diff --git a/.github/ISSUE_TEMPLATE/---document-issue-.md b/.github/ISSUE_TEMPLATE/---document-issue-.md
deleted file mode 100644
index ffc2fcd7817b6..0000000000000
--- a/.github/ISSUE_TEMPLATE/---document-issue-.md
+++ /dev/null
@@ -1,59 +0,0 @@
----
-name: 文档（Document Issue）
-about: 您可以提问文档相关的问题。 You could use this template for reporting an document issue.
-
----
-
-非常感谢您提交关于飞桨文档的Issue，我们会认真听取您的意见，并进行改进。
-
-建立issue时，为快速解决问题，请您根据情况给出如下信息：
-- 标题：请包含关键词“XXX文档问题”，例如“add 文档问题” 或 ”paddle.add 文档问题“
-- 文档版本信息：请提供有问题的文档的版本号，例如 develop，1.8，2.0RC；
-
-### 文档问题描述：
-
-#### API文档描述是否清晰？
-如：文档描述看不懂，不知道这个API该怎么用；文档公式错误；
-
-#### 参数说明是否清晰
-如：参数未解释清楚，包括用法、使用场景、默认值等
-
-#### 返回/形状说明是否清晰
-如：API返回值、数据的形状描述错误、不清楚
-
-#### 示例代码是否有效？
-如：没有示例代码；示例代码没有可指导性；示例代码跑不通；示例代码格式有问题；示例代码没有注释；
-
-#### 中英文内容是否一致？
-如：中英文API描述不一致；中英文API参数不一致；
-
-#### 其他
-如：文档页面打不开；文档缺失；文档中有死链；
-
-
-Thanks for opening a document issue. We will listen to your opinions carefully and make improvements.
-
-In order to quickly solve your problem, when creating an issue, please provide the following information:
-**Document Information**
-- Title：Please include the keyword "XXX document issue", such as "add document issue" or "paddle.add document issue"
-- Doc Version：Please provide the version of the document, such as develop, 1.8, 2.0RC;
-
-### Describe the problem：
-
-#### Document description is clear?
-For example: I don’t understand this document, I don’t know how to use this API; The formula in this doc is unclear;
-
-#### Parameter description is clear?
-For example: The parameters are confusing, including usage, scenarios, default values, etc.
-
-#### Return/Shape description is clear
-For example: Data returned this doc is error, shape returned is not clear.
-
-#### The sample code is clear?
-For example: no sample code; The sample code is not helpful; The sample code not run well; Format of the sample is not reasonable; The sample code has no comments.
-
-#### Chinese content and English content is consistent?
-For example:Chinese API in this doc is inconsistent with English API, including params, description, sample code, formula, etc.
-
-#### Other
-For example: The doc link is broken; The doc page is missing; Dead link in docs.
diff --git a/.github/ISSUE_TEMPLATE/---feature-request-.md b/.github/ISSUE_TEMPLATE/---feature-request-.md
deleted file mode 100644
index 7af1f7daeefbb..0000000000000
--- a/.github/ISSUE_TEMPLATE/---feature-request-.md
+++ /dev/null
@@ -1,29 +0,0 @@
----
-name: 建议(Feature request)
-about: 您可以提出您的建议。 You could use this template for reporting a suggestion  issue.
-
----
-
-欢迎您对PaddlePaddle提出建议，非常感谢您对PaddlePaddle的贡献！
-在留下您的建议时，辛苦您同步提供如下信息：
-- 版本、环境信息
-1）PaddlePaddle版本：请提供您的PaddlePaddle版本号，例如1.1
-2）CPU/GPU：您是否使用GPU进行训练，如是，请提供您的CUDA和cuDNN版本号
-3）系统环境：请您描述系统类型、版本，例如Mac OS 10.14
-注：您可以通过执行[summary_env.py](https://github.com/PaddlePaddle/Paddle/blob/develop/tools/summary_env.py)获取以上信息。
-- 复现信息：如为报错，请给出复现环境、复现步骤
-- 建议描述：请您详细描述，您认为需优化的功能
-
-Thank you for contributing to PaddlePaddle.
-Before submitting the issue, you could search issue in the github in case that there was a similar issue submitted or resolved before.
-Please make sure that this is a feature request. 
-**System information**
--PaddlePaddle version （eg.1.1）or CommitID
--CPU: including CPUMKL/OpenBlas/MKLDNN version
--GPU: including CUDA/CUDNN version
--OS Platform (eg.Mac OS 10.14)
-Note: You can get most of the information by running [summary_env.py](https://github.com/PaddlePaddle/Paddle/blob/develop/tools/summary_env.py). 
-**To Reproduce**
-Steps to reproduce the behavior
-**Describe the feature and the current behavior/state.**
-**Any Other info.**
diff --git a/.github/ISSUE_TEMPLATE/---inference-issue-.md b/.github/ISSUE_TEMPLATE/---inference-issue-.md
deleted file mode 100644
index ceb8b12d80572..0000000000000
--- a/.github/ISSUE_TEMPLATE/---inference-issue-.md
+++ /dev/null
@@ -1,42 +0,0 @@
----
-name: 预测（Inference Issue）
-about: 您可以提问预测中报错、应用等问题。 You could use this template for reporting an inference issue.
-
----
-
-为使您的问题得到快速解决，在建立Issue前，请您先通过如下方式搜索是否有相似问题:【搜索issue关键字】【使用labels筛选】【官方文档】
-
-如果您没有查询到相似问题，为快速解决您的提问，建立issue时请提供如下细节信息：
-- 标题：简洁、精准描述您的问题，例如“最新预测库的API文档在哪儿 ”
-- 版本、环境信息：
-    1）PaddlePaddle版本：请提供您的PaddlePaddle版本号（如1.1）或CommitID
-    2）CPU：预测若用CPU，请提供CPU型号，MKL/OpenBlas/MKLDNN/等数学库使用情况
-    3）GPU：预测若用GPU，请提供GPU型号、CUDA和CUDNN版本号
-    4）系统环境：请您描述系统类型、版本（如Mac OS 10.14），Python版本
-注：您可以通过执行[summary_env.py](https://github.com/PaddlePaddle/Paddle/blob/develop/tools/summary_env.py)获取以上信息。
--预测信息
-    1）C++预测：请您提供预测库安装包的版本信息，及其中的version.txt文件
-    2）CMake包含路径的完整命令
-    3）API信息（如调用请提供）
-    4）预测库来源：官网下载/特殊环境（如BCLOUD编译）
-- 复现信息：如为报错，请给出复现环境、复现步骤
-- 问题描述：请详细描述您的问题，同步贴出报错信息、日志/代码关键片段
-
-Thank you for contributing to PaddlePaddle.
-Before submitting the issue, you could search issue in the github in case that th
-If there is no solution,please make sure that this is an inference issue including the following details :
-**System information**
--PaddlePaddle version （eg.1.1）or CommitID
--CPU: including CPUMKL/OpenBlas/MKLDNN version
--GPU: including CUDA/CUDNN version
--OS Platform (eg.Mac OS 10.14)
--Python version
--Cmake orders
--C++version.txt
--API information
-Note: You can get most of the information by running [summary_env.py](https://github.com/PaddlePaddle/Paddle/blob/develop/tools/summary_env.py). 
-**To Reproduce**
-Steps to reproduce the behavior
-**Describe your current behavior**
-**Code to reproduce the issue**
-**Other info / logs**
diff --git a/.github/ISSUE_TEMPLATE/---installation-issue-.md b/.github/ISSUE_TEMPLATE/---installation-issue-.md
deleted file mode 100644
index 5e761a6605a76..0000000000000
--- a/.github/ISSUE_TEMPLATE/---installation-issue-.md
+++ /dev/null
@@ -1,43 +0,0 @@
----
-name: 安装（Installation Issue）
-about: 您可以提问安装、编译出现报错等问题。 You could use this template for reporting an installation
-   issue.
-
----
-
-为使您的问题得到快速解决，在建立Issue前，请您先通过如下方式搜索是否有相似问题:【搜索issue关键字】【使用labels筛选】【官方文档】
-
-建立issue时，为快速解决问题，请您根据使用情况给出如下信息：
-- 标题：请包含关键词“安装错误”/“编译错误”，例如“Mac编译错误”
-- 版本、环境信息：
-    1）PaddlePaddle版本：请提供您的PaddlePaddle版本号（如1.1）或CommitID
-    2）CPU：请提供CPU型号，MKL/OpenBlas/MKLDNN/等数学库的使用情况
-    3）GPU：请提供GPU型号，CUDA和CUDNN版本号
-    4）系统环境：请说明系统类型、版本（如Mac OS 10.14）、Python版本
-注：您可以通过执行[summary_env.py](https://github.com/PaddlePaddle/Paddle/blob/develop/tools/summary_env.py)获取以上信息。
-- 安装方式信息：
-1）pip安装/docker安装
-2）本地编译：请提供cmake命令，编译命令
-3）docker编译：请提供docker镜像，编译命令            
-  特殊环境请注明：如离线安装等
-- 复现信息：如为报错，请给出复现环境、复现步骤
-- 问题描述：请详细描述您的问题，同步贴出报错信息、日志/代码关键片段
-
-Thank you for contributing to PaddlePaddle.
-Before submitting the issue, you could search issue in Github in case that there was a similar issue submitted or resolved before.
-If there is no solution,please make sure that this is an installation issue including the following details:
-**System information**
--PaddlePaddle version （eg.1.1）or CommitID
--CPU: including CPUMKL/OpenBlas/MKLDNN version
--GPU: including CUDA/CUDNN version
--OS Platform (eg. Mac OS 10.14)
--Python version
-- Install method: pip install/install with docker/build from source(without docker)/build within docker
-- Other special cases that you think may be related to this problem, eg. offline install, special internet condition 
-Note: You can get most of the information by running [summary_env.py](https://github.com/PaddlePaddle/Paddle/blob/develop/tools/summary_env.py).   
-
-**To Reproduce**
-Steps to reproduce the behavior
-**Describe your current behavior**
-**Code to reproduce the issue**
-**Other info / logs**
diff --git a/.github/ISSUE_TEMPLATE/---model-issue-.md b/.github/ISSUE_TEMPLATE/---model-issue-.md
deleted file mode 100644
index 1e7c2e9c3e9ce..0000000000000
--- a/.github/ISSUE_TEMPLATE/---model-issue-.md
+++ /dev/null
@@ -1,38 +0,0 @@
----
-name: 模型（Model Issue）
-about: 您可以提问模型、算法、数据集方向的使用报错等问题。You could use this template for reporting a model/
-  algorithm/dataset  issue.
-
----
-
-为使您的问题得到快速解决，在建立Issue前，请您先通过如下方式搜索是否有相似问题:【搜索issue关键字】【使用labels筛选】【官方文档】
-
-建立issue时，为快速解决问题，请您根据使用情况给出如下信息：
-- 标题：简洁、精准描述您的问题，例如“ssd 模型前置lstm报错  ”
-- 版本、环境信息：
-    1）PaddlePaddle版本：请提供PaddlePaddle版本号，例如1.1或CommitID
-    2）CPU：请提供CPU型号，MKL/OpenBlas/MKLDNN/等数学库的使用情况
-    3）GPU：请提供GPU型号，CUDA和CUDNN版本号
-    4）系统环境：请说明系统类型、版本（例如Mac OS 10.14），Python版本
- 注：您可以通过执行[summary_env.py](https://github.com/PaddlePaddle/Paddle/blob/develop/tools/summary_env.py)获取以上信息。
-- 模型信息
-    1）模型名称 2）使用数据集名称 3）使用算法名称 4）模型链接
-- 复现信息：如为报错，请给出复现环境、复现步骤
-- 问题描述：请详细描述您的问题，同步贴出报错信息、日志/代码关键片段
-
-Thank you for contributing to PaddlePaddle.
-Before submitting the issue, you could search issue in the github.Probably there was a similar issue submitted or resolved before.
-If there is no solution,please make sure that this is a issue of models including the following details:
-**System information**
--PaddlePaddle version （eg.1.1）or CommitID
--CPU: including CPUMKL/OpenBlas/MKLDNN version
--GPU: including CUDA/CUDNN version
--OS Platform (eg.Mac OS 10.14)
--Python version
--Name of Models&Dataset/details of operator
-Note: You can get most of the information by running [summary_env.py](https://github.com/PaddlePaddle/Paddle/blob/develop/tools/summary_env.py). 
-**To Reproduce**
-Steps to reproduce the behavior
-**Describe your current behavior**
-**Code to reproduce the issue**
-**Other info / logs**
diff --git a/.github/ISSUE_TEMPLATE/---others-.md b/.github/ISSUE_TEMPLATE/---others-.md
deleted file mode 100644
index ebab9023a6353..0000000000000
--- a/.github/ISSUE_TEMPLATE/---others-.md
+++ /dev/null
@@ -1,35 +0,0 @@
----
-name: 其他（Others）
-about: 如上述分类未包含您的问题，可在此提出。 You could use this template for reporting other issues
-
----
-
-为使您的问题得到快速解决，在建立Issues前，请您先通过如下方式搜索是否有相似问题:【搜索issue关键字】【使用labels筛选】【官方文档】
-
-如果您没有查询到相似问题，为快速解决您的提问，建立issue时请提供如下细节信息：
-- 标题：简洁、精准概括您的问题
-- 版本、环境信息：
-    1）PaddlePaddle版本：请提供您的PaddlePaddle版本号，例如1.1或CommitID
-    2）CPU/GPU：如果您使用GPU训练，请提供GPU驱动版本、CUDA和cuDNN版本号
-    3）系统环境：请您描述系统类型、版本，例如Mac OS 10.14
-    4）Python版本号
-    5）显存信息
- 注：您可以通过执行[summary_env.py](https://github.com/PaddlePaddle/Paddle/blob/develop/tools/summary_env.py)获取以上信息。
-- 复现信息：如为报错，请给出复现环境、复现步骤
-- 问题描述：请详细描述您的问题，同步贴出报错信息、日志/代码关键片段
-
-Thank you for contributing to PaddlePaddle.
-Before submitting the issue, you could search issue in the github in case that there was a similar issue submitted or resolved before.
-If there is no solution,please provide us with the following details :
-**System information**
--PaddlePaddle version （eg.1.1）or CommitID
--CPU: including CPUMKL/OpenBlas/MKLDNN version
--GPU: including CUDA/cuDNN version
--OS Platform and Distribution(eg.Mac OS 10.14)
--Python version 
-Note: You can get most of the information by running [summary_env.py](https://github.com/PaddlePaddle/Paddle/blob/develop/tools/summary_env.py). 
-**To Reproduce**
-Steps to reproduce the behavior
-**Describe your current behavior**
-**Code to reproduce the issue**
-**Other info / logs**
diff --git a/.github/ISSUE_TEMPLATE/---training-issue-.md b/.github/ISSUE_TEMPLATE/---training-issue-.md
deleted file mode 100644
index 15aa077619dc1..0000000000000
--- a/.github/ISSUE_TEMPLATE/---training-issue-.md
+++ /dev/null
@@ -1,40 +0,0 @@
----
-name: 训练（Training issue）
-about: 您可以提问训练中报错、应用、出core等问题。 You could use this template for reporting an training
-   issue.
-
----
-
-为使您的问题得到快速解决，在建立Issues前，请您先通过如下方式搜索是否有相似问题:【搜索issue关键字】【使用labels筛选】【官方文档】
-
-如果您没有查询到相似问题，为快速解决您的提问，建立issue时请提供如下细节信息：
-- 标题：简洁、精准概括您的问题，例如“Insufficient Memory xxx" ”
-- 版本、环境信息：
-    1）PaddlePaddle版本：请提供您的PaddlePaddle版本号，例如1.1或CommitID
-    2）CPU：预测若用CPU，请提供CPU型号，MKL/OpenBlas/MKLDNN/等数学库使用情况
-    3）GPU：预测若用GPU，请提供GPU型号、CUDA和CUDNN版本号
-    4）系统环境：请您描述系统类型、版本，例如Mac OS 10.14，Python版本
- 注：您可以通过执行[summary_env.py](https://github.com/PaddlePaddle/Paddle/blob/develop/tools/summary_env.py)获取以上信息。
-- 训练信息
-    1）单机/多机，单卡/多卡
-    2）显存信息
-    3）Operator信息
-- 复现信息：如为报错，请给出复现环境、复现步骤
-- 问题描述：请详细描述您的问题，同步贴出报错信息、日志、可复现的代码片段
-
-Thank you for contributing to PaddlePaddle.
-Before submitting the issue, you could search issue in the github in case that there was a similar issue submitted or resolved before.
-If there is no solution,please make sure that this is a training issue including the following details:
-**System information**
--PaddlePaddle version （eg.1.1）or CommitID
--CPU: including CPUMKL/OpenBlas/MKLDNN version
--GPU: including CUDA/CUDNN version
--OS Platform (eg.Mac OS 10.14)
--Other imformation: Distriuted training/informantion of operator/
-Graphics card storage
-Note: You can get most of the information by running [summary_env.py](https://github.com/PaddlePaddle/Paddle/blob/develop/tools/summary_env.py). 
-**To Reproduce**
-Steps to reproduce the behavior
-**Describe your current behavior**
-**Code to reproduce the issue**
-**Other info / logs**
diff --git a/.github/ISSUE_TEMPLATE/1_bug-report.yml b/.github/ISSUE_TEMPLATE/1_bug-report.yml
new file mode 100644
index 0000000000000..058589232fe1e
--- /dev/null
+++ b/.github/ISSUE_TEMPLATE/1_bug-report.yml
@@ -0,0 +1,66 @@
+
+name: 🐛 报BUG Bug Report
+description: 报告一个可复现的BUG帮助我们修复框架。 Report a bug to help us reproduce and fix it.
+labels: [type/bug-report, status/new-issue]
+
+body:
+- type: markdown
+  attributes:
+    value: >
+      #### 在向Paddle报bug之前，请先查询[历史issue](https://github.com/PaddlePaddle/Paddle/issues)是否报过同样的bug。
+      
+      #### Before submitting a bug, please make sure the issue hasn't been already addressed by searching through [the existing and past issues](https://github.com/PaddlePaddle/Paddle/issues).
+
+- type: textarea
+  id: code
+  attributes:
+    label: bug描述 Describe the Bug 
+    description:  | 
+      请清晰简洁的描述这个bug，最好附上bug复现环境、bug复现步骤及最小代码集，以便我们可以通过运行代码来重现错误。代码片段需要尽可能简洁，请花些时间去掉不相关的代码以帮助我们有效地调试。我们希望通过复制代码并运行得到与你相同的结果，请避免任何外部数据或包含相关的导入等。例如：
+      ```python
+      # 导入所有必要的库。 All necessary imports at the beginning.
+      # paddlepaddle <= 2.1.2
+      import paddle
+
+      # 一个简洁的片段，能够定位到bug。 A succinct reproducing example trimmed down to the essential parts.
+      a = paddle.rand(shape=[1,4])
+      b = paddle.rand(shape=[1,4])
+      a.stop_gradient = False
+      b.stop_gradient = False
+      
+      c = paddle.zeros((4, 4))
+      c[0, :] = a/b
+      
+      print('Is c requires grad: ', not c.stop_gradient) # 注意：这里出现了bug，期望requires_grad=True
+      ```
+      如果代码太长，请将可执行代码放到[AIStudio](https://aistudio.baidu.com/aistudio/index)中并将项目设置为公开（或者放到github gist上），请在项目中描述清楚bug复现步骤，在issue中描述期望结果与实际结果。
+      如果你报告的是一个报错信息，请将完整回溯的报错贴在这里，并使用 ` ```三引号块``` `展示错误信息。
+      
+      
+    placeholder: |
+      请清晰简洁的描述这个bug。A clear and concise description of what the bug is.
+
+      ```python
+      # 最小可复现代码。 Sample code to reproduce the problem.
+      ```
+
+      ```shell
+      带有完整回溯的报错信息。 The error message you got, with the full traceback.
+      ```
+  validations:
+    required: true 
+    
+- type: textarea
+  id: others
+  attributes:
+    label: 其他补充信息 Additional Supplementary Information
+    description: |
+      如果你还有其他需要补充的内容，请写在这里。
+      If you have anything else to add, please write it here.
+  validations:
+    required: false
+    
+- type: markdown
+  attributes:
+    value: >
+      感谢你的贡献 🎉！Thanks for your contribution 🎉!
diff --git a/.github/ISSUE_TEMPLATE/2_feature-request.yml b/.github/ISSUE_TEMPLATE/2_feature-request.yml
new file mode 100644
index 0000000000000..e9dd3465d1758
--- /dev/null
+++ b/.github/ISSUE_TEMPLATE/2_feature-request.yml
@@ -0,0 +1,37 @@
+name: 🚀 新需求 Feature Request
+description: 提交一个你对Paddle的新需求。 Submit a request for a new Paddle feature.
+labels: [type/feature-request, status/new-issue]
+
+body:
+- type: markdown
+  attributes:
+    value: >
+      #### 你可以在这里提出你对Paddle框架的新需求，包括但不限于：功能或模型缺失、功能不全或无法使用、精度/性能不符合预期等。
+      
+      #### You could submit a request for a new Paddle feature here, including but not limited to: new features or models, incomplete or unusable features, accuracy/performance not as expected, etc.
+      
+- type: textarea
+  id: description
+  attributes:
+    label: 需求描述 Feature Description
+    description: |
+      请尽可能包含任务目标、需求场景、功能描述等信息，全面的信息有利于我们准确评估你的需求。
+      Please include as much information as possible, such as mission objectives, requirement scenarios, functional descriptions, etc. Comprehensive information will help us accurately assess your feature request.
+    value: "任务目标（请描述你正在做的项目是什么，如模型、论文、项目是什么？）; <br /> 需求场景（请描述你的项目中为什么需要用此功能）;  <br /> 功能描述（请简单描述或设计这个功能）"
+  validations:
+    required: true
+    
+- type: textarea
+  id: alternatives
+  attributes:
+    label: 替代实现 Alternatives
+    description: |
+      如果你考虑过的任何替代解决方案或功能，请简要描述下，我们会综合评估。
+      A description of any alternative solutions or features you've considered, if any.
+  validations:
+    required: false
+    
+- type: markdown
+  attributes:
+    value: >
+      感谢你的贡献 🎉！Thanks for your contribution 🎉!
diff --git a/.github/ISSUE_TEMPLATE/3_build-installation-issue.yml b/.github/ISSUE_TEMPLATE/3_build-installation-issue.yml
new file mode 100644
index 0000000000000..2786175af6dc8
--- /dev/null
+++ b/.github/ISSUE_TEMPLATE/3_build-installation-issue.yml
@@ -0,0 +1,65 @@
+name: 🗂 安装 Build/Installation Issue
+description: 报告一个安装问题。 Report an issue related to build or install Paddle.
+labels: [type/build, status/new-issue]
+
+body:
+- type: markdown
+  attributes:
+    value: >
+      #### 安装请参考[官网文档](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/develop/install/pip/linux-pip.html)，若未能解决你的问题，你可以在这里提issue。
+      
+      #### Before submitting a Build/Installation Issue, please make sure you have visited the [official website](https://www.paddlepaddle.org.cn/documentation/docs/en/install/index_en.html).     
+
+- type: textarea
+  id: error
+  attributes:
+    label: 问题描述 Issue Description
+    description: |
+      请详细描述你的问题，同步贴出报错信息、日志/代码关键片段、复现步骤，以便我们快速排查问题。
+      Please describe your problem in detail, and synchronously post the error message, key log/code snippet, and reproduction steps, so that we can quickly troubleshoot the problem.
+  validations:
+    required: true
+    
+- type: textarea
+  id: environment
+  attributes:
+    label: 版本&环境信息 Version & Environment Information
+    description: |
+      请参考以下命令运行脚本[summary_env.py](https://github.com/PaddlePaddle/Paddle/blob/develop/tools/summary_env.py)获取版本&环境信息，并将输出拷贝在这里。
+      Please run the following and paste the output below.
+      ```shell
+      wget https://raw.githubusercontent.com/PaddlePaddle/Paddle/develop/tools/summary_env.py
+      python3 -m pip install distro
+      python3 summary_env.py
+      ```
+      若运行脚本出现问题，请在issue中说明，并提供以下信息：
+      1. PaddlePaddle版本：请提供你的PaddlePaddle版本号（如2.0.0）或CommitID。
+      2. CPU（可选）：请提供CPU型号，MKL/OpenBlas/MKLDNN/等数学库的使用情况，是否支持AVX指令集。
+      3. GPU：请提供GPU型号，CUDA（如cuda10.2）和CUDNN版本号（如cudnn7.6.5）。
+      4. 系统环境：请说明系统类型、版本（如Mac OS 10.14）。
+      5. Python版本（如python 3.7）。
+      6. （可选）若安装过程遇到问题，请提供安装方式（pip/conda/docker/源码编译）和相应的安装命令。
+      7. （可选）若使用paddle过程中，遇到了无法使用gpu相关问题，请在命令行中键入`nvidia-smi`和`nvcc -V`，提供这两个命令输出的截图。
+      8. （可选）若使用特殊硬件，请单独注明。
+      
+     
+    placeholder: |
+      ****************************************
+      Paddle version: 
+      Paddle With CUDA: 
+    
+      OS: 
+      Python version: 
+    
+      CUDA version: 
+      cuDNN version: 
+      Nvidia driver version: 
+      ****************************************
+  validations:
+    required: true
+    
+    
+- type: markdown
+  attributes:
+    value: >
+      感谢你的贡献 🎉！Thanks for your contribution 🎉!
diff --git a/.github/ISSUE_TEMPLATE/4_documentation-issue.yml b/.github/ISSUE_TEMPLATE/4_documentation-issue.yml
new file mode 100644
index 0000000000000..936ed3d92e8c4
--- /dev/null
+++ b/.github/ISSUE_TEMPLATE/4_documentation-issue.yml
@@ -0,0 +1,39 @@
+name: 📚 文档 Documentation Issue
+description: 反馈一个官网文档错误。 Report an issue related to https://www.paddlepaddle.org.cn/.
+labels: [type/docs, status/new-issue]
+
+body:
+- type: markdown
+  attributes:
+    value: >
+      #### 请确认反馈的问题来自PaddlePaddle官网文档：https://www.paddlepaddle.org.cn/ 。
+      
+      #### Before submitting a Documentation Issue, Please make sure that issue is related to https://www.paddlepaddle.org.cn/.
+
+- type: textarea
+  id: link
+  attributes:
+    label: 文档链接&描述 Document Links & Description
+    description: |
+      请说明有问题的文档链接以及该文档存在的问题。
+      Please fill in the link to the document and describe the question.
+  validations:
+    required: true
+
+
+- type: textarea
+  id: error
+  attributes:
+    label: 请提出你的建议 Please give your suggestion
+    description: |
+      请告诉我们，你希望如何改进这个文档。或者你可以提个PR修复这个问题。[教程参考](https://github.com/PaddlePaddle/docs/wiki#%E8%B4%A1%E7%8C%AE%E6%96%87%E6%A1%A3)
+      Please tell us how you would like to improve this document. Or you can submit a PR to fix this problem.
+      
+  validations:
+    required: false
+    
+- type: markdown
+  attributes:
+    value: >
+      感谢你的贡献 🎉！Thanks for your contribution 🎉!
+
diff --git a/.github/ISSUE_TEMPLATE/5_ask-a-question.yml b/.github/ISSUE_TEMPLATE/5_ask-a-question.yml
new file mode 100644
index 0000000000000..158918946f3fd
--- /dev/null
+++ b/.github/ISSUE_TEMPLATE/5_ask-a-question.yml
@@ -0,0 +1,33 @@
+name: 🙋🏼‍♀️🙋🏻‍♂️提问 Ask a Question
+description: 提出一个使用/咨询问题。 Ask a usage or consultation question.
+labels: [type/question, status/new-issue]
+
+body:
+- type: markdown
+  attributes:
+    value: >
+      #### 你可以在这里提出一个使用/咨询问题，提问之前请确保：
+      
+      - 1）已经百度/谷歌搜索过你的问题，但是没有找到解答；
+      
+      - 2）已经在官网查询过[API文档](https://www.paddlepaddle.org.cn/documentation/docs/zh/api/index_cn.html)与[FAQ](https://www.paddlepaddle.org.cn/documentation/docs/zh/faq/index_cn.html)，但是没有找到解答；
+      
+      - 3）已经在[历史issue](https://github.com/PaddlePaddle/Paddle/issues)中搜索过，没有找到同类issue或issue未被解答。
+      
+      
+      #### You could ask a usage or consultation question here, before your start, please make sure:
+      
+      - 1) You have searched your question on Baidu/Google, but found no answer;
+      
+      - 2) You have checked the [API documentation](https://www.paddlepaddle.org.cn/documentation/docs/en/api/index_en.html), but found no answer;
+      
+      - 3) You have searched [the existing and past issues](https://github.com/PaddlePaddle/Paddle/issues), but found no similar issue or the issue has not been answered.
+      
+      
+
+- type: textarea
+  id: question
+  attributes:
+    label: 请提出你的问题 Please ask your question
+  validations:
+    required: true
diff --git a/.github/ISSUE_TEMPLATE/6_others.yml b/.github/ISSUE_TEMPLATE/6_others.yml
new file mode 100644
index 0000000000000..e8f4a9c232918
--- /dev/null
+++ b/.github/ISSUE_TEMPLATE/6_others.yml
@@ -0,0 +1,26 @@
+name: 🧩 其他 Others
+description: 提出其他问题。 Report any other non-support related issues.
+labels: [type/others, status/new-issue]
+
+body:
+- type: markdown
+  attributes:
+    value: >
+      #### 你可以在这里提出任何前面几类模板不适用的问题，包括但不限于：优化性建议、框架使用体验反馈、版本兼容性问题、报错信息不清楚等。
+      
+      #### You can report any issues that are not applicable to the previous types of templates, including but not limited to: enhancement suggestions, feedback on the use of the framework, version compatibility issues, unclear error information, etc.
+
+- type: textarea
+  id: others
+  attributes:
+    label: 问题描述 Please describe your issue
+  validations:
+    required: true
+    
+- type: markdown
+  attributes:
+    value: >
+      感谢你的贡献 🎉！ Thanks for your contribution 🎉!
+
+
+
diff --git a/cmake/external/libmct.cmake b/cmake/external/libmct.cmake
index 92c3165fbaa90..a166e43c7b95e 100644
--- a/cmake/external/libmct.cmake
+++ b/cmake/external/libmct.cmake
@@ -45,7 +45,7 @@ ExternalProject_Add(
     PREFIX                ${LIBMCT_PREFIX_DIR}
     DOWNLOAD_DIR          ${LIBMCT_DOWNLOAD_DIR}
     DOWNLOAD_COMMAND      wget --no-check-certificate ${LIBMCT_URL} -c -q -O ${LIBMCT_NAME}.tar.gz
-                          && tar zxvf ${LIBMCT_NAME}.tar.gz
+                          && tar --no-same-owner -zxvf ${LIBMCT_NAME}.tar.gz
     DOWNLOAD_NO_PROGRESS  1
     UPDATE_COMMAND        ""
     CMAKE_ARGS            -DCMAKE_INSTALL_PREFIX=${LIBMCT_INSTALL_ROOT}
diff --git a/cmake/external/xpu.cmake b/cmake/external/xpu.cmake
index 0d340ab638b1a..e83bdef327891 100644
--- a/cmake/external/xpu.cmake
+++ b/cmake/external/xpu.cmake
@@ -36,7 +36,7 @@ ENDIF()
 
 if(NOT DEFINED XPU_BASE_URL)
   SET(XPU_BASE_URL_WITHOUT_DATE "https://baidu-kunlun-product.cdn.bcebos.com/KL-SDK/klsdk-dev")
-  SET(XPU_BASE_URL "${XPU_BASE_URL_WITHOUT_DATE}/20220327")
+  SET(XPU_BASE_URL "${XPU_BASE_URL_WITHOUT_DATE}/20220402")
 else()
   SET(XPU_BASE_URL "${XPU_BASE_URL}")
 endif()
diff --git a/cmake/flags.cmake b/cmake/flags.cmake
index f90b71f9e60a8..5742a6b602ff3 100644
--- a/cmake/flags.cmake
+++ b/cmake/flags.cmake
@@ -244,3 +244,7 @@ if(WITH_ROCM)
     string (REPLACE "-Werror" "-Wno-error" CMAKE_C_FLAGS ${CMAKE_C_FLAGS})
 endif()
 
+if(WITH_PSCORE OR WITH_PSLIB)
+    string (REPLACE "-Wnon-virtual-dtor" "-Wno-non-virtual-dtor" CMAKE_CXX_FLAGS ${CMAKE_CXX_FLAGS})
+    string (REPLACE "-Wnon-virtual-dtor" "-Wno-non-virtual-dtor" CMAKE_C_FLAGS ${CMAKE_C_FLAGS})
+endif()
diff --git a/cmake/inference_lib.cmake b/cmake/inference_lib.cmake
index cafd1406b256f..e3e6e1cced2aa 100644
--- a/cmake/inference_lib.cmake
+++ b/cmake/inference_lib.cmake
@@ -199,13 +199,6 @@ IF(WITH_XPU)
         DSTS ${dst_dir} ${dst_dir})
 ENDIF()
 
-IF(WITH_IPU)
-    set(dst_dir "${PADDLE_INFERENCE_INSTALL_DIR}/third_party/install/ipu")
-    copy(inference_lib_dist
-        SRCS ${CMAKE_BINARY_DIR}/paddle/fluid/platform/device/ipu/libpaddle_ipu.so
-        DSTS ${dst_dir})
-ENDIF()
-
 # CMakeCache Info
 copy(inference_lib_dist
         SRCS ${CMAKE_CURRENT_BINARY_DIR}/CMakeCache.txt
diff --git a/paddle/fluid/distributed/CMakeLists.txt b/paddle/fluid/distributed/CMakeLists.txt
index 17432a0c043f2..06b0583eddf24 100644
--- a/paddle/fluid/distributed/CMakeLists.txt
+++ b/paddle/fluid/distributed/CMakeLists.txt
@@ -1,9 +1,6 @@
 add_subdirectory(collective)
 add_subdirectory(store)
 if(NOT WITH_PSCORE)
-    if(WITH_HETERPS)
-        add_subdirectory(ps)
-    endif()
     add_subdirectory(fleet_executor)
     return()
 endif()
diff --git a/paddle/fluid/distributed/collective/CMakeLists.txt b/paddle/fluid/distributed/collective/CMakeLists.txt
index 6fb805a72e4de..6d736d5543ce4 100644
--- a/paddle/fluid/distributed/collective/CMakeLists.txt
+++ b/paddle/fluid/distributed/collective/CMakeLists.txt
@@ -7,14 +7,14 @@ endif()
 
 if(WITH_NCCL)
   cc_library(processgroup_nccl SRCS ProcessGroupNCCL.cc NCCLTools.cc Common.cc DEPS place cuda_stream enforce collective_helper device_context phi phi_api eager_api)
-  if (WITH_DISTRIBUTE)
+  if (WITH_DISTRIBUTE AND WITH_PSCORE)
     cc_library(processgroup_heter SRCS ProcessGroupHeter.cc NCCLTools.cc Common.cc DEPS place cuda_stream enforce collective_helper device_context phi phi_api eager_api)
   endif()
 endif()
 
 if(WITH_ASCEND_CL)
   cc_library(processgroup_hccl SRCS ProcessGroupHCCL.cc HCCLTools.cc Common.cc DEPS place npu_stream enforce collective_helper device_context phi phi_api eager_api)
-  if (WITH_DISTRIBUTE)
+  if (WITH_DISTRIBUTE AND WITH_PSCORE)
     cc_library(processgroup_heter SRCS ProcessGroupHeter.cc HCCLTools.cc Common.cc DEPS place npu_stream enforce collective_helper device_context phi phi_api eager_api)
   endif()
 endif()
diff --git a/paddle/fluid/distributed/collective/ProcessGroup.cc b/paddle/fluid/distributed/collective/ProcessGroup.cc
index ab118dadd5d88..6da83a888683b 100644
--- a/paddle/fluid/distributed/collective/ProcessGroup.cc
+++ b/paddle/fluid/distributed/collective/ProcessGroup.cc
@@ -35,10 +35,10 @@ bool ProcessGroup::Task::Wait(std::chrono::milliseconds timeout) {
 void ProcessGroup::Task::Synchronize() {}
 
 ProcessGroup::ProcessGroup(int rank, int size, int gid)
-    : rank_(rank), size_(size) {
+    : rank_(rank), size_(size), gid_(gid) {
   if (gid != IGNORE_ID) {
     auto map = ProcessGroupMapFromGid::getInstance();
-    map->insert(gid, this);
+    map->insert(gid_, this);
   }
 }
 
diff --git a/paddle/fluid/distributed/collective/ProcessGroup.h b/paddle/fluid/distributed/collective/ProcessGroup.h
index 36a00a7d31758..17d021852671e 100644
--- a/paddle/fluid/distributed/collective/ProcessGroup.h
+++ b/paddle/fluid/distributed/collective/ProcessGroup.h
@@ -93,8 +93,8 @@ class ProcessGroup {
   }
 
   virtual void Broadcast(const phi::DenseTensor* in, phi::DenseTensor* out) {
-    PADDLE_THROW(platform::errors::InvalidArgument(
-        "ProcessGroup%s does not support broadcast for static",
+    PADDLE_THROW(platform::errors::Fatal(
+        "ProcessGroup%s does not support broadcast for static mode runtime",
         GetBackendName()));
   }
 
@@ -148,6 +148,7 @@ class ProcessGroup {
  protected:
   const int rank_;
   const int size_;
+  const int gid_;
 };
 
 class ProcessGroupMapFromGid {
@@ -158,16 +159,20 @@ class ProcessGroupMapFromGid {
   }
 
   void insert(int gid, ProcessGroup* pg) {
-    PADDLE_ENFORCE_EQ(has(gid), false,
-                      platform::errors::PreconditionNotMet(
-                          "The process group with id %d doesnot exist.", gid));
+    // TODO(sandyhouse): address ut and uncomment the following codes
+    // PADDLE_ENFORCE_EQ(has(gid), false,
+    //                   platform::errors::PreconditionNotMet(
+    //                       "The process group with id %d doesnot exist.",
+    //                       gid));
     map_[gid] = pg;
   }
 
   ProcessGroup* get(int gid) {
-    PADDLE_ENFORCE_EQ(has(gid), false,
-                      platform::errors::PreconditionNotMet(
-                          "The process group with id %d doesnot exist.", gid));
+    // TODO(sandyhouse): address ut and uncomment the following codes
+    // PADDLE_ENFORCE_EQ(has(gid), true,
+    //                   platform::errors::PreconditionNotMet(
+    //                       "The process group with id %d doesnot exist.",
+    //                       gid));
     return map_.find(gid)->second;
   }
 
diff --git a/paddle/fluid/distributed/collective/ProcessGroupHCCL.cc b/paddle/fluid/distributed/collective/ProcessGroupHCCL.cc
index b21155e09d06e..55945b5e0e396 100644
--- a/paddle/fluid/distributed/collective/ProcessGroupHCCL.cc
+++ b/paddle/fluid/distributed/collective/ProcessGroupHCCL.cc
@@ -30,12 +30,6 @@ constexpr int64_t kWaitBlockTImeout = 10;
 namespace paddle {
 namespace distributed {
 
-// bool CheckTensorsInNPUPlace(const std::vector<Tensor>& tensors) {
-//   return std::all_of(tensors.cbegin(), tensors.cend(), [&](const Tensor& t) {
-//     return t.place() == platform::DeviceType::NPU;
-//   });
-// }
-
 void SyncDefaultStream(
     const std::vector<Place>& places,
     std::vector<NPUEventManager>& hcclEvents,                   // NOLINT
diff --git a/paddle/fluid/distributed/collective/ProcessGroupHeter.cc b/paddle/fluid/distributed/collective/ProcessGroupHeter.cc
index ffd653042494d..b3c9ddde50116 100644
--- a/paddle/fluid/distributed/collective/ProcessGroupHeter.cc
+++ b/paddle/fluid/distributed/collective/ProcessGroupHeter.cc
@@ -56,7 +56,8 @@ ProcessGroupHeter::ProcessGroupHeter(const std::shared_ptr<Store>& store,
       local_size_(local_size),
       gloo_rank_(gloo_rank),
       gloo_size_(gloo_size),
-      with_switch_(with_switch) {
+      with_switch_(with_switch),
+      switch_endpoint_(switch_endpoint) {
 #if defined(PADDLE_WITH_NCCL)
   inner_pg_ = std::make_shared<ProcessGroupNCCL>(store, local_rank, local_size,
                                                  IGNORE_ID);
@@ -64,14 +65,10 @@ ProcessGroupHeter::ProcessGroupHeter(const std::shared_ptr<Store>& store,
   inner_pg_ = std::make_shared<ProcessGroupHCCL>(store, local_rank, local_size,
                                                  IGNORE_ID);
 #else
-  PADDLE_THROW(platform::errors::InvalidArgument(
+  PADDLE_THROW(platform::errors::Fatal(
       "ProcessGroupHeter only supports NCCL and HCCL now.");
 #endif
-  if (with_switch_) {
-    // TODO(sandyhouse) starts a client to connect the cloud switch module
-    // std::shared_ptr<HeterClient> client_ =
-    // HeterClient::GetInstance({switch_endpoint}, {}, 0);
-  } else if (local_rank_ == 0) {
+  if (local_rank_ == 0 && !with_switch_) {
     auto opts = ProcessGroupGloo::GlooOptions::create();
     opts->device = ProcessGroupGloo::createDefaultDevice();
     inter_pg_ = std::make_shared<ProcessGroupGloo>(store, gloo_rank_,
@@ -79,6 +76,15 @@ ProcessGroupHeter::ProcessGroupHeter(const std::shared_ptr<Store>& store,
   }
 }
 
+template <typename T>
+static void _do_add(T* dst, T* src, size_t size) {
+  for (size_t i = 0; i < size; i++) {
+    *dst += *src;
+    dst++;
+    src++;
+  }
+}
+
 std::shared_ptr<ProcessGroup::Task> ProcessGroupHeter::AllReduce(
     std::vector<Tensor>& tensors, const AllreduceOptions& opts) {
 #if defined(PADDLE_WITH_NCCL)
@@ -93,33 +99,92 @@ std::shared_ptr<ProcessGroup::Task> ProcessGroupHeter::AllReduce(
 
   // Step2: copy tensors to CPU
   if (local_rank_ == 0) {
-    std::vector<Tensor> cpu_tensors(tensors.size());
+    std::vector<Tensor> cpu_tensors;
+    cpu_tensors.reserve(tensors.size());
     for (size_t i = 0; i < tensors.size(); i++) {
       auto dense_gpu_tensor =
           std::dynamic_pointer_cast<phi::DenseTensor>(tensors[i].impl());
-      auto dense_cpu_tensor =
-          std::dynamic_pointer_cast<phi::DenseTensor>(cpu_tensors[i].impl());
-      dense_cpu_tensor->Resize(tensors[i].dims());
+      phi::DenseTensorMeta meta = phi::DenseTensorMeta(
+          dense_gpu_tensor->dtype(), dense_gpu_tensor->dims());
+      std::shared_ptr<phi::DenseTensor> dense_cpu_tensor =
+          std::make_shared<phi::DenseTensor>(
+              std::make_unique<paddle::experimental::DefaultAllocator>(
+                  paddle::platform::CPUPlace())
+                  .get(),
+              meta);
+      dense_cpu_tensor->ResizeAndAllocate(dense_gpu_tensor->dims());
+      cpu_tensors[i] = paddle::experimental::Tensor(dense_cpu_tensor);
       framework::TensorCopySync(*dense_gpu_tensor, platform::CPUPlace(),
                                 dense_cpu_tensor.get());
     }
     // Step3: do inter cluster allreduce
     if (with_switch_) {
-      // TODO(sandyhouse) send to and recv from switch, and do add
+      if (local_rank_ == 0) {
+        HeterClient* client_ =
+            HeterClient::GetInstance({switch_endpoint_}, {}, 0).get();
+        auto dense_cpu_tensor =
+            std::dynamic_pointer_cast<phi::DenseTensor>(cpu_tensors[0].impl());
+        std::vector<int> send_size;
+        send_size.push_back(dense_cpu_tensor->numel());
+        int ret = client_->Send(
+            gid_, {dense_cpu_tensor->name()}, send_size,
+            dense_cpu_tensor->data(),
+            dense_cpu_tensor->numel() *
+                framework::DataTypeSize(dense_cpu_tensor->dtype()));
+        PADDLE_ENFORCE_EQ(ret, 0, platform::errors::PreconditionNotMet(
+                                      "Send to the switch module error."));
+        phi::DenseTensorMeta meta = phi::DenseTensorMeta(
+            dense_cpu_tensor->dtype(), dense_cpu_tensor->dims());
+        std::shared_ptr<phi::DenseTensor> dense_cpu_tensor2 =
+            std::make_shared<phi::DenseTensor>(
+                std::make_unique<paddle::experimental::DefaultAllocator>(
+                    paddle::platform::CPUPlace())
+                    .get(),
+                meta);
+        dense_cpu_tensor2->ResizeAndAllocate(dense_cpu_tensor->dims());
+        Tensor cpu_tensor_temp =
+            paddle::experimental::Tensor(dense_cpu_tensor2);
+        ret = client_->Recv(
+            gid_, {dense_cpu_tensor->name()}, dense_cpu_tensor2->data(),
+            dense_cpu_tensor2->numel() *
+                framework::DataTypeSize(dense_cpu_tensor2->dtype()));
+        PADDLE_ENFORCE_EQ(ret, 0, platform::errors::PreconditionNotMet(
+                                      "Recv from the switch module error."));
+
+        switch (dense_cpu_tensor->dtype()) {
+          case DataType::FLOAT32:
+            _do_add<float>(reinterpret_cast<float*>(dense_cpu_tensor->data()),
+                           reinterpret_cast<float*>(dense_cpu_tensor2->data()),
+                           dense_cpu_tensor->numel());
+            break;
+          case DataType::FLOAT64:
+            _do_add<double>(
+                reinterpret_cast<double*>(dense_cpu_tensor->data()),
+                reinterpret_cast<double*>(dense_cpu_tensor2->data()),
+                dense_cpu_tensor->numel());
+            break;
+          case DataType::INT32:
+            _do_add<int>(reinterpret_cast<int*>(dense_cpu_tensor->data()),
+                         reinterpret_cast<int*>(dense_cpu_tensor2->data()),
+                         dense_cpu_tensor->numel());
+            break;
+          default:
+            PADDLE_THROW(platform::errors::PreconditionNotMet(
+                "Unsupported data type (%s) to do add.",
+                framework::DataType2String(dense_cpu_tensor->dtype())));
+        }
+      }
     } else {
       auto gloo_task = inter_pg_->AllReduce(cpu_tensors, opts);
       gloo_task->Wait();
     }
     // Step4: copy cpu tensors to gpu
-    // TODO(sandyhouse)
     // copy cpu tensors to gpu
     for (size_t i = 0; i < tensors.size(); i++) {
       auto dense_gpu_tensor =
           std::dynamic_pointer_cast<phi::DenseTensor>(tensors[i].impl());
       auto dense_cpu_tensor =
           std::dynamic_pointer_cast<phi::DenseTensor>(cpu_tensors[i].impl());
-      // framework::TensorCopySync(*dense_cpu_tensor, tensors[i].place(),
-      // dense_gpu_tensor.get());
       framework::TensorCopySync(*dense_cpu_tensor, dense_cpu_tensor->place(),
                                 dense_gpu_tensor.get());
     }
@@ -147,18 +212,57 @@ std::shared_ptr<ProcessGroup::Task> ProcessGroupHeter::Broadcast(
   inner_pg_->Broadcast(tensors, b_opts);
 
   if (local_rank_ == 0) {
-    std::vector<Tensor> cpu_tensors(tensors.size());
+    std::vector<Tensor> cpu_tensors;
+    cpu_tensors.reserve(tensors.size());
     for (size_t i = 0; i < tensors.size(); i++) {
       auto dense_gpu_tensor =
           std::dynamic_pointer_cast<phi::DenseTensor>(tensors[i].impl());
-      auto dense_cpu_tensor =
-          std::dynamic_pointer_cast<phi::DenseTensor>(cpu_tensors[i].impl());
-      dense_cpu_tensor->Resize(tensors[i].dims());
+      phi::DenseTensorMeta meta = phi::DenseTensorMeta(
+          dense_gpu_tensor->dtype(), dense_gpu_tensor->dims());
+      std::shared_ptr<phi::DenseTensor> dense_cpu_tensor =
+          std::make_shared<phi::DenseTensor>(
+              std::make_unique<paddle::experimental::DefaultAllocator>(
+                  paddle::platform::CPUPlace())
+                  .get(),
+              meta);
+      dense_cpu_tensor->ResizeAndAllocate(dense_gpu_tensor->dims());
+      cpu_tensors[i] = paddle::experimental::Tensor(dense_cpu_tensor);
       framework::TensorCopySync(*dense_gpu_tensor, platform::CPUPlace(),
                                 dense_cpu_tensor.get());
     }
     if (with_switch_) {
-      // TODO(sandyhouse) send to and recv
+      if (local_rank_ == 0) {
+        HeterClient* client_ =
+            HeterClient::GetInstance({switch_endpoint_}, {}, 0).get();
+        auto dense_cpu_tensor =
+            std::dynamic_pointer_cast<phi::DenseTensor>(cpu_tensors[0].impl());
+        if (gloo_rank_ == 0) {
+          std::vector<int> send_size;
+          send_size.push_back(dense_cpu_tensor->numel());
+          int ret = client_->Send(
+              gid_, {dense_cpu_tensor->name()}, send_size,
+              dense_cpu_tensor->data(),
+              dense_cpu_tensor->numel() *
+                  framework::DataTypeSize(dense_cpu_tensor->dtype()));
+          PADDLE_ENFORCE_EQ(ret, 0, platform::errors::PreconditionNotMet(
+                                        "Send to the switch module error."));
+        } else {
+          int ret = client_->Recv(
+              gid_, {dense_cpu_tensor->name()}, dense_cpu_tensor->data(),
+              dense_cpu_tensor->numel() *
+                  framework::DataTypeSize(dense_cpu_tensor->dtype()));
+          PADDLE_ENFORCE_EQ(ret, 0,
+                            platform::errors::PreconditionNotMet(
+                                "Receive from the switch module error."));
+          ret = client_->Recv(
+              gid_, {dense_cpu_tensor->name()}, dense_cpu_tensor->data(),
+              dense_cpu_tensor->numel() *
+                  framework::DataTypeSize(dense_cpu_tensor->dtype()));
+          PADDLE_ENFORCE_EQ(ret, 0,
+                            platform::errors::PreconditionNotMet(
+                                "Receive from the switch module error."));
+        }
+      }
     } else {
       auto gloo_task = inter_pg_->Broadcast(cpu_tensors, opts);
       gloo_task->Wait();
@@ -168,8 +272,6 @@ std::shared_ptr<ProcessGroup::Task> ProcessGroupHeter::Broadcast(
           std::dynamic_pointer_cast<phi::DenseTensor>(tensors[i].impl());
       auto dense_cpu_tensor =
           std::dynamic_pointer_cast<phi::DenseTensor>(cpu_tensors[i].impl());
-      // framework::TensorCopySync(*dense_cpu_tensor, tensors[i].place(),
-      // dense_gpu_tensor.get());
       framework::TensorCopySync(*dense_cpu_tensor, dense_cpu_tensor->place(),
                                 dense_gpu_tensor.get());
     }
@@ -185,22 +287,44 @@ void ProcessGroupHeter::Broadcast(const phi::DenseTensor* in,
   inner_pg_->Broadcast(in, out);
 
   if (local_rank_ == 0) {
-    Tensor cpu_tensor;
-    auto dense_cpu_tensor =
-        std::dynamic_pointer_cast<phi::DenseTensor>(cpu_tensor.impl());
-    dense_cpu_tensor->Resize(in->dims());
+    phi::DenseTensorMeta meta = phi::DenseTensorMeta(in->dtype(), in->dims());
+    std::shared_ptr<phi::DenseTensor> dense_cpu_tensor =
+        std::make_shared<phi::DenseTensor>(
+            std::make_unique<paddle::experimental::DefaultAllocator>(
+                paddle::platform::CPUPlace())
+                .get(),
+            meta);
+    dense_cpu_tensor->ResizeAndAllocate(in->dims());
+    Tensor cpu_tensor = paddle::experimental::Tensor(dense_cpu_tensor);
     framework::TensorCopySync(*in, platform::CPUPlace(),
                               dense_cpu_tensor.get());
     if (with_switch_) {
-      // TODO(sandyhouse) send to and recv
+      if (local_rank_ == 0) {
+        HeterClient* client_ =
+            HeterClient::GetInstance({switch_endpoint_}, {}, 0).get();
+        if (gloo_rank_ == 0) {
+          std::vector<int> send_size;
+          send_size.push_back(in->numel());
+          int ret = client_->Send(
+              gid_, {in->name()}, send_size, dense_cpu_tensor->data(),
+              in->numel() * framework::DataTypeSize(in->dtype()));
+          PADDLE_ENFORCE_EQ(ret, 0, platform::errors::PreconditionNotMet(
+                                        "Send to the switch module error."));
+        } else {
+          int ret =
+              client_->Recv(gid_, {in->name()}, dense_cpu_tensor->data(),
+                            in->numel() * framework::DataTypeSize(in->dtype()));
+          PADDLE_ENFORCE_EQ(ret, 0,
+                            platform::errors::PreconditionNotMet(
+                                "Receive from the switch module error."));
+        }
+      }
     } else {
       std::vector<Tensor> cpu_tensors = {cpu_tensor};
-      // auto gloo_task = inter_pg_->Broadcast(cpu_tensors);
-      // gloo_task->Wait();
-      inter_pg_->Broadcast(cpu_tensors);
+      auto gloo_task = inter_pg_->Broadcast(cpu_tensors);
+      gloo_task->Wait();
     }
-    framework::TensorCopySync(*dense_cpu_tensor, dense_cpu_tensor->place(),
-                              out);
+    framework::TensorCopySync(*dense_cpu_tensor, out->place(), out);
   }
   inner_pg_->Broadcast(out, out);
 }
diff --git a/paddle/fluid/distributed/collective/ProcessGroupHeter.h b/paddle/fluid/distributed/collective/ProcessGroupHeter.h
index 8a26adbea4d78..892dbb9369e8d 100644
--- a/paddle/fluid/distributed/collective/ProcessGroupHeter.h
+++ b/paddle/fluid/distributed/collective/ProcessGroupHeter.h
@@ -23,7 +23,6 @@
 
 #include "paddle/fluid/distributed/collective/ProcessGroup.h"
 #include "paddle/fluid/distributed/collective/ProcessGroupGloo.h"
-// #include "paddle/fluid/distributed/ps/service/heter_client.h"
 #include "paddle/fluid/platform/device_context.h"
 
 #ifdef PADDLE_WITH_GLOO
@@ -48,6 +47,11 @@
 #include "paddle/fluid/distributed/collective/ProcessGroupHCCL.h"
 #endif
 
+#if defined(PADDLE_WITH_DISTRIBUTE) && defined(PADDLE_WITH_PSCORE) && \
+    (defined(PADDLE_WITH_NCCL) || defined(PADDLE_WITH_ASCEND_CL))
+#include "paddle/fluid/distributed/ps/service/heter_client.h"
+#endif
+
 #include "paddle/fluid/distributed/collective/Common.h"
 
 constexpr const char* HETER_BACKEND_NAME = "HETER_BACKEND";
@@ -108,6 +112,7 @@ class ProcessGroupHeter : public ProcessGroup {
   int gloo_rank_;
   int gloo_size_;
   bool with_switch_;
+  std::string switch_endpoint_;
 };
 
 }  //  namespace distributed
diff --git a/paddle/fluid/distributed/collective/ProcessGroupNCCL.cc b/paddle/fluid/distributed/collective/ProcessGroupNCCL.cc
index 7c0752b5f367c..b1d892e2521a3 100644
--- a/paddle/fluid/distributed/collective/ProcessGroupNCCL.cc
+++ b/paddle/fluid/distributed/collective/ProcessGroupNCCL.cc
@@ -110,7 +110,8 @@ void ProcessGroupNCCL::BroadcastUniqueNCCLID(
     std::vector<ncclUniqueId>& nccl_ids) {  // NOLINT
   if (rank_ == 0) {
     for (size_t i = 0; i < nccl_ids.size(); i++) {
-      auto key = "ProcessGroupNCCL/nccl_ids/" + std::to_string(i);
+      auto key = "ProcessGroupNCCL/nccl_ids/" + std::to_string(gid_) + "/" +
+                 std::to_string(i);
       auto nccl_id = std::vector<uint8_t>(
           reinterpret_cast<uint8_t*>(&nccl_ids[i]),
           reinterpret_cast<uint8_t*>(&nccl_ids[i]) + NCCL_UNIQUE_ID_BYTES);
@@ -118,7 +119,8 @@ void ProcessGroupNCCL::BroadcastUniqueNCCLID(
     }
   } else {
     for (size_t i = 0; i < nccl_ids.size(); i++) {
-      auto key = "ProcessGroupNCCL/nccl_ids/" + std::to_string(i);
+      auto key = "ProcessGroupNCCL/nccl_ids/" + std::to_string(gid_) + "/" +
+                 std::to_string(i);
       auto ret = store_->get(key);
       std::memcpy(&nccl_ids[i], ret.data(), ret.size());
     }
@@ -226,6 +228,43 @@ std::shared_ptr<ProcessGroup::Task> ProcessGroupNCCL::Collective(
   return task;
 }
 
+template <typename Fn>
+void ProcessGroupNCCL::Collective(const phi::DenseTensor* in,
+                                  phi::DenseTensor* out, Fn fn,
+                                  CommType op_type) {
+  std::vector<Place> places;
+  places.push_back(in->place());
+  const auto key = GetKeyFromPlaces(places);
+
+  {
+    std::lock_guard<std::mutex> lock(mutex_);
+    if (places_to_ncclcomm_.find(key) == places_to_ncclcomm_.end()) {
+      CreateNCCLManagerCache(key, places);
+    }
+  }
+
+  auto& nccl_comms = places_to_ncclcomm_[key];
+
+  SyncDefaultStream(places, places_to_events_[key], places_to_ctx_[key]);
+
+  // construct uninitialize guard for device
+  platform::CUDADeviceGuard cuda_guard;
+
+  if (FLAGS_use_stream_safe_cuda_allocator) {
+    cuda_guard.SetDevice(places[0]);
+    memory::RecordStream(in->Holder(), places_to_ctx_[key][0]->stream());
+  }
+
+  {
+    platform::NCCLGroupGuard nccl_guard;
+    cuda_guard.SetDevice(places[0]);
+    const auto& nccl_stream = places_to_ctx_[key][0]->stream();
+    fn(in, out, nccl_comms[0]->GetNcclComm(), nccl_stream);
+  }
+
+  cuda_guard.SetDevice(places[0]);
+}
+
 template <typename Fn>
 std::shared_ptr<ProcessGroup::Task> ProcessGroupNCCL::PointToPoint(
     std::vector<Tensor>& tensors, Fn fn, int dst_rank, CommType op_type) {
diff --git a/paddle/fluid/distributed/collective/ProcessGroupNCCL.h b/paddle/fluid/distributed/collective/ProcessGroupNCCL.h
index 4ab5374dacaf4..fa73ed195b0c1 100644
--- a/paddle/fluid/distributed/collective/ProcessGroupNCCL.h
+++ b/paddle/fluid/distributed/collective/ProcessGroupNCCL.h
@@ -146,6 +146,10 @@ class ProcessGroupNCCL : public ProcessGroup {
       std::vector<Tensor>& outputs,  // NOLINT
       Fn fn, CommType op_type);
 
+  template <typename Fn>
+  void Collective(const phi::DenseTensor*, phi::DenseTensor*, Fn fn,
+                  CommType op_type);
+
   template <typename Fn>
   std::shared_ptr<ProcessGroup::Task> PointToPoint(
       std::vector<Tensor>& tensors,  // NOLINT
diff --git a/paddle/fluid/distributed/collective/reducer.cc b/paddle/fluid/distributed/collective/reducer.cc
index ec02406efc818..71741515c90d5 100644
--- a/paddle/fluid/distributed/collective/reducer.cc
+++ b/paddle/fluid/distributed/collective/reducer.cc
@@ -360,6 +360,7 @@ void EagerReducer::InitializeGroups(
         is_sparse_gradient_[tensor_indices_.front()]) {
       // process the sparse gradient. one sparse, one group
       group.dtype_ = first_var.dtype();
+      group.is_sparse_ = true;
     } else {
       // process the dense gradient.
       InitializeDenseGroups(tensor_indices_, &group);
@@ -391,6 +392,12 @@ void EagerReducer::InitializeDenseGroups(
     auto &tensor = tensors_[tensor_index];
     auto &tensor_name = tensor.name();
 
+    PADDLE_ENFORCE_EQ(is_sparse_gradient_[tensor_index], false,
+                      platform::errors::PreconditionNotMet(
+                          "Tensor %s's GRAD must be Tensor, but received "
+                          "GRAD is SelectedRows",
+                          tensor_name));
+
     PADDLE_ENFORCE_EQ(tensor.is_initialized(), true,
                       platform::errors::PreconditionNotMet(
                           "Tensor %s is not initialized.", tensor_name));
@@ -480,6 +487,7 @@ void EagerReducer::PrepareForBackward(const std::vector<Tensor> &outputs) {
   next_group_ = 0;
   std::for_each(groups_.begin(), groups_.end(), [](EagerGroup &group) {
     group.pending_ = group.tensor_indices_.size();
+    group.sparse_contents_ = Tensor();
   });
 
   // reinitialize vars_marked_ready_ for next iteration
@@ -544,9 +552,6 @@ void EagerReducer::AddDistHook(size_t var_index) {
     return;
   }
 
-  auto &tensor = tensors_[var_index];
-  const auto &grad_node = GetGradNodeFromTensor(&tensor);
-
   VLOG(3) << "Tensor[" << var_index << "] [" << tensors_[var_index].name()
           << "@Grad] arrived and triggered disthook";
 
@@ -608,33 +613,69 @@ void EagerReducer::MarkVarReady(const size_t var_index,
   auto &group_tensor = group.dense_tensors_[inside_group_index];
   const auto length = group.length_[inside_group_index];
 
-  if (is_used_var) {
-    auto *autograd_meta = tensors_[var_index].get_autograd_meta();
-    auto &grad_tensor = static_cast<egr::AutogradMeta *>(autograd_meta)->Grad();
-    group_tensor
-        .ShareDataWith(
-            *(std::dynamic_pointer_cast<phi::DenseTensor>(grad_tensor.impl())))
-        .Resize({grad_tensor.numel()});
-  } else {
-    // TODO(shenliang03): maybe save the memory by avoiding tensor construction
-    if (!group_tensor.initialized()) {
-      group_tensor.Resize({static_cast<int64_t>(length)});
-      group_tensor.mutable_data(inner_place_, group.dtype_);
-    }
-    if (HasGrad(var_index)) {
-      VLOG(3) << "Tensor[" << tensors_[var_index].name() << "] has grad";
-      auto grad_tensor = egr::EagerUtils::mutable_grad(tensors_[var_index]);
+  if (!group.is_sparse_) {
+    if (is_used_var) {
+      auto *autograd_meta = tensors_[var_index].get_autograd_meta();
+      auto &grad_tensor =
+          static_cast<egr::AutogradMeta *>(autograd_meta)->Grad();
       group_tensor
           .ShareDataWith(*(
-              std::dynamic_pointer_cast<phi::DenseTensor>(grad_tensor->impl())))
-          .Resize({length});
+              std::dynamic_pointer_cast<phi::DenseTensor>(grad_tensor.impl())))
+          .Resize({grad_tensor.numel()});
     } else {
-      VLOG(3) << "Tensor[" << tensors_[var_index].name()
-              << "] doesn't have grad";
-      auto *dev_ctx = platform::DeviceContextPool::Instance().Get(inner_place_);
-      group_tensor.Resize({static_cast<int64_t>(length)});
-      phi::funcs::set_constant(*dev_ctx, &group_tensor, 0.0);
+      // TODO(shenliang03): maybe save the memory by avoiding tensor
+      // construction
+      if (!group_tensor.initialized()) {
+        group_tensor.Resize({static_cast<int64_t>(length)});
+        group_tensor.mutable_data(inner_place_, group.dtype_);
+      }
+      if (HasGrad(var_index)) {
+        VLOG(3) << "Tensor[" << tensors_[var_index].name() << "] has grad";
+        auto grad_tensor = egr::EagerUtils::mutable_grad(tensors_[var_index]);
+        group_tensor
+            .ShareDataWith(*(std::dynamic_pointer_cast<phi::DenseTensor>(
+                grad_tensor->impl())))
+            .Resize({length});
+      } else {
+        VLOG(3) << "Tensor[" << tensors_[var_index].name()
+                << "] doesn't have grad";
+        auto *dev_ctx =
+            platform::DeviceContextPool::Instance().Get(inner_place_);
+        group_tensor.Resize({static_cast<int64_t>(length)});
+        phi::funcs::set_constant(*dev_ctx, &group_tensor, 0.0);
+      }
     }
+  } else {
+    auto *autograd_meta = tensors_[var_index].get_autograd_meta();
+    auto &grad_tensor = static_cast<egr::AutogradMeta *>(autograd_meta)->Grad();
+
+    // process sparse group
+    PADDLE_ENFORCE_EQ(
+        HasGrad(var_index), true,
+        platform::errors::PreconditionNotMet(
+            "The sparse parameter[%d][%s] should have gradient. "
+            "Currently, DataParallel does not support sparse "
+            "parameters without generating gradients during training. "
+            "For example, if is_sparese=True is used in Embedding, "
+            "the current step of this parameter cannot generate gradient "
+            "because of stop_gradient/detatch, where error will occur.",
+            var_index, tensors_[var_index].name()));
+
+    // need to check tensor type
+    PADDLE_ENFORCE_EQ(
+        grad_tensor.is_selected_rows(), true,
+        platform::errors::PreconditionNotMet(
+            "The sparse parameter[%d][%s] must have a selectedrows gradient. "
+            "Before forward pass, the parameter type is inferred to be "
+            "SelectedRows, but after backward pass, its actual type becomes "
+            "LodTensor. It is currently not supported by DataParallel. "
+            "For example, if sparse embedding is used, and the weight of "
+            "embedding is shared with subsequent dense parameters, then "
+            "the parameter gradient of the embedding will be converted "
+            "to dense parameters.",
+            var_index, tensors_[var_index].name()));
+
+    group.sparse_contents_.set_impl(grad_tensor.impl());
   }
 
   if (--group.pending_ == 0) {
@@ -666,7 +707,11 @@ void EagerReducer::MarkGroupReady(size_t group_index) {
   for (; next_group_ < groups_.size() && groups_[next_group_].pending_ == 0;
        ++next_group_) {
     UNUSED auto &group = groups_[next_group_];
-    FusedAllReduceSchedule(&group, next_group_);
+    if (group.is_sparse_) {
+      AllReduceSparse(&group, next_group_);
+    } else {
+      FusedAllReduceSchedule(&group, next_group_);
+    }
   }
 }
 
@@ -725,6 +770,11 @@ void EagerReducer::ProcessUnusedDenseVars() {
       const auto inside_group_index = var_locator.inside_group_index;
       auto &src_tensor = group.dense_tensors_[inside_group_index];
 
+      // sparse no need to check and no support find_unused_parameters
+      if (group.is_sparse_) {
+        continue;
+      }
+
       Tensor grad_value(std::make_shared<phi::DenseTensor>(src_tensor));
 
       auto dest_var_base = tensors_[var_index];
@@ -739,11 +789,15 @@ void EagerReducer::FinalizeBackward() {
   groups_need_finalize_ = false;
   grad_need_hooks_ = false;
   for (auto &group : groups_) {
-    group.task->Synchronize();
+    if (!group.is_sparse_) {
+      group.task->Synchronize();
+    }
   }
 
   for (auto &group : groups_) {
-    group.SplitTensors(inner_place_);
+    if (!group.is_sparse_) {
+      group.SplitTensors(inner_place_);
+    }
   }
 
   if (find_unused_vars_each_step_) {
@@ -778,6 +832,127 @@ void EagerReducer::FusedAllReduceSchedule(EagerGroup *group,
   // split in FinalizeBackward()
 }
 
+void EagerReducer::AllReduceSparse(EagerGroup *group,
+                                   const int curr_group_index) {
+  // div nranks
+  Tensor sparse_tensor(group->sparse_contents_);
+  paddle::experimental::scale_(sparse_tensor, 1.0 / nranks_, 0.0, false);
+
+  VLOG(3) << "sparse_group [" << curr_group_index << "] start allreduce.";
+
+  auto *dev_ctx = platform::DeviceContextPool::Instance().Get(inner_place_);
+  if (platform::is_gpu_place(inner_place_)) {
+#if defined(PADDLE_WITH_NCCL) || defined(PADDLE_WITH_RCCL)
+    dev_ctx = static_cast<platform::CUDADeviceContext *>(
+        platform::DeviceContextPool::Instance().Get(inner_place_));
+#else
+    PADDLE_THROW(platform::errors::PermissionDenied(
+        "Paddle can't concat grad tensors since it's not compiled with NCCL,"
+        "Please recompile or reinstall Paddle with NCCL support."));
+#endif
+  } else if (platform::is_cpu_place(inner_place_)) {
+    dev_ctx = static_cast<platform::CPUDeviceContext *>(
+        platform::DeviceContextPool::Instance().Get(inner_place_));
+  } else {
+    PADDLE_THROW(platform::errors::Unimplemented(
+        "Split grad tensor not supported on place (%s)", inner_place_));
+  }
+
+  auto src = std::dynamic_pointer_cast<phi::SelectedRows>(
+      group->sparse_contents_.impl());
+  const auto &src_rows = src->rows();
+
+  const auto &rank_ = process_group_->GetRank();
+  const auto &size_ = process_group_->GetSize();
+
+  framework::Vector<int64_t> rows_num_vector(size_);
+  rows_num_vector[rank_] = static_cast<int64_t>(src_rows.size());
+
+  Tensor rows_num_tensor = paddle::experimental::empty(
+      IntArray({static_cast<int64_t>(size_)}), DataType::INT64, inner_place_);
+  auto *rows_num_dense_tensor =
+      std::dynamic_pointer_cast<phi::DenseTensor>(rows_num_tensor.impl()).get();
+  framework::TensorFromVector<int64_t>(rows_num_vector, *dev_ctx,
+                                       rows_num_dense_tensor);
+
+  distributed::AllreduceOptions opts;
+  opts.reduce_op = ReduceOp::SUM;
+  std::vector<Tensor> reduce_tensors = {rows_num_tensor};
+  process_group_->AllReduce(reduce_tensors, opts)->Synchronize();
+
+  framework::TensorToVector<int64_t>(*rows_num_dense_tensor, *dev_ctx,
+                                     &rows_num_vector);
+  dev_ctx->Wait();
+
+  const auto *cpu_rows_num_ptr = rows_num_vector.data();
+  auto rows_num = std::accumulate(cpu_rows_num_ptr, cpu_rows_num_ptr + size_,
+                                  static_cast<int64_t>(0));
+
+  VLOG(3) << "Gather rows: " << string::join_strings(rows_num_vector, ',')
+          << ", total rows number: " << rows_num
+          << ", height: " << src->height();
+
+  dev_ctx->Wait();
+
+  if (std::all_of(cpu_rows_num_ptr, cpu_rows_num_ptr + size_,
+                  [&](int64_t row) { return row == cpu_rows_num_ptr[0]; })) {
+    // During sparse communication, the number of each card is same.
+    // allgather is used to speed up the allreduce by replacing broadcast.
+
+    VLOG(3) << "allgather replaces broadcast to speed up in sparse allreduce";
+
+    Tensor dst_rows_tensor =
+        paddle::experimental::empty(IntArray({static_cast<int64_t>(rows_num)}),
+                                    DataType::INT64, inner_place_);
+    Tensor src_rows_tensor = paddle::experimental::empty(
+        IntArray({static_cast<int64_t>((*src).rows().size())}), DataType::INT64,
+        inner_place_);
+    auto *src_rows_dense_tensor =
+        std::dynamic_pointer_cast<phi::DenseTensor>(src_rows_tensor.impl())
+            .get();
+    framework::TensorFromVector<int64_t>((*src).rows(), *dev_ctx,
+                                         src_rows_dense_tensor);
+
+    std::vector<Tensor> src_rows_tensors = {src_rows_tensor};
+    std::vector<Tensor> dst_rows_tensors = {dst_rows_tensor};
+    process_group_->AllGather(src_rows_tensors, dst_rows_tensors)
+        ->Synchronize();
+
+    framework::Vector<int64_t> dst_rows_vector(rows_num, 0);
+    auto *dst_rows_dense_tensor =
+        std::dynamic_pointer_cast<phi::DenseTensor>(dst_rows_tensor.impl())
+            .get();
+    framework::TensorToVector<int64_t>(*dst_rows_dense_tensor, *dev_ctx,
+                                       &dst_rows_vector);
+    dev_ctx->Wait();
+
+    Tensor src_value_tensor(std::make_shared<phi::DenseTensor>(src->value()));
+    std::vector<int64_t> dst_shape = src_value_tensor.shape();
+    dst_shape[dst_shape.size() - 2] = rows_num;
+    auto dst_dense_tensor = std::dynamic_pointer_cast<phi::DenseTensor>(
+        paddle::experimental::full(IntArray(dst_shape), 0,
+                                   src_value_tensor.dtype(), inner_place_)
+            .impl());
+
+    auto dst =
+        std::make_shared<phi::SelectedRows>(dst_rows_vector, (*src).height());
+    *(dst->mutable_value()) = *dst_dense_tensor;
+    Tensor dst_value_tensor(std::make_shared<phi::DenseTensor>(dst->value()));
+
+    std::vector<Tensor> src_value_tensors = {src_value_tensor};
+    std::vector<Tensor> dst_value_tensors = {dst_value_tensor};
+    process_group_->AllGather(src_value_tensors, dst_value_tensors)
+        ->Synchronize();
+
+    src->set_rows(dst_rows_vector);
+    *(src->mutable_value()) =
+        *(std::dynamic_pointer_cast<phi::DenseTensor>(dst_value_tensor.impl()));
+  } else {
+    PADDLE_THROW(
+        platform::errors::Unimplemented("This case is not supported."));
+  }
+}
+
 std::ostream &operator<<(std::ostream &out, const EagerGroup &group) {
   const auto &tensors_ = group.tensor_indices_;
   out << "numel: " << group.all_length_ << " ;var number: " << tensors_.size()
diff --git a/paddle/fluid/distributed/collective/reducer.h b/paddle/fluid/distributed/collective/reducer.h
index 848277f5fad4e..12c02509884e9 100644
--- a/paddle/fluid/distributed/collective/reducer.h
+++ b/paddle/fluid/distributed/collective/reducer.h
@@ -47,6 +47,8 @@ std::vector<std::vector<size_t>> Eager_AssignGroupBySize(
 class EagerGroup {
  public:
   Tensor dense_contents_;
+  Tensor sparse_contents_;
+  bool is_sparse_ = false;
 
   // for concat kernel
   std::vector<phi::DenseTensor> dense_tensors_;
@@ -104,6 +106,7 @@ class EagerReducer {
   void MarkVarReady(const size_t var_index, const bool is_used_var);
   void MarkGroupReady(const size_t group_index);
   void FusedAllReduceSchedule(EagerGroup *group, const int curr_group_index);
+  void AllReduceSparse(EagerGroup *group, const int curr_group_index);
   void FinalizeBackward();
   void TraverseBackwardGraph(const std::vector<Tensor> &outputs);
   void ProcessUnusedDenseVars();
diff --git a/paddle/fluid/distributed/ps/service/CMakeLists.txt b/paddle/fluid/distributed/ps/service/CMakeLists.txt
old mode 100644
new mode 100755
index ab6c2e2600274..b8de291072a1f
--- a/paddle/fluid/distributed/ps/service/CMakeLists.txt
+++ b/paddle/fluid/distributed/ps/service/CMakeLists.txt
@@ -39,8 +39,8 @@ cc_library(server SRCS server.cc DEPS downpour_server boost ${RPC_DEPS})
 cc_library(communicator SRCS communicator/communicator.cc DEPS scope client boost table math_function selected_rows_functor ${RPC_DEPS})
 cc_library(ps_service SRCS ps_service/service.cc DEPS communicator client server boost ${RPC_DEPS})
 
-cc_library(heter_server SRCS heter_server.cc DEPS brpc_utils ${COMMON_DEPS} ${RPC_DEPS})
 cc_library(heter_client SRCS heter_client.cc DEPS brpc_utils ${COMMON_DEPS} ${RPC_DEPS})
+cc_library(heter_server SRCS heter_server.cc DEPS heter_client brpc_utils ${COMMON_DEPS} ${RPC_DEPS})
 
 set_source_files_properties(ps_service/graph_py_service.cc PROPERTIES COMPILE_FLAGS ${DISTRIBUTE_COMPILE_FLAGS})
 cc_library(graph_py_service SRCS ps_service/graph_py_service.cc DEPS ps_service)
diff --git a/paddle/fluid/distributed/ps/service/brpc_ps_client.cc b/paddle/fluid/distributed/ps/service/brpc_ps_client.cc
index 9674717ffc24b..971c448bf2714 100644
--- a/paddle/fluid/distributed/ps/service/brpc_ps_client.cc
+++ b/paddle/fluid/distributed/ps/service/brpc_ps_client.cc
@@ -55,6 +55,8 @@ DEFINE_int32(pserver_sparse_merge_thread, 1, "pserver sparse merge thread num");
 DEFINE_int32(pserver_sparse_table_shard_num, 1000,
              "sparse table shard for save & load");
 
+DEFINE_int32(heter_world_size, 100, "group size");  // 可配置
+
 namespace paddle {
 namespace framework {
 class Scope;
@@ -78,7 +80,7 @@ void DownpourPsClientService::service(
     const PsRequestMessage *request, PsResponseMessage *response,
     ::google::protobuf::Closure *done) {
   brpc::ClosureGuard done_guard(done);
-  int ret = _client->handle_client2client_msg(
+  int ret = _client->HandleClient2ClientMsg(
       request->cmd_id(), request->client_id(), request->data());
   response->set_err_code(0);
   response->set_err_msg("");
@@ -89,8 +91,8 @@ void DownpourPsClientService::service(
 }
 
 // 启动client端RpcService 用于数据互发等操作
-int32_t BrpcPsClient::start_client_service() {
-  if (_service.configure(this, _client_id) != 0) {
+int32_t BrpcPsClient::StartClientService() {
+  if (_service.Configure(this, _client_id) != 0) {
     LOG(ERROR)
         << "service initialize failed, service_name:DownpourPsClientService";
     return -1;
@@ -106,12 +108,12 @@ int32_t BrpcPsClient::start_client_service() {
     return -1;
   }
   _server_started = true;
-  _env->registe_ps_client(butil::my_ip_cstr(), _server.listen_address().port,
-                          _client_id);
+  _env->RegistePsClient(butil::my_ip_cstr(), _server.listen_address().port,
+                        _client_id);
   return 0;
 }
 
-int32_t BrpcPsClient::create_client2client_connection(
+int32_t BrpcPsClient::CreateClient2ClientConnection(
     int pserver_timeout_ms, int pserver_connect_timeout_ms, int max_retry) {
   brpc::ChannelOptions options;
   options.protocol = "baidu_std";
@@ -120,12 +122,12 @@ int32_t BrpcPsClient::create_client2client_connection(
   options.connect_timeout_ms = pserver_connect_timeout_ms;
   options.max_retry = max_retry;
 
-  std::vector<PSHost> client_list = _env->get_ps_clients();
+  std::vector<PSHost> client_list = _env->GetPsClients();
   VLOG(1) << "BrpcPsClient::create_c2c_connection client_list size: "
           << client_list.size();
   for (auto cc : client_list) {
     VLOG(1) << "BrpcPsClient::create_c2c_connection client_list: "
-            << cc.to_string();
+            << cc.ToString();
   }
   _client_channels.resize(client_list.size());
   std::ostringstream os;
@@ -152,7 +154,7 @@ int32_t BrpcPsClient::create_client2client_connection(
   return 0;
 }
 
-int32_t BrpcPsClient::initialize() {
+int32_t BrpcPsClient::Initialize() {
   _async_call_num = 0;
 
   brpc::ChannelOptions options;
@@ -167,7 +169,7 @@ int32_t BrpcPsClient::initialize() {
   std::string client_ip(butil::my_ip_cstr());
 
   // 获取server列表，并连接
-  std::vector<PSHost> server_list = _env->get_ps_servers();
+  std::vector<PSHost> server_list = _env->GetPsServers();
   _server_channels.resize(server_list.size());
   for (size_t i = 0; i < server_list.size(); ++i) {
     server_ip_port.assign(server_list[i].ip.c_str());
@@ -192,7 +194,7 @@ int32_t BrpcPsClient::initialize() {
     os << server_ip_port << ",";
   }
   // 启动client探听接口, 并相互建立连接
-  start_client_service();
+  StartClientService();
 
   // 异步push 请求队列初始化
   const auto &worker_param = _config.worker_param().downpour_worker_param();
@@ -232,13 +234,13 @@ int32_t BrpcPsClient::initialize() {
   _flushing = false;
   // 启动异步push线程
   _async_push_sparse_thread =
-      std::thread(std::bind(&BrpcPsClient::push_sparse_task_consume, this));
+      std::thread(std::bind(&BrpcPsClient::PushSparseTaskConsume, this));
   // _async_push_sparse_thread.detach();
   _async_push_dense_thread =
-      std::thread(std::bind(&BrpcPsClient::push_dense_task_consume, this));
+      std::thread(std::bind(&BrpcPsClient::PushDenseTaskConsume, this));
   // for debug
   // _print_thread =
-  //    std::thread(std::bind(&BrpcPsClient::print_queue_size_thread, this));
+  //    std::thread(std::bind(&BrpcPsClient::PrintQueueSizeThread, this));
 
   return 0;
 }
@@ -284,7 +286,7 @@ std::string DownpourBrpcClosure::get_response(size_t request_idx, int cmd_id) {
   return data;
 }
 
-std::future<int32_t> BrpcPsClient::print_table_stat(uint32_t table_id) {
+std::future<int32_t> BrpcPsClient::PrintTableStat(uint32_t table_id) {
   size_t request_call_num = _server_channels.size();
   DownpourBrpcClosure *closure = new DownpourBrpcClosure(
       request_call_num, [request_call_num, table_id](void *done) {
@@ -317,7 +319,7 @@ std::future<int32_t> BrpcPsClient::print_table_stat(uint32_t table_id) {
     closure->request(i)->set_cmd_id(PS_PRINT_TABLE_STAT);
     closure->request(i)->set_table_id(table_id);
     closure->request(i)->set_client_id(_client_id);
-    PsService_Stub rpc_stub(get_cmd_channel(i));
+    PsService_Stub rpc_stub(GetCmdChannel(i));
     closure->cntl(i)->set_timeout_ms(
         10800000);  // cmd msg don't limit timeout for save/load
     rpc_stub.service(closure->cntl(i), closure->request(i),
@@ -325,7 +327,7 @@ std::future<int32_t> BrpcPsClient::print_table_stat(uint32_t table_id) {
   }
   return fut;
 }
-std::future<int32_t> BrpcPsClient::send_cmd(
+std::future<int32_t> BrpcPsClient::SendCmd(
     uint32_t table_id, int cmd_id, const std::vector<std::string> &params) {
   size_t request_call_num = _server_channels.size();
   DownpourBrpcClosure *closure = new DownpourBrpcClosure(
@@ -350,7 +352,7 @@ std::future<int32_t> BrpcPsClient::send_cmd(
     for (const auto &param : params) {
       closure->request(i)->add_params(param);
     }
-    PsService_Stub rpc_stub(get_cmd_channel(i));
+    PsService_Stub rpc_stub(GetCmdChannel(i));
     closure->cntl(i)->set_timeout_ms(
         10800000 * 2);  // cmd msg don't limit timeout for save/load
     rpc_stub.service(closure->cntl(i), closure->request(i),
@@ -359,7 +361,7 @@ std::future<int32_t> BrpcPsClient::send_cmd(
   return fut;
 }
 
-std::future<int32_t> BrpcPsClient::send_save_cmd(
+std::future<int32_t> BrpcPsClient::SendSaveCmd(
     uint32_t table_id, int cmd_id, const std::vector<std::string> &params) {
   size_t request_call_num = _server_channels.size();
   DownpourBrpcClosure *closure = new DownpourBrpcClosure(
@@ -390,7 +392,7 @@ std::future<int32_t> BrpcPsClient::send_save_cmd(
     for (const auto &param : params) {
       closure->request(i)->add_params(param);
     }
-    PsService_Stub rpc_stub(get_cmd_channel(i));
+    PsService_Stub rpc_stub(GetCmdChannel(i));
     closure->cntl(i)->set_timeout_ms(
         10800000);  // cmd msg don't limit timeout for save/load
     rpc_stub.service(closure->cntl(i), closure->request(i),
@@ -399,65 +401,42 @@ std::future<int32_t> BrpcPsClient::send_save_cmd(
   return fut;
 }
 
-std::future<int32_t> BrpcPsClient::shrink(uint32_t table_id,
+std::future<int32_t> BrpcPsClient::Shrink(uint32_t table_id,
                                           const std::string threshold) {
-  return send_cmd(table_id, PS_SHRINK_TABLE, {threshold});
+  return SendCmd(table_id, PS_SHRINK_TABLE, {threshold});
 }
 
-std::future<int32_t> BrpcPsClient::load(const std::string &epoch,
+std::future<int32_t> BrpcPsClient::Load(const std::string &epoch,
                                         const std::string &mode) {
-  return send_cmd(-1, PS_LOAD_ALL_TABLE, {epoch, mode});
+  return SendCmd(-1, PS_LOAD_ALL_TABLE, {epoch, mode});
 }
-std::future<int32_t> BrpcPsClient::load(uint32_t table_id,
+std::future<int32_t> BrpcPsClient::Load(uint32_t table_id,
                                         const std::string &epoch,
                                         const std::string &mode) {
-  return send_cmd(table_id, PS_LOAD_ONE_TABLE, {epoch, mode});
-}
-
-std::future<int32_t> BrpcPsClient::Load(const LoadSaveContext &load_context) {
-  if (load_context.table_id < 0) {
-    return send_cmd(-1, PS_LOAD_ALL_TABLE,
-                    {load_context.epoch, load_context.mode});
-  } else {
-    return send_cmd(load_context.table_id, PS_LOAD_ONE_TABLE,
-                    {load_context.epoch, load_context.mode});
-  }
+  return SendCmd(table_id, PS_LOAD_ONE_TABLE, {epoch, mode});
 }
 
-std::future<int32_t> BrpcPsClient::save(const std::string &epoch,
+std::future<int32_t> BrpcPsClient::Save(const std::string &epoch,
                                         const std::string &mode) {
   VLOG(1) << "BrpcPsClient::save path " << epoch;
-  return send_save_cmd(-1, PS_SAVE_ALL_TABLE, {epoch, mode});
+  return SendSaveCmd(-1, PS_SAVE_ALL_TABLE, {epoch, mode});
 }
-std::future<int32_t> BrpcPsClient::save(uint32_t table_id,
+std::future<int32_t> BrpcPsClient::Save(uint32_t table_id,
                                         const std::string &epoch,
                                         const std::string &mode) {
   VLOG(1) << "BrpcPsClient::save one table path " << epoch << " table_id "
           << table_id;
-  return send_save_cmd(table_id, PS_SAVE_ONE_TABLE, {epoch, mode});
-}
-
-std::future<int32_t> BrpcPsClient::Save(const LoadSaveContext &save_context) {
-  if (save_context.table_id < 0) {
-    VLOG(1) << "BrpcPsClient::save path " << save_context.epoch;
-    return send_save_cmd(-1, PS_SAVE_ALL_TABLE,
-                         {save_context.epoch, save_context.mode});
-  } else {
-    VLOG(1) << "BrpcPsClient::save one table path " << save_context.epoch
-            << " table_id " << save_context.table_id;
-    return send_save_cmd(save_context.table_id, PS_SAVE_ONE_TABLE,
-                         {save_context.epoch, save_context.mode});
-  }
+  return SendSaveCmd(table_id, PS_SAVE_ONE_TABLE, {epoch, mode});
 }
 
-std::future<int32_t> BrpcPsClient::clear() {
-  return send_cmd(-1, PS_CLEAR_ALL_TABLE, {});
+std::future<int32_t> BrpcPsClient::Clear() {
+  return SendCmd(-1, PS_CLEAR_ALL_TABLE, {});
 }
-std::future<int32_t> BrpcPsClient::clear(uint32_t table_id) {
-  return send_cmd(table_id, PS_CLEAR_ONE_TABLE, {});
+std::future<int32_t> BrpcPsClient::Clear(uint32_t table_id) {
+  return SendCmd(table_id, PS_CLEAR_ONE_TABLE, {});
 }
 
-std::future<int32_t> BrpcPsClient::flush() {
+std::future<int32_t> BrpcPsClient::Flush() {
   VLOG(0) << "BrpcPsClient::flush begin";
   _flushing = true;
   std::promise<int> promise;
@@ -470,106 +449,69 @@ std::future<int32_t> BrpcPsClient::flush() {
   promise.set_value(0);
   _flushing = false;
   VLOG(0) << "BrpcPsClient::flush done";
-  print_queue_size();
+  PrintQueueSize();
   return fut;
 }
 
-void BrpcPsClient::print_queue_size() {
+void BrpcPsClient::PrintQueueSize() {
   for (auto &push_sparse_task_itr : _push_sparse_task_queue_map) {
     auto table_id = push_sparse_task_itr.first;
     auto queue_size = push_sparse_task_itr.second->Size();
-    VLOG(0) << "BrpcPsClient::print_queue_size: table " << table_id
+    VLOG(0) << "BrpcPsClient::PrintQueueSize: table " << table_id
             << " size: " << queue_size;
   }
 
   for (auto &task_queue_itr : _push_dense_task_queue_map) {
     auto table_id = task_queue_itr.first;
     auto queue_size = task_queue_itr.second->Size();
-    VLOG(0) << "BrpcPsClient::print_queue_size: table " << table_id
+    VLOG(0) << "BrpcPsClient::PrintQueueSize: table " << table_id
             << " size: " << queue_size;
   }
 }
 
-void BrpcPsClient::print_queue_size_thread() {
+void BrpcPsClient::PrintQueueSizeThread() {
   while (_running) {
     usleep(1000000 * 60 * 2);
-    print_queue_size();
+    PrintQueueSize();
   }
 }
 
-void BrpcPsClient::finalize_worker() {
-  flush();
-  VLOG(0) << "BrpcPsClient::finalize_worker begin join thread";
+void BrpcPsClient::FinalizeWorker() {
+  Flush();
+  VLOG(0) << "BrpcPsClient::FinalizeWorker begin join thread";
   _running = false;
   _async_push_dense_thread.join();
   _async_push_sparse_thread.join();
   // _print_thread.join();
-  VLOG(0) << "BrpcPsClient::finalize_worker begin join server";
+  VLOG(0) << "BrpcPsClient::FinalizeWorker begin join server";
   _server.Stop(1000);
   _server.Join();
   _server_started = false;
-  VLOG(0) << "BrpcPsClient::finalize_worker done";
+  VLOG(0) << "BrpcPsClient::FinalizeWorker done";
 }
 
-std::future<int32_t> BrpcPsClient::stop_server() {
-  return send_cmd(-1, PS_STOP_SERVER, {});
+std::future<int32_t> BrpcPsClient::StopServer() {
+  return SendCmd(-1, PS_STOP_SERVER, {});
 }
 
-std::future<int32_t> BrpcPsClient::start_profiler() {
-  return send_cmd(-1, PS_START_PROFILER, {});
+std::future<int32_t> BrpcPsClient::StartProfiler() {
+  return SendCmd(-1, PS_START_PROFILER, {});
 }
 
-std::future<int32_t> BrpcPsClient::stop_profiler() {
-  return send_cmd(-1, PS_STOP_PROFILER, {});
+std::future<int32_t> BrpcPsClient::StopProfiler() {
+  return SendCmd(-1, PS_STOP_PROFILER, {});
 }
 
-std::future<int32_t> BrpcPsClient::barrier(size_t table_id,
+std::future<int32_t> BrpcPsClient::Barrier(size_t table_id,
                                            uint32_t barrier_type) {
-  return send_cmd(table_id, PS_BARRIER, {std::to_string(barrier_type)});
-}
-
-std::future<int32_t> BrpcPsClient::Pull(RequestContext &pull_context) {
-  if (pull_context.value_type == Dense) {  // pull dense
-    Region *dense_region =
-        reinterpret_cast<Region *>(pull_context.dense_values);
-    return pull_dense(dense_region, pull_context.num, pull_context.table);
-  } else {  // pull sparse
-    size_t table_id = pull_context.table;
-    size_t num = pull_context.num;
-    bool is_training = pull_context.is_training;
-    if (pull_context.training_mode == Geo) {  // for geo
-      return pull_sparse_param(pull_context.sparse_values, table_id,
-                               pull_context.keys, num, is_training);
-    } else if (pull_context.training_mode == Async) {  // for async
-      return pull_sparse(pull_context.sparse_values, table_id,
-                         pull_context.keys, num, is_training);
-    }
-  }
-}
-
-std::future<int32_t> BrpcPsClient::Push(RequestContext &push_context) {
-  if (push_context.value_type == Dense) {  // push dense
-    const Region *dense_region = push_context.push_context.push_dense_values;
-    return push_dense(dense_region, push_context.num, push_context.table);
-  } else {  // push sparse
-    size_t table_id = push_context.table;
-    size_t num = push_context.num;
-    bool is_training = push_context.is_training;
-    if (push_context.training_mode == Geo) {  // for geo
-      // TODO(zhaocaibei)
-    } else if (push_context.training_mode == Async) {  // for async
-      const uint64_t *keys = push_context.push_context.keys;
-      const float **update_values = push_context.push_context.push_values;
-      return push_sparse(table_id, keys, update_values, num);
-    }
-  }
+  return SendCmd(table_id, PS_BARRIER, {std::to_string(barrier_type)});
 }
 
-std::future<int32_t> BrpcPsClient::pull_geo_param(size_t table_id,
-                                                  std::vector<float> *values,
-                                                  std::vector<uint64_t> *keys,
-                                                  int pserver_idx) {
-  auto *accessor = table_accessor(table_id);
+std::future<int32_t> BrpcPsClient::PullGeoParam(size_t table_id,
+                                                std::vector<float> *values,
+                                                std::vector<uint64_t> *keys,
+                                                int pserver_idx) {
+  auto *accessor = GetTableAccessor(table_id);
   DownpourBrpcClosure *closure =
       new DownpourBrpcClosure(1, [keys, values, accessor](void *done) {
         int ret = 0;
@@ -583,12 +525,12 @@ std::future<int32_t> BrpcPsClient::pull_geo_param(size_t table_id,
         io_buffer_itr.copy_and_forward(reinterpret_cast<void *>(&shard_nums),
                                        sizeof(uint32_t));
         keys->resize(shard_nums);
-        values->resize(shard_nums * accessor->GetTableInfo(UPDATE_DIM));
+        values->resize(shard_nums * accessor->GetAccessorInfo().update_dim);
         io_buffer_itr.copy_and_forward((void *)(keys->data()),  // NOLINT
                                        sizeof(uint64_t) * shard_nums);
         io_buffer_itr.copy_and_forward(
             (void *)(values->data()),  // NOLINT
-            shard_nums * accessor->GetTableInfo(UPDATE_SIZE));
+            shard_nums * accessor->GetAccessorInfo().update_size);
         closure->set_promise_value(ret);
       });
   auto promise = std::make_shared<std::promise<int32_t>>();
@@ -598,7 +540,7 @@ std::future<int32_t> BrpcPsClient::pull_geo_param(size_t table_id,
   closure->request(0)->set_cmd_id(PS_PULL_GEO_PARAM);
   closure->request(0)->set_table_id(table_id);
   closure->request(0)->set_client_id(_client_id);
-  PsService_Stub rpc_stub(get_cmd_channel(pserver_idx));
+  PsService_Stub rpc_stub(GetCmdChannel(pserver_idx));
   closure->cntl(0)->set_log_id(butil::gettimeofday_ms());
   rpc_stub.service(closure->cntl(0), closure->request(0), closure->response(0),
                    closure);
@@ -606,10 +548,11 @@ std::future<int32_t> BrpcPsClient::pull_geo_param(size_t table_id,
 }
 
 // for GEO
-std::future<int32_t> BrpcPsClient::push_sparse_param(
-    size_t table_id, const uint64_t *keys, const float **update_values,
-    size_t num, void *done) {
-  auto *accessor = table_accessor(table_id);
+std::future<int32_t> BrpcPsClient::PushSparseParam(size_t table_id,
+                                                   const uint64_t *keys,
+                                                   const float **update_values,
+                                                   size_t num, void *done) {
+  auto *accessor = GetTableAccessor(table_id);
   // 发送RPC请求
   DownpourBrpcClosure *closure = reinterpret_cast<DownpourBrpcClosure *>(done);
   auto promise = std::make_shared<std::promise<int32_t>>();
@@ -630,7 +573,7 @@ std::future<int32_t> BrpcPsClient::push_sparse_param(
     auto kvs = ids[shard_idx];
     auto value_ptr = value_ptrs[shard_idx];
     size_t kv_size = kvs.size();
-    uint32_t value_size = accessor->GetTableInfo(UPDATE_SIZE);
+    uint32_t value_size = accessor->GetAccessorInfo().update_size;
     // 发送RPC请求
     auto *push_request = closure->request(shard_idx);
     push_request->set_cmd_id(PS_PUSH_SPARSE_PARAM);
@@ -638,16 +581,15 @@ std::future<int32_t> BrpcPsClient::push_sparse_param(
     push_request->set_client_id(_client_id);
     push_request->add_params((char *)&kv_size, sizeof(uint32_t));  // NOLINT
     auto *push_data = push_request->mutable_data();
-    push_data->resize(kv_size *
-                      (sizeof(uint64_t) + accessor->GetTableInfo(UPDATE_SIZE)));
+    push_data->resize(kv_size * (sizeof(uint64_t) + value_size));
     char *push_data_ptr = const_cast<char *>(push_data->data());
     memcpy(push_data_ptr, kvs.data(), kv_size * sizeof(uint64_t));
     push_data_ptr += kv_size * sizeof(uint64_t);
     for (int i = 0; i < kv_size; ++i) {
-      memcpy(push_data_ptr, value_ptr[i], accessor->GetTableInfo(UPDATE_SIZE));
-      push_data_ptr += accessor->GetTableInfo(UPDATE_SIZE);
+      memcpy(push_data_ptr, value_ptr[i], value_size);
+      push_data_ptr += value_size;
     }
-    PsService_Stub rpc_stub(get_sparse_channel(shard_idx));
+    PsService_Stub rpc_stub(GetSparseChannel(shard_idx));
     closure->cntl(shard_idx)->set_request_compress_type(
         (brpc::CompressType)FLAGS_pserver_communicate_compress_type);
     rpc_stub.service(closure->cntl(shard_idx), closure->request(shard_idx),
@@ -656,16 +598,13 @@ std::future<int32_t> BrpcPsClient::push_sparse_param(
   return fut;
 }
 
-std::future<int32_t> BrpcPsClient::pull_dense(Region *regions,
-                                              size_t region_num,
-                                              size_t table_id) {
+std::future<int32_t> BrpcPsClient::PullDense(Region *regions, size_t region_num,
+                                             size_t table_id) {
   auto timer = std::make_shared<CostTimer>("pserver_client_pull_dense");
-  auto *accessor = table_accessor(table_id);
-  auto fea_dim = accessor->GetTableInfo(FEA_DIM);
-  auto select_size = accessor->GetTableInfo(SELECT_SIZE);
+  auto *accessor = GetTableAccessor(table_id);
+  auto fea_dim = accessor->GetAccessorInfo().fea_dim;
   size_t request_call_num = _server_channels.size();
-  uint32_t num_per_shard =
-      dense_dim_per_shard(accessor->GetTableInfo(FEA_DIM), request_call_num);
+  uint32_t num_per_shard = DenseDimPerShard(fea_dim, request_call_num);
   // callback 将各shard结果，顺序填入region
   DownpourBrpcClosure *closure = new DownpourBrpcClosure(
       request_call_num, [request_call_num, num_per_shard, regions, region_num,
@@ -675,7 +614,7 @@ std::future<int32_t> BrpcPsClient::pull_dense(Region *regions,
         size_t region_data_idx = 0;  // 当前填充的region内data偏移
         auto *closure = reinterpret_cast<DownpourBrpcClosure *>(done);
         size_t shard_data_size =
-            num_per_shard * accessor->GetTableInfo(SELECT_SIZE);
+            num_per_shard * accessor->GetAccessorInfo().select_size;
         for (size_t i = 0; i < request_call_num; ++i) {
           if (closure->check_response(i, PS_PULL_DENSE_TABLE) != 0) {
             ret = -1;
@@ -728,23 +667,24 @@ std::future<int32_t> BrpcPsClient::pull_dense(Region *regions,
     closure->request(i)->set_client_id(_client_id);
     closure->request(i)->add_params((char *)&num_per_shard,  // NOLINT
                                     sizeof(num_per_shard));
-    PsService_Stub rpc_stub(get_dense_channel(i));
+    PsService_Stub rpc_stub(GetDenseChannel(i));
     rpc_stub.service(closure->cntl(i), closure->request(i),
                      closure->response(i), closure);
   }
   return fut;
 }
 
-std::future<int32_t> BrpcPsClient::push_dense_param(const Region *regions,
-                                                    size_t region_num,
-                                                    size_t table_id) {
-  auto *accessor = table_accessor(table_id);
+std::future<int32_t> BrpcPsClient::PushDenseParam(const Region *regions,
+                                                  size_t region_num,
+                                                  size_t table_id) {
+  auto *accessor = GetTableAccessor(table_id);
+  auto accessor_info = accessor->GetAccessorInfo();
   size_t request_call_num = _server_channels.size();
   // 1.拆分Region数据到shard中，后续多shard并行拷贝数据
   std::vector<std::vector<Region>> regions_partition(request_call_num);
   uint32_t num_per_shard =
-      dense_dim_per_shard(accessor->GetTableInfo(FEA_DIM), request_call_num);
-  size_t shard_data_size = num_per_shard * accessor->GetTableInfo(UPDATE_SIZE);
+      DenseDimPerShard(accessor_info.fea_dim, request_call_num);
+  size_t shard_data_size = num_per_shard * accessor_info.update_size;
   size_t current_region_idx = 0;
   size_t current_region_data_idx = 0;
   for (size_t i = 0; i < request_call_num; ++i) {
@@ -807,17 +747,17 @@ std::future<int32_t> BrpcPsClient::push_dense_param(const Region *regions,
                             fill_num);
       fill_remain_size -= fill_num;
     }
-    PsService_Stub rpc_stub(get_dense_channel(i));
+    PsService_Stub rpc_stub(GetDenseChannel(i));
     rpc_stub.service(closure->cntl(i), closure->request(i),
                      closure->response(i), closure);
   }
   return fut;
 }
 
-std::future<int32_t> BrpcPsClient::push_sparse_raw_gradient(
+std::future<int32_t> BrpcPsClient::PushSparseRawGradient(
     size_t table_id, const uint64_t *keys, const float **update_values,
     size_t num, void *done) {
-  auto *accessor = table_accessor(table_id);
+  auto *accessor = GetTableAccessor(table_id);
   // 发送RPC请求
   DownpourBrpcClosure *closure = reinterpret_cast<DownpourBrpcClosure *>(done);
   auto promise = std::make_shared<std::promise<int32_t>>();
@@ -851,7 +791,7 @@ std::future<int32_t> BrpcPsClient::push_sparse_raw_gradient(
     auto value_ptr = value_ptrs[shard_idx];
 
     size_t kv_size = kvs.size();
-    uint32_t value_size = accessor->GetTableInfo(UPDATE_SIZE);
+    uint32_t value_size = accessor->GetAccessorInfo().update_size;
 
     // 发送RPC请求
     auto *push_request = closure->request(shard_idx);
@@ -860,17 +800,16 @@ std::future<int32_t> BrpcPsClient::push_sparse_raw_gradient(
     push_request->set_client_id(_client_id);
     push_request->add_params((char *)&kv_size, sizeof(uint32_t));  // NOLINT
     auto *push_data = push_request->mutable_data();
-    push_data->resize(kv_size *
-                      (sizeof(uint64_t) + accessor->GetTableInfo(UPDATE_SIZE)));
+    push_data->resize(kv_size * (sizeof(uint64_t) + value_size));
     char *push_data_ptr = const_cast<char *>(push_data->data());
     memcpy(push_data_ptr, kvs.data(), kv_size * sizeof(uint64_t));
     push_data_ptr += kv_size * sizeof(uint64_t);
 
     for (int i = 0; i < kv_size; ++i) {
-      memcpy(push_data_ptr, value_ptr[i], accessor->GetTableInfo(UPDATE_SIZE));
-      push_data_ptr += accessor->GetTableInfo(UPDATE_SIZE);
+      memcpy(push_data_ptr, value_ptr[i], value_size);
+      push_data_ptr += value_size;
     }
-    PsService_Stub rpc_stub(get_sparse_channel(shard_idx));
+    PsService_Stub rpc_stub(GetSparseChannel(shard_idx));
     closure->cntl(shard_idx)->set_request_compress_type(
         (brpc::CompressType)FLAGS_pserver_communicate_compress_type);
     rpc_stub.service(closure->cntl(shard_idx), closure->request(shard_idx),
@@ -879,7 +818,7 @@ std::future<int32_t> BrpcPsClient::push_sparse_raw_gradient(
   return fut;
 }
 
-std::future<int32_t> BrpcPsClient::push_dense_raw_gradient(
+std::future<int32_t> BrpcPsClient::PushDenseRawGradient(
     int table_id, float *total_send_data, size_t total_send_data_size,
     void *done) {
   size_t request_call_num = _server_channels.size();
@@ -887,9 +826,9 @@ std::future<int32_t> BrpcPsClient::push_dense_raw_gradient(
   auto promise = std::make_shared<std::promise<int32_t>>();
   closure->add_promise(promise);
   std::future<int> fut = promise->get_future();
-  auto *accessor = table_accessor(table_id);
+  auto *accessor = GetTableAccessor(table_id);
   uint32_t num_per_shard =
-      dense_dim_per_shard(accessor->GetTableInfo(FEA_DIM), request_call_num);
+      DenseDimPerShard(accessor->GetAccessorInfo().fea_dim, request_call_num);
   for (size_t i = 0; i < request_call_num; ++i) {
     closure->request(i)->set_cmd_id(PS_PUSH_DENSE_TABLE);
     closure->request(i)->set_table_id(table_id);
@@ -903,16 +842,16 @@ std::future<int32_t> BrpcPsClient::push_dense_raw_gradient(
            total_send_data + i * num_per_shard, num_per_shard * sizeof(float));
     // closure->cntl(i)->set_request_compress_type(
     //     (brpc::CompressType)FLAGS_pserver_communicate_compress_type);
-    PsService_Stub rpc_stub(get_dense_channel(i));
+    PsService_Stub rpc_stub(GetDenseChannel(i));
     rpc_stub.service(closure->cntl(i), closure->request(i),
                      closure->response(i), closure);
   }
   return fut;
 }
 
-std::future<int32_t> BrpcPsClient::push_global_step(int table_id,
-                                                    int64_t *total_send_data,
-                                                    void *done) {
+std::future<int32_t> BrpcPsClient::PushGlobalStep(int table_id,
+                                                  int64_t *total_send_data,
+                                                  void *done) {
   size_t request_call_num = _server_channels.size();
   DownpourBrpcClosure *closure = reinterpret_cast<DownpourBrpcClosure *>(done);
   auto promise = std::make_shared<std::promise<int32_t>>();
@@ -931,17 +870,17 @@ std::future<int32_t> BrpcPsClient::push_global_step(int table_id,
     memcpy(push_data_ptr + sizeof(uint32_t), total_send_data,
            num_per_shard * sizeof(int64_t));
 
-    PsService_Stub rpc_stub(get_dense_channel(i));
+    PsService_Stub rpc_stub(GetDenseChannel(i));
     rpc_stub.service(closure->cntl(i), closure->request(i),
                      closure->response(i), closure);
   }
   return fut;
 }
 
-std::future<int32_t> BrpcPsClient::pull_sparse(float **select_values,
-                                               size_t table_id,
-                                               const uint64_t *keys, size_t num,
-                                               bool is_training) {
+std::future<int32_t> BrpcPsClient::PullSparse(float **select_values,
+                                              size_t table_id,
+                                              const uint64_t *keys, size_t num,
+                                              bool is_training) {
   auto timer = std::make_shared<CostTimer>("pserver_client_pull_sparse");
   auto local_timer =
       std::make_shared<CostTimer>("pserver_client_pull_sparse_local");
@@ -966,9 +905,9 @@ std::future<int32_t> BrpcPsClient::pull_sparse(float **select_values,
     shard_sorted_kvs->at(shard_id).push_back({keys[i], select_values[i]});
   }
 
-  auto *accessor = table_accessor(table_id);
+  auto *accessor = GetTableAccessor(table_id);
 
-  size_t value_size = accessor->GetTableInfo(SELECT_SIZE);
+  size_t value_size = accessor->GetAccessorInfo().select_size;
 
   DownpourBrpcClosure *closure = new DownpourBrpcClosure(
       request_call_num, [shard_sorted_kvs, value_size](void *done) {
@@ -1053,7 +992,7 @@ std::future<int32_t> BrpcPsClient::pull_sparse(float **select_values,
       closure->request(i)->set_client_id(_client_id);
       closure->request(i)->add_params((char *)&kv_request_count,  // NOLINT
                                       sizeof(uint32_t));
-      PsService_Stub rpc_stub(get_cmd_channel(i));
+      PsService_Stub rpc_stub(GetCmdChannel(i));
       closure->cntl(i)->set_log_id(butil::gettimeofday_ms());
       rpc_stub.service(closure->cntl(i), closure->request(i),
                        closure->response(i), closure);
@@ -1063,11 +1002,11 @@ std::future<int32_t> BrpcPsClient::pull_sparse(float **select_values,
 }
 
 // for GEO
-std::future<int32_t> BrpcPsClient::pull_sparse_param(float **select_values,
-                                                     size_t table_id,
-                                                     const uint64_t *keys,
-                                                     size_t num,
-                                                     bool is_training) {
+std::future<int32_t> BrpcPsClient::PullSparseParam(float **select_values,
+                                                   size_t table_id,
+                                                   const uint64_t *keys,
+                                                   size_t num,
+                                                   bool is_training) {
   auto timer = std::make_shared<CostTimer>("pserver_client_pull_sparse_param");
   size_t request_call_num = _server_channels.size();
 
@@ -1080,9 +1019,8 @@ std::future<int32_t> BrpcPsClient::pull_sparse_param(float **select_values,
     shard_sorted_kvs->at(shard_id).push_back({keys[i], select_values[i]});
   }
 
-  auto *accessor = table_accessor(table_id);
-  size_t value_size = accessor->GetTableInfo(SELECT_SIZE);
-
+  auto *accessor = GetTableAccessor(table_id);
+  size_t value_size = accessor->GetAccessorInfo().select_size;
   DownpourBrpcClosure *closure = new DownpourBrpcClosure(
       request_call_num, [shard_sorted_kvs, value_size](void *done) {
         int ret = 0;
@@ -1167,7 +1105,7 @@ std::future<int32_t> BrpcPsClient::pull_sparse_param(float **select_values,
       closure->request(i)->set_client_id(_client_id);
       closure->request(i)->add_params((char *)&kv_request_count,  // NOLINT
                                       sizeof(uint32_t));
-      PsService_Stub rpc_stub(get_cmd_channel(i));
+      PsService_Stub rpc_stub(GetCmdChannel(i));
       closure->cntl(i)->set_log_id(butil::gettimeofday_ms());
       rpc_stub.service(closure->cntl(i), closure->request(i),
                        closure->response(i), closure);
@@ -1176,7 +1114,7 @@ std::future<int32_t> BrpcPsClient::pull_sparse_param(float **select_values,
   return fut;
 }
 
-std::future<int32_t> BrpcPsClient::send_client2client_msg(
+std::future<int32_t> BrpcPsClient::SendClient2ClientMsg(
     int msg_type, int to_client_id, const std::string &msg) {
   auto promise = std::make_shared<std::promise<int32_t>>();
   std::future<int> fut = promise->get_future();
@@ -1201,11 +1139,11 @@ std::future<int32_t> BrpcPsClient::send_client2client_msg(
   return fut;
 }
 
-std::future<int32_t> BrpcPsClient::push_sparse_raw_gradient_partial(
+std::future<int32_t> BrpcPsClient::PushSparseRawGradientPartial(
     size_t table_id, const uint64_t *keys, const float **update_values,
     uint32_t num, void *done, int pserver_idx) {
-  auto *accessor = table_accessor(table_id);
-  size_t value_size = accessor->GetTableInfo(UPDATE_SIZE);
+  auto *accessor = GetTableAccessor(table_id);
+  size_t value_size = accessor->GetAccessorInfo().update_size;
   DownpourBrpcClosure *closure = reinterpret_cast<DownpourBrpcClosure *>(done);
   auto promise = std::make_shared<std::promise<int32_t>>();
   closure->add_promise(promise);
@@ -1226,7 +1164,7 @@ std::future<int32_t> BrpcPsClient::push_sparse_raw_gradient_partial(
     memcpy(push_data_ptr, update_values[i], value_size);
     push_data_ptr += value_size;
   }
-  PsService_Stub rpc_stub(get_sparse_channel(pserver_idx));
+  PsService_Stub rpc_stub(GetSparseChannel(pserver_idx));
   closure->cntl(0)->set_request_compress_type(
       (brpc::CompressType)FLAGS_pserver_communicate_compress_type);
   rpc_stub.service(closure->cntl(0), closure->request(0), closure->response(0),
@@ -1234,8 +1172,8 @@ std::future<int32_t> BrpcPsClient::push_sparse_raw_gradient_partial(
   return fut;
 }
 
-int32_t BrpcPsClient::recv_and_save_table(const uint64_t table_id,
-                                          const std::string &path) {
+int32_t BrpcPsClient::RecvAndSaveTable(const uint64_t table_id,
+                                       const std::string &path) {
   // get var information
   std::string var_name = "";
   int64_t var_num = 0;
@@ -1269,17 +1207,17 @@ int32_t BrpcPsClient::recv_and_save_table(const uint64_t table_id,
     save_vec.push_back(save_huge_vec.data() + i * var_shape);
   }
 
-  VLOG(2) << "recv_and_save_table: table_class: " << table_class;
+  VLOG(2) << "RecvAndSaveTable: table_class: " << table_class;
   // TODO(zhaocaibei123): new GeoBrpcPSClient, move this to its
-  // recv_and_save_table
+  // RecvAndSaveTable
   if (table_class == "MemorySparseGeoTable") {
     auto status =
-        pull_sparse_param(reinterpret_cast<float **>(save_vec.data()), table_id,
-                          save_key.data(), save_key.size(), true);
+        PullSparseParam(reinterpret_cast<float **>(save_vec.data()), table_id,
+                        save_key.data(), save_key.size(), true);
     status.wait();
   } else {
-    auto status = pull_sparse(reinterpret_cast<float **>(save_vec.data()),
-                              table_id, save_key.data(), save_key.size(), true);
+    auto status = PullSparse(reinterpret_cast<float **>(save_vec.data()),
+                             table_id, save_key.data(), save_key.size(), true);
     status.wait();
   }
 
@@ -1313,15 +1251,15 @@ int32_t BrpcPsClient::recv_and_save_table(const uint64_t table_id,
   return 0;
 }
 
-std::future<int32_t> BrpcPsClient::push_sparse(size_t table_id,
-                                               const uint64_t *keys,
-                                               const float **update_values,
-                                               size_t num) {
+std::future<int32_t> BrpcPsClient::PushSparse(size_t table_id,
+                                              const uint64_t *keys,
+                                              const float **update_values,
+                                              size_t num) {
   auto push_timer = std::make_shared<CostTimer>("pserver_client_push_sparse");
   CostTimer parse_timer("pserver_client_push_sparse_parse");
   int push_sparse_async_num = _push_sparse_task_queue_map[table_id]->Size();
   while (push_sparse_async_num > FLAGS_pserver_max_async_call_num) {
-    //    LOG(INFO) << "push_sparse Waiting for async_call_num comsume,
+    //    LOG(INFO) << "PushSparse Waiting for async_call_num comsume,
     //    task_num:"
     //              << push_sparse_async_num
     //              << ", max_task_limit:" << FLAGS_pserver_max_async_call_num;
@@ -1331,7 +1269,7 @@ std::future<int32_t> BrpcPsClient::push_sparse(size_t table_id,
   auto put_timer = std::make_shared<CostTimer>("client_push_sparse_put");
   thread_local std::vector<std::vector<std::pair<uint64_t, const float *>>>
       shard_sorted_kv_list;
-  auto *accessor = table_accessor(table_id);
+  auto *accessor = GetTableAccessor(table_id);
   size_t request_call_num = _server_channels.size();
   shard_sorted_kv_list.resize(request_call_num);
   for (auto &x : shard_sorted_kv_list) {
@@ -1365,7 +1303,7 @@ std::future<int32_t> BrpcPsClient::push_sparse(size_t table_id,
       shard_kv_data.kv_num = 0;
       continue;
     }
-    uint32_t value_size = accessor->GetTableInfo(UPDATE_SIZE);
+    uint32_t value_size = accessor->GetAccessorInfo().update_size;
     for (size_t kv_idx = 0; kv_idx < sorted_kv_size; ++kv_idx) {
       shard_kv_data.key_list[kv_idx] = sorted_kv_list[kv_idx].first;
       shard_kv_data.value_list[kv_idx].assign(
@@ -1379,7 +1317,7 @@ std::future<int32_t> BrpcPsClient::push_sparse(size_t table_id,
   return fut;
 }
 
-void BrpcPsClient::push_sparse_task_consume() {
+void BrpcPsClient::PushSparseTaskConsume() {
   uint64_t merge_size = FLAGS_pserver_push_sparse_merge_limit;
   std::vector<std::shared_ptr<SparseAsyncTask>> task_list;
   size_t request_call_num = _server_channels.size();
@@ -1390,7 +1328,7 @@ void BrpcPsClient::push_sparse_task_consume() {
     // 所有sparseTable的pushTask 进行处理
     for (auto &push_sparse_task_itr : _push_sparse_task_queue_map) {
       auto table_id = push_sparse_task_itr.first;
-      auto *accessor = table_accessor(table_id);
+      auto *accessor = GetTableAccessor(table_id);
       auto &task_queue = push_sparse_task_itr.second;
       auto queue_size = task_queue->Size();
       if (queue_size == 0) {
@@ -1469,7 +1407,7 @@ void BrpcPsClient::push_sparse_task_consume() {
         for (int shard_idx = 0; shard_idx < request_call_num; ++shard_idx) {
           merge_status[shard_idx] =
               async_push_sparse_shard_threads.enqueue(std::bind(
-                  &BrpcPsClient::push_sparse_async_shard_push, this, task_list,
+                  &BrpcPsClient::PushSparseAsyncShardPush, this, task_list,
                   request_kv_num, table_id, shard_idx, closure, accessor));
         }
         for (int shard_idx = 0; shard_idx < request_call_num; ++shard_idx) {
@@ -1485,7 +1423,7 @@ void BrpcPsClient::push_sparse_task_consume() {
         for (int shard_idx = 0; shard_idx < request_call_num; ++shard_idx) {
           merge_status[shard_idx] =
               async_push_sparse_shard_threads.enqueue(std::bind(
-                  &BrpcPsClient::push_sparse_async_shard_merge, this, task_list,
+                  &BrpcPsClient::PushSparseAsyncShardMerge, this, task_list,
                   request_kv_num, table_id, shard_idx, accessor));
         }
         for (int shard_idx = 0; shard_idx < request_call_num; ++shard_idx) {
@@ -1511,23 +1449,23 @@ void BrpcPsClient::push_sparse_task_consume() {
 
 void sparse_local_merge(ValueAccessor *accessor, float *merge_data,
                         const float *another_data) {
-  size_t col_num = accessor->GetTableInfo(UPDATE_SIZE) / sizeof(float);
+  size_t col_num = accessor->GetAccessorInfo().update_dim;
   float *merge_data_shell[col_num];
   const float *another_data_shell[col_num];
   for (int i = 0; i < col_num; ++i) {
     merge_data_shell[i] = merge_data + i;
     another_data_shell[i] = another_data + i;
   }
-  accessor->merge(merge_data_shell, another_data_shell, 1);
+  accessor->Merge(merge_data_shell, another_data_shell, 1);
 }
 
-int BrpcPsClient::push_sparse_async_shard_merge(
+int BrpcPsClient::PushSparseAsyncShardMerge(
     std::vector<std::shared_ptr<SparseAsyncTask>> &task_list,
     std::vector<int> &request_kv_num, int table_id, int shard_idx,
     ValueAccessor *accessor) {
   size_t merged_kv_count = 0;
   uint64_t min_key = UINT64_MAX;
-  uint32_t value_size = accessor->GetTableInfo(UPDATE_SIZE);
+  uint32_t value_size = accessor->GetAccessorInfo().update_size;
 
   thread_local std::vector<std::pair<uint64_t, const float *>> sorted_kv_list;
   sorted_kv_list.clear();
@@ -1613,12 +1551,12 @@ int BrpcPsClient::push_sparse_async_shard_merge(
   return 0;
 }
 
-int BrpcPsClient::push_sparse_async_shard_push(
+int BrpcPsClient::PushSparseAsyncShardPush(
     std::vector<std::shared_ptr<SparseAsyncTask>> &task_list,
     std::vector<int> &request_kv_num, int table_id, int shard_idx,
     DownpourBrpcClosure *closure, ValueAccessor *accessor) {
-  push_sparse_async_shard_merge(task_list, request_kv_num, table_id, shard_idx,
-                                accessor);
+  PushSparseAsyncShardMerge(task_list, request_kv_num, table_id, shard_idx,
+                            accessor);
   size_t merged_kv_count = task_list[0]->data()->shared_data[shard_idx].kv_num;
 
   auto &merged_key_list = task_list[0]->data()->shared_data[shard_idx].key_list;
@@ -1633,9 +1571,8 @@ int BrpcPsClient::push_sparse_async_shard_push(
   push_request->add_params(reinterpret_cast<char *>(&merged_kv_count),
                            sizeof(uint32_t));  // NOLINT
   auto *push_data = push_request->mutable_data();
-  int update_size = accessor->GetTableInfo(UPDATE_SIZE);
-  push_data->resize(merged_kv_count *
-                    (sizeof(uint64_t) + accessor->GetTableInfo(UPDATE_SIZE)));
+  int update_size = accessor->GetAccessorInfo().update_size;
+  push_data->resize(merged_kv_count * (sizeof(uint64_t) + update_size));
   char *push_data_ptr = const_cast<char *>(push_data->data());
   memcpy(push_data_ptr, merged_key_list.data(),
          merged_kv_count * sizeof(uint64_t));
@@ -1644,10 +1581,10 @@ int BrpcPsClient::push_sparse_async_shard_push(
     const char *task_data_ptr = merged_value_list[i].data();
 
     memcpy(push_data_ptr, (float *)(task_data_ptr),  // NOLINT
-           accessor->GetTableInfo(UPDATE_SIZE));
-    push_data_ptr += accessor->GetTableInfo(UPDATE_SIZE);
+           update_size);
+    push_data_ptr += update_size;
   }
-  PsService_Stub rpc_stub(get_sparse_channel(shard_idx));
+  PsService_Stub rpc_stub(GetSparseChannel(shard_idx));
   closure->cntl(shard_idx)->set_request_compress_type(
       (brpc::CompressType)FLAGS_pserver_communicate_compress_type);
   rpc_stub.service(closure->cntl(shard_idx), closure->request(shard_idx),
@@ -1656,18 +1593,18 @@ int BrpcPsClient::push_sparse_async_shard_push(
   return 0;
 }
 
-std::future<int32_t> BrpcPsClient::push_dense(const Region *regions,
-                                              size_t region_num,
-                                              size_t table_id) {
-  auto *accessor = table_accessor(table_id);
-  int fea_dim = accessor->GetTableInfo(FEA_DIM);
-  int update_dim = accessor->GetTableInfo(UPDATE_DIM);
+std::future<int32_t> BrpcPsClient::PushDense(const Region *regions,
+                                             size_t region_num,
+                                             size_t table_id) {
+  auto *accessor = GetTableAccessor(table_id);
+  int fea_dim = accessor->GetAccessorInfo().fea_dim;
+  int update_dim = accessor->GetAccessorInfo().update_dim;
   auto push_timer = std::make_shared<CostTimer>("pserver_client_push_dense");
   auto parse_timer =
       std::make_shared<CostTimer>("pserver_client_push_dense_parse");
   int push_dense_async_num = _push_dense_task_queue_map[table_id]->Size();
   while (push_dense_async_num > FLAGS_pserver_max_async_call_num) {
-    //    LOG(INFO) << "push_dense Waiting for async_call_num comsume,
+    //    LOG(INFO) << "PushDense Waiting for async_call_num comsume,
     //    task_num:"
     //              << push_dense_async_num
     //              << ", max_task_limit:" << FLAGS_pserver_max_async_call_num;
@@ -1679,13 +1616,9 @@ std::future<int32_t> BrpcPsClient::push_dense(const Region *regions,
   auto dense_data = std::make_shared<std::vector<float>>();
   auto async_task = new DenseAsyncTask(dense_data, table_id, push_timer);
   size_t request_call_num = _server_channels.size();
-
-  uint32_t num_per_shard =
-      dense_dim_per_shard(accessor->GetTableInfo(FEA_DIM), request_call_num);
-
+  uint32_t num_per_shard = DenseDimPerShard(fea_dim, request_call_num);
   // 将region数据拷贝到转置矩阵中
-  async_task->data()->resize(num_per_shard * request_call_num *
-                             accessor->GetTableInfo(UPDATE_DIM));
+  async_task->data()->resize(num_per_shard * request_call_num * update_dim);
   float *data = async_task->data()->data();
   size_t data_size = async_task->data()->size();
   uint32_t pos = 0;
@@ -1703,7 +1636,7 @@ std::future<int32_t> BrpcPsClient::push_dense(const Region *regions,
   return fut;
 }
 
-void BrpcPsClient::push_dense_task_consume() {
+void BrpcPsClient::PushDenseTaskConsume() {
   uint64_t merge_size = FLAGS_pserver_push_dense_merge_limit;
   static bool scale_gradient = FLAGS_pserver_scale_gradient_by_merge;
   ::ThreadPool async_merge_dense_threads(10);
@@ -1721,7 +1654,7 @@ void BrpcPsClient::push_dense_task_consume() {
       ++_async_call_num;
       DenseAsyncTask *task;
       task_queue->Get(task);
-      auto *accessor = table_accessor(task->table_id());
+      auto *accessor = GetTableAccessor(task->table_id());
       // 设置请求回调
       size_t request_call_num = _server_channels.size();
 
@@ -1757,7 +1690,7 @@ void BrpcPsClient::push_dense_task_consume() {
                async_task]() -> int {
                 auto &tmp_task_vec = *(async_task->data());
                 const float *merge_data = tmp_task_vec.data();
-                accessor->merge(&total_send_data, &merge_data,
+                accessor->Merge(&total_send_data, &merge_data,
                                 total_send_data_size);
 #pragma optimize("", off)
                 auto *debug_closure = closure;
@@ -1772,7 +1705,7 @@ void BrpcPsClient::push_dense_task_consume() {
           merge_status[i].wait();
         }
 
-        VLOG(3) << "BrpcPsClient::push_dense_task_consume before merge "
+        VLOG(3) << "BrpcPsClient::PushDenseTaskConsume before merge "
                    "total_send_data[0]"
                 << total_send_data[0] << " total_send_data[-2]"
                 << total_send_data[total_send_data_size - 2]
@@ -1785,7 +1718,7 @@ void BrpcPsClient::push_dense_task_consume() {
           mat *= (1.0 / (merge_count + 1));
         }
 
-        VLOG(3) << "BrpcPsClient::push_dense_task_consume after merge "
+        VLOG(3) << "BrpcPsClient::PushDenseTaskConsume after merge "
                    "total_send_data[0]"
                 << total_send_data[0] << " total_send_data[-2]"
                 << total_send_data[total_send_data_size - 2]
@@ -1794,8 +1727,8 @@ void BrpcPsClient::push_dense_task_consume() {
                 << merge_count;
       }
       std::shared_ptr<DenseAsyncTask> task_ptr(task);
-      push_dense_raw_gradient(task_ptr, total_send_data, total_send_data_size,
-                              closure);
+      PushDenseRawGradient(task_ptr, total_send_data, total_send_data_size,
+                           closure);
     }
     auto wait_ms = FLAGS_pserver_async_push_dense_interval_ms -
                    (butil::gettimeofday_ms() - async_start_time_ms);
@@ -1805,16 +1738,17 @@ void BrpcPsClient::push_dense_task_consume() {
   }
 }
 
-void BrpcPsClient::push_dense_raw_gradient(
-    std::shared_ptr<DenseAsyncTask> &task, float *total_send_data,
-    size_t total_send_data_size, DownpourBrpcClosure *closure) {
-  auto *accessor = table_accessor(task->table_id());
+void BrpcPsClient::PushDenseRawGradient(std::shared_ptr<DenseAsyncTask> &task,
+                                        float *total_send_data,
+                                        size_t total_send_data_size,
+                                        DownpourBrpcClosure *closure) {
+  auto *accessor = GetTableAccessor(task->table_id());
   size_t request_call_num = _server_channels.size();
   // 将数据拷贝到请求buffer区
   auto timer = std::make_shared<CostTimer>("pserver_client_push_dense_rpc");
   closure->add_timer(timer);
   uint32_t num_per_shard =
-      dense_dim_per_shard(accessor->GetTableInfo(FEA_DIM), request_call_num);
+      DenseDimPerShard(accessor->GetAccessorInfo().fea_dim, request_call_num);
   auto send_timer =
       std::make_shared<CostTimer>("pserver_client_push_dense_send");
   for (size_t i = 0; i < request_call_num; ++i) {
@@ -1830,7 +1764,7 @@ void BrpcPsClient::push_dense_raw_gradient(
            total_send_data + i * num_per_shard, num_per_shard * sizeof(float));
     closure->cntl(i)->set_request_compress_type(
         (brpc::CompressType)FLAGS_pserver_communicate_compress_type);
-    PsService_Stub rpc_stub(get_dense_channel(i));
+    PsService_Stub rpc_stub(GetDenseChannel(i));
     rpc_stub.service(closure->cntl(i), closure->request(i),
                      closure->response(i), closure);
   }
diff --git a/paddle/fluid/distributed/ps/service/brpc_ps_client.h b/paddle/fluid/distributed/ps/service/brpc_ps_client.h
index 8b0cb0741b400..f109b473ca1f4 100644
--- a/paddle/fluid/distributed/ps/service/brpc_ps_client.h
+++ b/paddle/fluid/distributed/ps/service/brpc_ps_client.h
@@ -50,7 +50,7 @@ class DownpourPsClientService : public PsService {
   DownpourPsClientService() {}
   virtual ~DownpourPsClientService() {}
 
-  virtual int32_t configure(PSClient *client, size_t rank_id) {
+  virtual int32_t Configure(PSClient *client, size_t rank_id) {
     _client = client;
     _rank = rank_id;
     return 0;
@@ -139,7 +139,7 @@ class BrpcPsClient : public PSClient {
   BrpcPsClient() {}
   virtual ~BrpcPsClient() {
     if (_running) {
-      flush();
+      Flush();
       _running = false;
     }
     if (_async_push_dense_thread.joinable()) {
@@ -154,109 +154,98 @@ class BrpcPsClient : public PSClient {
       _server_started = false;
     }
   }
-  virtual int32_t create_client2client_connection(
-      int pserver_timeout_ms, int pserver_connect_timeout_ms, int max_retry);
-  std::future<int32_t> shrink(uint32_t table_id,
+  virtual int32_t CreateClient2ClientConnection(int pserver_timeout_ms,
+                                                int pserver_connect_timeout_ms,
+                                                int max_retry);
+  std::future<int32_t> Shrink(uint32_t table_id,
                               const std::string threshold) override;
-  std::future<int32_t> load(const std::string &epoch,
+  std::future<int32_t> Load(const std::string &epoch,
                             const std::string &mode) override;
-  std::future<int32_t> load(uint32_t table_id, const std::string &epoch,
+  std::future<int32_t> Load(uint32_t table_id, const std::string &epoch,
                             const std::string &mode) override;
 
-  std::future<int32_t> Load(const LoadSaveContext &load_context) override;
-
-  std::future<int32_t> save(const std::string &epoch,
+  std::future<int32_t> Save(const std::string &epoch,
                             const std::string &mode) override;
 
-  std::future<int32_t> save(uint32_t table_id, const std::string &epoch,
+  std::future<int32_t> Save(uint32_t table_id, const std::string &epoch,
                             const std::string &mode) override;
 
-  virtual std::future<int32_t> Save(
-      const LoadSaveContext &save_context) override;
-
-  std::future<int32_t> clear() override;
-
-  std::future<int32_t> clear(uint32_t table_id) override;
+  std::future<int32_t> Clear() override;
 
-  std::future<int32_t> stop_server() override;
+  std::future<int32_t> Clear(uint32_t table_id) override;
 
-  std::future<int32_t> start_profiler() override;
-  std::future<int32_t> stop_profiler() override;
+  std::future<int32_t> StopServer() override;
 
-  void finalize_worker() override;
+  std::future<int32_t> StartProfiler() override;
+  std::future<int32_t> StopProfiler() override;
 
-  virtual std::future<int32_t> pull_dense(Region *regions, size_t region_num,
-                                          size_t table_id);
+  void FinalizeWorker() override;
 
-  virtual std::future<int32_t> push_dense_param(const Region *regions,
-                                                size_t region_num,
-                                                size_t table_id);
+  virtual std::future<int32_t> PullDense(Region *regions, size_t region_num,
+                                         size_t table_id);
 
-  virtual std::future<int32_t> push_dense(const Region *regions,
-                                          size_t region_num, size_t table_id);
-  void push_dense_task_consume();
-  virtual std::future<int32_t> pull_sparse(float **select_values,
-                                           size_t table_id,
-                                           const uint64_t *keys, size_t num,
-                                           bool is_training);
-  virtual std::future<int32_t> pull_sparse_param(float **select_values,
-                                                 size_t table_id,
-                                                 const uint64_t *keys,
-                                                 size_t num, bool is_training);
+  virtual std::future<int32_t> PushDenseParam(const Region *regions,
+                                              size_t region_num,
+                                              size_t table_id);
 
-  virtual std::future<int32_t> Pull(RequestContext &pull_context) override;
+  virtual std::future<int32_t> PushDense(const Region *regions,
+                                         size_t region_num, size_t table_id);
+  void PushDenseTaskConsume();
+  virtual std::future<int32_t> PullSparse(float **select_values,
+                                          size_t table_id, const uint64_t *keys,
+                                          size_t num, bool is_training);
+  virtual std::future<int32_t> PullSparseParam(float **select_values,
+                                               size_t table_id,
+                                               const uint64_t *keys, size_t num,
+                                               bool is_training);
 
-  virtual std::future<int32_t> Push(RequestContext &push_context) override;
+  virtual std::future<int32_t> PrintTableStat(uint32_t table_id);
 
-  virtual std::future<int32_t> print_table_stat(uint32_t table_id);
+  virtual std::future<int32_t> Barrier(size_t table_id, uint32_t barrier_type);
 
-  virtual std::future<int32_t> barrier(size_t table_id, uint32_t barrier_type);
+  virtual std::future<int32_t> PullGeoParam(size_t table_id,
+                                            std::vector<float> *values,
+                                            std::vector<uint64_t> *keys,
+                                            int pserver_idx);
+  virtual std::future<int32_t> PushGlobalStep(int table_id,
+                                              int64_t *total_send_data,
+                                              void *done);
+  virtual std::future<int32_t> Flush();
 
-  virtual std::future<int32_t> pull_geo_param(size_t table_id,
-                                              std::vector<float> *values,
-                                              std::vector<uint64_t> *keys,
-                                              int pserver_idx);
-  virtual std::future<int32_t> push_global_step(int table_id,
-                                                int64_t *total_send_data,
-                                                void *done);
-  virtual std::future<int32_t> flush();
-
-  std::future<int32_t> send_client2client_msg(int msg_type, int to_client_id,
-                                              const std::string &msg) override;
+  std::future<int32_t> SendClient2ClientMsg(int msg_type, int to_client_id,
+                                            const std::string &msg) override;
 
   // for local save sparse
-  virtual int32_t recv_and_save_table(const uint64_t table_id,
-                                      const std::string &path);
+  virtual int32_t RecvAndSaveTable(const uint64_t table_id,
+                                   const std::string &path);
 
-  void print_queue_size();
-  void print_queue_size_thread();
+  void PrintQueueSize();
+  void PrintQueueSizeThread();
 
  protected:
-  virtual size_t get_server_nums() { return _server_channels.size(); }
-  inline brpc::Channel *get_sparse_channel(size_t server_id) {
+  virtual size_t GetServerNums() { return _server_channels.size(); }
+  inline brpc::Channel *GetSparseChannel(size_t server_id) {
     return _server_channels[server_id][0].get();
   }
-  inline brpc::Channel *get_dense_channel(size_t server_id) {
+  inline brpc::Channel *GetDenseChannel(size_t server_id) {
     return _server_channels[server_id][1].get();
   }
-  inline brpc::Channel *get_cmd_channel(size_t server_id) {
+  inline brpc::Channel *GetCmdChannel(size_t server_id) {
     return _server_channels[server_id][2].get();
   }
-  int32_t initialize() override;
+  int32_t Initialize() override;
 
  private:
-  // virtual int32_t initialize() override;
-
-  inline uint32_t dense_dim_per_shard(uint32_t dense_dim_total,
-                                      uint32_t shard_num) {
+  inline uint32_t DenseDimPerShard(uint32_t dense_dim_total,
+                                   uint32_t shard_num) {
     return dense_dim_total / shard_num + 1;
   }
 
-  std::future<int32_t> send_cmd(uint32_t table_id, int cmd_id,
-                                const std::vector<std::string> &param);
+  std::future<int32_t> SendCmd(uint32_t table_id, int cmd_id,
+                               const std::vector<std::string> &param);
 
-  std::future<int32_t> send_save_cmd(uint32_t table_id, int cmd_id,
-                                     const std::vector<std::string> &param);
+  std::future<int32_t> SendSaveCmd(uint32_t table_id, int cmd_id,
+                                   const std::vector<std::string> &param);
 
   bool _running = false;
   bool _flushing = false;
@@ -276,12 +265,12 @@ class BrpcPsClient : public PSClient {
 
   std::thread _print_thread;
 
-  int push_sparse_async_shard_merge(
+  int PushSparseAsyncShardMerge(
       std::vector<std::shared_ptr<SparseAsyncTask>> &task_list,       // NOLINT
       std::vector<int> &request_kv_num, int table_id, int shard_idx,  // NOLINT
       ValueAccessor *accessor);
 
-  int push_sparse_async_shard_push(
+  int PushSparseAsyncShardPush(
       std::vector<std::shared_ptr<SparseAsyncTask>> &task_list,       // NOLINT
       std::vector<int> &request_kv_num, int table_id, int shard_idx,  // NOLINT
       DownpourBrpcClosure *closure, ValueAccessor *accessor);
@@ -292,36 +281,36 @@ class BrpcPsClient : public PSClient {
       _client_channels;  // client2client
   std::vector<std::array<std::shared_ptr<brpc::Channel>, 3>>
       _server_channels;  // client2server
-  std::future<int32_t> push_dense_raw_gradient(int table_id,
-                                               float *total_send_data,
-                                               size_t total_send_data_size,
-                                               void *done) override;
-
-  std::future<int32_t> push_sparse_raw_gradient(size_t table_id,
-                                                const uint64_t *keys,
-                                                const float **update_values,
-                                                size_t num,
-                                                void *done) override;
-
-  std::future<int32_t> push_sparse_raw_gradient_partial(
-      size_t table_id, const uint64_t *keys, const float **update_values,
-      uint32_t num, void *done, int pserver_idx) override;
-
-  std::future<int32_t> push_sparse_param(size_t table_id, const uint64_t *keys,
-                                         const float **update_values,
-                                         size_t num, void *done) override;
-  std::future<int32_t> push_sparse(size_t table_id, const uint64_t *keys,
-                                   const float **update_values,
-                                   size_t num) override;
-  void push_sparse_task_consume();
+  std::future<int32_t> PushDenseRawGradient(int table_id,
+                                            float *total_send_data,
+                                            size_t total_send_data_size,
+                                            void *done) override;
+
+  std::future<int32_t> PushSparseRawGradient(size_t table_id,
+                                             const uint64_t *keys,
+                                             const float **update_values,
+                                             size_t num, void *done) override;
+
+  std::future<int32_t> PushSparseRawGradientPartial(size_t table_id,
+                                                    const uint64_t *keys,
+                                                    const float **update_values,
+                                                    uint32_t num, void *done,
+                                                    int pserver_idx) override;
+
+  std::future<int32_t> PushSparseParam(size_t table_id, const uint64_t *keys,
+                                       const float **update_values, size_t num,
+                                       void *done) override;
+  std::future<int32_t> PushSparse(size_t table_id, const uint64_t *keys,
+                                  const float **update_values,
+                                  size_t num) override;
+  void PushSparseTaskConsume();
 
  private:
-  int32_t start_client_service();
+  int32_t StartClientService();
 
-  void push_dense_raw_gradient(std::shared_ptr<DenseAsyncTask> &task,  // NOLINT
-                               float *total_send_data,
-                               size_t total_send_data_size,
-                               DownpourBrpcClosure *closure);
+  void PushDenseRawGradient(std::shared_ptr<DenseAsyncTask> &task,  // NOLINT
+                            float *total_send_data, size_t total_send_data_size,
+                            DownpourBrpcClosure *closure);
   float _mae = 0;
   float _mse = 0;
   uint16_t _push_times = 0;
diff --git a/paddle/fluid/distributed/ps/service/brpc_ps_server.cc b/paddle/fluid/distributed/ps/service/brpc_ps_server.cc
index 0d7624baec580..d22cca91f7816 100644
--- a/paddle/fluid/distributed/ps/service/brpc_ps_server.cc
+++ b/paddle/fluid/distributed/ps/service/brpc_ps_server.cc
@@ -31,7 +31,7 @@ class RpcController;
 namespace paddle {
 namespace distributed {
 
-int32_t BrpcPsServer::initialize() {
+int32_t BrpcPsServer::Initialize() {
   auto &service_config = _config.downpour_server_param().service_param();
   if (!service_config.has_service_class()) {
     LOG(ERROR) << "miss service_class in ServerServiceParameter";
@@ -46,7 +46,7 @@ int32_t BrpcPsServer::initialize() {
   }
 
   _service.reset(service);
-  if (service->configure(this) != 0 || service->initialize() != 0) {
+  if (service->Configure(this) != 0 || service->Initialize() != 0) {
     LOG(ERROR) << "service initialize failed, service_name:"
                << service_config.service_class();
     return -1;
@@ -59,7 +59,7 @@ int32_t BrpcPsServer::initialize() {
   return 0;
 }
 
-uint64_t BrpcPsServer::start(const std::string &ip, uint32_t port) {
+uint64_t BrpcPsServer::Start(const std::string &ip, uint32_t port) {
   std::unique_lock<std::mutex> lock(mutex_);
 
   std::string ip_port = ip + ":" + std::to_string(port);
@@ -68,7 +68,7 @@ uint64_t BrpcPsServer::start(const std::string &ip, uint32_t port) {
   brpc::ServerOptions options;
 
   int num_threads = std::thread::hardware_concurrency();
-  auto trainers = _environment->get_trainers();
+  auto trainers = _environment->GetTrainers();
   options.num_threads = trainers > num_threads ? trainers : num_threads;
 
   if (_server.Start(ip_port.c_str(), &options) != 0) {
@@ -83,7 +83,7 @@ uint64_t BrpcPsServer::start(const std::string &ip, uint32_t port) {
     }
   }
 
-  _environment->registe_ps_server(ip, port, _rank);
+  _environment->RegistePsServer(ip, port, _rank);
   cv_.wait(lock, [&] { return stoped_; });
 
   PSHost host;
@@ -93,31 +93,30 @@ uint64_t BrpcPsServer::start(const std::string &ip, uint32_t port) {
   return host.rank;
 }
 
-int32_t BrpcPsServer::port() { return _server.listen_address().port; }
+int32_t BrpcPsServer::Port() { return _server.listen_address().port; }
 
-int32_t BrpcPsService::initialize() {
+int32_t BrpcPsService::Initialize() {
   _is_initialize_shard_info = false;
-  _service_handler_map[PS_STOP_SERVER] = &BrpcPsService::stop_server;
-  _service_handler_map[PS_PULL_DENSE_TABLE] = &BrpcPsService::pull_dense;
-  _service_handler_map[PS_PUSH_DENSE_TABLE] = &BrpcPsService::push_dense;
-  _service_handler_map[PS_PULL_SPARSE_TABLE] = &BrpcPsService::pull_sparse;
-  _service_handler_map[PS_PUSH_SPARSE_TABLE] = &BrpcPsService::push_sparse;
-  _service_handler_map[PS_SAVE_ONE_TABLE] = &BrpcPsService::save_one_table;
-  _service_handler_map[PS_SAVE_ALL_TABLE] = &BrpcPsService::save_all_table;
-  _service_handler_map[PS_SHRINK_TABLE] = &BrpcPsService::shrink_table;
-  _service_handler_map[PS_LOAD_ONE_TABLE] = &BrpcPsService::load_one_table;
-  _service_handler_map[PS_LOAD_ALL_TABLE] = &BrpcPsService::load_all_table;
-  _service_handler_map[PS_CLEAR_ONE_TABLE] = &BrpcPsService::clear_one_table;
-  _service_handler_map[PS_CLEAR_ALL_TABLE] = &BrpcPsService::clear_all_table;
-  _service_handler_map[PS_PUSH_DENSE_PARAM] = &BrpcPsService::push_dense_param;
-  _service_handler_map[PS_PRINT_TABLE_STAT] = &BrpcPsService::print_table_stat;
-  _service_handler_map[PS_PULL_GEO_PARAM] = &BrpcPsService::pull_geo_param;
-  _service_handler_map[PS_PUSH_SPARSE_PARAM] =
-      &BrpcPsService::push_sparse_param;
-  _service_handler_map[PS_BARRIER] = &BrpcPsService::barrier;
-  _service_handler_map[PS_START_PROFILER] = &BrpcPsService::start_profiler;
-  _service_handler_map[PS_STOP_PROFILER] = &BrpcPsService::stop_profiler;
-  _service_handler_map[PS_PUSH_GLOBAL_STEP] = &BrpcPsService::push_global_step;
+  _service_handler_map[PS_STOP_SERVER] = &BrpcPsService::StopServer;
+  _service_handler_map[PS_PULL_DENSE_TABLE] = &BrpcPsService::PullDense;
+  _service_handler_map[PS_PUSH_DENSE_TABLE] = &BrpcPsService::PushDense;
+  _service_handler_map[PS_PULL_SPARSE_TABLE] = &BrpcPsService::PullSparse;
+  _service_handler_map[PS_PUSH_SPARSE_TABLE] = &BrpcPsService::PushSparse;
+  _service_handler_map[PS_SAVE_ONE_TABLE] = &BrpcPsService::SaveOneTable;
+  _service_handler_map[PS_SAVE_ALL_TABLE] = &BrpcPsService::SaveAllTable;
+  _service_handler_map[PS_SHRINK_TABLE] = &BrpcPsService::ShrinkTable;
+  _service_handler_map[PS_LOAD_ONE_TABLE] = &BrpcPsService::LoadOneTable;
+  _service_handler_map[PS_LOAD_ALL_TABLE] = &BrpcPsService::LoadAllTable;
+  _service_handler_map[PS_CLEAR_ONE_TABLE] = &BrpcPsService::ClearOneTable;
+  _service_handler_map[PS_CLEAR_ALL_TABLE] = &BrpcPsService::ClearAllTable;
+  _service_handler_map[PS_PUSH_DENSE_PARAM] = &BrpcPsService::PushDenseParam;
+  _service_handler_map[PS_PRINT_TABLE_STAT] = &BrpcPsService::PrintTableStat;
+  _service_handler_map[PS_PULL_GEO_PARAM] = &BrpcPsService::PullGeoParam;
+  _service_handler_map[PS_PUSH_SPARSE_PARAM] = &BrpcPsService::PushSparseParam;
+  _service_handler_map[PS_BARRIER] = &BrpcPsService::Barrier;
+  _service_handler_map[PS_START_PROFILER] = &BrpcPsService::StartProfiler;
+  _service_handler_map[PS_STOP_PROFILER] = &BrpcPsService::StopProfiler;
+  _service_handler_map[PS_PUSH_GLOBAL_STEP] = &BrpcPsService::PushGlobalStep;
   auto &profiler = CostProfiler::instance();
   profiler.register_profiler("pserver_server_pull_dense");
   profiler.register_profiler("pserver_server_push_dense");
@@ -125,7 +124,7 @@ int32_t BrpcPsService::initialize() {
   profiler.register_profiler("pserver_server_push_sparse");
 
   // shard初始化,server启动后才可从env获取到server_list的shard信息
-  initialize_shard_info();
+  InitializeShardInfo();
 
   return 0;
 }
@@ -138,16 +137,16 @@ int32_t BrpcPsService::initialize() {
     return -1;                                             \
   }
 
-int32_t BrpcPsService::initialize_shard_info() {
+int32_t BrpcPsService::InitializeShardInfo() {
   if (!_is_initialize_shard_info) {
     std::lock_guard<std::mutex> guard(_initialize_shard_mutex);
     if (_is_initialize_shard_info) {
       return 0;
     }
-    size_t shard_num = _server->environment()->get_ps_servers().size();
-    auto &table_map = *(_server->table());
+    size_t shard_num = _server->Environment()->GetPsServers().size();
+    auto &table_map = *(_server->GetTable());
     for (auto itr : table_map) {
-      itr.second->set_shard(_rank, shard_num);
+      itr.second->SetShard(_rank, shard_num);
     }
     _is_initialize_shard_info = true;
   }
@@ -167,7 +166,7 @@ void BrpcPsService::service(google::protobuf::RpcController *cntl_base,
 
   response->set_err_code(0);
   response->set_err_msg("");
-  auto *table = _server->table(request->table_id());
+  auto *table = _server->GetTable(request->table_id());
   brpc::Controller *cntl = static_cast<brpc::Controller *>(cntl_base);
   auto itr = _service_handler_map.find(request->cmd_id());
   if (itr == _service_handler_map.end()) {
@@ -185,11 +184,11 @@ void BrpcPsService::service(google::protobuf::RpcController *cntl_base,
   }
 }
 
-int32_t BrpcPsService::pull_dense(Table *table, const PsRequestMessage &request,
-                                  PsResponseMessage &response,
-                                  brpc::Controller *cntl) {
+int32_t BrpcPsService::PullDense(Table *table, const PsRequestMessage &request,
+                                 PsResponseMessage &response,
+                                 brpc::Controller *cntl) {
   platform::RecordEvent record_event(
-      "PsService->pull_dense", platform::TracerEventType::Communication, 1);
+      "PsService->PullDense", platform::TracerEventType::Communication, 1);
   CHECK_TABLE_EXIST(table, request, response)
   if (request.params_size() < 1) {
     set_response_code(
@@ -206,13 +205,15 @@ int32_t BrpcPsService::pull_dense(Table *table, const PsRequestMessage &request,
   }
 
   auto res_data = butil::get_object<std::vector<float>>();
-  res_data->resize(num * table->value_accesor()->select_size() / sizeof(float));
+  res_data->resize(num * table->ValueAccesor()->GetAccessorInfo().select_size /
+                   sizeof(float));
+
   TableContext table_context;
   table_context.value_type = Dense;
   table_context.pull_context.values = res_data->data();
   table_context.num = num;
   table->Pull(table_context);
-  // table->pull_dense(res_data->data(), num);
+  // table->PullDense(res_data->data(), num);
 
   cntl->response_attachment().append((char *)(res_data->data()),
                                      res_data->size() * sizeof(float));
@@ -221,13 +222,12 @@ int32_t BrpcPsService::pull_dense(Table *table, const PsRequestMessage &request,
   return 0;
 }
 
-int32_t BrpcPsService::push_dense_param(Table *table,
-                                        const PsRequestMessage &request,
-                                        PsResponseMessage &response,
-                                        brpc::Controller *cntl) {
-  platform::RecordEvent record_event("PsService->push_dense_param",
-                                     platform::TracerEventType::Communication,
-                                     1);
+int32_t BrpcPsService::PushDenseParam(Table *table,
+                                      const PsRequestMessage &request,
+                                      PsResponseMessage &response,
+                                      brpc::Controller *cntl) {
+  platform::RecordEvent record_event(
+      "PsService->PushDenseParam", platform::TracerEventType::Communication, 1);
   CHECK_TABLE_EXIST(table, request, response)
   thread_local std::string push_buffer;
   auto &req_io_buffer = cntl->request_attachment();
@@ -244,17 +244,24 @@ int32_t BrpcPsService::push_dense_param(Table *table,
   uint32_t num = *(const uint32_t *)data;
 
   const float *values = (const float *)(data + sizeof(uint32_t));
-  if (table->push_dense_param(values, num) != 0) {
-    set_response_code(response, -1, "push_dense_param failed");
+  TableContext table_context;
+  table_context.value_type = Dense;
+  table_context.push_context.values = values;
+  table_context.push_context.is_param = true;
+  table_context.num = num;
+
+  //  if (table->PushDenseParam(values, num) != 0) {
+  if (table->Push(table_context) != 0) {
+    set_response_code(response, -1, "PushDenseParam failed");
   }
   return 0;
 }
 
-int32_t BrpcPsService::push_dense(Table *table, const PsRequestMessage &request,
-                                  PsResponseMessage &response,
-                                  brpc::Controller *cntl) {
+int32_t BrpcPsService::PushDense(Table *table, const PsRequestMessage &request,
+                                 PsResponseMessage &response,
+                                 brpc::Controller *cntl) {
   platform::RecordEvent record_event(
-      "PsService->push_dense", platform::TracerEventType::Communication, 1);
+      "PsService->PushDense", platform::TracerEventType::Communication, 1);
   CHECK_TABLE_EXIST(table, request, response)
   auto req_buffer_size = request.data().size();
   if (req_buffer_size < 1) {
@@ -277,14 +284,14 @@ int32_t BrpcPsService::push_dense(Table *table, const PsRequestMessage &request,
   // const float *values = (const float *)(request.data().data() +
   // sizeof(uint32_t));
   if (table->Push(table_context) != 0) {
-    // if (table->push_dense(values, num) != 0) {
-    set_response_code(response, -1, "push_dense failed");
+    // if (table->PushDense(values, num) != 0) {
+    set_response_code(response, -1, "PushDense failed");
   }
 
   return 0;
 }
 
-int32_t BrpcPsService::barrier(Table *table, const PsRequestMessage &request,
+int32_t BrpcPsService::Barrier(Table *table, const PsRequestMessage &request,
                                PsResponseMessage &response,
                                brpc::Controller *cntl) {
   CHECK_TABLE_EXIST(table, request, response)
@@ -298,15 +305,15 @@ int32_t BrpcPsService::barrier(Table *table, const PsRequestMessage &request,
 
   auto trainer_id = request.client_id();
   auto barrier_type = request.params(0);
-  table->barrier(trainer_id, barrier_type);
+  table->Barrier(trainer_id, barrier_type);
   return 0;
 }
 
-int32_t BrpcPsService::push_sparse_param(Table *table,
-                                         const PsRequestMessage &request,
-                                         PsResponseMessage &response,
-                                         brpc::Controller *cntl) {
-  platform::RecordEvent record_event("PsService->push_sparse_param",
+int32_t BrpcPsService::PushSparseParam(Table *table,
+                                       const PsRequestMessage &request,
+                                       PsResponseMessage &response,
+                                       brpc::Controller *cntl) {
+  platform::RecordEvent record_event("PsService->PushSparseParam",
                                      platform::TracerEventType::Communication,
                                      1);
   CHECK_TABLE_EXIST(table, request, response)
@@ -330,16 +337,24 @@ int32_t BrpcPsService::push_sparse_param(Table *table,
   const uint64_t *keys = (const uint64_t *)push_data.data();
   const float *values =
       (const float *)(push_data.data() + sizeof(uint64_t) * num);
-  if (table->push_sparse_param(keys, values, num) != 0) {
-    set_response_code(response, -1, "push_sparse_param error");
+
+  TableContext table_context;
+  table_context.value_type = Sparse;
+  table_context.push_context.keys = keys;
+  table_context.push_context.values = values;
+  table_context.push_context.is_param = true;
+  table_context.num = num;
+  //  if (table->PushSparseParam(keys, values, num) != 0) {
+  if (table->Push(table_context) != 0) {
+    set_response_code(response, -1, "PushSparseParam error");
   }
   return 0;
 }
 
-int32_t BrpcPsService::pull_geo_param(Table *table,
-                                      const PsRequestMessage &request,
-                                      PsResponseMessage &response,
-                                      brpc::Controller *cntl) {
+int32_t BrpcPsService::PullGeoParam(Table *table,
+                                    const PsRequestMessage &request,
+                                    PsResponseMessage &response,
+                                    brpc::Controller *cntl) {
   platform::RecordEvent record_event(
       "PsService->pull_geo_param", platform::TracerEventType::Communication, 1);
   CHECK_TABLE_EXIST(table, request, response)
@@ -349,7 +364,14 @@ int32_t BrpcPsService::pull_geo_param(Table *table,
 
   std::vector<float> values;
   std::vector<uint64_t> ids;
-  table->pull_geo_param(trainer_id, &values, &ids);
+
+  TableContext table_context;
+  table_context.value_type = Sparse;
+  table_context.pull_context.geo_pull_keys = &ids;
+  table_context.pull_context.geo_pull_values = &values;
+  table_context.trainer_id = trainer_id;
+  table->Pull(table_context);
+  //  table->PullGeoParam(trainer_id, &values, &ids);
 
   uint32_t num = ids.size();
   cntl->response_attachment().append((char *)(&num), sizeof(uint32_t));
@@ -360,12 +382,11 @@ int32_t BrpcPsService::pull_geo_param(Table *table,
   return 0;
 }
 
-int32_t BrpcPsService::pull_sparse(Table *table,
-                                   const PsRequestMessage &request,
-                                   PsResponseMessage &response,
-                                   brpc::Controller *cntl) {
+int32_t BrpcPsService::PullSparse(Table *table, const PsRequestMessage &request,
+                                  PsResponseMessage &response,
+                                  brpc::Controller *cntl) {
   platform::RecordEvent record_event(
-      "PsService->pull_sparse", platform::TracerEventType::Communication, 1);
+      "PsService->PullSparse", platform::TracerEventType::Communication, 1);
   CHECK_TABLE_EXIST(table, request, response)
 
   auto &req_io_buffer = cntl->request_attachment();
@@ -385,7 +406,7 @@ int32_t BrpcPsService::pull_sparse(Table *table,
 
   CostTimer timer("pserver_server_pull_sparse");
   uint32_t num = *(uint32_t *)(request.params(0).c_str());
-  auto dim = table->value_accesor()->select_dim();
+  auto dim = table->ValueAccesor()->GetAccessorInfo().select_dim;
 
   thread_local std::string req_buffer;
   req_buffer.reserve(req_buffer_size);
@@ -404,7 +425,7 @@ int32_t BrpcPsService::pull_sparse(Table *table,
   table_context.pull_context.pull_value = value;
   table_context.pull_context.values = res_data->data();
   table->Pull(table_context);
-  // table->pull_sparse(res_data->data(), value);
+  // table->PullSparse(res_data->data(), value);
 
   cntl->response_attachment().append((char *)(res_data->data()),
                                      res_data->size() * sizeof(float));
@@ -412,12 +433,11 @@ int32_t BrpcPsService::pull_sparse(Table *table,
   return 0;
 }
 
-int32_t BrpcPsService::push_sparse(Table *table,
-                                   const PsRequestMessage &request,
-                                   PsResponseMessage &response,
-                                   brpc::Controller *cntl) {
+int32_t BrpcPsService::PushSparse(Table *table, const PsRequestMessage &request,
+                                  PsResponseMessage &response,
+                                  brpc::Controller *cntl) {
   platform::RecordEvent record_event(
-      "PsService->push_sparse", platform::TracerEventType::Communication, 1);
+      "PsService->PushSparse", platform::TracerEventType::Communication, 1);
   CHECK_TABLE_EXIST(table, request, response)
   auto &push_data = request.data();
   if (push_data.size() < 1) {
@@ -447,18 +467,18 @@ int32_t BrpcPsService::push_sparse(Table *table,
   // const float *values = (const float *)(push_data.data() + sizeof(uint64_t) *
   // num);
   if (table->Push(table_context) != 0) {
-    // if (table->push_sparse(keys, values, num) != 0) {
-    set_response_code(response, -1, "push_sparse error");
+    // if (table->PushSparse(keys, values, num) != 0) {
+    set_response_code(response, -1, "PushSparse error");
   }
   return 0;
 }
 
-int32_t BrpcPsService::print_table_stat(Table *table,
-                                        const PsRequestMessage &request,
-                                        PsResponseMessage &response,
-                                        brpc::Controller *cntl) {
+int32_t BrpcPsService::PrintTableStat(Table *table,
+                                      const PsRequestMessage &request,
+                                      PsResponseMessage &response,
+                                      brpc::Controller *cntl) {
   CHECK_TABLE_EXIST(table, request, response)
-  std::pair<int64_t, int64_t> ret = table->print_table_stat();
+  std::pair<int64_t, int64_t> ret = table->PrintTableStat();
   paddle::framework::BinaryArchive ar;
   ar << ret.first << ret.second;
   std::string table_info(ar.Buffer(), ar.Length());
@@ -467,10 +487,10 @@ int32_t BrpcPsService::print_table_stat(Table *table,
   return 0;
 }
 
-int32_t BrpcPsService::load_one_table(Table *table,
-                                      const PsRequestMessage &request,
-                                      PsResponseMessage &response,
-                                      brpc::Controller *cntl) {
+int32_t BrpcPsService::LoadOneTable(Table *table,
+                                    const PsRequestMessage &request,
+                                    PsResponseMessage &response,
+                                    brpc::Controller *cntl) {
   CHECK_TABLE_EXIST(table, request, response)
   if (request.params_size() < 2) {
     set_response_code(
@@ -478,20 +498,20 @@ int32_t BrpcPsService::load_one_table(Table *table,
         "PsRequestMessage.datas is requeired at least 2 for path & load_param");
     return -1;
   }
-  if (table->load(request.params(0), request.params(1)) != 0) {
+  if (table->Load(request.params(0), request.params(1)) != 0) {
     set_response_code(response, -1, "table load failed");
     return -1;
   }
   return 0;
 }
 
-int32_t BrpcPsService::load_all_table(Table *table,
-                                      const PsRequestMessage &request,
-                                      PsResponseMessage &response,
-                                      brpc::Controller *cntl) {
-  auto &table_map = *(_server->table());
+int32_t BrpcPsService::LoadAllTable(Table *table,
+                                    const PsRequestMessage &request,
+                                    PsResponseMessage &response,
+                                    brpc::Controller *cntl) {
+  auto &table_map = *(_server->GetTable());
   for (auto &itr : table_map) {
-    if (load_one_table(itr.second.get(), request, response, cntl) != 0) {
+    if (LoadOneTable(itr.second.get(), request, response, cntl) != 0) {
       LOG(ERROR) << "load table[" << itr.first << "] failed";
       return -1;
     }
@@ -499,10 +519,10 @@ int32_t BrpcPsService::load_all_table(Table *table,
   return 0;
 }
 
-int32_t BrpcPsService::save_one_table(Table *table,
-                                      const PsRequestMessage &request,
-                                      PsResponseMessage &response,
-                                      brpc::Controller *cntl) {
+int32_t BrpcPsService::SaveOneTable(Table *table,
+                                    const PsRequestMessage &request,
+                                    PsResponseMessage &response,
+                                    brpc::Controller *cntl) {
   CHECK_TABLE_EXIST(table, request, response)
   if (request.params_size() < 2) {
     set_response_code(
@@ -510,12 +530,12 @@ int32_t BrpcPsService::save_one_table(Table *table,
         "PsRequestMessage.datas is requeired at least 2, path&mode");
     return -1;
   }
-  table->flush();
+  table->Flush();
 
   int32_t feasign_size = 0;
 
   VLOG(3) << "save table " << request.params(0) << " " << request.params(1);
-  feasign_size = table->save(request.params(0), request.params(1));
+  feasign_size = table->Save(request.params(0), request.params(1));
   if (feasign_size < 0) {
     set_response_code(response, -1, "table save failed");
     return -1;
@@ -523,16 +543,16 @@ int32_t BrpcPsService::save_one_table(Table *table,
   return feasign_size;
 }
 
-int32_t BrpcPsService::save_all_table(Table *table,
-                                      const PsRequestMessage &request,
-                                      PsResponseMessage &response,
-                                      brpc::Controller *cntl) {
-  auto &table_map = *(_server->table());
+int32_t BrpcPsService::SaveAllTable(Table *table,
+                                    const PsRequestMessage &request,
+                                    PsResponseMessage &response,
+                                    brpc::Controller *cntl) {
+  auto &table_map = *(_server->GetTable());
   int32_t all_feasign_size = 0;
   int32_t feasign_size = 0;
 
   for (auto &itr : table_map) {
-    feasign_size = save_one_table(itr.second.get(), request, response, cntl);
+    feasign_size = SaveOneTable(itr.second.get(), request, response, cntl);
     if (feasign_size < 0) {
       LOG(ERROR) << "save table[" << itr.first << "] failed";
       return -1;
@@ -541,10 +561,10 @@ int32_t BrpcPsService::save_all_table(Table *table,
   return 0;
 }
 
-int32_t BrpcPsService::shrink_table(Table *table,
-                                    const PsRequestMessage &request,
-                                    PsResponseMessage &response,
-                                    brpc::Controller *cntl) {
+int32_t BrpcPsService::ShrinkTable(Table *table,
+                                   const PsRequestMessage &request,
+                                   PsResponseMessage &response,
+                                   brpc::Controller *cntl) {
   CHECK_TABLE_EXIST(table, request, response)
   if (request.params_size() < 1) {
     set_response_code(
@@ -552,8 +572,8 @@ int32_t BrpcPsService::shrink_table(Table *table,
         "PsRequestMessage.datas is requeired at least 1, threshold");
     return -1;
   }
-  table->flush();
-  if (table->shrink(request.params(0)) != 0) {
+  table->Flush();
+  if (table->Shrink(request.params(0)) != 0) {
     set_response_code(response, -1, "table shrink failed");
     return -1;
   }
@@ -561,63 +581,62 @@ int32_t BrpcPsService::shrink_table(Table *table,
   return 0;
 }
 
-int32_t BrpcPsService::clear_one_table(Table *table,
-                                       const PsRequestMessage &request,
-                                       PsResponseMessage &response,
-                                       brpc::Controller *cntl) {
+int32_t BrpcPsService::ClearOneTable(Table *table,
+                                     const PsRequestMessage &request,
+                                     PsResponseMessage &response,
+                                     brpc::Controller *cntl) {
   CHECK_TABLE_EXIST(table, request, response)
-  table->flush();
-  table->clear();
+  table->Flush();
+  table->Clear();
   return 0;
 }
 
-int32_t BrpcPsService::clear_all_table(Table *table,
-                                       const PsRequestMessage &request,
-                                       PsResponseMessage &response,
-                                       brpc::Controller *cntl) {
-  auto &table_map = *(_server->table());
+int32_t BrpcPsService::ClearAllTable(Table *table,
+                                     const PsRequestMessage &request,
+                                     PsResponseMessage &response,
+                                     brpc::Controller *cntl) {
+  auto &table_map = *(_server->GetTable());
   for (auto &itr : table_map) {
-    if (clear_one_table(itr.second.get(), request, response, cntl) != 0) {
+    if (ClearOneTable(itr.second.get(), request, response, cntl) != 0) {
       return -1;
     }
   }
   return 0;
 }
 
-int32_t BrpcPsService::stop_server(Table *table,
-                                   const PsRequestMessage &request,
-                                   PsResponseMessage &response,
-                                   brpc::Controller *cntl) {
+int32_t BrpcPsService::StopServer(Table *table, const PsRequestMessage &request,
+                                  PsResponseMessage &response,
+                                  brpc::Controller *cntl) {
   auto *p_server = _server;
   std::thread t_stop([p_server]() {
-    p_server->stop();
+    p_server->Stop();
     VLOG(3) << "Server Stoped";
   });
   t_stop.detach();
   return 0;
 }
 
-int32_t BrpcPsService::stop_profiler(Table *table,
-                                     const PsRequestMessage &request,
-                                     PsResponseMessage &response,
-                                     brpc::Controller *cntl) {
+int32_t BrpcPsService::StopProfiler(Table *table,
+                                    const PsRequestMessage &request,
+                                    PsResponseMessage &response,
+                                    brpc::Controller *cntl) {
   platform::DisableProfiler(platform::EventSortingKey::kDefault,
                             string::Sprintf("server_%s_profile", _rank));
   return 0;
 }
 
-int32_t BrpcPsService::start_profiler(Table *table,
-                                      const PsRequestMessage &request,
-                                      PsResponseMessage &response,
-                                      brpc::Controller *cntl) {
+int32_t BrpcPsService::StartProfiler(Table *table,
+                                     const PsRequestMessage &request,
+                                     PsResponseMessage &response,
+                                     brpc::Controller *cntl) {
   platform::EnableProfiler(platform::ProfilerState::kCPU);
   return 0;
 }
 
-int32_t BrpcPsService::push_global_step(Table *table,
-                                        const PsRequestMessage &request,
-                                        PsResponseMessage &response,
-                                        brpc::Controller *cntl) {
+int32_t BrpcPsService::PushGlobalStep(Table *table,
+                                      const PsRequestMessage &request,
+                                      PsResponseMessage &response,
+                                      brpc::Controller *cntl) {
   CHECK_TABLE_EXIST(table, request, response);
   auto req_buffer_size = request.data().size();
   if (req_buffer_size < 1) {
@@ -628,7 +647,13 @@ int32_t BrpcPsService::push_global_step(Table *table,
   const int64_t *values =
       (const int64_t *)(request.data().data() + sizeof(uint32_t));
   auto trainer_id = request.client_id();
-  if (table->push_dense(values, trainer_id) != 0) {
+
+  TableContext context;
+  context.trainer_id = trainer_id;
+  context.push_context.push_steps = values;
+
+  //  if (table->PushDense(values, trainer_id) != 0) {
+  if (table->Push(context) != 0) {
     set_response_code(response, -1, "run_program failed");
   }
 
diff --git a/paddle/fluid/distributed/ps/service/brpc_ps_server.h b/paddle/fluid/distributed/ps/service/brpc_ps_server.h
index d81a3a5df07f1..250f465d84253 100644
--- a/paddle/fluid/distributed/ps/service/brpc_ps_server.h
+++ b/paddle/fluid/distributed/ps/service/brpc_ps_server.h
@@ -41,8 +41,8 @@ class BrpcPsServer : public PSServer {
  public:
   BrpcPsServer() {}
   virtual ~BrpcPsServer() {}
-  virtual uint64_t start(const std::string &ip, uint32_t port);
-  virtual int32_t stop() {
+  virtual uint64_t Start(const std::string &ip, uint32_t port);
+  virtual int32_t Stop() {
     std::unique_lock<std::mutex> lock(mutex_);
     stoped_ = true;
     cv_.notify_all();
@@ -51,10 +51,10 @@ class BrpcPsServer : public PSServer {
     _server.Join();
     return 0;
   }
-  int32_t port();
+  int32_t Port();
 
  private:
-  virtual int32_t initialize();
+  virtual int32_t Initialize();
   mutable std::mutex mutex_;
   std::condition_variable cv_;
   bool stoped_ = false;
@@ -71,7 +71,7 @@ typedef int32_t (BrpcPsService::*serviceHandlerFunc)(
 
 class BrpcPsService : public PsBaseService {
  public:
-  virtual int32_t initialize() override;
+  virtual int32_t Initialize() override;
 
   virtual void service(::google::protobuf::RpcController *controller,
                        const PsRequestMessage *request,
@@ -79,50 +79,49 @@ class BrpcPsService : public PsBaseService {
                        ::google::protobuf::Closure *done) override;
 
  private:
-  int32_t initialize_shard_info();
-  int32_t pull_dense(Table *table, const PsRequestMessage &request,
-                     PsResponseMessage &response, brpc::Controller *cntl);
-  int32_t push_dense(Table *table, const PsRequestMessage &request,
-                     PsResponseMessage &response, brpc::Controller *cntl);
-  int32_t push_dense_param(Table *table, const PsRequestMessage &request,
-                           PsResponseMessage &response, brpc::Controller *cntl);
-  int32_t push_sparse_param(Table *table, const PsRequestMessage &request,
-                            PsResponseMessage &response,
-                            brpc::Controller *cntl);
-  int32_t pull_sparse(Table *table, const PsRequestMessage &request,
-                      PsResponseMessage &response, brpc::Controller *cntl);
-  int32_t pull_geo_param(Table *table, const PsRequestMessage &request,
+  int32_t InitializeShardInfo();
+  int32_t PullDense(Table *table, const PsRequestMessage &request,
+                    PsResponseMessage &response, brpc::Controller *cntl);
+  int32_t PushDense(Table *table, const PsRequestMessage &request,
+                    PsResponseMessage &response, brpc::Controller *cntl);
+  int32_t PushDenseParam(Table *table, const PsRequestMessage &request,
                          PsResponseMessage &response, brpc::Controller *cntl);
-  int32_t barrier(Table *table, const PsRequestMessage &request,
+  int32_t PushSparseParam(Table *table, const PsRequestMessage &request,
+                          PsResponseMessage &response, brpc::Controller *cntl);
+  int32_t PullSparse(Table *table, const PsRequestMessage &request,
+                     PsResponseMessage &response, brpc::Controller *cntl);
+  int32_t PullGeoParam(Table *table, const PsRequestMessage &request,
+                       PsResponseMessage &response, brpc::Controller *cntl);
+  int32_t Barrier(Table *table, const PsRequestMessage &request,
                   PsResponseMessage &response, brpc::Controller *cntl);
-  int32_t push_sparse(Table *table, const PsRequestMessage &request,
-                      PsResponseMessage &response, brpc::Controller *cntl);
-  int32_t load_one_table(Table *table, const PsRequestMessage &request,
-                         PsResponseMessage &response, brpc::Controller *cntl);
-  int32_t load_all_table(Table *table, const PsRequestMessage &request,
-                         PsResponseMessage &response, brpc::Controller *cntl);
-  int32_t save_one_table(Table *table, const PsRequestMessage &request,
-                         PsResponseMessage &response, brpc::Controller *cntl);
-  int32_t save_all_table(Table *table, const PsRequestMessage &request,
-                         PsResponseMessage &response, brpc::Controller *cntl);
-  int32_t shrink_table(Table *table, const PsRequestMessage &request,
+  int32_t PushSparse(Table *table, const PsRequestMessage &request,
+                     PsResponseMessage &response, brpc::Controller *cntl);
+  int32_t LoadOneTable(Table *table, const PsRequestMessage &request,
                        PsResponseMessage &response, brpc::Controller *cntl);
-  int32_t clear_one_table(Table *table, const PsRequestMessage &request,
-                          PsResponseMessage &response, brpc::Controller *cntl);
-  int32_t clear_all_table(Table *table, const PsRequestMessage &request,
-                          PsResponseMessage &response, brpc::Controller *cntl);
-  int32_t stop_server(Table *table, const PsRequestMessage &request,
+  int32_t LoadAllTable(Table *table, const PsRequestMessage &request,
+                       PsResponseMessage &response, brpc::Controller *cntl);
+  int32_t SaveOneTable(Table *table, const PsRequestMessage &request,
+                       PsResponseMessage &response, brpc::Controller *cntl);
+  int32_t SaveAllTable(Table *table, const PsRequestMessage &request,
+                       PsResponseMessage &response, brpc::Controller *cntl);
+  int32_t ShrinkTable(Table *table, const PsRequestMessage &request,
                       PsResponseMessage &response, brpc::Controller *cntl);
-  int32_t start_profiler(Table *table, const PsRequestMessage &request,
-                         PsResponseMessage &response, brpc::Controller *cntl);
-  int32_t stop_profiler(Table *table, const PsRequestMessage &request,
+  int32_t ClearOneTable(Table *table, const PsRequestMessage &request,
                         PsResponseMessage &response, brpc::Controller *cntl);
+  int32_t ClearAllTable(Table *table, const PsRequestMessage &request,
+                        PsResponseMessage &response, brpc::Controller *cntl);
+  int32_t StopServer(Table *table, const PsRequestMessage &request,
+                     PsResponseMessage &response, brpc::Controller *cntl);
+  int32_t StartProfiler(Table *table, const PsRequestMessage &request,
+                        PsResponseMessage &response, brpc::Controller *cntl);
+  int32_t StopProfiler(Table *table, const PsRequestMessage &request,
+                       PsResponseMessage &response, brpc::Controller *cntl);
 
-  int32_t print_table_stat(Table *table, const PsRequestMessage &request,
-                           PsResponseMessage &response, brpc::Controller *cntl);
+  int32_t PrintTableStat(Table *table, const PsRequestMessage &request,
+                         PsResponseMessage &response, brpc::Controller *cntl);
 
-  int32_t push_global_step(Table *table, const PsRequestMessage &request,
-                           PsResponseMessage &response, brpc::Controller *cntl);
+  int32_t PushGlobalStep(Table *table, const PsRequestMessage &request,
+                         PsResponseMessage &response, brpc::Controller *cntl);
 
   bool _is_initialize_shard_info;
   std::mutex _initialize_shard_mutex;
diff --git a/paddle/fluid/distributed/ps/service/communicator/communicator.cc b/paddle/fluid/distributed/ps/service/communicator/communicator.cc
index 50c34bd319253..c4b833f294e17 100644
--- a/paddle/fluid/distributed/ps/service/communicator/communicator.cc
+++ b/paddle/fluid/distributed/ps/service/communicator/communicator.cc
@@ -39,7 +39,7 @@ inline double GetCurrentUS() {
 
 Communicator::Communicator() {}
 
-void Communicator::init_gflag(const std::string &gflags) {
+void Communicator::InitGFlag(const std::string &gflags) {
   VLOG(3) << "Init With Gflags:" << gflags;
   std::vector<std::string> flags = paddle::string::split_string(gflags);
   if (flags.size() < 1) {
@@ -73,7 +73,7 @@ void Communicator::InitBrpcClient(
 }
 
 std::vector<uint64_t> Communicator::GetClientInfo() {
-  std::vector<uint64_t> res = _ps_env.get_client_info();
+  std::vector<uint64_t> res = _ps_env.GetClientInfo();
   for (auto rr : res) {
     VLOG(2) << "Communicator::GetClientInfo " << rr;
   }
@@ -82,7 +82,7 @@ std::vector<uint64_t> Communicator::GetClientInfo() {
 
 int Communicator::SetClients(std::vector<uint64_t> &host_sign_list) {
   int node = host_sign_list.size();
-  return _ps_env.set_ps_clients(host_sign_list.data(), node);
+  return _ps_env.SetPsClients(host_sign_list.data(), node);
 }
 
 void Communicator::RpcRecvDense(const std::vector<std::string> &varnames,
@@ -114,7 +114,7 @@ void Communicator::RpcRecvDense(const std::vector<std::string> &varnames,
     }
   }
   auto status =
-      _worker_ptr->pull_dense(regions.data(), regions.size(), table_id);
+      _worker_ptr->PullDense(regions.data(), regions.size(), table_id);
   status.wait();
 
   for (auto &t : varnames) {
@@ -177,7 +177,7 @@ void Communicator::RpcSendDenseParam(const std::vector<std::string> &varnames,
     }
   }
   auto status =
-      _worker_ptr->push_dense_param(regions.data(), regions.size(), table_id);
+      _worker_ptr->PushDenseParam(regions.data(), regions.size(), table_id);
   status.wait();
   VLOG(4) << "RPC Send Dense Param " << table_id << " done!";
   return;
@@ -190,9 +190,9 @@ void Communicator::RpcSendDense(const CommContext &ctx, const Scope &scope) {
   auto &var_names = ctx.origin_varnames;
   auto &table_id = ctx.table_id;
   auto dense_data = std::make_shared<std::vector<float>>();
-  size_t request_call_num = _worker_ptr->get_server_nums();
+  size_t request_call_num = _worker_ptr->GetServerNums();
   uint32_t num_per_shard =
-      dense_dim_per_shard(ctx.height_sections[0], request_call_num);
+      DenseDimPerShard(ctx.height_sections[0], request_call_num);
   dense_data->resize(num_per_shard *
                      request_call_num);  // accessor->update_dim() = 1
   float *data = dense_data->data();
@@ -222,8 +222,8 @@ void Communicator::RpcSendDense(const CommContext &ctx, const Scope &scope) {
         closure->set_promise_value(ret);
         --_async_call_num;
       });
-  auto status = _worker_ptr->push_dense_raw_gradient(
-      table_id, data, dense_data->size(), closure);
+  auto status = _worker_ptr->PushDenseRawGradient(table_id, data,
+                                                  dense_data->size(), closure);
   status.wait();
   return;
 }
@@ -233,7 +233,7 @@ void Communicator::RpcSendSparseParam(const std::string &varname, int table_id,
   platform::RecordEvent record_event("Communicator->RpcSendSparseParam",
                                      platform::TracerEventType::Communication,
                                      1);
-  size_t request_call_num = _worker_ptr->get_server_nums();
+  size_t request_call_num = _worker_ptr->GetServerNums();
   std::vector<float *> push_g_vec;
 
   auto *send_var = scope.FindVar(varname);
@@ -260,9 +260,9 @@ void Communicator::RpcSendSparseParam(const std::string &varname, int table_id,
         }
         closure->set_promise_value(ret);
       });
-  auto status = _worker_ptr->push_sparse_param(
-      table_id, sparse_push_keys.data(), (const float **)push_g_vec.data(),
-      sparse_push_keys.size(), closure);
+  auto status = _worker_ptr->PushSparseParam(table_id, sparse_push_keys.data(),
+                                             (const float **)push_g_vec.data(),
+                                             sparse_push_keys.size(), closure);
   status.wait();
   return;
 }
@@ -272,7 +272,7 @@ void Communicator::RpcSendSparse(const std::string &var_name, int table_id,
   platform::RecordEvent record_event("Communicator->RpcSendSparse",
                                      platform::TracerEventType::Communication,
                                      1);
-  size_t request_call_num = _worker_ptr->get_server_nums();
+  size_t request_call_num = _worker_ptr->GetServerNums();
   std::vector<uint64_t> sparse_push_keys;
   std::vector<float *> push_g_vec;
 
@@ -313,7 +313,7 @@ void Communicator::RpcSendSparse(const std::string &var_name, int table_id,
         closure->set_promise_value(ret);
         --_async_call_num;
       });
-  auto status = _worker_ptr->push_sparse_raw_gradient(
+  auto status = _worker_ptr->PushSparseRawGradient(
       table_id, sparse_push_keys.data(), (const float **)push_g_vec.data(),
       sparse_push_keys.size(), closure);
   status.wait();
@@ -340,7 +340,7 @@ void Communicator::RpcRecvSparse(const std::string &varname, int table_id,
 
   bool training = true;
 
-  auto status = _worker_ptr->pull_sparse_param(
+  auto status = _worker_ptr->PullSparseParam(
       (float **)push_g_vec.data(), table_id,  // NOLINT
       sparse_push_keys.data(), sparse_push_keys.size(), training);
   status.wait();
@@ -376,11 +376,11 @@ void Communicator::RpcProfilerControl() {
     if (!do_server_profiler_ && platform::IsProfileEnabled()) {
       // send profiler start flag
       do_server_profiler_ = true;
-      auto start_status = _worker_ptr->start_profiler();
+      auto start_status = _worker_ptr->StartProfiler();
       start_status.wait();
     } else if (do_server_profiler_ && !platform::IsProfileEnabled()) {
       // send profiler end flag
-      auto stop_status = _worker_ptr->stop_profiler();
+      auto stop_status = _worker_ptr->StopProfiler();
       stop_status.wait();
       do_server_profiler_ = false;
     }
@@ -396,7 +396,7 @@ void Communicator::SendGlobalStep(const CommContext &ctx, int batches,
                                      platform::TracerEventType::Communication,
                                      1);
   auto &table_id = ctx.table_id;
-  size_t request_call_num = _worker_ptr->get_server_nums();
+  size_t request_call_num = _worker_ptr->GetServerNums();
 
   auto &var_name = STEP_COUNTER;
   auto *out_var = send_scope->Var(var_name);
@@ -416,7 +416,7 @@ void Communicator::SendGlobalStep(const CommContext &ctx, int batches,
         }
         closure->set_promise_value(ret);
       });
-  auto status = _worker_ptr->push_global_step(table_id, data, closure);
+  auto status = _worker_ptr->PushGlobalStep(table_id, data, closure);
   status.wait();
   return;
 }
@@ -605,8 +605,8 @@ void AsyncCommunicator::PullSparseToTensorSync(
     }
   }
   auto status =
-      _worker_ptr->pull_sparse(pull_result_ptr.data(), table_id,
-                               fea_keys.data(), fea_keys.size(), is_training);
+      _worker_ptr->PullSparse(pull_result_ptr.data(), table_id, fea_keys.data(),
+                              fea_keys.size(), is_training);
   status.wait();
   auto ret = status.get();
   if (ret != 0) {
@@ -738,9 +738,9 @@ void AsyncCommunicator::PushSparseFromTensorAsync(
       this->Check(table_id), true,
       platform::errors::InvalidArgument(
           "can not find table: %s, please check your config", table_id));
-  auto status = _worker_ptr->push_sparse(table_id, push_keys.data(),
-                                         (const float **)push_g_vec.data(),
-                                         push_keys.size());
+  auto status = _worker_ptr->PushSparse(table_id, push_keys.data(),
+                                        (const float **)push_g_vec.data(),
+                                        push_keys.size());
 }
 
 void HalfAsyncCommunicator::MainThread() {
@@ -813,7 +813,7 @@ void AsyncCommunicator::Stop() {
   if (!communicator_) {
     VLOG(0) << "Communicator is not inited, do nothing";
   } else {
-    // _worker_ptr->finalize_worker();
+    // _worker_ptr->FinalizeWorker();
     VLOG(1) << "client finalize_worker done";
     if (recv_thread_) {
       VLOG(1) << "stop recv thread";
@@ -1327,7 +1327,7 @@ void GeoCommunicator::SendSparse(const std::string &varname,
     closure->set_promise_value(ret);
     --_async_call_num;
   });
-  auto status = _worker_ptr->push_sparse_raw_gradient_partial(
+  auto status = _worker_ptr->PushSparseRawGradientPartial(
       table_id, (const uint64_t *)sparse_ids.data(),
       (const float **)push_g_vec.data(), sparse_ids.size(), closure, ep_idx);
   status.wait();
@@ -1345,7 +1345,7 @@ void GeoCommunicator::RecvSparse(const std::string &varname, int table_id,
   // 1. recv from pserver
   std::vector<uint64_t> keys;
   std::vector<float> values;
-  auto status = _worker_ptr->pull_geo_param(table_id, &values, &keys, ep_idx);
+  auto status = _worker_ptr->PullGeoParam(table_id, &values, &keys, ep_idx);
   status.wait();
 
   std::string param = SplitedGradToParam(varname);
diff --git a/paddle/fluid/distributed/ps/service/communicator/communicator.h b/paddle/fluid/distributed/ps/service/communicator/communicator.h
index da4b46928d55c..75676c392435c 100644
--- a/paddle/fluid/distributed/ps/service/communicator/communicator.h
+++ b/paddle/fluid/distributed/ps/service/communicator/communicator.h
@@ -299,7 +299,7 @@ class Communicator {
   virtual void Barrier() {}
 
   virtual void BarrierWithTable(uint32_t barrier_type) {
-    auto rets = _worker_ptr->barrier(barrier_table_id_, barrier_type);
+    auto rets = _worker_ptr->Barrier(barrier_table_id_, barrier_type);
     rets.wait();
     int status = rets.get();
     PADDLE_ENFORCE_EQ(status, 0,
@@ -310,7 +310,7 @@ class Communicator {
   virtual void CreateC2CConnection(int pserver_timeout_ms,
                                    int pserver_connect_timeout_ms,
                                    int max_retry) {
-    _worker_ptr->create_client2client_connection(
+    _worker_ptr->CreateClient2ClientConnection(
         pserver_timeout_ms, pserver_connect_timeout_ms, max_retry);
   }
 
@@ -379,12 +379,12 @@ class Communicator {
   std::unordered_map<std::string, std::string> envs;
 
   // 计算每个shard 对 dense的存储量
-  inline uint32_t dense_dim_per_shard(uint32_t dense_dim_total,
-                                      uint32_t shard_num) {
+  inline uint32_t DenseDimPerShard(uint32_t dense_dim_total,
+                                   uint32_t shard_num) {
     return dense_dim_total / shard_num + 1;
   }
 
-  void init_gflag(const std::string &gflags);
+  void InitGFlag(const std::string &gflags);
   paddle::distributed::PSParameter _ps_param;
   paddle::distributed::PaddlePSEnvironment _ps_env;
   int servers_ = 0;
diff --git a/paddle/fluid/distributed/ps/service/env.h b/paddle/fluid/distributed/ps/service/env.h
index 0cc57229b7a82..162ee6f098422 100644
--- a/paddle/fluid/distributed/ps/service/env.h
+++ b/paddle/fluid/distributed/ps/service/env.h
@@ -40,7 +40,7 @@ struct PSHost {
   // |---ip---|---port---|--rank--|
   // |-32bit--|--20bit---|--12bit-|
 
-  uint64_t serialize_to_uint64() {
+  uint64_t SerializeToUint64() {
     uint64_t host_label = 0;
     host_label = inet_addr(ip.c_str());
     host_label = host_label << 32;
@@ -49,7 +49,7 @@ struct PSHost {
     return host_label;
   }
 
-  void parse_from_uint64(uint64_t host_label) {
+  void ParseFromUint64(uint64_t host_label) {
     static uint64_t rank_label_mask = (1L << 12) - 1;
     static uint64_t port_label_mask = (1L << 20) - 1;
     rank = host_label & rank_label_mask;
@@ -58,17 +58,17 @@ struct PSHost {
     ip = inet_ntoa(*(in_addr *)&ip_addr);  // NOLINT
   }
 
-  std::string to_string() {
+  std::string ToString() {
     std::stringstream s;
     s << "host: " << ip;
     s << " port: " << port;
     s << " rank: " << rank;
-    s << " uint: " << serialize_to_uint64();
+    s << " uint: " << SerializeToUint64();
     return s.str();
   }
 
   // for open source parameter server
-  std::string serialize_to_string() {
+  std::string SerializeToString() {
     std::stringstream s;
     s << ip << ":";
     s << port << ":";
@@ -76,16 +76,16 @@ struct PSHost {
     return s.str();
   }
 
-  void parse_from_string(std::string endpoint) {
+  void ParseFromString(std::string endpoint) {
     std::vector<std::string> endpoint_info;
-    string_split(endpoint, ':', &endpoint_info);
+    StringSplit(endpoint, ':', &endpoint_info);
     ip = endpoint_info[0];
     port = std::stoi(endpoint_info[1]);
     rank = std::stoi(endpoint_info[2]);
   }
 
-  void string_split(const std::string &str, char sep,
-                    std::vector<std::string> *pieces, bool ignore_null = true) {
+  void StringSplit(const std::string &str, char sep,
+                   std::vector<std::string> *pieces, bool ignore_null = true) {
     pieces->clear();
     if (str.empty()) {
       if (!ignore_null) {
@@ -111,63 +111,60 @@ class PSEnvironment {
   explicit PSEnvironment() {}  // NOLINT
   virtual ~PSEnvironment() {}
 
-  virtual int32_t set_ps_servers(uint64_t *host_sign_list, int node_num) {
+  virtual int32_t SetPsServers(uint64_t *host_sign_list, int node_num) {
     return 0;
   }
-  virtual int32_t set_ps_servers(
+  virtual int32_t SetPsServers(
       const std::vector<std::string> *host_endpoint_list, int node_num) {
     return 0;
   }
 
-  virtual int32_t set_ps_clients(uint64_t *host_sign_list, int node_num) {
+  virtual int32_t SetPsClients(uint64_t *host_sign_list, int node_num) {
     return 0;
   }
 
-  virtual int32_t set_ps_clients(std::string *host_endpoint_list,
-                                 int node_num) {
+  virtual int32_t SetPsClients(std::string *host_endpoint_list, int node_num) {
     return 0;
   }
-  virtual uint64_t get_local_host_sign() { return 0; }
-  virtual std::vector<PSHost> get_ps_servers() const { return _ps_server_list; }
-  virtual int32_t registe_ps_server(const std::string &ip, uint32_t port,
-                                    int32_t rank) {
-    return registe_ps_host(ip, port, rank, _ps_server_list,
-                           _ps_server_sign_set);
+  virtual uint64_t GetLocalHostSign() { return 0; }
+  virtual std::vector<PSHost> GetPsServers() const { return _ps_server_list; }
+  virtual int32_t RegistePsServer(const std::string &ip, uint32_t port,
+                                  int32_t rank) {
+    return RegistePsHost(ip, port, rank, _ps_server_list, _ps_server_sign_set);
   }
 
-  virtual std::vector<PSHost> get_ps_clients() const { return _ps_client_list; }
-  virtual int32_t registe_ps_client(const std::string &ip, uint32_t port,
-                                    int32_t rank) {
-    return registe_ps_host(ip, port, rank, _ps_client_list,
-                           _ps_client_sign_set);
+  virtual std::vector<PSHost> GetPsClients() const { return _ps_client_list; }
+  virtual int32_t RegistePsClient(const std::string &ip, uint32_t port,
+                                  int32_t rank) {
+    return RegistePsHost(ip, port, rank, _ps_client_list, _ps_client_sign_set);
   }
 
-  virtual std::vector<uint64_t> get_client_info() {
+  virtual std::vector<uint64_t> GetClientInfo() {
     std::vector<uint64_t> client_info;
     for (auto &i : _ps_client_list) {
-      client_info.push_back(i.serialize_to_uint64());
+      client_info.push_back(i.SerializeToUint64());
     }
     return client_info;
   }
 
-  virtual std::vector<std::string> get_client_info(bool use_string_endpoint) {
+  virtual std::vector<std::string> GetClientInfo(bool use_string_endpoint) {
     if (use_string_endpoint) {
       std::vector<std::string> client_info;
       for (auto &i : _ps_client_list) {
-        client_info.push_back(i.serialize_to_string());
+        client_info.push_back(i.SerializeToString());
       }
       return client_info;
     }
     return {};
   }
 
-  virtual void set_trainers(int trainers) { trainers_ = trainers; }
+  virtual void SetTrainers(int trainers) { trainers_ = trainers; }
 
-  virtual int get_trainers() { return trainers_; }
+  virtual int GetTrainers() { return trainers_; }
 
  protected:
   //注册一个host //  NOLINT
-  virtual int32_t registe_ps_host(
+  virtual int32_t RegistePsHost(
       const std::string &ip, uint32_t port, int32_t rank,
       std::vector<PSHost> &host_list,            // NOLINT
       std::unordered_set<uint64_t> &sign_set) {  // NOLINT
@@ -198,15 +195,15 @@ class PaddlePSEnvironment : public PSEnvironment {
   explicit PaddlePSEnvironment() {}  // NOLINT
   virtual ~PaddlePSEnvironment() {}
 
-  virtual int32_t set_ps_servers(uint64_t *host_sign_list, int node_num) {
+  virtual int32_t SetPsServers(uint64_t *host_sign_list, int node_num) {
     _ps_server_list.clear();
     _ps_server_sign_set.clear();
     for (int i = 0; i < node_num; ++i) {
       if (host_sign_list[i] > 0) {
         PSHost host;
-        host.parse_from_uint64(host_sign_list[i]);
+        host.ParseFromUint64(host_sign_list[i]);
         _ps_server_list.push_back(host);
-        _ps_server_sign_set.insert(host.serialize_to_uint64());
+        _ps_server_sign_set.insert(host.SerializeToUint64());
       }
     }
     std::sort(
@@ -215,14 +212,14 @@ class PaddlePSEnvironment : public PSEnvironment {
     return 0;
   }
 
-  virtual int32_t set_ps_servers(const std::vector<std::string> *host_sign_list,
-                                 int node_num) {
+  virtual int32_t SetPsServers(const std::vector<std::string> *host_sign_list,
+                               int node_num) {
     _ps_server_list.clear();
     _ps_server_sign_set.clear();
     for (int i = 0; i < node_num; ++i) {
       if (host_sign_list->at(i) != "") {
         PSHost host;
-        host.parse_from_string(host_sign_list->at(i));
+        host.ParseFromString(host_sign_list->at(i));
         _ps_server_list.push_back(host);
         _ps_server_sign_set.insert(host.rank);
       }
@@ -233,15 +230,15 @@ class PaddlePSEnvironment : public PSEnvironment {
     return 0;
   }
 
-  virtual int32_t set_ps_clients(uint64_t *host_sign_list, int node_num) {
+  virtual int32_t SetPsClients(uint64_t *host_sign_list, int node_num) {
     _ps_client_list.clear();
     _ps_client_sign_set.clear();
     for (int i = 0; i < node_num; ++i) {
       if (host_sign_list[i] > 0) {
         PSHost host;
-        host.parse_from_uint64(host_sign_list[i]);
+        host.ParseFromUint64(host_sign_list[i]);
         _ps_client_list.push_back(host);
-        _ps_client_sign_set.insert(host.serialize_to_uint64());
+        _ps_client_sign_set.insert(host.SerializeToUint64());
       }
     }
     std::sort(
@@ -250,14 +247,14 @@ class PaddlePSEnvironment : public PSEnvironment {
     return 0;
   }
 
-  virtual int32_t set_ps_clients(const std::vector<std::string> *host_sign_list,
-                                 int node_num) {
+  virtual int32_t SetPsClients(const std::vector<std::string> *host_sign_list,
+                               int node_num) {
     _ps_client_list.clear();
     _ps_client_sign_set.clear();
     for (int i = 0; i < node_num; ++i) {
       if (host_sign_list->at(i) != "") {
         PSHost host;
-        host.parse_from_string(host_sign_list->at(i));
+        host.ParseFromString(host_sign_list->at(i));
         _ps_client_list.push_back(host);
         _ps_client_sign_set.insert(host.rank);
       }
@@ -269,9 +266,9 @@ class PaddlePSEnvironment : public PSEnvironment {
     return 0;
   }
 
-  virtual uint64_t get_local_host_sign() {
+  virtual uint64_t GetLocalHostSign() {
     if (_ps_client_list.size() > 0) {
-      return _ps_client_list[0].serialize_to_uint64();
+      return _ps_client_list[0].SerializeToUint64();
     } else {
       return 0;
     }
diff --git a/paddle/fluid/distributed/ps/service/graph_brpc_client.cc b/paddle/fluid/distributed/ps/service/graph_brpc_client.cc
index a3db88e3b679d..827a643ee50d6 100644
--- a/paddle/fluid/distributed/ps/service/graph_brpc_client.cc
+++ b/paddle/fluid/distributed/ps/service/graph_brpc_client.cc
@@ -135,8 +135,7 @@ std::future<int32_t> GraphBrpcClient::get_node_feat(
     closure->request(request_idx)
         ->add_params(joint_feature_name.c_str(), joint_feature_name.size());
 
-    GraphPsService_Stub rpc_stub =
-        getServiceStub(get_cmd_channel(server_index));
+    GraphPsService_Stub rpc_stub = getServiceStub(GetCmdChannel(server_index));
     closure->cntl(request_idx)->set_log_id(butil::gettimeofday_ms());
     rpc_stub.service(closure->cntl(request_idx), closure->request(request_idx),
                      closure->response(request_idx), closure);
@@ -169,8 +168,7 @@ std::future<int32_t> GraphBrpcClient::clear_nodes(uint32_t table_id) {
     closure->request(server_index)->set_table_id(table_id);
     closure->request(server_index)->set_client_id(_client_id);
 
-    GraphPsService_Stub rpc_stub =
-        getServiceStub(get_cmd_channel(server_index));
+    GraphPsService_Stub rpc_stub = getServiceStub(GetCmdChannel(server_index));
     closure->cntl(server_index)->set_log_id(butil::gettimeofday_ms());
     rpc_stub.service(closure->cntl(server_index),
                      closure->request(server_index),
@@ -238,9 +236,8 @@ std::future<int32_t> GraphBrpcClient::add_graph_node(
           ->add_params((char *)weighted,
                        sizeof(bool) * is_weighted_bucket[request_idx].size());
     }
-    // PsService_Stub rpc_stub(get_cmd_channel(server_index));
-    GraphPsService_Stub rpc_stub =
-        getServiceStub(get_cmd_channel(server_index));
+    // PsService_Stub rpc_stub(GetCmdChannel(server_index));
+    GraphPsService_Stub rpc_stub = getServiceStub(GetCmdChannel(server_index));
     closure->cntl(request_idx)->set_log_id(butil::gettimeofday_ms());
     rpc_stub.service(closure->cntl(request_idx), closure->request(request_idx),
                      closure->response(request_idx), closure);
@@ -292,9 +289,8 @@ std::future<int32_t> GraphBrpcClient::remove_graph_node(
     closure->request(request_idx)
         ->add_params((char *)request_bucket[request_idx].data(),
                      sizeof(int64_t) * node_num);
-    // PsService_Stub rpc_stub(get_cmd_channel(server_index));
-    GraphPsService_Stub rpc_stub =
-        getServiceStub(get_cmd_channel(server_index));
+    // PsService_Stub rpc_stub(GetCmdChannel(server_index));
+    GraphPsService_Stub rpc_stub = getServiceStub(GetCmdChannel(server_index));
     closure->cntl(request_idx)->set_log_id(butil::gettimeofday_ms());
     rpc_stub.service(closure->cntl(request_idx), closure->request(request_idx),
                      closure->response(request_idx), closure);
@@ -362,9 +358,8 @@ std::future<int32_t> GraphBrpcClient::batch_sample_neighbors(
     closure->request(0)->add_params((char *)&sample_size, sizeof(int));
     closure->request(0)->add_params((char *)&need_weight, sizeof(bool));
     ;
-    // PsService_Stub rpc_stub(get_cmd_channel(server_index));
-    GraphPsService_Stub rpc_stub =
-        getServiceStub(get_cmd_channel(server_index));
+    // PsService_Stub rpc_stub(GetCmdChannel(server_index));
+    GraphPsService_Stub rpc_stub = getServiceStub(GetCmdChannel(server_index));
     closure->cntl(0)->set_log_id(butil::gettimeofday_ms());
     rpc_stub.service(closure->cntl(0), closure->request(0),
                      closure->response(0), closure);
@@ -464,9 +459,8 @@ std::future<int32_t> GraphBrpcClient::batch_sample_neighbors(
         ->add_params((char *)&sample_size, sizeof(int));
     closure->request(request_idx)
         ->add_params((char *)&need_weight, sizeof(bool));
-    // PsService_Stub rpc_stub(get_cmd_channel(server_index));
-    GraphPsService_Stub rpc_stub =
-        getServiceStub(get_cmd_channel(server_index));
+    // PsService_Stub rpc_stub(GetCmdChannel(server_index));
+    GraphPsService_Stub rpc_stub = getServiceStub(GetCmdChannel(server_index));
     closure->cntl(request_idx)->set_log_id(butil::gettimeofday_ms());
     rpc_stub.service(closure->cntl(request_idx), closure->request(request_idx),
                      closure->response(request_idx), closure);
@@ -506,8 +500,8 @@ std::future<int32_t> GraphBrpcClient::random_sample_nodes(
   closure->request(0)->set_client_id(_client_id);
   closure->request(0)->add_params((char *)&sample_size, sizeof(int));
   ;
-  // PsService_Stub rpc_stub(get_cmd_channel(server_index));
-  GraphPsService_Stub rpc_stub = getServiceStub(get_cmd_channel(server_index));
+  // PsService_Stub rpc_stub(GetCmdChannel(server_index));
+  GraphPsService_Stub rpc_stub = getServiceStub(GetCmdChannel(server_index));
   closure->cntl(0)->set_log_id(butil::gettimeofday_ms());
   rpc_stub.service(closure->cntl(0), closure->request(0), closure->response(0),
                    closure);
@@ -541,8 +535,7 @@ std::future<int32_t> GraphBrpcClient::load_graph_split_config(
     closure->request(server_index)->set_table_id(table_id);
     closure->request(server_index)->set_client_id(_client_id);
     closure->request(server_index)->add_params(path);
-    GraphPsService_Stub rpc_stub =
-        getServiceStub(get_cmd_channel(server_index));
+    GraphPsService_Stub rpc_stub = getServiceStub(GetCmdChannel(server_index));
     closure->cntl(server_index)->set_log_id(butil::gettimeofday_ms());
     rpc_stub.service(closure->cntl(server_index),
                      closure->request(server_index),
@@ -581,8 +574,7 @@ std::future<int32_t> GraphBrpcClient::use_neighbors_sample_cache(
     closure->request(server_index)
         ->add_params((char *)&size_limit, sizeof(size_t));
     closure->request(server_index)->add_params((char *)&ttl, sizeof(size_t));
-    GraphPsService_Stub rpc_stub =
-        getServiceStub(get_cmd_channel(server_index));
+    GraphPsService_Stub rpc_stub = getServiceStub(GetCmdChannel(server_index));
     closure->cntl(server_index)->set_log_id(butil::gettimeofday_ms());
     rpc_stub.service(closure->cntl(server_index),
                      closure->request(server_index),
@@ -624,8 +616,8 @@ std::future<int32_t> GraphBrpcClient::pull_graph_list(
   closure->request(0)->add_params((char *)&start, sizeof(int));
   closure->request(0)->add_params((char *)&size, sizeof(int));
   closure->request(0)->add_params((char *)&step, sizeof(int));
-  // PsService_Stub rpc_stub(get_cmd_channel(server_index));
-  GraphPsService_Stub rpc_stub = getServiceStub(get_cmd_channel(server_index));
+  // PsService_Stub rpc_stub(GetCmdChannel(server_index));
+  GraphPsService_Stub rpc_stub = getServiceStub(GetCmdChannel(server_index));
   closure->cntl(0)->set_log_id(butil::gettimeofday_ms());
   rpc_stub.service(closure->cntl(0), closure->request(0), closure->response(0),
                    closure);
@@ -717,8 +709,7 @@ std::future<int32_t> GraphBrpcClient::set_node_feat(
     closure->request(request_idx)
         ->add_params(set_feature.c_str(), set_feature.size());
 
-    GraphPsService_Stub rpc_stub =
-        getServiceStub(get_cmd_channel(server_index));
+    GraphPsService_Stub rpc_stub = getServiceStub(GetCmdChannel(server_index));
     closure->cntl(request_idx)->set_log_id(butil::gettimeofday_ms());
     rpc_stub.service(closure->cntl(request_idx), closure->request(request_idx),
                      closure->response(request_idx), closure);
@@ -727,10 +718,10 @@ std::future<int32_t> GraphBrpcClient::set_node_feat(
   return fut;
 }
 
-int32_t GraphBrpcClient::initialize() {
+int32_t GraphBrpcClient::Initialize() {
   // set_shard_num(_config.shard_num());
-  BrpcPsClient::initialize();
-  server_size = get_server_nums();
+  BrpcPsClient::Initialize();
+  server_size = GetServerNums();
   graph_service = NULL;
   local_channel = NULL;
   return 0;
diff --git a/paddle/fluid/distributed/ps/service/graph_brpc_client.h b/paddle/fluid/distributed/ps/service/graph_brpc_client.h
index e2b8a518615dc..d1d3c95260df4 100644
--- a/paddle/fluid/distributed/ps/service/graph_brpc_client.h
+++ b/paddle/fluid/distributed/ps/service/graph_brpc_client.h
@@ -97,12 +97,12 @@ class GraphBrpcClient : public BrpcPsClient {
                                                        std::string path);
   virtual std::future<int32_t> remove_graph_node(
       uint32_t table_id, std::vector<int64_t>& node_id_list);
-  virtual int32_t initialize();
+  virtual int32_t Initialize();
   int get_shard_num() { return shard_num; }
   void set_shard_num(int shard_num) { this->shard_num = shard_num; }
   int get_server_index_by_id(int64_t id);
   void set_local_channel(int index) {
-    this->local_channel = get_cmd_channel(index);
+    this->local_channel = GetCmdChannel(index);
   }
   void set_local_graph_service(GraphBrpcService* graph_service) {
     this->graph_service = graph_service;
diff --git a/paddle/fluid/distributed/ps/service/graph_brpc_server.cc b/paddle/fluid/distributed/ps/service/graph_brpc_server.cc
index 20a55e4d11983..21e590997b178 100644
--- a/paddle/fluid/distributed/ps/service/graph_brpc_server.cc
+++ b/paddle/fluid/distributed/ps/service/graph_brpc_server.cc
@@ -33,7 +33,7 @@ namespace distributed {
     return -1;                                             \
   }
 
-int32_t GraphBrpcServer::initialize() {
+int32_t GraphBrpcServer::Initialize() {
   auto &service_config = _config.downpour_server_param().service_param();
   if (!service_config.has_service_class()) {
     LOG(ERROR) << "miss service_class in ServerServiceParameter";
@@ -48,7 +48,7 @@ int32_t GraphBrpcServer::initialize() {
   }
 
   _service.reset(service);
-  if (service->configure(this) != 0 || service->initialize() != 0) {
+  if (service->Configure(this) != 0 || service->Initialize() != 0) {
     LOG(ERROR) << "service initialize failed, service_name:"
                << service_config.service_class();
     return -1;
@@ -61,11 +61,11 @@ int32_t GraphBrpcServer::initialize() {
   return 0;
 }
 
-brpc::Channel *GraphBrpcServer::get_cmd_channel(size_t server_index) {
+brpc::Channel *GraphBrpcServer::GetCmdChannel(size_t server_index) {
   return _pserver_channels[server_index].get();
 }
 
-uint64_t GraphBrpcServer::start(const std::string &ip, uint32_t port) {
+uint64_t GraphBrpcServer::Start(const std::string &ip, uint32_t port) {
   std::unique_lock<std::mutex> lock(mutex_);
 
   std::string ip_port = ip + ":" + std::to_string(port);
@@ -73,20 +73,20 @@ uint64_t GraphBrpcServer::start(const std::string &ip, uint32_t port) {
   brpc::ServerOptions options;
 
   int num_threads = std::thread::hardware_concurrency();
-  auto trainers = _environment->get_trainers();
+  auto trainers = _environment->GetTrainers();
   options.num_threads = trainers > num_threads ? trainers : num_threads;
 
   if (_server.Start(ip_port.c_str(), &options) != 0) {
     LOG(ERROR) << "GraphBrpcServer start failed, ip_port=" << ip_port;
     return 0;
   }
-  _environment->registe_ps_server(ip, port, _rank);
+  _environment->RegistePsServer(ip, port, _rank);
   return 0;
 }
 
 int32_t GraphBrpcServer::build_peer2peer_connection(int rank) {
   this->rank = rank;
-  auto _env = environment();
+  auto _env = Environment();
   brpc::ChannelOptions options;
   options.protocol = "baidu_std";
   options.timeout_ms = 500000;
@@ -94,7 +94,7 @@ int32_t GraphBrpcServer::build_peer2peer_connection(int rank) {
   options.connect_timeout_ms = 10000;
   options.max_retry = 3;
 
-  std::vector<PSHost> server_list = _env->get_ps_servers();
+  std::vector<PSHost> server_list = _env->GetPsServers();
   _pserver_channels.resize(server_list.size());
   std::ostringstream os;
   std::string server_ip_port;
@@ -172,19 +172,18 @@ int32_t GraphBrpcService::remove_graph_node(Table *table,
   ((GraphTable *)table)->remove_graph_node(node_ids);
   return 0;
 }
-int32_t GraphBrpcServer::port() { return _server.listen_address().port; }
+int32_t GraphBrpcServer::Port() { return _server.listen_address().port; }
 
-int32_t GraphBrpcService::initialize() {
+int32_t GraphBrpcService::Initialize() {
   _is_initialize_shard_info = false;
-  _service_handler_map[PS_STOP_SERVER] = &GraphBrpcService::stop_server;
-  _service_handler_map[PS_LOAD_ONE_TABLE] = &GraphBrpcService::load_one_table;
-  _service_handler_map[PS_LOAD_ALL_TABLE] = &GraphBrpcService::load_all_table;
+  _service_handler_map[PS_STOP_SERVER] = &GraphBrpcService::StopServer;
+  _service_handler_map[PS_LOAD_ONE_TABLE] = &GraphBrpcService::LoadOneTable;
+  _service_handler_map[PS_LOAD_ALL_TABLE] = &GraphBrpcService::LoadAllTable;
 
-  _service_handler_map[PS_PRINT_TABLE_STAT] =
-      &GraphBrpcService::print_table_stat;
-  _service_handler_map[PS_BARRIER] = &GraphBrpcService::barrier;
-  _service_handler_map[PS_START_PROFILER] = &GraphBrpcService::start_profiler;
-  _service_handler_map[PS_STOP_PROFILER] = &GraphBrpcService::stop_profiler;
+  _service_handler_map[PS_PRINT_TABLE_STAT] = &GraphBrpcService::PrintTableStat;
+  _service_handler_map[PS_BARRIER] = &GraphBrpcService::Barrier;
+  _service_handler_map[PS_START_PROFILER] = &GraphBrpcService::StartProfiler;
+  _service_handler_map[PS_STOP_PROFILER] = &GraphBrpcService::StopProfiler;
 
   _service_handler_map[PS_PULL_GRAPH_LIST] = &GraphBrpcService::pull_graph_list;
   _service_handler_map[PS_GRAPH_SAMPLE_NEIGHBORS] =
@@ -207,21 +206,21 @@ int32_t GraphBrpcService::initialize() {
   _service_handler_map[PS_GRAPH_LOAD_GRAPH_SPLIT_CONFIG] =
       &GraphBrpcService::load_graph_split_config;
   // shard初始化,server启动后才可从env获取到server_list的shard信息
-  initialize_shard_info();
+  InitializeShardInfo();
 
   return 0;
 }
 
-int32_t GraphBrpcService::initialize_shard_info() {
+int32_t GraphBrpcService::InitializeShardInfo() {
   if (!_is_initialize_shard_info) {
     std::lock_guard<std::mutex> guard(_initialize_shard_mutex);
     if (_is_initialize_shard_info) {
       return 0;
     }
-    server_size = _server->environment()->get_ps_servers().size();
-    auto &table_map = *(_server->table());
+    server_size = _server->Environment()->GetPsServers().size();
+    auto &table_map = *(_server->GetTable());
     for (auto itr : table_map) {
-      itr.second->set_shard(_rank, server_size);
+      itr.second->SetShard(_rank, server_size);
     }
     _is_initialize_shard_info = true;
   }
@@ -241,7 +240,7 @@ void GraphBrpcService::service(google::protobuf::RpcController *cntl_base,
 
   response->set_err_code(0);
   response->set_err_msg("");
-  auto *table = _server->table(request->table_id());
+  auto *table = _server->GetTable(request->table_id());
   brpc::Controller *cntl = static_cast<brpc::Controller *>(cntl_base);
   auto itr = _service_handler_map.find(request->cmd_id());
   if (itr == _service_handler_map.end()) {
@@ -261,7 +260,7 @@ void GraphBrpcService::service(google::protobuf::RpcController *cntl_base,
   }
 }
 
-int32_t GraphBrpcService::barrier(Table *table, const PsRequestMessage &request,
+int32_t GraphBrpcService::Barrier(Table *table, const PsRequestMessage &request,
                                   PsResponseMessage &response,
                                   brpc::Controller *cntl) {
   CHECK_TABLE_EXIST(table, request, response)
@@ -275,16 +274,16 @@ int32_t GraphBrpcService::barrier(Table *table, const PsRequestMessage &request,
 
   auto trainer_id = request.client_id();
   auto barrier_type = request.params(0);
-  table->barrier(trainer_id, barrier_type);
+  table->Barrier(trainer_id, barrier_type);
   return 0;
 }
 
-int32_t GraphBrpcService::print_table_stat(Table *table,
-                                           const PsRequestMessage &request,
-                                           PsResponseMessage &response,
-                                           brpc::Controller *cntl) {
+int32_t GraphBrpcService::PrintTableStat(Table *table,
+                                         const PsRequestMessage &request,
+                                         PsResponseMessage &response,
+                                         brpc::Controller *cntl) {
   CHECK_TABLE_EXIST(table, request, response)
-  std::pair<int64_t, int64_t> ret = table->print_table_stat();
+  std::pair<int64_t, int64_t> ret = table->PrintTableStat();
   paddle::framework::BinaryArchive ar;
   ar << ret.first << ret.second;
   std::string table_info(ar.Buffer(), ar.Length());
@@ -293,10 +292,10 @@ int32_t GraphBrpcService::print_table_stat(Table *table,
   return 0;
 }
 
-int32_t GraphBrpcService::load_one_table(Table *table,
-                                         const PsRequestMessage &request,
-                                         PsResponseMessage &response,
-                                         brpc::Controller *cntl) {
+int32_t GraphBrpcService::LoadOneTable(Table *table,
+                                       const PsRequestMessage &request,
+                                       PsResponseMessage &response,
+                                       brpc::Controller *cntl) {
   CHECK_TABLE_EXIST(table, request, response)
   if (request.params_size() < 2) {
     set_response_code(
@@ -304,20 +303,20 @@ int32_t GraphBrpcService::load_one_table(Table *table,
         "PsRequestMessage.datas is requeired at least 2 for path & load_param");
     return -1;
   }
-  if (table->load(request.params(0), request.params(1)) != 0) {
+  if (table->Load(request.params(0), request.params(1)) != 0) {
     set_response_code(response, -1, "table load failed");
     return -1;
   }
   return 0;
 }
 
-int32_t GraphBrpcService::load_all_table(Table *table,
-                                         const PsRequestMessage &request,
-                                         PsResponseMessage &response,
-                                         brpc::Controller *cntl) {
-  auto &table_map = *(_server->table());
+int32_t GraphBrpcService::LoadAllTable(Table *table,
+                                       const PsRequestMessage &request,
+                                       PsResponseMessage &response,
+                                       brpc::Controller *cntl) {
+  auto &table_map = *(_server->GetTable());
   for (auto &itr : table_map) {
-    if (load_one_table(itr.second.get(), request, response, cntl) != 0) {
+    if (LoadOneTable(itr.second.get(), request, response, cntl) != 0) {
       LOG(ERROR) << "load table[" << itr.first << "] failed";
       return -1;
     }
@@ -325,13 +324,13 @@ int32_t GraphBrpcService::load_all_table(Table *table,
   return 0;
 }
 
-int32_t GraphBrpcService::stop_server(Table *table,
-                                      const PsRequestMessage &request,
-                                      PsResponseMessage &response,
-                                      brpc::Controller *cntl) {
+int32_t GraphBrpcService::StopServer(Table *table,
+                                     const PsRequestMessage &request,
+                                     PsResponseMessage &response,
+                                     brpc::Controller *cntl) {
   GraphBrpcServer *p_server = (GraphBrpcServer *)_server;
   std::thread t_stop([p_server]() {
-    p_server->stop();
+    p_server->Stop();
     LOG(INFO) << "Server Stoped";
   });
   p_server->export_cv()->notify_all();
@@ -339,19 +338,19 @@ int32_t GraphBrpcService::stop_server(Table *table,
   return 0;
 }
 
-int32_t GraphBrpcService::stop_profiler(Table *table,
-                                        const PsRequestMessage &request,
-                                        PsResponseMessage &response,
-                                        brpc::Controller *cntl) {
+int32_t GraphBrpcService::StopProfiler(Table *table,
+                                       const PsRequestMessage &request,
+                                       PsResponseMessage &response,
+                                       brpc::Controller *cntl) {
   platform::DisableProfiler(platform::EventSortingKey::kDefault,
                             string::Sprintf("server_%s_profile", _rank));
   return 0;
 }
 
-int32_t GraphBrpcService::start_profiler(Table *table,
-                                         const PsRequestMessage &request,
-                                         PsResponseMessage &response,
-                                         brpc::Controller *cntl) {
+int32_t GraphBrpcService::StartProfiler(Table *table,
+                                        const PsRequestMessage &request,
+                                        PsResponseMessage &response,
+                                        brpc::Controller *cntl) {
   platform::EnableProfiler(platform::ProfilerState::kCPU);
   return 0;
 }
@@ -475,7 +474,7 @@ int32_t GraphBrpcService::sample_neighbors_across_multi_servers(
   std::vector<int> server2request(server_size, -1);
   std::vector<int64_t> local_id;
   std::vector<int> local_query_idx;
-  size_t rank = get_rank();
+  size_t rank = GetRank();
   for (int query_idx = 0; query_idx < node_num; ++query_idx) {
     int server_index =
         ((GraphTable *)table)->get_server_index_by_id(node_data[query_idx]);
@@ -589,9 +588,9 @@ int32_t GraphBrpcService::sample_neighbors_across_multi_servers(
     closure->request(request_idx)
         ->add_params((char *)&need_weight, sizeof(bool));
     PsService_Stub rpc_stub(
-        ((GraphBrpcServer *)get_server())->get_cmd_channel(server_index));
+        ((GraphBrpcServer *)GetServer())->GetCmdChannel(server_index));
     // GraphPsService_Stub rpc_stub =
-    //     getServiceStub(get_cmd_channel(server_index));
+    //     getServiceStub(GetCmdChannel(server_index));
     closure->cntl(request_idx)->set_log_id(butil::gettimeofday_ms());
     rpc_stub.service(closure->cntl(request_idx), closure->request(request_idx),
                      closure->response(request_idx), closure);
diff --git a/paddle/fluid/distributed/ps/service/graph_brpc_server.h b/paddle/fluid/distributed/ps/service/graph_brpc_server.h
index a978d97b296b0..caf728701b289 100644
--- a/paddle/fluid/distributed/ps/service/graph_brpc_server.h
+++ b/paddle/fluid/distributed/ps/service/graph_brpc_server.h
@@ -31,10 +31,10 @@ class GraphBrpcServer : public PSServer {
   GraphBrpcServer() {}
   virtual ~GraphBrpcServer() {}
   PsBaseService *get_service() { return _service.get(); }
-  virtual uint64_t start(const std::string &ip, uint32_t port);
+  virtual uint64_t Start(const std::string &ip, uint32_t port);
   virtual int32_t build_peer2peer_connection(int rank);
-  virtual brpc::Channel *get_cmd_channel(size_t server_index);
-  virtual int32_t stop() {
+  virtual brpc::Channel *GetCmdChannel(size_t server_index);
+  virtual int32_t Stop() {
     std::unique_lock<std::mutex> lock(mutex_);
     if (stoped_) return 0;
     stoped_ = true;
@@ -43,12 +43,12 @@ class GraphBrpcServer : public PSServer {
     _server.Join();
     return 0;
   }
-  int32_t port();
+  int32_t Port();
 
   std::condition_variable *export_cv() { return &cv_; }
 
  private:
-  virtual int32_t initialize();
+  virtual int32_t Initialize();
   mutable std::mutex mutex_;
   std::condition_variable cv_;
   bool stoped_ = false;
@@ -66,7 +66,7 @@ typedef int32_t (GraphBrpcService::*serviceFunc)(
 
 class GraphBrpcService : public PsBaseService {
  public:
-  virtual int32_t initialize() override;
+  virtual int32_t Initialize() override;
 
   virtual void service(::google::protobuf::RpcController *controller,
                        const PsRequestMessage *request,
@@ -75,7 +75,7 @@ class GraphBrpcService : public PsBaseService {
 
  protected:
   std::unordered_map<int32_t, serviceFunc> _service_handler_map;
-  int32_t initialize_shard_info();
+  int32_t InitializeShardInfo();
   int32_t pull_graph_list(Table *table, const PsRequestMessage &request,
                           PsResponseMessage &response, brpc::Controller *cntl);
   int32_t graph_random_sample_neighbors(Table *table,
@@ -100,21 +100,21 @@ class GraphBrpcService : public PsBaseService {
   int32_t remove_graph_node(Table *table, const PsRequestMessage &request,
                             PsResponseMessage &response,
                             brpc::Controller *cntl);
-  int32_t barrier(Table *table, const PsRequestMessage &request,
+  int32_t Barrier(Table *table, const PsRequestMessage &request,
                   PsResponseMessage &response, brpc::Controller *cntl);
-  int32_t load_one_table(Table *table, const PsRequestMessage &request,
-                         PsResponseMessage &response, brpc::Controller *cntl);
-  int32_t load_all_table(Table *table, const PsRequestMessage &request,
-                         PsResponseMessage &response, brpc::Controller *cntl);
-  int32_t stop_server(Table *table, const PsRequestMessage &request,
-                      PsResponseMessage &response, brpc::Controller *cntl);
-  int32_t start_profiler(Table *table, const PsRequestMessage &request,
-                         PsResponseMessage &response, brpc::Controller *cntl);
-  int32_t stop_profiler(Table *table, const PsRequestMessage &request,
+  int32_t LoadOneTable(Table *table, const PsRequestMessage &request,
+                       PsResponseMessage &response, brpc::Controller *cntl);
+  int32_t LoadAllTable(Table *table, const PsRequestMessage &request,
+                       PsResponseMessage &response, brpc::Controller *cntl);
+  int32_t StopServer(Table *table, const PsRequestMessage &request,
+                     PsResponseMessage &response, brpc::Controller *cntl);
+  int32_t StartProfiler(Table *table, const PsRequestMessage &request,
                         PsResponseMessage &response, brpc::Controller *cntl);
+  int32_t StopProfiler(Table *table, const PsRequestMessage &request,
+                       PsResponseMessage &response, brpc::Controller *cntl);
 
-  int32_t print_table_stat(Table *table, const PsRequestMessage &request,
-                           PsResponseMessage &response, brpc::Controller *cntl);
+  int32_t PrintTableStat(Table *table, const PsRequestMessage &request,
+                         PsResponseMessage &response, brpc::Controller *cntl);
 
   int32_t sample_neighbors_across_multi_servers(Table *table,
                                                 const PsRequestMessage &request,
diff --git a/paddle/fluid/distributed/ps/service/heter_client.cc b/paddle/fluid/distributed/ps/service/heter_client.cc
index d6287cda6d443..4ca25dac826f0 100644
--- a/paddle/fluid/distributed/ps/service/heter_client.cc
+++ b/paddle/fluid/distributed/ps/service/heter_client.cc
@@ -13,18 +13,14 @@
 // limitations under the License.
 
 #include "paddle/fluid/distributed/ps/service/heter_client.h"
+
 #include "paddle/fluid/framework/convert_utils.h"
 #include "paddle/fluid/platform/profiler.h"
-#include "paddle/fluid/string/split.h"
-
-DECLARE_int32(rpc_deadline);
-DECLARE_int32(pserver_timeout_ms);
 
 namespace paddle {
 namespace distributed {
 
-std::shared_ptr<HeterClient> HeterClient::s_instance_ = NULL;
-bool HeterClient::is_initialized_ = false;
+std::shared_ptr<HeterClient> HeterClient::s_instance_ = nullptr;
 
 int GetMicroId(const platform::DeviceContext& ctx,
                const framework::Scope* scope) {
@@ -54,58 +50,21 @@ int GetMicroId(const platform::DeviceContext& ctx,
   return micro_id;
 }
 
-void HeterClient::MainThread() {
-  while (running_) {
-    RpcProfilerControl();
-  }
-}
-
 void HeterClient::Stop() {
-  running_ = false;
-  if (!is_initialized_) {
-    VLOG(3) << "HeterClient is not inited, do nothing";
-  } else {
-    if (main_thread_) {
-      auto status = StopHeterWorker();
-      status.wait();
-      main_thread_->join();
-      main_thread_.reset(nullptr);
-    }
-    VLOG(3) << "HeterClient Stop Done";
-  }
-}
-
-void HeterClient::FinalizeWorker() {
-  running_ = false;
-  if (!is_initialized_) {
-    VLOG(3) << "HeterClient is not inited, do nothing";
-  } else {
-    if (main_thread_) {
-      main_thread_->join();
-      main_thread_.reset(nullptr);
-    }
-    VLOG(3) << "HeterClient Stop Done";
-  }
+  auto status = StopHeterWorker();
+  status.wait();
 }
 
 std::future<int32_t> HeterClient::StopHeterWorker() {
   return SendCmd(-1, PS_STOP_SERVER, {});
 }
 
-void HeterClient::RpcProfilerControl() {
-  if (trainer_id_ == 0) {
-    if (!do_server_profiler_ && platform::IsProfileEnabled()) {
-      // send profiler start flag
-      do_server_profiler_ = true;
-      auto start_status = StartProfiler();
-      start_status.wait();
-    } else if (do_server_profiler_ && !platform::IsProfileEnabled()) {
-      // send profiler end flag
-      auto stop_status = StopProfiler();
-      stop_status.wait();
-      do_server_profiler_ = false;
-    }
-  }
+std::future<int32_t> HeterClient::StartProfiler() {
+  return SendCmd(-1, PS_START_PROFILER, {});
+}
+
+std::future<int32_t> HeterClient::StopProfiler() {
+  return SendCmd(-1, PS_STOP_PROFILER, {});
 }
 
 void HeterClient::CreateClient2XpuConnection() {
@@ -156,27 +115,24 @@ void HeterClient::SendAndRecvAsync(
                                      1);
   const platform::DeviceContext* p_ctx = &ctx;
   const framework::Scope* p_scope = &scope;
-  const std::string message_name_val = message_name;
   const std::vector<std::string> send_var_name_val = send_var_name;
   const std::vector<std::string> recv_var_name_val = recv_var_name;
-  VLOG(3) << "BRPCClient::SendAndRecv Begin, message_name: "
-          << message_name_val;
+  VLOG(3) << "BRPCClient::SendAndRecv Begin, message_name: " << message_name;
   brpc::Channel* channel = nullptr;
   distributed::MultiVarMsg request;
-  OnHeterRpcDone* closure = new OnHeterRpcDone([p_ctx, p_scope](void* done) {
+  OnHeterRpcDone* closure = new OnHeterRpcDone([](void* done) {
     auto* closure = reinterpret_cast<OnHeterRpcDone*>(done);
     PADDLE_ENFORCE_NE(
         closure->cntl.Failed(), true,
         platform::errors::Unimplemented(
             "HeterClient::SendAndRecv meets brpc error, error message is %s",
             closure->cntl.ErrorText()));
-
     VLOG(4) << "call heter_worker success";
   });
   closure->cntl.set_timeout_ms(FLAGS_pserver_timeout_ms);
   auto& request_io_buffer = closure->cntl.request_attachment();
   distributed::SerializeToMultiVarMsgAndIOBuf(
-      message_name_val, send_var_name_val, recv_var_name_val, *p_ctx, p_scope,
+      message_name, send_var_name_val, recv_var_name_val, *p_ctx, p_scope,
       &request, &request_io_buffer);
 
   int micro_id = GetMicroId(ctx, p_scope);
@@ -188,6 +144,19 @@ void HeterClient::SendAndRecvAsync(
   } else if (mode == "backward") {
     int num = minibatch_id % previous_xpu_channels_.size();
     channel = previous_xpu_channels_[num].get();
+  } else if (mode == "send_to_switch") {
+    VLOG(4) << "calling switch service";
+    // auto promise = std::make_shared<std::promise<int32_t>>();
+    // closure->add_promise(promise);
+    // std::future<int> fut = promise->get_future();
+    // int idx = 1;  // for test
+    // LOG(INFO) << "xpu_channels_ size: " << xpu_channels_.size();
+    // channel = xpu_channels_[idx].get();  // 为了适配 send_and_recv op
+    // ::paddle::distributed::PsService_Stub stub(channel);
+    // stub.SendToSwitch(&closure->cntl, &request, &closure->response,
+    // closure); fut.wait();
+    VLOG(4) << "calling switch service done";
+    return;
   }
   ::paddle::distributed::PsService_Stub stub(channel);
   stub.SendAndRecvVariable(&closure->cntl, &request, &closure->response,
@@ -229,13 +198,209 @@ std::future<int32_t> HeterClient::SendCmd(
   return fut;
 }
 
-std::future<int32_t> HeterClient::StartProfiler() {
-  return SendCmd(-1, PS_START_PROFILER, {});
+int HeterClient::Send(const platform::DeviceContext& ctx,
+                      const framework::Scope& scope,
+                      const std::string& message_name,
+                      const std::vector<std::string>& send_var_names) {
+  const framework::Scope* p_scope = &scope;  // 注意是 const
+  OnHeterRpcDone* closure = new OnHeterRpcDone([](void* done) {
+    auto* closure = reinterpret_cast<OnHeterRpcDone*>(done);
+    int ret = 0;
+    closure->set_promise_value(ret);
+    if (closure->cntl.Failed()) {
+      PADDLE_ENFORCE_NE(
+          closure->cntl.Failed(), true,
+          platform::errors::Unimplemented(
+              "HeterClient::SendToSwitch meets brpc error, error message is %s",
+              closure->cntl.ErrorText()));
+    }
+  });
+
+  closure->cntl.set_timeout_ms(FLAGS_pserver_timeout_ms);
+  auto& request_io_buffer = closure->cntl.request_attachment();
+
+  distributed::MultiVarMsg request;
+  // 1. set req message_name(string)
+  request.set_message_name(message_name);
+
+  // 2. set req send_var_names(<string>)
+  for (auto& send_var_name : send_var_names) {
+    request.add_send_var_names(send_var_name);
+  }
+
+  // 3. set req var_messages(<VarMessage>)
+  for (auto& send_var_name : send_var_names) {
+    auto* send_var_msg = request.add_var_messages();
+    send_var_msg->set_varname(send_var_name);
+    framework::Variable* var = p_scope->FindVar(send_var_name);
+    butil::IOBuf temp_iobuf;
+    if (var->IsType<framework::LoDTensor>()) {
+      SerializeLodTensor(var, ctx, send_var_msg, &temp_iobuf);
+    } else if (var->IsType<phi::SelectedRows>()) {
+      SerializeSelectedRows(var, ctx, send_var_msg, &temp_iobuf);
+    }
+    request_io_buffer.append(temp_iobuf);
+  }
+  auto promise = std::make_shared<std::promise<int32_t>>();
+  closure->add_promise(promise);
+  std::future<int> fut = promise->get_future();
+  if (send_switch_channels_.empty()) {
+    LOG(ERROR) << "send_switch_channels_ is null, get xpu_channels_[0]";
+    if (xpu_channels_.empty()) {
+      LOG(ERROR) << "xpu_channels_ is null";
+    }
+    send_switch_channels_.push_back(xpu_channels_[0]);
+  }
+  brpc::Channel* channel = send_switch_channels_[0].get();
+  // brpc::Channel* channel = xpu_channels_[0].get();
+  ::paddle::distributed::PsService_Stub stub(channel);
+  stub.SendToSwitch(&closure->cntl, &request, &closure->ps_response, closure);
+
+  VLOG(4) << "waiting SendToSwitch response result......";
+  fut.wait();
+  VLOG(4) << "Send done";
+  return 0;
 }
 
-std::future<int32_t> HeterClient::StopProfiler() {
-  return SendCmd(-1, PS_STOP_PROFILER, {});
+int HeterClient::Send(int group_id, const std::vector<std::string>& var_names,
+                      const std::vector<int>& vars_len, void* data_ptr,
+                      int64_t data_size) {
+  OnHeterRpcDone* closure = new OnHeterRpcDone([](void* done) {
+    auto* closure = reinterpret_cast<OnHeterRpcDone*>(done);
+    int ret = 0;
+    closure->set_promise_value(ret);
+    if (closure->cntl.Failed()) {
+      LOG(ERROR) << "Send meets brpc error, err msg is %s"
+                 << closure->cntl.ErrorText();
+    }
+  });
+  distributed::MultiVarMsg request;
+  closure->cntl.set_timeout_ms(FLAGS_pserver_timeout_ms);
+  std::string message_name = "send and save";
+  request.set_message_name(message_name);
+  request.set_group_id(group_id);
+  for (auto& send_var_name : var_names) {
+    request.add_send_var_names(send_var_name);
+  }
+  for (auto var_len : vars_len) {
+    request.add_vars_len(var_len);
+  }
+  auto& request_buffer = closure->cntl.request_attachment();
+  request_buffer.append(reinterpret_cast<void*>(data_ptr),
+                        data_size * sizeof(float));
+  auto promise = std::make_shared<std::promise<int32_t>>();
+  closure->add_promise(promise);
+  std::future<int> fut = promise->get_future();
+  if (send_switch_channels_.empty()) {
+    LOG(ERROR) << "send_switch_channels_ is null, get xpu_channels_[0]";
+    if (xpu_channels_.empty()) {
+      LOG(ERROR) << "xpu_channels_ is null";
+    }
+    send_switch_channels_.push_back(xpu_channels_[0]);
+  }
+  brpc::Channel* channel = send_switch_channels_[0].get();
+  ::paddle::distributed::PsService_Stub stub(channel);
+  stub.SendToSwitch(&closure->cntl, &request, &closure->ps_response, closure);
+  fut.wait();
+  return 0;
 }
 
-}  // end namespace distributed
+int HeterClient::Recv(const platform::DeviceContext& ctx,
+                      framework::Scope& recv_scope,  // NOLINT
+                      const std::string& message_name,
+                      const std::vector<std::string>& recv_var_names) {
+  OnHeterRpcDone* closure = new OnHeterRpcDone([](void* done) {
+    auto* closure = reinterpret_cast<OnHeterRpcDone*>(done);
+    VLOG(4) << "Recv service call done";
+    int ret = 0;
+    closure->set_promise_value(ret);
+    if (closure->cntl.Failed()) {
+      VLOG(4) << "HeterClient::RecvFromSwitch meets "
+                 "brpc error, error message is %s"
+              << closure->cntl.ErrorText();
+    }
+  });
+
+  closure->cntl.set_timeout_ms(FLAGS_pserver_timeout_ms);
+
+  distributed::MultiVarMsg request;
+  // 1. set req message_name(string)
+  request.set_message_name(message_name);
+
+  // 2. set req recv_var_names(<string>)
+  for (auto& recv_var_name : recv_var_names) {
+    request.add_recv_var_names(recv_var_name);
+  }
+  auto promise = std::make_shared<std::promise<int32_t>>();
+  closure->add_promise(promise);
+  std::future<int> fut = promise->get_future();
+  if (recv_switch_channels_.empty()) {
+    LOG(ERROR) << "peer_switch_channels_ is null, get xpu_channels_[1]";
+    if (xpu_channels_.size() < 2) {
+      LOG(ERROR) << "xpu_channels_ is null";
+    }
+    recv_switch_channels_.push_back(xpu_channels_[1]);
+  }
+  brpc::Channel* channel = recv_switch_channels_[0].get();
+  ::paddle::distributed::PsService_Stub stub(channel);
+  stub.RecvFromSwitch(&closure->cntl, &request, &closure->response, closure);
+  fut.wait();
+  VLOG(4) << "RecvFromSwitch done";
+  // save in worker
+  platform::DeviceContextPool& pool = platform::DeviceContextPool::Instance();
+  platform::CPUPlace cpu_place;
+  auto& cpu_dev_ctx = *pool.Get(cpu_place);
+  auto& res_io_buffer = closure->cntl.response_attachment();
+  VLOG(4) << "entering DeserializeFromMultiVarMsgAndIOBuf";
+  distributed::DeserializeFromMultiVarMsgAndIOBuf(
+      closure->response, &res_io_buffer, cpu_dev_ctx, &recv_scope);
+  VLOG(4) << "Recv done";
+  return 0;
+}
+
+int HeterClient::Recv(int group_id, const std::vector<std::string>& var_names,
+                      void* data_ptr, int64_t data_size) {
+  OnHeterRpcDone* closure = new OnHeterRpcDone([](void* done) {
+    auto* closure = reinterpret_cast<OnHeterRpcDone*>(done);
+    int ret = 0;
+    closure->set_promise_value(ret);
+    if (closure->cntl.Failed()) {
+      LOG(ERROR) << "Recv meets brpc error, err msg is %s"
+                 << closure->cntl.ErrorText();
+    }
+  });
+  closure->cntl.set_timeout_ms(FLAGS_pserver_timeout_ms);
+
+  distributed::MultiVarMsg request;
+  std::string message_name = "query and recv";
+  request.set_message_name(message_name);
+  request.set_group_id(group_id);
+
+  for (auto& recv_var_name : var_names) {
+    request.add_recv_var_names(recv_var_name);
+  }
+  auto promise = std::make_shared<std::promise<int32_t>>();
+  closure->add_promise(promise);
+  std::future<int> fut = promise->get_future();
+  if (recv_switch_channels_.empty()) {
+    LOG(ERROR) << "peer_switch_channels_ is null, get xpu_channels_[1]";
+    if (xpu_channels_.size() < 2) {
+      LOG(ERROR) << "xpu_channels_ is null";
+    }
+    recv_switch_channels_.push_back(xpu_channels_[1]);
+  }
+  brpc::Channel* channel = recv_switch_channels_[0].get();
+  ::paddle::distributed::PsService_Stub stub(channel);
+  stub.RecvFromSwitch(&closure->cntl, &request, &closure->response, closure);
+  fut.wait();
+  VLOG(4) << "RecvFromSwitch done";
+  // save in worker
+  auto& res_io_buffer = closure->cntl.response_attachment();
+  butil::IOBufBytesIterator io_buffer_itr(res_io_buffer);
+  io_buffer_itr.copy_and_forward(reinterpret_cast<void*>(data_ptr),
+                                 data_size * sizeof(float));
+  VLOG(4) << "Recv done";
+  return 0;
+}
+}  // namespace distributed
 }  // end namespace paddle
diff --git a/paddle/fluid/distributed/ps/service/heter_client.h b/paddle/fluid/distributed/ps/service/heter_client.h
old mode 100644
new mode 100755
index 4f27ef75ea954..006f87ddf5b06
--- a/paddle/fluid/distributed/ps/service/heter_client.h
+++ b/paddle/fluid/distributed/ps/service/heter_client.h
@@ -32,13 +32,14 @@ limitations under the License. */
 #include "paddle/fluid/framework/tensor.h"
 #include "paddle/fluid/framework/variable_helper.h"
 #include "paddle/fluid/platform/macros.h"  // for DISABLE_COPY_AND_ASSIGN
+#include "paddle/fluid/string/split.h"
 
 namespace paddle {
 namespace framework {
 class Scope;
 }  // namespace framework
 }  // namespace paddle
-
+DECLARE_int32(pserver_timeout_ms);
 namespace paddle {
 namespace distributed {
 
@@ -51,24 +52,72 @@ class OnHeterRpcDone : public google::protobuf::Closure {
  public:
   explicit OnHeterRpcDone(HeterRpcCallbackFunc func) : handler_(func) {}
   virtual ~OnHeterRpcDone() {}
-  void Run() {
-    std::unique_ptr<OnHeterRpcDone> self_guard(this);
-    handler_(this);
+  void Run() { handler_(this); }
+
+  void add_promise(std::shared_ptr<std::promise<int32_t>>& promise) {  // NOLINT
+    _promises.push_back(promise);
   }
 
+  void set_promise_value(int value) {
+    for (auto& promise : _promises) {
+      promise->set_value(value);
+    }
+  }
+  int CheckResponse() { return 0; }
+  std::vector<std::shared_ptr<std::promise<int32_t>>> _promises;
   HeterRpcCallbackFunc handler_;
+
+  MultiVariableMessage request;
   MultiVariableMessage response;
+
+  PsResponseMessage ps_response;
+
   brpc::Controller cntl;
+  // PsRequestMessage *request(size_t i) { return &_requests[i]; }
+  // PsResponseMessage *response(size_t i) { return &_responses[i]; }
+  // std::vector<PsRequestMessage> _requests;
+  // std::vector<PsResponseMessage> _responses;
+  // std::vector<std::shared_ptr<brpc::Controller>> _cntls;
 };
 
 class HeterClient {
  public:
   virtual ~HeterClient() {}
 
-  HeterClient() {
-    running_ = true;
-    main_thread_.reset(
-        new std::thread(std::bind(&HeterClient::MainThread, this)));
+  void InitClientChannels(bool need_encrypt,
+                          const std::vector<std::string>& node_list,
+                          int32_t peer_role) {
+    brpc::ChannelOptions options;
+    options.protocol = "baidu_std";
+    options.connection_type = "single";
+    options.timeout_ms = FLAGS_pserver_timeout_ms;
+    std::vector<std::shared_ptr<brpc::Channel>>* client_channels = nullptr;
+    if (peer_role == PEER_ROLE_IS_SWITCH) {
+      options.ssl_options.enable = need_encrypt;
+      client_channels = &peer_switch_channels_;
+    } else if (peer_role == PEER_ROLE_IS_WORKER) {
+      client_channels = &peer_worker_channels_;
+    } else {
+      LOG(ERROR) << "init switch client failed, peer_role not valid";
+    }
+    (*client_channels).resize(node_list.size());
+    for (size_t i = 0; i < node_list.size(); ++i) {
+      (*client_channels)[i].reset(new brpc::Channel());
+      if ((*client_channels)[i]->Init(node_list[i].c_str(), "", &options) !=
+          0) {
+        VLOG(0) << "client channel init failed! try again";
+        auto ip_port = paddle::string::Split(node_list[i], ':');
+        std::string ip = ip_port[0];
+        int port = std::stoi(ip_port[1]);
+        std::string int_ip_port = GetIntTypeEndpoint(ip, port);
+        if ((*client_channels)[i]->Init(int_ip_port.c_str(), "", &options) !=
+            0) {
+          LOG(ERROR) << "client channel init failed! peer ip_port = "
+                     << int_ip_port;
+        }
+      }
+    }
+    VLOG(4) << "InitClientChannels success";
   }
 
   void CreateClient2XpuConnection();
@@ -80,14 +129,28 @@ class HeterClient {
                         const std::vector<std::string>& recv_var_name,
                         const std::string& mode = "forward");
 
+  int Send(int group_id, const std::vector<std::string>& var_names,
+           const std::vector<int>& vars_len, void* data_ptr, int64_t data_size);
+
+  int Send(const platform::DeviceContext& ctx, const framework::Scope& scope,
+           const std::string& message_name,
+           const std::vector<std::string>& send_var_names);
+
+  int Recv(int group_id, const std::vector<std::string>& var_names,
+           void* data_ptr, int64_t data_size);
+
+  int Recv(const platform::DeviceContext& ctx,
+           framework::Scope& recv_scope,  // NOLINT
+           const std::string& message_name,
+           const std::vector<std::string>& recv_var_names);
+
   // HeterClient singleton
   static std::shared_ptr<HeterClient> GetInstance(
       const std::vector<std::string>& endpoint,
       const std::vector<std::string>& previous_endpoint,
       const int& trainer_id) {
     if (NULL == s_instance_) {
-      is_initialized_ = true;
-      s_instance_.reset(new paddle::distributed::HeterClient());
+      s_instance_.reset(new HeterClient());
       s_instance_->SetXpuList(endpoint);
       s_instance_->SetPreviousXpuList(previous_endpoint);
       s_instance_->SetTrainerID(trainer_id);
@@ -96,13 +159,29 @@ class HeterClient {
     return s_instance_;
   }
 
-  void Stop();
+  // switch client singleton
+  static HeterClient& GetSwitchInstance(
+      const std::vector<std::string>& peer_endpoints, int32_t peer_role) {
+    static HeterClient switch_s_instance_;
+    if (peer_endpoints.empty()) {
+      VLOG(4) << "init switch client failed, null peer_endpoints";
+    }
+    VLOG(4) << "peer role is: " << peer_role
+            << ", addr is: " << peer_endpoints[0];
+    switch_s_instance_.SetPeerSwitchList(peer_endpoints);
+    switch_s_instance_.InitClientChannels(false, peer_endpoints, peer_role);
+    return switch_s_instance_;
+  }
 
-  void FinalizeWorker();
+  void SetPeerSwitchList(const std::vector<std::string>& peer_endpoints) {
+    peer_switch_list_ = peer_endpoints;
+  }
 
-  void MainThread();
+  void SetPeerWorkerList(const std::vector<std::string>& worker_endpoints) {
+    peer_worker_list_ = worker_endpoints;
+  }
 
-  void RpcProfilerControl();
+  void Stop();
 
   std::future<int32_t> SendCmd(uint32_t table_id, int cmd_id,
                                const std::vector<std::string>& params);
@@ -124,20 +203,32 @@ class HeterClient {
 
   void SetTrainerID(const int& trainer_id) { trainer_id_ = trainer_id; }
 
+ public:
+  std::vector<std::string> send_switch_list_;
+  std::vector<std::string> recv_switch_list_;
+
+  std::vector<std::string> peer_switch_list_;
+  std::vector<std::string> peer_worker_list_;
+  std::vector<std::shared_ptr<brpc::Channel>> send_switch_channels_;
+  std::vector<std::shared_ptr<brpc::Channel>> recv_switch_channels_;
+
+  std::vector<std::shared_ptr<brpc::Channel>> peer_switch_channels_;
+  std::vector<std::shared_ptr<brpc::Channel>> peer_worker_channels_;
+
  private:
+  HeterClient() {}
+  HeterClient& operator=(const HeterClient&);
+  HeterClient(const HeterClient&);
+
   static std::shared_ptr<HeterClient> s_instance_;
-  static bool is_initialized_;
-  std::unique_ptr<std::thread> main_thread_{nullptr};
   std::vector<std::shared_ptr<brpc::Channel>> xpu_channels_;
   std::vector<std::shared_ptr<brpc::Channel>> previous_xpu_channels_;
 
-  DISABLE_COPY_AND_ASSIGN(HeterClient);
+  // DISABLE_COPY_AND_ASSIGN(HeterClient);
   std::vector<std::string> xpu_list_;
   std::vector<std::string> previous_xpu_list_;
 
-  bool running_ = false;
   int trainer_id_;
-  bool do_server_profiler_ = false;
 };
 
 }  // end namespace distributed
diff --git a/paddle/fluid/distributed/ps/service/heter_server.cc b/paddle/fluid/distributed/ps/service/heter_server.cc
index 01afed3f12375..e21bf093f1915 100644
--- a/paddle/fluid/distributed/ps/service/heter_server.cc
+++ b/paddle/fluid/distributed/ps/service/heter_server.cc
@@ -13,21 +13,28 @@
 // limitations under the License.
 
 #include "paddle/fluid/distributed/ps/service/heter_server.h"
+
 #include "paddle/fluid/string/split.h"
 
 namespace paddle {
 namespace distributed {
+// DEFINE_string(cert_path, "./cert.pem", "cert.pem path");
+// DEFINE_string(key_path, "./key.pem", "key.pem path");
 
-std::shared_ptr<HeterServer> HeterServer::s_instance_ = NULL;
+std::shared_ptr<HeterServer> HeterServer::s_instance_ = nullptr;
 
 void HeterServer::RegisterServiceHandler(std::string message_name,
                                          HeterServiceHandler func) {
   service_.RegisterServiceHandler(message_name, func);
 }
 
-void HeterServer::StartHeterService() {
+void HeterServer::StartHeterService(bool neeed_encrypt) {
   server_.AddService(&service_, brpc::SERVER_DOESNT_OWN_SERVICE);
   brpc::ServerOptions options;
+  if (neeed_encrypt) {
+    options.ssl_options.default_cert.certificate = "/cert.pem";
+    options.ssl_options.default_cert.private_key = "/key.pem";
+  }
   if (server_.Start(endpoint_.c_str(), &options) != 0) {
     VLOG(0) << "HeterServer start fail. Try again.";
     auto ip_port = paddle::string::Split(endpoint_, ':');
@@ -47,16 +54,50 @@ void HeterServer::StartHeterService() {
     ready_ = 1;
   }
   condition_ready_.notify_all();
+  VLOG(4) << "stopped: " << stoped_ << ", ready_: " << ready_;
   std::unique_lock<std::mutex> running_lock(mutex_);
   cv_.wait(running_lock, [&] {
-    VLOG(1) << "Heter Server is Stop? " << stoped_;
+    VLOG(4) << "Heter Server is Stop? " << stoped_;
     return stoped_;
   });
+  VLOG(4) << "start service done";
 }
 
-void HeterServer::SetEndPoint(const std::string& endpoint) {
-  endpoint_ = endpoint;
-  service_.SetEndpoint(endpoint);
+void HeterServer::StartHeterInterService(bool neeed_encrypt) {
+  server_inter_.AddService(&service_, brpc::SERVER_DOESNT_OWN_SERVICE);
+  brpc::ServerOptions options;
+  if (neeed_encrypt) {
+    options.ssl_options.default_cert.certificate = "/cert.pem";
+    options.ssl_options.default_cert.private_key = "/key.pem";
+  }
+  if (server_inter_.Start(endpoint_inter_.c_str(), &options) != 0) {
+    VLOG(4) << "switch inter server start fail. Try again.";
+    auto ip_port = paddle::string::Split(endpoint_inter_, ':');
+    std::string ip = ip_port[0];
+    int port = std::stoi(ip_port[1]);
+    std::string int_ip_port = GetIntTypeEndpoint(ip, port);
+    if (server_inter_.Start(endpoint_inter_.c_str(), &options) != 0) {
+      LOG(ERROR) << "switch inter server start failed, ip_port= "
+                 << int_ip_port;
+    }
+  } else {
+    VLOG(4) << "switch inter server server start success! listen on "
+            << endpoint_inter_;
+  }
+
+  {
+    std::lock_guard<std::mutex> lock(this->mutex_ready_);
+    stoped_ = false;
+    ready_ = 1;
+  }
+  condition_ready_.notify_all();
+  VLOG(4) << "stopped: " << stoped_ << ", ready_: " << ready_;
+  std::unique_lock<std::mutex> running_lock(mutex_);
+  cv_.wait(running_lock, [&] {
+    VLOG(4) << "Heter Server is Stop? " << stoped_;
+    return stoped_;
+  });
+  VLOG(4) << "start service done";
 }
 
 void HeterServer::SetFanin(const int& fan_in) { service_.SetFanin(fan_in); }
@@ -64,35 +105,180 @@ void HeterServer::SetFanin(const int& fan_in) { service_.SetFanin(fan_in); }
 void HeterServer::WaitServerReady() {
   std::unique_lock<std::mutex> lock(this->mutex_ready_);
   condition_ready_.wait(lock, [=] { return this->ready_ == 1; });
+  while (!this->ready_) {
+    sleep(1);
+  }
 }
 
-int32_t HeterService::stop_profiler(const PsRequestMessage& request,
-                                    PsResponseMessage& response,
-                                    brpc::Controller* cntl) {
-  platform::DisableProfiler(
-      platform::EventSortingKey::kDefault,
-      string::Sprintf("heter_worker_%s_profile", endpoint_));
+int SendAndRecvVariableHandler::SaveInSwitchWithShard(
+    const MultiVarMsg* request, PsResponseMessage* response,
+    brpc::Controller* cntl) {
+  VLOG(4) << "entering SaveInSwitchWithShard";
+  int32_t group_id = request->group_id();
+  auto& local_shard = _local_shards[group_id];
+  auto& request_io_buffer = cntl->request_attachment();
+  butil::IOBufBytesIterator io_buffer_itr(request_io_buffer);
+  for (int idx = 0; idx < request->send_var_names_size(); idx++) {
+    const auto& var_name = request->send_var_names(idx);
+    const auto& var_len = request->vars_len(idx);
+    auto itr = local_shard.find(var_name);
+    if (itr != local_shard.end()) {
+      LOG(INFO) << "var: " << var_name << "has not been consumed!"
+                << "check again";
+      WaitForVarsConsumed(group_id, var_name);
+    }
+    auto& value = local_shard[var_name];
+    value.resize(var_len);
+    io_buffer_itr.copy_and_forward(reinterpret_cast<void*>(value.data()),
+                                   var_len * sizeof(float));
+    VLOG(4) << "saved data in shards: ";
+    for (uint32_t i = 0; i < local_shard[var_name].size(); i++) {
+      VLOG(4) << *(local_shard[var_name].data() + i);
+    }
+  }
+  VLOG(4) << "SaveInSwitchWithShard success";
   return 0;
 }
 
-int32_t HeterService::start_profiler(const PsRequestMessage& request,
-                                     PsResponseMessage& response,
-                                     brpc::Controller* cntl) {
-  platform::EnableProfiler(platform::ProfilerState::kAll);
+int SendAndRecvVariableHandler::QueryInSwitchWithShard(
+    const MultiVarMsg* request, MultiVarMsg* response, brpc::Controller* cntl) {
+  VLOG(4) << "entering QueryInSwitchWithShard";
+  int32_t group_id = request->group_id();
+  VLOG(4) << "group id: " << group_id;
+  auto& local_shard = _local_shards[group_id];
+  auto& response_io_buffer = cntl->response_attachment();
+  auto req_var_nums = request->recv_var_names_size();
+  std::vector<std::string> req_var_names(req_var_nums);
+  for (int var_idx = 0; var_idx < req_var_nums; ++var_idx) {
+    req_var_names[var_idx] = request->recv_var_names(var_idx);
+  }
+  auto msg_name = request->message_name();
+  response->set_message_name(msg_name);
+
+  for (auto& req_var_name : req_var_names) {
+    VLOG(4) << "req var name: " << req_var_name;
+    response->add_send_var_names(req_var_name);
+    auto itr = local_shard.find(req_var_name);
+    if (itr == local_shard.end()) {
+      LOG(INFO) << "var: " << req_var_name << " not found in shards";
+      WaitForVarsProduced(group_id, req_var_name);
+    }
+    LOG(INFO) << "var: " << req_var_name << " found in shards";
+    itr = local_shard.find(req_var_name);
+    auto& value = itr.value();
+    response_io_buffer.append(value.data(), value.size() * sizeof(float));
+    value.resize(0);  // 标记位
+  }
+  VLOG(4) << "heter server QueryInSwitchWithShard done";
   return 0;
 }
 
-int32_t HeterService::stop_heter_worker(const PsRequestMessage& request,
-                                        PsResponseMessage& response,
-                                        brpc::Controller* cntl) {
-  auto client_id = request.client_id();
-  stop_cpu_worker_set_.insert(client_id);
-  if (stop_cpu_worker_set_.size() == fan_in_) {
-    is_exit_ = true;
-    VLOG(3) << "Stop heter Service done.";
+int SendAndRecvVariableHandler::SaveInSwitchWithScope(
+    const MultiVarMsg* request, PsResponseMessage* response,
+    brpc::Controller* cntl) {
+  VLOG(4) << "entering SaveInSwitchWithScope";
+  platform::DeviceContextPool& pool = platform::DeviceContextPool::Instance();
+  platform::CPUPlace cpu_place;
+  auto& cpu_dev_ctx = *pool.Get(cpu_place);
+  auto message_name = request->message_name();
+  VLOG(4) << "message_name in heter server: " << message_name;
+  std::unique_lock<std::mutex> lk(scope_mutex_);
+  auto local_scope = local_scope_ptr.get();
+  if (!local_scope) {
+    LOG(ERROR) << "local_scope_ptr is null in SaveInSwitchWithScope";
+  }
+  for (int idx = 0; idx < request->send_var_names_size(); idx++) {
+    const auto& msg = request->var_messages(idx);
+    std::string var_name = msg.varname();
+    auto* var_exist_ptr = local_scope->FindVar(var_name);
+    if (!var_exist_ptr) {
+      VLOG(4) << "not find var: " << var_name << " in local_scope";
+    }
+    vars_table[var_name] += 1;
+    VLOG(4) << "saved var_name: " << var_name
+            << ", cnt = " << vars_table[var_name];
+  }
+  auto& request_io_buffer = cntl->request_attachment();
+  distributed::DeserializeFromMultiVarMsgAndIOBuf(*request, &request_io_buffer,
+                                                  cpu_dev_ctx, local_scope);
+  lk.unlock();
+  while (true) {
+    int ret = 0;
+    for (int idx = 0; idx < request->send_var_names_size(); idx++) {
+      ret |= vars_table[request->var_messages(idx).varname()];
+    }
+    if (!ret) {
+      VLOG(4) << "all saved vars consumed";
+      break;
+    }
+    VLOG(4) << "waiting consume result......";
+    sleep(1);
   }
+  VLOG(4) << "SaveInSwitchWithScope success";
   return 0;
 }
 
+int SendAndRecvVariableHandler::QueryInSwitchWithScope(
+    const MultiVarMsg* request, MultiVarMsg* response, brpc::Controller* cntl) {
+  VLOG(4) << "entering QueryInSwitchWithScope";
+  auto local_scope = local_scope_ptr.get();
+  if (!local_scope) {
+    LOG(INFO) << "local_scope is null";
+  }
+  platform::DeviceContextPool& pool = platform::DeviceContextPool::Instance();
+  platform::CPUPlace cpu_place;
+  auto& cpu_dev_ctx = *pool.Get(cpu_place);
+
+  // get req message_name & req_var_names
+  auto msg_name = request->message_name();
+  auto req_var_nums = request->recv_var_names_size();
+  std::vector<std::string> req_var_names(req_var_nums);
+  for (int var_idx = 0; var_idx < req_var_nums; ++var_idx) {
+    req_var_names[var_idx] = request->recv_var_names(var_idx);
+  }
+  auto& response_io_buffer = cntl->response_attachment();
+
+  // 1. fill message_name(string)
+  response->set_message_name(msg_name);
+
+  // 2. fill var_names(string)
+  for (auto& req_var_name : req_var_names) {
+    response->add_send_var_names(req_var_name);
+  }
+
+  // 3. fill var_messages(VarMessage)
+  for (auto& req_var_name : req_var_names) {
+    LOG(INFO) << "query var_name: " << req_var_name;
+    auto* send_var_msg = response->add_var_messages();
+    send_var_msg->set_varname(req_var_name);
+
+    framework::Variable* var_ptr;
+    while (true) {
+      var_ptr = local_scope->FindVar(req_var_name);
+      if (!var_ptr) {
+        LOG(INFO) << "local_scope not find var: " << req_var_name;
+      } else {
+        break;
+      }
+      sleep(1);
+    }
+    butil::IOBuf temp_iobuf;
+    if (var_ptr->IsType<framework::LoDTensor>()) {
+      SerializeLodTensor(var_ptr, cpu_dev_ctx, send_var_msg, &temp_iobuf);
+    } else if (var_ptr->IsType<phi::SelectedRows>()) {
+      SerializeSelectedRows(var_ptr, cpu_dev_ctx, send_var_msg, &temp_iobuf);
+    }
+    response_io_buffer.append(temp_iobuf);
+  }
+  for (auto& req_var_name : req_var_names) {
+    std::unique_lock<std::mutex> lk(scope_mutex_);
+    vars_table[req_var_name] -= 1;
+    VLOG(4) << "remained var: " << req_var_name
+            << ", cnt = " << vars_table[req_var_name];
+    lk.unlock();
+  }
+  VLOG(4) << "heter server QueryInSwitchWithScope done";
+  return 0;
+}
 }  // end namespace distributed
-}  // end namespace paddle
+}  // namespace paddle
diff --git a/paddle/fluid/distributed/ps/service/heter_server.h b/paddle/fluid/distributed/ps/service/heter_server.h
index a14fb5f6cc04a..624e76112c7b0 100644
--- a/paddle/fluid/distributed/ps/service/heter_server.h
+++ b/paddle/fluid/distributed/ps/service/heter_server.h
@@ -22,11 +22,14 @@ limitations under the License. */
 #include <unordered_map>
 #include <unordered_set>
 #include <vector>
+
 #include "brpc/channel.h"
 #include "brpc/controller.h"
 #include "brpc/server.h"
 #include "paddle/fluid/distributed/ps/service/brpc_utils.h"
+#include "paddle/fluid/distributed/ps/service/heter_client.h"
 #include "paddle/fluid/distributed/ps/service/sendrecv.pb.h"
+#include "paddle/fluid/distributed/ps/table/depends/feature_value.h"
 #include "paddle/fluid/framework/blocking_queue.h"
 #include "paddle/fluid/framework/executor.h"
 #include "paddle/fluid/framework/program_desc.h"
@@ -51,108 +54,37 @@ class Scope;
 }  // namespace paddle
 
 DECLARE_double(eager_delete_tensor_gb);
+DECLARE_int32(pserver_timeout_ms);
+DECLARE_int32(heter_world_size);
 namespace paddle {
 namespace distributed {
 
-using MultiVarMsg = ::paddle::distributed::MultiVariableMessage;
-using VarMsg = ::paddle::distributed::VariableMessage;
-
-class HeterService;
+using MultiVarMsg = MultiVariableMessage;
+using VarMsg = VariableMessage;
 
-typedef int32_t (HeterService::*serviceHandlerFunc)(
+using serviceHandler = std::function<int32_t(
     const PsRequestMessage& request, PsResponseMessage& response,  // NOLINT
-    brpc::Controller* cntl);
+    brpc::Controller* cntl)>;
+using HeterServiceHandler =
+    std::function<int32_t(const MultiVarMsg*, MultiVarMsg*, brpc::Controller*)>;
 
-typedef std::function<void(void*)> HeterRpcCallbackFunc;
-typedef std::function<int(const MultiVarMsg*, MultiVarMsg*, brpc::Controller*)>
-    HeterServiceHandler;
+using HeterRpcCallbackFunc = std::function<void(void*)>;
 
-class HeterService : public ::paddle::distributed::PsService {
+class ServiceHandlerBase {
  public:
-  HeterService() {
-    _service_handler_map[PS_STOP_SERVER] = &HeterService::stop_heter_worker;
-    _service_handler_map[PS_START_PROFILER] = &HeterService::start_profiler;
-    _service_handler_map[PS_STOP_PROFILER] = &HeterService::stop_profiler;
-  }
+  ServiceHandlerBase() : dev_ctx_(nullptr), scope_(nullptr) {}
 
-  virtual ~HeterService() {}
+  virtual ~ServiceHandlerBase() {}
 
-  virtual void service(::google::protobuf::RpcController* controller,
-                       const PsRequestMessage* request,
-                       PsResponseMessage* response,
-                       ::google::protobuf::Closure* done) {
-    brpc::ClosureGuard done_guard(done);
-    std::string log_label("ReceiveCmd-");
-
-    response->set_err_code(0);
-    response->set_err_msg("");
-    brpc::Controller* cntl = static_cast<brpc::Controller*>(controller);
-    auto itr = _service_handler_map.find(request->cmd_id());
-    if (itr == _service_handler_map.end()) {
-      std::string err_msg(
-          "undefined cmd_id, should match PsCmdID in ps.proto, cmd_id:");
-      err_msg.append(std::to_string(request->cmd_id()));
-      return;
-    }
-    serviceHandlerFunc handler_func = itr->second;
-    int service_ret = (this->*handler_func)(*request, *response, cntl);
-    if (service_ret != 0) {
-      response->set_err_code(service_ret);
-      response->set_err_msg("server internal error");
-    }
-  }
-
-  void SendAndRecvVariable(::google::protobuf::RpcController* controller,
-                           const MultiVarMsg* request, MultiVarMsg* response,
-                           ::google::protobuf::Closure* done) {
-    brpc::ClosureGuard done_guard(done);
-    std::string message_name = request->message_name();
-    auto itr = handler_map_.find(message_name);
-    brpc::Controller* cntl = static_cast<brpc::Controller*>(controller);
-    PADDLE_ENFORCE_NE(
-        itr, handler_map_.end(),
-        platform::errors::InvalidArgument(
-            "HeterService::SendAndRecvVariable Get illegal message_name: %s "
-            "which is not in HeterService::handler_map_",
-            message_name));
-    itr->second(request, response, cntl);
-  }
-
-  void RegisterServiceHandler(std::string message_name,
-                              HeterServiceHandler func) {
-    handler_map_[message_name] = func;
-  }
-
-  int32_t ForceExit() {
-    VLOG(3) << "heter service force exit";
-    is_exit_ = true;
-    return 0;
-  }
-
-  void SetEndpoint(const std::string& end_point) { endpoint_ = end_point; }
-  void SetFanin(const int& fan_in) { fan_in_ = fan_in; }
-  bool IsExit() { return is_exit_; }
-
- private:
-  int32_t stop_profiler(const PsRequestMessage& request,
-                        PsResponseMessage& response,  // NOLINT
-                        brpc::Controller* cntl);
-
-  int32_t start_profiler(const PsRequestMessage& request,
-                         PsResponseMessage& response,  // NOLINT
-                         brpc::Controller* cntl);
+  void SetScope(const framework::Scope* scope) { scope_ = scope; }
+  void SetDevCtx(const platform::DeviceContext* dev_ctx) { dev_ctx_ = dev_ctx; }
 
-  int32_t stop_heter_worker(const PsRequestMessage& request,
-                            PsResponseMessage& response,  // NOLINT
-                            brpc::Controller* cntl);
+  virtual int Handle(const MultiVarMsg* request, MultiVarMsg* response,
+                     brpc::Controller* cntl) = 0;
 
- private:
-  std::string endpoint_;
-  std::unordered_map<std::string, HeterServiceHandler> handler_map_;
-  std::unordered_map<int32_t, serviceHandlerFunc> _service_handler_map;
-  std::unordered_set<int> stop_cpu_worker_set_;
-  int fan_in_;
-  bool is_exit_ = false;
+ protected:
+  const platform::DeviceContext* dev_ctx_;
+  const framework::Scope* scope_;
 };
 
 using SharedMiniScope =
@@ -163,31 +95,15 @@ using SharedTaskQueue = std::shared_ptr<
     std::unordered_map<int, std::shared_ptr<::paddle::framework::BlockingQueue<
                                 std::pair<std::string, int>>>>>;
 
-class HeterRequestHandler {
- public:
-  HeterRequestHandler() : dev_ctx_(nullptr), scope_(nullptr) {}
-
-  virtual ~HeterRequestHandler() {}
-
-  void SetScope(const framework::Scope* scope) { scope_ = scope; }
-  void SetDevCtx(const platform::DeviceContext* dev_ctx) { dev_ctx_ = dev_ctx; }
-
-  virtual int Handle(const MultiVarMsg* request, MultiVarMsg* response,
-                     brpc::Controller* cntl) = 0;
-
- protected:
-  const platform::DeviceContext* dev_ctx_;
-  const framework::Scope* scope_;
-};
-
-class RequestSendAndRecvHandler final : public HeterRequestHandler {
+class SendAndRecvVariableHandler final : public ServiceHandlerBase {
  public:
-  RequestSendAndRecvHandler() {
+  SendAndRecvVariableHandler() {
     this->num_microbatch_ = 0;
     this->num_minibatch_ = 0;
+    _local_shards.reset(new shard_type[FLAGS_heter_world_size]);
   }
 
-  virtual ~RequestSendAndRecvHandler() {}
+  virtual ~SendAndRecvVariableHandler() {}
 
   void SetMiniScopes(SharedMiniScope mini_scopes) {
     mini_scopes_ = mini_scopes;
@@ -209,11 +125,47 @@ class RequestSendAndRecvHandler final : public HeterRequestHandler {
     return (*task_queue_).size();
   }
 
+  int SaveInSwitchWithScope(const MultiVarMsg* request,
+                            PsResponseMessage* response,
+                            brpc::Controller* cntl);
+
+  void WaitForVarsConsumed(int32_t group_id, const std::string& var_name) {
+    auto& local_shard = _local_shards[group_id];
+    while (local_shard.find(var_name) != local_shard.end()) {
+      if (local_shard[var_name].size() == 0) {
+        break;
+      }
+      VLOG(4) << "waiting consume result......";
+      sleep(1);
+    }
+    return;
+  }
+
+  void WaitForVarsProduced(int32_t group_id, const std::string& var_name) {
+    auto& local_shard = _local_shards[group_id];
+    while (local_shard.find(var_name) == local_shard.end()) {
+      VLOG(4) << "waiting produce result......";
+      sleep(1);
+    }
+    return;
+  }
+
+  int SaveInSwitchWithShard(const MultiVarMsg* request,
+                            PsResponseMessage* response,
+                            brpc::Controller* cntl);
+
+  int QueryInSwitchWithShard(const MultiVarMsg* request, MultiVarMsg* response,
+                             brpc::Controller* cntl);
+
+  int QueryInSwitchWithScope(const MultiVarMsg* request, MultiVarMsg* response,
+                             brpc::Controller* cntl);
+
   void SetTaskQueue(SharedTaskQueue task_queue) { task_queue_ = task_queue; }
 
   int Handle(const MultiVarMsg* request, MultiVarMsg* response,
              brpc::Controller* cntl) override {
-    platform::RecordEvent record_event("RequestSendAndRecvHandler->Handle",
+    LOG(INFO) << "entered Handle";
+    platform::RecordEvent record_event("SendAndRecvVariableHandler->Handle",
                                        platform::TracerEventType::Communication,
                                        1);
     FLAGS_eager_delete_tensor_gb = -1;
@@ -241,7 +193,6 @@ class RequestSendAndRecvHandler final : public HeterRequestHandler {
     auto* tensor = var->GetMutable<framework::LoDTensor>();
     auto data = reinterpret_cast<const float*>(tensor->data());
     auto micro_id = static_cast<int>(data[0]);
-
     int minibatch_index = micro_id / 10;
     int microbatch_index = micro_id % 10;
 
@@ -249,10 +200,7 @@ class RequestSendAndRecvHandler final : public HeterRequestHandler {
     std::unique_lock<std::mutex> lk(scope_mutex_);
     if ((*mini_scopes_).find(minibatch_index) != (*mini_scopes_).end()) {
       lk.unlock();
-      // PADDLE_ENFORCE_EQ(
-      //    (*mini_scopes_).find(minibatch_index) != (*mini_scopes_).end(), 1,
-      //    platform::errors::InvalidArgument(
-      //        "minibatch index should in current trainer"));
+
       PADDLE_ENFORCE_EQ(
           (*micro_scopes_).find(minibatch_index) != (*micro_scopes_).end(), 1,
           platform::errors::InvalidArgument(
@@ -282,6 +230,7 @@ class RequestSendAndRecvHandler final : public HeterRequestHandler {
     // blocking queue handles multi thread
     (*task_queue_)[minibatch_index]->Push(
         std::make_pair(message_name, microbatch_index));
+
     auto response_var_nums = request->recv_var_names_size();
     std::vector<std::string> response_var_names(response_var_nums),
         empty_var_names{};
@@ -295,6 +244,12 @@ class RequestSendAndRecvHandler final : public HeterRequestHandler {
     return 0;
   }
 
+ public:
+  using shard_type = SparseTableShard<std::string, FixedFeatureValue>;
+  std::shared_ptr<paddle::framework::Scope> local_scope_ptr;  // for switch
+  std::unordered_map<std::string, uint32_t> vars_table;
+  std::unique_ptr<shard_type[]> _local_shards;
+
  private:
   // share with HeterPipelineTrainer
   SharedMiniScope mini_scopes_{nullptr};
@@ -310,15 +265,254 @@ class RequestSendAndRecvHandler final : public HeterRequestHandler {
   SharedTaskQueue task_queue_;
 };
 
+class HeterService : public PsService {
+ public:
+  HeterService() {
+    _service_handler_map[PS_STOP_SERVER] =
+        std::bind(&HeterService::stop_heter_worker, this, std::placeholders::_1,
+                  std::placeholders::_2, std::placeholders::_3);
+    _service_handler_map[PS_START_PROFILER] =
+        std::bind(&HeterService::start_profiler, this, std::placeholders::_1,
+                  std::placeholders::_2, std::placeholders::_3);
+    _service_handler_map[PS_STOP_PROFILER] =
+        std::bind(&HeterService::stop_profiler, this, std::placeholders::_1,
+                  std::placeholders::_2, std::placeholders::_3);
+
+    service_handler_.local_scope_ptr =
+        std::make_shared<paddle::framework::Scope>();
+  }
+
+  virtual ~HeterService() {}
+
+  virtual void service(::google::protobuf::RpcController* controller,
+                       const PsRequestMessage* request,
+                       PsResponseMessage* response,
+                       ::google::protobuf::Closure* done) {
+    brpc::ClosureGuard done_guard(done);
+
+    response->set_err_code(0);
+    response->set_err_msg("");
+    brpc::Controller* cntl = static_cast<brpc::Controller*>(controller);
+    auto itr = _service_handler_map.find(request->cmd_id());
+    if (itr == _service_handler_map.end()) {
+      std::string err_msg(
+          "undefined cmd_id, should match PsCmdID in ps.proto, cmd_id:");
+      err_msg.append(std::to_string(request->cmd_id()));
+      return;
+    }
+    serviceHandler handler = itr->second;
+    int service_ret = handler(*request, *response, cntl);
+    VLOG(4) << "handler in service ret: " << service_ret;
+    if (service_ret != 0) {
+      response->set_err_code(service_ret);
+      response->set_err_msg("server internal error");
+    }
+  }
+
+  virtual void SendAndRecvVariable(
+      ::google::protobuf::RpcController* controller, const MultiVarMsg* request,
+      MultiVarMsg* response, ::google::protobuf::Closure* done) {
+    // This object helps you to call done->Run() in RAII style. If you need
+    // to process the request asynchronously, pass done_guard.release().
+    brpc::ClosureGuard done_guard(done);
+    std::string message_name = request->message_name();
+    VLOG(0) << "SendAndRecvVariable message_name: " << message_name;
+    auto itr = handler_map_.find(message_name);
+    brpc::Controller* cntl = static_cast<brpc::Controller*>(controller);
+    LOG(INFO) << "SendAndRecvVariable(client addr) =" << cntl->remote_side();
+    PADDLE_ENFORCE_NE(
+        itr, handler_map_.end(),
+        platform::errors::InvalidArgument(
+            "HeterService::SendAndRecvVariable Get illegal message_name: %s "
+            "which is not in HeterService::handler_map_",
+            message_name));
+    itr->second(request, response, cntl);
+    // We don't want to call done->Run() here, release the guard.
+    // done_guard.release();
+  }
+
+  virtual void RecvFromSwitch(::google::protobuf::RpcController* controller,
+                              const MultiVarMsg* request, MultiVarMsg* response,
+                              ::google::protobuf::Closure* done) {
+    brpc::ClosureGuard done_guard(done);
+    brpc::Controller* cntl = static_cast<brpc::Controller*>(controller);
+    // int ret = service_handler_.QueryInSwitchWithScope(request, response,
+    // cntl);
+    int ret = service_handler_.QueryInSwitchWithShard(request, response, cntl);
+    // std::string message_name = request->message_name();
+    // auto itr = handler_map_.find(message_name);
+    // int ret = itr->second(request, response, cntl);
+    if (ret != 0) {
+      LOG(ERROR) << "QueryInSwitchWithScope failed!";
+    }
+    // response->set_message_name(message_name);
+  }
+
+  virtual void SendToSwitch(::google::protobuf::RpcController* controller,
+                            const MultiVarMsg* request,
+                            PsResponseMessage* response,
+                            ::google::protobuf::Closure* done) {
+    VLOG(4) << "entering SendToSwitch";
+    brpc::ClosureGuard done_guard(done);
+    auto& switch_client_ptr_ =
+        HeterClient::GetSwitchInstance(peer_endpoints_, PEER_ROLE_IS_SWITCH);
+    if (switch_client_ptr_.peer_switch_channels_.empty()) {
+      LOG(ERROR) << "switch_client_ptr_.peer_switch_channels_ null";
+    }
+    brpc::Channel* channel = switch_client_ptr_.peer_switch_channels_[0].get();
+    brpc::Controller* cntl = static_cast<brpc::Controller*>(controller);
+    // proxy: 定义新的 OnHeterRpcDone 对象（或者在类 OnHeterRpcDone 中 reset）
+    OnHeterRpcDone* closure2 = new OnHeterRpcDone([](void* done) {
+      auto* closure = reinterpret_cast<OnHeterRpcDone*>(done);
+      int ret = closure->CheckResponse();
+      closure->set_promise_value(ret);
+      if (closure->cntl.Failed()) {
+        PADDLE_ENFORCE_NE(
+            closure->cntl.Failed(), true,
+            platform::errors::Unimplemented(
+                "HeterClient::SendS2S meets brpc error, error message is %s",
+                closure->cntl.ErrorText()));
+      }
+    });
+    auto& std_cntl = closure2->cntl;
+    std_cntl.set_timeout_ms(FLAGS_pserver_timeout_ms);
+    std_cntl.request_attachment().append(cntl->request_attachment().movable());
+
+    auto promise = std::make_shared<std::promise<int32_t>>();
+    closure2->add_promise(promise);
+    std::future<int> fut = promise->get_future();
+    // brpc::Controller std_cntl;
+    // std_cntl.request_attachment().append(cntl->request_attachment().movable());
+    PsService_Stub stub(channel);
+    stub.SendS2S(&std_cntl, request, response, closure2);
+    cntl->response_attachment().append(
+        std_cntl.response_attachment().movable());
+    fut.wait();
+    VLOG(4) << "SendToSwitch done";
+  }
+
+  void SendS2S(::google::protobuf::RpcController* controller,
+               const MultiVarMsg* request, PsResponseMessage* response,
+               ::google::protobuf::Closure* done) {
+    VLOG(4) << "entering SendS2S";
+    brpc::ClosureGuard done_guard(done);
+    brpc::Controller* cntl = static_cast<brpc::Controller*>(controller);
+    // int ret = service_handler_.SaveInSwitchWithScope(request, response,
+    // cntl);
+    int ret = service_handler_.SaveInSwitchWithShard(request, response, cntl);
+    // std::string message_name = request->message_name();
+    // auto itr = handler_map_.find(message_name);
+    // if (itr == handler_map_.end()) {
+    //    LOG(ERROR) << "can not find func handler";
+    //}
+    // int ret = itr->second(request, response, cntl);
+    if (ret != 0) {
+      LOG(ERROR) << "SaveInSwitchWithScope failed";
+    }
+    std::string err_msg = "ok";
+    response->set_err_msg(err_msg.c_str());
+    response->set_err_code(ret);
+    VLOG(4) << "heter server SendS2S done";
+  }
+
+  void SendToWorker(::google::protobuf::RpcController* controller,
+                    const MultiVarMsg* request, PsResponseMessage* response,
+                    ::google::protobuf::Closure* done) {
+    brpc::ClosureGuard done_guard(done);
+    brpc::Controller* cntl = static_cast<brpc::Controller*>(controller);
+    VLOG(4) << "SendToWorker(client addr) =" << cntl->remote_side();
+    auto& switch_client_ptr_ =
+        HeterClient::GetSwitchInstance(peer_endpoints_, PEER_ROLE_IS_WORKER);
+    VLOG(4) << "in switch client, peer worker 0: "
+            << switch_client_ptr_.peer_worker_list_[0];
+    brpc::Channel* channel = switch_client_ptr_.peer_worker_channels_[0].get();
+
+    auto* closure = reinterpret_cast<OnHeterRpcDone*>(done);
+    PsService_Stub stub(channel);
+    stub.SendAndRecvVariable(controller, request, &closure->response, done);
+    // fill response content
+    std::string err_msg("pass to worker");
+    response->set_err_msg(err_msg.c_str());
+    response->set_err_code(0);
+  }
+
+  void RegisterServiceHandler(std::string message_name,
+                              HeterServiceHandler func) {
+    handler_map_[message_name] = func;
+  }
+
+  void SetEndpoint(const std::string& end_point) { endpoint_ = end_point; }
+
+  void SetInterEndpoint(const std::string& end_point) {
+    endpoint_inter_ = end_point;
+  }
+
+  void SetPeerEndPoints(const std::vector<std::string>& peer_endpoints) {
+    peer_endpoints_ = peer_endpoints;
+  }
+
+  void SetFanin(const int& fan_in) { fan_in_ = fan_in; }
+
+  void ForceExit() {
+    VLOG(3) << "heter service force exit";
+    is_exit_ = true;
+    return;
+  }
+
+  bool IsExit() { return is_exit_; }
+
+ private:
+  int32_t stop_profiler(const PsRequestMessage& request,
+                        PsResponseMessage& response,  // NOLINT
+                        brpc::Controller* cntl) {
+    platform::DisableProfiler(
+        platform::EventSortingKey::kDefault,
+        string::Sprintf("heter_worker_%s_profile", endpoint_));
+    return 0;
+  }
+
+  int32_t start_profiler(const PsRequestMessage& request,
+                         PsResponseMessage& response,  // NOLINT
+                         brpc::Controller* cntl) {
+    platform::EnableProfiler(platform::ProfilerState::kAll);
+    return 0;
+  }
+
+  int32_t stop_heter_worker(const PsRequestMessage& request,
+                            PsResponseMessage& response,  // NOLINT
+                            brpc::Controller* cntl) {
+    auto client_id = request.client_id();
+    stop_cpu_worker_set_.insert(client_id);
+    if (stop_cpu_worker_set_.size() == fan_in_) {
+      is_exit_ = true;
+    }
+    return 0;
+  }
+
+ private:
+  SendAndRecvVariableHandler service_handler_;
+  std::string endpoint_;
+  std::string endpoint_inter_;
+  // for switch
+  std::vector<std::string> peer_endpoints_;
+
+  std::unordered_map<int32_t, serviceHandler> _service_handler_map;
+  std::unordered_map<std::string, HeterServiceHandler> handler_map_;
+  std::unordered_set<int> stop_cpu_worker_set_;
+  uint32_t fan_in_;
+  bool is_exit_ = false;
+};
+
 class HeterServer {
  public:
+  HeterServer() : ready_(0) {}
   virtual ~HeterServer() {}
-
   void Stop() {
     std::unique_lock<std::mutex> lock(mutex_);
     if (stoped_ == true) return;
-    if (!IsExit()) service_.ForceExit();
-    VLOG(3) << "HeterServer Stop()";
+    if (!IsExit()) {
+      service_.ForceExit();
+    }
     stoped_ = true;
     cv_.notify_all();
     server_.Stop(1000);
@@ -327,26 +521,42 @@ class HeterServer {
 
   bool IsStop() {
     std::unique_lock<std::mutex> lock(mutex_);
-    if (stoped_ == true)
-      return true;
-    else
-      return false;
+    return stoped_;
   }
 
   bool IsExit() { return service_.IsExit(); }
 
-  HeterServer() : service_(), ready_(0) {}
-
   void RegisterServiceHandler(std::string message_name,
                               HeterServiceHandler func);
 
-  void StartHeterService();
+  void StartHeterService(bool need_encrypt = false);
+
+  void StartHeterInterService(bool need_encrypt = false);
+
+  void SetEndPoint(const std::string& endpoint) {
+    this->endpoint_ = endpoint;
+    service_.SetEndpoint(endpoint);
+  }
+
+  void SetLocalScope() {
+    request_handler_->local_scope_ptr =
+        std::make_shared<paddle::framework::Scope>();
+  }
+
+  void SetInterEndpoint(const std::string& endpoint) {
+    this->endpoint_inter_ = endpoint;
+    service_.SetInterEndpoint(endpoint);
+  }
+
+  void SetPeerEndPoints(const std::vector<std::string>& peer_endpoints) {
+    this->peer_endpoints_ = peer_endpoints;
+    service_.SetPeerEndPoints(peer_endpoints);
+  }
 
-  void SetEndPoint(const std::string& endpoint);
   void SetFanin(const int& fan_in);
 
-  void SetRequestHandler(
-      std::shared_ptr<RequestSendAndRecvHandler> request_handler) {
+  void SetServiceHandler(
+      std::shared_ptr<SendAndRecvVariableHandler> request_handler) {
     request_handler_ = request_handler;
   }
 
@@ -381,11 +591,15 @@ class HeterServer {
   std::condition_variable condition_ready_;
   bool stoped_ = true;
   std::string endpoint_;
+  std::string endpoint_inter_;
+  // for switch
+  std::vector<std::string> peer_endpoints_;
 
  protected:
   brpc::Server server_;
+  brpc::Server server_inter_;
   HeterService service_;
-  std::shared_ptr<RequestSendAndRecvHandler> request_handler_;
+  std::shared_ptr<SendAndRecvVariableHandler> request_handler_;
 
   DISABLE_COPY_AND_ASSIGN(HeterServer);
   std::mutex mutex_ready_;
diff --git a/paddle/fluid/distributed/ps/service/ps_client.cc b/paddle/fluid/distributed/ps/service/ps_client.cc
index fd956b758de1a..f7df99ec13cdf 100644
--- a/paddle/fluid/distributed/ps/service/ps_client.cc
+++ b/paddle/fluid/distributed/ps/service/ps_client.cc
@@ -25,7 +25,7 @@ REGISTER_PSCORE_CLASS(PSClient, BrpcPsClient);
 REGISTER_PSCORE_CLASS(PSClient, PsLocalClient);
 REGISTER_PSCORE_CLASS(PSClient, GraphBrpcClient);
 
-int32_t PSClient::configure(
+int32_t PSClient::Configure(
     const PSParameter &config,
     const std::map<uint64_t, std::vector<paddle::distributed::Region>> &regions,
     PSEnvironment &env, size_t client_id) {
@@ -46,15 +46,15 @@ int32_t PSClient::configure(
     auto *accessor = CREATE_PSCORE_CLASS(
         ValueAccessor,
         work_param.downpour_table_param(i).accessor().accessor_class());
-    accessor->configure(work_param.downpour_table_param(i).accessor());
-    accessor->initialize();
+    accessor->Configure(work_param.downpour_table_param(i).accessor());
+    accessor->Initialize();
     _table_accessors[work_param.downpour_table_param(i).table_id()].reset(
         accessor);
   }
-  return initialize();
+  return Initialize();
 }
 
-PSClient *PSClientFactory::create(const PSParameter &ps_config) {
+PSClient *PSClientFactory::Create(const PSParameter &ps_config) {
   const auto &config = ps_config.server_param();
   if (!config.has_downpour_server_param()) {
     LOG(ERROR) << "miss downpour_server_param in ServerParameter";
@@ -81,7 +81,7 @@ PSClient *PSClientFactory::create(const PSParameter &ps_config) {
     return NULL;
   }
 
-  TableManager::instance().initialize();
+  TableManager::Instance().Initialize();
   VLOG(3) << "Create PSClient[" << service_param.client_class() << "] success";
   return client;
 }
diff --git a/paddle/fluid/distributed/ps/service/ps_client.h b/paddle/fluid/distributed/ps/service/ps_client.h
index 83d2aba1db445..6f27b0eb04624 100644
--- a/paddle/fluid/distributed/ps/service/ps_client.h
+++ b/paddle/fluid/distributed/ps/service/ps_client.h
@@ -26,7 +26,6 @@
 #include "paddle/fluid/distributed/ps/service/sendrecv.pb.h"
 #include "paddle/fluid/distributed/ps/table/accessor.h"
 #include "paddle/fluid/distributed/ps/table/graph/graph_node.h"
-#include "paddle/fluid/distributed/ps/table/table.h"
 #include "paddle/fluid/platform/timer.h"
 
 namespace paddle {
@@ -60,41 +59,6 @@ class PSClientClosure : public google::protobuf::Closure {
   std::vector<std::shared_ptr<std::promise<int32_t>>> _promises;
 };
 
-struct LoadSaveContext {
-  int table_id;
-  std::string epoch;
-  std::string mode;
-};
-
-enum TrainingMode { Async = 0, Sync = 1, Geo = 3 };
-
-enum TrainingPhase { Init = 0, Train = 1, Save = 2 };
-
-// enum ValueType {
-//   Sparse = 0,
-//   Dense = 1
-// };
-
-struct PushContext {
-  const uint64_t *keys;
-  const float **push_values;
-  const Region *push_dense_values;
-};
-
-struct RequestContext {
-  int table;
-  TrainingMode training_mode;    // 1 for async, 2 for geo, 3 for sync
-  TrainingPhase training_phase;  // 1 for init, 2 for train
-  ValueType value_type;          // 1 for sparse, 2 for dense
-  uint64_t *keys;
-  float **sparse_values;  // for sparse values
-  Region *dense_values;   // for dense values
-  PushContext push_context;
-  size_t num;
-  bool is_training;
-  void *callback;
-};
-
 class PSClient {
  public:
   PSClient() {}
@@ -102,41 +66,37 @@ class PSClient {
   PSClient(PSClient &&) = delete;
   PSClient(const PSClient &) = delete;
 
-  virtual int32_t configure(  // NOLINT
+  virtual int32_t Configure(  // NOLINT
       const PSParameter &config,
       const std::map<uint64_t, std::vector<paddle::distributed::Region>>
           &regions,
       PSEnvironment &_env, size_t client_id) final;  // NOLINT
 
-  virtual int32_t create_client2client_connection(
-      int pserver_timeout_ms, int pserver_connect_timeout_ms,
-      int max_retry) = 0;
+  virtual int32_t CreateClient2ClientConnection(int pserver_timeout_ms,
+                                                int pserver_connect_timeout_ms,
+                                                int max_retry) = 0;
 
   // 触发table数据退场
-  virtual std::future<int32_t> shrink(uint32_t table_id,
+  virtual std::future<int32_t> Shrink(uint32_t table_id,
                                       const std::string threshold) = 0;
 
   // 全量table进行数据load
-  virtual std::future<int32_t> load(const std::string &epoch,
+  virtual std::future<int32_t> Load(const std::string &epoch,
                                     const std::string &mode) = 0;
   // 指定table数据load
-  virtual std::future<int32_t> load(uint32_t table_id, const std::string &epoch,
+  virtual std::future<int32_t> Load(uint32_t table_id, const std::string &epoch,
                                     const std::string &mode) = 0;
-  // context配置load选项
-  virtual std::future<int32_t> Load(const LoadSaveContext &load_context) = 0;
 
   // 全量table数据save  value_accessor根据mode，可能有不同的save条件
-  virtual std::future<int32_t> save(const std::string &epoch,
+  virtual std::future<int32_t> Save(const std::string &epoch,
                                     const std::string &mode) = 0;
   // 指定table数据save  value_accessor根据mode，可能有不同的save条件
-  virtual std::future<int32_t> save(uint32_t table_id, const std::string &epoch,
+  virtual std::future<int32_t> Save(uint32_t table_id, const std::string &epoch,
                                     const std::string &mode) = 0;
 
-  virtual std::future<int32_t> Save(const LoadSaveContext &save_context) = 0;
-
   // 清空table数据
-  virtual std::future<int32_t> clear() = 0;
-  virtual std::future<int32_t> clear(uint32_t table_id) = 0;
+  virtual std::future<int32_t> Clear() = 0;
+  virtual std::future<int32_t> Clear(uint32_t table_id) = 0;
 
   // pull dense的参数部分，并分块填充到本地网络参数中
   // start和num用于拉取部分参数
@@ -145,23 +105,19 @@ class PSClient {
   // sender聚集同一区块的请求，累计多个填充buffer
   // server将参数区块中配置的某一维提取返回
   // 返回数据解包后填充到累计的多个buffer中
-  virtual std::future<int32_t> pull_dense(Region *regions, size_t region_num,
-                                          size_t table_id) = 0;  // 保留
-
-  virtual std::future<int32_t> Push(RequestContext &push_context) = 0;
+  virtual std::future<int32_t> PullDense(Region *regions, size_t region_num,
+                                         size_t table_id) = 0;  // 保留
 
   // firstly push dense param for parameter server
   // this is neccessary because dense weight initialized in trainer on cold
   // start
-  virtual std::future<int32_t> push_dense_param(const Region *regions,
-                                                size_t region_num,
-                                                size_t table_id) = 0;
-
-  virtual std::future<int32_t> push_dense(const Region *regions,
-                                          size_t region_num,
-                                          size_t table_id) = 0;
+  virtual std::future<int32_t> PushDenseParam(const Region *regions,
+                                              size_t region_num,
+                                              size_t table_id) = 0;
 
-  virtual std::future<int32_t> Pull(RequestContext &pull_context) = 0;
+  virtual std::future<int32_t> PushDense(const Region *regions,
+                                         size_t region_num,
+                                         size_t table_id) = 0;
 
   // 使用keys进行pull请求，结果填充values
   // keys和values的个数均为num个，每个value占用select_size空间
@@ -169,15 +125,14 @@ class PSClient {
   // 整合多个线程请求的keys，聚集并分散发送到server
   // 返回结果后，遍历buffer并对values赋值
   // is_training 用于区分请求是训练/预测，server端对于特征和准入会有不同的处理.
-  virtual std::future<int32_t> pull_sparse(float **select_values,
-                                           size_t table_id,
-                                           const uint64_t *keys, size_t num,
-                                           bool is_training) = 0;
-
-  virtual std::future<int32_t> pull_sparse_param(float **select_values,
-                                                 size_t table_id,
-                                                 const uint64_t *keys,
-                                                 size_t num, bool is_training) {
+  virtual std::future<int32_t> PullSparse(float **select_values,
+                                          size_t table_id, const uint64_t *keys,
+                                          size_t num, bool is_training) = 0;
+
+  virtual std::future<int32_t> PullSparseParam(float **select_values,
+                                               size_t table_id,
+                                               const uint64_t *keys, size_t num,
+                                               bool is_training) {
     VLOG(0) << "Did not implement";
     std::promise<int32_t> promise;
     std::future<int> fut = promise.get_future();
@@ -185,10 +140,10 @@ class PSClient {
     return fut;
   }
 
-  virtual ::std::future<int32_t> pull_sparse_ptr(char **select_values,
-                                                 size_t table_id,
-                                                 const uint64_t *keys,
-                                                 size_t num) {
+  virtual ::std::future<int32_t> PullSparsePtr(char **select_values,
+                                               size_t table_id,
+                                               const uint64_t *keys,
+                                               size_t num) {
     VLOG(0) << "Did not implement";
     std::promise<int32_t> promise;
     std::future<int> fut = promise.get_future();
@@ -196,38 +151,38 @@ class PSClient {
     return fut;
   }
 
-  virtual std::future<int32_t> print_table_stat(uint32_t table_id) = 0;
+  virtual std::future<int32_t> PrintTableStat(uint32_t table_id) = 0;
 
   // 确保所有积攒中的请求都发起发送
-  virtual std::future<int32_t> flush() = 0;
+  virtual std::future<int32_t> Flush() = 0;
   // server优雅退出
-  virtual std::future<int32_t> stop_server() = 0;
+  virtual std::future<int32_t> StopServer() = 0;
 
   // server profilera
-  virtual std::future<int32_t> start_profiler() = 0;
-  virtual std::future<int32_t> stop_profiler() = 0;
+  virtual std::future<int32_t> StartProfiler() = 0;
+  virtual std::future<int32_t> StopProfiler() = 0;
 
-  virtual std::future<int32_t> barrier(size_t table_id,
+  virtual std::future<int32_t> Barrier(size_t table_id,
                                        uint32_t barrier_type) = 0;
 
-  virtual std::future<int32_t> pull_geo_param(size_t table_id,
-                                              std::vector<float> *values,
-                                              std::vector<uint64_t> *keys,
-                                              int pserver_idx) = 0;
+  virtual std::future<int32_t> PullGeoParam(size_t table_id,
+                                            std::vector<float> *values,
+                                            std::vector<uint64_t> *keys,
+                                            int pserver_idx) = 0;
 
-  virtual std::future<int32_t> push_global_step(int table_id,
-                                                int64_t *total_send_data,
-                                                void *done) = 0;
+  virtual std::future<int32_t> PushGlobalStep(int table_id,
+                                              int64_t *total_send_data,
+                                              void *done) = 0;
 
   // recv table from server and save it in LodTensor
-  virtual int32_t recv_and_save_table(const uint64_t table_id,
-                                      const std::string &path) = 0;
+  virtual int32_t RecvAndSaveTable(const uint64_t table_id,
+                                   const std::string &path) = 0;
 
-  virtual void finalize_worker() = 0;
+  virtual void FinalizeWorker() = 0;
   // client to client, 消息发送
-  virtual std::future<int32_t> send_client2client_msg(int msg_type,
-                                                      int to_client_id,
-                                                      const std::string &msg) {
+  virtual std::future<int32_t> SendClient2ClientMsg(int msg_type,
+                                                    int to_client_id,
+                                                    const std::string &msg) {
     VLOG(0) << "Did not implement";
     std::promise<int32_t> promise;
     std::future<int> fut = promise.get_future();
@@ -238,13 +193,13 @@ class PSClient {
   // client2client消息处理，std::function<int32_t (int, int, const std::string&)
   // -> ret (msg_type, from_client_id, msg)
   typedef std::function<int32_t(int, int, const std::string &)> MsgHandlerFunc;
-  virtual int registe_client2client_msg_handler(int msg_type,
-                                                MsgHandlerFunc handler) {
+  virtual int RegisteClient2ClientMsgHandler(int msg_type,
+                                             MsgHandlerFunc handler) {
     _msg_handler_map[msg_type] = handler;
     return 0;
   }
-  virtual int handle_client2client_msg(int msg_type, int from_client_id,
-                                       const std::string &msg) {
+  virtual int HandleClient2ClientMsg(int msg_type, int from_client_id,
+                                     const std::string &msg) {
     auto itr = _msg_handler_map.find(msg_type);
     if (itr == _msg_handler_map.end()) {
       LOG(WARNING) << "unknown client2client_msg type:" << msg_type;
@@ -253,7 +208,7 @@ class PSClient {
     return itr->second(msg_type, from_client_id, msg);
   }
 
-  virtual ValueAccessor *table_accessor(size_t table_id) {
+  virtual ValueAccessor *GetTableAccessor(size_t table_id) {
     auto itr = _table_accessors.find(table_id);
     if (itr == _table_accessors.end()) {
       return NULL;
@@ -261,31 +216,31 @@ class PSClient {
     return itr->second.get();
   }
 
-  virtual size_t get_server_nums() = 0;
+  virtual size_t GetServerNums() = 0;
 
-  virtual std::future<int32_t> push_dense_raw_gradient(
-      int table_id, float *total_send_data, size_t total_send_data_size,
-      void *done) = 0;
+  virtual std::future<int32_t> PushDenseRawGradient(int table_id,
+                                                    float *total_send_data,
+                                                    size_t total_send_data_size,
+                                                    void *done) = 0;
 
-  virtual std::future<int32_t> push_sparse_raw_gradient(
+  virtual std::future<int32_t> PushSparseRawGradient(
       size_t table_id, const uint64_t *keys, const float **update_values,
       size_t num, void *done) = 0;
 
-  virtual std::future<int32_t> push_sparse_raw_gradient_partial(
+  virtual std::future<int32_t> PushSparseRawGradientPartial(
       size_t table_id, const uint64_t *keys, const float **update_values,
       uint32_t num, void *done, int pserver_idx) = 0;
 
-  virtual std::future<int32_t> push_sparse_param(size_t table_id,
-                                                 const uint64_t *keys,
-                                                 const float **update_values,
-                                                 size_t num, void *done) = 0;
-  virtual std::future<int32_t> push_sparse(size_t table_id,
-                                           const uint64_t *keys,
-                                           const float **update_values,
-                                           size_t num) = 0;
+  virtual std::future<int32_t> PushSparseParam(size_t table_id,
+                                               const uint64_t *keys,
+                                               const float **update_values,
+                                               size_t num, void *done) = 0;
+  virtual std::future<int32_t> PushSparse(size_t table_id, const uint64_t *keys,
+                                          const float **update_values,
+                                          size_t num) = 0;
 
  protected:
-  virtual int32_t initialize() = 0;
+  virtual int32_t Initialize() = 0;
   size_t _client_id;
   PSParameter _config;
   std::map<uint64_t, std::vector<paddle::distributed::Region>>
@@ -333,7 +288,7 @@ REGISTER_PSCORE_REGISTERER(PSClient);
 
 class PSClientFactory {
  public:
-  static PSClient *create(const PSParameter &config);
+  static PSClient *Create(const PSParameter &config);
 };
 }  // namespace distributed
 }  // namespace paddle
diff --git a/paddle/fluid/distributed/ps/service/ps_local_client.cc b/paddle/fluid/distributed/ps/service/ps_local_client.cc
old mode 100755
new mode 100644
index fe5cbe682ea67..bc024ed3175bc
--- a/paddle/fluid/distributed/ps/service/ps_local_client.cc
+++ b/paddle/fluid/distributed/ps/service/ps_local_client.cc
@@ -19,165 +19,98 @@
 
 namespace paddle {
 namespace distributed {
-int32_t PsLocalClient::initialize() {
+int32_t PsLocalClient::Initialize() {
   const auto& downpour_param = _config.server_param().downpour_server_param();
-  TableManager::instance().initialize();
+  TableManager::Instance().Initialize();
   for (size_t i = 0; i < downpour_param.downpour_table_param_size(); ++i) {
     auto* table = CREATE_PSCORE_CLASS(
         Table, downpour_param.downpour_table_param(i).table_class());
-    table->set_shard(0, 1);
-    table->initialize(downpour_param.downpour_table_param(i),
+    table->SetShard(0, 1);
+    table->Initialize(downpour_param.downpour_table_param(i),
                       _config.fs_client_param());
     _table_map[downpour_param.downpour_table_param(i).table_id()].reset(table);
   }
   return 0;
 }
 
-::std::future<int32_t> PsLocalClient::shrink(uint32_t table_id,
+::std::future<int32_t> PsLocalClient::Shrink(uint32_t table_id,
                                              const std::string threshold) {
   // TODO
   return done();
 }
 
-::std::future<int32_t> PsLocalClient::load(const std::string& epoch,
+::std::future<int32_t> PsLocalClient::Load(const std::string& epoch,
                                            const std::string& mode) {
   // TODO
   for (auto& it : _table_map) {
-    load(it.first, epoch, mode);
+    Load(it.first, epoch, mode);
   }
   return done();
 }
-::std::future<int32_t> PsLocalClient::load(uint32_t table_id,
+::std::future<int32_t> PsLocalClient::Load(uint32_t table_id,
                                            const std::string& epoch,
                                            const std::string& mode) {
   // TODO
-  auto* table_ptr = table(table_id);
-  table_ptr->load(epoch, mode);
+  auto* table_ptr = GetTable(table_id);
+  table_ptr->Load(epoch, mode);
   return done();
 }
 
-std::future<int32_t> PsLocalClient::Load(const LoadSaveContext& load_context) {
-  if (load_context.table_id < 0) {
-    for (auto& it : _table_map) {
-      load(it.first, load_context.epoch, load_context.mode);
-    }
-    return done();
-  } else {
-    auto* table_ptr = table(load_context.table_id);
-    table_ptr->load(load_context.epoch, load_context.mode);
-    return done();
-  }
-}
-
-::std::future<int32_t> PsLocalClient::save(const std::string& epoch,
+::std::future<int32_t> PsLocalClient::Save(const std::string& epoch,
                                            const std::string& mode) {
   // TODO
   for (auto& it : _table_map) {
-    save(it.first, epoch, mode);
+    Save(it.first, epoch, mode);
   }
   return done();
 }
-::std::future<int32_t> PsLocalClient::save(uint32_t table_id,
+::std::future<int32_t> PsLocalClient::Save(uint32_t table_id,
                                            const std::string& epoch,
                                            const std::string& mode) {
   // TODO
-  auto* table_ptr = table(table_id);
-  table_ptr->flush();
-  table_ptr->save(epoch, mode);
+  auto* table_ptr = GetTable(table_id);
+  table_ptr->Flush();
+  table_ptr->Save(epoch, mode);
   return done();
 }
 
-::std::future<int32_t> PsLocalClient::Save(
-    const LoadSaveContext& save_context) {
-  if (save_context.table_id < 0) {
-    for (auto& it : _table_map) {
-      save(it.first, save_context.epoch, save_context.mode);
-    }
-    return done();
-  } else {
-    auto* table_ptr = table(save_context.table_id);
-    table_ptr->flush();
-    table_ptr->save(save_context.epoch, save_context.mode);
-    return done();
-  }
-}
-
-::std::future<int32_t> PsLocalClient::clear() {
+::std::future<int32_t> PsLocalClient::Clear() {
   // TODO
   return done();
 }
-::std::future<int32_t> PsLocalClient::clear(uint32_t table_id) {
+::std::future<int32_t> PsLocalClient::Clear(uint32_t table_id) {
   // TODO
   return done();
 }
 
-::std::future<int32_t> PsLocalClient::flush() {
+::std::future<int32_t> PsLocalClient::Flush() {
   // no need
   return done();
 }
 
-::std::future<int32_t> PsLocalClient::stop_server() {
+::std::future<int32_t> PsLocalClient::StopServer() {
   // no need
   return done();
 }
 
-::std::future<int32_t> PsLocalClient::Pull(RequestContext& pull_context) {
-  if (pull_context.value_type == Dense) {  // pull dense
-    Region* dense_region = reinterpret_cast<Region*>(pull_context.dense_values);
-    pull_dense(dense_region, pull_context.num, pull_context.table);
-  } else {  // pull sparse
-    // uint64_t* keys = reinterpret_cast<uint64_t*>(pull_context.keys);
-    // char** select_values =
-    // reinterpret_cast<char**>(pull_context.sparse_values);
-    size_t table_id = pull_context.table;
-    size_t num = pull_context.num;
-    pull_sparse_ptr(reinterpret_cast<char**>(pull_context.sparse_values),
-                    table_id, pull_context.keys, num);
-  }
-}
+::std::future<int32_t> PsLocalClient::PullDense(Region* regions,
+                                                size_t region_num,
+                                                size_t table_id) {
+  auto* accessor = GetTableAccessor(table_id);
+  auto* table_ptr = GetTable(table_id);
 
-::std::future<int32_t> PsLocalClient::Push(RequestContext& push_context) {
-  if (push_context.value_type == Dense) {  // push dense
-    if (push_context.training_phase == Init) {
-      const Region* regions = push_context.push_context.push_dense_values;
-      size_t region_num = push_context.num;
-      push_dense_param(regions, region_num, push_context.table);
-    } else {
-      if (push_context.training_mode == Geo) {  // geo
-        float* total_send_data =
-            reinterpret_cast<float*>(push_context.dense_values);
-        size_t total_send_data_size = push_context.num;
-        push_dense_raw_gradient(push_context.table, total_send_data,
-                                total_send_data_size, push_context.callback);
-      } else {  // async and sync
-        const Region* regions = push_context.push_context.push_dense_values;
-        size_t region_num = push_context.num;
-        push_dense(regions, region_num, push_context.table);
-      }
-    }
-  } else {  // push sparse
-    if (push_context.training_mode == Async) {
-      const uint64_t* keys = push_context.push_context.keys;
-      const float** update_values = push_context.push_context.push_values;
-      size_t table_id = push_context.table;
-      size_t num = push_context.num;
-      push_sparse(table_id, keys, update_values, num);
-    } else {
-      // TODO
-    }
-  }
-}
-
-::std::future<int32_t> PsLocalClient::pull_dense(Region* regions,
-                                                 size_t region_num,
-                                                 size_t table_id) {
-  auto* accessor = table_accessor(table_id);
-  auto* table_ptr = table(table_id);
+  uint32_t num_per_shard =
+      DenseDimPerShard(accessor->GetAccessorInfo().fea_dim, 1);
 
-  uint32_t num_per_shard = dense_dim_per_shard(accessor->fea_dim(), 1);
   std::vector<float> region_buffer;
   region_buffer.resize(num_per_shard);
-  table_ptr->pull_dense(region_buffer.data(), region_buffer.size());
+
+  TableContext table_context;
+  table_context.value_type = Dense;
+  table_context.pull_context.values = region_buffer.data();
+  table_context.num = region_buffer.size();
+  table_ptr->Pull(table_context);
+  //  table_ptr->PullDense(region_buffer.data(), region_buffer.size());
 
   size_t region_idx = 0;
   size_t region_data_idx = 0;
@@ -212,47 +145,62 @@ ::std::future<int32_t> PsLocalClient::pull_dense(Region* regions,
   return done();
 }
 
-::std::future<int32_t> PsLocalClient::push_dense_param(const Region* regions,
-                                                       size_t region_num,
-                                                       size_t table_id) {
-  auto* accessor = table_accessor(table_id);
-  auto* table_ptr = table(table_id);
+::std::future<int32_t> PsLocalClient::PushDenseParam(const Region* regions,
+                                                     size_t region_num,
+                                                     size_t table_id) {
+  auto* accessor = GetTableAccessor(table_id);
+  auto* table_ptr = GetTable(table_id);
 
   std::vector<float> region_buffer;
-  region_buffer.resize(dense_dim_per_shard(accessor->fea_dim(), 1), 0);
+  region_buffer.resize(DenseDimPerShard(accessor->GetAccessorInfo().fea_dim, 1),
+                       0);
   for (size_t i = 0, offset = 0; i < region_num; ++i) {
     uint32_t data_num = regions[i].size / sizeof(float);
     memcpy(region_buffer.data() + offset, regions[i].data, regions[i].size);
     offset += data_num;
   }
 
-  // table_ptr->push_dense_param(region_buffer.data(), region_buffer.size());
+  TableContext table_context;
+  table_context.value_type = Dense;
+  table_context.push_context.values = region_buffer.data();
+  table_context.push_context.is_param = true;
+  table_context.num = region_buffer.size();
+
+  table_ptr->Push(table_context);
+  // table_ptr->PushDenseParam(region_buffer.data(), region_buffer.size());
 
   return done();
 }
 
-::std::future<int32_t> PsLocalClient::push_dense_raw_gradient(
+::std::future<int32_t> PsLocalClient::PushDenseRawGradient(
     int table_id, float* total_send_data, size_t total_send_data_size,
     void* callback) {
   VLOG(1) << "wxx push_dense_raw_gradient";
 
   PSClientClosure* closure = reinterpret_cast<PSClientClosure*>(callback);
 
-  auto* table_ptr = table(table_id);
+  auto* table_ptr = GetTable(table_id);
+
+  TableContext table_context;
+  table_context.value_type = Dense;
+  table_context.push_context.values = total_send_data;
+  table_context.num = total_send_data_size;
+  //  table_ptr->PushDense(total_send_data, total_send_data_size);
+  table_ptr->Push(table_context);
 
-  table_ptr->push_dense(total_send_data, total_send_data_size);
   delete closure;
   return done();
 }
 
-::std::future<int32_t> PsLocalClient::push_dense(const Region* regions,
-                                                 size_t region_num,
-                                                 size_t table_id) {
-  auto* accessor = table_accessor(table_id);
-  auto* table_ptr = table(table_id);
+::std::future<int32_t> PsLocalClient::PushDense(const Region* regions,
+                                                size_t region_num,
+                                                size_t table_id) {
+  auto* accessor = GetTableAccessor(table_id);
+  auto* table_ptr = GetTable(table_id);
 
   std::vector<float> region_buffer;
-  region_buffer.resize(dense_dim_per_shard(accessor->fea_dim(), 1));
+  region_buffer.resize(
+      DenseDimPerShard(accessor->GetAccessorInfo().fea_dim, 1));
   size_t data_size = region_buffer.size();
   for (size_t i = 0, offset = 0; i < region_num; ++i) {
     uint32_t data_num = regions[i].size / sizeof(float);
@@ -265,12 +213,17 @@ ::std::future<int32_t> PsLocalClient::push_dense(const Region* regions,
     offset += data_num;
   }
 
-  table_ptr->push_dense(region_buffer.data(), region_buffer.size());
+  TableContext table_context;
+  table_context.value_type = Dense;
+  table_context.push_context.values = region_buffer.data();
+  table_context.num = region_buffer.size();
+  //  table_ptr->PushDense(total_send_data, total_send_data_size);
+  table_ptr->Push(table_context);
 
   return done();
 }
 
-//::std::future<int32_t> PsLocalClient::pull_sparse(float** select_values,
+//::std::future<int32_t> PsLocalClient::PullSparse(float** select_values,
 //                                                  size_t table_id,
 //                                                  const uint64_t* keys,
 //                                                  size_t num) {
@@ -280,14 +233,14 @@ ::std::future<int32_t> PsLocalClient::push_dense(const Region* regions,
 //  // auto local_timer =
 //  // std::make_shared<CostTimer>("pslib_downpour_client_pull_sparse_local");
 //  //将key拆分到各shard请求，并记录原始对应value指针
-//  auto* accessor = table_accessor(table_id);
-//  auto* table_ptr = table(table_id);
+//  auto* accessor = GetTableAccessor(table_id);
+//  auto* table_ptr = GetTable(table_id);
 //  size_t value_size = accessor->select_size();
 //
-//  // table_ptr->pull_sparse(keys, num);
+//  // table_ptr->PullSparse(keys, num);
 //  std::vector<float> res_data;
 //  res_data.resize(num * value_size / sizeof(float));
-//  table_ptr->pull_sparse(res_data.data(), keys, num);
+//  table_ptr->PullSparse(res_data.data(), keys, num);
 //  // memcpy(select_values[0], res_data->data(), res_data->size() *
 //  // sizeof(float));
 //  size_t offset = 0;
@@ -300,43 +253,67 @@ ::std::future<int32_t> PsLocalClient::push_dense(const Region* regions,
 //  return done();
 //}
 
-::std::future<int32_t> PsLocalClient::pull_sparse_ptr(char** select_values,
-                                                      size_t table_id,
-                                                      const uint64_t* keys,
-                                                      size_t num) {
+::std::future<int32_t> PsLocalClient::PullSparsePtr(char** select_values,
+                                                    size_t table_id,
+                                                    const uint64_t* keys,
+                                                    size_t num) {
   // FIXME
   // auto timer =
   // std::make_shared<CostTimer>("pslib_downpour_client_pull_sparse");
   // auto local_timer =
   // std::make_shared<CostTimer>("pslib_downpour_client_pull_sparse_local");
   //将key拆分到各shard请求，并记录原始对应value指针
-  auto* table_ptr = table(table_id);
+  auto* table_ptr = GetTable(table_id);
+
+  TableContext table_context;
+  table_context.value_type = Sparse;
+  table_context.pull_context.keys = keys;
+  table_context.pull_context.ptr_values = select_values;
+  table_context.use_ptr = true;
+  table_context.num = num;
 
-  table_ptr->pull_sparse_ptr(select_values, keys, num);
+  //  table_ptr->PullSparsePtr(select_values, keys, num);
+  table_ptr->Pull(table_context);
 
   return done();
 }
 
-::std::future<int32_t> PsLocalClient::push_sparse_raw_gradient(
+::std::future<int32_t> PsLocalClient::PushSparseRawGradient(
     size_t table_id, const uint64_t* keys, const float** update_values,
     size_t num, void* callback) {
   PSClientClosure* closure = reinterpret_cast<PSClientClosure*>(callback);
-  auto* accessor = table_accessor(table_id);
-  auto* table_ptr = table(table_id);
-
-  table_ptr->push_sparse(keys, update_values, num);
+  auto* accessor = GetTableAccessor(table_id);
+  auto* table_ptr = GetTable(table_id);
+
+  TableContext table_context;
+  table_context.value_type = Sparse;
+  table_context.push_context.keys = keys;
+  table_context.push_context.ptr_values = update_values;
+  table_context.num = num;
+  table_context.use_ptr = true;
+
+  // table_ptr->PushSparse(keys, update_values, num);
+  table_ptr->Push(table_context);
   delete closure;
   return done();
 }
 
-::std::future<int32_t> PsLocalClient::push_sparse(size_t table_id,
-                                                  const uint64_t* keys,
-                                                  const float** update_values,
-                                                  size_t num) {
-  auto* accessor = table_accessor(table_id);
-  auto* table_ptr = table(table_id);
-
-  table_ptr->push_sparse(keys, update_values, num);
+::std::future<int32_t> PsLocalClient::PushSparse(size_t table_id,
+                                                 const uint64_t* keys,
+                                                 const float** update_values,
+                                                 size_t num) {
+  auto* accessor = GetTableAccessor(table_id);
+  auto* table_ptr = GetTable(table_id);
+
+  TableContext table_context;
+  table_context.value_type = Sparse;
+  table_context.push_context.keys = keys;
+  table_context.push_context.ptr_values = update_values;
+  table_context.num = num;
+  table_context.use_ptr = true;
+
+  //  table_ptr->PushSparse(keys, update_values, num);
+  table_ptr->Push(table_context);
   return done();
 }
 }
diff --git a/paddle/fluid/distributed/ps/service/ps_local_client.h b/paddle/fluid/distributed/ps/service/ps_local_client.h
index 83ca558e3d2cb..439ecf79f2f80 100644
--- a/paddle/fluid/distributed/ps/service/ps_local_client.h
+++ b/paddle/fluid/distributed/ps/service/ps_local_client.h
@@ -26,54 +26,46 @@ class PsLocalClient : public PSClient {
  public:
   PsLocalClient() {}
   virtual ~PsLocalClient() { _running = false; }
-  virtual int32_t create_client2client_connection(int pslib_timeout_ms,
-                                                  int pslib_connect_timeout_ms,
-                                                  int max_retry) {
+  virtual int32_t CreateClient2ClientConnection(int pslib_timeout_ms,
+                                                int pslib_connect_timeout_ms,
+                                                int max_retry) {
     return 0;
   }
 
-  virtual ::std::future<int32_t> shrink(uint32_t table_id,
+  virtual ::std::future<int32_t> Shrink(uint32_t table_id,
                                         const std::string threshold) override;
-  virtual ::std::future<int32_t> load(const std::string& epoch,
+  virtual ::std::future<int32_t> Load(const std::string& epoch,
                                       const std::string& mode) override;
-  virtual ::std::future<int32_t> load(uint32_t table_id,
+  virtual ::std::future<int32_t> Load(uint32_t table_id,
                                       const std::string& epoch,
                                       const std::string& mode) override;
-  virtual std::future<int32_t> Load(
-      const LoadSaveContext& load_context) override;
 
-  virtual ::std::future<int32_t> save(const std::string& epoch,
+  virtual ::std::future<int32_t> Save(const std::string& epoch,
                                       const std::string& mode) override;
-  virtual ::std::future<int32_t> save(uint32_t table_id,
+  virtual ::std::future<int32_t> Save(uint32_t table_id,
                                       const std::string& epoch,
                                       const std::string& mode) override;
-  virtual std::future<int32_t> Save(
-      const LoadSaveContext& save_context) override;
 
-  virtual ::std::future<int32_t> clear() override;
-  virtual ::std::future<int32_t> clear(uint32_t table_id) override;
+  virtual ::std::future<int32_t> Clear() override;
+  virtual ::std::future<int32_t> Clear(uint32_t table_id) override;
 
-  virtual ::std::future<int32_t> stop_server() override;
+  virtual ::std::future<int32_t> StopServer() override;
 
-  virtual void finalize_worker() override {}
-  virtual ::std::future<int32_t> pull_dense(Region* regions, size_t region_num,
-                                            size_t table_id);
+  virtual void FinalizeWorker() override {}
+  virtual ::std::future<int32_t> PullDense(Region* regions, size_t region_num,
+                                           size_t table_id);
 
-  virtual ::std::future<int32_t> Pull(RequestContext& pull_context) override;
+  virtual ::std::future<int32_t> PushDense(const Region* regions,
+                                           size_t region_num, size_t table_id);
 
-  virtual ::std::future<int32_t> Push(RequestContext& push_context) override;
+  virtual ::std::future<int32_t> PushDenseParam(const Region* regions,
+                                                size_t region_num,
+                                                size_t table_id);
 
-  virtual ::std::future<int32_t> push_dense(const Region* regions,
-                                            size_t region_num, size_t table_id);
-
-  virtual ::std::future<int32_t> push_dense_param(const Region* regions,
-                                                  size_t region_num,
-                                                  size_t table_id);
-
-  virtual ::std::future<int32_t> pull_sparse(float** select_values,
-                                             size_t table_id,
-                                             const uint64_t* keys, size_t num,
-                                             bool is_training) {
+  virtual ::std::future<int32_t> PullSparse(float** select_values,
+                                            size_t table_id,
+                                            const uint64_t* keys, size_t num,
+                                            bool is_training) {
     std::promise<int32_t> prom;
     std::future<int32_t> fut = prom.get_future();
     prom.set_value(0);
@@ -81,26 +73,26 @@ class PsLocalClient : public PSClient {
     return fut;
   }
 
-  virtual ::std::future<int32_t> pull_sparse_ptr(char** select_values,
-                                                 size_t table_id,
-                                                 const uint64_t* keys,
-                                                 size_t num);
+  virtual ::std::future<int32_t> PullSparsePtr(char** select_values,
+                                               size_t table_id,
+                                               const uint64_t* keys,
+                                               size_t num);
 
-  virtual ::std::future<int32_t> print_table_stat(uint32_t table_id) {
+  virtual ::std::future<int32_t> PrintTableStat(uint32_t table_id) {
     std::promise<int32_t> prom;
     std::future<int32_t> fut = prom.get_future();
     prom.set_value(0);
 
     return fut;
   }
-  virtual ::std::future<int32_t> push_sparse(size_t table_id,
-                                             const uint64_t* keys,
-                                             const float** update_values,
-                                             size_t num);
+  virtual ::std::future<int32_t> PushSparse(size_t table_id,
+                                            const uint64_t* keys,
+                                            const float** update_values,
+                                            size_t num);
 
-  virtual ::std::future<int32_t> flush();
+  virtual ::std::future<int32_t> Flush();
   // server profilera
-  virtual std::future<int32_t> start_profiler() {
+  virtual std::future<int32_t> StartProfiler() {
     std::promise<int32_t> prom;
     std::future<int32_t> fut = prom.get_future();
     prom.set_value(0);
@@ -108,7 +100,7 @@ class PsLocalClient : public PSClient {
     return fut;
   };
 
-  virtual std::future<int32_t> stop_profiler() {
+  virtual std::future<int32_t> StopProfiler() {
     std::promise<int32_t> prom;
     std::future<int32_t> fut = prom.get_future();
     prom.set_value(0);
@@ -116,7 +108,7 @@ class PsLocalClient : public PSClient {
     return fut;
   }
 
-  virtual std::future<int32_t> barrier(size_t table_id, uint32_t barrier_type) {
+  virtual std::future<int32_t> Barrier(size_t table_id, uint32_t barrier_type) {
     std::promise<int32_t> prom;
     std::future<int32_t> fut = prom.get_future();
     prom.set_value(0);
@@ -124,10 +116,10 @@ class PsLocalClient : public PSClient {
     return fut;
   }
 
-  virtual std::future<int32_t> pull_geo_param(size_t table_id,
-                                              std::vector<float>* values,
-                                              std::vector<uint64_t>* keys,
-                                              int pserver_idx) {
+  virtual std::future<int32_t> PullGeoParam(size_t table_id,
+                                            std::vector<float>* values,
+                                            std::vector<uint64_t>* keys,
+                                            int pserver_idx) {
     std::promise<int32_t> prom;
     std::future<int32_t> fut = prom.get_future();
     prom.set_value(0);
@@ -135,9 +127,9 @@ class PsLocalClient : public PSClient {
     return fut;
   }
 
-  virtual std::future<int32_t> push_global_step(int table_id,
-                                                int64_t* total_send_data,
-                                                void* done) {
+  virtual std::future<int32_t> PushGlobalStep(int table_id,
+                                              int64_t* total_send_data,
+                                              void* done) {
     std::promise<int32_t> prom;
     std::future<int32_t> fut = prom.get_future();
     prom.set_value(0);
@@ -146,12 +138,12 @@ class PsLocalClient : public PSClient {
   }
 
   // recv table from server and save it in LodTensor
-  virtual int32_t recv_and_save_table(const uint64_t table_id,
-                                      const std::string& path) {
+  virtual int32_t RecvAndSaveTable(const uint64_t table_id,
+                                   const std::string& path) {
     return 0;
   }
 
-  virtual ::std::future<int32_t> send_client2client_msg(
+  virtual ::std::future<int32_t> SendClient2ClientMsg(
       int msg_type, int to_client_id, const std::string& msg) override {
     std::promise<int32_t> prom;
     std::future<int32_t> fut = prom.get_future();
@@ -159,17 +151,18 @@ class PsLocalClient : public PSClient {
 
     return fut;
   }
-  virtual size_t get_server_nums() { return 1; }
+  virtual size_t GetServerNums() { return 1; }
 
-  virtual std::future<int32_t> push_dense_raw_gradient(
-      int table_id, float* total_send_data, size_t total_send_data_size,
-      void* callback) override;
+  virtual std::future<int32_t> PushDenseRawGradient(int table_id,
+                                                    float* total_send_data,
+                                                    size_t total_send_data_size,
+                                                    void* callback) override;
 
-  virtual std::future<int32_t> push_sparse_raw_gradient(
+  virtual std::future<int32_t> PushSparseRawGradient(
       size_t table_id, const uint64_t* keys, const float** update_values,
       size_t num, void* callback) override;
 
-  virtual std::future<int32_t> push_sparse_raw_gradient_partial(
+  virtual std::future<int32_t> PushSparseRawGradientPartial(
       size_t table_id, const uint64_t* keys, const float** update_values,
       uint32_t num, void* done, int pserver_idx) override {
     std::promise<int32_t> prom;
@@ -179,11 +172,11 @@ class PsLocalClient : public PSClient {
     return fut;
   }
 
-  virtual std::future<int32_t> push_sparse_param(size_t table_id,
-                                                 const uint64_t* keys,
-                                                 const float** update_values,
-                                                 size_t num,
-                                                 void* done) override {
+  virtual std::future<int32_t> PushSparseParam(size_t table_id,
+                                               const uint64_t* keys,
+                                               const float** update_values,
+                                               size_t num,
+                                               void* done) override {
     std::promise<int32_t> prom;
     std::future<int32_t> fut = prom.get_future();
     prom.set_value(0);
@@ -192,7 +185,7 @@ class PsLocalClient : public PSClient {
   }
 
  private:
-  virtual int32_t initialize() override;
+  virtual int32_t Initialize() override;
 
   std::future<int32_t> done() {
     std::shared_ptr<std::promise<int32_t>> prom =
@@ -202,16 +195,16 @@ class PsLocalClient : public PSClient {
     return fut;
   }
 
-  inline uint32_t dense_dim_per_shard(uint32_t dense_dim_total,
-                                      uint32_t shard_num) {
+  inline uint32_t DenseDimPerShard(uint32_t dense_dim_total,
+                                   uint32_t shard_num) {
     return dense_dim_total / shard_num + 1;
   }
 
-  inline std::unordered_map<uint32_t, std::shared_ptr<Table>>* table() {
+  inline std::unordered_map<uint32_t, std::shared_ptr<Table>>* GetTable() {
     return &_table_map;
   }
 
-  inline Table* table(size_t table_id) {
+  inline Table* GetTable(size_t table_id) {
     auto itr = _table_map.find(table_id);
     if (itr != _table_map.end()) {
       return itr->second.get();
diff --git a/paddle/fluid/distributed/ps/service/ps_local_server.h b/paddle/fluid/distributed/ps/service/ps_local_server.h
index 31b52126fc576..c09f8585b659d 100644
--- a/paddle/fluid/distributed/ps/service/ps_local_server.h
+++ b/paddle/fluid/distributed/ps/service/ps_local_server.h
@@ -25,17 +25,17 @@ class PsLocalServer : public PSServer {
  public:
   PsLocalServer() {}
   virtual ~PsLocalServer() {}
-  virtual uint64_t start() { return 0; }
-  virtual uint64_t start(const std::string &ip, uint32_t port) { return 0; }
-  virtual int32_t stop() { return 0; }
-  virtual int32_t configure(
+  virtual uint64_t Start() { return 0; }
+  virtual uint64_t Start(const std::string &ip, uint32_t port) { return 0; }
+  virtual int32_t Stop() { return 0; }
+  virtual int32_t Configure(
       const PSParameter &config, PSEnvironment &env, size_t server_rank,
       const std::vector<framework::ProgramDesc> &server_sub_program = {}) {
     return 0;
   }
 
  private:
-  virtual int32_t initialize() { return 0; }
+  virtual int32_t Initialize() { return 0; }
 };
 }
 }
diff --git a/paddle/fluid/distributed/ps/service/ps_service/graph_py_service.cc b/paddle/fluid/distributed/ps/service/ps_service/graph_py_service.cc
index c8be0f7971090..92dfeb6818a28 100644
--- a/paddle/fluid/distributed/ps/service/ps_service/graph_py_service.cc
+++ b/paddle/fluid/distributed/ps/service/ps_service/graph_py_service.cc
@@ -70,7 +70,7 @@ void GraphPyService::set_up(std::string ips_str, int shard_num,
     port_list.push_back(ip_and_port[1]);
     uint32_t port = stoul(ip_and_port[1]);
     auto ph_host = paddle::distributed::PSHost(ip_and_port[0], port, index);
-    host_sign_list.push_back(ph_host.serialize_to_string());
+    host_sign_list.push_back(ph_host.SerializeToString());
     index++;
   }
 }
@@ -83,11 +83,11 @@ void GraphPyClient::start_client() {
   paddle::distributed::PaddlePSEnvironment _ps_env;
   auto servers_ = host_sign_list.size();
   _ps_env = paddle::distributed::PaddlePSEnvironment();
-  _ps_env.set_ps_servers(&host_sign_list, servers_);
+  _ps_env.SetPsServers(&host_sign_list, servers_);
   worker_ptr = std::shared_ptr<paddle::distributed::GraphBrpcClient>(
       (paddle::distributed::GraphBrpcClient*)
-          paddle::distributed::PSClientFactory::create(worker_proto));
-  worker_ptr->configure(worker_proto, dense_regions, _ps_env, client_id);
+          paddle::distributed::PSClientFactory::Create(worker_proto));
+  worker_ptr->Configure(worker_proto, dense_regions, _ps_env, client_id);
   worker_ptr->set_shard_num(get_shard_num());
 }
 void GraphPyServer::start_server(bool block) {
@@ -96,17 +96,17 @@ void GraphPyServer::start_server(bool block) {
   ::paddle::distributed::PSParameter server_proto = this->GetServerProto();
 
   auto _ps_env = paddle::distributed::PaddlePSEnvironment();
-  _ps_env.set_ps_servers(&this->host_sign_list,
-                         this->host_sign_list.size());  // test
+  _ps_env.SetPsServers(&this->host_sign_list,
+                       this->host_sign_list.size());  // test
   pserver_ptr = std::shared_ptr<paddle::distributed::GraphBrpcServer>(
       (paddle::distributed::GraphBrpcServer*)
-          paddle::distributed::PSServerFactory::create(server_proto));
+          paddle::distributed::PSServerFactory::Create(server_proto));
   VLOG(0) << "pserver-ptr created ";
   std::vector<framework::ProgramDesc> empty_vec;
   framework::ProgramDesc empty_prog;
   empty_vec.push_back(empty_prog);
-  pserver_ptr->configure(server_proto, _ps_env, rank, empty_vec);
-  pserver_ptr->start(ip, port);
+  pserver_ptr->Configure(server_proto, _ps_env, rank, empty_vec);
+  pserver_ptr->Start(ip, port);
   pserver_ptr->build_peer2peer_connection(rank);
   std::condition_variable* cv_ = pserver_ptr->export_cv();
   if (block) {
@@ -246,7 +246,7 @@ void GraphPyClient::load_edge_file(std::string name, std::string filepath,
     VLOG(0) << "loadding data with type " << name << " from " << filepath;
     uint32_t table_id = this->table_id_map[name];
     auto status =
-        get_ps_client()->load(table_id, std::string(filepath), params);
+        get_ps_client()->Load(table_id, std::string(filepath), params);
     status.wait();
   }
 }
@@ -285,7 +285,7 @@ void GraphPyClient::load_node_file(std::string name, std::string filepath) {
   if (this->table_id_map.count(name)) {
     uint32_t table_id = this->table_id_map[name];
     auto status =
-        get_ps_client()->load(table_id, std::string(filepath), params);
+        get_ps_client()->Load(table_id, std::string(filepath), params);
     status.wait();
   }
 }
@@ -396,13 +396,13 @@ std::vector<FeatureNode> GraphPyClient::pull_graph_list(std::string name,
   return res;
 }
 
-void GraphPyClient::stop_server() {
+void GraphPyClient::StopServer() {
   VLOG(0) << "going to stop server";
   std::unique_lock<std::mutex> lock(mutex_);
   if (stoped_) return;
-  auto status = this->worker_ptr->stop_server();
+  auto status = this->worker_ptr->StopServer();
   if (status.get() == 0) stoped_ = true;
 }
-void GraphPyClient::finalize_worker() { this->worker_ptr->finalize_worker(); }
+void GraphPyClient::FinalizeWorker() { this->worker_ptr->FinalizeWorker(); }
 }
 }
diff --git a/paddle/fluid/distributed/ps/service/ps_service/graph_py_service.h b/paddle/fluid/distributed/ps/service/ps_service/graph_py_service.h
index 85707137c1800..19f34dad80745 100644
--- a/paddle/fluid/distributed/ps/service/ps_service/graph_py_service.h
+++ b/paddle/fluid/distributed/ps/service/ps_service/graph_py_service.h
@@ -123,7 +123,7 @@ class GraphPyServer : public GraphPyService {
     set_rank(rank);
     GraphPyService::set_up(ips_str, shard_num, node_types, edge_types);
   }
-  int get_rank() { return rank; }
+  int GetRank() { return rank; }
   void set_rank(int rank) { this->rank = rank; }
 
   void start_server(bool block = true);
@@ -154,8 +154,8 @@ class GraphPyClient : public GraphPyService {
         (paddle::distributed::GraphBrpcService*)server.get_ps_server()
             ->get_service());
   }
-  void stop_server();
-  void finalize_worker();
+  void StopServer();
+  void FinalizeWorker();
   void load_edge_file(std::string name, std::string filepath, bool reverse);
   void load_node_file(std::string name, std::string filepath);
   void clear_nodes(std::string name);
diff --git a/paddle/fluid/distributed/ps/service/ps_service/service.cc b/paddle/fluid/distributed/ps/service/ps_service/service.cc
index 73793d2f9bd0e..9c3a06c2212e6 100644
--- a/paddle/fluid/distributed/ps/service/ps_service/service.cc
+++ b/paddle/fluid/distributed/ps/service/ps_service/service.cc
@@ -46,7 +46,7 @@ paddle::distributed::PSParameter load_from_prototxt(
   return param;
 }
 
-void PSCore::init_gflag(const std::string& gflags) {
+void PSCore::InitGFlag(const std::string& gflags) {
   VLOG(3) << "Init With Gflags:" << gflags;
   std::vector<std::string> flags = paddle::string::split_string(gflags);
   if (flags.size() < 1) {
@@ -65,67 +65,67 @@ void PSCore::init_gflag(const std::string& gflags) {
   ::GFLAGS_NAMESPACE::ParseCommandLineFlags(&params_cnt, &params_ptr, true);
 }
 
-int PSCore::init_server(
+int PSCore::InitServer(
     const std::string& dist_desc,
     const std::vector<std::string>* host_sign_list, int node_num, int index,
     int trainers,
     const std::vector<framework::ProgramDesc>& server_sub_program) {
   google::protobuf::TextFormat::ParseFromString(dist_desc, &_ps_param);
-  init_gflag(_ps_param.init_gflags());
+  InitGFlag(_ps_param.init_gflags());
   _ps_env = paddle::distributed::PaddlePSEnvironment();
-  _ps_env.set_ps_servers(host_sign_list, node_num);
-  _ps_env.set_trainers(trainers);
+  _ps_env.SetPsServers(host_sign_list, node_num);
+  _ps_env.SetTrainers(trainers);
   int ret = 0;
   _server_ptr = std::shared_ptr<paddle::distributed::PSServer>(
-      paddle::distributed::PSServerFactory::create(_ps_param));
-  ret = _server_ptr->configure(_ps_param, _ps_env, index, server_sub_program);
+      paddle::distributed::PSServerFactory::Create(_ps_param));
+  ret = _server_ptr->Configure(_ps_param, _ps_env, index, server_sub_program);
   CHECK(ret == 0) << "failed to configure server";
   return ret;
 }
 
-int PSCore::init_worker(
+int PSCore::InitWorker(
     const std::string& dist_desc,
     const std::map<uint64_t, std::vector<paddle::distributed::Region>>& regions,
     const std::vector<std::string>* host_sign_list, int node_num, int index) {
   google::protobuf::TextFormat::ParseFromString(dist_desc, &_ps_param);
-  init_gflag(_ps_param.init_gflags());
+  InitGFlag(_ps_param.init_gflags());
   _ps_env = paddle::distributed::PaddlePSEnvironment();
-  _ps_env.set_ps_servers(host_sign_list, node_num);
+  _ps_env.SetPsServers(host_sign_list, node_num);
   int ret = 0;
-  VLOG(1) << "PSCore::init_worker";
+  VLOG(1) << "PSCore::InitWorker";
   auto* communicator = Communicator::GetInstance();
-  ret = communicator->GetPsClient()->configure(_ps_param, regions, _ps_env,
+  ret = communicator->GetPsClient()->Configure(_ps_param, regions, _ps_env,
                                                index);
   communicator->Start();
   return ret;
 }
 
-std::vector<uint64_t> PSCore::get_client_info() {
-  return _ps_env.get_client_info();
+std::vector<uint64_t> PSCore::GetClientInfo() {
+  return _ps_env.GetClientInfo();
 }
 
-int PSCore::create_client2client_connection(int pserver_timeout_ms,
-                                            int pserver_connect_timeout_ms,
-                                            int max_retry) {
-  int ret = _worker_ptr->create_client2client_connection(
+int PSCore::CreateClient2ClientConnection(int pserver_timeout_ms,
+                                          int pserver_connect_timeout_ms,
+                                          int max_retry) {
+  int ret = _worker_ptr->CreateClient2ClientConnection(
       pserver_timeout_ms, pserver_connect_timeout_ms, max_retry);
   return ret;
 }
 
-uint64_t PSCore::run_server(const std::string& ip, uint32_t port) {
-  return _server_ptr->start(ip, port);
+uint64_t PSCore::RunServer(const std::string& ip, uint32_t port) {
+  return _server_ptr->Start(ip, port);
 }
 
-int PSCore::finalize_worker() {
-  _worker_ptr->finalize_worker();
+int PSCore::FinalizeWorker() {
+  _worker_ptr->FinalizeWorker();
   return 0;
 }
 
-int PSCore::stop_server() {
-  auto stop_status = _worker_ptr->stop_server();
+int PSCore::StopServer() {
+  auto stop_status = _worker_ptr->StopServer();
   stop_status.wait();
   return 0;
 }
-paddle::distributed::PSParameter* PSCore::get_param() { return &_ps_param; }
+paddle::distributed::PSParameter* PSCore::GetParam() { return &_ps_param; }
 }  // namespace distributed
 }  // namespace paddle
diff --git a/paddle/fluid/distributed/ps/service/ps_service/service.h b/paddle/fluid/distributed/ps/service/ps_service/service.h
index 202c2407f15ae..112fdc3e14183 100644
--- a/paddle/fluid/distributed/ps/service/ps_service/service.h
+++ b/paddle/fluid/distributed/ps/service/ps_service/service.h
@@ -42,31 +42,31 @@ class PSCore {
   explicit PSCore() {}
   virtual ~PSCore() {}
 
-  virtual int init_server(
+  virtual int InitServer(
       const std::string& dist_desc,
       const std::vector<std::string>* host_sign_list, int node_num, int index,
       int trainers,
       const std::vector<framework::ProgramDesc>& server_sub_program = {});
-  virtual int init_worker(
+  virtual int InitWorker(
       const std::string& dist_desc,
       const std::map<uint64_t, std::vector<paddle::distributed::Region>>&
           regions,
       const std::vector<std::string>* host_sign_list, int node_num, int index);
-  virtual uint64_t run_server(const std::string& ip, uint32_t port);
-  virtual int stop_server();
-  virtual int finalize_worker();
-  virtual std::vector<uint64_t> get_client_info();
-  virtual int create_client2client_connection(int pserver_timeout_ms,
-                                              int pserver_connect_timeout_ms,
-                                              int max_retry);
+  virtual uint64_t RunServer(const std::string& ip, uint32_t port);
+  virtual int StopServer();
+  virtual int FinalizeWorker();
+  virtual std::vector<uint64_t> GetClientInfo();
+  virtual int CreateClient2ClientConnection(int pserver_timeout_ms,
+                                            int pserver_connect_timeout_ms,
+                                            int max_retry);
   std::shared_ptr<paddle::distributed::PSServer>
       _server_ptr;  // pointer to server
   std::shared_ptr<paddle::distributed::PSClient>
       _worker_ptr;  // pointer to worker
-  virtual paddle::distributed::PSParameter* get_param();
+  virtual paddle::distributed::PSParameter* GetParam();
 
  private:
-  void init_gflag(const std::string& gflags);
+  void InitGFlag(const std::string& gflags);
   paddle::distributed::PSParameter _ps_param;
   paddle::distributed::PaddlePSEnvironment _ps_env;
 };
diff --git a/paddle/fluid/distributed/ps/service/sendrecv.proto b/paddle/fluid/distributed/ps/service/sendrecv.proto
old mode 100644
new mode 100755
index 6dfaff1ffa1df..580f411c28c07
--- a/paddle/fluid/distributed/ps/service/sendrecv.proto
+++ b/paddle/fluid/distributed/ps/service/sendrecv.proto
@@ -59,6 +59,12 @@ enum PsCmdID {
   PS_GRAPH_SAMPLE_NODES_FROM_ONE_SERVER = 38;
   PS_GRAPH_USE_NEIGHBORS_SAMPLE_CACHE = 39;
   PS_GRAPH_LOAD_GRAPH_SPLIT_CONFIG = 40;
+  PEER_ROLE_IS_WORKER = 41;
+  PEER_ROLE_IS_SWITCH = 42;
+  PS_SAVE_WITH_SCOPE = 43;
+  PS_SAVE_WITH_SHARD = 44;
+  PS_QUERY_WITH_SCOPE = 45;
+  PS_QUERY_WITH_SHARD = 46;
 }
 
 message PsRequestMessage {
@@ -117,9 +123,16 @@ message MultiVariableMessage {
   repeated string send_var_names = 2;
   repeated string recv_var_names = 3;
   repeated VariableMessage var_messages = 4;
+  optional bytes data = 5;
+  repeated int32 vars_len = 6;
+  optional int32 group_id = 7;
 };
 
 service PsService {
   rpc service(PsRequestMessage) returns (PsResponseMessage);
   rpc SendAndRecvVariable(MultiVariableMessage) returns (MultiVariableMessage);
+  rpc SendToWorker(MultiVariableMessage) returns (PsResponseMessage);
+  rpc SendToSwitch(MultiVariableMessage) returns (PsResponseMessage);
+  rpc SendS2S(MultiVariableMessage) returns (PsResponseMessage);
+  rpc RecvFromSwitch(MultiVariableMessage) returns (MultiVariableMessage);
 };
diff --git a/paddle/fluid/distributed/ps/service/server.cc b/paddle/fluid/distributed/ps/service/server.cc
index 893f671359e40..65f7ae821cef1 100644
--- a/paddle/fluid/distributed/ps/service/server.cc
+++ b/paddle/fluid/distributed/ps/service/server.cc
@@ -29,7 +29,7 @@ REGISTER_PSCORE_CLASS(PsBaseService, BrpcPsService);
 REGISTER_PSCORE_CLASS(PSServer, GraphBrpcServer);
 REGISTER_PSCORE_CLASS(PsBaseService, GraphBrpcService);
 
-PSServer *PSServerFactory::create(const PSParameter &ps_config) {
+PSServer *PSServerFactory::Create(const PSParameter &ps_config) {
   const auto &config = ps_config.server_param();
 
   if (!config.has_downpour_server_param()) {
@@ -56,18 +56,18 @@ PSServer *PSServerFactory::create(const PSParameter &ps_config) {
                << service_param.server_class();
     return NULL;
   }
-  TableManager::instance().initialize();
+  TableManager::Instance().Initialize();
   return server;
 }
 
-int32_t PSServer::configure(
+int32_t PSServer::Configure(
     const PSParameter &config, PSEnvironment &env, size_t server_rank,
     const std::vector<framework::ProgramDesc> &server_sub_program) {
   scope_.reset(new framework::Scope());
   _config = config.server_param();
   _rank = server_rank;
   _environment = &env;
-  size_t shard_num = env.get_ps_servers().size();
+  size_t shard_num = env.GetPsServers().size();
 
   const auto &downpour_param = _config.downpour_server_param();
 
@@ -87,21 +87,21 @@ int32_t PSServer::configure(
       global_step_table = downpour_param.downpour_table_param(i).table_id();
     }
 
-    table->set_program_env(scope_.get(), place_, &server_sub_program);
-    table->set_shard(_rank, shard_num);
-    table->initialize(downpour_param.downpour_table_param(i),
+    table->SetProgramEnv(scope_.get(), place_, &server_sub_program);
+    table->SetShard(_rank, shard_num);
+    table->Initialize(downpour_param.downpour_table_param(i),
                       config.fs_client_param());
     _table_map[downpour_param.downpour_table_param(i).table_id()].reset(table);
   }
 
   if (barrier_table != UINT32_MAX) {
-    _table_map[barrier_table]->set_table_map(&_table_map);
+    _table_map[barrier_table]->SetTableMap(&_table_map);
   }
   if (global_step_table != UINT32_MAX) {
-    _table_map[global_step_table]->set_table_map(&_table_map);
+    _table_map[global_step_table]->SetTableMap(&_table_map);
   }
 
-  return initialize();
+  return Initialize();
 }
 }  // namespace distributed
 }  // namespace paddle
diff --git a/paddle/fluid/distributed/ps/service/server.h b/paddle/fluid/distributed/ps/service/server.h
index d2804405b4198..5da819326b052 100644
--- a/paddle/fluid/distributed/ps/service/server.h
+++ b/paddle/fluid/distributed/ps/service/server.h
@@ -65,19 +65,19 @@ class PSServer {
   PSServer(PSServer &&) = delete;
   PSServer(const PSServer &) = delete;
 
-  virtual int32_t configure(
+  virtual int32_t Configure(
       const PSParameter &config, PSEnvironment &env, size_t server_rank,
       const std::vector<framework::ProgramDesc> &server_sub_program = {});
 
-  virtual uint64_t start(const std::string &ip, uint32_t port) = 0;
-  virtual int32_t stop() = 0;
+  virtual uint64_t Start(const std::string &ip, uint32_t port) = 0;
+  virtual int32_t Stop() = 0;
 
-  inline size_t rank() const { return _rank; }
+  inline size_t Rank() const { return _rank; }
 
-  inline PSEnvironment *environment() { return _environment; }
+  inline PSEnvironment *Environment() { return _environment; }
 
-  inline const ServerParameter *config() const { return &_config; }
-  inline Table *table(size_t table_id) {
+  inline const ServerParameter *Config() const { return &_config; }
+  inline Table *GetTable(size_t table_id) {
     auto itr = _table_map.find(table_id);
     if (itr != _table_map.end()) {
       return itr->second.get();
@@ -85,12 +85,12 @@ class PSServer {
     return NULL;
   }
 
-  inline std::unordered_map<uint32_t, std::shared_ptr<Table>> *table() {
+  inline std::unordered_map<uint32_t, std::shared_ptr<Table>> *GetTable() {
     return &_table_map;
   }
 
  protected:
-  virtual int32_t initialize() = 0;
+  virtual int32_t Initialize() = 0;
 
  protected:
   size_t _rank;
@@ -129,11 +129,11 @@ class PsBaseService : public PsService {
  public:
   PsBaseService() : _rank(0), _server(NULL), _config(NULL) {}
   virtual ~PsBaseService() {}
-  virtual size_t get_rank() { return _rank; }
-  virtual int32_t configure(PSServer *server) {
+  virtual size_t GetRank() { return _rank; }
+  virtual int32_t Configure(PSServer *server) {
     _server = server;
-    _rank = _server->rank();
-    _config = _server->config();
+    _rank = _server->Rank();
+    _config = _server->Config();
     return 0;
   }
   virtual void service(::google::protobuf::RpcController *controller,
@@ -148,8 +148,8 @@ class PsBaseService : public PsService {
     LOG(WARNING) << "Resonse err_code:" << err_code << " msg:" << err_msg;
   }
 
-  virtual int32_t initialize() = 0;
-  PSServer *get_server() { return _server; }
+  virtual int32_t Initialize() = 0;
+  PSServer *GetServer() { return _server; }
 
  protected:
   size_t _rank;
@@ -160,7 +160,7 @@ REGISTER_PSCORE_REGISTERER(PsBaseService);
 
 class PSServerFactory {
  public:
-  static PSServer *create(const PSParameter &config);
+  static PSServer *Create(const PSParameter &config);
 };
 }  // namespace distributed
 }  // namespace paddle
diff --git a/paddle/fluid/distributed/ps/table/CMakeLists.txt b/paddle/fluid/distributed/ps/table/CMakeLists.txt
index 227d0a9f1cdb8..aebe36b5e0496 100644
--- a/paddle/fluid/distributed/ps/table/CMakeLists.txt
+++ b/paddle/fluid/distributed/ps/table/CMakeLists.txt
@@ -7,10 +7,7 @@ set_source_files_properties(${graphDir}/graph_weighted_sampler.cc PROPERTIES COM
 cc_library(WeightedSampler SRCS ${graphDir}/graph_weighted_sampler.cc DEPS graph_edge)
 set_source_files_properties(${graphDir}/graph_node.cc PROPERTIES COMPILE_FLAGS ${DISTRIBUTE_COMPILE_FLAGS})
 cc_library(graph_node SRCS ${graphDir}/graph_node.cc DEPS WeightedSampler)
-set_source_files_properties(common_dense_table.cc PROPERTIES COMPILE_FLAGS ${DISTRIBUTE_COMPILE_FLAGS})
-set_source_files_properties(common_sparse_table.cc PROPERTIES COMPILE_FLAGS ${DISTRIBUTE_COMPILE_FLAGS})
-set_source_files_properties(ssd_sparse_table.cc PROPERTIES COMPILE_FLAGS ${DISTRIBUTE_COMPILE_FLAGS})
-set_source_files_properties(sparse_geo_table.cc PROPERTIES COMPILE_FLAGS ${DISTRIBUTE_COMPILE_FLAGS})
+set_source_files_properties(memory_dense_table.cc PROPERTIES COMPILE_FLAGS ${DISTRIBUTE_COMPILE_FLAGS})
 set_source_files_properties(barrier_table.cc PROPERTIES COMPILE_FLAGS ${DISTRIBUTE_COMPILE_FLAGS})
 set_source_files_properties(common_graph_table.cc PROPERTIES COMPILE_FLAGS ${DISTRIBUTE_COMPILE_FLAGS})
 
@@ -23,10 +20,10 @@ set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fopenmp")
 
 set(EXTERN_DEP "")
 if(WITH_HETERPS)
-    set(TABLE_SRC common_sparse_table.cc ssd_sparse_table.cc common_dense_table.cc sparse_geo_table.cc barrier_table.cc common_graph_table.cc)
+    set(TABLE_SRC memory_dense_table.cc barrier_table.cc common_graph_table.cc)
     set(EXTERN_DEP rocksdb)
 else()
-    set(TABLE_SRC common_sparse_table.cc common_dense_table.cc sparse_geo_table.cc barrier_table.cc common_graph_table.cc)
+    set(TABLE_SRC memory_dense_table.cc barrier_table.cc common_graph_table.cc)
 endif()
 
 cc_library(common_table SRCS ${TABLE_SRC} DEPS ${TABLE_DEPS}
@@ -43,12 +40,10 @@ set_source_files_properties(sparse_sgd_rule.cc PROPERTIES COMPILE_FLAGS ${DISTRI
 set_source_files_properties(ctr_double_accessor.cc PROPERTIES COMPILE_FLAGS ${DISTRIBUTE_COMPILE_FLAGS})
 set_source_files_properties(ctr_accessor.cc PROPERTIES COMPILE_FLAGS ${DISTRIBUTE_COMPILE_FLAGS})
 set_source_files_properties(sparse_accessor.cc PROPERTIES COMPILE_FLAGS ${DISTRIBUTE_COMPILE_FLAGS})
-set_source_files_properties(downpour_ctr_accessor.cc PROPERTIES COMPILE_FLAGS ${DISTRIBUTE_COMPILE_FLAGS})
 set_source_files_properties(memory_sparse_table.cc PROPERTIES COMPILE_FLAGS ${DISTRIBUTE_COMPILE_FLAGS})
 cc_library(sparse_sgd_rule SRCS sparse_sgd_rule.cc DEPS ${TABLE_DEPS} ps_framework_proto)
 cc_library(ctr_double_accessor SRCS ctr_double_accessor.cc DEPS ${TABLE_DEPS} ps_framework_proto sparse_sgd_rule)
 cc_library(ctr_accessor SRCS ctr_accessor.cc sparse_accessor.cc DEPS ${TABLE_DEPS} ps_framework_proto sparse_sgd_rule)
-cc_library(downpour_ctr_accessor SRCS downpour_ctr_accessor.cc DEPS ${TABLE_DEPS} ps_framework_proto sparse_sgd_rule)
 cc_library(memory_sparse_table SRCS memory_sparse_table.cc DEPS ps_framework_proto ${TABLE_DEPS} fs afs_wrapper ctr_accessor common_table)
 
 set_source_files_properties(memory_sparse_geo_table.cc PROPERTIES COMPILE_FLAGS ${DISTRIBUTE_COMPILE_FLAGS})
diff --git a/paddle/fluid/distributed/ps/table/accessor.h b/paddle/fluid/distributed/ps/table/accessor.h
index 207cc94b4cb15..024af327a33af 100644
--- a/paddle/fluid/distributed/ps/table/accessor.h
+++ b/paddle/fluid/distributed/ps/table/accessor.h
@@ -46,33 +46,30 @@ struct DataConverter {
 };
 
 struct AccessorInfo {
+  // value维度
   size_t dim;
+  // value各个维度的size
   size_t size;
-  size_t select_size;
+  // pull value维度
   size_t select_dim;
-  size_t update_size;
+  // pull value各维度相加总size
+  size_t select_size;
+  // push value维度
   size_t update_dim;
+  // push value各个维度的size
+  size_t update_size;
+  // value中mf动态长度部分总size大小, sparse下生效
   size_t mf_size;
+  // value总维度，dense下生效
   size_t fea_dim;
 };
 
-enum InfoKey {
-  DIM = 0,
-  SIZE = 1,
-  SELECT_SIZE = 2,
-  SELECT_DIM = 3,
-  UPDATE_SIZE = 4,
-  UPDATE_DIM = 5,
-  MF_SIZE = 6,
-  FEA_DIM = 7
-};
-
 class ValueAccessor {
  public:
   ValueAccessor() {}
   virtual ~ValueAccessor() {}
 
-  virtual int configure(const TableAccessorParameter& parameter) {
+  virtual int Configure(const TableAccessorParameter& parameter) {
     _config = parameter;
     // data_convert结构体初始化
     if (_config.table_accessor_save_param_size() != 0) {
@@ -88,38 +85,14 @@ class ValueAccessor {
     }
     return 0;
   }
-  virtual int initialize() = 0;
-
-  virtual void SetTableInfo(AccessorInfo& info) = 0;
-  virtual size_t GetTableInfo(InfoKey key) = 0;
+  virtual int Initialize() = 0;
 
-  // value维度
-  virtual size_t dim() = 0;
-  // value各个维度的size
-  virtual size_t dim_size(size_t dim) = 0;
-  // value各维度相加总size
-  virtual size_t size() = 0;
+  virtual AccessorInfo GetAccessorInfo() { return _accessor_info; }
 
-  // value中mf动态长度部分总size大小, sparse下生效
-  virtual size_t mf_size() { return 0; }
-  virtual bool need_extend_mf(float* value) { return false; }
-  virtual bool has_mf(size_t size) { return false; }
-  // pull value维度
-  virtual size_t select_dim() = 0;
-  // pull value各个维度的size
-  virtual size_t select_dim_size(size_t dim) = 0;
-  // pull value各维度相加总size
-  virtual size_t select_size() = 0;
-  // push value维度
-  virtual size_t update_dim() = 0;
-  // push value各个维度的size
-  virtual size_t update_dim_size(size_t dim) = 0;
-  // push value各维度相加总size
-  virtual size_t update_size() = 0;
-  // fea total for dense
-  virtual size_t fea_dim() { return _config.fea_dim(); }
+  virtual bool NeedExtendMF(float* value) { return false; }
+  virtual bool HasMF(size_t size) { return false; }
   // converter for save
-  virtual std::string get_converter(int param) {
+  virtual std::string GetConverter(int param) {
     auto itr = _data_coverter_map.find(param);
     if (itr == _data_coverter_map.end()) {
       return "";
@@ -128,7 +101,7 @@ class ValueAccessor {
     }
   }
   // deconverter for load
-  virtual std::string get_deconverter(int param) {
+  virtual std::string GetDeconverter(int param) {
     auto itr = _data_coverter_map.find(param);
     if (itr == _data_coverter_map.end()) {
       return "";
@@ -137,47 +110,47 @@ class ValueAccessor {
     }
   }
   // 判断该value是否进行shrink
-  virtual bool shrink(float* value) = 0;
+  virtual bool Shrink(float* value) = 0;
 
   // 判断该value是否在save阶段dump,
   // param作为参数用于标识save阶段，如downpour的xbox与batch_model
-  virtual bool save(float* value, int param) = 0;
+  virtual bool Save(float* value, int param) = 0;
   // update delta_score and unseen_days after save
-  virtual void update_stat_after_save(float* value, int param) {}
+  virtual void UpdateStatAfterSave(float* value, int param) {}
 
   // keys不存在时，为values生成随机值
-  virtual int32_t create(float** value, size_t num) = 0;
-  virtual bool create_value(int type, const float* value) { return true; }
+  virtual int32_t Create(float** value, size_t num) = 0;
+  virtual bool CreateValue(int type, const float* value) { return true; }
   // 从values中选取到select_values中
-  virtual int32_t select(float** select_values, const float** values,
+  virtual int32_t Select(float** select_values, const float** values,
                          size_t num) = 0;
   // 将update_values聚合到一起
-  virtual int32_t merge(float** update_values,
+  virtual int32_t Merge(float** update_values,
                         const float** other_update_values, size_t num) = 0;
   // 将update_values聚合到一起，通过it.next判定是否进入下一个key
-  // virtual int32_t merge(float** update_values, iterator it);
+  // virtual int32_t Merge(float** update_values, iterator it);
   // 将update_values更新应用到values中
-  virtual int32_t update(float** values, const float** update_values,
+  virtual int32_t Update(float** values, const float** update_values,
                          size_t num) = 0;
 
   // used to save model, will filter feature
-  virtual std::string parse_to_string(const float* value, int param) = 0;
+  virtual std::string ParseToString(const float* value, int param) = 0;
   //  parse value from string, used to load model
-  virtual int32_t parse_from_string(const std::string& data, float* value) = 0;
+  virtual int32_t ParseFromString(const std::string& data, float* value) = 0;
 
-  virtual FsDataConverter converter(int param) {
+  virtual FsDataConverter Converter(int param) {
     FsDataConverter data_convert;
-    data_convert.converter = this->get_converter(param);
-    data_convert.deconverter = this->get_deconverter(param);
+    data_convert.converter = this->GetConverter(param);
+    data_convert.deconverter = this->GetDeconverter(param);
     return data_convert;
   }
 
-  virtual int set_weight(float** values, const float** update_values,
-                         size_t num) {
+  virtual int SetWeight(float** values, const float** update_values,
+                        size_t num) {
     return 0;
   }
 
-  virtual float get_field(float* value, const std::string& name) { return 0.0; }
+  virtual float GetField(float* value, const std::string& name) { return 0.0; }
 #define DEFINE_GET_INDEX(class, field) \
   virtual int get_##field##_index() override { return class ::field##_index(); }
 
diff --git a/paddle/fluid/distributed/ps/table/barrier_table.cc b/paddle/fluid/distributed/ps/table/barrier_table.cc
index 25838e7ac2f04..b9d0345313cc3 100644
--- a/paddle/fluid/distributed/ps/table/barrier_table.cc
+++ b/paddle/fluid/distributed/ps/table/barrier_table.cc
@@ -17,7 +17,7 @@
 namespace paddle {
 namespace distributed {
 
-int32_t BarrierTable::initialize() {
+int32_t BarrierTable::Initialize() {
   auto trainers = _config.common().trainer_num();
   trigger_.store(trainers);
 
@@ -29,7 +29,7 @@ int32_t BarrierTable::initialize() {
 }
 
 // 0: send_barrier 1: recv_barrier 2: complete
-int32_t BarrierTable::barrier(const uint32_t trainer_id,
+int32_t BarrierTable::Barrier(const uint32_t trainer_id,
                               const std::string barrier_type) {
   std::unique_lock<std::mutex> lock(mutex_);
 
@@ -56,7 +56,7 @@ int32_t BarrierTable::barrier(const uint32_t trainer_id,
     VLOG(1) << "barrier table optimize begin";
     for (auto& x : *table_map_) {
       auto table = x.second;
-      table->pour();
+      table->Pour();
     }
     VLOG(1) << "barrier table optimize done";
 
@@ -66,7 +66,7 @@ int32_t BarrierTable::barrier(const uint32_t trainer_id,
   return 0;
 }
 
-int32_t BarrierTable::set_table_map(
+int32_t BarrierTable::SetTableMap(
     std::unordered_map<uint32_t, std::shared_ptr<Table>>* table_map) {
   table_map_ = table_map;
   return 0;
diff --git a/paddle/fluid/distributed/ps/table/common_graph_table.cc b/paddle/fluid/distributed/ps/table/common_graph_table.cc
index dcce46270d026..7aab679954709 100644
--- a/paddle/fluid/distributed/ps/table/common_graph_table.cc
+++ b/paddle/fluid/distributed/ps/table/common_graph_table.cc
@@ -448,7 +448,7 @@ int32_t GraphTable::load_graph_split_config(const std::string &path) {
   return 0;
 }
 
-int32_t GraphTable::load(const std::string &path, const std::string &param) {
+int32_t GraphTable::Load(const std::string &path, const std::string &param) {
   bool load_edge = (param[0] == 'e');
   bool load_node = (param[0] == 'n');
   if (load_edge) {
@@ -1066,11 +1066,11 @@ int32_t GraphTable::pull_graph_list(int start, int total_size,
 int32_t GraphTable::get_server_index_by_id(int64_t id) {
   return id % shard_num / shard_num_per_server;
 }
-int32_t GraphTable::initialize(const TableParameter &config,
+int32_t GraphTable::Initialize(const TableParameter &config,
                                const FsClientParameter &fs_config) {
   LOG(INFO) << "in graphTable initialize";
   _config = config;
-  if (initialize_accessor() != 0) {
+  if (InitializeAccessor() != 0) {
     LOG(WARNING) << "Table accessor initialize failed";
     return -1;
   }
@@ -1082,9 +1082,9 @@ int32_t GraphTable::initialize(const TableParameter &config,
   auto graph = config.graph_parameter();
   shard_num = _config.shard_num();
   LOG(INFO) << "in graphTable initialize over";
-  return initialize(graph);
+  return Initialize(graph);
 }
-int32_t GraphTable::initialize(const GraphParameter &graph) {
+int32_t GraphTable::Initialize(const GraphParameter &graph) {
 #ifdef PADDLE_WITH_HETERPS
   if (graph.gpups_mode()) {
     gpups_mode = true;
diff --git a/paddle/fluid/distributed/ps/table/common_graph_table.h b/paddle/fluid/distributed/ps/table/common_graph_table.h
index 72600b42b8282..acc484e6098d4 100644
--- a/paddle/fluid/distributed/ps/table/common_graph_table.h
+++ b/paddle/fluid/distributed/ps/table/common_graph_table.h
@@ -280,7 +280,7 @@ class ScaledLRU {
           }
         }
         auto status =
-            thread_pool->enqueue([this]() -> int { return shrink(); });
+            thread_pool->enqueue([this]() -> int { return Shrink(); });
         status.wait();
       }
     });
@@ -298,7 +298,7 @@ class ScaledLRU {
   LRUResponse insert(size_t index, K *keys, V *data, size_t length) {
     return lru_pool[index].insert(keys, data, length);
   }
-  int shrink() {
+  int Shrink() {
     int node_size = 0;
     for (size_t i = 0; i < lru_pool.size(); i++) {
       node_size += lru_pool[i].node_size - lru_pool[i].remove_count;
@@ -329,7 +329,7 @@ class ScaledLRU {
     if (diff != 0) {
       __sync_fetch_and_add(&global_count, diff);
       if (global_count > int(1.25 * size_limit)) {
-        thread_pool->enqueue([this]() -> int { return shrink(); });
+        thread_pool->enqueue([this]() -> int { return Shrink(); });
       }
     }
   }
@@ -404,7 +404,7 @@ class GraphSampler {
 };
 #endif
 
-class GraphTable : public SparseTable {
+class GraphTable : public Table {
  public:
   GraphTable() {
     use_cache = false;
@@ -415,6 +415,23 @@ class GraphTable : public SparseTable {
     rw_lock.reset(new pthread_rwlock_t());
   }
   virtual ~GraphTable();
+
+  virtual void *GetShard(size_t shard_idx) { return 0; }
+
+  static int32_t sparse_local_shard_num(uint32_t shard_num,
+                                        uint32_t server_num) {
+    if (shard_num % server_num == 0) {
+      return shard_num / server_num;
+    }
+    size_t local_shard_num = shard_num / server_num + 1;
+    return local_shard_num;
+  }
+
+  static size_t get_sparse_shard(uint32_t shard_num, uint32_t server_num,
+                                 uint64_t key) {
+    return (key % shard_num) / sparse_local_shard_num(shard_num, server_num);
+  }
+
   virtual int32_t pull_graph_list(int start, int size,
                                   std::unique_ptr<char[]> &buffer,
                                   int &actual_size, bool need_feature,
@@ -430,11 +447,11 @@ class GraphTable : public SparseTable {
 
   virtual int32_t get_nodes_ids_by_ranges(
       std::vector<std::pair<int, int>> ranges, std::vector<int64_t> &res);
-  virtual int32_t initialize() { return 0; }
-  virtual int32_t initialize(const TableParameter &config,
+  virtual int32_t Initialize() { return 0; }
+  virtual int32_t Initialize(const TableParameter &config,
                              const FsClientParameter &fs_config);
-  virtual int32_t initialize(const GraphParameter &config);
-  int32_t load(const std::string &path, const std::string &param);
+  virtual int32_t Initialize(const GraphParameter &config);
+  int32_t Load(const std::string &path, const std::string &param);
   int32_t load_graph_split_config(const std::string &path);
 
   int32_t load_edges(const std::string &path, bool reverse);
@@ -452,26 +469,16 @@ class GraphTable : public SparseTable {
   virtual int32_t Pull(TableContext &context) { return 0; }
   virtual int32_t Push(TableContext &context) { return 0; }
 
-  virtual int32_t pull_sparse(float *values,
-                              const PullSparseValue &pull_value) {
-    return 0;
-  }
-
-  virtual int32_t push_sparse(const uint64_t *keys, const float *values,
-                              size_t num) {
-    return 0;
-  }
-
   virtual int32_t clear_nodes();
-  virtual void clear() {}
-  virtual int32_t flush() { return 0; }
-  virtual int32_t shrink(const std::string &param) { return 0; }
+  virtual void Clear() {}
+  virtual int32_t Flush() { return 0; }
+  virtual int32_t Shrink(const std::string &param) { return 0; }
   //指定保存路径
-  virtual int32_t save(const std::string &path, const std::string &converter) {
+  virtual int32_t Save(const std::string &path, const std::string &converter) {
     return 0;
   }
-  virtual int32_t initialize_shard() { return 0; }
-  virtual int32_t set_shard(size_t shard_idx, size_t server_num) {
+  virtual int32_t InitializeShard() { return 0; }
+  virtual int32_t SetShard(size_t shard_idx, size_t server_num) {
     _shard_idx = shard_idx;
     /*
     _shard_num is not used in graph_table, this following operation is for the
diff --git a/paddle/fluid/distributed/ps/table/common_sparse_table.cc b/paddle/fluid/distributed/ps/table/common_sparse_table.cc
deleted file mode 100644
index 1fc8adc2b92eb..0000000000000
--- a/paddle/fluid/distributed/ps/table/common_sparse_table.cc
+++ /dev/null
@@ -1,605 +0,0 @@
-// Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-#include "paddle/fluid/distributed/ps/table/common_sparse_table.h"
-#include <sstream>
-
-#include "glog/logging.h"
-#include "paddle/fluid/platform/enforce.h"
-
-namespace paddle {
-namespace distributed {
-class ValueBlock;
-}  // namespace distributed
-}  // namespace paddle
-
-namespace paddle {
-namespace distributed {
-
-void CommonSparseTable::ProcessALine(const std::vector<std::string>& columns,
-                                     const Meta& meta, const int64_t id,
-                                     std::vector<std::vector<float>>* values) {
-  auto colunmn_size = columns.size();
-  auto load_values =
-      paddle::string::split_string<std::string>(columns[colunmn_size - 1], ",");
-  values->reserve(meta.names.size());
-
-  int offset = 0;
-  for (int x = 0; x < meta.names.size(); ++x) {
-    std::vector<float> val;
-    auto start = load_values.begin() + offset;
-    auto end = load_values.begin() + offset + meta.dims[x];
-    PADDLE_ENFORCE_LE(offset + meta.dims[x], load_values.size(),
-                      paddle::platform::errors::InvalidArgument(
-                          "The data format in txt does not meet the field "
-                          "requirements defined in meta"));
-
-    std::transform(start, end, std::back_inserter(val), [id](std::string va) {
-      float v = 0.0;
-
-      try {
-        v = std::stof(va);
-      } catch (std::invalid_argument& e) {
-        VLOG(0) << "id: " << id << " get unexpected value: " << va
-                << " and be reset to: 0.0";
-      } catch (std::out_of_range& e) {
-        VLOG(0) << "id: " << id << " get unexpected value: " << va
-                << " and be reset to: 0.0";
-      }
-      return v;
-    });
-
-    values->push_back(val);
-    offset += meta.dims[x];
-  }
-}
-
-void CommonSparseTable::SaveMetaToText(std::ostream* os,
-                                       const CommonAccessorParameter& common,
-                                       const size_t shard_idx,
-                                       const int64_t total) {
-  // save meta
-  std::stringstream stream;
-  stream << "param=" << common.table_name() << "\n";
-  stream << "shard_id=" << shard_idx << "\n";
-  stream << "row_names=" << paddle::string::join_strings(common.params(), ',')
-         << "\n";
-  stream << "row_dims=" << paddle::string::join_strings(common.dims(), ',')
-         << "\n";
-  stream << "count=" << total << "\n";
-  os->write(stream.str().c_str(), sizeof(char) * stream.str().size());
-}
-
-int64_t CommonSparseTable::SaveValueToText(std::ostream* os,
-                                           std::shared_ptr<ValueBlock> block,
-                                           std::shared_ptr<::ThreadPool> pool,
-                                           const int mode, int shard_id) {
-  int64_t save_num = 0;
-  for (auto& table : block->values_) {
-    for (auto& value : table) {
-      if (mode == SaveMode::delta && !value.second->need_save_) {
-        continue;
-      }
-
-      ++save_num;
-
-      std::stringstream ss;
-      auto* vs = value.second->data_.data();
-
-      auto id = value.first;
-
-      ss << id << "\t" << value.second->count_ << "\t"
-         << value.second->unseen_days_ << "\t" << value.second->is_entry_
-         << "\t";
-
-      for (int i = 0; i < block->value_length_ - 1; i++) {
-        ss << std::to_string(vs[i]) << ",";
-      }
-
-      ss << std::to_string(vs[block->value_length_ - 1]);
-      ss << "\n";
-
-      os->write(ss.str().c_str(), sizeof(char) * ss.str().size());
-
-      if (mode == SaveMode::base || mode == SaveMode::delta) {
-        value.second->need_save_ = false;
-      }
-    }
-  }
-
-  return save_num;
-}
-
-int64_t CommonSparseTable::LoadFromText(
-    const std::string& valuepath, const std::string& metapath,
-    const int pserver_id, const int pserver_num, const int local_shard_num,
-    std::vector<std::shared_ptr<ValueBlock>>* blocks) {
-  Meta meta = Meta(metapath);
-
-  int num_lines = 0;
-  std::ifstream file(valuepath);
-  std::string line;
-
-  while (std::getline(file, line)) {
-    auto values = paddle::string::split_string<std::string>(line, "\t");
-    auto id = std::stoull(values[0]);
-
-    if (id % pserver_num != pserver_id) {
-      VLOG(3) << "will not load " << values[0] << " from " << valuepath
-              << ", please check id distribution";
-      continue;
-    }
-
-    auto shard_id = id % local_shard_num;
-    auto block = blocks->at(shard_id);
-
-    std::vector<std::vector<float>> kvalues;
-    ProcessALine(values, meta, id, &kvalues);
-
-    block->Init(id, false);
-
-    VALUE* value_instant = block->GetValue(id);
-
-    if (values.size() == 5) {
-      value_instant->count_ = std::stoi(values[1]);
-      value_instant->unseen_days_ = std::stoi(values[2]);
-      value_instant->is_entry_ = static_cast<bool>(std::stoi(values[3]));
-    }
-
-    std::vector<float*> block_values = block->Get(id, meta.names, meta.dims);
-    auto blas = GetBlas<float>();
-    for (int x = 0; x < meta.names.size(); ++x) {
-      blas.VCOPY(meta.dims[x], kvalues[x].data(), block_values[x]);
-    }
-  }
-
-  return 0;
-}
-
-int32_t CommonSparseTable::initialize() {
-  _shards_task_pool.resize(task_pool_size_);
-  for (int i = 0; i < _shards_task_pool.size(); ++i) {
-    _shards_task_pool[i].reset(new ::ThreadPool(1));
-  }
-
-  sync = _config.common().sync();
-  VLOG(1) << "table " << _config.common().table_name() << " is sync: " << sync;
-
-  _global_lr = new float(1.0);
-
-  auto common = _config.common();
-  int size = static_cast<int>(common.params().size());
-
-  size_t offset = 0;
-  for (int x = 0; x < size; ++x) {
-    auto& varname = common.params()[x];
-    auto& dim = common.dims()[x];
-
-    value_idx_[varname] = x;
-    value_names_.push_back(varname);
-    value_dims_.push_back(dim);
-    value_offsets_.push_back(offset);
-    initializer_attrs_.push_back(common.initializers()[x]);
-
-    if (varname == "Param") {
-      param_dim_ = dim;
-      param_offset_ = offset;
-    }
-
-    offset += dim;
-  }
-
-  initialize_value();
-  initialize_optimizer();
-  initialize_recorder();
-  return 0;
-}
-
-int32_t CommonSparseTable::initialize_recorder() { return 0; }
-
-int32_t CommonSparseTable::initialize_value() {
-  auto common = _config.common();
-  shard_values_.reserve(task_pool_size_);
-
-  for (int x = 0; x < task_pool_size_; ++x) {
-    auto shard = std::make_shared<ValueBlock>(
-        value_names_, value_dims_, value_offsets_, value_idx_,
-        initializer_attrs_, common.entry());
-
-    shard_values_.emplace_back(shard);
-  }
-
-  return 0;
-}
-
-int32_t CommonSparseTable::initialize_optimizer() {
-  auto common = _config.common();
-  auto name = common.name();
-
-  if (name == "sgd") {
-    optimizer_ = std::make_shared<SSGD>(value_names_, value_dims_,
-                                        value_offsets_, value_idx_);
-    optimizer_->set_global_lr(_global_lr);
-  } else if (name == "adam") {
-    optimizer_ = std::make_shared<SAdam>(value_names_, value_dims_,
-                                         value_offsets_, value_idx_);
-    optimizer_->set_global_lr(_global_lr);
-  } else if (name == "sum") {
-    optimizer_ = std::make_shared<SSUM>(value_names_, value_dims_,
-                                        value_offsets_, value_idx_);
-  } else {
-    VLOG(3) << "init optimizer failed";
-  }
-
-  VLOG(3) << "init optimizer " << name << " done";
-  return 0;
-}
-
-int32_t CommonSparseTable::set_global_lr(float* lr) {
-  _global_lr = lr;
-  optimizer_->set_global_lr(_global_lr);
-  return 0;
-}
-
-int32_t CommonSparseTable::load(const std::string& dirname,
-                                const std::string& param) {
-  auto begin = GetCurrentUS();
-  rwlock_->WRLock();
-  auto varname = _config.common().table_name();
-  std::string var_store =
-      string::Sprintf("%s/%s%s", dirname, varname, PSERVER_SAVE_SUFFIX);
-  std::string shard_var_pre =
-      string::Sprintf("%s.block%d", varname, _shard_idx);
-  std::string value_ = string::Sprintf("%s/%s.txt", var_store, shard_var_pre);
-  std::string meta_ = string::Sprintf("%s/%s.meta", var_store, shard_var_pre);
-
-  LoadFromText(value_, meta_, _shard_idx, _shard_num, task_pool_size_,
-               &shard_values_);
-  rwlock_->UNLock();
-  auto end = GetCurrentUS();
-
-  VLOG(0) << "load " << varname << " with value: " << value_
-          << " , meta: " << meta_
-          << " using: " << std::to_string((end - begin) / 1e+6) << " seconds";
-
-  return 0;
-}
-
-int32_t CommonSparseTable::save(const std::string& dirname,
-                                const std::string& param) {
-  auto begin = GetCurrentUS();
-  rwlock_->WRLock();
-  int mode = std::stoi(param);
-  VLOG(3) << "sparse table save: " << dirname << " mode: " << mode;
-
-  auto varname = _config.common().table_name();
-  std::string var_store =
-      string::Sprintf("%s/%s%s", dirname, varname, PSERVER_SAVE_SUFFIX);
-  MkDirRecursively(var_store.c_str());
-
-  VLOG(3) << "save " << varname << " in dir: " << var_store << " begin";
-  std::vector<std::string> params(_config.common().params().begin(),
-                                  _config.common().params().end());
-
-  std::string shard_var_pre =
-      string::Sprintf("%s.block%d", varname, _shard_idx);
-
-  std::string value_ = string::Sprintf("%s/%s.txt", var_store, shard_var_pre);
-
-  std::unique_ptr<std::ofstream> vs(new std::ofstream(value_));
-
-  int64_t total_ins = 0;
-  for (int shard_id = 0; shard_id < task_pool_size_; ++shard_id) {
-    // save values
-    auto shard_save_num =
-        SaveValueToText(vs.get(), shard_values_[shard_id],
-                        _shards_task_pool[shard_id], mode, shard_id);
-    total_ins += shard_save_num;
-  }
-  vs->close();
-
-  std::string meta_ = string::Sprintf("%s/%s.meta", var_store, shard_var_pre);
-  std::unique_ptr<std::ofstream> ms(new std::ofstream(meta_));
-  SaveMetaToText(ms.get(), _config.common(), _shard_idx, total_ins);
-  ms->close();
-
-  auto end = GetCurrentUS();
-  rwlock_->UNLock();
-  VLOG(0) << "save " << varname << " with path: " << value_
-          << " using: " << std::to_string((end - begin) / 1e+6) << " seconds";
-
-  return 0;
-}
-
-std::pair<int64_t, int64_t> CommonSparseTable::print_table_stat() {
-  int64_t feasign_size = 0;
-  int64_t mf_size = 0;
-
-  for (auto& shard : shard_values_) {
-    for (auto& table : shard->values_) {
-      feasign_size += table.size();
-    }
-  }
-
-  return {feasign_size, mf_size};
-}
-
-int32_t CommonSparseTable::pour() {
-  std::vector<float> values;
-  std::vector<uint64_t> keys;
-
-  keys.reserve(pull_reservoir_.size());
-  values.reserve(pull_reservoir_.size() * param_dim_);
-
-  for (auto& val : pull_reservoir_) {
-    keys.push_back(val.first);
-    auto& reservoir = val.second;
-    reservoir.avg();
-    std::copy(reservoir.values.begin(), reservoir.values.end(),
-              std::back_inserter(values));
-  }
-  _push_sparse(keys.data(), values.data(), pull_reservoir_.size());
-
-  pull_reservoir_.clear();
-  return 0;
-}
-
-int32_t CommonSparseTable::Pull(TableContext& context) {
-  CHECK(context.value_type == Sparse);
-  if (context.use_ptr) {
-    char** pull_values = context.pull_context.ptr_values;
-    const uint64_t* keys = context.pull_context.keys;
-    return pull_sparse_ptr(pull_values, keys, context.num);
-  } else {
-    float* pull_values = context.pull_context.values;
-    const PullSparseValue& pull_value = context.pull_context.pull_value;
-    return pull_sparse(pull_values, pull_value);
-  }
-}
-
-int32_t CommonSparseTable::Push(TableContext& context) {
-  CHECK(context.value_type == Sparse);
-  if (context.push_context.values != nullptr) {
-    const float* values = context.push_context.values;
-    const uint64_t* keys = context.push_context.keys;
-    return push_sparse(keys, values, context.num);
-  } else {
-    const float** values = context.push_context.ptr_values;
-    const uint64_t* keys = context.push_context.keys;
-    return push_sparse(keys, values, context.num);
-  }
-}
-
-int32_t CommonSparseTable::pull_sparse(float* pull_values,
-                                       const PullSparseValue& pull_value) {
-  auto shard_num = task_pool_size_;
-  std::vector<std::future<int>> tasks(shard_num);
-
-  for (int shard_id = 0; shard_id < shard_num; ++shard_id) {
-    tasks[shard_id] = _shards_task_pool[shard_id]->enqueue(
-        [this, shard_id, shard_num, &pull_value, &pull_values]() -> int {
-          auto& block = shard_values_[shard_id];
-
-          std::vector<int> offsets;
-          pull_value.Fission(shard_id, shard_num, &offsets);
-
-          if (pull_value.is_training_) {
-            for (auto& offset : offsets) {
-              auto feasign = pull_value.feasigns_[offset];
-              auto frequencie = pull_value.frequencies_[offset];
-              auto* value = block->Init(feasign, true, frequencie);
-              std::copy_n(value + param_offset_, param_dim_,
-                          pull_values + param_dim_ * offset);
-            }
-          } else {
-            for (auto& offset : offsets) {
-              auto feasign = pull_value.feasigns_[offset];
-              auto* value = block->Init(feasign, false);
-              std::copy_n(value + param_offset_, param_dim_,
-                          pull_values + param_dim_ * offset);
-            }
-          }
-
-          return 0;
-        });
-  }
-
-  for (size_t shard_id = 0; shard_id < tasks.size(); ++shard_id) {
-    tasks[shard_id].wait();
-  }
-  return 0;
-}
-
-int32_t CommonSparseTable::pull_sparse_ptr(char** pull_values,
-                                           const uint64_t* keys, size_t num) {
-  std::vector<std::vector<uint64_t>> offset_bucket;
-  offset_bucket.resize(task_pool_size_);
-
-  for (int x = 0; x < num; ++x) {
-    auto y = keys[x] % task_pool_size_;
-    offset_bucket[y].push_back(x);
-  }
-
-  std::vector<std::future<int>> tasks(task_pool_size_);
-
-  for (int shard_id = 0; shard_id < task_pool_size_; ++shard_id) {
-    tasks[shard_id] = _shards_task_pool[shard_id]->enqueue(
-        [this, shard_id, &keys, &offset_bucket, &pull_values]() -> int {
-          auto& block = shard_values_[shard_id];
-          auto& offsets = offset_bucket[shard_id];
-
-          for (int i = 0; i < offsets.size(); ++i) {
-            auto offset = offsets[i];
-            auto id = keys[offset];
-            auto* value = block->InitGet(id);
-            // std::copy_n(value + param_offset_, param_dim_,
-            //            pull_values + param_dim_ * offset);
-            pull_values[offset] = reinterpret_cast<char*>(value);
-          }
-
-          return 0;
-        });
-  }
-
-  for (size_t shard_id = 0; shard_id < tasks.size(); ++shard_id) {
-    tasks[shard_id].wait();
-  }
-  return 0;
-}
-
-int32_t CommonSparseTable::_push_sparse(const uint64_t* keys,
-                                        const float* values, size_t num) {
-  std::vector<std::vector<uint64_t>> offset_bucket;
-  offset_bucket.resize(task_pool_size_);
-
-  for (int x = 0; x < num; ++x) {
-    auto y = keys[x] % task_pool_size_;
-    offset_bucket[y].push_back(x);
-  }
-
-  std::vector<std::future<int>> tasks(task_pool_size_);
-
-  for (int shard_id = 0; shard_id < task_pool_size_; ++shard_id) {
-    tasks[shard_id] = _shards_task_pool[shard_id]->enqueue(
-        [this, shard_id, &keys, &values, num, &offset_bucket]() -> int {
-          auto& offsets = offset_bucket[shard_id];
-          optimizer_->update(keys, values, num, offsets,
-                             shard_values_[shard_id].get());
-          return 0;
-        });
-  }
-
-  for (size_t shard_id = 0; shard_id < tasks.size(); ++shard_id) {
-    tasks[shard_id].wait();
-  }
-  return 0;
-}
-
-int32_t CommonSparseTable::push_sparse(const uint64_t* keys,
-                                       const float* values, size_t num) {
-  if (sync) {
-    std::future<int> task =
-        _shards_task_pool[0]->enqueue([this, &keys, &values, num]() -> int {
-          for (int x = 0; x < num; ++x) {
-            auto id = keys[x];
-            auto has = pull_reservoir_.find(id);
-
-            if (has == pull_reservoir_.end()) {
-              pull_reservoir_[id] = ReservoirValue<float>(param_dim_);
-            }
-
-            auto& reservoir = pull_reservoir_[id];
-            reservoir.add(values + x * param_dim_, param_dim_);
-          }
-          return 0;
-        });
-    task.wait();
-  } else {
-    _push_sparse(keys, values, num);
-  }
-
-  return 0;
-}
-
-int32_t CommonSparseTable::push_sparse(const uint64_t* keys,
-                                       const float** values, size_t num) {
-  _push_sparse(keys, values, num);
-  return 0;
-}
-
-int32_t CommonSparseTable::_push_sparse(const uint64_t* keys,
-                                        const float** values, size_t num) {
-  std::vector<std::vector<uint64_t>> offset_bucket;
-  offset_bucket.resize(task_pool_size_);
-
-  for (int x = 0; x < num; ++x) {
-    auto y = keys[x] % task_pool_size_;
-    offset_bucket[y].push_back(x);
-  }
-
-  std::vector<std::future<int>> tasks(task_pool_size_);
-
-  for (int shard_id = 0; shard_id < task_pool_size_; ++shard_id) {
-    tasks[shard_id] = _shards_task_pool[shard_id]->enqueue(
-        [this, shard_id, &keys, &values, num, &offset_bucket]() -> int {
-          auto& offsets = offset_bucket[shard_id];
-          for (size_t i = 0; i < offsets.size(); ++i) {
-            std::vector<uint64_t> tmp_off = {0};
-            optimizer_->update(keys + offsets[i], values[offsets[i]], num,
-                               tmp_off, shard_values_[shard_id].get());
-          }
-          return 0;
-        });
-  }
-
-  for (size_t shard_id = 0; shard_id < tasks.size(); ++shard_id) {
-    tasks[shard_id].wait();
-  }
-  return 0;
-}
-
-int32_t CommonSparseTable::push_sparse_param(const uint64_t* keys,
-                                             const float* values, size_t num) {
-  std::vector<std::vector<uint64_t>> offset_bucket;
-  offset_bucket.resize(task_pool_size_);
-
-  for (int x = 0; x < num; ++x) {
-    auto y = keys[x] % task_pool_size_;
-    offset_bucket[y].push_back(x);
-  }
-
-  std::vector<std::future<int>> tasks(task_pool_size_);
-
-  for (int shard_id = 0; shard_id < task_pool_size_; ++shard_id) {
-    tasks[shard_id] = _shards_task_pool[shard_id]->enqueue(
-        [this, shard_id, &keys, &offset_bucket, &values]() -> int {
-          auto& block = shard_values_[shard_id];
-          auto& offsets = offset_bucket[shard_id];
-
-          for (int i = 0; i < offsets.size(); ++i) {
-            auto offset = offsets[i];
-            auto id = keys[offset];
-            auto* value = block->Init(id, false);
-            std::copy_n(values + param_dim_ * offset, param_dim_,
-                        value + param_offset_);
-            block->SetEntry(id, true);
-          }
-          return 0;
-        });
-  }
-
-  for (size_t shard_id = 0; shard_id < tasks.size(); ++shard_id) {
-    tasks[shard_id].wait();
-  }
-  return 0;
-}
-
-int32_t CommonSparseTable::flush() { return 0; }
-
-int32_t CommonSparseTable::shrink(const std::string& param) {
-  int threshold = std::stoi(param);
-  VLOG(3) << "sparse table shrink: " << threshold;
-
-  for (int shard_id = 0; shard_id < task_pool_size_; ++shard_id) {
-    // shrink
-    VLOG(4) << shard_id << " " << task_pool_size_ << " begin shrink";
-    shard_values_[shard_id]->Shrink(threshold);
-  }
-  return 0;
-}
-
-void CommonSparseTable::clear() { VLOG(0) << "clear coming soon"; }
-
-}  // namespace distributed
-}  // namespace paddle
diff --git a/paddle/fluid/distributed/ps/table/common_sparse_table.h b/paddle/fluid/distributed/ps/table/common_sparse_table.h
index 138c544742066..2673e8dfae3c6 100644
--- a/paddle/fluid/distributed/ps/table/common_sparse_table.h
+++ b/paddle/fluid/distributed/ps/table/common_sparse_table.h
@@ -108,31 +108,30 @@ struct Meta {
   }
 };
 
-class CommonSparseTable : public SparseTable {
+class CommonSparseTable : public Table {
  public:
   CommonSparseTable() { rwlock_.reset(new phi::RWLock); }
   virtual ~CommonSparseTable() {}
 
   // unused method begin
-  virtual int32_t pull_dense(float* pull_values, size_t num) { return 0; }
-  virtual int32_t push_dense_param(const float* values, size_t num) {
-    return 0;
-  }
-  virtual int32_t push_dense(const float* values, size_t num) { return 0; }
+  //  virtual int32_t PullDense(float* pull_values, size_t num) { return 0; }
+  //  virtual int32_t PushDenseParam(const float* values, size_t num) { return
+  //  0; }
+  //  virtual int32_t PushDense(const float* values, size_t num) { return 0; }
   // unused method end
 
   virtual int32_t Pull(TableContext& context);
   virtual int32_t Push(TableContext& context);
 
-  virtual int32_t initialize();
-  virtual int32_t initialize_shard() { return 0; }
-  virtual int32_t initialize_value();
-  virtual int32_t initialize_optimizer();
-  virtual int32_t initialize_recorder();
+  virtual int32_t Initialize();
+  virtual int32_t InitializeShard() { return 0; }
+  virtual int32_t InitializeValue();
+  virtual int32_t InitializeOptimizer();
+  virtual int32_t InitializeRecorder();
 
-  virtual int32_t load(const std::string& path, const std::string& param);
+  virtual int32_t Load(const std::string& path, const std::string& param);
 
-  virtual int32_t save(const std::string& path, const std::string& param);
+  virtual int32_t Save(const std::string& path, const std::string& param);
 
   void SaveMetaToText(std::ostream* os, const CommonAccessorParameter& common,
                       const size_t shard_idx, const int64_t total);
@@ -150,34 +149,35 @@ class CommonSparseTable : public SparseTable {
       const int pserver_id, const int pserver_num, const int local_shard_num,
       std::vector<std::shared_ptr<ValueBlock>>* blocks);
 
-  virtual std::pair<int64_t, int64_t> print_table_stat();
-  virtual int32_t pull_sparse(float* values, const PullSparseValue& pull_value);
+  virtual std::pair<int64_t, int64_t> PrintTableStat();
+  virtual int32_t PullSparse(float* values, const PullSparseValue& pull_value);
 
-  virtual int32_t pull_sparse_ptr(char** pull_values, const uint64_t* keys,
-                                  size_t num);
+  virtual int32_t PullSparsePtr(char** pull_values, const uint64_t* keys,
+                                size_t num);
 
-  virtual int32_t push_sparse(const uint64_t* keys, const float* values,
-                              size_t num);
+  virtual int32_t PushSparse(const uint64_t* keys, const float* values,
+                             size_t num);
 
-  virtual int32_t push_sparse(const uint64_t* keys, const float** values,
-                              size_t num);
+  virtual int32_t PushSparse(const uint64_t* keys, const float** values,
+                             size_t num);
 
   // only for sparse geo table
-  virtual int32_t push_sparse_param(const uint64_t* keys, const float* values,
-                                    size_t num);
+  virtual int32_t PushSparseParam(const uint64_t* keys, const float* values,
+                                  size_t num);
+  virtual int32_t SetGlobalLR(float* lr);
 
-  virtual int32_t set_global_lr(float* lr) override;
+  virtual int32_t Pour();
+  virtual int32_t Flush();
+  virtual int32_t Shrink(const std::string& param);
+  virtual void Clear();
 
-  virtual int32_t pour();
-  virtual int32_t flush();
-  virtual int32_t shrink(const std::string& param);
-  virtual void clear();
+  virtual void* GetShard(size_t shard_idx) { return 0; }
 
  protected:
-  virtual int32_t _push_sparse(const uint64_t* keys, const float* values,
-                               size_t num);
-  virtual int32_t _push_sparse(const uint64_t* keys, const float** values,
-                               size_t num);
+  virtual int32_t _PushSparse(const uint64_t* keys, const float* values,
+                              size_t num);
+  virtual int32_t _PushSparse(const uint64_t* keys, const float** values,
+                              size_t num);
 
  protected:
   const int task_pool_size_ = 11;
diff --git a/paddle/fluid/distributed/ps/table/common_table.h b/paddle/fluid/distributed/ps/table/common_table.h
index 3d291c0152246..f69d9ccbf1453 100644
--- a/paddle/fluid/distributed/ps/table/common_table.h
+++ b/paddle/fluid/distributed/ps/table/common_table.h
@@ -66,95 +66,34 @@ struct ReservoirValue {
   }
 };
 
-class SparseTable : public Table {
- public:
-  SparseTable() {}
-  virtual ~SparseTable() {}
-
-  virtual void *get_shard(size_t shard_idx) { return 0; }
-
-  int32_t pull_dense(float *values, size_t num) override { return 0; }
-
-  int32_t push_dense(const float *values, size_t num) override { return 0; }
-
-  static int32_t sparse_local_shard_num(uint32_t shard_num,
-                                        uint32_t server_num) {
-    if (shard_num % server_num == 0) {
-      return shard_num / server_num;
-    }
-    size_t local_shard_num = shard_num / server_num + 1;
-    return local_shard_num;
-  }
-
-  static size_t get_sparse_shard(uint32_t shard_num, uint32_t server_num,
-                                 uint64_t key) {
-    return (key % shard_num) / sparse_local_shard_num(shard_num, server_num);
-  }
-};
-
-class DenseTable : public Table {
- public:
-  DenseTable() {}
-  virtual ~DenseTable() {}
-
-  virtual void *get_shard(size_t shard_idx) { return 0; }
-  int32_t pull_sparse(float *values,
-                      const PullSparseValue &pull_value) override {
-    return 0;
-  }
-  int32_t push_sparse(const uint64_t *keys, const float *values,
-                      size_t num) override {
-    return 0;
-  }
-  int32_t push_dense_param(const float *values, size_t num) override {
-    return 0;
-  }
-  int32_t shrink(const std::string &param) override { return 0; }
-};
-
 class BarrierTable : public Table {
  public:
   BarrierTable() {}
   virtual ~BarrierTable() {}
 
-  virtual void *get_shard(size_t shard_idx) { return 0; }
+  virtual void *GetShard(size_t shard_idx) { return 0; }
 
   virtual int32_t Pull(TableContext &context) { return 0; }
   virtual int32_t Push(TableContext &context) { return 0; }
 
-  int32_t pull_dense(float *values, size_t num) override { return 0; }
-
-  int32_t push_dense(const float *values, size_t num) override { return 0; }
-
-  int32_t pull_sparse(float *values,
-                      const PullSparseValue &pull_value) override {
-    return 0;
-  }
-  int32_t push_sparse(const uint64_t *keys, const float *values,
-                      size_t num) override {
-    return 0;
-  }
-  int32_t push_dense_param(const float *values, size_t num) override {
-    return 0;
-  }
-  int32_t shrink(const std::string &param) override { return 0; }
-  virtual void clear() {}
-  virtual int32_t flush() { return 0; }
-  virtual int32_t load(const std::string &path, const std::string &param) {
+  int32_t Shrink(const std::string &param) override { return 0; }
+  virtual void Clear() {}
+  virtual int32_t Flush() { return 0; }
+  virtual int32_t Load(const std::string &path, const std::string &param) {
     return 0;
   }
-  virtual int32_t save(const std::string &path, const std::string &param) {
+  virtual int32_t Save(const std::string &path, const std::string &param) {
     return 0;
   }
-  virtual int32_t initialize_shard() { return 0; }
+  virtual int32_t InitializeShard() { return 0; }
 
-  virtual int32_t initialize() override;
+  virtual int32_t Initialize() override;
   // only for barrier
   // 0: send_barrier 1: recv_barrier 2: complete
-  virtual int32_t barrier(const uint32_t trainer_id,
+  virtual int32_t Barrier(const uint32_t trainer_id,
                           const std::string barrier_type) override;
 
-  virtual int32_t set_table_map(
+  virtual int32_t SetTableMap(
       std::unordered_map<uint32_t, std::shared_ptr<Table>> *table_map) override;
 
  private:
diff --git a/paddle/fluid/distributed/ps/table/ctr_accessor.cc b/paddle/fluid/distributed/ps/table/ctr_accessor.cc
index ffb97914fb8c0..4446c8297c5b3 100644
--- a/paddle/fluid/distributed/ps/table/ctr_accessor.cc
+++ b/paddle/fluid/distributed/ps/table/ctr_accessor.cc
@@ -20,92 +20,39 @@
 namespace paddle {
 namespace distributed {
 
-int CtrCommonAccessor::initialize() {
+int CtrCommonAccessor::Initialize() {
   auto name = _config.embed_sgd_param().name();
   _embed_sgd_rule = CREATE_PSCORE_CLASS(SparseValueSGDRule, name);
-  _embed_sgd_rule->load_config(_config.embed_sgd_param(), 1);
+  _embed_sgd_rule->LoadConfig(_config.embed_sgd_param(), 1);
 
   name = _config.embedx_sgd_param().name();
   _embedx_sgd_rule = CREATE_PSCORE_CLASS(SparseValueSGDRule, name);
-  _embedx_sgd_rule->load_config(_config.embedx_sgd_param(),
-                                _config.embedx_dim());
+  _embedx_sgd_rule->LoadConfig(_config.embedx_sgd_param(),
+                               _config.embedx_dim());
 
-  common_feature_value.embed_sgd_dim = _embed_sgd_rule->dim();
+  common_feature_value.embed_sgd_dim = _embed_sgd_rule->Dim();
   common_feature_value.embedx_dim = _config.embedx_dim();
-  common_feature_value.embedx_sgd_dim = _embedx_sgd_rule->dim();
+  common_feature_value.embedx_sgd_dim = _embedx_sgd_rule->Dim();
   _show_click_decay_rate = _config.ctr_accessor_param().show_click_decay_rate();
 
+  InitAccessorInfo();
   return 0;
 }
 
-void CtrCommonAccessor::SetTableInfo(AccessorInfo& info) {
-  info.dim = dim();
-  info.size = size();
-  info.select_dim = select_dim();
-  info.select_size = select_size();
-  info.update_dim = update_dim();
-  info.update_size = update_size();
-  info.mf_size = mf_size();
-  info.fea_dim = fea_dim();
-}
-
-size_t CtrCommonAccessor::GetTableInfo(InfoKey key) {
-  switch (key) {
-    case DIM:
-      return dim();
-    case SIZE:
-      return size();
-    case SELECT_DIM:
-      return select_dim();
-    case SELECT_SIZE:
-      return select_size();
-    case UPDATE_DIM:
-      return update_dim();
-    case UPDATE_SIZE:
-      return update_size();
-    case MF_SIZE:
-      return mf_size();
-    case FEA_DIM:
-      return fea_dim();
-  }
-  return 0;
-}
+void CtrCommonAccessor::InitAccessorInfo() {
+  _accessor_info.dim = common_feature_value.Dim();
+  _accessor_info.size = common_feature_value.Size();
 
-size_t CtrCommonAccessor::dim() { return common_feature_value.dim(); }
-
-size_t CtrCommonAccessor::dim_size(size_t dim) {
   auto embedx_dim = _config.embedx_dim();
-  return common_feature_value.dim_size(dim, embedx_dim);
+  _accessor_info.select_dim = 3 + embedx_dim;
+  _accessor_info.select_size = _accessor_info.select_dim * sizeof(float);
+  _accessor_info.update_dim = 4 + embedx_dim;
+  _accessor_info.update_size = _accessor_info.update_dim * sizeof(float);
+  _accessor_info.mf_size =
+      (embedx_dim + common_feature_value.embedx_sgd_dim) * sizeof(float);
 }
 
-size_t CtrCommonAccessor::size() { return common_feature_value.size(); }
-
-size_t CtrCommonAccessor::mf_size() {
-  return (_config.embedx_dim() + common_feature_value.embedx_sgd_dim) *
-         sizeof(float);  // embedx embedx_g2sum
-}
-
-// pull value
-size_t CtrCommonAccessor::select_dim() {
-  auto embedx_dim = _config.embedx_dim();
-  return 3 + embedx_dim;
-}
-
-size_t CtrCommonAccessor::select_dim_size(size_t dim) { return sizeof(float); }
-
-size_t CtrCommonAccessor::select_size() { return select_dim() * sizeof(float); }
-
-// push value
-size_t CtrCommonAccessor::update_dim() {
-  auto embedx_dim = _config.embedx_dim();
-  return 4 + embedx_dim;
-}
-
-size_t CtrCommonAccessor::update_dim_size(size_t dim) { return sizeof(float); }
-
-size_t CtrCommonAccessor::update_size() { return update_dim() * sizeof(float); }
-
-bool CtrCommonAccessor::shrink(float* value) {
+bool CtrCommonAccessor::Shrink(float* value) {
   auto base_threshold = _config.ctr_accessor_param().base_threshold();
   auto delta_threshold = _config.ctr_accessor_param().delta_threshold();
   auto delete_after_unseen_days =
@@ -113,20 +60,20 @@ bool CtrCommonAccessor::shrink(float* value) {
   auto delete_threshold = _config.ctr_accessor_param().delete_threshold();
 
   // time_decay first
-  common_feature_value.show(value) *= _show_click_decay_rate;
-  common_feature_value.click(value) *= _show_click_decay_rate;
+  common_feature_value.Show(value) *= _show_click_decay_rate;
+  common_feature_value.Click(value) *= _show_click_decay_rate;
 
   // shrink after
-  auto score = show_click_score(common_feature_value.show(value),
-                                common_feature_value.click(value));
-  auto unseen_days = common_feature_value.unseen_days(value);
+  auto score = ShowClickScore(common_feature_value.Show(value),
+                              common_feature_value.Click(value));
+  auto unseen_days = common_feature_value.UnseenDays(value);
   if (score < delete_threshold || unseen_days > delete_after_unseen_days) {
     return true;
   }
   return false;
 }
 
-bool CtrCommonAccessor::save(float* value, int param) {
+bool CtrCommonAccessor::Save(float* value, int param) {
   auto base_threshold = _config.ctr_accessor_param().base_threshold();
   auto delta_threshold = _config.ctr_accessor_param().delta_threshold();
   auto delta_keep_days = _config.ctr_accessor_param().delta_keep_days();
@@ -142,14 +89,13 @@ bool CtrCommonAccessor::save(float* value, int param) {
     case 1:
     // save xbox base
     case 2: {
-      if (show_click_score(common_feature_value.show(value),
-                           common_feature_value.click(value)) >=
-              base_threshold &&
-          common_feature_value.delta_score(value) >= delta_threshold &&
-          common_feature_value.unseen_days(value) <= delta_keep_days) {
+      if (ShowClickScore(common_feature_value.Show(value),
+                         common_feature_value.Click(value)) >= base_threshold &&
+          common_feature_value.DeltaScore(value) >= delta_threshold &&
+          common_feature_value.UnseenDays(value) <= delta_keep_days) {
         // do this after save, because it must not be modified when retry
         if (param == 2) {
-          common_feature_value.delta_score(value) = 0;
+          common_feature_value.DeltaScore(value) = 0;
         }
         return true;
       } else {
@@ -159,7 +105,7 @@ bool CtrCommonAccessor::save(float* value, int param) {
     // already decayed in shrink
     case 3: {
       // do this after save, because it must not be modified when retry
-      // common_feature_value.unseen_days(value)++;
+      // common_feature_value.UnseenDays(value)++;
       return true;
     }
     // save revert batch_model
@@ -171,7 +117,7 @@ bool CtrCommonAccessor::save(float* value, int param) {
   }
 }
 
-void CtrCommonAccessor::update_stat_after_save(float* value, int param) {
+void CtrCommonAccessor::UpdateStatAfterSave(float* value, int param) {
   auto base_threshold = _config.ctr_accessor_param().base_threshold();
   auto delta_threshold = _config.ctr_accessor_param().delta_threshold();
   auto delta_keep_days = _config.ctr_accessor_param().delta_keep_days();
@@ -180,17 +126,16 @@ void CtrCommonAccessor::update_stat_after_save(float* value, int param) {
   }
   switch (param) {
     case 1: {
-      if (show_click_score(common_feature_value.show(value),
-                           common_feature_value.click(value)) >=
-              base_threshold &&
-          common_feature_value.delta_score(value) >= delta_threshold &&
-          common_feature_value.unseen_days(value) <= delta_keep_days) {
-        common_feature_value.delta_score(value) = 0;
+      if (ShowClickScore(common_feature_value.Show(value),
+                         common_feature_value.Click(value)) >= base_threshold &&
+          common_feature_value.DeltaScore(value) >= delta_threshold &&
+          common_feature_value.UnseenDays(value) <= delta_keep_days) {
+        common_feature_value.DeltaScore(value) = 0;
       }
     }
       return;
     case 3: {
-      common_feature_value.unseen_days(value)++;
+      common_feature_value.UnseenDays(value)++;
     }
       return;
     default:
@@ -198,52 +143,51 @@ void CtrCommonAccessor::update_stat_after_save(float* value, int param) {
   }
 }
 
-int32_t CtrCommonAccessor::create(float** values, size_t num) {
+int32_t CtrCommonAccessor::Create(float** values, size_t num) {
   auto embedx_dim = _config.embedx_dim();
   for (size_t value_item = 0; value_item < num; ++value_item) {
     float* value = values[value_item];
-    value[common_feature_value.unseen_days_index()] = 0;
-    value[common_feature_value.delta_score_index()] = 0;
-    value[common_feature_value.show_index()] = 0;
-    value[common_feature_value.click_index()] = 0;
-    value[common_feature_value.slot_index()] = -1;
-    _embed_sgd_rule->init_value(
-        value + common_feature_value.embed_w_index(),
-        value + common_feature_value.embed_g2sum_index());
-    _embedx_sgd_rule->init_value(
-        value + common_feature_value.embedx_w_index(),
-        value + common_feature_value.embedx_g2sum_index(), false);
+    value[common_feature_value.UnseenDaysIndex()] = 0;
+    value[common_feature_value.DeltaScoreIndex()] = 0;
+    value[common_feature_value.ShowIndex()] = 0;
+    value[common_feature_value.ClickIndex()] = 0;
+    value[common_feature_value.SlotIndex()] = -1;
+    _embed_sgd_rule->InitValue(value + common_feature_value.EmbedWIndex(),
+                               value + common_feature_value.EmbedG2SumIndex());
+    _embedx_sgd_rule->InitValue(value + common_feature_value.EmbedxWIndex(),
+                                value + common_feature_value.EmbedxG2SumIndex(),
+                                false);
   }
   return 0;
 }
 
-bool CtrCommonAccessor::need_extend_mf(float* value) {
-  float show = value[common_feature_value.show_index()];
-  float click = value[common_feature_value.click_index()];
+bool CtrCommonAccessor::NeedExtendMF(float* value) {
+  float show = value[common_feature_value.ShowIndex()];
+  float click = value[common_feature_value.ClickIndex()];
   float score = (show - click) * _config.ctr_accessor_param().nonclk_coeff() +
                 click * _config.ctr_accessor_param().click_coeff();
   return score >= _config.embedx_threshold();
 }
 
-bool CtrCommonAccessor::has_mf(size_t size) {
-  return size > common_feature_value.embedx_g2sum_index();
+bool CtrCommonAccessor::HasMF(size_t size) {
+  return size > common_feature_value.EmbedxG2SumIndex();
 }
 
 // from CommonFeatureValue to CtrCommonPullValue
-int32_t CtrCommonAccessor::select(float** select_values, const float** values,
+int32_t CtrCommonAccessor::Select(float** select_values, const float** values,
                                   size_t num) {
   auto embedx_dim = _config.embedx_dim();
   for (size_t value_item = 0; value_item < num; ++value_item) {
     float* select_value = select_values[value_item];
     const float* value = values[value_item];
-    select_value[CtrCommonPullValue::show_index()] =
-        value[common_feature_value.show_index()];
-    select_value[CtrCommonPullValue::click_index()] =
-        value[common_feature_value.click_index()];
-    select_value[CtrCommonPullValue::embed_w_index()] =
-        value[common_feature_value.embed_w_index()];
-    memcpy(select_value + CtrCommonPullValue::embedx_w_index(),
-           value + common_feature_value.embedx_w_index(),
+    select_value[CtrCommonPullValue::ShowIndex()] =
+        value[common_feature_value.ShowIndex()];
+    select_value[CtrCommonPullValue::ClickIndex()] =
+        value[common_feature_value.ClickIndex()];
+    select_value[CtrCommonPullValue::EmbedWIndex()] =
+        value[common_feature_value.EmbedWIndex()];
+    memcpy(select_value + CtrCommonPullValue::EmbedxWIndex(),
+           value + common_feature_value.EmbedxWIndex(),
            embedx_dim * sizeof(float));
   }
   return 0;
@@ -252,16 +196,16 @@ int32_t CtrCommonAccessor::select(float** select_values, const float** values,
 // from CtrCommonPushValue to CtrCommonPushValue
 // first dim: item
 // second dim: field num
-int32_t CtrCommonAccessor::merge(float** update_values,
+int32_t CtrCommonAccessor::Merge(float** update_values,
                                  const float** other_update_values,
                                  size_t num) {
   auto embedx_dim = _config.embedx_dim();
-  size_t total_dim = CtrCommonPushValue::dim(embedx_dim);
+  size_t total_dim = CtrCommonPushValue::Dim(embedx_dim);
   for (size_t value_item = 0; value_item < num; ++value_item) {
     float* update_value = update_values[value_item];
     const float* other_update_value = other_update_values[value_item];
     for (auto i = 0u; i < total_dim; ++i) {
-      if (i != CtrCommonPushValue::slot_index()) {
+      if (i != CtrCommonPushValue::SlotIndex()) {
         update_value[i] += other_update_value[i];
       }
     }
@@ -272,44 +216,45 @@ int32_t CtrCommonAccessor::merge(float** update_values,
 // from CtrCommonPushValue to CommonFeatureValue
 // first dim: item
 // second dim: field num
-int32_t CtrCommonAccessor::update(float** update_values,
+int32_t CtrCommonAccessor::Update(float** update_values,
                                   const float** push_values, size_t num) {
   auto embedx_dim = _config.embedx_dim();
   for (size_t value_item = 0; value_item < num; ++value_item) {
     float* update_value = update_values[value_item];
     const float* push_value = push_values[value_item];
-    float push_show = push_value[CtrCommonPushValue::show_index()];
-    float push_click = push_value[CtrCommonPushValue::click_index()];
-    float slot = push_value[CtrCommonPushValue::slot_index()];
-    update_value[common_feature_value.show_index()] += push_show;
-    update_value[common_feature_value.click_index()] += push_click;
-    update_value[common_feature_value.slot_index()] = slot;
-    update_value[common_feature_value.delta_score_index()] +=
+    float push_show = push_value[CtrCommonPushValue::ShowIndex()];
+    float push_click = push_value[CtrCommonPushValue::ClickIndex()];
+    float slot = push_value[CtrCommonPushValue::SlotIndex()];
+    update_value[common_feature_value.ShowIndex()] += push_show;
+    update_value[common_feature_value.ClickIndex()] += push_click;
+    update_value[common_feature_value.SlotIndex()] = slot;
+    update_value[common_feature_value.DeltaScoreIndex()] +=
         (push_show - push_click) * _config.ctr_accessor_param().nonclk_coeff() +
         push_click * _config.ctr_accessor_param().click_coeff();
-    update_value[common_feature_value.unseen_days_index()] = 0;
-    _embed_sgd_rule->update_value(
-        update_value + common_feature_value.embed_w_index(),
-        update_value + common_feature_value.embed_g2sum_index(),
-        push_value + CtrCommonPushValue::embed_g_index());
-    _embedx_sgd_rule->update_value(
-        update_value + common_feature_value.embedx_w_index(),
-        update_value + common_feature_value.embedx_g2sum_index(),
-        push_value + CtrCommonPushValue::embedx_g_index());
+    update_value[common_feature_value.UnseenDaysIndex()] = 0;
+    // TODO(zhaocaibei123): add configure show_scale
+    _embed_sgd_rule->UpdateValue(
+        update_value + common_feature_value.EmbedWIndex(),
+        update_value + common_feature_value.EmbedG2SumIndex(),
+        push_value + CtrCommonPushValue::EmbedGIndex(), push_show);
+    _embedx_sgd_rule->UpdateValue(
+        update_value + common_feature_value.EmbedxWIndex(),
+        update_value + common_feature_value.EmbedxG2SumIndex(),
+        push_value + CtrCommonPushValue::EmbedxGIndex(), push_show);
   }
   return 0;
 }
 
-bool CtrCommonAccessor::create_value(int stage, const float* value) {
+bool CtrCommonAccessor::CreateValue(int stage, const float* value) {
   // stage == 0, pull
   // stage == 1, push
   if (stage == 0) {
     return true;
   } else if (stage == 1) {
     // operation
-    auto show = CtrCommonPushValue::show(const_cast<float*>(value));
-    auto click = CtrCommonPushValue::click(const_cast<float*>(value));
-    auto score = show_click_score(show, click);
+    auto show = CtrCommonPushValue::Show(const_cast<float*>(value));
+    auto click = CtrCommonPushValue::Click(const_cast<float*>(value));
+    auto score = ShowClickScore(show, click);
     if (score <= 0) {
       return false;
     }
@@ -323,41 +268,40 @@ bool CtrCommonAccessor::create_value(int stage, const float* value) {
   }
 }
 
-float CtrCommonAccessor::show_click_score(float show, float click) {
+float CtrCommonAccessor::ShowClickScore(float show, float click) {
   auto nonclk_coeff = _config.ctr_accessor_param().nonclk_coeff();
   auto click_coeff = _config.ctr_accessor_param().click_coeff();
   return (show - click) * nonclk_coeff + click * click_coeff;
 }
 
-std::string CtrCommonAccessor::parse_to_string(const float* v, int param) {
+std::string CtrCommonAccessor::ParseToString(const float* v, int param) {
   thread_local std::ostringstream os;
   os.clear();
   os.str("");
   os << v[0] << " " << v[1] << " " << v[2] << " " << v[3] << " " << v[4] << " "
      << v[5];
-  for (int i = common_feature_value.embed_g2sum_index();
-       i < common_feature_value.embedx_w_index(); i++) {
+  for (int i = common_feature_value.EmbedG2SumIndex();
+       i < common_feature_value.EmbedxWIndex(); i++) {
     os << " " << v[i];
   }
-  auto show = common_feature_value.show(const_cast<float*>(v));
-  auto click = common_feature_value.click(const_cast<float*>(v));
-  auto score = show_click_score(show, click);
+  auto show = common_feature_value.Show(const_cast<float*>(v));
+  auto click = common_feature_value.Click(const_cast<float*>(v));
+  auto score = ShowClickScore(show, click);
   if (score >= _config.embedx_threshold() &&
-      param > common_feature_value.embedx_w_index()) {
-    for (auto i = common_feature_value.embedx_w_index();
-         i < common_feature_value.dim(); ++i) {
+      param > common_feature_value.EmbedxWIndex()) {
+    for (auto i = common_feature_value.EmbedxWIndex();
+         i < common_feature_value.Dim(); ++i) {
       os << " " << v[i];
     }
   }
   return os.str();
 }
 
-int CtrCommonAccessor::parse_from_string(const std::string& str, float* value) {
+int CtrCommonAccessor::ParseFromString(const std::string& str, float* value) {
   int embedx_dim = _config.embedx_dim();
 
-  _embedx_sgd_rule->init_value(
-      value + common_feature_value.embedx_w_index(),
-      value + common_feature_value.embedx_g2sum_index());
+  _embedx_sgd_rule->InitValue(value + common_feature_value.EmbedxWIndex(),
+                              value + common_feature_value.EmbedxG2SumIndex());
   auto ret = paddle::string::str_to_float(str.data(), value);
   CHECK(ret >= 6) << "expect more than 6 real:" << ret;
   return ret;
diff --git a/paddle/fluid/distributed/ps/table/ctr_accessor.h b/paddle/fluid/distributed/ps/table/ctr_accessor.h
index a2121b21d9fe6..b8895e74d1d09 100644
--- a/paddle/fluid/distributed/ps/table/ctr_accessor.h
+++ b/paddle/fluid/distributed/ps/table/ctr_accessor.h
@@ -40,28 +40,28 @@ class CtrCommonAccessor : public ValueAccessor {
        std::<vector>float embedx_g2sum;
        */
 
-    int dim() { return 6 + embed_sgd_dim + embedx_sgd_dim + embedx_dim; }
-    int dim_size(size_t dim, int embedx_dim) { return sizeof(float); }
-    int size() { return dim() * sizeof(float); }
-    int slot_index() { return 0; }
-    int unseen_days_index() { return slot_index() + 1; }
-    int delta_score_index() { return unseen_days_index() + 1; }
-    int show_index() { return delta_score_index() + 1; }
-    int click_index() { return show_index() + 1; }
-    int embed_w_index() { return click_index() + 1; }
-    int embed_g2sum_index() { return embed_w_index() + 1; }
-    int embedx_w_index() { return embed_g2sum_index() + embed_sgd_dim; }
-    int embedx_g2sum_index() { return embedx_w_index() + embedx_dim; }
-
-    float& unseen_days(float* val) { return val[unseen_days_index()]; }
-    float& delta_score(float* val) { return val[delta_score_index()]; }
-    float& show(float* val) { return val[show_index()]; }
-    float& click(float* val) { return val[click_index()]; }
-    float& slot(float* val) { return val[slot_index()]; }
-    float& embed_w(float* val) { return val[embed_w_index()]; }
-    float& embed_g2sum(float* val) { return val[embed_g2sum_index()]; }
-    float& embedx_w(float* val) { return val[embedx_w_index()]; }
-    float& embedx_g2sum(float* val) { return val[embedx_g2sum_index()]; }
+    int Dim() { return 6 + embed_sgd_dim + embedx_sgd_dim + embedx_dim; }
+    int DimSize(size_t dim, int embedx_dim) { return sizeof(float); }
+    int Size() { return Dim() * sizeof(float); }
+    int SlotIndex() { return 0; }
+    int UnseenDaysIndex() { return SlotIndex() + 1; }
+    int DeltaScoreIndex() { return UnseenDaysIndex() + 1; }
+    int ShowIndex() { return DeltaScoreIndex() + 1; }
+    int ClickIndex() { return ShowIndex() + 1; }
+    int EmbedWIndex() { return ClickIndex() + 1; }
+    int EmbedG2SumIndex() { return EmbedWIndex() + 1; }
+    int EmbedxWIndex() { return EmbedG2SumIndex() + embed_sgd_dim; }
+    int EmbedxG2SumIndex() { return EmbedxWIndex() + embedx_dim; }
+
+    float& UnseenDays(float* val) { return val[UnseenDaysIndex()]; }
+    float& DeltaScore(float* val) { return val[DeltaScoreIndex()]; }
+    float& Show(float* val) { return val[ShowIndex()]; }
+    float& Click(float* val) { return val[ClickIndex()]; }
+    float& Slot(float* val) { return val[SlotIndex()]; }
+    float& EmbedW(float* val) { return val[EmbedWIndex()]; }
+    float& EmbedG2Sum(float* val) { return val[EmbedG2SumIndex()]; }
+    float& EmbedxW(float* val) { return val[EmbedxWIndex()]; }
+    float& EmbedxG2Sum(float* val) { return val[EmbedxG2SumIndex()]; }
 
     int embed_sgd_dim;
     int embedx_dim;
@@ -77,31 +77,29 @@ class CtrCommonAccessor : public ValueAccessor {
        std::vector<float> embedx_g;
        */
 
-    static int dim(int embedx_dim) { return 4 + embedx_dim; }
-
-    static int dim_size(int dim, int embedx_dim) { return sizeof(float); }
-    static int size(int embedx_dim) { return dim(embedx_dim) * sizeof(float); }
-    static int slot_index() { return 0; }
-    static int show_index() { return CtrCommonPushValue::slot_index() + 1; }
-    static int click_index() { return CtrCommonPushValue::show_index() + 1; }
-    static int embed_g_index() { return CtrCommonPushValue::click_index() + 1; }
-    static int embedx_g_index() {
-      return CtrCommonPushValue::embed_g_index() + 1;
-    }
-    static float& slot(float* val) {
-      return val[CtrCommonPushValue::slot_index()];
+    static int Dim(int embedx_dim) { return 4 + embedx_dim; }
+
+    static int DimSize(int dim, int embedx_dim) { return sizeof(float); }
+    static int Size(int embedx_dim) { return Dim(embedx_dim) * sizeof(float); }
+    static int SlotIndex() { return 0; }
+    static int ShowIndex() { return CtrCommonPushValue::SlotIndex() + 1; }
+    static int ClickIndex() { return CtrCommonPushValue::ShowIndex() + 1; }
+    static int EmbedGIndex() { return CtrCommonPushValue::ClickIndex() + 1; }
+    static int EmbedxGIndex() { return CtrCommonPushValue::EmbedGIndex() + 1; }
+    static float& Slot(float* val) {
+      return val[CtrCommonPushValue::SlotIndex()];
     }
-    static float& show(float* val) {
-      return val[CtrCommonPushValue::show_index()];
+    static float& Show(float* val) {
+      return val[CtrCommonPushValue::ShowIndex()];
     }
-    static float& click(float* val) {
-      return val[CtrCommonPushValue::click_index()];
+    static float& Click(float* val) {
+      return val[CtrCommonPushValue::ClickIndex()];
     }
-    static float& embed_g(float* val) {
-      return val[CtrCommonPushValue::embed_g_index()];
+    static float& EmbedG(float* val) {
+      return val[CtrCommonPushValue::EmbedGIndex()];
     }
-    static float* embedx_g(float* val) {
-      return val + CtrCommonPushValue::embedx_g_index();
+    static float* EmbedxG(float* val) {
+      return val + CtrCommonPushValue::EmbedxGIndex();
     }
   };
 
@@ -113,96 +111,75 @@ class CtrCommonAccessor : public ValueAccessor {
        std::vector<float> embedx_w;
        */
 
-    static int dim(int embedx_dim) { return 3 + embedx_dim; }
-    static int dim_size(size_t dim) { return sizeof(float); }
-    static int size(int embedx_dim) { return dim(embedx_dim) * sizeof(float); }
-    static int show_index() { return 0; }
-    static int click_index() { return 1; }
-    static int embed_w_index() { return 2; }
-    static int embedx_w_index() { return 3; }
-    static float& show(float* val) {
-      return val[CtrCommonPullValue::show_index()];
+    static int Dim(int embedx_dim) { return 3 + embedx_dim; }
+    static int DimSize(size_t dim) { return sizeof(float); }
+    static int Size(int embedx_dim) { return Dim(embedx_dim) * sizeof(float); }
+    static int ShowIndex() { return 0; }
+    static int ClickIndex() { return 1; }
+    static int EmbedWIndex() { return 2; }
+    static int EmbedxWIndex() { return 3; }
+    static float& Show(float* val) {
+      return val[CtrCommonPullValue::ShowIndex()];
     }
-    static float& click(float* val) {
-      return val[CtrCommonPullValue::click_index()];
+    static float& Click(float* val) {
+      return val[CtrCommonPullValue::ClickIndex()];
     }
-    static float& embed_w(float* val) {
-      return val[CtrCommonPullValue::embed_w_index()];
+    static float& EmbedW(float* val) {
+      return val[CtrCommonPullValue::EmbedWIndex()];
     }
-    static float* embedx_w(float* val) {
-      return val + CtrCommonPullValue::embedx_w_index();
+    static float* EmbedxW(float* val) {
+      return val + CtrCommonPullValue::EmbedxWIndex();
     }
   };
   CtrCommonAccessor() {}
-  virtual int initialize();
   virtual ~CtrCommonAccessor() {}
-
-  virtual void SetTableInfo(AccessorInfo& info);
-  virtual size_t GetTableInfo(InfoKey key);
-  // value维度
-  virtual size_t dim();
-  // value各个维度的size
-  virtual size_t dim_size(size_t dim);
-  // value各维度相加总size
-  virtual size_t size();
-  // value中mf动态长度部分总size大小, sparse下生效
-  virtual size_t mf_size();
-  // pull value维度
-  virtual size_t select_dim();
-  // pull value各个维度的size
-  virtual size_t select_dim_size(size_t dim);
-  // pull value各维度相加总size
-  virtual size_t select_size();
-  // push value维度
-  virtual size_t update_dim();
-  // push value各个维度的size
-  virtual size_t update_dim_size(size_t dim);
-  // push value各维度相加总size
-  virtual size_t update_size();
+  virtual int Initialize();
+  // 初始化AccessorInfo
+  virtual void InitAccessorInfo();
   // 判断该value是否进行shrink
-  virtual bool shrink(float* value);
+  virtual bool Shrink(float* value);
   // 判断该value是否保存到ssd
   // virtual bool save_ssd(float* value);
-  virtual bool need_extend_mf(float* value);
-  virtual bool has_mf(size_t size);
+  virtual bool NeedExtendMF(float* value);
+  virtual bool HasMF(size_t size);
   // 判断该value是否在save阶段dump,
   // param作为参数用于标识save阶段，如downpour的xbox与batch_model
   // param = 0, save all feature
   // param = 1, save delta feature
   // param = 2, save xbox base feature
-  bool save(float* value, int param) override;
+  bool Save(float* value, int param) override;
   // update delta_score and unseen_days after save
-  void update_stat_after_save(float* value, int param) override;
+  void UpdateStatAfterSave(float* value, int param) override;
   // keys不存在时，为values生成随机值
   // 要求value的内存由外部调用者分配完毕
-  virtual int32_t create(float** value, size_t num);
+  virtual int32_t Create(float** value, size_t num);
   // 从values中选取到select_values中
-  virtual int32_t select(float** select_values, const float** values,
+  virtual int32_t Select(float** select_values, const float** values,
                          size_t num);
   // 将update_values聚合到一起
-  virtual int32_t merge(float** update_values,
+  virtual int32_t Merge(float** update_values,
                         const float** other_update_values, size_t num);
   // 将update_values聚合到一起，通过it.next判定是否进入下一个key
-  // virtual int32_t merge(float** update_values, iterator it);
+  // virtual int32_t Merge(float** update_values, iterator it);
   // 将update_values更新应用到values中
-  virtual int32_t update(float** values, const float** update_values,
+  virtual int32_t Update(float** values, const float** update_values,
                          size_t num);
 
-  std::string parse_to_string(const float* value, int param) override;
-  int32_t parse_from_string(const std::string& str, float* v) override;
-  virtual bool create_value(int type, const float* value);
+  std::string ParseToString(const float* value, int param) override;
+  int32_t ParseFromString(const std::string& str, float* v) override;
+  virtual bool CreateValue(int type, const float* value);
 
   // 这个接口目前只用来取show
-  float get_field(float* value, const std::string& name) override {
+  float GetField(float* value, const std::string& name) override {
     // CHECK(name == "show");
     if (name == "show") {
-      return common_feature_value.show(value);
+      return common_feature_value.Show(value);
     }
     return 0.0;
   }
 
  private:
-  // float show_click_score(float show, float click);
+  // float ShowClickScore(float show, float click);
 
   // SparseValueSGDRule* _embed_sgd_rule;
   // SparseValueSGDRule* _embedx_sgd_rule;
@@ -213,7 +190,7 @@ class CtrCommonAccessor : public ValueAccessor {
  public:  // TODO(zhaocaibei123): it should be private, but we make it public
           // for unit test
   CtrCommonFeatureValue common_feature_value;
-  float show_click_score(float show, float click);
+  float ShowClickScore(float show, float click);
   SparseValueSGDRule* _embed_sgd_rule;
   SparseValueSGDRule* _embedx_sgd_rule;
 };
diff --git a/paddle/fluid/distributed/ps/table/ctr_double_accessor.cc b/paddle/fluid/distributed/ps/table/ctr_double_accessor.cc
index 0e3df6e82521d..740b03a84e461 100644
--- a/paddle/fluid/distributed/ps/table/ctr_double_accessor.cc
+++ b/paddle/fluid/distributed/ps/table/ctr_double_accessor.cc
@@ -20,94 +20,36 @@
 namespace paddle {
 namespace distributed {
 
-int DownpourCtrDoubleAccessor::initialize() {
+int DownpourCtrDoubleAccessor::Initialize() {
   auto name = _config.embed_sgd_param().name();
   _embed_sgd_rule = CREATE_PSCORE_CLASS(SparseValueSGDRule, name);
-  _embed_sgd_rule->load_config(_config.embed_sgd_param(), 1);
+  _embed_sgd_rule->LoadConfig(_config.embed_sgd_param(), 1);
 
   name = _config.embedx_sgd_param().name();
   _embedx_sgd_rule = CREATE_PSCORE_CLASS(SparseValueSGDRule, name);
-  _embedx_sgd_rule->load_config(_config.embedx_sgd_param(),
-                                _config.embedx_dim());
+  _embedx_sgd_rule->LoadConfig(_config.embedx_sgd_param(),
+                               _config.embedx_dim());
 
   _show_click_decay_rate = _config.ctr_accessor_param().show_click_decay_rate();
   _ssd_unseenday_threshold =
       _config.ctr_accessor_param().ssd_unseenday_threshold();
 
+  InitAccessorInfo();
   return 0;
 }
 
-void DownpourCtrDoubleAccessor::SetTableInfo(AccessorInfo& info) {
-  info.dim = dim();
-  info.size = size();
-  info.select_dim = select_dim();
-  info.select_size = select_size();
-  info.update_dim = update_dim();
-  info.update_size = update_size();
-  info.mf_size = mf_size();
-  info.fea_dim = fea_dim();
-}
-
-size_t DownpourCtrDoubleAccessor::GetTableInfo(InfoKey key) {
-  switch (key) {
-    case DIM:
-      return dim();
-    case SIZE:
-      return size();
-    case SELECT_DIM:
-      return select_dim();
-    case SELECT_SIZE:
-      return select_size();
-    case UPDATE_DIM:
-      return update_dim();
-    case UPDATE_SIZE:
-      return update_size();
-    case MF_SIZE:
-      return mf_size();
-    case FEA_DIM:
-      return fea_dim();
-  }
-  return 0;
-}
-
-size_t DownpourCtrDoubleAccessor::dim() {
-  auto embedx_dim = _config.embedx_dim();
-  return DownpourCtrDoubleFeatureValue::dim(embedx_dim);
-}
-size_t DownpourCtrDoubleAccessor::dim_size(size_t dim) {
-  auto embedx_dim = _config.embedx_dim();
-  return DownpourCtrDoubleFeatureValue::dim_size(dim, embedx_dim);
-}
-size_t DownpourCtrDoubleAccessor::size() {
+void DownpourCtrDoubleAccessor::InitAccessorInfo() {
   auto embedx_dim = _config.embedx_dim();
-  return DownpourCtrDoubleFeatureValue::size(embedx_dim);
+  _accessor_info.dim = DownpourCtrDoubleFeatureValue::Dim(embedx_dim);
+  _accessor_info.size = DownpourCtrDoubleFeatureValue::Size(embedx_dim);
+  _accessor_info.select_dim = 3 + embedx_dim;
+  _accessor_info.select_size = _accessor_info.select_dim * sizeof(float);
+  _accessor_info.update_dim = 4 + embedx_dim;
+  _accessor_info.update_size = _accessor_info.update_dim * sizeof(float);
+  _accessor_info.mf_size = (embedx_dim + 1) * sizeof(float);
 }
-size_t DownpourCtrDoubleAccessor::mf_size() {
-  return (_config.embedx_dim() + 1) * sizeof(float);  // embedx embedx_g2sum
-}
-// pull value
-size_t DownpourCtrDoubleAccessor::select_dim() {
-  auto embedx_dim = _config.embedx_dim();
-  return 3 + embedx_dim;
-}
-size_t DownpourCtrDoubleAccessor::select_dim_size(size_t dim) {
-  return sizeof(float);
-}
-size_t DownpourCtrDoubleAccessor::select_size() {
-  return select_dim() * sizeof(float);
-}
-// push value
-size_t DownpourCtrDoubleAccessor::update_dim() {
-  auto embedx_dim = _config.embedx_dim();
-  return 4 + embedx_dim;
-}
-size_t DownpourCtrDoubleAccessor::update_dim_size(size_t dim) {
-  return sizeof(float);
-}
-size_t DownpourCtrDoubleAccessor::update_size() {
-  return update_dim() * sizeof(float);
-}
-bool DownpourCtrDoubleAccessor::shrink(float* value) {
+
+bool DownpourCtrDoubleAccessor::Shrink(float* value) {
   // auto base_threshold = _config.ctr_accessor_param().base_threshold();
   // auto delta_threshold = _config.ctr_accessor_param().delta_threshold();
   // auto delete_threshold = _config.ctr_accessor_param().delete_threshold();
@@ -117,19 +59,19 @@ bool DownpourCtrDoubleAccessor::shrink(float* value) {
       _config.ctr_accessor_param().delete_after_unseen_days();
   auto delete_threshold = _config.ctr_accessor_param().delete_threshold();
   // time_decay first
-  DownpourCtrDoubleFeatureValue::show(value) *= _show_click_decay_rate;
-  DownpourCtrDoubleFeatureValue::click(value) *= _show_click_decay_rate;
+  DownpourCtrDoubleFeatureValue::Show(value) *= _show_click_decay_rate;
+  DownpourCtrDoubleFeatureValue::Click(value) *= _show_click_decay_rate;
   // shrink after
-  auto score = show_click_score(DownpourCtrDoubleFeatureValue::show(value),
-                                DownpourCtrDoubleFeatureValue::click(value));
-  auto unseen_days = DownpourCtrDoubleFeatureValue::unseen_days(value);
+  auto score = ShowClickScore(DownpourCtrDoubleFeatureValue::Show(value),
+                              DownpourCtrDoubleFeatureValue::Click(value));
+  auto unseen_days = DownpourCtrDoubleFeatureValue::UnseenDays(value);
   if (score < delete_threshold || unseen_days > delete_after_unseen_days) {
     return true;
   }
   return false;
 }
 bool DownpourCtrDoubleAccessor::save_ssd(float* value) {
-  if (DownpourCtrDoubleFeatureValue::unseen_days(value) >
+  if (DownpourCtrDoubleFeatureValue::UnseenDays(value) >
       _ssd_unseenday_threshold) {
     return true;
   }
@@ -139,16 +81,16 @@ bool DownpourCtrDoubleAccessor::save_ssd(float* value) {
 //         float* value, int param, double global_cache_threshold) {
 //     auto base_threshold = _config.ctr_accessor_param().base_threshold();
 //     auto delta_keep_days = _config.ctr_accessor_param().delta_keep_days();
-//     if (show_click_score(DownpourCtrDoubleFeatureValue::show(value),
-//     DownpourCtrDoubleFeatureValue::click(value)) >= base_threshold
-//         && DownpourCtrDoubleFeatureValue::unseen_days(value) <=
+//     if (ShowClickScore(DownpourCtrDoubleFeatureValue::Show(value),
+//     DownpourCtrDoubleFeatureValue::Click(value)) >= base_threshold
+//         && DownpourCtrDoubleFeatureValue::UnseenDays(value) <=
 //         delta_keep_days) {
-//         return DownpourCtrDoubleFeatureValue::show(value) >
+//         return DownpourCtrDoubleFeatureValue::Show(value) >
 //         global_cache_threshold;
 //     }
 //     return false;
 // }
-bool DownpourCtrDoubleAccessor::save(float* value, int param) {
+bool DownpourCtrDoubleAccessor::Save(float* value, int param) {
   // auto base_threshold = _config.ctr_accessor_param().base_threshold();
   // auto delta_threshold = _config.ctr_accessor_param().delta_threshold();
   // auto delta_keep_days = _config.ctr_accessor_param().delta_keep_days();
@@ -167,16 +109,14 @@ bool DownpourCtrDoubleAccessor::save(float* value, int param) {
     case 1:
     // save xbox base
     case 2: {
-      if (show_click_score(DownpourCtrDoubleFeatureValue::show(value),
-                           DownpourCtrDoubleFeatureValue::click(value)) >=
+      if (ShowClickScore(DownpourCtrDoubleFeatureValue::Show(value),
+                         DownpourCtrDoubleFeatureValue::Click(value)) >=
               base_threshold &&
-          DownpourCtrDoubleFeatureValue::delta_score(value) >=
-              delta_threshold &&
-          DownpourCtrDoubleFeatureValue::unseen_days(value) <=
-              delta_keep_days) {
+          DownpourCtrDoubleFeatureValue::DeltaScore(value) >= delta_threshold &&
+          DownpourCtrDoubleFeatureValue::UnseenDays(value) <= delta_keep_days) {
         // do this after save, because it must not be modified when retry
         if (param == 2) {
-          DownpourCtrDoubleFeatureValue::delta_score(value) = 0;
+          DownpourCtrDoubleFeatureValue::DeltaScore(value) = 0;
         }
         return true;
       } else {
@@ -185,10 +125,10 @@ bool DownpourCtrDoubleAccessor::save(float* value, int param) {
     }
     // already decayed in shrink
     case 3: {
-      // DownpourCtrFeatureValue::show(value) *= _show_click_decay_rate;
-      // DownpourCtrFeatureValue::click(value) *= _show_click_decay_rate;
+      // DownpourCtrFeatureValue::Show(value) *= _show_click_decay_rate;
+      // DownpourCtrFeatureValue::Click(value) *= _show_click_decay_rate;
       // do this after save, because it must not be modified when retry
-      // DownpourCtrDoubleFeatureValue::unseen_days(value)++;
+      // DownpourCtrDoubleFeatureValue::UnseenDays(value)++;
       return true;
     }
     default:
@@ -196,8 +136,7 @@ bool DownpourCtrDoubleAccessor::save(float* value, int param) {
   };
 }
 
-void DownpourCtrDoubleAccessor::update_stat_after_save(float* value,
-                                                       int param) {
+void DownpourCtrDoubleAccessor::UpdateStatAfterSave(float* value, int param) {
   auto base_threshold = _config.ctr_accessor_param().base_threshold();
   auto delta_threshold = _config.ctr_accessor_param().delta_threshold();
   auto delta_keep_days = _config.ctr_accessor_param().delta_keep_days();
@@ -206,19 +145,17 @@ void DownpourCtrDoubleAccessor::update_stat_after_save(float* value,
   }
   switch (param) {
     case 1: {
-      if (show_click_score(DownpourCtrDoubleFeatureValue::show(value),
-                           DownpourCtrDoubleFeatureValue::click(value)) >=
+      if (ShowClickScore(DownpourCtrDoubleFeatureValue::Show(value),
+                         DownpourCtrDoubleFeatureValue::Click(value)) >=
               base_threshold &&
-          DownpourCtrDoubleFeatureValue::delta_score(value) >=
-              delta_threshold &&
-          DownpourCtrDoubleFeatureValue::unseen_days(value) <=
-              delta_keep_days) {
-        DownpourCtrDoubleFeatureValue::delta_score(value) = 0;
+          DownpourCtrDoubleFeatureValue::DeltaScore(value) >= delta_threshold &&
+          DownpourCtrDoubleFeatureValue::UnseenDays(value) <= delta_keep_days) {
+        DownpourCtrDoubleFeatureValue::DeltaScore(value) = 0;
       }
     }
       return;
     case 3: {
-      DownpourCtrDoubleFeatureValue::unseen_days(value)++;
+      DownpourCtrDoubleFeatureValue::UnseenDays(value)++;
     }
       return;
     default:
@@ -226,29 +163,29 @@ void DownpourCtrDoubleAccessor::update_stat_after_save(float* value,
   };
 }
 
-int32_t DownpourCtrDoubleAccessor::create(float** values, size_t num) {
+int32_t DownpourCtrDoubleAccessor::Create(float** values, size_t num) {
   auto embedx_dim = _config.embedx_dim();
   for (size_t value_item = 0; value_item < num; ++value_item) {
     float* value = values[value_item];
-    value[DownpourCtrDoubleFeatureValue::unseen_days_index()] = 0;
-    value[DownpourCtrDoubleFeatureValue::delta_score_index()] = 0;
-    *(double*)(value + DownpourCtrDoubleFeatureValue::show_index()) = 0;
-    *(double*)(value + DownpourCtrDoubleFeatureValue::click_index()) = 0;
-    value[DownpourCtrDoubleFeatureValue::slot_index()] = -1;
-    _embed_sgd_rule->init_value(
-        value + DownpourCtrDoubleFeatureValue::embed_w_index(),
-        value + DownpourCtrDoubleFeatureValue::embed_g2sum_index());
-    _embedx_sgd_rule->init_value(
-        value + DownpourCtrDoubleFeatureValue::embedx_w_index(),
-        value + DownpourCtrDoubleFeatureValue::embedx_g2sum_index(), false);
+    value[DownpourCtrDoubleFeatureValue::UnseenDaysIndex()] = 0;
+    value[DownpourCtrDoubleFeatureValue::DeltaScoreIndex()] = 0;
+    *(double*)(value + DownpourCtrDoubleFeatureValue::ShowIndex()) = 0;
+    *(double*)(value + DownpourCtrDoubleFeatureValue::ClickIndex()) = 0;
+    value[DownpourCtrDoubleFeatureValue::SlotIndex()] = -1;
+    _embed_sgd_rule->InitValue(
+        value + DownpourCtrDoubleFeatureValue::EmbedWIndex(),
+        value + DownpourCtrDoubleFeatureValue::EmbedG2SumIndex());
+    _embedx_sgd_rule->InitValue(
+        value + DownpourCtrDoubleFeatureValue::EmbedxWIndex(),
+        value + DownpourCtrDoubleFeatureValue::EmbedxG2SumIndex(), false);
   }
   return 0;
 }
-bool DownpourCtrDoubleAccessor::need_extend_mf(float* value) {
+bool DownpourCtrDoubleAccessor::NeedExtendMF(float* value) {
   auto show =
-      ((double*)(value + DownpourCtrDoubleFeatureValue::show_index()))[0];
+      ((double*)(value + DownpourCtrDoubleFeatureValue::ShowIndex()))[0];
   auto click =
-      ((double*)(value + DownpourCtrDoubleFeatureValue::click_index()))[0];
+      ((double*)(value + DownpourCtrDoubleFeatureValue::ClickIndex()))[0];
   // float score = (show - click) * _config.ctr_accessor_param().nonclk_coeff()
   auto score = (show - click) * _config.ctr_accessor_param().nonclk_coeff() +
                click * _config.ctr_accessor_param().click_coeff();
@@ -256,20 +193,20 @@ bool DownpourCtrDoubleAccessor::need_extend_mf(float* value) {
   return score >= _config.embedx_threshold();
 }
 // from DownpourCtrFeatureValue to DownpourCtrPullValue
-int32_t DownpourCtrDoubleAccessor::select(float** select_values,
+int32_t DownpourCtrDoubleAccessor::Select(float** select_values,
                                           const float** values, size_t num) {
   auto embedx_dim = _config.embedx_dim();
   for (size_t value_item = 0; value_item < num; ++value_item) {
     float* select_value = select_values[value_item];
     float* value = const_cast<float*>(values[value_item]);
-    select_value[DownpourCtrDoublePullValue::show_index()] =
-        (float)*(double*)(value + DownpourCtrDoubleFeatureValue::show_index());
-    select_value[DownpourCtrDoublePullValue::click_index()] =
-        (float)*(double*)(value + DownpourCtrDoubleFeatureValue::click_index());
-    select_value[DownpourCtrDoublePullValue::embed_w_index()] =
-        value[DownpourCtrDoubleFeatureValue::embed_w_index()];
-    memcpy(select_value + DownpourCtrDoublePullValue::embedx_w_index(),
-           value + DownpourCtrDoubleFeatureValue::embedx_w_index(),
+    select_value[DownpourCtrDoublePullValue::ShowIndex()] =
+        (float)*(double*)(value + DownpourCtrDoubleFeatureValue::ShowIndex());
+    select_value[DownpourCtrDoublePullValue::ClickIndex()] =
+        (float)*(double*)(value + DownpourCtrDoubleFeatureValue::ClickIndex());
+    select_value[DownpourCtrDoublePullValue::EmbedWIndex()] =
+        value[DownpourCtrDoubleFeatureValue::EmbedWIndex()];
+    memcpy(select_value + DownpourCtrDoublePullValue::EmbedxWIndex(),
+           value + DownpourCtrDoubleFeatureValue::EmbedxWIndex(),
            embedx_dim * sizeof(float));
   }
   return 0;
@@ -277,23 +214,23 @@ int32_t DownpourCtrDoubleAccessor::select(float** select_values,
 // from DownpourCtrPushValue to DownpourCtrPushValue
 // first dim: item
 // second dim: field num
-int32_t DownpourCtrDoubleAccessor::merge(float** update_values,
+int32_t DownpourCtrDoubleAccessor::Merge(float** update_values,
                                          const float** other_update_values,
                                          size_t num) {
   auto embedx_dim = _config.embedx_dim();
-  size_t total_dim = DownpourCtrDoublePushValue::dim(embedx_dim);
+  size_t total_dim = DownpourCtrDoublePushValue::Dim(embedx_dim);
   for (size_t value_item = 0; value_item < num; ++value_item) {
     float* update_value = update_values[value_item];
     const float* other_update_value = other_update_values[value_item];
-    /**(double*)(update_value + DownpourCtrDoublePushValue::show_index()) +=
-    *(double*)(other_update_value + DownpourCtrDoublePushValue::show_index());
-    *(double*)(update_value + DownpourCtrDoublePushValue::click_index()) +=
-    *(double*)(other_update_value + DownpourCtrDoublePushValue::click_index());
+    /**(double*)(update_value + DownpourCtrDoublePushValue::ShowIndex()) +=
+    *(double*)(other_update_value + DownpourCtrDoublePushValue::ShowIndex());
+    *(double*)(update_value + DownpourCtrDoublePushValue::ClickIndex()) +=
+    *(double*)(other_update_value + DownpourCtrDoublePushValue::ClickIndex());
     for (auto i = 3u; i < total_dim; ++i) {
         update_value[i] += other_update_value[i];
     }*/
     for (auto i = 0u; i < total_dim; ++i) {
-      if (i != DownpourCtrDoublePushValue::slot_index()) {
+      if (i != DownpourCtrDoublePushValue::SlotIndex()) {
         update_value[i] += other_update_value[i];
       }
     }
@@ -303,47 +240,47 @@ int32_t DownpourCtrDoubleAccessor::merge(float** update_values,
 // from DownpourCtrPushValue to DownpourCtrFeatureValue
 // first dim: item
 // second dim: field num
-int32_t DownpourCtrDoubleAccessor::update(float** update_values,
+int32_t DownpourCtrDoubleAccessor::Update(float** update_values,
                                           const float** push_values,
                                           size_t num) {
   auto embedx_dim = _config.embedx_dim();
   for (size_t value_item = 0; value_item < num; ++value_item) {
     float* update_value = update_values[value_item];
     const float* push_value = push_values[value_item];
-    float push_show = push_value[DownpourCtrDoublePushValue::show_index()];
-    float push_click = push_value[DownpourCtrDoublePushValue::click_index()];
-    float slot = push_value[DownpourCtrDoublePushValue::slot_index()];
-    *(double*)(update_value + DownpourCtrDoubleFeatureValue::show_index()) +=
+    float push_show = push_value[DownpourCtrDoublePushValue::ShowIndex()];
+    float push_click = push_value[DownpourCtrDoublePushValue::ClickIndex()];
+    float slot = push_value[DownpourCtrDoublePushValue::SlotIndex()];
+    *(double*)(update_value + DownpourCtrDoubleFeatureValue::ShowIndex()) +=
         (double)push_show;
-    *(double*)(update_value + DownpourCtrDoubleFeatureValue::click_index()) +=
+    *(double*)(update_value + DownpourCtrDoubleFeatureValue::ClickIndex()) +=
         (double)push_click;
-    update_value[DownpourCtrDoubleFeatureValue::slot_index()] = slot;
-    update_value[DownpourCtrDoubleFeatureValue::delta_score_index()] +=
+    update_value[DownpourCtrDoubleFeatureValue::SlotIndex()] = slot;
+    update_value[DownpourCtrDoubleFeatureValue::DeltaScoreIndex()] +=
         (push_show - push_click) * _config.ctr_accessor_param().nonclk_coeff() +
         push_click * _config.ctr_accessor_param().click_coeff();
     //(push_show - push_click) * _config.ctr_accessor_param().nonclk_coeff() +
     // push_click * _config.ctr_accessor_param().click_coeff();
-    update_value[DownpourCtrDoubleFeatureValue::unseen_days_index()] = 0;
-    _embed_sgd_rule->update_value(
-        update_value + DownpourCtrDoubleFeatureValue::embed_w_index(),
-        update_value + DownpourCtrDoubleFeatureValue::embed_g2sum_index(),
-        push_value + DownpourCtrDoublePushValue::embed_g_index(), push_show);
-    _embedx_sgd_rule->update_value(
-        update_value + DownpourCtrDoubleFeatureValue::embedx_w_index(),
-        update_value + DownpourCtrDoubleFeatureValue::embedx_g2sum_index(),
-        push_value + DownpourCtrDoublePushValue::embedx_g_index(), push_show);
+    update_value[DownpourCtrDoubleFeatureValue::UnseenDaysIndex()] = 0;
+    _embed_sgd_rule->UpdateValue(
+        update_value + DownpourCtrDoubleFeatureValue::EmbedWIndex(),
+        update_value + DownpourCtrDoubleFeatureValue::EmbedG2SumIndex(),
+        push_value + DownpourCtrDoublePushValue::EmbedGIndex(), push_show);
+    _embedx_sgd_rule->UpdateValue(
+        update_value + DownpourCtrDoubleFeatureValue::EmbedxWIndex(),
+        update_value + DownpourCtrDoubleFeatureValue::EmbedxG2SumIndex(),
+        push_value + DownpourCtrDoublePushValue::EmbedxGIndex(), push_show);
   }
   return 0;
 }
-bool DownpourCtrDoubleAccessor::create_value(int stage, const float* value) {
+bool DownpourCtrDoubleAccessor::CreateValue(int stage, const float* value) {
   // stage == 0, pull
   // stage == 1, push
   if (stage == 0) {
     return true;
   } else if (stage == 1) {
-    auto show = DownpourCtrDoublePushValue::show(const_cast<float*>(value));
-    auto click = DownpourCtrDoublePushValue::click(const_cast<float*>(value));
-    auto score = show_click_score(show, click);
+    auto show = DownpourCtrDoublePushValue::Show(const_cast<float*>(value));
+    auto click = DownpourCtrDoublePushValue::Click(const_cast<float*>(value));
+    auto score = ShowClickScore(show, click);
     if (score <= 0) {
       return false;
     }
@@ -356,24 +293,24 @@ bool DownpourCtrDoubleAccessor::create_value(int stage, const float* value) {
     return true;
   }
 }
-double DownpourCtrDoubleAccessor::show_click_score(double show, double click) {
+double DownpourCtrDoubleAccessor::ShowClickScore(double show, double click) {
   // auto nonclk_coeff = _config.ctr_accessor_param().nonclk_coeff();
   // auto click_coeff = _config.ctr_accessor_param().click_coeff();
   auto nonclk_coeff = _config.ctr_accessor_param().nonclk_coeff();
   auto click_coeff = _config.ctr_accessor_param().click_coeff();
   return (show - click) * nonclk_coeff + click * click_coeff;
 }
-std::string DownpourCtrDoubleAccessor::parse_to_string(const float* v,
-                                                       int param_size) {
+std::string DownpourCtrDoubleAccessor::ParseToString(const float* v,
+                                                     int param_size) {
   thread_local std::ostringstream os;
   os.clear();
   os.str("");
   os << v[0] << " " << v[1] << " " << (float)((double*)(v + 2))[0] << " "
      << (float)((double*)(v + 4))[0] << " " << v[6] << " " << v[7] << " "
      << v[8];
-  auto show = DownpourCtrDoubleFeatureValue::show(const_cast<float*>(v));
-  auto click = DownpourCtrDoubleFeatureValue::click(const_cast<float*>(v));
-  auto score = show_click_score(show, click);
+  auto show = DownpourCtrDoubleFeatureValue::Show(const_cast<float*>(v));
+  auto click = DownpourCtrDoubleFeatureValue::Click(const_cast<float*>(v));
+  auto score = ShowClickScore(show, click);
   if (score >= _config.embedx_threshold() && param_size > 9) {
     os << " " << v[9];
     for (auto i = 0; i < _config.embedx_dim(); ++i) {
@@ -382,23 +319,23 @@ std::string DownpourCtrDoubleAccessor::parse_to_string(const float* v,
   }
   return os.str();
 }
-int DownpourCtrDoubleAccessor::parse_from_string(const std::string& str,
-                                                 float* value) {
+int DownpourCtrDoubleAccessor::ParseFromString(const std::string& str,
+                                               float* value) {
   int embedx_dim = _config.embedx_dim();
-  float data_buff[dim() + 2];
+  float data_buff[_accessor_info.dim + 2];
   float* data_buff_ptr = data_buff;
-  _embedx_sgd_rule->init_value(
-      data_buff_ptr + DownpourCtrDoubleFeatureValue::embedx_w_index(),
-      data_buff_ptr + DownpourCtrDoubleFeatureValue::embedx_g2sum_index());
+  _embedx_sgd_rule->InitValue(
+      data_buff_ptr + DownpourCtrDoubleFeatureValue::EmbedxWIndex(),
+      data_buff_ptr + DownpourCtrDoubleFeatureValue::EmbedxG2SumIndex());
   auto str_len = paddle::string::str_to_float(str.data(), data_buff_ptr);
   CHECK(str_len >= 6) << "expect more than 6 real:" << str_len;
-  int show_index = DownpourCtrDoubleFeatureValue::show_index();
-  int click_index = DownpourCtrDoubleFeatureValue::click_index();
-  int embed_w_index = DownpourCtrDoubleFeatureValue::embed_w_index();
+  int show_index = DownpourCtrDoubleFeatureValue::ShowIndex();
+  int click_index = DownpourCtrDoubleFeatureValue::ClickIndex();
+  int embed_w_index = DownpourCtrDoubleFeatureValue::EmbedWIndex();
   // no slot, embedx
-  int value_dim = dim();
-  int embedx_g2sum_index = DownpourCtrDoubleFeatureValue::embedx_g2sum_index();
-  value[DownpourCtrDoubleFeatureValue::slot_index()] = -1;
+  int value_dim = _accessor_info.dim;
+  int embedx_g2sum_index = DownpourCtrDoubleFeatureValue::EmbedxG2SumIndex();
+  value[DownpourCtrDoubleFeatureValue::SlotIndex()] = -1;
   // other case
   if (str_len == (value_dim - 1)) {
     // copy unseen_days..delta_score
@@ -407,9 +344,8 @@ int DownpourCtrDoubleAccessor::parse_from_string(const std::string& str,
     *(double*)(value + show_index) = (double)data_buff_ptr[2];
     *(double*)(value + click_index) = (double)data_buff_ptr[3];
     // copy others
-    value[DownpourCtrDoubleFeatureValue::embed_w_index()] = data_buff_ptr[4];
-    value[DownpourCtrDoubleFeatureValue::embed_g2sum_index()] =
-        data_buff_ptr[5];
+    value[DownpourCtrDoubleFeatureValue::EmbedWIndex()] = data_buff_ptr[4];
+    value[DownpourCtrDoubleFeatureValue::EmbedG2SumIndex()] = data_buff_ptr[5];
     memcpy(value + embedx_g2sum_index, data_buff_ptr + 6,
            (embedx_dim + 1) * sizeof(float));
   } else {
diff --git a/paddle/fluid/distributed/ps/table/ctr_double_accessor.h b/paddle/fluid/distributed/ps/table/ctr_double_accessor.h
index fb8b27ecfd985..3995903463637 100644
--- a/paddle/fluid/distributed/ps/table/ctr_double_accessor.h
+++ b/paddle/fluid/distributed/ps/table/ctr_double_accessor.h
@@ -38,64 +38,64 @@ class DownpourCtrDoubleAccessor : public ValueAccessor {
     float embedx_g2sum;
     std::vector<float> embedx_w;
     */
-    static int dim(int embedx_dim) { return 8 + embedx_dim; }
-    static int dim_size(size_t dim, int embedx_dim) { return sizeof(float); }
-    static int size(int embedx_dim) {
-      return (dim(embedx_dim) + 2) * sizeof(float);
+    static int Dim(int embedx_dim) { return 8 + embedx_dim; }
+    static int DimSize(size_t dim, int embedx_dim) { return sizeof(float); }
+    static int Size(int embedx_dim) {
+      return (Dim(embedx_dim) + 2) * sizeof(float);
     }
-    static int unseen_days_index() { return 0; }
-    static int delta_score_index() {
-      return DownpourCtrDoubleFeatureValue::unseen_days_index() + 1;
+    static int UnseenDaysIndex() { return 0; }
+    static int DeltaScoreIndex() {
+      return DownpourCtrDoubleFeatureValue::UnseenDaysIndex() + 1;
     }
-    static int show_index() {
-      return DownpourCtrDoubleFeatureValue::delta_score_index() + 1;
+    static int ShowIndex() {
+      return DownpourCtrDoubleFeatureValue::DeltaScoreIndex() + 1;
     }
     // show is double
-    static int click_index() {
-      return DownpourCtrDoubleFeatureValue::show_index() + 2;
+    static int ClickIndex() {
+      return DownpourCtrDoubleFeatureValue::ShowIndex() + 2;
     }
     // click is double
-    static int embed_w_index() {
-      return DownpourCtrDoubleFeatureValue::click_index() + 2;
+    static int EmbedWIndex() {
+      return DownpourCtrDoubleFeatureValue::ClickIndex() + 2;
     }
-    static int embed_g2sum_index() {
-      return DownpourCtrDoubleFeatureValue::embed_w_index() + 1;
+    static int EmbedG2SumIndex() {
+      return DownpourCtrDoubleFeatureValue::EmbedWIndex() + 1;
     }
-    static int slot_index() {
-      return DownpourCtrDoubleFeatureValue::embed_g2sum_index() + 1;
+    static int SlotIndex() {
+      return DownpourCtrDoubleFeatureValue::EmbedG2SumIndex() + 1;
     }
-    static int embedx_g2sum_index() {
-      return DownpourCtrDoubleFeatureValue::slot_index() + 1;
+    static int EmbedxG2SumIndex() {
+      return DownpourCtrDoubleFeatureValue::SlotIndex() + 1;
     }
-    static int embedx_w_index() {
-      return DownpourCtrDoubleFeatureValue::embedx_g2sum_index() + 1;
+    static int EmbedxWIndex() {
+      return DownpourCtrDoubleFeatureValue::EmbedxG2SumIndex() + 1;
     }
-    static float& unseen_days(float* val) {
-      return val[DownpourCtrDoubleFeatureValue::unseen_days_index()];
+    static float& UnseenDays(float* val) {
+      return val[DownpourCtrDoubleFeatureValue::UnseenDaysIndex()];
     }
-    static float& delta_score(float* val) {
-      return val[DownpourCtrDoubleFeatureValue::delta_score_index()];
+    static float& DeltaScore(float* val) {
+      return val[DownpourCtrDoubleFeatureValue::DeltaScoreIndex()];
     }
-    static double& show(float* val) {
-      return ((double*)(val + DownpourCtrDoubleFeatureValue::show_index()))[0];
+    static double& Show(float* val) {
+      return ((double*)(val + DownpourCtrDoubleFeatureValue::ShowIndex()))[0];
     }
-    static double& click(float* val) {
-      return ((double*)(val + DownpourCtrDoubleFeatureValue::click_index()))[0];
+    static double& Click(float* val) {
+      return ((double*)(val + DownpourCtrDoubleFeatureValue::ClickIndex()))[0];
     }
-    static float& slot(float* val) {
-      return val[DownpourCtrDoubleFeatureValue::slot_index()];
+    static float& Slot(float* val) {
+      return val[DownpourCtrDoubleFeatureValue::SlotIndex()];
     }
-    static float& embed_w(float* val) {
-      return val[DownpourCtrDoubleFeatureValue::embed_w_index()];
+    static float& EmbedW(float* val) {
+      return val[DownpourCtrDoubleFeatureValue::EmbedWIndex()];
     }
-    static float& embed_g2sum(float* val) {
-      return val[DownpourCtrDoubleFeatureValue::embed_g2sum_index()];
+    static float& EmbedG2Sum(float* val) {
+      return val[DownpourCtrDoubleFeatureValue::EmbedG2SumIndex()];
     }
-    static float& embedx_g2sum(float* val) {
-      return val[DownpourCtrDoubleFeatureValue::embedx_g2sum_index()];
+    static float& EmbedxG2Sum(float* val) {
+      return val[DownpourCtrDoubleFeatureValue::EmbedxG2SumIndex()];
     }
-    static float* embedx_w(float* val) {
-      return (val + DownpourCtrDoubleFeatureValue::embedx_w_index());
+    static float* EmbedxW(float* val) {
+      return (val + DownpourCtrDoubleFeatureValue::EmbedxWIndex());
     }
   };
   struct DownpourCtrDoublePushValue {
@@ -106,36 +106,36 @@ class DownpourCtrDoubleAccessor : public ValueAccessor {
     float embed_g;
     std::vector<float> embedx_g;
     */
-    static int dim(int embedx_dim) { return 4 + embedx_dim; }
-    static int dim_size(int dim, int embedx_dim) { return sizeof(float); }
-    static int size(int embedx_dim) { return dim(embedx_dim) * sizeof(float); }
-    static int slot_index() { return 0; }
-    static int show_index() {
-      return DownpourCtrDoublePushValue::slot_index() + 1;
+    static int Dim(int embedx_dim) { return 4 + embedx_dim; }
+    static int DimSize(int dim, int embedx_dim) { return sizeof(float); }
+    static int Size(int embedx_dim) { return Dim(embedx_dim) * sizeof(float); }
+    static int SlotIndex() { return 0; }
+    static int ShowIndex() {
+      return DownpourCtrDoublePushValue::SlotIndex() + 1;
     }
-    static int click_index() {
-      return DownpourCtrDoublePushValue::show_index() + 1;
+    static int ClickIndex() {
+      return DownpourCtrDoublePushValue::ShowIndex() + 1;
     }
-    static int embed_g_index() {
-      return DownpourCtrDoublePushValue::click_index() + 1;
+    static int EmbedGIndex() {
+      return DownpourCtrDoublePushValue::ClickIndex() + 1;
     }
-    static int embedx_g_index() {
-      return DownpourCtrDoublePushValue::embed_g_index() + 1;
+    static int EmbedxGIndex() {
+      return DownpourCtrDoublePushValue::EmbedGIndex() + 1;
     }
-    static float& slot(float* val) {
-      return val[DownpourCtrDoublePushValue::slot_index()];
+    static float& Slot(float* val) {
+      return val[DownpourCtrDoublePushValue::SlotIndex()];
     }
-    static float& show(float* val) {
-      return val[DownpourCtrDoublePushValue::show_index()];
+    static float& Show(float* val) {
+      return val[DownpourCtrDoublePushValue::ShowIndex()];
     }
-    static float& click(float* val) {
-      return val[DownpourCtrDoublePushValue::click_index()];
+    static float& Click(float* val) {
+      return val[DownpourCtrDoublePushValue::ClickIndex()];
     }
-    static float& embed_g(float* val) {
-      return val[DownpourCtrDoublePushValue::embed_g_index()];
+    static float& EmbedG(float* val) {
+      return val[DownpourCtrDoublePushValue::EmbedGIndex()];
     }
-    static float* embedx_g(float* val) {
-      return val + DownpourCtrDoublePushValue::embedx_g_index();
+    static float* EmbedxG(float* val) {
+      return val + DownpourCtrDoublePushValue::EmbedxGIndex();
     }
   };
   struct DownpourCtrDoublePullValue {
@@ -145,88 +145,68 @@ class DownpourCtrDoubleAccessor : public ValueAccessor {
     float embed_w;
     std::vector<float> embedx_w;
     */
-    static int dim(int embedx_dim) { return 3 + embedx_dim; }
-    static int dim_size(size_t dim) { return sizeof(float); }
-    static int size(int embedx_dim) { return dim(embedx_dim) * sizeof(float); }
-    static int show_index() { return 0; }
-    static int click_index() { return 1; }
-    static int embed_w_index() { return 2; }
-    static int embedx_w_index() { return 3; }
-    static float& show(float* val) {
-      return val[DownpourCtrDoublePullValue::show_index()];
+    static int Dim(int embedx_dim) { return 3 + embedx_dim; }
+    static int DimSize(size_t dim) { return sizeof(float); }
+    static int Size(int embedx_dim) { return Dim(embedx_dim) * sizeof(float); }
+    static int ShowIndex() { return 0; }
+    static int ClickIndex() { return 1; }
+    static int EmbedWIndex() { return 2; }
+    static int EmbedxWIndex() { return 3; }
+    static float& Show(float* val) {
+      return val[DownpourCtrDoublePullValue::ShowIndex()];
     }
-    static float& click(float* val) {
-      return val[DownpourCtrDoublePullValue::click_index()];
+    static float& Click(float* val) {
+      return val[DownpourCtrDoublePullValue::ClickIndex()];
     }
-    static float& embed_w(float* val) {
-      return val[DownpourCtrDoublePullValue::embed_w_index()];
+    static float& EmbedW(float* val) {
+      return val[DownpourCtrDoublePullValue::EmbedWIndex()];
     }
-    static float* embedx_w(float* val) {
-      return val + DownpourCtrDoublePullValue::embedx_w_index();
+    static float* EmbedxW(float* val) {
+      return val + DownpourCtrDoublePullValue::EmbedxWIndex();
     }
   };
   DownpourCtrDoubleAccessor() {}
   virtual ~DownpourCtrDoubleAccessor() {}
-  virtual int initialize();
-  virtual void SetTableInfo(AccessorInfo& info);
-  virtual size_t GetTableInfo(InfoKey key);
-  // value维度
-  virtual size_t dim();
-  // value各个维度的size
-  virtual size_t dim_size(size_t dim);
-  // value各维度相加总size
-  virtual size_t size();
-  // value中mf动态长度部分总size大小, sparse下生效
-  virtual size_t mf_size();
-  // pull value维度
-  virtual size_t select_dim();
-  // pull value各个维度的size
-  virtual size_t select_dim_size(size_t dim);
-  // pull value各维度相加总size
-  virtual size_t select_size();
-  // push value维度
-  virtual size_t update_dim();
-  // push value各个维度的size
-  virtual size_t update_dim_size(size_t dim);
-  // push value各维度相加总size
-  virtual size_t update_size();
+  virtual int Initialize();
+  // 初始化AccessorInfo
+  virtual void InitAccessorInfo();
   // 判断该value是否进行shrink
-  virtual bool shrink(float* value);
-  virtual bool need_extend_mf(float* value);
+  virtual bool Shrink(float* value);
+  virtual bool NeedExtendMF(float* value);
   // 判断该value是否在save阶段dump,
   // param作为参数用于标识save阶段，如downpour的xbox与batch_model
   // param = 0, save all feature
   // param = 1, save delta feature
   // param = 3, save all feature with time decay
-  virtual bool save(float* value, int param) override;
+  virtual bool Save(float* value, int param) override;
   // update delta_score and unseen_days after save
-  virtual void update_stat_after_save(float* value, int param) override;
+  virtual void UpdateStatAfterSave(float* value, int param) override;
   // 判断该value是否保存到ssd
   virtual bool save_ssd(float* value);
   // virtual bool save_cache(float* value, int param, double
   // global_cache_threshold) override;
   // keys不存在时，为values生成随机值
   // 要求value的内存由外部调用者分配完毕
-  virtual int32_t create(float** value, size_t num);
+  virtual int32_t Create(float** value, size_t num);
   // 从values中选取到select_values中
-  virtual int32_t select(float** select_values, const float** values,
+  virtual int32_t Select(float** select_values, const float** values,
                          size_t num);
   // 将update_values聚合到一起
-  virtual int32_t merge(float** update_values,
+  virtual int32_t Merge(float** update_values,
                         const float** other_update_values, size_t num);
   // 将update_values聚合到一起，通过it.next判定是否进入下一个key
-  // virtual int32_t merge(float** update_values, iterator it);
+  // virtual int32_t Merge(float** update_values, iterator it);
   // 将update_values更新应用到values中
-  virtual int32_t update(float** values, const float** update_values,
+  virtual int32_t Update(float** values, const float** update_values,
                          size_t num);
-  virtual std::string parse_to_string(const float* value, int param) override;
-  virtual int32_t parse_from_string(const std::string& str, float* v) override;
-  virtual bool create_value(int type, const float* value);
+  virtual std::string ParseToString(const float* value, int param) override;
+  virtual int32_t ParseFromString(const std::string& str, float* v) override;
+  virtual bool CreateValue(int type, const float* value);
   //这个接口目前只用来取show
-  virtual float get_field(float* value, const std::string& name) override {
+  virtual float GetField(float* value, const std::string& name) override {
     CHECK(name == "show");
     if (name == "show") {
-      return (float)DownpourCtrDoubleFeatureValue::show(value);
+      return (float)DownpourCtrDoubleFeatureValue::Show(value);
     }
     return 0.0;
   }
@@ -235,7 +215,7 @@ class DownpourCtrDoubleAccessor : public ValueAccessor {
   // DEFINE_GET_INDEX(DownpourCtrDoubleFeatureValue, embed_w)
   // DEFINE_GET_INDEX(DownpourCtrDoubleFeatureValue, embedx_w)
  private:
-  double show_click_score(double show, double click);
+  double ShowClickScore(double show, double click);
 
  private:
   SparseValueSGDRule* _embed_sgd_rule;
diff --git a/paddle/fluid/distributed/ps/table/depends/dense.h b/paddle/fluid/distributed/ps/table/depends/dense.h
index 8661eb1feecc8..aea757e8d5959 100644
--- a/paddle/fluid/distributed/ps/table/depends/dense.h
+++ b/paddle/fluid/distributed/ps/table/depends/dense.h
@@ -34,9 +34,9 @@ class DenseOptimizer {
   DenseOptimizer() {}
   explicit DenseOptimizer(const CommonAccessorParameter& accessor,
                           std::vector<std::vector<float>>* values) {}
-  virtual void update(const float* update_values, size_t num, int begin,
+  virtual void Update(const float* update_values, size_t num, int begin,
                       int end) = 0;
-  virtual void set_global_lr(float* lr) { global_learning_rate_ = lr; }
+  virtual void SetGlobalLR(float* lr) { global_learning_rate_ = lr; }
 
  protected:
   float* global_learning_rate_;
@@ -55,7 +55,7 @@ class DSUM : public DenseOptimizer {
     }
   }
 
-  void update(const float* update_values, size_t num, int begin,
+  void Update(const float* update_values, size_t num, int begin,
               int end) override {
     auto update_numel = end - begin;
     GetBlas<float>().VADD(update_numel, update_values + begin, param + begin,
@@ -81,7 +81,7 @@ class DSGD : public DenseOptimizer {
     }
   }
 
-  void update(const float* update_values, size_t num, int begin,
+  void Update(const float* update_values, size_t num, int begin,
               int end) override {
     auto update_numel = end - begin;
     std::vector<float> grads;
@@ -99,7 +99,7 @@ class DSGD : public DenseOptimizer {
 };
 
 // adam optimizer for dense tensor
-// TODO(zhaocaibei123): add CHECK(common_dense_table.task_pool_size_) == 1
+// TODO(zhaocaibei123): add CHECK(memory_dense_table.task_pool_size_) == 1
 class DAdam : public DenseOptimizer {
  public:
   explicit DAdam(const CommonAccessorParameter& accessor,
@@ -132,9 +132,9 @@ class DAdam : public DenseOptimizer {
     epsilon = 1.0e-8;
   }
 
-  // make sure common_dense_table.task_pool_size_ == 1;
+  // make sure memory_dense_table.task_pool_size_ == 1;
   // otherwise, task_pool_size_ times beta1_pow/beta2_pow multiplication
-  void update(const float* update_values, size_t num, int begin,
+  void Update(const float* update_values, size_t num, int begin,
               int end) override {
     auto update_numel = end - begin;
     std::vector<float> grad, grad2, tmp;
@@ -214,7 +214,7 @@ class DAdamD2Sum : public DenseOptimizer {
     }
   }
 
-  void update(const float* update_values, size_t num, int begin,
+  void Update(const float* update_values, size_t num, int begin,
               int end) override {
     auto update_numel = end - begin;
     Eigen::Map<Eigen::MatrixXf> mat_ada_g2sum(ada_g2sum + begin, 1,
@@ -276,7 +276,7 @@ class DSummary : public DenseOptimizer {
     }
   }
 
-  void update(const float* update_values, size_t num, int begin,
+  void Update(const float* update_values, size_t num, int begin,
               int end) override {
     auto update_numel = end - begin;
     Eigen::Map<Eigen::MatrixXf> mat_w(param + begin, 1, update_numel);
diff --git a/paddle/fluid/distributed/ps/table/depends/sparse.h b/paddle/fluid/distributed/ps/table/depends/sparse.h
deleted file mode 100644
index d4ea7829e45f8..0000000000000
--- a/paddle/fluid/distributed/ps/table/depends/sparse.h
+++ /dev/null
@@ -1,220 +0,0 @@
-// Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-#pragma once
-
-#include <math.h>  // for sqrt in CPU and CUDA
-#include <functional>
-#include <memory>
-#include <string>
-#include <unordered_map>
-#include <utility>
-#include <vector>
-#include "gflags/gflags.h"
-
-#include "paddle/fluid/distributed/common/utils.h"
-#include "paddle/fluid/distributed/ps/table/depends/large_scale_kv.h"
-
-namespace paddle {
-namespace distributed {
-
-class SparseOptimizer {
- public:
-  explicit SparseOptimizer(
-      const std::vector<std::string>& value_names,
-      const std::vector<int>& value_dims, const std::vector<int>& value_offsets,
-      const std::unordered_map<std::string, int>& value_idx)
-      : value_names_(value_names),
-        value_dims_(value_dims),
-        value_offsets_(value_offsets),
-        value_idx_(value_idx) {}
-
-  virtual void update(const uint64_t* keys, const float* update_values,
-                      size_t num, const std::vector<uint64_t>& offsets,
-                      ValueBlock* block) = 0;
-
-  virtual void set_global_lr(float* lr) { global_learning_rate_ = lr; }
-
-  const std::vector<std::string>& value_names_;
-  const std::vector<int>& value_dims_;
-  const std::vector<int>& value_offsets_;
-  const std::unordered_map<std::string, int>& value_idx_;
-  int param_offset = 0;
-  int update_numel = 0;
-
- protected:
-  float* global_learning_rate_;
-};
-
-// sum calc for sparse tensor
-class SSUM : public SparseOptimizer {
- public:
-  explicit SSUM(const std::vector<std::string>& value_names,
-                const std::vector<int>& value_dims,
-                const std::vector<int>& value_offsets,
-                const std::unordered_map<std::string, int>& value_idx)
-      : SparseOptimizer(value_names, value_dims, value_offsets, value_idx) {
-    auto idx = value_idx.at("Param");
-    param_offset = value_offsets.at(idx);
-    update_numel = value_dims.at(idx);
-  }
-
-  void update(const uint64_t* keys, const float* update_values, size_t num,
-              const std::vector<uint64_t>& offsets,
-              ValueBlock* block) override {
-    auto blas = GetBlas<float>();
-    for (auto x : offsets) {
-      auto id = keys[x];
-      if (!block->GetEntry(id)) continue;
-      auto* value = block->Get(id);
-      float* param = value + param_offset;
-      blas.VADD(update_numel, update_values + x * update_numel, param, param);
-    }
-  }
-};
-
-// sgd optimzer for sparse tensor
-class SSGD : public SparseOptimizer {
- public:
-  explicit SSGD(const std::vector<std::string>& value_names,
-                const std::vector<int>& value_dims,
-                const std::vector<int>& value_offsets,
-                const std::unordered_map<std::string, int>& value_idx)
-      : SparseOptimizer(value_names, value_dims, value_offsets, value_idx) {
-    auto idx = value_idx.at("Param");
-    param_offset = value_offsets.at(idx);
-    update_numel = value_dims.at(idx);
-
-    idx = value_idx.at("LearningRate");
-    lr_offset = value_offsets.at(idx);
-  }
-
-  void update(const uint64_t* keys, const float* update_values, size_t num,
-              const std::vector<uint64_t>& offsets,
-              ValueBlock* block) override {
-    auto blas = GetBlas<float>();
-    for (auto x : offsets) {
-      auto id = keys[x];
-      if (!block->GetEntry(id)) continue;
-      auto* value = block->Get(id);
-
-      float learning_rate = *(global_learning_rate_) * (value + lr_offset)[0];
-      float* param = value + param_offset;
-
-      std::vector<float> grads;
-      grads.resize(update_numel);
-      blas.VCOPY(update_numel, update_values + x * update_numel, grads.data());
-      blas.SCAL(update_numel, learning_rate, grads.data());
-      blas.VSUB(update_numel, param, grads.data(), param);
-    }
-  }
-
-  int lr_offset;
-};
-
-// adam optimzer for sparse tensor
-class SAdam : public SparseOptimizer {
- public:
-  explicit SAdam(const std::vector<std::string>& value_names,
-                 const std::vector<int>& value_dims,
-                 const std::vector<int>& value_offsets,
-                 const std::unordered_map<std::string, int>& value_idx)
-      : SparseOptimizer(value_names, value_dims, value_offsets, value_idx) {
-    auto idx = value_idx.at("Param");
-    param_offset = value_offsets.at(idx);
-    update_numel = value_dims.at(idx);
-
-    idx = value_idx.at("LearningRate");
-    lr_offset = value_offsets.at(idx);
-
-    idx = value_idx.at("Moment1");
-    m1_offset = value_offsets.at(idx);
-
-    idx = value_idx.at("Moment2");
-    m2_offset = value_offsets.at(idx);
-
-    idx = value_idx.at("Beta1Pow");
-    beta1_pow_offset = value_offsets.at(idx);
-
-    idx = value_idx.at("Beta2Pow");
-    beta2_pow_offset = value_offsets.at(idx);
-
-    // add attr later
-    beta1 = 0.9;
-    beta2 = 0.999;
-    epsilon = 1.0e-8;
-  }
-
-  void update(const uint64_t* keys, const float* update_values, size_t num,
-              const std::vector<uint64_t>& offsets,
-              ValueBlock* block) override {
-    auto blas = GetBlas<float>();
-    for (auto x : offsets) {
-      auto id = keys[x];
-      if (!block->GetEntry(id)) continue;
-      auto* values = block->Get(id);
-      float lr_ = *(global_learning_rate_) * (values + lr_offset)[0];
-      float* param = values + param_offset;
-      float* moment1 = values + m1_offset;
-      float* moment2 = values + m2_offset;
-      float* beta1_pow = values + beta1_pow_offset;
-      float* beta2_pow = values + beta2_pow_offset;
-
-      beta1_pow[0] = beta1_pow[0] * beta1;
-      beta2_pow[0] = beta2_pow[0] * beta2;
-
-      lr_ *= sqrt(1 - beta2_pow[0]) / (1 - beta1_pow[0]);
-
-      std::vector<float> grad, grad2, tmp;
-      grad.resize(update_numel);
-      grad2.resize(update_numel);
-      tmp.resize(update_numel);
-
-      blas.VCOPY(update_numel, update_values + x * update_numel, grad.data());
-      blas.VCOPY(update_numel, update_values + x * update_numel, grad2.data());
-
-      blas.SCAL(update_numel, 1 - beta1, grad.data());
-      blas.VSQUARE(update_numel, grad2.data(), grad2.data());
-      blas.SCAL(update_numel, 1 - beta2, grad2.data());
-
-      blas.SCAL(update_numel, beta1, moment1);
-      blas.VADD(update_numel, moment1, grad.data(), moment1);
-      blas.SCAL(update_numel, beta2, moment2);
-      blas.VADD(update_numel, moment2, grad2.data(), moment2);
-
-      float* tmp_ = tmp.data();
-      float eps_ = epsilon * sqrt(1 - beta2_pow[0]);
-
-      SQRT<float>(update_numel, moment2, tmp_);
-      ADD<float>(update_numel, tmp_, eps_, tmp_);
-
-      blas.VDIV(update_numel, moment1, tmp_, tmp_);
-      blas.SCAL(update_numel, lr_, tmp_);
-      blas.VSUB(update_numel, param, tmp_, param);
-    }
-  }
-
-  int lr_offset;
-  int m1_offset;
-  int m2_offset;
-  int beta1_pow_offset;
-  int beta2_pow_offset;
-
-  float beta1;
-  float beta2;
-  float epsilon;
-};
-
-}  // namespace distributed
-}  // namespace paddle
diff --git a/paddle/fluid/distributed/ps/table/downpour_ctr_accessor.cc b/paddle/fluid/distributed/ps/table/downpour_ctr_accessor.cc
deleted file mode 100644
index 2fff81b1a4dc6..0000000000000
--- a/paddle/fluid/distributed/ps/table/downpour_ctr_accessor.cc
+++ /dev/null
@@ -1,505 +0,0 @@
-// Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-#include "paddle/fluid/distributed/ps/table/downpour_ctr_accessor.h"
-#include <gflags/gflags.h>
-#include "glog/logging.h"
-#include "paddle/fluid/string/string_helper.h"
-
-namespace paddle {
-namespace distributed {
-
-int DownpourCtrAccessor::initialize() {
-  auto name = _config.embed_sgd_param().name();
-  _embed_sgd_rule = CREATE_PSCORE_CLASS(SparseValueSGDRule, name);
-  _embed_sgd_rule->load_config(_config.embed_sgd_param(), 1);
-
-  name = _config.embedx_sgd_param().name();
-  _embedx_sgd_rule = CREATE_PSCORE_CLASS(SparseValueSGDRule, name);
-  _embedx_sgd_rule->load_config(_config.embedx_sgd_param(),
-                                _config.embedx_dim());
-
-  _show_click_decay_rate = _config.ctr_accessor_param().show_click_decay_rate();
-  _ssd_unseenday_threshold =
-      _config.ctr_accessor_param().ssd_unseenday_threshold();
-  set_time_decay_rates();
-  return 0;
-}
-
-void DownpourCtrAccessor::SetTableInfo(AccessorInfo& info) {
-  info.dim = dim();
-  info.size = size();
-  info.select_dim = select_dim();
-  info.select_size = select_size();
-  info.update_dim = update_dim();
-  info.update_size = update_size();
-  info.mf_size = mf_size();
-  info.fea_dim = fea_dim();
-}
-
-size_t DownpourCtrAccessor::GetTableInfo(InfoKey key) {
-  switch (key) {
-    case DIM:
-      return dim();
-    case SIZE:
-      return size();
-    case SELECT_DIM:
-      return select_dim();
-    case SELECT_SIZE:
-      return select_size();
-    case UPDATE_DIM:
-      return update_dim();
-    case UPDATE_SIZE:
-      return update_size();
-    case MF_SIZE:
-      return mf_size();
-    case FEA_DIM:
-      return fea_dim();
-  }
-  return 0;
-}
-
-size_t DownpourCtrAccessor::dim() {
-  auto embedx_dim = _config.embedx_dim();
-  return DownpourCtrFeatureValue::dim(embedx_dim);
-}
-
-size_t DownpourCtrAccessor::dim_size(size_t dim) {
-  auto embedx_dim = _config.embedx_dim();
-  return DownpourCtrFeatureValue::dim_size(dim, embedx_dim);
-}
-
-size_t DownpourCtrAccessor::size() {
-  auto embedx_dim = _config.embedx_dim();
-  return DownpourCtrFeatureValue::size(embedx_dim);
-}
-
-size_t DownpourCtrAccessor::mf_size() {
-  return (_config.embedx_dim() + 1) * sizeof(float);  // embedx embedx_g2sum
-}
-
-// pull value
-size_t DownpourCtrAccessor::select_dim() {
-  auto embedx_dim = _config.embedx_dim();
-  return 3 + embedx_dim;
-}
-
-size_t DownpourCtrAccessor::select_dim_size(size_t dim) {
-  return sizeof(float);
-}
-
-size_t DownpourCtrAccessor::select_size() {
-  return select_dim() * sizeof(float);
-}
-
-// push value
-size_t DownpourCtrAccessor::update_dim() {
-  auto embedx_dim = _config.embedx_dim();
-  return 4 + embedx_dim;
-}
-
-size_t DownpourCtrAccessor::update_dim_size(size_t dim) {
-  return sizeof(float);
-}
-
-size_t DownpourCtrAccessor::update_size() {
-  return update_dim() * sizeof(float);
-}
-
-bool DownpourCtrAccessor::shrink(float* value) {
-  // auto base_threshold = _config.ctr_accessor_param().base_threshold();
-  // auto delta_threshold = _config.ctr_accessor_param().delta_threshold();
-  // auto delete_threshold = _config.ctr_accessor_param().delete_threshold();
-  auto base_threshold = _config.ctr_accessor_param().base_threshold();
-  auto delta_threshold = _config.ctr_accessor_param().delta_threshold();
-  auto delete_after_unseen_days =
-      _config.ctr_accessor_param().delete_after_unseen_days();
-  auto delete_threshold = _config.ctr_accessor_param().delete_threshold();
-
-  // time_decay first
-  auto unseen_days = DownpourCtrFeatureValue::unseen_days(value);
-  int16_t day_diff = _day_id - unseen_days;
-  if (day_diff < 0 || day_diff > delete_after_unseen_days) {
-    return true;
-  }
-  auto show_right =
-      DownpourCtrFeatureValue::show(value) * _time_decay_rates[day_diff];
-  auto click_right =
-      DownpourCtrFeatureValue::click(value) * _time_decay_rates[day_diff];
-
-  // shrink after
-  auto score = show_click_score(show_right, click_right);
-  if (score < delete_threshold) {
-    return true;
-  }
-  return false;
-}
-
-void DownpourCtrAccessor::set_day_id(int day_id) { _day_id = day_id; }
-
-int DownpourCtrAccessor::get_day_id() { return _day_id; }
-
-bool DownpourCtrAccessor::save_ssd(float* value) {
-  if (_day_id == 0) {
-    return true;
-  }
-  auto unseen_days = DownpourCtrFeatureValue::unseen_days(value);
-  if (unseen_days == 0) {
-    return false;
-  }
-  // for the origin load (eg. unseen_days = 0-15)
-  if (unseen_days < _config.ctr_accessor_param().delta_keep_days()) {
-    unseen_days = _day_id - unseen_days;
-  }
-  int16_t day_diff = _day_id - unseen_days;
-  if (day_diff > _ssd_unseenday_threshold) {
-    return true;
-  }
-  return false;
-}
-
-// bool DownpourCtrAccessor::save_cache(
-//         float* value, int param, double global_cache_threshold) {
-//     auto base_threshold = _config.ctr_accessor_param().base_threshold();
-//     auto delta_keep_days = _config.ctr_accessor_param().delta_keep_days();
-//     auto unseen_days = DownpourCtrFeatureValue::unseen_days(value);
-//     int16_t day_diff = _day_id - unseen_days;
-//     if (show_click_score(DownpourCtrFeatureValue::show(value),
-//     DownpourCtrFeatureValue::click(value)) >= base_threshold
-//         && day_diff <= delta_keep_days) {
-//         return DownpourCtrFeatureValue::show(value) > global_cache_threshold;
-//     }
-//     return false;
-// }
-
-bool DownpourCtrAccessor::save(float* value, int param) {
-  // auto base_threshold = _config.ctr_accessor_param().base_threshold();
-  // auto delta_threshold = _config.ctr_accessor_param().delta_threshold();
-  // auto delta_keep_days = _config.ctr_accessor_param().delta_keep_days();
-  auto base_threshold = _config.ctr_accessor_param().base_threshold();
-  auto delta_threshold = _config.ctr_accessor_param().delta_threshold();
-  auto delta_keep_days = _config.ctr_accessor_param().delta_keep_days();
-  if (param == 2) {
-    delta_threshold = 0;
-  }
-  switch (param) {
-    // save all
-    case 0: {
-      return true;
-    }
-    // save xbox delta
-    case 1:
-    // save xbox base
-    case 2: {
-      auto unseen_days = DownpourCtrFeatureValue::unseen_days(value);
-      int16_t day_diff = _day_id - unseen_days;
-
-      auto show_right =
-          DownpourCtrFeatureValue::show(value) * _time_decay_rates[day_diff];
-      auto click_right =
-          DownpourCtrFeatureValue::click(value) * _time_decay_rates[day_diff];
-
-      if (show_click_score(show_right, click_right) >= base_threshold &&
-          DownpourCtrFeatureValue::delta_score(value) >= delta_threshold &&
-          day_diff <= delta_keep_days) {
-        // do this after save, because it must not be modified when retry
-        if (param == 2) {
-          DownpourCtrFeatureValue::delta_score(value) = 0;
-        }
-        return true;
-      } else {
-        return false;
-      }
-    }
-    // already decayed in shrink
-    case 3: {
-      // DownpourCtrFeatureValue::show(value) *= _show_click_decay_rate;
-      // DownpourCtrFeatureValue::click(value) *= _show_click_decay_rate;
-      // do this after save, because it must not be modified when retry
-      // DownpourCtrFeatureValue::unseen_days(value)++;
-      return true;
-    }
-    default:
-      return true;
-  };
-}
-
-void DownpourCtrAccessor::update_stat_after_save(float* value, int param) {
-  auto base_threshold = _config.ctr_accessor_param().base_threshold();
-  auto delta_threshold = _config.ctr_accessor_param().delta_threshold();
-  auto delta_keep_days = _config.ctr_accessor_param().delta_keep_days();
-  if (param == 2) {
-    delta_threshold = 0;
-  }
-  switch (param) {
-    case 1: {
-      auto unseen_days = DownpourCtrFeatureValue::unseen_days(value);
-      int16_t day_diff = _day_id - unseen_days;
-      auto show_right =
-          DownpourCtrFeatureValue::show(value) * _time_decay_rates[day_diff];
-      auto click_right =
-          DownpourCtrFeatureValue::click(value) * _time_decay_rates[day_diff];
-
-      if (show_click_score(show_right, click_right) >= base_threshold &&
-          DownpourCtrFeatureValue::delta_score(value) >= delta_threshold &&
-          day_diff <= delta_keep_days) {
-        DownpourCtrFeatureValue::delta_score(value) = 0;
-      }
-    }
-      return;
-    //  case 3:
-    //     {
-    //         DownpourCtrFeatureValue::unseen_days(value)++;
-    //     }
-    //     return;
-    default:
-      return;
-  };
-}
-
-int32_t DownpourCtrAccessor::create(float** values, size_t num) {
-  auto embedx_dim = _config.embedx_dim();
-  for (size_t value_item = 0; value_item < num; ++value_item) {
-    float* value = values[value_item];
-    value[DownpourCtrFeatureValue::unseen_days_index()] = 0;
-    value[DownpourCtrFeatureValue::delta_score_index()] = 0;
-    value[DownpourCtrFeatureValue::show_index()] = 0;
-    value[DownpourCtrFeatureValue::click_index()] = 0;
-    value[DownpourCtrFeatureValue::slot_index()] = -1;
-    _embed_sgd_rule->init_value(
-        value + DownpourCtrFeatureValue::embed_w_index(),
-        value + DownpourCtrFeatureValue::embed_g2sum_index(), true);
-    _embedx_sgd_rule->init_value(
-        value + DownpourCtrFeatureValue::embedx_w_index(),
-        value + DownpourCtrFeatureValue::embedx_g2sum_index());
-  }
-  return 0;
-}
-
-bool DownpourCtrAccessor::need_extend_mf(float* value) {
-  float show = value[DownpourCtrFeatureValue::show_index()];
-  float click = value[DownpourCtrFeatureValue::click_index()];
-  // float score = (show - click) * _config.ctr_accessor_param().nonclk_coeff()
-  float score = (show - click) * _config.ctr_accessor_param().nonclk_coeff() +
-                click * _config.ctr_accessor_param().click_coeff();
-  //+ click * _config.ctr_accessor_param().click_coeff();
-  return score >= _config.embedx_threshold();
-}
-
-bool DownpourCtrAccessor::has_mf(size_t size) {
-  return size > DownpourCtrFeatureValue::embedx_g2sum_index();
-}
-
-// from DownpourCtrFeatureValue to DownpourCtrPullValue
-int32_t DownpourCtrAccessor::select(float** select_values, const float** values,
-                                    size_t num) {
-  auto embedx_dim = _config.embedx_dim();
-  for (size_t value_item = 0; value_item < num; ++value_item) {
-    float* select_value = select_values[value_item];
-    float* value = const_cast<float*>(values[value_item]);
-    select_value[DownpourCtrPullValue::show_index()] =
-        value[DownpourCtrFeatureValue::show_index()];
-    select_value[DownpourCtrPullValue::click_index()] =
-        value[DownpourCtrFeatureValue::click_index()];
-    select_value[DownpourCtrPullValue::embed_w_index()] =
-        value[DownpourCtrFeatureValue::embed_w_index()];
-    memcpy(select_value + DownpourCtrPullValue::embedx_w_index(),
-           value + DownpourCtrFeatureValue::embedx_w_index(),
-           embedx_dim * sizeof(float));
-  }
-  return 0;
-}
-
-// from DownpourCtrPushValue to DownpourCtrPushValue
-// first dim: item
-// second dim: field num
-int32_t DownpourCtrAccessor::merge(float** update_values,
-                                   const float** other_update_values,
-                                   size_t num) {
-  auto embedx_dim = _config.embedx_dim();
-  size_t total_dim = DownpourCtrPushValue::dim(embedx_dim);
-  for (size_t value_item = 0; value_item < num; ++value_item) {
-    float* update_value = update_values[value_item];
-    const float* other_update_value = other_update_values[value_item];
-    for (auto i = 0u; i < total_dim; ++i) {
-      if (i != DownpourCtrPushValue::slot_index()) {
-        update_value[i] += other_update_value[i];
-      }
-    }
-  }
-  return 0;
-}
-
-// from DownpourCtrPushValue to DownpourCtrFeatureValue
-// first dim: item
-// second dim: field num
-int32_t DownpourCtrAccessor::update(float** update_values,
-                                    const float** push_values, size_t num) {
-  auto embedx_dim = _config.embedx_dim();
-  for (size_t value_item = 0; value_item < num; ++value_item) {
-    float* update_value = update_values[value_item];
-    const float* push_value = push_values[value_item];
-    float push_show = push_value[DownpourCtrPushValue::show_index()];
-    float push_click = push_value[DownpourCtrPushValue::click_index()];
-    float slot = push_value[DownpourCtrPushValue::slot_index()];
-    update_value[DownpourCtrFeatureValue::show_index()] += push_show;
-    update_value[DownpourCtrFeatureValue::click_index()] += push_click;
-    update_value[DownpourCtrFeatureValue::slot_index()] = slot;
-    update_value[DownpourCtrFeatureValue::delta_score_index()] +=
-        (push_show - push_click) * _config.ctr_accessor_param().nonclk_coeff() +
-        push_click * _config.ctr_accessor_param().click_coeff();
-    //(push_show - push_click) * _config.ctr_accessor_param().nonclk_coeff() +
-    // push_click * _config.ctr_accessor_param().click_coeff();
-    update_value[DownpourCtrFeatureValue::unseen_days_index()] = 0;
-    _embed_sgd_rule->update_value(
-        update_value + DownpourCtrFeatureValue::embed_w_index(),
-        update_value + DownpourCtrFeatureValue::embed_g2sum_index(),
-        push_value + DownpourCtrPushValue::embed_g_index(), push_show);
-    _embedx_sgd_rule->update_value(
-        update_value + DownpourCtrFeatureValue::embedx_w_index(),
-        update_value + DownpourCtrFeatureValue::embedx_g2sum_index(),
-        push_value + DownpourCtrPushValue::embedx_g_index(), push_show);
-  }
-  return 0;
-}
-
-bool DownpourCtrAccessor::create_value(int stage, const float* value) {
-  // stage == 0, pull
-  // stage == 1, push
-  if (stage == 0) {
-    return true;
-  } else if (stage == 1) {
-    auto show = DownpourCtrPushValue::show(const_cast<float*>(value));
-    auto click = DownpourCtrPushValue::click(const_cast<float*>(value));
-    auto score = show_click_score(show, click);
-    if (score <= 0) {
-      return false;
-    }
-    if (score >= 1) {
-      return true;
-    }
-    return local_uniform_real_distribution<float>()(local_random_engine()) <
-           score;
-  } else {
-    return true;
-  }
-}
-
-float DownpourCtrAccessor::show_click_score(float show, float click) {
-  // auto nonclk_coeff = _config.ctr_accessor_param().nonclk_coeff();
-  // auto click_coeff = _config.ctr_accessor_param().click_coeff();
-  auto nonclk_coeff = _config.ctr_accessor_param().nonclk_coeff();
-  auto click_coeff = _config.ctr_accessor_param().click_coeff();
-  return (show - click) * nonclk_coeff + click * click_coeff;
-}
-
-std::string DownpourCtrAccessor::parse_to_string(const float* v,
-                                                 int param_size) {
-  thread_local std::ostringstream os;
-  os.clear();
-  os.str("");
-  os << v[0] << " " << v[1] << " " << v[2] << " " << v[3] << " " << v[4] << " "
-     << v[5] << " " << v[6];
-  auto show = DownpourCtrFeatureValue::show(const_cast<float*>(v));
-  auto click = DownpourCtrFeatureValue::click(const_cast<float*>(v));
-  auto score = show_click_score(show, click);
-  if (score >= _config.embedx_threshold() && param_size > 7) {
-    os << " " << v[7];
-    for (auto i = 0; i < _config.embedx_dim(); ++i) {
-      os << " " << v[8 + i];
-    }
-  }
-  return os.str();
-}
-
-int DownpourCtrAccessor::parse_from_string(const std::string& str,
-                                           float* value) {
-  int embedx_dim = _config.embedx_dim();
-  float data_buff[dim()];
-  float* data_buff_ptr = data_buff;
-
-  _embedx_sgd_rule->init_value(
-      data_buff_ptr + DownpourCtrFeatureValue::embedx_w_index(),
-      data_buff_ptr + DownpourCtrFeatureValue::embedx_g2sum_index());
-
-  auto str_len = paddle::string::str_to_float(str.data(), data_buff_ptr);
-  CHECK(str_len >= 6) << "expect more than 6 real:" << str_len;
-  // no slot, embedx
-  int value_dim = dim();
-  int embedx_g2sum_index = DownpourCtrFeatureValue::embedx_g2sum_index();
-  value[DownpourCtrFeatureValue::slot_index()] = -1;
-  // other case
-  if (str_len == (value_dim - 1)) {
-    memcpy(value, data_buff_ptr, (embedx_g2sum_index - 1) * sizeof(float));
-    memcpy(value + embedx_g2sum_index, data_buff_ptr + embedx_g2sum_index - 1,
-           (embedx_dim + 1) * sizeof(float));
-  } else {
-    memcpy(value, data_buff_ptr, str_len * sizeof(float));
-  }
-  if (str_len == (value_dim - 1) || str_len == 6) {
-    str_len += 1;
-  }
-  return str_len;
-}
-
-void DownpourCtrAccessor::set_time_decay_rates() {
-  //根据unseen_days的天数来初始化_time_decay_rates大小和对应的衰减率
-  auto delete_after_unseen_days =
-      _config.ctr_accessor_param().delete_after_unseen_days();
-  _time_decay_rates.assign(delete_after_unseen_days + 1, 0.0);
-  for (int i = 0; i <= delete_after_unseen_days; ++i) {
-    _time_decay_rates[i] = pow(_show_click_decay_rate, i);
-  }
-}
-
-void DownpourCtrAccessor::update_time_decay(float* value,
-                                            bool is_update_seen_day) {
-  // 根据day_id 来进行show click 衰减和unseen_day 更新;unseen_day
-  // 为上次出现的dayid
-  if (_day_id == 0) {
-    return;
-  }
-  auto unseen_days = DownpourCtrFeatureValue::unseen_days(value);
-  if (unseen_days == 0) {
-    DownpourCtrFeatureValue::unseen_days(value) = _day_id;
-    return;
-  }
-  // for the origin load (unseenday = 0 -15)
-  if (unseen_days < _config.ctr_accessor_param().delete_after_unseen_days()) {
-    // pull
-    if (is_update_seen_day) {
-      DownpourCtrFeatureValue::unseen_days(value) = _day_id;
-      return;
-      // save 舍弃原始的unseenday,都变为上一天出现,保证show/click不被重复decay
-    } else {
-      DownpourCtrFeatureValue::unseen_days(value) = _day_id - 1;
-    }
-  }
-  int16_t day_diff = _day_id - unseen_days;
-  if (day_diff < 0) {
-    DownpourCtrFeatureValue::unseen_days(value) = _day_id;
-    return;
-  }
-  if (day_diff >= _config.ctr_accessor_param().delete_after_unseen_days()) {
-    return;
-  }
-  DownpourCtrFeatureValue::show(value) *= _time_decay_rates[day_diff];
-  DownpourCtrFeatureValue::click(value) *= _time_decay_rates[day_diff];
-  if (is_update_seen_day) {
-    DownpourCtrFeatureValue::unseen_days(value) = _day_id;
-  }
-}
-
-}  // namespace distributed
-}  // namespace paddle
diff --git a/paddle/fluid/distributed/ps/table/downpour_ctr_accessor.h b/paddle/fluid/distributed/ps/table/downpour_ctr_accessor.h
deleted file mode 100644
index 6ff6c0438310e..0000000000000
--- a/paddle/fluid/distributed/ps/table/downpour_ctr_accessor.h
+++ /dev/null
@@ -1,255 +0,0 @@
-// Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-#pragma once
-#include <stdint.h>
-#include <stdio.h>
-#include <vector>
-#include "paddle/fluid/distributed/common/registerer.h"
-#include "paddle/fluid/distributed/ps.pb.h"
-#include "paddle/fluid/distributed/ps/table/accessor.h"
-#include "paddle/fluid/distributed/ps/table/sparse_sgd_rule.h"
-
-namespace paddle {
-namespace distributed {
-
-/**
- * @brief Accessor for unit
- **/
-class DownpourCtrAccessor : public ValueAccessor {
- public:
-  struct DownpourCtrFeatureValue {
-    /*
-    float unseen_days;
-    float delta_score;
-    float show;
-    float click;
-    float embed_w;
-    float embed_g2sum;
-    float slot;
-    float embedx_g2sum;
-    std::vector<float> embedx_w;
-    */
-
-    static int dim(int embedx_dim) { return 8 + embedx_dim; }
-    static int dim_size(size_t dim, int embedx_dim) { return sizeof(float); }
-    static int size(int embedx_dim) { return dim(embedx_dim) * sizeof(float); }
-    static int unseen_days_index() { return 0; }
-    static int delta_score_index() {
-      return DownpourCtrFeatureValue::unseen_days_index() + 1;
-    }
-    static int show_index() {
-      return DownpourCtrFeatureValue::delta_score_index() + 1;
-    }
-    static int click_index() {
-      return DownpourCtrFeatureValue::show_index() + 1;
-    }
-    static int embed_w_index() {
-      return DownpourCtrFeatureValue::click_index() + 1;
-    }
-    static int embed_g2sum_index() {
-      return DownpourCtrFeatureValue::embed_w_index() + 1;
-    }
-    static int slot_index() {
-      return DownpourCtrFeatureValue::embed_g2sum_index() + 1;
-    }
-    static int embedx_g2sum_index() {
-      return DownpourCtrFeatureValue::slot_index() + 1;
-    }
-    static int embedx_w_index() {
-      return DownpourCtrFeatureValue::embedx_g2sum_index() + 1;
-    }
-    static float& unseen_days(float* val) {
-      return val[DownpourCtrFeatureValue::unseen_days_index()];
-    }
-    static float& delta_score(float* val) {
-      return val[DownpourCtrFeatureValue::delta_score_index()];
-    }
-    static float& show(float* val) {
-      return val[DownpourCtrFeatureValue::show_index()];
-    }
-    static float& click(float* val) {
-      return val[DownpourCtrFeatureValue::click_index()];
-    }
-    static float& slot(float* val) {
-      return val[DownpourCtrFeatureValue::slot_index()];
-    }
-    static float& embed_w(float* val) {
-      return val[DownpourCtrFeatureValue::embed_w_index()];
-    }
-    static float& embed_g2sum(float* val) {
-      return val[DownpourCtrFeatureValue::embed_g2sum_index()];
-    }
-    static float& embedx_g2sum(float* val) {
-      return val[DownpourCtrFeatureValue::embedx_g2sum_index()];
-    }
-    static float* embedx_w(float* val) {
-      return (val + DownpourCtrFeatureValue::embedx_w_index());
-    }
-  };
-
-  struct DownpourCtrPushValue {
-    /*
-    float slot;
-    float show;
-    float click;
-    float embed_g;
-    std::vector<float> embedx_g;
-    */
-
-    static int dim(int embedx_dim) { return 4 + embedx_dim; }
-
-    static int dim_size(int dim, int embedx_dim) { return sizeof(float); }
-    static int size(int embedx_dim) { return dim(embedx_dim) * sizeof(float); }
-    static int slot_index() { return 0; }
-    static int show_index() { return DownpourCtrPushValue::slot_index() + 1; }
-    static int click_index() { return DownpourCtrPushValue::show_index() + 1; }
-    static int embed_g_index() {
-      return DownpourCtrPushValue::click_index() + 1;
-    }
-    static int embedx_g_index() {
-      return DownpourCtrPushValue::embed_g_index() + 1;
-    }
-    static float& slot(float* val) { return val[0]; }
-    static float& show(float* val) { return val[1]; }
-    static float& click(float* val) { return val[2]; }
-    static float& embed_g(float* val) { return val[3]; }
-    static float* embedx_g(float* val) { return val + 4; }
-  };
-
-  struct DownpourCtrPullValue {
-    /*
-    float show;
-    float click;
-    float embed_w;
-    std::vector<float> embedx_w;
-    */
-
-    static int dim(int embedx_dim) { return 3 + embedx_dim; }
-    static int dim_size(size_t dim) { return sizeof(float); }
-    static int size(int embedx_dim) { return dim(embedx_dim) * sizeof(float); }
-    static int show_index() { return 0; }
-    static int click_index() { return 1; }
-    static int embed_w_index() { return 2; }
-    static int embedx_w_index() { return 3; }
-    static float& show(float* val) {
-      return val[DownpourCtrPullValue::show_index()];
-    }
-    static float& click(float* val) {
-      return val[DownpourCtrPullValue::click_index()];
-    }
-    static float& embed_w(float* val) {
-      return val[DownpourCtrPullValue::embed_w_index()];
-    }
-    static float* embedx_w(float* val) {
-      return val + DownpourCtrPullValue::embedx_w_index();
-    }
-  };
-  DownpourCtrAccessor() {}
-  virtual ~DownpourCtrAccessor() {}
-
-  virtual int initialize();
-  virtual void SetTableInfo(AccessorInfo& info);
-  virtual size_t GetTableInfo(InfoKey key);
-  // value维度
-  virtual size_t dim();
-  // value各个维度的size
-  virtual size_t dim_size(size_t dim);
-  // value各维度相加总size
-  virtual size_t size();
-  // value中mf动态长度部分总size大小, sparse下生效
-  virtual size_t mf_size();
-  // pull value维度
-  virtual size_t select_dim();
-  // pull value各个维度的size
-  virtual size_t select_dim_size(size_t dim);
-  // pull value各维度相加总size
-  virtual size_t select_size();
-  // push value维度
-  virtual size_t update_dim();
-  // push value各个维度的size
-  virtual size_t update_dim_size(size_t dim);
-  // push value各维度相加总size
-  virtual size_t update_size();
-  // 判断该value是否进行shrink
-  virtual bool shrink(float* value);
-  // 判断该value是否保存到ssd
-  virtual bool save_ssd(float* value);
-  virtual bool need_extend_mf(float* value);
-  virtual bool has_mf(size_t size);
-  // 判断该value是否在save阶段dump,
-  // param作为参数用于标识save阶段，如downpour的xbox与batch_model
-  // param = 0, save all feature
-  // param = 1, save delta feature
-  // param = 3, save all feature with time decay
-  virtual bool save(float* value, int param) override;
-  // update delta_score and unseen_days after save
-  virtual void update_stat_after_save(float* value, int param) override;
-  // virtual bool save_cache(float* value, int param, double
-  // global_cache_threshold) override;
-  // keys不存在时，为values生成随机值
-  // 要求value的内存由外部调用者分配完毕
-  virtual int32_t create(float** value, size_t num);
-  // 从values中选取到select_values中
-  virtual int32_t select(float** select_values, const float** values,
-                         size_t num);
-  // 将update_values聚合到一起
-  virtual int32_t merge(float** update_values,
-                        const float** other_update_values, size_t num);
-  // 将update_values聚合到一起，通过it.next判定是否进入下一个key
-  // virtual int32_t merge(float** update_values, iterator it);
-  // 将update_values更新应用到values中
-  virtual int32_t update(float** values, const float** update_values,
-                         size_t num);
-
-  virtual std::string parse_to_string(const float* value, int param) override;
-  virtual int32_t parse_from_string(const std::string& str, float* v) override;
-  virtual bool create_value(int type, const float* value);
-
-  //这个接口目前只用来取show
-  virtual float get_field(float* value, const std::string& name) override {
-    CHECK(name == "show");
-    if (name == "show") {
-      auto unseen_days = DownpourCtrFeatureValue::unseen_days(value);
-      int16_t day_diff = _day_id - unseen_days;
-      auto show_right =
-          DownpourCtrFeatureValue::show(value) * _time_decay_rates[day_diff];
-      return (float)show_right;
-    }
-    return 0.0;
-  }
-  // DEFINE_GET_INDEX(DownpourCtrFeatureValue, show)
-  // DEFINE_GET_INDEX(DownpourCtrFeatureValue, click)
-  // DEFINE_GET_INDEX(DownpourCtrFeatureValue, embed_w)
-  // DEFINE_GET_INDEX(DownpourCtrFeatureValue, embedx_w)
-
-  virtual void update_time_decay(float* value, bool is_update_seen_day);
-  virtual void set_day_id(int day_id);
-  virtual int get_day_id();
-  bool test_func() { return false; }
-
- private:
-  float show_click_score(float show, float click);
-  void set_time_decay_rates();
-
- private:
-  SparseValueSGDRule* _embed_sgd_rule;
-  SparseValueSGDRule* _embedx_sgd_rule;
-  float _show_click_decay_rate;
-  int32_t _ssd_unseenday_threshold;
-  std::vector<double> _time_decay_rates;
-  int _day_id;
-};
-}  // namespace distributed
-}  // namespace paddle
diff --git a/paddle/fluid/distributed/ps/table/common_dense_table.cc b/paddle/fluid/distributed/ps/table/memory_dense_table.cc
similarity index 82%
rename from paddle/fluid/distributed/ps/table/common_dense_table.cc
rename to paddle/fluid/distributed/ps/table/memory_dense_table.cc
index a462fc50aeb72..58ec8503c8156 100644
--- a/paddle/fluid/distributed/ps/table/common_dense_table.cc
+++ b/paddle/fluid/distributed/ps/table/memory_dense_table.cc
@@ -12,7 +12,7 @@
 // See the License for the specific language governing permissions and
 // limitations under the License.
 
-#include "paddle/fluid/distributed/ps/table/common_dense_table.h"
+#include "paddle/fluid/distributed/ps/table/memory_dense_table.h"
 
 #include "paddle/fluid/platform/enforce.h"
 
@@ -21,8 +21,8 @@ namespace distributed {
 
 int FLAGS_pslib_table_save_max_retry_dense = 3;
 
-void CommonDenseTable::create_initializer(const std::string& attr,
-                                          const std::string& name) {
+void MemoryDenseTable::CreateInitializer(const std::string& attr,
+                                         const std::string& name) {
   auto slices = string::split_string<std::string>(attr, "&");
 
   if (slices[0] == "gaussian_random") {
@@ -39,7 +39,7 @@ void CommonDenseTable::create_initializer(const std::string& attr,
   }
 }
 
-int32_t CommonDenseTable::initialize() {
+int32_t MemoryDenseTable::Initialize() {
   _shards_task_pool.resize(task_pool_size_);
   for (int i = 0; i < _shards_task_pool.size(); ++i) {
     _shards_task_pool[i].reset(new ::ThreadPool(1));
@@ -49,12 +49,12 @@ int32_t CommonDenseTable::initialize() {
   VLOG(1) << "table " << _config.common().table_name() << " is sync: " << sync;
   _global_lr = new float(1.0);
 
-  initialize_value();
-  initialize_optimizer();
+  InitializeValue();
+  InitializeOptimizer();
   return 0;
 }
 
-int32_t CommonDenseTable::initialize_value() {
+int32_t MemoryDenseTable::InitializeValue() {
   auto common = _config.common();
   int size = static_cast<int>(common.params().size());
   values_.resize(size);
@@ -70,7 +70,7 @@ int32_t CommonDenseTable::initialize_value() {
     auto& initializer = common.initializers()[x];
     total_dim_ += dim;
 
-    create_initializer(initializer, varname);
+    CreateInitializer(initializer, varname);
     values_[x].resize(dim);
     names_index_[varname] = x;
 
@@ -92,27 +92,27 @@ int32_t CommonDenseTable::initialize_value() {
     param_col_ids_.insert(param_col_ids_.begin() + 1, -1);
   }
 
-  VLOG(1) << "CommonDenseTable::initialize_value total dim: " << total_dim_
+  VLOG(1) << "MemoryDenseTable::InitializeValue total dim: " << total_dim_
           << " fixed_len_params_dim: " << fixed_len_params_dim_;
 
   pull_reservoir_ = ReservoirValue<float>(param_dim_);
   return 0;
 }
 
-int32_t CommonDenseTable::initialize_optimizer() {
+int32_t MemoryDenseTable::InitializeOptimizer() {
   auto common = _config.common();
   auto name = common.name();
   auto attrs = common.attributes();
 
   if (name == "sgd") {
     optimizer_ = std::make_shared<DSGD>(common, &values_);
-    optimizer_->set_global_lr(_global_lr);
+    optimizer_->SetGlobalLR(_global_lr);
   } else if (name == "adam") {
     optimizer_ = std::make_shared<DAdam>(common, &values_);
-    optimizer_->set_global_lr(_global_lr);
+    optimizer_->SetGlobalLR(_global_lr);
   } else if (name == "adam_d2sum") {
     optimizer_ = std::make_shared<DAdamD2Sum>(common, &values_);
-    // optimizer_->set_global_lr(_global_lr);  //no use
+    // optimizer_->SetGlobalLR(_global_lr);  //no use
   } else if (name == "sum") {
     optimizer_ = std::make_shared<DSUM>(common, &values_);
   } else if (name == "summary") {
@@ -124,34 +124,37 @@ int32_t CommonDenseTable::initialize_optimizer() {
   return 0;
 }
 
-int32_t CommonDenseTable::set_global_lr(float* lr) {
+int32_t MemoryDenseTable::SetGlobalLR(float* lr) {
   _global_lr = lr;
-  optimizer_->set_global_lr(_global_lr);
+  optimizer_->SetGlobalLR(_global_lr);
   return 0;
 }
 
-int32_t CommonDenseTable::Pull(TableContext& context) {
+int32_t MemoryDenseTable::Pull(TableContext& context) {
   CHECK(context.value_type == Dense);
   float* pull_values = context.pull_context.values;
-  return pull_dense(pull_values, context.num);
+  return PullDense(pull_values, context.num);
 }
 
-int32_t CommonDenseTable::Push(TableContext& context) {
+int32_t MemoryDenseTable::Push(TableContext& context) {
   CHECK(context.value_type == Dense);
   if (context.push_context.values != nullptr) {
-    const float* values = context.push_context.values;
-    return push_dense(values, context.num);
+    if (!context.push_context.is_param) {
+      return PushDense(context.push_context.values, context.num);
+    } else {
+      return PushDenseParam(context.push_context.values, context.num);
+    }
   }
   return 0;
 }
 
-int32_t CommonDenseTable::pull_dense(float* pull_values, size_t num) {
+int32_t MemoryDenseTable::PullDense(float* pull_values, size_t num) {
   std::copy(values_[param_idx_].begin(), values_[param_idx_].end(),
             pull_values);
   return 0;
 }
 
-int32_t CommonDenseTable::push_dense_param(const float* values, size_t num) {
+int32_t MemoryDenseTable::PushDenseParam(const float* values, size_t num) {
   PADDLE_ENFORCE_GE(
       num, param_dim_,
       paddle::platform::errors::InvalidArgument(
@@ -160,14 +163,14 @@ int32_t CommonDenseTable::push_dense_param(const float* values, size_t num) {
   return 0;
 }
 
-int32_t CommonDenseTable::pour() {
+int32_t MemoryDenseTable::Pour() {
   pull_reservoir_.avg();
-  _push_dense(pull_reservoir_.values.data(), pull_reservoir_.values.size());
+  _PushDense(pull_reservoir_.values.data(), pull_reservoir_.values.size());
   pull_reservoir_.reset();
   return 0;
 }
 
-int32_t CommonDenseTable::push_dense(const float* values, size_t num) {
+int32_t MemoryDenseTable::PushDense(const float* values, size_t num) {
   if (sync) {
     std::future<int> task =
         _shards_task_pool[0]->enqueue([this, &values]() -> int {
@@ -176,12 +179,12 @@ int32_t CommonDenseTable::push_dense(const float* values, size_t num) {
         });
     task.wait();
   } else {
-    _push_dense(values, num);
+    _PushDense(values, num);
   }
   return 0;
 }
 
-int32_t CommonDenseTable::_push_dense(const float* values, size_t num) {
+int32_t MemoryDenseTable::_PushDense(const float* values, size_t num) {
   PADDLE_ENFORCE_GE(
       num, param_dim_,
       paddle::platform::errors::InvalidArgument(
@@ -195,7 +198,7 @@ int32_t CommonDenseTable::_push_dense(const float* values, size_t num) {
         [this, shard_id, &buckets, &values]() -> int {
           auto begin = buckets[shard_id];
           auto end = buckets[shard_id + 1];
-          optimizer_->update(values, param_dim_, begin, end);
+          optimizer_->Update(values, param_dim_, begin, end);
           return 0;
         });
   }
@@ -203,16 +206,16 @@ int32_t CommonDenseTable::_push_dense(const float* values, size_t num) {
   for (size_t shard_id = 0; shard_id < tasks.size(); ++shard_id) {
     tasks[shard_id].wait();
   }
-  VLOG(2) << "debug CommonDenseTable::_push_dense done";
+  VLOG(2) << "debug MemoryDenseTable::_push_dense done";
   return 0;
 }
 
-int32_t CommonDenseTable::load(const std::string& path,
+int32_t MemoryDenseTable::Load(const std::string& path,
                                const std::string& param) {
   if (param_dim_ <= 0) {
     return 0;
   }
-  std::string table_path = table_dir(path);
+  std::string table_path = TableDir(path);
   auto file_list = _afs_client.list(table_path);
   std::sort(file_list.begin(), file_list.end());
   for (auto ff : file_list) {
@@ -220,7 +223,8 @@ int32_t CommonDenseTable::load(const std::string& path,
   }
   size_t dim_num_per_file = _config.accessor().fea_dim() / file_list.size() + 1;
   // param_dim_ in last node != _config.accesor().fea_dim() / _shard_num + 1
-  size_t dim_num_per_shard = _table_info.fea_dim / _shard_num + 1;
+  size_t dim_num_per_shard =
+      _value_accesor->GetAccessorInfo().fea_dim / _shard_num + 1;
   size_t start_dim_idx = dim_num_per_shard * _shard_idx;
   size_t start_file_idx = start_dim_idx / dim_num_per_file;
   size_t end_file_idx = (start_dim_idx + param_dim_) / dim_num_per_file;
@@ -232,9 +236,9 @@ int32_t CommonDenseTable::load(const std::string& path,
   int load_param = atoi(param.c_str());
   FsChannelConfig channel_config;
 
-  channel_config.converter = _value_accesor->converter(load_param).converter;
+  channel_config.converter = _value_accesor->Converter(load_param).converter;
   channel_config.deconverter =
-      _value_accesor->converter(load_param).deconverter;
+      _value_accesor->Converter(load_param).deconverter;
   bool is_read_failed = false;
   int err_no = 0;
   int retry_num = 0;
@@ -277,7 +281,7 @@ int32_t CommonDenseTable::load(const std::string& path,
               continue;
             }
             values_[param_col_ids_[col_idx]][dim_idx] = data_buffer[col_idx];
-            VLOG(2) << "CommonDenseTable::load param x: "
+            VLOG(2) << "MemoryDenseTable::load param x: "
                     << param_col_ids_[col_idx] << " y: " << dim_idx
                     << " value: " << values_[param_col_ids_[col_idx]][dim_idx]
                     << " line " << file_dim_idx;
@@ -314,24 +318,24 @@ int32_t CommonDenseTable::load(const std::string& path,
   return 0;
 }
 
-int32_t CommonDenseTable::save(const std::string& path,
+int32_t MemoryDenseTable::Save(const std::string& path,
                                const std::string& param) {
   int save_param = atoi(param.c_str());
   uint32_t feasign_size;
-  VLOG(0) << "CommonDenseTable::save path " << path;
+  VLOG(0) << "MemoryDenseTable::save path " << path;
 
   FsChannelConfig channel_config;
   if (_config.compress_in_save()) {
     channel_config.path = paddle::string::format_string(
-        "%s/part-%03d.gz", table_dir(path).c_str(), _shard_idx);
+        "%s/part-%03d.gz", TableDir(path).c_str(), _shard_idx);
   } else {
     channel_config.path = paddle::string::format_string(
-        "%s/part-%03d", table_dir(path).c_str(), _shard_idx);
+        "%s/part-%03d", TableDir(path).c_str(), _shard_idx);
   }
   _afs_client.remove(channel_config.path);
-  channel_config.converter = _value_accesor->converter(save_param).converter;
+  channel_config.converter = _value_accesor->Converter(save_param).converter;
   channel_config.deconverter =
-      _value_accesor->converter(save_param).deconverter;
+      _value_accesor->Converter(save_param).deconverter;
 
   bool is_write_failed = false;
   std::vector<std::vector<std::string>> result_buffer_param(
@@ -352,7 +356,7 @@ int32_t CommonDenseTable::save(const std::string& path,
     for (int x = 0; x < size; ++x) {
       auto& varname = common.params()[x];
       auto& dim = common.dims()[x];
-      VLOG(3) << "CommonDenseTable::save dim " << x << " size: " << dim;
+      VLOG(3) << "MemoryDenseTable::save dim " << x << " size: " << dim;
       for (int y = 0; y < dim; ++y) {
         os.clear();
         os.str("");
diff --git a/paddle/fluid/distributed/ps/table/common_dense_table.h b/paddle/fluid/distributed/ps/table/memory_dense_table.h
similarity index 62%
rename from paddle/fluid/distributed/ps/table/common_dense_table.h
rename to paddle/fluid/distributed/ps/table/memory_dense_table.h
index cad49a0a449c4..73653fbc2eb57 100644
--- a/paddle/fluid/distributed/ps/table/common_dense_table.h
+++ b/paddle/fluid/distributed/ps/table/memory_dense_table.h
@@ -30,33 +30,35 @@ namespace distributed {
 
 class DenseOptimizer;
 
-class CommonDenseTable : public DenseTable {
+class MemoryDenseTable : public Table {
  public:
-  CommonDenseTable() {}
-  virtual ~CommonDenseTable() {}
-  int32_t initialize() override;
-  int32_t initialize_shard() override { return 0; }
-  virtual void create_initializer(const std::string& attr,
-                                  const std::string& name);
-  virtual int32_t initialize_value();
-  virtual int32_t initialize_optimizer();
-  virtual int32_t Pull(TableContext& context);
-  virtual int32_t Push(TableContext& context);
-  int32_t pull_dense(float* pull_values, size_t num) override;
-  int32_t push_dense_param(const float* values, size_t num) override;
-  int32_t push_dense(const float* values, size_t num) override;
-  int32_t pour() override;
-  int32_t set_global_lr(float* lr) override;
+  MemoryDenseTable() {}
+  virtual ~MemoryDenseTable() {}
+  int32_t Initialize() override;
+  int32_t InitializeShard() override { return 0; }
+  void CreateInitializer(const std::string& attr, const std::string& name);
+  int32_t InitializeValue();
+  int32_t InitializeOptimizer();
 
-  int32_t load(const std::string& path, const std::string& param) override;
-  int32_t save(const std::string& path, const std::string& param) override;
+  int32_t Pull(TableContext& context) override;
+  int32_t Push(TableContext& context) override;
 
-  int32_t flush() override { return 0; }
-  int32_t shrink(const std::string& param) override { return 0; }
-  void clear() override { return; }
+  int32_t PullDense(float* pull_values, size_t num);
+  int32_t PushDenseParam(const float* values, size_t num);
+  int32_t PushDense(const float* values, size_t num);
+  int32_t Pour() override;
+  int32_t SetGlobalLR(float* lr) override;
+
+  int32_t Load(const std::string& path, const std::string& param) override;
+  int32_t Save(const std::string& path, const std::string& param) override;
+
+  int32_t Flush() override { return 0; }
+  int32_t Shrink(const std::string& param) override { return 0; }
+  void Clear() override { return; }
+  void* GetShard(size_t shard_idx) override { return 0; }
 
  protected:
-  int32_t _push_dense(const float* values, size_t num);
+  int32_t _PushDense(const float* values, size_t num);
 
  private:
   const int task_pool_size_ = 10;
diff --git a/paddle/fluid/distributed/ps/table/memory_sparse_geo_table.cc b/paddle/fluid/distributed/ps/table/memory_sparse_geo_table.cc
index f16f4fc7f34a5..1567d31d0f3ee 100644
--- a/paddle/fluid/distributed/ps/table/memory_sparse_geo_table.cc
+++ b/paddle/fluid/distributed/ps/table/memory_sparse_geo_table.cc
@@ -17,11 +17,33 @@
 namespace paddle {
 namespace distributed {
 
-int32_t MemorySparseGeoTable::push_sparse_param(const uint64_t* keys,
-                                                const float* values,
-                                                size_t num) {
-  VLOG(5) << "DEBUG MemorySparseGeoTable::push_sparse_param begin "
-             "push_sparse_param "
+int32_t MemorySparseGeoTable::Pull(TableContext& context) {
+  CHECK(context.value_type == Sparse);
+  if (context.pull_context.geo_pull_keys != nullptr) {
+    return PullGeoParam(context.trainer_id,
+                        context.pull_context.geo_pull_values,
+                        context.pull_context.geo_pull_keys);
+  } else {
+    return PullSparse(context.pull_context.values,
+                      context.pull_context.pull_value);
+  }
+}
+
+int32_t MemorySparseGeoTable::Push(TableContext& context) {
+  CHECK(context.value_type == Sparse);
+  if (!context.push_context.is_param) {
+    return PushSparse(context.push_context.keys, context.push_context.values,
+                      context.num);
+  } else {
+    return PushSparseParam(context.push_context.keys,
+                           context.push_context.values, context.num);
+  }
+}
+
+int32_t MemorySparseGeoTable::PushSparseParam(const uint64_t* keys,
+                                              const float* values, size_t num) {
+  VLOG(5) << "DEBUG MemorySparseGeoTable::PushSparseParam begin "
+             "PushSparseParam "
           << num;
   auto shard_num = _task_pool_size;
   std::vector<std::vector<uint64_t>> offset_bucket;
@@ -31,8 +53,8 @@ int32_t MemorySparseGeoTable::push_sparse_param(const uint64_t* keys,
     auto y = keys[x] % shard_num;
     offset_bucket[y].push_back(x);
     if (x < 10) {
-      VLOG(5) << "DEBUG MemorySparseGeoTable::push_sparse_param key: "
-              << keys[x] << " shard: " << y;
+      VLOG(5) << "DEBUG MemorySparseGeoTable::PushSparseParam key: " << keys[x]
+              << " shard: " << y;
     }
   }
 
@@ -51,8 +73,8 @@ int32_t MemorySparseGeoTable::push_sparse_param(const uint64_t* keys,
             feature_value.resize(_dim);
             std::copy_n(values + _dim * offset, _dim, feature_value.data());
             if (i < 10) {
-              VLOG(5) << "MemorySparseGeoTable::push_sparse_param "
-                         "push_sparse_param key "
+              VLOG(5) << "MemorySparseGeoTable::PushSparseParam "
+                         "PushSparseParam key "
                       << id << " value[0]: " << (values + _dim * offset)[0]
                       << " data: " << feature_value.data()[0]
                       << " value[-1]: " << (values + _dim * offset)[_dim - 1]
@@ -69,9 +91,9 @@ int32_t MemorySparseGeoTable::push_sparse_param(const uint64_t* keys,
   return 0;
 }
 
-int32_t MemorySparseGeoTable::pull_geo_param(const uint32_t trainer_id,
-                                             std::vector<float>* values,
-                                             std::vector<uint64_t>* ids) {
+int32_t MemorySparseGeoTable::PullGeoParam(const uint32_t trainer_id,
+                                           std::vector<float>* values,
+                                           std::vector<uint64_t>* ids) {
   _geo_recorder->GetAndClear(trainer_id, ids);
   VLOG(5)
       << "DEBUG MemorySparseGeoTable::pull_geo_param pull_geo_param trainer_id "
@@ -86,23 +108,23 @@ int32_t MemorySparseGeoTable::pull_geo_param(const uint32_t trainer_id,
   pull_value.frequencies_ = frequencies.data();
 
   values->resize(ids->size() * _dim);
-  pull_sparse(values->data(), pull_value);
+  PullSparse(values->data(), pull_value);
   return 0;
 }
 
-int32_t MemorySparseGeoTable::push_sparse(const uint64_t* keys,
-                                          const float* values, size_t num) {
-  VLOG(5) << "DEBUG MemorySparseGeoTable::push_sparse keys[0]" << keys[0]
+int32_t MemorySparseGeoTable::PushSparse(const uint64_t* keys,
+                                         const float* values, size_t num) {
+  VLOG(5) << "DEBUG MemorySparseGeoTable::PushSparse keys[0]" << keys[0]
           << " key_num: " << num;
   std::vector<uint64_t> ids;
   ids.resize(num);
   std::copy_n(keys, num, ids.begin());
   _geo_recorder->Update(ids);
-  _push_sparse(keys, values, num);
+  _PushSparse(keys, values, num);
   return 0;
 }
 
-int32_t MemorySparseGeoTable::initialize() {
+int32_t MemorySparseGeoTable::Initialize() {
   if (!_geo_recorder) {
     auto trainers = _config.common().trainer_num();
     _geo_recorder = std::make_shared<GeoRecorder>(trainers);
@@ -118,8 +140,9 @@ int32_t MemorySparseGeoTable::initialize() {
   return 0;
 }
 
-int32_t MemorySparseGeoTable::pull_sparse(float* pull_values,
-                                          const PullSparseValue& pull_value) {
+// hash different from MemorySparseTable
+int32_t MemorySparseGeoTable::PullSparse(float* pull_values,
+                                         const PullSparseValue& pull_value) {
   auto shard_num = _task_pool_size;
   std::vector<std::future<int>> tasks(shard_num);
 
@@ -146,13 +169,13 @@ int32_t MemorySparseGeoTable::pull_sparse(float* pull_values,
               auto& feature_value = local_shard[key];
               feature_value.resize(_dim);
               memset(feature_value.data(), 0, sizeof(float) * _dim);
-              VLOG(0) << "MemorySparseGeoTable pull_sparse key not found!!! "
+              VLOG(0) << "MemorySparseGeoTable PullSparse key not found!!! "
                       << key;
               itr = local_shard.find(key);
             }
             memcpy(select_data, itr.value().data(), _dim * sizeof(float));
 
-            VLOG(5) << "DEBUG MemorySparseGeoTable::pull_sparse key: " << key
+            VLOG(5) << "DEBUG MemorySparseGeoTable::PullSparse key: " << key
                     << " select_data[0] " << select_data[0]
                     << " value[0]: " << itr.value().data()[0];
           }
@@ -167,8 +190,8 @@ int32_t MemorySparseGeoTable::pull_sparse(float* pull_values,
   return 0;
 }
 
-int32_t MemorySparseGeoTable::_push_sparse(const uint64_t* keys,
-                                           const float* values, size_t num) {
+int32_t MemorySparseGeoTable::_PushSparse(const uint64_t* keys,
+                                          const float* values, size_t num) {
   auto shard_num = _task_pool_size;
   std::vector<std::future<int>> tasks(shard_num);
   std::vector<std::vector<std::pair<uint64_t, int>>> task_keys(shard_num);
diff --git a/paddle/fluid/distributed/ps/table/memory_sparse_geo_table.h b/paddle/fluid/distributed/ps/table/memory_sparse_geo_table.h
index 3b43f99543fdd..60ba5d9602e44 100644
--- a/paddle/fluid/distributed/ps/table/memory_sparse_geo_table.h
+++ b/paddle/fluid/distributed/ps/table/memory_sparse_geo_table.h
@@ -34,40 +34,44 @@ namespace distributed {
 
 class GeoRecorder;
 
-class MemorySparseGeoTable : public SparseTable {
+class MemorySparseGeoTable : public Table {
  public:
   typedef SparseTableShard<uint64_t, FixedFeatureValue> shard_type;
   MemorySparseGeoTable() { _geo_recorder = nullptr; }
   virtual ~MemorySparseGeoTable() {}
 
-  virtual int32_t initialize();
-  virtual int32_t initialize_shard() { return 0; }
-  virtual int32_t load(const std::string& path, const std::string& param) {
+  int32_t Initialize() override;
+  int32_t InitializeShard() override { return 0; }
+  int32_t Load(const std::string& path, const std::string& param) override {
     return 0;
   }
-  virtual int32_t save(const std::string& path, const std::string& param) {
+  int32_t Save(const std::string& path, const std::string& param) override {
     return 0;
   }
-  virtual int32_t Pull(TableContext& context) { return 0; }
-  virtual int32_t Push(TableContext& context) { return 0; }
-  virtual int32_t flush() { return 0; }
-  virtual int32_t shrink(const std::string& param) { return 0; }
-  virtual void clear() { return; }
-  virtual int32_t pull_sparse(float* values, const PullSparseValue& pull_value);
+  int32_t Pull(TableContext& context) override;
+  int32_t Push(TableContext& context) override;
+  int32_t Flush() override { return 0; }
+  int32_t Shrink(const std::string& param) override { return 0; }
+  void Clear() override { return; }
 
-  int32_t push_sparse_param(const uint64_t* keys, const float* values,
-                            size_t num);
-  // TODO(zhaocaibei123): change to pull_sparse, and rename pull_sparse
-  int32_t pull_geo_param(const uint32_t trainer_id, std::vector<float>* values,
-                         std::vector<uint64_t>* keys);
+  int32_t PullSparse(float* values, const PullSparseValue& pull_value);
 
-  int32_t push_sparse(const uint64_t* keys, const float* values,
-                      size_t num) override;
+  int32_t PushSparseParam(const uint64_t* keys, const float* values,
+                          size_t num);
 
-  int32_t _push_sparse(const uint64_t* keys, const float* values, size_t num);
+  int32_t PullGeoParam(const uint32_t trainer_id, std::vector<float>* values,
+                       std::vector<uint64_t>* keys);
+
+  int32_t PushSparse(const uint64_t* keys, const float* values, size_t num);
+
+  int32_t _PushSparse(const uint64_t* keys, const float* values, size_t num);
   // int32_t _pull_sparse(float* pull_values, const PullSparseValue&
   // pull_value);
 
+  void* GetShard(size_t shard_idx) override {
+    return &_local_shards[shard_idx];
+  }
+
  private:
   std::shared_ptr<GeoRecorder> _geo_recorder;
   const int _task_pool_size = 10;
diff --git a/paddle/fluid/distributed/ps/table/memory_sparse_table.cc b/paddle/fluid/distributed/ps/table/memory_sparse_table.cc
index 3f5c484eab825..e6c52e0b9b0c8 100644
--- a/paddle/fluid/distributed/ps/table/memory_sparse_table.cc
+++ b/paddle/fluid/distributed/ps/table/memory_sparse_table.cc
@@ -31,7 +31,7 @@ bool FLAGS_pserver_create_value_when_push = true;
 int FLAGS_pserver_table_save_max_retry = 3;
 bool FLAGS_pserver_enable_create_feasign_randomly = false;
 
-int32_t MemorySparseTable::initialize() {
+int32_t MemorySparseTable::Initialize() {
   _shards_task_pool.resize(_task_pool_size);
   for (int i = 0; i < _shards_task_pool.size(); ++i) {
     _shards_task_pool[i].reset(new ::ThreadPool(1));
@@ -39,15 +39,15 @@ int32_t MemorySparseTable::initialize() {
   auto& profiler = CostProfiler::instance();
   profiler.register_profiler("pserver_sparse_update_all");
   profiler.register_profiler("pserver_sparse_select_all");
-  initialize_value();
+  InitializeValue();
   VLOG(0) << "initalize MemorySparseTable succ";
   return 0;
 }
 
-int32_t MemorySparseTable::initialize_value() {
+int32_t MemorySparseTable::InitializeValue() {
   _sparse_table_shard_num = static_cast<int>(_config.shard_num());
   _avg_local_shard_num =
-      SparseTable::sparse_local_shard_num(_sparse_table_shard_num, _shard_num);
+      sparse_local_shard_num(_sparse_table_shard_num, _shard_num);
   _real_local_shard_num = _avg_local_shard_num;
   if (_real_local_shard_num * (_shard_idx + 1) > _sparse_table_shard_num) {
     _real_local_shard_num =
@@ -64,14 +64,14 @@ int32_t MemorySparseTable::initialize_value() {
   return 0;
 }
 
-int32_t MemorySparseTable::load(const std::string& path,
+int32_t MemorySparseTable::Load(const std::string& path,
                                 const std::string& param) {
-  std::string table_path = table_dir(path);
+  std::string table_path = TableDir(path);
   auto file_list = _afs_client.list(table_path);
 
   std::sort(file_list.begin(), file_list.end());
   for (auto file : file_list) {
-    VLOG(1) << "MemorySparseTable::load() file list: " << file;
+    VLOG(1) << "MemorySparseTable::Load() file list: " << file;
   }
 
   int load_param = atoi(param.c_str());
@@ -89,7 +89,7 @@ int32_t MemorySparseTable::load(const std::string& path,
   size_t file_start_idx = _shard_idx * _avg_local_shard_num;
 
   size_t feature_value_size =
-      _value_accesor->GetTableInfo(SIZE) / sizeof(float);
+      _value_accesor->GetAccessorInfo().size / sizeof(float);
 
   int thread_num = _real_local_shard_num < 15 ? _real_local_shard_num : 15;
   omp_set_num_threads(thread_num);
@@ -99,9 +99,9 @@ int32_t MemorySparseTable::load(const std::string& path,
     channel_config.path = file_list[file_start_idx + i];
     VLOG(1) << "MemorySparseTable::load begin load " << channel_config.path
             << " into local shard " << i;
-    channel_config.converter = _value_accesor->converter(load_param).converter;
+    channel_config.converter = _value_accesor->Converter(load_param).converter;
     channel_config.deconverter =
-        _value_accesor->converter(load_param).deconverter;
+        _value_accesor->Converter(load_param).deconverter;
 
     bool is_read_failed = false;
     int retry_num = 0;
@@ -119,8 +119,7 @@ int32_t MemorySparseTable::load(const std::string& path,
           uint64_t key = std::strtoul(line_data.data(), &end, 10);
           auto& value = shard[key];
           value.resize(feature_value_size);
-          int parse_size =
-              _value_accesor->parse_from_string(++end, value.data());
+          int parse_size = _value_accesor->ParseFromString(++end, value.data());
           value.resize(parse_size);
 
           // for debug
@@ -155,9 +154,9 @@ int32_t MemorySparseTable::load(const std::string& path,
   return 0;
 }
 
-int32_t MemorySparseTable::load_local_fs(const std::string& path,
-                                         const std::string& param) {
-  std::string table_path = table_dir(path);
+int32_t MemorySparseTable::LoadLocalFS(const std::string& path,
+                                       const std::string& param) {
+  std::string table_path = TableDir(path);
   auto file_list = paddle::framework::localfs_list(table_path);
 
   int load_param = atoi(param.c_str());
@@ -175,7 +174,7 @@ int32_t MemorySparseTable::load_local_fs(const std::string& path,
   size_t file_start_idx = _shard_idx * _avg_local_shard_num;
 
   size_t feature_value_size =
-      _value_accesor->GetTableInfo(SIZE) / sizeof(float);
+      _value_accesor->GetAccessorInfo().size / sizeof(float);
 
   int thread_num = _real_local_shard_num < 15 ? _real_local_shard_num : 15;
   omp_set_num_threads(thread_num);
@@ -196,8 +195,7 @@ int32_t MemorySparseTable::load_local_fs(const std::string& path,
           uint64_t key = std::strtoul(line_data.data(), &end, 10);
           auto& value = shard[key];
           value.resize(feature_value_size);
-          int parse_size =
-              _value_accesor->parse_from_string(++end, value.data());
+          int parse_size = _value_accesor->ParseFromString(++end, value.data());
           value.resize(parse_size);
         }
         file.close();
@@ -227,12 +225,12 @@ int32_t MemorySparseTable::load_local_fs(const std::string& path,
   return 0;
 }
 
-int32_t MemorySparseTable::save(const std::string& dirname,
+int32_t MemorySparseTable::Save(const std::string& dirname,
                                 const std::string& param) {
   VLOG(0) << "MemorySparseTable::save dirname: " << dirname;
   int save_param =
       atoi(param.c_str());  // checkpoint:0  xbox delta:1  xbox base:2
-  std::string table_path = table_dir(dirname);
+  std::string table_path = TableDir(dirname);
   _afs_client.remove(paddle::string::format_string(
       "%s/part-%03d-*", table_path.c_str(), _shard_idx));
   std::atomic<uint32_t> feasign_size_all{0};
@@ -253,9 +251,9 @@ int32_t MemorySparseTable::save(const std::string& dirname,
           paddle::string::format_string("%s/part-%03d-%05d", table_path.c_str(),
                                         _shard_idx, file_start_idx + i);
     }
-    channel_config.converter = _value_accesor->converter(save_param).converter;
+    channel_config.converter = _value_accesor->Converter(save_param).converter;
     channel_config.deconverter =
-        _value_accesor->converter(save_param).deconverter;
+        _value_accesor->Converter(save_param).deconverter;
     bool is_write_failed = false;
     int feasign_size = 0;
     int retry_num = 0;
@@ -268,8 +266,8 @@ int32_t MemorySparseTable::save(const std::string& dirname,
       auto write_channel =
           _afs_client.open_w(channel_config, 1024 * 1024 * 40, &err_no);
       for (auto it = shard.begin(); it != shard.end(); ++it) {
-        if (_value_accesor->save(it.value().data(), save_param)) {
-          std::string format_value = _value_accesor->parse_to_string(
+        if (_value_accesor->Save(it.value().data(), save_param)) {
+          std::string format_value = _value_accesor->ParseToString(
               it.value().data(), it.value().size());
           if (0 !=
               write_channel->write_line(paddle::string::format_string(
@@ -302,7 +300,7 @@ int32_t MemorySparseTable::save(const std::string& dirname,
     } while (is_write_failed);
     feasign_size_all += feasign_size;
     for (auto it = shard.begin(); it != shard.end(); ++it) {
-      _value_accesor->update_stat_after_save(it.value().data(), save_param);
+      _value_accesor->UpdateStatAfterSave(it.value().data(), save_param);
     }
     LOG(INFO) << "MemorySparseTable save prefix success, path: "
               << channel_config.path;
@@ -311,12 +309,12 @@ int32_t MemorySparseTable::save(const std::string& dirname,
   return 0;
 }
 
-int32_t MemorySparseTable::save_local_fs(const std::string& dirname,
-                                         const std::string& param,
-                                         const std::string& prefix) {
+int32_t MemorySparseTable::SaveLocalFS(const std::string& dirname,
+                                       const std::string& param,
+                                       const std::string& prefix) {
   int save_param =
       atoi(param.c_str());  // checkpoint:0  xbox delta:1  xbox base:2
-  std::string table_path = table_dir(dirname);
+  std::string table_path = TableDir(dirname);
   int feasign_cnt = 0;
   size_t file_start_idx = _avg_local_shard_num * _shard_idx;
 
@@ -334,9 +332,9 @@ int32_t MemorySparseTable::save_local_fs(const std::string& dirname,
     std::ofstream os;
     os.open(file_name);
     for (auto it = shard.begin(); it != shard.end(); ++it) {
-      if (_value_accesor->save(it.value().data(), save_param)) {
-        std::string format_value = _value_accesor->parse_to_string(
-            it.value().data(), it.value().size());
+      if (_value_accesor->Save(it.value().data(), save_param)) {
+        std::string format_value =
+            _value_accesor->ParseToString(it.value().data(), it.value().size());
         std::string out_line = paddle::string::format_string(
             "%lu %s\n", it.key(), format_value.c_str());
         // VLOG(2) << out_line.c_str();
@@ -351,7 +349,7 @@ int32_t MemorySparseTable::save_local_fs(const std::string& dirname,
   return 0;
 }
 
-int64_t MemorySparseTable::local_size() {
+int64_t MemorySparseTable::LocalSize() {
   int64_t local_size = 0;
   for (size_t i = 0; i < _real_local_shard_num; ++i) {
     local_size += _local_shards[i].size();
@@ -359,7 +357,7 @@ int64_t MemorySparseTable::local_size() {
   return local_size;
 }
 
-int64_t MemorySparseTable::local_mf_size() {
+int64_t MemorySparseTable::LocalMFSize() {
   std::vector<int64_t> size_arr(_real_local_shard_num, 0);
   std::vector<std::future<int>> tasks(_real_local_shard_num);
   int64_t ret_size = 0;
@@ -370,7 +368,7 @@ int64_t MemorySparseTable::local_mf_size() {
               auto& local_shard = _local_shards[shard_id];
               for (auto it = local_shard.begin(); it != local_shard.end();
                    ++it) {
-                if (_value_accesor->has_mf(it.value().size())) {
+                if (_value_accesor->HasMF(it.value().size())) {
                   size_arr[shard_id] += 1;
                 }
               }
@@ -386,9 +384,9 @@ int64_t MemorySparseTable::local_mf_size() {
   return ret_size;
 }
 
-std::pair<int64_t, int64_t> MemorySparseTable::print_table_stat() {
-  int64_t feasign_size = local_size();
-  int64_t mf_size = local_mf_size();
+std::pair<int64_t, int64_t> MemorySparseTable::PrintTableStat() {
+  int64_t feasign_size = LocalSize();
+  int64_t mf_size = LocalMFSize();
   return {feasign_size, mf_size};
 }
 
@@ -397,30 +395,36 @@ int32_t MemorySparseTable::Pull(TableContext& context) {
   if (context.use_ptr) {
     char** pull_values = context.pull_context.ptr_values;
     const uint64_t* keys = context.pull_context.keys;
-    return pull_sparse_ptr(pull_values, keys, context.num);
+    return PullSparsePtr(pull_values, keys, context.num);
   } else {
     float* pull_values = context.pull_context.values;
     const PullSparseValue& pull_value = context.pull_context.pull_value;
-    return pull_sparse(pull_values, pull_value);
+    return PullSparse(pull_values, pull_value);
   }
 }
 
 int32_t MemorySparseTable::Push(TableContext& context) {
   CHECK(context.value_type == Sparse);
-
-  const uint64_t* keys = context.push_context.keys;
-  return push_sparse(keys, context.push_context.values, context.num);
+  if (!context.use_ptr) {
+    return PushSparse(context.push_context.keys, context.push_context.values,
+                      context.num);
+  } else {
+    return PushSparse(context.push_context.keys,
+                      context.push_context.ptr_values, context.num);
+  }
 }
 
-int32_t MemorySparseTable::pull_sparse(float* pull_values,
-                                       const PullSparseValue& pull_value) {
+int32_t MemorySparseTable::PullSparse(float* pull_values,
+                                      const PullSparseValue& pull_value) {
   CostTimer timer("pserver_sparse_select_all");
   std::vector<std::future<int>> tasks(_real_local_shard_num);
 
-  const size_t value_size = _value_accesor->GetTableInfo(SIZE) / sizeof(float);
-  size_t mf_value_size = _value_accesor->GetTableInfo(MF_SIZE) / sizeof(float);
+  const size_t value_size =
+      _value_accesor->GetAccessorInfo().size / sizeof(float);
+  size_t mf_value_size =
+      _value_accesor->GetAccessorInfo().mf_size / sizeof(float);
   size_t select_value_size =
-      _value_accesor->GetTableInfo(SELECT_SIZE) / sizeof(float);
+      _value_accesor->GetAccessorInfo().select_size / sizeof(float);
   // std::atomic<uint32_t> missed_keys{0};
 
   std::vector<std::vector<std::pair<uint64_t, int>>> task_keys(
@@ -453,7 +457,7 @@ int32_t MemorySparseTable::pull_sparse(float* pull_values,
                     auto& feature_value = local_shard[key];
                     feature_value.resize(data_size);
                     float* data_ptr = feature_value.data();
-                    _value_accesor->create(&data_buffer_ptr, 1);
+                    _value_accesor->Create(&data_buffer_ptr, 1);
                     memcpy(data_ptr, data_buffer_ptr,
                            data_size * sizeof(float));
                   }
@@ -467,7 +471,7 @@ int32_t MemorySparseTable::pull_sparse(float* pull_values,
                 }
                 auto offset = keys[i].second;
                 float* select_data = pull_values + select_value_size * offset;
-                _value_accesor->select(&select_data,
+                _value_accesor->Select(&select_data,
                                        (const float**)&data_buffer_ptr, 1);
               }
 
@@ -481,11 +485,12 @@ int32_t MemorySparseTable::pull_sparse(float* pull_values,
   return 0;
 }
 
-int32_t MemorySparseTable::pull_sparse_ptr(char** pull_values,
-                                           const uint64_t* keys, size_t num) {
+int32_t MemorySparseTable::PullSparsePtr(char** pull_values,
+                                         const uint64_t* keys, size_t num) {
   CostTimer timer("pscore_sparse_select_all");
-  size_t value_size = _value_accesor->size() / sizeof(float);
-  size_t mf_value_size = _value_accesor->mf_size() / sizeof(float);
+  size_t value_size = _value_accesor->GetAccessorInfo().size / sizeof(float);
+  size_t mf_value_size =
+      _value_accesor->GetAccessorInfo().mf_size / sizeof(float);
 
   std::vector<std::future<int>> tasks(_real_local_shard_num);
   std::vector<std::vector<std::pair<uint64_t, int>>> task_keys(
@@ -514,7 +519,7 @@ int32_t MemorySparseTable::pull_sparse_ptr(char** pull_values,
                   auto& feature_value = local_shard[key];
                   feature_value.resize(data_size);
                   float* data_ptr = feature_value.data();
-                  _value_accesor->create(&data_buffer_ptr, 1);
+                  _value_accesor->Create(&data_buffer_ptr, 1);
                   memcpy(data_ptr, data_buffer_ptr, data_size * sizeof(float));
                   ret = &feature_value;
                 } else {
@@ -532,8 +537,8 @@ int32_t MemorySparseTable::pull_sparse_ptr(char** pull_values,
   return 0;
 }
 
-int32_t MemorySparseTable::push_sparse(const uint64_t* keys,
-                                       const float* values, size_t num) {
+int32_t MemorySparseTable::PushSparse(const uint64_t* keys, const float* values,
+                                      size_t num) {
   CostTimer timer("pserver_sparse_update_all");
   std::vector<std::future<int>> tasks(_real_local_shard_num);
   std::vector<std::vector<std::pair<uint64_t, int>>> task_keys(
@@ -543,10 +548,12 @@ int32_t MemorySparseTable::push_sparse(const uint64_t* keys,
     task_keys[shard_id].push_back({keys[i], i});
   }
 
-  const size_t value_col = _value_accesor->GetTableInfo(SIZE) / sizeof(float);
-  size_t mf_value_col = _value_accesor->GetTableInfo(MF_SIZE) / sizeof(float);
+  const size_t value_col =
+      _value_accesor->GetAccessorInfo().size / sizeof(float);
+  size_t mf_value_col =
+      _value_accesor->GetAccessorInfo().mf_size / sizeof(float);
   size_t update_value_col =
-      _value_accesor->GetTableInfo(UPDATE_SIZE) / sizeof(float);
+      _value_accesor->GetAccessorInfo().update_size / sizeof(float);
 
   for (size_t shard_id = 0; shard_id < _real_local_shard_num; ++shard_id) {
     tasks[shard_id] = _shards_task_pool[shard_id % _task_pool_size]->enqueue(
@@ -564,13 +571,13 @@ int32_t MemorySparseTable::push_sparse(const uint64_t* keys,
             auto itr = local_shard.find(key);
             if (itr == local_shard.end()) {
               if (FLAGS_pserver_enable_create_feasign_randomly &&
-                  !_value_accesor->create_value(1, update_data)) {
+                  !_value_accesor->CreateValue(1, update_data)) {
                 continue;
               }
               auto value_size = value_col - mf_value_col;
               auto& feature_value = local_shard[key];
               feature_value.resize(value_size);
-              _value_accesor->create(&data_buffer_ptr, 1);
+              _value_accesor->Create(&data_buffer_ptr, 1);
               memcpy(feature_value.data(), data_buffer_ptr,
                      value_size * sizeof(float));
               itr = local_shard.find(key);
@@ -581,16 +588,16 @@ int32_t MemorySparseTable::push_sparse(const uint64_t* keys,
             size_t value_size = feature_value.size();
 
             if (value_size == value_col) {  // 已拓展到最大size, 则就地update
-              _value_accesor->update(&value_data, &update_data, 1);
+              _value_accesor->Update(&value_data, &update_data, 1);
             } else {
               // 拷入buffer区进行update，然后再回填，不需要的mf则回填时抛弃了
               memcpy(data_buffer_ptr, value_data, value_size * sizeof(float));
-              _value_accesor->update(&data_buffer_ptr, &update_data, 1);
+              _value_accesor->Update(&data_buffer_ptr, &update_data, 1);
 
-              if (_value_accesor->need_extend_mf(data_buffer)) {
+              if (_value_accesor->NeedExtendMF(data_buffer)) {
                 feature_value.resize(value_col);
                 value_data = feature_value.data();
-                _value_accesor->create(&value_data, 1);
+                _value_accesor->Create(&value_data, 1);
               }
               memcpy(value_data, data_buffer_ptr, value_size * sizeof(float));
             }
@@ -605,14 +612,8 @@ int32_t MemorySparseTable::push_sparse(const uint64_t* keys,
   return 0;
 }
 
-int32_t MemorySparseTable::push_sparse(const uint64_t* keys,
-                                       const float** values, size_t num) {
-  _push_sparse(keys, values, num);
-  return 0;
-}
-
-int32_t MemorySparseTable::_push_sparse(const uint64_t* keys,
-                                        const float** values, size_t num) {
+int32_t MemorySparseTable::PushSparse(const uint64_t* keys,
+                                      const float** values, size_t num) {
   std::vector<std::future<int>> tasks(_real_local_shard_num);
   std::vector<std::vector<std::pair<uint64_t, int>>> task_keys(
       _real_local_shard_num);
@@ -621,10 +622,11 @@ int32_t MemorySparseTable::_push_sparse(const uint64_t* keys,
     task_keys[shard_id].push_back({keys[i], i});
   }
 
-  size_t value_col = _value_accesor->GetTableInfo(SIZE) / sizeof(float);
-  size_t mf_value_col = _value_accesor->GetTableInfo(MF_SIZE) / sizeof(float);
+  size_t value_col = _value_accesor->GetAccessorInfo().size / sizeof(float);
+  size_t mf_value_col =
+      _value_accesor->GetAccessorInfo().mf_size / sizeof(float);
   size_t update_value_col =
-      _value_accesor->GetTableInfo(UPDATE_SIZE) / sizeof(float);
+      _value_accesor->GetAccessorInfo().update_size / sizeof(float);
 
   for (int shard_id = 0; shard_id < _real_local_shard_num; ++shard_id) {
     tasks[shard_id] = _shards_task_pool[shard_id % _task_pool_size]->enqueue(
@@ -641,13 +643,13 @@ int32_t MemorySparseTable::_push_sparse(const uint64_t* keys,
             auto itr = local_shard.find(key);
             if (itr == local_shard.end()) {
               if (FLAGS_pserver_enable_create_feasign_randomly &&
-                  !_value_accesor->create_value(1, update_data)) {
+                  !_value_accesor->CreateValue(1, update_data)) {
                 continue;
               }
               auto value_size = value_col - mf_value_col;
               auto& feature_value = local_shard[key];
               feature_value.resize(value_size);
-              _value_accesor->create(&data_buffer_ptr, 1);
+              _value_accesor->Create(&data_buffer_ptr, 1);
               memcpy(feature_value.data(), data_buffer_ptr,
                      value_size * sizeof(float));
               itr = local_shard.find(key);
@@ -656,15 +658,15 @@ int32_t MemorySparseTable::_push_sparse(const uint64_t* keys,
             float* value_data = feature_value.data();
             size_t value_size = feature_value.size();
             if (value_size == value_col) {  // 已拓展到最大size, 则就地update
-              _value_accesor->update(&value_data, &update_data, 1);
+              _value_accesor->Update(&value_data, &update_data, 1);
             } else {
               // 拷入buffer区进行update，然后再回填，不需要的mf则回填时抛弃了
               memcpy(data_buffer_ptr, value_data, value_size * sizeof(float));
-              _value_accesor->update(&data_buffer_ptr, &update_data, 1);
-              if (_value_accesor->need_extend_mf(data_buffer)) {
+              _value_accesor->Update(&data_buffer_ptr, &update_data, 1);
+              if (_value_accesor->NeedExtendMF(data_buffer)) {
                 feature_value.resize(value_col);
                 value_data = feature_value.data();
-                _value_accesor->create(&value_data, 1);
+                _value_accesor->Create(&value_data, 1);
               }
               memcpy(value_data, data_buffer_ptr, value_size * sizeof(float));
             }
@@ -679,16 +681,16 @@ int32_t MemorySparseTable::_push_sparse(const uint64_t* keys,
   return 0;
 }
 
-int32_t MemorySparseTable::flush() { return 0; }
+int32_t MemorySparseTable::Flush() { return 0; }
 
-int32_t MemorySparseTable::shrink(const std::string& param) {
-  VLOG(0) << "MemorySparseTable::shrink";
+int32_t MemorySparseTable::Shrink(const std::string& param) {
+  VLOG(0) << "MemorySparseTable::Shrink";
   // TODO(zhaocaibei123): implement with multi-thread
   for (int shard_id = 0; shard_id < _real_local_shard_num; ++shard_id) {
-    // shrink
+    // Shrink
     auto& shard = _local_shards[shard_id];
     for (auto it = shard.begin(); it != shard.end();) {
-      if (_value_accesor->shrink(it.value().data())) {
+      if (_value_accesor->Shrink(it.value().data())) {
         it = shard.erase(it);
       } else {
         ++it;
@@ -698,7 +700,7 @@ int32_t MemorySparseTable::shrink(const std::string& param) {
   return 0;
 }
 
-void MemorySparseTable::clear() { VLOG(0) << "clear coming soon"; }
+void MemorySparseTable::Clear() { VLOG(0) << "clear coming soon"; }
 
 }  // namespace distributed
 }  // namespace paddle
diff --git a/paddle/fluid/distributed/ps/table/memory_sparse_table.h b/paddle/fluid/distributed/ps/table/memory_sparse_table.h
index d26c67319760d..87a73bd22fa2f 100644
--- a/paddle/fluid/distributed/ps/table/memory_sparse_table.h
+++ b/paddle/fluid/distributed/ps/table/memory_sparse_table.h
@@ -34,57 +34,61 @@
 namespace paddle {
 namespace distributed {
 
-class MemorySparseTable : public SparseTable {
+class MemorySparseTable : public Table {
  public:
   typedef SparseTableShard<uint64_t, FixedFeatureValue> shard_type;
   MemorySparseTable() {}
   virtual ~MemorySparseTable() {}
 
-  // unused method begin
-  virtual int32_t pull_dense(float* pull_values, size_t num) { return 0; }
-  virtual int32_t push_dense_param(const float* values, size_t num) {
-    return 0;
-  }
-  virtual int32_t push_dense(const float* values, size_t num) { return 0; }
   // unused method end
+  static int32_t sparse_local_shard_num(uint32_t shard_num,
+                                        uint32_t server_num) {
+    if (shard_num % server_num == 0) {
+      return shard_num / server_num;
+    }
+    size_t local_shard_num = shard_num / server_num + 1;
+    return local_shard_num;
+  }
+
+  static size_t get_sparse_shard(uint32_t shard_num, uint32_t server_num,
+                                 uint64_t key) {
+    return (key % shard_num) / sparse_local_shard_num(shard_num, server_num);
+  }
 
-  virtual int32_t Pull(TableContext& context);
-  virtual int32_t Push(TableContext& context);
+  int32_t Pull(TableContext& context) override;
+  int32_t Push(TableContext& context) override;
 
-  virtual int32_t initialize();
-  virtual int32_t initialize_shard() { return 0; }
-  virtual int32_t initialize_value();
+  int32_t Initialize() override;
+  int32_t InitializeShard() override { return 0; }
+  int32_t InitializeValue();
 
-  virtual int32_t load(const std::string& path, const std::string& param);
+  int32_t Load(const std::string& path, const std::string& param) override;
 
-  virtual int32_t save(const std::string& path, const std::string& param);
+  int32_t Save(const std::string& path, const std::string& param) override;
 
-  int32_t load_local_fs(const std::string& path, const std::string& param);
-  int32_t save_local_fs(const std::string& path, const std::string& param,
-                        const std::string& prefix);
+  int32_t LoadLocalFS(const std::string& path, const std::string& param);
+  int32_t SaveLocalFS(const std::string& path, const std::string& param,
+                      const std::string& prefix);
 
-  int64_t local_size();
-  int64_t local_mf_size();
+  int64_t LocalSize();
+  int64_t LocalMFSize();
 
-  virtual std::pair<int64_t, int64_t> print_table_stat();
-  virtual int32_t pull_sparse(float* values, const PullSparseValue& pull_value);
+  std::pair<int64_t, int64_t> PrintTableStat() override;
+  int32_t PullSparse(float* values, const PullSparseValue& pull_value);
 
-  virtual int32_t pull_sparse_ptr(char** pull_values, const uint64_t* keys,
-                                  size_t num);
+  int32_t PullSparsePtr(char** pull_values, const uint64_t* keys, size_t num);
 
-  virtual int32_t push_sparse(const uint64_t* keys, const float* values,
-                              size_t num);
+  int32_t PushSparse(const uint64_t* keys, const float* values, size_t num);
 
-  virtual int32_t push_sparse(const uint64_t* keys, const float** values,
-                              size_t num);
+  int32_t PushSparse(const uint64_t* keys, const float** values, size_t num);
 
-  virtual int32_t flush();
-  virtual int32_t shrink(const std::string& param);
-  virtual void clear();
+  int32_t Flush() override;
+  int32_t Shrink(const std::string& param) override;
+  void Clear() override;
 
- protected:
-  virtual int32_t _push_sparse(const uint64_t* keys, const float** values,
-                               size_t num);
+  void* GetShard(size_t shard_idx) override {
+    return &_local_shards[shard_idx];
+  }
 
  protected:
   const int _task_pool_size = 24;
diff --git a/paddle/fluid/distributed/ps/table/sparse_accessor.cc b/paddle/fluid/distributed/ps/table/sparse_accessor.cc
index 651ff9d00e49a..bc537880f1c21 100644
--- a/paddle/fluid/distributed/ps/table/sparse_accessor.cc
+++ b/paddle/fluid/distributed/ps/table/sparse_accessor.cc
@@ -20,92 +20,39 @@
 namespace paddle {
 namespace distributed {
 
-int SparseAccessor::initialize() {
+int SparseAccessor::Initialize() {
   auto name = _config.embed_sgd_param().name();
   _embed_sgd_rule = CREATE_PSCORE_CLASS(SparseValueSGDRule, name);
-  _embed_sgd_rule->load_config(_config.embed_sgd_param(), 1);
+  _embed_sgd_rule->LoadConfig(_config.embed_sgd_param(), 1);
 
   name = _config.embedx_sgd_param().name();
   _embedx_sgd_rule = CREATE_PSCORE_CLASS(SparseValueSGDRule, name);
-  _embedx_sgd_rule->load_config(_config.embedx_sgd_param(),
-                                _config.embedx_dim());
+  _embedx_sgd_rule->LoadConfig(_config.embedx_sgd_param(),
+                               _config.embedx_dim());
 
-  sparse_feature_value.embed_sgd_dim = _embed_sgd_rule->dim();
+  sparse_feature_value.embed_sgd_dim = _embed_sgd_rule->Dim();
   sparse_feature_value.embedx_dim = _config.embedx_dim();
-  sparse_feature_value.embedx_sgd_dim = _embedx_sgd_rule->dim();
+  sparse_feature_value.embedx_sgd_dim = _embedx_sgd_rule->Dim();
   _show_click_decay_rate = _config.ctr_accessor_param().show_click_decay_rate();
 
+  InitAccessorInfo();
   return 0;
 }
 
-void SparseAccessor::SetTableInfo(AccessorInfo& info) {
-  info.dim = dim();
-  info.size = size();
-  info.select_dim = select_dim();
-  info.select_size = select_size();
-  info.update_dim = update_dim();
-  info.update_size = update_size();
-  info.mf_size = mf_size();
-  info.fea_dim = fea_dim();
-}
-
-size_t SparseAccessor::GetTableInfo(InfoKey key) {
-  switch (key) {
-    case DIM:
-      return dim();
-    case SIZE:
-      return size();
-    case SELECT_DIM:
-      return select_dim();
-    case SELECT_SIZE:
-      return select_size();
-    case UPDATE_DIM:
-      return update_dim();
-    case UPDATE_SIZE:
-      return update_size();
-    case MF_SIZE:
-      return mf_size();
-    case FEA_DIM:
-      return fea_dim();
-  }
-  return 0;
-}
-
-size_t SparseAccessor::dim() { return sparse_feature_value.dim(); }
-
-size_t SparseAccessor::dim_size(size_t dim) {
-  auto embedx_dim = _config.embedx_dim();
-  return sparse_feature_value.dim_size(dim, embedx_dim);
-}
-
-size_t SparseAccessor::size() { return sparse_feature_value.size(); }
-
-size_t SparseAccessor::mf_size() {
-  return (_config.embedx_dim() + sparse_feature_value.embedx_sgd_dim) *
-         sizeof(float);  // embedx embedx_g2sum
-}
-
-// pull value
-size_t SparseAccessor::select_dim() {
-  auto embedx_dim = _config.embedx_dim();
-  return 1 + embedx_dim;
-}
-
-size_t SparseAccessor::select_dim_size(size_t dim) { return sizeof(float); }
-
-size_t SparseAccessor::select_size() { return select_dim() * sizeof(float); }
-
-// push value
-size_t SparseAccessor::update_dim() {
+void SparseAccessor::InitAccessorInfo() {
+  _accessor_info.dim = sparse_feature_value.Dim();
+  _accessor_info.size = sparse_feature_value.Size();
   auto embedx_dim = _config.embedx_dim();
-  return 4 + embedx_dim;
+  _accessor_info.select_dim = 1 + embedx_dim;
+  _accessor_info.select_size = _accessor_info.select_dim * sizeof(float);
+  ;
+  _accessor_info.update_dim = 4 + embedx_dim;
+  _accessor_info.update_size = _accessor_info.update_dim * sizeof(float);
+  _accessor_info.mf_size =
+      (embedx_dim + sparse_feature_value.embedx_sgd_dim) * sizeof(float);
 }
 
-size_t SparseAccessor::update_dim_size(size_t dim) { return sizeof(float); }
-
-size_t SparseAccessor::update_size() { return update_dim() * sizeof(float); }
-
-bool SparseAccessor::shrink(float* value) {
+bool SparseAccessor::Shrink(float* value) {
   auto base_threshold = _config.ctr_accessor_param().base_threshold();
   auto delta_threshold = _config.ctr_accessor_param().delta_threshold();
   auto delete_after_unseen_days =
@@ -113,20 +60,20 @@ bool SparseAccessor::shrink(float* value) {
   auto delete_threshold = _config.ctr_accessor_param().delete_threshold();
 
   // time_decay first
-  sparse_feature_value.show(value) *= _show_click_decay_rate;
-  sparse_feature_value.click(value) *= _show_click_decay_rate;
+  sparse_feature_value.Show(value) *= _show_click_decay_rate;
+  sparse_feature_value.Click(value) *= _show_click_decay_rate;
 
   // shrink after
-  auto score = show_click_score(sparse_feature_value.show(value),
-                                sparse_feature_value.click(value));
-  auto unseen_days = sparse_feature_value.unseen_days(value);
+  auto score = ShowClickScore(sparse_feature_value.Show(value),
+                              sparse_feature_value.Click(value));
+  auto unseen_days = sparse_feature_value.UnseenDays(value);
   if (score < delete_threshold || unseen_days > delete_after_unseen_days) {
     return true;
   }
   return false;
 }
 
-bool SparseAccessor::save(float* value, int param) {
+bool SparseAccessor::Save(float* value, int param) {
   auto base_threshold = _config.ctr_accessor_param().base_threshold();
   auto delta_threshold = _config.ctr_accessor_param().delta_threshold();
   auto delta_keep_days = _config.ctr_accessor_param().delta_keep_days();
@@ -142,14 +89,13 @@ bool SparseAccessor::save(float* value, int param) {
     case 1:
     // save xbox base
     case 2: {
-      if (show_click_score(sparse_feature_value.show(value),
-                           sparse_feature_value.click(value)) >=
-              base_threshold &&
-          sparse_feature_value.delta_score(value) >= delta_threshold &&
-          sparse_feature_value.unseen_days(value) <= delta_keep_days) {
+      if (ShowClickScore(sparse_feature_value.Show(value),
+                         sparse_feature_value.Click(value)) >= base_threshold &&
+          sparse_feature_value.DeltaScore(value) >= delta_threshold &&
+          sparse_feature_value.UnseenDays(value) <= delta_keep_days) {
         // do this after save, because it must not be modified when retry
         if (param == 2) {
-          sparse_feature_value.delta_score(value) = 0;
+          sparse_feature_value.DeltaScore(value) = 0;
         }
         return true;
       } else {
@@ -159,7 +105,7 @@ bool SparseAccessor::save(float* value, int param) {
     // already decayed in shrink
     case 3: {
       // do this after save, because it must not be modified when retry
-      // sparse_feature_value.unseen_days(value)++;
+      // sparse_feature_value.UnseenDays(value)++;
       return true;
     }
     // save revert batch_model
@@ -171,7 +117,7 @@ bool SparseAccessor::save(float* value, int param) {
   }
 }
 
-void SparseAccessor::update_stat_after_save(float* value, int param) {
+void SparseAccessor::UpdateStatAfterSave(float* value, int param) {
   auto base_threshold = _config.ctr_accessor_param().base_threshold();
   auto delta_threshold = _config.ctr_accessor_param().delta_threshold();
   auto delta_keep_days = _config.ctr_accessor_param().delta_keep_days();
@@ -180,17 +126,16 @@ void SparseAccessor::update_stat_after_save(float* value, int param) {
   }
   switch (param) {
     case 1: {
-      if (show_click_score(sparse_feature_value.show(value),
-                           sparse_feature_value.click(value)) >=
-              base_threshold &&
-          sparse_feature_value.delta_score(value) >= delta_threshold &&
-          sparse_feature_value.unseen_days(value) <= delta_keep_days) {
-        sparse_feature_value.delta_score(value) = 0;
+      if (ShowClickScore(sparse_feature_value.Show(value),
+                         sparse_feature_value.Click(value)) >= base_threshold &&
+          sparse_feature_value.DeltaScore(value) >= delta_threshold &&
+          sparse_feature_value.UnseenDays(value) <= delta_keep_days) {
+        sparse_feature_value.DeltaScore(value) = 0;
       }
     }
       return;
     case 3: {
-      sparse_feature_value.unseen_days(value)++;
+      sparse_feature_value.UnseenDays(value)++;
     }
       return;
     default:
@@ -198,48 +143,47 @@ void SparseAccessor::update_stat_after_save(float* value, int param) {
   }
 }
 
-int32_t SparseAccessor::create(float** values, size_t num) {
+int32_t SparseAccessor::Create(float** values, size_t num) {
   auto embedx_dim = _config.embedx_dim();
   for (size_t value_item = 0; value_item < num; ++value_item) {
     float* value = values[value_item];
-    value[sparse_feature_value.unseen_days_index()] = 0;
-    value[sparse_feature_value.delta_score_index()] = 0;
-    value[sparse_feature_value.show_index()] = 0;
-    value[sparse_feature_value.click_index()] = 0;
-    value[sparse_feature_value.slot_index()] = -1;
-    _embed_sgd_rule->init_value(
-        value + sparse_feature_value.embed_w_index(),
-        value + sparse_feature_value.embed_g2sum_index());
-    _embedx_sgd_rule->init_value(
-        value + sparse_feature_value.embedx_w_index(),
-        value + sparse_feature_value.embedx_g2sum_index(), false);
+    value[sparse_feature_value.UnseenDaysIndex()] = 0;
+    value[sparse_feature_value.DeltaScoreIndex()] = 0;
+    value[sparse_feature_value.ShowIndex()] = 0;
+    value[sparse_feature_value.ClickIndex()] = 0;
+    value[sparse_feature_value.SlotIndex()] = -1;
+    _embed_sgd_rule->InitValue(value + sparse_feature_value.EmbedWIndex(),
+                               value + sparse_feature_value.EmbedG2SumIndex());
+    _embedx_sgd_rule->InitValue(value + sparse_feature_value.EmbedxWIndex(),
+                                value + sparse_feature_value.EmbedxG2SumIndex(),
+                                false);
   }
   return 0;
 }
 
-bool SparseAccessor::need_extend_mf(float* value) {
-  float show = value[sparse_feature_value.show_index()];
-  float click = value[sparse_feature_value.click_index()];
+bool SparseAccessor::NeedExtendMF(float* value) {
+  float show = value[sparse_feature_value.ShowIndex()];
+  float click = value[sparse_feature_value.ClickIndex()];
   float score = (show - click) * _config.ctr_accessor_param().nonclk_coeff() +
                 click * _config.ctr_accessor_param().click_coeff();
   return score >= _config.embedx_threshold();
 }
 
-bool SparseAccessor::has_mf(size_t size) {
-  return size > sparse_feature_value.embedx_g2sum_index();
+bool SparseAccessor::HasMF(size_t size) {
+  return size > sparse_feature_value.EmbedxG2SumIndex();
 }
 
 // from SparseFeatureValue to SparsePullValue
-int32_t SparseAccessor::select(float** select_values, const float** values,
+int32_t SparseAccessor::Select(float** select_values, const float** values,
                                size_t num) {
   auto embedx_dim = _config.embedx_dim();
   for (size_t value_item = 0; value_item < num; ++value_item) {
     float* select_value = select_values[value_item];
     const float* value = values[value_item];
-    select_value[SparsePullValue::embed_w_index()] =
-        value[sparse_feature_value.embed_w_index()];
-    memcpy(select_value + SparsePullValue::embedx_w_index(),
-           value + sparse_feature_value.embedx_w_index(),
+    select_value[SparsePullValue::EmbedWIndex()] =
+        value[sparse_feature_value.EmbedWIndex()];
+    memcpy(select_value + SparsePullValue::EmbedxWIndex(),
+           value + sparse_feature_value.EmbedxWIndex(),
            embedx_dim * sizeof(float));
   }
   return 0;
@@ -248,15 +192,15 @@ int32_t SparseAccessor::select(float** select_values, const float** values,
 // from SparsePushValue to SparsePushValue
 // first dim: item
 // second dim: field num
-int32_t SparseAccessor::merge(float** update_values,
+int32_t SparseAccessor::Merge(float** update_values,
                               const float** other_update_values, size_t num) {
   auto embedx_dim = _config.embedx_dim();
-  size_t total_dim = SparsePushValue::dim(embedx_dim);
+  size_t total_dim = SparsePushValue::Dim(embedx_dim);
   for (size_t value_item = 0; value_item < num; ++value_item) {
     float* update_value = update_values[value_item];
     const float* other_update_value = other_update_values[value_item];
     for (auto i = 0u; i < total_dim; ++i) {
-      if (i != SparsePushValue::slot_index()) {
+      if (i != SparsePushValue::SlotIndex()) {
         update_value[i] += other_update_value[i];
       }
     }
@@ -267,44 +211,44 @@ int32_t SparseAccessor::merge(float** update_values,
 // from SparsePushValue to SparseFeatureValue
 // first dim: item
 // second dim: field num
-int32_t SparseAccessor::update(float** update_values, const float** push_values,
+int32_t SparseAccessor::Update(float** update_values, const float** push_values,
                                size_t num) {
   auto embedx_dim = _config.embedx_dim();
   for (size_t value_item = 0; value_item < num; ++value_item) {
     float* update_value = update_values[value_item];
     const float* push_value = push_values[value_item];
-    float push_show = push_value[SparsePushValue::show_index()];
-    float push_click = push_value[SparsePushValue::click_index()];
-    float slot = push_value[SparsePushValue::slot_index()];
-    update_value[sparse_feature_value.show_index()] += push_show;
-    update_value[sparse_feature_value.click_index()] += push_click;
-    update_value[sparse_feature_value.slot_index()] = slot;
-    update_value[sparse_feature_value.delta_score_index()] +=
+    float push_show = push_value[SparsePushValue::ShowIndex()];
+    float push_click = push_value[SparsePushValue::ClickIndex()];
+    float slot = push_value[SparsePushValue::SlotIndex()];
+    update_value[sparse_feature_value.ShowIndex()] += push_show;
+    update_value[sparse_feature_value.ClickIndex()] += push_click;
+    update_value[sparse_feature_value.SlotIndex()] = slot;
+    update_value[sparse_feature_value.DeltaScoreIndex()] +=
         (push_show - push_click) * _config.ctr_accessor_param().nonclk_coeff() +
         push_click * _config.ctr_accessor_param().click_coeff();
-    update_value[sparse_feature_value.unseen_days_index()] = 0;
-    _embed_sgd_rule->update_value(
-        update_value + sparse_feature_value.embed_w_index(),
-        update_value + sparse_feature_value.embed_g2sum_index(),
-        push_value + SparsePushValue::embed_g_index());
-    _embedx_sgd_rule->update_value(
-        update_value + sparse_feature_value.embedx_w_index(),
-        update_value + sparse_feature_value.embedx_g2sum_index(),
-        push_value + SparsePushValue::embedx_g_index());
+    update_value[sparse_feature_value.UnseenDaysIndex()] = 0;
+    _embed_sgd_rule->UpdateValue(
+        update_value + sparse_feature_value.EmbedWIndex(),
+        update_value + sparse_feature_value.EmbedG2SumIndex(),
+        push_value + SparsePushValue::EmbedGIndex());
+    _embedx_sgd_rule->UpdateValue(
+        update_value + sparse_feature_value.EmbedxWIndex(),
+        update_value + sparse_feature_value.EmbedxG2SumIndex(),
+        push_value + SparsePushValue::EmbedxGIndex());
   }
   return 0;
 }
 
-bool SparseAccessor::create_value(int stage, const float* value) {
+bool SparseAccessor::CreateValue(int stage, const float* value) {
   // stage == 0, pull
   // stage == 1, push
   if (stage == 0) {
     return true;
   } else if (stage == 1) {
     // operation
-    auto show = SparsePushValue::show(const_cast<float*>(value));
-    auto click = SparsePushValue::click(const_cast<float*>(value));
-    auto score = show_click_score(show, click);
+    auto show = SparsePushValue::Show(const_cast<float*>(value));
+    auto click = SparsePushValue::Click(const_cast<float*>(value));
+    auto score = ShowClickScore(show, click);
     if (score <= 0) {
       return false;
     }
@@ -318,41 +262,40 @@ bool SparseAccessor::create_value(int stage, const float* value) {
   }
 }
 
-float SparseAccessor::show_click_score(float show, float click) {
+float SparseAccessor::ShowClickScore(float show, float click) {
   auto nonclk_coeff = _config.ctr_accessor_param().nonclk_coeff();
   auto click_coeff = _config.ctr_accessor_param().click_coeff();
   return (show - click) * nonclk_coeff + click * click_coeff;
 }
 
-std::string SparseAccessor::parse_to_string(const float* v, int param) {
+std::string SparseAccessor::ParseToString(const float* v, int param) {
   thread_local std::ostringstream os;
   os.clear();
   os.str("");
   os << v[0] << " " << v[1] << " " << v[2] << " " << v[3] << " " << v[4] << " "
      << v[5];
-  for (int i = sparse_feature_value.embed_g2sum_index();
-       i < sparse_feature_value.embedx_w_index(); i++) {
+  for (int i = sparse_feature_value.EmbedG2SumIndex();
+       i < sparse_feature_value.EmbedxWIndex(); i++) {
     os << " " << v[i];
   }
-  auto show = sparse_feature_value.show(const_cast<float*>(v));
-  auto click = sparse_feature_value.click(const_cast<float*>(v));
-  auto score = show_click_score(show, click);
+  auto show = sparse_feature_value.Show(const_cast<float*>(v));
+  auto click = sparse_feature_value.Click(const_cast<float*>(v));
+  auto score = ShowClickScore(show, click);
   if (score >= _config.embedx_threshold() &&
-      param > sparse_feature_value.embedx_w_index()) {
-    for (auto i = sparse_feature_value.embedx_w_index();
-         i < sparse_feature_value.dim(); ++i) {
+      param > sparse_feature_value.EmbedxWIndex()) {
+    for (auto i = sparse_feature_value.EmbedxWIndex();
+         i < sparse_feature_value.Dim(); ++i) {
       os << " " << v[i];
     }
   }
   return os.str();
 }
 
-int SparseAccessor::parse_from_string(const std::string& str, float* value) {
+int SparseAccessor::ParseFromString(const std::string& str, float* value) {
   int embedx_dim = _config.embedx_dim();
 
-  _embedx_sgd_rule->init_value(
-      value + sparse_feature_value.embedx_w_index(),
-      value + sparse_feature_value.embedx_g2sum_index());
+  _embedx_sgd_rule->InitValue(value + sparse_feature_value.EmbedxWIndex(),
+                              value + sparse_feature_value.EmbedxG2SumIndex());
   auto ret = paddle::string::str_to_float(str.data(), value);
   CHECK(ret >= 6) << "expect more than 6 real:" << ret;
   return ret;
diff --git a/paddle/fluid/distributed/ps/table/sparse_accessor.h b/paddle/fluid/distributed/ps/table/sparse_accessor.h
index cdc4c1dc6200e..5ca5d21707a2b 100644
--- a/paddle/fluid/distributed/ps/table/sparse_accessor.h
+++ b/paddle/fluid/distributed/ps/table/sparse_accessor.h
@@ -40,28 +40,28 @@ class SparseAccessor : public ValueAccessor {
        std::<vector>float embedx_g2sum;
        */
 
-    int dim() { return 6 + embed_sgd_dim + embedx_sgd_dim + embedx_dim; }
-    int dim_size(size_t dim, int embedx_dim) { return sizeof(float); }
-    int size() { return dim() * sizeof(float); }
-    int slot_index() { return 0; }
-    int unseen_days_index() { return slot_index() + 1; }
-    int delta_score_index() { return unseen_days_index() + 1; }
-    int show_index() { return delta_score_index() + 1; }
-    int click_index() { return show_index() + 1; }
-    int embed_w_index() { return click_index() + 1; }
-    int embed_g2sum_index() { return embed_w_index() + 1; }
-    int embedx_w_index() { return embed_g2sum_index() + embed_sgd_dim; }
-    int embedx_g2sum_index() { return embedx_w_index() + embedx_dim; }
-
-    float& unseen_days(float* val) { return val[unseen_days_index()]; }
-    float& delta_score(float* val) { return val[delta_score_index()]; }
-    float& show(float* val) { return val[show_index()]; }
-    float& click(float* val) { return val[click_index()]; }
-    float& slot(float* val) { return val[slot_index()]; }
-    float& embed_w(float* val) { return val[embed_w_index()]; }
-    float& embed_g2sum(float* val) { return val[embed_g2sum_index()]; }
-    float& embedx_w(float* val) { return val[embedx_w_index()]; }
-    float& embedx_g2sum(float* val) { return val[embedx_g2sum_index()]; }
+    int Dim() { return 6 + embed_sgd_dim + embedx_sgd_dim + embedx_dim; }
+    int DimSize(size_t dim, int embedx_dim) { return sizeof(float); }
+    int Size() { return Dim() * sizeof(float); }
+    int SlotIndex() { return 0; }
+    int UnseenDaysIndex() { return SlotIndex() + 1; }
+    int DeltaScoreIndex() { return UnseenDaysIndex() + 1; }
+    int ShowIndex() { return DeltaScoreIndex() + 1; }
+    int ClickIndex() { return ShowIndex() + 1; }
+    int EmbedWIndex() { return ClickIndex() + 1; }
+    int EmbedG2SumIndex() { return EmbedWIndex() + 1; }
+    int EmbedxWIndex() { return EmbedG2SumIndex() + embed_sgd_dim; }
+    int EmbedxG2SumIndex() { return EmbedxWIndex() + embedx_dim; }
+
+    float& UnseenDays(float* val) { return val[UnseenDaysIndex()]; }
+    float& DeltaScore(float* val) { return val[DeltaScoreIndex()]; }
+    float& Show(float* val) { return val[ShowIndex()]; }
+    float& Click(float* val) { return val[ClickIndex()]; }
+    float& Slot(float* val) { return val[SlotIndex()]; }
+    float& EmbedW(float* val) { return val[EmbedWIndex()]; }
+    float& EmbedG2Sum(float* val) { return val[EmbedG2SumIndex()]; }
+    float& EmbedxW(float* val) { return val[EmbedxWIndex()]; }
+    float& EmbedxG2Sum(float* val) { return val[EmbedxG2SumIndex()]; }
 
     int embed_sgd_dim;
     int embedx_dim;
@@ -77,29 +77,25 @@ class SparseAccessor : public ValueAccessor {
        std::vector<float> embedx_g;
        */
 
-    static int dim(int embedx_dim) { return 4 + embedx_dim; }
-
-    static int dim_size(int dim, int embedx_dim) { return sizeof(float); }
-    static int size(int embedx_dim) { return dim(embedx_dim) * sizeof(float); }
-    static int slot_index() { return 0; }
-    static int show_index() { return SparsePushValue::slot_index() + 1; }
-    static int click_index() { return SparsePushValue::show_index() + 1; }
-    static int embed_g_index() { return SparsePushValue::click_index() + 1; }
-    static int embedx_g_index() { return SparsePushValue::embed_g_index() + 1; }
-    static float& slot(float* val) {
-      return val[SparsePushValue::slot_index()];
+    static int Dim(int embedx_dim) { return 4 + embedx_dim; }
+
+    static int DimSize(int dim, int embedx_dim) { return sizeof(float); }
+    static int Size(int embedx_dim) { return Dim(embedx_dim) * sizeof(float); }
+    static int SlotIndex() { return 0; }
+    static int ShowIndex() { return SparsePushValue::SlotIndex() + 1; }
+    static int ClickIndex() { return SparsePushValue::ShowIndex() + 1; }
+    static int EmbedGIndex() { return SparsePushValue::ClickIndex() + 1; }
+    static int EmbedxGIndex() { return SparsePushValue::EmbedGIndex() + 1; }
+    static float& Slot(float* val) { return val[SparsePushValue::SlotIndex()]; }
+    static float& Show(float* val) { return val[SparsePushValue::ShowIndex()]; }
+    static float& Click(float* val) {
+      return val[SparsePushValue::ClickIndex()];
     }
-    static float& show(float* val) {
-      return val[SparsePushValue::show_index()];
+    static float& EmbedG(float* val) {
+      return val[SparsePushValue::EmbedGIndex()];
     }
-    static float& click(float* val) {
-      return val[SparsePushValue::click_index()];
-    }
-    static float& embed_g(float* val) {
-      return val[SparsePushValue::embed_g_index()];
-    }
-    static float* embedx_g(float* val) {
-      return val + SparsePushValue::embedx_g_index();
+    static float* EmbedxG(float* val) {
+      return val + SparsePushValue::EmbedxGIndex();
     }
   };
 
@@ -109,88 +105,68 @@ class SparseAccessor : public ValueAccessor {
        std::vector<float> embedx_w;
        */
 
-    static int dim(int embedx_dim) { return 1 + embedx_dim; }
-    static int dim_size(size_t dim) { return sizeof(float); }
-    static int size(int embedx_dim) { return dim(embedx_dim) * sizeof(float); }
-    static int embed_w_index() { return 0; }
-    static int embedx_w_index() { return 1; }
-    static float& embed_w(float* val) {
-      return val[SparsePullValue::embed_w_index()];
+    static int Dim(int embedx_dim) { return 1 + embedx_dim; }
+    static int DimSize(size_t dim) { return sizeof(float); }
+    static int Size(int embedx_dim) { return Dim(embedx_dim) * sizeof(float); }
+    static int EmbedWIndex() { return 0; }
+    static int EmbedxWIndex() { return 1; }
+    static float& EmbedW(float* val) {
+      return val[SparsePullValue::EmbedWIndex()];
     }
-    static float* embedx_w(float* val) {
-      return val + SparsePullValue::embedx_w_index();
+    static float* EmbedxW(float* val) {
+      return val + SparsePullValue::EmbedxWIndex();
     }
   };
   SparseAccessor() {}
-  virtual int initialize();
-  virtual void SetTableInfo(AccessorInfo& info);
-  virtual size_t GetTableInfo(InfoKey key);
   virtual ~SparseAccessor() {}
 
-  // value维度
-  virtual size_t dim();
-  // value各个维度的size
-  virtual size_t dim_size(size_t dim);
-  // value各维度相加总size
-  virtual size_t size();
-  // value中mf动态长度部分总size大小, sparse下生效
-  virtual size_t mf_size();
-  // pull value维度
-  virtual size_t select_dim();
-  // pull value各个维度的size
-  virtual size_t select_dim_size(size_t dim);
-  // pull value各维度相加总size
-  virtual size_t select_size();
-  // push value维度
-  virtual size_t update_dim();
-  // push value各个维度的size
-  virtual size_t update_dim_size(size_t dim);
-  // push value各维度相加总size
-  virtual size_t update_size();
+  virtual int Initialize();
+  // 初始化AccessorInfo
+  virtual void InitAccessorInfo();
   // 判断该value是否进行shrink
-  virtual bool shrink(float* value);
+  virtual bool Shrink(float* value);
   // 判断该value是否保存到ssd
   // virtual bool save_ssd(float* value);
-  virtual bool need_extend_mf(float* value);
-  virtual bool has_mf(size_t size);
+  virtual bool NeedExtendMF(float* value);
+  virtual bool HasMF(size_t size);
   // 判断该value是否在save阶段dump,
   // param作为参数用于标识save阶段，如downpour的xbox与batch_model
   // param = 0, save all feature
   // param = 1, save delta feature
   // param = 2, save xbox base feature
-  bool save(float* value, int param) override;
+  bool Save(float* value, int param) override;
   // update delta_score and unseen_days after save
-  void update_stat_after_save(float* value, int param) override;
+  void UpdateStatAfterSave(float* value, int param) override;
   // keys不存在时，为values生成随机值
   // 要求value的内存由外部调用者分配完毕
-  virtual int32_t create(float** value, size_t num);
+  virtual int32_t Create(float** value, size_t num);
   // 从values中选取到select_values中
-  virtual int32_t select(float** select_values, const float** values,
+  virtual int32_t Select(float** select_values, const float** values,
                          size_t num);
   // 将update_values聚合到一起
-  virtual int32_t merge(float** update_values,
+  virtual int32_t Merge(float** update_values,
                         const float** other_update_values, size_t num);
   // 将update_values聚合到一起，通过it.next判定是否进入下一个key
-  // virtual int32_t merge(float** update_values, iterator it);
+  // virtual int32_t Merge(float** update_values, iterator it);
   // 将update_values更新应用到values中
-  virtual int32_t update(float** values, const float** update_values,
+  virtual int32_t Update(float** values, const float** update_values,
                          size_t num);
 
-  std::string parse_to_string(const float* value, int param) override;
-  int32_t parse_from_string(const std::string& str, float* v) override;
-  virtual bool create_value(int type, const float* value);
+  std::string ParseToString(const float* value, int param) override;
+  int32_t ParseFromString(const std::string& str, float* v) override;
+  virtual bool CreateValue(int type, const float* value);
 
   // 这个接口目前只用来取show
-  float get_field(float* value, const std::string& name) override {
+  float GetField(float* value, const std::string& name) override {
     // CHECK(name == "show");
     if (name == "show") {
-      return sparse_feature_value.show(value);
+      return sparse_feature_value.Show(value);
     }
     return 0.0;
   }
 
  private:
-  // float show_click_score(float show, float click);
+  // float ShowClickScore(float show, float click);
 
   // SparseValueSGDRule* _embed_sgd_rule;
   // SparseValueSGDRule* _embedx_sgd_rule;
@@ -201,7 +177,7 @@ class SparseAccessor : public ValueAccessor {
  public:  // TODO(zhaocaibei123): it should be private, but we make it public
           // for unit test
   SparseFeatureValue sparse_feature_value;
-  float show_click_score(float show, float click);
+  float ShowClickScore(float show, float click);
   SparseValueSGDRule* _embed_sgd_rule;
   SparseValueSGDRule* _embedx_sgd_rule;
 };
diff --git a/paddle/fluid/distributed/ps/table/sparse_geo_table.cc b/paddle/fluid/distributed/ps/table/sparse_geo_table.cc
deleted file mode 100644
index 6ef4330113e8f..0000000000000
--- a/paddle/fluid/distributed/ps/table/sparse_geo_table.cc
+++ /dev/null
@@ -1,91 +0,0 @@
-// Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-#include "paddle/fluid/distributed/ps/table/sparse_geo_table.h"
-
-namespace paddle {
-namespace distributed {
-
-int32_t SparseGeoTable::pull_geo_param(const uint32_t trainer_id,
-                                       std::vector<float>* values,
-                                       std::vector<uint64_t>* ids) {
-  geo_recorder->GetAndClear(trainer_id, ids);
-  auto dim = _config.common().dims()[0];
-
-  std::vector<uint32_t> frequencies;
-  frequencies.resize(ids->size(), 1);
-
-  auto pull_value = PullSparseValue(ids->size(), dim);
-  pull_value.is_training_ = true;
-  pull_value.feasigns_ = ids->data();
-  pull_value.frequencies_ = frequencies.data();
-
-  values->resize(ids->size() * dim);
-  CommonSparseTable::pull_sparse(values->data(), pull_value);
-  return 0;
-}
-
-int32_t SparseGeoTable::push_sparse(const uint64_t* keys, const float* values,
-                                    size_t num) {
-  std::vector<uint64_t> ids;
-  ids.resize(num);
-  std::copy_n(keys, num, ids.begin());
-  geo_recorder->Update(ids);
-  CommonSparseTable::push_sparse(keys, values, num);
-  return 0;
-}
-
-int32_t SparseGeoTable::initialize_value() {
-  auto common = _config.common();
-  shard_values_.reserve(task_pool_size_);
-
-  for (int x = 0; x < task_pool_size_; ++x) {
-    auto shard = std::make_shared<ValueBlock>(
-        value_names_, value_dims_, value_offsets_, value_idx_,
-        initializer_attrs_, common.entry());
-
-    shard_values_.emplace_back(shard);
-  }
-
-  auto accessor = _config.accessor();
-  std::vector<uint64_t> feasigns;
-
-  for (size_t x = 0; x < accessor.fea_dim(); ++x) {
-    if (x % _shard_num == _shard_idx) {
-      feasigns.push_back(x);
-    }
-  }
-
-  VLOG(3) << "has " << feasigns.size() << " ids need to be pre inited";
-
-  auto buckets = bucket(feasigns.size(), 10);
-  for (int x = 0; x < 10; ++x) {
-    auto bucket_feasigns = buckets[x + 1] - buckets[x];
-    std::vector<uint64_t> ids(bucket_feasigns);
-    std::copy(feasigns.begin() + buckets[x], feasigns.begin() + buckets[x + 1],
-              ids.begin());
-
-    std::vector<uint32_t> fres;
-    fres.resize(ids.size(), 1);
-
-    auto pull_value = PullSparseValue(ids, fres, param_dim_);
-    std::vector<float> pulls;
-    pulls.resize(bucket_feasigns * param_dim_);
-    pull_sparse(pulls.data(), pull_value);
-  }
-  return 0;
-}
-
-}  // namespace distributed
-}  // namespace paddle
diff --git a/paddle/fluid/distributed/ps/table/sparse_geo_table.h b/paddle/fluid/distributed/ps/table/sparse_geo_table.h
deleted file mode 100644
index 1151c9f81ac97..0000000000000
--- a/paddle/fluid/distributed/ps/table/sparse_geo_table.h
+++ /dev/null
@@ -1,68 +0,0 @@
-// Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-#pragma once
-
-#include <assert.h>
-#include <pthread.h>
-#include <stdint.h>
-#include <memory>
-#include <mutex>  // NOLINT
-#include <string>
-#include <utility>
-#include <vector>
-
-#include "Eigen/Dense"
-#include "paddle/fluid/distributed/ps/table/accessor.h"
-#include "paddle/fluid/distributed/ps/table/common_sparse_table.h"
-#include "paddle/fluid/distributed/ps/table/common_table.h"
-#include "paddle/fluid/distributed/ps/table/depends/geo_recorder.h"
-#include "paddle/fluid/distributed/ps/table/depends/initializers.h"
-#include "paddle/fluid/distributed/ps/table/depends/large_scale_kv.h"
-#include "paddle/fluid/distributed/ps/table/depends/sparse.h"
-#include "paddle/fluid/string/string_helper.h"
-#include "paddle/phi/core/utils/rw_lock.h"
-
-namespace paddle {
-namespace distributed {
-
-class GeoRecorder;
-
-class SparseGeoTable : public CommonSparseTable {
- public:
-  explicit SparseGeoTable() : CommonSparseTable() { geo_recorder = nullptr; }
-  virtual ~SparseGeoTable() {}
-
-  virtual int32_t initialize_value();
-
-  int32_t pull_geo_param(const uint32_t trainer_id, std::vector<float>* values,
-                         std::vector<uint64_t>* keys);
-
-  int32_t push_sparse(const uint64_t* keys, const float* values,
-                      size_t num) override;
-
-  virtual int32_t initialize_recorder() {
-    if (!geo_recorder) {
-      auto trainers = _config.common().trainer_num();
-      geo_recorder = std::make_shared<GeoRecorder>(trainers);
-    }
-    return 0;
-  }
-
- private:
-  std::shared_ptr<GeoRecorder> geo_recorder;
-};
-
-}  // namespace distributed
-}  // namespace paddle
diff --git a/paddle/fluid/distributed/ps/table/sparse_sgd_rule.cc b/paddle/fluid/distributed/ps/table/sparse_sgd_rule.cc
index 3e39d6f976d12..8471b93612828 100644
--- a/paddle/fluid/distributed/ps/table/sparse_sgd_rule.cc
+++ b/paddle/fluid/distributed/ps/table/sparse_sgd_rule.cc
@@ -21,8 +21,8 @@ DEFINE_bool(enable_show_scale_gradient, true, "enable show scale gradient");
 namespace paddle {
 namespace distributed {
 
-void SparseNaiveSGDRule::load_config(const SparseCommonSGDRuleParameter& param,
-                                     size_t emb_dim) {
+void SparseNaiveSGDRule::LoadConfig(const SparseCommonSGDRuleParameter& param,
+                                    size_t emb_dim) {
   _embedding_dim = emb_dim;
   auto naive_param = param.naive();
   learning_rate_ = naive_param.learning_rate();
@@ -39,17 +39,16 @@ void SparseNaiveSGDRule::load_config(const SparseCommonSGDRuleParameter& param,
   }
 }
 
-void SparseNaiveSGDRule::update_value_work(float* w, float* sgd,
-                                           const float* push_value,
-                                           float scale) {
+void SparseNaiveSGDRule::UpdateValueWork(float* w, float* sgd,
+                                         const float* push_value, float scale) {
   for (size_t i = 0; i < _embedding_dim; ++i) {
     w[i] -= learning_rate_ * push_value[i];
-    bound_value(w[i]);
+    BoundValue(w[i]);
   }
 }
 
-void SparseNaiveSGDRule::init_value_work(float* value, float* sgd,
-                                         bool zero_init) {
+void SparseNaiveSGDRule::InitValueWork(float* value, float* sgd,
+                                       bool zero_init) {
   if (zero_init) {
     for (size_t i = 0; i < _embedding_dim; ++i) {
       value[i] = 0;
@@ -60,12 +59,12 @@ void SparseNaiveSGDRule::init_value_work(float* value, float* sgd,
           (local_uniform_real_distribution<float>()(local_random_engine()) * 2 -
            1) *
           _initial_range;
-      bound_value(value[i]);
+      BoundValue(value[i]);
     }
   }
 }
-void SparseAdaGradSGDRule::load_config(
-    const SparseCommonSGDRuleParameter& param, size_t emb_dim) {
+void SparseAdaGradSGDRule::LoadConfig(const SparseCommonSGDRuleParameter& param,
+                                      size_t emb_dim) {
   _embedding_dim = emb_dim;
   auto adagrad_param = param.adagrad();
   learning_rate_ = adagrad_param.learning_rate();
@@ -84,42 +83,42 @@ void SparseAdaGradSGDRule::load_config(
   }
 }
 
-void SparseAdaGradSGDRule::update_value_work(float* w, float* sgd,
-                                             const float* grad, float scale) {
-  float& g2sum = sgd[g2sum_index()];
+void SparseAdaGradSGDRule::UpdateValueWork(float* w, float* sgd,
+                                           const float* grad, float scale) {
+  float& g2sum = sgd[G2SumIndex()];
   double add_g2sum = 0;
 
   for (int i = 0; i < _embedding_dim; i++) {
     double scaled_grad = grad[i] / scale;
     w[i] -= learning_rate_ * scaled_grad *
             sqrt(_initial_g2sum / (_initial_g2sum + g2sum));
-    bound_value(w[i]);
+    BoundValue(w[i]);
     add_g2sum += scaled_grad * scaled_grad;
   }
 
   g2sum += add_g2sum / _embedding_dim;
 }
 
-void SparseAdaGradSGDRule::init_value_work(float* value, float* sgd,
-                                           bool zero_init) {
+void SparseAdaGradSGDRule::InitValueWork(float* value, float* sgd,
+                                         bool zero_init) {
   for (int i = 0; i < _embedding_dim; ++i) {
     if (zero_init) {
       value[i] = 0.0;
-      bound_value(value[i]);
+      BoundValue(value[i]);
     } else {
       value[i] =
           (local_uniform_real_distribution<double>()(local_random_engine()) *
                2 -
            1) *
           _initial_range;
-      bound_value(value[i]);
+      BoundValue(value[i]);
     }
   }
-  sgd[g2sum_index()] = 0;
+  sgd[G2SumIndex()] = 0;
 }
 
-void StdAdaGradSGDRule::load_config(const SparseCommonSGDRuleParameter& param,
-                                    size_t emb_dim) {
+void StdAdaGradSGDRule::LoadConfig(const SparseCommonSGDRuleParameter& param,
+                                   size_t emb_dim) {
   _embedding_dim = emb_dim;
   auto adagrad_param = param.adagrad();
   learning_rate_ = adagrad_param.learning_rate();
@@ -138,38 +137,38 @@ void StdAdaGradSGDRule::load_config(const SparseCommonSGDRuleParameter& param,
   }
 }
 
-void StdAdaGradSGDRule::update_value_work(float* w, float* sgd,
-                                          const float* grad, float scale) {
+void StdAdaGradSGDRule::UpdateValueWork(float* w, float* sgd, const float* grad,
+                                        float scale) {
   for (int i = 0; i < _embedding_dim; i++) {
-    float& g2sum = sgd[g2sum_index() + i];
+    float& g2sum = sgd[G2SumIndex() + i];
     double scaled_grad = grad[i] / scale;
     w[i] -= learning_rate_ * scaled_grad *
             sqrt(_initial_g2sum / (_initial_g2sum + g2sum));
-    bound_value(w[i]);
+    BoundValue(w[i]);
     g2sum += scaled_grad * scaled_grad;
   }
 }
 
-void StdAdaGradSGDRule::init_value_work(float* value, float* sgd,
-                                        bool zero_init) {
+void StdAdaGradSGDRule::InitValueWork(float* value, float* sgd,
+                                      bool zero_init) {
   for (int i = 0; i < _embedding_dim; ++i) {
     if (zero_init) {
       value[i] = 0.0;
-      bound_value(value[i]);
+      BoundValue(value[i]);
     } else {
       value[i] =
           (local_uniform_real_distribution<double>()(local_random_engine()) *
                2 -
            1) *
           _initial_range;
-      bound_value(value[i]);
+      BoundValue(value[i]);
     }
-    sgd[g2sum_index() + i] = 0;
+    sgd[G2SumIndex() + i] = 0;
   }
 }
 
-void SparseAdamSGDRule::load_config(const SparseCommonSGDRuleParameter& param,
-                                    size_t emb_dim) {
+void SparseAdamSGDRule::LoadConfig(const SparseCommonSGDRuleParameter& param,
+                                   size_t emb_dim) {
   _embedding_dim = emb_dim;
   auto adam_param = param.adam();
   learning_rate_ = adam_param.learning_rate();
@@ -189,12 +188,12 @@ void SparseAdamSGDRule::load_config(const SparseCommonSGDRuleParameter& param,
   }
 }
 
-void SparseAdamSGDRule::update_value_work(float* w, float* sgd,
-                                          const float* grad, float scale) {
-  float* gsum = sgd + gsum_index();
-  float* g2sum = sgd + g2sum_index();
-  float* beta1_pow = sgd + beta1_pow_index();
-  float* beta2_pow = sgd + beta2_pow_index();
+void SparseAdamSGDRule::UpdateValueWork(float* w, float* sgd, const float* grad,
+                                        float scale) {
+  float* gsum = sgd + GSumIndex();
+  float* g2sum = sgd + G2SumIndex();
+  float* beta1_pow = sgd + Beta1PowIndex();
+  float* beta2_pow = sgd + Beta2PowIndex();
   const float* g = grad;
 
   float lr = learning_rate_;
@@ -209,35 +208,35 @@ void SparseAdamSGDRule::update_value_work(float* w, float* sgd,
     g2sum[i] =
         _beta2_decay_rate * g2sum[i] + (1 - _beta2_decay_rate) * g[i] * g[i];
     w[i] = w[i] - lr * (gsum[i] / (sqrt(g2sum[i]) + _ada_epsilon));
-    bound_value(w[i]);
+    BoundValue(w[i]);
   }
   // update beta_pow_decay
   (*beta1_pow) *= _beta1_decay_rate;
   (*beta2_pow) *= _beta2_decay_rate;
 }
 
-void SparseAdamSGDRule::init_value_work(float* value, float* sgd,
-                                        bool zero_init) {
+void SparseAdamSGDRule::InitValueWork(float* value, float* sgd,
+                                      bool zero_init) {
   for (int i = 0; i < _embedding_dim; ++i) {
     if (zero_init) {
       value[i] = 0.0;
-      bound_value(value[i]);
+      BoundValue(value[i]);
     } else {
       value[i] =
           (local_uniform_real_distribution<double>()(local_random_engine()) *
                2 -
            1) *
           _initial_range;
-      bound_value(value[i]);
+      BoundValue(value[i]);
     }
   }
   // init rule gsum and g2sum
-  for (int i = gsum_index(); i < beta1_pow_index(); i++) {
+  for (int i = GSumIndex(); i < Beta1PowIndex(); i++) {
     sgd[i] = 0.0;
   }
   // init beta1_pow and beta2_pow
-  *(sgd + beta1_pow_index()) = _beta1_decay_rate;
-  *(sgd + beta2_pow_index()) = _beta2_decay_rate;
+  *(sgd + Beta1PowIndex()) = _beta1_decay_rate;
+  *(sgd + Beta2PowIndex()) = _beta2_decay_rate;
 }
 }  // namespace distributed
 }  // namespace paddle
diff --git a/paddle/fluid/distributed/ps/table/sparse_sgd_rule.h b/paddle/fluid/distributed/ps/table/sparse_sgd_rule.h
index ba2baa42f742a..55a37b5941921 100644
--- a/paddle/fluid/distributed/ps/table/sparse_sgd_rule.h
+++ b/paddle/fluid/distributed/ps/table/sparse_sgd_rule.h
@@ -28,33 +28,33 @@ class SparseValueSGDRule {
  public:
   SparseValueSGDRule() {}
   virtual ~SparseValueSGDRule() {}
-  virtual void load_config(const SparseCommonSGDRuleParameter& param,
-                           size_t emb_dim) {
+  virtual void LoadConfig(const SparseCommonSGDRuleParameter& param,
+                          size_t emb_dim) {
     _embedding_dim = emb_dim;
     _name = param.name();
   }
-  virtual void update_value_work(float* w, float* sgd, const float* push_value,
-                                 float scale) = 0;
-  virtual void init_value_work(float* value, float* sgd, bool zero_init) = 0;
-  virtual size_t dim() = 0;
-  const std::string& get_name() const { return _name; }
-  void init_value(float* value, float* sgd, bool zero_init = true) {
-    init_value_work(value, sgd, zero_init);
+  virtual void UpdateValueWork(float* w, float* sgd, const float* push_value,
+                               float scale) = 0;
+  virtual void InitValueWork(float* value, float* sgd, bool zero_init) = 0;
+  virtual size_t Dim() = 0;
+  const std::string& GetName() const { return _name; }
+  void InitValue(float* value, float* sgd, bool zero_init = true) {
+    InitValueWork(value, sgd, zero_init);
   }
-  void update_value(float* w, float* sgd, const float* push_value,
-                    float scale = 1) {
-    update_value_work(w, sgd, push_value, scale);
+  void UpdateValue(float* w, float* sgd, const float* push_value,
+                   float scale = 1) {
+    UpdateValueWork(w, sgd, push_value, scale);
   }
   template <class T>
-  void bound_value(T& w) {  // NOLINT
+  void BoundValue(T& w) {  // NOLINT
     if (!(w >= _min_bound)) {
       w = (T)_min_bound;
     } else if (!(w <= _max_bound)) {
       w = (T)_max_bound;
     }
   }
-  float& min_bound() { return _min_bound; }
-  float& max_bound() { return _max_bound; }
+  float& MinBound() { return _min_bound; }
+  float& MaxBound() { return _max_bound; }
 
  protected:
   float _min_bound;
@@ -70,12 +70,12 @@ REGISTER_PSCORE_REGISTERER(SparseValueSGDRule);
 
 class SparseNaiveSGDRule : public SparseValueSGDRule {
  public:
-  virtual void load_config(const SparseCommonSGDRuleParameter& param,
-                           size_t emb_dim);
-  virtual void update_value_work(float* w, float* sgd, const float* push_value,
-                                 float scale);
-  virtual void init_value_work(float* value, float* sgd, bool zero_init);
-  virtual size_t dim() { return 0; }
+  virtual void LoadConfig(const SparseCommonSGDRuleParameter& param,
+                          size_t emb_dim);
+  virtual void UpdateValueWork(float* w, float* sgd, const float* push_value,
+                               float scale);
+  virtual void InitValueWork(float* value, float* sgd, bool zero_init);
+  virtual size_t Dim() { return 0; }
 
  private:
   float learning_rate_;
@@ -83,13 +83,13 @@ class SparseNaiveSGDRule : public SparseValueSGDRule {
 
 class SparseAdaGradSGDRule : public SparseValueSGDRule {
  public:
-  virtual void load_config(const SparseCommonSGDRuleParameter& param,
-                           size_t emb_dim);
-  virtual void update_value_work(float* w, float* sgd, const float* push_value,
-                                 float scale);
-  virtual void init_value_work(float* value, float* sgd, bool zero_init);
-  virtual size_t dim() { return 1; }
-  size_t g2sum_index() { return 0; }
+  virtual void LoadConfig(const SparseCommonSGDRuleParameter& param,
+                          size_t emb_dim);
+  virtual void UpdateValueWork(float* w, float* sgd, const float* push_value,
+                               float scale);
+  virtual void InitValueWork(float* value, float* sgd, bool zero_init);
+  virtual size_t Dim() { return 1; }
+  size_t G2SumIndex() { return 0; }
 
  private:
   float learning_rate_;
@@ -98,13 +98,13 @@ class SparseAdaGradSGDRule : public SparseValueSGDRule {
 
 class StdAdaGradSGDRule : public SparseValueSGDRule {
  public:
-  virtual void load_config(const SparseCommonSGDRuleParameter& param,
-                           size_t emb_dim);
-  virtual void update_value_work(float* w, float* sgd, const float* push_value,
-                                 float scale);
-  virtual void init_value_work(float* value, float* sgd, bool zero_init);
-  virtual size_t dim() { return _embedding_dim; }
-  size_t g2sum_index() { return 0; }
+  virtual void LoadConfig(const SparseCommonSGDRuleParameter& param,
+                          size_t emb_dim);
+  virtual void UpdateValueWork(float* w, float* sgd, const float* push_value,
+                               float scale);
+  virtual void InitValueWork(float* value, float* sgd, bool zero_init);
+  virtual size_t Dim() { return _embedding_dim; }
+  size_t G2SumIndex() { return 0; }
 
  private:
   float learning_rate_;
@@ -113,16 +113,16 @@ class StdAdaGradSGDRule : public SparseValueSGDRule {
 
 class SparseAdamSGDRule : public SparseValueSGDRule {
  public:
-  virtual void load_config(const SparseCommonSGDRuleParameter& param,
-                           size_t emb_dim);
-  virtual void update_value_work(float* w, float* sgd, const float* push_value,
-                                 float scale);
-  virtual void init_value_work(float* value, float* sgd, bool zero_init);
-  virtual size_t dim() { return _embedding_dim * 2 + 2; }
-  size_t gsum_index() { return 0; }
-  size_t g2sum_index() { return gsum_index() + _embedding_dim; }
-  size_t beta1_pow_index() { return g2sum_index() + _embedding_dim; }
-  size_t beta2_pow_index() { return beta1_pow_index() + 1; }
+  virtual void LoadConfig(const SparseCommonSGDRuleParameter& param,
+                          size_t emb_dim);
+  virtual void UpdateValueWork(float* w, float* sgd, const float* push_value,
+                               float scale);
+  virtual void InitValueWork(float* value, float* sgd, bool zero_init);
+  virtual size_t Dim() { return _embedding_dim * 2 + 2; }
+  size_t GSumIndex() { return 0; }
+  size_t G2SumIndex() { return GSumIndex() + _embedding_dim; }
+  size_t Beta1PowIndex() { return G2SumIndex() + _embedding_dim; }
+  size_t Beta2PowIndex() { return Beta1PowIndex() + 1; }
 
  protected:
   float learning_rate_;
diff --git a/paddle/fluid/distributed/ps/table/ssd_sparse_table.cc b/paddle/fluid/distributed/ps/table/ssd_sparse_table.cc
deleted file mode 100644
index 5bc58bc5a1108..0000000000000
--- a/paddle/fluid/distributed/ps/table/ssd_sparse_table.cc
+++ /dev/null
@@ -1,376 +0,0 @@
-// Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-#ifdef PADDLE_WITH_HETERPS
-#include "paddle/fluid/distributed/ps/table/ssd_sparse_table.h"
-
-DEFINE_string(rocksdb_path, "database", "path of sparse table rocksdb file");
-
-namespace paddle {
-namespace distributed {
-
-int32_t SSDSparseTable::initialize() {
-  _shards_task_pool.resize(task_pool_size_);
-  for (int i = 0; i < _shards_task_pool.size(); ++i) {
-    _shards_task_pool[i].reset(new ::ThreadPool(1));
-  }
-
-  sync = _config.common().sync();
-  VLOG(1) << "table " << _config.common().table_name() << " is sync: " << sync;
-
-  _global_lr = new float(1.0);
-
-  auto common = _config.common();
-  int size = static_cast<int>(common.params().size());
-
-  size_t offset = 0;
-  for (int x = 0; x < size; ++x) {
-    auto& varname = common.params()[x];
-    auto& dim = common.dims()[x];
-
-    value_idx_[varname] = x;
-    value_names_.push_back(varname);
-    value_dims_.push_back(dim);
-    value_offsets_.push_back(offset);
-    initializer_attrs_.push_back(common.initializers()[x]);
-
-    if (varname == "Param") {
-      param_dim_ = dim;
-      param_offset_ = offset;
-    }
-
-    offset += dim;
-  }
-
-  initialize_value();
-  initialize_optimizer();
-  initialize_recorder();
-  _db = paddle::distributed::RocksDBHandler::GetInstance();
-  _db->initialize(FLAGS_rocksdb_path, task_pool_size_);
-  return 0;
-}
-
-int32_t SSDSparseTable::Pull(TableContext& context) {
-  CHECK(context.value_type == Sparse);
-  if (context.use_ptr) {
-    char** pull_values = context.pull_context.ptr_values;
-    const uint64_t* keys = context.pull_context.keys;
-    return pull_sparse_ptr(pull_values, keys, context.num);
-  } else {
-    float* pull_values = context.pull_context.values;
-    const PullSparseValue& pull_value = context.pull_context.pull_value;
-    return pull_sparse(pull_values, pull_value);
-  }
-}
-
-int32_t SSDSparseTable::Push(TableContext& context) { return 0; }
-
-int32_t SSDSparseTable::pull_sparse(float* pull_values,
-                                    const PullSparseValue& pull_value) {
-  auto shard_num = task_pool_size_;
-  std::vector<std::future<int>> tasks(shard_num);
-
-  for (int shard_id = 0; shard_id < shard_num; ++shard_id) {
-    tasks[shard_id] = _shards_task_pool[shard_id]->enqueue(
-        [this, shard_id, shard_num, &pull_value, &pull_values]() -> int {
-          auto& block = shard_values_[shard_id];
-
-          std::vector<int> offsets;
-          pull_value.Fission(shard_id, shard_num, &offsets);
-
-          for (auto& offset : offsets) {
-            auto feasign = pull_value.feasigns_[offset];
-            auto frequencie = pull_value.frequencies_[offset];
-            float* embedding = nullptr;
-            auto iter = block->Find(feasign);
-            // in mem
-            if (iter == block->end()) {
-              embedding = iter->second->data_.data();
-              if (pull_value.is_training_) {
-                block->AttrUpdate(iter->second, frequencie);
-              }
-            } else {
-              // need create
-              std::string tmp_str("");
-              if (_db->get(shard_id, (char*)&feasign, sizeof(uint64_t),
-                           tmp_str) > 0) {
-                embedding = block->Init(feasign, true, frequencie);
-              } else {
-                // in db
-                int data_size = tmp_str.size() / sizeof(float);
-                int value_size = block->value_length_;
-                float* db_value = (float*)const_cast<char*>(tmp_str.c_str());
-                VALUE* value = block->InitGet(feasign);
-
-                // copy to mem
-                memcpy(value->data_.data(), db_value,
-                       value_size * sizeof(float));
-                embedding = db_value;
-
-                // param, count, unseen_day
-                value->count_ = db_value[value_size];
-                value->unseen_days_ = db_value[value_size + 1];
-                value->is_entry_ = db_value[value_size + 2];
-                if (pull_value.is_training_) {
-                  block->AttrUpdate(value, frequencie);
-                }
-              }
-            }
-            std::copy_n(embedding + param_offset_, param_dim_,
-                        pull_values + param_dim_ * offset);
-          }
-          return 0;
-        });
-  }
-
-  for (size_t shard_id = 0; shard_id < tasks.size(); ++shard_id) {
-    tasks[shard_id].wait();
-  }
-  return 0;
-}
-
-int32_t SSDSparseTable::pull_sparse_ptr(char** pull_values,
-                                        const uint64_t* keys, size_t num) {
-  auto shard_num = task_pool_size_;
-  std::vector<std::future<int>> tasks(shard_num);
-
-  std::vector<std::vector<uint64_t>> offset_bucket;
-  offset_bucket.resize(task_pool_size_);
-
-  for (int x = 0; x < num; ++x) {
-    auto y = keys[x] % task_pool_size_;
-    offset_bucket[y].push_back(x);
-  }
-
-  for (int shard_id = 0; shard_id < shard_num; ++shard_id) {
-    tasks[shard_id] = _shards_task_pool[shard_id]->enqueue(
-        [this, shard_id, &keys, &pull_values, &offset_bucket]() -> int {
-          auto& block = shard_values_[shard_id];
-          auto& offsets = offset_bucket[shard_id];
-
-          for (auto& offset : offsets) {
-            auto feasign = keys[offset];
-            auto iter = block->Find(feasign);
-            VALUE* value = nullptr;
-            // in mem
-            if (iter != block->end()) {
-              value = iter->second;
-            } else {
-              // need create
-              std::string tmp_str("");
-              if (_db->get(shard_id, (char*)&feasign, sizeof(uint64_t),
-                           tmp_str) > 0) {
-                value = block->InitGet(feasign);
-              } else {
-                // in db
-                int data_size = tmp_str.size() / sizeof(float);
-                int value_size = block->value_length_;
-                float* db_value = (float*)const_cast<char*>(tmp_str.c_str());
-                value = block->InitGet(feasign);
-
-                // copy to mem
-                memcpy(value->data_.data(), db_value,
-                       value_size * sizeof(float));
-
-                // param, count, unseen_day
-                value->count_ = db_value[value_size];
-                value->unseen_days_ = db_value[value_size + 1];
-                value->is_entry_ = db_value[value_size + 2];
-              }
-            }
-            pull_values[offset] = (char*)value;
-          }
-          return 0;
-        });
-  }
-
-  for (size_t shard_id = 0; shard_id < tasks.size(); ++shard_id) {
-    tasks[shard_id].wait();
-  }
-  return 0;
-}
-
-int32_t SSDSparseTable::shrink(const std::string& param) { return 0; }
-
-int32_t SSDSparseTable::update_table() {
-  int count = 0;
-  int value_size = shard_values_[0]->value_length_;
-  int db_size = 3 + value_size;
-  float tmp_value[db_size];
-
-  for (size_t i = 0; i < task_pool_size_; ++i) {
-    auto& block = shard_values_[i];
-
-    for (auto& table : block->values_) {
-      for (auto iter = table.begin(); iter != table.end();) {
-        VALUE* value = iter->second;
-        if (value->unseen_days_ >= 1) {
-          tmp_value[value_size] = value->count_;
-          tmp_value[value_size + 1] = value->unseen_days_;
-          tmp_value[value_size + 2] = value->is_entry_;
-          memcpy(tmp_value, value->data_.data(), sizeof(float) * value_size);
-          _db->put(i, (char*)&(iter->first), sizeof(uint64_t), (char*)tmp_value,
-                   db_size * sizeof(float));
-          count++;
-
-          butil::return_object(iter->second);
-          iter = table.erase(iter);
-        } else {
-          ++iter;
-        }
-      }
-    }
-    _db->flush(i);
-  }
-  VLOG(1) << "Table>> update count: " << count;
-  return 0;
-}
-
-int64_t SSDSparseTable::SaveValueToText(std::ostream* os,
-                                        std::shared_ptr<ValueBlock> block,
-                                        std::shared_ptr<::ThreadPool> pool,
-                                        const int mode, int shard_id) {
-  int64_t save_num = 0;
-
-  for (auto& table : block->values_) {
-    for (auto& value : table) {
-      if (mode == SaveMode::delta && !value.second->need_save_) {
-        continue;
-      }
-
-      ++save_num;
-
-      std::stringstream ss;
-      auto* vs = value.second->data_.data();
-
-      auto id = value.first;
-
-      ss << id << "\t" << value.second->count_ << "\t"
-         << value.second->unseen_days_ << "\t" << value.second->is_entry_
-         << "\t";
-
-      for (int i = 0; i < block->value_length_ - 1; i++) {
-        ss << std::to_string(vs[i]) << ",";
-      }
-
-      ss << std::to_string(vs[block->value_length_ - 1]);
-      ss << "\n";
-
-      os->write(ss.str().c_str(), sizeof(char) * ss.str().size());
-
-      if (mode == SaveMode::base || mode == SaveMode::delta) {
-        value.second->need_save_ = false;
-      }
-    }
-  }
-
-  if (mode != 1) {
-    int value_size = block->value_length_;
-    auto* it = _db->get_iterator(shard_id);
-
-    for (it->SeekToFirst(); it->Valid(); it->Next()) {
-      float* value = (float*)const_cast<char*>(it->value().data());
-      std::stringstream ss;
-      ss << *((uint64_t*)const_cast<char*>(it->key().data())) << "\t"
-         << value[value_size] << "\t" << value[value_size + 1] << "\t"
-         << value[value_size + 2] << "\t";
-      for (int i = 0; i < block->value_length_ - 1; i++) {
-        ss << std::to_string(value[i]) << ",";
-      }
-
-      ss << std::to_string(value[block->value_length_ - 1]);
-      ss << "\n";
-
-      os->write(ss.str().c_str(), sizeof(char) * ss.str().size());
-    }
-  }
-
-  return save_num;
-}
-
-int32_t SSDSparseTable::load(const std::string& path,
-                             const std::string& param) {
-  rwlock_->WRLock();
-  VLOG(3) << "ssd sparse table load with " << path << " with meta " << param;
-  LoadFromText(path, param, _shard_idx, _shard_num, task_pool_size_,
-               &shard_values_);
-  rwlock_->UNLock();
-  return 0;
-}
-
-int64_t SSDSparseTable::LoadFromText(
-    const std::string& valuepath, const std::string& metapath,
-    const int pserver_id, const int pserver_num, const int local_shard_num,
-    std::vector<std::shared_ptr<ValueBlock>>* blocks) {
-  Meta meta = Meta(metapath);
-
-  int num_lines = 0;
-  std::ifstream file(valuepath);
-  std::string line;
-
-  int value_size = shard_values_[0]->value_length_;
-  int db_size = 3 + value_size;
-  float tmp_value[db_size];
-
-  while (std::getline(file, line)) {
-    auto values = paddle::string::split_string<std::string>(line, "\t");
-    auto id = std::stoull(values[0]);
-
-    if (id % pserver_num != pserver_id) {
-      VLOG(3) << "will not load " << values[0] << " from " << valuepath
-              << ", please check id distribution";
-      continue;
-    }
-
-    auto shard_id = id % local_shard_num;
-    auto block = blocks->at(shard_id);
-
-    std::vector<std::vector<float>> kvalues;
-    ProcessALine(values, meta, id, &kvalues);
-
-    block->Init(id, false);
-
-    VALUE* value_instant = block->GetValue(id);
-
-    if (values.size() == 5) {
-      value_instant->count_ = std::stoi(values[1]);
-      value_instant->unseen_days_ = std::stoi(values[2]);
-      value_instant->is_entry_ = static_cast<bool>(std::stoi(values[3]));
-    }
-
-    std::vector<float*> block_values = block->Get(id, meta.names, meta.dims);
-    auto blas = GetBlas<float>();
-    for (int x = 0; x < meta.names.size(); ++x) {
-      blas.VCOPY(meta.dims[x], kvalues[x].data(), block_values[x]);
-    }
-    VLOG(3) << "loading: " << id
-            << "unseen day: " << value_instant->unseen_days_;
-    if (value_instant->unseen_days_ >= 1) {
-      tmp_value[value_size] = value_instant->count_;
-      tmp_value[value_size + 1] = value_instant->unseen_days_;
-      tmp_value[value_size + 2] = value_instant->is_entry_;
-      memcpy(tmp_value, value_instant->data_.data(),
-             sizeof(float) * value_size);
-      _db->put(shard_id, (char*)&(id), sizeof(uint64_t), (char*)tmp_value,
-               db_size * sizeof(float));
-      block->erase(id);
-    }
-  }
-
-  return 0;
-}
-
-}  // namespace ps
-}  // namespace paddle
-#endif
diff --git a/paddle/fluid/distributed/ps/table/ssd_sparse_table.h b/paddle/fluid/distributed/ps/table/ssd_sparse_table.h
deleted file mode 100644
index 3a703d7d966d3..0000000000000
--- a/paddle/fluid/distributed/ps/table/ssd_sparse_table.h
+++ /dev/null
@@ -1,64 +0,0 @@
-// Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-#pragma once
-#include "paddle/fluid/distributed/ps/table/common_sparse_table.h"
-#include "paddle/fluid/distributed/ps/table/depends/rocksdb_warpper.h"
-#ifdef PADDLE_WITH_HETERPS
-namespace paddle {
-namespace distributed {
-class SSDSparseTable : public CommonSparseTable {
- public:
-  SSDSparseTable() {}
-  virtual ~SSDSparseTable() {}
-
-  virtual int32_t initialize() override;
-
-  void SaveMetaToText(std::ostream* os, const CommonAccessorParameter& common,
-                      const size_t shard_idx, const int64_t total);
-
-  int64_t SaveValueToText(std::ostream* os, std::shared_ptr<ValueBlock> block,
-                          std::shared_ptr<::ThreadPool> pool, const int mode,
-                          int shard_id);
-
-  virtual int64_t LoadFromText(
-      const std::string& valuepath, const std::string& metapath,
-      const int pserver_id, const int pserver_num, const int local_shard_num,
-      std::vector<std::shared_ptr<ValueBlock>>* blocks);
-
-  virtual int32_t load(const std::string& path, const std::string& param);
-
-  // exchange data
-  virtual int32_t update_table();
-
-  virtual int32_t Pull(TableContext& context);
-  virtual int32_t Push(TableContext& context);
-
-  virtual int32_t pull_sparse(float* values, const PullSparseValue& pull_value);
-
-  virtual int32_t pull_sparse_ptr(char** pull_values, const uint64_t* keys,
-                                  size_t num);
-
-  virtual int32_t flush() override { return 0; }
-  virtual int32_t shrink(const std::string& param) override;
-  virtual void clear() override {}
-
- private:
-  RocksDBHandler* _db;
-  int64_t _cache_tk_size;
-};
-
-}  // namespace ps
-}  // namespace paddle
-#endif
diff --git a/paddle/fluid/distributed/ps/table/table.cc b/paddle/fluid/distributed/ps/table/table.cc
index 6faa3e2632e28..0fbdfb6fcce77 100644
--- a/paddle/fluid/distributed/ps/table/table.cc
+++ b/paddle/fluid/distributed/ps/table/table.cc
@@ -17,15 +17,11 @@
 #include "glog/logging.h"
 #include "paddle/fluid/distributed/common/registerer.h"
 
-#include "paddle/fluid/distributed/ps/table/common_dense_table.h"
 #include "paddle/fluid/distributed/ps/table/common_graph_table.h"
-#include "paddle/fluid/distributed/ps/table/common_sparse_table.h"
-#include "paddle/fluid/distributed/ps/table/memory_sparse_geo_table.h"
-#include "paddle/fluid/distributed/ps/table/sparse_geo_table.h"
-#ifdef PADDLE_WITH_HETERPS
-#include "paddle/fluid/distributed/ps/table/ssd_sparse_table.h"
-#endif
+#include "paddle/fluid/distributed/ps/table/memory_dense_table.h"
+
 #include "paddle/fluid/distributed/ps/table/ctr_accessor.h"
+#include "paddle/fluid/distributed/ps/table/memory_sparse_geo_table.h"
 #include "paddle/fluid/distributed/ps/table/memory_sparse_table.h"
 #include "paddle/fluid/distributed/ps/table/sparse_accessor.h"
 #include "paddle/fluid/distributed/ps/table/tensor_accessor.h"
@@ -34,14 +30,11 @@
 namespace paddle {
 namespace distributed {
 REGISTER_PSCORE_CLASS(Table, GraphTable);
-REGISTER_PSCORE_CLASS(Table, CommonDenseTable);
-REGISTER_PSCORE_CLASS(Table, CommonSparseTable);
+REGISTER_PSCORE_CLASS(Table, MemoryDenseTable);
 #ifdef PADDLE_WITH_HETERPS
-REGISTER_PSCORE_CLASS(Table, SSDSparseTable);
 REGISTER_PSCORE_CLASS(GraphSampler, CompleteGraphSampler);
 REGISTER_PSCORE_CLASS(GraphSampler, BasicBfsGraphSampler);
 #endif
-REGISTER_PSCORE_CLASS(Table, SparseGeoTable);
 REGISTER_PSCORE_CLASS(Table, BarrierTable);
 REGISTER_PSCORE_CLASS(Table, TensorTable);
 REGISTER_PSCORE_CLASS(Table, DenseTensorTable);
@@ -56,7 +49,7 @@ REGISTER_PSCORE_CLASS(SparseValueSGDRule, SparseAdamSGDRule);
 REGISTER_PSCORE_CLASS(SparseValueSGDRule, SparseNaiveSGDRule);
 REGISTER_PSCORE_CLASS(SparseValueSGDRule, SparseAdaGradSGDRule);
 
-int32_t TableManager::initialize() {
+int32_t TableManager::Initialize() {
   static bool initialized = false;
   if (initialized) {
     return 0;
@@ -65,10 +58,10 @@ int32_t TableManager::initialize() {
   return 0;
 }
 
-int32_t Table::initialize(const TableParameter &config,
+int32_t Table::Initialize(const TableParameter &config,
                           const FsClientParameter &fs_config) {
   _config = config;
-  if (initialize_accessor() != 0) {
+  if (InitializeAccessor() != 0) {
     LOG(WARNING) << "Table accessor initialize failed";
     return -1;
   }
@@ -77,10 +70,10 @@ int32_t Table::initialize(const TableParameter &config,
     LOG(WARNING) << "Table fs_client initialize failed";
     // return -1;
   }
-  return initialize();
+  return Initialize();
 }
 
-int32_t Table::initialize_accessor() {
+int32_t Table::InitializeAccessor() {
   if (!_config.has_accessor() || !_config.accessor().has_accessor_class()) {
     LOG(ERROR) << "missing accessor config in table, table_id:"
                << _config.table_id();
@@ -97,13 +90,12 @@ int32_t Table::initialize_accessor() {
                << ", accessor_name:" << _config.accessor().accessor_class();
     return -1;
   }
-  if (accessor->configure(_config.accessor()) || accessor->initialize() != 0) {
+  if (accessor->Configure(_config.accessor()) || accessor->Initialize() != 0) {
     LOG(ERROR) << " accessor initialize failed, table_id:" << _config.table_id()
                << ", accessor_name:" << _config.accessor().accessor_class();
     return -1;
   }
   _value_accesor.reset(accessor);
-  // _value_accesor->SetTableInfo(_table_info);
   return 0;
 }
 
diff --git a/paddle/fluid/distributed/ps/table/table.h b/paddle/fluid/distributed/ps/table/table.h
index bba34d89377a7..c515e03e3fa48 100644
--- a/paddle/fluid/distributed/ps/table/table.h
+++ b/paddle/fluid/distributed/ps/table/table.h
@@ -35,126 +35,97 @@ namespace distributed {
 
 enum ValueType { Sparse = 0, Dense = 1 };
 
-struct PullContext {
-  const uint64_t *keys;
+struct TablePullContext {
+  const uint64_t *keys = nullptr;
   PullSparseValue pull_value;
-  float *values;
-  char **ptr_values;
+  float *values = nullptr;
+  char **ptr_values = nullptr;
+  std::vector<uint64_t> *geo_pull_keys = nullptr;  // for GEO
+  std::vector<float> *geo_pull_values = nullptr;   // for GEO
 };
 
 struct TablePushContext {
-  const uint64_t *keys;
-  const float *values;
-  const float **ptr_values;
+  const uint64_t *keys = nullptr;
+  const float *values = nullptr;
+  const float **ptr_values = nullptr;
+  const int64_t *push_steps = nullptr;  // for global step
+  bool is_param = false;  // true: push param, false: push gradient
 };
 
 struct TableContext {
   ValueType value_type;
-  PullContext pull_context;
+  TablePullContext pull_context;
   TablePushContext push_context;
   size_t num;
   bool use_ptr = false;
+  uint32_t trainer_id;  // for GEO and global step
 };
 
 class Table {
  public:
   Table() {}
   virtual ~Table() {}
-  virtual int32_t initialize(const TableParameter &config,
+  virtual int32_t Initialize(const TableParameter &config,
                              const FsClientParameter &fs_config);
 
   virtual int32_t Pull(TableContext &context) = 0;
   virtual int32_t Push(TableContext &context) = 0;
-  virtual int32_t pull_dense(float *values, size_t num) = 0;
-  virtual int32_t push_dense(const float *values, size_t num) = 0;
-  // for push global_step
-  virtual int32_t push_dense(const int64_t *values, const int32_t trainer_id) {
-    return 0;
-  }
-  virtual int32_t push_dense_param(const float *values, size_t num) {
-    return 0;
-  }
-
-  virtual int32_t pull_sparse_ptr(char **pull_values, const uint64_t *keys,
-                                  size_t num) {
-    VLOG(0) << "NOT IMPLEMENT";
-    return 0;
-  }
-  virtual int32_t pull_sparse(float *values,
-                              const PullSparseValue &pull_value) = 0;
-  virtual int32_t push_sparse(const uint64_t *keys, const float *values,
-                              size_t num) = 0;
-  virtual int32_t push_sparse(const uint64_t *keys, const float **values,
-                              size_t num) {
-    return 0;
-  }
-  virtual int32_t push_sparse_param(const uint64_t *keys, const float *values,
-                                    size_t num) {
-    return 0;
-  }
-
-  // only for sparse geo table
-  virtual int32_t pull_geo_param(const uint32_t trainer_id,
-                                 std::vector<float> *values,
-                                 std::vector<uint64_t> *keys) {
-    return 0;
-  }
 
   // only for barrier
-  virtual int32_t barrier(const uint32_t trainer_id,
+  virtual int32_t Barrier(const uint32_t trainer_id,
                           const std::string barrier_type) {
     return 0;
   }
 
   // only for barrier table
-  virtual int32_t set_table_map(
+  virtual int32_t SetTableMap(
       std::unordered_map<uint32_t, std::shared_ptr<Table>> *table_map) {
     return 0;
   }
 
   // only for tensor table
-  virtual int32_t set_program_env(
+  virtual int32_t SetProgramEnv(
       framework::Scope *scope, platform::Place place,
       const std::vector<framework::ProgramDesc> *sub_program) {
     return 0;
   }
 
-  virtual int32_t set_global_lr(float *lr) {
+  virtual int32_t SetGlobalLR(float *lr) {
     _global_lr = lr;
     return 0;
   }
 
-  virtual int32_t pour() { return 0; }
+  virtual int32_t Pour() { return 0; }
 
-  virtual void clear() = 0;
-  virtual int32_t flush() = 0;
-  virtual int32_t shrink(const std::string &param) = 0;
+  virtual void Clear() = 0;
+  virtual int32_t Flush() = 0;
+  virtual int32_t Shrink(const std::string &param) = 0;
 
   // 指定加载路径
-  virtual int32_t load(const std::string &path,
+  virtual int32_t Load(const std::string &path,
                        const std::string &converter) = 0;
   // 指定保存路径
-  virtual int32_t save(const std::string &path,
+  virtual int32_t Save(const std::string &path,
                        const std::string &converter) = 0;
 
-  virtual int32_t set_shard(size_t shard_idx, size_t shard_num) {
+  virtual int32_t SetShard(size_t shard_idx, size_t shard_num) {
     _shard_idx = shard_idx;
     _shard_num = shard_num;
-    return initialize_shard();
+    return InitializeShard();
   }
 
-  inline std::shared_ptr<ValueAccessor> value_accesor() {
+  inline std::shared_ptr<ValueAccessor> ValueAccesor() {
     return _value_accesor;
   }
 
-  virtual void *get_shard(size_t shard_idx) = 0;
-  virtual std::pair<int64_t, int64_t> print_table_stat() { return {0, 0}; }
+  virtual void *GetShard(size_t shard_idx) = 0;
+  virtual std::pair<int64_t, int64_t> PrintTableStat() { return {0, 0}; }
 
  protected:
-  virtual int32_t initialize() = 0;
-  virtual int32_t initialize_accessor();
-  virtual int32_t initialize_shard() = 0;
-  virtual std::string table_dir(const std::string &model_dir) {
+  virtual int32_t Initialize() = 0;
+  virtual int32_t InitializeAccessor();
+  virtual int32_t InitializeShard() = 0;
+  virtual std::string TableDir(const std::string &model_dir) {
     return paddle::string::format_string("%s/%03d/", model_dir.c_str(),
                                          _config.table_id());
   }
@@ -164,18 +135,17 @@ class Table {
   TableParameter _config;
   float *_global_lr = nullptr;
   std::shared_ptr<ValueAccessor> _value_accesor;
-  AccessorInfo _table_info;
   AfsClient _afs_client;
 };
 REGISTER_PSCORE_REGISTERER(Table);
 
 class TableManager {
  public:
-  static TableManager &instance() {
+  static TableManager &Instance() {
     static TableManager manager;
     return manager;
   }
-  int32_t initialize();
+  int32_t Initialize();
 
  private:
   TableManager() {}
diff --git a/paddle/fluid/distributed/ps/table/tensor_accessor.cc b/paddle/fluid/distributed/ps/table/tensor_accessor.cc
index 77014141783c3..5d1f69b7463da 100644
--- a/paddle/fluid/distributed/ps/table/tensor_accessor.cc
+++ b/paddle/fluid/distributed/ps/table/tensor_accessor.cc
@@ -18,86 +18,38 @@
 namespace paddle {
 namespace distributed {
 
-int CommMergeAccessor::initialize() { return 0; }
-
-void CommMergeAccessor::SetTableInfo(AccessorInfo &info) {
-  info.dim = dim();
-  info.size = size();
-  info.select_dim = select_dim();
-  info.select_size = select_size();
-  info.update_dim = update_dim();
-  info.update_size = update_size();
-  info.mf_size = mf_size();
-  info.fea_dim = fea_dim();
-}
-
-size_t CommMergeAccessor::GetTableInfo(InfoKey key) {
-  switch (key) {
-    case DIM:
-      return dim();
-    case SIZE:
-      return size();
-    case SELECT_DIM:
-      return select_dim();
-    case SELECT_SIZE:
-      return select_size();
-    case UPDATE_DIM:
-      return update_dim();
-    case UPDATE_SIZE:
-      return update_size();
-    case MF_SIZE:
-      return mf_size();
-    case FEA_DIM:
-      return fea_dim();
-  }
+int CommMergeAccessor::Initialize() {
+  InitAccessorInfo();
   return 0;
 }
 
-// value 维度
-size_t CommMergeAccessor::dim() { return 0; }
-
-// value 各个维度的size
-size_t CommMergeAccessor::dim_size(size_t dim) { return 0; }
-
-// value 各维度相加总size
-size_t CommMergeAccessor::size() { return 0; }
-
-// pull value 维度
-size_t CommMergeAccessor::select_dim() { return _config.embedx_dim(); }
-
-// pull value 各个维度的size
-size_t CommMergeAccessor::select_dim_size(size_t dim) { return sizeof(float); }
-
-// pull value 各维度相加总size
-size_t CommMergeAccessor::select_size() { return select_dim() * sizeof(float); }
-
-// push value 维度
-size_t CommMergeAccessor::update_dim() { return _config.embedx_dim(); }
-
-// push value 各个维度的size
-size_t CommMergeAccessor::update_dim_size(size_t dim) { return sizeof(float); }
-
-// push value 各维度相加总size
-size_t CommMergeAccessor::update_size() { return update_dim() * sizeof(float); }
+void CommMergeAccessor::InitAccessorInfo() {
+  auto embedx_dim = _config.embedx_dim();
+  _accessor_info.select_dim = embedx_dim;
+  _accessor_info.select_size = _accessor_info.select_dim * sizeof(float);
+  _accessor_info.update_dim = embedx_dim;
+  _accessor_info.update_size = _accessor_info.update_dim * sizeof(float);
+  _accessor_info.fea_dim = _config.fea_dim();
+}
 
 // 判断该value 是否进行shrink
-bool CommMergeAccessor::shrink(float * /*value*/) { return false; }
+bool CommMergeAccessor::Shrink(float * /*value*/) { return false; }
 
 // 判断该value 是否在save阶段dump,
 // param作为参数用于标识save阶段，如downpour的xbox与batch_model
-bool CommMergeAccessor::save(float * /*value*/, int /*param*/) { return true; }
+bool CommMergeAccessor::Save(float * /*value*/, int /*param*/) { return true; }
 
 // keys不存在时，为values生成随机值
-int32_t CommMergeAccessor::create(float **value, size_t num) { return 0; }
+int32_t CommMergeAccessor::Create(float **value, size_t num) { return 0; }
 
 // 从values中选取到select_values中
-int32_t CommMergeAccessor::select(float **select_values, const float **values,
+int32_t CommMergeAccessor::Select(float **select_values, const float **values,
                                   size_t num) {
   return 0;
 }
 
 // 将update_values聚合到一起
-int32_t CommMergeAccessor::merge(float **update_values,
+int32_t CommMergeAccessor::Merge(float **update_values,
                                  const float **other_update_values,
                                  size_t num) {
   Eigen::Map<Eigen::MatrixXf> u_mat(update_values[0], 1, num);
@@ -109,13 +61,13 @@ int32_t CommMergeAccessor::merge(float **update_values,
 // 将update_values聚合到一起，通过it.next判定是否进入下一个key
 //  int32_t merge(float** update_values, iterator it);
 // 将update_values更新应用到values中
-int32_t CommMergeAccessor::update(float **values, const float **update_values,
+int32_t CommMergeAccessor::Update(float **values, const float **update_values,
                                   size_t num) {
   return 0;
 }
 
-int CommMergeAccessor::set_weight(float **values, const float **update_values,
-                                  size_t num) {
+int CommMergeAccessor::SetWeight(float **values, const float **update_values,
+                                 size_t num) {
   return 0;
 }
 
diff --git a/paddle/fluid/distributed/ps/table/tensor_accessor.h b/paddle/fluid/distributed/ps/table/tensor_accessor.h
index 6f5b69a392bc5..60951598482ad 100644
--- a/paddle/fluid/distributed/ps/table/tensor_accessor.h
+++ b/paddle/fluid/distributed/ps/table/tensor_accessor.h
@@ -29,53 +29,35 @@ class CommMergeAccessor : public ValueAccessor {
  public:
   CommMergeAccessor() {}
   virtual ~CommMergeAccessor() {}
-  virtual int initialize();
-  virtual void SetTableInfo(AccessorInfo &info);
-  virtual size_t GetTableInfo(InfoKey key);
-  // value维度
-  virtual size_t dim();
-  // value各个维度的size
-  virtual size_t dim_size(size_t dim);
-  // value各维度相加总size
-  virtual size_t size();
-  // pull value维度
-  virtual size_t select_dim();
-  // pull value各个维度的size
-  virtual size_t select_dim_size(size_t dim);
-  // pull value各维度相加总size
-  virtual size_t select_size();
-  // push value维度
-  virtual size_t update_dim();
-  // push value各个维度的size
-  virtual size_t update_dim_size(size_t dim);
-  // push value各维度相加总size
-  virtual size_t update_size();
+  virtual int Initialize();
+  // 初始化AccessorInfo
+  virtual void InitAccessorInfo();
   // 判断该value是否进行shrink
-  virtual bool shrink(float * /*value*/);
+  virtual bool Shrink(float * /*value*/);
   // 判断该value是否在save阶段dump,
   // param作为参数用于标识save阶段，如downpour的xbox与batch_model
-  virtual bool save(float * /*value*/, int /*param*/);
+  virtual bool Save(float * /*value*/, int /*param*/);
 
   // keys不存在时，为values生成随机值
-  virtual int32_t create(float **value, size_t num);
+  virtual int32_t Create(float **value, size_t num);
   // 从values中选取到select_values中
-  virtual int32_t select(float **select_values, const float **values,
+  virtual int32_t Select(float **select_values, const float **values,
                          size_t num);
   // 将update_values聚合到一起
-  virtual int32_t merge(float **update_values,
+  virtual int32_t Merge(float **update_values,
                         const float **other_update_values, size_t num);
   // 将update_values聚合到一起，通过it.next判定是否进入下一个key
-  // virtual int32_t merge(float** update_values, iterator it);
+  // virtual int32_t Merge(float** update_values, iterator it);
   // 将update_values更新应用到values中
-  virtual int32_t update(float **values, const float **update_values,
+  virtual int32_t Update(float **values, const float **update_values,
                          size_t num);
 
-  virtual int set_weight(float **values, const float **update_values,
-                         size_t num);
-  virtual std::string parse_to_string(const float *value, int param) {
+  virtual int SetWeight(float **values, const float **update_values,
+                        size_t num);
+  virtual std::string ParseToString(const float *value, int param) {
     return "";
   }
-  virtual int parse_from_string(const std::string &str, float *v) { return 0; }
+  virtual int ParseFromString(const std::string &str, float *v) { return 0; }
 };
 }  // namespace distributed
 }  // namespace paddle
diff --git a/paddle/fluid/distributed/ps/table/tensor_table.h b/paddle/fluid/distributed/ps/table/tensor_table.h
index e59314923cdbc..7bb236d02c985 100644
--- a/paddle/fluid/distributed/ps/table/tensor_table.h
+++ b/paddle/fluid/distributed/ps/table/tensor_table.h
@@ -50,44 +50,29 @@ class TensorTable : public Table {
   TensorTable() {}
   virtual ~TensorTable() {}
 
-  virtual int32_t Pull(TableContext &context) { return 0; }
-  virtual int32_t Push(TableContext &context) { return 0; }
-  int32_t pull_dense(float *values, size_t num) override { return 0; }
+  int32_t Pull(TableContext &context) override { return 0; }
+  int32_t Push(TableContext &context) override { return 0; }
 
-  int32_t push_dense(const float *values, size_t num) override { return 0; }
+  int32_t Shrink(const std::string &param) override { return 0; }
 
-  int32_t pull_sparse(float *values,
-                      const PullSparseValue &pull_value) override {
-    return 0;
-  }
-  int32_t push_sparse(const uint64_t *keys, const float *values,
-                      size_t num) override {
-    return 0;
-  }
-  int32_t shrink(const std::string &param) override { return 0; }
-
-  virtual void *get_shard(size_t shard_idx) { return 0; }
+  void *GetShard(size_t shard_idx) override { return 0; }
 
-  virtual int32_t initialize_shard() { return 0; }
+  int32_t InitializeShard() override { return 0; }
 
-  virtual int32_t flush() { return 0; }
+  int32_t Flush() override { return 0; }
 
-  virtual int32_t load(const std::string &path, const std::string &param) {
+  int32_t Load(const std::string &path, const std::string &param) override {
     return 0;
   }
-  virtual int32_t save(const std::string &path, const std::string &param) {
+  int32_t Save(const std::string &path, const std::string &param) override {
     return 0;
   }
 
-  virtual void clear() {}
-
-  int32_t initialize() override { return 0; }
+  void Clear() override {}
 
-  int32_t push_dense(const int64_t *values, const int32_t trainer_id) override {
-    return 0;
-  }
+  int32_t Initialize() override { return 0; }
 
-  int32_t set_program_env(
+  int32_t SetProgramEnv(
       framework::Scope *scope, platform::Place place,
       const std::vector<framework::ProgramDesc> *sub_program) override {
     scope_ = scope;
@@ -111,48 +96,31 @@ class DenseTensorTable : public TensorTable {
   DenseTensorTable() {}
   virtual ~DenseTensorTable() {}
 
-  int32_t pull_sparse(float *values,
-                      const PullSparseValue &pull_value) override {
-    return 0;
-  }
-  int32_t push_sparse(const uint64_t *keys, const float *values,
-                      size_t num) override {
-    return 0;
-  }
-  int32_t shrink(const std::string &param) override { return 0; }
+  int32_t Shrink(const std::string &param) override { return 0; }
 
-  virtual void *get_shard(size_t shard_idx) { return 0; }
+  void *GetShard(size_t shard_idx) override { return 0; }
 
-  virtual int32_t initialize_shard() { return 0; }
+  int32_t InitializeShard() override { return 0; }
 
-  virtual int32_t flush() { return 0; }
+  int32_t Flush() override { return 0; }
 
-  virtual void clear() {}
+  void Clear() override {}
 
   // Todo: Support program Load & Save
-  virtual int32_t load(const std::string &path, const std::string &param) {
+  int32_t Load(const std::string &path, const std::string &param) override {
     return 0;
   }
-  virtual int32_t save(const std::string &path, const std::string &param) {
+  int32_t Save(const std::string &path, const std::string &param) override {
     return 0;
   }
 
-  // Todo: Support pull dense
-  int32_t pull_dense(float *values, size_t num) override { return 0; }
-
   /*----------------------------------------------------------------------*/
 
-  int32_t initialize() override { return 0; }
-
-  int32_t push_dense(const float *values, size_t num) override { return 0; }
-
-  int32_t push_dense(const int64_t *values, const int32_t trainer_id) {
-    return 0;
-  }
+  int32_t Initialize() override { return 0; }
 
  protected:
-  virtual int32_t _run_program(const float *values, size_t num,
-                               const uint32_t trainer_id) {
+  virtual int32_t _RunProgram(const float *values, size_t num,
+                              const uint32_t trainer_id) {
     return 0;
   }
 
@@ -167,36 +135,26 @@ class GlobalStepTable : public DenseTensorTable {
   GlobalStepTable() {}
   virtual ~GlobalStepTable() {}
 
-  int32_t pull_sparse(float *values,
-                      const PullSparseValue &pull_value) override {
-    return 0;
-  }
-  int32_t push_sparse(const uint64_t *keys, const float *values,
-                      size_t num) override {
-    return 0;
-  }
-  int32_t shrink(const std::string &param) override { return 0; }
+  int32_t Shrink(const std::string &param) override { return 0; }
 
-  virtual void *get_shard(size_t shard_idx) { return 0; }
+  void *GetShard(size_t shard_idx) override { return 0; }
 
-  virtual int32_t initialize_shard() { return 0; }
+  int32_t InitializeShard() override { return 0; }
 
-  virtual int32_t flush() { return 0; }
+  int32_t Flush() override { return 0; }
 
-  virtual void clear() {}
+  void Clear() override {}
 
-  virtual int32_t load(const std::string &path, const std::string &param) {
+  int32_t Load(const std::string &path, const std::string &param) override {
     return 0;
   }
-  virtual int32_t save(const std::string &path, const std::string &param) {
+  int32_t Save(const std::string &path, const std::string &param) override {
     return 0;
   }
 
-  int32_t pull_dense(float *values, size_t num) override { return 0; }
-
   /*----------------------------------------------------------------------*/
 
-  int32_t initialize() override {
+  int32_t Initialize() override {
     auto _program_config = _config.tensor();
     auto trainers_ = _config.common().trainer_num();
     FLAGS_eager_delete_tensor_gb = -1;
@@ -235,16 +193,17 @@ class GlobalStepTable : public DenseTensorTable {
         decay_counters_[i] = 0;
       }
     }
+    return 0;
   }
 
-  int32_t push_dense(const float *values, size_t num) override { return 0; }
+  //  int32_t PushDense(const float *values, size_t num) override { return 0; }
 
-  int32_t push_dense(const int64_t *values, const int32_t trainer_id) {
-    return _run_program(values, trainer_id);
+  virtual int32_t Push(TableContext context) {
+    return _RunProgram(context.push_context.push_steps, context.trainer_id);
   }
 
-  int32_t set_table_map(std::unordered_map<uint32_t, std::shared_ptr<Table>>
-                            *table_map) override {
+  int32_t SetTableMap(std::unordered_map<uint32_t, std::shared_ptr<Table>>
+                          *table_map) override {
     auto *lr_var = scope_->FindVar(fetch_var_name_);
     auto *lr_tensor = lr_var->GetMutable<framework::LoDTensor>();
     auto *lr_value = lr_tensor->mutable_data<float>(platform::CPUPlace());
@@ -255,14 +214,14 @@ class GlobalStepTable : public DenseTensorTable {
       if (table_id == _config.table_id()) {
         continue;
       }
-      iter->second->set_global_lr(lr_value);
+      iter->second->SetGlobalLR(lr_value);
     }
     return 0;
   }
 
  private:
-  virtual int32_t _run_program(const int64_t *values,
-                               const uint32_t trainer_id) {
+  virtual int32_t _RunProgram(const int64_t *values,
+                              const uint32_t trainer_id) {
     FLAGS_eager_delete_tensor_gb = -1;
     auto counter = decay_counters_.at(trainer_id);
     counter += int(values[0]);
diff --git a/paddle/fluid/distributed/ps/wrapper/fleet.cc b/paddle/fluid/distributed/ps/wrapper/fleet.cc
index c9093368c693e..7bc50a868104a 100644
--- a/paddle/fluid/distributed/ps/wrapper/fleet.cc
+++ b/paddle/fluid/distributed/ps/wrapper/fleet.cc
@@ -51,32 +51,6 @@ int32_t FleetWrapper::CopyTableByFeasign(
   return 0;
 }
 
-void FleetWrapper::Stop() { StopServer(); }
-
-void FleetWrapper::Load(WrapperContext& context) {
-  auto table_id = context.table_id;
-  if (table_id >= 0 && context.meta != "") {
-    LoadSparseOnServer(context.path, context.meta, context.table_id);
-    return;
-  }
-  if (table_id < 0) {  // laod all
-    LoadModel(context.path, context.mode);
-  } else {  // load one table
-    LoadModelOneTable(table_id, context.path, context.mode);
-  }
-  return;
-}
-
-void FleetWrapper::Save(WrapperContext& context) {
-  auto table_id = context.table_id;
-  if (table_id < 0) {
-    SaveModel(context.path, context.mode);
-  } else {
-    SaveModelOneTable(table_id, context.path, context.mode);
-  }
-  return;
-}
-
 void FleetWrapper::SetClient2ClientConfig(int request_timeout_ms,
                                           int connect_timeout_ms,
                                           int max_retry) {
@@ -90,7 +64,7 @@ void FleetWrapper::LoadSparseOnServer(const std::string& path,
                                       uint32_t table_id) {
   VLOG(3) << "load sparse table " << table_id << " with " << path << " meta "
           << meta;
-  pserver_ptr_->_server_ptr->table(table_id)->load(path, meta);
+  pserver_ptr_->_server_ptr->GetTable(table_id)->Load(path, meta);
 }
 
 void FleetWrapper::InitServer(
@@ -101,8 +75,8 @@ void FleetWrapper::InitServer(
     VLOG(3) << "Going to init server";
     pserver_ptr_ = std::shared_ptr<paddle::distributed::PSCore>(
         new paddle::distributed::PSCore());
-    pserver_ptr_->init_server(dist_desc, &host_sign_list, host_sign_list.size(),
-                              index, trainers, server_sub_program);
+    pserver_ptr_->InitServer(dist_desc, &host_sign_list, host_sign_list.size(),
+                             index, trainers, server_sub_program);
     is_initialized_ = true;
   } else {
     VLOG(3) << "Server can be initialized only once";
@@ -143,10 +117,10 @@ void FleetWrapper::InitWorker(const std::string& dist_desc,
       google::protobuf::TextFormat::ParseFromString(dist_desc, &ps_param);
       InitGFlag(ps_param.init_gflags());
       int servers = host_sign_list.size();
-      ps_env_.set_ps_servers(&host_sign_list, servers);
+      ps_env_.SetPsServers(&host_sign_list, servers);
       worker_ptr_ = std::shared_ptr<paddle::distributed::PSClient>(
-          paddle::distributed::PSClientFactory::create(ps_param));
-      worker_ptr_->configure(ps_param, dense_pull_regions, ps_env_, index);
+          paddle::distributed::PSClientFactory::Create(ps_param));
+      worker_ptr_->Configure(ps_param, dense_pull_regions, ps_env_, index);
     }
   } else {
     VLOG(3) << "Client can be initialized only once";
@@ -155,13 +129,13 @@ void FleetWrapper::InitWorker(const std::string& dist_desc,
 
 void FleetWrapper::StopServer() {
   VLOG(3) << "Going to stop server";
-  auto status = worker_ptr_->stop_server();
+  auto status = worker_ptr_->StopServer();
   status.wait();
 }
 
 void FleetWrapper::FinalizeWorker() {
   VLOG(3) << "Going to finalize worker";
-  worker_ptr_->finalize_worker();
+  worker_ptr_->FinalizeWorker();
 }
 
 void FleetWrapper::BarrierWithTable(uint32_t barrier_type) {
@@ -172,13 +146,13 @@ void FleetWrapper::BarrierWithTable(uint32_t barrier_type) {
 
 uint64_t FleetWrapper::RunServer(const std::string& ip, uint32_t port) {
   VLOG(3) << "Going to run server with ip " << ip << " port " << port;
-  auto ret = pserver_ptr_->run_server(ip, port);
+  auto ret = pserver_ptr_->RunServer(ip, port);
   return ret;
 }
 
 std::vector<uint64_t> FleetWrapper::GetClientsInfo() {
   VLOG(3) << "Going to get client info";
-  std::vector<uint64_t> res = ps_env_.get_client_info();
+  std::vector<uint64_t> res = ps_env_.GetClientInfo();
   for (auto rr : res) {
     VLOG(2) << "FleetWrapper::GetClientInfo " << rr;
   }
@@ -187,14 +161,14 @@ std::vector<uint64_t> FleetWrapper::GetClientsInfo() {
 
 int FleetWrapper::SetClients(std::vector<uint64_t>& host_sign_list) {
   int node = host_sign_list.size();
-  return ps_env_.set_ps_clients(host_sign_list.data(), node);
+  return ps_env_.SetPsClients(host_sign_list.data(), node);
 }
 
 void FleetWrapper::CreateClient2ClientConnection() {
   VLOG(1) << "Going to create client2client connection";
-  worker_ptr_->create_client2client_connection(
-      client2client_request_timeout_ms_, client2client_connect_timeout_ms_,
-      client2client_max_retry_);
+  worker_ptr_->CreateClient2ClientConnection(client2client_request_timeout_ms_,
+                                             client2client_connect_timeout_ms_,
+                                             client2client_max_retry_);
 }
 
 std::future<int32_t> FleetWrapper::PullSparseVarsAsync(
@@ -230,9 +204,9 @@ std::future<int32_t> FleetWrapper::PullSparseVarsAsync(
   }
 
   bool training = true;
-  return pserver_ptr_->_worker_ptr->pull_sparse(pull_result_ptr.data(),
-                                                table_id, fea_keys->data(),
-                                                fea_keys->size(), training);
+  return pserver_ptr_->_worker_ptr->PullSparse(pull_result_ptr.data(), table_id,
+                                               fea_keys->data(),
+                                               fea_keys->size(), training);
 }
 
 void FleetWrapper::PullSparseVarsSync(
@@ -279,7 +253,7 @@ void FleetWrapper::PullSparseVarsSync(
     pull_result_ptr.push_back(t.data());
   }
   bool training = true;
-  auto status = pserver_ptr_->_worker_ptr->pull_sparse(
+  auto status = pserver_ptr_->_worker_ptr->PullSparse(
       pull_result_ptr.data(), table_id, fea_keys->data(), fea_keys->size(),
       training);
   pull_sparse_status.push_back(std::move(status));
@@ -337,21 +311,10 @@ void FleetWrapper::PullSparseToTensorSync(const uint64_t table_id, int fea_dim,
       pull_result_ptr.push_back(output_data + output_len);
     }
   }
-  // ps client pull sparse
-  // construct client request context
-  RequestContext req_context;
-  req_context.value_type = Sparse;
-  req_context.training_mode = Async;
-  req_context.table = table_id;
-  req_context.sparse_values = pull_result_ptr.data();
-  req_context.keys = fea_keys.data();
-  req_context.num = fea_keys.size();
-  req_context.is_training = is_training;
-  auto status = worker_ptr_->Pull(req_context);
-  // auto status =
-  //     worker_ptr_->pull_sparse(pull_result_ptr.data(), table_id,
-  //                              fea_keys.data(), fea_keys.size(),
-  //                              is_training);
+
+  auto status =
+      worker_ptr_->PullSparse(pull_result_ptr.data(), table_id, fea_keys.data(),
+                              fea_keys.size(), is_training);
   status.wait();
   auto ret = status.get();
   if (ret != 0) {
@@ -364,7 +327,7 @@ void FleetWrapper::PullDenseVarsAsync(
     const Scope& scope, const uint64_t tid,
     const std::vector<std::string>& var_names,
     std::vector<std::future<int32_t>>* pull_dense_status, bool in_cpu) {
-  auto& regions = _regions[tid];
+  auto& regions = regions_[tid];
   regions.clear();
   regions.resize(var_names.size());
   for (auto i = 0u; i < var_names.size(); ++i) {
@@ -378,21 +341,15 @@ void FleetWrapper::PullDenseVarsAsync(
     paddle::distributed::Region reg(w, tensor->numel());
     regions[i] = std::move(reg);
   }
-  RequestContext req_context;
-  req_context.value_type = Dense;
-  req_context.training_mode = Async;
-  req_context.table = tid;
-  req_context.dense_values = regions.data();
-  req_context.num = regions.size();
-  auto status = worker_ptr_->Pull(req_context);
-  // auto status = worker_ptr_->pull_dense(regions.data(), regions.size(), tid);
+
+  auto status = worker_ptr_->PullDense(regions.data(), regions.size(), tid);
   pull_dense_status->push_back(std::move(status));
 }
 
 void FleetWrapper::PullDenseVarsSync(
     const Scope& scope, const uint64_t tid,
     const std::vector<std::string>& var_names) {
-  auto& regions = _regions[tid];
+  auto& regions = regions_[tid];
   regions.clear();
   regions.reserve(var_names.size());
   for (auto& t : var_names) {
@@ -404,7 +361,7 @@ void FleetWrapper::PullDenseVarsSync(
       regions.emplace_back(std::move(reg));
     }
   }
-  auto status = worker_ptr_->pull_dense(regions.data(), regions.size(), tid);
+  auto status = worker_ptr_->PullDense(regions.data(), regions.size(), tid);
   status.wait();
 }
 
@@ -424,7 +381,7 @@ void FleetWrapper::PushDenseParamSync(
     }
   }
   auto push_status =
-      worker_ptr_->push_dense_param(regions.data(), regions.size(), table_id);
+      worker_ptr_->PushDenseParam(regions.data(), regions.size(), table_id);
   push_status.wait();
   auto status = push_status.get();
   CHECK(status == 0) << "push dense param failed, status[" << status << "]";
@@ -470,15 +427,8 @@ void FleetWrapper::PushDenseVarsAsync(
             << g[tensor->numel() - 1];
   }
 
-  RequestContext req_context;
-  req_context.value_type = Dense;
-  req_context.training_mode = Async;
-  req_context.table = table_id;
-  req_context.push_context.push_dense_values = regions.data();
-  req_context.num = regions.size();
-  // auto push_status =
-  //     worker_ptr_->push_dense(regions.data(), regions.size(), table_id);
-  auto push_status = worker_ptr_->Push(req_context);
+  auto push_status =
+      worker_ptr_->PushDense(regions.data(), regions.size(), table_id);
 }
 
 void FleetWrapper::PushSparseVarsAsync(
@@ -650,23 +600,13 @@ void FleetWrapper::PushSparseFromTensorAsync(
     push_g_vec[i] = push_values.at(i).data();
   }
 
-  // ps client push sparse
-  // construct request context
-  RequestContext req_context;
-  req_context.value_type = Sparse;
-  req_context.training_mode = Async;
-  req_context.table = table_id;
-  req_context.push_context.push_values = (const float**)push_g_vec.data();
-  req_context.push_context.keys = push_keys.data();
-  req_context.num = push_keys.size();
-  auto status = worker_ptr_->Push(req_context);
-  // auto status = worker_ptr_->push_sparse(table_id, push_keys.data(),
-  //                                        (const float**)push_g_vec.data(),
-  //                                        push_keys.size());
+  auto status = worker_ptr_->PushSparse(table_id, push_keys.data(),
+                                        (const float**)push_g_vec.data(),
+                                        push_keys.size());
 }
 
 void FleetWrapper::LoadModel(const std::string& path, const int mode) {
-  auto ret = worker_ptr_->load(path, std::to_string(mode));
+  auto ret = worker_ptr_->Load(path, std::to_string(mode));
   ret.wait();
   if (ret.get() != 0) {
     LOG(ERROR) << "load model from path:" << path << " failed";
@@ -675,7 +615,7 @@ void FleetWrapper::LoadModel(const std::string& path, const int mode) {
 
 void FleetWrapper::LoadModelOneTable(const uint64_t table_id,
                                      const std::string& path, const int mode) {
-  auto ret = worker_ptr_->load(table_id, path, std::to_string(mode));
+  auto ret = worker_ptr_->Load(table_id, path, std::to_string(mode));
   ret.wait();
   if (ret.get() != 0) {
     LOG(ERROR) << "load model of table id: " << table_id
@@ -684,7 +624,7 @@ void FleetWrapper::LoadModelOneTable(const uint64_t table_id,
 }
 
 void FleetWrapper::SaveModel(const std::string& path, const int mode) {
-  auto ret = worker_ptr_->save(path, std::to_string(mode));
+  auto ret = worker_ptr_->Save(path, std::to_string(mode));
   ret.wait();
   int32_t feasign_cnt = ret.get();
   if (feasign_cnt == -1) {
@@ -694,7 +634,7 @@ void FleetWrapper::SaveModel(const std::string& path, const int mode) {
 
 void FleetWrapper::SaveModelOneTable(const uint64_t table_id,
                                      const std::string& path, const int mode) {
-  auto ret = worker_ptr_->save(table_id, path, std::to_string(mode));
+  auto ret = worker_ptr_->Save(table_id, path, std::to_string(mode));
   ret.wait();
   if (ret.get() != 0) {
     LOG(ERROR) << "save model of table id: " << table_id
@@ -704,7 +644,7 @@ void FleetWrapper::SaveModelOneTable(const uint64_t table_id,
 
 void FleetWrapper::RecvAndSaveTable(const uint64_t table_id,
                                     const std::string& path) {
-  auto ret = worker_ptr_->recv_and_save_table(table_id, path);
+  auto ret = worker_ptr_->RecvAndSaveTable(table_id, path);
   if (ret != 0) {
     LOG(ERROR) << "save model of table id: " << table_id
                << ", to path: " << path << " failed";
@@ -712,7 +652,7 @@ void FleetWrapper::RecvAndSaveTable(const uint64_t table_id,
 }
 
 void FleetWrapper::PrintTableStat(const uint64_t table_id) {
-  auto ret = worker_ptr_->print_table_stat(table_id);
+  auto ret = worker_ptr_->PrintTableStat(table_id);
   ret.wait();
   int32_t err_code = ret.get();
   if (err_code == -1) {
@@ -721,7 +661,7 @@ void FleetWrapper::PrintTableStat(const uint64_t table_id) {
 }
 
 void FleetWrapper::ShrinkSparseTable(int table_id, int threshold) {
-  auto ret = worker_ptr_->shrink(table_id, std::to_string(threshold));
+  auto ret = worker_ptr_->Shrink(table_id, std::to_string(threshold));
   ret.wait();
   int32_t err_code = ret.get();
   if (err_code == -1) {
@@ -730,12 +670,12 @@ void FleetWrapper::ShrinkSparseTable(int table_id, int threshold) {
 }
 
 void FleetWrapper::ClearModel() {
-  auto ret = pserver_ptr_->_worker_ptr->clear();
+  auto ret = pserver_ptr_->_worker_ptr->Clear();
   ret.wait();
 }
 
 void FleetWrapper::ClearOneTable(const uint64_t table_id) {
-  auto ret = pserver_ptr_->_worker_ptr->clear(table_id);
+  auto ret = pserver_ptr_->_worker_ptr->Clear(table_id);
   ret.wait();
 }
 
@@ -774,7 +714,7 @@ void FleetWrapper::ShrinkDenseTable(int table_id, Scope* scope,
       regions.emplace_back(std::move(reg));
     }
   }
-  auto push_status = pserver_ptr_->_worker_ptr->push_dense_param(
+  auto push_status = pserver_ptr_->_worker_ptr->PushDenseParam(
       regions.data(), regions.size(), table_id);
   push_status.wait();
   auto status = push_status.get();
@@ -791,7 +731,7 @@ void FleetWrapper::ClientFlush() {
     VLOG(0) << "worker_ptr null, do nothing";
     return;
   }
-  auto ret = worker_ptr_->flush();
+  auto ret = worker_ptr_->Flush();
   ret.wait();
   int32_t err_code = ret.get();
   if (err_code == -1) {
@@ -805,13 +745,13 @@ int FleetWrapper::RegisterClientToClientMsgHandler(int msg_type,
     VLOG(0) << "FleetWrapper::Client is null";
     return -1;
   } else {
-    return worker_ptr_->registe_client2client_msg_handler(msg_type, handler);
+    return worker_ptr_->RegisteClient2ClientMsgHandler(msg_type, handler);
   }
 }
 
 std::future<int32_t> FleetWrapper::SendClientToClientMsg(
     int msg_type, int to_client_id, const std::string& msg) {
-  return worker_ptr_->send_client2client_msg(msg_type, to_client_id, msg);
+  return worker_ptr_->SendClient2ClientMsg(msg_type, to_client_id, msg);
 }
 
 std::default_random_engine& FleetWrapper::LocalRandomEngine() {
diff --git a/paddle/fluid/distributed/ps/wrapper/fleet.h b/paddle/fluid/distributed/ps/wrapper/fleet.h
index 13b7ea7609ee6..e6ec09a12637d 100644
--- a/paddle/fluid/distributed/ps/wrapper/fleet.h
+++ b/paddle/fluid/distributed/ps/wrapper/fleet.h
@@ -25,7 +25,6 @@ limitations under the License. */
 
 #include "paddle/fluid/distributed/ps/service/communicator/communicator_common.h"
 #include "paddle/fluid/distributed/ps/service/ps_service/service.h"
-#include "paddle/fluid/distributed/ps/wrapper/ps_wrapper.h"
 #include "paddle/fluid/framework/archive.h"
 #include "paddle/fluid/framework/io/fs.h"
 #include "paddle/fluid/framework/io/shell.h"
@@ -55,7 +54,7 @@ using framework::Variable;
 
 using RpcCtxMap = std::unordered_map<std::string, CommContext>;
 
-class FleetWrapper : public PSWrapper {
+class FleetWrapper {
  public:
   virtual ~FleetWrapper() {}
   FleetWrapper() {
@@ -69,7 +68,6 @@ class FleetWrapper : public PSWrapper {
     // pserver request max retry
     client2client_max_retry_ = 3;
   }
-  virtual int32_t Initialize(InitContext& context) { return 0; }
 
   // TODO(zhaocaibei123: later)
   int32_t CopyTable(const uint64_t src_table_id, const uint64_t dest_table_id);
@@ -81,12 +79,6 @@ class FleetWrapper : public PSWrapper {
   typedef std::function<void(int, int)> HeterCallBackFunc;
   int RegisterHeterCallback(HeterCallBackFunc handler);
 
-  virtual void Stop() override;
-
-  virtual void Load(WrapperContext& context) override;
-
-  virtual void Save(WrapperContext& context) override;
-
   // set client to client communication config
   void SetClient2ClientConfig(int request_timeout_ms, int connect_timeout_ms,
                               int max_retry);
@@ -278,7 +270,7 @@ class FleetWrapper : public PSWrapper {
 
  protected:
   static bool is_initialized_;
-  std::map<uint64_t, std::vector<paddle::distributed::Region>> _regions;
+  std::map<uint64_t, std::vector<paddle::distributed::Region>> regions_;
   bool scale_sparse_gradient_with_batch_size_;
   int32_t sleep_seconds_before_fail_exit_;
   int client2client_request_timeout_ms_;
diff --git a/paddle/fluid/distributed/test/barrier_table_test.cc b/paddle/fluid/distributed/test/barrier_table_test.cc
index 0715f777fa5cb..c4c5b22992804 100644
--- a/paddle/fluid/distributed/test/barrier_table_test.cc
+++ b/paddle/fluid/distributed/test/barrier_table_test.cc
@@ -39,19 +39,19 @@ TEST(BarrierTable, Barrier) {
   common_config->set_trainer_num(trainers);
   common_config->set_sync(sync);
 
-  auto ret = table->initialize(table_config, fs_config);
+  auto ret = table->Initialize(table_config, fs_config);
 
   std::unordered_map<uint32_t, std::shared_ptr<Table>> maps =
       std::unordered_map<uint32_t, std::shared_ptr<Table>>();
 
-  table->set_table_map(&maps);
+  table->SetTableMap(&maps);
 
   std::shared_ptr<::ThreadPool> pool_ =
       std::make_shared<::ThreadPool>(trainers);
   std::vector<std::future<void>> task_status;
 
   for (auto x = 0; x < trainers; x++) {
-    auto task = [table, x] { table->barrier(x, 0); };
+    auto task = [table, x] { table->Barrier(x, 0); };
     task_status.push_back(pool_->enqueue(std::move(task)));
   }
 
diff --git a/paddle/fluid/distributed/test/brpc_service_dense_sgd_test.cc b/paddle/fluid/distributed/test/brpc_service_dense_sgd_test.cc
index 19ff50ec2a43b..f9d57be95affe 100644
--- a/paddle/fluid/distributed/test/brpc_service_dense_sgd_test.cc
+++ b/paddle/fluid/distributed/test/brpc_service_dense_sgd_test.cc
@@ -63,7 +63,7 @@ void InitTensorsOnClient(framework::Scope* scope, platform::CPUPlace* place,
 void GetDownpourDenseTableProto(
     ::paddle::distributed::TableParameter* dense_table_proto) {
   dense_table_proto->set_table_id(0);
-  dense_table_proto->set_table_class("CommonDenseTable");
+  dense_table_proto->set_table_class("MemoryDenseTable");
   dense_table_proto->set_shard_num(256);
   dense_table_proto->set_type(::paddle::distributed::PS_DENSE_TABLE);
   ::paddle::distributed::TableAccessorParameter* accessor_proto =
@@ -155,16 +155,16 @@ void RunServer() {
 
   auto _ps_env = paddle::distributed::PaddlePSEnvironment();
   LOG(INFO) << "RUN set_ps_servers";
-  _ps_env.set_ps_servers(&host_sign_list_, 1);
+  _ps_env.SetPsServers(&host_sign_list_, 1);
   pserver_ptr_ = std::shared_ptr<paddle::distributed::PSServer>(
-      paddle::distributed::PSServerFactory::create(server_proto));
+      paddle::distributed::PSServerFactory::Create(server_proto));
   LOG(INFO) << "RUN configure";
   std::vector<framework::ProgramDesc> empty_vec;
   framework::ProgramDesc empty_prog;
   empty_vec.push_back(empty_prog);
-  pserver_ptr_->configure(server_proto, _ps_env, 0, empty_vec);
+  pserver_ptr_->Configure(server_proto, _ps_env, 0, empty_vec);
   LOG(INFO) << "RUN start";
-  pserver_ptr_->start(ip_, port_);
+  pserver_ptr_->Start(ip_, port_);
   LOG(INFO) << "End start";
 }
 
@@ -175,19 +175,19 @@ void RunClient(std::map<uint64_t, std::vector<paddle::distributed::Region>>&
   auto servers_ = host_sign_list_.size();
   _ps_env = paddle::distributed::PaddlePSEnvironment();
   LOG(INFO) << "Run set_ps_servers";
-  _ps_env.set_ps_servers(&host_sign_list_, servers_);
+  _ps_env.SetPsServers(&host_sign_list_, servers_);
   LOG(INFO) << "Run Create PSClient";
   worker_ptr_ = std::shared_ptr<paddle::distributed::PSClient>(
-      paddle::distributed::PSClientFactory::create(worker_proto));
+      paddle::distributed::PSClientFactory::Create(worker_proto));
   LOG(INFO) << "Run configure";
-  worker_ptr_->configure(worker_proto, dense_regions, _ps_env, 0);
+  worker_ptr_->Configure(worker_proto, dense_regions, _ps_env, 0);
 }
 
 void RunBrpcPushDense() {
   setenv("http_proxy", "", 1);
   setenv("https_proxy", "", 1);
   auto ph_host = paddle::distributed::PSHost(ip_, port_, 0);
-  host_sign_list_.push_back(ph_host.serialize_to_string());
+  host_sign_list_.push_back(ph_host.SerializeToString());
 
   // Srart Server
   std::thread server_thread(RunServer);
@@ -218,7 +218,7 @@ void RunBrpcPushDense() {
   paddle::distributed::Region temp_reg(temp, tensor->numel());
   temp_region.emplace_back(std::move(temp_reg));
   auto pull_status =
-      worker_ptr_->pull_dense(temp_region.data(), temp_region.size(), 0);
+      worker_ptr_->PullDense(temp_region.data(), temp_region.size(), 0);
   pull_status.wait();
 
   for (size_t idx = 0; idx < tensor->numel(); ++idx) {
@@ -229,10 +229,10 @@ void RunBrpcPushDense() {
 
   LOG(INFO) << "Run push_dense_param";
   auto push_status =
-      worker_ptr_->push_dense_param(regions.data(), regions.size(), 0);
+      worker_ptr_->PushDenseParam(regions.data(), regions.size(), 0);
   push_status.wait();
 
-  pull_status = worker_ptr_->pull_dense(regions.data(), regions.size(), 0);
+  pull_status = worker_ptr_->PullDense(regions.data(), regions.size(), 0);
   pull_status.wait();
 
   for (size_t idx = 0; idx < tensor->numel(); ++idx) {
@@ -257,11 +257,11 @@ void RunBrpcPushDense() {
 
   LOG(INFO) << "Run pull_dense_grad";
   auto push_grad_status =
-      worker_ptr_->push_dense_raw_gradient(0, temp, tensor->numel(), closure);
+      worker_ptr_->PushDenseRawGradient(0, temp, tensor->numel(), closure);
   push_grad_status.wait();
 
   auto pull_update_status =
-      worker_ptr_->pull_dense(regions.data(), regions.size(), 0);
+      worker_ptr_->PullDense(regions.data(), regions.size(), 0);
   pull_update_status.wait();
 
   for (size_t idx = 0; idx < tensor->numel(); ++idx) {
@@ -269,9 +269,9 @@ void RunBrpcPushDense() {
   }
 
   LOG(INFO) << "Run stop_server";
-  worker_ptr_->stop_server();
+  worker_ptr_->StopServer();
   LOG(INFO) << "Run finalize_worker";
-  worker_ptr_->finalize_worker();
+  worker_ptr_->FinalizeWorker();
   server_thread.join();
 }
 
diff --git a/paddle/fluid/distributed/test/brpc_service_sparse_sgd_test.cc b/paddle/fluid/distributed/test/brpc_service_sparse_sgd_test.cc
index 633f3b2f3c550..29195d9985728 100644
--- a/paddle/fluid/distributed/test/brpc_service_sparse_sgd_test.cc
+++ b/paddle/fluid/distributed/test/brpc_service_sparse_sgd_test.cc
@@ -49,6 +49,8 @@ namespace distributed = paddle::distributed;
 void CreateVarsOnScope(framework::Scope* scope, platform::CPUPlace* place) {
   auto x_var = scope->Var("x");
   x_var->GetMutable<framework::LoDTensor>();
+  auto x_g_var = scope->Var("x@GRAD");
+  x_g_var->GetMutable<framework::LoDTensor>();
 }
 
 void InitTensorsOnClient(framework::Scope* scope, platform::CPUPlace* place,
@@ -59,34 +61,49 @@ void InitTensorsOnClient(framework::Scope* scope, platform::CPUPlace* place,
   float* x_ptr =
       x_var->mutable_data<float>(framework::DDim({1, rows_numel}), *place);
   for (int64_t i = 0; i < rows_numel; ++i) x_ptr[i] = 1.0;
+
+  auto g_size = rows_numel +
+                30;  // hard code here: key_num * (fea_dim + 3), show/clk/slot
+  auto x_g_var = scope->Var("x@GRAD")->GetMutable<framework::LoDTensor>();
+  float* x_g_ptr =
+      x_g_var->mutable_data<float>(framework::DDim({1, g_size}), *place);
+  for (int64_t i = 0; i < g_size; ++i) x_g_ptr[i] = 1.0;
 }
 
 void GetDownpourSparseTableProto(
     ::paddle::distributed::TableParameter* sparse_table_proto) {
   sparse_table_proto->set_table_id(0);
-  sparse_table_proto->set_table_class("CommonSparseTable");
-  sparse_table_proto->set_shard_num(256);
-  sparse_table_proto->set_type(::paddle::distributed::PS_SPARSE_TABLE);
-  ::paddle::distributed::TableAccessorParameter* accessor_proto =
+  sparse_table_proto->set_table_class("MemorySparseTable");
+  sparse_table_proto->set_shard_num(10);
+  ::paddle::distributed::TableAccessorParameter* accessor_config =
       sparse_table_proto->mutable_accessor();
-  ::paddle::distributed::CommonAccessorParameter* common_proto =
-      sparse_table_proto->mutable_common();
-
-  accessor_proto->set_accessor_class("CommMergeAccessor");
-  accessor_proto->set_fea_dim(0);
-  accessor_proto->set_embedx_dim(10);
-
-  common_proto->set_name("sgd");
-  common_proto->set_table_name("MergedDense");
-  common_proto->set_trainer_num(1);
-  common_proto->set_sync(false);
-  common_proto->set_entry("none");
-  common_proto->add_params("Param");
-  common_proto->add_dims(10);
-  common_proto->add_initializers("uniform_random&0&-1.0&1.0");
-  common_proto->add_params("LearningRate");
-  common_proto->add_dims(1);
-  common_proto->add_initializers("fill_constant&1.0");
+
+  accessor_config->set_accessor_class("SparseAccessor");
+  accessor_config->set_fea_dim(10);
+  accessor_config->set_embedx_dim(9);
+  accessor_config->set_embedx_threshold(0);
+  accessor_config->mutable_ctr_accessor_param()->set_nonclk_coeff(0.2);
+  accessor_config->mutable_ctr_accessor_param()->set_click_coeff(1);
+  accessor_config->mutable_ctr_accessor_param()->set_base_threshold(0.5);
+  accessor_config->mutable_ctr_accessor_param()->set_delta_threshold(0.2);
+  accessor_config->mutable_ctr_accessor_param()->set_delta_keep_days(16);
+  accessor_config->mutable_ctr_accessor_param()->set_show_click_decay_rate(
+      0.99);
+
+  accessor_config->mutable_embed_sgd_param()->set_name("SparseNaiveSGDRule");
+  auto* naive_param =
+      accessor_config->mutable_embed_sgd_param()->mutable_naive();
+  naive_param->set_learning_rate(1.0);
+  naive_param->set_initial_range(0.3);
+  naive_param->add_weight_bounds(-10.0);
+  naive_param->add_weight_bounds(10.0);
+
+  accessor_config->mutable_embedx_sgd_param()->set_name("SparseNaiveSGDRule");
+  naive_param = accessor_config->mutable_embedx_sgd_param()->mutable_naive();
+  naive_param->set_learning_rate(1.0);
+  naive_param->set_initial_range(0.3);
+  naive_param->add_weight_bounds(-10.0);
+  naive_param->add_weight_bounds(10.0);
 }
 
 ::paddle::distributed::PSParameter GetServerProto() {
@@ -156,14 +173,14 @@ void RunServer() {
   ::paddle::distributed::PSParameter server_proto = GetServerProto();
 
   auto _ps_env = paddle::distributed::PaddlePSEnvironment();
-  _ps_env.set_ps_servers(&host_sign_list_, 1);
+  _ps_env.SetPsServers(&host_sign_list_, 1);
   pserver_ptr_ = std::shared_ptr<paddle::distributed::PSServer>(
-      paddle::distributed::PSServerFactory::create(server_proto));
+      paddle::distributed::PSServerFactory::Create(server_proto));
   std::vector<framework::ProgramDesc> empty_vec;
   framework::ProgramDesc empty_prog;
   empty_vec.push_back(empty_prog);
-  pserver_ptr_->configure(server_proto, _ps_env, 0, empty_vec);
-  pserver_ptr_->start(ip_, port_);
+  pserver_ptr_->Configure(server_proto, _ps_env, 0, empty_vec);
+  pserver_ptr_->Start(ip_, port_);
 }
 
 void RunClient(std::map<uint64_t, std::vector<paddle::distributed::Region>>&
@@ -172,17 +189,17 @@ void RunClient(std::map<uint64_t, std::vector<paddle::distributed::Region>>&
   paddle::distributed::PaddlePSEnvironment _ps_env;
   auto servers_ = host_sign_list_.size();
   _ps_env = paddle::distributed::PaddlePSEnvironment();
-  _ps_env.set_ps_servers(&host_sign_list_, servers_);
+  _ps_env.SetPsServers(&host_sign_list_, servers_);
   worker_ptr_ = std::shared_ptr<paddle::distributed::PSClient>(
-      paddle::distributed::PSClientFactory::create(worker_proto));
-  worker_ptr_->configure(worker_proto, dense_regions, _ps_env, 0);
+      paddle::distributed::PSClientFactory::Create(worker_proto));
+  worker_ptr_->Configure(worker_proto, dense_regions, _ps_env, 0);
 }
 
 void RunBrpcPushSparse() {
   setenv("http_proxy", "", 1);
   setenv("https_proxy", "", 1);
   auto ph_host = paddle::distributed::PSHost(ip_, port_, 0);
-  host_sign_list_.push_back(ph_host.serialize_to_string());
+  host_sign_list_.push_back(ph_host.SerializeToString());
 
   // Srart Server
   std::thread server_thread(RunServer);
@@ -214,45 +231,45 @@ void RunBrpcPushSparse() {
 
   /*-----------------------Test Server Init----------------------------------*/
   LOG(INFO) << "Run pull_sparse_param";
-  auto pull_status = worker_ptr_->pull_sparse(
+  auto pull_status = worker_ptr_->PullSparse(
       fea_value_ptr.data(), 0, fea_keys.data(), fea_keys.size(), true);
   pull_status.wait();
-  for (size_t idx = 0; idx < tensor->numel(); ++idx) {
-    fea_values.data()[idx] *= 2.0;
-  }
-
-  /*-----------------------Test Push Param----------------------------------*/
 
-  LOG(INFO) << "Run push_sparse_param";
-  paddle::distributed::DownpourBrpcClosure* closure_push_param =
+  /*-----------------------Test Push Grad----------------------------------*/
+  // first to expand embedx, init
+  paddle::distributed::DownpourBrpcClosure* closure_push_grad =
       new paddle::distributed::DownpourBrpcClosure(1, [&](void* done) {
         int ret = 0;
         auto* closure = (paddle::distributed::DownpourBrpcClosure*)done;
         for (size_t i = 0; i < 1; ++i) {
           if (closure->check_response(
-                  i, paddle::distributed::PS_PUSH_SPARSE_PARAM) != 0) {
+                  i, paddle::distributed::PS_PUSH_SPARSE_TABLE) != 0) {
             ret = -1;
             break;
           }
         }
         closure->set_promise_value(ret);
       });
-  auto push_status = worker_ptr_->push_sparse_param(
-      0, fea_keys.data(), (const float**)fea_value_ptr.data(), fea_keys.size(),
-      closure_push_param);
-  push_status.wait();
 
-  auto pull_param_status = worker_ptr_->pull_sparse(
-      fea_temp_value_ptr.data(), 0, fea_keys.data(), fea_keys.size(), true);
-  pull_param_status.wait();
+  framework::Variable* g_var = client_scope.FindVar("x@GRAD");
+  framework::LoDTensor* g_tensor = g_var->GetMutable<framework::LoDTensor>();
 
-  for (size_t idx = 0; idx < tensor->numel(); ++idx) {
-    EXPECT_FLOAT_EQ(fea_temp_values[idx], fea_values[idx]);
+  LOG(INFO) << "Run push_sparse_grad";
+  std::vector<float*> push_g_vec;
+  for (auto i = 0; i < static_cast<int>(fea_keys.size()); ++i) {
+    push_g_vec.push_back(g_tensor->data<float>() + i * 13);
   }
+  auto push_grad_status = worker_ptr_->PushSparseRawGradient(
+      0, fea_keys.data(), (const float**)push_g_vec.data(), fea_keys.size(),
+      closure_push_grad);
+  push_grad_status.wait();
 
-  /*-----------------------Test Push Grad----------------------------------*/
+  // pull
+  pull_status = worker_ptr_->PullSparse(fea_value_ptr.data(), 0,
+                                        fea_keys.data(), fea_keys.size(), true);
+  pull_status.wait();
 
-  paddle::distributed::DownpourBrpcClosure* closure_push_grad =
+  paddle::distributed::DownpourBrpcClosure* closure_push_grad1 =
       new paddle::distributed::DownpourBrpcClosure(1, [&](void* done) {
         int ret = 0;
         auto* closure = (paddle::distributed::DownpourBrpcClosure*)done;
@@ -266,17 +283,14 @@ void RunBrpcPushSparse() {
         closure->set_promise_value(ret);
       });
 
-  LOG(INFO) << "Run pull_sparse_grad";
-  std::vector<float*> push_g_vec;
-  for (auto i = 0; i < static_cast<int>(fea_keys.size()); ++i) {
-    push_g_vec.push_back(tensor->data<float>() + i * 10);
-  }
-  auto push_grad_status = worker_ptr_->push_sparse_raw_gradient(
+  // push again, embedx update this time
+  push_grad_status = worker_ptr_->PushSparseRawGradient(
       0, fea_keys.data(), (const float**)push_g_vec.data(), fea_keys.size(),
-      closure_push_grad);
+      closure_push_grad1);
   push_grad_status.wait();
 
-  auto pull_update_status = worker_ptr_->pull_sparse(
+  // pull update
+  auto pull_update_status = worker_ptr_->PullSparse(
       fea_temp_value_ptr.data(), 0, fea_keys.data(), fea_keys.size(), true);
   pull_update_status.wait();
 
@@ -285,9 +299,9 @@ void RunBrpcPushSparse() {
   }
 
   LOG(INFO) << "Run stop_server";
-  worker_ptr_->stop_server();
+  worker_ptr_->StopServer();
   LOG(INFO) << "Run finalize_worker";
-  worker_ptr_->finalize_worker();
+  worker_ptr_->FinalizeWorker();
   server_thread.join();
 }
 
diff --git a/paddle/fluid/distributed/test/ctr_accessor_test.cc b/paddle/fluid/distributed/test/ctr_accessor_test.cc
index 835b1a361573d..258b4d3326209 100644
--- a/paddle/fluid/distributed/test/ctr_accessor_test.cc
+++ b/paddle/fluid/distributed/test/ctr_accessor_test.cc
@@ -67,49 +67,49 @@ TableAccessorParameter gen_param() {
 TEST(downpour_feature_value_accessor_test, test_shrink) {
   TableAccessorParameter parameter = gen_param();
   CtrCommonAccessor* acc = new CtrCommonAccessor();
-  ASSERT_EQ(acc->configure(parameter), 0);
-  ASSERT_EQ(acc->initialize(), 0);
+  ASSERT_EQ(acc->Configure(parameter), 0);
+  ASSERT_EQ(acc->Initialize(), 0);
 
   VLOG(3) << "size of struct: " << acc->common_feature_value.embed_sgd_dim
           << " " << acc->common_feature_value.embedx_dim << " "
           << acc->common_feature_value.embedx_sgd_dim << " "
-          << acc->common_feature_value.dim() << "\n";
+          << acc->common_feature_value.Dim() << "\n";
 
-  float* value = new float[acc->dim()];
-  for (auto i = 0u; i < acc->dim(); ++i) {
+  float* value = new float[acc->GetAccessorInfo().dim];
+  for (auto i = 0u; i < acc->GetAccessorInfo().dim; ++i) {
     value[i] = i * 1.0;
   }
-  ASSERT_TRUE(!acc->shrink(value));
+  ASSERT_TRUE(!acc->Shrink(value));
 
   // set unseen_days too long
   value[1] = 1000;
   // set delta score too small
   value[2] = 0.001;
-  ASSERT_TRUE(acc->shrink(value));
+  ASSERT_TRUE(acc->Shrink(value));
 }
 
 TEST(downpour_feature_value_accessor_test, test_save) {
   TableAccessorParameter parameter = gen_param();
   CtrCommonAccessor* acc = new CtrCommonAccessor();
-  ASSERT_EQ(acc->configure(parameter), 0);
-  ASSERT_EQ(acc->initialize(), 0);
+  ASSERT_EQ(acc->Configure(parameter), 0);
+  ASSERT_EQ(acc->Initialize(), 0);
 
-  float* value = new float[acc->dim()];
-  for (auto i = 0u; i < acc->dim(); ++i) {
+  float* value = new float[acc->GetAccessorInfo().dim];
+  for (auto i = 0u; i < acc->GetAccessorInfo().dim; ++i) {
     value[i] = i * 1.0;
   }
 
   // save all feature
-  ASSERT_TRUE(acc->save(value, 0));
+  ASSERT_TRUE(acc->Save(value, 0));
 
   // save delta feature
-  ASSERT_TRUE(acc->save(value, 1));
+  ASSERT_TRUE(acc->Save(value, 1));
 
   // save base feature with time decay
-  ASSERT_TRUE(acc->save(value, 2));
+  ASSERT_TRUE(acc->Save(value, 2));
 
   VLOG(3) << "test_save:";
-  for (auto i = 0u; i < acc->dim(); ++i) {
+  for (auto i = 0u; i < acc->GetAccessorInfo().dim; ++i) {
     VLOG(3) << value[i];
   }
 }
@@ -117,8 +117,8 @@ TEST(downpour_feature_value_accessor_test, test_save) {
 TEST(downpour_feature_value_accessor_test, test_create) {
   TableAccessorParameter parameter = gen_param();
   CtrCommonAccessor* acc = new CtrCommonAccessor();
-  ASSERT_EQ(acc->configure(parameter), 0);
-  ASSERT_EQ(acc->initialize(), 0);
+  ASSERT_EQ(acc->Configure(parameter), 0);
+  ASSERT_EQ(acc->Initialize(), 0);
 
   const int field_size = 7 + 8;
   const int item_size = 10;
@@ -127,7 +127,7 @@ TEST(downpour_feature_value_accessor_test, test_create) {
   for (auto i = 0u; i < item_size; ++i) {
     value[i] = new float[field_size];
   }
-  ASSERT_EQ(acc->create(value, item_size), 0);
+  ASSERT_EQ(acc->Create(value, item_size), 0);
 
   for (auto i = 0u; i < item_size; ++i) {
     for (auto j = 0u; j < field_size; ++j) {
@@ -141,11 +141,11 @@ TEST(downpour_feature_value_accessor_test, test_create) {
 TEST(downpour_feature_value_accessor_test, test_update) {
   TableAccessorParameter parameter = gen_param();
   CtrCommonAccessor* acc = new CtrCommonAccessor();
-  ASSERT_EQ(acc->configure(parameter), 0);
-  ASSERT_EQ(acc->initialize(), 0);
+  ASSERT_EQ(acc->Configure(parameter), 0);
+  ASSERT_EQ(acc->Initialize(), 0);
 
-  VLOG(3) << "dim: " << acc->common_feature_value.dim() << "\n";
-  VLOG(3) << "update_dim: " << acc->update_dim() << "\n";
+  VLOG(3) << "dim: " << acc->common_feature_value.Dim() << "\n";
+  VLOG(3) << "update_dim: " << acc->GetAccessorInfo().update_dim << "\n";
 
   const int field_size = 7 + 8;
   const int item_size = 10;
@@ -162,9 +162,9 @@ TEST(downpour_feature_value_accessor_test, test_update) {
   typedef const float* const_float_ptr;
   const_float_ptr* grad = new const_float_ptr[item_size];
   for (auto i = 0u; i < item_size; ++i) {
-    float* p = new float[acc->update_dim()];
-    for (auto j = 0u; j < acc->update_dim(); ++j) {
-      p[j] = i;
+    float* p = new float[acc->GetAccessorInfo().update_dim];
+    for (auto j = 0u; j < acc->GetAccessorInfo().update_dim; ++j) {
+      p[j] = i + 1;
     }
     grad[i] = p;
   }
@@ -244,21 +244,21 @@ TEST(downpour_feature_value_accessor_test, test_update) {
     v.unseen_days = 0;
     v.show += push_v.show;
     v.click += push_v.click;
-    v.delta_score += acc->show_click_score(push_v.show, push_v.click);
+    v.delta_score += acc->ShowClickScore(push_v.show, push_v.click);
 
-    acc->_embed_sgd_rule->update_value(&v.embed_w, &v.embed_g2sum[0],
-                                       &push_v.embed_g);
-    acc->_embedx_sgd_rule->update_value(&v.embedx_w[0], &v.embedx_g2sum[0],
-                                        &push_v.embedx_g[0]);
+    acc->_embed_sgd_rule->UpdateValue(&v.embed_w, &v.embed_g2sum[0],
+                                      &push_v.embed_g, push_v.show);
+    acc->_embedx_sgd_rule->UpdateValue(&v.embedx_w[0], &v.embedx_g2sum[0],
+                                       &push_v.embedx_g[0], push_v.show);
 
-    float* ptr = new float[acc->dim()];
+    float* ptr = new float[acc->GetAccessorInfo().dim];
     v.to_array(ptr, parameter.embedx_dim());
     exp_value.push_back(ptr);
   }
-  acc->update(value, grad, item_size);
+  acc->Update(value, grad, item_size);
 
   for (auto i = 0u; i < item_size; ++i) {
-    for (auto j = 0u; j < acc->dim(); ++j) {
+    for (auto j = 0u; j < acc->GetAccessorInfo().dim; ++j) {
       VLOG(3) << value[i][j] << ":" << exp_value[i][j] << " ";
       ASSERT_FLOAT_EQ(value[i][j], exp_value[i][j]);
     }
@@ -268,19 +268,19 @@ TEST(downpour_feature_value_accessor_test, test_update) {
 TEST(downpour_feature_value_accessor_test, test_show_click_score) {
   TableAccessorParameter parameter = gen_param();
   CtrCommonAccessor* acc = new CtrCommonAccessor();
-  ASSERT_EQ(acc->configure(parameter), 0);
-  ASSERT_EQ(acc->initialize(), 0);
+  ASSERT_EQ(acc->Configure(parameter), 0);
+  ASSERT_EQ(acc->Initialize(), 0);
 
   float show = 10;
   float click = 6;
-  ASSERT_FLOAT_EQ(acc->show_click_score(show, click), 6.8);
+  ASSERT_FLOAT_EQ(acc->ShowClickScore(show, click), 6.8);
 }
 
 TEST(downpour_feature_value_accessor_test, test_string_related) {
   TableAccessorParameter parameter = gen_param();
   CtrCommonAccessor* acc = new CtrCommonAccessor();
-  ASSERT_EQ(acc->configure(parameter), 0);
-  ASSERT_EQ(acc->initialize(), 0);
+  ASSERT_EQ(acc->Configure(parameter), 0);
+  ASSERT_EQ(acc->Initialize(), 0);
 
   const int field_size = 15;
   float* value = new float[field_size];
@@ -288,12 +288,12 @@ TEST(downpour_feature_value_accessor_test, test_string_related) {
     value[i] = i;
   }
 
-  auto str = acc->parse_to_string(value, 0);
+  auto str = acc->ParseToString(value, 0);
 
   VLOG(3) << str << std::endl;
 
   str = "0 1 2 3 4 5 6";
-  ASSERT_NE(acc->parse_from_string(str, value), 0);
+  ASSERT_NE(acc->ParseFromString(str, value), 0);
   // make sure init_zero=true
 
   for (auto i = 7; i < 15; ++i) {
diff --git a/paddle/fluid/distributed/test/dense_table_test.cc b/paddle/fluid/distributed/test/dense_table_test.cc
index c9a038e000e14..9529c776c120e 100644
--- a/paddle/fluid/distributed/test/dense_table_test.cc
+++ b/paddle/fluid/distributed/test/dense_table_test.cc
@@ -16,22 +16,22 @@ limitations under the License. */
 #include <vector>
 #include "gtest/gtest.h"
 #include "paddle/fluid/distributed/ps.pb.h"
-#include "paddle/fluid/distributed/ps/table/common_dense_table.h"
+#include "paddle/fluid/distributed/ps/table/memory_dense_table.h"
 
 namespace paddle {
 namespace distributed {
 
-// CommonDenseTable + Adam
+// MemoryDenseTable + Adam
 class Table;
 
-TEST(CommonDenseTable, Adam) {
+TEST(MemoryDenseTable, Adam) {
   int fea_dim = 10;
   int trainers = 2;
 
   TableParameter table_config;
-  table_config.set_table_class("CommonDenseTable");
+  table_config.set_table_class("MemoryDenseTable");
   FsClientParameter fs_config;
-  Table *table = new CommonDenseTable();
+  Table *table = new MemoryDenseTable();
   TableAccessorParameter *accessor_config = table_config.mutable_accessor();
   accessor_config->set_accessor_class("CommMergeAccessor");
   CommonAccessorParameter *common_config = table_config.mutable_common();
@@ -63,13 +63,19 @@ TEST(CommonDenseTable, Adam) {
   common_config->add_params("LearningRate");
   common_config->add_dims(1);
   common_config->add_initializers("fill_constant&5e-6");
-  auto ret = table->initialize(table_config, fs_config);
+  auto ret = table->Initialize(table_config, fs_config);
   ASSERT_EQ(ret, 0);
 
   // pull parameters for create and check
   std::vector<float> init_values;
   init_values.resize(fea_dim);
-  table->pull_dense(init_values.data(), fea_dim);
+
+  TableContext table_context1;
+  table_context1.value_type = Dense;
+  table_context1.pull_context.values = init_values.data();
+  table_context1.num = fea_dim;
+  table->Pull(table_context1);
+  // table->PullDense(init_values.data(), fea_dim);
 
   // push gradient
   std::vector<std::vector<float>> trainer_gradient_values;
@@ -85,12 +91,24 @@ TEST(CommonDenseTable, Adam) {
   // for adam
   for (int i = 0; i < trainers; i++) {
     auto &push_values = trainer_gradient_values[i];
-    table->push_dense(push_values.data(), push_values.size());
+
+    TableContext table_context;
+    table_context.value_type = Dense;
+    table_context.push_context.values = push_values.data();
+    table_context.num = push_values.size();
+    table->Push(table_context);
+    // table->PushDense(push_values.data(), push_values.size());
   }
 
   std::vector<float> pull_values;
   pull_values.resize(fea_dim);
-  table->pull_dense(pull_values.data(), fea_dim);
+
+  TableContext table_context;
+  table_context.value_type = Dense;
+  table_context.pull_context.values = pull_values.data();
+  table_context.num = fea_dim;
+  table->Pull(table_context);
+  // table->PullDense(pull_values.data(), fea_dim);
 
   float mom_rate = 0.99;
   float decay_rate = 0.9999;
@@ -118,19 +136,20 @@ TEST(CommonDenseTable, Adam) {
     }
   }
   for (int j = 0; j < fea_dim; j++) {
+    VLOG(0) << param[j] << " " << pull_values[j];
     ASSERT_TRUE(abs(param[j] - pull_values[j]) < 1e-5);
   }
 }
 
-// CommonDenseTable + Adam
-TEST(CommonDenseTable, SGD) {
+// MemoryDenseTable + Adam
+TEST(MemoryDenseTable, SGD) {
   int fea_dim = 10;
   int trainers = 2;
 
   TableParameter table_config;
-  table_config.set_table_class("CommonDenseTable");
+  table_config.set_table_class("MemoryDenseTable");
   FsClientParameter fs_config;
-  Table *table = new CommonDenseTable();
+  Table *table = new MemoryDenseTable();
   TableAccessorParameter *accessor_config = table_config.mutable_accessor();
   accessor_config->set_accessor_class("CommMergeAccessor");
   CommonAccessorParameter *common_config = table_config.mutable_common();
@@ -143,13 +162,19 @@ TEST(CommonDenseTable, SGD) {
   common_config->add_params("LearningRate");
   common_config->add_dims(1);
   common_config->add_initializers("fill_constant&1.0");
-  auto ret = table->initialize(table_config, fs_config);
+  auto ret = table->Initialize(table_config, fs_config);
   ASSERT_EQ(ret, 0);
 
   // pull parameters for create and check
   std::vector<float> init_values;
   init_values.resize(fea_dim);
-  table->pull_dense(init_values.data(), fea_dim);
+
+  TableContext table_context1;
+  table_context1.value_type = Dense;
+  table_context1.pull_context.values = init_values.data();
+  table_context1.num = fea_dim;
+  table->Pull(table_context1);
+  // table->PullDense(init_values.data(), fea_dim);
 
   std::vector<float> total_gradients;
   total_gradients.resize(fea_dim);
@@ -172,7 +197,12 @@ TEST(CommonDenseTable, SGD) {
   for (int i = 0; i < trainers; i++) {
     auto &push_values = trainer_gradient_values[i];
     auto task = [table, &push_values] {
-      table->push_dense(push_values.data(), push_values.size());
+      TableContext table_context;
+      table_context.value_type = Dense;
+      table_context.push_context.values = push_values.data();
+      table_context.num = push_values.size();
+      table->Push(table_context);
+      //      table->PushDense(push_values.data(), push_values.size());
     };
     task_status.push_back(pool_->enqueue(std::move(task)));
   }
@@ -182,7 +212,13 @@ TEST(CommonDenseTable, SGD) {
 
   std::vector<float> pull_values;
   pull_values.resize(fea_dim);
-  table->pull_dense(pull_values.data(), fea_dim);
+
+  TableContext table_context;
+  table_context.value_type = Dense;
+  table_context.pull_context.values = pull_values.data();
+  table_context.num = fea_dim;
+  table->Pull(table_context);
+  // table->PullDense(pull_values.data(), fea_dim);
   for (int j = 0; j < fea_dim; j++) {
     auto update_val = init_values[j] - 1.0 * total_gradients[j];
     ASSERT_TRUE(abs(update_val - pull_values[j]) < 1e-5);
diff --git a/paddle/fluid/distributed/test/geo_table_test.cc b/paddle/fluid/distributed/test/geo_table_test.cc
deleted file mode 100644
index b148c32f4968c..0000000000000
--- a/paddle/fluid/distributed/test/geo_table_test.cc
+++ /dev/null
@@ -1,124 +0,0 @@
-/* Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License. */
-
-#include <ThreadPool.h>
-
-#include <unistd.h>
-#include <string>
-#include <thread>  // NOLINT
-
-#include "google/protobuf/text_format.h"
-#include "gtest/gtest.h"
-#include "paddle/fluid/distributed/ps.pb.h"
-#include "paddle/fluid/distributed/ps/table/common_dense_table.h"
-#include "paddle/fluid/distributed/ps/table/common_sparse_table.h"
-#include "paddle/fluid/distributed/ps/table/depends/sparse_utils.h"
-#include "paddle/fluid/distributed/ps/table/sparse_geo_table.h"
-#include "paddle/fluid/distributed/ps/table/table.h"
-
-namespace paddle {
-namespace distributed {
-
-// SparseGeoTable + SSUM
-TEST(SparseGeoTable, SSUM) {
-  int emb_dim = 10;
-  int trainers = 2;
-
-  TableParameter table_config;
-  table_config.set_table_class("SparseGeoTable");
-  FsClientParameter fs_config;
-  Table *table = new SparseGeoTable();
-  TableAccessorParameter *accessor_config = table_config.mutable_accessor();
-  accessor_config->set_accessor_class("CommMergeAccessor");
-  CommonAccessorParameter *common_config = table_config.mutable_common();
-  common_config->set_name("sum");
-  common_config->set_table_name("ssum_test_table");
-  common_config->set_trainer_num(trainers);
-  common_config->add_params("Param");
-  common_config->add_dims(emb_dim);
-  common_config->add_initializers("fill_constant&1.0");
-
-  auto ret = table->initialize(table_config, fs_config);
-  ASSERT_EQ(ret, 0);
-
-  // test push_sparse_param, and create params
-  std::vector<uint64_t> init_keys = {0, 1, 2, 3, 4};
-  std::vector<uint32_t> init_fres = {1, 1, 1, 1, 1};
-  std::vector<float> init_values;
-  for (size_t i = 0; i < init_keys.size() * emb_dim; i++) {
-    init_values.push_back(0.0);
-  }
-  table->push_sparse_param(init_keys.data(), init_values.data(),
-                           init_keys.size());
-
-  std::vector<float> pull_values(init_values.size());
-  auto value = PullSparseValue(init_keys, init_fres, emb_dim);
-  table->pull_sparse(pull_values.data(), value);
-
-  for (size_t i = 0; i < init_keys.size() * emb_dim; i++) {
-    ASSERT_TRUE(abs(pull_values[i] - init_values[i]) < 1e-5);
-  }
-
-  std::vector<std::vector<uint64_t>> trainer_keys;
-  std::vector<std::vector<float>> trainer_values;
-  trainer_keys.resize(trainers);
-  trainer_values.resize(trainers);
-  float start = 0.0;
-  for (int i = 0; i < trainers; i++) {
-    trainer_keys[i] = init_keys;
-    for (size_t j = 0; j < trainer_keys[i].size(); j++) {
-      auto id = trainer_keys[i][j];
-      for (int k = 0; k < emb_dim; k++) {
-        trainer_values[i].push_back(start);
-        pull_values[id * emb_dim + k] += start;
-        start += 0.1;
-      }
-    }
-  }
-
-  std::shared_ptr<::ThreadPool> pool_ =
-      std::make_shared<::ThreadPool>(trainers);
-  std::vector<std::future<void>> task_status;
-  for (int i = 0; i < trainers; i++) {
-    auto &push_keys = trainer_keys[i];
-    auto &push_values = trainer_values[i];
-    auto task = [table, &push_keys, &push_values] {
-      table->push_sparse(push_keys.data(), push_values.data(),
-                         push_keys.size());
-    };
-    task_status.push_back(pool_->enqueue(std::move(task)));
-  }
-  for (auto &status : task_status) {
-    status.wait();
-  }
-
-  std::vector<std::vector<uint64_t>> geo_pull_ids;
-  std::vector<std::vector<float>> geo_pull_values;
-  geo_pull_ids.resize(trainers);
-  geo_pull_values.resize(trainers);
-  for (int i = 0; i < trainers; i++) {
-    table->pull_geo_param(i, &geo_pull_values[i], &geo_pull_ids[i]);
-    ASSERT_EQ(geo_pull_values[i].size(), geo_pull_ids[i].size() * emb_dim);
-    for (size_t j = 0; j < geo_pull_ids[i].size(); ++j) {
-      auto id = geo_pull_ids[i][j];
-      for (int k = 0; k < emb_dim; k++) {
-        ASSERT_TRUE(abs(geo_pull_values[i][j * emb_dim + k] -
-                        pull_values[id * emb_dim + k]) < 1e-5);
-      }
-    }
-  }
-}
-
-}  // namespace distributed
-}  // namespace paddle
diff --git a/paddle/fluid/distributed/test/graph_node_split_test.cc b/paddle/fluid/distributed/test/graph_node_split_test.cc
index a2f495de3c953..ce4f38f6cec9f 100644
--- a/paddle/fluid/distributed/test/graph_node_split_test.cc
+++ b/paddle/fluid/distributed/test/graph_node_split_test.cc
@@ -166,16 +166,16 @@ void RunServer() {
   ::paddle::distributed::PSParameter server_proto = GetServerProto();
 
   auto _ps_env = paddle::distributed::PaddlePSEnvironment();
-  _ps_env.set_ps_servers(&host_sign_list_, 2);  // test
+  _ps_env.SetPsServers(&host_sign_list_, 2);  // test
   pserver_ptr_ = std::shared_ptr<paddle::distributed::GraphBrpcServer>(
       (paddle::distributed::GraphBrpcServer*)
-          paddle::distributed::PSServerFactory::create(server_proto));
+          paddle::distributed::PSServerFactory::Create(server_proto));
   std::vector<framework::ProgramDesc> empty_vec;
   framework::ProgramDesc empty_prog;
   empty_vec.push_back(empty_prog);
-  pserver_ptr_->configure(server_proto, _ps_env, 0, empty_vec);
+  pserver_ptr_->Configure(server_proto, _ps_env, 0, empty_vec);
   LOG(INFO) << "first server, run start(ip,port)";
-  pserver_ptr_->start(ip_, port_);
+  pserver_ptr_->Start(ip_, port_);
   pserver_ptr_->build_peer2peer_connection(0);
   LOG(INFO) << "init first server Done";
 }
@@ -185,15 +185,15 @@ void RunServer2() {
   ::paddle::distributed::PSParameter server_proto2 = GetServerProto();
 
   auto _ps_env2 = paddle::distributed::PaddlePSEnvironment();
-  _ps_env2.set_ps_servers(&host_sign_list_, 2);  // test
+  _ps_env2.SetPsServers(&host_sign_list_, 2);  // test
   pserver_ptr2 = std::shared_ptr<paddle::distributed::GraphBrpcServer>(
       (paddle::distributed::GraphBrpcServer*)
-          paddle::distributed::PSServerFactory::create(server_proto2));
+          paddle::distributed::PSServerFactory::Create(server_proto2));
   std::vector<framework::ProgramDesc> empty_vec2;
   framework::ProgramDesc empty_prog2;
   empty_vec2.push_back(empty_prog2);
-  pserver_ptr2->configure(server_proto2, _ps_env2, 1, empty_vec2);
-  pserver_ptr2->start(ip2, port2);
+  pserver_ptr2->Configure(server_proto2, _ps_env2, 1, empty_vec2);
+  pserver_ptr2->Start(ip2, port2);
   pserver_ptr2->build_peer2peer_connection(1);
 }
 
@@ -204,11 +204,11 @@ void RunClient(
   paddle::distributed::PaddlePSEnvironment _ps_env;
   auto servers_ = host_sign_list_.size();
   _ps_env = paddle::distributed::PaddlePSEnvironment();
-  _ps_env.set_ps_servers(&host_sign_list_, servers_);
+  _ps_env.SetPsServers(&host_sign_list_, servers_);
   worker_ptr_ = std::shared_ptr<paddle::distributed::GraphBrpcClient>(
       (paddle::distributed::GraphBrpcClient*)
-          paddle::distributed::PSClientFactory::create(worker_proto));
-  worker_ptr_->configure(worker_proto, dense_regions, _ps_env, 0);
+          paddle::distributed::PSClientFactory::Create(worker_proto));
+  worker_ptr_->Configure(worker_proto, dense_regions, _ps_env, 0);
   worker_ptr_->set_shard_num(127);
   worker_ptr_->set_local_channel(index);
   worker_ptr_->set_local_graph_service(
@@ -222,11 +222,11 @@ void RunGraphSplit() {
   prepare_file(node_file_name, nodes);
   prepare_file(graph_split_file_name, graph_split);
   auto ph_host = paddle::distributed::PSHost(ip_, port_, 0);
-  host_sign_list_.push_back(ph_host.serialize_to_string());
+  host_sign_list_.push_back(ph_host.SerializeToString());
 
   // test-start
   auto ph_host2 = paddle::distributed::PSHost(ip2, port2, 1);
-  host_sign_list_.push_back(ph_host2.serialize_to_string());
+  host_sign_list_.push_back(ph_host2.SerializeToString());
   // test-end
   // Srart Server
   std::thread* server_thread = new std::thread(RunServer);
@@ -247,7 +247,7 @@ void RunGraphSplit() {
       0, std::string(graph_split_file_name));
   pull_status.wait();
   pull_status =
-      worker_ptr_->load(0, std::string(edge_file_name), std::string("e>"));
+      worker_ptr_->Load(0, std::string(edge_file_name), std::string("e>"));
   srand(time(0));
   pull_status.wait();
   std::vector<std::vector<int64_t>> _vs;
@@ -266,9 +266,9 @@ void RunGraphSplit() {
   std::remove(node_file_name);
   std::remove(graph_split_file_name);
   LOG(INFO) << "Run stop_server";
-  worker_ptr_->stop_server();
+  worker_ptr_->StopServer();
   LOG(INFO) << "Run finalize_worker";
-  worker_ptr_->finalize_worker();
+  worker_ptr_->FinalizeWorker();
 }
 
 TEST(RunGraphSplit, Run) { RunGraphSplit(); }
diff --git a/paddle/fluid/distributed/test/graph_node_test.cc b/paddle/fluid/distributed/test/graph_node_test.cc
index e55d39cd4834d..b2c741df7a5dd 100644
--- a/paddle/fluid/distributed/test/graph_node_test.cc
+++ b/paddle/fluid/distributed/test/graph_node_test.cc
@@ -348,16 +348,16 @@ void RunServer() {
   ::paddle::distributed::PSParameter server_proto = GetServerProto();
 
   auto _ps_env = paddle::distributed::PaddlePSEnvironment();
-  _ps_env.set_ps_servers(&host_sign_list_, 2);  // test
+  _ps_env.SetPsServers(&host_sign_list_, 2);  // test
   pserver_ptr_ = std::shared_ptr<paddle::distributed::GraphBrpcServer>(
       (paddle::distributed::GraphBrpcServer*)
-          paddle::distributed::PSServerFactory::create(server_proto));
+          paddle::distributed::PSServerFactory::Create(server_proto));
   std::vector<framework::ProgramDesc> empty_vec;
   framework::ProgramDesc empty_prog;
   empty_vec.push_back(empty_prog);
-  pserver_ptr_->configure(server_proto, _ps_env, 0, empty_vec);
+  pserver_ptr_->Configure(server_proto, _ps_env, 0, empty_vec);
   LOG(INFO) << "first server, run start(ip,port)";
-  pserver_ptr_->start(ip_, port_);
+  pserver_ptr_->Start(ip_, port_);
   pserver_ptr_->build_peer2peer_connection(0);
   LOG(INFO) << "init first server Done";
 }
@@ -367,15 +367,15 @@ void RunServer2() {
   ::paddle::distributed::PSParameter server_proto2 = GetServerProto();
 
   auto _ps_env2 = paddle::distributed::PaddlePSEnvironment();
-  _ps_env2.set_ps_servers(&host_sign_list_, 2);  // test
+  _ps_env2.SetPsServers(&host_sign_list_, 2);  // test
   pserver_ptr2 = std::shared_ptr<paddle::distributed::GraphBrpcServer>(
       (paddle::distributed::GraphBrpcServer*)
-          paddle::distributed::PSServerFactory::create(server_proto2));
+          paddle::distributed::PSServerFactory::Create(server_proto2));
   std::vector<framework::ProgramDesc> empty_vec2;
   framework::ProgramDesc empty_prog2;
   empty_vec2.push_back(empty_prog2);
-  pserver_ptr2->configure(server_proto2, _ps_env2, 1, empty_vec2);
-  pserver_ptr2->start(ip2, port2);
+  pserver_ptr2->Configure(server_proto2, _ps_env2, 1, empty_vec2);
+  pserver_ptr2->Start(ip2, port2);
   pserver_ptr2->build_peer2peer_connection(1);
 }
 
@@ -386,11 +386,11 @@ void RunClient(
   paddle::distributed::PaddlePSEnvironment _ps_env;
   auto servers_ = host_sign_list_.size();
   _ps_env = paddle::distributed::PaddlePSEnvironment();
-  _ps_env.set_ps_servers(&host_sign_list_, servers_);
+  _ps_env.SetPsServers(&host_sign_list_, servers_);
   worker_ptr_ = std::shared_ptr<paddle::distributed::GraphBrpcClient>(
       (paddle::distributed::GraphBrpcClient*)
-          paddle::distributed::PSClientFactory::create(worker_proto));
-  worker_ptr_->configure(worker_proto, dense_regions, _ps_env, 0);
+          paddle::distributed::PSClientFactory::Create(worker_proto));
+  worker_ptr_->Configure(worker_proto, dense_regions, _ps_env, 0);
   worker_ptr_->set_shard_num(127);
   worker_ptr_->set_local_channel(index);
   worker_ptr_->set_local_graph_service(
@@ -404,11 +404,11 @@ void RunBrpcPushSparse() {
   prepare_file(edge_file_name, 1);
   prepare_file(node_file_name, 0);
   auto ph_host = paddle::distributed::PSHost(ip_, port_, 0);
-  host_sign_list_.push_back(ph_host.serialize_to_string());
+  host_sign_list_.push_back(ph_host.SerializeToString());
 
   // test-start
   auto ph_host2 = paddle::distributed::PSHost(ip2, port2, 1);
-  host_sign_list_.push_back(ph_host2.serialize_to_string());
+  host_sign_list_.push_back(ph_host2.SerializeToString());
   // test-end
   // Srart Server
   std::thread* server_thread = new std::thread(RunServer);
@@ -424,7 +424,7 @@ void RunBrpcPushSparse() {
 
   /*-----------------------Test Server Init----------------------------------*/
   auto pull_status =
-      worker_ptr_->load(0, std::string(edge_file_name), std::string("e>"));
+      worker_ptr_->Load(0, std::string(edge_file_name), std::string("e>"));
   srand(time(0));
   pull_status.wait();
   std::vector<std::vector<int64_t>> _vs;
@@ -438,7 +438,7 @@ void RunBrpcPushSparse() {
   pull_status.wait();
   ASSERT_EQ(0, _vs[0].size());
   paddle::distributed::GraphTable* g =
-      (paddle::distributed::GraphTable*)pserver_ptr_->table(0);
+      (paddle::distributed::GraphTable*)pserver_ptr_->GetTable(0);
   size_t ttl = 6;
   g->make_neighbor_sample_cache(4, ttl);
   int round = 5;
@@ -622,15 +622,15 @@ void RunBrpcPushSparse() {
   std::remove(node_file_name);
   testAddNode(worker_ptr_);
   LOG(INFO) << "Run stop_server";
-  worker_ptr_->stop_server();
+  worker_ptr_->StopServer();
   LOG(INFO) << "Run finalize_worker";
-  worker_ptr_->finalize_worker();
+  worker_ptr_->FinalizeWorker();
   testFeatureNodeSerializeInt();
   testFeatureNodeSerializeInt64();
   testFeatureNodeSerializeFloat32();
   testFeatureNodeSerializeFloat64();
   testGraphToBuffer();
-  client1.stop_server();
+  client1.StopServer();
 }
 
 void testCache() {
@@ -700,4 +700,4 @@ void testGraphToBuffer() {
   VLOG(0) << s1.get_feature(0);
 }
 
-TEST(RunBrpcPushSparse, Run) { RunBrpcPushSparse(); }
\ No newline at end of file
+TEST(RunBrpcPushSparse, Run) { RunBrpcPushSparse(); }
diff --git a/paddle/fluid/distributed/test/large_scale_test.cc b/paddle/fluid/distributed/test/large_scale_test.cc
deleted file mode 100644
index 13c1d132124eb..0000000000000
--- a/paddle/fluid/distributed/test/large_scale_test.cc
+++ /dev/null
@@ -1,71 +0,0 @@
-/* Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License. */
-
-#include <ThreadPool.h>
-
-#include <unistd.h>
-#include <string>
-#include <thread>  // NOLINT
-
-#include "google/protobuf/text_format.h"
-#include "gtest/gtest.h"
-#include "paddle/fluid/distributed/ps.pb.h"
-#include "paddle/fluid/distributed/ps/table/common_sparse_table.h"
-#include "paddle/fluid/distributed/ps/table/depends/large_scale_kv.h"
-#include "paddle/fluid/distributed/ps/table/table.h"
-
-namespace paddle {
-namespace distributed {
-
-TEST(BENCHMARK, LargeScaleKV) {
-  int emb_dim = 10;
-  int trainers = 2;
-  float beta1 = 0.9;
-  float beta2 = 0.999;
-  float epsilon = 1.0e-8;
-
-  TableParameter table_config;
-  table_config.set_table_class("CommonSparseTable");
-  FsClientParameter fs_config;
-  Table *table = new CommonSparseTable();
-  TableAccessorParameter *accessor_config = table_config.mutable_accessor();
-  accessor_config->set_accessor_class("CommMergeAccessor");
-  CommonAccessorParameter *common_config = table_config.mutable_common();
-  common_config->set_name("adam");
-  common_config->set_table_name("adam_test_table");
-  common_config->set_trainer_num(trainers);
-  common_config->add_params("Param");
-  common_config->add_dims(emb_dim);
-  common_config->add_initializers("uniform_random&0&-1.0&1.0");
-  common_config->add_params("LearningRate");
-  common_config->add_dims(1);
-  common_config->add_initializers("fill_constant&1.0");
-  common_config->add_params("Moment1");
-  common_config->add_dims(emb_dim);
-  common_config->add_initializers("fill_constant&0.0");
-  common_config->add_params("Moment2");
-  common_config->add_dims(emb_dim);
-  common_config->add_initializers("fill_constant&0.0");
-  common_config->add_params("Beta1Pow");
-  common_config->add_dims(1);
-  common_config->add_initializers("fill_constant&1.0");
-  common_config->add_params("Beta2Pow");
-  common_config->add_dims(1);
-  common_config->add_initializers("fill_constant&1.0");
-  auto ret = table->initialize(table_config, fs_config);
-  ASSERT_EQ(ret, 0);
-}
-
-}  // namespace distributed
-}  // namespace paddle
diff --git a/paddle/fluid/distributed/test/memory_geo_table_test.cc b/paddle/fluid/distributed/test/memory_geo_table_test.cc
index fb48b38c76a28..ca3b51fade177 100644
--- a/paddle/fluid/distributed/test/memory_geo_table_test.cc
+++ b/paddle/fluid/distributed/test/memory_geo_table_test.cc
@@ -48,7 +48,7 @@ TEST(MemorySparseGeoTable, SSUM) {
   common_config->add_dims(emb_dim);
   common_config->add_initializers("fill_constant&1.0");
 
-  auto ret = table->initialize(table_config, fs_config);
+  auto ret = table->Initialize(table_config, fs_config);
   ASSERT_EQ(ret, 0);
 
   // test push_sparse_param, and create params
@@ -58,12 +58,26 @@ TEST(MemorySparseGeoTable, SSUM) {
   for (size_t i = 0; i < init_keys.size() * emb_dim; i++) {
     init_values.push_back(0.0);
   }
-  table->push_sparse_param(init_keys.data(), init_values.data(),
-                           init_keys.size());
+
+  TableContext table_context1;
+  table_context1.value_type = Sparse;
+  table_context1.push_context.keys = init_keys.data();
+  table_context1.push_context.values = init_values.data();
+  table_context1.push_context.is_param = true;
+  table_context1.num = init_keys.size();
+
+  table->Push(table_context1);
+  //  table->PushSparseParam(init_keys.data(), init_values.data(),
+  //                         init_keys.size());
 
   std::vector<float> pull_values(init_values.size());
   auto value = PullSparseValue(init_keys, init_fres, emb_dim);
-  table->pull_sparse(pull_values.data(), value);
+  TableContext table_context;
+  table_context.value_type = Sparse;
+  table_context.pull_context.pull_value = value;
+  table_context.pull_context.values = pull_values.data();
+  table->Pull(table_context);
+  // table->PullSparse(pull_values.data(), value);
 
   for (size_t i = 0; i < init_keys.size() * emb_dim; i++) {
     ASSERT_TRUE(abs(pull_values[i] - init_values[i]) < 1e-5);
@@ -93,8 +107,14 @@ TEST(MemorySparseGeoTable, SSUM) {
     auto &push_keys = trainer_keys[i];
     auto &push_values = trainer_values[i];
     auto task = [table, &push_keys, &push_values] {
-      table->push_sparse(push_keys.data(), push_values.data(),
-                         push_keys.size());
+      TableContext table_context;
+      table_context.value_type = Sparse;
+      table_context.push_context.keys = push_keys.data();
+      table_context.push_context.values = push_values.data();
+      table_context.num = push_keys.size();
+      table->Push(table_context);
+      //      table->PushSparse(push_keys.data(), push_values.data(),
+      //      push_keys.size());
     };
     task_status.push_back(pool_->enqueue(std::move(task)));
   }
@@ -107,7 +127,13 @@ TEST(MemorySparseGeoTable, SSUM) {
   geo_pull_ids.resize(trainers);
   geo_pull_values.resize(trainers);
   for (int i = 0; i < trainers; i++) {
-    table->pull_geo_param(i, &geo_pull_values[i], &geo_pull_ids[i]);
+    TableContext table_context;
+    table_context.value_type = Sparse;
+    table_context.pull_context.geo_pull_keys = &geo_pull_ids[i];
+    table_context.pull_context.geo_pull_values = &geo_pull_values[i];
+    table_context.trainer_id = i;
+    table->Pull(table_context);
+    //    table->PullGeoParam(i, &geo_pull_values[i], &geo_pull_ids[i]);
     ASSERT_EQ(geo_pull_values[i].size(), geo_pull_ids[i].size() * emb_dim);
     for (size_t j = 0; j < geo_pull_ids[i].size(); ++j) {
       auto id = geo_pull_ids[i][j];
diff --git a/paddle/fluid/distributed/test/memory_sparse_table_test.cc b/paddle/fluid/distributed/test/memory_sparse_table_test.cc
index aec02e8aec558..68bc50373ffad 100644
--- a/paddle/fluid/distributed/test/memory_sparse_table_test.cc
+++ b/paddle/fluid/distributed/test/memory_sparse_table_test.cc
@@ -36,7 +36,7 @@ TEST(MemorySparseTable, SGD) {
   table_config.set_shard_num(10);
   FsClientParameter fs_config;
   Table *table = new MemorySparseTable();
-  table->set_shard(0, 1);
+  table->SetShard(0, 1);
 
   TableAccessorParameter *accessor_config = table_config.mutable_accessor();
   accessor_config->set_accessor_class("CtrCommonAccessor");
@@ -66,7 +66,7 @@ TEST(MemorySparseTable, SGD) {
   naive_param->add_weight_bounds(-10.0);
   naive_param->add_weight_bounds(10.0);
 
-  auto ret = table->initialize(table_config, fs_config);
+  auto ret = table->Initialize(table_config, fs_config);
   ASSERT_EQ(ret, 0);
 
   // pull parameters for create and check
@@ -76,7 +76,13 @@ TEST(MemorySparseTable, SGD) {
   std::vector<float> init_values;
   init_values.resize(init_keys.size() * (emb_dim + 3));
   auto value = PullSparseValue(init_keys, init_fres, emb_dim);
-  table->pull_sparse(init_values.data(), value);
+
+  TableContext table_context;
+  table_context.value_type = Sparse;
+  table_context.pull_context.pull_value = value;
+  table_context.pull_context.values = init_values.data();
+  table->Pull(table_context);
+  // table->PullSparse(init_values.data(), value);
 
   // for check
   std::vector<float> total_gradients;
@@ -109,8 +115,14 @@ TEST(MemorySparseTable, SGD) {
     auto &push_keys = trainer_keys[i];
     auto &push_values = trainer_gradient_values[i];
     auto task = [table, &push_keys, &push_values] {
-      table->push_sparse(push_keys.data(), push_values.data(),
-                         push_keys.size());
+      TableContext table_context;
+      table_context.value_type = Sparse;
+      table_context.push_context.keys = push_keys.data();
+      table_context.push_context.values = push_values.data();
+      table_context.num = push_keys.size();
+      table->Push(table_context);
+      // table->PushSparse(push_keys.data(), push_values.data(),
+      // push_keys.size());
     };
     task_status.push_back(pool_->enqueue(std::move(task)));
   }
@@ -120,7 +132,13 @@ TEST(MemorySparseTable, SGD) {
 
   std::vector<float> pull_values;
   pull_values.resize(init_keys.size() * (emb_dim + 3));
-  table->pull_sparse(pull_values.data(), value);
+
+  TableContext table_context1;
+  table_context1.value_type = Sparse;
+  table_context1.pull_context.pull_value = value;
+  table_context1.pull_context.values = pull_values.data();
+  table->Pull(table_context1);
+  // table->PullSparse(pull_values.data(), value);
 
   for (size_t i = 0; i < init_keys.size(); ++i) {
     for (size_t j = 2; j < emb_dim + 3; ++j) {
@@ -133,7 +151,7 @@ TEST(MemorySparseTable, SGD) {
   }
 
   MemorySparseTable *ctr_table = dynamic_cast<MemorySparseTable *>(table);
-  ctr_table->save_local_fs("./work/table.save", "0", "test");
+  ctr_table->SaveLocalFS("./work/table.save", "0", "test");
 }
 
 }  // namespace distributed
diff --git a/paddle/fluid/distributed/test/sparse_sgd_rule_test.cc b/paddle/fluid/distributed/test/sparse_sgd_rule_test.cc
index c895231d93ec5..1a4e16b926619 100644
--- a/paddle/fluid/distributed/test/sparse_sgd_rule_test.cc
+++ b/paddle/fluid/distributed/test/sparse_sgd_rule_test.cc
@@ -31,22 +31,22 @@ TEST(sparse_value_naive_sgd_test, init_and_update) {
   naive_param->add_weight_bounds(-10.0);
   naive_param->add_weight_bounds(10.0);
 
-  rule.load_config(param, 10);
+  rule.LoadConfig(param, 10);
 
   // check init_value for zero
   const int kItemSize = 10;
   float w[kItemSize];
   float grad[kItemSize];
-  rule.init_value(w, w + 9, true);
+  rule.InitValue(w, w + 9, true);
 
   for (auto i = 0u; i < kItemSize; ++i) {
     ASSERT_FLOAT_EQ(w[i], 0);
   }
 
   // check init_value for random
-  rule.init_value(w, w + 9, false);
+  rule.InitValue(w, w + 9, false);
   for (auto i = 0u; i < kItemSize; ++i) {
-    ASSERT_TRUE(w[i] >= rule.min_bound() && w[i] <= rule.max_bound());
+    ASSERT_TRUE(w[i] >= rule.MinBound() && w[i] <= rule.MaxBound());
   }
 
   // check update_value for one field
@@ -59,7 +59,7 @@ TEST(sparse_value_naive_sgd_test, init_and_update) {
   float label[] = {-0.100000, -0.200000, -0.300000, -0.400000, -0.500000,
                    -0.600000, -0.700000, -0.800000, -0.900000, -1.000000};
   const float* ptr_grad = grad;
-  rule.update_value(w, w + 9, ptr_grad);
+  rule.UpdateValue(w, w + 9, ptr_grad);
 
   for (auto i = 0u; i < kItemSize; ++i) {
     VLOG(3) << w[i] << "\n";
@@ -78,14 +78,14 @@ TEST(downpour_sparse_adagrad_test, test_init_and_update) {
   adagrad_param->add_weight_bounds(-10.0);
   adagrad_param->add_weight_bounds(10.0);
 
-  rule.load_config(param, 10);
+  rule.LoadConfig(param, 10);
 
   // check init_value for zero
   const int kValueSize = 11;
   int kEmbSize = 10;
   float w[kValueSize];
 
-  rule.init_value(w, w + 10, true);
+  rule.InitValue(w, w + 10, true);
 
   for (auto i = 0u; i < kEmbSize; ++i) {
     ASSERT_FLOAT_EQ(w[i], 0);
@@ -93,9 +93,9 @@ TEST(downpour_sparse_adagrad_test, test_init_and_update) {
   ASSERT_FLOAT_EQ(w[kEmbSize], 0);
 
   // check init_value for random
-  rule.init_value(w, w + 10, false);
+  rule.InitValue(w, w + 10, false);
   for (auto i = 0u; i < kEmbSize; ++i) {
-    ASSERT_TRUE(w[i] >= rule.min_bound() && w[i] <= rule.max_bound());
+    ASSERT_TRUE(w[i] >= rule.MinBound() && w[i] <= rule.MaxBound());
   }
   ASSERT_FLOAT_EQ(w[kEmbSize], 0);
 
@@ -110,7 +110,7 @@ TEST(downpour_sparse_adagrad_test, test_init_and_update) {
   }
 
   const float* ptr_grad = grad;
-  rule.update_value(w, w + 10, ptr_grad);
+  rule.UpdateValue(w, w + 10, ptr_grad);
   float label[] = {-0.100000, -0.200000, -0.300000, -0.400000,
                    -0.500000, -0.600000, -0.700000, -0.800000,
                    -0.900000, -1.000000, 38.500000};
@@ -140,33 +140,33 @@ TEST(downpour_sparse_adam_test, test_init_and_update) {
 
   SparseAdamSGDRule rule;
 
-  rule.load_config(param, embed_dim);
+  rule.LoadConfig(param, embed_dim);
 
   // check init_value for zero
   const int rule_dim =
-      rule.dim();  // dims of gsum + g2sum + beta1_pow + beta2_pow in adam
+      rule.Dim();  // dims of gsum + g2sum + beta1_pow + beta2_pow in adam
   const int value_dim = embed_dim + rule_dim;  // total dims of w + rule
   float* value = new float[value_dim];
-  rule.init_value(value, value + embed_dim, true);
-  for (auto i = 0u; i < rule.beta1_pow_index(); ++i) {
+  rule.InitValue(value, value + embed_dim, true);
+  for (auto i = 0u; i < rule.Beta1PowIndex(); ++i) {
     ASSERT_FLOAT_EQ(value[i], 0);
   }
-  ASSERT_FLOAT_EQ(*(value + embed_dim + rule.beta1_pow_index()), 0.9);
-  ASSERT_FLOAT_EQ(*(value + embed_dim + rule.beta2_pow_index()), 0.999);
+  ASSERT_FLOAT_EQ(*(value + embed_dim + rule.Beta1PowIndex()), 0.9);
+  ASSERT_FLOAT_EQ(*(value + embed_dim + rule.Beta2PowIndex()), 0.999);
 
   // check init_value for random
-  rule.init_value(value, value + embed_dim, false);
+  rule.InitValue(value, value + embed_dim, false);
   for (auto i = 0u; i < embed_dim; ++i) {
-    ASSERT_TRUE(value[i] >= rule.min_bound() && value[i] <= rule.max_bound());
+    ASSERT_TRUE(value[i] >= rule.MinBound() && value[i] <= rule.MaxBound());
   }
-  for (auto i = rule.gsum_index(); i < rule.beta1_pow_index(); ++i) {
+  for (auto i = rule.GSumIndex(); i < rule.Beta1PowIndex(); ++i) {
     ASSERT_FLOAT_EQ(value[i + embed_dim], 0);
   }
-  ASSERT_FLOAT_EQ(*(value + embed_dim + rule.beta1_pow_index()), 0.9);
-  ASSERT_FLOAT_EQ(*(value + embed_dim + rule.beta2_pow_index()), 0.999);
+  ASSERT_FLOAT_EQ(*(value + embed_dim + rule.Beta1PowIndex()), 0.9);
+  ASSERT_FLOAT_EQ(*(value + embed_dim + rule.Beta2PowIndex()), 0.999);
 
   // check update_value
-  rule.init_value(value, value + embed_dim, true);
+  rule.InitValue(value, value + embed_dim, true);
   float* grad = new float[embed_dim];
   for (auto i = 0u; i < embed_dim; ++i) {
     grad[i] = (i + 1) * 1.0;
@@ -181,7 +181,7 @@ TEST(downpour_sparse_adam_test, test_init_and_update) {
                    0.0249996781,   0.0359995365, 0.0489993691,  0.063999176,
                    0.0809989572,   0.0999987125, 0.809999943,   0.998001039};
 
-  rule.update_value(value, value + embed_dim, grad);
+  rule.UpdateValue(value, value + embed_dim, grad);
 
   for (auto i = 0u; i < value_dim; ++i) {  // check update
     ASSERT_FLOAT_EQ(value[i], label[i]) << "i is " << i;
diff --git a/paddle/fluid/distributed/test/sparse_table_test.cc b/paddle/fluid/distributed/test/sparse_table_test.cc
deleted file mode 100644
index f13bab078a6b0..0000000000000
--- a/paddle/fluid/distributed/test/sparse_table_test.cc
+++ /dev/null
@@ -1,223 +0,0 @@
-/* Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License. */
-
-#include <ThreadPool.h>
-
-#include <unistd.h>
-#include <string>
-#include <thread>  // NOLINT
-
-#include "google/protobuf/text_format.h"
-#include "gtest/gtest.h"
-#include "paddle/fluid/distributed/ps.pb.h"
-#include "paddle/fluid/distributed/ps/table/common_dense_table.h"
-#include "paddle/fluid/distributed/ps/table/common_sparse_table.h"
-#include "paddle/fluid/distributed/ps/table/sparse_geo_table.h"
-#include "paddle/fluid/distributed/ps/table/table.h"
-
-namespace paddle {
-namespace distributed {
-
-// CommonSparseTable + SSGD
-TEST(CommonSparseTable, SGD) {
-  int emb_dim = 10;
-  int trainers = 2;
-
-  TableParameter table_config;
-  table_config.set_table_class("CommonSparseTable");
-  FsClientParameter fs_config;
-  Table *table = new CommonSparseTable();
-  TableAccessorParameter *accessor_config = table_config.mutable_accessor();
-  accessor_config->set_accessor_class("CommMergeAccessor");
-  CommonAccessorParameter *common_config = table_config.mutable_common();
-  common_config->set_name("sgd");
-  common_config->set_table_name("sgd_test_table");
-  common_config->set_trainer_num(trainers);
-  common_config->add_params("Param");
-  common_config->add_dims(emb_dim);
-  common_config->add_initializers("uniform_random&0&-1.0&1.0");  // param
-  common_config->add_params("LearningRate");
-  common_config->add_dims(1);
-  common_config->add_initializers("fill_constant&1.0");  // learning_rate
-  auto ret = table->initialize(table_config, fs_config);
-  ASSERT_EQ(ret, 0);
-
-  // pull parameters for create and check
-  std::vector<uint64_t> init_keys = {0, 1, 2, 3, 4};
-  std::vector<uint32_t> init_fres = {1, 1, 1, 1, 1};
-
-  std::vector<float> init_values;
-  init_values.resize(init_keys.size() * emb_dim);
-
-  std::vector<float> pull_values(init_values.size());
-  auto value = PullSparseValue(init_keys, init_fres, emb_dim);
-  table->pull_sparse(init_values.data(), value);
-
-  // for check
-  std::vector<float> total_gradients;
-  total_gradients.resize(init_keys.size() * emb_dim);
-  memset(total_gradients.data(), 0, sizeof(float) * total_gradients.size());
-
-  // push gradient
-  std::vector<std::vector<uint64_t>> trainer_keys;
-  std::vector<std::vector<float>> trainer_gradient_values;
-  trainer_keys.resize(trainers);
-  trainer_gradient_values.resize(trainers);
-  float start = 0.0;
-  for (int i = 0; i < trainers; i++) {
-    trainer_keys[i] = init_keys;
-    for (size_t j = 0; j < trainer_keys[i].size(); j++) {
-      auto id = trainer_keys[i][j];
-      for (int k = 0; k < emb_dim; k++) {
-        trainer_gradient_values[i].push_back(start);
-        total_gradients[id * emb_dim + k] += start;
-        start += 0.1;
-      }
-    }
-  }
-
-  std::shared_ptr<::ThreadPool> pool_ =
-      std::make_shared<::ThreadPool>(trainers);
-  std::vector<std::future<void>> task_status;
-  for (int i = 0; i < trainers; i++) {
-    auto &push_keys = trainer_keys[i];
-    auto &push_values = trainer_gradient_values[i];
-    auto task = [table, &push_keys, &push_values] {
-      table->push_sparse(push_keys.data(), push_values.data(),
-                         push_keys.size());
-    };
-    task_status.push_back(pool_->enqueue(std::move(task)));
-  }
-  for (auto &status : task_status) {
-    status.wait();
-  }
-
-  std::vector<float> pull_values;
-  pull_values.resize(init_keys.size() * emb_dim);
-  table->pull_sparse(init_values.data(), value);
-
-  for (size_t i = 0; i < init_values.size(); ++i) {
-    auto update_val = init_values[i] - 1.0 * total_gradients[i];
-    ASSERT_TRUE(abs(update_val - pull_values[i]) < 1e-5);
-  }
-}
-
-// CommonSparseTable + Adam
-TEST(CommonSparseTable, Adam) {
-  int emb_dim = 10;
-  int trainers = 2;
-  float beta1 = 0.9;
-  float beta2 = 0.999;
-  float epsilon = 1.0e-8;
-
-  TableParameter table_config;
-  table_config.set_table_class("CommonSparseTable");
-  FsClientParameter fs_config;
-  Table *table = new CommonSparseTable();
-  TableAccessorParameter *accessor_config = table_config.mutable_accessor();
-  accessor_config->set_accessor_class("CommMergeAccessor");
-  CommonAccessorParameter *common_config = table_config.mutable_common();
-  common_config->set_name("adam");
-  common_config->set_table_name("adam_test_table");
-  common_config->set_trainer_num(trainers);
-  common_config->add_params("Param");
-  common_config->add_dims(emb_dim);
-  common_config->add_initializers("uniform_random&0&-1.0&1.0");
-  common_config->add_params("LearningRate");
-  common_config->add_dims(1);
-  common_config->add_initializers("fill_constant&1.0");
-  common_config->add_params("Moment1");
-  common_config->add_dims(emb_dim);
-  common_config->add_initializers("fill_constant&0.0");
-  common_config->add_params("Moment2");
-  common_config->add_dims(emb_dim);
-  common_config->add_initializers("fill_constant&0.0");
-  common_config->add_params("Beta1Pow");
-  common_config->add_dims(1);
-  common_config->add_initializers("fill_constant&1.0");
-  common_config->add_params("Beta2Pow");
-  common_config->add_dims(1);
-  common_config->add_initializers("fill_constant&1.0");
-  auto ret = table->initialize(table_config, fs_config);
-  ASSERT_EQ(ret, 0);
-
-  // pull parameters for create and check
-  std::vector<uint64_t> init_keys = {0, 1, 2, 3, 4};
-  std::vector<uint32_t> init_fres = {1, 1, 1, 1, 1};
-
-  std::vector<float> init_values;
-  init_values.resize(init_keys.size() * emb_dim);
-
-  auto value = PullSparseValue(init_keys, init_fres, emb_dim);
-  table->pull_sparse(init_values.data(), value);
-
-  // push gradient
-  std::vector<std::vector<uint64_t>> trainer_keys;
-  std::vector<std::vector<float>> trainer_gradient_values;
-  trainer_keys.resize(trainers);
-  trainer_gradient_values.resize(trainers);
-  float start = 0.0;
-  for (int i = 0; i < trainers; i++) {
-    trainer_keys[i] = init_keys;
-    for (size_t j = 0; j < trainer_keys[i].size(); j++) {
-      for (int k = 0; k < emb_dim; k++) {
-        trainer_gradient_values[i].push_back(start);
-        start += 0.1;
-      }
-    }
-  }
-
-  for (int i = 0; i < trainers; i++) {
-    auto &push_keys = trainer_keys[i];
-    auto &push_values = trainer_gradient_values[i];
-    table->push_sparse(push_keys.data(), push_values.data(), push_keys.size());
-  }
-
-  std::vector<float> pull_values;
-  pull_values.resize(init_keys.size() * emb_dim);
-  table->pull_sparse(pull_values.data(), init_keys.data(), init_keys.size());
-
-  for (size_t idx = 0; idx < init_keys.size(); idx += emb_dim) {
-    std::vector<float> beta1_pow, beta2_pow, lr, mom1, mom2, param;
-    beta1_pow.push_back(beta1);
-    beta2_pow.push_back(beta2);
-    lr.push_back(1.0);
-    for (int i = 0; i < emb_dim; i++) {
-      mom1.push_back(0.0);
-      mom2.push_back(0.0);
-      param.push_back(init_values[idx + i]);
-    }
-    for (int i = 0; i < trainers; i++) {
-      auto lr_ = lr[0] * sqrt(1 - beta2_pow[0]) / (1 - beta1_pow[0]);
-      for (int j = 0; j < emb_dim; j++) {
-        mom1[j] =
-            beta1 * mom1[j] + (1 - beta1) * trainer_gradient_values[i][idx + j];
-        mom2[j] = beta2 * mom2[j] +
-                  (1 - beta2) * trainer_gradient_values[i][idx + j] *
-                      trainer_gradient_values[i][idx + j];
-        param[j] = param[j] -
-                   lr_ * (mom1[j] /
-                          (sqrt(mom2[j]) + epsilon * sqrt(1 - beta2_pow[0])));
-      }
-      beta1_pow[0] *= beta1;
-      beta2_pow[0] *= beta2;
-    }
-    for (int i = 0; i < emb_dim; i++) {
-      ASSERT_TRUE(abs(param[i] - pull_values[idx + i]) < 1e-5);
-    }
-  }
-}
-
-}  // namespace distributed
-}  // namespace paddle
diff --git a/paddle/fluid/distributed/test/table_test.cc b/paddle/fluid/distributed/test/table_test.cc
index 6a29781158b83..4f73519ef5e69 100644
--- a/paddle/fluid/distributed/test/table_test.cc
+++ b/paddle/fluid/distributed/test/table_test.cc
@@ -14,19 +14,19 @@ limitations under the License. */
 
 #include "gtest/gtest.h"
 #include "paddle/fluid/distributed/ps.pb.h"
-#include "paddle/fluid/distributed/ps/table/common_sparse_table.h"
-#include "paddle/fluid/distributed/ps/table/sparse_geo_table.h"
+#include "paddle/fluid/distributed/ps/table/memory_dense_table.h"
+//#include "paddle/fluid/distributed/ps/table/sparse_geo_table.h"
 
 namespace paddle {
 namespace distributed {
 
 TEST(Table, Initialize) {
   TableParameter table_config;
-  table_config.set_table_class("SparseGeoTable");
+  table_config.set_table_class("MemoryDenseTable");
   FsClientParameter fs_config;
   // case 1. no accessor
-  Table *table = new SparseGeoTable();
-  auto ret = table->initialize(table_config, fs_config);
+  Table *table = new MemoryDenseTable();
+  auto ret = table->Initialize(table_config, fs_config);
   ASSERT_EQ(ret, -1);
 }
 }  // namespace distributed
diff --git a/paddle/fluid/eager/CMakeLists.txt b/paddle/fluid/eager/CMakeLists.txt
index d8089bedf924e..da326ff7d76d7 100644
--- a/paddle/fluid/eager/CMakeLists.txt
+++ b/paddle/fluid/eager/CMakeLists.txt
@@ -13,12 +13,16 @@ add_subdirectory(accumulation)
 add_subdirectory(custom_operator)
 if(NOT ((NOT WITH_PYTHON) AND ON_INFER))
     add_subdirectory(pylayer)
+    cc_library(grad_tensor_holder SRCS grad_tensor_holder.cc DEPS grad_node_info gradient_accumulator)
+    add_dependencies(grad_tensor_holder eager_final_state_codegen)
+    cc_library(backward SRCS backward.cc DEPS grad_tensor_holder utils autograd_meta grad_node_info)
 endif()
+
 cc_library(grad_node_info SRCS grad_node_info.cc DEPS phi_api phi_tensor)
-cc_library(grad_tensor_holder SRCS grad_tensor_holder.cc DEPS grad_node_info gradient_accumulator)
 
 cc_library(autograd_meta SRCS autograd_meta.cc DEPS phi_api phi_tensor)
 cc_library(utils SRCS utils.cc DEPS phi_api phi_tensor global_utils layer proto_desc operator op_registry variable_helper memcpy scale_op autograd_meta hook_utils)
-cc_library(backward SRCS backward.cc DEPS grad_tensor_holder utils autograd_meta grad_node_info)
 
-add_subdirectory(tests)
+if(NOT ((NOT WITH_PYTHON) AND ON_INFER))
+    add_subdirectory(tests)
+endif()
diff --git a/paddle/fluid/eager/accumulation/accumulation_node.h b/paddle/fluid/eager/accumulation/accumulation_node.h
index 2e38d7e9e91e2..38d5533c3d606 100644
--- a/paddle/fluid/eager/accumulation/accumulation_node.h
+++ b/paddle/fluid/eager/accumulation/accumulation_node.h
@@ -25,7 +25,10 @@ class GradNodeAccumulation : public GradNodeBase {
   // Constructor: configure fwd input tensors to grad node
   explicit GradNodeAccumulation(AutogradMeta* meta) : GradNodeBase(1, 1) {
     VLOG(6) << "Construct GradNodeAccumulation";
-    weak_grad_ = meta->WeakGrad();
+    if (meta) {
+      weak_grad_ = meta->WeakGrad();
+    }
+
     SetDefaultGradInOutMeta();
   }
 
@@ -40,11 +43,6 @@ class GradNodeAccumulation : public GradNodeBase {
 
   void ClearTensorWrappers() override { VLOG(6) << "Do nothing here now"; }
 
-  bool IsTensorWrappersCleared() override {
-    VLOG(6) << "Do nothing here now";
-    return false;
-  }
-
   std::string name() { return "GradNodeAccumulation"; }
 
   /**
@@ -58,6 +56,11 @@ class GradNodeAccumulation : public GradNodeBase {
   inline bool ReduceHooksRegistered() { return reduce_hooks_.size() != 0; }
   void ApplyReduceHooks();
 
+  std::shared_ptr<GradNodeBase> Copy() const override {
+    return std::shared_ptr<GradNodeAccumulation>(
+        new GradNodeAccumulation(nullptr));
+  }
+
  private:
   std::weak_ptr<paddle::experimental::Tensor> weak_grad_;
 
diff --git a/paddle/fluid/eager/api/generated/eager_generated/backwards/scale_node.h b/paddle/fluid/eager/api/generated/eager_generated/backwards/scale_node.h
index 0b942d2a06707..dd61ddc486eef 100644
--- a/paddle/fluid/eager/api/generated/eager_generated/backwards/scale_node.h
+++ b/paddle/fluid/eager/api/generated/eager_generated/backwards/scale_node.h
@@ -44,11 +44,6 @@ class GradNodeScale : public GradNodeBase {
 
   void ClearTensorWrappers() override { VLOG(6) << "Do nothing here now"; }
 
-  bool IsTensorWrappersCleared() override {
-    VLOG(6) << "Do nothing here now";
-    return false;
-  }
-
   void SetTensorWrappers_X(
       const std::vector<paddle::experimental::Tensor>& tensors);
 
@@ -56,6 +51,12 @@ class GradNodeScale : public GradNodeBase {
   std::string name() override { return ""; }
   // Members: define fwd input tensors
   // For Scale there is no fwd input tensor needed
+
+  std::shared_ptr<GradNodeBase> Copy() const override {
+    auto copied_node = std::make_shared<GradNodeScale>(*this);
+    return copied_node;
+  }
+
  private:
   float scale_{1.0};
 };
diff --git a/paddle/fluid/eager/api/utils/hook_utils.cc b/paddle/fluid/eager/api/utils/hook_utils.cc
index 9abd7be49d44c..8ee646b718c2f 100644
--- a/paddle/fluid/eager/api/utils/hook_utils.cc
+++ b/paddle/fluid/eager/api/utils/hook_utils.cc
@@ -76,6 +76,7 @@ void RetainGradForTensor(const paddle::experimental::Tensor& tensor) {
           VLOG(7) << "Set impl for RetainGrad Hook for tensor: " << t.name();
           // Simply Copy impl() to grad_tensor
           grad_tensor->set_impl(t.impl());
+          grad_tensor->set_autograd_meta(t.mutable_autograd_meta());
           return *grad_tensor.get();
         } else {
           VLOG(7) << "Retain NULL paddle::experimental::Tensor in Grad Hook";
diff --git a/paddle/fluid/eager/auto_code_generator/eager_generator.cc b/paddle/fluid/eager/auto_code_generator/eager_generator.cc
index 9039cf8eba95a..de44a833f6e73 100644
--- a/paddle/fluid/eager/auto_code_generator/eager_generator.cc
+++ b/paddle/fluid/eager/auto_code_generator/eager_generator.cc
@@ -1824,7 +1824,7 @@ static std::pair<std::string, std::string> GenerateForwardFunctionContents(
           // Bump inplace version of inplace tensor.
           auto inplace_input_name = inplace_map[output_name];
           const char* FWD_OUT_TENSOR_TEMPLATE =
-              "  egr::EagerUtils::ModifyInplaceInput(outs[\"%s\"][0], &%s);\n"
+              "  egr::EagerUtils::GetOutput(outs[\"%s\"][0], &%s);\n"
               "  %s.bump_inplace_version();\n"
               "  VLOG(3) << \"Tensor(\" << %s.name() << \") uses Inplace "
               "Strategy.\";\n";
@@ -2479,22 +2479,23 @@ static std::string GenerateGradNodeHeaderContents(
       "\n"
       "  void ClearTensorWrappers() override { \n"
       "%s\n"
-      "    is_tensor_wrappers_cleared = true;\n"
+      "    SetIsTensorWrappersCleared(true);\n"
       "  }\n"
       "  std::string name() override { return \" GradNode%s \"; } \n "
       "\n"
+      "std::shared_ptr<GradNodeBase> Copy() const override {{\n "
+      "    auto copied_node = std::shared_ptr<GradNode%s>(new "
+      "GradNode%s(*this));\n "
+      "    return copied_node;\n "
+      "}}\n "
+      "\n"
       "  // SetX, SetY, ...\n"
       "%s\n"
       "  // SetAttrMap\n"
       "%s\n"
-      "  bool IsTensorWrappersCleared() override { \n"
-      "    return is_tensor_wrappers_cleared;\n"
-      "  }\n"
       " private:\n"
       "   // TensorWrappers\n"
       "%s\n"
-      "   bool is_tensor_wrappers_cleared = false;\n"
-      "\n"
       "   // Attribute Map\n"
       "%s\n"
       "};";
@@ -2601,8 +2602,9 @@ static std::string GenerateGradNodeHeaderContents(
 
   std::string grad_node_str = paddle::string::Sprintf(
       GRAD_NODE_TEMPLATE, op_type, op_type, op_type, op_type, op_type, op_type,
-      op_type, clear_tensor_wrappers_str, op_type, set_tensor_wrappers_str,
-      set_attr_map_str, tensor_wrapper_members_str, attr_members_str);
+      op_type, clear_tensor_wrappers_str, op_type, op_type, op_type,
+      set_tensor_wrappers_str, set_attr_map_str, tensor_wrapper_members_str,
+      attr_members_str);
 
   return grad_node_str;
 }
@@ -2651,7 +2653,8 @@ static void GenerateForwardDygraphFile(const std::string& forward_cc_path,
       "#include \"paddle/fluid/eager/api/utils/global_utils.h\"\n"
       "#include \"paddle/fluid/eager/amp_utils.h\"\n"
       "#include \"paddle/fluid/eager/amp_auto_cast.h\"\n"
-      "#include \"paddle/fluid/platform/profiler/event_tracing.h\"\n\n";
+      "#include \"paddle/fluid/platform/profiler/event_tracing.h\"\n"
+      "#pragma GCC diagnostic ignored \"-Wunused-variable\"\n\n";
   std::string forward_cc_include_str =
       paddle::string::Sprintf(FORWARD_INCLUDE_TEMPLATE);
   std::ofstream forward_cc_stream(forward_cc_path, std::ios::out);
@@ -2803,7 +2806,7 @@ static void DygraphCodeGeneration(const std::string& output_dir) {
     // Inplace Function Generator.
     // `sum` op has duplicate input. Don't consider adding inplace strategy
     // for `sum` in temporary.
-    if (op_type != "sum" && infer_inplace) {
+    if (infer_inplace && !special_inplace_op_set.count(op_type)) {
       auto in_to_outs = infer_inplace(true);
       for (auto& inplace_pair : in_to_outs) {
         inplace_map[inplace_pair.second] = inplace_pair.first;
diff --git a/paddle/fluid/eager/auto_code_generator/final_state_generator/codegen_utils.py b/paddle/fluid/eager/auto_code_generator/final_state_generator/codegen_utils.py
index 8b7ea547b2632..21b6b882a6f15 100644
--- a/paddle/fluid/eager/auto_code_generator/final_state_generator/codegen_utils.py
+++ b/paddle/fluid/eager/auto_code_generator/final_state_generator/codegen_utils.py
@@ -21,7 +21,10 @@
 ########################
 ### Global Variables ###
 ########################
-ops_to_fill_zero_for_empty_grads = set(["split", "rnn"])
+ops_to_fill_zero_for_empty_grads = set([
+    "split_grad", "rnn_grad", "matmul_double_grad", "matmul_triple_grad",
+    "sigmoid_triple_grad"
+])
 
 # For API dispatch used at python-level
 # { op_name : [arg_name, ...] }
@@ -170,10 +173,9 @@ def GetForwardFunctionName(string):
     return f"{string}_final_state_dygraph_function"
 
 
-def TransformGradVarNameForDoubleGradGeneration(string):
-    if IsGradName(string):
-        string = "grad_" + string[:-5]
-    return string
+def GetIndent(num):
+    tab = "   "
+    return "".join([tab for i in range(num)])
 
 
 ######################
diff --git a/paddle/fluid/eager/auto_code_generator/final_state_generator/eager_gen.py b/paddle/fluid/eager/auto_code_generator/final_state_generator/eager_gen.py
index 0f78763d6c959..3b4c8f962179e 100644
--- a/paddle/fluid/eager/auto_code_generator/final_state_generator/eager_gen.py
+++ b/paddle/fluid/eager/auto_code_generator/final_state_generator/eager_gen.py
@@ -31,8 +31,7 @@
 from codegen_utils import ParseYamlForward, ParseYamlBackward
 from codegen_utils import FunctionGeneratorBase, YamlGeneratorBase
 from codegen_utils import ops_to_fill_zero_for_empty_grads
-from codegen_utils import TransformGradVarNameForDoubleGradGeneration
-from codegen_utils import AssertMessage
+from codegen_utils import AssertMessage, GetIndent
 
 
 ###########
@@ -88,9 +87,9 @@ def ParseArguments():
 
 CLEAR_VECTOR_TENSOR_WRAPPERS_TEMPLATE = \
 """
-       for (auto tw: {}) {
+       for (auto& tw : {}) {{
          tw.clear();
-       };
+       }}
 """
 
 SET_ATTR_METHOD_TEMPLATE = \
@@ -112,80 +111,82 @@ def ParseArguments():
 
 NODE_DECLARATION_TEMPLATE = \
 """
-    class {} : public egr::GradNodeBase {{
-     public:
-      {}() : egr::GradNodeBase() {{}}
-      {}(size_t bwd_in_slot_num, size_t bwd_out_slot_num) : 
-          egr::GradNodeBase(bwd_in_slot_num, bwd_out_slot_num) {{}}
-      ~{}() override = default;
-
-      virtual std::vector<std::vector<paddle::experimental::Tensor>> operator()(
-          std::vector<std::vector<paddle::experimental::Tensor>>& grads, bool create_graph = false) override;
-      std::string name() override {{ return \" {} \"; }}
-
-      void ClearTensorWrappers() override {{
-          {}
-        is_tensor_wrappers_cleared = true;
-      }}
-
-      // SetTensorWrapperX, SetTensorWrapperY, ...
-      {}
-      // SetAttributes
-      {}
-
-      bool IsTensorWrappersCleared() override {{
-          return is_tensor_wrappers_cleared;
-      }}
-     private:
-      // TensorWrappers
-      {}
-
-      bool is_tensor_wrappers_cleared = false;
-
-      // Attributes
+class {} : public egr::GradNodeBase {{
+ public:
+  {}() : egr::GradNodeBase() {{}}
+  {}(size_t bwd_in_slot_num, size_t bwd_out_slot_num) : 
+      egr::GradNodeBase(bwd_in_slot_num, bwd_out_slot_num) {{}}
+  ~{}() override = default;
+
+  virtual std::vector<std::vector<paddle::experimental::Tensor>> operator()(
+      std::vector<std::vector<paddle::experimental::Tensor>>& grads, bool create_graph = false) override;
+  std::string name() override {{ return \" {} \"; }}
+  
+  void ClearTensorWrappers() override {{
       {}
-    }};
+      SetIsTensorWrappersCleared(true);
+  }}
+
+  std::shared_ptr<GradNodeBase> Copy() const override {{
+      auto copied_node = std::shared_ptr<{}>(new {}(*this));
+      
+      return copied_node;
+  }}
+  
+  // SetTensorWrapperX, SetTensorWrapperY, ...
+  {}
+  // SetAttributes
+  {}
+
+ private:
+  // TensorWrappers
+  {}
+
+  // Attributes
+  {}
+}};
 """
 
 GRAD_FUNCTION_TEMPLATE = \
 """
-    std::vector<std::vector<paddle::experimental::Tensor>> {}::operator()(std::vector<std::vector<paddle::experimental::Tensor>>& grads, bool create_graph) {{
-        // Fill Zero For GradIn Tensors
-        {}
+std::vector<std::vector<paddle::experimental::Tensor>> {}::operator()(std::vector<std::vector<paddle::experimental::Tensor>>& grads, bool create_graph) {{
+    // Fill Zero For GradIn Tensors
+{}
 
-        // Apply Gradient Hooks
-        auto hooked_grads = ApplyGradientHooks(grads);
-        
-        // Collect GradIn Tensors, Attrs and Recovered TensorWrappers
-        {}
+    // Apply Gradient Hooks
+    auto hooked_grads = ApplyGradientHooks(grads);
+    
+    // Collect GradIn Tensors, Attrs and Recovered TensorWrappers
+{}
 
-        // Call grad_api function
-        VLOG(3) << \"Final State Running: \" << \"{}\"; 
-        {}
+    // Call grad_api function
+    VLOG(3) << \"Final State Running: \" << \"{}\"; 
+{}
 
-        // Get Output
-        {}
+    // Get Output
+{}
 
-        // Get GradIn autograd_meta
-        {}
+    // Get GradIn autograd_meta
+{}
 
-        // Get GradOut autograd_meta
-        {}
-        
-        // Compute Require Grad
-        {}
-        
-        // Create Grad Node
-        {}
+    // Get GradOut autograd_meta
+{}
+    
+    // Compute Require Grad
+{}
+    
+    // Create Grad Node
+{}
 
-        // Return 
-        {}
-    }}
+    // Return 
+{}
+
+}}
 """
 
 FORWARD_FUNCTION_TEMPLATE = \
 """
-    {} {}({}) {{
+{} {}({}) {{
     // Dygraph Record Event
 {}
     // AMP Logic
@@ -193,6 +194,16 @@ class {} : public egr::GradNodeBase {{
     
     // Get Input AutoGradMeta
 {}
+    // Set Device Id
+    auto place = egr::Controller::Instance().GetExpectedPlace();
+    if (paddle::platform::is_gpu_place(place)) {{
+#if defined(PADDLE_WITH_CUDA) || defined(PADDLE_WITH_HIP)
+      phi::backends::gpu::SetDeviceId(place.device);
+#else
+      PADDLE_THROW(paddle::platform::errors::PreconditionNotMet(
+        "PaddlePaddle should compile with GPU if use CUDAPlace."));
+#endif
+    }}
     // Forward API Call
 {}
     // Get Outputs
@@ -208,33 +219,33 @@ class {} : public egr::GradNodeBase {{
     // Node Creation
 {}
 
-        // Returns
-        return {};
-    }}
+    // Returns
+    return {};
+}}
 
 """
 
 FORWARD_BODY_TEMPLATE = \
 """
-        if(require_any_grad) {{
+    if(require_any_grad) {{
 {}
-            egr::EagerUtils::PassStopGradient({});
-
-            // Node Construction
+      egr::EagerUtils::PassStopGradient({});
+            
+      // Node Construction
 {}
-            // SetAttributes
+      // SetAttributes
 {}
-            // SetTensorWrappers
+      // SetTensorWrappers
 {}
-            // SetGradOutMeta & SetEdges
+      // SetGradOutMeta & SetEdges
 {}
 {}
-            // SetOutRank & SetHistory & SetGradInMeta & RetainGrad
+      // SetOutRank & SetHistory & SetGradInMeta & RetainGrad
 {}
 {}
 {}
 {}
-        }}
+    }}
 """
 
 NAMESPACE_WRAPPER_TEMPLATE = \
@@ -283,6 +294,7 @@ class {} : public egr::GradNodeBase {{
 #include "paddle/fluid/platform/profiler/event_tracing.h"
 #include "paddle/fluid/eager/amp_utils.h"
 #include "paddle/fluid/eager/eager_amp_auto_cast.h"
+#include "paddle/phi/backends/gpu/gpu_info.h"
 
 {}
 {}
@@ -318,9 +330,9 @@ class {} : public egr::GradNodeBase {{
 
 CORE_OPS_DECLARATION_TEMPLATE = \
 """
-    extern std::unordered_map<std::string, std::vector<std::string>> core_ops_final_state_args_info;
-    extern std::unordered_map<std::string, std::vector<std::string>> core_ops_final_state_args_type_info;
-    extern std::unordered_map<std::string, std::vector<std::string>> core_ops_final_state_returns_info;
+extern std::unordered_map<std::string, std::vector<std::string>> core_ops_final_state_args_info;
+extern std::unordered_map<std::string, std::vector<std::string>> core_ops_final_state_args_type_info;
+extern std::unordered_map<std::string, std::vector<std::string>> core_ops_final_state_returns_info;
 
 """
 
@@ -352,6 +364,18 @@ class {} : public egr::GradNodeBase {{
     }}
 """
 
+CREATE_PLAIN_OPTIONAL_TENSOR_TEMPLATE = \
+"""
+    paddle::optional<const paddle::experimental::Tensor&> {}_optional = paddle::none;
+    if({}.initialized()) {}_optional = paddle::make_optional<const paddle::experimental::Tensor&>({});
+"""
+
+CREATE_RECOVER_OPTIONAL_TENSOR_TEMPLATE = \
+"""
+    paddle::optional<const paddle::experimental::Tensor&> {}_optional = paddle::none;
+    if( {}.impl() ) {}_optional = paddle::make_optional<const paddle::experimental::Tensor&>({});
+"""
+
 
 #######################
 ## Generator Helpers ##
@@ -469,10 +493,8 @@ def ForwardsValidationCheck(self):
         orig_forward_returns_list = self.orig_forward_returns_list
 
         for i in range(len(forward_inputs_list)):
-            forward_input_name = forward_inputs_list[i][0]
             forward_input_type = forward_inputs_list[i][1]
             forward_input_pos = forward_inputs_list[i][2]
-            orig_input_name = orig_forward_inputs_list[i][0]
             orig_input_type = orig_forward_inputs_list[i][1]
             orig_input_pos = orig_forward_inputs_list[i][2]
 
@@ -482,11 +504,9 @@ def ForwardsValidationCheck(self):
                 forward_input_pos, orig_input_pos)
 
         for i in range(len(forward_attrs_list)):
-            orig_attr_name = orig_forward_attrs_list[i][0]
             orig_attr_type = orig_forward_attrs_list[i][1]
             orig_attr_default = orig_forward_attrs_list[i][2]
             orig_attr_pos = orig_forward_attrs_list[i][3]
-            forward_attr_name = forward_attrs_list[i][0]
             forward_attr_type = forward_attrs_list[i][1]
             forward_attr_default = forward_attrs_list[i][2]
             forward_attr_pos = forward_attrs_list[i][3]
@@ -678,12 +698,15 @@ def GenerateNodeCreationCodes(self):
         num_backward_inputs = len(forward_outputs_position_map.keys())
         num_backward_outputs = len(forward_inputs_position_map.keys())
         grad_node_name = GetGradNodeName(forward_api_name)
+
+        # Helper
+        indent = GetIndent(2)
         # NOTE(Aurelius74): DO NOT use make_shared here. Because some Node contains experimental::Scalar
         # which contains "complex128" as data. "complex128" is memory-aligned manually. But make_shared
         # request MEMALIGN for allocation (Maybe).
         # See https://stackoverflow.com/questions/31228656/how-can-shared-ptr-disrupt-alignment
         # and https://github.com/MRtrix3/mrtrix3/issues/957
-        node_construction_str = f"            auto grad_node = std::shared_ptr<{grad_node_name}>(new {grad_node_name}({num_backward_inputs}, {num_backward_outputs}));"
+        node_construction_str = f"{indent}auto grad_node = std::shared_ptr<{grad_node_name}>(new {grad_node_name}({num_backward_inputs}, {num_backward_outputs}));"
 
         # SetAttributes
         set_attributes_list = []
@@ -693,9 +716,9 @@ def GenerateNodeCreationCodes(self):
 
         for name, _, default_val_attr, _ in backward_attrs_list:
             if name in forward_attrs_name_set:
-                set_attributes = f"        grad_node->SetAttribute{name}({name});"
+                set_attributes = f"{indent}grad_node->SetAttribute{name}({name});"
             else:
-                set_attributes = f"        grad_node->SetAttribute{name}({default_val_attr});"
+                set_attributes = f"{indent}grad_node->SetAttribute{name}({default_val_attr});"
             set_attributes_list.append(set_attributes)
         set_attributes_str = "\n".join(set_attributes_list)
 
@@ -707,10 +730,11 @@ def GenerateNodeCreationCodes(self):
             is_optional = (name in optional_inputs)
 
             if is_fwd_input:
+                need_input_data = "false" if name in self.no_need_buffers else "true"
                 if is_optional:
-                    set_tensor_wrappers = f" if({name}.get_ptr() != nullptr) grad_node->SetTensorWrapper{name}(*({name}.get_ptr()), true);"
+                    set_tensor_wrappers = f"{indent}if({name}.get_ptr() != nullptr) grad_node->SetTensorWrapper{name}(*({name}.get_ptr()), true);"
                 else:
-                    set_tensor_wrappers = f"        grad_node->SetTensorWrapper{name}({name}, true);"
+                    set_tensor_wrappers = f"{indent}grad_node->SetTensorWrapper{name}({name}, {need_input_data});"
             else:
                 if num_fwd_outputs > 1:
                     # Aligned with forward output position
@@ -719,9 +743,9 @@ def GenerateNodeCreationCodes(self):
                     fwd_output_pos = forward_outputs_position_map[name][1]
 
                 if is_optional:
-                    set_tensor_wrappers = f" if({name}.get_ptr() != nullptr) grad_node->SetTensorWrapper{name}(*({name}.get_ptr()), false);"
+                    set_tensor_wrappers = f"{indent}if({name}.get_ptr() != nullptr) grad_node->SetTensorWrapper{name}(*({name}.get_ptr()), false);"
                 else:
-                    set_tensor_wrappers = f"        grad_node->SetTensorWrapper{name}({name}, false);"
+                    set_tensor_wrappers = f"{indent}grad_node->SetTensorWrapper{name}({name}, false);"
             set_tensor_wrappers_list.append(set_tensor_wrappers)
         set_tensor_wrappers_str = "\n".join(set_tensor_wrappers_list)
 
@@ -732,11 +756,11 @@ def GenerateNodeCreationCodes(self):
             input_autograd_meta_name = GetAutoGradMetaName(name)
             is_optional = (name in self.optional_inputs)
             if is_optional:
-                set_grad_out_meta = f"            if({name}.get_ptr() != nullptr) grad_node->SetGradOutMeta(*({name}.get_ptr()), {pos});"
-                set_edges = f"            if({name}.get_ptr() != nullptr)  grad_node->AddEdges({input_autograd_meta_name}, {pos});"
+                set_grad_out_meta = f"{indent}if({name}.get_ptr() != nullptr) grad_node->SetGradOutMeta(*({name}.get_ptr()), {pos});"
+                set_edges = f"{indent}if({name}.get_ptr() != nullptr)  grad_node->AddEdges({input_autograd_meta_name}, {pos});"
             else:
-                set_grad_out_meta = f"            grad_node->SetGradOutMeta({name}, {pos});"
-                set_edges = f"            grad_node->AddEdges({input_autograd_meta_name}, {pos});"
+                set_grad_out_meta = f"{indent}grad_node->SetGradOutMeta({name}, {pos});"
+                set_edges = f"{indent}grad_node->AddEdges({input_autograd_meta_name}, {pos});"
 
             set_grad_out_meta_list.append(set_grad_out_meta)
             set_edges_list.append(set_edges)
@@ -751,11 +775,11 @@ def GenerateNodeCreationCodes(self):
         num_outputs = len(forward_outputs_position_map.keys())
         for name, (_, pos) in forward_outputs_position_map.items():
             output_autograd_meta_name = GetAutoGradMetaName(name)
-            set_out_rank = f"        egr::EagerUtils::SetOutRankWithSlot({output_autograd_meta_name}, {pos});"
-            set_history = f"        egr::EagerUtils::SetHistory({output_autograd_meta_name}, grad_node);"
+            set_out_rank = f"{indent}egr::EagerUtils::SetOutRankWithSlot({output_autograd_meta_name}, {pos});"
+            set_history = f"{indent}egr::EagerUtils::SetHistory({output_autograd_meta_name}, grad_node);"
 
-            set_retain_grad = f"            egr::EagerUtils::CheckAndRetainGrad({name});"
-            set_grad_in_meta = f"            grad_node->SetGradInMeta({name}, {pos});"
+            set_retain_grad = f"{indent}egr::EagerUtils::CheckAndRetainGrad({name});"
+            set_grad_in_meta = f"{indent}grad_node->SetGradInMeta({name}, {pos});"
             set_out_rank_list.append(set_out_rank)
             set_history_list.append(set_history)
             set_grad_in_meta_list.append(set_grad_in_meta)
@@ -767,7 +791,7 @@ def GenerateNodeCreationCodes(self):
         set_retain_grad_str = "\n".join(set_retain_grad_list)
 
         node_event_name = forward_api_name + " node_creation"
-        node_creation_event_str = f"paddle::platform::RecordEvent node_creation_record_event(\"{node_event_name}\", paddle::platform::TracerEventType::Operator, 1);\n"
+        node_creation_event_str = f"{indent}paddle::platform::RecordEvent node_creation_record_event(\"{node_event_name}\", paddle::platform::TracerEventType::Operator, 1);\n"
 
         self.node_creation_str = FORWARD_BODY_TEMPLATE.format(
             node_creation_event_str, pass_stop_gradient_args_str,
@@ -845,6 +869,7 @@ def GenerateForwardDefinition(self, is_inplaced):
         optional_inputs = self.optional_inputs
         intermediate_outputs = self.intermediate_outputs
         inplace_map = self.inplace_map if is_inplaced else {}
+        indent = GetIndent(1)
 
         # Get Function Args
         num_inputs = len(forward_attrs_list) + len(
@@ -918,7 +943,7 @@ def GenerateForwardDefinition(self, is_inplaced):
             else:
                 function_name = GetIntermediateAPIFunctionName(function_name)
 
-        forward_call_str = f"auto api_result = paddle::experimental::{namespace}{function_name}({inputs_call_args_str});"
+        forward_call_str = f"{indent}auto api_result = paddle::experimental::{namespace}{function_name}({inputs_call_args_str});"
         num_outputs = len(forward_outputs_position_map.keys()) - len(
             intermediate_outputs)
 
@@ -926,9 +951,9 @@ def GenerateForwardDefinition(self, is_inplaced):
         get_outputs_str = ""
         for name, (rtype, pos) in forward_outputs_position_map.items():
             if num_outputs == 1 and len(intermediate_outputs) == 0:
-                get_outputs_str += f"auto& {name} = api_result;\n"
+                get_outputs_str += f"{indent}auto& {name} = api_result;\n"
             else:
-                get_outputs_str += f"auto& {name} = std::get<{pos}>(api_result);\n"
+                get_outputs_str += f"{indent}auto& {name} = std::get<{pos}>(api_result);\n"
 
         # Get return type list & outputs
         returns_type_list = ["" for i in range(num_outputs)]
@@ -961,12 +986,12 @@ def GenerateForwardDefinition(self, is_inplaced):
         for name, (ttype, pos) in forward_inputs_position_map.items():
             input_autograd_meta_name = GetAutoGradMetaName(name)
             if IsPlainTensorType(ttype):
-                input_autograd_meta = f"    egr::AutogradMeta* {input_autograd_meta_name} = egr::EagerUtils::nullable_autograd_meta({name});"
+                input_autograd_meta = f"{indent}egr::AutogradMeta* {input_autograd_meta_name} = egr::EagerUtils::nullable_autograd_meta({name});"
             else:
                 assert IsVectorTensorType(ttype)
                 input_autograd_meta_vec_name = GetAutoGradMetaVectorName(name)
-                input_autograd_meta = f"    std::vector<egr::AutogradMeta*> {input_autograd_meta_vec_name} = egr::EagerUtils::nullable_autograd_meta({name});\n"
-                input_autograd_meta += f"    std::vector<egr::AutogradMeta*>* {input_autograd_meta_name} = &{input_autograd_meta_vec_name};"
+                input_autograd_meta = f"{indent}std::vector<egr::AutogradMeta*> {input_autograd_meta_vec_name} = egr::EagerUtils::nullable_autograd_meta({name});\n"
+                input_autograd_meta += f"{indent}std::vector<egr::AutogradMeta*>* {input_autograd_meta_name} = &{input_autograd_meta_vec_name};"
 
             inputs_autograd_meta_list.append(input_autograd_meta)
             compute_require_grad_args_list.append(input_autograd_meta_name)
@@ -981,19 +1006,19 @@ def GenerateForwardDefinition(self, is_inplaced):
             output_autograd_meta_vec_name = GetAutoGradMetaVectorName(name)
             if num_fwd_outputs == 1:
                 if IsPlainTensorType(rtype):
-                    output_autograd_meta = f"    egr::AutogradMeta* {output_autograd_meta_name} = egr::EagerUtils::autograd_meta(&{name});"
+                    output_autograd_meta = f"{indent}egr::AutogradMeta* {output_autograd_meta_name} = egr::EagerUtils::autograd_meta(&{name});"
                 else:
                     assert IsVectorTensorType(rtype)
-                    output_autograd_meta = f"    std::vector<egr::AutogradMeta*> {output_autograd_meta_vec_name} = egr::EagerUtils::autograd_meta(&{name});\n"
-                    output_autograd_meta += f"    std::vector<egr::AutogradMeta*>* {output_autograd_meta_name} = &{output_autograd_meta_vec_name};"
+                    output_autograd_meta = f"{indent}std::vector<egr::AutogradMeta*> {output_autograd_meta_vec_name} = egr::EagerUtils::autograd_meta(&{name});\n"
+                    output_autograd_meta += f"{indent}std::vector<egr::AutogradMeta*>* {output_autograd_meta_name} = &{output_autograd_meta_vec_name};"
             else:
                 # Tuple api_result
                 if IsPlainTensorType(rtype):
-                    output_autograd_meta = f"    egr::AutogradMeta* {output_autograd_meta_name} = egr::EagerUtils::autograd_meta(&{name});"
+                    output_autograd_meta = f"{indent}egr::AutogradMeta* {output_autograd_meta_name} = egr::EagerUtils::autograd_meta(&{name});"
                 else:
                     assert IsVectorTensorType(rtype)
-                    output_autograd_meta = f"    std::vector<egr::AutogradMeta*> {output_autograd_meta_vec_name} = egr::EagerUtils::autograd_meta(&{name});\n"
-                    output_autograd_meta += f"    std::vector<egr::AutogradMeta*>* {output_autograd_meta_name} = &{output_autograd_meta_vec_name};"
+                    output_autograd_meta = f"{indent}std::vector<egr::AutogradMeta*> {output_autograd_meta_vec_name} = egr::EagerUtils::autograd_meta(&{name});\n"
+                    output_autograd_meta += f"{indent}std::vector<egr::AutogradMeta*>* {output_autograd_meta_name} = &{output_autograd_meta_vec_name};"
 
             outputs_autograd_meta_list.append(output_autograd_meta)
         outputs_autograd_meta_str = "\n".join(outputs_autograd_meta_list)
@@ -1012,7 +1037,7 @@ def GenerateForwardDefinition(self, is_inplaced):
         self.GenerateNodeCreationCodes()
 
         node_creation_str = self.node_creation_str
-        dygraph_event_str = f"paddle::platform::RecordEvent dygraph_entrance_record_event(\"{forward_api_name} dygraph\", paddle::platform::TracerEventType::Operator, 1);"
+        dygraph_event_str = f"{indent}paddle::platform::RecordEvent dygraph_entrance_record_event(\"{forward_api_name} dygraph\", paddle::platform::TracerEventType::Operator, 1);"
         forward_function_name = GetDygraphForwardFunctionName(forward_api_name)
 
         # Forward amp logic
@@ -1114,11 +1139,66 @@ def __init__(self,
         DygraphFunctionGeneratorBase.__init__(self, forward_api_contents,
                                               grad_api_contents, namespace)
 
+        # Record name mapping from forward_api_name to grad_api_names
+        self.to_next_grad_name_mapping = {}  # {name : name}
+
         # Generated Results
         self.node_declaration_str = ""
         self.node_definition_str = ""
         self.next_grad_api_contents = next_grad_api_contents
 
+    def TransformToNextGradName(self, string):
+        name_mapping = self.to_next_grad_name_mapping
+        if string in name_mapping.keys():
+            return name_mapping[string]
+        return string
+
+    def ResetOptionalInputs(self):
+        namespace = self.namespace
+        grad_api_contents = self.grad_api_contents
+
+        base_generator = FunctionGeneratorBase(grad_api_contents, namespace)
+        base_generator.ParseDispensable()
+
+        self.optional_inputs = base_generator.optional_inputs
+
+    def RecordGrad2NextGradNameMapping(self, next_node_generator):
+        next_orig_inputs_list = next_node_generator.orig_forward_inputs_list
+        next_orig_returns_list = next_node_generator.orig_forward_returns_list
+
+        next_forward_inputs_list = next_node_generator.forward_inputs_list
+        next_forward_returns_list = next_node_generator.forward_returns_list
+        for i in range(len(next_orig_inputs_list)):
+            grad_name = next_orig_inputs_list[i][0]
+            next_forward_name = next_forward_inputs_list[i][0]
+            self.to_next_grad_name_mapping[grad_name] = next_forward_name
+
+        for i in range(len(next_orig_returns_list)):
+            grad_ret_name = next_orig_returns_list[i][0]
+            next_ret_name = next_forward_returns_list[i][0]
+            self.to_next_grad_name_mapping[grad_ret_name] = next_ret_name
+
+    def GenerateHigherOrderNodeCreationCode(self):
+        namespace = self.namespace
+        grad_api_contents = self.grad_api_contents
+        next_grad_api_contents = self.next_grad_api_contents
+
+        grad_node_creation_str = ""
+        if next_grad_api_contents:
+            forward_api_contents = grad_api_contents
+            forward_api_contents['api'] = forward_api_contents['backward_api']
+            backward_api_contents = next_grad_api_contents
+
+            next_node_generator = DygraphFunctionGeneratorBase(
+                forward_api_contents, backward_api_contents, namespace)
+            next_node_generator.run()
+            next_node_generator.GenerateNodeCreationCodes()
+            grad_node_creation_str = next_node_generator.node_creation_str
+
+            self.RecordGrad2NextGradNameMapping(next_node_generator)
+
+        return grad_node_creation_str
+
     def GenerateNodeDeclaration(self):
         forward_op_name = self.forward_api_name
         backward_forward_inputs_map = self.backward_forward_inputs_map
@@ -1173,9 +1253,10 @@ def GenerateNodeDeclaration(self):
         grad_node_name = GetGradNodeName(forward_op_name)
         self.node_declaration_str = NODE_DECLARATION_TEMPLATE.format(
             grad_node_name, grad_node_name, grad_node_name, grad_node_name,
-            grad_node_name, clear_tensor_wrapper_str,
-            set_tensor_wrapper_methods_str, set_attribute_methods_str,
-            tensor_wrapper_members_str, attribute_members_str)
+            grad_node_name, clear_tensor_wrapper_str, grad_node_name,
+            grad_node_name, set_tensor_wrapper_methods_str,
+            set_attribute_methods_str, tensor_wrapper_members_str,
+            attribute_members_str)
 
         logging.info(f"Generated Node Declaration: {self.node_declaration_str}")
 
@@ -1187,6 +1268,7 @@ def GenerateNodeDefinition(self, grad_node_creation_str):
         backward_grad_inputs_map = self.backward_grad_inputs_map
         backward_grad_outputs_map = self.backward_grad_outputs_map
         backward_attrs_list = self.backward_attrs_list
+        indent = GetIndent(1)
 
         # Construct grad_api function args
         # Order: TensorWrappers, GradTensors, Attributes
@@ -1197,42 +1279,58 @@ def GenerateNodeDefinition(self, grad_node_creation_str):
 
         # Fill Grad Ins with Zero
         fill_zero_str = ""
-        if forward_api_name in ops_to_fill_zero_for_empty_grads:
-            fill_zero_str = "egr::EagerUtils::FillZeroForEmptyGradInputs(&grads, this->InputMeta());\n"
+        if backward_api_name in ops_to_fill_zero_for_empty_grads:
+            fill_zero_str = f"{indent}egr::EagerUtils::FillZeroForEmptyGradInputs(&grads, this->InputMeta());\n"
 
         # Grad Ins from TensorWrappers
         for name, (_, is_fwd_input,
                    grad_api_position), in backward_forward_inputs_map.items():
             tensor_wrapper_name = GetSavedName(name)
-            transformed_tensor_name = TransformGradVarNameForDoubleGradGeneration(
-                name)
+            transformed_tensor_name = self.TransformToNextGradName(name)
 
             is_optional = (name in self.optional_inputs)
+            tensor_wrapper_recover_str = f"{indent}auto {transformed_tensor_name} = egr::EagerUtils::RecoverTensorWrapper(&this->{tensor_wrapper_name}, this->shared_from_this());"
             if is_optional:
-                tensor_wrapper_recover_str = f"auto {transformed_tensor_name} = egr::EagerUtils::RecoverOptionalTensorWrapper(&this->{tensor_wrapper_name}, nullptr);"
+                tensor_wrapper_recover_str += "\n" + CREATE_RECOVER_OPTIONAL_TENSOR_TEMPLATE.format(
+                    transformed_tensor_name, transformed_tensor_name,
+                    transformed_tensor_name, transformed_tensor_name)
+
+                grad_api_args[
+                    grad_api_position] = transformed_tensor_name + "_optional"
+
             else:
-                tensor_wrapper_recover_str = f"auto {transformed_tensor_name} = egr::EagerUtils::RecoverTensorWrapper(&this->{tensor_wrapper_name}, nullptr);"
-            grad_api_args[grad_api_position] = transformed_tensor_name
+                grad_api_args[grad_api_position] = transformed_tensor_name
+
             get_grad_in_args_list.append(tensor_wrapper_recover_str)
 
         # Grad Ins from grads
         for name, (ttype, fwd_position,
                    grad_api_position) in backward_grad_inputs_map.items():
-            transformed_tensor_name = TransformGradVarNameForDoubleGradGeneration(
-                name)
+            transformed_tensor_name = self.TransformToNextGradName(name)
 
+            is_optional = (name in self.optional_inputs)
             if IsPlainTensorType(ttype):
-                get_tensor_str = f"auto& {transformed_tensor_name} = hooked_grads[{fwd_position}][0];"
+                get_tensor_str = f"{indent}auto& {transformed_tensor_name} = hooked_grads[{fwd_position}][0];"
+
+                if is_optional:
+                    get_tensor_str += "\n" + CREATE_PLAIN_OPTIONAL_TENSOR_TEMPLATE.format(
+                        transformed_tensor_name, transformed_tensor_name,
+                        transformed_tensor_name, transformed_tensor_name)
+                    grad_api_args[
+                        grad_api_position] = f"{transformed_tensor_name}_optional"
+                else:
+                    grad_api_args[grad_api_position] = transformed_tensor_name
             else:
                 assert IsVectorTensorType(ttype)
-                get_tensor_str = f"auto& {transformed_tensor_name} = hooked_grads[{fwd_position}];"
-            grad_api_args[grad_api_position] = transformed_tensor_name
+                get_tensor_str = f"{indent}auto& {transformed_tensor_name} = hooked_grads[{fwd_position}];"
+                grad_api_args[grad_api_position] = transformed_tensor_name
+
             get_grad_in_args_list.append(get_tensor_str)
 
         # Grad Attrs
         for name, _, _, grad_api_position in backward_attrs_list:
             saved_attribute_name = GetSavedName(name)
-            get_attr_str = f"auto& {name} = this->{saved_attribute_name};"
+            get_attr_str = f"{indent}auto& {name} = this->{saved_attribute_name};"
 
             grad_api_args[grad_api_position] = name
             get_grad_in_args_list.append(get_attr_str)
@@ -1242,20 +1340,23 @@ def GenerateNodeDefinition(self, grad_node_creation_str):
 
         # Grad Function Call String
         grad_api_namespace = f"paddle::experimental::{namespace}"
-        grad_function_call_str = f"auto grad_api_result = {grad_api_namespace}{backward_api_name}({grad_api_args_str});"
+        grad_function_call_str = f"{indent}auto grad_api_result = {grad_api_namespace}{backward_api_name}({grad_api_args_str});"
 
         # Get Grad Outputs
         get_outputs_str = ""
         num_outputs = len(backward_grad_outputs_map.keys())
         for name, (ttype, fwd_position,
                    grad_api_position) in backward_grad_outputs_map.items():
-            transformed_tensor_name = TransformGradVarNameForDoubleGradGeneration(
-                name)
+            transformed_tensor_name = self.TransformToNextGradName(name)
 
             if num_outputs == 1:
-                get_tensor_str = f"auto& {transformed_tensor_name} = grad_api_result;"
+                get_tensor_str = f"{indent}auto& {transformed_tensor_name} = grad_api_result;"
             else:
-                get_tensor_str = f"auto& {transformed_tensor_name} = grad_api_result[{grad_api_position}];"
+                if IsPlainTensorType(ttype):
+                    get_tensor_str = f"{indent}auto& {transformed_tensor_name} = grad_api_result[{grad_api_position}][0];"
+                else:
+                    assert IsVectorTensorType(ttype)
+                    get_tensor_str = f"{indent}auto& {transformed_tensor_name} = grad_api_result[{grad_api_position}];"
             get_outputs_str += get_tensor_str + "\n"
 
         # Prepare for Node Creation if Necessary
@@ -1268,38 +1369,36 @@ def GenerateNodeDefinition(self, grad_node_creation_str):
             compute_require_grad_args_list = ["trace_backward"]
             for name, (ttype, pos,
                        grad_api_position) in backward_grad_inputs_map.items():
-                transformed_tensor_name = TransformGradVarNameForDoubleGradGeneration(
-                    name)
+                transformed_tensor_name = self.TransformToNextGradName(name)
 
                 input_autograd_meta_name = GetAutoGradMetaName(
                     transformed_tensor_name)
                 if IsPlainTensorType(ttype):
-                    input_autograd_meta = f"    egr::AutogradMeta* {input_autograd_meta_name} = egr::EagerUtils::nullable_autograd_meta({transformed_tensor_name});"
+                    input_autograd_meta = f"{indent}egr::AutogradMeta* {input_autograd_meta_name} = egr::EagerUtils::nullable_autograd_meta({transformed_tensor_name});"
                 else:
                     assert IsVectorTensorType(ttype)
                     input_autograd_meta_vec_name = GetAutoGradMetaVectorName(
                         transformed_tensor_name)
-                    input_autograd_meta = f"    std::vector<egr::AutogradMeta*> {input_autograd_meta_vec_name} = egr::EagerUtils::nullable_autograd_meta({transformed_tensor_name});\n"
-                    input_autograd_meta += f"    std::vector<egr::AutogradMeta*>* {input_autograd_meta_name} = &{input_autograd_meta_vec_name};"
+                    input_autograd_meta = f"{indent}std::vector<egr::AutogradMeta*> {input_autograd_meta_vec_name} = egr::EagerUtils::nullable_autograd_meta({transformed_tensor_name});\n"
+                    input_autograd_meta += f"{indent}std::vector<egr::AutogradMeta*>* {input_autograd_meta_name} = &{input_autograd_meta_vec_name};"
 
                 inputs_autograd_meta_list.append(input_autograd_meta)
                 compute_require_grad_args_list.append(input_autograd_meta_name)
 
             # 2. Get TensorWrapper AutoGradMeta
             for name, (ttype, _, pos), in backward_forward_inputs_map.items():
-                transformed_tensor_name = TransformGradVarNameForDoubleGradGeneration(
-                    name)
+                transformed_tensor_name = self.TransformToNextGradName(name)
 
                 input_autograd_meta_name = GetAutoGradMetaName(
                     transformed_tensor_name)
                 if IsPlainTensorType(ttype):
-                    input_autograd_meta = f"    egr::AutogradMeta* {input_autograd_meta_name} = egr::EagerUtils::nullable_autograd_meta({transformed_tensor_name});"
+                    input_autograd_meta = f"{indent}egr::AutogradMeta* {input_autograd_meta_name} = egr::EagerUtils::nullable_autograd_meta({transformed_tensor_name});"
                 else:
                     assert IsVectorTensorType(ttype)
                     input_autograd_meta_vec_name = GetAutoGradMetaVectorName(
                         transformed_tensor_name)
-                    input_autograd_meta = f"    std::vector<egr::AutogradMeta*> {input_autograd_meta_vec_name} = egr::EagerUtils::nullable_autograd_meta({transformed_tensor_name});\n"
-                    input_autograd_meta += f"    std::vector<egr::AutogradMeta*>* {input_autograd_meta_name} = &{input_autograd_meta_vec_name};"
+                    input_autograd_meta = f"{indent}std::vector<egr::AutogradMeta*> {input_autograd_meta_vec_name} = egr::EagerUtils::nullable_autograd_meta({transformed_tensor_name});\n"
+                    input_autograd_meta += f"{indent}std::vector<egr::AutogradMeta*>* {input_autograd_meta_name} = &{input_autograd_meta_vec_name};"
 
                 inputs_autograd_meta_list.append(input_autograd_meta)
                 compute_require_grad_args_list.append(input_autograd_meta_name)
@@ -1311,8 +1410,7 @@ def GenerateNodeDefinition(self, grad_node_creation_str):
             outputs_autograd_meta_list = []
             num_fwd_outputs = len(backward_grad_outputs_map.keys())
             for name, (rtype, pos, _) in backward_grad_outputs_map.items():
-                transformed_tensor_name = TransformGradVarNameForDoubleGradGeneration(
-                    name)
+                transformed_tensor_name = self.TransformToNextGradName(name)
 
                 output_autograd_meta_name = GetAutoGradMetaName(
                     transformed_tensor_name)
@@ -1320,51 +1418,58 @@ def GenerateNodeDefinition(self, grad_node_creation_str):
                     transformed_tensor_name)
                 if num_fwd_outputs == 1:
                     if IsPlainTensorType(rtype):
-                        output_autograd_meta = f"    egr::AutogradMeta* {output_autograd_meta_name} = egr::EagerUtils::autograd_meta(&{transformed_tensor_name});"
+                        output_autograd_meta = f"{indent}egr::AutogradMeta* {output_autograd_meta_name} = egr::EagerUtils::autograd_meta(&{transformed_tensor_name});"
                     else:
                         assert IsVectorTensorType(rtype)
-                        output_autograd_meta = f"    std::vector<egr::AutogradMeta*> {output_autograd_meta_vec_name} = egr::EagerUtils::autograd_meta(&{transformed_tensor_name});\n"
-                        output_autograd_meta += f"    std::vector<egr::AutogradMeta*>* {output_autograd_meta_name} = &{output_autograd_meta_vec_name};"
+                        output_autograd_meta = f"{indent}std::vector<egr::AutogradMeta*> {output_autograd_meta_vec_name} = egr::EagerUtils::autograd_meta(&{transformed_tensor_name});\n"
+                        output_autograd_meta += f"{indent}std::vector<egr::AutogradMeta*>* {output_autograd_meta_name} = &{output_autograd_meta_vec_name};"
                 else:
                     # Tuple api_result
                     if IsPlainTensorType(rtype):
-                        output_autograd_meta = f"    egr::AutogradMeta* {output_autograd_meta_name} = egr::EagerUtils::autograd_meta(&{transformed_tensor_name});"
+                        output_autograd_meta = f"{indent}egr::AutogradMeta* {output_autograd_meta_name} = egr::EagerUtils::autograd_meta(&{transformed_tensor_name});"
                     else:
                         assert IsVectorTensorType(rtype)
-                        output_autograd_meta = f"    std::vector<egr::AutogradMeta*> {output_autograd_meta_vec_name} = egr::EagerUtils::autograd_meta(&{transformed_tensor_name});\n"
-                        output_autograd_meta += f"    std::vector<egr::AutogradMeta*>* {output_autograd_meta_name} = &{output_autograd_meta_vec_name};"
+                        output_autograd_meta = f"{indent}std::vector<egr::AutogradMeta*> {output_autograd_meta_vec_name} = egr::EagerUtils::autograd_meta(&{transformed_tensor_name});\n"
+                        output_autograd_meta += f"{indent}std::vector<egr::AutogradMeta*>* {output_autograd_meta_name} = &{output_autograd_meta_vec_name};"
 
                 outputs_autograd_meta_list.append(output_autograd_meta)
             outputs_autograd_meta_str = "\n".join(outputs_autograd_meta_list)
 
-            compute_require_grad_str = "bool trace_backward = egr::Controller::Instance().HasGrad() && create_graph;\n"
-            compute_require_grad_str += f"bool require_any_grad = egr::EagerUtils::ComputeRequireGrad({compute_require_grad_args_str});"
+            compute_require_grad_str = f"{indent}bool trace_backward = egr::Controller::Instance().HasGrad() && create_graph;\n"
+            compute_require_grad_str += f"{indent}bool require_any_grad = egr::EagerUtils::ComputeRequireGrad({compute_require_grad_args_str});"
 
         # Construct grad_api returns
         num_bwd_outputs = len(backward_grad_outputs_map.keys())
         slot_num_bwd_outputs = len(self.forward_inputs_position_map.keys())
-        returns_str = f"std::vector<std::vector<paddle::experimental::Tensor>> returns({slot_num_bwd_outputs});\n"
+        returns_str = f"{indent}std::vector<std::vector<paddle::experimental::Tensor>> returns({slot_num_bwd_outputs});\n"
         for name, (ttype, fwd_position,
                    grad_api_position) in backward_grad_outputs_map.items():
-            transformed_tensor_name = TransformGradVarNameForDoubleGradGeneration(
-                name)
+            transformed_tensor_name = self.TransformToNextGradName(name)
 
             # Infer Grad API Return Type
             if num_bwd_outputs == 1:
                 # Single tensor output, return as is
                 if IsPlainTensorType(ttype):
-                    returns_str += f"returns[0] = {{ {transformed_tensor_name} }};\n"
+                    returns_str += f"{indent}returns[0] = {{ {transformed_tensor_name} }};\n"
                 else:
                     assert IsVectorTensorType(ttype)
-                    returns_str += f"returns[0] = {transformed_tensor_name};\n"
+                    returns_str += f"{indent}returns[0] = {transformed_tensor_name};\n"
             else:
                 # Rearrange output order accordingly
-                returns_str += f"returns[{fwd_position}] =  {transformed_tensor_name};\n"
-        returns_str += f"if(NeedComplexToRealConversion()) HandleComplexGradToRealGrad(&returns);\n"
-        returns_str += f"return returns;\n"
+                if IsPlainTensorType(ttype):
+                    returns_str += f"{indent}returns[{fwd_position}] = {{ {transformed_tensor_name} }};\n"
+                else:
+                    assert IsVectorTensorType(ttype)
+                    returns_str += f"{indent}returns[{fwd_position}] = {transformed_tensor_name};\n"
+
+        returns_str += f"{indent}if(NeedComplexToRealConversion()) HandleComplexGradToRealGrad(&returns);\n"
+        returns_str += f"{indent}return returns;\n"
 
         grad_node_name = GetGradNodeName(forward_api_name)
 
+        if len(grad_node_creation_str) == 0:
+            grad_node_creation_str = f"if(create_graph) VLOG(3) << \"Higher order grad node for {grad_node_name} has not been implemented yet.\";"
+
         self.node_definition_str = GRAD_FUNCTION_TEMPLATE.format(
             grad_node_name, fill_zero_str, get_grad_in_args_str, grad_node_name,
             grad_function_call_str, get_outputs_str, inputs_autograd_meta_str,
@@ -1376,24 +1481,15 @@ def GenerateNodeDefinition(self, grad_node_creation_str):
     def run(self):
         super().run()
 
+        self.ResetOptionalInputs()
+
         #####################
         ## Code Generation ##
         #####################
-        self.GenerateNodeDeclaration()
+        # Higher-order GradNode generation
+        grad_node_creation_str = self.GenerateHigherOrderNodeCreationCode()
 
-        namespace = self.namespace
-        grad_node_creation_str = ""
-        next_grad_api_contents = self.next_grad_api_contents
-        if next_grad_api_contents:
-            forward_api_contents = self.grad_api_contents
-            forward_api_contents['api'] = forward_api_contents['backward_api']
-            backward_api_contents = next_grad_api_contents
-
-            next_node_generator = DygraphFunctionGeneratorBase(
-                forward_api_contents, backward_api_contents, namespace)
-            next_node_generator.run()
-            next_node_generator.GenerateNodeCreationCodes()
-            grad_node_creation_str = next_node_generator.node_creation_str
+        self.GenerateNodeDeclaration()
 
         self.GenerateNodeDefinition(grad_node_creation_str)
 
diff --git a/paddle/fluid/eager/auto_code_generator/final_state_generator/python_c_gen.py b/paddle/fluid/eager/auto_code_generator/final_state_generator/python_c_gen.py
index 463c50658cd32..8075b65b1945b 100644
--- a/paddle/fluid/eager/auto_code_generator/final_state_generator/python_c_gen.py
+++ b/paddle/fluid/eager/auto_code_generator/final_state_generator/python_c_gen.py
@@ -23,7 +23,7 @@
 ###########################
 ## Global Configurations ##
 ###########################
-skipped_forward_api_names = set(["scale"])
+skipped_forward_api_names = set([])
 
 
 def SkipAPIGeneration(forward_api_name):
diff --git a/paddle/fluid/eager/backward.cc b/paddle/fluid/eager/backward.cc
index 0ce2f17cb45be..d5397e20e7d68 100644
--- a/paddle/fluid/eager/backward.cc
+++ b/paddle/fluid/eager/backward.cc
@@ -50,7 +50,16 @@ class GeneralGrad {
       for (size_t i = 0; i < num_inputs; i++) {
         AutogradMeta* auto_grad_meta =
             EagerUtils::unsafe_autograd_meta(inputs[i]);
-        auto target_node = auto_grad_meta->GetMutableGradNode().get();
+        auto* target_node = auto_grad_meta->GetMutableGradNode().get();
+
+        if (orig_to_copied_node_mapping_.count(target_node)) {
+          target_node = orig_to_copied_node_mapping_[target_node];
+        } else {
+          VLOG(6) << "Unable to find target node in "
+                     "orig_to_copied_node_mapping_, likely indicating an "
+                     "unused input";
+        }
+
         PADDLE_ENFORCE_NOT_NULL(target_node,
                                 paddle::platform::errors::Fatal(
                                     "There is no grad op for %s:[%d] or it's"
@@ -249,7 +258,15 @@ class GeneralGrad {
     for (size_t i = 0; i < inputs.size(); ++i) {
       auto& input = inputs[i];
       AutogradMeta* auto_grad_meta = EagerUtils::unsafe_autograd_meta(input);
-      auto target_node = auto_grad_meta->GetMutableGradNode().get();
+
+      auto* target_node = auto_grad_meta->GetMutableGradNode().get();
+      if (orig_to_copied_node_mapping_.count(target_node)) {
+        target_node = orig_to_copied_node_mapping_[target_node];
+      } else {
+        VLOG(6) << "Unable to find target node in "
+                   "orig_to_copied_node_mapping_, likely indicating an unused "
+                   "input";
+      }
 
       auto iter = results_map.find(target_node);
       if (iter != results_map.end()) {
@@ -326,6 +343,78 @@ class GeneralGrad {
     potential_stop_nodes.clear();
     depending_nodes.clear();
     results_map.clear();
+    copied_grad_nodes_.clear();
+    orig_to_copied_node_mapping_.clear();
+  }
+
+  GradNodeBase* CopyGradNode(const std::shared_ptr<GradNodeBase>& orig_node) {
+    if (orig_to_copied_node_mapping_.count(orig_node.get())) {
+      return orig_to_copied_node_mapping_[orig_node.get()];
+    }
+    std::shared_ptr<GradNodeBase> copied_node = orig_node->Copy();
+
+    // Save node and update mapping
+    orig_to_copied_node_mapping_[orig_node.get()] = copied_node.get();
+    copied_grad_nodes_.push_back(copied_node);
+
+    return copied_node.get();
+  }
+
+  void ReconstructBackwardGraph(
+      const std::queue<GradNodeBase*>& orig_init_queue) {
+    std::queue<GradNodeBase*> queue = orig_init_queue;
+    std::unordered_set<GradNodeBase*> visited;
+
+    // BFS and recursively copy the grad nodes
+    while (!queue.empty()) {
+      GradNodeBase* orig_node = queue.front();
+      queue.pop();
+      if (visited.count(orig_node)) {
+        continue;
+      }
+      visited.insert(orig_node);
+
+      PADDLE_ENFORCE(
+          orig_to_copied_node_mapping_.count(orig_node),
+          paddle::platform::errors::Fatal(
+              "Cannot reconstruct backward graph,"
+              "unable to find copied target for certain grad node."));
+      GradNodeBase* copied_node = orig_to_copied_node_mapping_[orig_node];
+
+      const std::vector<std::vector<Edge>>& orig_edges = orig_node->GetEdges();
+      std::vector<std::vector<Edge>>& copied_edges =
+          copied_node->GetMutableEdges();
+      for (size_t i = 0; i < orig_edges.size(); i++) {
+        for (size_t j = 0; j < orig_edges[i].size(); j++) {
+          const Edge& orig_edge = orig_edges[i][j];
+          Edge& copied_edge = copied_edges[i][j];
+
+          std::shared_ptr<GradNodeBase> orig_next_node =
+              orig_edge.GetMutableGradNode();
+          if (!orig_next_node) continue;
+
+          // Copy Next Node
+          std::shared_ptr<GradNodeBase> copied_next_node;
+          if (orig_to_copied_node_mapping_.count(orig_next_node.get())) {
+            copied_next_node =
+                orig_to_copied_node_mapping_[orig_next_node.get()]
+                    ->shared_from_this();
+
+          } else {
+            copied_next_node = orig_next_node->Copy();
+            orig_to_copied_node_mapping_[orig_next_node.get()] =
+                copied_next_node.get();
+            copied_grad_nodes_.push_back(copied_next_node);
+          }
+
+          // Update Edge's Grad Node
+          copied_edge.SetGradNode(copied_next_node);
+
+          // Update BFS queue
+          queue.push(orig_next_node.get());
+        }
+      }
+    }
   }
 
  private:
@@ -345,6 +434,10 @@ class GeneralGrad {
                      std::unordered_set<GradNodeBase*> /* pre nodes */>
       depending_nodes;
   std::unordered_map<GradNodeBase*, paddle::experimental::Tensor> results_map;
+
+  std::vector<std::shared_ptr<GradNodeBase>> copied_grad_nodes_;
+  std::unordered_map<GradNodeBase*, GradNodeBase*> orig_to_copied_node_mapping_;
+
   DISABLE_COPY_AND_ASSIGN(GeneralGrad);
 };
 
@@ -444,6 +537,7 @@ std::vector<paddle::experimental::Tensor> RunBackward(
   // 1. Init queue with starting nodes
   // 2. Prepare initial input buffers
   std::queue<GradNodeBase*> queue;
+  std::queue<GradNodeBase*> orig_queue;
   std::unordered_map<GradNodeBase*, std::unique_ptr<GradTensorHolder>>
       node_input_buffers_dict;
   for (size_t i = 0; i < tensors.size(); i++) {
@@ -466,7 +560,18 @@ std::vector<paddle::experimental::Tensor> RunBackward(
       continue;
     }
 
+    // TODO(zhanlve): Copy and Modify GradNode if is_general_grad
     GradNodeBase* grad_node = shared_grad_node.get();
+    if (is_general_grad) {
+      // Save orig grad node
+      orig_queue.push(grad_node);
+
+      // Replace grad_node with copied grad_node
+      grad_node = GeneralGrad::Instance().CopyGradNode(shared_grad_node);
+
+      // Record potential startup grad node
+      GeneralGrad::Instance().GetPotentialStartupNodes()->insert(grad_node);
+    }
 
     // Prepare GradTensorHolder
     if (!node_input_buffers_dict.count(grad_node)) {
@@ -475,8 +580,9 @@ std::vector<paddle::experimental::Tensor> RunBackward(
       node_input_buffers_dict[grad_node] =
           std::make_unique<GradTensorHolder>(grad_node->InputMeta());
     }
-
-    if (grad_tensors.size() > 0) {
+    bool copy_from_grad_t =
+        grad_tensors.size() > 0 && grad_tensors[i].initialized();
+    if (copy_from_grad_t) {
       PADDLE_ENFORCE(
           grad_tensors.size() == tensors.size(),
           paddle::platform::errors::Fatal(
@@ -486,17 +592,9 @@ std::vector<paddle::experimental::Tensor> RunBackward(
       // Feed given tensor if it's provided
       VLOG(6) << "Fill grad input tensor " << i << "with give grad tensor";
 
-      if (grad_tensors[i].is_initialized()) {
-        // Deep copy
-        paddle::experimental::Tensor tmp_tensor;
-        tmp_tensor.copy_(grad_tensors[i], grad_tensors[i].inner_place(), false);
-        node_input_buffers_dict[grad_node]->add(input_info.first,
-                                                input_info.second, tmp_tensor);
-      } else {
-        node_input_buffers_dict[grad_node]->add(
-            input_info.first, input_info.second, grad_tensors[i]);
-      }
-
+      // Deep copy
+      node_input_buffers_dict[grad_node]->CopyValueFromTensor(
+          input_info.first, input_info.second, grad_tensors[i]);
     } else {
       VLOG(6) << "Fill grad input tensor " << i << " with 1.0";
       // Initialize tensor with 1.0
@@ -504,15 +602,17 @@ std::vector<paddle::experimental::Tensor> RunBackward(
       // dims
       // GradTensorHolder will initialize another tensor with same tensortype,
       // datatype and dims but filled with 1.0
-      node_input_buffers_dict[grad_node]->add(
+      node_input_buffers_dict[grad_node]->CopyValueFromTensor(
           input_info.first, input_info.second, tensor, true /*fill_one=true*/);
     }
 
     // Prepare queue, potential startup_nodes
     queue.push(grad_node);
-    if (is_general_grad) {
-      GeneralGrad::Instance().GetPotentialStartupNodes()->emplace(grad_node);
-    }
+  }
+
+  if (is_general_grad) {
+    // Copy Backward Graph
+    GeneralGrad::Instance().ReconstructBackwardGraph(orig_queue);
   }
 
   VLOG(6) << "Update In degree Map for backward";
@@ -686,6 +786,7 @@ std::vector<paddle::experimental::Tensor> RunBackward(
       }
     }
   }
+
   if (!is_general_grad) return {};
   return GeneralGrad::Instance().GetResults(inputs, allow_unused, create_graph);
 }
diff --git a/paddle/fluid/eager/custom_operator/custom_operator_node.h b/paddle/fluid/eager/custom_operator/custom_operator_node.h
index 33b56fc8c863a..c483dc0ebd177 100644
--- a/paddle/fluid/eager/custom_operator/custom_operator_node.h
+++ b/paddle/fluid/eager/custom_operator/custom_operator_node.h
@@ -36,9 +36,10 @@ class RunCustomOpNode : public GradNodeBase {
   }
 
   // Functor: perform backward computations
-  virtual std::vector<std::vector<paddle::experimental::Tensor>> operator()(
-      std::vector<std::vector<paddle::experimental::Tensor>>& grads,
-      bool create_graph = false)  // NOLINT
+  virtual std::vector<std::vector<paddle::experimental::Tensor>>
+  operator()(                                                         // NOLINT
+      std::vector<std::vector<paddle::experimental::Tensor>>& grads,  // NOLINT
+      bool create_graph = false)                                      // NOLINT
       override;
 
   std::string name() {
@@ -64,13 +65,15 @@ class RunCustomOpNode : public GradNodeBase {
   }
 
   void ClearTensorWrappers() override { VLOG(6) << "Do nothing here now"; }
-  bool IsTensorWrappersCleared() override {
-    VLOG(6) << "Do nothing here now";
-    return false;
-  }
 
   void SetAttrs(const std::vector<paddle::any>& attr) { attrs_ = attr; }
 
+  std::shared_ptr<GradNodeBase> Copy() const override {
+    auto copied_node =
+        std::shared_ptr<RunCustomOpNode>(new RunCustomOpNode(*this));
+    return copied_node;
+  }
+
  public:
   std::unordered_map<int, std::vector<egr::TensorWrapper>> fwd_outs;
   std::unordered_map<int, std::vector<egr::TensorWrapper>> fwd_ins;
diff --git a/paddle/fluid/eager/grad_node_info.cc b/paddle/fluid/eager/grad_node_info.cc
index 5f3dfe8e513ed..23c7ea7c5e9b4 100644
--- a/paddle/fluid/eager/grad_node_info.cc
+++ b/paddle/fluid/eager/grad_node_info.cc
@@ -19,6 +19,7 @@
 
 #include "paddle/phi/common/data_type.h"
 #include "paddle/phi/core/dense_tensor.h"
+#include "paddle/phi/core/sparse_coo_tensor.h"
 
 #include "paddle/fluid/framework/convert_utils.h"
 #include "paddle/fluid/framework/data_type.h"
@@ -124,29 +125,32 @@ void GradNodeBase::SetGradInMeta(const paddle::experimental::Tensor& fwd_out,
     return;
   }
 
+  phi::DenseTensor* dense_tensor = nullptr;
   // Record TensorMeta
   if (phi::DenseTensor::classof(fwd_out.impl().get())) {
     // Only Copy Meta
-    phi::DenseTensor* dense_tensor =
-        static_cast<phi::DenseTensor*>(fwd_out.impl().get());
-
-    PADDLE_ENFORCE_NE(
-        dense_tensor->meta().dtype, phi::DataType::UNDEFINED,
-        paddle::platform::errors::Fatal(
-            "Attempting to copy DenseTensorMeta with phi::DataType::UNDEFINED,"
-            "which is illegal."));
-
-    meta.SetTensorMeta(dense_tensor->meta());
-    meta.SetPlace(fwd_out.inner_place());
-
-    if (paddle::framework::IsComplexType(
-            paddle::framework::TransToProtoVarType(dense_tensor->type()))) {
-      need_complex_to_real_ = true;
-    }
+    dense_tensor = static_cast<phi::DenseTensor*>(fwd_out.impl().get());
+  } else if (phi::SparseCooTensor::classof(fwd_out.impl().get())) {
+    phi::SparseCooTensor* coo_tensor =
+        static_cast<phi::SparseCooTensor*>(fwd_out.impl().get());
+    dense_tensor = coo_tensor->mutable_non_zero_elements();
   } else {
     VLOG(6) << "Unable to initialize the DenseTensorMeta of GradSlotMeta with "
                "non-DenseTensor argument.";
   }
+  PADDLE_ENFORCE_NE(
+      dense_tensor->meta().dtype, phi::DataType::UNDEFINED,
+      paddle::platform::errors::Fatal(
+          "Attempting to copy DenseTensorMeta with phi::DataType::UNDEFINED,"
+          "which is illegal."));
+
+  meta.SetTensorMeta(dense_tensor->meta());
+  meta.SetPlace(fwd_out.inner_place());
+
+  if (paddle::framework::IsComplexType(
+          paddle::framework::TransToProtoVarType(dense_tensor->type()))) {
+    need_complex_to_real_ = true;
+  }
 }
 
 void GradNodeBase::SetGradInMeta(
@@ -322,6 +326,10 @@ const std::vector<std::vector<Edge>>& GradNodeBase::GetEdges() const {
   return adj_edges_;
 }
 
+std::vector<std::vector<Edge>>& GradNodeBase::GetMutableEdges() {
+  return adj_edges_;
+}
+
 std::vector<std::vector<paddle::experimental::Tensor>>
 GradNodeBase::ApplyGradientHooks(
     const std::vector<std::vector<paddle::experimental::Tensor>>& tensors) {
diff --git a/paddle/fluid/eager/grad_node_info.h b/paddle/fluid/eager/grad_node_info.h
index ff4445f4261e3..6a70a16a2416f 100644
--- a/paddle/fluid/eager/grad_node_info.h
+++ b/paddle/fluid/eager/grad_node_info.h
@@ -14,6 +14,8 @@
 
 #pragma once
 
+#include <memory>
+
 #include "paddle/fluid/eager/eager_tensor.h"
 #include "paddle/fluid/eager/hooks.h"
 #include "paddle/phi/api/all.h"
@@ -85,7 +87,7 @@ class GradSlotMeta {
   std::shared_ptr<phi::DenseTensorMeta> meta_ = nullptr;
 };
 
-class GradNodeBase {
+class GradNodeBase : public std::enable_shared_from_this<GradNodeBase> {
  public:
   GradNodeBase() { VLOG(6) << "Construct GradNodeBase"; }
   GradNodeBase(size_t bwd_in_slot_num, size_t bwd_out_slot_num);
@@ -111,7 +113,11 @@ class GradNodeBase {
 
   virtual void ClearTensorWrappers() = 0;
 
-  virtual bool IsTensorWrappersCleared() = 0;
+  /**
+       * Self-Copy interface designed for use in DoubleGrad
+       * **/
+  virtual std::shared_ptr<GradNodeBase> Copy() const = 0;
+
   /**
    * AddEdges is designed to set input tensors' backward Node as current
    * node's Edges.
@@ -189,6 +195,16 @@ class GradNodeBase {
   /**
        * GetEdges is designed to get all edges of current node**/
   const std::vector<std::vector<Edge>>& GetEdges() const;
+  std::vector<std::vector<Edge>>& GetMutableEdges();
+
+  /**
+       * The following interfaces are designed for no_need_buffer
+       * **/
+  bool IsTensorWrappersCleared() { return is_tensor_wrappers_cleared_; }
+
+  void SetIsTensorWrappersCleared(bool is_tensor_wrappers_cleared) {
+    is_tensor_wrappers_cleared_ = is_tensor_wrappers_cleared;
+  }
 
  private:
   // TODO(zhanlve): Merge adj_edges_ into GradOutMeta
@@ -216,6 +232,7 @@ class GradNodeBase {
   // We handle complex to real conversion only if any complex GradIn is involved
   bool need_complex_to_real_ = false;
   int64_t next_hook_id_{0};
+  bool is_tensor_wrappers_cleared_ = false;
 };
 
 class Edge {
@@ -244,6 +261,11 @@ class Edge {
     return grad_node_;
   }
 
+  void SetGradNode(const std::shared_ptr<GradNodeBase>& node) {
+    VLOG(6) << "Reseting Edge's Grad Node";
+    grad_node_ = node;
+  }
+
   std::pair<size_t, size_t> GetEdgeRankInfo() const {
     return std::make_pair(in_slot_id_, in_rank_);
   }
diff --git a/paddle/fluid/eager/grad_tensor_holder.cc b/paddle/fluid/eager/grad_tensor_holder.cc
index 038ad09aa4d8b..2dacb588ff847 100644
--- a/paddle/fluid/eager/grad_tensor_holder.cc
+++ b/paddle/fluid/eager/grad_tensor_holder.cc
@@ -15,6 +15,7 @@
 #include "paddle/fluid/eager/grad_tensor_holder.h"
 #include "paddle/fluid/imperative/gradient_accumulator.h"
 
+#include "paddle/fluid/eager/api/generated/eager_generated/forwards/dygraph_functions.h"
 #include "paddle/fluid/framework/convert_utils.h"
 #include "paddle/fluid/framework/var_type.h"
 #include "paddle/phi/kernels/funcs/math_function.h"
@@ -26,9 +27,9 @@ void GradTensorHolder::SetBufferSlotRankZeros(size_t slot_id, size_t rank) {
       paddle::experimental::zeros_like(buffer_[slot_id][rank]);
 }
 
-void GradTensorHolder::add(size_t slot_id, size_t rank,
-                           const paddle::experimental::Tensor& t,
-                           bool fill_one) {
+void GradTensorHolder::CopyValueFromTensor(
+    size_t slot_id, size_t rank, const paddle::experimental::Tensor& t,
+    bool fill_one) {
   // TODO(jiabin): We need to deal with empty input_buffer with slot size not
   // empty;
   PADDLE_ENFORCE(slot_id < buffer_.size(),
@@ -50,50 +51,90 @@ void GradTensorHolder::add(size_t slot_id, size_t rank,
           slot_id, buffer_[slot_id].size(), rank));
   if (!fill_one) {
     paddle::experimental::Tensor& buffer_tensor = buffer_[slot_id][rank];
-    // TODO(jiabin): Code bellow is ugly to divide which inner var we used,
-    // remove framework::Variable
-    // related code later.
-    // This if statement is trying to test neither phi::Tensor nor
-    // framework::Variable is initialized.
     if ((!buffer_tensor.defined() || !buffer_tensor.initialized())) {
-      // Simply copy tensor->impl
-      buffer_tensor = t;
+      // Perform deep copy here
+      buffer_tensor.copy_(t, t.inner_place(), false);
+      buffer_tensor.set_autograd_meta(t.mutable_autograd_meta());
+
     } else {
-      // Accumulation
-      PADDLE_ENFORCE_EQ(t.initialized(), true,
-                        paddle::platform::errors::Fatal(
-                            "We can only accumulate initialized tensor, but we "
-                            "got tensor: %s is empty please check you network "
-                            "and make sure it creates grads.",
-                            t.name()));
-      if (t.is_dense_tensor()) {
-        if (buffer_tensor.is_dense_tensor()) {
-          paddle::imperative::TensorAdd<paddle::experimental::Tensor>(
-              t, &buffer_tensor);
-        } else {
-          // TODO(jiabin): Support Other TensorBase later
-          paddle::experimental::Tensor new_buffer(
-              std::make_shared<phi::DenseTensor>(), "tmp_accumulator");
-          paddle::imperative::SelectedRowsAddTensor(buffer_tensor, t,
-                                                    &new_buffer);
-          buffer_tensor.set_impl(new_buffer.impl());
-        }
-      } else {
-        // TODO(jiabin): Support Other TensorBase later
-        if (buffer_tensor.is_dense_tensor()) {
-          paddle::imperative::SelectedRowsAddToTensor(t, &buffer_tensor);
-        } else {
-          buffer_tensor =
-              std::move(*paddle::imperative::SelectedRowsMerge<
-                        paddle::experimental::Tensor>(t, buffer_tensor));
-        }
-      }
+      PADDLE_THROW(paddle::platform::errors::Fatal(
+          "Cannot copy grad_tensors' value to grad tensor holders,"
+          "input buffer has already been initialized."));
     }
   } else {
     // Create new tensor->impl and fill it with 1.0
     if (t.defined()) {
-      // Fill 1.0
-      buffer_[slot_id][rank] = paddle::experimental::ones_like(t, t.dtype());
+      // Fill 1.0, use full to support complex, one_like don't support it.
+      buffer_[slot_id][rank] =
+          paddle::experimental::full(t.shape(), 1, t.dtype(), t.inner_place());
+    }
+  }
+}
+
+void GradTensorHolder::add(size_t slot_id, size_t rank,
+                           const paddle::experimental::Tensor& t) {
+  // TODO(jiabin): We need to deal with empty input_buffer with slot size not
+  // empty;
+  PADDLE_ENFORCE(slot_id < buffer_.size(),
+                 paddle::platform::errors::Fatal(
+                     "Invalid slot_id for GradTensorHolder::add() "
+                     "which exceeds size of buffer"));
+  VLOG(6) << "Add Tensor for buffer_ slot: " << slot_id
+          << ", size: " << buffer_[slot_id].size();
+  if (buffer_[slot_id].empty()) {
+    VLOG(6) << "Pass add Tensor for buffer_ slot: " << slot_id
+            << " since its buffer_ is empty ";
+    return;
+  }
+  PADDLE_ENFORCE(
+      rank < buffer_[slot_id].size(),
+      paddle::platform::errors::Fatal(
+          "Invalid rank for GradTensorHolder::add() which exceeds size "
+          "of buffer slot %d, got slot size is: %d rank is: %d",
+          slot_id, buffer_[slot_id].size(), rank));
+
+  paddle::experimental::Tensor& buffer_tensor = buffer_[slot_id][rank];
+  // TODO(jiabin): Code bellow is ugly to divide which inner var we used,
+  // remove framework::Variable
+  // related code later.
+  // This if statement is trying to test neither phi::Tensor nor
+  // framework::Variable is initialized.
+  if ((!buffer_tensor.defined() || !buffer_tensor.initialized())) {
+    // Simply copy tensor->impl
+    buffer_tensor = t;
+  } else {
+    // Accumulation
+    PADDLE_ENFORCE_EQ(t.initialized(), true,
+                      paddle::platform::errors::Fatal(
+                          "We can only accumulate initialized tensor, but we "
+                          "got tensor: %s is empty please check you network "
+                          "and make sure it creates grads.",
+                          t.name()));
+    if (t.is_dense_tensor()) {
+      if (buffer_tensor.is_dense_tensor()) {
+        buffer_tensor = add_final_state_dygraph_function(t, buffer_tensor);
+
+      } else {
+        // TODO(jiabin): Support Other TensorBase later
+        // TODO(zhanlve): Replace SelectedRowsAddTensor with
+        // add_dygraph_function once it's supported
+        paddle::experimental::Tensor new_buffer(
+            std::make_shared<phi::DenseTensor>(), "tmp_accumulator");
+        paddle::imperative::SelectedRowsAddTensor(buffer_tensor, t,
+                                                  &new_buffer);
+        buffer_tensor.set_impl(new_buffer.impl());
+      }
+    } else {
+      // TODO(jiabin): Support Other TensorBase later
+      // TODO(zhanlve): Replace SelectedRowsAddTensor with add_dygraph_function
+      // once it's supported
+      if (buffer_tensor.is_dense_tensor()) {
+        paddle::imperative::SelectedRowsAddToTensor(t, &buffer_tensor);
+      } else {
+        buffer_tensor =
+            std::move(*paddle::imperative::SelectedRowsMerge<
+                      paddle::experimental::Tensor>(t, buffer_tensor));
+      }
     }
   }
 }
diff --git a/paddle/fluid/eager/grad_tensor_holder.h b/paddle/fluid/eager/grad_tensor_holder.h
index db03789ea7632..a4f2507728c64 100644
--- a/paddle/fluid/eager/grad_tensor_holder.h
+++ b/paddle/fluid/eager/grad_tensor_holder.h
@@ -45,8 +45,10 @@ class GradTensorHolder {
   GradTensorHolder& operator=(const GradTensorHolder& other) = default;
 
   // Create new tensor and copy tensor->impl
-  void add(size_t slot_id, size_t rank, const paddle::experimental::Tensor& t,
-           bool fill_one = false);
+  void add(size_t slot_id, size_t rank, const paddle::experimental::Tensor& t);
+  void CopyValueFromTensor(size_t slot_id, size_t rank,
+                           const paddle::experimental::Tensor& t,
+                           bool fill_one = false);
 
   const std::vector<paddle::experimental::Tensor>& operator[](
       const size_t& pos) {
diff --git a/paddle/fluid/eager/pylayer/py_layer_node.h b/paddle/fluid/eager/pylayer/py_layer_node.h
index cd0a517afbf0f..f2e50494467c7 100644
--- a/paddle/fluid/eager/pylayer/py_layer_node.h
+++ b/paddle/fluid/eager/pylayer/py_layer_node.h
@@ -40,11 +40,6 @@ class GradNodePyLayer : public GradNodeBase {
 
   void ClearTensorWrappers() override { VLOG(6) << "Do nothing here now"; }
 
-  bool IsTensorWrappersCleared() override {
-    VLOG(6) << "Do nothing here now";
-    return false;
-  }
-
   std::string name() {
     return "GradNodePyLayer_" + std::string(Py_TYPE(ctx_)->tp_name);
   }
@@ -72,6 +67,12 @@ class GradNodePyLayer : public GradNodeBase {
     }
   }
 
+  std::shared_ptr<GradNodeBase> Copy() const override {
+    auto copied_node =
+        std::shared_ptr<GradNodePyLayer>(new GradNodePyLayer(*this));
+    return copied_node;
+  }
+
  private:
   PyObject* ctx_{nullptr};
   PyObject* outputs_{nullptr};
diff --git a/paddle/fluid/eager/tensor_wrapper.h b/paddle/fluid/eager/tensor_wrapper.h
index 8da27f3bb8a13..3d5d3139de14c 100644
--- a/paddle/fluid/eager/tensor_wrapper.h
+++ b/paddle/fluid/eager/tensor_wrapper.h
@@ -51,6 +51,7 @@ class TensorWrapper {
      * to avoid recursive depends on GradNodeBase
      * **/
     full_reserved_ = full_reserved;
+    no_need_buffer_ = no_need_buffer;
     if (full_reserved_) {
       VLOG(6) << "Fully reserved tensor: " << tensor.name();
       intermidiate_tensor_ = tensor;
@@ -58,7 +59,6 @@ class TensorWrapper {
     }
 
     // shallow copy tensor_impl here
-    no_need_buffer_ = no_need_buffer;
     if (no_need_buffer) {
       if (phi::DenseTensor::classof(tensor.impl().get())) {
         // Only Copy Meta
@@ -94,20 +94,20 @@ class TensorWrapper {
       return paddle::experimental::Tensor();
     }
 
+    check_inplace_version();
+
     // if it's full_reserved just return the full copy of tensor
-    if (full_reserved_) {
-      check_inplace_version();
-      return intermidiate_tensor_;
-    } else {
+    paddle::experimental::Tensor recovered_tensor = intermidiate_tensor_;
+    if (!full_reserved_) {
       std::shared_ptr<GradNodeBase> new_grad_node = grad_node;
       auto p_ab_autograd_meta =
           std::make_shared<AutogradMeta>(Edge(new_grad_node, out_rank_info_));
-      intermidiate_tensor_.set_autograd_meta(
+      recovered_tensor.set_autograd_meta(
           std::static_pointer_cast<paddle::experimental::AbstractAutogradMeta>(
               p_ab_autograd_meta));
-      check_inplace_version();
-      return intermidiate_tensor_;
     }
+
+    return recovered_tensor;
   }
 
   void check_inplace_version() {
@@ -122,10 +122,10 @@ class TensorWrapper {
           static_cast<phi::DenseTensor*>(intermidiate_tensor_.impl().get());
       auto& inplace_version_counter = dense_tensor->InplaceVersionCounter();
 
-      uint32_t current_inplace_version =
-          inplace_version_counter.CurrentVersion();
+      uint32_t wrapper_version_snapshot = inplace_version_snapshot_;
+      uint32_t tensor_version = inplace_version_counter.CurrentVersion();
       PADDLE_ENFORCE_EQ(
-          current_inplace_version, inplace_version_snapshot_,
+          tensor_version, wrapper_version_snapshot,
           paddle::platform::errors::PermissionDenied(
               "Tensor '%s' used in gradient computation has been "
               "modified by an inplace operation. "
@@ -133,14 +133,14 @@ class TensorWrapper {
               "Please fix your code to void calling an inplace operator "
               "after using the Tensor which will used in gradient "
               "computation.",
-              intermidiate_tensor_.name(), current_inplace_version,
-              inplace_version_snapshot_));
-      VLOG(6) << " The inplace_version_snapshot_ of Tensor '"
-              << intermidiate_tensor_.name() << "' is [ "
-              << inplace_version_snapshot_ << " ]";
-      VLOG(6) << " The current_inplace_version of Tensor '"
+              intermidiate_tensor_.name(), tensor_version,
+              wrapper_version_snapshot));
+      VLOG(6) << " The wrapper_version_snapshot of Tensor '"
               << intermidiate_tensor_.name() << "' is [ "
-              << current_inplace_version << " ]";
+              << wrapper_version_snapshot << " ]";
+      VLOG(6) << " The tensor_version of Tensor '"
+              << intermidiate_tensor_.name() << "' is [ " << tensor_version
+              << " ]";
     }
   }
 
diff --git a/paddle/fluid/eager/tests/CMakeLists.txt b/paddle/fluid/eager/tests/CMakeLists.txt
index 2bfb9937c8c91..6bcd34262c8ab 100644
--- a/paddle/fluid/eager/tests/CMakeLists.txt
+++ b/paddle/fluid/eager/tests/CMakeLists.txt
@@ -1,6 +1,3 @@
 add_subdirectory(data_structure_tests)
 add_subdirectory(task_tests)
-
-if(NOT ((NOT WITH_PYTHON) AND ON_INFER))
-    add_subdirectory(performance_tests)
-endif()
+add_subdirectory(performance_tests)
diff --git a/paddle/fluid/eager/tests/data_structure_tests/CMakeLists.txt b/paddle/fluid/eager/tests/data_structure_tests/CMakeLists.txt
index e1cd9939aca77..76c59561fc0bb 100644
--- a/paddle/fluid/eager/tests/data_structure_tests/CMakeLists.txt
+++ b/paddle/fluid/eager/tests/data_structure_tests/CMakeLists.txt
@@ -1,6 +1,9 @@
 cc_test(test_egr_ds_eager_tensor SRCS eager_tensor_test.cc DEPS ${eager_deps})
 cc_test(test_egr_ds_auotgrad_meta SRCS autograd_meta_test.cc DEPS ${eager_deps})
 cc_test(test_egr_ds_grad_node_info SRCS grad_node_info_test.cc DEPS ${eager_deps})
-cc_test(test_egr_ds_grad_tensor_holder SRCS grad_tensor_holder_test.cc DEPS ${eager_deps})
 cc_test(test_egr_ds_accumulation_node SRCS accumulation_node_test.cc DEPS ${eager_deps})
 cc_test(test_egr_ds_tensor_wrapper SRCS tensor_wrapper_test.cc DEPS ${eager_deps})
+
+if(NOT ((NOT WITH_PYTHON) AND ON_INFER))
+    cc_test(test_egr_ds_grad_tensor_holder SRCS grad_tensor_holder_test.cc DEPS ${eager_deps} ${generated_deps})
+endif()
diff --git a/paddle/fluid/eager/tests/data_structure_tests/grad_node_test.h b/paddle/fluid/eager/tests/data_structure_tests/grad_node_test.h
index dff12fdfc34a1..8500ec79ef9ba 100644
--- a/paddle/fluid/eager/tests/data_structure_tests/grad_node_test.h
+++ b/paddle/fluid/eager/tests/data_structure_tests/grad_node_test.h
@@ -32,7 +32,7 @@ class GradTestNode : public egr::GradNodeBase {
   GradTestNode() : GradNodeBase() { val_ = 1.0; }
   std::string name() override { return "GradTestNode"; }
   std::vector<std::vector<paddle::experimental::Tensor>> operator()(
-      std::vector<std::vector<paddle::experimental::Tensor>>& grads,
+      std::vector<std::vector<paddle::experimental::Tensor>>& grads,  // NOLINT
       bool create_graph = false) override {
     val_ = std::dynamic_pointer_cast<phi::DenseTensor>(grads[0][0].impl())
                ->data<float>()[0];
@@ -50,10 +50,14 @@ class GradTestNode : public egr::GradNodeBase {
     return res;
   }
   void ClearTensorWrappers() override { VLOG(6) << "Do nothing here now"; }
-  bool IsTensorWrappersCleared() override {
-    VLOG(6) << "Do nothing here now";
-    return false;
+
+  std::shared_ptr<GradNodeBase> Copy() const override {
+    {
+      auto copied_node = std::shared_ptr<GradTestNode>(new GradTestNode(*this));
+      return copied_node;
+    }
   }
+
   float val_;
 };
 }  // namespace eager_test
diff --git a/paddle/fluid/eager/tests/data_structure_tests/grad_tensor_holder_test.cc b/paddle/fluid/eager/tests/data_structure_tests/grad_tensor_holder_test.cc
index 645eac06ddda5..7d2aafc63628e 100644
--- a/paddle/fluid/eager/tests/data_structure_tests/grad_tensor_holder_test.cc
+++ b/paddle/fluid/eager/tests/data_structure_tests/grad_tensor_holder_test.cc
@@ -25,6 +25,7 @@
 #include "paddle/phi/core/kernel_registry.h"
 
 PD_DECLARE_KERNEL(full_like, CPU, ALL_LAYOUT);
+PD_DECLARE_KERNEL(add, CPU, ALL_LAYOUT);
 
 // TODO(jiabin): remove nolint here!!!
 using namespace egr;  // NOLINT
@@ -77,11 +78,11 @@ TEST(GradTensorHolder, Interfaces) {
 
   // add():
   // fill one
-  grad_tensor_holder.add(0, 0, et0, true);
+  grad_tensor_holder.CopyValueFromTensor(0, 0, et0, true);
 
   // accumulation
-  grad_tensor_holder.add(1, 0, et0, false);
-  grad_tensor_holder.add(1, 0, et1, false);
+  grad_tensor_holder.add(1, 0, et0);
+  grad_tensor_holder.add(1, 0, et1);
 
   // Buffers()
   const auto& buffers = grad_tensor_holder.Buffers();
@@ -141,8 +142,8 @@ TEST(GradTensorHolder, SelectedRowsMergeAdd) {
       GradTensorHolder({slot_meta, slot_meta});
 
   // accumulation
-  grad_tensor_holder.add(0, 0, t1, false);
-  grad_tensor_holder.add(0, 0, t2, false);
+  grad_tensor_holder.add(0, 0, t1);
+  grad_tensor_holder.add(0, 0, t2);
 
   // Buffers()
   const auto& buffers = grad_tensor_holder.Buffers();
diff --git a/paddle/fluid/eager/tests/task_tests/CMakeLists.txt b/paddle/fluid/eager/tests/task_tests/CMakeLists.txt
index 52dba6b9218c7..5a09ffd6a1e5f 100644
--- a/paddle/fluid/eager/tests/task_tests/CMakeLists.txt
+++ b/paddle/fluid/eager/tests/task_tests/CMakeLists.txt
@@ -1,13 +1,13 @@
 cc_test(test_egr_task_tensor_utils SRCS tensor_utils_test.cc DEPS ${eager_deps})
 cc_test(test_egr_task_eager_utils SRCS eager_utils_test.cc DEPS ${eager_deps})
 cc_test(test_egr_task_forward_autograd SRCS forward_autograd_test.cc DEPS ${eager_deps} ${fluid_deps} eager_scale scale_node)
-cc_test(test_egr_task_backward SRCS backward_test.cc DEPS ${eager_deps} ${fluid_deps} eager_scale scale_node)
-cc_test(test_egr_task_hook SRCS hook_test.cc DEPS ${eager_deps} ${fluid_deps} eager_scale scale_node)
-cc_test(test_egr_task_cross_batch SRCS cross_batch_accumulation_test.cc DEPS ${eager_deps} ${fluid_deps} eager_scale scale_node)
-cc_test(test_egr_task_fwd_bwd_joint SRCS fwd_bwd_joint_test.cc DEPS ${eager_deps} ${fluid_deps} eager_scale scale_node)
-cc_test(test_egr_task_grad SRCS grad_test.cc DEPS ${eager_deps} ${fluid_deps} eager_scale scale_node)
 
 if(NOT ((NOT WITH_PYTHON) AND ON_INFER))
+    cc_test(test_egr_task_hook SRCS hook_test.cc DEPS ${eager_deps} ${fluid_deps} ${generated_deps} eager_scale scale_node)
+    cc_test(test_egr_task_backward SRCS backward_test.cc DEPS ${eager_deps} ${fluid_deps} ${generated_deps} eager_scale scale_node)
+    cc_test(test_egr_task_grad SRCS grad_test.cc DEPS ${eager_deps} ${fluid_deps} ${generated_deps} eager_scale scale_node)
+    cc_test(test_egr_task_fwd_bwd_joint SRCS fwd_bwd_joint_test.cc DEPS ${eager_deps} ${fluid_deps} ${generated_deps} eager_scale scale_node)
+    cc_test(test_egr_task_cross_batch SRCS cross_batch_accumulation_test.cc DEPS ${eager_deps} ${fluid_deps} ${generated_deps} eager_scale scale_node)
     cc_test(test_egr_task_hook_intermidiate SRCS hook_test_intermidiate.cc DEPS ${eager_deps} ${fluid_deps} ${generated_deps} dygraph_node)
     cc_test(test_egr_task_autocodegen SRCS generated_test.cc DEPS ${eager_deps} ${fluid_deps} ${generated_deps})
 endif()
diff --git a/paddle/fluid/eager/tests/task_tests/backward_test.cc b/paddle/fluid/eager/tests/task_tests/backward_test.cc
index 87f8f6eca1f88..8c127efa4f7f3 100644
--- a/paddle/fluid/eager/tests/task_tests/backward_test.cc
+++ b/paddle/fluid/eager/tests/task_tests/backward_test.cc
@@ -34,6 +34,7 @@
 
 PD_DECLARE_KERNEL(full, CPU, ALL_LAYOUT);
 PD_DECLARE_KERNEL(copy, CPU, ALL_LAYOUT);
+PD_DECLARE_KERNEL(add, CPU, ALL_LAYOUT);
 
 namespace egr {
 
diff --git a/paddle/fluid/eager/tests/task_tests/fwd_bwd_joint_test.cc b/paddle/fluid/eager/tests/task_tests/fwd_bwd_joint_test.cc
index 882695e98d109..d2bef100ca2b5 100644
--- a/paddle/fluid/eager/tests/task_tests/fwd_bwd_joint_test.cc
+++ b/paddle/fluid/eager/tests/task_tests/fwd_bwd_joint_test.cc
@@ -33,8 +33,10 @@
 #include "paddle/phi/core/kernel_registry.h"
 
 PD_DECLARE_KERNEL(full, CPU, ALL_LAYOUT);
+PD_DECLARE_KERNEL(add, CPU, ALL_LAYOUT);
 #if defined(PADDLE_WITH_CUDA) || defined(PADDLE_WITH_HIP)
 PD_DECLARE_KERNEL(full, GPU, ALL_LAYOUT);
+PD_DECLARE_KERNEL(add, GPU, ALL_LAYOUT);
 #endif
 
 namespace egr {
diff --git a/paddle/fluid/eager/tests/task_tests/grad_test.cc b/paddle/fluid/eager/tests/task_tests/grad_test.cc
index 6b03799c48659..7e64c65d8205e 100644
--- a/paddle/fluid/eager/tests/task_tests/grad_test.cc
+++ b/paddle/fluid/eager/tests/task_tests/grad_test.cc
@@ -33,6 +33,8 @@
 
 PD_DECLARE_KERNEL(full, CPU, ALL_LAYOUT);
 PD_DECLARE_KERNEL(copy, CPU, ALL_LAYOUT);
+PD_DECLARE_KERNEL(add, CPU, ALL_LAYOUT);
+
 namespace egr {
 
 TEST(Grad, SingleNodeEmptyGrad) {
diff --git a/paddle/fluid/eager/to_static/run_program_op_node.h b/paddle/fluid/eager/to_static/run_program_op_node.h
index 79703ce06dc9b..46f48778a9656 100644
--- a/paddle/fluid/eager/to_static/run_program_op_node.h
+++ b/paddle/fluid/eager/to_static/run_program_op_node.h
@@ -407,10 +407,6 @@ class GradNodeRunProgram : public egr::GradNodeBase {
   }
 
   void ClearTensorWrappers() override { VLOG(6) << "Do nothing here now"; }
-  bool IsTensorWrappersCleared() override {
-    VLOG(6) << "Do nothing here now";
-    return false;
-  }
 
   // SetAttrMap
   void SetAttrMap(const paddle::framework::AttributeMap &attrs) {
@@ -468,6 +464,12 @@ class GradNodeRunProgram : public egr::GradNodeBase {
     }
   }
 
+  std::shared_ptr<GradNodeBase> Copy() const override {
+    auto copied_node =
+        std::shared_ptr<GradNodeRunProgram>(new GradNodeRunProgram(*this));
+    return copied_node;
+  }
+
  private:
   // TensorWrappers
   std::vector<paddle::experimental::Tensor> x_;
diff --git a/paddle/fluid/eager/utils.cc b/paddle/fluid/eager/utils.cc
index 5328033fc749b..bcf4a4627bb76 100644
--- a/paddle/fluid/eager/utils.cc
+++ b/paddle/fluid/eager/utils.cc
@@ -271,27 +271,6 @@ void EagerUtils::HandleViewBetweenInputAndOutput(
   }
 }
 
-void EagerUtils::ModifyInplaceInput(
-    const std::shared_ptr<EagerVariable>& inplace_variable,
-    paddle::experimental::Tensor* inplace_tensor) {
-  // Only modify the meta information of the inplace tensor, because
-  // EagerVariable cannot modify Tensor's meta information after inplace
-  // op (such as ``reshape``) is executed.
-  PADDLE_ENFORCE_NOT_NULL(inplace_tensor,
-                          paddle::platform::errors::Fatal(
-                              "Inplace Tensor is null and cannot be modified. "
-                              "We are tring to Modify Inplace Input from its "
-                              "shared_ptr, this error may indicate the inplace "
-                              " input is nullptr"));
-  if (phi::DenseTensor::classof(inplace_variable->GetTensorBase().get())) {
-    phi::DenseTensor* variable_dense_tensor =
-        static_cast<phi::DenseTensor*>(inplace_variable->GetTensorBase().get());
-    phi::DenseTensor* tensor_dense_tensor =
-        static_cast<phi::DenseTensor*>(inplace_tensor->impl().get());
-    tensor_dense_tensor->set_meta(variable_dense_tensor->meta());
-  }
-}
-
 std::vector<paddle::experimental::Tensor> EagerUtils::GetOutputs(
     const std::vector<std::shared_ptr<EagerVariable>>& outs) {
   std::vector<paddle::experimental::Tensor> res;
@@ -385,22 +364,6 @@ paddle::experimental::Tensor EagerUtils::RecoverTensorWrapper(
   return tw->recover(grad_node);
 }
 
-paddle::optional<const paddle::experimental::Tensor&>
-EagerUtils::RecoverOptionalTensorWrapper(
-    TensorWrapper* tw, const std::shared_ptr<GradNodeBase>& grad_node) {
-  PADDLE_ENFORCE_NOT_NULL(
-      tw, phi::errors::InvalidArgument("TensorWrapper in "
-                                       "RecoverOptionalTensorWrapper function "
-                                       "should not be null"));
-  auto tmp = tw->recover(grad_node);
-
-  paddle::optional<const paddle::experimental::Tensor&> res{paddle::none};
-  if (tmp.initialized()) {
-    res = tmp;
-  }
-  return res;
-}
-
 std::vector<paddle::experimental::Tensor> EagerUtils::RecoverTensorWrapper(
     std::vector<TensorWrapper>* tw,
     const std::shared_ptr<GradNodeBase>& grad_node) {
diff --git a/paddle/fluid/eager/utils.h b/paddle/fluid/eager/utils.h
index 4c3f5c88e4c93..be534d4440561 100644
--- a/paddle/fluid/eager/utils.h
+++ b/paddle/fluid/eager/utils.h
@@ -179,9 +179,6 @@ class EagerUtils {
   static std::vector<paddle::experimental::Tensor> RecoverTensorWrapper(
       std::vector<TensorWrapper>* tw,
       const std::shared_ptr<GradNodeBase>& grad_node);
-  static paddle::optional<const paddle::experimental::Tensor&>
-  RecoverOptionalTensorWrapper(TensorWrapper* tw,
-                               const std::shared_ptr<GradNodeBase>& grad_node);
 
   // Intermidate needed remove this once we don't need legacy
   // Inner Method
@@ -203,9 +200,6 @@ class EagerUtils {
   static std::vector<std::shared_ptr<EagerVariable>> CreateVars(
       const size_t num);
   // Construct Tensor From var
-  static void ModifyInplaceInput(
-      const std::shared_ptr<EagerVariable>& inplace_variable,
-      paddle::experimental::Tensor* inplace_tensor);
   static std::vector<paddle::experimental::Tensor> GetOutputs(
       const std::vector<std::shared_ptr<EagerVariable>>& outs);
   static paddle::experimental::Tensor GetOutput(
diff --git a/paddle/fluid/framework/CMakeLists.txt b/paddle/fluid/framework/CMakeLists.txt
index 09ced6bd0d5ce..fb4c9937611e7 100755
--- a/paddle/fluid/framework/CMakeLists.txt
+++ b/paddle/fluid/framework/CMakeLists.txt
@@ -117,12 +117,14 @@ endif()
 cc_test(var_type_traits_test SRCS var_type_traits_test.cc DEPS var_type_traits)
 
 set(BRPC_DEPS "")
-if(WITH_PSLIB OR WITH_PSCORE)
-    if(NOT WITH_HETERPS)
-        set(BRPC_DEPS brpc ssl crypto)
-    endif()
+if(WITH_PSCORE)
+    set(BRPC_DEPS brpc ssl crypto)
+endif()
+if(WITH_PSLIB)
     if(WITH_PSLIB_BRPC)
         set(BRPC_DEPS pslib_brpc)
+    elseif(NOT WITH_HETERPS)
+        set(BRPC_DEPS brpc ssl crypto)
     endif()
 endif()
 
@@ -300,7 +302,7 @@ if(WITH_DISTRIBUTE)
     lod_rank_table feed_fetch_method collective_helper ${GLOB_DISTRIBUTE_DEPS}
     graph_to_program_pass variable_helper data_feed_proto timer monitor
     heter_service_proto fleet_executor ${BRPC_DEP})
-    set(DISTRIBUTE_COMPILE_FLAGS "-Wno-non-virtual-dtor -Wno-error=non-virtual-dtor -Wno-error=delete-non-virtual-dtor")
+    set(DISTRIBUTE_COMPILE_FLAGS "-Wno-non-virtual-dtor -Wno-error=non-virtual-dtor -Wno-error=delete-non-virtual-dtor -Wno-error=parentheses")
     if (CMAKE_CXX_COMPILER_VERSION VERSION_GREATER 7.0)
         set(DISTRIBUTE_COMPILE_FLAGS
                 "${DISTRIBUTE_COMPILE_FLAGS} -faligned-new")
@@ -320,7 +322,7 @@ if(WITH_DISTRIBUTE)
             index_sampler index_wrapper sampler index_dataset_proto
             lod_rank_table fs shell fleet_wrapper heter_wrapper box_wrapper metrics lodtensor_printer feed_fetch_method
             graph_to_program_pass variable_helper timer monitor heter_service_proto fleet heter_server brpc fleet_executor)
-    set(DISTRIBUTE_COMPILE_FLAGS "-Wno-non-virtual-dtor -Wno-error=non-virtual-dtor -Wno-error=delete-non-virtual-dtor")
+    set(DISTRIBUTE_COMPILE_FLAGS "-Wno-non-virtual-dtor -Wno-error=non-virtual-dtor -Wno-error=delete-non-virtual-dtor -Wno-error=parentheses")
     if (CMAKE_CXX_COMPILER_VERSION VERSION_GREATER 7.0)
         set(DISTRIBUTE_COMPILE_FLAGS
                 "${DISTRIBUTE_COMPILE_FLAGS} -faligned-new")
diff --git a/paddle/fluid/framework/details/fast_threaded_ssa_graph_executor.cc b/paddle/fluid/framework/details/fast_threaded_ssa_graph_executor.cc
index 1b2b24762894c..ce471d55b24a1 100644
--- a/paddle/fluid/framework/details/fast_threaded_ssa_graph_executor.cc
+++ b/paddle/fluid/framework/details/fast_threaded_ssa_graph_executor.cc
@@ -132,6 +132,9 @@ FetchResultType FastThreadedSSAGraphExecutor::Run(
   }
   // Wait FetchOps.
   if (!fetch_ops.empty()) {
+    platform::RecordEvent record_wait(
+        "FastThreadedSSAGraphExecutor::WaitFetchOps",
+        platform::TracerEventType::Operator, 1);
     ClearFetchOp(graph_, &fetch_ops);
 
     for (auto &place : places_) {
@@ -231,8 +234,9 @@ void FastThreadedSSAGraphExecutor::RunOpAsync(
     OpHandleBase *op,
     const std::shared_ptr<BlockingQueue<size_t>> &complete_q) {
   ++remaining_;
-  platform::RecordEvent("WorkQueue::AddTask",
-                        platform::TracerEventType::UserDefined, 10 /*level*/);
+  platform::RecordEvent record("WorkQueue::AddTask",
+                               platform::TracerEventType::UserDefined,
+                               10 /*level*/);
   this->pool_->enqueue([=] {
     std::deque<OpHandleBase *> op_queue;
     op_queue.push_front(op);
diff --git a/paddle/fluid/framework/executor.cc b/paddle/fluid/framework/executor.cc
index f951b5d0f5070..06ce9712f5c52 100644
--- a/paddle/fluid/framework/executor.cc
+++ b/paddle/fluid/framework/executor.cc
@@ -172,6 +172,8 @@ void Executor::Run(const ProgramDesc& pdesc, Scope* scope, int block_id,
                    bool create_local_scope, bool create_vars,
                    const std::vector<std::string>& skip_ref_cnt_vars,
                    bool force_disable_gc, bool keep_kid_scopes) {
+  platform::RecordEvent record_run("Executor::Run",
+                                   platform::TracerEventType::UserDefined, 1);
   platform::RecordBlock b(block_id);
   if (FLAGS_use_mkldnn) EnableMKLDNN(pdesc);
   auto ctx = Prepare(pdesc, block_id, skip_ref_cnt_vars, force_disable_gc);
@@ -301,6 +303,8 @@ void Executor::Run(const ProgramDesc& program, Scope* scope,
                    bool create_local_scope, bool create_vars,
                    const std::string& feed_holder_name,
                    const std::string& fetch_holder_name) {
+  platform::RecordEvent record_run("Executor::Run",
+                                   platform::TracerEventType::UserDefined, 1);
   platform::RecordBlock b(kProgramId);
   if (FLAGS_use_mkldnn) EnableMKLDNN(program);
 #ifdef PADDLE_WITH_MKLDNN
@@ -428,6 +432,8 @@ void Executor::RunPartialPreparedContext(ExecutorPrepareContext* ctx,
                                          int64_t end_op_index,
                                          bool create_local_scope,
                                          bool create_vars, bool keep_kids) {
+  platform::RecordEvent record_run("Executor::RunPartialPreparedContext",
+                                   platform::TracerEventType::UserDefined, 1);
   platform::RecordBlock b(kProgramId);
   PADDLE_ENFORCE_NOT_NULL(
       scope, platform::errors::InvalidArgument("Scope shouldn't be null"));
@@ -518,6 +524,8 @@ void Executor::RunPartialPreparedContext(ExecutorPrepareContext* ctx,
     auto& op = ctx->ops_[i];
     op->Run(*local_scope, place_);
     if (gc) {
+      platform::RecordEvent record("CheckGC",
+                                   platform::TracerEventType::UserDefined, 10);
       DeleteUnusedTensors(*local_scope, op.get(), ctx->unused_vars_, gc.get());
     }
   }
diff --git a/paddle/fluid/framework/fleet/heter_ps/CMakeLists.txt b/paddle/fluid/framework/fleet/heter_ps/CMakeLists.txt
index ead6dd7e6898d..983208c0608ae 100644
--- a/paddle/fluid/framework/fleet/heter_ps/CMakeLists.txt
+++ b/paddle/fluid/framework/fleet/heter_ps/CMakeLists.txt
@@ -10,12 +10,14 @@ IF(WITH_GPU)
     nv_library(heter_comm SRCS heter_comm.h feature_value.h heter_resource.cc heter_resource.h hashtable.h mem_pool.h DEPS ${HETERPS_DEPS})
     nv_test(test_heter_comm SRCS feature_value.h DEPS heter_comm)
     nv_library(heter_ps SRCS heter_ps.cu DEPS heter_comm)
-    nv_library(graph_gpu_ps SRCS graph_gpu_ps_table.h DEPS heter_comm table)
-    nv_test(test_graph_comm SRCS test_graph.cu DEPS graph_gpu_ps)
-    nv_test(test_cpu_graph_sample SRCS test_cpu_graph_sample.cu DEPS graph_gpu_ps)
-    #nv_test(test_sample_rate SRCS test_sample_rate.cu DEPS graph_gpu_ps)
-    # ADD_EXECUTABLE(test_sample_rate test_sample_rate.cu)
-    # target_link_libraries(test_sample_rate graph_gpu_ps)
+    if(WITH_PSCORE)
+      nv_library(graph_gpu_ps SRCS graph_gpu_ps_table.h DEPS heter_comm table)
+      nv_test(test_graph_comm SRCS test_graph.cu DEPS graph_gpu_ps)
+      nv_test(test_cpu_graph_sample SRCS test_cpu_graph_sample.cu DEPS graph_gpu_ps)
+      #nv_test(test_sample_rate SRCS test_sample_rate.cu DEPS graph_gpu_ps)
+      # ADD_EXECUTABLE(test_sample_rate test_sample_rate.cu)
+      # target_link_libraries(test_sample_rate graph_gpu_ps)
+    endif()
 ENDIF()
 IF(WITH_ROCM)
     hip_library(heter_comm SRCS heter_comm.h feature_value.h heter_resource.cc heter_resource.h hashtable.h DEPS cub device_context)
diff --git a/paddle/fluid/framework/fleet/ps_gpu_wrapper.cc b/paddle/fluid/framework/fleet/ps_gpu_wrapper.cc
index 72f998a772764..c7852de00a18e 100755
--- a/paddle/fluid/framework/fleet/ps_gpu_wrapper.cc
+++ b/paddle/fluid/framework/fleet/ps_gpu_wrapper.cc
@@ -37,6 +37,43 @@ limitations under the License. */
 namespace paddle {
 namespace framework {
 
+#ifdef PADDLE_WITH_PSLIB
+void AfsWrapper::init(const std::string& fs_name, const std::string& fs_user,
+                      const std::string& pass_wd, const std::string& conf) {
+  int ret = afs_handler_.init(fs_name.c_str(), fs_user.c_str(), pass_wd.c_str(),
+                              conf.c_str());
+  if (ret != 0) {
+    LOG(ERROR) << "AFS Init Error";
+  }
+}
+
+int AfsWrapper::remove(const std::string& path) {
+  return afs_handler_.remove(path);
+}
+
+int AfsWrapper::mkdir(const std::string& path) {
+  return afs_handler_.mkdir(path);
+}
+
+std::vector<std::string> AfsWrapper::list(const std::string& path) {
+  return afs_handler_.list(path);
+}
+
+int AfsWrapper::exist(const std::string& path) {
+  return afs_handler_.exist(path);
+}
+
+int AfsWrapper::upload(const std::string& local_file,
+                       const std::string& afs_file) {
+  return afs_handler_.upload_file(local_file, afs_file);
+}
+
+int AfsWrapper::download(const std::string& local_file,
+                         const std::string& afs_file) {
+  return afs_handler_.download_file(local_file, afs_file);
+}
+#endif
+
 std::shared_ptr<PSGPUWrapper> PSGPUWrapper::s_instance_ = NULL;
 bool PSGPUWrapper::is_initialized_ = false;
 #ifdef PADDLE_WITH_PSLIB
@@ -343,7 +380,7 @@ void PSGPUWrapper::BuildPull(std::shared_ptr<HeterContext> gpu_task) {
 #ifdef PADDLE_WITH_PSCORE
     int32_t cnt = 0;
     while (true) {
-      auto tt = fleet_ptr->worker_ptr_->pull_sparse_ptr(
+      auto tt = fleet_ptr->worker_ptr_->PullSparsePtr(
           reinterpret_cast<char**>(local_ptr[i].data()), this->table_id_,
           local_keys[i].data(), key_size);
       bool flag = true;
diff --git a/paddle/fluid/framework/fleet/ps_gpu_wrapper.h b/paddle/fluid/framework/fleet/ps_gpu_wrapper.h
index d9d29cc072dd7..9b7d6de082d1c 100755
--- a/paddle/fluid/framework/fleet/ps_gpu_wrapper.h
+++ b/paddle/fluid/framework/fleet/ps_gpu_wrapper.h
@@ -55,6 +55,27 @@ namespace framework {
 #define TYPEALIGN(ALIGNVAL, LEN) \
   (((uint64_t)(LEN) + ((ALIGNVAL)-1)) & ~((uint64_t)((ALIGNVAL)-1)))
 
+#ifdef PADDLE_WITH_PSLIB
+class AfsWrapper {
+ public:
+  AfsWrapper() {}
+  virtual ~AfsWrapper() {}
+  void init(const std::string& fs_name, const std::string& fs_user,
+            const std::string& pass_wd, const std::string& conf);
+  int remove(const std::string& path);
+  int mkdir(const std::string& path);
+  std::vector<std::string> list(const std::string& path);
+
+  int exist(const std::string& path);
+  int upload(const std::string& local_file, const std::string& afs_file);
+
+  int download(const std::string& local_file, const std::string& afs_file);
+
+ private:
+  paddle::ps::AfsApiWrapper afs_handler_;
+};
+#endif
+
 class PSGPUWrapper {
  public:
   virtual ~PSGPUWrapper() { delete HeterPs_; }
diff --git a/paddle/fluid/framework/ir/CMakeLists.txt b/paddle/fluid/framework/ir/CMakeLists.txt
index 7aaaef712a6e9..16a95b2ccf7f1 100755
--- a/paddle/fluid/framework/ir/CMakeLists.txt
+++ b/paddle/fluid/framework/ir/CMakeLists.txt
@@ -86,6 +86,8 @@ pass_library(quant_conv2d_dequant_fuse_pass inference)
 pass_library(shuffle_channel_detect_pass inference)
 pass_library(delete_quant_dequant_op_pass inference)
 pass_library(delete_quant_dequant_filter_op_pass inference)
+pass_library(delete_weight_dequant_linear_op_pass inference)
+pass_library(delete_quant_dequant_linear_op_pass inference)
 pass_library(delete_dropout_op_pass inference)
 pass_library(simplify_with_basic_ops_pass base)
 pass_library(fc_elementwise_layernorm_fuse_pass base)
@@ -148,7 +150,7 @@ if(WITH_IPU)
     pass_library(ipu_runtime_replacer_pass base DIR ipu)
     pass_library(inference_process_pass base DIR ipu)
     pass_library(inference_postprocess_pass base DIR ipu)
-    pass_library(popart_canonicalization_pass base DIR ipu DEPS paddle_ipu)
+    pass_library(popart_canonicalization_pass base DIR ipu)
     pass_library(ipu_inplace_pass base DIR ipu)
     pass_library(infer_shape_pass base DIR ipu)
     pass_library(delete_scale_op_pass base DIR ipu)
diff --git a/paddle/fluid/framework/ir/add_support_int8_pass.cc b/paddle/fluid/framework/ir/add_support_int8_pass.cc
index d157d2e934ace..3a3f5c3741f4d 100644
--- a/paddle/fluid/framework/ir/add_support_int8_pass.cc
+++ b/paddle/fluid/framework/ir/add_support_int8_pass.cc
@@ -1,4 +1,4 @@
-// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
 //
 // Licensed under the Apache License, Version 2.0 (the "License");
 // you may not use this file except in compliance with the License.
@@ -19,11 +19,7 @@ namespace framework {
 namespace ir {
 
 #define GET_IR_NODE(node__) GET_IR_NODE_FROM_SUBGRAPH(node__, node__, pattern);
-#define GET_NODES        \
-  GET_IR_NODE(prev_op);  \
-  GET_IR_NODE(prev_out); \
-  GET_IR_NODE(quant_op); \
-  GET_IR_NODE(quant_out);
+#define GET_NODES GET_IR_NODE(quant_op);
 
 void AddSupportInt8Pass::ApplyImpl(ir::Graph* graph) const {
   const std::string pattern_name = "add_support_int8";
@@ -37,10 +33,57 @@ void AddSupportInt8Pass::ApplyImpl(ir::Graph* graph) const {
   auto handler = [&](const GraphPatternDetector::subgraph_t& subgraph,
                      Graph* g) {
     GET_NODES;
-    if (prev_op->Op()->HasAttr("out_threshold") &&
-        quant_op->Op()->HasAttr("out_threshold")) {
-      quant_op->Op()->SetAttr("support_int8", true);
+
+    bool inscale_flag = false;
+    bool outscale_flag = false;
+    auto* quanted_op_desc = quant_op->Op();
+    // If inputs'tensors have the inputs_scale, then save it's index in
+    // input_quant_tensor_index
+    // OP'Attr hasn't std::vector<std::pair< >>. To do: Support multi-tensor
+    // scale for one input
+    for (size_t i = 0; i < quanted_op_desc->InputNames().size(); i++) {
+      if (quanted_op_desc->Input(quanted_op_desc->InputNames()[i]).size() > 0 &&
+          quanted_op_desc->HasAttr(
+              "Input_scale_" +
+              quanted_op_desc->Input(quanted_op_desc->InputNames()[i])[0])) {
+        inscale_flag = true;
+        quanted_op_desc->SetAttr(
+            quanted_op_desc->InputNames()[i],
+            quanted_op_desc->GetAttr(
+                "Input_scale_" +
+                quanted_op_desc->Input(quanted_op_desc->InputNames()[i])[0]));
+      }
+    }
+
+    // If outputs'tensors have the outputs_scale, then save it's index in
+    // output_quant_tensor_index
+    // OP'Attr hasn't std::vector<std::pair< >>. To do: Support multi-tensor
+    // scale for one output
+    for (auto out_node : quant_op->outputs) {
+      for (auto out_op_node : out_node->outputs) {
+        for (auto name : out_op_node->Op()->InputNames()) {
+          for (auto input_name : out_op_node->Op()->Input(name)) {
+            if (out_op_node->Op()->HasAttr("Input_scale_" + input_name)) {
+              for (size_t i = 0; i < quanted_op_desc->OutputNames().size();
+                   i++) {
+                if (quanted_op_desc->Output(quanted_op_desc->OutputNames()[i])
+                            .size() > 0 &&
+                    input_name ==
+                        quanted_op_desc->Output(
+                            quanted_op_desc->OutputNames()[i])[0]) {
+                  outscale_flag = true;
+                  quanted_op_desc->SetAttr(
+                      quanted_op_desc->OutputNames()[i],
+                      out_op_node->Op()->GetAttr("Input_scale_" + input_name));
+                }
+              }
+            }
+          }
+        }
+      }
     }
+    quanted_op_desc->SetAttr("support_int8", inscale_flag && outscale_flag);
+    quanted_op_desc->Flush();
     found_count++;
   };
   gpd(graph, handler);
diff --git a/paddle/fluid/framework/ir/cost_model.cc b/paddle/fluid/framework/ir/cost_model.cc
index 5027c50103a52..6086409ffd971 100644
--- a/paddle/fluid/framework/ir/cost_model.cc
+++ b/paddle/fluid/framework/ir/cost_model.cc
@@ -44,6 +44,19 @@ double CostData::GetWholeMemoryBytes() const { return whole_memory_bytes_; }
 const Graph* CostData::GetGraph() const { return graph_; }
 const ProgramDesc* CostData::GetProgram() const { return program_; }
 
+static bool StringHasEnding(const std::string& full,
+                            const std::string& ending) {
+  if (full.length() < ending.length()) {
+    return false;
+  }
+  if (full.length() == ending.length()) {
+    return full == ending;
+  }
+  size_t prefix_len = full.length() - ending.length();
+  return 0 == full.compare(prefix_len, ending.length(), ending) &&
+         full[prefix_len - 1] == '/';
+}
+
 bool CostData::SetCostData(const ProgramDesc& program,
                            const std::vector<std::vector<Event>>& time_events) {
   // TODO(zhhsplendid): Make a copy so that CostData can be available even if
@@ -77,7 +90,7 @@ bool CostData::SetCostData(const ProgramDesc& program,
     std::string op_type = op_desc->Type();
 
     while (event_index < main_thread_events.size()) {
-      if (main_thread_events[event_index].name() == op_type &&
+      if (StringHasEnding(main_thread_events[event_index].name(), op_type) &&
           main_thread_events[event_index].type() ==
               platform::EventType::kPushRange) {
         break;
@@ -97,7 +110,7 @@ bool CostData::SetCostData(const ProgramDesc& program,
       // ControlFlow Op can be like that, but this version only support global
       // block
       // TODO(zhhsplendid): make a more strict mapping between push and pop
-      if (main_thread_events[event_index].name() == op_type &&
+      if (StringHasEnding(main_thread_events[event_index].name(), op_type) &&
           main_thread_events[event_index].type() ==
               platform::EventType::kPopRange) {
         break;
diff --git a/paddle/fluid/framework/ir/delete_quant_dequant_linear_op_pass.cc b/paddle/fluid/framework/ir/delete_quant_dequant_linear_op_pass.cc
new file mode 100644
index 0000000000000..8f2b58ed51b99
--- /dev/null
+++ b/paddle/fluid/framework/ir/delete_quant_dequant_linear_op_pass.cc
@@ -0,0 +1,148 @@
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include "paddle/fluid/framework/ir/delete_quant_dequant_linear_op_pass.h"
+
+#include <algorithm>
+#include <memory>
+#include <string>
+#include <unordered_set>
+#include <vector>
+
+namespace paddle {
+namespace framework {
+namespace ir {
+
+#define GET_IR_NODE(node__) GET_IR_NODE_FROM_SUBGRAPH(node__, node__, pattern);
+#define GET_NODES                        \
+  GET_IR_NODE(quantize_linear_op_x);     \
+  GET_IR_NODE(quantize_linear_op_scale); \
+  GET_IR_NODE(quantize_linear_op);       \
+  GET_IR_NODE(quantize_linear_op_out);   \
+  GET_IR_NODE(dequantize_linear_op);     \
+  GET_IR_NODE(dequantize_linear_op_out); \
+  GET_IR_NODE(any_op2);
+
+DeleteQuantDequantLinearOpPass::DeleteQuantDequantLinearOpPass() {
+  AddOpCompat(OpCompat("quantize_linear"))
+      .AddInput("X")
+      .IsTensor()
+      .End()
+      .AddInput("Scale")
+      .IsTensor()
+      .End()
+      .AddInput("ZeroPoint")
+      .IsTensor()
+      .IsOptional()
+      .End()
+      .AddOutput("Y")
+      .IsTensor()
+      .End()
+      .AddAttr("bit_length")
+      .IsType<int>()
+      .End()
+      .AddAttr("quant_axis")
+      .IsType<int>()
+      .End();
+  AddOpCompat(OpCompat("dequantize_linear"))
+      .AddInput("X")
+      .IsTensor()
+      .End()
+      .AddInput("Scale")
+      .IsTensor()
+      .End()
+      .AddInput("ZeroPoint")
+      .IsTensor()
+      .IsOptional()
+      .End()
+      .AddOutput("Y")
+      .IsTensor()
+      .End()
+      .AddAttr("bit_length")
+      .IsType<int>()
+      .End()
+      .AddAttr("quant_axis")
+      .IsType<int>()
+      .End();
+}
+// Delete quantize_linear_op dequantize_linear_op, then add input_scales
+void DeleteQuantDequantLinearOpPass::ApplyImpl(ir::Graph* graph) const {
+  const std::string pattern_name = "delete_quantdequant_linear_op_pattern";
+  FusePassBase::Init(pattern_name, graph);
+
+  GraphPatternDetector gpd;
+  auto* scope = param_scope();
+  PADDLE_ENFORCE_NOT_NULL(
+      scope,
+      platform::errors::InvalidArgument(
+          "Scope in DeleteQuantDequantLinearOpPass should not be null."));
+  // Create pattern
+  patterns::DeleteQuantDequantLinearOpPattern pattern(gpd.mutable_pattern(),
+                                                      pattern_name);
+  pattern();
+  int found_count = 0;
+
+  auto handler = [&](const GraphPatternDetector::subgraph_t& subgraph,
+                     Graph* g) {
+    GET_NODES;
+    /*
+    if (!IsCompat(subgraph, g)) {
+      LOG(WARNING) << "delete_quant_dequant_linear_op_pass "
+                      "compat check failed.";
+      return;
+    }
+    */
+    std::unordered_set<const Node*> nodes2rm = {};
+    int bit_length =
+        BOOST_GET_CONST(int, quantize_linear_op->Op()->GetAttr("bit_length"));
+    int range = ((1 << (bit_length - 1)) - 1);
+
+    // Get input scale from tensor
+    const LoDTensor& input_scale_tensor =
+        scope->GetVar(quantize_linear_op_scale->Name())->Get<LoDTensor>();
+    PADDLE_ENFORCE_EQ(
+        paddle::platform::is_cpu_place(input_scale_tensor.place()), true,
+        platform::errors::InvalidArgument(
+            "Input scale tensor's place should be CPU."));
+    const float* input_scale_data = input_scale_tensor.data<float>();
+    float input_scale = input_scale_data[0] / range;
+
+    auto* any_op2_desc = any_op2->Op();
+    any_op2_desc->SetAttr("Input_scale_" + quantize_linear_op_x->Var()->Name(),
+                          input_scale);
+
+    nodes2rm.insert(quantize_linear_op_scale);
+    nodes2rm.insert(quantize_linear_op);
+    nodes2rm.insert(quantize_linear_op_out);
+    nodes2rm.insert(dequantize_linear_op);
+    nodes2rm.insert(dequantize_linear_op_out);
+
+    // link x to any_op2
+    any_op2_desc->RenameInput(dequantize_linear_op_out->Var()->Name(),
+                              quantize_linear_op_x->Var()->Name());
+    any_op2_desc->Flush();
+    IR_NODE_LINK_TO(quantize_linear_op_x, any_op2);
+    GraphSafeRemoveNodes(graph, nodes2rm);
+    found_count++;
+  };
+  gpd(graph, handler);
+  AddStatis(found_count);
+}
+
+}  // namespace ir
+}  // namespace framework
+}  // namespace paddle
+
+REGISTER_PASS(delete_quant_dequant_linear_op_pass,
+              paddle::framework::ir::DeleteQuantDequantLinearOpPass);
diff --git a/paddle/fluid/framework/ir/delete_quant_dequant_linear_op_pass.h b/paddle/fluid/framework/ir/delete_quant_dequant_linear_op_pass.h
new file mode 100644
index 0000000000000..b00e3cb5c468b
--- /dev/null
+++ b/paddle/fluid/framework/ir/delete_quant_dequant_linear_op_pass.h
@@ -0,0 +1,35 @@
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#pragma once
+#include <vector>
+
+#include "paddle/fluid/framework/ir/fuse_pass_base.h"
+
+namespace paddle {
+namespace framework {
+namespace ir {
+
+class DeleteQuantDequantLinearOpPass : public FusePassBase {
+ public:
+  DeleteQuantDequantLinearOpPass();
+  virtual ~DeleteQuantDequantLinearOpPass() {}
+
+ protected:
+  void ApplyImpl(ir::Graph* graph) const override;
+};
+
+}  // namespace ir
+}  // namespace framework
+}  // namespace paddle
diff --git a/paddle/fluid/framework/ir/delete_quant_dequant_op_pass.cc b/paddle/fluid/framework/ir/delete_quant_dequant_op_pass.cc
index 63d68bd04b5f0..e2bb62dba7cf0 100644
--- a/paddle/fluid/framework/ir/delete_quant_dequant_op_pass.cc
+++ b/paddle/fluid/framework/ir/delete_quant_dequant_op_pass.cc
@@ -61,7 +61,6 @@ void DeleteQuantDequantOpPass::ApplyImpl(ir::Graph* graph) const {
     GET_NODES;
     int bit_length =
         BOOST_GET_CONST(int, quant_dequant_op->Op()->GetAttr("bit_length"));
-    int range = ((1 << (bit_length - 1)) - 1);
 
     // Get input scale from tensor
     std::string input_scale_var_name =
@@ -76,7 +75,7 @@ void DeleteQuantDequantOpPass::ApplyImpl(ir::Graph* graph) const {
         platform::errors::InvalidArgument(
             "Input scale tensor's place should be CPU."));
     const float* input_scale_data = input_scale_tensor.data<float>();
-    float input_scale = input_scale_data[0] / range;
+    float input_scale = input_scale_data[0];
 
     // Set input scale in attr, and relink nodes
     std::string input_name = input->Var()->Name();
@@ -85,12 +84,7 @@ void DeleteQuantDequantOpPass::ApplyImpl(ir::Graph* graph) const {
     for (auto* quantized_node : outlinks) {
       auto op_desc = quantized_node->Op();
       std::string quantized_op_type = op_desc->Type();
-      if (quantized_op_type == "mul" || quantized_op_type == "matmul" ||
-          quantized_op_type == "matmul_v2") {
-        op_desc->SetAttr("X_scale", input_scale);
-      } else {
-        op_desc->SetAttr("Input_scale", input_scale);
-      }
+      op_desc->SetAttr("Input_scale", input_scale);
       op_desc->SetAttr("bit_length", bit_length);
       op_desc->RenameInput(quant_dequant_output_name, input_name);
       op_desc->Flush();
diff --git a/paddle/fluid/framework/ir/delete_weight_dequant_linear_op_pass.cc b/paddle/fluid/framework/ir/delete_weight_dequant_linear_op_pass.cc
new file mode 100644
index 0000000000000..8ebea231e7a2a
--- /dev/null
+++ b/paddle/fluid/framework/ir/delete_weight_dequant_linear_op_pass.cc
@@ -0,0 +1,415 @@
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include "paddle/fluid/framework/ir/delete_weight_dequant_linear_op_pass.h"
+
+#include <algorithm>
+#include <memory>
+#include <string>
+#include <unordered_set>
+#include <vector>
+
+namespace paddle {
+namespace framework {
+namespace ir {
+
+#define GET_IR_NODE(node__) GET_IR_NODE_FROM_SUBGRAPH(node__, node__, pattern);
+#define GET_NODES                                 \
+  GET_IR_NODE(weight_dequantize_linear_op_x);     \
+  GET_IR_NODE(weight_dequantize_linear_op_scale); \
+  GET_IR_NODE(weight_dequantize_linear_op);       \
+  GET_IR_NODE(weight_dequantize_linear_op_out);   \
+  GET_IR_NODE(any_op2);
+
+DeleteWeightQuantDequantLinearOpPass::DeleteWeightQuantDequantLinearOpPass() {
+  AddOpCompat(OpCompat("quantize_linear"))
+      .AddInput("X")
+      .IsTensor()
+      .End()
+      .AddInput("Scale")
+      .IsTensor()
+      .End()
+      .AddInput("ZeroPoint")
+      .IsTensor()
+      .IsOptional()
+      .End()
+      .AddOutput("Y")
+      .IsTensor()
+      .End()
+      .AddAttr("bit_length")
+      .IsType<int>()
+      .End()
+      .AddAttr("quant_axis")
+      .IsType<int>()
+      .End();
+  AddOpCompat(OpCompat("dequantize_linear"))
+      .AddInput("X")
+      .IsTensor()
+      .End()
+      .AddInput("Scale")
+      .IsTensor()
+      .End()
+      .AddInput("ZeroPoint")
+      .IsTensor()
+      .IsOptional()
+      .End()
+      .AddOutput("Y")
+      .IsTensor()
+      .End()
+      .AddAttr("bit_length")
+      .IsType<int>()
+      .End()
+      .AddAttr("quant_axis")
+      .IsType<int>()
+      .End();
+  AddOpCompat(OpCompat("conv2d"))
+      .AddInput("Input")
+      .IsTensor()
+      .End()
+      .AddInput("Filter")
+      .IsTensor()
+      .End()
+      .AddInput("Bias")
+      .IsTensor()
+      .IsOptional()
+      .End()
+      .AddInput("ResidualData")
+      .IsTensor()
+      .IsOptional()
+      .End()
+      .AddOutput("Output")
+      .IsTensor()
+      .End()
+      .AddAttr("strides")
+      .IsType<std::vector<int>>()
+      .End()
+      .AddAttr("paddings")
+      .IsType<std::vector<int>>()
+      .End()
+      .AddAttr("padding_algorithm")
+      .IsOptional()
+      .IsStringIn({"EXPLICIT", "SAME", "VALID"})
+      .End()
+      .AddAttr("groups")
+      .IsNumGE(1)
+      .End()
+      .AddAttr("dilations")
+      .IsType<std::vector<int>>()
+      .End()
+      .AddAttr("data_format")
+      .IsStringIn({"NCHW", "NHWC", "AnyLayout"})
+      .End();
+  AddOpCompat(OpCompat("depthwise_conv2d"))
+      .AddInput("Input")
+      .IsTensor()
+      .End()
+      .AddInput("Filter")
+      .IsTensor()
+      .End()
+      .AddInput("Bias")
+      .IsTensor()
+      .IsOptional()
+      .End()
+      .AddInput("ResidualData")
+      .IsTensor()
+      .IsOptional()
+      .End()
+      .AddOutput("Output")
+      .IsTensor()
+      .End()
+      .AddAttr("strides")
+      .IsType<std::vector<int>>()
+      .End()
+      .AddAttr("paddings")
+      .IsType<std::vector<int>>()
+      .End()
+      .AddAttr("padding_algorithm")
+      .IsOptional()
+      .IsStringIn({"EXPLICIT", "SAME", "VALID"})
+      .End()
+      .AddAttr("groups")
+      .IsNumGE(1)
+      .End()
+      .AddAttr("dilations")
+      .IsType<std::vector<int>>()
+      .End()
+      .AddAttr("data_format")
+      .IsStringIn({"NCHW", "NHWC", "AnyLayout"})
+      .End();
+  AddOpCompat(OpCompat("mul"))
+      .AddInput("X")
+      .IsTensor()
+      .End()
+      .AddInput("Y")
+      .IsTensor()
+      .End()
+      .AddOutput("Out")
+      .IsTensor()
+      .End()
+      .AddAttr("x_num_col_dims")
+      .IsNumGE(1)
+      .End()
+      .AddAttr("y_num_col_dims")
+      .IsNumEQ(1)
+      .End();
+  AddOpCompat(OpCompat("matmul_v2"))
+      .AddInput("X")
+      .IsTensor()
+      .End()
+      .AddInput("Y")
+      .IsTensor()
+      .End()
+      .AddOutput("Out")
+      .IsTensor()
+      .End()
+      .AddAttr("trans_x")
+      .IsBoolEQ(false)
+      .End()
+      .AddAttr("trans_y")
+      .IsBoolEQ(false)
+      .End();
+  AddOpCompat(OpCompat("matmul"))
+      .AddInput("X")
+      .IsTensor()
+      .End()
+      .AddInput("Y")
+      .IsTensor()
+      .End()
+      .AddOutput("Out")
+      .IsTensor()
+      .End()
+      .AddAttr("alpha")
+      .IsNumGE(0.99f)
+      .IsNumLE(1.01f)
+      .End()
+      .AddAttr("transpose_X")
+      .IsBoolEQ(false)
+      .End()
+      .AddAttr("transpose_Y")
+      .IsBoolEQ(false)
+      .End();
+  AddOpCompat(OpCompat("fc"))
+      .AddInput("Input")
+      .IsTensor()
+      .End()
+      .AddInput("W")
+      .IsTensor()
+      .End()
+      .AddInput("Bias")
+      .IsTensor()
+      .End()
+      .AddOutput("Out")
+      .IsTensor()
+      .End()
+      .AddAttr("in_num_col_dims")
+      .IsNumGE(1)
+      .End()
+      .AddAttr("activation_type")
+      .IsStringIn({"relu", ""})
+      .End();
+  AddOpCompat(OpCompat("conv2d_transpose"))
+      .AddInput("Input")
+      .IsTensor()
+      .End()
+      .AddInput("Filter")
+      .IsTensor()
+      .End()
+      .AddInput("Bias")
+      .IsTensor()
+      .IsOptional()
+      .End()
+      .AddOutput("Output")
+      .IsTensor()
+      .End()
+      .AddAttr("output_padding")
+      .IsType<std::vector<int>>()
+      .IsOptional()
+      .End()
+      .AddAttr("output_size")
+      .IsType<std::vector<int>>()
+      .IsOptional()
+      .End()
+      .AddAttr("groups")
+      .IsNumGE(1)
+      .End()
+      .AddAttr("dilations")
+      .IsType<std::vector<int>>()
+      .End()
+      .AddAttr("strides")
+      .IsType<std::vector<int>>()
+      .End()
+      .AddAttr("paddings")
+      .IsType<std::vector<int>>()
+      .End()
+      .AddAttr("padding_algorithm")
+      .IsOptional()
+      .IsStringIn({"EXPLICIT", "SAME", "VALID"})
+      .End()
+      .AddAttr("data_format")
+      .IsStringIn({"NCHW", "NHWC", "AnyLayout"})
+      .End();
+}
+// Delete dequantize_linear_op, then dequantize weight
+void DeleteWeightQuantDequantLinearOpPass::ApplyImpl(ir::Graph* graph) const {
+  const std::string pattern_name =
+      "delete_weight_quantdequant_linear_op_pattern";
+  FusePassBase::Init(pattern_name, graph);
+
+  GraphPatternDetector gpd;
+  auto* scope = param_scope();
+  PADDLE_ENFORCE_NOT_NULL(
+      scope,
+      platform::errors::InvalidArgument(
+          "Scope in DeleteWeightQuantDequantLinearOpPass should not be null."));
+  // Create pattern
+  patterns::DeleteWeightQuantDequantLinearOpPattern pattern(
+      gpd.mutable_pattern(), pattern_name);
+  pattern();
+  int found_count = 0;
+
+  auto handler = [&](const GraphPatternDetector::subgraph_t& subgraph,
+                     Graph* g) {
+    GET_NODES;
+    /*
+    if (!IsCompat(subgraph, g)) {
+      LOG(WARNING) << "delete_weight_dequant_linear_op_pass "
+                      "compat check failed.";
+      return;
+    }
+    */
+    std::unordered_set<const Node*> nodes2rm = {};
+    int bit_length = BOOST_GET_CONST(
+        int, weight_dequantize_linear_op->Op()->GetAttr("bit_length"));
+    int range = ((1 << (bit_length - 1)) - 1);
+
+    auto* any_op2_desc = any_op2->Op();
+
+    // get weight tensor
+    auto* weight_tensor = scope->GetVar(weight_dequantize_linear_op_x->Name())
+                              ->GetMutable<LoDTensor>();
+    int8_t* quantized_weight_data =
+        weight_tensor->mutable_data<int8_t>(platform::CPUPlace());
+    auto w_dims = weight_tensor->dims();
+
+    // Get weight scale
+    std::vector<float> weight_scale;
+    auto* weight_scale_tensor =
+        scope->GetVar(weight_dequantize_linear_op_scale->Name())
+            ->GetMutable<LoDTensor>();
+    float* weight_scale_data =
+        weight_scale_tensor->mutable_data<float>(platform::CPUPlace());
+
+    auto weight_scale_nums = weight_scale_tensor->numel();
+    for (int i = 0; i < weight_scale_nums; i++) {
+      weight_scale.push_back(weight_scale_data[i] / range);
+    }
+
+    // dequant weight
+    std::vector<float> weight_data_tmp;
+    weight_data_tmp.reserve(weight_tensor->numel());
+
+    int quant_axis = BOOST_GET_CONST(
+        int, weight_dequantize_linear_op->Op()->GetAttr("quant_axis"));
+    if (quant_axis == -1) {  // per_layer quant_dequant: all OP
+      PADDLE_ENFORCE_EQ(weight_scale_nums, 1,
+                        platform::errors::InvalidArgument(
+                            "When quant_axis == -1 means use per_layer "
+                            "quant_dequant, weight_scale'number should be 1."));
+
+      //  float(weight) * scale
+      for (int i = 0; i < weight_tensor->numel(); i++) {
+        weight_data_tmp[i] =
+            static_cast<float>(quantized_weight_data[i]) * weight_scale[0];
+      }
+    } else if (quant_axis == 0) {  // per_channel quant_dequant: conv2d,
+                                   // depthwise_conv2d, conv2d_fusion
+      PADDLE_ENFORCE_EQ(
+          weight_scale_nums, w_dims[quant_axis],
+          platform::errors::InvalidArgument(
+              "When quant_axis == 0 means use per_channel quant_dequant, "
+              "weight_scale'numbers should be equal channels."));
+      PADDLE_ENFORCE_EQ(w_dims.size(), 4,
+                        platform::errors::InvalidArgument(
+                            "When quant_axis == 0 means use per_channel "
+                            "quant_dequant, (conv2d, depthwise_conv2d, "
+                            "conv2d_fusion)'s weight dims should be 4."));
+
+      for (int i = 0; i < weight_tensor->numel(); i++) {
+        int inner_size = w_dims[1] * w_dims[2] * w_dims[3];
+        weight_data_tmp[i] = static_cast<float>(quantized_weight_data[i]) *
+                             weight_scale[i / inner_size];
+      }
+    } else if (quant_axis == 1) {
+      PADDLE_ENFORCE_EQ(
+          weight_scale_nums, w_dims[quant_axis],
+          platform::errors::InvalidArgument(
+              "When quant_axis == 1 means use per_channel quant_dequant, "
+              "weight_scale'numbers should be equal channels."));
+
+      if (w_dims.size() == 4) {  // conv2d_transpose
+        std::string quantized_op_type = any_op2->Op()->Type();
+        PADDLE_ENFORCE_EQ(
+            quantized_op_type, "conv2d_transpose",
+            platform::errors::InvalidArgument(
+                "When quant_axis == 1 means use per_channel quant_dequant, "
+                "only conv2d_transpose weight dims equal 4."));
+        for (int i = 0; i < weight_tensor->numel(); i++) {
+          int inner_size = w_dims[2] * w_dims[3];
+          weight_data_tmp[i] = static_cast<float>(quantized_weight_data[i]) *
+                               weight_scale[(i / inner_size) % w_dims[1]];
+        }
+      } else if (w_dims.size() == 2) {
+        for (int i = 0; i < weight_tensor->numel(); i++) {
+          weight_data_tmp[i] = static_cast<float>(quantized_weight_data[i]) *
+                               weight_scale[i % w_dims[1]];
+        }
+      } else {
+        PADDLE_THROW(platform::errors::InvalidArgument(
+            "When quant_axis == 1 , weight dims should be 2 or 4, please check "
+            "your model "));
+      }
+    } else {
+      PADDLE_THROW(platform::errors::InvalidArgument(
+          "quant_axis should be -1 or 0 or 1, please check your model "
+          "OP'attribute "));
+    }
+    weight_tensor->clear();  // clear int weight
+    weight_tensor->Resize(phi::make_ddim(phi::vectorize(w_dims)));
+    float* new_quantized_weight_data =
+        weight_tensor->mutable_data<float>(platform::CPUPlace());
+    memcpy(new_quantized_weight_data, weight_data_tmp.data(),
+           weight_tensor->numel() * sizeof(float));
+
+    nodes2rm.insert(weight_dequantize_linear_op_scale);
+    nodes2rm.insert(weight_dequantize_linear_op);
+    nodes2rm.insert(weight_dequantize_linear_op_out);
+
+    // relink weight to any_op2
+    any_op2_desc->RenameInput(weight_dequantize_linear_op_out->Var()->Name(),
+                              weight_dequantize_linear_op_x->Var()->Name());
+    any_op2_desc->Flush();
+    IR_NODE_LINK_TO(weight_dequantize_linear_op_x, any_op2);
+    GraphSafeRemoveNodes(graph, nodes2rm);
+    found_count++;
+  };
+  gpd(graph, handler);
+  AddStatis(found_count);
+}
+
+}  // namespace ir
+}  // namespace framework
+}  // namespace paddle
+
+REGISTER_PASS(delete_weight_dequant_linear_op_pass,
+              paddle::framework::ir::DeleteWeightQuantDequantLinearOpPass);
diff --git a/paddle/fluid/framework/ir/delete_weight_dequant_linear_op_pass.h b/paddle/fluid/framework/ir/delete_weight_dequant_linear_op_pass.h
new file mode 100644
index 0000000000000..e240b6212b84a
--- /dev/null
+++ b/paddle/fluid/framework/ir/delete_weight_dequant_linear_op_pass.h
@@ -0,0 +1,35 @@
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#pragma once
+#include <vector>
+
+#include "paddle/fluid/framework/ir/fuse_pass_base.h"
+
+namespace paddle {
+namespace framework {
+namespace ir {
+
+class DeleteWeightQuantDequantLinearOpPass : public FusePassBase {
+ public:
+  DeleteWeightQuantDequantLinearOpPass();
+  virtual ~DeleteWeightQuantDequantLinearOpPass() {}
+
+ protected:
+  void ApplyImpl(ir::Graph* graph) const override;
+};
+
+}  // namespace ir
+}  // namespace framework
+}  // namespace paddle
diff --git a/paddle/fluid/framework/ir/fc_fuse_pass.cc b/paddle/fluid/framework/ir/fc_fuse_pass.cc
index e246a10961c0c..1e25b21483b82 100644
--- a/paddle/fluid/framework/ir/fc_fuse_pass.cc
+++ b/paddle/fluid/framework/ir/fc_fuse_pass.cc
@@ -226,23 +226,34 @@ int FCFusePass::ApplyFCPattern(Graph* graph, bool with_relu) const {
     // For anakin subgraph int8
     // When in anakin subgraph int8 mode, the pattern like "fake_quant + mul +
     // fake_dequant" can be detected by the quant_dequant_fuse_pass. This pass
-    // will add "input_scale", "weight_scale" which are extracted from
+    // will add "input_scale" which are extracted from
     // fake_quant op and fake_dequant op to mul op, and then delete the
     // fake_quant op and fake_dequant op in the graph. If the mul op has the
     // scale info, we should add those to the fused fc.
     auto* mul_op_desc = mul->Op();
+    auto* elementwise_add_op_desc = elementwise_add->Op();
+
     if (mul_op_desc->HasAttr("enable_int8")) {
       desc.SetAttr("enable_int8", mul_op_desc->GetAttr("enable_int8"));
-      desc.SetAttr("Input_scale", mul_op_desc->GetAttr("X_scale"));
-      desc.SetAttr("weight_scale", mul_op_desc->GetAttr("weight_scale"));
-      if (mul_op_desc->HasAttr("out_scale"))
-        desc.SetAttr("out_scale", mul_op_desc->GetAttr("out_scale"));
-      auto elementwise_desc = elementwise_add->Op();
-      if (elementwise_desc->HasAttr("out_scale"))
-        desc.SetAttr("out_scale", elementwise_desc->GetAttr("out_scale"));
     }
 
-    auto* elementwise_add_op_desc = elementwise_add->Op();
+    if (mul_op_desc->HasAttr("Input_scale")) {
+      desc.SetAttr("Input_scale", mul_op_desc->GetAttr("Input_scale"));
+    }
+
+    bool inscale_flag = false;
+    bool outscale_flag = false;
+
+    if (mul_op_desc->HasAttr("X")) {
+      desc.SetAttr("X", mul_op_desc->GetAttr("X"));
+      inscale_flag = true;
+    }
+    if (elementwise_add_op_desc->HasAttr("Out")) {
+      desc.SetAttr("Out", elementwise_add_op_desc->GetAttr("Out"));
+      outscale_flag = true;
+    }
+    desc.SetAttr("support_int8", inscale_flag && outscale_flag);
+
     // if we can find out_threshold in elementwise_add, then set it as the
     // out_thrshold of fc
     auto out_threshold_attr =
diff --git a/paddle/fluid/framework/ir/gpu_cpu_map_matmul_to_mul_pass.cc b/paddle/fluid/framework/ir/gpu_cpu_map_matmul_to_mul_pass.cc
index 1759d18761da3..ac580b99b5c95 100644
--- a/paddle/fluid/framework/ir/gpu_cpu_map_matmul_to_mul_pass.cc
+++ b/paddle/fluid/framework/ir/gpu_cpu_map_matmul_to_mul_pass.cc
@@ -298,8 +298,7 @@ void GpuCpuMapMatmul2MulPass::ApplyImpl(ir::Graph* graph) const {
       desc.SetAttr("y_num_col_dims", 1);
       if (matmul_op->Op()->HasAttr("enable_int8")) {
         desc.SetAttr("enable_int8", matmul_op->Op()->GetAttr("enable_int8"));
-        desc.SetAttr("X_scale", matmul_op->Op()->GetAttr("X_scale"));
-        desc.SetAttr("weight_scale", matmul_op->Op()->GetAttr("weight_scale"));
+        desc.SetAttr("Input_scale", matmul_op->Op()->GetAttr("Input_scale"));
         desc.SetAttr("out_threshold",
                      matmul_op->Op()->GetAttr("out_threshold"));
       }
@@ -372,9 +371,7 @@ void GpuCpuMapMatmulV2ToMulPass::ApplyImpl(ir::Graph* graph) const {
       desc.SetAttr("y_num_col_dims", 1);
       if (matmul_v2_op->Op()->HasAttr("enable_int8")) {
         desc.SetAttr("enable_int8", matmul_v2_op->Op()->GetAttr("enable_int8"));
-        desc.SetAttr("X_scale", matmul_v2_op->Op()->GetAttr("X_scale"));
-        desc.SetAttr("weight_scale",
-                     matmul_v2_op->Op()->GetAttr("weight_scale"));
+        desc.SetAttr("Input_scale", matmul_v2_op->Op()->GetAttr("Input_scale"));
         desc.SetAttr("out_threshold",
                      matmul_v2_op->Op()->GetAttr("out_threshold"));
       }
@@ -451,8 +448,7 @@ void GpuCpuMapMatmulV2ToMatmulPass::ApplyImpl(ir::Graph* graph) const {
     }
     if (matmul_v2_op->Op()->HasAttr("enable_int8")) {
       desc.SetAttr("enable_int8", matmul_v2_op->Op()->GetAttr("enable_int8"));
-      desc.SetAttr("X_scale", matmul_v2_op->Op()->GetAttr("X_scale"));
-      desc.SetAttr("weight_scale", matmul_v2_op->Op()->GetAttr("weight_scale"));
+      desc.SetAttr("Input_scale", matmul_v2_op->Op()->GetAttr("Input_scale"));
       desc.SetAttr("out_threshold",
                    matmul_v2_op->Op()->GetAttr("out_threshold"));
     }
@@ -532,8 +528,7 @@ void GpuCpuSqueeze2MatmulFusePass::ApplyImpl(ir::Graph* graph) const {
       desc.SetAttr("y_num_col_dims", 1);
       if (matmul_op->Op()->HasAttr("enable_int8")) {
         desc.SetAttr("enable_int8", matmul_op->Op()->GetAttr("enable_int8"));
-        desc.SetAttr("X_scale", matmul_op->Op()->GetAttr("X_scale"));
-        desc.SetAttr("weight_scale", matmul_op->Op()->GetAttr("weight_scale"));
+        desc.SetAttr("Input_scale", matmul_op->Op()->GetAttr("Input_scale"));
         desc.SetAttr("out_threshold",
                      matmul_op->Op()->GetAttr("out_threshold"));
       }
@@ -677,8 +672,7 @@ void GpuCpuReshape2MatmulFusePass::ApplyImpl(ir::Graph* graph) const {
       desc.SetAttr("y_num_col_dims", 1);
       if (matmul_op->Op()->HasAttr("enable_int8")) {
         desc.SetAttr("enable_int8", matmul_op->Op()->GetAttr("enable_int8"));
-        desc.SetAttr("X_scale", matmul_op->Op()->GetAttr("X_scale"));
-        desc.SetAttr("weight_scale", matmul_op->Op()->GetAttr("weight_scale"));
+        desc.SetAttr("Input_scale", matmul_op->Op()->GetAttr("Input_scale"));
         desc.SetAttr("out_threshold",
                      matmul_op->Op()->GetAttr("out_threshold"));
       }
@@ -765,8 +759,7 @@ void GpuCpuFlatten2MatmulFusePass::ApplyImpl(ir::Graph* graph) const {
       desc.SetAttr("y_num_col_dims", 1);
       if (matmul_op->Op()->HasAttr("enable_int8")) {
         desc.SetAttr("enable_int8", matmul_op->Op()->GetAttr("enable_int8"));
-        desc.SetAttr("X_scale", matmul_op->Op()->GetAttr("X_scale"));
-        desc.SetAttr("weight_scale", matmul_op->Op()->GetAttr("weight_scale"));
+        desc.SetAttr("Input_scale", matmul_op->Op()->GetAttr("Input_scale"));
         desc.SetAttr("out_threshold",
                      matmul_op->Op()->GetAttr("out_threshold"));
       }
diff --git a/paddle/fluid/framework/ir/graph_pattern_detector.cc b/paddle/fluid/framework/ir/graph_pattern_detector.cc
index 164a13d1560f4..8eb1b64a2763a 100644
--- a/paddle/fluid/framework/ir/graph_pattern_detector.cc
+++ b/paddle/fluid/framework/ir/graph_pattern_detector.cc
@@ -2069,6 +2069,29 @@ PDNode *patterns::Elementwise::operator()(PDNode *x_var, PDNode *y_var,
   return out_var;
 }
 
+PDNode *patterns::ResidualElementwise::operator()(
+    PDNode *op_var, PDNode *residual_var, const std::string elementwise_type,
+    bool as_x) {
+  auto elementwise_op =
+      pattern->NewNode(elementwise_op_repr())->assert_is_op(elementwise_type);
+
+  if (as_x) {
+    op_var->AsInput()->assert_is_op_input(elementwise_type, "X");
+    residual_var->AsInput()->assert_is_op_input(elementwise_type, "Y");
+  } else {
+    op_var->AsInput()->assert_is_op_input(elementwise_type, "Y");
+    residual_var->AsInput()->assert_is_op_input(elementwise_type, "X");
+  }
+  auto out_var = pattern->NewNode(elementwise_out_repr())
+                     ->AsOutput()
+                     ->assert_is_op_output(elementwise_type, "Out");
+
+  elementwise_op->LinksFrom({op_var, residual_var});
+  elementwise_op->LinksTo({out_var});
+
+  return out_var;
+}
+
 PDNode *patterns::Concat::operator()() {
   auto concat_op = pattern->NewNode(concat_op_repr())->assert_is_op("concat");
 
@@ -2949,6 +2972,84 @@ void patterns::DeleteQuantDequantFilterOpPattern::operator()() {
   any_op2->LinksFrom({quant_dequant_out});
 }
 
+void patterns::DeleteWeightQuantDequantLinearOpPattern::operator()() {
+  auto weight_dequantize_linear_op_x =
+      pattern->NewNode(weight_dequantize_linear_op_x_repr())
+          ->AsInput()
+          ->assert_is_op_input("dequantize_linear", "X")
+          ->assert_is_persistable_var();
+
+  auto weight_dequantize_linear_op_scale =
+      pattern->NewNode(weight_dequantize_linear_op_scale_repr())
+          ->AsInput()
+          ->assert_is_op_input("dequantize_linear", "Scale")
+          ->assert_is_persistable_var();
+
+  auto weight_dequantize_linear_op =
+      pattern->NewNode(weight_dequantize_linear_op_repr())
+          ->assert_is_op("dequantize_linear");
+
+  auto weight_dequantize_linear_op_out =
+      pattern->NewNode(weight_dequantize_linear_op_out_repr())
+          ->AsIntermediate()
+          ->assert_is_op_output("dequantize_linear", "Y");
+
+  auto any_op2 = pattern->NewNode(any_op2_repr())->assert_is_op()->AsOutput();
+
+  weight_dequantize_linear_op
+      ->LinksFrom(
+          {weight_dequantize_linear_op_x, weight_dequantize_linear_op_scale})
+      .LinksTo({weight_dequantize_linear_op_out});
+  any_op2->LinksFrom({weight_dequantize_linear_op_out});
+}
+
+void patterns::DeleteQuantDequantLinearOpPattern::operator()() {
+  auto quantize_linear_op_x = pattern->NewNode(quantize_linear_op_x_repr())
+                                  ->AsInput()
+                                  ->assert_is_op_input("quantize_linear", "X");
+
+  auto quantize_linear_op_scale =
+      pattern->NewNode(quantize_linear_op_scale_repr())
+          ->AsInput()
+          ->assert_is_op_input("quantize_linear", "Scale")
+          ->assert_is_persistable_var();
+
+  auto quantize_linear_op = pattern->NewNode(quantize_linear_op_repr())
+                                ->assert_is_op("quantize_linear");
+
+  auto quantize_linear_op_out =
+      pattern->NewNode(quantize_linear_op_out_repr())
+          ->AsIntermediate()
+          ->assert_is_op_output("quantize_linear", "Y")
+          ->assert_is_op_input("dequantize_linear", "X")
+          ->assert_var_not_persistable();
+
+  // Can not add this node. Todo: Wangzheee
+  /*
+    auto dequantize_linear_op_scale =
+        pattern->NewNode(dequantize_linear_op_scale_repr())
+            ->assert_is_op_input("dequantize_linear", "Scale")
+            ->AsIntermediate();
+  */
+
+  auto dequantize_linear_op = pattern->NewNode(dequantize_linear_op_repr())
+                                  ->assert_is_op("dequantize_linear");
+
+  auto dequantize_linear_op_out =
+      pattern->NewNode(dequantize_linear_op_out_repr())
+          ->AsIntermediate()
+          ->assert_is_op_output("dequantize_linear", "Y");
+
+  auto any_op2 = pattern->NewNode(any_op2_repr())->assert_is_op()->AsOutput();
+
+  quantize_linear_op
+      ->LinksFrom({quantize_linear_op_x, quantize_linear_op_scale})
+      .LinksTo({quantize_linear_op_out});
+  dequantize_linear_op->LinksFrom({quantize_linear_op_out})
+      .LinksTo({dequantize_linear_op_out});
+  any_op2->LinksFrom({dequantize_linear_op_out});
+}
+
 PDNode *patterns::ReshapeTransposeMatmulPattern::operator()(
     const std::string &op_name, bool with_reshape_xshape,
     bool with_transpose_xshape) {
@@ -3311,25 +3412,14 @@ PDNode *patterns::LayerNorm::operator()() {
   return shift_out;
 }
 
-// Add support int8 flag
+// Add support int8 flag and out_threshold
 PDNode *patterns::AddSupportInt8::operator()() {
-  auto prev_op =
-      pattern->NewNode(prev_op_repr())
-          ->assert_is_op()
-          ->assert_more([&](Node *node) {
-            return node->Op()->HasAttr("out_threshold") ? true : false;
-          });
-  auto prev_out = pattern->NewNode(prev_out_repr())->assert_is_var();
-  auto quant_op =
-      pattern->NewNode(quant_op_repr())
-          ->assert_is_op()
-          ->assert_more([&](Node *node) {
-            return node->Op()->HasAttr("out_threshold") ? true : false;
-          });
+  auto quant_op = pattern->NewNode(quant_op_repr())->assert_is_op();
   auto quant_out =
-      pattern->NewNode(quant_out_repr())->assert_is_var()->AsOutput();
-  prev_op->LinksTo({prev_out});
-  prev_out->LinksTo({quant_op});
+      pattern->NewNode(quant_out_repr())
+          ->assert_is_var()
+          ->assert_more([&](Node *node) { return node->outputs.size() > 0; })
+          ->AsOutput();
   quant_op->LinksTo({quant_out});
   return quant_out;
 }
diff --git a/paddle/fluid/framework/ir/graph_pattern_detector.h b/paddle/fluid/framework/ir/graph_pattern_detector.h
index 17c70ace301d3..434ede6cf7a3b 100644
--- a/paddle/fluid/framework/ir/graph_pattern_detector.h
+++ b/paddle/fluid/framework/ir/graph_pattern_detector.h
@@ -1032,6 +1032,22 @@ struct Elementwise : public PatternBase {
   PATTERN_DECL_NODE(elementwise_out);
 };
 
+// Residual Elementwise ops
+// This pattern allows operator output to be X or Y
+// and residual data Y or X, based on as_x flag
+struct ResidualElementwise : public PatternBase {
+  ResidualElementwise(PDPattern* pattern, const std::string& name_scope,
+                      bool as_x)
+      : PatternBase(pattern, name_scope, "residual_elementwise") {}
+  PDNode* operator()(PDNode* op_var, PDNode* residual_var,
+                     const std::string elementwise_type, bool as_x);
+
+  PATTERN_DECL_NODE(operator_output);
+  PATTERN_DECL_NODE(residual_data);
+  PATTERN_DECL_NODE(elementwise_op);
+  PATTERN_DECL_NODE(elementwise_out);
+};
+
 // Transpose op
 // Forward pass for transpose.
 // transpose_out is a result of the operator.
@@ -1702,6 +1718,40 @@ struct DeleteQuantDequantFilterOpPattern : public PatternBase {
   PATTERN_DECL_NODE(any_op2);
 };
 
+struct DeleteWeightQuantDequantLinearOpPattern : public PatternBase {
+  DeleteWeightQuantDequantLinearOpPattern(PDPattern* pattern,
+                                          const std::string& name_scope)
+      : PatternBase(pattern, name_scope,
+                    "delete_weight_quant_dequant_linear_op_pattern") {}
+
+  void operator()();
+
+  PATTERN_DECL_NODE(weight_dequantize_linear_op_x);
+  PATTERN_DECL_NODE(weight_dequantize_linear_op_scale);
+  PATTERN_DECL_NODE(weight_dequantize_linear_op);
+  PATTERN_DECL_NODE(weight_dequantize_linear_op_out);
+  PATTERN_DECL_NODE(any_op2);
+};
+
+struct DeleteQuantDequantLinearOpPattern : public PatternBase {
+  DeleteQuantDequantLinearOpPattern(PDPattern* pattern,
+                                    const std::string& name_scope)
+      : PatternBase(pattern, name_scope,
+                    "delete_quant_dequant_linear_op_pattern") {}
+
+  void operator()();
+
+  PATTERN_DECL_NODE(quantize_linear_op_x);
+  PATTERN_DECL_NODE(quantize_linear_op_scale);
+  PATTERN_DECL_NODE(quantize_linear_op);
+  PATTERN_DECL_NODE(quantize_linear_op_out);
+  PATTERN_DECL_NODE(dequantize_linear_op);
+  // PATTERN_DECL_NODE(dequantize_linear_op_scale);  // Can not add this node.
+  // Todo: Wangzheee
+  PATTERN_DECL_NODE(dequantize_linear_op_out);
+  PATTERN_DECL_NODE(any_op2);
+};
+
 // Reshape + Transpose + Matmul
 // named nodes:
 // reshape_op, reshape_out, reshape_xshape,
@@ -1887,8 +1937,6 @@ struct AddSupportInt8 : public PatternBase {
       : PatternBase(pattern, name_scope, "Add_support_int8") {}
 
   PDNode* operator()();
-  PATTERN_DECL_NODE(prev_op);
-  PATTERN_DECL_NODE(prev_out);
   PATTERN_DECL_NODE(quant_op);
   PATTERN_DECL_NODE(quant_out);
 };
diff --git a/paddle/fluid/framework/ir/graph_traits.cc b/paddle/fluid/framework/ir/graph_traits.cc
index 262a523bd8e0e..b06314563025a 100644
--- a/paddle/fluid/framework/ir/graph_traits.cc
+++ b/paddle/fluid/framework/ir/graph_traits.cc
@@ -12,6 +12,9 @@
 // See the License for the specific language governing permissions and
 // limitations under the License.
 
+#include <list>
+#include <map>
+
 #include "paddle/fluid/framework/ir/graph_traits.h"
 
 namespace paddle {
@@ -23,6 +26,51 @@ namespace ir {
 //
 class Node;
 
+bool IsReachable(ir::Graph *graph, Node *from, Node *to) {
+  if (from == to) {
+    return true;
+  }
+
+  std::map<Node *, bool> visited;
+
+  for (auto &node : GraphTraits::DFS(*graph)) {
+    visited[&node] = false;
+  }
+
+  visited[from] = true;
+
+  std::list<Node *> queue;
+  queue.push_back(from);
+
+  while (!queue.empty()) {
+    auto cur = FindNode(graph, queue.front());
+    queue.pop_front();
+
+    if (!cur) return false;
+
+    for (const auto &n : cur->outputs) {
+      if (n == to) {
+        return true;
+      }
+
+      if (!visited[n]) {
+        visited[n] = true;
+        queue.push_back(n);
+      }
+    }
+  }
+  return false;
+}
+
+Node *FindNode(ir::Graph *graph, const Node *node) {
+  for (const auto &n : graph->Nodes()) {
+    if (n == node) {
+      return n;
+    }
+  }
+  return nullptr;
+}
+
 NodesDFSIterator::NodesDFSIterator(const std::vector<Node *> &source) {
   for (auto *x : source) stack_.push(x);
 }
diff --git a/paddle/fluid/framework/ir/graph_traits.h b/paddle/fluid/framework/ir/graph_traits.h
index a54cc61a63fde..7e313e17f422e 100644
--- a/paddle/fluid/framework/ir/graph_traits.h
+++ b/paddle/fluid/framework/ir/graph_traits.h
@@ -29,6 +29,9 @@ namespace ir {
 class Graph;
 class Node;
 
+bool IsReachable(ir::Graph *graph, Node *from, Node *to);
+Node *FindNode(ir::Graph *graph, const Node *node);
+
 template <typename IteratorT>
 class iterator_range {
   IteratorT begin_, end_;
diff --git a/paddle/fluid/framework/ir/mkldnn/conv_elementwise_add_mkldnn_fuse_pass.cc b/paddle/fluid/framework/ir/mkldnn/conv_elementwise_add_mkldnn_fuse_pass.cc
index fc2758c273450..16c4f251e0bde 100644
--- a/paddle/fluid/framework/ir/mkldnn/conv_elementwise_add_mkldnn_fuse_pass.cc
+++ b/paddle/fluid/framework/ir/mkldnn/conv_elementwise_add_mkldnn_fuse_pass.cc
@@ -14,12 +14,6 @@
 
 #include "paddle/fluid/framework/ir/mkldnn/conv_elementwise_add_mkldnn_fuse_pass.h"
 
-#include <functional>
-#include <list>
-#include <map>
-#include <memory>
-#include <tuple>
-
 #include "paddle/fluid/framework/ir/graph_traits.h"
 #include "paddle/fluid/framework/op_version_registry.h"
 #include "paddle/fluid/string/pretty_log.h"
@@ -28,60 +22,6 @@ namespace paddle {
 namespace framework {
 namespace ir {
 
-bool IsReachable(ir::Graph* graph, Node* from, Node* to) {
-  auto find_node = [](ir::Graph* graph, const Node* node) -> Node* {
-    for (auto n : graph->Nodes()) {
-      if (n == node) {
-        return n;
-      }
-    }
-
-    return nullptr;
-  };
-
-  if (from == to) {
-    return true;
-  }
-
-  std::map<Node*, bool> visited;
-
-  for (auto& node : GraphTraits::DFS(*graph)) {
-    visited[&node] = false;
-  }
-
-  visited[from] = true;
-
-  std::list<Node*> queue;
-  queue.push_back(from);
-
-  while (!queue.empty()) {
-    auto cur = find_node(graph, queue.front());
-    queue.pop_front();
-
-    if (!cur) return false;
-
-    for (auto n : cur->outputs) {
-      if (n == to) {
-        return true;
-      }
-
-      if (!visited[n]) {
-        visited[n] = true;
-        queue.push_back(n);
-      }
-    }
-  }
-  return false;
-}
-
-template <typename T>
-paddle::optional<T> HasAttribute(const Node& op, const std::string& attr) {
-  if (op.Op()->HasAttr(attr))
-    return BOOST_GET_CONST(T, op.Op()->GetAttr(attr));
-  else
-    return paddle::none;
-}
-
 ResidualConnectionMKLDNNFusePass::ResidualConnectionMKLDNNFusePass() {
   AddOpCompat(OpCompat("conv2d"))
       .AddInput("Input")
@@ -136,89 +76,22 @@ ResidualConnectionMKLDNNFusePass::ResidualConnectionMKLDNNFusePass() {
       .End();
 }
 
-GraphWithStats ResidualConnectionMKLDNNFusePass::FuseConvAsX(
-    const std::string& name_scope,
-    const GraphWithStats& graph_with_stats) const {
-  GraphPatternDetector gpd;
-  auto pattern = gpd.mutable_pattern();
-
-  patterns::Conv conv_pattern{pattern, name_scope};
-  auto conv_output = conv_pattern();
-
-  patterns::Elementwise elementwise_pattern{pattern, name_scope};
-  elementwise_pattern(
-      conv_output, pattern->NewNode(elementwise_pattern.elementwise_y_repr()),
-      "elementwise_add");
-  conv_output->AsIntermediate();
-
-  int found_conv_as_x_count = 0;
-
-  auto handler = [&](const GraphPatternDetector::subgraph_t& subgraph,
-                     Graph* g) {
-    GET_IR_NODE_FROM_SUBGRAPH(conv_op, conv_op, conv_pattern);
-    GET_IR_NODE_FROM_SUBGRAPH(conv_input, conv_input, conv_pattern);
-    GET_IR_NODE_FROM_SUBGRAPH(conv_filter, conv_filter, conv_pattern);
-    GET_IR_NODE_FROM_SUBGRAPH(conv_output, conv_output, conv_pattern);
-
-    GET_IR_NODE_FROM_SUBGRAPH(elementwise_op, elementwise_op,
-                              elementwise_pattern);
-    GET_IR_NODE_FROM_SUBGRAPH(elementwise_identity, elementwise_y,
-                              elementwise_pattern);
-    GET_IR_NODE_FROM_SUBGRAPH(elementwise_out, elementwise_out,
-                              elementwise_pattern);
-
-    if (FindFuseOption(*conv_op, *elementwise_op) != FUSE_MKLDNN) return;
-
-    if (!IsReachable(g, elementwise_identity, conv_output)) return;
-
-    if (HasFusedActivation(conv_op)) return;
-
-    if (!IsCompat(subgraph, g)) {
-      LOG(WARNING)
-          << "conv_elementwise_add_mkldnn_fuse_pass in op compat failed.";
-      return;
-    }
-
-    conv_op->Op()->SetInput("ResidualData", {elementwise_identity->Name()});
-    conv_op->Op()->SetOutput("Output", {elementwise_out->Name()});
-    conv_op->Op()->SetAttr("fuse_residual_connection", true);
-
-    GraphSafeRemoveNodes(g, {conv_output, elementwise_op});
-
-    IR_NODE_LINK_TO(elementwise_identity, conv_op);
-    IR_NODE_LINK_TO(conv_op, elementwise_out);
-
-    found_conv_as_x_count++;
-  };
-
-  gpd(graph_with_stats.first, handler);
-  if (!Has("disable_logs") || !Get<bool>("disable_logs")) {
-    std::stringstream msg_ss;
-    msg_ss << "---    Fused " << found_conv_as_x_count
-           << " conv (as x) + elementwise_add patterns";
-    paddle::string::PrettyLogDetail(msg_ss.str().c_str());
-  }
-
-  return std::make_pair(graph_with_stats.first,
-                        found_conv_as_x_count + graph_with_stats.second);
-}
-
-GraphWithStats ResidualConnectionMKLDNNFusePass::FuseConvAsY(
-    const std::string& name_scope,
-    const GraphWithStats& graph_with_stats) const {
+GraphWithStats ResidualConnectionMKLDNNFusePass::FuseConv(
+    const std::string& name_scope, const GraphWithStats& graph_with_stats,
+    bool as_x) const {
   GraphPatternDetector gpd;
   auto pattern = gpd.mutable_pattern();
 
   patterns::Conv conv_pattern{pattern, name_scope};
   auto conv_output = conv_pattern();
 
-  patterns::Elementwise elementwise_pattern{pattern, name_scope};
+  patterns::ResidualElementwise elementwise_pattern{pattern, name_scope, as_x};
   elementwise_pattern(
-      pattern->NewNode(elementwise_pattern.elementwise_x_repr()), conv_output,
-      "elementwise_add");
+      conv_output, pattern->NewNode(elementwise_pattern.residual_data_repr()),
+      "elementwise_add", as_x);
   conv_output->AsIntermediate();
 
-  int found_conv_as_y_count = 0;
+  int found_conv_count = 0;
 
   auto handler = [&](const GraphPatternDetector::subgraph_t& subgraph,
                      Graph* g) {
@@ -229,15 +102,13 @@ GraphWithStats ResidualConnectionMKLDNNFusePass::FuseConvAsY(
 
     GET_IR_NODE_FROM_SUBGRAPH(elementwise_op, elementwise_op,
                               elementwise_pattern);
-    GET_IR_NODE_FROM_SUBGRAPH(elementwise_x, elementwise_x,
+    GET_IR_NODE_FROM_SUBGRAPH(residual_data, residual_data,
                               elementwise_pattern);
     GET_IR_NODE_FROM_SUBGRAPH(elementwise_out, elementwise_out,
                               elementwise_pattern);
 
     if (FindFuseOption(*conv_op, *elementwise_op) != FUSE_MKLDNN) return;
-
-    if (!IsReachable(g, elementwise_x, conv_output)) return;
-
+    if (!IsReachable(g, residual_data, conv_output)) return;
     if (HasFusedActivation(conv_op)) return;
 
     if (!IsCompat(subgraph, g)) {
@@ -246,28 +117,29 @@ GraphWithStats ResidualConnectionMKLDNNFusePass::FuseConvAsY(
       return;
     }
 
-    conv_op->Op()->SetInput("ResidualData", {elementwise_x->Name()});
+    conv_op->Op()->SetInput("ResidualData", {residual_data->Name()});
     conv_op->Op()->SetOutput("Output", {elementwise_out->Name()});
     conv_op->Op()->SetAttr("fuse_residual_connection", true);
 
     GraphSafeRemoveNodes(g, {conv_output, elementwise_op});
 
-    IR_NODE_LINK_TO(elementwise_x, conv_op);
+    IR_NODE_LINK_TO(residual_data, conv_op);
     IR_NODE_LINK_TO(conv_op, elementwise_out);
 
-    found_conv_as_y_count++;
+    found_conv_count++;
   };
 
   gpd(graph_with_stats.first, handler);
   if (!Has("disable_logs") || !Get<bool>("disable_logs")) {
     std::stringstream msg_ss;
-    msg_ss << "---    Fused " << found_conv_as_y_count
-           << " conv (as y) + elementwise_add patterns";
+    std::string fusionMode = as_x ? "x" : "y";
+    msg_ss << "---    Fused " << found_conv_count << " conv (as " << fusionMode
+           << ") + elementwise_add patterns";
     paddle::string::PrettyLogDetail(msg_ss.str().c_str());
   }
 
   return std::make_pair(graph_with_stats.first,
-                        found_conv_as_y_count + graph_with_stats.second);
+                        found_conv_count + graph_with_stats.second);
 }
 
 GraphWithStats ResidualConnectionMKLDNNFusePass::FuseProjectionConv(
@@ -308,7 +180,7 @@ GraphWithStats ResidualConnectionMKLDNNFusePass::FuseProjectionConv(
 
     if (!IsCompat(subgraph, g)) {
       LOG(WARNING)
-          << "conv_elementwise_add_mkldnn_fuse_pass in op compat failed.";
+          << "op compat for conv_elementwise_add_mkldnn_fuse_pass failed.";
       return;
     }
 
@@ -361,8 +233,8 @@ void ResidualConnectionMKLDNNFusePass::ApplyImpl(ir::Graph* graph) const {
   FusePassBase::Init(name_scope_, graph);
   auto graph_with_stats =
       FuseProjectionConv(name_scope_, std::make_pair(graph, 0));
-  graph_with_stats = FuseConvAsX(name_scope_, graph_with_stats);
-  graph_with_stats = FuseConvAsY(name_scope_, graph_with_stats);
+  graph_with_stats = FuseConv(name_scope_, graph_with_stats, true);
+  graph_with_stats = FuseConv(name_scope_, graph_with_stats, false);
 
   AddStatis(graph_with_stats.second);
 }
diff --git a/paddle/fluid/framework/ir/mkldnn/conv_elementwise_add_mkldnn_fuse_pass.h b/paddle/fluid/framework/ir/mkldnn/conv_elementwise_add_mkldnn_fuse_pass.h
index c4351b382187d..7c6e9927163c7 100644
--- a/paddle/fluid/framework/ir/mkldnn/conv_elementwise_add_mkldnn_fuse_pass.h
+++ b/paddle/fluid/framework/ir/mkldnn/conv_elementwise_add_mkldnn_fuse_pass.h
@@ -14,30 +14,20 @@
 
 #pragma once
 
-#include <memory>
-#include <string>
-#include <tuple>
-#include <utility>
 #include "paddle/fluid/framework/ir/fuse_pass_base.h"
-#include "paddle/fluid/framework/ir/graph.h"
 #include "paddle/fluid/framework/ir/graph_pattern_detector.h"
 
-#include <boost/optional.hpp>
-
 namespace paddle {
 namespace framework {
 namespace ir {
 
 using GraphWithStats = std::pair<ir::Graph*, int>;
 
-bool IsReachable(ir::Graph* graph, Node* from, Node* to);
-
 class ResidualConnectionMKLDNNFusePass : public FusePassBase {
  private:
-  GraphWithStats FuseConvAsX(const std::string& name_scope,
-                             const GraphWithStats& graph_with_stats) const;
-  GraphWithStats FuseConvAsY(const std::string& name_scope,
-                             const GraphWithStats& graph_with_stats) const;
+  GraphWithStats FuseConv(const std::string& name_scope,
+                          const GraphWithStats& graph_with_stats,
+                          bool as_x) const;
   GraphWithStats FuseProjectionConv(
       const std::string& name_scope,
       const GraphWithStats& graph_with_stats) const;
diff --git a/paddle/fluid/framework/ir/multihead_matmul_fuse_pass.cc b/paddle/fluid/framework/ir/multihead_matmul_fuse_pass.cc
index 989b5460743b0..a8595d55b31b0 100644
--- a/paddle/fluid/framework/ir/multihead_matmul_fuse_pass.cc
+++ b/paddle/fluid/framework/ir/multihead_matmul_fuse_pass.cc
@@ -862,43 +862,30 @@ int MultiHeadMatmulV2FusePass::BuildFusionV2(Graph* graph,
     multihead_op_desc.SetAttr("head_number", head_number);
 
     auto* mul0_op_desc = mul0->Op();
-    auto* mul1_op_desc = mul1->Op();
-    auto* mul2_op_desc = mul2->Op();
-    if (mul0_op_desc->HasAttr("enable_int8")) {
-      multihead_op_desc.SetAttr("enable_int8",
-                                mul0_op_desc->GetAttr("enable_int8"));
-      // all mul op has same input.
+
+    // all mul op has same input.
+    if (multihead_op_desc.HasAttr("Input_scale")) {
       multihead_op_desc.SetAttr("Input_scale",
-                                mul0_op_desc->GetAttr("X_scale"));
-      auto weight_scale0 = BOOST_GET_CONST(
-          std::vector<float>, mul0_op_desc->GetAttr("weight_scale"));
-      auto weight_scale1 = BOOST_GET_CONST(
-          std::vector<float>, mul1_op_desc->GetAttr("weight_scale"));
-      auto weight_scale2 = BOOST_GET_CONST(
-          std::vector<float>, mul2_op_desc->GetAttr("weight_scale"));
-      auto weight_max = std::max(weight_scale0, weight_scale1);
-      weight_max = std::max(weight_max, weight_scale2);
-      multihead_op_desc.SetAttr("weight_scale", weight_max);
-
-      auto* add0_op_desc = eltadd0->Op();
-      auto* add1_op_desc = eltadd1->Op();
-      auto* add2_op_desc = eltadd2->Op();
-      if (add0_op_desc->HasAttr("out_threshold")) {
-        auto out_scale0 =
-            BOOST_GET_CONST(float, add0_op_desc->GetAttr("out_threshold"));
-        auto out_scale1 =
-            BOOST_GET_CONST(float, add1_op_desc->GetAttr("out_threshold"));
-        auto out_scale2 =
-            BOOST_GET_CONST(float, add2_op_desc->GetAttr("out_threshold"));
-        auto out_scale_max = std::max(out_scale0, out_scale1);
-        out_scale_max = std::max(out_scale_max, out_scale2);
-        multihead_op_desc.SetAttr("fc_out_threshold", out_scale_max);
-      }
+                                mul0_op_desc->GetAttr("Input_scale"));
+    }
+    auto* add0_op_desc = eltadd0->Op();
+    auto* add1_op_desc = eltadd1->Op();
+    auto* add2_op_desc = eltadd2->Op();
+    if (add0_op_desc->HasAttr("out_threshold")) {
+      auto out_scale0 =
+          BOOST_GET_CONST(float, add0_op_desc->GetAttr("out_threshold"));
+      auto out_scale1 =
+          BOOST_GET_CONST(float, add1_op_desc->GetAttr("out_threshold"));
+      auto out_scale2 =
+          BOOST_GET_CONST(float, add2_op_desc->GetAttr("out_threshold"));
+      auto out_scale_max = std::max(out_scale0, out_scale1);
+      out_scale_max = std::max(out_scale_max, out_scale2);
+      multihead_op_desc.SetAttr("fc_out_threshold", out_scale_max);
     }
 
     auto* softmax_qk_op_desc = softmax_qk->Op();
     auto* matmul_qk_op_desc = matmul_qk->Op();
-    if (matmul_qk_op_desc->HasAttr("X_scale")) {
+    if (matmul_qk_op_desc->HasAttr("Input_scale")) {
       multihead_op_desc.SetAttr("qkv2context_plugin_int8", true);
       if (softmax_qk_op_desc->HasAttr("out_threshold")) {
         auto qkv_plugin_scale = BOOST_GET_CONST(
diff --git a/paddle/fluid/framework/ir/quant_conv2d_dequant_fuse_pass.cc b/paddle/fluid/framework/ir/quant_conv2d_dequant_fuse_pass.cc
index 619fe7ab4f738..281e0b9910619 100644
--- a/paddle/fluid/framework/ir/quant_conv2d_dequant_fuse_pass.cc
+++ b/paddle/fluid/framework/ir/quant_conv2d_dequant_fuse_pass.cc
@@ -341,7 +341,6 @@ void QuantDequantFusePass::DeleteQuant(ir::Graph* graph, Scope* scope,
     Node* output_scale = subgraph.at(pattern.GetPDNode("output_scale_node"));
     Node* output_act = subgraph.at(pattern.GetPDNode("output_act_node"));
     int bit_length = BOOST_GET_CONST(int, quant->Op()->GetAttr("bit_length"));
-    int range = ((1 << (bit_length - 1)) - 1);
 
     // Get input scale from tensor
     std::string input_scale_var_name = quant->Op()->Input("InScale").front();
@@ -356,7 +355,7 @@ void QuantDequantFusePass::DeleteQuant(ir::Graph* graph, Scope* scope,
             "Input scale tensor's place should be CPU."));
     const float* input_scale_data = input_scale_tensor.data<float>();
     float in_scale = input_scale_data[0];
-    float scale_value = in_scale / range;
+    float scale_value = in_scale;
 
     // Set input scale in attr, and relink nodes
     std::string input_act_name = input_act->Var()->Name();
@@ -369,11 +368,10 @@ void QuantDequantFusePass::DeleteQuant(ir::Graph* graph, Scope* scope,
           quantized_op_type == "conv2d_fusion" ||
           quantized_op_type == "depthwise_conv2d" ||
           quantized_op_type == "fc" ||
-          quantized_op_type == "conv2d_transpose") {
+          quantized_op_type == "conv2d_transpose" ||
+          quantized_op_type == "mul" || quantized_op_type == "matmul" ||
+          quantized_op_type == "matmul_v2") {
         op_desc->SetAttr("Input_scale", scale_value);
-      } else if (quantized_op_type == "mul" || quantized_op_type == "matmul" ||
-                 quantized_op_type == "matmul_v2") {
-        op_desc->SetAttr("X_scale", scale_value);
       } else {
         PADDLE_THROW(platform::errors::Unimplemented(
             "Unsupported quantized op type %s.", quantized_op_type));
@@ -619,7 +617,6 @@ void QuantDequantFusePass::FuseDequant(ir::Graph* graph, Scope* scope,
       new_op_desc.SetInput("X", {new_input});
       new_op_desc.SetOutput("Out", {new_output});
     }
-    new_op_desc.SetAttr("weight_scale", weight_scale);
     new_op_desc.Flush();
     auto* new_op = graph->CreateOpNode(&new_op_desc);
     IR_NODE_LINK_TO(quantized_op_input_node, new_op);
diff --git a/paddle/fluid/framework/ir/trt_map_matmul_to_mul_pass.cc b/paddle/fluid/framework/ir/trt_map_matmul_to_mul_pass.cc
index 3caaf08dc9cb5..d3211c0841416 100644
--- a/paddle/fluid/framework/ir/trt_map_matmul_to_mul_pass.cc
+++ b/paddle/fluid/framework/ir/trt_map_matmul_to_mul_pass.cc
@@ -297,11 +297,24 @@ void TrtMapMatmul2MulPass::ApplyImpl(ir::Graph* graph) const {
       desc.SetAttr("transpose_Y", matmul_op->Op()->GetAttr("transpose_Y"));
       if (matmul_op->Op()->HasAttr("enable_int8")) {
         desc.SetAttr("enable_int8", matmul_op->Op()->GetAttr("enable_int8"));
-        desc.SetAttr("X_scale", matmul_op->Op()->GetAttr("X_scale"));
-        desc.SetAttr("weight_scale", matmul_op->Op()->GetAttr("weight_scale"));
+        desc.SetAttr("Input_scale", matmul_op->Op()->GetAttr("Input_scale"));
         desc.SetAttr("out_threshold",
                      matmul_op->Op()->GetAttr("out_threshold"));
       }
+
+      bool inscale_flag = false;
+      bool outscale_flag = false;
+
+      if (matmul_op->Op()->HasAttr("X")) {
+        desc.SetAttr("X", matmul_op->Op()->GetAttr("X"));
+        inscale_flag = true;
+      }
+      if (matmul_op->Op()->HasAttr("Out")) {
+        desc.SetAttr("Out", matmul_op->Op()->GetAttr("Out"));
+        outscale_flag = true;
+      }
+      desc.SetAttr("support_int8", inscale_flag && outscale_flag);
+
       auto mul_node = g->CreateOpNode(&desc);
       IR_NODE_LINK_TO(matmul_in_x, mul_node);
       IR_NODE_LINK_TO(matmul_in_y, mul_node);
@@ -370,12 +383,23 @@ void TrtMapMatmulV2ToMulPass::ApplyImpl(ir::Graph* graph) const {
       desc.SetAttr("transpose_Y", matmul_v2_op->Op()->GetAttr("trans_y"));
       if (matmul_v2_op->Op()->HasAttr("enable_int8")) {
         desc.SetAttr("enable_int8", matmul_v2_op->Op()->GetAttr("enable_int8"));
-        desc.SetAttr("X_scale", matmul_v2_op->Op()->GetAttr("X_scale"));
-        desc.SetAttr("weight_scale",
-                     matmul_v2_op->Op()->GetAttr("weight_scale"));
+        desc.SetAttr("Input_scale", matmul_v2_op->Op()->GetAttr("Input_scale"));
         desc.SetAttr("out_threshold",
                      matmul_v2_op->Op()->GetAttr("out_threshold"));
       }
+
+      bool inscale_flag = false;
+      bool outscale_flag = false;
+      if (matmul_v2_op->Op()->HasAttr("X")) {
+        desc.SetAttr("X", matmul_v2_op->Op()->GetAttr("X"));
+        inscale_flag = true;
+      }
+      if (matmul_v2_op->Op()->HasAttr("Out")) {
+        desc.SetAttr("Out", matmul_v2_op->Op()->GetAttr("Out"));
+        outscale_flag = true;
+      }
+      desc.SetAttr("support_int8", inscale_flag && outscale_flag);
+
       auto mul_node = g->CreateOpNode(&desc);
       IR_NODE_LINK_TO(matmul_v2_in_x, mul_node);
       IR_NODE_LINK_TO(matmul_v2_in_y, mul_node);
@@ -448,11 +472,23 @@ void TrtMapMatmulV2ToMatmulPass::ApplyImpl(ir::Graph* graph) const {
     }
     if (matmul_v2_op->Op()->HasAttr("enable_int8")) {
       desc.SetAttr("enable_int8", matmul_v2_op->Op()->GetAttr("enable_int8"));
-      desc.SetAttr("X_scale", matmul_v2_op->Op()->GetAttr("X_scale"));
-      desc.SetAttr("weight_scale", matmul_v2_op->Op()->GetAttr("weight_scale"));
+      desc.SetAttr("Input_scale", matmul_v2_op->Op()->GetAttr("Input_scale"));
       desc.SetAttr("out_threshold",
                    matmul_v2_op->Op()->GetAttr("out_threshold"));
     }
+
+    bool inscale_flag = false;
+    bool outscale_flag = false;
+    if (matmul_v2_op->Op()->HasAttr("X")) {
+      desc.SetAttr("X", matmul_v2_op->Op()->GetAttr("X"));
+      inscale_flag = true;
+    }
+    if (matmul_v2_op->Op()->HasAttr("Out")) {
+      desc.SetAttr("Out", matmul_v2_op->Op()->GetAttr("Out"));
+      outscale_flag = true;
+    }
+    desc.SetAttr("support_int8", inscale_flag && outscale_flag);
+
     auto matmul_node = g->CreateOpNode(&desc);
     IR_NODE_LINK_TO(matmul_v2_in_x, matmul_node);
     IR_NODE_LINK_TO(matmul_v2_in_y, matmul_node);
@@ -530,11 +566,24 @@ void TrtSqueeze2MatmulFusePass::ApplyImpl(ir::Graph* graph) const {
       desc.SetAttr("y_num_col_dims", 1);
       if (matmul_op->Op()->HasAttr("enable_int8")) {
         desc.SetAttr("enable_int8", matmul_op->Op()->GetAttr("enable_int8"));
-        desc.SetAttr("X_scale", matmul_op->Op()->GetAttr("X_scale"));
-        desc.SetAttr("weight_scale", matmul_op->Op()->GetAttr("weight_scale"));
+        desc.SetAttr("Input_scale", matmul_op->Op()->GetAttr("Input_scale"));
         desc.SetAttr("out_threshold",
                      matmul_op->Op()->GetAttr("out_threshold"));
       }
+
+      bool inscale_flag_x = false;
+      bool outscale_flag = false;
+
+      if (squeeze2_op->Op()->HasAttr("X")) {
+        desc.SetAttr("X", squeeze2_op->Op()->GetAttr("X"));
+        inscale_flag_x = true;
+      }
+      if (matmul_op->Op()->HasAttr("Out")) {
+        desc.SetAttr("Out", matmul_op->Op()->GetAttr("Out"));
+        outscale_flag = true;
+      }
+      desc.SetAttr("support_int8", inscale_flag_x && outscale_flag);
+
       auto mul_node = g->CreateOpNode(&desc);
       IR_NODE_LINK_TO(squeeze2_in_x, mul_node);
       IR_NODE_LINK_TO(matmul_in_y, mul_node);
@@ -675,11 +724,24 @@ void TrtReshape2MatmulFusePass::ApplyImpl(ir::Graph* graph) const {
       desc.SetAttr("y_num_col_dims", 1);
       if (matmul_op->Op()->HasAttr("enable_int8")) {
         desc.SetAttr("enable_int8", matmul_op->Op()->GetAttr("enable_int8"));
-        desc.SetAttr("X_scale", matmul_op->Op()->GetAttr("X_scale"));
-        desc.SetAttr("weight_scale", matmul_op->Op()->GetAttr("weight_scale"));
+        desc.SetAttr("Input_scale", matmul_op->Op()->GetAttr("Input_scale"));
         desc.SetAttr("out_threshold",
                      matmul_op->Op()->GetAttr("out_threshold"));
       }
+
+      bool inscale_flag_x = false;
+      bool outscale_flag = false;
+
+      if (reshape2_op->Op()->HasAttr("X")) {
+        desc.SetAttr("X", reshape2_op->Op()->GetAttr("X"));
+        inscale_flag_x = true;
+      }
+      if (matmul_op->Op()->HasAttr("Out")) {
+        desc.SetAttr("Out", matmul_op->Op()->GetAttr("Out"));
+        outscale_flag = true;
+      }
+      desc.SetAttr("support_int8", inscale_flag_x && outscale_flag);
+
       if (!IsCompat(desc)) {
         LOG(WARNING)
             << "TrtReshape2MatmulFusePass in out mul op compat failed.";
@@ -763,11 +825,24 @@ void TrtFlatten2MatmulFusePass::ApplyImpl(ir::Graph* graph) const {
       desc.SetAttr("y_num_col_dims", 1);
       if (matmul_op->Op()->HasAttr("enable_int8")) {
         desc.SetAttr("enable_int8", matmul_op->Op()->GetAttr("enable_int8"));
-        desc.SetAttr("X_scale", matmul_op->Op()->GetAttr("X_scale"));
-        desc.SetAttr("weight_scale", matmul_op->Op()->GetAttr("weight_scale"));
+        desc.SetAttr("Input_scale", matmul_op->Op()->GetAttr("Input_scale"));
         desc.SetAttr("out_threshold",
                      matmul_op->Op()->GetAttr("out_threshold"));
       }
+
+      bool inscale_flag_x = false;
+      bool outscale_flag = false;
+
+      if (flatten2_op->Op()->HasAttr("X")) {
+        desc.SetAttr("X", flatten2_op->Op()->GetAttr("X"));
+        inscale_flag_x = true;
+      }
+      if (matmul_op->Op()->HasAttr("Out")) {
+        desc.SetAttr("Out", matmul_op->Op()->GetAttr("Out"));
+        outscale_flag = true;
+      }
+      desc.SetAttr("support_int8", inscale_flag_x && outscale_flag);
+
       auto mul_node = g->CreateOpNode(&desc);
       IR_NODE_LINK_TO(flatten2_in_x, mul_node);
       IR_NODE_LINK_TO(matmul_in_y, mul_node);
diff --git a/paddle/fluid/framework/multi_trainer.cc b/paddle/fluid/framework/multi_trainer.cc
index 83926336cbec8..61cd7ad01696e 100644
--- a/paddle/fluid/framework/multi_trainer.cc
+++ b/paddle/fluid/framework/multi_trainer.cc
@@ -276,7 +276,7 @@ void MultiTrainer::Finalize() {
   if (communicator == nullptr) {
     VLOG(0) << "MultiTrainer::Finalize communicator is null!";
   } else {
-    communicator->_worker_ptr->flush();
+    communicator->_worker_ptr->Flush();
     VLOG(1) << "MultiTrainer::Finalize ps client flush done";
   }
 #endif
diff --git a/paddle/fluid/framework/new_executor/data_transfer.cc b/paddle/fluid/framework/new_executor/data_transfer.cc
index 1d0727b80baf7..d0e5565139c54 100644
--- a/paddle/fluid/framework/new_executor/data_transfer.cc
+++ b/paddle/fluid/framework/new_executor/data_transfer.cc
@@ -319,6 +319,7 @@ void ApplyDataTransform(const OpKernelType& expected_kernel_key,
     }
   }
 
+  bool transfered = false;
   DataTranferHelper data_transfer_helper(place, var_scope);
   for (auto& var_name_item : *ins_map_temp) {
     bool should_skip_input =
@@ -334,6 +335,9 @@ void ApplyDataTransform(const OpKernelType& expected_kernel_key,
       if (var->IsType<LoDTensor>() || var->IsType<phi::SelectedRows>()) {
         tensor_in = GetLoDTensorOrSelectedRowsValueFromVar(*var);
       } else if (var->IsType<LoDTensorArray>()) {
+        if (var->Get<LoDTensorArray>().size() == 0) {
+          continue;
+        }
         tensor_in =
             static_cast<const Tensor*>(&(var->Get<LoDTensorArray>()[0]));
       } else {
@@ -389,6 +393,7 @@ void ApplyDataTransform(const OpKernelType& expected_kernel_key,
       }
 
       if (is_transferred) {
+        transfered = true;
         // update RuntimeContext.inputs and original op_func_node inputs
         op_func_node->input_index[var_name_item.first][i] =
             var_scope->VarId(new_var_name);
@@ -426,11 +431,13 @@ void ApplyDataTransform(const OpKernelType& expected_kernel_key,
     }
   }
 
-  // NOTE(zhiqiu): UPDATE the corresponding OeratorBase to make it consistent
-  // with instruction. (hot fix, it is not good design here)
-  op_func_node->operator_base_ =
-      std::shared_ptr<OperatorBase>(framework::OpRegistry::CreateOp(
-          op_base->Type(), new_ins, new_outs, op_base->Attrs()));
+  if (transfered) {
+    // NOTE(zhiqiu): UPDATE the corresponding OeratorBase to make it consistent
+    // with instruction. (hot fix, it is not good design here)
+    op_func_node->operator_base_ =
+        std::shared_ptr<OperatorBase>(framework::OpRegistry::CreateOp(
+            op_base->Type(), new_ins, new_outs, op_base->Attrs()));
+  }
   op_func_node->no_data_transform_index = std::move(no_data_transform_index);
 }
 
diff --git a/paddle/fluid/framework/new_executor/event_manager.cc b/paddle/fluid/framework/new_executor/event_manager.cc
index cc6fd6e3ed0f9..bca2264b66afc 100644
--- a/paddle/fluid/framework/new_executor/event_manager.cc
+++ b/paddle/fluid/framework/new_executor/event_manager.cc
@@ -13,6 +13,7 @@
 // limitations under the License.
 
 #include "paddle/fluid/framework/new_executor/event_manager.h"
+#include "paddle/fluid/platform/profiler/event_tracing.h"
 
 namespace paddle {
 namespace framework {
@@ -24,6 +25,8 @@ void WaitEvent(const Instruction& instruction, const platform::Place& place) {
   VLOG(3) << "Deal StreamWaitEventOrSync for " << instruction.OpBase()->Type();
 
   for (auto& event_iter : instruction.InputEvents()) {
+    platform::RecordEvent record("WaitStreamEvent",
+                                 platform::TracerEventType::UserDefined, 10);
     VLOG(3) << "wait var_id: " << event_iter.var_id_
             << " 's event with waiter_type: " << event_iter.waiter_type_;
     event_iter.event_->Wait(event_iter.waiter_type_,
@@ -36,6 +39,8 @@ void RecordEvent(const Instruction& instruction, const platform::Place& place) {
   if (platform::is_cpu_place(place)) return;
 
   for (auto& event : instruction.OutputEvents()) {
+    platform::RecordEvent record("RecordStreamEvent",
+                                 platform::TracerEventType::UserDefined, 10);
     VLOG(3) << "Record event in out_var_id: " << event.var_id_;
     event.event_->Record(&instruction.DeviceContext());
   }
@@ -46,6 +51,8 @@ void RecordEvent(const Instruction& instruction) {
   if (platform::is_cpu_place(instruction.DeviceContext().GetPlace())) return;
 
   for (auto& event : instruction.OutputEvents()) {
+    platform::RecordEvent record("RecordStreamEvent",
+                                 platform::TracerEventType::UserDefined, 10);
     VLOG(3) << "Record event in out_var_id: " << event.var_id_;
     event.event_->Record(&instruction.DeviceContext());
   }
diff --git a/paddle/fluid/framework/new_executor/interpretercore.cc b/paddle/fluid/framework/new_executor/interpretercore.cc
index e30dd21fc5c0e..20a6e53479323 100644
--- a/paddle/fluid/framework/new_executor/interpretercore.cc
+++ b/paddle/fluid/framework/new_executor/interpretercore.cc
@@ -300,8 +300,16 @@ void InterpreterCore::Convert(
     gc_event_.emplace_back(vec_instruction_[i].DeviceContext().GetPlace(),
                            platform::GenerateDeviceEventFlag());
   }
+  bool inplaced = false;
+  for (auto inst : vec_instruction_) {
+    if (inst.OpBase()->Type() == "share_buffer" ||
+        inst.OpBase()->Type() == "share_data") {
+      VLOG(4) << "Already inplaced, skip inplace now.";
+      inplaced = true;
+    }
+  }
 
-  if (FLAGS_new_executor_use_inplace) {
+  if (FLAGS_new_executor_use_inplace && !inplaced) {
     BuildInplace();
   }
 
@@ -425,13 +433,18 @@ void InterpreterCore::RunInstruction(const Instruction& instr_node) {
                            : global_scope_->GetMutableScope();
   auto op_with_kernel = dynamic_cast<const framework::OperatorWithKernel*>(op);
   {
+    // If it is OperatorBase, InferShape do nothing.
     if (op_with_kernel != nullptr) {
       platform::RecordEvent infershape_event(
           "infer_shape", platform::TracerEventType::OperatorInner, 1,
           platform::EventRole::kInnerOp);
-      // If it is OperatorBase, InferShape do nothing.
-      op_with_kernel->Info().infer_shape_(
-          instr_node.InnerInferShapeContext().get());
+
+      // see OperatorWithKernel::RunImpl in operator.cc for why
+      if (!(op_with_kernel->HasAttr(kAllKernelsMustComputeRuntimeShape) &&
+            op_with_kernel->Attr<bool>(kAllKernelsMustComputeRuntimeShape))) {
+        op_with_kernel->Info().infer_shape_(
+            instr_node.InnerInferShapeContext().get());
+      }
     }
   }
 
@@ -476,6 +489,8 @@ void InterpreterCore::RunInstruction(const Instruction& instr_node) {
   VLOG(4) << "End run " << place << " " << op->DebugStringEx(global_scope_);
 
   if (!instr_node.InplaceBackMap().empty()) {
+    platform::RecordEvent inplaceback_event(
+        "InplaceVarsBack", platform::TracerEventType::UserDefined, 10);
     auto& m = instr_node.InplaceBackMap();
     // NOTE(zhiqiu): same logic as TransferInplaceVarsBack() in operator.cc
     for (auto& p : m) {
@@ -501,7 +516,7 @@ void InterpreterCore::RunInstruction(const Instruction& instr_node) {
   }
 
   // for debug nan/inf
-  if (FLAGS_check_nan_inf) {
+  if (op_with_kernel != nullptr && FLAGS_check_nan_inf) {
     VLOG(4) << "Check nan/inf";
     framework::details::CheckOpHasNanOrInf(
         *op, *global_scope_,
@@ -511,6 +526,14 @@ void InterpreterCore::RunInstruction(const Instruction& instr_node) {
 
 void InterpreterCore::ExecuteInstructionList(
     const std::vector<Instruction>& vec_instr) {
+  unfinished_op_numer_ = vec_instr.size();
+  if (unfinished_op_numer_ == 0) {
+    VLOG(4) << "No op to run, return";
+    return;
+  }
+
+  platform::RecordEvent record_prepare(
+      "PrepareAtomic", platform::TracerEventType::UserDefined, 1);
   // NOTE(zhiqiu): get the prepared deps from std::future, and async prepare
   // those for the next step
   auto atomic_deps = async_work_queue_->AtomicDeps();
@@ -518,8 +541,7 @@ void InterpreterCore::ExecuteInstructionList(
 
   async_work_queue_->PrepareAtomicDeps(dependecy_count_);
   async_work_queue_->PrepareAtomicVarRef(global_scope_->VecMetaInfo());
-
-  unfinished_op_numer_ = vec_instr.size();
+  record_prepare.End();
 
   exception_holder_.Clear();
 
@@ -542,10 +564,12 @@ void InterpreterCore::ExecuteInstructionList(
     if (exception_holder_.Type() != "EOF") {
       async_work_queue_->Cancel();
     }
+    VLOG(4) << "Cancel ok";
     PADDLE_ENFORCE_EQ(
         main_thread_blocker_.Clear(), 0,
         platform::errors::PreconditionNotMet(
             "main_thread_blocker_.Clear() return -1, clear failed"));
+    VLOG(4) << "clear ok";
     exception_holder_.ReThrow();
   }
 }
@@ -554,12 +578,14 @@ void InterpreterCore::RunNextInstructions(
     const Instruction& instr, std::queue<size_t>* reserved_next_ops,
     std::vector<std::atomic<size_t>>* atomic_deps,
     std::vector<std::atomic<size_t>>* atomic_var_ref) {
+  platform::RecordEvent record("RunNextInstructions",
+                               platform::TracerEventType::UserDefined, 10);
   VLOG(4) << "atomic 1:" << atomic_deps;
   auto& next_instr = instr.NextInstructions();
 
   auto IsReady = [atomic_deps](size_t next_id) {
-    VLOG(4) << "atomic:" << atomic_deps << " " << &(*atomic_deps)[next_id]
-            << " " << next_id;
+    VLOG(4) << "atomic:" << atomic_deps << " op_id: " << next_id
+            << ", remain deps: " << (*atomic_deps)[next_id];
     return (*atomic_deps)[next_id].fetch_sub(1, std::memory_order_relaxed) == 1;
   };
 
@@ -637,15 +663,18 @@ void InterpreterCore::RunInstructionAsync(
     auto* op = instr_node.OpBase();
     platform::RecordEvent instruction_event(
         op->Type(), platform::TracerEventType::Operator, 1);
-    interpreter::WaitEvent(instr_node, place_);
 
     try {
+      interpreter::WaitEvent(instr_node, place_);
+
       RunInstruction(instr_node);
 
 #if defined(PADDLE_WITH_CUDA) || defined(PADDLE_WITH_HIP)
       RecordStreamForGC(instr_node);
 #endif
       CheckGC(instr_node, atomic_var_ref);
+
+      interpreter::RecordEvent(instr_node, place_);
     } catch (platform::EnforceNotMet& ex) {
       framework::InsertCallStackInfo(op->Type(), op->Attrs(), &ex);
       exception_holder_.Catch(std::make_exception_ptr(std::move(ex)));
@@ -677,8 +706,6 @@ void InterpreterCore::RunInstructionAsync(
       }
     }
 
-    interpreter::RecordEvent(instr_node, place_);
-
     RunNextInstructions(instr_node, &ready_ops, atomic_deps, atomic_var_ref);
   }
 }
@@ -689,6 +716,8 @@ void InterpreterCore::RecordStreamForGC(const Instruction& instr) {
       instr.KernelType() != OpFuncType::kQueueAsync) {
     return;
   }
+  platform::RecordEvent record("RecordStreamForGC",
+                               platform::TracerEventType::UserDefined, 10);
 
   gpuStream_t stream = reinterpret_cast<const platform::CUDADeviceContext&>(
                            instr.DeviceContext())
@@ -780,6 +809,8 @@ void InterpreterCore::RecordStreamForGC(const Instruction& instr) {
 void InterpreterCore::CheckGC(
     const Instruction& instr,
     std::vector<std::atomic<size_t>>* atomic_var_ref) {
+  platform::RecordEvent record("CheckGC",
+                               platform::TracerEventType::UserDefined, 10);
   size_t instr_id = instr.Id();
   auto& var_scope = *global_scope_;
 
diff --git a/paddle/fluid/framework/new_executor/interpretercore_util.cc b/paddle/fluid/framework/new_executor/interpretercore_util.cc
index d56082a91a61f..a704411f3bb71 100644
--- a/paddle/fluid/framework/new_executor/interpretercore_util.cc
+++ b/paddle/fluid/framework/new_executor/interpretercore_util.cc
@@ -29,6 +29,8 @@ PADDLE_DEFINE_EXPORTED_bool(
     new_executor_sequential_run, false,
     "Enable sequential execution for standalone executor, used for debug");
 
+DECLARE_bool(use_mkldnn);
+
 namespace paddle {
 namespace framework {
 namespace interpreter {
@@ -192,6 +194,7 @@ void create_all_ops(const framework::BlockDesc& block,
 
     const VariableNameMap& inputs_names = op->Inputs();
     const VariableNameMap& outputs_names = op->Outputs();
+
     AttributeMap op_attr_map = op->GetAttrMap();
 
     if (info.Checker() != nullptr) {
@@ -199,6 +202,16 @@ void create_all_ops(const framework::BlockDesc& block,
     }
     auto op_base =
         info.Creator()(op->Type(), inputs_names, outputs_names, op_attr_map);
+
+#ifdef PADDLE_WITH_MKLDNN
+    if (FLAGS_use_mkldnn) {
+      if (op->HasAttr("use_mkldnn")) {
+        VLOG(4) << "Set use_mkldnn=True for " << op_base->Type();
+        op_base->SetAttr("use_mkldnn", true);
+      }
+    }
+#endif
+
     ops->emplace_back(std::unique_ptr<OperatorBase>(op_base));
   }
 }
@@ -415,19 +428,19 @@ void build_op_func_list(const platform::Place& place,
       op_func_node.dev_ctx_ = dev_ctx;
       VLOG(3) << op_with_kernel->Type()
               << " : expected_kernel_key : " << expected_kernel_key;
-      auto exec_ctx =
-          ExecutionContext(*op_with_kernel, scope, *dev_ctx, runtime_context);
 
       // see OperatorWithKernel::RunImpl in operator.cc for why
       if (!(op->HasAttr(kAllKernelsMustComputeRuntimeShape) &&
             op->Attr<bool>(kAllKernelsMustComputeRuntimeShape))) {
         InterpretercoreInferShapeContext infer_shape_ctx(*op, runtime_context);
         // TODO(Aurelius84): In case of control flow ops, they are NOT
-        // inheritted
-        // from OperatorWithKernel.
+        // inheritted from OperatorWithKernel.
         op_with_kernel->Info().infer_shape_(&infer_shape_ctx);
       }
 
+      auto exec_ctx =
+          ExecutionContext(*op_with_kernel, scope, *dev_ctx, runtime_context);
+
       auto run_phi_kernel = false;
       if (phi::KernelFactory::Instance().HasCompatiblePhiKernel(
               op_with_kernel->Type())) {
@@ -463,7 +476,6 @@ void build_op_func_list(const platform::Place& place,
         op_with_kernel->BuildPhiKernelContext(runtime_context, dev_ctx,
                                               &pt_kernel_context);
         op_func_node.pt_kernel_ = op_with_kernel->PhiKernel();
-
         (*op_func_node.pt_kernel_)(&pt_kernel_context);
       } else {
         auto kernels_iter = all_op_kernels.find(op->Type());
@@ -698,6 +710,7 @@ std::map<int, std::list<int>> build_op_downstream_map(
   const std::set<std::string> random_op_set = {
       "bernoulli",      "poisson", "multinomial", "gaussian_random",
       "uniform_random", "randint", "randperm",    "exponential"};
+
   int dependence_op_idx = -1;
   for (size_t op_idx = 0; op_idx < vec_instruction.size(); ++op_idx) {
     if (random_op_set.count(vec_instruction[op_idx].OpBase()->Type())) {
@@ -708,6 +721,147 @@ std::map<int, std::list<int>> build_op_downstream_map(
     }
   }
 
+  // add dependency for communication op
+  const std::string communication_op_prefix = "c_";
+  dependence_op_idx = -1;
+  for (size_t op_idx = 0; op_idx < vec_instruction.size(); ++op_idx) {
+    if (vec_instruction[op_idx].OpBase()->Type().find(
+            communication_op_prefix) != std::string::npos) {
+      if (dependence_op_idx != -1) {
+        op2dependences[op_idx].insert(dependence_op_idx);
+      }
+      dependence_op_idx = op_idx;
+    }
+  }
+
+  // TODO(zhiqiu): there still some cases not handled
+  // add dependency for c_sync_comm_stream
+
+  // in program, we can add only one c_sync_comm_stream to sync all
+  // communication ops.
+  // c_allreduce_sum(a)
+  // c_allreduce_sum(b)
+  // c_allreduce_sum(c)
+  // c_sync_comm_stream(a)
+  const std::string kSyncComm = "c_sync_comm_stream";
+  dependence_op_idx = -1;
+  for (size_t op_idx = 0; op_idx < vec_instruction.size(); ++op_idx) {
+    if (vec_instruction[op_idx].OpBase()->Type() == kSyncComm) {
+      dependence_op_idx = op_idx;
+    } else {
+      if (dependence_op_idx != -1) {
+        VLOG(4) << "Add depend from "
+                << vec_instruction[dependence_op_idx].OpBase()->Type() << " to "
+                << vec_instruction[op_idx].OpBase()->Type();
+        op2dependences[op_idx].insert(dependence_op_idx);
+      }
+    }
+  }
+
+  // add dependency for coalesce_tensor
+  const std::string kCoalesceTensor = "coalesce_tensor";
+  for (size_t op_idx = 0; op_idx < vec_instruction.size(); ++op_idx) {
+    if (vec_instruction[op_idx].OpBase()->Type() == kCoalesceTensor) {
+      VLOG(4) << "Add depend for " << kCoalesceTensor << " " << op_idx;
+      auto fused_out = vec_instruction[op_idx].Outputs().at("FusedOutput")[0];
+      auto outputs = vec_instruction[op_idx].Outputs().at("Output");
+
+      auto is_read = [](const Instruction& inst, int var_id) -> bool {
+        for (auto pair : inst.Inputs()) {
+          for (auto item : pair.second) {
+            if (item == var_id) {
+              return true;
+            }
+          }
+        }
+        return false;
+      };
+
+      auto is_write = [](const Instruction& inst, int var_id) -> bool {
+        for (auto pair : inst.Outputs()) {
+          for (auto item : pair.second) {
+            if (item == var_id) {
+              return true;
+            }
+          }
+        }
+        return false;
+      };
+
+      // find first op that reads fused_out
+      auto first_read_fused_out_op = -1;
+      for (auto j = op_idx + 1; j < vec_instruction.size(); ++j) {
+        if (is_read(vec_instruction[j], fused_out)) {
+          first_read_fused_out_op = j;
+          break;
+        }
+      }
+
+      if (UNLIKELY(first_read_fused_out_op == -1)) {
+        VLOG(4) << "No op read FusedOutput";
+        continue;
+      }
+
+      // find ops that write 'outputs' between (op_index,
+      // first_read_fused_out_op)
+      // add depend: them->first_read_fused_out_op
+      for (auto j = op_idx + 1;
+           j < static_cast<size_t>(first_read_fused_out_op); ++j) {
+        for (auto var_id : outputs) {
+          if (is_write(vec_instruction[j], var_id)) {
+            op2dependences[first_read_fused_out_op].insert(j);
+            VLOG(4) << j << " -> " << first_read_fused_out_op;
+            VLOG(4)
+                << "Add depend from " << vec_instruction[j].OpBase()->Type()
+                << " to "
+                << vec_instruction[first_read_fused_out_op].OpBase()->Type();
+          }
+        }
+      }
+
+      // find first op read 'outputs' between (first_read_fused_out_op, end)
+      // add depned:  first_read_fused_out_op -> first op that reads 'outputs'
+
+      // special case for consecutive communication ops, for example,
+      // FusedOutput = c_sync_calc_stream(FusedOutput)
+      // FusedOutput= c_allreduce_sum(FusedOutput)
+      // FusedOutput = c_sync_comm_stream(FusedOutput)
+      // we should take the last one to add depned instead of
+      // 'first_read_fused_out_op'
+      size_t target = first_read_fused_out_op;
+      for (size_t j = first_read_fused_out_op + 1; j < vec_instruction.size();
+           ++j) {
+        if (j == target + 1 &&
+            vec_instruction[target].OpBase()->Type().find(
+                communication_op_prefix) != std::string::npos &&
+            vec_instruction[j].OpBase()->Type().find(communication_op_prefix) !=
+                std::string::npos) {
+          VLOG(4) << "Found consecutive communication ops, "
+                  << vec_instruction[target].OpBase()->Type() << " -> "
+                  << vec_instruction[j].OpBase()->Type();
+          target = j;
+          continue;
+        }
+
+        for (auto var_id : outputs) {
+          if (is_read(vec_instruction[j], var_id)) {
+            op2dependences[j].insert(target);
+            VLOG(4) << target << " -> " << j;
+            VLOG(4) << "Add depend from "
+                    << vec_instruction[target].OpBase()->Type() << " to "
+                    << vec_instruction[j].OpBase()->Type();
+          }
+        }
+      }
+    }
+  }
+  for (auto pair : op2dependences) {
+    VLOG(10) << pair.first << " Depends on " << pair.second.size();
+    std::ostringstream oss;
+    std::copy(pair.second.begin(), pair.second.end(),
+              std::ostream_iterator<int>(oss, " "));
+    VLOG(10) << oss.str();
+  }
   return std::move(get_downstream_map(op2dependences));
 }
 
diff --git a/paddle/fluid/framework/new_executor/standalone_executor_test.cc b/paddle/fluid/framework/new_executor/standalone_executor_test.cc
index 8a3b40bbd76ef..fbcbb2ca23bcb 100644
--- a/paddle/fluid/framework/new_executor/standalone_executor_test.cc
+++ b/paddle/fluid/framework/new_executor/standalone_executor_test.cc
@@ -35,7 +35,7 @@ USE_OP_ITSELF(elementwise_add);
 USE_OP_ITSELF(sigmoid);
 USE_OP_ITSELF(tanh);
 USE_OP_ITSELF(elementwise_mul);
-USE_OP(softmax_with_cross_entropy);
+USE_OP_ITSELF(softmax_with_cross_entropy);
 USE_OP_ITSELF(reduce_mean);
 USE_OP_ITSELF(reduce_sum);
 USE_OP_ITSELF(reduce_sum_grad);
@@ -46,7 +46,7 @@ USE_OP_ITSELF(elementwise_add_grad);
 USE_OP_ITSELF(matmul_grad);
 USE_OP_ITSELF(square);
 USE_OP_ITSELF(transpose2_grad);
-USE_OP(concat_grad);
+USE_OP_ITSELF(concat_grad);
 USE_OP_ITSELF(elementwise_mul_grad);
 USE_OP_ITSELF(sigmoid_grad);
 USE_OP_ITSELF(tanh_grad);
@@ -67,6 +67,7 @@ PD_DECLARE_KERNEL(transpose, GPU, ALL_LAYOUT);
 PD_DECLARE_KERNEL(reshape, GPU, ALL_LAYOUT);
 PD_DECLARE_KERNEL(split, GPU, ALL_LAYOUT);
 PD_DECLARE_KERNEL(concat, GPU, ALL_LAYOUT);
+PD_DECLARE_KERNEL(concat_grad, GPU, ALL_LAYOUT);
 PD_DECLARE_KERNEL(matmul, GPU, ALL_LAYOUT);
 PD_DECLARE_KERNEL(add_raw, GPU, ALL_LAYOUT);
 PD_DECLARE_KERNEL(add, GPU, ALL_LAYOUT);
@@ -83,6 +84,8 @@ PD_DECLARE_KERNEL(max_raw, GPU, ALL_LAYOUT);
 PD_DECLARE_KERNEL(sgd, GPU, ALL_LAYOUT);
 PD_DECLARE_KERNEL(slice, GPU, ALL_LAYOUT);
 PD_DECLARE_KERNEL(slice_grad, GPU, ALL_LAYOUT);
+PD_DECLARE_KERNEL(cross_entropy_with_softmax, GPU, ALL_LAYOUT);
+PD_DECLARE_KERNEL(cross_entropy_with_softmax_grad, GPU, ALL_LAYOUT);
 PD_DECLARE_KERNEL(sqrt, GPU, ALL_LAYOUT);
 
 DECLARE_double(eager_delete_tensor_gb);
diff --git a/paddle/fluid/framework/new_executor/workqueue/nonblocking_threadpool.h b/paddle/fluid/framework/new_executor/workqueue/nonblocking_threadpool.h
index bc65231abe737..384498584c66a 100644
--- a/paddle/fluid/framework/new_executor/workqueue/nonblocking_threadpool.h
+++ b/paddle/fluid/framework/new_executor/workqueue/nonblocking_threadpool.h
@@ -408,8 +408,8 @@ class ThreadPoolTempl {
       ec_.Notify(true);
       return false;
     }
-    platform::RecordEvent("SleepWaitForWork",
-                          platform::TracerEventType::UserDefined, 10);
+    platform::RecordEvent record("WaitForWork",
+                                 platform::TracerEventType::UserDefined, 10);
     ec_.CommitWait(waiter);
     blocked_--;
     return true;
diff --git a/paddle/fluid/framework/new_executor/workqueue/thread_data_registry.h b/paddle/fluid/framework/new_executor/workqueue/thread_data_registry.h
index ffdddc39a31e3..98ed2c1ffc4b3 100644
--- a/paddle/fluid/framework/new_executor/workqueue/thread_data_registry.h
+++ b/paddle/fluid/framework/new_executor/workqueue/thread_data_registry.h
@@ -60,18 +60,51 @@ class ThreadDataRegistry {
   }
 
  private:
-  // types
+// types
+// Lock types
+#if defined(__clang__) || defined(__GNUC__)  // CLANG or GCC
+#ifndef __APPLE__
+#if __cplusplus >= 201703L
+  using LockType = std::shared_mutex;
+  using SharedLockGuardType = std::shared_lock<std::shared_mutex>;
+#elif __cplusplus >= 201402L
   using LockType = std::shared_timed_mutex;
+  using SharedLockGuardType = std::shared_lock<std::shared_timed_mutex>;
+#else
+  using LockType = std::mutex;
+  using SharedLockGuardType = std::lock_guard<std::mutex>;
+#endif
+// Special case : mac. https://github.com/facebook/react-native/issues/31250
+#else
+  using LockType = std::mutex;
+  using SharedLockGuardType = std::lock_guard<std::mutex>;
+#endif
+#elif defined(_MSC_VER)  // MSVC
+#if _MSVC_LANG >= 201703L
+  using LockType = std::shared_mutex;
+  using SharedLockGuardType = std::shared_lock<std::shared_mutex>;
+#elif _MSVC_LANG >= 201402L
+  using LockType = std::shared_timed_mutex;
+  using SharedLockGuardType = std::shared_lock<std::shared_timed_mutex>;
+#else
+  using LockType = std::mutex;
+  using SharedLockGuardType = std::lock_guard<std::mutex>;
+#endif
+#else  // other compilers
+  using LockType = std::mutex;
+  using SharedLockGuardType = std::lock_guard<std::mutex>;
+#endif
+
   class ThreadDataHolder;
   class ThreadDataRegistryImpl {
    public:
     void RegisterData(uint64_t tid, ThreadDataHolder* tls_obj) {
-      std::lock_guard<LockType> lock(lock_);
+      std::lock_guard<LockType> guard(lock_);
       tid_map_[tid] = tls_obj;
     }
 
     void UnregisterData(uint64_t tid) {
-      std::lock_guard<LockType> lock(lock_);
+      std::lock_guard<LockType> guard(lock_);
       tid_map_.erase(tid);
     }
 
@@ -79,7 +112,7 @@ class ThreadDataRegistry {
                                       std::is_copy_constructible<Alias>::value>>
     std::unordered_map<uint64_t, T> GetAllThreadDataByValue() {
       std::unordered_map<uint64_t, T> data_copy;
-      std::shared_lock<LockType> lock(lock_);
+      SharedLockGuardType guard(lock_);
       data_copy.reserve(tid_map_.size());
       for (auto& kv : tid_map_) {
         data_copy.emplace(kv.first, kv.second->GetData());
@@ -90,7 +123,7 @@ class ThreadDataRegistry {
     std::unordered_map<uint64_t, std::reference_wrapper<T>>
     GetAllThreadDataByRef() {
       std::unordered_map<uint64_t, std::reference_wrapper<T>> data_ref;
-      std::shared_lock<LockType> lock(lock_);
+      SharedLockGuardType guard(lock_);
       data_ref.reserve(tid_map_.size());
       for (auto& kv : tid_map_) {
         data_ref.emplace(kv.first, std::ref(kv.second->GetData()));
diff --git a/paddle/fluid/framework/new_executor/workqueue/workqueue.cc b/paddle/fluid/framework/new_executor/workqueue/workqueue.cc
index 881878ebb12a7..b8dfcad187ca0 100644
--- a/paddle/fluid/framework/new_executor/workqueue/workqueue.cc
+++ b/paddle/fluid/framework/new_executor/workqueue/workqueue.cc
@@ -55,8 +55,9 @@ class WorkQueueImpl : public WorkQueue {
   }
 
   void AddTask(std::function<void()> fn) override {
-    platform::RecordEvent("WorkQueue::AddTask",
-                          platform::TracerEventType::UserDefined, 10 /*level*/);
+    platform::RecordEvent record("WorkQueue::AddTask",
+                                 platform::TracerEventType::UserDefined,
+                                 10 /*level*/);
     if (tracker_ != nullptr) {
       fn = [
         task = std::move(fn), raii = CounterGuard<TaskTracker>(tracker_)
@@ -146,8 +147,9 @@ WorkQueueGroupImpl::~WorkQueueGroupImpl() {
 }
 
 void WorkQueueGroupImpl::AddTask(size_t queue_idx, std::function<void()> fn) {
-  platform::RecordEvent("WorkQueue::AddTask",
-                        platform::TracerEventType::UserDefined, 10 /*level*/);
+  platform::RecordEvent record("WorkQueue::AddTask",
+                               platform::TracerEventType::UserDefined,
+                               10 /*level*/);
   assert(queue_idx < queues_.size());
   if (queues_options_.at(queue_idx).track_task) {
     fn = [
diff --git a/paddle/fluid/framework/op_desc.cc b/paddle/fluid/framework/op_desc.cc
index a02466c04e913..f31fefcfade89 100644
--- a/paddle/fluid/framework/op_desc.cc
+++ b/paddle/fluid/framework/op_desc.cc
@@ -777,10 +777,17 @@ void OpDesc::CheckAttrs() {
   checker->Check(&attrs_);
 }
 
-void OpDesc::InferShape(const BlockDesc &block) const {
+void OpDesc::InferShape(const BlockDesc &block) {
   try {
     VLOG(3) << "CompileTime infer shape on " << Type();
-    auto &infer_shape = OpInfoMap::Instance().Get(this->Type()).infer_shape_;
+    auto &op_info = OpInfoMap::Instance().Get(this->Type());
+    auto *checker = op_info.Checker();
+    if (checker != nullptr) {
+      // set dafault value here
+      VLOG(10) << "begin to check attribute of " << Type();
+      checker->Check(&attrs_);
+    }
+    auto &infer_shape = op_info.infer_shape_;
     PADDLE_ENFORCE_EQ(
         static_cast<bool>(infer_shape), true,
         platform::errors::NotFound(
diff --git a/paddle/fluid/framework/op_desc.h b/paddle/fluid/framework/op_desc.h
index 82e15d40bee78..0afe6796dad7a 100644
--- a/paddle/fluid/framework/op_desc.h
+++ b/paddle/fluid/framework/op_desc.h
@@ -142,7 +142,7 @@ class OpDesc {
 
   void CheckAttrs();
 
-  void InferShape(const BlockDesc &block) const;
+  void InferShape(const BlockDesc &block);
 
   void InferVarType(BlockDesc *block) const;
 
diff --git a/paddle/fluid/framework/operator.cc b/paddle/fluid/framework/operator.cc
index 4183360f655a7..6af07caaf88b2 100644
--- a/paddle/fluid/framework/operator.cc
+++ b/paddle/fluid/framework/operator.cc
@@ -1120,6 +1120,56 @@ static void CheckTensorNANOrInf(const std::string& op_type,
                               op_type, name));
 }
 
+bool OperatorWithKernel::SupportGPU() const {
+  auto phi_kernels = phi::KernelFactory::Instance().SelectKernelMap(
+      phi::TransToPhiKernelName(type_));
+  auto has_phi_kernel =
+      std::any_of(phi_kernels.begin(), phi_kernels.end(),
+                  [](phi::KernelKeyMap::const_reference kern_pair) {
+                    return kern_pair.first.backend() == phi::Backend::GPU;
+                  });
+  if (has_phi_kernel) {
+    return true;
+  } else {
+    auto kernel_iter = OperatorWithKernel::AllOpKernels().find(type_);
+    if (kernel_iter == OperatorWithKernel::AllOpKernels().end()) {
+      return false;
+    } else {
+      auto& op_kernels = kernel_iter->second;
+      return std::any_of(
+          op_kernels.begin(), op_kernels.end(),
+          [](OpKernelMap::const_reference kern_pair) {
+            return platform::is_gpu_place(kern_pair.first.place_);
+          });
+    }
+  }
+}
+
+bool OperatorWithKernel::SupportNPU() const {
+  auto phi_kernels = phi::KernelFactory::Instance().SelectKernelMap(
+      phi::TransToPhiKernelName(type_));
+  auto has_phi_kernel =
+      std::any_of(phi_kernels.begin(), phi_kernels.end(),
+                  [](phi::KernelKeyMap::const_reference kern_pair) {
+                    return kern_pair.first.backend() == phi::Backend::NPU;
+                  });
+  if (has_phi_kernel) {
+    return true;
+  } else {
+    auto kernel_iter = OperatorWithKernel::AllOpKernels().find(type_);
+    if (kernel_iter == OperatorWithKernel::AllOpKernels().end()) {
+      return false;
+    } else {
+      auto& op_kernels = kernel_iter->second;
+      return std::any_of(
+          op_kernels.begin(), op_kernels.end(),
+          [](OpKernelMap::const_reference kern_pair) {
+            return platform::is_npu_place(kern_pair.first.place_);
+          });
+    }
+  }
+}
+
 bool OperatorWithKernel::SupportsMKLDNN(
     const proto::VarType::Type data_type) const {
   auto op_kernel_iter = OperatorWithKernel::AllOpKernels().find(type_);
@@ -1243,16 +1293,54 @@ void OperatorWithKernel::RunImpl(const Scope& scope,
       }
     } else {
       pt_kernel_name = pt_kernel_signature_->name;
+// NOTE(Liu-xiandong): The register kernel used KP have library_type[KP],
+// But the default library_type is Plain, so we need to modify the
+// library_type here, otherwise it can't work.
+#ifdef PADDLE_WITH_XPU_KP
+      if (paddle::platform::is_xpu_place(kernel_type_->place_)) {
+        bool use_xpu_kp_kernel_rt =
+            FLAGS_run_kp_kernel &&
+            paddle::platform::is_xpu_kp_support_op(type_, *kernel_type_);
+        bool use_xpu_kp_kernel_debug =
+            paddle::platform::is_in_xpu_kpwhite_list(type_);
+        if (use_xpu_kp_kernel_rt) {
+          VLOG(3) << "phi xpu_kp using rt mode in static graph";
+        }
+        if (use_xpu_kp_kernel_debug) {
+          VLOG(3) << "phi xpu_kp using debug mode in static graph";
+        }
+        bool is_xpu_kp_support =
+            (use_xpu_kp_kernel_rt || use_xpu_kp_kernel_debug);
+        if (is_xpu_kp_support) {
+          auto expected_kernel_key_library_type = kernel_type_->library_type_;
+          kernel_type_->library_type_ = LibraryType::kKP;
+          VLOG(3) << "modifing XPU KP kernel in static graph: " << type_
+                  << ", using_kernel_key:" << *kernel_type_.get();
+          auto try_pt_kernel_key =
+              TransOpKernelTypeToPhiKernelKey(*kernel_type_.get());
+          if (!phi::KernelFactory::Instance().IsSelectKernelValid(
+                  pt_kernel_name, try_pt_kernel_key)) {
+            kernel_type_->library_type_ = expected_kernel_key_library_type;
+            VLOG(3) << "modify XPU KP kernel in static graph: " << type_
+                    << " is failed " << *kernel_type_.get();
+          }
+        }
+      }
+#endif
       pt_kernel_key = TransOpKernelTypeToPhiKernelKey(*kernel_type_.get());
     }
-#ifdef PADDLE_WITH_XPU
+
+// NOTE(Liu-xiandong): Determine whether the selected kernel is valid
+// If not, use the kernel registered in fluid. And if the fluid do not
+// contains the related heterogeneous kernel, use phi CPU kernel.
+#if defined(PADDLE_WITH_XPU) && !defined(PADDLE_WITH_XPU_KP)
     bool is_xpu_unsupport =
         paddle::platform::is_xpu_place(kernel_type_->place_) &&
             !paddle::platform::is_xpu_support_op(type_, *kernel_type_.get()) ||
         paddle::platform::is_in_xpu_black_list(type_);
 #endif
     if (pt_kernel_->IsValid()
-#ifdef PADDLE_WITH_XPU
+#if defined(PADDLE_WITH_XPU) && !defined(PADDLE_WITH_XPU_KP)
         && !is_xpu_unsupport
 #endif
         ) {
@@ -1260,10 +1348,29 @@ void OperatorWithKernel::RunImpl(const Scope& scope,
     } else {
       auto& all_op_kernels = AllOpKernels();
       auto kernels_iter = all_op_kernels.find(type_);
+
+// NOTE(Liu-xiandong): If we can't find heterogeneous kernel in phi,
+// we need to select the heterogeneous kernel in fluid, but the kernel
+// registered in KP use library_type[KP], we need to modify it.
+#ifdef PADDLE_WITH_XPU_KP
+      bool use_xpu_kp_kernel_rt =
+          paddle::platform::is_xpu_place(kernel_type_->place_) &&
+          FLAGS_run_kp_kernel &&
+          paddle::platform::is_xpu_kp_support_op(type_, *kernel_type_);
+      bool use_xpu_kp_kernel_debug =
+          paddle::platform::is_xpu_place(kernel_type_->place_) &&
+          paddle::platform::is_in_xpu_kpwhite_list(type_);
+      bool is_xpu_kp_support =
+          (use_xpu_kp_kernel_rt || use_xpu_kp_kernel_debug);
+      if (is_xpu_kp_support) {
+        kernel_type_->library_type_ = LibraryType::kKP;
+      }
+#endif
+
       if (kernels_iter == all_op_kernels.end() ||
           kernels_iter->second.find(*kernel_type_.get()) ==
               kernels_iter->second.end()
-#ifdef PADDLE_WITH_XPU
+#if defined(PADDLE_WITH_XPU) && !defined(PADDLE_WITH_XPU_KP)
           || is_xpu_unsupport
 #endif
           ) {
@@ -1502,10 +1609,22 @@ void OperatorWithKernel::ChooseKernel(const ExecutionContext& ctx) const {
     }
     bool is_xpu_kp_support = (use_xpu_kp_kernel_rt || use_xpu_kp_kernel_debug);
     if (is_xpu_kp_support) {
+      auto cache_expected_kernel_key_library_type =
+          expected_kernel_key.library_type_;
       expected_kernel_key.library_type_ = LibraryType::kKP;
       kernel_iter = kernels.find(expected_kernel_key);
-      VLOG(3) << "using XPU KP kernel: " << type_
-              << ", using_kernel_key:" << expected_kernel_key;
+      // if can't find corresponding kernel when is_xpu_kp_support is on
+      // if the fluid do not register related kernel, it can't work and hava
+      // error as before
+      if (kernel_iter == kernels.end()) {
+        expected_kernel_key.library_type_ =
+            cache_expected_kernel_key_library_type;
+        expected_kernel_key.place_ = platform::CPUPlace();
+        kernel_iter = kernels.find(expected_kernel_key);
+      } else {
+        VLOG(3) << "using XPU KP kernel: " << type_
+                << ", using_kernel_key:" << expected_kernel_key;
+      }
     }
     bool is_xpu_unsupport =
         (!paddle::platform::is_xpu_support_op(type_, expected_kernel_key) ||
@@ -1550,6 +1669,17 @@ void OperatorWithKernel::ChooseKernel(const ExecutionContext& ctx) const {
     expected_kernel_key.place_ = platform::CPUPlace();
     kernel_iter = kernels.find(expected_kernel_key);
   }
+#endif
+#ifdef PADDLE_WITH_CUSTOM_DEVICE
+  if (kernel_iter == kernels.end() &&
+      platform::is_custom_place(expected_kernel_key.place_)) {
+    VLOG(3) << "missing " << expected_kernel_key.place_.GetDeviceType()
+            << " kernel: " << type_
+            << ", expected_kernel_key:" << expected_kernel_key
+            << ", fallbacking to CPU one!";
+    expected_kernel_key.place_ = platform::CPUPlace();
+    kernel_iter = kernels.find(expected_kernel_key);
+  }
 #endif
   PADDLE_ENFORCE_NE(kernel_iter, kernels.end(),
                     platform::errors::NotFound(
@@ -2046,6 +2176,16 @@ Scope* OperatorWithKernel::PreparePhiData(
       if (!new_scope) {
         new_scope = &scope.NewScope();
       }
+      // For inference, if a gpu model has an op which could only run on CPU,
+      // each result of different input will be the same with the first one.
+      // The reason is that if a gpu tensor is the input of a cpu kernel,
+      // we will create a new cpu tensor in new scope.
+      // However, if enable_cache_runtime_context_, we get the cpu tensor each
+      // time, not the gpu tensor. Thus, we set pre_scope_ = nullptr
+      // to trigger `new RuntimeContext()` in RunImpl().
+      if (enable_cache_runtime_context_) {
+        pre_scope_ = nullptr;
+      }
 
       // Create new var with the same name in transfer scopes
       auto* trans_var = new_scope->Var(name_vec[offset]);
@@ -2106,7 +2246,11 @@ void OperatorWithKernel::BuildPhiKernelContext(
                  typeid(paddle::optional<const phi::DenseTensor&>)) ||
          input_defs[i].type_index ==
              std::type_index(
-                 typeid(paddle::optional<const phi::SelectedRows&>)))) {
+                 typeid(paddle::optional<const phi::SelectedRows&>)) ||
+         input_defs[i].type_index ==
+             std::type_index(
+                 typeid(paddle::optional<
+                        const std::vector<const phi::DenseTensor*>>)))) {
       pt_kernel_context->EmplaceBackInputWithoutSetRange(nullptr);
       auto end_idx = start_idx + 1;
       pt_kernel_context->AssignInputRange(std::make_pair(start_idx, end_idx),
@@ -2368,6 +2512,10 @@ void OperatorWithKernel::BuildPhiKernelContext(
                  std::type_index(typeid(std::vector<std::string>))) {
         pt_kernel_context->EmplaceBackAttr(
             BOOST_GET_CONST(std::vector<std::string>, attr_it->second));
+      } else if (attr_defs[i].type_index ==
+                 std::type_index(typeid(std::vector<float>))) {
+        pt_kernel_context->EmplaceBackAttr(
+            BOOST_GET_CONST(std::vector<float>, attr_it->second));
       } else {
         PADDLE_THROW(platform::errors::Unimplemented(
             "Unsupported cast op attribute `%s` when construct "
diff --git a/paddle/fluid/framework/operator.h b/paddle/fluid/framework/operator.h
index ce22f09944778..f7fc83f1d6d30 100644
--- a/paddle/fluid/framework/operator.h
+++ b/paddle/fluid/framework/operator.h
@@ -560,39 +560,10 @@ class OperatorWithKernel : public OperatorBase {
     return g_all_op_kernels;
   }
 
-  bool SupportGPU() const override {
-    auto phi_kernels = phi::KernelFactory::Instance().SelectKernelMap(
-        phi::TransToPhiKernelName(type_));
-    auto has_phi_kernel =
-        std::any_of(phi_kernels.begin(), phi_kernels.end(),
-                    [](phi::KernelKeyMap::const_reference kern_pair) {
-                      return kern_pair.first.backend() == phi::Backend::GPU;
-                    });
-    if (has_phi_kernel) {
-      return true;
-    } else {
-      auto kernel_iter = OperatorWithKernel::AllOpKernels().find(type_);
-      if (kernel_iter == OperatorWithKernel::AllOpKernels().end()) {
-        return false;
-      } else {
-        auto& op_kernels = kernel_iter->second;
-        return std::any_of(
-            op_kernels.begin(), op_kernels.end(),
-            [](OpKernelMap::const_reference kern_pair) {
-              return platform::is_gpu_place(kern_pair.first.place_);
-            });
-      }
-    }
-  }
+  bool SupportGPU() const override;
+
+  bool SupportNPU() const override;
 
-  bool SupportNPU() const override {
-    // TODO(zhiqiu): support phi if needed?
-    auto& op_kernels = OperatorWithKernel::AllOpKernels().at(type_);
-    return std::any_of(op_kernels.begin(), op_kernels.end(),
-                       [](OpKernelMap::const_reference kern_pair) {
-                         return platform::is_npu_place(kern_pair.first.place_);
-                       });
-  }
   bool SupportMLU() const override {
     // TODO(zhiqiu): support phi if needed?
     auto& op_kernels = OperatorWithKernel::AllOpKernels().at(type_);
diff --git a/paddle/fluid/framework/parallel_executor.cc b/paddle/fluid/framework/parallel_executor.cc
index 5b913ff2d21de..b088a535a1232 100644
--- a/paddle/fluid/framework/parallel_executor.cc
+++ b/paddle/fluid/framework/parallel_executor.cc
@@ -916,6 +916,8 @@ void ParallelExecutor::BCastParamsToDevices(
 
 FetchResultType ParallelExecutor::Run(
     const std::vector<std::string> &fetch_tensors, bool return_merged) {
+  platform::RecordEvent record_run("ParallelExecutor::Run",
+                                   platform::TracerEventType::UserDefined, 1);
   VLOG(3) << "enter ParallelExecutor Run";
 #ifdef PADDLE_WITH_CUDA
   if (platform::IsCUDAGraphCapturing()) {
diff --git a/paddle/fluid/framework/phi_utils.cc b/paddle/fluid/framework/phi_utils.cc
index 14997dd961013..8e6f082da1026 100644
--- a/paddle/fluid/framework/phi_utils.cc
+++ b/paddle/fluid/framework/phi_utils.cc
@@ -87,7 +87,7 @@ phi::KernelKey TransOpKernelTypeToPhiKernelKey(
   } else if (kernel_type.library_type_ == LibraryType::kKP) {
     backend = phi::Backend::KPS;
   } else {
-    // do
+    // do nothing
   }
   paddle::experimental::DataLayout layout = kernel_type.data_layout_;
   paddle::experimental::DataType dtype =
@@ -102,7 +102,7 @@ phi::KernelKey FallBackToCpu(const OpKernelType& expected_kernel_key,
   if (platform::is_xpu_place(expected_kernel_key.place_) ||
       paddle::platform::is_in_xpu_black_list(op.Type())) {
     VLOG(3) << "phi missing XPU kernel: " << op.Type()
-            << "phipected_kernel_key:" << expected_kernel_key
+            << ", phipected_kernel_key:" << expected_kernel_key
             << ", fallbacking to CPU one!";
     return phi::KernelKey(phi::Backend::CPU, kernel_key.layout(),
                           kernel_key.dtype());
@@ -111,7 +111,7 @@ phi::KernelKey FallBackToCpu(const OpKernelType& expected_kernel_key,
 #ifdef PADDLE_WITH_ASCEND_CL
   if (platform::is_npu_place(expected_kernel_key.place_)) {
     VLOG(3) << "phi missing NPU kernel: " << op.Type()
-            << "phipected_kernel_key:" << expected_kernel_key
+            << ", phipected_kernel_key:" << expected_kernel_key
             << ", fallbacking to CPU one!";
     return phi::KernelKey(phi::Backend::CPU, kernel_key.layout(),
                           kernel_key.dtype());
@@ -120,7 +120,7 @@ phi::KernelKey FallBackToCpu(const OpKernelType& expected_kernel_key,
 #ifdef PADDLE_WITH_MLU
   if (platform::is_mlu_place(expected_kernel_key.place_)) {
     VLOG(3) << "phi missing MLU kernel: " << op.Type()
-            << "phipected_kernel_key:" << expected_kernel_key
+            << ", phipected_kernel_key:" << expected_kernel_key
             << ", fallbacking to CPU one!";
     return phi::KernelKey(phi::Backend::CPU, kernel_key.layout(),
                           kernel_key.dtype());
@@ -128,8 +128,18 @@ phi::KernelKey FallBackToCpu(const OpKernelType& expected_kernel_key,
 #endif
 #ifdef PADDLE_WITH_IPU
   if (platform::is_ipu_place(expected_kernel_key.place_)) {
-    VLOG(3) << "pten missing IPU kernel: " << op.Type()
-            << ", expected_kernel_key:" << expected_kernel_key
+    VLOG(3) << "phi missing IPU kernel: " << op.Type()
+            << ", phipected_kernel_key:" << expected_kernel_key
+            << ", fallbacking to CPU one!";
+    return phi::KernelKey(phi::Backend::CPU, kernel_key.layout(),
+                          kernel_key.dtype());
+  }
+#endif
+#ifdef PADDLE_WITH_CUSTOM_DEVICE
+  if (platform::is_custom_place(expected_kernel_key.place_)) {
+    VLOG(3) << "phi missing " << expected_kernel_key.place_.GetDeviceType()
+            << " kernel: " << op.Type()
+            << ", phipected_kernel_key:" << expected_kernel_key
             << ", fallbacking to CPU one!";
     return phi::KernelKey(phi::Backend::CPU, kernel_key.layout(),
                           kernel_key.dtype());
diff --git a/paddle/fluid/imperative/basic_engine.cc b/paddle/fluid/imperative/basic_engine.cc
index d7478b18dba06..ce3c5dd2fe562 100644
--- a/paddle/fluid/imperative/basic_engine.cc
+++ b/paddle/fluid/imperative/basic_engine.cc
@@ -30,6 +30,7 @@
 #include "paddle/fluid/imperative/op_base.h"
 #include "paddle/fluid/imperative/tracer.h"
 #include "paddle/fluid/platform/profiler.h"
+#include "paddle/phi/kernels/autotune/switch_autotune.h"
 #include "paddle/phi/kernels/funcs/math_function.h"
 
 DECLARE_bool(sort_sum_gradient);
@@ -645,6 +646,8 @@ void BasicEngine::Execute() {
   Clear();
 
   VLOG(1) << "Backward op number: " << op_num;
+
+  phi::autotune::AutoTuneStatus::Instance().Update();
 }
 
 void BasicEngine::Clear() {
diff --git a/paddle/fluid/imperative/partial_grad_engine.cc b/paddle/fluid/imperative/partial_grad_engine.cc
index 56ddbf3386198..f2f64d92a23fc 100644
--- a/paddle/fluid/imperative/partial_grad_engine.cc
+++ b/paddle/fluid/imperative/partial_grad_engine.cc
@@ -826,6 +826,8 @@ std::vector<std::shared_ptr<VarBase>> PartialGradTask::Run() {
 }
 
 void PartialGradTask::RunEachOp(OpBase *op) {
+  platform::RecordEvent op_type_record_event(
+      op->Type() + " grad trace_op", platform::TracerEventType::Operator, 1);
   // Prepare new inputs
   NameVarMap<VarBase> tmp_ins;
   for (auto &input_pair : op->GetInsMap()) {
@@ -908,7 +910,6 @@ void PartialGradTask::RunEachOp(OpBase *op) {
   // Run op
   OpBase::Run(op->InnerOp(), tmp_ins, tmp_outs, op->Attrs(),
               op->DefaultAttrsMap(), op->place());
-
   if (create_graph_) {
     auto double_grad_node =
         CreateGradOpNode(op->InnerOp(), tmp_ins, tmp_outs, op->Attrs(),
diff --git a/paddle/fluid/imperative/prepared_operator.cc b/paddle/fluid/imperative/prepared_operator.cc
index f7d2ef1bf5d42..b56d113937d69 100644
--- a/paddle/fluid/imperative/prepared_operator.cc
+++ b/paddle/fluid/imperative/prepared_operator.cc
@@ -174,7 +174,9 @@ PreparedOp PrepareImpl(const NameVarMap<VarType>& ins,
     VLOG(6) << pt_kernel_signature;
 
     pt_kernel_name = pt_kernel_signature.name;
-// modify the expected_kernel_key for KP in phi
+// NOTE(Liu-xiandong): The register kernel used KP have library_type[KP],
+// But the default library_type is Plain, so we need to modify the
+// library_type here, otherwise it can't work.
 #ifdef PADDLE_WITH_XPU_KP
     if (paddle::platform::is_xpu_place(expected_kernel_key.place_)) {
       bool use_xpu_kp_kernel_rt =
@@ -191,12 +193,23 @@ PreparedOp PrepareImpl(const NameVarMap<VarType>& ins,
       bool is_xpu_kp_support =
           (use_xpu_kp_kernel_rt || use_xpu_kp_kernel_debug);
       if (is_xpu_kp_support) {
+        auto expected_kernel_key_library_type =
+            expected_kernel_key.library_type_;
         expected_kernel_key.library_type_ = paddle::framework::LibraryType::kKP;
-        VLOG(3) << "modify XPU KP kernel: " << op.Type()
+        VLOG(3) << "modifing XPU KP kernel: " << op.Type()
                 << ", using_kernel_key:" << expected_kernel_key;
+        phi::KernelKey try_pt_kernel_key =
+            TransOpKernelTypeToPhiKernelKey(expected_kernel_key);
+        if (!phi::KernelFactory::Instance().IsSelectKernelValid(
+                pt_kernel_name, try_pt_kernel_key)) {
+          expected_kernel_key.library_type_ = expected_kernel_key_library_type;
+          VLOG(3) << "modify XPU KP kernel: " << op.Type() << " is failed "
+                  << expected_kernel_key;
+        }
       }
     }
 #endif
+
     pt_kernel_key = TransOpKernelTypeToPhiKernelKey(expected_kernel_key);
     auto pt_kernel = phi::KernelFactory::Instance().SelectKernel(pt_kernel_name,
                                                                  pt_kernel_key);
@@ -227,6 +240,23 @@ PreparedOp PrepareImpl(const NameVarMap<VarType>& ins,
   auto& all_op_kernels = op.AllOpKernels();
   auto kernels_iter = all_op_kernels.find(op.Type());
 
+// NOTE(Liu-xiandong): If we can't find heterogeneous kernel in phi,
+// we need to select the heterogeneous kernel in fluid, but the kernel
+// registered in KP use library_type[KP], we need to modify it.
+#ifdef PADDLE_WITH_XPU_KP
+  bool use_xpu_kp_kernel_rt =
+      paddle::platform::is_xpu_place(expected_kernel_key.place_) &&
+      FLAGS_run_kp_kernel &&
+      paddle::platform::is_xpu_kp_support_op(op.Type(), expected_kernel_key);
+  bool use_xpu_kp_kernel_debug =
+      paddle::platform::is_xpu_place(expected_kernel_key.place_) &&
+      paddle::platform::is_in_xpu_kpwhite_list(op.Type());
+  bool is_xpu_kp_support = (use_xpu_kp_kernel_rt || use_xpu_kp_kernel_debug);
+  if (is_xpu_kp_support) {
+    expected_kernel_key.library_type_ = paddle::framework::LibraryType::kKP;
+  }
+#endif
+
   if ((kernels_iter == all_op_kernels.end() ||
        kernels_iter->second.find(expected_kernel_key) ==
            kernels_iter->second.end())
@@ -255,6 +285,7 @@ PreparedOp PrepareImpl(const NameVarMap<VarType>& ins,
       platform::errors::NotFound(
           "There are no kernels which are registered in the %s operator.",
           op.Type()));
+
   auto& kernels = kernels_iter->second;
   auto kernel_iter = kernels.find(expected_kernel_key);
 
@@ -271,18 +302,12 @@ PreparedOp PrepareImpl(const NameVarMap<VarType>& ins,
 
 #ifdef PADDLE_WITH_XPU_KP
   if (paddle::platform::is_xpu_place(expected_kernel_key.place_)) {
-    bool use_xpu_kp_kernel_rt =
-        FLAGS_run_kp_kernel &&
-        paddle::platform::is_xpu_kp_support_op(op.Type(), expected_kernel_key);
-    bool use_xpu_kp_kernel_debug =
-        paddle::platform::is_in_xpu_kpwhite_list(op.Type());
     if (use_xpu_kp_kernel_rt) {
       VLOG(3) << "xpu_kp using rt mode ";
     }
     if (use_xpu_kp_kernel_debug) {
       VLOG(3) << "xpu_kp using debug mode ";
     }
-    bool is_xpu_kp_support = (use_xpu_kp_kernel_rt || use_xpu_kp_kernel_debug);
     if (is_xpu_kp_support) {
       expected_kernel_key.library_type_ = paddle::framework::LibraryType::kKP;
       kernel_iter = kernels.find(expected_kernel_key);
@@ -464,6 +489,11 @@ static void PreparedOpRunPtImpl(
     pt_kernel(&pt_kernel_context);
   }
 
+  if (FLAGS_check_nan_inf) {
+    framework::details::CheckOpHasNanOrInfInDygraph<VarType>(
+        op.Type(), outs, dev_ctx->GetPlace());
+  }
+
   if (FLAGS_benchmark) {
     dev_ctx->Wait();
 #if defined(PADDLE_WITH_CUDA) || defined(PADDLE_WITH_HIP)
diff --git a/paddle/fluid/imperative/prepared_operator.h b/paddle/fluid/imperative/prepared_operator.h
index 43bed5fd35e2f..04d0b4ca7a5db 100644
--- a/paddle/fluid/imperative/prepared_operator.h
+++ b/paddle/fluid/imperative/prepared_operator.h
@@ -272,6 +272,14 @@ void BuildDygraphPhiKernelContext(
         auto end_idx = start_idx + 1;
         kernel_ctx->AssignInputRange(std::make_pair(start_idx, end_idx), i);
         continue;
+      } else if (input_defs[i].type_index ==
+                 std::type_index(
+                     typeid(paddle::optional<
+                            const std::vector<const phi::DenseTensor*>>))) {
+        kernel_ctx->EmplaceBackInputWithoutSetRange(nullptr);
+        auto end_idx = start_idx + 1;
+        kernel_ctx->AssignInputRange(std::make_pair(start_idx, end_idx), i);
+        continue;
       } else {
         PADDLE_THROW(phi::errors::NotFound(
             "Can not find input variable '%s' for %s OP, please check whether "
@@ -545,6 +553,9 @@ void BuildDygraphPhiKernelContext(
                  std::type_index(typeid(std::vector<std::string>))) {
         kernel_ctx->EmplaceBackAttr(
             BOOST_GET_CONST(std::vector<std::string>, attr));
+      } else if (attr_defs[i].type_index ==
+                 std::type_index(typeid(std::vector<float>))) {
+        kernel_ctx->EmplaceBackAttr(BOOST_GET_CONST(std::vector<float>, attr));
       } else {
         PADDLE_THROW(platform::errors::Unimplemented(
             "Unsupported cast op attribute `%s` when construct "
diff --git a/paddle/fluid/imperative/tracer.cc b/paddle/fluid/imperative/tracer.cc
index 3d4cfa2df3179..6b20b9b393869 100644
--- a/paddle/fluid/imperative/tracer.cc
+++ b/paddle/fluid/imperative/tracer.cc
@@ -313,7 +313,7 @@ void Tracer::TraceOpImpl(const std::string& type,
 
   {
     platform::RecordEvent node_creation_record_event(
-        type + " node_creation", platform::TracerEventType::Operator, 1);
+        type + " node_creation", platform::TracerEventType::OperatorInner, 1);
 
     if (ComputeRequiredGrad(new_ins, outs, trace_backward)) {
       PADDLE_ENFORCE_EQ(
diff --git a/paddle/fluid/inference/CMakeLists.txt b/paddle/fluid/inference/CMakeLists.txt
index 8cc4260289ad6..bdf364aa9adcd 100644
--- a/paddle/fluid/inference/CMakeLists.txt
+++ b/paddle/fluid/inference/CMakeLists.txt
@@ -53,8 +53,6 @@ endif()
 #TODO(wilber, T8T9): Do we still need to support windows gpu static library?
 if(WIN32 AND WITH_GPU)
   cc_library(paddle_inference DEPS ${fluid_modules} ${phi_modules} ${STATIC_INFERENCE_API} ${utils_modules})
-elseif(WITH_IPU)
-  cc_library(paddle_inference DEPS ${fluid_modules} ${phi_modules} ${STATIC_INFERENCE_API} ${utils_modules} paddle_ipu)
 else()
   create_static_lib(paddle_inference ${fluid_modules} ${phi_modules} ${STATIC_INFERENCE_API} ${utils_modules})
 endif()
diff --git a/paddle/fluid/inference/api/analysis_predictor.cc b/paddle/fluid/inference/api/analysis_predictor.cc
index 6388cfc4b2dae..b7e811e4c64d6 100644
--- a/paddle/fluid/inference/api/analysis_predictor.cc
+++ b/paddle/fluid/inference/api/analysis_predictor.cc
@@ -74,6 +74,10 @@
 #include "paddle/fluid/inference/tensorrt/trt_int8_calibrator.h"
 #endif
 
+#ifdef PADDLE_WITH_IPU
+#include "paddle/fluid/platform/device/ipu/paddle_ipu_handler.h"
+#endif
+
 namespace paddle {
 
 using inference::Singleton;
@@ -1054,16 +1058,8 @@ std::unique_ptr<PaddlePredictor> CreatePaddlePredictor<
                            std::to_string(fraction_of_gpu_memory);
         VLOG(3) << "set flag: " << flag;
         gflags.push_back(flag);
-        gflags.push_back("--cudnn_deterministic=True");
       }
 
-// TODO(wilber): jetson tx2 may fail to run the model due to insufficient memory
-// under the native_best_fit strategy. Modify the default allocation strategy to
-// auto_growth. todo, find a more appropriate way to solve the problem.
-#ifdef WITH_NV_JETSON
-      gflags.push_back("--allocator_strategy=auto_growth");
-#endif
-
       // TODO(Shixiaowei02): Add a mandatory scheme to use the thread local
       // allocator when multi-stream is enabled.
       if (config.thread_local_stream_enabled()) {
diff --git a/paddle/fluid/inference/api/demo_ci/CMakeLists.txt b/paddle/fluid/inference/api/demo_ci/CMakeLists.txt
index df98a7b05cf3f..c02fcd0781321 100644
--- a/paddle/fluid/inference/api/demo_ci/CMakeLists.txt
+++ b/paddle/fluid/inference/api/demo_ci/CMakeLists.txt
@@ -84,13 +84,15 @@ if(WITH_GPU)
   if(NOT WIN32)
     set(CUDA_LIB "/usr/local/cuda/lib64/" CACHE STRING "CUDA Library")
   else()
-    if(NOT DEFINED CUDA_LIB)
+    set(CUDA_LIB "" CACHE STRING "CUDA_LIB")
+    if("${TENSORRT_ROOT}" STREQUAL "")
       if(DEFINED ENV{CUDA_PATH})
         set(CUDA_LIB "$ENV{CUDA_PATH}\\lib\\x64")
       else()
         set(CUDA_LIB "C:\\Program\ Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v10.2\\lib\\x64")
       endif()
     endif()
+    message(STATUS "Current CUDA lib path: ${CUDA_LIB}")
   endif(NOT WIN32)
 endif()
 
@@ -208,41 +210,47 @@ endif()
 add_executable(${DEMO_NAME} ${DEMO_NAME}.cc)
 target_link_libraries(${DEMO_NAME} ${DEPS})
 if(WIN32)
+  if("${CMAKE_GENERATOR}" MATCHES "Ninja")
+    set(LIB_PATH ${CMAKE_BINARY_DIR})
+  else()
+    set(LIB_PATH ${CMAKE_BINARY_DIR}/${CMAKE_BUILD_TYPE})
+  endif()
+
   if(USE_TENSORRT)
     add_custom_command(TARGET ${DEMO_NAME} POST_BUILD
             COMMAND ${CMAKE_COMMAND} -E copy ${TENSORRT_LIB_DIR}/nvinfer${CMAKE_SHARED_LIBRARY_SUFFIX}
-              ${CMAKE_BINARY_DIR}/${CMAKE_BUILD_TYPE}
+              ${LIB_PATH}
             COMMAND ${CMAKE_COMMAND} -E copy ${TENSORRT_LIB_DIR}/nvinfer_plugin${CMAKE_SHARED_LIBRARY_SUFFIX}
-              ${CMAKE_BINARY_DIR}/${CMAKE_BUILD_TYPE}
+              ${LIB_PATH}
     )
     if(${TENSORRT_MAJOR_VERSION} GREATER_EQUAL 7)
       add_custom_command(TARGET ${DEMO_NAME} POST_BUILD
               COMMAND ${CMAKE_COMMAND} -E copy ${TENSORRT_LIB_DIR}/myelin64_1${CMAKE_SHARED_LIBRARY_SUFFIX}
-                ${CMAKE_BINARY_DIR}/${CMAKE_BUILD_TYPE})
+                ${LIB_PATH})
     endif()
   endif()
   if(WITH_MKL)
     add_custom_command(TARGET ${DEMO_NAME} POST_BUILD
-          COMMAND ${CMAKE_COMMAND} -E copy ${MATH_LIB_PATH}/lib/mklml.dll ${CMAKE_BINARY_DIR}/Release
-          COMMAND ${CMAKE_COMMAND} -E copy ${MATH_LIB_PATH}/lib/libiomp5md.dll ${CMAKE_BINARY_DIR}/Release
-          COMMAND ${CMAKE_COMMAND} -E copy ${MKLDNN_PATH}/lib/mkldnn.dll  ${CMAKE_BINARY_DIR}/Release
+          COMMAND ${CMAKE_COMMAND} -E copy ${MATH_LIB_PATH}/lib/mklml.dll ${LIB_PATH}
+          COMMAND ${CMAKE_COMMAND} -E copy ${MATH_LIB_PATH}/lib/libiomp5md.dll ${LIB_PATH}
+          COMMAND ${CMAKE_COMMAND} -E copy ${MKLDNN_PATH}/lib/mkldnn.dll  ${LIB_PATH}
     )
   else()
     add_custom_command(TARGET ${DEMO_NAME} POST_BUILD
-          COMMAND ${CMAKE_COMMAND} -E copy ${OPENBLAS_LIB_PATH}/lib/openblas.dll ${CMAKE_BINARY_DIR}/Release
+          COMMAND ${CMAKE_COMMAND} -E copy ${OPENBLAS_LIB_PATH}/lib/openblas.dll ${LIB_PATH}
     )
   endif()
   if(WITH_ONNXRUNTIME)
     add_custom_command(TARGET ${DEMO_NAME} POST_BUILD
     COMMAND ${CMAKE_COMMAND} -E copy ${PADDLE_LIB_THIRD_PARTY_PATH}onnxruntime/lib/onnxruntime.dll
-      ${CMAKE_BINARY_DIR}/${CMAKE_BUILD_TYPE}
+      ${LIB_PATH}
     COMMAND ${CMAKE_COMMAND} -E copy ${PADDLE_LIB_THIRD_PARTY_PATH}paddle2onnx/lib/paddle2onnx.dll
-      ${CMAKE_BINARY_DIR}/${CMAKE_BUILD_TYPE}
+      ${LIB_PATH}
     )
   endif()
   if(NOT WITH_STATIC_LIB)
       add_custom_command(TARGET ${DEMO_NAME} POST_BUILD 
-        COMMAND ${CMAKE_COMMAND} -E copy "${PADDLE_LIB}/paddle/lib/paddle_inference.dll" ${CMAKE_BINARY_DIR}/${CMAKE_BUILD_TYPE}
+        COMMAND ${CMAKE_COMMAND} -E copy "${PADDLE_LIB}/paddle/lib/paddle_inference.dll" ${LIB_PATH}
       )
   endif()
 endif()
diff --git a/paddle/fluid/inference/api/demo_ci/run.sh b/paddle/fluid/inference/api/demo_ci/run.sh
index 2c0945cd5b386..290c547c98691 100755
--- a/paddle/fluid/inference/api/demo_ci/run.sh
+++ b/paddle/fluid/inference/api/demo_ci/run.sh
@@ -23,6 +23,7 @@ USE_TENSORRT=$5
 TENSORRT_ROOT_DIR=$6 # TensorRT root dir, default to /usr
 WITH_ONNXRUNTIME=$7
 MSVC_STATIC_CRT=$8
+CUDA_LIB=$9/lib/x64
 inference_install_dir=${PADDLE_ROOT}/build/paddle_inference_install_dir
 WIN_DETECT=$(echo `uname` | grep "Win") # detect current platform
 
@@ -112,16 +113,18 @@ for WITH_STATIC_LIB in ON OFF; do
       continue
     fi
     # -----simple_on_word2vec on windows-----
-    cmake .. -G "Visual Studio 15 2017" -A x64 -T host=x64 -DPADDLE_LIB=${inference_install_dir} \
+    cmake .. -GNinja -DPADDLE_LIB=${inference_install_dir} \
       -DWITH_MKL=$TURN_ON_MKL \
       -DDEMO_NAME=simple_on_word2vec \
       -DWITH_GPU=$TEST_GPU_CPU \
       -DWITH_STATIC_LIB=$WITH_STATIC_LIB \
       -DMSVC_STATIC_CRT=$MSVC_STATIC_CRT \
-      -DWITH_ONNXRUNTIME=$WITH_ONNXRUNTIME
-    msbuild  /maxcpucount /property:Configuration=Release cpp_inference_demo.sln
+      -DWITH_ONNXRUNTIME=$WITH_ONNXRUNTIME \
+      -DCMAKE_BUILD_TYPE=Release \
+      -DCUDA_LIB="$CUDA_LIB"
+    ninja
     for use_gpu in $use_gpu_list; do
-      Release/simple_on_word2vec.exe \
+      ./simple_on_word2vec.exe \
         --dirname=$DATA_DIR/word2vec/word2vec.inference.model \
         --use_gpu=$use_gpu
       if [ $? -ne 0 ]; then
@@ -132,17 +135,19 @@ for WITH_STATIC_LIB in ON OFF; do
 
     # -----vis_demo on windows-----
     rm -rf *
-    cmake .. -G "Visual Studio 15 2017" -A x64 -T host=x64 -DPADDLE_LIB=${inference_install_dir} \
+    cmake .. -GNinja -DPADDLE_LIB=${inference_install_dir} \
       -DWITH_MKL=$TURN_ON_MKL \
       -DDEMO_NAME=vis_demo \
       -DWITH_GPU=$TEST_GPU_CPU \
       -DWITH_STATIC_LIB=$WITH_STATIC_LIB \
       -DMSVC_STATIC_CRT=$MSVC_STATIC_CRT \
-      -DWITH_ONNXRUNTIME=$WITH_ONNXRUNTIME
-    msbuild  /maxcpucount /property:Configuration=Release cpp_inference_demo.sln
+      -DWITH_ONNXRUNTIME=$WITH_ONNXRUNTIME \
+      -DCMAKE_BUILD_TYPE=Release \
+      -DCUDA_LIB="$CUDA_LIB"
+    ninja
     for use_gpu in $use_gpu_list; do
       for vis_demo_name in $vis_demo_list; do
-        Release/vis_demo.exe \
+        ./vis_demo.exe \
           --modeldir=$DATA_DIR/$vis_demo_name/model \
           --data=$DATA_DIR/$vis_demo_name/data.txt \
           --refer=$DATA_DIR/$vis_demo_name/result.txt \
@@ -153,11 +158,11 @@ for WITH_STATIC_LIB in ON OFF; do
         fi
       done
     done
-
+    
     # --------tensorrt mobilenet on windows------
     if [ $USE_TENSORRT == ON -a $TEST_GPU_CPU == ON ]; then
       rm -rf *
-      cmake .. -G "Visual Studio 15 2017" -A x64 -T host=x64 -DPADDLE_LIB=${inference_install_dir} \
+      cmake .. -GNinja -DPADDLE_LIB=${inference_install_dir} \
         -DWITH_MKL=$TURN_ON_MKL \
         -DDEMO_NAME=trt_mobilenet_demo \
         -DWITH_GPU=$TEST_GPU_CPU \
@@ -165,9 +170,11 @@ for WITH_STATIC_LIB in ON OFF; do
         -DMSVC_STATIC_CRT=$MSVC_STATIC_CRT \
         -DUSE_TENSORRT=$USE_TENSORRT \
         -DTENSORRT_ROOT=$TENSORRT_ROOT_DIR \
-        -DWITH_ONNXRUNTIME=$WITH_ONNXRUNTIME
-      msbuild  /maxcpucount /property:Configuration=Release cpp_inference_demo.sln
-      Release/trt_mobilenet_demo.exe \
+        -DWITH_ONNXRUNTIME=$WITH_ONNXRUNTIME \
+        -DCMAKE_BUILD_TYPE=Release \
+        -DCUDA_LIB="$CUDA_LIB"
+      ninja
+      ./trt_mobilenet_demo.exe \
         --modeldir=$DATA_DIR/mobilenet/model \
         --data=$DATA_DIR/mobilenet/data.txt \
         --refer=$DATA_DIR/mobilenet/result.txt 
diff --git a/paddle/fluid/inference/api/paddle_pass_builder.cc b/paddle/fluid/inference/api/paddle_pass_builder.cc
index 95975d8f2a892..20418e37a7b94 100644
--- a/paddle/fluid/inference/api/paddle_pass_builder.cc
+++ b/paddle/fluid/inference/api/paddle_pass_builder.cc
@@ -76,10 +76,13 @@ void PaddlePassBuilder::ClearPasses() { passes_.clear(); }
 
 const std::vector<std::string> kTRTSubgraphPasses({
   "adaptive_pool2d_convert_global_pass",
-      "shuffle_channel_detect_pass",          //
-      "quant_conv2d_dequant_fuse_pass",       //
-      "delete_quant_dequant_op_pass",         //
-      "delete_quant_dequant_filter_op_pass",  //
+      "shuffle_channel_detect_pass",           //
+      "quant_conv2d_dequant_fuse_pass",        //
+      "delete_quant_dequant_op_pass",          //
+      "delete_quant_dequant_filter_op_pass",   //
+      "delete_weight_dequant_linear_op_pass",  //
+      "delete_quant_dequant_linear_op_pass",   //
+      "add_support_int8_pass",                 //
       // "fc_fuse_pass",                        //
       "simplify_with_basic_ops_pass",                 //
       "embedding_eltwise_layernorm_fuse_pass",        //
@@ -98,9 +101,8 @@ const std::vector<std::string> kTRTSubgraphPasses({
       "trt_map_matmul_to_mul_pass",                   //
       "fc_fuse_pass",                                 //
       "conv_elementwise_add_fuse_pass",               //
-      "add_support_int8_pass",
-      "tensorrt_subgraph_pass",  //
-      "conv_bn_fuse_pass",       //
+      "tensorrt_subgraph_pass",                       //
+      "conv_bn_fuse_pass",                            //
 #if CUDNN_VERSION >= 7100  // To run conv_fusion, the version of cudnn must be
                            // guaranteed at least v7
 // cudnn8.0 has memory leak problem in conv + eltwise + act, so we
diff --git a/paddle/fluid/inference/tensorrt/convert/activation_op.cc b/paddle/fluid/inference/tensorrt/convert/activation_op.cc
index e6a0ecf4aecec..b86351e394bd1 100644
--- a/paddle/fluid/inference/tensorrt/convert/activation_op.cc
+++ b/paddle/fluid/inference/tensorrt/convert/activation_op.cc
@@ -68,12 +68,6 @@ class ActivationOpConverter : public OpConverter {
     auto output_name = op_desc.Output("Out")[0];
 
     RreplenishLayerAndOutput(layer, op_type_, {output_name}, test_mode);
-    if (op_desc.HasAttr("out_scale")) {
-#if IS_TRT_VERSION_GE(5130)
-      float out_scale = BOOST_GET_CONST(float, op_desc.GetAttr("out_scale"));
-      engine_->SetTensorDynamicRange(layer->getOutput(0), out_scale);
-#endif
-    }
   }
 
  protected:
diff --git a/paddle/fluid/inference/tensorrt/convert/affine_channel_op.cc b/paddle/fluid/inference/tensorrt/convert/affine_channel_op.cc
index eba67c3c098ca..cc06f82ae3901 100644
--- a/paddle/fluid/inference/tensorrt/convert/affine_channel_op.cc
+++ b/paddle/fluid/inference/tensorrt/convert/affine_channel_op.cc
@@ -49,11 +49,11 @@ class AffineChannelOpConverter : public OpConverter {
 
     auto* scale_v = scope.FindVar(scale_name);
     auto* scale_t = scale_v->GetMutable<framework::LoDTensor>();
-    float* scale_ptr = engine_->GetWeightCPUData(scale_name, scale_t, false);
+    float* scale_ptr = engine_->GetWeightCPUData(scale_name, scale_t);
 
     auto* bias_v = scope.FindVar(bias_name);
     auto* bias_t = bias_v->GetMutable<framework::LoDTensor>();
-    float* bias_ptr = engine_->GetWeightCPUData(bias_name, bias_t, false);
+    float* bias_ptr = engine_->GetWeightCPUData(bias_name, bias_t);
 
     // tensorrt scalend layer only support spatial dims >= 2,
     // so nhwc is not availabe (spatial dims == 0)
diff --git a/paddle/fluid/inference/tensorrt/convert/conv2d_op.cc b/paddle/fluid/inference/tensorrt/convert/conv2d_op.cc
index a296a2641db65..1b2abeac6c19f 100644
--- a/paddle/fluid/inference/tensorrt/convert/conv2d_op.cc
+++ b/paddle/fluid/inference/tensorrt/convert/conv2d_op.cc
@@ -49,18 +49,11 @@ void ConvertConv2d(TensorRTEngine* engine, const framework::proto::OpDesc& op,
 
   if (enable_int8) {
 #if IS_TRT_VERSION_GE(5000)
-    float in_scale =
-        BOOST_GET_CONST(float, op_desc.GetAttr("Input_scale")) * 127;
-    auto weight_scale =
-        BOOST_GET_CONST(std::vector<float>, op_desc.GetAttr("weight_scale"));
-    weight_data = engine->GetWeightCPUData(op_desc.Input("Filter").front(), Y_t,
-                                           true, weight_scale);
+    float in_scale = BOOST_GET_CONST(float, op_desc.GetAttr("Input_scale"));
     engine->SetTensorDynamicRange(X, in_scale);
 #endif
-  } else {
-    weight_data =
-        engine->GetWeightCPUData(op_desc.Input("Filter").front(), Y_t, false);
   }
+  weight_data = engine->GetWeightCPUData(op_desc.Input("Filter").front(), Y_t);
 
   PADDLE_ENFORCE_EQ(Y_t->dims().size(), 4UL,
                     platform::errors::InvalidArgument(
@@ -115,7 +108,7 @@ void ConvertConv2d(TensorRTEngine* engine, const framework::proto::OpDesc& op,
     auto* bias_tensor = scope.GetVar(op_desc.Input("Bias").front());
     auto* bias_tensor_data = bias_tensor->GetMutable<framework::LoDTensor>();
     bias_data = engine->GetWeightCPUData(op_desc.Input("Bias").front(),
-                                         bias_tensor_data, false);
+                                         bias_tensor_data);
     bias_size = static_cast<size_t>(bias_tensor_data->numel());
   }
 
diff --git a/paddle/fluid/inference/tensorrt/convert/conv3d_op.cc b/paddle/fluid/inference/tensorrt/convert/conv3d_op.cc
index dae92264d2c3e..dbb2786ed78ab 100644
--- a/paddle/fluid/inference/tensorrt/convert/conv3d_op.cc
+++ b/paddle/fluid/inference/tensorrt/convert/conv3d_op.cc
@@ -48,17 +48,10 @@ void ConvertConv3d(TensorRTEngine* engine, const framework::proto::OpDesc& op,
   bool enable_int8 = op_desc.HasAttr("enable_int8");
 
   if (enable_int8) {
-    float in_scale =
-        BOOST_GET_CONST(float, op_desc.GetAttr("Input_scale")) * 127;
-    auto weight_scale =
-        BOOST_GET_CONST(std::vector<float>, op_desc.GetAttr("weight_scale"));
-    weight_data = engine->GetWeightCPUData(op_desc.Input("Filter").front(), Y_t,
-                                           true, weight_scale);
+    float in_scale = BOOST_GET_CONST(float, op_desc.GetAttr("Input_scale"));
     engine->SetTensorDynamicRange(X, in_scale);
-  } else {
-    weight_data =
-        engine->GetWeightCPUData(op_desc.Input("Filter").front(), Y_t, false);
   }
+  weight_data = engine->GetWeightCPUData(op_desc.Input("Filter").front(), Y_t);
 
   PADDLE_ENFORCE_EQ(Y_t->dims().size(), 5UL,
                     platform::errors::InvalidArgument(
diff --git a/paddle/fluid/inference/tensorrt/convert/deformable_conv_op.cc b/paddle/fluid/inference/tensorrt/convert/deformable_conv_op.cc
index d8534a4183bdd..2bbe6ea3d2fa8 100644
--- a/paddle/fluid/inference/tensorrt/convert/deformable_conv_op.cc
+++ b/paddle/fluid/inference/tensorrt/convert/deformable_conv_op.cc
@@ -47,8 +47,7 @@ class DeformableConvOpConverter : public OpConverter {
     auto* filter_var = scope.FindVar(filter_name);
     auto* filter_tensor = filter_var->GetMutable<framework::LoDTensor>();
 
-    float* filter_data =
-        engine_->GetWeightCPUData(filter_name, filter_tensor, false);
+    float* filter_data = engine_->GetWeightCPUData(filter_name, filter_tensor);
 
     const int c_o = filter_tensor->dims()[0];
     const int c_i = filter_tensor->dims()[1];
diff --git a/paddle/fluid/inference/tensorrt/convert/elementwise_op.cc b/paddle/fluid/inference/tensorrt/convert/elementwise_op.cc
index a66a97b4be9da..8fd0e1bbd068d 100644
--- a/paddle/fluid/inference/tensorrt/convert/elementwise_op.cc
+++ b/paddle/fluid/inference/tensorrt/convert/elementwise_op.cc
@@ -51,8 +51,7 @@ class ElementwiseWeightOpConverter : public OpConverter {
     auto* Y_t = Y_v->GetMutable<framework::LoDTensor>();
     float* weight_data = nullptr;
     auto output_name = op_desc.Output("Out")[0];
-    weight_data =
-        engine_->GetWeightCPUData(op_desc.Input("Y").front(), Y_t, false);
+    weight_data = engine_->GetWeightCPUData(op_desc.Input("Y").front(), Y_t);
     nvinfer1::Dims dims_x = X->getDimensions();
 
     auto regist_eltwise_weight = [&](nvinfer1::ScaleMode scale_mode) {
@@ -112,13 +111,6 @@ class ElementwiseWeightOpConverter : public OpConverter {
         RreplenishLayerAndOutput(layer, "elementwise_" + op_type_,
                                  {output_name}, test_mode);
       }
-      if (op_desc.HasAttr("enable_int8")) {
-#if IS_TRT_VERSION_GE(5000)
-        CHECK(op_desc.HasAttr("X_scale"));
-        float x_scale = BOOST_GET_CONST(float, op_desc.GetAttr("X_scale"));
-        engine_->SetTensorDynamicRange(X, x_scale);
-#endif
-      }
     };
 
     if (engine_->with_dynamic_shape()) {
@@ -222,16 +214,6 @@ class ElementwiseTensorOpConverter : public OpConverter {
 
     auto common_func = [&](nvinfer1::ILayer* layer) {
       RreplenishLayerAndOutput(layer, "elementwise", {output_name}, test_mode);
-      if (op_desc.HasAttr("enable_int8")) {
-#if IS_TRT_VERSION_GE(5000)
-        CHECK(op_desc.HasAttr("X_scale"));
-        CHECK(op_desc.HasAttr("Y_scale"));
-        float x_scale = BOOST_GET_CONST(float, op_desc.GetAttr("X_scale"));
-        float y_scale = BOOST_GET_CONST(float, op_desc.GetAttr("Y_scale"));
-        engine_->SetTensorDynamicRange(X, x_scale);
-        engine_->SetTensorDynamicRange(Y, y_scale);
-#endif
-      }
     };
 
     if (dims_x.nbDims == dims_y.nbDims) {
diff --git a/paddle/fluid/inference/tensorrt/convert/emb_eltwise_layernorm.cc b/paddle/fluid/inference/tensorrt/convert/emb_eltwise_layernorm.cc
index 9741aab32dea5..7a494860e6fa1 100644
--- a/paddle/fluid/inference/tensorrt/convert/emb_eltwise_layernorm.cc
+++ b/paddle/fluid/inference/tensorrt/convert/emb_eltwise_layernorm.cc
@@ -77,7 +77,7 @@ class EmbEltwiseLayerNormOpConverter : public OpConverter {
       auto* temp_tensor = temp_var->GetMutable<framework::LoDTensor>();
       (*dims) = temp_tensor->dims();
 
-      auto* temp_data = engine_->GetWeightCPUData(var_name, temp_tensor, false);
+      auto* temp_data = engine_->GetWeightCPUData(var_name, temp_tensor);
       return temp_data;
     };
 
diff --git a/paddle/fluid/inference/tensorrt/convert/fc_op.cc b/paddle/fluid/inference/tensorrt/convert/fc_op.cc
index bdea14c9e9f89..a631332dae360 100644
--- a/paddle/fluid/inference/tensorrt/convert/fc_op.cc
+++ b/paddle/fluid/inference/tensorrt/convert/fc_op.cc
@@ -113,22 +113,20 @@ class FcOpConverter : public OpConverter {
     // assigned from CPU memory, which can't be avoided.
     float* weight_data = nullptr;
     bool enable_int8 = op_desc.HasAttr("enable_int8");
-    float in_scale = 0.;
-    if (enable_int8) {
-#if IS_TRT_VERSION_GE(5000)
-      CHECK(op_desc.HasAttr(i_name + "_scale"));
-      in_scale =
-          BOOST_GET_CONST(float, op_desc.GetAttr(i_name + "_scale")) * 127;
-      auto weight_scale =
-          BOOST_GET_CONST(std::vector<float>, op_desc.GetAttr("weight_scale"));
-      weight_data = engine_->GetWeightCPUData(op_desc.Input(w_name).front(),
-                                              Y_t, true, weight_scale);
+    bool support_int8 = false;
+    if (op_desc.HasAttr("support_int8")) {
+      support_int8 = BOOST_GET_CONST(bool, op_desc.GetAttr("support_int8"));
+    }
+    float in_scale = 0;
+    if (enable_int8 || support_int8) {
+      if (enable_int8) {
+        in_scale = BOOST_GET_CONST(float, op_desc.GetAttr("Input_scale"));
+      } else {
+        in_scale = BOOST_GET_CONST(float, op_desc.GetAttr("X"));
+      }
       engine_->SetTensorDynamicRange(X, in_scale);
-#endif
-    } else {
-      weight_data =
-          engine_->GetWeightCPUData(op_desc.Input(w_name).front(), Y_t, false);
     }
+    weight_data = engine_->GetWeightCPUData(op_desc.Input(w_name).front(), Y_t);
 
     PADDLE_ENFORCE_EQ(Y_t->dims().size(), 2UL,
                       platform::errors::InvalidArgument(
@@ -148,14 +146,18 @@ class FcOpConverter : public OpConverter {
     auto regist_fc = [&](nvinfer1::ITensor* inputs, int n_output,
                          TensorRTEngine::Weight& weight,
                          TensorRTEngine::Weight& bias) {
-      if (enable_int8) {
+      if (enable_int8 || support_int8) {
         // add conv layer
-        PADDLE_ENFORCE_EQ(
-            op_desc.HasAttr("out_threshold"), true,
-            platform::errors::InvalidArgument(
-                "must have out threshold in fc layers in int8 mode"));
-        float out_scale =
-            BOOST_GET_CONST(float, op_desc.GetAttr("out_threshold"));
+        float out_scale = 0;
+        if (enable_int8) {
+          PADDLE_ENFORCE_EQ(
+              op_desc.HasAttr("out_threshold"), true,
+              platform::errors::InvalidArgument(
+                  "must have out threshold in fc layers in int8 mode"));
+          out_scale = BOOST_GET_CONST(float, op_desc.GetAttr("out_threshold"));
+        } else {
+          out_scale = BOOST_GET_CONST(float, op_desc.GetAttr("Out"));
+        }
         nvinfer1::DimsHW nv_ksize(1, 1);
         auto* fc_layer_int8 =
             TRT_ENGINE_ADD_LAYER(engine_, Convolution, *inputs, n_output,
@@ -235,8 +237,7 @@ class FcOpConverter : public OpConverter {
     if (with_bias) {
       auto* b_v = scope.GetVar(op_desc.Input("Bias").front());
       auto* b_t = b_v->GetMutable<framework::LoDTensor>();
-      bias_data =
-          engine_->GetWeightCPUData(op_desc.Input("Bias").front(), b_t, false);
+      bias_data = engine_->GetWeightCPUData(op_desc.Input("Bias").front(), b_t);
       bias_num = b_t->numel();
     }
     TensorRTEngine::Weight bias{nvinfer1::DataType::kFLOAT,
@@ -251,7 +252,7 @@ class FcOpConverter : public OpConverter {
     // not add Shuffle layer in ernie's multihead.
     if (engine_->use_oss() && engine_->with_ernie() && x_dim.nbDims == 4 &&
         x_dim.d[3] == 1 && x_num_col_dims == 2) {
-      if (enable_int8) {
+      if (enable_int8 || support_int8) {
         // add conv1x1 layer
         nvinfer1::DimsHW nv_ksize(1, 1);
         auto* fc_layer_int8 =
@@ -265,8 +266,13 @@ class FcOpConverter : public OpConverter {
               op_desc.HasAttr("out_threshold"), true,
               platform::errors::InvalidArgument(
                   "must have out threshold in fc layers in int8 mode"));
-          float out_scale =
-              BOOST_GET_CONST(float, op_desc.GetAttr("out_threshold"));
+          float out_scale = 0;
+          if (enable_int8) {
+            out_scale =
+                BOOST_GET_CONST(float, op_desc.GetAttr("out_threshold"));
+          } else {
+            out_scale = BOOST_GET_CONST(float, op_desc.GetAttr("Out"));
+          }
           engine_->SetTensorDynamicRange(fc_layer_int8->getOutput(0),
                                          out_scale);
           nvinfer1::IActivationLayer* relu_layer_int8 = TRT_ENGINE_ADD_LAYER(
@@ -308,7 +314,7 @@ class FcOpConverter : public OpConverter {
       auto* reshape_before_fc_layer =
           reshape_before_fc(X, x_dim, x_num_col_dims, output_name);
       auto* reshape_itensor = reshape_before_fc_layer->getOutput(0);
-      if (enable_int8) {
+      if (enable_int8 || support_int8) {
         engine_->SetTensorDynamicRange(reshape_itensor, in_scale);
       }
       regist_fc(reshape_itensor, n_output, weight, bias);
diff --git a/paddle/fluid/inference/tensorrt/convert/group_norm_op.cc b/paddle/fluid/inference/tensorrt/convert/group_norm_op.cc
index b3c1f986aa030..910a807d3626a 100644
--- a/paddle/fluid/inference/tensorrt/convert/group_norm_op.cc
+++ b/paddle/fluid/inference/tensorrt/convert/group_norm_op.cc
@@ -48,7 +48,7 @@ class GroupNormOpConverter : public OpConverter {
       auto* temp_tensor = temp_var->GetMutable<framework::LoDTensor>();
       (*dims) = temp_tensor->dims();
 
-      auto* temp_data = engine_->GetWeightCPUData(var_name, temp_tensor, false);
+      auto* temp_data = engine_->GetWeightCPUData(var_name, temp_tensor);
       return temp_data;
     };
 
diff --git a/paddle/fluid/inference/tensorrt/convert/leaky_relu_op.cc b/paddle/fluid/inference/tensorrt/convert/leaky_relu_op.cc
index c6dbfc832201b..c7a551b7436db 100644
--- a/paddle/fluid/inference/tensorrt/convert/leaky_relu_op.cc
+++ b/paddle/fluid/inference/tensorrt/convert/leaky_relu_op.cc
@@ -49,8 +49,8 @@ class LeakyReluOpConverter : public OpConverter {
 
     bool enable_int8 = op_desc.HasAttr("enable_int8");
     if (enable_int8) {
-      CHECK(op_desc.HasAttr("X_scale"));
-      float in_scale = BOOST_GET_CONST(float, op_desc.GetAttr("X_scale"));
+      CHECK(op_desc.HasAttr("Input_scale"));
+      float in_scale = BOOST_GET_CONST(float, op_desc.GetAttr("Input_scale"));
       engine_->SetTensorDynamicRange(input, in_scale);
     }
 #else
diff --git a/paddle/fluid/inference/tensorrt/convert/matmul_op.cc b/paddle/fluid/inference/tensorrt/convert/matmul_op.cc
index b2e76b9a0e61b..7568f67d64d04 100644
--- a/paddle/fluid/inference/tensorrt/convert/matmul_op.cc
+++ b/paddle/fluid/inference/tensorrt/convert/matmul_op.cc
@@ -64,7 +64,9 @@ class MatMulOpConverter : public OpConverter {
                     : nvinfer1::MatrixOperation::kNONE;
 
     if (op_desc.HasAttr("support_int8") &&
-        engine_->precision() == AnalysisConfig::Precision::kInt8) {
+        BOOST_GET_CONST(bool, op_desc.GetAttr("support_int8")) &&
+        engine_->precision() == AnalysisConfig::Precision::kInt8 &&
+        platform::GetGPUComputeCapability(0) >= 75) {
       if (engine_->with_dynamic_shape()) {
         VLOG(3) << "Convert a fluid matmul_op_int8_dynamic to TensorRT "
                    "MatmulPluginLayer";
diff --git a/paddle/fluid/inference/tensorrt/convert/multiclass_nms3_op.cc b/paddle/fluid/inference/tensorrt/convert/multiclass_nms3_op.cc
index 00f1419f082d1..a968ea2a2c484 100644
--- a/paddle/fluid/inference/tensorrt/convert/multiclass_nms3_op.cc
+++ b/paddle/fluid/inference/tensorrt/convert/multiclass_nms3_op.cc
@@ -1,4 +1,4 @@
-/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
 Licensed under the Apache License, Version 2.0 (the "License");
 you may not use this file except in compliance with the License.
 You may obtain a copy of the License at
@@ -38,6 +38,7 @@ class MultiClassNMS3OpConverter : public OpConverter {
     std::string scores = op_desc.Input("Scores").front();
     std::string output_name = op_desc.Output("Out").front();
     std::string rois_num_name = op_desc.Output("NmsRoisNum").front();
+    std::string index_name = op_desc.Output("Index").front();
 
     auto* bboxes_tensor = engine_->GetITensor(bboxes);
     auto* scores_tensor = engine_->GetITensor(scores);
@@ -122,10 +123,20 @@ class MultiClassNMS3OpConverter : public OpConverter {
         engine_, Concatenation, concat_inputs.data(), concat_inputs.size());
     nms_concat_layer->setAxis(1);
 
+    // add fake index as output to be consistent with the outputs of
+    // multiclass_nms3
+    std::vector<uint32_t> index(1, 0);
+    auto constant_layer = TRT_ENGINE_ADD_LAYER(
+        engine_, Constant, nvinfer1::Dims2(1, 1),
+        nvinfer1::Weights{nvinfer1::DataType::kINT32,
+                          static_cast<void*>(index.data()), 1});
+
     RreplenishLayerAndOutput(batch_nms_layer, "multiclass_nms3",
                              {rois_num_name}, test_mode);
     RreplenishLayerAndOutput(nms_concat_layer, "multiclass_nms3", {output_name},
                              test_mode);
+    RreplenishLayerAndOutput(constant_layer, "multiclass_nms3", {index_name},
+                             test_mode);
   }
 };
 
diff --git a/paddle/fluid/inference/tensorrt/convert/multihead_matmul_op.cc b/paddle/fluid/inference/tensorrt/convert/multihead_matmul_op.cc
index f19b21d3e6326..21c79f0edd27f 100644
--- a/paddle/fluid/inference/tensorrt/convert/multihead_matmul_op.cc
+++ b/paddle/fluid/inference/tensorrt/convert/multihead_matmul_op.cc
@@ -40,22 +40,16 @@ class MultiheadMatMulOpConverter : public OpConverter {
     auto* bias_t = bias_v->GetMutable<framework::LoDTensor>();
 
     float* weight_data = nullptr;
-    bool enable_int8 = op_desc.HasAttr("enable_int8");
     bool qkv2context_plugin_int8 = op_desc.HasAttr("qkv2context_plugin_int8");
     float in_scale = 0.;
 
-    if (enable_int8) {
-      in_scale = BOOST_GET_CONST(float, op_desc.GetAttr("Input_scale")) * 127;
-      auto weight_scale =
-          BOOST_GET_CONST(std::vector<float>, op_desc.GetAttr("weight_scale"));
-      weight_data =
-          engine_->GetWeightCPUData(weight_name, weight_t, true, weight_scale);
+    if (op_desc.HasAttr("Input_scale")) {
+      in_scale = BOOST_GET_CONST(float, op_desc.GetAttr("Input_scale"));
       engine_->SetTensorDynamicRange(input, in_scale);
-    } else {
-      weight_data = engine_->GetWeightCPUData(weight_name, weight_t, false);
     }
+    weight_data = engine_->GetWeightCPUData(weight_name, weight_t);
 
-    float* bias_data = engine_->GetWeightCPUData(bias_name, bias_t, false);
+    float* bias_data = engine_->GetWeightCPUData(bias_name, bias_t);
     std::vector<float> weight_data_tmp;
     weight_data_tmp.reserve(weight_t->numel());
     memcpy(weight_data_tmp.data(), weight_data,
@@ -85,6 +79,10 @@ class MultiheadMatMulOpConverter : public OpConverter {
 
     if (engine_->with_dynamic_shape()) {
       if (engine_->use_oss()) {
+        if (engine_->precision() == AnalysisConfig::Precision::kFloat32) {
+          PADDLE_THROW(platform::errors::Fatal(
+              "use use_oss must be int8 or half, not float32."));
+        }
         nvinfer1::Weights weight{nvinfer1::DataType::kFLOAT,
                                  static_cast<void*>(weight_data),
                                  static_cast<int32_t>(weight_t->numel())};
@@ -93,7 +91,7 @@ class MultiheadMatMulOpConverter : public OpConverter {
                                static_cast<int32_t>(bias_t->numel())};
         if (engine_->with_interleaved()) {
           VLOG(4) << "fused multihead_matmul op: use_oss and with_interleaved";
-          if (!enable_int8) {
+          if (!op_desc.HasAttr("Input_scale")) {
             PADDLE_THROW(
                 platform::errors::Fatal("use with_interleaved must be int8."));
           }
@@ -213,7 +211,7 @@ class MultiheadMatMulOpConverter : public OpConverter {
 
           nvinfer1::ILayer* fc_layer = nullptr;
           float dp_probs = 1.0 / 127.0;
-          if (enable_int8) {
+          if (op_desc.HasAttr("Input_scale")) {
             nvinfer1::DimsHW nv_ksize(1, 1);
             fc_layer = TRT_ENGINE_ADD_LAYER(engine_, Convolution, *input, n,
                                             nv_ksize, weight, bias);
@@ -222,7 +220,7 @@ class MultiheadMatMulOpConverter : public OpConverter {
                                             weight, bias);
           }
 
-          if (enable_int8) {
+          if (op_desc.HasAttr("fc_out_threshold")) {
             PADDLE_ENFORCE_EQ(op_desc.HasAttr("fc_out_threshold"), true,
                               platform::errors::InvalidArgument(
                                   "must have out threshold in multihead layers "
@@ -241,14 +239,10 @@ class MultiheadMatMulOpConverter : public OpConverter {
           auto creator = GetPluginRegistry()->getPluginCreator(
               "CustomQKVToContextPluginDynamic", "2");
           assert(creator != nullptr);
-          int type = static_cast<int>((engine_->WithFp16() == 1)
-                                          ? nvinfer1::DataType::kHALF
-                                          : nvinfer1::DataType::kFLOAT);
-          if (enable_int8) {
-            type = static_cast<int>(nvinfer1::DataType::kHALF);
-            if (qkv2context_plugin_int8) {
-              type = static_cast<int>(nvinfer1::DataType::kINT8);
-            }
+          int type = static_cast<int>(nvinfer1::DataType::kHALF);
+          if (qkv2context_plugin_int8 &&
+              (engine_->precision() == AnalysisConfig::Precision::kInt8)) {
+            type = static_cast<int>(nvinfer1::DataType::kINT8);
           }
           bool has_mask = true;
           int var_seqlen = 1;
@@ -335,7 +329,7 @@ class MultiheadMatMulOpConverter : public OpConverter {
         reshape_before_fc_dim.d[4] = 1;
         auto* reshape_before_fc_layer =
             TRT_ENGINE_ADD_LAYER(engine_, Shuffle, *input);
-        if (enable_int8) {
+        if (op_desc.HasAttr("Input_scale")) {
           engine_->SetTensorDynamicRange(reshape_before_fc_layer->getOutput(0),
                                          in_scale);
         }
@@ -346,7 +340,7 @@ class MultiheadMatMulOpConverter : public OpConverter {
 
         // add layer fc
         nvinfer1::ILayer* fc_layer = nullptr;
-        if (enable_int8) {
+        if (op_desc.HasAttr("Input_scale")) {
           nvinfer1::DimsHW nv_ksize(1, 1);
           fc_layer = TRT_ENGINE_ADD_LAYER(
               engine_, Convolution, *reshape_before_fc_layer->getOutput(0), n,
@@ -357,7 +351,7 @@ class MultiheadMatMulOpConverter : public OpConverter {
               n, weight.get(), bias.get());
         }
 
-        if (enable_int8) {
+        if (op_desc.HasAttr("fc_out_threshold")) {
           PADDLE_ENFORCE_EQ(
               op_desc.HasAttr("fc_out_threshold"), true,
               platform::errors::InvalidArgument(
@@ -382,8 +376,8 @@ class MultiheadMatMulOpConverter : public OpConverter {
         bool with_fp16 =
             engine_->WithFp16() && !engine_->disable_trt_plugin_fp16();
 
-        if (enable_int8) {
-          with_fp16 = 1;
+        if (engine_->precision() == AnalysisConfig::Precision::kInt8) {
+          with_fp16 = true;
         }
         plugin::DynamicPluginTensorRT* plugin =
             new plugin::QkvToContextPluginDynamic(hidden_in, head_number,
diff --git a/paddle/fluid/inference/tensorrt/convert/op_converter.h b/paddle/fluid/inference/tensorrt/convert/op_converter.h
index 7e0c8bf1da177..f7eb7f859afaa 100644
--- a/paddle/fluid/inference/tensorrt/convert/op_converter.h
+++ b/paddle/fluid/inference/tensorrt/convert/op_converter.h
@@ -145,42 +145,68 @@ class OpConverter {
     (*it)(op, scope, test_mode);
 
     size_t output_num = op_desc.OutputNames().size();
-    if (output_num == 1) {  // The number of output is 1
-      if (op_desc.HasAttr("out_threshold")) {
-        float out_scale =
-            BOOST_GET_CONST(float, op_desc.GetAttr("out_threshold"));
-        std::string output_name = "";
-        if (op_desc.HasOutput("Output")) {
-          output_name = op_desc.Output("Output").front();
-        } else if (op_desc.HasOutput("Out")) {
-          output_name = op_desc.Output("Out").front();
-        } else if (op_desc.HasOutput("Y")) {
-          output_name = op_desc.Output("Y").front();
-        } else {
-          PADDLE_THROW(
-              platform::errors::NotFound("Op %s has out threshold but doesn't "
-                                         "have an output named \"Output\", "
-                                         "\"Out\" or \"Y\".",
-                                         op_desc.Type()));
-        }
+    // only one out settensordynamicRange
+    if (op_desc.HasAttr("out_threshold")) {
+      float out_scale =
+          BOOST_GET_CONST(float, op_desc.GetAttr("out_threshold"));
+      std::string output_name = "";
+      if (op_desc.HasOutput("Output")) {
+        output_name = op_desc.Output("Output").front();
+      } else if (op_desc.HasOutput("Out")) {
+        output_name = op_desc.Output("Out").front();
+      } else if (op_desc.HasOutput("Y")) {
+        output_name = op_desc.Output("Y").front();
+      } else {
+        PADDLE_THROW(
+            platform::errors::NotFound("Op %s has out threshold but doesn't "
+                                       "have an output named \"Output\", "
+                                       "\"Out\" or \"Y\".",
+                                       op_desc.Type()));
+      }
+      auto* output_itensor = engine->GetITensor(output_name);
+      engine->SetTensorDynamicRange(output_itensor, out_scale);
+      VLOG(1) << "Set out scale = " << out_scale << " for tensor "
+              << output_name << ".";
+    }
+    // outs settensordynamicRange
+    for (size_t i = 0; i < output_num; ++i) {
+      if (op_desc.HasAttr("out_" + std::to_string(i) + "_threshold")) {
+        float out_scale = BOOST_GET_CONST(
+            float, op_desc.GetAttr("out_" + std::to_string(i) + "_threshold"));
+        std::string output_name =
+            op_desc.Output(op_desc.OutputNames()[i]).front();
         auto* output_itensor = engine->GetITensor(output_name);
         engine->SetTensorDynamicRange(output_itensor, out_scale);
         VLOG(1) << "Set out scale = " << out_scale << " for tensor "
                 << output_name << ".";
       }
-    } else if (output_num > 1) {  // The number of outputs greater than 1
-      for (size_t i = 0; i < output_num; ++i) {
-        if (op_desc.HasAttr("out_" + std::to_string(i) + "_threshold")) {
-          float out_scale = BOOST_GET_CONST(
-              float,
-              op_desc.GetAttr("out_" + std::to_string(i) + "_threshold"));
-          std::string output_name =
-              op_desc.Output(op_desc.OutputNames()[i]).front();
-          auto* output_itensor = engine->GetITensor(output_name);
-          engine->SetTensorDynamicRange(output_itensor, out_scale);
-          VLOG(1) << "Set out scale = " << out_scale << " for tensor "
-                  << output_name << ".";
-        }
+    }
+
+    // quant_dequant_linear support for paddle trt
+
+    std::vector<std::string> inputs_name = op_desc.InputNames();
+    std::vector<std::string> outputs_name = op_desc.OutputNames();
+
+    for (size_t i = 0; i < inputs_name.size(); i++) {
+      if (op_desc.HasAttr(inputs_name[i])) {
+        std::string input_tensor_name = op_desc.Input(inputs_name[i])[0];
+        auto* input_itensor = engine->GetITensor(input_tensor_name);
+        float input_scale =
+            BOOST_GET_CONST(float, op_desc.GetAttr(inputs_name[i]));
+        engine->SetTensorDynamicRange(input_itensor, input_scale);
+        VLOG(1) << "Set input tensor scale = " << input_scale
+                << " for tensor: " << input_tensor_name << ".";
+      }
+    }
+    for (size_t i = 0; i < outputs_name.size(); i++) {
+      if (op_desc.HasAttr(outputs_name[i])) {
+        std::string output_tensor_name = op_desc.Output(outputs_name[i])[0];
+        auto* output_itensor = engine->GetITensor(output_tensor_name);
+        float output_scale =
+            BOOST_GET_CONST(float, op_desc.GetAttr(outputs_name[i]));
+        engine->SetTensorDynamicRange(output_itensor, output_scale);
+        VLOG(1) << "Set output tensor scale = " << output_scale
+                << " for tensor: " << output_tensor_name << ".";
       }
     }
   }
diff --git a/paddle/fluid/inference/tensorrt/convert/pool2d_op.cc b/paddle/fluid/inference/tensorrt/convert/pool2d_op.cc
index 7b65d2d7c97cc..7824a0f1e29f4 100644
--- a/paddle/fluid/inference/tensorrt/convert/pool2d_op.cc
+++ b/paddle/fluid/inference/tensorrt/convert/pool2d_op.cc
@@ -132,11 +132,10 @@ class Pool2dOpConverter : public OpConverter {
     }
 
     if (op_desc.HasAttr("enable_int8")) {
-#if IS_TRT_VERSION_GE(5000)
-      CHECK(op_desc.HasAttr("X_scale"));
-      float input_scale = BOOST_GET_CONST(float, op_desc.GetAttr("X_scale"));
+      CHECK(op_desc.HasAttr("Input_scale"));
+      float input_scale =
+          BOOST_GET_CONST(float, op_desc.GetAttr("Input_scale"));
       engine_->SetTensorDynamicRange(input1, input_scale);
-#endif
     }
 
     std::vector<int> real_paddings = paddings;
diff --git a/paddle/fluid/inference/tensorrt/convert/pool3d_op.cc b/paddle/fluid/inference/tensorrt/convert/pool3d_op.cc
index 5a306f622adbe..665bf9c8d22ed 100644
--- a/paddle/fluid/inference/tensorrt/convert/pool3d_op.cc
+++ b/paddle/fluid/inference/tensorrt/convert/pool3d_op.cc
@@ -123,8 +123,9 @@ class Pool3dOpConverter : public OpConverter {
     nvinfer1::Dims3 nv_paddings(paddings[0], paddings[1], paddings[2]);
     nvinfer1::ILayer *layer = nullptr;
     if (op_desc.HasAttr("enable_int8")) {
-      CHECK(op_desc.HasAttr("X_scale"));
-      float input_scale = BOOST_GET_CONST(float, op_desc.GetAttr("X_scale"));
+      CHECK(op_desc.HasAttr("Input_scale"));
+      float input_scale =
+          BOOST_GET_CONST(float, op_desc.GetAttr("Input_scale"));
       engine_->SetTensorDynamicRange(input1, input_scale);
     }
 
diff --git a/paddle/fluid/inference/tensorrt/convert/preln_emb_eltwise_layernorm.cc b/paddle/fluid/inference/tensorrt/convert/preln_emb_eltwise_layernorm.cc
index daa3b186ab4c4..87fdbb71a3faf 100644
--- a/paddle/fluid/inference/tensorrt/convert/preln_emb_eltwise_layernorm.cc
+++ b/paddle/fluid/inference/tensorrt/convert/preln_emb_eltwise_layernorm.cc
@@ -70,7 +70,7 @@ class PrelnEmbEltwiseLayerNormOpConverter : public OpConverter {
       auto* temp_tensor = temp_var->GetMutable<framework::LoDTensor>();
       (*dims) = temp_tensor->dims();
 
-      auto* temp_data = engine_->GetWeightCPUData(var_name, temp_tensor, false);
+      auto* temp_data = engine_->GetWeightCPUData(var_name, temp_tensor);
       return temp_data;
     };
 
diff --git a/paddle/fluid/inference/tensorrt/convert/preln_skip_layernorm.cc b/paddle/fluid/inference/tensorrt/convert/preln_skip_layernorm.cc
index d9eca65fc45dc..8053135cc452c 100644
--- a/paddle/fluid/inference/tensorrt/convert/preln_skip_layernorm.cc
+++ b/paddle/fluid/inference/tensorrt/convert/preln_skip_layernorm.cc
@@ -48,7 +48,7 @@ class PrelnSkipLayerNormOpConverter : public OpConverter {
       auto* temp_tensor = temp_var->GetMutable<framework::LoDTensor>();
       (*dims) = temp_tensor->dims();
 
-      auto* temp_data = engine_->GetWeightCPUData(var_name, temp_tensor, false);
+      auto* temp_data = engine_->GetWeightCPUData(var_name, temp_tensor);
       return temp_data;
     };
 
diff --git a/paddle/fluid/inference/tensorrt/convert/prelu_op.cc b/paddle/fluid/inference/tensorrt/convert/prelu_op.cc
index 9e81d1177cfe1..d5b5d9bc81b6a 100644
--- a/paddle/fluid/inference/tensorrt/convert/prelu_op.cc
+++ b/paddle/fluid/inference/tensorrt/convert/prelu_op.cc
@@ -57,8 +57,8 @@ class PReluOpConverter : public OpConverter {
       layer = engine_->AddDynamicPlugin(&input, input_num, plugin);
     } else {
 #if IS_TRT_VERSION_GE(7000)
-      float* alpha_weight_data = engine_->GetWeightCPUData(
-          op_desc.Input("Alpha")[0], alpha_tensor, false);
+      float* alpha_weight_data =
+          engine_->GetWeightCPUData(op_desc.Input("Alpha")[0], alpha_tensor);
       TensorRTEngine::Weight alpha_weight{
           nvinfer1::DataType::kFLOAT, static_cast<void*>(alpha_weight_data),
           static_cast<size_t>(alpha_tensor->numel())};
diff --git a/paddle/fluid/inference/tensorrt/convert/skip_layernorm.cc b/paddle/fluid/inference/tensorrt/convert/skip_layernorm.cc
index 753cd70727643..831e117311771 100644
--- a/paddle/fluid/inference/tensorrt/convert/skip_layernorm.cc
+++ b/paddle/fluid/inference/tensorrt/convert/skip_layernorm.cc
@@ -40,7 +40,7 @@ class SkipLayerNormOpConverter : public OpConverter {
       auto* temp_tensor = temp_var->GetMutable<framework::LoDTensor>();
       (*dims) = temp_tensor->dims();
 
-      auto* temp_data = engine_->GetWeightCPUData(var_name, temp_tensor, false);
+      auto* temp_data = engine_->GetWeightCPUData(var_name, temp_tensor);
       return temp_data;
     };
 
diff --git a/paddle/fluid/inference/tensorrt/engine.cc b/paddle/fluid/inference/tensorrt/engine.cc
index 794475dfc10ca..33386c746ae5a 100644
--- a/paddle/fluid/inference/tensorrt/engine.cc
+++ b/paddle/fluid/inference/tensorrt/engine.cc
@@ -356,9 +356,7 @@ void TensorRTEngine::SetRuntimeBatch(size_t batch_size) {
 }
 
 float *TensorRTEngine::GetWeightCPUData(const std::string &name,
-                                        framework::Tensor *weight_tensor,
-                                        bool enable_int8,
-                                        const std::vector<float> &scale) {
+                                        framework::Tensor *weight_tensor) {
   static int name_suffix_counter = 0;
   std::string name_suffix = std::to_string(name_suffix_counter);
   std::string splitter = "__";
diff --git a/paddle/fluid/inference/tensorrt/engine.h b/paddle/fluid/inference/tensorrt/engine.h
index d53a8923af612..f781cd0cb3a8d 100644
--- a/paddle/fluid/inference/tensorrt/engine.h
+++ b/paddle/fluid/inference/tensorrt/engine.h
@@ -389,8 +389,7 @@ class TensorRTEngine {
   }
 
   float* GetWeightCPUData(const std::string& name,
-                          framework::Tensor* weight_tensor, bool enable_int8,
-                          const std::vector<float>& scale = {});
+                          framework::Tensor* weight_tensor);
 
   // A pointer to CPU memory is needed of the TRT weight.
   // Before TRT runs, fluid loads weight into GPU storage.
diff --git a/paddle/fluid/inference/tensorrt/op_teller.cc b/paddle/fluid/inference/tensorrt/op_teller.cc
index fe0332025ed4a..85c5dc7107fec 100644
--- a/paddle/fluid/inference/tensorrt/op_teller.cc
+++ b/paddle/fluid/inference/tensorrt/op_teller.cc
@@ -1007,10 +1007,33 @@ bool OpTeller::Tell(const framework::ir::Node* node, bool use_no_calib_int8,
       auto* y_var_desc = block->FindVar(desc.Input("Y")[0]);
       const auto x_shape = x_var_desc->GetShape();
       const auto y_shape = y_var_desc->GetShape();
+      if (op_type == "elementwise_add" && y_var_desc->Persistable()) {
+        if (y_shape.size() != 1) {
+          return false;
+        }
+        if (y_shape[0] != x_shape[1]) {
+          return false;
+        }
+      }
       if (x_shape.size() == 1 && y_shape.size() == 1) {
         VLOG(3) << "Now trt may not support two 1d tensor elementwise op.";
         return false;
       }
+      if (op_type == "elementwise_add" || op_type == "elementwise_mul") {
+        if (x_var_desc->Persistable()) {
+          VLOG(3) << "Input X is a parameter which is not supported for "
+                     "elementwise_add/elementwise_mul in tensorrt, swap x and "
+                     "y will work";
+          return false;
+        }
+      }
+      if (op_type == "elementwise_sub" || op_type == "elementwise_div") {
+        if (x_var_desc->Persistable() || y_var_desc->Persistable()) {
+          VLOG(3) << "Input X or Input Y is a parameter which is not supported "
+                     "for elementwise_sub/elementwise_div in tensorrt";
+          return false;
+        }
+      }
     }
 
     if (op_type == "stack") {
@@ -1479,8 +1502,27 @@ bool OpTeller::Tell(const framework::ir::Node* node, bool use_no_calib_int8,
       std::vector<int> shape =
           BOOST_GET_CONST(std::vector<int>, desc.GetAttr("shape"));
       if (shape.size() >= nvinfer1::Dims::MAX_DIMS) return false;
-      if (!with_dynamic_shape && (shape[0] == -1 || shape.size() == 1))
+      if (!with_dynamic_shape) {
+        if (shape.size() == 1) {
+          return false;
+        }
+        if (shape[0] == 0) {
+          return true;
+        } else {
+          auto* block = desc.Block();
+          auto x_var_name = desc.Input("X")[0];
+          auto* x_var_desc = block->FindVar(x_var_name);
+          const auto x_shape = x_var_desc->GetShape();
+          int input_num = std::accumulate(x_shape.begin() + 1, x_shape.end(), 1,
+                                          std::multiplies<int>());
+          int shape_num = std::accumulate(shape.begin() + 1, shape.end(), 1,
+                                          std::multiplies<int>());
+          if (input_num == shape_num) {
+            return true;
+          }
+        }
         return false;
+      }
     }
 
     if (op_type == "clip") {
diff --git a/paddle/fluid/inference/tensorrt/plugin/elementwise_op_plugin.cu b/paddle/fluid/inference/tensorrt/plugin/elementwise_op_plugin.cu
index d6a1cdb9e68a6..c9163e62a2e19 100644
--- a/paddle/fluid/inference/tensorrt/plugin/elementwise_op_plugin.cu
+++ b/paddle/fluid/inference/tensorrt/plugin/elementwise_op_plugin.cu
@@ -30,6 +30,11 @@ template <typename T>
 struct Mul {
   __device__ T operator()(const T &a, const T &b) const { return a * b; }
 };
+
+template <typename T>
+struct Div {
+  __device__ T operator()(const T &a, const T &b) const { return a / b; }
+};
 }  // namespace details
 
 template <typename T, typename Operator>
@@ -130,6 +135,10 @@ int ElementWisePlugin::enqueue(int batch_size, const void *const *inputs,
     elementwise_kernel<<<block, thread, 0, stream>>>(
         num, x, y, out, prev_size_, batch_size * midd_size_, post_size_,
         details::Mul<float>());
+  } else if (type_ == "div") {
+    elementwise_kernel<<<block, thread, 0, stream>>>(
+        num, x, y, out, prev_size_, batch_size * midd_size_, post_size_,
+        details::Div<float>());
   } else {
     PADDLE_THROW(platform::errors::Fatal(
         "The %s type elementwise is not implemented in trt plugin.", type_));
@@ -242,11 +251,15 @@ int ElementwisePluginDynamic::enqueue(
   } else if (type_ == "mul") {
     elementwise_kernel<<<block, thread, 0, stream>>>(
         num, x, y, out, prev_size, midd_size, post_size, details::Mul<float>());
+  } else if (type_ == "div") {
+    elementwise_kernel<<<block, thread, 0, stream>>>(
+        num, x, y, out, prev_size, midd_size, post_size, details::Div<float>());
   } else {
-    PADDLE_THROW(platform::errors::Unimplemented(
-        "Paddle-TRT only support elementwise operation: {add, mul} currently, "
-        "but got %s.",
-        type_));
+    PADDLE_THROW(
+        platform::errors::Unimplemented("Paddle-TRT only support elementwise "
+                                        "operation: {add, mul, div} currently, "
+                                        "but got %s.",
+                                        type_));
   }
 
   return cudaGetLastError() != cudaSuccess;
diff --git a/paddle/fluid/inference/tensorrt/test_engine.cc b/paddle/fluid/inference/tensorrt/test_engine.cc
index 5a546ddd38bec..347d48f6eb116 100644
--- a/paddle/fluid/inference/tensorrt/test_engine.cc
+++ b/paddle/fluid/inference/tensorrt/test_engine.cc
@@ -38,6 +38,10 @@ class TensorRTEngineTest : public ::testing::Test {
         paddle::memory::allocation::AllocatorFacade::Instance()
             .GetZeroAllocator(platform::CUDAPlace(0))
             .get());
+    ctx_->SetPinnedAllocator(
+        paddle::memory::allocation::AllocatorFacade::Instance()
+            .GetAllocator(paddle::platform::CUDAPinnedPlace())
+            .get());
     ctx_->PartialInitWithAllocator();
 
     engine_ = new TensorRTEngine(10, 1 << 10);
diff --git a/paddle/fluid/inference/tests/api/analyzer_bert_tester.cc b/paddle/fluid/inference/tests/api/analyzer_bert_tester.cc
index 2570325c24abc..8f7e51009223a 100644
--- a/paddle/fluid/inference/tests/api/analyzer_bert_tester.cc
+++ b/paddle/fluid/inference/tests/api/analyzer_bert_tester.cc
@@ -156,6 +156,8 @@ void profile(bool use_mkldnn = false) {
 
   if (use_mkldnn) {
     config.EnableMKLDNN();
+    config.pass_builder()->AppendPass("fc_mkldnn_pass");
+    config.pass_builder()->AppendPass("fc_act_mkldnn_fuse_pass");
   }
 
   std::vector<std::vector<PaddleTensor>> outputs;
diff --git a/paddle/fluid/inference/tests/infer_ut/CMakeLists.txt b/paddle/fluid/inference/tests/infer_ut/CMakeLists.txt
index f376cbd4fb302..5c17e2d62d37d 100644
--- a/paddle/fluid/inference/tests/infer_ut/CMakeLists.txt
+++ b/paddle/fluid/inference/tests/infer_ut/CMakeLists.txt
@@ -83,13 +83,15 @@ if(WITH_GPU)
   if(NOT WIN32)
     set(CUDA_LIB "/usr/local/cuda/lib64/" CACHE STRING "CUDA Library")
   else()
-    if(NOT DEFINED CUDA_LIB)
+    set(CUDA_LIB "" CACHE STRING "CUDA_LIB")
+    if("${TENSORRT_ROOT}" STREQUAL "")
       if(DEFINED ENV{CUDA_PATH})
         set(CUDA_LIB "$ENV{CUDA_PATH}\\lib\\x64")
       else()
         set(CUDA_LIB "C:\\Program\ Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v10.2\\lib\\x64")
       endif()
     endif()
+    message(STATUS "Current CUDA lib path: ${CUDA_LIB}")
   endif(NOT WIN32)
 endif()
 
@@ -236,47 +238,54 @@ if(WITH_GTEST)
   include_directories(${GTEST_INSTALL_DIR}/include)
   add_dependencies(${DEMO_NAME} thirdparty_gtest)
   IF(WIN32)
-    target_link_libraries(${DEMO_NAME} ${GTEST_INSTALL_DIR}/${CMAKE_INSTALL_LIBDIR}/gtest${CMAKE_STATIC_LIBRARY_SUFFIX})
+    target_link_libraries(${DEMO_NAME} ${GTEST_LIBRARIES})
   ELSE()
     target_link_libraries(${DEMO_NAME} ${GTEST_INSTALL_DIR}/${CMAKE_INSTALL_LIBDIR}/libgtest${CMAKE_STATIC_LIBRARY_SUFFIX})
   ENDIF(WIN32)
 endif()
 if(WIN32)
+  if("${CMAKE_GENERATOR}" MATCHES "Ninja")
+    set(LIB_PATH ${CMAKE_BINARY_DIR})
+  else()
+    set(LIB_PATH ${CMAKE_BINARY_DIR}/${CMAKE_BUILD_TYPE})
+  endif()
+
   if(USE_TENSORRT)
     add_custom_command(TARGET ${DEMO_NAME} POST_BUILD
             COMMAND ${CMAKE_COMMAND} -E copy ${TENSORRT_LIB_DIR}/nvinfer${CMAKE_SHARED_LIBRARY_SUFFIX}
-              ${CMAKE_BINARY_DIR}/${CMAKE_BUILD_TYPE}
+              ${LIB_PATH}
             COMMAND ${CMAKE_COMMAND} -E copy ${TENSORRT_LIB_DIR}/nvinfer_plugin${CMAKE_SHARED_LIBRARY_SUFFIX}
-              ${CMAKE_BINARY_DIR}/${CMAKE_BUILD_TYPE}
+              ${LIB_PATH}
     )
     if(${TENSORRT_MAJOR_VERSION} EQUAL 7)
       add_custom_command(TARGET ${DEMO_NAME} POST_BUILD
               COMMAND ${CMAKE_COMMAND} -E copy ${TENSORRT_LIB_DIR}/myelin64_1${CMAKE_SHARED_LIBRARY_SUFFIX}
-                ${CMAKE_BINARY_DIR}/${CMAKE_BUILD_TYPE})
+                ${LIB_PATH})
     endif()
   endif()
   if(WITH_MKL)
+    message("LIB_PATH IS ${LIB_PATH}")
     add_custom_command(TARGET ${DEMO_NAME} POST_BUILD
-          COMMAND ${CMAKE_COMMAND} -E copy ${MATH_LIB_PATH}/lib/mklml.dll ${CMAKE_BINARY_DIR}/Release
-          COMMAND ${CMAKE_COMMAND} -E copy ${MATH_LIB_PATH}/lib/libiomp5md.dll ${CMAKE_BINARY_DIR}/Release
-          COMMAND ${CMAKE_COMMAND} -E copy ${MKLDNN_PATH}/lib/mkldnn.dll  ${CMAKE_BINARY_DIR}/Release
+          COMMAND ${CMAKE_COMMAND} -E copy ${MATH_LIB_PATH}/lib/mklml.dll ${LIB_PATH}
+          COMMAND ${CMAKE_COMMAND} -E copy ${MATH_LIB_PATH}/lib/libiomp5md.dll ${LIB_PATH}
+          COMMAND ${CMAKE_COMMAND} -E copy ${MKLDNN_PATH}/lib/mkldnn.dll  ${LIB_PATH}
     )
   else()
     add_custom_command(TARGET ${DEMO_NAME} POST_BUILD
-          COMMAND ${CMAKE_COMMAND} -E copy ${OPENBLAS_LIB_PATH}/lib/openblas.dll ${CMAKE_BINARY_DIR}/Release
+          COMMAND ${CMAKE_COMMAND} -E copy ${OPENBLAS_LIB_PATH}/lib/openblas.dll ${LIB_PATH}
     )
   endif()
   if(WITH_ONNXRUNTIME)
     add_custom_command(TARGET ${DEMO_NAME} POST_BUILD
     COMMAND ${CMAKE_COMMAND} -E copy ${PADDLE_LIB_THIRD_PARTY_PATH}onnxruntime/lib/onnxruntime.dll
-      ${CMAKE_BINARY_DIR}/${CMAKE_BUILD_TYPE}
+      ${LIB_PATH}
     COMMAND ${CMAKE_COMMAND} -E copy ${PADDLE_LIB_THIRD_PARTY_PATH}paddle2onnx/lib/paddle2onnx.dll
-      ${CMAKE_BINARY_DIR}/${CMAKE_BUILD_TYPE}
+      ${LIB_PATH}
     )
   endif()
   if(NOT WITH_STATIC_LIB)
       add_custom_command(TARGET ${DEMO_NAME} POST_BUILD 
-        COMMAND ${CMAKE_COMMAND} -E copy "${PADDLE_LIB}/paddle/lib/paddle_inference.dll" ${CMAKE_BINARY_DIR}/${CMAKE_BUILD_TYPE}
+        COMMAND ${CMAKE_COMMAND} -E copy "${PADDLE_LIB}/paddle/lib/paddle_inference.dll" ${LIB_PATH}
       )
   endif()
 endif()
diff --git a/paddle/fluid/inference/tests/infer_ut/external-cmake/gtest-cpp.cmake b/paddle/fluid/inference/tests/infer_ut/external-cmake/gtest-cpp.cmake
index 3e83a21e386b5..b38984314ec85 100644
--- a/paddle/fluid/inference/tests/infer_ut/external-cmake/gtest-cpp.cmake
+++ b/paddle/fluid/inference/tests/infer_ut/external-cmake/gtest-cpp.cmake
@@ -8,10 +8,12 @@ set(GTEST_REPOSITORY     https://github.com/google/googletest.git)
 set(GTEST_TAG            release-1.8.1)
 INCLUDE_DIRECTORIES(${GTEST_INCLUDE_DIR})
 IF(WIN32)
+    # if use CMAKE_INSTALL_LIBDIR, the path of lib actually is install/gtest/lib/gtest.lib but GTEST_LIBRARIES 
+    # is install/gtest/gtest.lib
     set(GTEST_LIBRARIES
-        "${GTEST_INSTALL_DIR}/${CMAKE_INSTALL_LIBDIR}/gtest.lib" CACHE FILEPATH "gtest libraries." FORCE)
+        "${GTEST_INSTALL_DIR}/lib/gtest.lib" CACHE FILEPATH "gtest libraries." FORCE)
     set(GTEST_MAIN_LIBRARIES
-        "${GTEST_INSTALL_DIR}/${CMAKE_INSTALL_LIBDIR}/gtest_main.lib" CACHE FILEPATH "gtest main libraries." FORCE)
+        "${GTEST_INSTALL_DIR}/lib/gtest_main.lib" CACHE FILEPATH "gtest main libraries." FORCE)
 ELSE()
     set(GTEST_LIBRARIES
         "${GTEST_INSTALL_DIR}/${CMAKE_INSTALL_LIBDIR}/libgtest.a" CACHE FILEPATH "gtest libraries." FORCE)
diff --git a/paddle/fluid/inference/tests/infer_ut/run.sh b/paddle/fluid/inference/tests/infer_ut/run.sh
index 8123d37850034..331608a2cbc01 100755
--- a/paddle/fluid/inference/tests/infer_ut/run.sh
+++ b/paddle/fluid/inference/tests/infer_ut/run.sh
@@ -22,6 +22,7 @@ DATA_DIR=$4 # dataset
 TENSORRT_ROOT_DIR=$5 # TensorRT ROOT dir, default to /usr/local/TensorRT
 WITH_ONNXRUNTIME=$6
 MSVC_STATIC_CRT=$7
+CUDA_LIB=$8/lib/x64
 inference_install_dir=${PADDLE_ROOT}/build/paddle_inference_install_dir
 EXIT_CODE=0 # init default exit code
 WIN_DETECT=$(echo `uname` | grep "Win") # detect current platform
@@ -135,7 +136,7 @@ function compile_test() {
     cd ${build_dir}
     TEST_NAME=$1
     if [ $WIN_DETECT != "" ]; then
-        cmake .. -G "Visual Studio 15 2017" -A x64 -T host=x64 -DPADDLE_LIB=${inference_install_dir} \
+        cmake .. -GNinja -DPADDLE_LIB=${inference_install_dir} \
              -DWITH_MKL=$TURN_ON_MKL \
              -DDEMO_NAME=${TEST_NAME} \
              -DWITH_GPU=$TEST_GPU_CPU \
@@ -146,8 +147,9 @@ function compile_test() {
              -DWITH_GTEST=ON \
              -DCMAKE_CXX_FLAGS='/std:c++17' \
              -DCMAKE_BUILD_TYPE=Release \
-             -DWITH_ONNXRUNTIME=$WITH_ONNXRUNTIME
-        msbuild /maxcpucount /property:Configuration=Release ALL_BUILD.vcxproj
+             -DWITH_ONNXRUNTIME=$WITH_ONNXRUNTIME \
+             -DCUDA_LIB="$CUDA_LIB"
+        ninja
     else
         cmake .. -DPADDLE_LIB=${inference_install_dir} \
                  -DWITH_MKL=$TURN_ON_MKL \
@@ -171,11 +173,7 @@ mkdir -p ${log_dir}
 cd ${build_dir}
 rm -rf *
 
-if [ $WIN_DETECT != "" ]; then
-    exe_dir=${build_dir}/Release
-else
-    exe_dir=${build_dir}
-fi;
+exe_dir=${build_dir}
 
 printf "${YELLOW} start test_resnet50 ${NC} \n";
 compile_test "test_resnet50"
diff --git a/paddle/fluid/memory/allocation/allocator_facade.cc b/paddle/fluid/memory/allocation/allocator_facade.cc
index 7619767123f84..e2730a1b825e9 100644
--- a/paddle/fluid/memory/allocation/allocator_facade.cc
+++ b/paddle/fluid/memory/allocation/allocator_facade.cc
@@ -85,7 +85,7 @@ PADDLE_DEFINE_EXPORTED_bool(use_virtual_memory_auto_growth, false,
 // NOTE(Ruibiao): This FLAGS is just to be compatibled with
 // the old single-stream CUDA allocator. It will be removed
 // after StreamSafeCudaAllocator has been fully tested.
-PADDLE_DEFINE_EXPORTED_bool(use_stream_safe_cuda_allocator, false,
+PADDLE_DEFINE_EXPORTED_bool(use_stream_safe_cuda_allocator, true,
                             "Enable StreamSafeCUDAAllocator");
 
 PADDLE_DEFINE_EXPORTED_bool(use_cuda_managed_memory, false,
@@ -354,8 +354,7 @@ class AllocatorFacadePrivate {
   }
 
 #if defined(PADDLE_WITH_CUDA) || defined(PADDLE_WITH_HIP)
-  bool HasCUDAAllocator(const platform::CUDAPlace& place,
-                        const gpuStream_t& stream) {
+  bool HasCUDAAllocator(const platform::CUDAPlace& place, gpuStream_t stream) {
     auto it = cuda_allocators_.find(place);
     if (it == cuda_allocators_.end()) {
       return false;
@@ -366,7 +365,7 @@ class AllocatorFacadePrivate {
   }
 
   const std::shared_ptr<Allocator>& GetAllocator(
-      const platform::CUDAPlace& place, const gpuStream_t& stream,
+      const platform::CUDAPlace& place, gpuStream_t stream,
       bool create_if_not_found = false) {
     if (LIKELY(!IsCUDAGraphCapturing())) {
       if (stream == GetDefaultStream(place)) {
@@ -407,14 +406,13 @@ class AllocatorFacadePrivate {
     return iter->second;
   }
 
-  const gpuStream_t& GetDefaultStream(const platform::CUDAPlace& place) const {
+  gpuStream_t GetDefaultStream(const platform::CUDAPlace& place) const {
     const std::shared_ptr<StreamSafeCUDAAllocator>& allocator =
         GetDefaultStreamSafeCUDAAllocator(place);
     return allocator->GetDefaultStream();
   }
 
-  void SetDefaultStream(const platform::CUDAPlace& place,
-                        const gpuStream_t& stream) {
+  void SetDefaultStream(const platform::CUDAPlace& place, gpuStream_t stream) {
     const std::shared_ptr<StreamSafeCUDAAllocator>& allocator =
         GetDefaultStreamSafeCUDAAllocator(place);
     allocator->SetDefaultStream(stream);
@@ -424,7 +422,7 @@ class AllocatorFacadePrivate {
   }
 
   void RecordStream(std::shared_ptr<phi::Allocation> allocation,
-                    const gpuStream_t& stream) {
+                    gpuStream_t stream) {
     std::shared_ptr<StreamSafeCUDAAllocation> stream_safe_cuda_allocation =
         std::dynamic_pointer_cast<StreamSafeCUDAAllocation>(allocation);
     if (stream_safe_cuda_allocation != nullptr) {
@@ -434,7 +432,7 @@ class AllocatorFacadePrivate {
     }
   }
 
-  const gpuStream_t GetStream(
+  gpuStream_t GetStream(
       const std::shared_ptr<phi::Allocation>& allocation) const {
     const std::shared_ptr<StreamSafeCUDAAllocation>
         stream_safe_cuda_allocation =
@@ -1044,7 +1042,7 @@ bool AllocatorFacade::IsStreamSafeCUDAAllocatorUsed() {
 
 #if defined(PADDLE_WITH_CUDA) || defined(PADDLE_WITH_HIP)
 uint64_t AllocatorFacade::Release(const platform::CUDAPlace& place,
-                                  const gpuStream_t& stream) {
+                                  gpuStream_t stream) {
   AllocatorFacadePrivate* m = GetPrivate();
   if (!m->IsStreamSafeCUDAAllocatorUsed()) {
     VLOG(6) << "Warning: StreamSafeCUDAAllocator is not used!";
@@ -1055,12 +1053,12 @@ uint64_t AllocatorFacade::Release(const platform::CUDAPlace& place,
 }
 
 void AllocatorFacade::RecordStream(std::shared_ptr<phi::Allocation> allocation,
-                                   const gpuStream_t& stream) {
+                                   gpuStream_t stream) {
   GetPrivate()->RecordStream(allocation, stream);
 }
 
 const std::shared_ptr<Allocator>& AllocatorFacade::GetAllocator(
-    const platform::Place& place, const gpuStream_t& stream) {
+    const platform::Place& place, gpuStream_t stream) {
   AllocatorFacadePrivate* m = GetPrivate();
 
   if (!m->IsStreamSafeCUDAAllocatorUsed()) {
@@ -1075,13 +1073,13 @@ const std::shared_ptr<Allocator>& AllocatorFacade::GetAllocator(
   return m->GetAllocator(place, /* A non-zero num to choose allocator_ */ 1);
 }
 
-const gpuStream_t AllocatorFacade::GetStream(
+gpuStream_t AllocatorFacade::GetStream(
     const std::shared_ptr<phi::Allocation>& allocation) const {
   return GetPrivate()->GetStream(allocation);
 }
 
 void AllocatorFacade::SetDefaultStream(const platform::CUDAPlace& place,
-                                       const gpuStream_t& stream) {
+                                       gpuStream_t stream) {
   if (m_->IsStreamSafeCUDAAllocatorUsed()) {
     m_->SetDefaultStream(place, stream);
   }
diff --git a/paddle/fluid/memory/allocation/allocator_facade.h b/paddle/fluid/memory/allocation/allocator_facade.h
index d5c1e7c908c79..1dea50edccf2e 100644
--- a/paddle/fluid/memory/allocation/allocator_facade.h
+++ b/paddle/fluid/memory/allocation/allocator_facade.h
@@ -80,15 +80,12 @@ class AllocatorFacade {
 
 #if defined(PADDLE_WITH_CUDA) || defined(PADDLE_WITH_HIP)
   // TODO(zhiqiu): change gpuStream_t to phi::Stream if needed.
-  uint64_t Release(const platform::CUDAPlace& place, const gpuStream_t& stream);
-  void RecordStream(std::shared_ptr<Allocation> allocation,
-                    const gpuStream_t& stream);
+  uint64_t Release(const platform::CUDAPlace& place, gpuStream_t stream);
+  void RecordStream(std::shared_ptr<Allocation> allocation, gpuStream_t stream);
   const std::shared_ptr<Allocator>& GetAllocator(const platform::Place& place,
-                                                 const gpuStream_t& stream);
-  const gpuStream_t GetStream(
-      const std::shared_ptr<Allocation>& allocation) const;
-  void SetDefaultStream(const platform::CUDAPlace& place,
-                        const gpuStream_t& stream);
+                                                 gpuStream_t stream);
+  gpuStream_t GetStream(const std::shared_ptr<Allocation>& allocation) const;
+  void SetDefaultStream(const platform::CUDAPlace& place, gpuStream_t stream);
 #endif
 
 #ifdef PADDLE_WITH_CUDA
diff --git a/paddle/fluid/memory/allocation/auto_growth_best_fit_allocator.cc b/paddle/fluid/memory/allocation/auto_growth_best_fit_allocator.cc
index f5e4941d78709..782062283e985 100644
--- a/paddle/fluid/memory/allocation/auto_growth_best_fit_allocator.cc
+++ b/paddle/fluid/memory/allocation/auto_growth_best_fit_allocator.cc
@@ -48,8 +48,9 @@ AutoGrowthBestFitAllocator::AutoGrowthBestFitAllocator(
 
 phi::Allocation *AutoGrowthBestFitAllocator::AllocateImpl(
     size_t unaligned_size) {
-  platform::RecordEvent("AutoGrowthBestFitAllocator::Allocate",
-                        platform::TracerEventType::UserDefined, 9 /*level*/);
+  platform::RecordEvent record("AutoGrowthBestFitAllocator::Allocate",
+                               platform::TracerEventType::UserDefined,
+                               9 /*level*/);
   size_t size = AlignedSize(unaligned_size, alignment_);
   VLOG(10) << "Allocate " << unaligned_size << " bytes, aligned to " << size;
 
@@ -111,8 +112,9 @@ phi::Allocation *AutoGrowthBestFitAllocator::AllocateImpl(
 }
 
 void AutoGrowthBestFitAllocator::FreeImpl(phi::Allocation *allocation) {
-  platform::RecordEvent("AutoGrowthBestFitAllocator::Free",
-                        platform::TracerEventType::UserDefined, 9 /*level*/);
+  platform::RecordEvent record("AutoGrowthBestFitAllocator::Free",
+                               platform::TracerEventType::UserDefined,
+                               9 /*level*/);
   VLOG(10) << "Free " << allocation->size()
            << " bytes, ptr = " << allocation->ptr();
   std::lock_guard<SpinLock> guard(spinlock_);
diff --git a/paddle/fluid/memory/allocation/stream_safe_cuda_allocator.cc b/paddle/fluid/memory/allocation/stream_safe_cuda_allocator.cc
index 7e47d35176bac..80877cb670ba9 100644
--- a/paddle/fluid/memory/allocation/stream_safe_cuda_allocator.cc
+++ b/paddle/fluid/memory/allocation/stream_safe_cuda_allocator.cc
@@ -33,7 +33,7 @@ StreamSafeCUDAAllocation::StreamSafeCUDAAllocation(
       owning_stream_(std::move(owning_stream)),
       allocator_(allocator->shared_from_this()) {}
 
-void StreamSafeCUDAAllocation::RecordStream(const gpuStream_t& stream) {
+void StreamSafeCUDAAllocation::RecordStream(gpuStream_t stream) {
   VLOG(8) << "Try record stream " << stream << " for address " << ptr();
   if (stream == owning_stream_) {
     return;
@@ -90,7 +90,7 @@ bool StreamSafeCUDAAllocation::CanBeFreed() {
   return true;
 }
 
-const gpuStream_t& StreamSafeCUDAAllocation::GetOwningStream() const {
+gpuStream_t StreamSafeCUDAAllocation::GetOwningStream() const {
   return owning_stream_;
 }
 
@@ -102,7 +102,7 @@ void StreamSafeCUDAAllocation::RecordGraphCapturingStreams() {
 }
 
 void StreamSafeCUDAAllocation::RecordStreamWithNoGraphCapturing(
-    const gpuStream_t& stream) {
+    gpuStream_t stream) {
   gpuEvent_t record_event;
   auto it = outstanding_event_map_.find(stream);
   if (it == outstanding_event_map_.end()) {
@@ -154,17 +154,18 @@ StreamSafeCUDAAllocator::~StreamSafeCUDAAllocator() {
 
 bool StreamSafeCUDAAllocator::IsAllocThreadSafe() const { return true; }
 
-const gpuStream_t& StreamSafeCUDAAllocator::GetDefaultStream() const {
+gpuStream_t StreamSafeCUDAAllocator::GetDefaultStream() const {
   return default_stream_;
 }
 
-void StreamSafeCUDAAllocator::SetDefaultStream(const gpuStream_t& stream) {
+void StreamSafeCUDAAllocator::SetDefaultStream(gpuStream_t stream) {
   default_stream_ = stream;
 }
 
 phi::Allocation* StreamSafeCUDAAllocator::AllocateImpl(size_t size) {
-  platform::RecordEvent("StreamSafeCUDAAllocator::Allocate",
-                        platform::TracerEventType::UserDefined, 9 /*level*/);
+  platform::RecordEvent record("StreamSafeCUDAAllocator::Allocate",
+                               platform::TracerEventType::UserDefined,
+                               9 /*level*/);
   ProcessUnfreedAllocations();
   VLOG(8) << "Try allocate " << size << " bytes";
   AllocationPtr underlying_allocation;
@@ -192,8 +193,9 @@ phi::Allocation* StreamSafeCUDAAllocator::AllocateImpl(size_t size) {
 }
 
 void StreamSafeCUDAAllocator::FreeImpl(phi::Allocation* allocation) {
-  platform::RecordEvent("StreamSafeCUDAAllocator::Free",
-                        platform::TracerEventType::UserDefined, 9 /*level*/);
+  platform::RecordEvent record("StreamSafeCUDAAllocator::Free",
+                               platform::TracerEventType::UserDefined,
+                               9 /*level*/);
   StreamSafeCUDAAllocation* stream_safe_cuda_allocation =
       static_cast<StreamSafeCUDAAllocation*>(allocation);
 
diff --git a/paddle/fluid/memory/allocation/stream_safe_cuda_allocator.h b/paddle/fluid/memory/allocation/stream_safe_cuda_allocator.h
index 65af32c701b75..32d3896e66bbf 100644
--- a/paddle/fluid/memory/allocation/stream_safe_cuda_allocator.h
+++ b/paddle/fluid/memory/allocation/stream_safe_cuda_allocator.h
@@ -39,13 +39,13 @@ class StreamSafeCUDAAllocation : public Allocation {
                            gpuStream_t owning_stream,
                            StreamSafeCUDAAllocator *allocator);
 
-  void RecordStream(const gpuStream_t &stream);
+  void RecordStream(gpuStream_t stream);
   bool CanBeFreed();
-  const gpuStream_t &GetOwningStream() const;
+  gpuStream_t GetOwningStream() const;
 
  private:
   void RecordGraphCapturingStreams();
-  void RecordStreamWithNoGraphCapturing(const gpuStream_t &stream);
+  void RecordStreamWithNoGraphCapturing(gpuStream_t stream);
   DecoratedAllocationPtr underlying_allocation_;
   std::set<gpuStream_t> graph_capturing_stream_set_;
   std::map<gpuStream_t, gpuEvent_t> outstanding_event_map_;
@@ -66,8 +66,8 @@ class StreamSafeCUDAAllocator
   ~StreamSafeCUDAAllocator();
 
   bool IsAllocThreadSafe() const override;
-  const gpuStream_t &GetDefaultStream() const;
-  void SetDefaultStream(const gpuStream_t &stream);
+  gpuStream_t GetDefaultStream() const;
+  void SetDefaultStream(gpuStream_t stream);
 
  protected:
   phi::Allocation *AllocateImpl(size_t size) override;
diff --git a/paddle/fluid/memory/malloc.cc b/paddle/fluid/memory/malloc.cc
index f3de317dd1df5..50180b4b6a1a6 100644
--- a/paddle/fluid/memory/malloc.cc
+++ b/paddle/fluid/memory/malloc.cc
@@ -57,17 +57,16 @@ void* GetBasePtr(const std::shared_ptr<Allocation>& allocation) {
 }
 
 #if defined(PADDLE_WITH_CUDA) || defined(PADDLE_WITH_HIP)
-uint64_t Release(const platform::CUDAPlace& place, const gpuStream_t& stream) {
+uint64_t Release(const platform::CUDAPlace& place, gpuStream_t stream) {
   return allocation::AllocatorFacade::Instance().Release(place, stream);
 }
 
-void RecordStream(std::shared_ptr<Allocation> allocation,
-                  const gpuStream_t& stream) {
+void RecordStream(std::shared_ptr<Allocation> allocation, gpuStream_t stream) {
   return allocation::AllocatorFacade::Instance().RecordStream(allocation,
                                                               stream);
 }
 
-const gpuStream_t GetStream(const std::shared_ptr<Allocation>& allocation) {
+gpuStream_t GetStream(const std::shared_ptr<Allocation>& allocation) {
   return allocation::AllocatorFacade::Instance().GetStream(allocation);
 }
 
diff --git a/paddle/fluid/memory/malloc.h b/paddle/fluid/memory/malloc.h
index e6d910579ba95..796bdcf0ec2f6 100644
--- a/paddle/fluid/memory/malloc.h
+++ b/paddle/fluid/memory/malloc.h
@@ -50,13 +50,11 @@ extern bool InSameStream(const std::shared_ptr<Allocation>& allocation,
 extern void* GetBasePtr(const std::shared_ptr<Allocation>& allocation);
 
 #if defined(PADDLE_WITH_CUDA) || defined(PADDLE_WITH_HIP)
-extern uint64_t Release(const platform::CUDAPlace& place,
-                        const gpuStream_t& stream);
+extern uint64_t Release(const platform::CUDAPlace& place, gpuStream_t stream);
 
-void RecordStream(std::shared_ptr<Allocation> allocation,
-                  const gpuStream_t& stream);
+void RecordStream(std::shared_ptr<Allocation> allocation, gpuStream_t stream);
 
-const gpuStream_t GetStream(const std::shared_ptr<Allocation>& allocation);
+gpuStream_t GetStream(const std::shared_ptr<Allocation>& allocation);
 #endif
 }  // namespace memory
 }  // namespace paddle
diff --git a/paddle/fluid/memory/malloc_test.cu b/paddle/fluid/memory/malloc_test.cu
index 9837d3e4fab6e..2a98727e4b662 100644
--- a/paddle/fluid/memory/malloc_test.cu
+++ b/paddle/fluid/memory/malloc_test.cu
@@ -120,6 +120,10 @@ TEST(Malloc, CUDADeviceContextMultiStream) {
         paddle::memory::allocation::AllocatorFacade::Instance()
             .GetZeroAllocator(place)
             .get());
+    ctx->SetPinnedAllocator(
+        paddle::memory::allocation::AllocatorFacade::Instance()
+            .GetAllocator(paddle::platform::CUDAPinnedPlace())
+            .get());
     ctx->PartialInitWithAllocator();
     dev_ctx.emplace_back(std::move(ctx));
     MultiStreamCompute(&data[i], &second_data[i], *dev_ctx[i]);
@@ -172,6 +176,10 @@ TEST(Malloc, CUDADeviceContextMultiThreadMultiStream) {
         paddle::memory::allocation::AllocatorFacade::Instance()
             .GetZeroAllocator(place)
             .get());
+    ctx->SetPinnedAllocator(
+        paddle::memory::allocation::AllocatorFacade::Instance()
+            .GetAllocator(paddle::platform::CUDAPinnedPlace())
+            .get());
     ctx->PartialInitWithAllocator();
     dev_ctx.emplace_back(std::move(ctx));
     threads.push_back(std::thread(MultiStreamCompute, &data[i], &second_data[i],
diff --git a/paddle/fluid/operators/CMakeLists.txt b/paddle/fluid/operators/CMakeLists.txt
index 3901226216f4d..68eaf1a0ed469 100644
--- a/paddle/fluid/operators/CMakeLists.txt
+++ b/paddle/fluid/operators/CMakeLists.txt
@@ -102,10 +102,11 @@ endif()
 
 set(OP_HEADER_DEPS ${OP_HEADER_DEPS} phi phi_api_utils gather_scatter_kernel)
 
-register_operators(EXCLUDES py_layer_op py_func_op warpctc_op dgc_op load_combine_op lstm_op run_program_op eye_op
+register_operators(EXCLUDES py_layer_op py_func_op warpctc_op dgc_op load_combine_op lstm_op run_program_op eye_op quantize_linear_op
         recurrent_op save_combine_op sparse_attention_op sync_batch_norm_op spectral_op ${OP_MKL_DEPS} DEPS ${OP_HEADER_DEPS})
 
 op_library(run_program_op SRCS run_program_op.cc run_program_op.cu.cc DEPS executor_cache ${OP_HEADER_DEPS})
+op_library(quantize_linear_op DEPS cast_kernel)
 op_library(save_combine_op DEPS string_array)
 op_library(load_combine_op DEPS string_array)
 
diff --git a/paddle/fluid/operators/cast_op.cu b/paddle/fluid/operators/cast_op.cu
index eb51215790bbc..0afe09ec028e3 100644
--- a/paddle/fluid/operators/cast_op.cu
+++ b/paddle/fluid/operators/cast_op.cu
@@ -19,15 +19,15 @@ namespace ops = paddle::operators;
 namespace plat = paddle::platform;
 
 using CUDA = paddle::platform::CUDADeviceContext;
-#define REGISTER_CAST_CUDA_BASE(op_name, ...)                             \
-  REGISTER_OP_CUDA_KERNEL(                                                \
-      op_name, ops::CastOpKernel<CUDA, float>,                            \
-      ops::CastOpKernel<CUDA, double>, ops::CastOpKernel<CUDA, int>,      \
-      ops::CastOpKernel<CUDA, int64_t>, ops::CastOpKernel<CUDA, int16_t>, \
-      ops::CastOpKernel<CUDA, bool>, ops::CastOpKernel<CUDA, uint8_t>,    \
-      ops::CastOpKernel<CUDA, plat::float16>,                             \
-      ops::CastOpKernel<CUDA, plat::complex<float>>,                      \
-      ops::CastOpKernel<CUDA, plat::complex<double>>, ##__VA_ARGS__);
-
 // See [ why register transfer_dtype_op alias with cast_op? ] in cast_op.cc
-REGISTER_CAST_CUDA_BASE(transfer_dtype, ops::CastOpKernel<CUDA, plat::bfloat16>)
+REGISTER_OP_CUDA_KERNEL(transfer_dtype, ops::CastOpKernel<CUDA, float>,
+                        ops::CastOpKernel<CUDA, double>,
+                        ops::CastOpKernel<CUDA, int>,
+                        ops::CastOpKernel<CUDA, int64_t>,
+                        ops::CastOpKernel<CUDA, int16_t>,
+                        ops::CastOpKernel<CUDA, bool>,
+                        ops::CastOpKernel<CUDA, uint8_t>,
+                        ops::CastOpKernel<CUDA, plat::float16>,
+                        ops::CastOpKernel<CUDA, plat::complex<float>>,
+                        ops::CastOpKernel<CUDA, plat::complex<double>>,
+                        ops::CastOpKernel<CUDA, plat::bfloat16>);
diff --git a/paddle/fluid/operators/clip_op.cc b/paddle/fluid/operators/clip_op.cc
index 436d1edcedf1e..6e898d31663fa 100644
--- a/paddle/fluid/operators/clip_op.cc
+++ b/paddle/fluid/operators/clip_op.cc
@@ -1,21 +1,23 @@
-/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved.
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
 
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License. */
-
-#include "paddle/fluid/operators/clip_op.h"
 #include <memory>
+#include "paddle/fluid/framework/infershape_utils.h"
 #include "paddle/fluid/framework/op_registry.h"
 #include "paddle/fluid/framework/op_version_registry.h"
+#include "paddle/phi/core/infermeta_utils.h"
+#include "paddle/phi/infermeta/unary.h"
 
 namespace paddle {
 namespace operators {
@@ -23,15 +25,6 @@ namespace operators {
 class ClipOp : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
-
-  void InferShape(framework::InferShapeContext* ctx) const override {
-    OP_INOUT_CHECK(ctx->HasInput("X"), "Input", "X", "clip");
-    OP_INOUT_CHECK(ctx->HasOutput("Out"), "Output", "Out", "clip");
-    auto x_dims = ctx->GetInputDim("X");
-    ctx->SetOutputDim("Out", x_dims);
-    ctx->ShareLoD("X", /*->*/ "Out");
-  }
-
   framework::OpKernelType GetExpectedKernelType(
       const framework::ExecutionContext& ctx) const override {
     auto input_data_type =
@@ -176,23 +169,15 @@ class ClipDoubleGradOpMaker : public framework::SingleGradOpMaker<T> {
 }  // namespace paddle
 
 namespace ops = paddle::operators;
+DECLARE_INFER_SHAPE_FUNCTOR(clip, ClipInferShapeFunctor,
+                            PD_INFER_META(phi::UnchangedInferMeta));
 REGISTER_OPERATOR(clip, ops::ClipOp, ops::ClipOpMaker<float>,
                   ops::ClipGradOpMaker<paddle::framework::OpDesc>,
                   ops::ClipGradOpMaker<paddle::imperative::OpBase>,
-                  ops::ClipInplaceInferer);
+                  ops::ClipInplaceInferer, ClipInferShapeFunctor);
 REGISTER_OPERATOR(clip_grad, ops::ClipOpGrad, ops::ClipGradInplaceInferer,
                   ops::ClipDoubleGradOpMaker<paddle::framework::OpDesc>,
                   ops::ClipDoubleGradOpMaker<paddle::imperative::OpBase>);
-REGISTER_OP_CPU_KERNEL(
-    clip, ops::ClipKernel<paddle::platform::CPUDeviceContext, float>,
-    ops::ClipKernel<paddle::platform::CPUDeviceContext, double>,
-    ops::ClipKernel<paddle::platform::CPUDeviceContext, int>,
-    ops::ClipKernel<paddle::platform::CPUDeviceContext, int64_t>);
-REGISTER_OP_CPU_KERNEL(
-    clip_grad, ops::ClipGradKernel<paddle::platform::CPUDeviceContext, float>,
-    ops::ClipGradKernel<paddle::platform::CPUDeviceContext, double>,
-    ops::ClipGradKernel<paddle::platform::CPUDeviceContext, int>,
-    ops::ClipGradKernel<paddle::platform::CPUDeviceContext, int64_t>);
 
 REGISTER_OP_VERSION(clip)
     .AddCheckpoint(
diff --git a/paddle/fluid/operators/clip_op.cu b/paddle/fluid/operators/clip_op.cu
deleted file mode 100644
index 846354fcb81c5..0000000000000
--- a/paddle/fluid/operators/clip_op.cu
+++ /dev/null
@@ -1,32 +0,0 @@
-/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License. */
-
-#include "paddle/fluid/operators/clip_op.h"
-
-namespace ops = paddle::operators;
-REGISTER_OP_CUDA_KERNEL(
-    clip, ops::ClipKernel<paddle::platform::CUDADeviceContext, float>,
-    ops::ClipKernel<paddle::platform::CUDADeviceContext, double>,
-    ops::ClipKernel<paddle::platform::CUDADeviceContext, int>,
-    ops::ClipKernel<paddle::platform::CUDADeviceContext, int64_t>,
-    ops::ClipKernel<paddle::platform::CUDADeviceContext,
-                    paddle::platform::float16>);
-
-REGISTER_OP_CUDA_KERNEL(
-    clip_grad, ops::ClipGradKernel<paddle::platform::CUDADeviceContext, float>,
-    ops::ClipGradKernel<paddle::platform::CUDADeviceContext, double>,
-    ops::ClipGradKernel<paddle::platform::CUDADeviceContext, int>,
-    ops::ClipGradKernel<paddle::platform::CUDADeviceContext, int64_t>,
-    ops::ClipGradKernel<paddle::platform::CUDADeviceContext,
-                        paddle::platform::float16>);
diff --git a/paddle/fluid/operators/clip_op.h b/paddle/fluid/operators/clip_op.h
deleted file mode 100644
index 3b815cd1fa74a..0000000000000
--- a/paddle/fluid/operators/clip_op.h
+++ /dev/null
@@ -1,196 +0,0 @@
-/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License. */
-
-#pragma once
-
-#include "paddle/fluid/framework/eigen.h"
-#include "paddle/fluid/framework/op_registry.h"
-#include "paddle/fluid/operators/math/selected_rows_functor.h"
-#include "paddle/fluid/platform/transform.h"
-#if defined(__NVCC__) || defined(__HIPCC__)
-#include "paddle/fluid/operators/elementwise/elementwise_op_impl.cu.h"
-#endif
-
-namespace paddle {
-namespace operators {
-
-using framework::Tensor;
-using platform::Transform;
-
-template <typename T>
-class ClipFunctor {
- public:
-  explicit ClipFunctor(const T min, const T max) : min_(min), max_(max) {}
-  HOSTDEVICE T operator()(const T x) const {
-    return x < min_ ? min_ : x > max_ ? max_ : x;
-  }
-
- private:
-  T min_;
-  T max_;
-};
-
-template <typename T>
-class ClipGradFunctor {
- public:
-  explicit ClipGradFunctor(const T min, const T max) : min_(min), max_(max) {}
-  HOSTDEVICE T operator()(const T x, const T y) const {
-    return (y > min_ && y < max_) ? x : static_cast<T>(0);
-  }
-
- private:
-  T min_;
-  T max_;
-};
-
-template <typename DeviceContext, typename T>
-class ClipKernel : public framework::OpKernel<T> {
- public:
-  void Compute(const framework::ExecutionContext& context) const override {
-    auto max = static_cast<T>(context.Attr<float>("max"));
-    Tensor max_cpu;
-    if (context.HasInput("Max")) {
-      auto* max_t = context.Input<Tensor>("Max");
-      auto* max_data = max_t->data<T>();
-      if (platform::is_gpu_place(max_t->place())) {
-        paddle::framework::TensorCopySync(*max_t, platform::CPUPlace(),
-                                          &max_cpu);
-        max_data = max_cpu.data<T>();
-      }
-      max = max_data[0];
-    }
-    max = static_cast<T>(max);
-
-    auto min = static_cast<T>(context.Attr<float>("min"));
-    Tensor min_cpu;
-    if (context.HasInput("Min")) {
-      auto* min_t = context.Input<Tensor>("Min");
-      auto* min_data = min_t->data<T>();
-      if (platform::is_gpu_place(min_t->place())) {
-        paddle::framework::TensorCopySync(*min_t, platform::CPUPlace(),
-                                          &min_cpu);
-        min_data = min_cpu.data<T>();
-      }
-      min = min_data[0];
-    }
-
-    PADDLE_ENFORCE_LE(min, max,
-                      platform::errors::InvalidArgument(
-                          "max should be greater than or equal to min. "
-                          "But received min = %f, max = %f",
-                          static_cast<float>(min), static_cast<float>(max)));
-
-    auto* x_var = context.InputVar("X");
-    if (x_var->IsType<framework::LoDTensor>()) {
-      auto* x = context.Input<framework::LoDTensor>("X");
-      auto* out = context.Output<framework::LoDTensor>("Out");
-      T* out_data = out->mutable_data<T>(context.GetPlace());
-      const T* x_data = x->data<T>();
-      int64_t numel = x->numel();
-      if (platform::is_gpu_place(context.GetPlace())) {
-#if defined(__NVCC__) || defined(__HIPCC__)
-        std::vector<const framework::Tensor*> ins = {x};
-        std::vector<framework::Tensor*> outs = {out};
-        auto functor = ClipFunctor<T>(min, max);
-        paddle::operators::LaunchSameDimsElementwiseCudaKernel<T>(
-            context.template device_context<platform::CUDADeviceContext>(), ins,
-            &outs, functor);
-#endif
-      } else {
-        Transform<DeviceContext> trans;
-        trans(context.template device_context<DeviceContext>(), x_data,
-              x_data + numel, out_data, ClipFunctor<T>(min, max));
-      }
-    } else if (x_var->IsType<phi::SelectedRows>()) {
-      auto* x = context.Input<phi::SelectedRows>("X");
-      auto* out = context.Output<phi::SelectedRows>("Out");
-      PADDLE_ENFORCE_NE(x, out, platform::errors::InvalidArgument(
-                                    "Inplace clip is not allowed "
-                                    "when x is SelectedRows"));
-      math::scatter::MergeAdd<DeviceContext, T> merge_func;
-      merge_func(context.template device_context<DeviceContext>(), *x, out);
-      auto* out_tensor = out->mutable_value();
-      auto* out_data = out_tensor->data<T>();
-      int64_t numel = out_tensor->numel();
-      Transform<DeviceContext> trans;
-      trans(context.template device_context<DeviceContext>(), out_data,
-            out_data + numel, out_data, ClipFunctor<T>(min, max));
-    } else {
-      PADDLE_THROW(platform::errors::Unavailable(
-          "ClipOp only supports LoDTensor and SelectedRows."));
-    }
-  }
-};
-
-template <typename DeviceContext, typename T>
-class ClipGradKernel : public framework::OpKernel<T> {
- public:
-  void Compute(const framework::ExecutionContext& context) const override {
-    auto max = static_cast<T>(context.Attr<float>("max"));
-    Tensor max_cpu;
-    if (context.HasInput("Max")) {
-      auto* max_t = context.Input<Tensor>("Max");
-      auto* max_data = max_t->data<T>();
-      if (platform::is_gpu_place(max_t->place())) {
-        paddle::framework::TensorCopySync(*max_t, platform::CPUPlace(),
-                                          &max_cpu);
-        max_data = max_cpu.data<T>();
-      }
-      max = max_data[0];
-    }
-    max = static_cast<T>(max);
-
-    auto min = static_cast<T>(context.Attr<float>("min"));
-    Tensor min_cpu;
-    if (context.HasInput("Min")) {
-      auto* min_t = context.Input<Tensor>("Min");
-      auto* min_data = min_t->data<T>();
-      if (platform::is_gpu_place(min_t->place())) {
-        paddle::framework::TensorCopySync(*min_t, platform::CPUPlace(),
-                                          &min_cpu);
-        min_data = min_cpu.data<T>();
-      }
-      min = min_data[0];
-    }
-    min = static_cast<T>(min);
-
-    auto* d_out =
-        context.Input<framework::LoDTensor>(framework::GradVarName("Out"));
-    auto* d_x =
-        context.Output<framework::LoDTensor>(framework::GradVarName("X"));
-    if (d_x != nullptr) {
-      auto* x = context.Input<framework::LoDTensor>("X");
-#if defined(__NVCC__) || defined(__HIPCC__)
-      std::vector<const framework::Tensor*> ins = {d_out, x};
-      std::vector<framework::Tensor*> outs = {d_x};
-      auto functor = ClipGradFunctor<T>(min, max);
-      d_x->mutable_data<T>(context.GetPlace());
-      LaunchSameDimsElementwiseCudaKernel<T>(
-          context.template device_context<platform::CUDADeviceContext>(), ins,
-          &outs, functor);
-#else
-      int64_t numel = d_out->numel();
-      auto* d_x_data = d_x->mutable_data<T>(context.GetPlace());
-      const T* d_out_data = d_out->data<T>();
-      const T* x_data = x->data<T>();
-      Transform<DeviceContext> trans;
-      trans(context.template device_context<DeviceContext>(), d_out_data,
-            d_out_data + numel, x_data, d_x_data, ClipGradFunctor<T>(min, max));
-#endif
-    }
-  }
-};
-
-}  // namespace operators
-}  // namespace paddle
diff --git a/paddle/fluid/operators/clip_op_npu.cc b/paddle/fluid/operators/clip_op_npu.cc
index 372ba707329bb..17d7ad9796504 100644
--- a/paddle/fluid/operators/clip_op_npu.cc
+++ b/paddle/fluid/operators/clip_op_npu.cc
@@ -1,18 +1,18 @@
-/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License. */
-
-#include "paddle/fluid/operators/clip_op.h"
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include "paddle/fluid/framework/op_registry.h"
 #include "paddle/fluid/platform/device/npu/npu_op_runner.h"
 
 namespace paddle {
diff --git a/paddle/fluid/operators/clip_op_xpu.cc b/paddle/fluid/operators/clip_op_xpu.cc
index c53bb2d9e4d0c..c551312837274 100644
--- a/paddle/fluid/operators/clip_op_xpu.cc
+++ b/paddle/fluid/operators/clip_op_xpu.cc
@@ -1,20 +1,19 @@
-/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License. */
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
 
 #ifdef PADDLE_WITH_XPU
 
-#include "paddle/fluid/operators/clip_op.h"
 #include "paddle/fluid/framework/op_registry.h"
 
 namespace paddle {
diff --git a/paddle/fluid/operators/collective/c_broadcast_op.cu.cc b/paddle/fluid/operators/collective/c_broadcast_op.cu.cc
index 0ad61bb16b51e..7bdf5f0c46ca6 100644
--- a/paddle/fluid/operators/collective/c_broadcast_op.cu.cc
+++ b/paddle/fluid/operators/collective/c_broadcast_op.cu.cc
@@ -37,7 +37,6 @@ class CBroadcastOpCUDAKernel : public framework::OpKernel<T> {
 
     int rid = ctx.Attr<int>("ring_id");
     auto place = ctx.GetPlace();
-    auto comm = platform::NCCLCommContext::Instance().Get(rid, place);
     auto map = distributed::ProcessGroupMapFromGid::getInstance();
     if (map->has(rid)) {
       // Use ProcessGroup
@@ -46,6 +45,7 @@ class CBroadcastOpCUDAKernel : public framework::OpKernel<T> {
       return;
     }
 
+    auto comm = platform::NCCLCommContext::Instance().Get(rid, place);
     gpuStream_t stream = nullptr;
     if (ctx.Attr<bool>("use_calc_stream")) {
       auto dev_ctx = platform::DeviceContextPool::Instance().Get(place);
diff --git a/paddle/fluid/operators/compat/dequantize_linear.pbtxt b/paddle/fluid/operators/compat/dequantize_linear.pbtxt
new file mode 100644
index 0000000000000..73b61f8bc29fb
--- /dev/null
+++ b/paddle/fluid/operators/compat/dequantize_linear.pbtxt
@@ -0,0 +1,25 @@
+type: "dequantize_linear"
+def {
+  inputs {
+    name: "X"
+  }
+  inputs {
+    name: "Scale"
+  }
+  inputs {
+    name: "ZeroPoint"
+  }
+  outputs {
+    name: "Y"
+  }
+  attrs {
+    name: "bit_length"
+    type: INT
+  }
+  attrs {
+    name: "quant_axis"
+    type: INT
+  }
+}
+extra {
+}
diff --git a/paddle/fluid/operators/compat/mul.pbtxt b/paddle/fluid/operators/compat/mul.pbtxt
index 617775eaaae9e..056f799c6c49c 100644
--- a/paddle/fluid/operators/compat/mul.pbtxt
+++ b/paddle/fluid/operators/compat/mul.pbtxt
@@ -60,15 +60,7 @@ extra {
     type: BOOLEAN
   }
   attrs {
-    name: "X_scale"
-    type: FLOAT
-  }
-  attrs {
-    name: "weight_scale"
-    type: FLOAT
-  }
-  attrs {
-    name: "out_scale"
+    name: "Input_scale"
     type: FLOAT
   }
   attrs {
diff --git a/paddle/fluid/operators/compat/quantize_linear.pbtxt b/paddle/fluid/operators/compat/quantize_linear.pbtxt
new file mode 100644
index 0000000000000..7a3ca515029c3
--- /dev/null
+++ b/paddle/fluid/operators/compat/quantize_linear.pbtxt
@@ -0,0 +1,25 @@
+type: "quantize_linear"
+def {
+  inputs {
+    name: "X"
+  }
+  inputs {
+    name: "Scale"
+  }
+  inputs {
+    name: "ZeroPoint"
+  }
+  outputs {
+    name: "Y"
+  }
+  attrs {
+    name: "bit_length"
+    type: INT
+  }
+  attrs {
+    name: "quant_axis"
+    type: INT
+  }
+}
+extra {
+}
diff --git a/paddle/fluid/operators/concat_op.cc b/paddle/fluid/operators/concat_op.cc
index 059fafa3e7f4d..a467f2dbee7c9 100644
--- a/paddle/fluid/operators/concat_op.cc
+++ b/paddle/fluid/operators/concat_op.cc
@@ -216,18 +216,3 @@ REGISTER_OPERATOR(concat_grad, ops::ConcatOpGrad,
                   ops::ConcatDoubleGradOpMaker<paddle::framework::OpDesc>,
                   ops::ConcatDoubleGradOpMaker<paddle::imperative::OpBase>,
                   ops::ConcatOpGradNoNeedBufferVarInferer);
-
-REGISTER_OP_CPU_KERNEL(
-    concat_grad,
-    ops::ConcatGradKernel<paddle::platform::CPUDeviceContext, double>,
-    ops::ConcatGradKernel<paddle::platform::CPUDeviceContext, float>,
-    ops::ConcatGradKernel<paddle::platform::CPUDeviceContext, bool>,
-    ops::ConcatGradKernel<paddle::platform::CPUDeviceContext, int64_t>,
-    ops::ConcatGradKernel<paddle::platform::CPUDeviceContext,
-                          paddle::platform::float16>,
-    ops::ConcatGradKernel<paddle::platform::CPUDeviceContext, int>,
-    ops::ConcatGradKernel<paddle::platform::CPUDeviceContext, uint8_t>,
-    ops::ConcatGradKernel<paddle::platform::CPUDeviceContext,
-                          paddle::platform::complex<float>>,
-    ops::ConcatGradKernel<paddle::platform::CPUDeviceContext,
-                          paddle::platform::complex<double>>);
diff --git a/paddle/fluid/operators/concat_op.cu.cc b/paddle/fluid/operators/concat_op.cu.cc
deleted file mode 100644
index f7b64f16e2d8b..0000000000000
--- a/paddle/fluid/operators/concat_op.cu.cc
+++ /dev/null
@@ -1,36 +0,0 @@
-/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License. */
-
-#include "paddle/fluid/operators/concat_op.h"
-#include "paddle/fluid/platform/bfloat16.h"
-#include "paddle/fluid/platform/complex.h"
-#include "paddle/fluid/platform/float16.h"
-
-namespace ops = paddle::operators;
-namespace plat = paddle::platform;
-
-REGISTER_OP_CUDA_KERNEL(
-    concat_grad,
-    ops::ConcatGradKernel<paddle::platform::CUDADeviceContext, double>,
-    ops::ConcatGradKernel<paddle::platform::CUDADeviceContext, float>,
-    ops::ConcatGradKernel<paddle::platform::CUDADeviceContext, bool>,
-    ops::ConcatGradKernel<paddle::platform::CUDADeviceContext, plat::float16>,
-    ops::ConcatGradKernel<paddle::platform::CUDADeviceContext, plat::bfloat16>,
-    ops::ConcatGradKernel<paddle::platform::CUDADeviceContext, int64_t>,
-    ops::ConcatGradKernel<paddle::platform::CUDADeviceContext, int>,
-    ops::ConcatGradKernel<paddle::platform::CUDADeviceContext, uint8_t>,
-    ops::ConcatGradKernel<paddle::platform::CUDADeviceContext,
-                          plat::complex<float>>,
-    ops::ConcatGradKernel<paddle::platform::CUDADeviceContext,
-                          plat::complex<double>>);
diff --git a/paddle/fluid/operators/concat_op.h b/paddle/fluid/operators/concat_op.h
index ec43e2ad374db..50aca54c12dec 100644
--- a/paddle/fluid/operators/concat_op.h
+++ b/paddle/fluid/operators/concat_op.h
@@ -39,62 +39,6 @@ static inline int64_t ComputeAxis(int64_t axis, int64_t rank) {
   }
   return axis > 0 ? axis : 0;
 }
-template <typename DeviceContext, typename T>
-class ConcatGradKernel : public framework::OpKernel<T> {
- public:
-  void Compute(const framework::ExecutionContext& ctx) const {
-    auto* out_grad =
-        ctx.Input<framework::Tensor>(framework::GradVarName("Out"));
-    auto ins = ctx.MultiInput<framework::LoDTensor>("X");
-    auto out_var_names = ctx.OutputNames(framework::GradVarName("X"));
-    auto outs =
-        ctx.MultiOutput<framework::LoDTensor>(framework::GradVarName("X"));
-
-    {
-      auto dx = outs;
-      auto x = ins;
-      for (size_t i = 0; i < dx.size(); ++i) {
-        if (dx[i] != nullptr) {
-          dx[i]->set_lod(x[i]->lod());
-        }
-      }
-    }
-    PADDLE_ENFORCE_NOT_NULL(ins[0],
-                            platform::errors::NotFound(
-                                "The first input tensor is not initalized."));
-
-    auto axis = ctx.Attr<int>("axis");
-    if (ctx.HasInput("AxisTensor")) {
-      auto* axis_tensor = ctx.Input<framework::Tensor>("AxisTensor");
-      axis = GetDataFromTensor<int>(axis_tensor)[0];
-    }
-    axis = ComputeAxis(static_cast<int64_t>(axis),
-                       static_cast<int64_t>(ins[0]->dims().size()));
-    // get output tensor that the name is not kEmptyVarName
-    std::vector<framework::Tensor*> outputs;
-    for (size_t j = 0; j < outs.size(); ++j) {
-      if (out_var_names[j] != framework::kEmptyVarName &&
-          outs[j]->numel() != 0UL) {
-        outs[j]->mutable_data<T>(ctx.GetPlace());
-        outputs.push_back(outs[j]);
-      } else {
-        outputs.push_back(nullptr);
-      }
-    }
-    auto& dev_ctx = ctx.template device_context<DeviceContext>();
-
-    // Sometimes direct copies will be faster, this maybe need deeply analysis.
-    if (axis == 0 && outs.size() < 10) {
-      std::vector<const framework::Tensor*> ref_shape;
-      ref_shape.insert(ref_shape.begin(), ins.begin(), ins.end());
-      StridedMemcpyWithAxis0<T>(dev_ctx, *out_grad, ref_shape, &outputs);
-    } else {
-      math::SplitFunctor<DeviceContext, T> split_functor;
-      split_functor(dev_ctx, *out_grad, ctx.MultiInput<framework::Tensor>("X"),
-                    static_cast<int>(axis), &outputs);
-    }
-  }
-};
 
 }  // namespace operators
 }  // namespace paddle
diff --git a/paddle/fluid/operators/conv_op.cc b/paddle/fluid/operators/conv_op.cc
index 9be63a85fc0de..405794783812b 100644
--- a/paddle/fluid/operators/conv_op.cc
+++ b/paddle/fluid/operators/conv_op.cc
@@ -844,8 +844,6 @@ framework::OpKernelType ConvOpDoubleGrad::GetExpectedKernelType(
 }  // namespace paddle
 
 namespace ops = paddle::operators;
-DECLARE_INFER_SHAPE_FUNCTOR(conv2d, Conv2dInferShapeFunctor,
-                            PD_INFER_META(phi::ConvInferMeta));
 REGISTER_OPERATOR(conv2d, ops::ConvOp, ops::Conv2DOpMaker,
                   ops::ConvOpInferVarType,
                   ops::Conv2DGradMaker<paddle::framework::OpDesc>,
@@ -856,8 +854,6 @@ REGISTER_OPERATOR(conv2d_grad, ops::ConvOpGrad,
 REGISTER_OPERATOR(conv2d_grad_grad, ops::ConvOpDoubleGrad);
 
 // depthwise convolution op
-DECLARE_INFER_SHAPE_FUNCTOR(depthwise_conv2d, DepthwiseConv2dInferShapeFunctor,
-                            PD_INFER_META(phi::ConvInferMeta));
 REGISTER_OPERATOR(depthwise_conv2d, ops::ConvOp, ops::Conv2DOpMaker,
                   ops::ConvOpInferVarType,
                   ops::Conv2DGradMaker<paddle::framework::OpDesc>,
@@ -867,8 +863,6 @@ REGISTER_OPERATOR(depthwise_conv2d_grad, ops::ConvOpGrad,
                   ops::Conv2DDoubleGradMaker<paddle::imperative::OpBase>);
 REGISTER_OPERATOR(depthwise_conv2d_grad_grad, ops::ConvOpDoubleGrad);
 
-DECLARE_INFER_SHAPE_FUNCTOR(conv3d, Conv3dInferShapeFunctor,
-                            PD_INFER_META(phi::ConvInferMeta));
 REGISTER_OPERATOR(conv3d, ops::ConvOp, ops::Conv3DOpMaker,
                   ops::ConvOpInferVarType,
                   ops::Conv3DGradMaker<paddle::framework::OpDesc>,
diff --git a/paddle/fluid/operators/detection/CMakeLists.txt b/paddle/fluid/operators/detection/CMakeLists.txt
index 568c7982cfc7c..f10c801919999 100644
--- a/paddle/fluid/operators/detection/CMakeLists.txt
+++ b/paddle/fluid/operators/detection/CMakeLists.txt
@@ -66,6 +66,7 @@ detection_library(yolo_box_op SRCS yolo_box_op.cc)
 detection_library(box_decoder_and_assign_op SRCS box_decoder_and_assign_op.cc box_decoder_and_assign_op.cu)
 detection_library(sigmoid_focal_loss_op SRCS sigmoid_focal_loss_op.cc sigmoid_focal_loss_op.cu)
 detection_library(retinanet_detection_output_op SRCS retinanet_detection_output_op.cc)
+detection_library(nms_op SRCS nms_op.cc nms_op.cu)
 
 if(WITH_GPU OR WITH_ROCM)
   set(TMPDEPS memory)
diff --git a/paddle/fluid/operators/detection/nms_op.cc b/paddle/fluid/operators/detection/nms_op.cc
new file mode 100644
index 0000000000000..f6dc44eb5fc2d
--- /dev/null
+++ b/paddle/fluid/operators/detection/nms_op.cc
@@ -0,0 +1,147 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "paddle/fluid/operators/detection/nms_op.h"
+#include <vector>
+
+namespace paddle {
+namespace operators {
+
+using framework::Tensor;
+
+class NMSOpMaker : public framework::OpProtoAndCheckerMaker {
+ public:
+  void Make() override {
+    AddInput("Boxes",
+             "(Tensor) "
+             "Boxes is a Tensor with shape [N, 4] "
+             "N is the number of boxes "
+             "in last dimension in format [x1, x2, y1, y2] "
+             "the relation should be ``0 <= x1 < x2 && 0 <= y1 < y2``.");
+
+    AddOutput("KeepBoxesIdxs",
+              "(Tensor) "
+              "KeepBoxesIdxs is a Tensor with shape [N] ");
+    AddAttr<float>(
+        "iou_threshold",
+        "iou_threshold is a threshold value used to compress similar boxes "
+        "boxes with IoU > iou_threshold will be considered as overlapping "
+        "and just one of them can be kept.")
+        .SetDefault(1.0f)
+        .AddCustomChecker([](const float& iou_threshold) {
+          PADDLE_ENFORCE_LE(iou_threshold, 1.0f,
+                            platform::errors::InvalidArgument(
+                                "iou_threshold should less equal than 1.0 "
+                                "but got %f",
+                                iou_threshold));
+          PADDLE_ENFORCE_GE(iou_threshold, 0.0f,
+                            platform::errors::InvalidArgument(
+                                "iou_threshold should greater equal than 0.0 "
+                                "but got %f",
+                                iou_threshold));
+        });
+    AddComment(R"DOC(
+                NMS Operator.
+                This Operator is used to perform Non-Maximum Compress for input boxes.
+                Indices of boxes kept by NMS will be sorted by scores and output.
+            )DOC");
+  }
+};
+
+class NMSOp : public framework::OperatorWithKernel {
+ public:
+  using framework::OperatorWithKernel::OperatorWithKernel;
+  void InferShape(framework::InferShapeContext* ctx) const override {
+    OP_INOUT_CHECK(ctx->HasInput("Boxes"), "Input", "Boxes", "NMS");
+    OP_INOUT_CHECK(ctx->HasOutput("KeepBoxesIdxs"), "Output", "KeepBoxesIdxs",
+                   "NMS");
+
+    auto boxes_dim = ctx->GetInputDim("Boxes");
+    PADDLE_ENFORCE_EQ(boxes_dim.size(), 2,
+                      platform::errors::InvalidArgument(
+                          "The Input Boxes must be 2-dimention "
+                          "whose shape must be [N, 4] "
+                          "N is the number of boxes "
+                          "in last dimension in format [x1, x2, y1, y2]. "));
+    auto num_boxes = boxes_dim[0];
+
+    ctx->SetOutputDim("KeepBoxesIdxs", {num_boxes});
+  }
+
+ protected:
+  framework::OpKernelType GetExpectedKernelType(
+      const framework::ExecutionContext& ctx) const override {
+    return framework::OpKernelType(
+        OperatorWithKernel::IndicateVarDataType(ctx, "Boxes"), ctx.GetPlace());
+  }
+};
+
+template <typename T>
+static void NMS(const T* boxes_data, int64_t* output_data, float threshold,
+                int64_t num_boxes) {
+  auto num_masks = CeilDivide(num_boxes, 64);
+  std::vector<uint64_t> masks(num_masks, 0);
+
+  for (int64_t i = 0; i < num_boxes; ++i) {
+    if (masks[i / 64] & 1ULL << (i % 64)) continue;
+    T box_1[4];
+    for (int k = 0; k < 4; ++k) {
+      box_1[k] = boxes_data[i * 4 + k];
+    }
+    for (int64_t j = i + 1; j < num_boxes; ++j) {
+      if (masks[j / 64] & 1ULL << (j % 64)) continue;
+      T box_2[4];
+      for (int k = 0; k < 4; ++k) {
+        box_2[k] = boxes_data[j * 4 + k];
+      }
+      bool is_overlap = CalculateIoU<T>(box_1, box_2, threshold);
+      if (is_overlap) {
+        masks[j / 64] |= 1ULL << (j % 64);
+      }
+    }
+  }
+
+  int64_t output_data_idx = 0;
+  for (int64_t i = 0; i < num_boxes; ++i) {
+    if (masks[i / 64] & 1ULL << (i % 64)) continue;
+    output_data[output_data_idx++] = i;
+  }
+
+  for (; output_data_idx < num_boxes; ++output_data_idx) {
+    output_data[output_data_idx] = 0;
+  }
+}
+
+template <typename T>
+class NMSKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext& context) const override {
+    const Tensor* boxes = context.Input<Tensor>("Boxes");
+    Tensor* output = context.Output<Tensor>("KeepBoxesIdxs");
+    int64_t* output_data = output->mutable_data<int64_t>(context.GetPlace());
+    auto threshold = context.template Attr<float>("iou_threshold");
+    NMS<T>(boxes->data<T>(), output_data, threshold, boxes->dims()[0]);
+  }
+};
+
+}  // namespace operators
+}  // namespace paddle
+
+namespace ops = paddle::operators;
+
+REGISTER_OPERATOR(
+    nms, ops::NMSOp, ops::NMSOpMaker,
+    paddle::framework::EmptyGradOpMaker<paddle::framework::OpDesc>,
+    paddle::framework::EmptyGradOpMaker<paddle::imperative::OpBase>);
+REGISTER_OP_CPU_KERNEL(nms, ops::NMSKernel<float>, ops::NMSKernel<double>);
diff --git a/paddle/fluid/operators/detection/nms_op.cu b/paddle/fluid/operators/detection/nms_op.cu
new file mode 100644
index 0000000000000..b6027e67d6ced
--- /dev/null
+++ b/paddle/fluid/operators/detection/nms_op.cu
@@ -0,0 +1,108 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include <vector>
+#include "paddle/fluid/operators/detection/nms_op.h"
+#include "paddle/fluid/platform/device/gpu/gpu_primitives.h"
+
+static const int64_t threadsPerBlock = sizeof(int64_t) * 8;
+
+namespace paddle {
+namespace operators {
+
+using framework::Tensor;
+
+template <typename T>
+static __global__ void NMS(const T* boxes_data, float threshold,
+                           int64_t num_boxes, uint64_t* masks) {
+  auto raw_start = blockIdx.y;
+  auto col_start = blockIdx.x;
+  if (raw_start > col_start) return;
+
+  const int raw_last_storage =
+      min(num_boxes - raw_start * threadsPerBlock, threadsPerBlock);
+  const int col_last_storage =
+      min(num_boxes - col_start * threadsPerBlock, threadsPerBlock);
+
+  if (threadIdx.x < raw_last_storage) {
+    uint64_t mask = 0;
+    auto current_box_idx = raw_start * threadsPerBlock + threadIdx.x;
+    const T* current_box = boxes_data + current_box_idx * 4;
+    for (int i = 0; i < col_last_storage; ++i) {
+      const T* target_box = boxes_data + (col_start * threadsPerBlock + i) * 4;
+      if (CalculateIoU<T>(current_box, target_box, threshold)) {
+        mask |= 1ULL << i;
+      }
+    }
+    const int blocks_per_line = CeilDivide(num_boxes, threadsPerBlock);
+    masks[current_box_idx * blocks_per_line + col_start] = mask;
+  }
+}
+
+template <typename T>
+class NMSCudaKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext& context) const override {
+    const Tensor* boxes = context.Input<Tensor>("Boxes");
+    Tensor* output = context.Output<Tensor>("KeepBoxesIdxs");
+    auto* output_data = output->mutable_data<int64_t>(context.GetPlace());
+
+    auto threshold = context.template Attr<float>("iou_threshold");
+    const int64_t num_boxes = boxes->dims()[0];
+    const auto blocks_per_line = CeilDivide(num_boxes, threadsPerBlock);
+
+    dim3 block(threadsPerBlock);
+    dim3 grid(blocks_per_line, blocks_per_line);
+
+    auto mask_data =
+        memory::Alloc(context.cuda_device_context(),
+                      num_boxes * blocks_per_line * sizeof(uint64_t));
+    uint64_t* mask_dev = reinterpret_cast<uint64_t*>(mask_data->ptr());
+    NMS<T><<<grid, block, 0, context.cuda_device_context().stream()>>>(
+        boxes->data<T>(), threshold, num_boxes, mask_dev);
+
+    std::vector<uint64_t> mask_host(num_boxes * blocks_per_line);
+    memory::Copy(platform::CPUPlace(), mask_host.data(), context.GetPlace(),
+                 mask_dev, num_boxes * blocks_per_line * sizeof(uint64_t),
+                 context.cuda_device_context().stream());
+
+    std::vector<int64_t> remv(blocks_per_line);
+
+    std::vector<int64_t> keep_boxes_idxs(num_boxes);
+    int64_t* output_host = keep_boxes_idxs.data();
+
+    int64_t last_box_num = 0;
+    for (int64_t i = 0; i < num_boxes; ++i) {
+      auto remv_element_id = i / threadsPerBlock;
+      auto remv_bit_id = i % threadsPerBlock;
+      if (!(remv[remv_element_id] & 1ULL << remv_bit_id)) {
+        output_host[last_box_num++] = i;
+        uint64_t* current_mask = mask_host.data() + i * blocks_per_line;
+        for (auto j = remv_element_id; j < blocks_per_line; ++j) {
+          remv[j] |= current_mask[j];
+        }
+      }
+    }
+    memory::Copy(context.GetPlace(), output_data, platform::CPUPlace(),
+                 output_host, sizeof(int64_t) * num_boxes,
+                 context.cuda_device_context().stream());
+  }
+};
+
+}  // namespace operators
+}  // namespace paddle
+
+namespace ops = paddle::operators;
+REGISTER_OP_CUDA_KERNEL(nms, ops::NMSCudaKernel<float>,
+                        ops::NMSCudaKernel<double>);
diff --git a/paddle/fluid/operators/detection/nms_op.h b/paddle/fluid/operators/detection/nms_op.h
new file mode 100644
index 0000000000000..dce8f47f0174e
--- /dev/null
+++ b/paddle/fluid/operators/detection/nms_op.h
@@ -0,0 +1,51 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+
+#include "paddle/fluid/framework/op_registry.h"
+#include "paddle/fluid/framework/operator.h"
+
+namespace paddle {
+namespace operators {
+
+HOSTDEVICE static inline int64_t CeilDivide(int64_t n, int64_t m) {
+  return (n + m - 1) / m;
+}
+
+template <typename T>
+HOSTDEVICE inline bool CalculateIoU(const T* const box_1, const T* const box_2,
+                                    const float threshold) {
+  auto box_1_x0 = box_1[0], box_1_y0 = box_1[1];
+  auto box_1_x1 = box_1[2], box_1_y1 = box_1[3];
+  auto box_2_x0 = box_2[0], box_2_y0 = box_2[1];
+  auto box_2_x1 = box_2[2], box_2_y1 = box_2[3];
+
+  auto inter_box_x0 = box_1_x0 > box_2_x0 ? box_1_x0 : box_2_x0;
+  auto inter_box_y0 = box_1_y0 > box_2_y0 ? box_1_y0 : box_2_y0;
+  auto inter_box_x1 = box_1_x1 < box_2_x1 ? box_1_x1 : box_2_x1;
+  auto inter_box_y1 = box_1_y1 < box_2_y1 ? box_1_y1 : box_2_y1;
+
+  auto inter_width =
+      inter_box_x1 - inter_box_x0 > 0 ? inter_box_x1 - inter_box_x0 : 0;
+  auto inter_height =
+      inter_box_y1 - inter_box_y0 > 0 ? inter_box_y1 - inter_box_y0 : 0;
+  auto inter_area = inter_width * inter_height;
+  auto union_area = (box_1_x1 - box_1_x0) * (box_1_y1 - box_1_y0) +
+                    (box_2_x1 - box_2_x0) * (box_2_y1 - box_2_y0) - inter_area;
+  return inter_area / union_area > threshold;
+}
+
+}  // namespace operators
+}  // namespace paddle
diff --git a/paddle/fluid/operators/dropout_impl.cu.h b/paddle/fluid/operators/dropout_impl.cu.h
index 83ca9ace20d05..6af8c925ff580 100644
--- a/paddle/fluid/operators/dropout_impl.cu.h
+++ b/paddle/fluid/operators/dropout_impl.cu.h
@@ -38,43 +38,9 @@ limitations under the License. */
 #include "paddle/phi/kernels/funcs/distribution_helper.h"
 #include "paddle/phi/kernels/funcs/functors.h"
 
-DECLARE_bool(use_curand);
-
 namespace paddle {
 namespace operators {
 
-template <typename T1, typename T2 = T1, typename OutT = T1>
-struct DstMaskGenerator {
-  const float dropout_prob_;
-  const bool is_upscale_in_train_;
-  using MT = typename details::MPTypeTrait<T1>::Type;
-  MT factor;
-  HOSTDEVICE inline DstMaskGenerator(const float dropout_prob,
-                                     const bool is_upscale_in_train)
-      : dropout_prob_(dropout_prob), is_upscale_in_train_(is_upscale_in_train) {
-    factor = static_cast<MT>(1.0f / (1.0f - dropout_prob_));
-  }
-
-  HOSTDEVICE inline void operator()(OutT* dst, const T1* src_val,
-                                    const T2* rand, int num) const {
-    static constexpr int kCount =
-        phi::funcs::uniform_distribution<T2>::kReturnsCount;
-// 0 ~ kCount -1 is dist , kCount ~ 2 * kCount - 1 is mask
-#pragma unroll
-    for (int i = 0; i < kCount; i++) {
-      if (rand[i] < dropout_prob_) {
-        dst[i] = static_cast<T1>(0);
-        dst[i + kCount] = dst[i];
-      } else {
-        dst[i] = is_upscale_in_train_
-                     ? static_cast<T1>(static_cast<MT>(src_val[i]) * factor)
-                     : static_cast<T1>(src_val[i]);
-        dst[i + kCount] = static_cast<T1>(1);
-      }
-    }
-  }
-};
-
 template <typename T1, typename T2 = T1, typename OutT = T1>
 struct DstMaskFunctor {
   const float retain_prob_;
@@ -113,7 +79,7 @@ __global__ void VectorizedRandomGenerator(const size_t n, uint64_t seed,
                                           const T* src, MaskType* mask, T* dst,
                                           bool is_upscale_in_train,
                                           uint64_t increment,
-                                          size_t main_offset, bool use_curand) {
+                                          size_t main_offset) {
   size_t idx = static_cast<size_t>(BLOCK_ID_X * BLOCK_NUM_X);
   static constexpr int kCount =
       phi::funcs::uniform_distribution<float>::kReturnsCount;
@@ -135,76 +101,41 @@ __global__ void VectorizedRandomGenerator(const size_t n, uint64_t seed,
   int deal_size = BLOCK_NUM_X * kCount;
 
   size_t fix = idx * kCount;
-  if (use_curand) {
-    auto dst_functor =
-        DstMaskFunctor<T, float>(1.0f - dropout_prob, is_upscale_in_train);
-    for (; fix < main_offset; fix += stride) {
-      kps::ReadData<T, kCount, 1, 1, false>(&dst_mask[0], src + fix, deal_size);
-      kps::ElementwiseRandom<SType, float, kCount, 1, Rand>(&rands[0], Rand(),
-                                                            &state);
-      // dst
-      kps::OperatorTernary<T, float, T, DstMaskFunctor<T, float>>(
-          &dst_mask[0], &dst_mask[0], &rands[0], dst_functor, kCount);
-      kps::WriteData<T, kCount, 1, 1, false>(dst + fix, &dst_mask[0],
-                                             deal_size);
-      // mask
-      kps::ElementwiseUnary<T, MaskType, kCount, 1, 1, Cast>(
-          &mask_result[0], &dst_mask[kCount], Cast());
-      kps::WriteData<MaskType, kCount, 1, 1, false>(mask + fix, &mask_result[0],
-                                                    deal_size);
-      if (fix > idx * kCount + 1) {
-        __syncthreads();
-      }
-    }
-    int remainder = n - fix;
-    if (remainder > 0) {
-      kps::ReadData<T, kCount, 1, 1, true>(&dst_mask[0], src + fix, remainder);
-      kps::ElementwiseRandom<SType, float, kCount, 1, Rand>(&rands[0], Rand(),
-                                                            &state);
-      // dst
-      kps::OperatorTernary<T, float, T, DstMaskFunctor<T, float>>(
-          &dst_mask[0], &dst_mask[0], &rands[0], dst_functor, kCount);
-      kps::WriteData<T, kCount, 1, 1, true>(dst + fix, &dst_mask[0], remainder);
-      // mask
-      kps::ElementwiseUnary<T, MaskType, kCount, 1, 1, Cast>(
-          &mask_result[0], &dst_mask[kCount], Cast());
-      kps::WriteData<MaskType, kCount, 1, 1, true>(mask + fix, &mask_result[0],
-                                                   remainder);
+
+  auto dst_functor =
+      DstMaskFunctor<T, float>(1.0f - dropout_prob, is_upscale_in_train);
+  for (; fix < main_offset; fix += stride) {
+    kps::ReadData<T, kCount, 1, 1, false>(&dst_mask[0], src + fix, deal_size);
+    kps::ElementwiseRandom<SType, float, kCount, 1, Rand>(&rands[0], Rand(),
+                                                          &state);
+    // dst
+    kps::OperatorTernary<T, float, T, DstMaskFunctor<T, float>>(
+        &dst_mask[0], &dst_mask[0], &rands[0], dst_functor, kCount);
+    kps::WriteData<T, kCount, 1, 1, false>(dst + fix, &dst_mask[0], deal_size);
+    // mask
+    kps::ElementwiseUnary<T, MaskType, kCount, 1, 1, Cast>(
+        &mask_result[0], &dst_mask[kCount], Cast());
+    kps::WriteData<MaskType, kCount, 1, 1, false>(mask + fix, &mask_result[0],
+                                                  deal_size);
+    if (fix > idx * kCount + 1) {
       __syncthreads();
     }
-  } else {
-    auto dst_functor =
-        DstMaskGenerator<T, float>(dropout_prob, is_upscale_in_train);
-    for (; fix < main_offset; fix += stride) {
-      kps::ReadData<T, kCount, 1, 1, false>(&dst_mask[0], src + fix, deal_size);
-      kps::ElementwiseRandom<SType, float, kCount, 1, Rand>(&rands[0], Rand(),
-                                                            &state);
-      // dst
-      kps::OperatorTernary<T, float, T, DstMaskGenerator<T, float>>(
-          &dst_mask[0], &dst_mask[0], &rands[0], dst_functor, kCount);
-      kps::WriteData<T, kCount, 1, 1, false>(dst + fix, &dst_mask[0],
-                                             deal_size);
-      // mask
-      kps::ElementwiseUnary<T, MaskType, kCount, 1, 1, Cast>(
-          &mask_result[0], &dst_mask[kCount], Cast());
-      kps::WriteData<MaskType, kCount, 1, 1, false>(mask + fix, &mask_result[0],
-                                                    deal_size);
-    }
-    int remainder = n - fix;
-    if (remainder > 0) {
-      kps::ReadData<T, kCount, 1, 1, true>(&dst_mask[0], src + fix, remainder);
-      kps::ElementwiseRandom<SType, float, kCount, 1, Rand>(&rands[0], Rand(),
-                                                            &state);
-      // dst
-      kps::OperatorTernary<T, float, T, DstMaskGenerator<T, float>>(
-          &dst_mask[0], &dst_mask[0], &rands[0], dst_functor, kCount);
-      kps::WriteData<T, kCount, 1, 1, true>(dst + fix, &dst_mask[0], remainder);
-      // mask
-      kps::ElementwiseUnary<T, MaskType, kCount, 1, 1, Cast>(
-          &mask_result[0], &dst_mask[kCount], Cast());
-      kps::WriteData<MaskType, kCount, 1, 1, true>(mask + fix, &mask_result[0],
-                                                   remainder);
-    }
+  }
+  int remainder = n - fix;
+  if (remainder > 0) {
+    kps::ReadData<T, kCount, 1, 1, true>(&dst_mask[0], src + fix, remainder);
+    kps::ElementwiseRandom<SType, float, kCount, 1, Rand>(&rands[0], Rand(),
+                                                          &state);
+    // dst
+    kps::OperatorTernary<T, float, T, DstMaskFunctor<T, float>>(
+        &dst_mask[0], &dst_mask[0], &rands[0], dst_functor, kCount);
+    kps::WriteData<T, kCount, 1, 1, true>(dst + fix, &dst_mask[0], remainder);
+    // mask
+    kps::ElementwiseUnary<T, MaskType, kCount, 1, 1, Cast>(
+        &mask_result[0], &dst_mask[kCount], Cast());
+    kps::WriteData<MaskType, kCount, 1, 1, true>(mask + fix, &mask_result[0],
+                                                 remainder);
+    __syncthreads();
   }
 }
 
@@ -251,13 +182,11 @@ void DropoutFwGPUKernelDriver(const phi::GPUContext& dev_ctx, bool is_test,
     size_t grid_size = gpu_config.GetGridSize();
     size_t block_size = gpu_config.GetBlockSize();
 
-    if (FLAGS_use_curand) {
-      int64_t device_id = dev_ctx.GetPlace().GetDeviceId();
-      const auto& prop = platform::GetDeviceProperties(device_id);
-      size_t max_grid_size = prop.maxThreadsPerMultiProcessor *
-                             prop.multiProcessorCount / block_size;
-      grid_size = std::min(grid_size, max_grid_size);
-    }
+    int64_t device_id = dev_ctx.GetPlace().GetDeviceId();
+    const auto& prop = platform::GetDeviceProperties(device_id);
+    size_t max_grid_size = prop.maxThreadsPerMultiProcessor *
+                           prop.multiProcessorCount / block_size;
+    grid_size = std::min(grid_size, max_grid_size);
 
     auto offset =
         ((x_numel - 1) / (grid_size * block_size * kVecSize) + 1) * kVecSize;
@@ -268,7 +197,7 @@ void DropoutFwGPUKernelDriver(const phi::GPUContext& dev_ctx, bool is_test,
 
     VectorizedRandomGenerator<T, uint8_t><<<grid_size, block_size, 0, stream>>>(
         size, seed_data, dropout_prob, x_data, mask_data, y_data,
-        upscale_in_train, increment, main_offset, FLAGS_use_curand);
+        upscale_in_train, increment, main_offset);
   } else {
     if (upscale_in_train) {
 // todo: can y share with data with x directly?
diff --git a/paddle/fluid/operators/dropout_op.cc b/paddle/fluid/operators/dropout_op.cc
index 3d9950902acfe..8d033ea3194b9 100644
--- a/paddle/fluid/operators/dropout_op.cc
+++ b/paddle/fluid/operators/dropout_op.cc
@@ -16,7 +16,7 @@ limitations under the License. */
 #include <string>
 #include "paddle/fluid/framework/infershape_utils.h"
 #include "paddle/fluid/framework/op_registry.h"
-#include "paddle/phi/infermeta/unary.h"
+#include "paddle/phi/infermeta/binary.h"
 
 namespace paddle {
 namespace operators {
diff --git a/paddle/fluid/operators/dropout_op_xpu.cc b/paddle/fluid/operators/dropout_op_xpu.cc
index 7d8660f238abc..851f26ee0e717 100644
--- a/paddle/fluid/operators/dropout_op_xpu.cc
+++ b/paddle/fluid/operators/dropout_op_xpu.cc
@@ -42,7 +42,13 @@ class DropoutXPUKernel : public framework::OpKernel<T> {
     if (!context.Attr<bool>("is_test")) {
       int seed_data = 0;
       if (seed) {
-        seed_data = *(seed->data<int>());
+        if (platform::is_xpu_place(seed->place())) {
+          memory::Copy(platform::CPUPlace(), &seed_data, seed->place(),
+                       seed->data<int>(), sizeof(int));
+        } else {
+          seed_data = *(seed->data<int>());
+        }
+
       } else {
         seed_data =
             context.Attr<bool>("fix_seed") ? context.Attr<int>("seed") : 0;
diff --git a/paddle/fluid/operators/fake_dequantize_op.cu b/paddle/fluid/operators/fake_dequantize_op.cu
index c0ec44909a5f3..582f0627b2044 100644
--- a/paddle/fluid/operators/fake_dequantize_op.cu
+++ b/paddle/fluid/operators/fake_dequantize_op.cu
@@ -12,142 +12,9 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. */
 
+#include "paddle/fluid/operators/fake_dequantize_op.cu.h"
 #include "paddle/fluid/operators/fake_dequantize_op.h"
 
-namespace paddle {
-namespace operators {
-
-template <typename T>
-__global__ void KeDequantize(const T* in, const T* scale, T max_range, int num,
-                             T* out) {
-  const int idx = threadIdx.x + blockIdx.x * blockDim.x;
-  if (idx < num) {
-    out[idx] = in[idx] * scale[0] / max_range;
-  }
-}
-
-template <typename T>
-struct DequantizeFunctor<platform::CUDADeviceContext, T> {
-  void operator()(const platform::CUDADeviceContext& dev_ctx,
-                  const framework::Tensor* in, const framework::Tensor* scale,
-                  T max_range, framework::Tensor* out) {
-    const T* in_data = in->data<T>();
-    const T* scale_factor = scale->data<T>();
-    T* out_data = out->mutable_data<T>(dev_ctx.GetPlace());
-
-    int num = in->numel();
-    int block = 512;
-    int grid = (num + block - 1) / block;
-
-    KeDequantize<T><<<grid, block, 0, dev_ctx.stream()>>>(
-        in_data, scale_factor, max_range, num, out_data);
-  }
-};
-
-template <typename T>
-__global__ void DequantizeOneScaleQuantAxis0(const T* in, const T* scale,
-                                             T max_range, int num, int channel,
-                                             T* out) {
-  int tid = threadIdx.x;
-  int channel_size = num / channel;
-  const T* in_c = in + blockIdx.x * channel_size;
-  T* out_c = out + blockIdx.x * channel_size;
-  for (int i = tid; i < channel_size; i += blockDim.x) {
-    out_c[i] = in_c[i] * scale[blockIdx.x] / max_range;
-  }
-}
-
-template <typename T>
-__global__ void DequantizeOneScaleQuantAxisN(const T* in, const T* scale,
-                                             const T max_range,
-                                             const int64_t num,
-                                             const int n_scales,
-                                             const int quant_stride, T* out) {
-  int64_t idx = blockDim.x * blockIdx.x + threadIdx.x;
-  for (int64_t i = idx; i < num; i += blockDim.x * gridDim.x) {
-    T s = scale[(i / quant_stride) % n_scales];
-    out[i] = in[i] * s / max_range;
-  }
-}
-
-template <typename T>
-__global__ void DequantizeTwoScale(const T* in, const T* scale_one,
-                                   const T* scale_two, T max_range, int num,
-                                   int iter_size, int channel, T* out) {
-  int tid = threadIdx.x;
-  int channel_size = num / (iter_size * channel);
-  int scale_index = blockIdx.x % channel;
-  const T* in_c = in + blockIdx.x * channel_size;
-  T* out_c = out + blockIdx.x * channel_size;
-  for (int i = tid; i < channel_size; i += blockDim.x) {
-    out_c[i] = in_c[i] * scale_one[scale_index] * scale_two[0] / max_range;
-  }
-}
-
-template <typename T>
-struct ChannelDequantizeFunctor<platform::CUDADeviceContext, T> {
-  void operator()(const platform::CUDADeviceContext& dev_ctx,
-                  const framework::Tensor* in, const framework::Tensor** scales,
-                  const int scale_num, T max_range, const int quant_axis,
-                  const int x_num_col_dims, framework::Tensor* out) {
-    auto in_dims = in->dims();
-    const T* in_data = in->data<T>();
-    T* out_data = out->mutable_data<T>(dev_ctx.GetPlace());
-    if (scale_num == 1) {
-      int64_t num = in->numel();
-      const T* scale_factor = scales[0]->data<T>();
-      if (quant_axis == 0) {
-        int grid = in_dims[0];
-        int block = 1024;
-        DequantizeOneScaleQuantAxis0<T><<<grid, block, 0, dev_ctx.stream()>>>(
-            in_data, scale_factor, max_range, num, in_dims[0], out_data);
-      } else {
-        int quant_stride = 1;
-        for (int i = quant_axis + 1; i < in_dims.size(); i++) {
-          quant_stride *= in_dims[i];
-        }
-
-        int64_t block_size = std::min(
-            num, static_cast<int64_t>(dev_ctx.GetMaxThreadsPerBlock() / 4));
-        int64_t max_threads =
-            dev_ctx.GetMaxPhysicalThreadCount();  // SM * block_per_SM
-        const int64_t max_blocks = std::max(
-            ((max_threads - 1) / block_size + 1), static_cast<int64_t>(1));
-        const int64_t grid_size =
-            std::min(max_blocks, (num + block_size - 1) / block_size);
-
-        DequantizeOneScaleQuantAxisN<
-            T><<<grid_size, block_size, 0, dev_ctx.stream()>>>(
-            in_data, scale_factor, max_range, num, in_dims[quant_axis],
-            quant_stride, out_data);
-      }
-    } else if (scale_num == 2) {
-      // Not need to consider quant_axis
-      int num = in->numel();
-      int iter_size = 1;
-      for (int i = 0; i < x_num_col_dims; i++) {
-        iter_size *= in->dims()[i];
-      }
-      int channel = in->dims()[x_num_col_dims];
-      const T* scale_one = scales[0]->data<T>();
-      const T* scale_two = scales[1]->data<T>();
-      int block = 1024;
-      int grid = iter_size * channel;
-      DequantizeTwoScale<T><<<grid, block, 0, dev_ctx.stream()>>>(
-          in_data, scale_one, scale_two, max_range, num, iter_size, channel,
-          out_data);
-    }
-  }
-};
-
-template struct DequantizeFunctor<platform::CUDADeviceContext, float>;
-template struct DequantizeFunctor<platform::CUDADeviceContext, double>;
-template struct ChannelDequantizeFunctor<platform::CUDADeviceContext, float>;
-template struct ChannelDequantizeFunctor<platform::CUDADeviceContext, double>;
-
-}  // namespace operators
-}  // namespace paddle
-
 namespace ops = paddle::operators;
 using CUDA = paddle::platform::CUDADeviceContext;
 REGISTER_OP_CUDA_KERNEL(fake_dequantize_max_abs,
diff --git a/paddle/fluid/operators/fake_dequantize_op.cu.h b/paddle/fluid/operators/fake_dequantize_op.cu.h
new file mode 100644
index 0000000000000..9859dd4607c15
--- /dev/null
+++ b/paddle/fluid/operators/fake_dequantize_op.cu.h
@@ -0,0 +1,151 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#ifndef PADDLE_FLUID_OPERATORS_FAKE_DEQUANTIZE_OP_CU_H_
+#define PADDLE_FLUID_OPERATORS_FAKE_DEQUANTIZE_OP_CU_H_
+#endif  // PADDLE_FLUID_OPERATORS_FAKE_DEQUANTIZE_OP_CU_H_
+
+#include "paddle/fluid/operators/fake_dequantize_op.h"
+
+namespace paddle {
+namespace operators {
+
+template <typename T>
+__global__ void KeDequantize(const T* in, const T* scale, T max_range,
+                             int64_t num, T* out) {
+  int64_t idx = threadIdx.x + blockIdx.x * blockDim.x;
+  for (int64_t i = idx; i < num; i += blockDim.x * gridDim.x) {
+    out[i] = in[i] * scale[0] / max_range;
+  }
+}
+
+template <typename T>
+struct DequantizeFunctor<platform::CUDADeviceContext, T> {
+  void operator()(const platform::CUDADeviceContext& dev_ctx,
+                  const framework::Tensor* in, const framework::Tensor* scale,
+                  T max_range, framework::Tensor* out) {
+    const T* in_data = in->data<T>();
+    const T* scale_factor = scale->data<T>();
+    T* out_data = out->mutable_data<T>(dev_ctx.GetPlace());
+
+    int64_t num = in->numel();
+    int64_t block_size = std::min(
+        num, static_cast<int64_t>(dev_ctx.GetMaxThreadsPerBlock() / 4));
+    int64_t max_threads =
+        dev_ctx.GetMaxPhysicalThreadCount();  // SM * block_per_SM
+    const int64_t max_blocks =
+        std::max(((max_threads - 1) / block_size + 1), static_cast<int64_t>(1));
+    const int64_t grid_size =
+        std::min(max_blocks, (num + block_size - 1) / block_size);
+    KeDequantize<T><<<grid_size, block_size, 0, dev_ctx.stream()>>>(
+        in_data, scale_factor, max_range, num, out_data);
+  }
+};
+
+template <typename T>
+__global__ void DequantizeOneScaleQuantAxis0(const T* in, const T* scale,
+                                             T max_range, int num, int channel,
+                                             T* out) {
+  int tid = threadIdx.x;
+  int channel_size = num / channel;
+  const T* in_c = in + blockIdx.x * channel_size;
+  T* out_c = out + blockIdx.x * channel_size;
+  for (int i = tid; i < channel_size; i += blockDim.x) {
+    out_c[i] = in_c[i] * scale[blockIdx.x] / max_range;
+  }
+}
+
+template <typename T>
+__global__ void DequantizeOneScaleQuantAxisN(const T* in, const T* scale,
+                                             const T max_range,
+                                             const int64_t num,
+                                             const int n_scales,
+                                             const int quant_stride, T* out) {
+  int64_t idx = blockDim.x * blockIdx.x + threadIdx.x;
+  for (int64_t i = idx; i < num; i += blockDim.x * gridDim.x) {
+    T s = scale[(i / quant_stride) % n_scales];
+    out[i] = in[i] * s / max_range;
+  }
+}
+
+template <typename T>
+__global__ void DequantizeTwoScale(const T* in, const T* scale_one,
+                                   const T* scale_two, T max_range, int num,
+                                   int iter_size, int channel, T* out) {
+  int tid = threadIdx.x;
+  int channel_size = num / (iter_size * channel);
+  int scale_index = blockIdx.x % channel;
+  const T* in_c = in + blockIdx.x * channel_size;
+  T* out_c = out + blockIdx.x * channel_size;
+  for (int i = tid; i < channel_size; i += blockDim.x) {
+    out_c[i] = in_c[i] * scale_one[scale_index] * scale_two[0] / max_range;
+  }
+}
+
+template <typename T>
+struct ChannelDequantizeFunctor<platform::CUDADeviceContext, T> {
+  void operator()(const platform::CUDADeviceContext& dev_ctx,
+                  const framework::Tensor* in, const framework::Tensor** scales,
+                  const int scale_num, T max_range, const int quant_axis,
+                  const int x_num_col_dims, framework::Tensor* out) {
+    auto in_dims = in->dims();
+    const T* in_data = in->data<T>();
+    T* out_data = out->mutable_data<T>(dev_ctx.GetPlace());
+    if (scale_num == 1) {
+      int64_t num = in->numel();
+      const T* scale_factor = scales[0]->data<T>();
+      int64_t block_size = std::min(
+          num, static_cast<int64_t>(dev_ctx.GetMaxThreadsPerBlock() / 4));
+      int64_t max_threads =
+          dev_ctx.GetMaxPhysicalThreadCount();  // SM * block_per_SM
+      const int64_t max_blocks = std::max(((max_threads - 1) / block_size + 1),
+                                          static_cast<int64_t>(1));
+      const int64_t grid_size =
+          std::min(max_blocks, (num + block_size - 1) / block_size);
+
+      int quant_stride = 1;
+      for (int i = quant_axis + 1; i < in_dims.size(); i++) {
+        quant_stride *= in_dims[i];
+      }
+
+      DequantizeOneScaleQuantAxisN<
+          T><<<grid_size, block_size, 0, dev_ctx.stream()>>>(
+          in_data, scale_factor, max_range, num, in_dims[quant_axis],
+          quant_stride, out_data);
+    } else if (scale_num == 2) {
+      // Not need to consider quant_axis
+      int num = in->numel();
+      int iter_size = 1;
+      for (int i = 0; i < x_num_col_dims; i++) {
+        iter_size *= in->dims()[i];
+      }
+      int channel = in->dims()[x_num_col_dims];
+      const T* scale_one = scales[0]->data<T>();
+      const T* scale_two = scales[1]->data<T>();
+      int block = 1024;
+      int grid = iter_size * channel;
+      DequantizeTwoScale<T><<<grid, block, 0, dev_ctx.stream()>>>(
+          in_data, scale_one, scale_two, max_range, num, iter_size, channel,
+          out_data);
+    }
+  }
+};
+
+template struct DequantizeFunctor<platform::CUDADeviceContext, float>;
+template struct DequantizeFunctor<platform::CUDADeviceContext, double>;
+template struct ChannelDequantizeFunctor<platform::CUDADeviceContext, float>;
+template struct ChannelDequantizeFunctor<platform::CUDADeviceContext, double>;
+
+}  // namespace operators
+}  // namespace paddle
diff --git a/paddle/fluid/operators/fake_quantize_op.cc b/paddle/fluid/operators/fake_quantize_op.cc
index 4544386718813..ac72f23d46ea8 100644
--- a/paddle/fluid/operators/fake_quantize_op.cc
+++ b/paddle/fluid/operators/fake_quantize_op.cc
@@ -17,8 +17,8 @@ limitations under the License. */
 #include <string>
 #include "paddle/fluid/framework/eigen.h"
 #include "paddle/fluid/framework/op_version_registry.h"
-#include "paddle/fluid/operators/clip_op.h"
 #include "paddle/fluid/platform/transform.h"
+#include "paddle/phi/kernels/impl/clip_kernel_impl.h"
 
 namespace paddle {
 namespace operators {
@@ -91,7 +91,7 @@ struct ClipAndFakeQuantFunctor<platform::CPUDeviceContext, T> {
     T inv_s = inverse(s);
     platform::Transform<platform::CPUDeviceContext> trans;
     trans(ctx, in.data<T>(), in.data<T>() + in.numel(),
-          out->mutable_data<T>(ctx.GetPlace()), ClipFunctor<T>(-s, s));
+          out->mutable_data<T>(ctx.GetPlace()), phi::ClipFunctor<T>(-s, s));
     auto out_e = framework::EigenVector<T>::Flatten(*out);
     out_e.device(*ctx.eigen_device()) = (bin_cnt * inv_s * out_e).round();
   }
@@ -109,7 +109,7 @@ struct ClipAndFakeQuantDequantFunctor<platform::CPUDeviceContext, T> {
 
     platform::Transform<platform::CPUDeviceContext> trans;
     trans(ctx, in.data<T>(), in.data<T>() + in.numel(),
-          out->mutable_data<T>(ctx.GetPlace()), ClipFunctor<T>(-s, s));
+          out->mutable_data<T>(ctx.GetPlace()), phi::ClipFunctor<T>(-s, s));
     auto out_e = framework::EigenVector<T>::Flatten(*out);
     out_e.device(*ctx.eigen_device()) =
         (bin_cnt * inv_s * out_e).round() * s / static_cast<T>(bin_cnt);
@@ -144,7 +144,7 @@ struct ChannelClipAndFakeQuantFunctor<platform::CPUDeviceContext, T> {
         auto* start = in_data + i * channel_size;
         auto* end = in_data + (i + 1) * channel_size;
         trans(ctx, start, end, out_data + i * channel_size,
-              ClipFunctor<T>(-s, s));
+              phi::ClipFunctor<T>(-s, s));
       }
       for (int64_t i = 0; i < channel; i++) {
         T s = scale_data[i];
@@ -163,7 +163,7 @@ struct ChannelClipAndFakeQuantFunctor<platform::CPUDeviceContext, T> {
           auto* start = in_data + i * step_i + j * step_j;
           auto* end = in_data + i * step_i + (j + 1) * step_j;
           auto* cur_out_data = out_data + i * step_i + j * step_j;
-          trans(ctx, start, end, cur_out_data, ClipFunctor<T>(-s, s));
+          trans(ctx, start, end, cur_out_data, phi::ClipFunctor<T>(-s, s));
           for (int k = 0; k < step_j; k++) {
             cur_out_data[k] = std::round(bin_cnt * inv_s * cur_out_data[k]);
           }
@@ -200,7 +200,7 @@ struct ChannelClipFakeQuantDequantFunctor<platform::CPUDeviceContext, T> {
         auto* start = in_data + i * channel_size;
         auto* end = in_data + (i + 1) * channel_size;
         trans(ctx, start, end, out_data + i * channel_size,
-              ClipFunctor<T>(-s, s));
+              phi::ClipFunctor<T>(-s, s));
       }
       for (int i = 0; i < channel; i++) {
         T s = scale_data[i];
@@ -220,7 +220,7 @@ struct ChannelClipFakeQuantDequantFunctor<platform::CPUDeviceContext, T> {
           auto* start = in_data + i * step_i + j * step_j;
           auto* end = in_data + i * step_i + (j + 1) * step_j;
           auto* cur_out_data = out_data + i * step_i + j * step_j;
-          trans(ctx, start, end, cur_out_data, ClipFunctor<T>(-s, s));
+          trans(ctx, start, end, cur_out_data, phi::ClipFunctor<T>(-s, s));
           for (int k = 0; k < step_j; k++) {
             cur_out_data[k] = std::round(bin_cnt * inv_s * cur_out_data[k]) *
                               s / static_cast<T>(bin_cnt);
diff --git a/paddle/fluid/operators/fake_quantize_op.cu b/paddle/fluid/operators/fake_quantize_op.cu
index 01384a6cafef9..5416ae11c2b56 100644
--- a/paddle/fluid/operators/fake_quantize_op.cu
+++ b/paddle/fluid/operators/fake_quantize_op.cu
@@ -12,531 +12,8 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. */
 
-#include <string>
-#include "paddle/fluid/memory/memcpy.h"
+#include "paddle/fluid/operators/fake_quantize_op.cu.h"
 #include "paddle/fluid/operators/fake_quantize_op.h"
-#include "paddle/fluid/platform/device/gpu/gpu_primitives.h"
-
-namespace paddle {
-namespace operators {
-
-template <typename T>
-__global__ void FindAbsMaxKernel(const T* in, const int n, T* out) {
-  int bid = threadIdx.x + blockIdx.x * blockDim.x;
-  int tid = threadIdx.x;
-
-  extern __shared__ char* shared_max_data_tmp[];
-  auto shared_max_data = reinterpret_cast<T*>(shared_max_data_tmp);
-  if (gridDim.x > 1) {
-    T local_max_data = T(0);
-    for (int i = bid; i < n; i += blockDim.x * gridDim.x) {
-      T tmp = abs(in[i]);
-      if (tmp > local_max_data) {
-        local_max_data = tmp;
-      }
-    }
-    shared_max_data[tid] = local_max_data;
-  } else {
-    if (bid < n) {
-      shared_max_data[tid] = abs(in[bid]);
-    } else {
-      shared_max_data[tid] = T(0);
-    }
-  }
-  __syncthreads();
-
-  for (int i = blockDim.x / 2; i > 0; i >>= 1) {
-    if (tid < i && (shared_max_data[tid] < shared_max_data[tid + i])) {
-      shared_max_data[tid] = shared_max_data[tid + i];
-    }
-    __syncthreads();
-  }
-  if (tid == 0) {
-    out[blockIdx.x] = shared_max_data[0];
-  }
-}
-
-template <typename T>
-struct FindAbsMaxFunctor<platform::CUDADeviceContext, T> {
-  void operator()(const platform::CUDADeviceContext& ctx, const T* in,
-                  const int num, T* out) {
-    int block = 1024;
-    int grid = (block - 1 + num) / block;
-    grid = (grid > block) ? block : grid;
-
-    framework::Tensor max;
-    T* max_data = max.mutable_data<T>(phi::make_ddim({grid}), ctx.GetPlace());
-    FindAbsMaxKernel<T><<<grid, block, 1024 * sizeof(T), ctx.stream()>>>(
-        in, num, max_data);
-    FindAbsMaxKernel<T><<<1, block, 1024 * sizeof(T), ctx.stream()>>>(
-        max_data, grid, out);
-  }
-};
-
-template struct FindAbsMaxFunctor<platform::CUDADeviceContext, float>;
-template struct FindAbsMaxFunctor<platform::CUDADeviceContext,
-                                  paddle::platform::float16>;
-
-template <typename T>
-__global__ void FindChannelAbsMaxKernelQuantAxis0(const T* in, const int n,
-                                                  const int c, T* out) {
-  int tid = threadIdx.x;
-  int channel_size = n / c;
-  const T* in_c = in + blockIdx.x * channel_size;
-  extern __shared__ T shared_max_data[];
-  T local_max_data = T(0);
-  for (int i = tid; i < channel_size; i += blockDim.x) {
-    T tmp = fabs(in_c[i]);
-    if (tmp > local_max_data) {
-      local_max_data = tmp;
-    }
-  }
-  shared_max_data[tid] = local_max_data;
-  __syncthreads();
-  for (int i = blockDim.x / 2; i > 0; i >>= 1) {
-    if (tid < i && (shared_max_data[tid] < shared_max_data[tid + i])) {
-      shared_max_data[tid] = shared_max_data[tid + i];
-    }
-    __syncthreads();
-  }
-  if (tid == 0) {
-    out[blockIdx.x] = shared_max_data[0];
-  }
-}
-
-template <typename T>
-__global__ void FindChannelAbsMaxKernelQuantAxis1(const T* in, const int n,
-                                                  const int cin, const int cout,
-                                                  T* out) {
-  extern __shared__ T shared_max_data[];
-  int cout_wh_size = n / cin;
-  int wh_size = n / (cin * cout);
-
-  int tid = threadIdx.x;
-  int bid = blockIdx.x;
-  const T* in_current = in + tid * cout_wh_size + bid * wh_size;
-  T local_max_data = T(0);
-  for (int i = 0; i < wh_size; i++) {
-    T tmp = fabs(in_current[i]);
-    if (tmp > local_max_data) {
-      local_max_data = tmp;
-    }
-  }
-  shared_max_data[tid] = local_max_data;
-  __syncthreads();
-
-  int len = blockDim.x;
-  for (int i = (len + 1) / 2; i > 0; len = i, i = (i + 1) / 2) {
-    if (tid < i && tid + i < len &&
-        shared_max_data[tid] < shared_max_data[tid + i]) {
-      shared_max_data[tid] = shared_max_data[tid + i];
-    }
-    if (i == 1) {
-      i = 0;  // break the loop
-    }
-    __syncthreads();
-  }
-  if (tid == 0 && shared_max_data[0] > out[bid]) {
-    out[bid] = shared_max_data[0];
-  }
-}
-
-template <typename T>
-struct FindChannelAbsMaxFunctor<platform::CUDADeviceContext, T> {
-  void operator()(const platform::CUDADeviceContext& ctx,
-                  const framework::Tensor& in_tensor, const int quant_axis,
-                  T* out_abs_max) {
-    PADDLE_ENFORCE_EQ(
-        quant_axis == 0 || quant_axis == 1, true,
-        platform::errors::InvalidArgument("'quant_axis' should be 0 or 1, but "
-                                          "the received is %d",
-                                          quant_axis));
-    const int num = in_tensor.numel();
-    auto in_dims = in_tensor.dims();
-    const T* in_data = in_tensor.data<T>();
-    if (quant_axis == 0) {
-      int cout = in_dims[0];
-      int grid = cout;
-      int block = 1024;
-      FindChannelAbsMaxKernelQuantAxis0<
-          T><<<grid, block, block * sizeof(T), ctx.stream()>>>(
-          in_data, num, cout, out_abs_max);
-    } else if (quant_axis == 1) {
-      int cin = in_dims[0];
-      int cout = in_dims[1];
-      int grid = cout;
-      int max_threads = 1024;
-
-#ifdef PADDLE_WITH_HIP
-      hipMemset(out_abs_max, 0, sizeof(T) * cout);
-#else
-      cudaMemset(out_abs_max, 0, sizeof(T) * cout);
-#endif
-
-      for (int i = 0; i < cin / max_threads; i++) {
-        int block = max_threads;
-        FindChannelAbsMaxKernelQuantAxis1<
-            T><<<grid, block, block * sizeof(T), ctx.stream()>>>(
-            in_data, num, cin, cout, out_abs_max);
-        in_data += num / cin;
-      }
-
-      int block = cin % max_threads;
-      if (block > 0) {
-        FindChannelAbsMaxKernelQuantAxis1<
-            T><<<grid, block, block * sizeof(T), ctx.stream()>>>(
-            in_data, num, in_dims[0], in_dims[1], out_abs_max);
-      }
-    }
-  }
-};
-
-template struct FindChannelAbsMaxFunctor<platform::CUDADeviceContext, float>;
-
-template <typename T>
-__global__ void ClipAndQuantKernel(const T* in, const T* scale,
-                                   const int bin_cnt, const int n, T* out) {
-  int bid = threadIdx.x + blockIdx.x * blockDim.x;
-  int tid = threadIdx.x;
-
-  T s = scale[0];
-  T inv_s = inverse(s);
-  for (int i = bid; i < n; i += blockDim.x * gridDim.x) {
-    T x = in[i];
-    T v = x > s ? s : x;
-    v = v < -s ? -s : v;
-    v = bin_cnt * inv_s * v;
-    out[i] = round(v);
-  }
-}
-
-template <typename T>
-__global__ void ClipAndQuantDequantKernel(const T* in, const T* scale,
-                                          const int bin_cnt, const int n,
-                                          T* out) {
-  int bid = threadIdx.x + blockIdx.x * blockDim.x;
-  int tid = threadIdx.x;
-
-  T s = scale[0];
-  T inv_s = inverse(s);
-  T bin_cnt_t = static_cast<T>(bin_cnt);
-
-  for (int i = bid; i < n; i += blockDim.x * gridDim.x) {
-    T x = in[i];
-    x = x > s ? s : x;
-    x = x < -s ? -s : x;
-    x = bin_cnt_t * inv_s * x;
-    x = static_cast<T>(round(static_cast<float>(x)));
-    out[i] = (x * s) / bin_cnt_t;
-  }
-}
-
-template <typename T>
-struct ClipAndFakeQuantFunctor<platform::CUDADeviceContext, T> {
-  void operator()(const platform::CUDADeviceContext& ctx,
-                  const framework::Tensor& in, const framework::Tensor& scale,
-                  const int bin_cnt, framework::Tensor* out) {
-    int num = in.numel();
-    int block = 1024;
-    int grid = (block - 1 + num) / block;
-
-    const T* in_data = in.data<T>();
-    const T* scale_data = scale.data<T>();
-    T* out_data = out->mutable_data<T>(ctx.GetPlace());
-
-    ClipAndQuantKernel<T><<<grid, block, 0, ctx.stream()>>>(
-        in_data, scale_data, bin_cnt, num, out_data);
-  }
-};
-
-template struct ClipAndFakeQuantFunctor<platform::CUDADeviceContext, float>;
-
-template <typename T>
-struct ClipAndFakeQuantDequantFunctor<platform::CUDADeviceContext, T> {
-  void operator()(const platform::CUDADeviceContext& ctx,
-                  const framework::Tensor& in, const framework::Tensor& scale,
-                  const int bin_cnt, framework::Tensor* out) {
-    int num = in.numel();
-    int block = 1024;
-    int grid = (block - 1 + num) / block;
-
-    const T* in_data = in.data<T>();
-    const T* scale_data = scale.data<T>();
-    T* out_data = out->mutable_data<T>(ctx.GetPlace());
-
-    ClipAndQuantDequantKernel<T><<<grid, block, 0, ctx.stream()>>>(
-        in_data, scale_data, bin_cnt, num, out_data);
-  }
-};
-
-// ChannelClipAndQuantKernel for quant_axis is 0
-template <typename T>
-__global__ void ChannelClipAndQuantKernelQuantAxis0(const T* in, const T* scale,
-                                                    const int bin_cnt,
-                                                    const int64_t n,
-                                                    const int c, T* out) {
-  int tid = threadIdx.x;
-
-  int64_t channel_size = n / c;
-  const T* in_c = in + blockIdx.x * channel_size;
-  T* out_c = out + blockIdx.x * channel_size;
-
-  T s = scale[blockIdx.x];
-  T inv_s = inverse(s);
-
-  for (int64_t i = tid; i < channel_size; i += blockDim.x) {
-    T x = in_c[i];
-    T v = x > s ? s : x;
-    v = v < -s ? -s : v;
-    v = bin_cnt * inv_s * v;
-    out_c[i] = round(v);
-  }
-}
-
-// ChannelClipAndQuantKernel for quant_axis is N
-template <typename T>
-__global__ void ChannelClipAndQuantKernelQuantAxisN(
-    const T* in, const T* scale, const int bin_cnt, const int64_t n,
-    const int nScale, const int quant_stride, T* out) {
-  int64_t idx = blockDim.x * blockIdx.x + threadIdx.x;
-  for (int64_t i = idx; i < n; i += blockDim.x * gridDim.x) {
-    T s = scale[(i / quant_stride) % nScale];
-    T inv_s = 1.0 / s;
-    T x = in[i];
-    T v = x > s ? s : x;
-    v = v < -s ? -s : v;
-    v = bin_cnt * inv_s * v;
-    out[i] = round(v);
-  }
-}
-
-template <typename T>
-struct ChannelClipAndFakeQuantFunctor<platform::CUDADeviceContext, T> {
-  void operator()(const platform::CUDADeviceContext& ctx,
-                  const framework::Tensor& in, const framework::Tensor& scale,
-                  const int bin_cnt, const int quant_axis,
-                  framework::Tensor* out) {
-    PADDLE_ENFORCE_EQ(
-        quant_axis == 0 || quant_axis == 1, true,
-        platform::errors::InvalidArgument("'quant_axis' should be 0 or 1, but "
-                                          "the received is %d",
-                                          quant_axis));
-
-    int64_t num = in.numel();
-    auto in_dims = in.dims();
-    const T* in_data = in.data<T>();
-    const T* scale_data = scale.data<T>();
-    T* out_data = out->mutable_data<T>(ctx.GetPlace());
-
-    if (quant_axis == 0) {
-      int grid = in_dims[0];
-      int block = 1024;
-      ChannelClipAndQuantKernelQuantAxis0<T><<<grid, block, 0, ctx.stream()>>>(
-          in_data, scale_data, bin_cnt, num, in_dims[0], out_data);
-    } else {
-      int quant_stride = 1;
-      for (int i = quant_axis + 1; i < in_dims.size(); i++) {
-        quant_stride *= in_dims[i];
-      }
-      int64_t block_size =
-          std::min(num, static_cast<int64_t>(ctx.GetMaxThreadsPerBlock() / 4));
-      int64_t max_threads =
-          ctx.GetMaxPhysicalThreadCount();  // SM * block_per_SM
-      const int64_t max_blocks = std::max(((max_threads - 1) / block_size + 1),
-                                          static_cast<int64_t>(1));
-
-      const int64_t grid_size =
-          std::min(max_blocks, (num + block_size - 1) / block_size);
-
-      ChannelClipAndQuantKernelQuantAxisN<T><<<grid_size, block_size>>>(
-          in_data, scale_data, bin_cnt, num, in_dims[quant_axis], quant_stride,
-          out_data);
-    }
-  }
-};
-
-template struct ChannelClipAndFakeQuantFunctor<platform::CUDADeviceContext,
-                                               float>;
-
-template <typename T>
-__global__ void FindRangeAbsMaxAndFillArray(const T* cur_scale,
-                                            const T* last_scale,
-                                            const int64_t* iter,
-                                            const int window_size, T* scale_arr,
-                                            T* out_scale, int* need_find_max,
-                                            int* out_size) {
-  int it = iter[0];
-  int idx = it % window_size;
-  T removed = scale_arr[idx];
-  T cur = cur_scale[0];
-  scale_arr[idx] = cur;
-  T max = last_scale[0];
-  out_scale[0] = max < cur ? cur : max;
-  if (fabs(removed - max) < 1e-6) {
-    need_find_max[0] = 1;
-    out_size[0] = it > window_size ? window_size : it;
-  } else {
-    need_find_max[0] = 0;
-  }
-}
-
-template <typename T>
-struct FindRangeAbsMaxFunctor<platform::CUDADeviceContext, T> {
-  void operator()(const platform::CUDADeviceContext& ctx,
-                  const framework::Tensor& cur_scale,
-                  const framework::Tensor& last_scale,
-                  const framework::Tensor& iter, const int window_size,
-                  framework::Tensor* scales_arr, framework::Tensor* out_scale) {
-    const auto gpu_place = ctx.GetPlace();
-
-    T* scale_arr = scales_arr->mutable_data<T>(gpu_place);
-    T* out_scale_data = out_scale->mutable_data<T>(gpu_place);
-
-    framework::Tensor need_find_max, out_size;
-    int* find_max = need_find_max.mutable_data<int>({1}, gpu_place);
-    int* out_size_data = out_size.mutable_data<int>({1}, gpu_place);
-
-    FindRangeAbsMaxAndFillArray<T><<<1, 1, 0, ctx.stream()>>>(
-        cur_scale.data<T>(), last_scale.data<T>(), iter.data<int64_t>(),
-        window_size, scale_arr, out_scale_data, find_max, out_size_data);
-
-    int g_find_max;
-    memory::Copy(platform::CPUPlace(), &g_find_max, gpu_place, find_max,
-                 sizeof(int), ctx.stream());
-    ctx.Wait();
-    if (g_find_max) {
-      int len;
-      memory::Copy(platform::CPUPlace(), &len, gpu_place, out_size_data,
-                   sizeof(int), ctx.stream());
-      ctx.Wait();
-      FindAbsMaxFunctor<platform::CUDADeviceContext, T>()(ctx, scale_arr, len,
-                                                          out_scale_data);
-    }
-  }
-};
-
-template <typename T>
-__global__ void FindMovingAverageAbsMaxKernel(const T* in_state,
-                                              const T* in_accum,
-                                              const T* cur_scale, const T rate,
-                                              T* out_state, T* out_accum,
-                                              T* out_scale) {
-  T state = rate * (*in_state) + T(1.0f);
-  T accum = rate * (*in_accum) + (*cur_scale);
-  *out_state = state;
-  *out_accum = accum;
-  *out_scale = accum / state;
-}
-
-template struct FindRangeAbsMaxFunctor<platform::CUDADeviceContext, float>;
-
-template <typename T>
-struct FindMovingAverageAbsMaxFunctor<platform::CUDADeviceContext, T> {
-  void operator()(const platform::CUDADeviceContext& ctx,
-                  const framework::Tensor& in_accum,
-                  const framework::Tensor& in_state, const T* cur_scale,
-                  const float rate, framework::Tensor* out_state,
-                  framework::Tensor* out_accum, framework::Tensor* out_scale) {
-    const auto gpu_place = ctx.GetPlace();
-
-    T rate_t = static_cast<T>(rate);
-    T* out_state_data = out_state->mutable_data<T>(gpu_place);
-    T* out_accum_data = out_accum->mutable_data<T>(gpu_place);
-    T* out_scale_data = out_scale->mutable_data<T>(gpu_place);
-
-    FindMovingAverageAbsMaxKernel<T><<<1, 1, 0, ctx.stream()>>>(
-        in_state.data<T>(), in_accum.data<T>(), cur_scale, rate_t,
-        out_state_data, out_accum_data, out_scale_data);
-  }
-};
-
-// ChannelClipAndQuantDequantKernel for quant_axis is 0
-template <typename T>
-__global__ void ChannelClipAndQuantDequantKernelQuantAxis0(
-    const T* in, const T* scale, const int bin_cnt, const int n, const int c,
-    T* out) {
-  int tid = threadIdx.x;
-
-  int channel_size = n / c;
-  const T* in_c = in + blockIdx.x * channel_size;
-  T* out_c = out + blockIdx.x * channel_size;
-
-  T s = scale[blockIdx.x];
-  T inv_s = inverse(s);
-
-  for (int i = tid; i < channel_size; i += blockDim.x) {
-    T x = in_c[i];
-    T v = x > s ? s : x;
-    v = v < -s ? -s : v;
-    v = bin_cnt * inv_s * v;
-    out_c[i] = round(v) * s / bin_cnt;
-  }
-}
-
-// ChannelClipAndQuantDequantKernel for quant_axis is 1
-template <typename T>
-__global__ void ChannelClipAndQuantDequantKernelQuantAxis1(
-    const T* in, const T* scale, const int bin_cnt, const int n, const int cin,
-    const int cout, T* out) {
-  T s = scale[blockIdx.x % cout];
-  T inv_s = inverse(s);
-
-  int wh_size = n / (cin * cout);
-  const T* in_c = in + blockIdx.x * wh_size;
-  T* out_c = out + blockIdx.x * wh_size;
-
-  for (int i = threadIdx.x; i < wh_size; i += blockDim.x) {
-    T x = in_c[i];
-    T v = x > s ? s : x;
-    v = v < -s ? -s : v;
-    v = bin_cnt * inv_s * v;
-    out_c[i] = round(v) * s / bin_cnt;
-  }
-}
-
-template <typename T>
-struct ChannelClipFakeQuantDequantFunctor<platform::CUDADeviceContext, T> {
-  void operator()(const platform::CUDADeviceContext& ctx,
-                  const framework::Tensor& in, const framework::Tensor& scale,
-                  const int bin_cnt, const int quant_axis,
-                  framework::Tensor* out) {
-    // At present, channelwise quantization supports conv2d, depthwise_conv2d
-    // conv2d_transpose and mul
-    PADDLE_ENFORCE_EQ(
-        quant_axis == 0 || quant_axis == 1, true,
-        platform::errors::InvalidArgument("'quant_axis' should be 0 or 1, but "
-                                          "the received is %d",
-                                          quant_axis));
-
-    int num = in.numel();
-    auto in_dims = in.dims();
-
-    const T* in_data = in.data<T>();
-    const T* scale_data = scale.data<T>();
-    T* out_data = out->mutable_data<T>(ctx.GetPlace());
-
-    if (quant_axis == 0) {
-      int grid = in_dims[0];
-      int block = 1024;
-      ChannelClipAndQuantDequantKernelQuantAxis0<
-          T><<<grid, block, 0, ctx.stream()>>>(in_data, scale_data, bin_cnt,
-                                               num, in_dims[0], out_data);
-    } else if (quant_axis == 1) {
-      int grid = in_dims[0] * in_dims[1];
-      int block = 1024;
-
-      ChannelClipAndQuantDequantKernelQuantAxis1<
-          T><<<grid, block, 0, ctx.stream()>>>(
-          in_data, scale_data, bin_cnt, num, in_dims[0], in_dims[1], out_data);
-    }
-  }
-};
-
-template struct ChannelClipFakeQuantDequantFunctor<platform::CUDADeviceContext,
-                                                   float>;
-
-}  // namespace operators
-}  // namespace paddle
 
 namespace ops = paddle::operators;
 using CUDA = paddle::platform::CUDADeviceContext;
diff --git a/paddle/fluid/operators/fake_quantize_op.cu.h b/paddle/fluid/operators/fake_quantize_op.cu.h
new file mode 100644
index 0000000000000..d85d47f546131
--- /dev/null
+++ b/paddle/fluid/operators/fake_quantize_op.cu.h
@@ -0,0 +1,543 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#ifndef PADDLE_FLUID_OPERATORS_FAKE_QUANTIZE_OP_CU_H_
+#define PADDLE_FLUID_OPERATORS_FAKE_QUANTIZE_OP_CU_H_
+#endif  // PADDLE_FLUID_OPERATORS_FAKE_QUANTIZE_OP_CU_H_
+
+#include <string>
+#include "paddle/fluid/memory/memcpy.h"
+#include "paddle/fluid/operators/fake_quantize_op.h"
+#include "paddle/fluid/platform/device/gpu/gpu_primitives.h"
+
+namespace paddle {
+namespace operators {
+
+template <typename T>
+__global__ void FindAbsMaxKernel(const T* in, const int n, T* out) {
+  int bid = threadIdx.x + blockIdx.x * blockDim.x;
+  int tid = threadIdx.x;
+
+  extern __shared__ char* shared_max_data_tmp[];
+  auto shared_max_data = reinterpret_cast<T*>(shared_max_data_tmp);
+  if (gridDim.x > 1) {
+    T local_max_data = T(0);
+    for (int i = bid; i < n; i += blockDim.x * gridDim.x) {
+      T tmp = abs(in[i]);
+      if (tmp > local_max_data) {
+        local_max_data = tmp;
+      }
+    }
+    shared_max_data[tid] = local_max_data;
+  } else {
+    if (bid < n) {
+      shared_max_data[tid] = abs(in[bid]);
+    } else {
+      shared_max_data[tid] = T(0);
+    }
+  }
+  __syncthreads();
+
+  for (int i = blockDim.x / 2; i > 0; i >>= 1) {
+    if (tid < i && (shared_max_data[tid] < shared_max_data[tid + i])) {
+      shared_max_data[tid] = shared_max_data[tid + i];
+    }
+    __syncthreads();
+  }
+  if (tid == 0) {
+    out[blockIdx.x] = shared_max_data[0];
+  }
+}
+
+template <typename T>
+struct FindAbsMaxFunctor<platform::CUDADeviceContext, T> {
+  void operator()(const platform::CUDADeviceContext& ctx, const T* in,
+                  const int num, T* out) {
+    int block = 1024;
+    int grid = (block - 1 + num) / block;
+    grid = (grid > block) ? block : grid;
+
+    framework::Tensor max;
+    T* max_data = max.mutable_data<T>(phi::make_ddim({grid}), ctx.GetPlace());
+    FindAbsMaxKernel<T><<<grid, block, 1024 * sizeof(T), ctx.stream()>>>(
+        in, num, max_data);
+    FindAbsMaxKernel<T><<<1, block, 1024 * sizeof(T), ctx.stream()>>>(
+        max_data, grid, out);
+  }
+};
+
+template struct FindAbsMaxFunctor<platform::CUDADeviceContext, float>;
+template struct FindAbsMaxFunctor<platform::CUDADeviceContext,
+                                  paddle::platform::float16>;
+
+template <typename T>
+__global__ void FindChannelAbsMaxKernelQuantAxis0(const T* in, const int n,
+                                                  const int c, T* out) {
+  int tid = threadIdx.x;
+  int channel_size = n / c;
+  const T* in_c = in + blockIdx.x * channel_size;
+  extern __shared__ T shared_max_data[];
+  T local_max_data = T(0);
+  for (int i = tid; i < channel_size; i += blockDim.x) {
+    T tmp = fabs(in_c[i]);
+    if (tmp > local_max_data) {
+      local_max_data = tmp;
+    }
+  }
+  shared_max_data[tid] = local_max_data;
+  __syncthreads();
+  for (int i = blockDim.x / 2; i > 0; i >>= 1) {
+    if (tid < i && (shared_max_data[tid] < shared_max_data[tid + i])) {
+      shared_max_data[tid] = shared_max_data[tid + i];
+    }
+    __syncthreads();
+  }
+  if (tid == 0) {
+    out[blockIdx.x] = shared_max_data[0];
+  }
+}
+
+template <typename T>
+__global__ void FindChannelAbsMaxKernelQuantAxis1(const T* in, const int n,
+                                                  const int cin, const int cout,
+                                                  T* out) {
+  extern __shared__ T shared_max_data[];
+  int cout_wh_size = n / cin;
+  int wh_size = n / (cin * cout);
+
+  int tid = threadIdx.x;
+  int bid = blockIdx.x;
+  const T* in_current = in + tid * cout_wh_size + bid * wh_size;
+  T local_max_data = T(0);
+  for (int i = 0; i < wh_size; i++) {
+    T tmp = fabs(in_current[i]);
+    if (tmp > local_max_data) {
+      local_max_data = tmp;
+    }
+  }
+  shared_max_data[tid] = local_max_data;
+  __syncthreads();
+
+  int len = blockDim.x;
+  for (int i = (len + 1) / 2; i > 0; len = i, i = (i + 1) / 2) {
+    if (tid < i && tid + i < len &&
+        shared_max_data[tid] < shared_max_data[tid + i]) {
+      shared_max_data[tid] = shared_max_data[tid + i];
+    }
+    if (i == 1) {
+      i = 0;  // break the loop
+    }
+    __syncthreads();
+  }
+  if (tid == 0 && shared_max_data[0] > out[bid]) {
+    out[bid] = shared_max_data[0];
+  }
+}
+
+template <typename T>
+struct FindChannelAbsMaxFunctor<platform::CUDADeviceContext, T> {
+  void operator()(const platform::CUDADeviceContext& ctx,
+                  const framework::Tensor& in_tensor, const int quant_axis,
+                  T* out_abs_max) {
+    PADDLE_ENFORCE_EQ(
+        quant_axis == 0 || quant_axis == 1, true,
+        platform::errors::InvalidArgument("'quant_axis' should be 0 or 1, but "
+                                          "the received is %d",
+                                          quant_axis));
+    const int num = in_tensor.numel();
+    auto in_dims = in_tensor.dims();
+    const T* in_data = in_tensor.data<T>();
+    if (quant_axis == 0) {
+      int cout = in_dims[0];
+      int grid = cout;
+      int block = 1024;
+      FindChannelAbsMaxKernelQuantAxis0<
+          T><<<grid, block, block * sizeof(T), ctx.stream()>>>(
+          in_data, num, cout, out_abs_max);
+    } else if (quant_axis == 1) {
+      int cin = in_dims[0];
+      int cout = in_dims[1];
+      int grid = cout;
+      int max_threads = 1024;
+
+#ifdef PADDLE_WITH_HIP
+      hipMemset(out_abs_max, 0, sizeof(T) * cout);
+#else
+      cudaMemset(out_abs_max, 0, sizeof(T) * cout);
+#endif  // PADDLE_FLUID_OPERATORS_FAKE_QUANTIZE_OP_CU_H_
+
+      for (int i = 0; i < cin / max_threads; i++) {
+        int block = max_threads;
+        FindChannelAbsMaxKernelQuantAxis1<
+            T><<<grid, block, block * sizeof(T), ctx.stream()>>>(
+            in_data, num, cin, cout, out_abs_max);
+        in_data += num / cin;
+      }
+
+      int block = cin % max_threads;
+      if (block > 0) {
+        FindChannelAbsMaxKernelQuantAxis1<
+            T><<<grid, block, block * sizeof(T), ctx.stream()>>>(
+            in_data, num, in_dims[0], in_dims[1], out_abs_max);
+      }
+    }
+  }
+};
+
+template struct FindChannelAbsMaxFunctor<platform::CUDADeviceContext, float>;
+
+template <typename T>
+__global__ void ClipAndQuantKernel(const T* in, const T* scale,
+                                   const int bin_cnt, const int n, T* out) {
+  int bid = threadIdx.x + blockIdx.x * blockDim.x;
+  int tid = threadIdx.x;
+
+  T s = scale[0];
+  T inv_s = inverse(s);
+  for (int i = bid; i < n; i += blockDim.x * gridDim.x) {
+    T x = in[i];
+    T v = x > s ? s : x;
+    v = v < -s ? -s : v;
+    v = bin_cnt * inv_s * v;
+    out[i] = round(v);
+  }
+}
+
+template <typename T>
+__global__ void ClipAndQuantDequantKernel(const T* in, const T* scale,
+                                          const int bin_cnt, const int n,
+                                          T* out) {
+  int bid = threadIdx.x + blockIdx.x * blockDim.x;
+  int tid = threadIdx.x;
+
+  T s = scale[0];
+  T inv_s = inverse(s);
+  T bin_cnt_t = static_cast<T>(bin_cnt);
+
+  for (int i = bid; i < n; i += blockDim.x * gridDim.x) {
+    T x = in[i];
+    x = x > s ? s : x;
+    x = x < -s ? -s : x;
+    x = bin_cnt_t * inv_s * x;
+    x = static_cast<T>(round(static_cast<float>(x)));
+    out[i] = (x * s) / bin_cnt_t;
+  }
+}
+
+template <typename T>
+struct ClipAndFakeQuantFunctor<platform::CUDADeviceContext, T> {
+  void operator()(const platform::CUDADeviceContext& ctx,
+                  const framework::Tensor& in, const framework::Tensor& scale,
+                  const int bin_cnt, framework::Tensor* out) {
+    int num = in.numel();
+    int block = 1024;
+    int grid = (block - 1 + num) / block;
+
+    const T* in_data = in.data<T>();
+    const T* scale_data = scale.data<T>();
+    T* out_data = out->mutable_data<T>(ctx.GetPlace());
+
+    ClipAndQuantKernel<T><<<grid, block, 0, ctx.stream()>>>(
+        in_data, scale_data, bin_cnt, num, out_data);
+  }
+};
+
+template struct ClipAndFakeQuantFunctor<platform::CUDADeviceContext, float>;
+
+template <typename T>
+struct ClipAndFakeQuantDequantFunctor<platform::CUDADeviceContext, T> {
+  void operator()(const platform::CUDADeviceContext& ctx,
+                  const framework::Tensor& in, const framework::Tensor& scale,
+                  const int bin_cnt, framework::Tensor* out) {
+    int num = in.numel();
+    int block = 1024;
+    int grid = (block - 1 + num) / block;
+
+    const T* in_data = in.data<T>();
+    const T* scale_data = scale.data<T>();
+    T* out_data = out->mutable_data<T>(ctx.GetPlace());
+
+    ClipAndQuantDequantKernel<T><<<grid, block, 0, ctx.stream()>>>(
+        in_data, scale_data, bin_cnt, num, out_data);
+  }
+};
+
+// ChannelClipAndQuantKernel for quant_axis is 0
+template <typename T>
+__global__ void ChannelClipAndQuantKernelQuantAxis0(const T* in, const T* scale,
+                                                    const int bin_cnt,
+                                                    const int64_t n,
+                                                    const int c, T* out) {
+  int tid = threadIdx.x;
+
+  int64_t channel_size = n / c;
+  const T* in_c = in + blockIdx.x * channel_size;
+  T* out_c = out + blockIdx.x * channel_size;
+
+  T s = scale[blockIdx.x];
+  T inv_s = inverse(s);
+
+  for (int64_t i = tid; i < channel_size; i += blockDim.x) {
+    T x = in_c[i];
+    T v = x > s ? s : x;
+    v = v < -s ? -s : v;
+    v = bin_cnt * inv_s * v;
+    out_c[i] = round(v);
+  }
+}
+
+// ChannelClipAndQuantKernel for quant_axis is N
+template <typename T>
+__global__ void ChannelClipAndQuantKernelQuantAxisN(
+    const T* in, const T* scale, const int bin_cnt, const int64_t n,
+    const int nScale, const int quant_stride, T* out) {
+  int64_t idx = blockDim.x * blockIdx.x + threadIdx.x;
+  for (int64_t i = idx; i < n; i += blockDim.x * gridDim.x) {
+    T s = scale[(i / quant_stride) % nScale];
+    T inv_s = 1.0 / s;
+    T x = in[i];
+    T v = x > s ? s : x;
+    v = v < -s ? -s : v;
+    v = bin_cnt * inv_s * v;
+    out[i] = round(v);
+  }
+}
+
+template <typename T>
+struct ChannelClipAndFakeQuantFunctor<platform::CUDADeviceContext, T> {
+  void operator()(const platform::CUDADeviceContext& ctx,
+                  const framework::Tensor& in, const framework::Tensor& scale,
+                  const int bin_cnt, const int quant_axis,
+                  framework::Tensor* out) {
+    PADDLE_ENFORCE_EQ(
+        quant_axis == 0 || quant_axis == 1, true,
+        platform::errors::InvalidArgument("'quant_axis' should be 0 or 1, but "
+                                          "the received is %d",
+                                          quant_axis));
+
+    int64_t num = in.numel();
+    auto in_dims = in.dims();
+    const T* in_data = in.data<T>();
+    const T* scale_data = scale.data<T>();
+    T* out_data = out->mutable_data<T>(ctx.GetPlace());
+
+    if (quant_axis == 0) {
+      int grid = in_dims[0];
+      int block = 1024;
+      ChannelClipAndQuantKernelQuantAxis0<T><<<grid, block, 0, ctx.stream()>>>(
+          in_data, scale_data, bin_cnt, num, in_dims[0], out_data);
+    } else {
+      int quant_stride = 1;
+      for (int i = quant_axis + 1; i < in_dims.size(); i++) {
+        quant_stride *= in_dims[i];
+      }
+      int64_t block_size =
+          std::min(num, static_cast<int64_t>(ctx.GetMaxThreadsPerBlock() / 4));
+      int64_t max_threads =
+          ctx.GetMaxPhysicalThreadCount();  // SM * block_per_SM
+      const int64_t max_blocks = std::max(((max_threads - 1) / block_size + 1),
+                                          static_cast<int64_t>(1));
+
+      const int64_t grid_size =
+          std::min(max_blocks, (num + block_size - 1) / block_size);
+
+      ChannelClipAndQuantKernelQuantAxisN<T><<<grid_size, block_size>>>(
+          in_data, scale_data, bin_cnt, num, in_dims[quant_axis], quant_stride,
+          out_data);
+    }
+  }
+};
+
+template struct ChannelClipAndFakeQuantFunctor<platform::CUDADeviceContext,
+                                               float>;
+
+template <typename T>
+__global__ void FindRangeAbsMaxAndFillArray(const T* cur_scale,
+                                            const T* last_scale,
+                                            const int64_t* iter,
+                                            const int window_size, T* scale_arr,
+                                            T* out_scale, int* need_find_max,
+                                            int* out_size) {
+  int it = iter[0];
+  int idx = it % window_size;
+  T removed = scale_arr[idx];
+  T cur = cur_scale[0];
+  scale_arr[idx] = cur;
+  T max = last_scale[0];
+  out_scale[0] = max < cur ? cur : max;
+  if (fabs(removed - max) < 1e-6) {
+    need_find_max[0] = 1;
+    out_size[0] = it > window_size ? window_size : it;
+  } else {
+    need_find_max[0] = 0;
+  }
+}
+
+template <typename T>
+struct FindRangeAbsMaxFunctor<platform::CUDADeviceContext, T> {
+  void operator()(const platform::CUDADeviceContext& ctx,
+                  const framework::Tensor& cur_scale,
+                  const framework::Tensor& last_scale,
+                  const framework::Tensor& iter, const int window_size,
+                  framework::Tensor* scales_arr, framework::Tensor* out_scale) {
+    const auto gpu_place = ctx.GetPlace();
+
+    T* scale_arr = scales_arr->mutable_data<T>(gpu_place);
+    T* out_scale_data = out_scale->mutable_data<T>(gpu_place);
+
+    framework::Tensor need_find_max, out_size;
+    int* find_max = need_find_max.mutable_data<int>({1}, gpu_place);
+    int* out_size_data = out_size.mutable_data<int>({1}, gpu_place);
+
+    FindRangeAbsMaxAndFillArray<T><<<1, 1, 0, ctx.stream()>>>(
+        cur_scale.data<T>(), last_scale.data<T>(), iter.data<int64_t>(),
+        window_size, scale_arr, out_scale_data, find_max, out_size_data);
+
+    int g_find_max;
+    memory::Copy(platform::CPUPlace(), &g_find_max, gpu_place, find_max,
+                 sizeof(int), ctx.stream());
+    ctx.Wait();
+    if (g_find_max) {
+      int len;
+      memory::Copy(platform::CPUPlace(), &len, gpu_place, out_size_data,
+                   sizeof(int), ctx.stream());
+      ctx.Wait();
+      FindAbsMaxFunctor<platform::CUDADeviceContext, T>()(ctx, scale_arr, len,
+                                                          out_scale_data);
+    }
+  }
+};
+
+template <typename T>
+__global__ void FindMovingAverageAbsMaxKernel(const T* in_state,
+                                              const T* in_accum,
+                                              const T* cur_scale, const T rate,
+                                              T* out_state, T* out_accum,
+                                              T* out_scale) {
+  T state = rate * (*in_state) + T(1.0f);
+  T accum = rate * (*in_accum) + (*cur_scale);
+  *out_state = state;
+  *out_accum = accum;
+  *out_scale = accum / state;
+}
+
+template struct FindRangeAbsMaxFunctor<platform::CUDADeviceContext, float>;
+
+template <typename T>
+struct FindMovingAverageAbsMaxFunctor<platform::CUDADeviceContext, T> {
+  void operator()(const platform::CUDADeviceContext& ctx,
+                  const framework::Tensor& in_accum,
+                  const framework::Tensor& in_state, const T* cur_scale,
+                  const float rate, framework::Tensor* out_state,
+                  framework::Tensor* out_accum, framework::Tensor* out_scale) {
+    const auto gpu_place = ctx.GetPlace();
+
+    T rate_t = static_cast<T>(rate);
+    T* out_state_data = out_state->mutable_data<T>(gpu_place);
+    T* out_accum_data = out_accum->mutable_data<T>(gpu_place);
+    T* out_scale_data = out_scale->mutable_data<T>(gpu_place);
+
+    FindMovingAverageAbsMaxKernel<T><<<1, 1, 0, ctx.stream()>>>(
+        in_state.data<T>(), in_accum.data<T>(), cur_scale, rate_t,
+        out_state_data, out_accum_data, out_scale_data);
+  }
+};
+
+// ChannelClipAndQuantDequantKernel for quant_axis is 0
+template <typename T>
+__global__ void ChannelClipAndQuantDequantKernelQuantAxis0(
+    const T* in, const T* scale, const int bin_cnt, const int n, const int c,
+    T* out) {
+  int tid = threadIdx.x;
+
+  int channel_size = n / c;
+  const T* in_c = in + blockIdx.x * channel_size;
+  T* out_c = out + blockIdx.x * channel_size;
+
+  T s = scale[blockIdx.x];
+  T inv_s = inverse(s);
+
+  for (int i = tid; i < channel_size; i += blockDim.x) {
+    T x = in_c[i];
+    T v = x > s ? s : x;
+    v = v < -s ? -s : v;
+    v = bin_cnt * inv_s * v;
+    out_c[i] = round(v) * s / bin_cnt;
+  }
+}
+
+// ChannelClipAndQuantDequantKernel for quant_axis is 1
+template <typename T>
+__global__ void ChannelClipAndQuantDequantKernelQuantAxis1(
+    const T* in, const T* scale, const int bin_cnt, const int n, const int cin,
+    const int cout, T* out) {
+  T s = scale[blockIdx.x % cout];
+  T inv_s = inverse(s);
+
+  int wh_size = n / (cin * cout);
+  const T* in_c = in + blockIdx.x * wh_size;
+  T* out_c = out + blockIdx.x * wh_size;
+
+  for (int i = threadIdx.x; i < wh_size; i += blockDim.x) {
+    T x = in_c[i];
+    T v = x > s ? s : x;
+    v = v < -s ? -s : v;
+    v = bin_cnt * inv_s * v;
+    out_c[i] = round(v) * s / bin_cnt;
+  }
+}
+
+template <typename T>
+struct ChannelClipFakeQuantDequantFunctor<platform::CUDADeviceContext, T> {
+  void operator()(const platform::CUDADeviceContext& ctx,
+                  const framework::Tensor& in, const framework::Tensor& scale,
+                  const int bin_cnt, const int quant_axis,
+                  framework::Tensor* out) {
+    // At present, channelwise quantization supports conv2d, depthwise_conv2d
+    // conv2d_transpose and mul
+    PADDLE_ENFORCE_EQ(
+        quant_axis == 0 || quant_axis == 1, true,
+        platform::errors::InvalidArgument("'quant_axis' should be 0 or 1, but "
+                                          "the received is %d",
+                                          quant_axis));
+
+    int num = in.numel();
+    auto in_dims = in.dims();
+
+    const T* in_data = in.data<T>();
+    const T* scale_data = scale.data<T>();
+    T* out_data = out->mutable_data<T>(ctx.GetPlace());
+
+    if (quant_axis == 0) {
+      int grid = in_dims[0];
+      int block = 1024;
+      ChannelClipAndQuantDequantKernelQuantAxis0<
+          T><<<grid, block, 0, ctx.stream()>>>(in_data, scale_data, bin_cnt,
+                                               num, in_dims[0], out_data);
+    } else if (quant_axis == 1) {
+      int grid = in_dims[0] * in_dims[1];
+      int block = 1024;
+
+      ChannelClipAndQuantDequantKernelQuantAxis1<
+          T><<<grid, block, 0, ctx.stream()>>>(
+          in_data, scale_data, bin_cnt, num, in_dims[0], in_dims[1], out_data);
+    }
+  }
+};
+
+template struct ChannelClipFakeQuantDequantFunctor<platform::CUDADeviceContext,
+                                                   float>;
+
+}  // namespace operators
+}  // namespace paddle
diff --git a/paddle/fluid/operators/feed_forward_test.cu b/paddle/fluid/operators/feed_forward_test.cu
index 27a235765227f..e5ebdad1e4434 100644
--- a/paddle/fluid/operators/feed_forward_test.cu
+++ b/paddle/fluid/operators/feed_forward_test.cu
@@ -292,6 +292,10 @@ class TestFeedForward {
         paddle::memory::allocation::AllocatorFacade::Instance()
             .GetZeroAllocator(place_)
             .get());
+    ctx_->SetPinnedAllocator(
+        paddle::memory::allocation::AllocatorFacade::Instance()
+            .GetAllocator(paddle::platform::CUDAPinnedPlace())
+            .get());
     ctx_->PartialInitWithAllocator();
 
     size_src_ = bsz_seq_ * dim_embed_;         // src: [bs, seq_len, em_dim]
diff --git a/paddle/fluid/operators/fill_constant_op_xpu.cc b/paddle/fluid/operators/fill_constant_op_xpu.cc
index a70f9e2c3b337..ddc28986995fa 100644
--- a/paddle/fluid/operators/fill_constant_op_xpu.cc
+++ b/paddle/fluid/operators/fill_constant_op_xpu.cc
@@ -21,7 +21,6 @@ REGISTER_OP_XPU_KERNEL(
     ops::FillConstantKernel<int16_t>, ops::FillConstantKernel<int>,
     ops::FillConstantKernel<int64_t>, ops::FillConstantKernel<bool>,
     ops::FillConstantKernel<paddle::platform::float16>,
-    ops::FillConstantKernel<paddle::platform::bfloat16>,
     ops::FillConstantKernel<paddle::platform::complex<float>>,
     ops::FillConstantKernel<paddle::platform::complex<double>>);
 #endif
diff --git a/paddle/fluid/operators/gaussian_random_op.cu b/paddle/fluid/operators/gaussian_random_op.cu
index 00ce10bfe3bcc..552649279e911 100644
--- a/paddle/fluid/operators/gaussian_random_op.cu
+++ b/paddle/fluid/operators/gaussian_random_op.cu
@@ -11,21 +11,14 @@ distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. */
-#include <thrust/device_vector.h>
-#include <thrust/host_vector.h>
 #include <thrust/random.h>
-#include <thrust/transform.h>
 #include "paddle/fluid/framework/generator.h"
 #include "paddle/fluid/framework/op_registry.h"
 #include "paddle/fluid/framework/operator.h"
 #include "paddle/fluid/operators/amp/fp16_type_traits.h"
 #include "paddle/fluid/operators/fill_constant_op.h"
-
-#include "paddle/phi/kernels/funcs/distribution_helper.h"
 #include "paddle/phi/kernels/funcs/index_impl.cu.h"
 
-DECLARE_bool(use_curand);
-
 namespace paddle {
 namespace operators {
 
diff --git a/paddle/fluid/operators/graph_reindex_op.cc b/paddle/fluid/operators/graph_reindex_op.cc
new file mode 100644
index 0000000000000..593de659c7608
--- /dev/null
+++ b/paddle/fluid/operators/graph_reindex_op.cc
@@ -0,0 +1,77 @@
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include "paddle/fluid/framework/infershape_utils.h"
+#include "paddle/fluid/framework/op_registry.h"
+#include "paddle/phi/core/infermeta_utils.h"
+#include "paddle/phi/infermeta/multiary.h"
+
+namespace paddle {
+namespace operators {
+
+class GraphReindexOP : public framework::OperatorWithKernel {
+ public:
+  using framework::OperatorWithKernel::OperatorWithKernel;
+
+ protected:
+  framework::OpKernelType GetExpectedKernelType(
+      const framework::ExecutionContext& ctx) const override {
+    return framework::OpKernelType(
+        OperatorWithKernel::IndicateVarDataType(ctx, "X"),
+        ctx.device_context());
+  }
+};
+
+class GraphReindexOpMaker : public framework::OpProtoAndCheckerMaker {
+ public:
+  void Make() override {
+    AddInput("X", "The destination nodes of the input graph.");
+    AddInput("Neighbors", "The neighbor nodes of the destination nodes `X`.");
+    AddInput("Count", "The number of neighbor nodes of each destination node.");
+    // Note(daisiming): If using buffer hashtable, we must ensure the number of
+    // nodes of the input graph should be no larger than maximum(int32).
+    AddInput("HashTable_Value",
+             "One of the buffer tensor of hashtable for reindex")
+        .AsDispensable();
+    AddInput("HashTable_Index",
+             "One of the buffer tensor of hashtable for reindex")
+        .AsDispensable();
+    AddAttr<bool>("flag_buffer_hashtable",
+                  "Define whether using the buffer hashtable.")
+        .SetDefault(false);
+    AddOutput("Reindex_Src",
+              "The source node index of graph edges after reindex.");
+    AddOutput("Reindex_Dst",
+              "The destination node index of graph edges after reindex.");
+    AddOutput("Out_Nodes", "The original index of graph nodes before reindex");
+
+    AddComment(R"DOC(
+Graph Reindex operator.
+)DOC");
+  }
+};
+
+}  // namespace operators
+}  // namespace paddle
+
+namespace ops = paddle::operators;
+
+DECLARE_INFER_SHAPE_FUNCTOR(graph_reindex, GraphReindexInferShapeFunctor,
+                            PD_INFER_META(phi::GraphReindexInferMeta));
+
+REGISTER_OPERATOR(
+    graph_reindex, ops::GraphReindexOP, ops::GraphReindexOpMaker,
+    paddle::framework::EmptyGradOpMaker<paddle::framework::OpDesc>,
+    paddle::framework::EmptyGradOpMaker<paddle::imperative::OpBase>,
+    GraphReindexInferShapeFunctor);
diff --git a/paddle/fluid/operators/graph_sample_neighbors_op.cc b/paddle/fluid/operators/graph_sample_neighbors_op.cc
new file mode 100644
index 0000000000000..5ac9e2d4e4519
--- /dev/null
+++ b/paddle/fluid/operators/graph_sample_neighbors_op.cc
@@ -0,0 +1,82 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "paddle/fluid/framework/infershape_utils.h"
+#include "paddle/fluid/framework/op_registry.h"
+#include "paddle/phi/core/infermeta_utils.h"
+#include "paddle/phi/infermeta/multiary.h"
+
+namespace paddle {
+namespace operators {
+
+class GraphSampleNeighborsOP : public framework::OperatorWithKernel {
+ public:
+  using framework::OperatorWithKernel::OperatorWithKernel;
+
+ protected:
+  framework::OpKernelType GetExpectedKernelType(
+      const framework::ExecutionContext& ctx) const override {
+    return framework::OpKernelType(
+        OperatorWithKernel::IndicateVarDataType(ctx, "Row"),
+        ctx.device_context());
+  }
+};
+
+class GraphSampleNeighborsOpMaker : public framework::OpProtoAndCheckerMaker {
+ public:
+  void Make() override {
+    AddInput("Row",
+             "One of the components of the CSC format of the input graph.");
+    AddInput("Col_Ptr",
+             "One of the components of the CSC format of the input graph.");
+    AddInput("X", "The input center nodes index tensor.");
+    AddInput("Eids", "The edge ids of the input graph.").AsDispensable();
+    AddInput("Perm_Buffer", "Permutation buffer for fisher-yates sampling.")
+        .AsDispensable();
+    AddOutput("Out", "The neighbors of input nodes X after sampling.");
+    AddOutput("Out_Count",
+              "The number of sample neighbors of input nodes respectively.");
+    AddOutput("Out_Eids", "The eids of the sample edges");
+    AddAttr<int>(
+        "sample_size", "The sample size of graph sample neighbors method. ",
+        "Set default value as -1, means return all neighbors of nodes.")
+        .SetDefault(-1);
+    AddAttr<bool>("return_eids",
+                  "Whether to return the eid of the sample edges.")
+        .SetDefault(false);
+    AddAttr<bool>("flag_perm_buffer",
+                  "Using the permutation for fisher-yates sampling in GPU"
+                  "Set default value as false, means not using it.")
+        .SetDefault(false);
+    AddComment(R"DOC(
+Graph Learning Sampling Neighbors operator, for graphsage sampling method.
+)DOC");
+  }
+};
+
+}  // namespace operators
+}  // namespace paddle
+
+namespace ops = paddle::operators;
+
+DECLARE_INFER_SHAPE_FUNCTOR(graph_sample_neighbors,
+                            GraphSampleNeighborsInferShapeFunctor,
+                            PD_INFER_META(phi::GraphSampleNeighborsInferMeta));
+
+REGISTER_OPERATOR(
+    graph_sample_neighbors, ops::GraphSampleNeighborsOP,
+    ops::GraphSampleNeighborsOpMaker,
+    paddle::framework::EmptyGradOpMaker<paddle::framework::OpDesc>,
+    paddle::framework::EmptyGradOpMaker<paddle::imperative::OpBase>,
+    GraphSampleNeighborsInferShapeFunctor);
diff --git a/paddle/fluid/operators/inplace_abn_op.cc b/paddle/fluid/operators/inplace_abn_op.cc
index 77951ff394e74..344b104b5948c 100644
--- a/paddle/fluid/operators/inplace_abn_op.cc
+++ b/paddle/fluid/operators/inplace_abn_op.cc
@@ -312,8 +312,8 @@ class InplaceABNGradKernel : public framework::OpKernel<T> {
     phi::BatchNormGradRawKernel<T>(
         static_cast<const typename framework::ConvertToPhiContext<
             DeviceContext>::TYPE&>(dev_ctx),
-        *d_y, *y, *scale, *bias, *saved_mean, *saved_variance, space_opt,
-        mean_opt, variance_opt, momentum, epsilon, data_layout, is_test,
+        *y, *scale, *bias, mean_opt, variance_opt, *saved_mean, *saved_variance,
+        space_opt, *d_y, momentum, epsilon, data_layout, is_test,
         use_global_stats, trainable_statistics, fuse_with_relu, true, d_x,
         scale_grad, bias_grad);
   }
@@ -324,10 +324,12 @@ class InplaceABNGradKernel : public framework::OpKernel<T> {
 
 namespace ops = paddle::operators;
 
+DECLARE_INPLACE_OP_INFERER(InplaceAbnOpInplaceInferer, {"X", "Y"});
 REGISTER_OPERATOR(inplace_abn, ops::InplaceABNOp, ops::InplaceABNOpMaker,
                   ops::BatchNormOpInferVarType,
                   ops::InplaceABNOpGradMaker<paddle::framework::OpDesc>,
-                  ops::InplaceABNOpGradMaker<paddle::imperative::OpBase>)
+                  ops::InplaceABNOpGradMaker<paddle::imperative::OpBase>,
+                  InplaceAbnOpInplaceInferer)
 REGISTER_OPERATOR(inplace_abn_grad, ops::InplaceABNGradOp)
 
 REGISTER_OP_CPU_KERNEL(
diff --git a/paddle/fluid/operators/inplace_abn_op.cu b/paddle/fluid/operators/inplace_abn_op.cu
index db8f8c72d13f8..6c16210ced022 100644
--- a/paddle/fluid/operators/inplace_abn_op.cu
+++ b/paddle/fluid/operators/inplace_abn_op.cu
@@ -140,10 +140,10 @@ class InplaceABNGradKernel
       phi::BatchNormGradRawKernel<T>(
           static_cast<const typename framework::ConvertToPhiContext<
               DeviceContext>::TYPE&>(dev_ctx),
-          *d_y, *y, *scale, *bias, *saved_mean, *saved_variance, space_opt,
-          mean_opt, variance_opt, momentum, epsilon, data_layout, is_test,
-          use_global_stats, trainable_statistics, fuse_with_relu, true, d_x,
-          scale_grad, bias_grad);
+          *y, *scale, *bias, mean_opt, variance_opt, *saved_mean,
+          *saved_variance, space_opt, *d_y, momentum, epsilon, data_layout,
+          is_test, use_global_stats, trainable_statistics, fuse_with_relu, true,
+          d_x, scale_grad, bias_grad);
     }
   }
 };
diff --git a/paddle/fluid/operators/interpolate_v2_op.cc b/paddle/fluid/operators/interpolate_v2_op.cc
index 4b5a18141d5aa..d0d7b7694fc3a 100644
--- a/paddle/fluid/operators/interpolate_v2_op.cc
+++ b/paddle/fluid/operators/interpolate_v2_op.cc
@@ -9,11 +9,15 @@
    See the License for the specific language governing permissions and
    limitations under the License. */
 
-#include "paddle/fluid/operators/interpolate_v2_op.h"
 #include <memory>
 #include <string>
 #include <vector>
+
+#include "paddle/fluid/framework/infershape_utils.h"
 #include "paddle/fluid/framework/op_registry.h"
+#include "paddle/phi/core/infermeta_utils.h"
+#include "paddle/phi/infermeta/multiary.h"
+
 #ifdef PADDLE_WITH_MKLDNN
 #include "paddle/fluid/platform/mkldnn_helper.h"
 #endif
@@ -722,64 +726,51 @@ DECLARE_NO_NEED_BUFFER_VARS_INFERER(InterpolateV2GradNoNeedBufferVarsInferer,
 // not
 // compatible with interp_op, so a new one is added in paddle2.0
 namespace ops = paddle::operators;
+
+DECLARE_INFER_SHAPE_FUNCTOR(bilinear_interp_v2, BilinearInterpInferShapeFunctor,
+                            PD_INFER_META(phi::InterpolateInferMeta));
+DECLARE_INFER_SHAPE_FUNCTOR(nearest_interp_v2, NearestInterpInferShapeFunctor,
+                            PD_INFER_META(phi::InterpolateInferMeta));
+DECLARE_INFER_SHAPE_FUNCTOR(trilinear_interp_v2,
+                            TrilinearInterpInferShapeFunctor,
+                            PD_INFER_META(phi::InterpolateInferMeta));
+DECLARE_INFER_SHAPE_FUNCTOR(bicubic_interp_v2, BicubicInterpInferShapeFunctor,
+                            PD_INFER_META(phi::InterpolateInferMeta));
+DECLARE_INFER_SHAPE_FUNCTOR(linear_interp_v2, LinearInterpInferShapeFunctor,
+                            PD_INFER_META(phi::InterpolateInferMeta));
+
 REGISTER_OPERATOR(bilinear_interp_v2, ops::InterpolateV2Op,
                   ops::InterpolateV2OpMaker,
                   ops::InterpolateV2GradMaker<paddle::framework::OpDesc>,
-                  ops::InterpolateV2GradMaker<paddle::imperative::OpBase>);
+                  ops::InterpolateV2GradMaker<paddle::imperative::OpBase>,
+                  BilinearInterpInferShapeFunctor);
 REGISTER_OPERATOR(bilinear_interp_v2_grad, ops::InterpolateV2OpGrad,
                   ops::InterpolateV2GradNoNeedBufferVarsInferer);
 REGISTER_OPERATOR(nearest_interp_v2, ops::InterpolateV2Op,
                   ops::InterpolateV2OpMaker,
                   ops::InterpolateV2GradMaker<paddle::framework::OpDesc>,
-                  ops::InterpolateV2GradMaker<paddle::imperative::OpBase>);
+                  ops::InterpolateV2GradMaker<paddle::imperative::OpBase>,
+                  NearestInterpInferShapeFunctor);
 REGISTER_OPERATOR(nearest_interp_v2_grad, ops::InterpolateV2OpGrad,
                   ops::InterpolateV2GradNoNeedBufferVarsInferer);
 REGISTER_OPERATOR(trilinear_interp_v2, ops::InterpolateV2Op,
                   ops::InterpolateV2OpMaker,
                   ops::InterpolateV2GradMaker<paddle::framework::OpDesc>,
-                  ops::InterpolateV2GradMaker<paddle::imperative::OpBase>);
+                  ops::InterpolateV2GradMaker<paddle::imperative::OpBase>,
+                  TrilinearInterpInferShapeFunctor);
 REGISTER_OPERATOR(trilinear_interp_v2_grad, ops::InterpolateV2OpGrad,
                   ops::InterpolateV2GradNoNeedBufferVarsInferer);
 REGISTER_OPERATOR(bicubic_interp_v2, ops::InterpolateV2Op,
                   ops::InterpolateV2OpMaker,
                   ops::InterpolateV2GradMaker<paddle::framework::OpDesc>,
-                  ops::InterpolateV2GradMaker<paddle::imperative::OpBase>);
+                  ops::InterpolateV2GradMaker<paddle::imperative::OpBase>,
+                  BicubicInterpInferShapeFunctor);
 REGISTER_OPERATOR(bicubic_interp_v2_grad, ops::InterpolateV2OpGrad,
                   ops::InterpolateV2GradNoNeedBufferVarsInferer);
-REGISTER_OP_CPU_KERNEL(bilinear_interp_v2, ops::InterpolateV2Kernel<float>,
-                       ops::InterpolateV2Kernel<double>,
-                       ops::InterpolateV2Kernel<uint8_t>);
-REGISTER_OP_CPU_KERNEL(bilinear_interp_v2_grad,
-                       ops::InterpolateV2GradKernel<float>,
-                       ops::InterpolateV2GradKernel<double>);
-REGISTER_OP_CPU_KERNEL(nearest_interp_v2, ops::InterpolateV2Kernel<float>,
-                       ops::InterpolateV2Kernel<double>,
-                       ops::InterpolateV2Kernel<int>,
-                       ops::InterpolateV2Kernel<int64_t>,
-                       ops::InterpolateV2Kernel<uint8_t>);
-REGISTER_OP_CPU_KERNEL(nearest_interp_v2_grad,
-                       ops::InterpolateV2GradKernel<float>,
-                       ops::InterpolateV2GradKernel<double>);
-REGISTER_OP_CPU_KERNEL(trilinear_interp_v2, ops::InterpolateV2Kernel<float>,
-                       ops::InterpolateV2Kernel<double>,
-                       ops::InterpolateV2Kernel<uint8_t>);
-REGISTER_OP_CPU_KERNEL(trilinear_interp_v2_grad,
-                       ops::InterpolateV2GradKernel<float>,
-                       ops::InterpolateV2GradKernel<double>);
 REGISTER_OPERATOR(linear_interp_v2, ops::InterpolateV2Op,
                   ops::InterpolateV2OpMaker,
                   ops::InterpolateV2GradMaker<paddle::framework::OpDesc>,
-                  ops::InterpolateV2GradMaker<paddle::imperative::OpBase>);
+                  ops::InterpolateV2GradMaker<paddle::imperative::OpBase>,
+                  LinearInterpInferShapeFunctor);
 REGISTER_OPERATOR(linear_interp_v2_grad, ops::InterpolateV2OpGrad,
                   ops::InterpolateV2GradNoNeedBufferVarsInferer);
-REGISTER_OP_CPU_KERNEL(linear_interp_v2, ops::InterpolateV2Kernel<float>,
-                       ops::InterpolateV2Kernel<double>,
-                       ops::InterpolateV2Kernel<uint8_t>);
-REGISTER_OP_CPU_KERNEL(linear_interp_v2_grad,
-                       ops::InterpolateV2GradKernel<float>,
-                       ops::InterpolateV2GradKernel<double>);
-REGISTER_OP_CPU_KERNEL(bicubic_interp_v2, ops::InterpolateV2Kernel<float>,
-                       ops::InterpolateV2Kernel<double>);
-REGISTER_OP_CPU_KERNEL(bicubic_interp_v2_grad,
-                       ops::InterpolateV2GradKernel<float>,
-                       ops::InterpolateV2GradKernel<double>);
diff --git a/paddle/fluid/operators/interpolate_v2_op.cu b/paddle/fluid/operators/interpolate_v2_op.cu
deleted file mode 100644
index cd297c53f89a0..0000000000000
--- a/paddle/fluid/operators/interpolate_v2_op.cu
+++ /dev/null
@@ -1,2210 +0,0 @@
-/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserve.
-   Licensed under the Apache License, Version 2.0 (the "License");
-   you may not use this file except in compliance with the License.
-   You may obtain a copy of the License at
-   http://www.apache.org/licenses/LICENSE-2.0
-   Unless required by applicable law or agreed to in writing, software
-   distributed under the License is distributed on an "AS IS" BASIS,
-   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-   See the License for the specific language governing permissions and
-   limitations under the License. */
-
-#include <algorithm>
-#include <string>
-#include "paddle/fluid/operators/interpolate_v2_op.h"
-#include "paddle/fluid/platform/device/gpu/gpu_device_function.h"
-#include "paddle/fluid/platform/device/gpu/gpu_launch_config.h"
-#include "paddle/fluid/platform/device/gpu/gpu_primitives.h"
-#include "paddle/fluid/platform/fast_divmod.h"
-#include "paddle/phi/kernels/funcs/math_cuda_utils.h"
-
-namespace paddle {
-namespace operators {
-
-using framework::Tensor;
-using platform::FastDivMod;
-using DataLayout = framework::DataLayout;
-
-static inline int GetLastPow2(int n) {
-  n |= (n >> 1);
-  n |= (n >> 2);
-  n |= (n >> 4);
-  n |= (n >> 8);
-  n |= (n >> 16);
-  return std::max(1, n - (n >> 1));
-}
-
-inline platform::GpuLaunchConfig GetGpuLaunchConfig3D(
-    const platform::CUDADeviceContext& context, int num_img, int height,
-    int width) {
-  const int kThreadsPerBlock = 256;
-  int max_threads_per_block = context.GetMaxThreadsPerBlock();  // 1024
-  int max_threads = std::min(kThreadsPerBlock, max_threads_per_block);
-
-  int block_x = std::min(GetLastPow2(width), max_threads);
-  int block_y = std::min(GetLastPow2(height), max_threads / block_x);
-  int block_z = std::min(num_img, max_threads / block_x / block_y);
-
-  auto max_grid_dim = context.GetCUDAMaxGridDimSize();
-  int grid_x = std::min<int>(max_grid_dim[0], platform::DivUp(width, block_x));
-  int grid_y = std::min<int>(max_grid_dim[1], platform::DivUp(height, block_y));
-  int grid_z =
-      std::min<int>(max_grid_dim[2], platform::DivUp(num_img, block_z * 4));
-
-  const int capability = context.GetComputeCapability();
-  platform::GpuLaunchConfig config;
-  config.compute_capability = capability;
-  config.thread_per_block = dim3(block_x, block_y, block_z);
-  config.block_per_grid = dim3(grid_x, grid_y, grid_z);
-  return config;
-}
-
-template <typename T>
-__forceinline__ __device__ void PreCalculatorForLinearInterpInputIndex(
-    int* in_img_idx, int* x_id, T* lambda1, T* lambda2, T src_x,
-    const int in_img_x) {
-  src_x = (src_x > 0) ? src_x : 0.f;
-  *in_img_idx = static_cast<int>(src_x);
-  *x_id = (*in_img_idx < in_img_x - 1) ? 1 : 0;
-  *lambda1 = src_x - *in_img_idx;
-  *lambda2 = 1.f - *lambda1;
-}
-
-struct FastDivModForInterpolate {
- public:
-  FastDivMod channels_div;
-  FastDivMod output_w_div;
-  FastDivMod output_wc_div;
-
-  explicit HOSTDEVICE FastDivModForInterpolate(const int channels,
-                                               const int output_w,
-                                               const int outout_wc)
-      : channels_div(FastDivMod(channels)),
-        output_w_div(FastDivMod(output_w)),
-        output_wc_div(FastDivMod(outout_wc)) {}
-};
-
-template <typename T>
-__global__ void KeNearestNeighborInterpNCHWFw(
-    const T* in, const size_t in_img_h, const size_t in_img_w, T* out,
-    const size_t out_img_h, const size_t out_img_w, const size_t nc,
-    const float ratio_h, const float ratio_w, const bool align_corners) {
-  int out_img_idx = threadIdx.x + blockIdx.x * blockDim.x;
-  int out_img_idy = threadIdx.y + blockIdx.y * blockDim.y;
-  int nc_id = threadIdx.z + blockIdx.z * blockDim.z;
-  int nc_stride = blockDim.z * gridDim.z;
-
-  // nearest_sampling by multiple read in_addr and write to out_addr
-  int in_img_idx = (align_corners)
-                       ? static_cast<int>(ratio_w * out_img_idx + 0.5)
-                       : static_cast<int>(ratio_w * out_img_idx);
-  int in_img_idy = (align_corners)
-                       ? static_cast<int>(ratio_h * out_img_idy + 0.5)
-                       : static_cast<int>(ratio_h * out_img_idy);
-
-  int in_index = (nc_id * in_img_h + in_img_idy) * in_img_w + in_img_idx;
-  int in_index_stride = nc_stride * in_img_h * in_img_w;
-
-  int out_index = (nc_id * out_img_h + out_img_idy) * out_img_w + out_img_idx;
-  int out_index_stride = nc_stride * out_img_h * out_img_w;
-
-  // prevent from multiple threads writing
-  if (out_img_idx < out_img_w && out_img_idy < out_img_h) {
-    while (nc_id < nc) {
-      out[out_index] = in[in_index];
-      in_index += in_index_stride;
-      out_index += out_index_stride;
-      nc_id += nc_stride;
-    }
-  }
-}
-
-template <typename T>
-__global__ void KeNearestNeighborInterpFw(
-    const T* in, const size_t in_img_h, const size_t in_img_w,
-    const size_t input_h, const size_t input_w, T* out, const size_t out_img_h,
-    const size_t out_img_w, const size_t output_h, const size_t output_w,
-    const size_t num_channels, const float ratio_h, const float ratio_w,
-    const bool align_corners, FastDivModForInterpolate divmods) {
-  int nthreads = output_h * output_w;
-  int tid = blockIdx.x * blockDim.x + threadIdx.x;
-  int stride = blockDim.x * gridDim.x;
-  int in_img_size = in_img_h * in_img_w;
-  int out_img_size = out_img_h * out_img_w;
-
-  for (; tid < nthreads; tid += stride) {
-    auto out_id_divmod = divmods.output_w_div.Divmod(tid);
-    int out_id_h = out_id_divmod.val[0];
-    int out_id_w = out_id_divmod.val[1];
-
-    int channel_id = divmods.channels_div.Divmod(tid).val[1];
-    auto outimg_id_divmod = divmods.output_wc_div.Divmod(out_id_w);
-    int out_img_idy = outimg_id_divmod.val[0];
-    int out_img_idx =
-        divmods.channels_div.Divmod(outimg_id_divmod.val[1]).val[0];
-
-    int in_img_idy = (align_corners)
-                         ? static_cast<int>(ratio_h * out_img_idy + 0.5)
-                         : static_cast<int>(ratio_h * out_img_idy);
-    int in_img_idx = (align_corners)
-                         ? static_cast<int>(ratio_w * out_img_idx + 0.5)
-                         : static_cast<int>(ratio_w * out_img_idx);
-
-    out[tid] = in[out_id_h * input_w + in_img_idy * in_img_w * num_channels +
-                  in_img_idx * num_channels + channel_id];
-  }
-}
-
-template <typename T>
-__global__ void KeNearestNeighbor3DInterpFw(
-    const T* in, const size_t in_img_d, const size_t in_img_h,
-    const size_t in_img_w, const size_t input_h, const size_t input_w, T* out,
-    const size_t out_img_d, const size_t out_img_h, const size_t out_img_w,
-    const size_t output_h, const size_t output_w, const size_t num_channels,
-    const float ratio_d, const float ratio_h, const float ratio_w,
-    const bool align_corners, const DataLayout data_layout) {
-  int nthreads = output_h * output_w;  // ncdhw
-  int tid = blockIdx.x * blockDim.x + threadIdx.x;
-  int stride = blockDim.x * gridDim.x;
-  for (; tid < nthreads; tid += stride) {
-    int out_id_h = tid / output_w;
-    int out_id_w = tid % output_w;
-    int in_img_size = input_w / num_channels;
-    int out_img_size = output_w / num_channels;
-
-    int channel_id, out_img_idt, out_img_idy, out_img_idx;
-    if (data_layout == DataLayout::kNCHW) {
-      channel_id = out_id_w / out_img_size;
-      out_img_idt = (out_id_w % out_img_size) / out_img_h / out_img_w;
-      out_img_idy = ((out_id_w % out_img_size) / out_img_w) % out_img_h;
-      out_img_idx = tid % out_img_w;
-    } else {
-      out_img_idt = out_id_w / (out_img_h * out_img_w * num_channels);
-      out_img_idy = out_id_w % (out_img_h * out_img_w * num_channels) /
-                    (out_img_w * num_channels);
-      out_img_idx = out_id_w % (out_img_w * num_channels) / num_channels;
-      channel_id = tid % num_channels;
-    }
-
-    int in_img_idt = (align_corners)
-                         ? static_cast<int>(ratio_d * out_img_idt + 0.5)
-                         : static_cast<int>(ratio_d * out_img_idt);
-
-    int in_img_idy = (align_corners)
-                         ? static_cast<int>(ratio_h * out_img_idy + 0.5)
-                         : static_cast<int>(ratio_h * out_img_idy);
-    int in_img_idx = (align_corners)
-                         ? static_cast<int>(ratio_w * out_img_idx + 0.5)
-                         : static_cast<int>(ratio_w * out_img_idx);
-
-    if (data_layout == DataLayout::kNCHW) {
-      out[tid] = in[out_id_h * input_w + channel_id * in_img_size +
-                    in_img_idt * in_img_h * in_img_w + in_img_idy * in_img_w +
-                    in_img_idx];
-    } else {
-      out[tid] = in[out_id_h * input_w +
-                    in_img_idt * in_img_h * in_img_w * num_channels +
-                    in_img_idy * in_img_w * num_channels +
-                    in_img_idx * num_channels + channel_id];
-    }
-  }
-}
-
-template <typename T>
-__global__ void KeNearestNeighborInterpNCHWBw(
-    T* in, const size_t in_img_h, const size_t in_img_w, const T* out,
-    const size_t out_img_h, const size_t out_img_w, const size_t nc,
-    const float ratio_h, const float ratio_w, const bool align_corners) {
-  int out_img_idx = threadIdx.x + blockIdx.x * blockDim.x;
-  int out_img_idy = threadIdx.y + blockIdx.y * blockDim.y;
-  int nc_id = threadIdx.z + blockIdx.z * blockDim.z;
-  int nc_stride = blockDim.z * gridDim.z;
-
-  // nearest_sampling by multiple read in_addr and write to out_addr
-  int in_img_idx = (align_corners)
-                       ? static_cast<int>(ratio_w * out_img_idx + 0.5)
-                       : static_cast<int>(ratio_w * out_img_idx);
-  int in_img_idy = (align_corners)
-                       ? static_cast<int>(ratio_h * out_img_idy + 0.5)
-                       : static_cast<int>(ratio_h * out_img_idy);
-
-  int in_index = (nc_id * in_img_h + in_img_idy) * in_img_w + in_img_idx;
-  int in_index_stride = nc_stride * in_img_h * in_img_w;
-
-  int out_index = (nc_id * out_img_h + out_img_idy) * out_img_w + out_img_idx;
-  int out_index_stride = nc_stride * out_img_h * out_img_w;
-
-  // prevent from multiple threads writing
-  if (out_img_idx < out_img_w && out_img_idy < out_img_h) {
-    while (nc_id < nc) {
-      T* in_pos = &in[in_index];
-      const T out_pos = out[out_index];
-      platform::CudaAtomicAdd(in_pos, out_pos);
-      in_index += in_index_stride;
-      out_index += out_index_stride;
-      nc_id += nc_stride;
-    }
-  }
-}
-
-template <typename T>
-__global__ void KeNearestNeighborInterpBw(
-    T* in, const size_t in_img_h, const size_t in_img_w, const size_t input_h,
-    const size_t input_w, const T* out, const size_t out_img_h,
-    const size_t out_img_w, const size_t output_h, const size_t output_w,
-    const size_t num_channels, const float ratio_h, const float ratio_w,
-    const bool align_corners, FastDivModForInterpolate divmods) {
-  int nthreads = output_h * output_w;
-  int tid = blockIdx.x * blockDim.x + threadIdx.x;
-  int stride = blockDim.x * gridDim.x;
-  int in_img_size = in_img_h * in_img_w;
-  int out_img_size = out_img_h * out_img_w;
-
-  for (; tid < nthreads; tid += stride) {
-    auto out_id_divmod = divmods.output_w_div.Divmod(tid);
-    int out_id_h = out_id_divmod.val[0];
-    int out_id_w = out_id_divmod.val[1];
-
-    int channel_id = divmods.channels_div.Divmod(tid).val[1];
-    auto outimg_id_divmod = divmods.output_wc_div.Divmod(out_id_w);
-    int out_img_idy = outimg_id_divmod.val[0];
-    int out_img_idx =
-        divmods.channels_div.Divmod(outimg_id_divmod.val[1]).val[0];
-
-    int in_img_idy = (align_corners)
-                         ? static_cast<int>(ratio_h * out_img_idy + 0.5)
-                         : static_cast<int>(ratio_h * out_img_idy);
-    int in_img_idx = (align_corners)
-                         ? static_cast<int>(ratio_w * out_img_idx + 0.5)
-                         : static_cast<int>(ratio_w * out_img_idx);
-
-    T* in_pos = &in[out_id_h * input_w + in_img_idy * in_img_w * num_channels +
-                    in_img_idx * num_channels + channel_id];
-
-    const T out_pos = out[tid];
-    platform::CudaAtomicAdd(in_pos, out_pos);
-  }
-}
-
-template <typename T>
-__global__ void KeNearestNeighbor3DInterpBw(
-    T* in, const size_t in_img_d, const size_t in_img_h, const size_t in_img_w,
-    const size_t input_h, const size_t input_w, const T* out,
-    const size_t out_img_d, const size_t out_img_h, const size_t out_img_w,
-    const size_t output_h, const size_t output_w, const size_t num_channels,
-    const float ratio_d, const float ratio_h, const float ratio_w,
-    const bool align_corners, const DataLayout data_layout) {
-  int nthreads = output_h * output_w;
-  int tid = blockIdx.x * blockDim.x + threadIdx.x;
-  int stride = blockDim.x * gridDim.x;
-  for (; tid < nthreads; tid += stride) {
-    int out_id_h = tid / output_w;
-    int out_id_w = tid % output_w;
-    int in_img_size = input_w / num_channels;
-    int out_img_size = output_w / num_channels;
-
-    int channel_id, out_img_idt, out_img_idy, out_img_idx;
-    if (data_layout == DataLayout::kNCHW) {
-      channel_id = out_id_w / out_img_size;
-      out_img_idt = (out_id_w % out_img_size) / out_img_h / out_img_w;
-      out_img_idy = ((out_id_w % out_img_size) / out_img_w) % out_img_h;
-      out_img_idx = tid % out_img_w;
-    } else {
-      out_img_idt = out_id_w / (out_img_h * out_img_w * num_channels);
-      out_img_idy = out_id_w % (out_img_h * out_img_w * num_channels) /
-                    (out_img_w * num_channels);
-      out_img_idx = out_id_w % (out_img_w * num_channels) / num_channels;
-      channel_id = tid % num_channels;
-    }
-
-    int in_img_idt = (align_corners)
-                         ? static_cast<int>(ratio_d * out_img_idt + 0.5)
-                         : static_cast<int>(ratio_d * out_img_idt);
-    int in_img_idy = (align_corners)
-                         ? static_cast<int>(ratio_h * out_img_idy + 0.5)
-                         : static_cast<int>(ratio_h * out_img_idy);
-    int in_img_idx = (align_corners)
-                         ? static_cast<int>(ratio_w * out_img_idx + 0.5)
-                         : static_cast<int>(ratio_w * out_img_idx);
-
-    T* in_pos;
-    if (data_layout == DataLayout::kNCHW) {
-      in_pos = &in[out_id_h * input_w + channel_id * in_img_size +
-                   in_img_idt * in_img_h * in_img_w + in_img_idy * in_img_w +
-                   in_img_idx];
-    } else {
-      in_pos = &in[out_id_h * input_w +
-                   in_img_idt * in_img_h * in_img_w * num_channels +
-                   in_img_idy * in_img_w * num_channels +
-                   in_img_idx * num_channels + channel_id];
-    }
-    const T out_pos = out[out_id_h * output_w + out_id_w];
-    platform::CudaAtomicAdd(in_pos, out_pos);
-  }
-}
-
-template <typename T>
-__global__ void KeLinearInterpFw(const T* in, const size_t in_img_w,
-                                 const size_t input_w, T* out,
-                                 const size_t out_img_w, const size_t output_h,
-                                 const size_t output_w,
-                                 const size_t num_channels, const float ratio_w,
-                                 const bool align_corners, const int align_mode,
-                                 const DataLayout data_layout) {
-  int nthreads = output_h * output_w;
-  int tid = blockIdx.x * blockDim.x + threadIdx.x;
-  int stride = blockDim.x * gridDim.x;
-  bool align_flag = (align_mode == 0 && !align_corners);
-  for (; tid < nthreads; tid += stride) {
-    int out_id_h = tid / output_w;
-    int out_id_w = tid % output_w;
-    int in_img_size = input_w / num_channels;
-    int out_img_size = output_w / num_channels;
-
-    int channel_id, out_img_idy, out_img_idx;
-    if (data_layout == DataLayout::kNCHW) {
-      channel_id = out_id_w / out_img_size;
-      out_img_idx = tid % out_img_w;
-    } else {
-      out_img_idx = out_id_w % (out_img_w * num_channels) / num_channels;
-      channel_id = tid % num_channels;
-    }
-
-    int in_img_idx = align_flag
-                         ? static_cast<int>(ratio_w * (out_img_idx + 0.5) - 0.5)
-                         : static_cast<int>(ratio_w * out_img_idx);
-    in_img_idx = (in_img_idx > 0) ? in_img_idx : 0;  // w
-    int w_id = (in_img_idx < in_img_w - 1) ? 1 : 0;  // w_id
-
-    T src_w = ratio_w * (out_img_idx + 0.5) - 0.5;
-    src_w = (src_w > 0) ? src_w : 0;
-    T w1lambda =
-        align_flag ? src_w - in_img_idx : ratio_w * out_img_idx - in_img_idx;
-    T w2lambda = 1.f - w1lambda;
-
-    if (data_layout == DataLayout::kNCHW) {
-      const T* in_pos =
-          &in[out_id_h * out_id_w + channel_id * in_img_size + in_img_idx];
-      // linear interpolation
-      out[out_id_h * output_w + out_id_w] =
-          w2lambda * in_pos[0] + w1lambda * in_pos[w_id];
-
-    } else {
-      const T* in_pos =
-          &in[out_id_h * input_w + in_img_idx * num_channels + channel_id];
-      // linear interpolation
-      out[out_id_h * output_w + out_id_w] =
-          w2lambda * in_pos[0] + w1lambda * in_pos[w_id * num_channels];
-    }
-  }
-}
-
-template <typename T>
-__global__ void KeLinearInterpBw(T* in, const size_t in_img_w,
-                                 const size_t input_w, const T* out,
-                                 const size_t out_img_w, const size_t output_h,
-                                 const size_t output_w,
-                                 const size_t num_channels, const T ratio_w,
-                                 const bool align_corners, const int align_mode,
-                                 const DataLayout data_layout) {
-  int nthreads = output_h * output_w;
-  int tid = blockIdx.x * blockDim.x + threadIdx.x;
-  int stride = blockDim.x * gridDim.x;
-  bool align_flag = (align_mode == 0 && !align_corners);
-  for (; tid < nthreads; tid += stride) {
-    int out_id_h = tid / output_w;
-    int out_id_w = tid % output_w;
-    int in_img_size = input_w / num_channels;
-    int out_img_size = output_w / num_channels;
-
-    int channel_id, out_img_idx;
-    if (data_layout == DataLayout::kNCHW) {
-      channel_id = out_id_w / out_img_size;
-      out_img_idx = tid % out_img_w;
-    } else {
-      out_img_idx = out_id_w % (out_img_w * num_channels) / num_channels;
-      channel_id = tid % num_channels;
-    }
-
-    int in_img_idx = align_flag ? ratio_w * (out_img_idx + 0.5) - 0.5
-                                : ratio_w * out_img_idx;
-    in_img_idx = (in_img_idx > 0) ? in_img_idx : 0;  // w
-    int w_id = (in_img_idx < in_img_w - 1) ? 1 : 0;  // w_id
-
-    T src_w = ratio_w * (out_img_idx + 0.5) - 0.5;
-    src_w = (src_w > 0) ? src_w : 0;
-    T w1lambda =
-        align_flag ? src_w - in_img_idx : ratio_w * out_img_idx - in_img_idx;
-    T w2lambda = 1.f - w1lambda;
-
-    T* in_pos;
-    if (data_layout == DataLayout::kNCHW) {
-      in_pos = &in[out_id_h * input_w + channel_id * in_img_size + in_img_idx];
-    } else {
-      in_pos = &in[out_id_h * input_w + in_img_idx * num_channels + channel_id];
-    }
-    const T* out_pos = &out[out_id_w];
-
-    if (data_layout == DataLayout::kNCHW) {
-      platform::CudaAtomicAdd(&in_pos[0], w2lambda * out_pos[0]);
-      platform::CudaAtomicAdd(&in_pos[w_id], w1lambda * out_pos[0]);
-    } else {
-      platform::CudaAtomicAdd(&in_pos[0], w2lambda * out_pos[0]);
-      platform::CudaAtomicAdd(&in_pos[w_id * num_channels],
-                              w1lambda * out_pos[0]);
-    }
-  }
-}
-
-template <typename T>
-__global__ void KeBilinearInterpNCHWFw(const T* in, const size_t in_img_h,
-                                       const size_t in_img_w, T* out,
-                                       const size_t out_img_h,
-                                       const size_t out_img_w, const size_t nc,
-                                       const float ratio_h, const float ratio_w,
-                                       const T align_type_value) {
-  int out_img_idx = threadIdx.x + blockIdx.x * blockDim.x;
-  int out_img_idy = threadIdx.y + blockIdx.y * blockDim.y;
-  int nc_id = threadIdx.z + blockIdx.z * blockDim.z;
-  int nc_stride = blockDim.z * gridDim.z;
-
-  int in_img_idx, in_img_idy, h_id, w_id;
-  T h1lambda, w1lambda, h2lambda, w2lambda;
-  T src_w = ratio_w * (out_img_idx + align_type_value) - align_type_value;
-  T src_h = ratio_h * (out_img_idy + align_type_value) - align_type_value;
-
-  PreCalculatorForLinearInterpInputIndex(&in_img_idx, &w_id, &w1lambda,
-                                         &w2lambda, src_w, in_img_w);
-  PreCalculatorForLinearInterpInputIndex(&in_img_idy, &h_id, &h1lambda,
-                                         &h2lambda, src_h, in_img_h);
-
-  int in_index = (nc_id * in_img_h + in_img_idy) * in_img_w + in_img_idx;
-  int in_index_stride = nc_stride * in_img_h * in_img_w;
-
-  int out_index = (nc_id * out_img_h + out_img_idy) * out_img_w + out_img_idx;
-  int out_index_stride = nc_stride * out_img_h * out_img_w;
-
-  // prevent from multiple threads writing
-  if (out_img_idx < out_img_w && out_img_idy < out_img_h) {
-    while (nc_id < nc) {
-      const T* in_pos = &in[in_index];
-      out[out_index] =
-          h2lambda * (w2lambda * in_pos[0] + w1lambda * in_pos[w_id]) +
-          h1lambda * (w2lambda * in_pos[h_id * in_img_w] +
-                      w1lambda * in_pos[h_id * in_img_w + w_id]);
-
-      in_index += in_index_stride;
-      out_index += out_index_stride;
-      nc_id += nc_stride;
-    }
-  }
-}
-
-template <typename T>
-__global__ void KeBilinearInterpFw(
-    const T* in, const size_t in_img_h, const size_t in_img_w,
-    const size_t input_h, const size_t input_w, T* out, const size_t out_img_h,
-    const size_t out_img_w, const size_t output_h, const size_t output_w,
-    const size_t num_channels, const float ratio_h, const float ratio_w,
-    const T align_type_value, FastDivModForInterpolate divmods) {
-  int nthreads = output_h * output_w;
-  int tid = blockIdx.x * blockDim.x + threadIdx.x;
-  int stride = blockDim.x * gridDim.x;
-
-  for (; tid < nthreads; tid += stride) {
-    auto out_id_divmod = divmods.output_w_div.Divmod(tid);
-    int out_id_h = out_id_divmod.val[0];
-    int out_id_w = out_id_divmod.val[1];
-
-    int channel_id = divmods.channels_div.Divmod(tid).val[1];
-    auto outimg_id_divmod = divmods.output_wc_div.Divmod(out_id_w);
-    int out_img_idy = outimg_id_divmod.val[0];
-    int out_img_idx =
-        divmods.channels_div.Divmod(outimg_id_divmod.val[1]).val[0];
-
-    int in_img_idx, in_img_idy, h_id, w_id;
-    T h1lambda, w1lambda, h2lambda, w2lambda;
-    T src_w = ratio_w * (out_img_idx + align_type_value) - align_type_value;
-    T src_h = ratio_h * (out_img_idy + align_type_value) - align_type_value;
-
-    PreCalculatorForLinearInterpInputIndex(&in_img_idx, &w_id, &w1lambda,
-                                           &w2lambda, src_w, in_img_w);
-    PreCalculatorForLinearInterpInputIndex(&in_img_idy, &h_id, &h1lambda,
-                                           &h2lambda, src_h, in_img_h);
-
-    // bilinear interpolation
-    const T* in_pos =
-        &in[out_id_h * input_w + in_img_idy * in_img_w * num_channels +
-            in_img_idx * num_channels + channel_id];
-    out[tid] =
-        h2lambda *
-            (w2lambda * in_pos[0] + w1lambda * in_pos[w_id * num_channels]) +
-        h1lambda *
-            (w2lambda * in_pos[h_id * in_img_w * num_channels] +
-             w1lambda *
-                 in_pos[h_id * in_img_w * num_channels + w_id * num_channels]);
-  }
-}
-
-/* Calculate the minimum of partial elements in a block */
-template <typename T>
-__inline__ __device__ T PartialBlockMin(T val, size_t threads_num_in_block,
-                                        unsigned mask) {
-  __shared__ T shared[WARP_SIZE];
-  __shared__ T shared_last_val;
-  __shared__ int shared_last_idx;
-  int lane = threadIdx.x & 0x1f;
-  int wid = threadIdx.x >> 5;
-  int threshold = (threads_num_in_block & (-WARP_SIZE));
-
-  if (threadIdx.x < threshold) {
-    shared_last_idx = (threshold >> 5) - 1;
-    val = phi::funcs::warpReduceMin(val, mask);
-    if (lane == 0) {
-      shared[wid] = val;
-    }
-  } else {
-    shared_last_val = std::numeric_limits<T>::max();
-    platform::CudaAtomicMin(&shared_last_val, val);
-    shared[wid] = shared_last_val;
-    shared_last_idx = wid;
-  }
-  __syncthreads();
-
-  if (threadIdx.x < threshold) {
-    val = (lane <= shared_last_idx) ? shared[lane]
-                                    : std::numeric_limits<T>::max();
-    val = phi::funcs::warpReduceMin(val, mask);
-    shared_last_val = val;
-  }
-  __syncthreads();
-  if (threadIdx.x >= threshold) {
-    val = shared_last_val;
-  }
-  return val;
-}
-
-template <typename T>
-__global__ void KeBilinearInterpBwShareMemory(
-    T* in, const int in_h, const int in_w, const T* __restrict__ out,
-    const int out_h, const int out_w, const int n, const int num_channels,
-    float ratio_h, float ratio_w, const T align_type_value, bool is_nchw) {
-  __shared__ T s_data[2][1024];
-  int tid = blockIdx.x * blockDim.x + threadIdx.x;
-  int stride = blockDim.x * gridDim.x;
-  int in_chw = in_h * in_w * num_channels;
-  int out_chw = num_channels * out_h * out_w;
-  int nthreads = n * out_chw;
-
-  for (; tid < nthreads; tid += stride) {
-    int out_id_h = tid / out_chw;
-    int out_id_w = tid % out_chw;
-    const int in_img_size = in_h * in_w;
-    const int out_img_size = out_h * out_w;
-    T value = out[out_id_h * out_chw + out_id_w];
-
-    int channel_id = out_id_w / out_img_size;
-    int out_img_idy = (out_id_w % out_img_size) / out_w;
-    int out_img_idx = tid % out_w;
-
-    int in_img_idx, in_img_idy, w_id, h_id;
-    T w1lambda, h1lambda, w2lambda, h2lambda;
-    T src_w = ratio_w * (out_img_idx + align_type_value) - align_type_value;
-    T src_h = ratio_h * (out_img_idy + align_type_value) - align_type_value;
-
-    PreCalculatorForLinearInterpInputIndex(&in_img_idx, &w_id, &w1lambda,
-                                           &w2lambda, src_w, in_w);
-    PreCalculatorForLinearInterpInputIndex(&in_img_idy, &h_id, &h1lambda,
-                                           &h2lambda, src_h, in_h);
-
-    // top_left_index is just input_index.
-    int input_index = out_id_h * in_chw + channel_id * in_img_size +
-                      in_img_idy * in_w + in_img_idx;
-    int top_right_index = input_index + w_id;
-    int bot_left_index = input_index + h_id * in_w;
-    int bot_right_index = input_index + h_id * in_w + w_id;
-    int in_top_min_index, in_bot_min_index;
-
-    s_data[0][threadIdx.x] = 0.f;
-    s_data[1][threadIdx.x] = 0.f;
-    int remain = nthreads - (tid & (-blockDim.x));
-    int in_top_max_index =
-        phi::funcs::blockReduceMax(top_right_index, FINAL_MASK);
-    int in_bot_max_index =
-        phi::funcs::blockReduceMax(bot_right_index, FINAL_MASK);
-
-    if (remain > blockDim.x) {
-      in_top_min_index = phi::funcs::blockReduceMin(input_index, FINAL_MASK);
-      in_bot_min_index = phi::funcs::blockReduceMin(bot_left_index, FINAL_MASK);
-    } else {
-      in_top_min_index = PartialBlockMin(input_index, remain, FINAL_MASK);
-      in_bot_min_index = PartialBlockMin(bot_left_index, remain, FINAL_MASK);
-    }
-    int upper_limit_share_idx = (in_top_max_index - in_top_min_index) >
-                                        (in_bot_max_index - in_bot_min_index)
-                                    ? (in_top_max_index - in_top_min_index)
-                                    : (in_bot_max_index - in_bot_min_index);
-    if (h_id != 0) {
-      platform::CudaAtomicAdd(&s_data[0][input_index - in_top_min_index],
-                              h2lambda * w2lambda * value);
-      platform::CudaAtomicAdd(&s_data[0][top_right_index - in_top_min_index],
-                              h2lambda * w1lambda * value);
-      platform::CudaAtomicAdd(&s_data[1][bot_left_index - in_bot_min_index],
-                              h1lambda * w2lambda * value);
-      platform::CudaAtomicAdd(&s_data[1][bot_right_index - in_bot_min_index],
-                              h1lambda * w1lambda * value);
-    } else {
-      platform::CudaAtomicAdd(&s_data[0][top_right_index - in_top_min_index],
-                              (h2lambda + h1lambda) * w1lambda * value);
-      platform::CudaAtomicAdd(&s_data[1][bot_left_index - in_bot_min_index],
-                              (h1lambda + h2lambda) * w2lambda * value);
-    }
-    __syncthreads();
-
-    if (threadIdx.x <= upper_limit_share_idx) {
-      platform::CudaAtomicAdd(&in[in_top_min_index + threadIdx.x],
-                              s_data[0][threadIdx.x]);
-      platform::CudaAtomicAdd(&in[in_bot_min_index + threadIdx.x],
-                              s_data[1][threadIdx.x]);
-    }
-  }
-}
-
-__device__ __forceinline__ int GetInputIndex(const size_t nc, const int height,
-                                             const int width, const int h,
-                                             const int w) {
-  return (nc * height + h) * width + w;
-}
-
-template <typename T>
-__global__ void KeBilinearInterpNCHWBw(T* in, const int in_h, const int in_w,
-                                       const int out_h, const int out_w,
-                                       const int n, const int num_channels,
-                                       float ratio_h, float ratio_w,
-                                       const T* __restrict__ out,
-                                       const T align_type_value) {
-  int index = threadIdx.x + blockDim.x * blockIdx.x;
-  int stride = blockDim.x * gridDim.x;
-  int num_out = n * num_channels * out_h * out_w;
-  int num_in = n * num_channels * in_h * in_w;
-
-  for (; index < num_out; index += stride) {
-    int index_tmp = index;
-    int w2 = index_tmp % out_w;
-    index_tmp /= out_w;
-    int h2 = index_tmp % out_h;
-    int nc = index_tmp / out_h;
-
-    int h1, y_id;
-    T h1lambda, h0lambda;
-    T src_y = ratio_h * (h2 + align_type_value) - align_type_value;
-
-    PreCalculatorForLinearInterpInputIndex(&h1, &y_id, &h1lambda, &h0lambda,
-                                           src_y, in_h);
-    int w1, x_id;
-    T w1lambda, w0lambda;
-    T src_x = ratio_w * (w2 + align_type_value) - align_type_value;
-    PreCalculatorForLinearInterpInputIndex(&w1, &x_id, &w1lambda, &w0lambda,
-                                           src_x, in_w);
-
-    T d2val = out[index];
-
-    platform::CudaAtomicAdd(in + GetInputIndex(nc, in_h, in_w, h1, w1),
-                            h0lambda * w0lambda * d2val);
-    platform::CudaAtomicAdd(in + GetInputIndex(nc, in_h, in_w, h1, w1 + x_id),
-                            h0lambda * w1lambda * d2val);
-    platform::CudaAtomicAdd(in + GetInputIndex(nc, in_h, in_w, h1 + y_id, w1),
-                            h1lambda * w0lambda * d2val);
-    platform::CudaAtomicAdd(
-        in + GetInputIndex(nc, in_h, in_w, h1 + y_id, w1 + x_id),
-        h1lambda * w1lambda * d2val);
-  }
-}
-
-template <typename T>
-__global__ void KeBilinearInterpBw(T* in, const int in_h, const int in_w,
-                                   const T* __restrict__ out, const int out_h,
-                                   const int out_w, const int n,
-                                   const int out_chw, const int num_channels,
-                                   float ratio_h, float ratio_w,
-                                   const T align_type_value,
-                                   FastDivModForInterpolate divmods) {
-  int tid = blockIdx.x * blockDim.x + threadIdx.x;
-  int stride = blockDim.x * gridDim.x;
-  int in_chw = in_h * in_w * num_channels;
-  int nthreads = n * out_chw;
-
-  for (; tid < nthreads; tid += stride) {
-    auto out_id_divmod = divmods.output_w_div.Divmod(tid);
-    int out_id_h = out_id_divmod.val[0];
-    int out_id_w = out_id_divmod.val[1];
-
-    int channel_id = divmods.channels_div.Divmod(tid).val[1];
-    auto outimg_id_divmod = divmods.output_wc_div.Divmod(out_id_w);
-    int out_img_idy = outimg_id_divmod.val[0];
-    int out_img_idx =
-        divmods.channels_div.Divmod(outimg_id_divmod.val[1]).val[0];
-
-    int in_img_idx, in_img_idy, w_id, h_id;
-    T w1lambda, h1lambda, w2lambda, h2lambda;
-    T src_w = ratio_w * (out_img_idx + align_type_value) - align_type_value;
-    T src_h = ratio_h * (out_img_idy + align_type_value) - align_type_value;
-
-    PreCalculatorForLinearInterpInputIndex(&in_img_idx, &w_id, &w1lambda,
-                                           &w2lambda, src_w, in_w);
-    PreCalculatorForLinearInterpInputIndex(&in_img_idy, &h_id, &h1lambda,
-                                           &h2lambda, src_h, in_h);
-
-    T value = out[tid];
-    T* in_pos = &in[out_id_h * in_chw + in_img_idy * in_w * num_channels +
-                    in_img_idx * num_channels + channel_id];
-    platform::CudaAtomicAdd(&in_pos[0], h2lambda * w2lambda * value);
-    platform::CudaAtomicAdd(&in_pos[w_id * num_channels],
-                            h2lambda * w1lambda * value);
-    platform::CudaAtomicAdd(&in_pos[h_id * in_w * num_channels],
-                            h1lambda * w2lambda * value);
-    platform::CudaAtomicAdd(
-        &in_pos[h_id * in_w * num_channels + w_id * num_channels],
-        h1lambda * w1lambda * value);
-  }
-}
-
-template <typename T>
-__global__ void KeTrilinearInterpFw(
-    const T* in, const size_t in_img_d, const size_t in_img_h,
-    const size_t in_img_w, const size_t input_h, const size_t input_w, T* out,
-    const size_t out_img_d, const size_t out_img_h, const size_t out_img_w,
-    const size_t output_h, const size_t output_w, const size_t num_channels,
-    const float ratio_d, const float ratio_h, const float ratio_w,
-    const bool align_corners, const int align_mode,
-    const DataLayout data_layout) {
-  int nthreads = output_h * output_w;
-  int tid = blockIdx.x * blockDim.x + threadIdx.x;
-  int stride = blockDim.x * gridDim.x;
-  bool align_flag = (align_mode == 0 && !align_corners);
-  for (; tid < nthreads; tid += stride) {
-    int out_id_h = tid / output_w;
-    int out_id_w = tid % output_w;
-    int in_img_size = input_w / num_channels;
-    int out_img_size = output_w / num_channels;
-
-    int channel_id, out_img_idt, out_img_idy, out_img_idx;
-    if (data_layout == DataLayout::kNCHW) {
-      channel_id = out_id_w / out_img_size;
-      out_img_idt = (out_id_w % out_img_size) / out_img_h / out_img_w;
-      out_img_idy = ((out_id_w % out_img_size) / out_img_w) % out_img_h;
-      out_img_idx = tid % out_img_w;
-    } else {
-      out_img_idt = out_id_w / (out_img_h * out_img_w * num_channels);
-      out_img_idy = out_id_w % (out_img_h * out_img_w * num_channels) /
-                    (out_img_w * num_channels);
-      out_img_idx = out_id_w % (out_img_w * num_channels) / num_channels;
-      channel_id = tid % num_channels;
-    }
-
-    int in_img_idt = align_flag
-                         ? static_cast<int>(ratio_d * (out_img_idt + 0.5) - 0.5)
-                         : static_cast<int>(ratio_d * out_img_idt);
-    in_img_idt = (in_img_idt > 0) ? in_img_idt : 0;
-    int d_id = (in_img_idt < in_img_d - 1) ? 1 : 0;
-    T src_d = ratio_d * (out_img_idt + 0.5) - 0.5;
-    src_d = (src_d > 0) ? src_d : 0;
-    T d1lambda =
-        align_flag ? src_d - in_img_idt : ratio_d * out_img_idt - in_img_idt;
-    T d2lambda = 1.f - d1lambda;
-
-    int in_img_idy = align_flag
-                         ? static_cast<int>(ratio_h * (out_img_idy + 0.5) - 0.5)
-                         : static_cast<int>(ratio_h * out_img_idy);
-    in_img_idy = (in_img_idy > 0) ? in_img_idy : 0;
-    int h_id = (in_img_idy < in_img_h - 1) ? 1 : 0;
-    T src_h = ratio_h * (out_img_idy + 0.5) - 0.5;
-    src_h = (src_h > 0) ? src_h : 0;
-    T h1lambda =
-        align_flag ? src_h - in_img_idy : ratio_h * out_img_idy - in_img_idy;
-    T h2lambda = 1.f - h1lambda;
-
-    int in_img_idx = align_flag
-                         ? static_cast<int>(ratio_w * (out_img_idx + 0.5) - 0.5)
-                         : static_cast<int>(ratio_w * out_img_idx);
-    in_img_idx = (in_img_idx > 0) ? in_img_idx : 0;
-    int w_id = (in_img_idx < in_img_w - 1) ? 1 : 0;
-    T src_w = ratio_w * (out_img_idx + 0.5) - 0.5;
-    src_w = (src_w > 0) ? src_w : 0;
-    T w1lambda =
-        align_flag ? src_w - in_img_idx : ratio_w * out_img_idx - in_img_idx;
-    T w2lambda = 1.f - w1lambda;
-
-    if (data_layout == DataLayout::kNCHW) {
-      int in_pos1_idx = out_id_h * input_w + channel_id * in_img_size +
-                        (in_img_idt * in_img_h + in_img_idy) * in_img_w +
-                        in_img_idx;
-      const T* in_pos1 = &in[in_pos1_idx];
-      int in_pos2_idx = in_pos1_idx + d_id * in_img_h * in_img_w;
-      const T* in_pos2 = &in[in_pos2_idx];
-
-      // trilinear interpolation
-      out[out_id_h * output_w + out_id_w] =
-          d2lambda *
-              (h2lambda * (w2lambda * in_pos1[0] + w1lambda * in_pos1[w_id]) +
-               h1lambda * (w2lambda * in_pos1[h_id * in_img_w] +
-                           w1lambda * in_pos1[h_id * in_img_w + w_id])) +
-          d1lambda *
-              (h2lambda * (w2lambda * in_pos2[0] + w1lambda * in_pos2[w_id]) +
-               h1lambda * (w2lambda * in_pos2[h_id * in_img_w] +
-                           w1lambda * in_pos2[h_id * in_img_w + w_id]));
-
-    } else {
-      int in_pos1_idx = out_id_h * input_w +
-                        in_img_idt * in_img_h * in_img_w * num_channels +
-                        in_img_idy * in_img_w * num_channels +
-                        in_img_idx * num_channels + channel_id;
-      const T* in_pos1 = &in[in_pos1_idx];
-      int in_pos2_idx = in_pos1_idx + d_id * in_img_h * in_img_w * num_channels;
-      const T* in_pos2 = &in[in_pos2_idx];
-
-      // trilinear interpolation
-      out[out_id_h * output_w + out_id_w] =
-          d2lambda *
-              (h2lambda * (w2lambda * in_pos1[0] +
-                           w1lambda * in_pos1[w_id * num_channels]) +
-               h1lambda * (w2lambda * in_pos1[h_id * in_img_w * num_channels] +
-                           w1lambda * in_pos1[h_id * in_img_w * num_channels +
-                                              w_id * num_channels])) +
-          d1lambda *
-              (h2lambda * (w2lambda * in_pos2[0] +
-                           w1lambda * in_pos2[w_id * num_channels]) +
-               h1lambda * (w2lambda * in_pos2[h_id * in_img_w * num_channels] +
-                           w1lambda * in_pos2[h_id * in_img_w * num_channels +
-                                              w_id * num_channels]));
-    }
-  }
-}
-
-template <typename T>
-__global__ void KeTrilinearInterpBw(
-    T* in, const size_t in_img_d, const size_t in_img_h, const size_t in_img_w,
-    const size_t input_h, const size_t input_w, const T* out,
-    const size_t out_img_d, const size_t out_img_h, const size_t out_img_w,
-    const size_t output_h, const size_t output_w, const size_t num_channels,
-    const T ratio_d, const T ratio_h, const T ratio_w, const bool align_corners,
-    const int align_mode, const DataLayout data_layout) {
-  int nthreads = output_h * output_w;
-  int tid = blockIdx.x * blockDim.x + threadIdx.x;
-  int stride = blockDim.x * gridDim.x;
-  bool align_flag = (align_mode == 0 && !align_corners);
-  for (; tid < nthreads; tid += stride) {
-    int out_id_h = tid / output_w;
-    int out_id_w = tid % output_w;
-    int in_img_size = input_w / num_channels;
-    int out_img_size = output_w / num_channels;
-
-    int channel_id, out_img_idt, out_img_idy, out_img_idx;
-    if (data_layout == DataLayout::kNCHW) {
-      channel_id = out_id_w / out_img_size;
-      out_img_idt = (out_id_w % out_img_size) / out_img_h / out_img_w;
-      out_img_idy = ((out_id_w % out_img_size) / out_img_w) % out_img_h;
-      out_img_idx = tid % out_img_w;
-    } else {
-      out_img_idt = out_id_w / (out_img_h * out_img_w * num_channels);
-      out_img_idy = out_id_w % (out_img_h * out_img_w * num_channels) /
-                    (out_img_w * num_channels);
-      out_img_idx = out_id_w % (out_img_w * num_channels) / num_channels;
-      channel_id = tid % num_channels;
-    }
-
-    int in_img_idt = align_flag
-                         ? static_cast<int>(ratio_d * (out_img_idt + 0.5) - 0.5)
-                         : static_cast<int>(ratio_d * out_img_idt);
-    in_img_idt = (in_img_idt > 0) ? in_img_idt : 0;
-    int d_id = (in_img_idt < in_img_d - 1) ? 1 : 0;
-    T src_d = ratio_d * (out_img_idt + 0.5) - 0.5;
-    src_d = (src_d > 0) ? src_d : 0;
-    T d1lambda =
-        align_flag ? src_d - in_img_idt : ratio_d * out_img_idt - in_img_idt;
-    T d2lambda = 1.f - d1lambda;
-
-    int in_img_idy = align_flag
-                         ? static_cast<int>(ratio_h * (out_img_idy + 0.5) - 0.5)
-                         : static_cast<int>(ratio_h * out_img_idy);
-    in_img_idy = (in_img_idy > 0) ? in_img_idy : 0;
-    int h_id = (in_img_idy < in_img_h - 1) ? 1 : 0;
-    T src_h = ratio_h * (out_img_idy + 0.5) - 0.5;
-    src_h = (src_h > 0) ? src_h : 0;
-    T h1lambda =
-        align_flag ? src_h - in_img_idy : ratio_h * out_img_idy - in_img_idy;
-    T h2lambda = 1.f - h1lambda;
-
-    int in_img_idx = align_flag
-                         ? static_cast<int>(ratio_w * (out_img_idx + 0.5) - 0.5)
-                         : static_cast<int>(ratio_w * out_img_idx);
-    in_img_idx = (in_img_idx > 0) ? in_img_idx : 0;
-    int w_id = (in_img_idx < in_img_w - 1) ? 1 : 0;
-    T src_w = ratio_w * (out_img_idx + 0.5) - 0.5;
-    src_w = (src_w > 0) ? src_w : 0;
-    T w1lambda =
-        align_flag ? src_w - in_img_idx : ratio_w * out_img_idx - in_img_idx;
-    T w2lambda = 1.f - w1lambda;
-
-    if (data_layout == DataLayout::kNCHW) {
-      int in_pos1_idx = out_id_h * input_w + channel_id * in_img_size +
-                        (in_img_idt * in_img_h + in_img_idy) * in_img_w +
-                        in_img_idx;
-      T* in_pos1 = &in[in_pos1_idx];
-      int in_pos2_idx = in_pos1_idx + d_id * in_img_h * in_img_w;
-      T* in_pos2 = &in[in_pos2_idx];
-
-      const T* out_pos = &out[out_id_h * output_w + out_id_w];
-
-      // trilinear interpolation grad
-      platform::CudaAtomicAdd(&in_pos1[0],
-                              d2lambda * h2lambda * w2lambda * out_pos[0]);
-      platform::CudaAtomicAdd(&in_pos1[w_id],
-                              d2lambda * h2lambda * w1lambda * out_pos[0]);
-      platform::CudaAtomicAdd(&in_pos1[h_id * in_img_w],
-                              d2lambda * h1lambda * w2lambda * out_pos[0]);
-      platform::CudaAtomicAdd(&in_pos1[h_id * in_img_w + w_id],
-                              d2lambda * h1lambda * w1lambda * out_pos[0]);
-      platform::CudaAtomicAdd(&in_pos2[0],
-                              d1lambda * h2lambda * w2lambda * out_pos[0]);
-      platform::CudaAtomicAdd(&in_pos2[w_id],
-                              d1lambda * h2lambda * w1lambda * out_pos[0]);
-      platform::CudaAtomicAdd(&in_pos2[h_id * in_img_w],
-                              d1lambda * h1lambda * w2lambda * out_pos[0]);
-      platform::CudaAtomicAdd(&in_pos2[h_id * in_img_w + w_id],
-                              d1lambda * h1lambda * w1lambda * out_pos[0]);
-    } else {
-      int in_pos1_idx = out_id_h * input_w +
-                        in_img_idt * in_img_h * in_img_w * num_channels +
-                        in_img_idy * in_img_w * num_channels +
-                        in_img_idx * num_channels + channel_id;
-      T* in_pos1 = &in[in_pos1_idx];
-      int in_pos2_idx = in_pos1_idx + d_id * in_img_h * in_img_w * num_channels;
-      T* in_pos2 = &in[in_pos2_idx];
-
-      const T* out_pos = &out[out_id_h * output_w + out_id_w];
-
-      // trilinear interpolation grad
-      platform::CudaAtomicAdd(&in_pos1[0],
-                              d2lambda * h2lambda * w2lambda * out_pos[0]);
-      platform::CudaAtomicAdd(&in_pos1[w_id * num_channels],
-                              d2lambda * h2lambda * w1lambda * out_pos[0]);
-      platform::CudaAtomicAdd(&in_pos1[h_id * in_img_w * num_channels],
-                              d2lambda * h1lambda * w2lambda * out_pos[0]);
-      platform::CudaAtomicAdd(
-          &in_pos1[h_id * in_img_w * num_channels + w_id * num_channels],
-          d2lambda * h1lambda * w1lambda * out_pos[0]);
-      platform::CudaAtomicAdd(&in_pos2[0],
-                              d1lambda * h2lambda * w2lambda * out_pos[0]);
-      platform::CudaAtomicAdd(&in_pos2[w_id * num_channels],
-                              d1lambda * h2lambda * w1lambda * out_pos[0]);
-      platform::CudaAtomicAdd(&in_pos2[h_id * in_img_w * num_channels],
-                              d1lambda * h1lambda * w2lambda * out_pos[0]);
-      platform::CudaAtomicAdd(
-          &in_pos2[h_id * in_img_w * num_channels + w_id * num_channels],
-          d1lambda * h1lambda * w1lambda * out_pos[0]);
-    }
-  }
-}
-
-template <typename T>
-__device__ __forceinline__ static T Kecubic_interp(const T x0, const T x1,
-                                                   const T x2, const T x3,
-                                                   T t) {
-  T coeffs[4];
-  T a = -0.75;
-  T x_1 = t;
-  T x_2 = 1.0 - t;
-  coeffs[0] = cubic_convolution2<T>(x_1 + 1.0, a);
-  coeffs[1] = cubic_convolution1<T>(x_1, a);
-  coeffs[2] = cubic_convolution1<T>(x_2, a);
-  coeffs[3] = cubic_convolution2<T>(x_2 + 1.0, a);
-  return x0 * coeffs[0] + x1 * coeffs[1] + x2 * coeffs[2] + x3 * coeffs[3];
-}
-
-template <typename T>
-__global__ void KeBicubicInterpFw(
-    const T* in, const size_t in_img_h, const size_t in_img_w,
-    const size_t input_h, const size_t input_w, T* out, const size_t out_img_h,
-    const size_t out_img_w, const size_t output_h, const size_t output_w,
-    const size_t num_channels, const float ratio_h, const float ratio_w,
-    const bool align_corners, const DataLayout data_layout) {
-  int nthreads = output_h * output_w;
-  int tid = blockIdx.x * blockDim.x + threadIdx.x;
-  int stride = blockDim.x * gridDim.x;
-
-  for (; tid < nthreads; tid += stride) {
-    int out_id_h = tid / output_w;
-    int out_id_w = tid % output_w;
-    int in_img_size = input_w / num_channels;
-    int out_img_size = output_w / num_channels;
-
-    int channel_id, out_img_idy, out_img_idx;
-
-    if (data_layout == DataLayout::kNCHW) {
-      channel_id = out_id_w / out_img_size;
-      out_img_idy = (out_id_w % out_img_size) / out_img_w;
-      out_img_idx = tid % out_img_w;
-    } else {
-      out_img_idy = out_id_w / (out_img_w * num_channels);
-      out_img_idx = out_id_w % (out_img_w * num_channels) / num_channels;
-      channel_id = tid % num_channels;
-    }
-
-    T in_img_idy = align_corners
-                       ? static_cast<T>(ratio_h * out_img_idy)
-                       : static_cast<T>(ratio_h * (out_img_idy + 0.5) - 0.5);
-    int input_y = floorf(in_img_idy);
-    const T y_t = in_img_idy - input_y;
-
-    T in_img_idx = align_corners
-                       ? static_cast<T>(ratio_w * out_img_idx)
-                       : static_cast<T>(ratio_w * (out_img_idx + 0.5) - 0.5);
-    int input_x = floorf(in_img_idx);
-    const T x_t = in_img_idx - input_x;
-
-    T coefficients[4];
-    const T* in_pos_0;
-    const T* in_pos_1;
-    const T* in_pos_2;
-    const T* in_pos_3;
-    int access_x_0;
-    if (data_layout == DataLayout::kNCHW) {
-      for (int k = 0; k < 4; k++) {
-        int access_y =
-            max(min(input_y - 1 + k, static_cast<int>(in_img_h - 1)), 0);
-        access_x_0 = max(min(input_x - 1, static_cast<int>(in_img_w - 1)), 0);
-        int access_x_1 =
-            max(min(input_x + 0, static_cast<int>(in_img_w - 1)), 0);
-        int access_x_2 =
-            max(min(input_x + 1, static_cast<int>(in_img_w - 1)), 0);
-        int access_x_3 =
-            max(min(input_x + 2, static_cast<int>(in_img_w - 1)), 0);
-
-        in_pos_0 = &in[out_id_h * input_w + channel_id * in_img_size +
-                       access_y * in_img_w + access_x_0];
-        in_pos_1 = &in[out_id_h * input_w + channel_id * in_img_size +
-                       access_y * in_img_w + access_x_1];
-        in_pos_2 = &in[out_id_h * input_w + channel_id * in_img_size +
-                       access_y * in_img_w + access_x_2];
-        in_pos_3 = &in[out_id_h * input_w + channel_id * in_img_size +
-                       access_y * in_img_w + access_x_3];
-
-        coefficients[k] = Kecubic_interp<T>(in_pos_0[0], in_pos_1[0],
-                                            in_pos_2[0], in_pos_3[0], x_t);
-      }
-
-      out[out_id_h * output_w + out_id_w] =
-          Kecubic_interp<T>(coefficients[0], coefficients[1], coefficients[2],
-                            coefficients[3], y_t);
-
-    } else {
-      for (int k = 0; k < 4; k++) {
-        int access_y =
-            max(min(input_y - 1 + k, static_cast<int>((in_img_h - 1))), 0);
-        int access_x_0 =
-            max(min(input_x - 1, static_cast<int>((in_img_w - 1))), 0);
-        int access_x_1 =
-            max(min(input_x + 0, static_cast<int>((in_img_w - 1))), 0);
-        int access_x_2 =
-            max(min(input_x + 1, static_cast<int>((in_img_w - 1))), 0);
-        int access_x_3 =
-            max(min(input_x + 2, static_cast<int>((in_img_w - 1))), 0);
-
-        const T* in_pos_0 =
-            &in[out_id_h * input_w + access_y * in_img_w * num_channels +
-                access_x_0 * num_channels + channel_id];
-        const T* in_pos_1 =
-            &in[out_id_h * input_w + access_y * in_img_w * num_channels +
-                access_x_1 * num_channels + channel_id];
-        const T* in_pos_2 =
-            &in[out_id_h * input_w + access_y * in_img_w * num_channels +
-                access_x_2 * num_channels + channel_id];
-        const T* in_pos_3 =
-            &in[out_id_h * input_w + access_y * in_img_w * num_channels +
-                access_x_3 * num_channels + channel_id];
-
-        coefficients[k] = Kecubic_interp(in_pos_0[0], in_pos_1[0], in_pos_2[0],
-                                         in_pos_3[0], x_t);
-      }
-
-      out[out_id_h * output_w + out_id_w] =
-          static_cast<T>(Kecubic_interp(coefficients[0], coefficients[1],
-                                        coefficients[2], coefficients[3], y_t));
-    }
-  }
-}
-
-template <typename T>
-__global__ void KeBicubicInterpBw(
-    T* in, const size_t in_img_h, const size_t in_img_w, const size_t input_h,
-    const size_t input_w, const T* out, const size_t out_img_h,
-    const size_t out_img_w, const size_t output_h, const size_t output_w,
-    const size_t num_channels, const float ratio_h, const float ratio_w,
-    const bool align_corners, const DataLayout data_layout) {
-  int nthreads = output_h * output_w;
-  int tid = blockIdx.x * blockDim.x + threadIdx.x;
-  int stride = blockDim.x * gridDim.x;
-
-  for (; tid < nthreads; tid += stride) {
-    int out_id_h = tid / output_w;
-    int out_id_w = tid % output_w;
-    int in_img_size = input_w / num_channels;
-    int out_img_size = output_w / num_channels;
-
-    int channel_id, out_img_idy, out_img_idx;
-    if (data_layout == DataLayout::kNCHW) {
-      channel_id = out_id_w / out_img_size;
-      out_img_idy = (out_id_w % out_img_size) / out_img_w;
-      out_img_idx = tid % out_img_w;
-    } else {
-      out_img_idy = out_id_w / (out_img_w * num_channels);
-      out_img_idx = out_id_w % (out_img_w * num_channels) / num_channels;
-      channel_id = tid % num_channels;
-    }
-
-    T in_img_idy = align_corners
-                       ? static_cast<T>(ratio_h * out_img_idy)
-                       : static_cast<T>(ratio_h * (out_img_idy + 0.5) - 0.5);
-    int input_y = floorf(in_img_idy);
-    const T y_t = in_img_idy - input_y;
-
-    T in_img_idx = align_corners
-                       ? static_cast<T>(ratio_w * out_img_idx)
-                       : static_cast<T>(ratio_w * (out_img_idx + 0.5) - 0.5);
-    int input_x = floorf(in_img_idx);
-
-    const T x_t = in_img_idx - input_x;
-
-    T x_coeffs[4];
-    T y_coeffs[4];
-
-    get_cubic_upsample_coefficients(x_coeffs, x_t);
-    get_cubic_upsample_coefficients(y_coeffs, y_t);
-
-    const T* out_pos = &out[out_id_h * output_w + out_id_w];
-    T* in_pos;
-
-    for (int i = 0; i < 4; i++) {
-      for (int j = 0; j < 4; j++) {
-        int access_y = max(min(static_cast<int>(input_y - 1 + j),
-                               static_cast<int>(in_img_h - 1)),
-                           0);
-        int access_x = max(min(static_cast<int>(input_x - 1 + i),
-                               static_cast<int>(in_img_w - 1)),
-                           0);
-        if (data_layout == DataLayout::kNCHW) {
-          in_pos = &in[out_id_h * input_w + channel_id * in_img_size +
-                       access_y * in_img_w + access_x];
-        } else {
-          in_pos = &in[out_id_h * input_w + access_y * in_img_w * num_channels +
-                       access_x * num_channels + channel_id];
-        }
-        platform::CudaAtomicAdd(&in_pos[0],
-                                (out_pos[0] * y_coeffs[j] * x_coeffs[i]));
-      }
-    }
-  }
-}
-
-template <typename T>
-static void Interpolate1DCUDAFwd(const framework::ExecutionContext& ctx,
-                                 const Tensor& input, Tensor* output) {
-  auto* input_data = input.data<T>();
-
-  const std::string data_layout_str = ctx.Attr<std::string>("data_layout");
-  const DataLayout data_layout = framework::StringToDataLayout(data_layout_str);
-  int n, c, in_d, in_h, in_w;
-  ExtractNCDWH(input.dims(), data_layout, &n, &c, &in_d, &in_h, &in_w);
-
-  auto interp_method = ctx.Attr<std::string>("interp_method");
-  bool align_corners = ctx.Attr<bool>("align_corners");
-  int align_mode = ctx.Attr<int>("align_mode");
-
-  int out_w = ctx.Attr<int>("out_w");
-
-  auto list_new_shape_tensor = ctx.MultiInput<framework::Tensor>("SizeTensor");
-  float scale_w = -1;
-  if (list_new_shape_tensor.size() > 0) {
-    // have size tensor
-    auto new_size = get_new_shape(list_new_shape_tensor);
-    out_w = new_size[0];
-  } else {
-    auto scale_tensor = ctx.Input<Tensor>("Scale");
-    auto scale = ctx.Attr<std::vector<float>>("scale");
-    if (scale_tensor != nullptr) {
-      auto scale_data = get_new_data_from_tensor<float>(scale_tensor);
-      scale_w = scale_data[0];
-      PADDLE_ENFORCE_EQ(
-          scale_w > 0, true,
-          platform::errors::InvalidArgument(
-              "The scale_w in input 'Scale' Tensor of Operator(interpolate) "
-              "should be greater than 0, but received value is %d.",
-              scale_w));
-    } else {
-      if (scale.size() > 0) {
-        scale_w = scale[0];
-        PADDLE_ENFORCE_EQ(
-            scale_w > 0, true,
-            platform::errors::InvalidArgument(
-                "The scale_w in Attr(scale) of Operator(interpolate) "
-                "should be greater than 0, but received value is %d.",
-                scale_w));
-      }
-    }
-    if (scale_w > 0.) {
-      out_w = static_cast<int>(in_w * scale_w);
-    }
-    auto out_size = ctx.Input<Tensor>("OutSize");
-    if (out_size != nullptr) {
-      Tensor sizes;
-      framework::TensorCopySync(*out_size, platform::CPUPlace(), &sizes);
-      auto size_data = sizes.data<int>();
-      out_w = size_data[0];
-    }
-  }
-  PADDLE_ENFORCE_GT(out_w, 0, platform::errors::InvalidArgument(
-                                  "out_w in Attr(out_shape) of Op(interpolate) "
-                                  "should be greater than 0."));
-  framework::DDim dim_out;
-  if (data_layout == DataLayout::kNCHW) {
-    dim_out = {n, c, out_w};
-  } else {
-    dim_out = {n, out_w, c};
-  }
-  auto output_data = output->mutable_data<T>(dim_out, ctx.GetPlace());
-
-  if (in_w == out_w) {
-    framework::TensorCopy(input, ctx.GetPlace(), output);
-    return;
-  }
-
-  float ratio_w = 0.f;
-  if (out_w > 1) {
-    float new_scale_w = 0.f;
-    new_scale_w = (scale_w > 0) ? static_cast<float>(1. / scale_w)
-                                : static_cast<float>(in_w) / out_w;
-    ratio_w = (align_corners) ? static_cast<float>(in_w - 1.0) / (out_w - 1.0)
-                              : static_cast<float>(new_scale_w);
-  }
-
-  int64_t in_cw = c * in_w;
-  int64_t out_cw = c * out_w;
-  auto pixelNum = n * out_cw;
-
-  platform::GpuLaunchConfig config =
-      platform::GetGpuLaunchConfig1D(ctx.cuda_device_context(), pixelNum);
-
-  if ("linear" == interp_method) {
-    KeLinearInterpFw<T><<<config.block_per_grid, config.thread_per_block, 0,
-                          ctx.cuda_device_context().stream()>>>(
-        input_data, in_w, in_cw, output_data, out_w, n, out_cw, c, ratio_w,
-        align_corners, align_mode, data_layout);
-  }
-}
-
-template <typename T>
-static void Interpolate2DCUDAFwd(const framework::ExecutionContext& ctx,
-                                 const Tensor& input, Tensor* output) {
-  auto* input_data = input.data<T>();
-
-  const std::string data_layout_str = ctx.Attr<std::string>("data_layout");
-  const DataLayout data_layout = framework::StringToDataLayout(data_layout_str);
-  int n, c, in_d, in_h, in_w;
-  ExtractNCDWH(input.dims(), data_layout, &n, &c, &in_d, &in_h, &in_w);
-
-  auto interp_method = ctx.Attr<std::string>("interp_method");
-  bool align_corners = ctx.Attr<bool>("align_corners");
-  int align_mode = ctx.Attr<int>("align_mode");
-
-  int out_h = ctx.Attr<int>("out_h");
-  int out_w = ctx.Attr<int>("out_w");
-
-  auto list_new_shape_tensor = ctx.MultiInput<framework::Tensor>("SizeTensor");
-  float scale_w = -1;
-  float scale_h = -1;
-  if (list_new_shape_tensor.size() > 0) {
-    // have size tensor
-    auto new_size = get_new_shape(list_new_shape_tensor);
-    out_h = new_size[0];
-    out_w = new_size[1];
-  } else {
-    auto scale_tensor = ctx.Input<Tensor>("Scale");
-    auto scale = ctx.Attr<std::vector<float>>("scale");
-    if (scale_tensor != nullptr) {
-      auto scale_data = get_new_data_from_tensor<float>(scale_tensor);
-      if (scale_data.size() > 1) {
-        scale_h = scale_data[0];
-        scale_w = scale_data[1];
-      } else {
-        scale_h = scale_data[0];
-        scale_w = scale_data[0];
-      }
-
-      PADDLE_ENFORCE_EQ(
-          scale_w > 0, true,
-          platform::errors::InvalidArgument(
-              "The scale_w in input 'Scale' Tensor of Operator(interpolate) "
-              "should be greater than 0, but received value is %d.",
-              scale_w));
-      PADDLE_ENFORCE_EQ(
-          scale_h > 0, true,
-          platform::errors::InvalidArgument(
-              "The scale_h in input 'Scale' Tensor of Operator(interpolate) "
-              "should be greater than 0, but received value is %d.",
-              scale_h));
-    } else {
-      if (scale.size() > 1) {
-        scale_w = scale[1];
-        scale_h = scale[0];
-
-        PADDLE_ENFORCE_EQ(
-            scale_w > 0, true,
-            platform::errors::InvalidArgument(
-                "The scale_w in Attr(scale) of Operator(interpolate) "
-                "should be greater than 0, but received value is %d.",
-                scale_w));
-        PADDLE_ENFORCE_EQ(
-            scale_h > 0, true,
-            platform::errors::InvalidArgument(
-                "The scale_h in Attr(scale) of Operator(interpolate) "
-                "should be greater than 0, but received value is %d.",
-                scale_h));
-      }
-    }
-    if (scale_w > 0. && scale_h > 0.) {
-      out_h = static_cast<int>(in_h * scale_h);
-      out_w = static_cast<int>(in_w * scale_w);
-    }
-    auto out_size = ctx.Input<Tensor>("OutSize");
-    if (out_size != nullptr) {
-      Tensor sizes;
-      framework::TensorCopySync(*out_size, platform::CPUPlace(), &sizes);
-      auto size_data = sizes.data<int>();
-      out_h = size_data[0];
-      out_w = size_data[1];
-    }
-  }
-  PADDLE_ENFORCE_GT(out_h, 0, platform::errors::InvalidArgument(
-                                  "out_h in Attr(out_shape) of Op(interpolate) "
-                                  "should be greater than 0."));
-  PADDLE_ENFORCE_GT(out_w, 0, platform::errors::InvalidArgument(
-                                  "out_w in Attr(out_shape) of Op(interpolate) "
-                                  "should be greater than 0."));
-
-  framework::DDim dim_out;
-  if (data_layout == DataLayout::kNCHW) {
-    dim_out = {n, c, out_h, out_w};
-  } else {
-    dim_out = {n, out_h, out_w, c};
-  }
-  auto output_data = output->mutable_data<T>(dim_out, ctx.GetPlace());
-
-  if (in_h == out_h && in_w == out_w) {
-    framework::TensorCopy(input, ctx.GetPlace(), output);
-    return;
-  }
-
-  float ratio_h = 0.f;
-  float ratio_w = 0.f;
-  if (out_h > 1) {
-    float new_scale_h = 0.f;
-    new_scale_h = (scale_h > 0) ? static_cast<float>(1. / scale_h)
-                                : static_cast<float>(in_h) / out_h;
-    ratio_h = (align_corners) ? static_cast<float>(in_h - 1) / (out_h - 1)
-                              : static_cast<float>(new_scale_h);
-  }
-  if (out_w > 1) {
-    float new_scale_w = 0.f;
-    new_scale_w = (scale_w > 0) ? static_cast<float>(1. / scale_w)
-                                : static_cast<float>(in_w) / out_w;
-    ratio_w = (align_corners) ? static_cast<float>(in_w - 1) / (out_w - 1)
-                              : static_cast<float>(new_scale_w);
-  }
-
-  int64_t in_hw = in_h * in_w;
-  int64_t out_hw = out_h * out_w;
-  int64_t in_chw = c * in_hw;
-  int64_t out_chw = c * out_hw;
-
-  auto pixelNum = n * out_chw;
-
-  platform::GpuLaunchConfig config =
-      platform::GetGpuLaunchConfig1D(ctx.cuda_device_context(), pixelNum);
-
-  if ("nearest" == interp_method) {
-    if (data_layout == DataLayout::kNCHW) {
-      // get launch 3D config
-      int nc = n * c;
-      platform::GpuLaunchConfig config_3d =
-          GetGpuLaunchConfig3D(ctx.cuda_device_context(), nc, out_h, out_w);
-      KeNearestNeighborInterpNCHWFw<
-          T><<<config_3d.block_per_grid, config_3d.thread_per_block, 0,
-               ctx.cuda_device_context().stream()>>>(
-          input_data, in_h, in_w, output_data, out_h, out_w, nc, ratio_h,
-          ratio_w, align_corners);
-    } else {
-      int64_t cw = c * out_w;
-      auto interp_divmods = FastDivModForInterpolate(c, out_chw, cw);
-      KeNearestNeighborInterpFw<
-          T><<<config.block_per_grid, config.thread_per_block, 0,
-               ctx.cuda_device_context().stream()>>>(
-          input_data, in_h, in_w, n, in_chw, output_data, out_h, out_w, n,
-          out_chw, c, ratio_h, ratio_w, align_corners, interp_divmods);
-    }
-  } else if ("bilinear" == interp_method) {
-    dim3 thread_num = config.thread_per_block;
-#ifdef WITH_NV_JETSON
-    if (config.compute_capability == 53 || config.compute_capability == 62) {
-      thread_num = 512;
-    }
-#endif
-    const T align_type_value = (align_mode == 0 && !align_corners) ? 0.5f : 0;
-    if (data_layout == DataLayout::kNCHW) {
-      // get launch 3D config
-      int nc = n * c;
-      platform::GpuLaunchConfig config_3d =
-          GetGpuLaunchConfig3D(ctx.cuda_device_context(), nc, out_h, out_w);
-      KeBilinearInterpNCHWFw<
-          T><<<config_3d.block_per_grid, config_3d.thread_per_block, 0,
-               ctx.cuda_device_context().stream()>>>(
-          input_data, in_h, in_w, output_data, out_h, out_w, nc, ratio_h,
-          ratio_w, align_type_value);
-    } else {
-      int64_t cw = c * out_w;
-      auto interp_divmods = FastDivModForInterpolate(c, out_chw, cw);
-      KeBilinearInterpFw<T><<<config.block_per_grid, thread_num, 0,
-                              ctx.cuda_device_context().stream()>>>(
-          input_data, in_h, in_w, n, in_chw, output_data, out_h, out_w, n,
-          out_chw, c, ratio_h, ratio_w, align_type_value, interp_divmods);
-    }
-  } else if ("bicubic" == interp_method) {
-#ifdef __HIPCC__
-    constexpr int thread_per_block = 256;
-#else
-    constexpr int thread_per_block = 512;
-#endif
-    KeBicubicInterpFw<T><<<config.block_per_grid, thread_per_block, 0,
-                           ctx.cuda_device_context().stream()>>>(
-        input_data, in_h, in_w, n, in_chw, output_data, out_h, out_w, n,
-        out_chw, c, ratio_h, ratio_w, align_corners, data_layout);
-  }
-}
-
-template <typename T>
-static void Interpolate3DCUDAFwd(const framework::ExecutionContext& ctx,
-                                 const Tensor& input, Tensor* output) {
-  auto* input_data = input.data<T>();
-
-  const std::string data_layout_str = ctx.Attr<std::string>("data_layout");
-  const DataLayout data_layout = framework::StringToDataLayout(data_layout_str);
-  int n, c, in_d, in_h, in_w;
-  ExtractNCDWH(input.dims(), data_layout, &n, &c, &in_d, &in_h, &in_w);
-
-  auto interp_method = ctx.Attr<std::string>("interp_method");
-  bool align_corners = ctx.Attr<bool>("align_corners");
-  int align_mode = ctx.Attr<int>("align_mode");
-
-  int out_d = ctx.Attr<int>("out_d");
-  int out_h = ctx.Attr<int>("out_h");
-  int out_w = ctx.Attr<int>("out_w");
-
-  auto list_new_shape_tensor = ctx.MultiInput<framework::Tensor>("SizeTensor");
-  float scale_w = -1;
-  float scale_d = -1;
-  float scale_h = -1;
-  if (list_new_shape_tensor.size() > 0) {
-    // have size tensor
-    auto new_size = get_new_shape(list_new_shape_tensor);
-    out_d = new_size[0];
-    out_h = new_size[1];
-    out_w = new_size[2];
-  } else {
-    auto scale_tensor = ctx.Input<Tensor>("Scale");
-    auto scale = ctx.Attr<std::vector<float>>("scale");
-    if (scale_tensor != nullptr) {
-      auto scale_data = get_new_data_from_tensor<float>(scale_tensor);
-      if (scale_data.size() > 1) {
-        scale_d = scale_data[0];
-        scale_h = scale_data[1];
-        scale_w = scale_data[2];
-      } else {
-        scale_d = scale_data[0];
-        scale_h = scale_data[0];
-        scale_w = scale_data[0];
-      }
-
-      PADDLE_ENFORCE_EQ(
-          scale_w > 0, true,
-          platform::errors::InvalidArgument(
-              "The scale_w in input 'Scale' Tensor of Operator(interpolate) "
-              "should be greater than 0, but received value is %d.",
-              scale_w));
-      PADDLE_ENFORCE_EQ(
-          scale_h > 0, true,
-          platform::errors::InvalidArgument(
-              "The scale_h in input 'Scale' Tensor of Operator(interpolate) "
-              "should be greater than 0, but received value is %d.",
-              scale_h));
-      PADDLE_ENFORCE_EQ(
-          scale_d > 0, true,
-          platform::errors::InvalidArgument(
-              "The scale_d in input 'Scale' Tensor of Operator(interpolate) "
-              "should be greater than 0, but received value is %d.",
-              scale_d));
-    } else {
-      if (scale.size() > 1) {
-        scale_d = scale[0];
-        scale_h = scale[1];
-        scale_w = scale[2];
-
-        PADDLE_ENFORCE_EQ(
-            scale_w > 0, true,
-            platform::errors::InvalidArgument(
-                "The scale_w in Attr(scale) of Operator(interpolate) "
-                "should be greater than 0, but received value is %d.",
-                scale_w));
-        PADDLE_ENFORCE_EQ(
-            scale_h > 0, true,
-            platform::errors::InvalidArgument(
-                "The scale_h in Attr(scale) of Operator(interpolate) "
-                "should be greater than 0, but received value is %d.",
-                scale_h));
-        PADDLE_ENFORCE_EQ(
-            scale_d > 0, true,
-            platform::errors::InvalidArgument(
-                "The scale_d in Attr(scale) of Operator(interpolate) "
-                "should be greater than 0, but received value is %d.",
-                scale_d));
-      }
-    }
-    if (scale_d > 0. && scale_h > 0. && scale_w > 0.) {
-      out_d = static_cast<int>(in_d * scale_d);
-      out_h = static_cast<int>(in_h * scale_h);
-      out_w = static_cast<int>(in_w * scale_w);
-    }
-    auto out_size = ctx.Input<Tensor>("OutSize");
-    if (out_size != nullptr) {
-      Tensor sizes;
-      framework::TensorCopySync(*out_size, platform::CPUPlace(), &sizes);
-      auto size_data = sizes.data<int>();
-      out_d = size_data[0];
-      out_h = size_data[1];
-      out_w = size_data[2];
-    }
-  }
-  PADDLE_ENFORCE_GT(out_d, 0, platform::errors::InvalidArgument(
-                                  "out_d in Attr(out_shape) of Op(interpolate) "
-                                  "should be greater than 0."));
-  PADDLE_ENFORCE_GT(out_h, 0, platform::errors::InvalidArgument(
-                                  "out_h in Attr(out_shape) of Op(interpolate) "
-                                  "should be greater than 0."));
-  PADDLE_ENFORCE_GT(out_w, 0, platform::errors::InvalidArgument(
-                                  "out_w in Attr(out_shape) of Op(interpolate) "
-                                  "should be greater than 0."));
-
-  framework::DDim dim_out;
-  if (data_layout == DataLayout::kNCHW) {
-    dim_out = {n, c, out_d, out_h, out_w};
-  } else {
-    dim_out = {n, out_d, out_h, out_w, c};
-  }
-  auto output_data = output->mutable_data<T>(dim_out, ctx.GetPlace());
-
-  if (in_d == out_d && in_h == out_h && in_w == out_w) {
-    framework::TensorCopy(input, ctx.GetPlace(), output);
-    return;
-  }
-
-  float ratio_d = 0.f;
-  float ratio_h = 0.f;
-  float ratio_w = 0.f;
-  if (out_d > 1) {
-    float new_scale_d = 0.f;
-    new_scale_d = (scale_d > 0) ? static_cast<float>(1. / scale_d)
-                                : static_cast<float>(in_d) / out_d;
-    ratio_d = (align_corners) ? static_cast<float>(in_d - 1) / (out_d - 1)
-                              : static_cast<float>(new_scale_d);
-  }
-  if (out_h > 1) {
-    float new_scale_h = 0.f;
-    new_scale_h = (scale_h > 0) ? static_cast<float>(1. / scale_h)
-                                : static_cast<float>(in_h) / out_h;
-    ratio_h = (align_corners) ? static_cast<float>(in_h - 1) / (out_h - 1)
-                              : static_cast<float>(new_scale_h);
-  }
-  if (out_w > 1) {
-    float new_scale_w = 0.f;
-    new_scale_w = (scale_w > 0) ? static_cast<float>(1. / scale_w)
-                                : static_cast<float>(in_w) / out_w;
-    ratio_w = (align_corners) ? static_cast<float>(in_w - 1) / (out_w - 1)
-                              : static_cast<float>(new_scale_w);
-  }
-
-  int64_t in_dhw = in_d * in_h * in_w;
-  int64_t out_dhw = out_d * out_h * out_w;
-  int64_t in_cdhw = c * in_dhw;
-  int64_t out_cdhw = c * out_dhw;
-
-  auto pixelNum = n * out_cdhw;
-
-  platform::GpuLaunchConfig config =
-      platform::GetGpuLaunchConfig1D(ctx.cuda_device_context(), pixelNum);
-
-  if ("trilinear" == interp_method) {
-    KeTrilinearInterpFw<T><<<config.block_per_grid, config.thread_per_block, 0,
-                             ctx.cuda_device_context().stream()>>>(
-        input_data, in_d, in_h, in_w, n, in_cdhw, output_data, out_d, out_h,
-        out_w, n, out_cdhw, c, ratio_d, ratio_h, ratio_w, align_corners,
-        align_mode, data_layout);
-  } else if ("nearest" == interp_method) {
-    KeNearestNeighbor3DInterpFw<
-        T><<<config.block_per_grid, config.thread_per_block, 0,
-             ctx.cuda_device_context().stream()>>>(
-        input_data, in_d, in_h, in_w, n, in_cdhw, output_data, out_d, out_h,
-        out_w, n, out_cdhw, c, ratio_d, ratio_h, ratio_w, align_corners,
-        data_layout);
-  }
-}
-
-template <typename T>
-static void Interpolate1DCUDABwd(const framework::ExecutionContext& ctx,
-                                 Tensor* input_grad, const Tensor output_grad) {
-  auto* input = ctx.Input<Tensor>("X");
-  const std::string data_layout_str = ctx.Attr<std::string>("data_layout");
-  const DataLayout data_layout = framework::StringToDataLayout(data_layout_str);
-  int n, c, in_d, in_h, in_w;
-  ExtractNCDWH(input->dims(), data_layout, &n, &c, &in_d, &in_h, &in_w);
-
-  auto interp_method = ctx.Attr<std::string>("interp_method");
-  bool align_corners = ctx.Attr<bool>("align_corners");
-  int align_mode = ctx.Attr<int>("align_mode");
-
-  int out_w = ctx.Attr<int>("out_w");
-  float scale_w = -1;
-  auto scale_tensor = ctx.Input<Tensor>("Scale");
-  auto scale = ctx.Attr<std::vector<float>>("scale");
-  if (scale_tensor != nullptr) {
-    auto scale_data = get_new_data_from_tensor<float>(scale_tensor);
-    scale_w = scale_data[0];
-    PADDLE_ENFORCE_EQ(
-        scale_w > 0, true,
-        platform::errors::InvalidArgument(
-            "The scale_w in input 'Scale' Tensor of Operator(interpolate) "
-            "should be greater than 0, but received value is %d.",
-            scale_w));
-  } else {
-    if (scale.size() > 0) {
-      scale_w = scale[0];
-
-      PADDLE_ENFORCE_EQ(
-          scale_w > 0, true,
-          platform::errors::InvalidArgument(
-              "The scale_w in Attr(scale) of Operator(interpolate) "
-              "should be greater than 0, but received value is %d.",
-              scale_w));
-    }
-  }
-  if (scale_w > 0.) {
-    out_w = static_cast<int>(in_w * scale_w);
-  }
-
-  auto out_size = ctx.Input<Tensor>("OutSize");
-  if (out_size != nullptr) {
-    Tensor sizes;
-    framework::TensorCopySync(*out_size, platform::CPUPlace(), &sizes);
-    auto size_data = sizes.data<int>();
-    out_w = size_data[0];
-  }
-  auto list_new_size_tensor = ctx.MultiInput<framework::Tensor>("SizeTensor");
-  if (list_new_size_tensor.size() > 0) {
-    // have size tensor
-    auto new_size = get_new_shape(list_new_size_tensor);
-    out_w = new_size[0];
-  }
-
-  auto* output_grad_data = output_grad.data<T>();
-  framework::DDim dim_grad;
-  if (data_layout == DataLayout::kNCHW) {
-    dim_grad = {n, c, in_w};
-  } else {
-    dim_grad = {n, in_w, c};
-  }
-  input_grad->mutable_data<T>(dim_grad, ctx.GetPlace());
-  auto* input_grad_data = input_grad->mutable_data<T>(dim_grad, ctx.GetPlace());
-  auto& device_ctx = ctx.template device_context<platform::CUDADeviceContext>();
-  phi::funcs::SetConstant<platform::CUDADeviceContext, T> zero;
-  zero(device_ctx, input_grad, static_cast<T>(0.0));
-
-  if (in_w == out_w) {
-    framework::TensorCopy(output_grad, ctx.GetPlace(), input_grad);
-    return;
-  }
-
-  float ratio_w = 0.f;
-  if (out_w > 1) {
-    float new_scale_w = 0.f;
-    new_scale_w = (scale_w > 0) ? static_cast<float>(1. / scale_w)
-                                : static_cast<float>(in_w) / out_w;
-    ratio_w = (align_corners) ? static_cast<float>(in_w - 1) / (out_w - 1)
-                              : static_cast<float>(new_scale_w);
-  }
-  int64_t in_cw = c * in_w;
-  int64_t out_cw = c * out_w;
-  auto pixelNum = n * out_cw;
-
-  platform::GpuLaunchConfig config =
-      platform::GetGpuLaunchConfig1D(ctx.cuda_device_context(), pixelNum);
-
-  if ("linear" == interp_method) {
-    KeLinearInterpBw<T><<<config.block_per_grid, config.thread_per_block, 0,
-                          ctx.cuda_device_context().stream()>>>(
-        input_grad_data, in_w, in_cw, output_grad_data, out_w, n, out_cw, c,
-        ratio_w, align_corners, align_mode, data_layout);
-  }
-}
-
-template <typename T>
-static void Interpolate2DCUDABwd(const framework::ExecutionContext& ctx,
-                                 Tensor* input_grad, const Tensor output_grad) {
-  auto* input = ctx.Input<Tensor>("X");
-  const std::string data_layout_str = ctx.Attr<std::string>("data_layout");
-  const DataLayout data_layout = framework::StringToDataLayout(data_layout_str);
-  int n, c, in_d, in_h, in_w;
-  ExtractNCDWH(input->dims(), data_layout, &n, &c, &in_d, &in_h, &in_w);
-
-  auto interp_method = ctx.Attr<std::string>("interp_method");
-  bool align_corners = ctx.Attr<bool>("align_corners");
-  int align_mode = ctx.Attr<int>("align_mode");
-
-  int out_h = ctx.Attr<int>("out_h");
-  int out_w = ctx.Attr<int>("out_w");
-  float scale_h = -1;
-  float scale_w = -1;
-  auto scale_tensor = ctx.Input<Tensor>("Scale");
-  auto scale = ctx.Attr<std::vector<float>>("scale");
-  if (scale_tensor != nullptr) {
-    auto scale_data = get_new_data_from_tensor<float>(scale_tensor);
-    if (scale_data.size() > 1) {
-      scale_h = scale_data[0];
-      scale_w = scale_data[1];
-    } else {
-      scale_h = scale_data[0];
-      scale_w = scale_data[0];
-    }
-
-    PADDLE_ENFORCE_EQ(
-        scale_w > 0, true,
-        platform::errors::InvalidArgument(
-            "The scale_w in input 'Scale' Tensor of Operator(interpolate) "
-            "should be greater than 0, but received value is %d.",
-            scale_w));
-    PADDLE_ENFORCE_EQ(
-        scale_h > 0, true,
-        platform::errors::InvalidArgument(
-            "The scale_h in input 'Scale' Tensor of Operator(interpolate) "
-            "should be greater than 0, but received value is %d.",
-            scale_h));
-  } else {
-    if (scale.size() > 1) {
-      scale_w = scale[1];
-      scale_h = scale[0];
-
-      PADDLE_ENFORCE_EQ(
-          scale_w > 0, true,
-          platform::errors::InvalidArgument(
-              "The scale_w in Attr(scale) of Operator(interpolate) "
-              "should be greater than 0, but received value is %d.",
-              scale_w));
-      PADDLE_ENFORCE_EQ(
-          scale_h > 0, true,
-          platform::errors::InvalidArgument(
-              "The scale_h in Attr(scale) of Operator(interpolate) "
-              "should be greater than 0, but received value is %d.",
-              scale_h));
-    }
-  }
-  if (scale_w > 0. && scale_h > 0.) {
-    out_h = static_cast<int>(in_h * scale_h);
-    out_w = static_cast<int>(in_w * scale_w);
-  }
-
-  auto out_size = ctx.Input<Tensor>("OutSize");
-  if (out_size != nullptr) {
-    Tensor sizes;
-    framework::TensorCopySync(*out_size, platform::CPUPlace(), &sizes);
-    auto size_data = sizes.data<int>();
-    out_h = size_data[0];
-    out_w = size_data[1];
-  }
-  auto list_new_size_tensor = ctx.MultiInput<framework::Tensor>("SizeTensor");
-  if (list_new_size_tensor.size() > 0) {
-    // have size tensor
-    auto new_size = get_new_shape(list_new_size_tensor);
-    out_h = new_size[0];
-    out_w = new_size[1];
-  }
-
-  auto* output_grad_data = output_grad.data<T>();
-  framework::DDim dim_grad;
-  if (data_layout == DataLayout::kNCHW) {
-    dim_grad = {n, c, in_h, in_w};
-  } else {
-    dim_grad = {n, in_h, in_w, c};
-  }
-  input_grad->mutable_data<T>(dim_grad, ctx.GetPlace());
-  auto* input_grad_data = input_grad->mutable_data<T>(dim_grad, ctx.GetPlace());
-  auto& device_ctx = ctx.template device_context<platform::CUDADeviceContext>();
-  phi::funcs::SetConstant<platform::CUDADeviceContext, T> zero;
-  zero(device_ctx, input_grad, static_cast<T>(0.0));
-
-  if (in_h == out_h && in_w == out_w) {
-    framework::TensorCopy(output_grad, ctx.GetPlace(), input_grad);
-    return;
-  }
-
-  float ratio_h = 0.f;
-  float ratio_w = 0.f;
-  if (out_h > 1) {
-    float new_scale_h = 0.f;
-    new_scale_h = (scale_h > 0) ? static_cast<float>(1. / scale_h)
-                                : static_cast<float>(in_h) / out_h;
-    ratio_h = (align_corners) ? static_cast<float>(in_h - 1) / (out_h - 1)
-                              : static_cast<float>(new_scale_h);
-  }
-  if (out_w > 1) {
-    float new_scale_w = 0.f;
-    new_scale_w = (scale_w > 0) ? static_cast<float>(1. / scale_w)
-                                : static_cast<float>(in_w) / out_w;
-    ratio_w = (align_corners) ? static_cast<float>(in_w - 1) / (out_w - 1)
-                              : static_cast<float>(new_scale_w);
-  }
-
-  int64_t in_hw = in_h * in_w;
-  int64_t out_hw = out_h * out_w;
-  int64_t in_chw = c * in_hw;
-  int64_t out_chw = c * out_hw;
-  auto pixelNum = n * out_chw;
-
-  platform::GpuLaunchConfig config =
-      platform::GetGpuLaunchConfig1D(ctx.cuda_device_context(), pixelNum);
-
-  if ("nearest" == interp_method) {
-    if (data_layout == DataLayout::kNCHW) {
-      // get launch 3D config
-      int nc = n * c;
-      platform::GpuLaunchConfig config_3d =
-          GetGpuLaunchConfig3D(ctx.cuda_device_context(), nc, out_h, out_w);
-      KeNearestNeighborInterpNCHWBw<
-          T><<<config_3d.block_per_grid, config_3d.thread_per_block, 0,
-               ctx.cuda_device_context().stream()>>>(
-          input_grad_data, in_h, in_w, output_grad_data, out_h, out_w, nc,
-          ratio_h, ratio_w, align_corners);
-    } else {
-      int64_t cw = c * out_w;
-      auto interp_divmods = FastDivModForInterpolate(c, out_chw, cw);
-      KeNearestNeighborInterpBw<
-          T><<<config.block_per_grid, config.thread_per_block, 0,
-               ctx.cuda_device_context().stream()>>>(
-          input_grad_data, in_h, in_w, n, in_chw, output_grad_data, out_h,
-          out_w, n, out_chw, c, ratio_h, ratio_w, align_corners,
-          interp_divmods);
-    }
-  } else if ("bilinear" == interp_method) {
-    const T align_type_value = (align_mode == 0 && !align_corners) ? 0.5f : 0;
-    bool is_nchw = (data_layout == DataLayout::kNCHW) ? true : false;
-    bool optimize_flag = false;
-#ifndef __HIPCC__
-    optimize_flag = (in_h < (out_h >> 6) && in_w < (out_w >> 6))
-                        ? true
-                        : ((in_h == 1 && in_w == 1) ? true : false);
-#endif
-
-    if (optimize_flag & is_nchw) {
-      KeBilinearInterpBwShareMemory<
-          T><<<config.block_per_grid, config.thread_per_block, 0,
-               ctx.cuda_device_context().stream()>>>(
-          input_grad_data, in_h, in_w, output_grad_data, out_h, out_w, n, c,
-          ratio_h, ratio_w, align_type_value, is_nchw);
-    } else if (!optimize_flag & is_nchw) {
-      //
-      const int num_kernels = n * c * out_h * out_w;
-      const int num_threads =
-          std::min(ctx.cuda_device_context().GetMaxThreadsPerBlock(), 1024);
-      KeBilinearInterpNCHWBw<
-          T><<<platform::DivUp(num_kernels, num_threads), num_threads, 0,
-               ctx.cuda_device_context().stream()>>>(
-          input_grad_data, in_h, in_w, out_h, out_w, n, c, ratio_h, ratio_w,
-          output_grad_data, align_type_value);
-    } else {
-      int64_t cw = c * out_w;
-      auto interp_divmods = FastDivModForInterpolate(c, out_chw, cw);
-      KeBilinearInterpBw<T><<<config.block_per_grid, config.thread_per_block, 0,
-                              ctx.cuda_device_context().stream()>>>(
-          input_grad_data, in_h, in_w, output_grad_data, out_h, out_w, n,
-          out_chw, c, ratio_h, ratio_w, align_type_value, interp_divmods);
-    }
-  } else if ("bicubic" == interp_method) {
-#ifdef __HIPCC__
-    constexpr int thread_per_block = 256;
-#else
-    constexpr int thread_per_block = 512;
-#endif
-    KeBicubicInterpBw<T><<<config.block_per_grid, thread_per_block, 0,
-                           ctx.cuda_device_context().stream()>>>(
-        input_grad_data, in_h, in_w, n, in_chw, output_grad_data, out_h, out_w,
-        n, out_chw, c, ratio_h, ratio_w, align_corners, data_layout);
-  }
-}
-
-template <typename T>
-static void Interpolate3DCUDABwd(const framework::ExecutionContext& ctx,
-                                 Tensor* input_grad,
-                                 const Tensor& output_grad) {
-  auto* input = ctx.Input<Tensor>("X");
-  const std::string data_layout_str = ctx.Attr<std::string>("data_layout");
-  const DataLayout data_layout = framework::StringToDataLayout(data_layout_str);
-  int n, c, in_d, in_h, in_w;
-  ExtractNCDWH(input->dims(), data_layout, &n, &c, &in_d, &in_h, &in_w);
-
-  auto interp_method = ctx.Attr<std::string>("interp_method");
-  bool align_corners = ctx.Attr<bool>("align_corners");
-  int align_mode = ctx.Attr<int>("align_mode");
-
-  int out_d = ctx.Attr<int>("out_d");
-  int out_h = ctx.Attr<int>("out_h");
-  int out_w = ctx.Attr<int>("out_w");
-  float scale_d = -1;
-  float scale_h = -1;
-  float scale_w = -1;
-  auto scale_tensor = ctx.Input<Tensor>("Scale");
-  auto scale = ctx.Attr<std::vector<float>>("scale");
-  if (scale_tensor != nullptr) {
-    auto scale_data = get_new_data_from_tensor<float>(scale_tensor);
-    if (scale_data.size() > 1) {
-      scale_d = scale_data[0];
-      scale_h = scale_data[1];
-      scale_w = scale_data[2];
-    } else {
-      scale_d = scale_data[0];
-      scale_h = scale_data[0];
-      scale_w = scale_data[0];
-    }
-    PADDLE_ENFORCE_EQ(
-        scale_w > 0, true,
-        platform::errors::InvalidArgument(
-            "The scale_w in input 'Scale' Tensor of Operator(interpolate) "
-            "should be greater than 0, but received value is %d.",
-            scale_w));
-    PADDLE_ENFORCE_EQ(
-        scale_h > 0, true,
-        platform::errors::InvalidArgument(
-            "The scale_h in input 'Scale' Tensor of Operator(interpolate) "
-            "should be greater than 0, but received value is %d.",
-            scale_h));
-    PADDLE_ENFORCE_EQ(
-        scale_d > 0, true,
-        platform::errors::InvalidArgument(
-            "The scale_d in input 'Scale' Tensor of Operator(interpolate) "
-            "should be greater than 0, but received value is %d.",
-            scale_d));
-  } else {
-    if (scale.size() > 1) {
-      scale_d = scale[0];
-      scale_h = scale[1];
-      scale_w = scale[2];
-
-      PADDLE_ENFORCE_EQ(
-          scale_w > 0, true,
-          platform::errors::InvalidArgument(
-              "The scale_w in Attr(scale) of Operator(interpolate) "
-              "should be greater than 0, but received value is %d.",
-              scale_w));
-      PADDLE_ENFORCE_EQ(
-          scale_h > 0, true,
-          platform::errors::InvalidArgument(
-              "The scale_h in Attr(scale) of Operator(interpolate) "
-              "should be greater than 0, but received value is %d.",
-              scale_h));
-      PADDLE_ENFORCE_EQ(
-          scale_d > 0, true,
-          platform::errors::InvalidArgument(
-              "The scale_d in Attr(scale) of Operator(interpolate) "
-              "should be greater than 0, but received value is %d.",
-              scale_d));
-    }
-  }
-  if (scale_d > 0. && scale_h > 0. && scale_w > 0.) {
-    out_d = static_cast<int>(in_d * scale_d);
-    out_h = static_cast<int>(in_h * scale_h);
-    out_w = static_cast<int>(in_w * scale_w);
-  }
-
-  auto out_size = ctx.Input<Tensor>("OutSize");
-  if (out_size != nullptr) {
-    Tensor sizes;
-    framework::TensorCopySync(*out_size, platform::CPUPlace(), &sizes);
-    auto size_data = sizes.data<int>();
-    out_d = size_data[0];
-    out_h = size_data[1];
-    out_w = size_data[2];
-  }
-  auto list_new_size_tensor = ctx.MultiInput<framework::Tensor>("SizeTensor");
-  if (list_new_size_tensor.size() > 0) {
-    // have size tensor
-    auto new_size = get_new_shape(list_new_size_tensor);
-    out_d = new_size[0];
-    out_h = new_size[1];
-    out_w = new_size[2];
-  }
-
-  auto* output_grad_data = output_grad.data<T>();
-  framework::DDim dim_grad;
-  if (data_layout == DataLayout::kNCHW) {
-    dim_grad = {n, c, in_d, in_h, in_w};
-  } else {
-    dim_grad = {n, in_d, in_h, in_w, c};
-  }
-  auto* input_grad_data = input_grad->mutable_data<T>(dim_grad, ctx.GetPlace());
-  auto& device_ctx = ctx.template device_context<platform::CUDADeviceContext>();
-  phi::funcs::SetConstant<platform::CUDADeviceContext, T> zero;
-  zero(device_ctx, input_grad, static_cast<T>(0.0));
-
-  if (in_d == out_d && in_h == out_h && in_w == out_w) {
-    framework::TensorCopy(output_grad, ctx.GetPlace(), input_grad);
-    return;
-  }
-
-  float ratio_d = 0.f;
-  float ratio_h = 0.f;
-  float ratio_w = 0.f;
-  if (out_d > 1) {
-    float new_scale_d = 0.f;
-    new_scale_d = (scale_d > 0) ? static_cast<float>(1. / scale_d)
-                                : static_cast<float>(in_d) / out_d;
-    ratio_d = (align_corners) ? static_cast<float>(in_d - 1) / (out_d - 1)
-                              : static_cast<float>(new_scale_d);
-  }
-  if (out_h > 1) {
-    float new_scale_h = 0.f;
-    new_scale_h = (scale_h > 0) ? static_cast<float>(1. / scale_h)
-                                : static_cast<float>(in_h) / out_h;
-    ratio_h = (align_corners) ? static_cast<float>(in_h - 1) / (out_h - 1)
-                              : static_cast<float>(new_scale_h);
-  }
-  if (out_w > 1) {
-    float new_scale_w = 0.f;
-    new_scale_w = (scale_w > 0) ? static_cast<float>(1. / scale_w)
-                                : static_cast<float>(in_w) / out_w;
-    ratio_w = (align_corners) ? static_cast<float>(in_w - 1) / (out_w - 1)
-                              : static_cast<float>(new_scale_w);
-  }
-
-  int64_t in_dhw = in_d * in_h * in_w;
-  int64_t out_dhw = out_d * out_h * out_w;
-  int64_t in_cdhw = c * in_dhw;
-  int64_t out_cdhw = c * out_dhw;
-
-  auto pixelNum = n * out_cdhw;
-
-  platform::GpuLaunchConfig config =
-      platform::GetGpuLaunchConfig1D(ctx.cuda_device_context(), pixelNum);
-
-  if ("trilinear" == interp_method) {
-    KeTrilinearInterpBw<T><<<config.block_per_grid, config.thread_per_block, 0,
-                             ctx.cuda_device_context().stream()>>>(
-        input_grad_data, in_d, in_h, in_w, n, in_cdhw, output_grad_data, out_d,
-        out_h, out_w, n, out_cdhw, c, ratio_d, ratio_h, ratio_w, align_corners,
-        align_mode, data_layout);
-  } else if ("nearest" == interp_method) {
-    KeNearestNeighbor3DInterpBw<
-        T><<<config.block_per_grid, config.thread_per_block, 0,
-             ctx.cuda_device_context().stream()>>>(
-        input_grad_data, in_d, in_h, in_w, n, in_cdhw, output_grad_data, out_d,
-        out_h, out_w, n, out_cdhw, c, ratio_d, ratio_h, ratio_w, align_corners,
-        data_layout);
-  }
-}
-
-template <typename T>
-class InterpolateOpV2CUDAKernel : public framework::OpKernel<T> {
- public:
-  void Compute(const framework::ExecutionContext& ctx) const override {
-    PADDLE_ENFORCE_EQ(
-        platform::is_gpu_place(ctx.GetPlace()), true,
-        platform::errors::NotFound("This kernel only runs on GPU device."));
-    auto* input = ctx.Input<Tensor>("X");
-    auto* output = ctx.Output<Tensor>("Out");
-
-    auto input_dims = input->dims();
-    if (input_dims.size() == 3) {  // 1D interpolation
-      Interpolate1DCUDAFwd<T>(ctx, *input, output);
-    } else if (input_dims.size() == 4) {  // 2D interpolation
-      Interpolate2DCUDAFwd<T>(ctx, *input, output);
-    } else if (input_dims.size() == 5) {  // 3D interpolation
-      Interpolate3DCUDAFwd<T>(ctx, *input, output);
-    }
-  }
-};
-
-template <typename T>
-class InterpolateV2GradOpCUDAKernel : public framework::OpKernel<T> {
- public:
-  void Compute(const framework::ExecutionContext& ctx) const override {
-    PADDLE_ENFORCE_EQ(
-        platform::is_gpu_place(ctx.GetPlace()), true,
-        platform::errors::NotFound("This kernel only runs on GPU device."));
-    auto* input_grad = ctx.Output<Tensor>(framework::GradVarName("X"));
-    auto* output_grad = ctx.Input<Tensor>(framework::GradVarName("Out"));
-
-    auto output_grad_dims = output_grad->dims();
-    if (output_grad_dims.size() == 3) {  // 1D interpolation
-      Interpolate1DCUDABwd<T>(ctx, input_grad, *output_grad);
-    } else if (output_grad_dims.size() == 4) {  // 2D interpolation
-      Interpolate2DCUDABwd<T>(ctx, input_grad, *output_grad);
-    } else if (output_grad_dims.size() == 5) {  // 3D interpolation
-      Interpolate3DCUDABwd<T>(ctx, input_grad, *output_grad);
-    }
-  }
-};
-
-}  // namespace operators
-}  // namespace paddle
-
-namespace ops = paddle::operators;
-namespace plat = paddle::platform;
-REGISTER_OP_CUDA_KERNEL(bilinear_interp_v2,
-                        ops::InterpolateOpV2CUDAKernel<float>,
-                        ops::InterpolateOpV2CUDAKernel<double>,
-                        ops::InterpolateOpV2CUDAKernel<int>);
-REGISTER_OP_CUDA_KERNEL(bilinear_interp_v2_grad,
-                        ops::InterpolateV2GradOpCUDAKernel<float>,
-                        ops::InterpolateV2GradOpCUDAKernel<double>);
-REGISTER_OP_CUDA_KERNEL(nearest_interp_v2,
-                        ops::InterpolateOpV2CUDAKernel<float>,
-                        ops::InterpolateOpV2CUDAKernel<double>,
-                        ops::InterpolateOpV2CUDAKernel<int64_t>,
-                        ops::InterpolateOpV2CUDAKernel<int>);
-REGISTER_OP_CUDA_KERNEL(nearest_interp_v2_grad,
-                        ops::InterpolateV2GradOpCUDAKernel<float>,
-                        ops::InterpolateV2GradOpCUDAKernel<double>);
-REGISTER_OP_CUDA_KERNEL(trilinear_interp_v2,
-                        ops::InterpolateOpV2CUDAKernel<float>,
-                        ops::InterpolateOpV2CUDAKernel<double>,
-                        ops::InterpolateOpV2CUDAKernel<int>);
-REGISTER_OP_CUDA_KERNEL(trilinear_interp_v2_grad,
-                        ops::InterpolateV2GradOpCUDAKernel<float>,
-                        ops::InterpolateV2GradOpCUDAKernel<double>);
-REGISTER_OP_CUDA_KERNEL(linear_interp_v2, ops::InterpolateOpV2CUDAKernel<float>,
-                        ops::InterpolateOpV2CUDAKernel<double>,
-                        ops::InterpolateOpV2CUDAKernel<int>);
-REGISTER_OP_CUDA_KERNEL(linear_interp_v2_grad,
-                        ops::InterpolateV2GradOpCUDAKernel<float>,
-                        ops::InterpolateV2GradOpCUDAKernel<double>);
-REGISTER_OP_CUDA_KERNEL(bicubic_interp_v2,
-                        ops::InterpolateOpV2CUDAKernel<float>,
-                        ops::InterpolateOpV2CUDAKernel<double>,
-                        ops::InterpolateOpV2CUDAKernel<int>);
-REGISTER_OP_CUDA_KERNEL(bicubic_interp_v2_grad,
-                        ops::InterpolateV2GradOpCUDAKernel<float>,
-                        ops::InterpolateV2GradOpCUDAKernel<double>);
diff --git a/paddle/fluid/operators/interpolate_v2_op.h b/paddle/fluid/operators/interpolate_v2_op.h
deleted file mode 100644
index f99d3f6c32442..0000000000000
--- a/paddle/fluid/operators/interpolate_v2_op.h
+++ /dev/null
@@ -1,1618 +0,0 @@
-/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserve.
-   Licensed under the Apache License, Version 2.0 (the "License");
-   you may not use this file except in compliance with the License.
-   You may obtain a copy of the License at
-   http://www.apache.org/licenses/LICENSE-2.0
-   Unless required by applicable law or agreed to in writing, software
-   distributed under the License is distributed on an "AS IS" BASIS,
-   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-   See the License for the specific language governing permissions and
-   limitations under the License. */
-
-#pragma once
-#include <algorithm>
-#include <string>
-#include <vector>
-#include "paddle/fluid/framework/op_registry.h"
-#include "paddle/phi/core/hostdevice.h"
-#include "paddle/phi/kernels/funcs/math_function.h"
-
-namespace paddle {
-namespace operators {
-
-template <typename T, size_t D, int MajorType = Eigen::RowMajor,
-          typename IndexType = Eigen::DenseIndex>
-using EigenTensor = framework::EigenTensor<T, D, MajorType, IndexType>;
-using Tensor = framework::Tensor;
-using DataLayout = framework::DataLayout;
-
-inline std::vector<int> get_new_shape(
-    const std::vector<const Tensor*>& list_new_shape_tensor) {
-  // get tensor from
-  std::vector<int> vec_new_shape;
-  for (size_t i = 0; i < list_new_shape_tensor.size(); ++i) {
-    auto tensor = list_new_shape_tensor[i];
-    PADDLE_ENFORCE_EQ(tensor->dims(), phi::make_ddim({1}),
-                      platform::errors::InvalidArgument(
-                          "The shape of dimension tensor should be [1],"
-                          "but received d%.",
-                          tensor->dims()));
-    if (platform::is_gpu_place(tensor->place())) {
-      framework::Tensor temp;
-      paddle::framework::TensorCopySync(*tensor, platform::CPUPlace(), &temp);
-      vec_new_shape.push_back(static_cast<int32_t>(*temp.data<int32_t>()));
-    } else {
-      vec_new_shape.push_back(static_cast<int32_t>(*tensor->data<int32_t>()));
-    }
-  }
-
-  return vec_new_shape;
-}
-
-template <typename T>
-inline std::vector<T> get_new_data_from_tensor(const Tensor* new_data_tensor) {
-  std::vector<T> vec_new_data;
-  auto* new_data = new_data_tensor->data<T>();
-  framework::Tensor cpu_starts_tensor;
-  if (platform::is_gpu_place(new_data_tensor->place())) {
-    paddle::framework::TensorCopySync(*new_data_tensor, platform::CPUPlace(),
-                                      &cpu_starts_tensor);
-    new_data = cpu_starts_tensor.data<T>();
-  }
-#ifdef PADDLE_WITH_ASCEND_CL
-  if (platform::is_npu_place(new_data_tensor->place())) {
-    paddle::framework::TensorCopySync(*new_data_tensor, platform::CPUPlace(),
-                                      &cpu_starts_tensor);
-    new_data = cpu_starts_tensor.data<T>();
-  }
-#endif
-#ifdef PADDLE_WITH_XPU
-  if (platform::is_xpu_place(new_data_tensor->place())) {
-    paddle::framework::TensorCopySync(*new_data_tensor, platform::CPUPlace(),
-                                      &cpu_starts_tensor);
-    new_data = cpu_starts_tensor.data<T>();
-  }
-#endif
-  vec_new_data = std::vector<T>(new_data, new_data + new_data_tensor->numel());
-  return vec_new_data;
-}
-
-inline void ExtractNCDWH(const framework::DDim& dims,
-                         const DataLayout& data_layout, int* N, int* C, int* D,
-                         int* H, int* W) {
-  *N = dims[0];
-
-  if (dims.size() == 3) {
-    *C = data_layout == DataLayout::kNCHW ? dims[1] : dims[2];
-    *D = 1;
-    *H = 1;
-    *W = data_layout == DataLayout::kNCHW ? dims[2] : dims[1];
-  } else if (dims.size() == 4) {
-    *C = data_layout == DataLayout::kNCHW ? dims[1] : dims[3];
-    *D = 1;
-    *H = data_layout == DataLayout::kNCHW ? dims[2] : dims[1];
-    *W = data_layout == DataLayout::kNCHW ? dims[3] : dims[2];
-  } else {
-    *C = data_layout == DataLayout::kNCHW ? dims[1] : dims[4];
-    *D = data_layout == DataLayout::kNCHW ? dims[2] : dims[1];
-    *H = data_layout == DataLayout::kNCHW ? dims[3] : dims[2];
-    *W = data_layout == DataLayout::kNCHW ? dims[4] : dims[3];
-  }
-}
-
-template <typename T>
-static void NearestNeighborInterpolate(const Tensor& input, Tensor* output,
-                                       const float ratio_h, const float ratio_w,
-                                       const int n, const int c,
-                                       const int out_h, const int out_w,
-                                       const bool align_corners,
-                                       const DataLayout& data_layout) {
-  auto input_t = EigenTensor<T, 4>::From(input);
-  auto output_t = EigenTensor<T, 4>::From(*output);
-  for (int k = 0; k < out_h; k++) {  // loop for images
-    int in_k = (align_corners) ? static_cast<int>(ratio_h * k + 0.5)
-                               : static_cast<int>(ratio_h * k);
-
-    for (int l = 0; l < out_w; l++) {
-      int in_l = (align_corners) ? static_cast<int>(ratio_w * l + 0.5)
-                                 : static_cast<int>(ratio_w * l);
-
-      for (int i = 0; i < n; i++) {    // loop for batches
-        for (int j = 0; j < c; j++) {  // loop for channels
-          if (data_layout == DataLayout::kNCHW) {
-            output_t(i, j, k, l) = input_t(i, j, in_k, in_l);
-          } else {
-            output_t(i, k, l, j) = input_t(i, in_k, in_l, j);
-          }
-        }
-      }
-    }
-  }
-}
-
-template <typename T>
-static void NearestNeighbor3DInterpolate(
-    const Tensor& input, Tensor* output, const float ratio_d,
-    const float ratio_h, const float ratio_w, const int n, const int c,
-    const int out_d, const int out_h, const int out_w, const bool align_corners,
-    const DataLayout& data_layout) {
-  auto input_t = EigenTensor<T, 5>::From(input);
-  auto output_t = EigenTensor<T, 5>::From(*output);
-  for (int d = 0; d < out_d; d++) {  // loop for images
-    int in_d = (align_corners) ? static_cast<int>(ratio_d * d + 0.5)
-                               : static_cast<int>(ratio_d * d);
-    for (int k = 0; k < out_h; k++) {
-      int in_k = (align_corners) ? static_cast<int>(ratio_h * k + 0.5)
-                                 : static_cast<int>(ratio_h * k);
-
-      for (int l = 0; l < out_w; l++) {
-        int in_l = (align_corners) ? static_cast<int>(ratio_w * l + 0.5)
-                                   : static_cast<int>(ratio_w * l);
-
-        for (int i = 0; i < n; i++) {    // loop for batches
-          for (int j = 0; j < c; j++) {  // loop for channels
-            if (data_layout == DataLayout::kNCHW) {
-              output_t(i, j, d, k, l) = input_t(i, j, in_d, in_k, in_l);
-            } else {  // NDHWC
-              output_t(i, d, k, l, j) = input_t(i, in_d, in_k, in_l, j);
-            }
-          }
-        }
-      }
-    }
-  }
-}
-
-template <typename T>
-static void LinearInterpolation(const Tensor& input, Tensor* output,
-                                const float ratio_w, const int in_w,
-                                const int n, const int c, const int out_w,
-                                const bool align_corners, const bool align_mode,
-                                const DataLayout data_layout) {
-  auto input_t = EigenTensor<T, 3>::From(input);
-  auto output_t = EigenTensor<T, 3>::From(*output);
-  bool align_flag = (align_mode == 0 && !align_corners);
-
-  std::vector<int> vx_w, vx_e;
-  std::vector<float> vd_w, vd_e;
-  vx_w.reserve(out_w);
-  vx_e.reserve(out_w);
-  vd_w.reserve(out_w);
-  vd_e.reserve(out_w);
-#ifdef PADDLE_WITH_MKLML
-#pragma omp parallel for
-#endif
-  for (int l = 0; l < out_w; l++) {
-    int x_w = align_flag ? static_cast<int>(ratio_w * (l + 0.5) - 0.5)
-                         : static_cast<int>(ratio_w * l);
-    x_w = (x_w > 0) ? x_w : 0;                       // w
-    int x_e = (x_w < (in_w - 1)) ? (x_w + 1) : x_w;  // w_id
-
-    float idx_src_x = ratio_w * (l + 0.5) - 0.5;
-    idx_src_x = (idx_src_x > 0) ? idx_src_x : 0;
-    float d_w = align_flag ? idx_src_x - x_w : ratio_w * l - x_w;  // w1lambda
-    float d_e = 1.f - d_w;                                         // w2lambda
-    {
-      vx_w[l] = x_w;
-      vx_e[l] = x_e;
-      vd_w[l] = d_w;
-      vd_e[l] = d_e;
-    }
-  }
-
-#ifdef PADDLE_WITH_MKLML
-#pragma omp parallel for collapse(3)
-#endif
-  for (int i = 0; i < n; i++) {    // loop for batches
-    for (int j = 0; j < c; j++) {  // loop for channels
-      for (int l = 0; l < out_w; l++) {
-        // linear interpolation
-        T out_t;
-        if (data_layout == DataLayout::kNCHW) {
-          out_t = input_t(i, j, vx_w[l]) * vd_e[l] +
-                  input_t(i, j, vx_e[l]) * vd_w[l];
-          output_t(i, j, l) = out_t;
-        } else {
-          out_t = input_t(i, vx_w[l], j) * vd_e[l] +
-                  input_t(i, vx_e[l], j) * vd_w[l];
-          output_t(i, l, j) = out_t;
-        }
-      }
-    }
-  }
-}
-
-template <typename T>
-static void LinearInterpolationGrad(const Tensor& output_grad,
-                                    Tensor* input_grad, const float ratio_w,
-                                    const int in_w, const int n, const int c,
-                                    const int out_w, const bool align_corners,
-                                    const int align_mode,
-                                    const DataLayout data_layout) {
-  auto input_grad_t = EigenTensor<T, 3>::From(*input_grad);
-  auto output_grad_t = EigenTensor<T, 3>::From(output_grad);
-  bool align_flag = (align_mode == 0 && !align_corners);
-  for (int l = 0; l < out_w; l++) {
-    int x_w = align_flag ? static_cast<int>(ratio_w * (l + 0.5) - 0.5)
-                         : static_cast<int>(ratio_w * l);
-    x_w = (x_w > 0) ? x_w : 0;                       // w
-    int x_e = (x_w < (in_w - 1)) ? (x_w + 1) : x_w;  // w_id
-
-    float idx_src_x = ratio_w * (l + 0.5) - 0.5;
-    idx_src_x = (idx_src_x > 0) ? idx_src_x : 0;
-    float d_w = align_flag ? idx_src_x - x_w : ratio_w * l - x_w;  // w1lambda
-    float d_e = 1.f - d_w;                                         // w2lambda
-
-    for (int i = 0; i < n; i++) {    // loop for batches
-      for (int j = 0; j < c; j++) {  // loop for channels
-        // linear interpolation grad
-        if (data_layout == DataLayout::kNCHW) {
-          const T grad = output_grad_t(i, j, l);
-          input_grad_t(i, j, x_w) += static_cast<T>(grad * d_e);
-          input_grad_t(i, j, x_e) += static_cast<T>(grad * d_w);
-        } else {
-          const T grad = output_grad_t(i, l, j);
-          input_grad_t(i, x_w, j) += static_cast<T>(grad * d_e);
-          input_grad_t(i, x_e, j) += static_cast<T>(grad * d_w);
-        }
-      }
-    }
-  }
-}
-
-template <typename T>
-static void BilinearInterpolation(const Tensor& input, Tensor* output,
-                                  const float ratio_h, const float ratio_w,
-                                  const int in_h, const int in_w, const int n,
-                                  const int c, const int out_h, const int out_w,
-                                  const bool align_corners,
-                                  const bool align_mode,
-                                  const DataLayout data_layout) {
-  auto input_t = EigenTensor<T, 4>::From(input);
-  auto output_t = EigenTensor<T, 4>::From(*output);
-  bool align_flag = (align_mode == 0 && !align_corners);
-
-  std::vector<int> vy_n, vy_s;
-  std::vector<float> vd_n, vd_s;
-  vy_n.reserve(out_h);
-  vy_s.reserve(out_h);
-  vd_n.reserve(out_h);
-  vd_s.reserve(out_h);
-#ifdef PADDLE_WITH_MKLML
-#pragma omp parallel for
-#endif
-  for (int k = 0; k < out_h; k++) {
-    int y_n = align_flag ? static_cast<int>(ratio_h * (k + 0.5) - 0.5)
-                         : static_cast<int>(ratio_h * k);
-    y_n = (y_n > 0) ? y_n : 0;
-    int y_s = (y_n + 1) < (in_h - 1) ? (y_n + 1) : (in_h - 1);
-    float idx_src_y = ratio_h * (k + 0.5) - 0.5;
-    idx_src_y = (idx_src_y > 0) ? idx_src_y : 0;
-    float d_n = align_flag ? idx_src_y - y_n : ratio_h * k - y_n;
-    float d_s = 1.f - d_n;
-    {
-      vy_n[k] = y_n;
-      vy_s[k] = y_s;
-      vd_n[k] = d_n;
-      vd_s[k] = d_s;
-    }
-  }
-
-  std::vector<int> vx_w, vx_e;
-  std::vector<float> vd_w, vd_e;
-  vx_w.reserve(out_w);
-  vx_e.reserve(out_w);
-  vd_w.reserve(out_w);
-  vd_e.reserve(out_w);
-#ifdef PADDLE_WITH_MKLML
-#pragma omp parallel for
-#endif
-  for (int l = 0; l < out_w; l++) {
-    int x_w = (align_mode == 0 && !align_corners)
-                  ? static_cast<int>(ratio_w * (l + 0.5) - 0.5)
-                  : static_cast<int>(ratio_w * l);
-    x_w = (x_w > 0) ? x_w : 0;
-    int x_e = (x_w + 1) < (in_w - 1) ? (x_w + 1) : (in_w - 1);
-    float idx_src_x = ratio_w * (l + 0.5) - 0.5;
-    idx_src_x = (idx_src_x > 0) ? idx_src_x : 0;
-    float d_w = align_flag ? idx_src_x - x_w : ratio_w * l - x_w;
-    float d_e = 1.f - d_w;
-    {
-      vx_w[l] = x_w;
-      vx_e[l] = x_e;
-      vd_w[l] = d_w;
-      vd_e[l] = d_e;
-    }
-  }
-
-#ifdef PADDLE_WITH_MKLML
-#pragma omp parallel for collapse(4)
-#endif
-  for (int i = 0; i < n; i++) {          // loop for batches
-    for (int j = 0; j < c; j++) {        // loop for channels
-      for (int k = 0; k < out_h; k++) {  // loop for images
-        for (int l = 0; l < out_w; l++) {
-          // bilinear interpolation
-          T out_t;
-          if (data_layout == DataLayout::kNCHW) {
-            out_t = input_t(i, j, vy_n[k], vx_w[l]) * vd_s[k] * vd_e[l] +
-                    input_t(i, j, vy_s[k], vx_w[l]) * vd_n[k] * vd_e[l] +
-                    input_t(i, j, vy_n[k], vx_e[l]) * vd_s[k] * vd_w[l] +
-                    input_t(i, j, vy_s[k], vx_e[l]) * vd_n[k] * vd_w[l];
-            output_t(i, j, k, l) = out_t;
-
-          } else {
-            out_t = input_t(i, vy_n[k], vx_w[l], j) * vd_s[k] * vd_e[l] +
-                    input_t(i, vy_s[k], vx_w[l], j) * vd_n[k] * vd_e[l] +
-                    input_t(i, vy_n[k], vx_e[l], j) * vd_s[k] * vd_w[l] +
-                    input_t(i, vy_s[k], vx_e[l], j) * vd_n[k] * vd_w[l];
-            output_t(i, k, l, j) = out_t;
-          }
-        }
-      }
-    }
-  }
-}
-
-template <typename T>
-static void TrilinearInterpolation(
-    const Tensor& input, Tensor* output, const float ratio_d,
-    const float ratio_h, const float ratio_w, const int in_d, const int in_h,
-    const int in_w, const int n, const int c, const int out_d, const int out_h,
-    const int out_w, const bool align_corners, const bool align_mode,
-    const DataLayout& data_layout) {
-  auto input_t = EigenTensor<T, 5>::From(input);
-  auto output_t = EigenTensor<T, 5>::From(*output);
-  bool align_flag = (align_mode == 0 && !align_corners);
-
-  std::vector<int> vt_f, vt_b;
-  std::vector<float> vd_f, vd_b;
-  vt_f.reserve(out_d);
-  vt_b.reserve(out_d);
-  vd_f.reserve(out_d);
-  vd_b.reserve(out_d);
-#ifdef PADDLE_WITH_MKLML
-#pragma omp parallel for
-#endif
-  for (int j = 0; j < out_d; j++) {
-    int t_f = align_flag ? static_cast<int>(ratio_d * (j + 0.5) - 0.5)
-                         : static_cast<int>(ratio_d * j);
-    t_f = (t_f > 0) ? t_f : 0;
-    int t_b = (t_f + 1) < (in_d - 1) ? (t_f + 1) : (in_d - 1);
-    float idx_src_t = ratio_d * (j + 0.5) - 0.5;
-    idx_src_t = (idx_src_t > 0) ? idx_src_t : 0;
-    float d_f = align_flag ? idx_src_t - t_f : ratio_d * j - t_f;
-    float d_b = 1.f - d_f;
-    {
-      vt_f[j] = t_f;
-      vt_b[j] = t_b;
-      vd_f[j] = d_f;
-      vd_b[j] = d_b;
-    }
-  }
-
-  std::vector<int> vy_n, vy_s;
-  std::vector<float> vd_n, vd_s;
-  vy_n.reserve(out_h);
-  vy_s.reserve(out_h);
-  vd_n.reserve(out_h);
-  vd_s.reserve(out_h);
-#ifdef PADDLE_WITH_MKLML
-#pragma omp parallel for
-#endif
-  for (int k = 0; k < out_h; k++) {
-    int y_n = align_flag ? static_cast<int>(ratio_h * (k + 0.5) - 0.5)
-                         : static_cast<int>(ratio_h * k);
-    y_n = (y_n > 0) ? y_n : 0;
-    int y_s = (y_n + 1) < (in_h - 1) ? (y_n + 1) : (in_h - 1);
-    float idx_src_y = ratio_h * (k + 0.5) - 0.5;
-    idx_src_y = (idx_src_y > 0) ? idx_src_y : 0;
-    float d_n = align_flag ? idx_src_y - y_n : ratio_h * k - y_n;
-    float d_s = 1.f - d_n;
-    {
-      vy_n[k] = y_n;
-      vy_s[k] = y_s;
-      vd_n[k] = d_n;
-      vd_s[k] = d_s;
-    }
-  }
-
-  std::vector<int> vx_w, vx_e;
-  std::vector<float> vd_w, vd_e;
-  vx_w.reserve(out_w);
-  vx_e.reserve(out_w);
-  vd_w.reserve(out_w);
-  vd_e.reserve(out_w);
-#ifdef PADDLE_WITH_MKLML
-#pragma omp parallel for
-#endif
-  for (int l = 0; l < out_w; l++) {
-    int x_w = (align_mode == 0 && !align_corners)
-                  ? static_cast<int>(ratio_w * (l + 0.5) - 0.5)
-                  : static_cast<int>(ratio_w * l);
-    x_w = (x_w > 0) ? x_w : 0;
-    int x_e = (x_w + 1) < (in_w - 1) ? (x_w + 1) : (in_w - 1);
-    float idx_src_x = ratio_w * (l + 0.5) - 0.5;
-    idx_src_x = (idx_src_x > 0) ? idx_src_x : 0;
-    float d_w = align_flag ? idx_src_x - x_w : ratio_w * l - x_w;
-    float d_e = 1.f - d_w;
-    {
-      vx_w[l] = x_w;
-      vx_e[l] = x_e;
-      vd_w[l] = d_w;
-      vd_e[l] = d_e;
-    }
-  }
-
-#ifdef PADDLE_WITH_MKLML
-#pragma omp parallel for collapse(5)
-#endif
-  for (int b = 0; b < n; b++) {          // loop for batches
-    for (int i = 0; i < c; i++) {        // loop for channels
-      for (int j = 0; j < out_d; j++) {  // loop for D, H, W
-        for (int k = 0; k < out_h; k++) {
-          for (int l = 0; l < out_w; l++) {
-            // trilinear interpolation
-            if (data_layout == DataLayout::kNCHW) {
-              T out_t = input_t(b, i, vt_f[j], vy_n[k], vx_w[l]) * vd_b[j] *
-                            vd_s[k] * vd_e[l] +
-                        input_t(b, i, vt_f[j], vy_n[k], vx_e[l]) * vd_b[j] *
-                            vd_s[k] * vd_w[l] +
-                        input_t(b, i, vt_f[j], vy_s[k], vx_w[l]) * vd_b[j] *
-                            vd_n[k] * vd_e[l] +
-                        input_t(b, i, vt_f[j], vy_s[k], vx_e[l]) * vd_b[j] *
-                            vd_n[k] * vd_w[l] +
-                        input_t(b, i, vt_b[j], vy_n[k], vx_w[l]) * vd_f[j] *
-                            vd_s[k] * vd_e[l] +
-                        input_t(b, i, vt_b[j], vy_n[k], vx_e[l]) * vd_f[j] *
-                            vd_s[k] * vd_w[l] +
-                        input_t(b, i, vt_b[j], vy_s[k], vx_w[l]) * vd_f[j] *
-                            vd_n[k] * vd_e[l] +
-                        input_t(b, i, vt_b[j], vy_s[k], vx_e[l]) * vd_f[j] *
-                            vd_n[k] * vd_w[l];
-              output_t(b, i, j, k, l) = out_t;
-            } else {
-              T out_t = input_t(b, vt_f[j], vy_n[k], vx_w[l], i) * vd_b[j] *
-                            vd_s[k] * vd_e[l] +
-                        input_t(b, vt_f[j], vy_n[k], vx_e[l], i) * vd_b[j] *
-                            vd_s[k] * vd_w[l] +
-                        input_t(b, vt_f[j], vy_s[k], vx_w[l], i) * vd_b[j] *
-                            vd_n[k] * vd_e[l] +
-                        input_t(b, vt_f[j], vy_s[k], vx_e[l], i) * vd_b[j] *
-                            vd_n[k] * vd_w[l] +
-                        input_t(b, vt_b[j], vy_n[k], vx_w[l], i) * vd_f[j] *
-                            vd_s[k] * vd_e[l] +
-                        input_t(b, vt_b[j], vy_n[k], vx_e[l], i) * vd_f[j] *
-                            vd_s[k] * vd_w[l] +
-                        input_t(b, vt_b[j], vy_s[k], vx_w[l], i) * vd_f[j] *
-                            vd_n[k] * vd_e[l] +
-                        input_t(b, vt_b[j], vy_s[k], vx_e[l], i) * vd_f[j] *
-                            vd_n[k] * vd_w[l];
-              output_t(b, j, k, l, i) = out_t;
-            }
-          }
-        }
-      }
-    }
-  }
-}
-
-template <typename T>
-HOSTDEVICE inline T cubic_convolution1(T x, T A) {
-  return ((A + 2) * x - (A + 3)) * x * x + 1;
-}
-
-template <typename T>
-HOSTDEVICE inline T cubic_convolution2(T x, T A) {
-  return ((A * x - 5 * A) * x + 8 * A) * x - 4 * A;
-}
-
-template <typename T>
-HOSTDEVICE inline void get_cubic_upsample_coefficients(T coeffs[4], T t) {
-  T A = -0.75;
-
-  T x1 = t;
-  coeffs[0] = cubic_convolution2<T>(x1 + 1.0, A);
-  coeffs[1] = cubic_convolution1<T>(x1, A);
-
-  // opposite coefficients
-  T x2 = 1.0 - t;
-  coeffs[2] = cubic_convolution1<T>(x2, A);
-  coeffs[3] = cubic_convolution2<T>(x2 + 1.0, A);
-}
-
-template <typename T>
-static inline T cubic_interp(T x0, T x1, T x2, T x3, T t) {
-  T coeffs[4];
-  get_cubic_upsample_coefficients<T>(coeffs, t);
-
-  return x0 * coeffs[0] + x1 * coeffs[1] + x2 * coeffs[2] + x3 * coeffs[3];
-}
-
-template <typename T>
-static void BicubicInterpolation(const Tensor& input, Tensor* output,
-                                 const float ratio_h, const float ratio_w,
-                                 const int in_h, const int in_w, const int n,
-                                 const int c, const int out_h, const int out_w,
-                                 const bool align_corners,
-                                 const DataLayout data_layout) {
-  auto input_t = EigenTensor<T, 4>::From(input);
-  auto output_t = EigenTensor<T, 4>::From(*output);
-
-  for (int k = 0; k < out_h; k++) {  // loop for images
-    T y_n = align_corners ? static_cast<T>(ratio_h * k)
-                          : static_cast<T>(ratio_h * (k + 0.5) - 0.5);
-    int input_y = floorf(y_n);
-    const T y_t = y_n - input_y;
-
-    for (int l = 0; l < out_w; l++) {
-      T x_n = align_corners ? static_cast<T>(ratio_w * l)
-                            : static_cast<T>(ratio_w * (l + 0.5) - 0.5);
-      int input_x = floorf(x_n);
-      const T x_t = x_n - input_x;
-
-      for (int i = 0; i < n; i++) {    // loop for batches
-        for (int j = 0; j < c; j++) {  // loop for channels
-          T coefficients[4];
-          // interp 4 times in x direction
-          for (int ii = 0; ii < 4; ii++) {
-            int access_y = std::max(std::min(input_y - 1 + ii, in_h - 1),
-                                    static_cast<int>(0));
-            int access_x_0 =
-                std::max(std::min(input_x - 1, in_w - 1), static_cast<int>(0));
-            int access_x_1 =
-                std::max(std::min(input_x + 0, in_w - 1), static_cast<int>(0));
-            int access_x_2 =
-                std::max(std::min(input_x + 1, in_w - 1), static_cast<int>(0));
-            int access_x_3 =
-                std::max(std::min(input_x + 2, in_w - 1), static_cast<int>(0));
-            if (data_layout == DataLayout::kNCHW) {
-              coefficients[ii] =
-                  cubic_interp<T>(input_t(i, j, access_y, access_x_0),
-                                  input_t(i, j, access_y, access_x_1),
-                                  input_t(i, j, access_y, access_x_2),
-                                  input_t(i, j, access_y, access_x_3), x_t);
-            } else {
-              coefficients[ii] =
-                  cubic_interp<T>(input_t(i, access_y, access_x_0, j),
-                                  input_t(i, access_y, access_x_1, j),
-                                  input_t(i, access_y, access_x_2, j),
-                                  input_t(i, access_y, access_x_3, j), x_t);
-            }
-          }
-
-          // interp y direction
-          if (data_layout == DataLayout::kNCHW) {
-            output_t(i, j, k, l) =
-                cubic_interp<T>(coefficients[0], coefficients[1],
-                                coefficients[2], coefficients[3], y_t);
-          } else {
-            output_t(i, k, l, j) =
-                cubic_interp<T>(coefficients[0], coefficients[1],
-                                coefficients[2], coefficients[3], y_t);
-          }
-        }
-      }
-    }
-  }
-}
-
-template <typename T>
-static void NearestNeighborInterpolateGrad(
-    const Tensor& output_grad, Tensor* input_grad, const float ratio_h,
-    const float ratio_w, const int n, const int c, const int out_h,
-    const int out_w, const bool align_corners, const DataLayout data_layout) {
-  auto input_grad_t = EigenTensor<T, 4>::From(*input_grad);
-  auto output_grad_t = EigenTensor<T, 4>::From(output_grad);
-
-  for (int k = 0; k < out_h; k++) {  // loop for images
-    int in_k = (align_corners) ? static_cast<int>(ratio_h * k + 0.5)
-                               : static_cast<int>(ratio_h * k);
-
-    for (int l = 0; l < out_w; l++) {
-      int in_l = (align_corners) ? static_cast<int>(ratio_w * l + 0.5)
-                                 : static_cast<int>(ratio_w * l);
-
-      for (int i = 0; i < n; i++) {    // loop for batches
-        for (int j = 0; j < c; j++) {  // loop for channels
-          if (data_layout == DataLayout::kNCHW) {
-            input_grad_t(i, j, in_k, in_l) += output_grad_t(i, j, k, l);
-          } else {
-            input_grad_t(i, in_k, in_l, j) += output_grad_t(i, k, l, j);
-          }
-        }
-      }
-    }
-  }
-}
-
-template <typename T>
-static void NearestNeighbor3DInterpolateGrad(
-    const Tensor& output_grad, Tensor* input_grad, const float ratio_d,
-    const float ratio_h, const float ratio_w, const int n, const int c,
-    const int out_d, const int out_h, const int out_w, const bool align_corners,
-    const DataLayout data_layout) {
-  auto input_grad_t = EigenTensor<T, 5>::From(*input_grad);
-  auto output_grad_t = EigenTensor<T, 5>::From(output_grad);
-
-  for (int d = 0; d < out_d; d++) {
-    int in_d = (align_corners) ? static_cast<int>(ratio_d * d + 0.5)
-                               : static_cast<int>(ratio_d * d);
-    for (int k = 0; k < out_h; k++) {  // loop for images
-      int in_k = (align_corners) ? static_cast<int>(ratio_h * k + 0.5)
-                                 : static_cast<int>(ratio_h * k);
-
-      for (int l = 0; l < out_w; l++) {
-        int in_l = (align_corners) ? static_cast<int>(ratio_w * l + 0.5)
-                                   : static_cast<int>(ratio_w * l);
-
-        for (int i = 0; i < n; i++) {    // loop for batches
-          for (int j = 0; j < c; j++) {  // loop for channels
-            if (data_layout == DataLayout::kNCHW) {
-              input_grad_t(i, j, in_d, in_k, in_l) +=
-                  output_grad_t(i, j, d, k, l);
-            } else {
-              input_grad_t(i, in_d, in_k, in_l, j) +=
-                  output_grad_t(i, d, k, l, j);
-            }
-          }
-        }
-      }
-    }
-  }
-}
-
-template <typename T>
-static void BilinearInterpolationGrad(
-    const Tensor& output_grad, Tensor* input_grad, const float ratio_h,
-    const float ratio_w, const int in_h, const int in_w, const int n,
-    const int c, const int out_h, const int out_w, const bool align_corners,
-    const int align_mode, const DataLayout data_layout) {
-  auto input_grad_t = EigenTensor<T, 4>::From(*input_grad);
-  auto output_grad_t = EigenTensor<T, 4>::From(output_grad);
-  bool align_flag = (align_mode == 0 && !align_corners);
-  for (int k = 0; k < out_h; k++) {  // loop for images
-    int y_n = align_flag ? static_cast<int>(ratio_h * (k + 0.5) - 0.5)
-                         : static_cast<int>(ratio_h * k);
-    y_n = (y_n > 0) ? y_n : 0;
-    int y_s = (y_n + 1) < (in_h - 1) ? (y_n + 1) : (in_h - 1);
-    float idx_src_y = ratio_h * (k + 0.5) - 0.5;
-    idx_src_y = (idx_src_y > 0) ? idx_src_y : 0;
-    float d_n = align_flag ? idx_src_y - y_n : ratio_h * k - y_n;
-    float d_s = 1.f - d_n;
-
-    for (int l = 0; l < out_w; l++) {
-      int x_w = align_flag ? static_cast<int>(ratio_w * (l + 0.5) - 0.5)
-                           : static_cast<int>(ratio_w * l);
-      x_w = (x_w > 0) ? x_w : 0;
-      int x_e = (x_w + 1) < (in_w - 1) ? (x_w + 1) : (in_w - 1);
-      float idx_src_x = ratio_w * (l + 0.5) - 0.5;
-      idx_src_x = (idx_src_x > 0) ? idx_src_x : 0;
-      float d_w = align_flag ? idx_src_x - x_w : ratio_w * l - x_w;
-      float d_e = 1.f - d_w;
-
-      for (int i = 0; i < n; i++) {    // loop for batches
-        for (int j = 0; j < c; j++) {  // loop for channels
-          // bilinear interpolation grad
-          if (data_layout == DataLayout::kNCHW) {
-            const T grad = output_grad_t(i, j, k, l);
-            input_grad_t(i, j, y_n, x_w) += static_cast<T>(grad * d_s * d_e);
-            input_grad_t(i, j, y_s, x_w) += static_cast<T>(grad * d_n * d_e);
-            input_grad_t(i, j, y_n, x_e) += static_cast<T>(grad * d_s * d_w);
-            input_grad_t(i, j, y_s, x_e) += static_cast<T>(grad * d_n * d_w);
-          } else {
-            const T grad = output_grad_t(i, k, l, j);
-            input_grad_t(i, y_n, x_w, j) += static_cast<T>(grad * d_s * d_e);
-            input_grad_t(i, y_s, x_w, j) += static_cast<T>(grad * d_n * d_e);
-            input_grad_t(i, y_n, x_e, j) += static_cast<T>(grad * d_s * d_w);
-            input_grad_t(i, y_s, x_e, j) += static_cast<T>(grad * d_n * d_w);
-          }
-        }
-      }
-    }
-  }
-}
-
-template <typename T>
-static void TrilinearInterpolationGrad(
-    const Tensor& output_grad, Tensor* input_grad, const float ratio_d,
-    const float ratio_h, const float ratio_w, const int in_d, const int in_h,
-    const int in_w, const int n, const int c, const int out_d, const int out_h,
-    const int out_w, const bool align_corners, const int align_mode,
-    const DataLayout data_layout) {
-  auto input_grad_t = EigenTensor<T, 5>::From(*input_grad);
-  auto output_grad_t = EigenTensor<T, 5>::From(output_grad);
-  bool align_flag = (align_mode == 0 && !align_corners);
-  for (int j = 0; j < out_d; j++) {  // loop for D
-    int t_f = align_flag ? static_cast<int>(ratio_d * (j + 0.5) - 0.5)
-                         : static_cast<int>(ratio_d * j);
-    t_f = (t_f > 0) ? t_f : 0;
-    int t_b = (t_f + 1) < (in_d - 1) ? (t_f + 1) : (in_d - 1);
-    float idx_src_t = ratio_d * (j + 0.5) - 0.5;
-    idx_src_t = (idx_src_t > 0) ? idx_src_t : 0;
-    float d_f = align_flag ? idx_src_t - t_f : ratio_d * j - t_f;
-    float d_b = 1.f - d_f;
-
-    for (int k = 0; k < out_h; k++) {  // loop for H
-      int y_n = align_flag ? static_cast<int>(ratio_h * (k + 0.5) - 0.5)
-                           : static_cast<int>(ratio_h * k);
-      y_n = (y_n > 0) ? y_n : 0;
-      int y_s = (y_n + 1) < (in_h - 1) ? (y_n + 1) : (in_h - 1);
-      float idx_src_y = ratio_h * (k + 0.5) - 0.5;
-      idx_src_y = (idx_src_y > 0) ? idx_src_y : 0;
-      float d_n = align_flag ? idx_src_y - y_n : ratio_h * k - y_n;
-      float d_s = 1.f - d_n;
-
-      for (int l = 0; l < out_w; l++) {  // loop for W
-        int x_w = align_flag ? static_cast<int>(ratio_w * (l + 0.5) - 0.5)
-                             : static_cast<int>(ratio_w * l);
-        x_w = (x_w > 0) ? x_w : 0;
-        int x_e = (x_w + 1) < (in_w - 1) ? (x_w + 1) : (in_w - 1);
-        float idx_src_x = ratio_w * (l + 0.5) - 0.5;
-        idx_src_x = (idx_src_x > 0) ? idx_src_x : 0;
-        float d_w = align_flag ? idx_src_x - x_w : ratio_w * l - x_w;
-        float d_e = 1.f - d_w;
-
-        for (int b = 0; b < n; b++) {    // loop for batches
-          for (int i = 0; i < c; i++) {  // loop for channels
-            // trilinear interpolation grad
-            if (data_layout == DataLayout::kNCHW) {
-              const T grad = output_grad_t(b, i, j, k, l);
-              input_grad_t(b, i, t_f, y_n, x_w) +=
-                  static_cast<T>(grad * d_b * d_s * d_e);
-              input_grad_t(b, i, t_f, y_n, x_e) +=
-                  static_cast<T>(grad * d_b * d_s * d_w);
-              input_grad_t(b, i, t_f, y_s, x_w) +=
-                  static_cast<T>(grad * d_b * d_n * d_e);
-              input_grad_t(b, i, t_f, y_s, x_e) +=
-                  static_cast<T>(grad * d_b * d_n * d_w);
-              input_grad_t(b, i, t_b, y_n, x_w) +=
-                  static_cast<T>(grad * d_f * d_s * d_e);
-              input_grad_t(b, i, t_b, y_n, x_e) +=
-                  static_cast<T>(grad * d_f * d_s * d_w);
-              input_grad_t(b, i, t_b, y_s, x_w) +=
-                  static_cast<T>(grad * d_f * d_n * d_e);
-              input_grad_t(b, i, t_b, y_s, x_e) +=
-                  static_cast<T>(grad * d_f * d_n * d_w);
-            } else {
-              const T grad = output_grad_t(b, j, k, l, i);
-              input_grad_t(b, t_f, y_n, x_w, i) +=
-                  static_cast<T>(grad * d_b * d_s * d_e);
-              input_grad_t(b, t_f, y_n, x_e, i) +=
-                  static_cast<T>(grad * d_b * d_s * d_w);
-              input_grad_t(b, t_f, y_s, x_w, i) +=
-                  static_cast<T>(grad * d_b * d_n * d_e);
-              input_grad_t(b, t_f, y_s, x_e, i) +=
-                  static_cast<T>(grad * d_b * d_n * d_w);
-              input_grad_t(b, t_b, y_n, x_w, i) +=
-                  static_cast<T>(grad * d_f * d_s * d_e);
-              input_grad_t(b, t_b, y_n, x_e, i) +=
-                  static_cast<T>(grad * d_f * d_s * d_w);
-              input_grad_t(b, t_b, y_s, x_w, i) +=
-                  static_cast<T>(grad * d_f * d_n * d_e);
-              input_grad_t(b, t_b, y_s, x_e, i) +=
-                  static_cast<T>(grad * d_f * d_n * d_w);
-            }
-          }
-        }
-      }
-    }
-  }
-}
-
-template <typename T>
-static void BicubicInterpolationGrad(const Tensor& output_grad,
-                                     Tensor* input_grad, const float ratio_h,
-                                     const float ratio_w, const int in_h,
-                                     const int in_w, const int n, const int c,
-                                     const int out_h, const int out_w,
-                                     const bool align_corners,
-                                     const DataLayout data_layout) {
-  auto input_grad_t = EigenTensor<T, 4>::From(*input_grad);
-  auto output_grad_t = EigenTensor<T, 4>::From(output_grad);
-
-  for (int k = 0; k < out_h; k++) {  // loop for images
-    T y_n = align_corners ? static_cast<T>(ratio_h * k)
-                          : static_cast<T>(ratio_h * (k + 0.5) - 0.5);
-    int input_y = floorf(y_n);
-    T y_t = y_n - input_y;
-
-    for (int l = 0; l < out_w; l++) {
-      T x_n = align_corners ? static_cast<T>(ratio_w * l)
-                            : static_cast<T>(ratio_w * (l + 0.5) - 0.5);
-      int input_x = floorf(x_n);
-      T x_t = x_n - input_x;
-
-      T x_coeffs[4];
-      T y_coeffs[4];
-
-      get_cubic_upsample_coefficients<T>(x_coeffs, x_t);
-      get_cubic_upsample_coefficients<T>(y_coeffs, y_t);
-
-      for (int i = 0; i < n; i++) {    // loop for batches
-        for (int j = 0; j < c; j++) {  // loop for channels
-          // bicubic interpolation grad
-          for (int ii = 0; ii < 4; ii++) {
-            for (int jj = 0; jj < 4; jj++) {
-              int access_x = std::max(std::min(input_x - 1 + ii, in_w - 1),
-                                      static_cast<int>(0));
-              int access_y = std::max(std::min(input_y - 1 + jj, in_h - 1),
-                                      static_cast<int>(0));
-              if (data_layout == DataLayout::kNCHW) {
-                T grad = output_grad_t(i, j, k, l);
-                input_grad_t(i, j, access_y, access_x) +=
-                    grad * y_coeffs[jj] * x_coeffs[ii];
-              } else {
-                T grad = output_grad_t(i, k, l, j);
-                input_grad_t(i, access_y, access_x, j) +=
-                    grad * y_coeffs[jj] * x_coeffs[ii];
-              }
-            }
-          }
-        }
-      }
-    }
-  }
-}
-
-template <typename T>
-static void Interpolate1DCPUFwd(const framework::ExecutionContext& ctx,
-                                const Tensor& input, Tensor* output) {
-  const std::string data_layout_str = ctx.Attr<std::string>("data_layout");
-  const DataLayout data_layout = framework::StringToDataLayout(data_layout_str);
-  int n, c, in_d, in_h, in_w;
-  ExtractNCDWH(input.dims(), data_layout, &n, &c, &in_d, &in_h, &in_w);
-
-  auto interp_method = ctx.Attr<std::string>("interp_method");
-  bool align_corners = ctx.Attr<bool>("align_corners");
-  int align_mode = ctx.Attr<int>("align_mode");
-
-  int out_w = ctx.Attr<int>("out_w");
-  auto list_new_size_tensor = ctx.MultiInput<framework::Tensor>("SizeTensor");
-  float scale_w = -1.;
-  if (list_new_size_tensor.size() > 0) {
-    // have size tensor
-    auto new_size = get_new_shape(list_new_size_tensor);
-    out_w = new_size[0];
-  } else {
-    // float scale_w = -1;
-    auto scale_tensor = ctx.Input<Tensor>("Scale");
-    auto scale = ctx.Attr<std::vector<float>>("scale");
-    if (scale_tensor != nullptr) {
-      auto scale_data = get_new_data_from_tensor<float>(scale_tensor);
-      scale_w = scale_data[0];
-      PADDLE_ENFORCE_EQ(
-          scale_w > 0, true,
-          platform::errors::InvalidArgument(
-              "The scale_w in input 'Scale' Tensor of Operator(interpolate) "
-              "should be greater than 0, but received value is %d.",
-              scale_w));
-    } else {
-      if (scale.size() > 0) {
-        scale_w = scale[0];
-
-        PADDLE_ENFORCE_EQ(
-            scale_w > 0, true,
-            platform::errors::InvalidArgument(
-                "The scale_w in Attr(scale) of Operator(interpolate) "
-                "should be greater than 0, but received value is %d.",
-                scale_w));
-      }
-    }
-    if (scale_w > 0.) {
-      out_w = static_cast<int>(in_w * scale_w);
-    }
-    auto out_size = ctx.Input<Tensor>("OutSize");
-    if (out_size != nullptr) {
-      auto out_size_data = get_new_data_from_tensor<int>(out_size);
-      out_w = out_size_data[0];
-    }
-  }
-  PADDLE_ENFORCE_GT(out_w, 0, platform::errors::InvalidArgument(
-                                  "out_w in Attr(out_shape) of Op(interpolate) "
-                                  "should be greater than 0."));
-  framework::DDim dim_out;
-  if (data_layout == DataLayout::kNCHW) {
-    dim_out = {n, c, out_w};
-  } else {
-    dim_out = {n, out_w, c};
-  }
-  output->mutable_data<T>(dim_out, ctx.GetPlace());
-
-  if (in_w == out_w) {
-    framework::TensorCopy(input, ctx.GetPlace(), output);
-    return;
-  }
-
-  float ratio_w = 0.f;
-  if (out_w > 1) {
-    float new_scale_w = 0.f;
-    new_scale_w = (scale_w > 0) ? static_cast<float>(1. / scale_w)
-                                : static_cast<float>(in_w) / out_w;
-    ratio_w = (align_corners) ? static_cast<float>(in_w - 1) / (out_w - 1)
-                              : static_cast<float>(new_scale_w);
-  }
-  if ("linear" == interp_method) {
-    LinearInterpolation<T>(input, output, ratio_w, in_w, n, c, out_w,
-                           align_corners, align_mode, data_layout);
-  }
-}
-
-template <typename T>
-static void Interpolate2DCPUFwd(const framework::ExecutionContext& ctx,
-                                const Tensor& input, Tensor* output) {
-  const std::string data_layout_str = ctx.Attr<std::string>("data_layout");
-  const DataLayout data_layout = framework::StringToDataLayout(data_layout_str);
-  int n, c, in_d, in_h, in_w;
-  ExtractNCDWH(input.dims(), data_layout, &n, &c, &in_d, &in_h, &in_w);
-
-  auto interp_method = ctx.Attr<std::string>("interp_method");
-  bool align_corners = ctx.Attr<bool>("align_corners");
-  int align_mode = ctx.Attr<int>("align_mode");
-
-  int out_h = ctx.Attr<int>("out_h");
-  int out_w = ctx.Attr<int>("out_w");
-  float scale_h = -1;
-  float scale_w = -1;
-
-  auto list_new_size_tensor = ctx.MultiInput<framework::Tensor>("SizeTensor");
-  if (list_new_size_tensor.size() > 0) {
-    // have size tensor
-    auto new_size = get_new_shape(list_new_size_tensor);
-    out_h = new_size[0];
-    out_w = new_size[1];
-  } else {
-    auto scale_tensor = ctx.Input<Tensor>("Scale");
-    auto scale = ctx.Attr<std::vector<float>>("scale");
-    if (scale_tensor != nullptr) {
-      auto scale_data = get_new_data_from_tensor<float>(scale_tensor);
-      if (scale_data.size() > 1) {
-        scale_h = scale_data[0];
-        scale_w = scale_data[1];
-      } else {
-        scale_h = scale_data[0];
-        scale_w = scale_data[0];
-      }
-      PADDLE_ENFORCE_EQ(
-          scale_w > 0, true,
-          platform::errors::InvalidArgument(
-              "The scale_w in input 'Scale' Tensor of Operator(interpolate) "
-              "should be greater than 0, but received value is %d.",
-              scale_w));
-      PADDLE_ENFORCE_EQ(
-          scale_h > 0, true,
-          platform::errors::InvalidArgument(
-              "The scale_h in input 'Scale' Tensor of Operator(interpolate) "
-              "should be greater than 0, but received value is %d.",
-              scale_h));
-    } else {
-      if (scale.size() > 1) {
-        scale_h = scale[0];
-        scale_w = scale[1];
-
-        PADDLE_ENFORCE_EQ(
-            scale_w > 0, true,
-            platform::errors::InvalidArgument(
-                "The scale_w in Attr(scale) of Operator(interpolate) "
-                "should be greater than 0, but received value is %d.",
-                scale_w));
-        PADDLE_ENFORCE_EQ(
-            scale_h > 0, true,
-            platform::errors::InvalidArgument(
-                "The scale_h in Attr(scale) of Operator(interpolate) "
-                "should be greater than 0, but received value is %d.",
-                scale_h));
-      }
-    }
-    if (scale_h > 0. && scale_w > 0.) {
-      out_h = static_cast<int>(in_h * scale_h);
-      out_w = static_cast<int>(in_w * scale_w);
-    }
-    auto out_size = ctx.Input<Tensor>("OutSize");
-    if (out_size != nullptr) {
-      auto out_size_data = get_new_data_from_tensor<int>(out_size);
-      out_h = out_size_data[0];
-      out_w = out_size_data[1];
-    }
-  }
-  PADDLE_ENFORCE_GT(out_h, 0, platform::errors::InvalidArgument(
-                                  "out_h in Attr(out_shape) of Op(interpolate) "
-                                  "should be greater than 0."));
-  PADDLE_ENFORCE_GT(out_w, 0, platform::errors::InvalidArgument(
-                                  "out_w in Attr(out_shape) of Op(interpolate) "
-                                  "should be greater than 0."));
-  framework::DDim dim_out;
-  if (data_layout == DataLayout::kNCHW) {
-    dim_out = {n, c, out_h, out_w};
-  } else {
-    dim_out = {n, out_h, out_w, c};
-  }
-  output->mutable_data<T>(dim_out, ctx.GetPlace());
-
-  if (in_h == out_h && in_w == out_w) {
-    framework::TensorCopy(input, ctx.GetPlace(), output);
-    return;
-  }
-
-  float ratio_h = 0.f;
-  float ratio_w = 0.f;
-  if (out_h > 1) {
-    float new_scale_h = 0.f;
-    new_scale_h = (scale_h > 0) ? static_cast<float>(1. / scale_h)
-                                : static_cast<float>(in_h) / out_h;
-    ratio_h = (align_corners) ? static_cast<float>(in_h - 1) / (out_h - 1)
-                              : static_cast<float>(new_scale_h);
-  }
-  if (out_w > 1) {
-    float new_scale_w = 0.f;
-    new_scale_w = (scale_w > 0) ? static_cast<float>(1. / scale_w)
-                                : static_cast<float>(in_w) / out_w;
-    ratio_w = (align_corners) ? static_cast<float>(in_w - 1) / (out_w - 1)
-                              : static_cast<float>(new_scale_w);
-  }
-
-  if ("bilinear" == interp_method) {
-    BilinearInterpolation<T>(input, output, ratio_h, ratio_w, in_h, in_w, n, c,
-                             out_h, out_w, align_corners, align_mode,
-                             data_layout);
-  } else if ("nearest" == interp_method) {
-    NearestNeighborInterpolate<T>(input, output, ratio_h, ratio_w, n, c, out_h,
-                                  out_w, align_corners, data_layout);
-  } else if ("bicubic" == interp_method) {
-    BicubicInterpolation<T>(input, output, ratio_h, ratio_w, in_h, in_w, n, c,
-                            out_h, out_w, align_corners, data_layout);
-  }
-}
-
-template <typename T>
-static void Interpolate3DCPUFwd(const framework::ExecutionContext& ctx,
-                                const Tensor& input, Tensor* output) {
-  const std::string data_layout_str = ctx.Attr<std::string>("data_layout");
-  const DataLayout data_layout = framework::StringToDataLayout(data_layout_str);
-  int n, c, in_d, in_h, in_w;
-  ExtractNCDWH(input.dims(), data_layout, &n, &c, &in_d, &in_h, &in_w);
-
-  auto interp_method = ctx.Attr<std::string>("interp_method");
-  bool align_corners = ctx.Attr<bool>("align_corners");
-  int align_mode = ctx.Attr<int>("align_mode");
-
-  int out_d = ctx.Attr<int>("out_d");
-  int out_h = ctx.Attr<int>("out_h");
-  int out_w = ctx.Attr<int>("out_w");
-
-  float scale_d = -1;
-  float scale_h = -1;
-  float scale_w = -1;
-
-  auto list_new_size_tensor = ctx.MultiInput<framework::Tensor>("SizeTensor");
-  if (list_new_size_tensor.size() > 0) {
-    // have size tensor
-    auto new_size = get_new_shape(list_new_size_tensor);
-    out_d = new_size[0];
-    out_h = new_size[1];
-    out_w = new_size[2];
-  } else {
-    auto scale_tensor = ctx.Input<Tensor>("Scale");
-    auto scale = ctx.Attr<std::vector<float>>("scale");
-    if (scale_tensor != nullptr) {
-      auto scale_data = get_new_data_from_tensor<float>(scale_tensor);
-      if (scale_data.size() > 1) {
-        scale_d = scale_data[0];
-        scale_h = scale_data[1];
-        scale_w = scale_data[2];
-      } else {
-        scale_d = scale_data[0];
-        scale_h = scale_data[0];
-        scale_w = scale_data[0];
-      }
-      PADDLE_ENFORCE_EQ(
-          scale_w > 0, true,
-          platform::errors::InvalidArgument(
-              "The scale_w in input 'Scale' Tensor of Operator(interpolate) "
-              "should be greater than 0, but received value is %d.",
-              scale_w));
-      PADDLE_ENFORCE_EQ(
-          scale_h > 0, true,
-          platform::errors::InvalidArgument(
-              "The scale_h in input 'Scale' Tensor of Operator(interpolate) "
-              "should be greater than 0, but received value is %d.",
-              scale_h));
-      PADDLE_ENFORCE_EQ(
-          scale_d > 0, true,
-          platform::errors::InvalidArgument(
-              "The scale_d in input 'Scale' Tensor of Operator(interpolate) "
-              "should be greater than 0, but received value is %d.",
-              scale_d));
-    } else {
-      if (scale.size() > 1) {
-        scale_d = scale[0];
-        scale_h = scale[1];
-        scale_w = scale[2];
-
-        PADDLE_ENFORCE_EQ(
-            scale_w > 0, true,
-            platform::errors::InvalidArgument(
-                "The scale_w in Attr(scale) of Operator(interpolate) "
-                "should be greater than 0, but received value is %d.",
-                scale_w));
-        PADDLE_ENFORCE_EQ(
-            scale_h > 0, true,
-            platform::errors::InvalidArgument(
-                "The scale_h in Attr(scale) of Operator(interpolate) "
-                "should be greater than 0, but received value is %d.",
-                scale_h));
-        PADDLE_ENFORCE_EQ(
-            scale_d > 0, true,
-            platform::errors::InvalidArgument(
-                "The scale_d in Attr(scale) of Operator(interpolate) "
-                "should be greater than 0, but received value is %d.",
-                scale_d));
-      }
-    }
-    if (scale_w > 0. && scale_h > 0. && scale_d > 0.) {
-      out_d = static_cast<int>(in_d * scale_d);
-      out_h = static_cast<int>(in_h * scale_h);
-      out_w = static_cast<int>(in_w * scale_w);
-    }
-    auto out_size = ctx.Input<Tensor>("OutSize");
-    if (out_size != nullptr) {
-      auto out_size_data = get_new_data_from_tensor<int>(out_size);
-      out_d = out_size_data[0];
-      out_h = out_size_data[1];
-      out_w = out_size_data[2];
-    }
-  }
-  PADDLE_ENFORCE_GT(out_d, 0, platform::errors::InvalidArgument(
-                                  "out_d in Attr(out_shape) of Op(interpolate) "
-                                  "should be greater than 0."));
-  PADDLE_ENFORCE_GT(out_h, 0, platform::errors::InvalidArgument(
-                                  "out_h in Attr(out_shape) of Op(interpolate) "
-                                  "should be greater than 0."));
-  PADDLE_ENFORCE_GT(out_w, 0, platform::errors::InvalidArgument(
-                                  "out_w in Attr(out_shape) of Op(interpolate) "
-                                  "should be greater than 0."));
-
-  framework::DDim dim_out;
-  if (data_layout == DataLayout::kNCHW) {
-    dim_out = {n, c, out_d, out_h, out_w};
-  } else {
-    dim_out = {n, out_d, out_h, out_w, c};
-  }
-
-  output->mutable_data<T>(dim_out, ctx.GetPlace());
-
-  if (in_d == out_d && in_h == out_h && in_w == out_w) {
-    framework::TensorCopy(input, ctx.GetPlace(), output);
-    return;
-  }
-
-  float ratio_d = 0.f;
-  float ratio_h = 0.f;
-  float ratio_w = 0.f;
-  if (out_d > 1) {
-    float new_scale_d = 0.f;
-    new_scale_d = (scale_d > 0) ? static_cast<float>(1. / scale_d)
-                                : static_cast<float>(in_d) / out_d;
-    ratio_d = (align_corners) ? static_cast<float>(in_d - 1) / (out_d - 1)
-                              : static_cast<float>(new_scale_d);
-  }
-  if (out_h > 1) {
-    float new_scale_h = 0.f;
-    new_scale_h = (scale_h > 0) ? static_cast<float>(1. / scale_h)
-                                : static_cast<float>(in_h) / out_h;
-    ratio_h = (align_corners) ? static_cast<float>(in_h - 1) / (out_h - 1)
-                              : static_cast<float>(new_scale_h);
-  }
-  if (out_w > 1) {
-    float new_scale_w = 0.f;
-    new_scale_w = (scale_w > 0) ? static_cast<float>(1. / scale_w)
-                                : static_cast<float>(in_w) / out_w;
-    ratio_w = (align_corners) ? static_cast<float>(in_w - 1) / (out_w - 1)
-                              : static_cast<float>(new_scale_w);
-  }
-
-  if ("trilinear" == interp_method) {
-    TrilinearInterpolation<T>(input, output, ratio_d, ratio_h, ratio_w, in_d,
-                              in_h, in_w, n, c, out_d, out_h, out_w,
-                              align_corners, align_mode, data_layout);
-  } else if ("nearest" == interp_method) {
-    NearestNeighbor3DInterpolate<T>(input, output, ratio_d, ratio_h, ratio_w, n,
-                                    c, out_d, out_h, out_w, align_corners,
-                                    data_layout);
-  }
-}
-
-template <typename T>
-static void Interpolate1DCPUBwd(const framework::ExecutionContext& ctx,
-                                Tensor* input_grad, const Tensor& output_grad) {
-  auto* input = ctx.Input<Tensor>("X");
-  const std::string data_layout_str = ctx.Attr<std::string>("data_layout");
-  const DataLayout data_layout = framework::StringToDataLayout(data_layout_str);
-  int n, c, in_d, in_h, in_w;
-  ExtractNCDWH(input->dims(), data_layout, &n, &c, &in_d, &in_h, &in_w);
-
-  auto interp_method = ctx.Attr<std::string>("interp_method");
-  bool align_corners = ctx.Attr<bool>("align_corners");
-  int align_mode = ctx.Attr<int>("align_mode");
-
-  int out_w = ctx.Attr<int>("out_w");
-  float scale_w = -1.0;
-  auto scale_tensor = ctx.Input<Tensor>("Scale");
-  auto scale = ctx.Attr<std::vector<float>>("scale");
-  if (scale_tensor != nullptr) {
-    auto scale_data = get_new_data_from_tensor<float>(scale_tensor);
-    scale_w = scale_data[0];
-    PADDLE_ENFORCE_EQ(
-        scale_w > 0, true,
-        platform::errors::InvalidArgument(
-            "The scale_w in input 'Scale' Tensor of Operator(interpolate) "
-            "should be greater than 0, but received value is %d.",
-            scale_w));
-  } else {
-    if (scale.size() > 0) {
-      scale_w = scale[0];
-      PADDLE_ENFORCE_EQ(
-          scale_w > 0, true,
-          platform::errors::InvalidArgument(
-              "The scale_w in Attr(scale) of Operator(interpolate) "
-              "should be greater than 0, but received value is %d.",
-              scale_w));
-    }
-  }
-  if (scale_w > 0.) {
-    out_w = static_cast<int>(in_w * scale_w);
-  }
-  auto out_size = ctx.Input<Tensor>("OutSize");
-  if (out_size != nullptr) {
-    auto out_size_data = get_new_data_from_tensor<int>(out_size);
-    out_w = out_size_data[0];
-  }
-  auto list_new_size_tensor = ctx.MultiInput<framework::Tensor>("SizeTensor");
-  if (list_new_size_tensor.size() > 0) {
-    // have size tensor
-    auto new_size = get_new_shape(list_new_size_tensor);
-    out_w = new_size[0];
-  }
-
-  framework::DDim dim_grad;
-  if (data_layout == DataLayout::kNCHW) {
-    dim_grad = {n, c, in_w};
-  } else {
-    dim_grad = {n, in_w, c};
-  }
-  input_grad->mutable_data<T>(dim_grad, ctx.GetPlace());
-
-  auto& device_ctx = ctx.template device_context<platform::CPUDeviceContext>();
-  phi::funcs::SetConstant<platform::CPUDeviceContext, T> zero;
-  zero(device_ctx, input_grad, static_cast<T>(0.0));
-
-  if (in_w == out_w) {
-    framework::TensorCopy(output_grad, ctx.GetPlace(), input_grad);
-    return;
-  }
-
-  float ratio_w = 0.f;
-  if (out_w > 1) {
-    float new_scale_w = 0.f;
-    new_scale_w = (scale_w > 0) ? static_cast<float>(1. / scale_w)
-                                : static_cast<float>(in_w) / out_w;
-    ratio_w = (align_corners) ? static_cast<float>(in_w - 1) / (out_w - 1)
-                              : static_cast<float>(new_scale_w);
-  }
-  if ("linear" == interp_method) {
-    LinearInterpolationGrad<T>(output_grad, input_grad, ratio_w, in_w, n, c,
-                               out_w, align_corners, align_mode, data_layout);
-  }
-}
-
-template <typename T>
-static void Interpolate2DCPUBwd(const framework::ExecutionContext& ctx,
-                                Tensor* input_grad, const Tensor& output_grad) {
-  auto* input = ctx.Input<Tensor>("X");
-  const std::string data_layout_str = ctx.Attr<std::string>("data_layout");
-  const DataLayout data_layout = framework::StringToDataLayout(data_layout_str);
-  int n, c, in_d, in_h, in_w;
-  ExtractNCDWH(input->dims(), data_layout, &n, &c, &in_d, &in_h, &in_w);
-
-  auto interp_method = ctx.Attr<std::string>("interp_method");
-  bool align_corners = ctx.Attr<bool>("align_corners");
-  int align_mode = ctx.Attr<int>("align_mode");
-
-  int out_h = ctx.Attr<int>("out_h");
-  int out_w = ctx.Attr<int>("out_w");
-  float scale_h = -1;
-  float scale_w = -1;
-  auto scale_tensor = ctx.Input<Tensor>("Scale");
-  auto scale = ctx.Attr<std::vector<float>>("scale");
-  if (scale_tensor != nullptr) {
-    auto scale_data = get_new_data_from_tensor<float>(scale_tensor);
-    if (scale_data.size() > 1) {
-      scale_h = scale_data[0];
-      scale_w = scale_data[1];
-    } else {
-      scale_w = scale_data[0];
-      scale_h = scale_data[0];
-    }
-    PADDLE_ENFORCE_EQ(
-        scale_w > 0, true,
-        platform::errors::InvalidArgument(
-            "The scale_w in input 'Scale' Tensor of Operator(interpolate) "
-            "should be greater than 0, but received value is %d.",
-            scale_w));
-    PADDLE_ENFORCE_EQ(
-        scale_h > 0, true,
-        platform::errors::InvalidArgument(
-            "The scale_h in input 'Scale' Tensor of Operator(interpolate) "
-            "should be greater than 0, but received value is %d.",
-            scale_h));
-  } else {
-    if (scale.size() > 1) {
-      scale_h = scale[0];
-      scale_w = scale[1];
-      PADDLE_ENFORCE_EQ(
-          scale_w > 0, true,
-          platform::errors::InvalidArgument(
-              "The scale_w in Attr(scale) of Operator(interpolate) "
-              "should be greater than 0, but received value is %d.",
-              scale_w));
-      PADDLE_ENFORCE_EQ(
-          scale_h > 0, true,
-          platform::errors::InvalidArgument(
-              "The scale_h in Attr(scale) of Operator(interpolate) "
-              "should be greater than 0, but received value is %d.",
-              scale_h));
-    }
-  }
-  if (scale_h > 0. && scale_w > 0.) {
-    out_h = static_cast<int>(in_h * scale_h);
-    out_w = static_cast<int>(in_w * scale_w);
-  }
-  auto out_size = ctx.Input<Tensor>("OutSize");
-  if (out_size != nullptr) {
-    auto out_size_data = get_new_data_from_tensor<int>(out_size);
-    out_h = out_size_data[0];
-    out_w = out_size_data[1];
-  }
-  auto list_new_size_tensor = ctx.MultiInput<framework::Tensor>("SizeTensor");
-  if (list_new_size_tensor.size() > 0) {
-    // have size tensor
-    auto new_size = get_new_shape(list_new_size_tensor);
-    out_h = new_size[0];
-    out_w = new_size[1];
-  }
-
-  framework::DDim dim_grad;
-  if (data_layout == DataLayout::kNCHW) {
-    dim_grad = {n, c, in_h, in_w};
-  } else {
-    dim_grad = {n, in_h, in_w, c};
-  }
-  input_grad->mutable_data<T>(dim_grad, ctx.GetPlace());
-
-  auto& device_ctx = ctx.template device_context<platform::CPUDeviceContext>();
-  phi::funcs::SetConstant<platform::CPUDeviceContext, T> zero;
-  zero(device_ctx, input_grad, static_cast<T>(0.0));
-
-  if (in_h == out_h && in_w == out_w) {
-    framework::TensorCopy(output_grad, ctx.GetPlace(), input_grad);
-    return;
-  }
-
-  float ratio_h = 0.f;
-  float ratio_w = 0.f;
-  if (out_h > 1) {
-    float new_scale_h = 0.f;
-    new_scale_h = (scale_h > 0) ? static_cast<float>(1. / scale_h)
-                                : static_cast<float>(in_h) / out_h;
-    ratio_h = (align_corners) ? static_cast<float>(in_h - 1) / (out_h - 1)
-                              : static_cast<float>(new_scale_h);
-  }
-  if (out_w > 1) {
-    float new_scale_w = 0.f;
-    new_scale_w = (scale_w > 0) ? static_cast<float>(1. / scale_w)
-                                : static_cast<float>(in_w) / out_w;
-    ratio_w = (align_corners) ? static_cast<float>(in_w - 1) / (out_w - 1)
-                              : static_cast<float>(new_scale_w);
-  }
-
-  if ("bilinear" == interp_method) {
-    BilinearInterpolationGrad<T>(output_grad, input_grad, ratio_h, ratio_w,
-                                 in_h, in_w, n, c, out_h, out_w, align_corners,
-                                 align_mode, data_layout);
-  } else if ("nearest" == interp_method) {
-    NearestNeighborInterpolateGrad<T>(output_grad, input_grad, ratio_h, ratio_w,
-                                      n, c, out_h, out_w, align_corners,
-                                      data_layout);
-  } else if ("bicubic" == interp_method) {
-    BicubicInterpolationGrad<T>(output_grad, input_grad, ratio_h, ratio_w, in_h,
-                                in_w, n, c, out_h, out_w, align_corners,
-                                data_layout);
-  }
-}
-
-template <typename T>
-static void Interpolate3DCPUBwd(const framework::ExecutionContext& ctx,
-                                Tensor* input_grad, const Tensor output_grad) {
-  auto* input = ctx.Input<Tensor>("X");
-  const std::string data_layout_str = ctx.Attr<std::string>("data_layout");
-  const DataLayout data_layout = framework::StringToDataLayout(data_layout_str);
-  int n, c, in_d, in_h, in_w;
-  ExtractNCDWH(input->dims(), data_layout, &n, &c, &in_d, &in_h, &in_w);
-
-  auto interp_method = ctx.Attr<std::string>("interp_method");
-  bool align_corners = ctx.Attr<bool>("align_corners");
-  int align_mode = ctx.Attr<int>("align_mode");
-
-  int out_d = ctx.Attr<int>("out_d");
-  int out_h = ctx.Attr<int>("out_h");
-  int out_w = ctx.Attr<int>("out_w");
-  float scale_d = -1;
-  float scale_h = -1;
-  float scale_w = -1;
-  auto scale_tensor = ctx.Input<Tensor>("Scale");
-  auto scale = ctx.Attr<std::vector<float>>("scale");
-  if (scale_tensor != nullptr) {
-    auto scale_data = get_new_data_from_tensor<float>(scale_tensor);
-    if (scale_data.size() > 1) {
-      scale_d = scale_data[0];
-      scale_h = scale_data[1];
-      scale_w = scale_data[2];
-    } else {
-      scale_d = scale_data[0];
-      scale_h = scale_data[0];
-      scale_w = scale_data[0];
-    }
-    PADDLE_ENFORCE_EQ(
-        scale_w > 0, true,
-        platform::errors::InvalidArgument(
-            "The scale_w in input 'Scale' Tensor of Operator(interpolate) "
-            "should be greater than 0, but received value is %d.",
-            scale_w));
-    PADDLE_ENFORCE_EQ(
-        scale_h > 0, true,
-        platform::errors::InvalidArgument(
-            "The scale_h in input 'Scale' Tensor of Operator(interpolate) "
-            "should be greater than 0, but received value is %d.",
-            scale_h));
-    PADDLE_ENFORCE_EQ(
-        scale_d > 0, true,
-        platform::errors::InvalidArgument(
-            "The scale_d in input 'Scale' Tensor of Operator(interpolate) "
-            "should be greater than 0, but received value is %d.",
-            scale_d));
-  } else {
-    if (scale.size() > 1) {
-      scale_d = scale[0];
-      scale_h = scale[1];
-      scale_w = scale[2];
-      PADDLE_ENFORCE_EQ(
-          scale_w > 0, true,
-          platform::errors::InvalidArgument(
-              "The scale_w in Attr(scale) of Operator(interpolate) "
-              "should be greater than 0, but received value is %d.",
-              scale_w));
-      PADDLE_ENFORCE_EQ(
-          scale_h > 0, true,
-          platform::errors::InvalidArgument(
-              "The scale_h in Attr(scale) of Operator(interpolate) "
-              "should be greater than 0, but received value is %d.",
-              scale_h));
-      PADDLE_ENFORCE_EQ(
-          scale_d > 0, true,
-          platform::errors::InvalidArgument(
-              "The scale_d in Attr(scale) of Operator(interpolate) "
-              "should be greater than 0, but received value is %d.",
-              scale_d));
-    }
-  }
-  if (scale_d > 0. && scale_h > 0. && scale_w > 0.) {
-    out_d = static_cast<int>(in_d * scale_d);
-    out_h = static_cast<int>(in_h * scale_h);
-    out_w = static_cast<int>(in_w * scale_w);
-  }
-  auto out_size = ctx.Input<Tensor>("OutSize");
-  if (out_size != nullptr) {
-    auto out_size_data = get_new_data_from_tensor<int>(out_size);
-    out_d = out_size_data[0];
-    out_h = out_size_data[1];
-    out_w = out_size_data[2];
-  }
-  auto list_new_size_tensor = ctx.MultiInput<framework::Tensor>("SizeTensor");
-  if (list_new_size_tensor.size() > 0) {
-    // have size tensor
-    auto new_size = get_new_shape(list_new_size_tensor);
-    out_d = new_size[0];
-    out_h = new_size[1];
-    out_w = new_size[2];
-  }
-
-  framework::DDim dim_grad;
-  if (data_layout == DataLayout::kNCHW) {
-    dim_grad = {n, c, in_d, in_h, in_w};
-  } else {
-    dim_grad = {n, in_d, in_h, in_w, c};
-  }
-  input_grad->mutable_data<T>(dim_grad, ctx.GetPlace());
-  auto& device_ctx = ctx.template device_context<platform::CPUDeviceContext>();
-  phi::funcs::SetConstant<platform::CPUDeviceContext, T> zero;
-  zero(device_ctx, input_grad, static_cast<T>(0.0));
-
-  if (in_d == out_d && in_h == out_h && in_w == out_w) {
-    framework::TensorCopy(output_grad, ctx.GetPlace(), input_grad);
-    return;
-  }
-
-  float ratio_d = 0.f;
-  float ratio_h = 0.f;
-  float ratio_w = 0.f;
-  if (out_d > 1) {
-    float new_scale_d = 0.f;
-    new_scale_d = (scale_d > 0) ? static_cast<float>(1. / scale_d)
-                                : static_cast<float>(in_d) / out_d;
-    ratio_d = (align_corners) ? static_cast<float>(in_d - 1) / (out_d - 1)
-                              : static_cast<float>(new_scale_d);
-  }
-  if (out_h > 1) {
-    float new_scale_h = 0.f;
-    new_scale_h = (scale_h > 0) ? static_cast<float>(1. / scale_h)
-                                : static_cast<float>(in_h) / out_h;
-    ratio_h = (align_corners) ? static_cast<float>(in_h - 1) / (out_h - 1)
-                              : static_cast<float>(new_scale_h);
-  }
-  if (out_w > 1) {
-    float new_scale_w = 0.f;
-    new_scale_w = (scale_w > 0) ? static_cast<float>(1. / scale_w)
-                                : static_cast<float>(in_w) / out_w;
-    ratio_w = (align_corners) ? static_cast<float>(in_w - 1) / (out_w - 1)
-                              : static_cast<float>(new_scale_w);
-  }
-
-  if ("trilinear" == interp_method) {
-    TrilinearInterpolationGrad<T>(
-        output_grad, input_grad, ratio_d, ratio_h, ratio_w, in_d, in_h, in_w, n,
-        c, out_d, out_h, out_w, align_corners, align_mode, data_layout);
-  } else if ("nearest" == interp_method) {
-    NearestNeighbor3DInterpolateGrad<T>(output_grad, input_grad, ratio_d,
-                                        ratio_h, ratio_w, n, c, out_d, out_h,
-                                        out_w, align_corners, data_layout);
-  }
-}
-
-template <typename T>
-class InterpolateV2Kernel : public framework::OpKernel<T> {
- public:
-  void Compute(const framework::ExecutionContext& ctx) const override {
-    auto* input = ctx.Input<Tensor>("X");
-    auto* output = ctx.Output<Tensor>("Out");
-
-    auto input_dims = input->dims();
-    if (input_dims.size() == 3) {  // 1D interpolation
-      Interpolate1DCPUFwd<T>(ctx, *input, output);
-    } else if (input_dims.size() == 4) {  // 2D interpolation
-      Interpolate2DCPUFwd<T>(ctx, *input, output);
-    } else if (input_dims.size() == 5) {  // 3D interpolation
-      Interpolate3DCPUFwd<T>(ctx, *input, output);
-    }
-  }
-};
-
-template <typename T>
-class InterpolateV2GradKernel : public framework::OpKernel<T> {
- public:
-  void Compute(const framework::ExecutionContext& ctx) const override {
-    auto* input_grad = ctx.Output<Tensor>(framework::GradVarName("X"));
-    auto* output_grad = ctx.Input<Tensor>(framework::GradVarName("Out"));
-
-    auto output_grad_dims = output_grad->dims();
-    if (output_grad_dims.size() == 3) {  // 1D interpolation grad
-      Interpolate1DCPUBwd<T>(ctx, input_grad, *output_grad);
-    } else if (output_grad_dims.size() == 4) {  // 2D interpolation grad
-      Interpolate2DCPUBwd<T>(ctx, input_grad, *output_grad);
-    } else if (output_grad_dims.size() == 5) {  // 3D interpolation grad
-      Interpolate3DCPUBwd<T>(ctx, input_grad, *output_grad);
-    }
-  }
-};
-
-}  // namespace operators
-}  // namespace paddle
diff --git a/paddle/fluid/operators/interpolate_v2_op_npu.cc b/paddle/fluid/operators/interpolate_v2_op_npu.cc
index bf29c2aabb801..615b5ea142b58 100644
--- a/paddle/fluid/operators/interpolate_v2_op_npu.cc
+++ b/paddle/fluid/operators/interpolate_v2_op_npu.cc
@@ -12,9 +12,11 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. */
 
-#include "paddle/fluid/operators/interpolate_v2_op.h"
 #include "paddle/fluid/platform/device/npu/npu_op_runner.h"
 
+#include "paddle/fluid/framework/op_registry.h"
+#include "paddle/phi/kernels/funcs/interpolate_function.h"
+
 namespace paddle {
 namespace operators {
 
@@ -401,7 +403,8 @@ class InterpolateV2NPUKernel : public framework::OpKernel<T> {
     const DataLayout data_layout =
         framework::StringToDataLayout(data_layout_str);
     int n, c, in_d, in_h, in_w;
-    ExtractNCDWH(input_dims, data_layout, &n, &c, &in_d, &in_h, &in_w);
+    phi::funcs::ExtractNCDWH(input_dims, data_layout, &n, &c, &in_d, &in_h,
+                             &in_w);
 
     auto interp_method = ctx.Attr<std::string>("interp_method");
     bool align_corners = ctx.Attr<bool>("align_corners");
@@ -431,14 +434,15 @@ class InterpolateV2NPUKernel : public framework::OpKernel<T> {
       out_w = output_w[0];
     } else if (ctx.HasInput("OutSize")) {
       auto out_size = ctx.Input<Tensor>("OutSize");
-      auto out_size_data = get_new_data_from_tensor<int>(out_size);
+      auto out_size_data = phi::funcs::get_new_data_from_tensor<int>(out_size);
       out_h = out_size_data[0];
       out_w = out_size_data[1];
     } else {
       auto scale_tensor = ctx.Input<Tensor>("Scale");
       auto scale = ctx.Attr<std::vector<float>>("scale");
       if (scale_tensor != nullptr) {
-        auto scale_data = get_new_data_from_tensor<float>(scale_tensor);
+        auto scale_data =
+            phi::funcs::get_new_data_from_tensor<float>(scale_tensor);
         if (scale_data.size() > 1) {
           scale_h = scale_data[0];
           scale_w = scale_data[1];
@@ -538,7 +542,8 @@ class InterpolateV2NPUGradKernel : public framework::OpKernel<T> {
     const DataLayout data_layout =
         framework::StringToDataLayout(data_layout_str);
     int n, c, in_d, in_h, in_w;
-    ExtractNCDWH(input->dims(), data_layout, &n, &c, &in_d, &in_h, &in_w);
+    phi::funcs::ExtractNCDWH(input->dims(), data_layout, &n, &c, &in_d, &in_h,
+                             &in_w);
 
     auto interp_method = ctx.Attr<std::string>("interp_method");
     bool align_corners = ctx.Attr<bool>("align_corners");
@@ -567,14 +572,15 @@ class InterpolateV2NPUGradKernel : public framework::OpKernel<T> {
       out_w = output_w[0];
     } else if (ctx.HasInput("OutSize")) {
       auto out_size = ctx.Input<Tensor>("OutSize");
-      auto out_size_data = get_new_data_from_tensor<int>(out_size);
+      auto out_size_data = phi::funcs::get_new_data_from_tensor<int>(out_size);
       out_h = out_size_data[0];
       out_w = out_size_data[1];
     } else {
       auto scale_tensor = ctx.Input<Tensor>("Scale");
       auto scale = ctx.Attr<std::vector<float>>("scale");
       if (scale_tensor != nullptr) {
-        auto scale_data = get_new_data_from_tensor<float>(scale_tensor);
+        auto scale_data =
+            phi::funcs::get_new_data_from_tensor<float>(scale_tensor);
         if (scale_data.size() > 1) {
           scale_h = scale_data[0];
           scale_w = scale_data[1];
diff --git a/paddle/fluid/operators/interpolate_v2_op_xpu.cc b/paddle/fluid/operators/interpolate_v2_op_xpu.cc
index 850dbe025b9cb..9cbfc95158348 100644
--- a/paddle/fluid/operators/interpolate_v2_op_xpu.cc
+++ b/paddle/fluid/operators/interpolate_v2_op_xpu.cc
@@ -14,8 +14,7 @@
 #include <vector>
 
 #include "paddle/fluid/framework/op_registry.h"
-#include "paddle/fluid/operators/interpolate_v2_op.h"
-
+#include "paddle/phi/kernels/funcs/interpolate_function.h"
 #ifdef PADDLE_WITH_XPU
 
 namespace paddle {
@@ -57,7 +56,8 @@ class InterpolateV2XPUKernel : public framework::OpKernel<T> {
     const DataLayout data_layout =
         framework::StringToDataLayout(data_layout_str);
     int n, c, in_d, in_h, in_w;
-    ExtractNCDWH(input_dims, data_layout, &n, &c, &in_d, &in_h, &in_w);
+    phi::funcs::ExtractNCDWH(input_dims, data_layout, &n, &c, &in_d, &in_h,
+                             &in_w);
 
     auto interp_method = ctx.Attr<std::string>("interp_method");
     bool align_corners = ctx.Attr<bool>("align_corners");
@@ -78,7 +78,8 @@ class InterpolateV2XPUKernel : public framework::OpKernel<T> {
       auto scale_tensor = ctx.Input<Tensor>("Scale");
       auto scale = ctx.Attr<std::vector<float>>("scale");
       if (scale_tensor != nullptr) {
-        auto scale_data = get_new_data_from_tensor<float>(scale_tensor);
+        auto scale_data =
+            phi::funcs::get_new_data_from_tensor<float>(scale_tensor);
         if (scale_data.size() > 1) {
           scale_h = scale_data[0];
           scale_w = scale_data[1];
@@ -107,7 +108,8 @@ class InterpolateV2XPUKernel : public framework::OpKernel<T> {
       }
       auto out_size = ctx.Input<Tensor>("OutSize");
       if (out_size != nullptr) {
-        auto out_size_data = get_new_data_from_tensor<int>(out_size);
+        auto out_size_data =
+            phi::funcs::get_new_data_from_tensor<int>(out_size);
         out_h = out_size_data[0];
         out_w = out_size_data[1];
       }
@@ -169,7 +171,8 @@ class InterpolateV2GradXPUKernel : public framework::OpKernel<T> {
     const DataLayout data_layout =
         framework::StringToDataLayout(data_layout_str);
     int n, c, in_d, in_h, in_w;
-    ExtractNCDWH(input->dims(), data_layout, &n, &c, &in_d, &in_h, &in_w);
+    phi::funcs::ExtractNCDWH(input->dims(), data_layout, &n, &c, &in_d, &in_h,
+                             &in_w);
 
     auto interp_method = ctx.Attr<std::string>("interp_method");
     bool align_corners = ctx.Attr<bool>("align_corners");
@@ -190,7 +193,8 @@ class InterpolateV2GradXPUKernel : public framework::OpKernel<T> {
       auto scale_tensor = ctx.Input<Tensor>("Scale");
       auto scale = ctx.Attr<std::vector<float>>("scale");
       if (scale_tensor != nullptr) {
-        auto scale_data = get_new_data_from_tensor<float>(scale_tensor);
+        auto scale_data =
+            phi::funcs::get_new_data_from_tensor<float>(scale_tensor);
         if (scale_data.size() > 1) {
           scale_h = scale_data[0];
           scale_w = scale_data[1];
@@ -219,7 +223,8 @@ class InterpolateV2GradXPUKernel : public framework::OpKernel<T> {
       }
       auto out_size = ctx.Input<Tensor>("OutSize");
       if (out_size != nullptr) {
-        auto out_size_data = get_new_data_from_tensor<int>(out_size);
+        auto out_size_data =
+            phi::funcs::get_new_data_from_tensor<int>(out_size);
         out_h = out_size_data[0];
         out_w = out_size_data[1];
       }
diff --git a/paddle/fluid/operators/linspace_op.cc b/paddle/fluid/operators/linspace_op.cc
index 378c7573d6129..1cd59672f97fc 100644
--- a/paddle/fluid/operators/linspace_op.cc
+++ b/paddle/fluid/operators/linspace_op.cc
@@ -38,7 +38,8 @@ class LinspaceOp : public framework::OperatorWithKernel {
   framework::OpKernelType GetKernelTypeForVar(
       const std::string &var_name, const framework::Tensor &tensor,
       const framework::OpKernelType &expected_kernel_type) const override {
-    return expected_kernel_type;
+    return framework::OpKernelType(expected_kernel_type.data_type_,
+                                   tensor.place(), tensor.layout());
   }
 };
 
@@ -66,7 +67,7 @@ class LinspaceOpMaker : public framework::OpProtoAndCheckerMaker {
 
 namespace ops = paddle::operators;
 DECLARE_INFER_SHAPE_FUNCTOR(linspace, LinspaceInferShapeFunctor,
-                            PD_INFER_META(phi::LinspaceInferMeta));
+                            PD_INFER_META(phi::LinspaceRawInferMeta));
 REGISTER_OPERATOR(
     linspace, ops::LinspaceOp, ops::LinspaceOpMaker,
     paddle::framework::EmptyGradOpMaker<paddle::framework::OpDesc>,
diff --git a/paddle/fluid/operators/math/cross_entropy.cc b/paddle/fluid/operators/math/cross_entropy.cc
index 0b0584608a300..cb2f59182c111 100644
--- a/paddle/fluid/operators/math/cross_entropy.cc
+++ b/paddle/fluid/operators/math/cross_entropy.cc
@@ -14,6 +14,7 @@ limitations under the License. */
 
 #include "paddle/fluid/operators/math/cross_entropy.h"
 #include "paddle/fluid/framework/convert_utils.h"
+#include "paddle/phi/backends/cpu/cpu_context.h"
 
 namespace paddle {
 namespace platform {
@@ -89,38 +90,38 @@ struct HardLabelCrossEntropyCPUFunctorImpl {
   const int axis_dim_;
 };
 
-template <typename T>
-class CrossEntropyFunctor<platform::CPUDeviceContext, T> {
- public:
-  void operator()(const platform::CPUDeviceContext& ctx, framework::Tensor* out,
-                  const framework::Tensor* prob,
-                  const framework::Tensor* labels, const bool softLabel,
-                  const int ignore_index, const int axis_dim) {
-    if (softLabel) {
-      const int batch_size = prob->dims()[0];
-      const int num_classes = prob->dims()[1];
-      const int num_remain = num_classes / axis_dim;
-
-      Eigen::DSizes<int, 3> batch_axis_remain(batch_size, axis_dim, num_remain);
-      auto in = EigenMatrix<T>::From(*prob);
-      auto lbl = EigenMatrix<T>::From(*labels);
-      auto loss = EigenMatrix<T>::From(*out);
-
-      loss.device(*ctx.eigen_device()) =
-          -((lbl * in.log().unaryExpr(math::TolerableValue<T>()))
-                .reshape(batch_axis_remain)
-                .sum(Eigen::DSizes<int, 1>(1)));
-    } else {
-      HardLabelCrossEntropyCPUFunctorImpl<T> functor_impl(
-          out, prob, labels, ignore_index, axis_dim);
-      framework::VisitIntDataType(
-          framework::TransToProtoVarType(labels->dtype()), functor_impl);
-    }
+template <typename DeviceContext, typename T>
+void CrossEntropyFunctor<DeviceContext, T>::operator()(
+    const DeviceContext& ctx, framework::Tensor* out,
+    const framework::Tensor* prob, const framework::Tensor* labels,
+    const bool softLabel, const int ignore_index, const int axis_dim) {
+  if (softLabel) {
+    const int batch_size = prob->dims()[0];
+    const int num_classes = prob->dims()[1];
+    const int num_remain = num_classes / axis_dim;
+
+    Eigen::DSizes<int, 3> batch_axis_remain(batch_size, axis_dim, num_remain);
+    auto in = EigenMatrix<T>::From(*prob);
+    auto lbl = EigenMatrix<T>::From(*labels);
+    auto loss = EigenMatrix<T>::From(*out);
+
+    loss.device(*ctx.eigen_device()) =
+        -((lbl * in.log().unaryExpr(math::TolerableValue<T>()))
+              .reshape(batch_axis_remain)
+              .sum(Eigen::DSizes<int, 1>(1)));
+  } else {
+    HardLabelCrossEntropyCPUFunctorImpl<T> functor_impl(out, prob, labels,
+                                                        ignore_index, axis_dim);
+    framework::VisitIntDataType(framework::TransToProtoVarType(labels->dtype()),
+                                functor_impl);
   }
-};
+}
 
 template class CrossEntropyFunctor<platform::CPUDeviceContext, float>;
 template class CrossEntropyFunctor<platform::CPUDeviceContext, double>;
+
+template class CrossEntropyFunctor<phi::CPUContext, float>;
+template class CrossEntropyFunctor<phi::CPUContext, double>;
 }  // namespace math
 }  // namespace operators
 }  // namespace paddle
diff --git a/paddle/fluid/operators/math/cross_entropy.cu b/paddle/fluid/operators/math/cross_entropy.cu
index 829ac9fb55964..80e06d4b7f688 100644
--- a/paddle/fluid/operators/math/cross_entropy.cu
+++ b/paddle/fluid/operators/math/cross_entropy.cu
@@ -17,6 +17,7 @@ limitations under the License. */
 #include "paddle/fluid/operators/math/cross_entropy.h"
 #include "paddle/fluid/platform/device/gpu/gpu_device_function.h"
 #include "paddle/fluid/platform/device/gpu/gpu_primitives.h"
+#include "paddle/phi/backends/gpu/gpu_context.h"
 
 namespace paddle {
 namespace operators {
@@ -93,46 +94,48 @@ struct HardLabelCrossEntropyCUDAFunctorImpl {
   gpuStream_t stream_;
 };
 
-template <typename T>
-class CrossEntropyFunctor<platform::CUDADeviceContext, T> {
- public:
-  void operator()(const platform::CUDADeviceContext& ctx,
-                  framework::Tensor* out, const framework::Tensor* prob,
-                  const framework::Tensor* labels, const bool softLabel,
-                  const int ignore_index, const int axis_dim) {
-    const T* prob_data = prob->data<T>();
-    T* loss_data = out->mutable_data<T>(ctx.GetPlace());
-
-    int batch_size = prob->dims()[0];
-    int class_num = prob->dims()[1];
+template <typename DeviceContext, typename T>
+void CrossEntropyFunctor<DeviceContext, T>::operator()(
+    const DeviceContext& ctx, framework::Tensor* out,
+    const framework::Tensor* prob, const framework::Tensor* labels,
+    const bool softLabel, const int ignore_index, const int axis_dim) {
+  const T* prob_data = prob->data<T>();
+  T* loss_data = out->mutable_data<T>(ctx.GetPlace());
+
+  int batch_size = prob->dims()[0];
+  int class_num = prob->dims()[1];
 #ifdef __HIPCC__
-    constexpr int kMaxBlockDim = 256;
+  constexpr int kMaxBlockDim = 256;
 #else
-    constexpr int kMaxBlockDim = 512;
+  constexpr int kMaxBlockDim = 512;
 #endif
 
-    if (softLabel) {
-      const T* label_data = labels->data<T>();
-      int block = class_num > kMaxBlockDim
-                      ? kMaxBlockDim
-                      : pow(2, static_cast<int>(std::log2(class_num)));
-
-      SoftCrossEntropyKernel<T><<<batch_size, block, 0, ctx.stream()>>>(
-          loss_data, prob_data, label_data, class_num);
-    } else {
-      HardLabelCrossEntropyCUDAFunctorImpl<T> functor(
-          loss_data, prob_data, labels->data(), batch_size, class_num,
-          ignore_index, kMaxBlockDim, ctx.stream());
-      framework::VisitDataType(framework::TransToProtoVarType(labels->dtype()),
-                               functor);
-    }
+  if (softLabel) {
+    const T* label_data = labels->data<T>();
+    int block = class_num > kMaxBlockDim
+                    ? kMaxBlockDim
+                    : pow(2, static_cast<int>(std::log2(class_num)));
+
+    SoftCrossEntropyKernel<T><<<batch_size, block, 0, ctx.stream()>>>(
+        loss_data, prob_data, label_data, class_num);
+  } else {
+    HardLabelCrossEntropyCUDAFunctorImpl<T> functor(
+        loss_data, prob_data, labels->data(), batch_size, class_num,
+        ignore_index, kMaxBlockDim, ctx.stream());
+    framework::VisitDataType(framework::TransToProtoVarType(labels->dtype()),
+                             functor);
   }
-};
+}
 
 template class CrossEntropyFunctor<platform::CUDADeviceContext, float>;
 template class CrossEntropyFunctor<platform::CUDADeviceContext, double>;
 template class CrossEntropyFunctor<platform::CUDADeviceContext,
                                    platform::float16>;
+
+template class CrossEntropyFunctor<phi::GPUContext, float>;
+template class CrossEntropyFunctor<phi::GPUContext, double>;
+template class CrossEntropyFunctor<phi::GPUContext, platform::float16>;
+
 }  // namespace math
 }  // namespace operators
 }  // namespace paddle
diff --git a/paddle/fluid/operators/math/softmax.cu b/paddle/fluid/operators/math/softmax.cu
index 83b124902ebb7..e960dc8a60832 100644
--- a/paddle/fluid/operators/math/softmax.cu
+++ b/paddle/fluid/operators/math/softmax.cu
@@ -29,9 +29,9 @@ using DataLayout = platform::DataLayout;
 template <typename T>
 using CudnnDataType = platform::CudnnDataType<T>;
 
-template <typename T>
-void SoftmaxCUDNNFunctor<T>::operator()(
-    const platform::CUDADeviceContext& context, const framework::Tensor* X,
+template <typename T, typename DeviceContext>
+void SoftmaxCUDNNFunctor<T, DeviceContext>::operator()(
+    const DeviceContext& context, const framework::Tensor* X,
     framework::Tensor* Y) {
   // ------------------- cudnn descriptors ---------------------
   ScopedTensorDescriptor xDesc;
@@ -69,9 +69,9 @@ void SoftmaxCUDNNFunctor<T>::operator()(
 #endif
 }
 
-template <typename T>
-void SoftmaxGradCUDNNFunctor<T>::operator()(
-    const platform::CUDADeviceContext& context, const framework::Tensor* Y,
+template <typename T, typename DeviceContext>
+void SoftmaxGradCUDNNFunctor<T, DeviceContext>::operator()(
+    const DeviceContext& context, const framework::Tensor* Y,
     const framework::Tensor* YGrad, framework::Tensor* XGrad) {
   // ------------------- cudnn descriptors ---------------------
   ScopedTensorDescriptor yDesc;
@@ -116,19 +116,31 @@ void SoftmaxGradCUDNNFunctor<T>::operator()(
 #endif
 }
 
-template class SoftmaxCUDNNFunctor<float>;
-template class SoftmaxCUDNNFunctor<platform::float16>;
-template class SoftmaxGradCUDNNFunctor<float>;
-template class SoftmaxGradCUDNNFunctor<platform::float16>;
+template class SoftmaxCUDNNFunctor<float, platform::CUDADeviceContext>;
+template class SoftmaxCUDNNFunctor<platform::float16,
+                                   platform::CUDADeviceContext>;
+template class SoftmaxGradCUDNNFunctor<float, platform::CUDADeviceContext>;
+template class SoftmaxGradCUDNNFunctor<platform::float16,
+                                       platform::CUDADeviceContext>;
+template class SoftmaxCUDNNFunctor<float, phi::GPUContext>;
+template class SoftmaxCUDNNFunctor<platform::float16, phi::GPUContext>;
+template class SoftmaxGradCUDNNFunctor<float, phi::GPUContext>;
+template class SoftmaxGradCUDNNFunctor<platform::float16, phi::GPUContext>;
 #if CUDNN_VERSION_MIN(8, 1, 0)
-template class SoftmaxCUDNNFunctor<platform::bfloat16>;
-template class SoftmaxGradCUDNNFunctor<platform::bfloat16>;
+template class SoftmaxCUDNNFunctor<platform::bfloat16,
+                                   platform::CUDADeviceContext>;
+template class SoftmaxGradCUDNNFunctor<platform::bfloat16,
+                                       platform::CUDADeviceContext>;
+template class SoftmaxCUDNNFunctor<platform::bfloat16, phi::GPUContext>;
+template class SoftmaxGradCUDNNFunctor<platform::bfloat16, phi::GPUContext>;
 #endif
 
 // MIOPEN do not support double
 #ifndef PADDLE_WITH_HIP
-template class SoftmaxCUDNNFunctor<double>;
-template class SoftmaxGradCUDNNFunctor<double>;
+template class SoftmaxCUDNNFunctor<double, platform::CUDADeviceContext>;
+template class SoftmaxGradCUDNNFunctor<double, platform::CUDADeviceContext>;
+template class SoftmaxCUDNNFunctor<double, phi::GPUContext>;
+template class SoftmaxGradCUDNNFunctor<double, phi::GPUContext>;
 #endif
 
 template class SoftmaxFunctor<platform::CUDADeviceContext, platform::float16,
diff --git a/paddle/fluid/operators/math/softmax.h b/paddle/fluid/operators/math/softmax.h
index 1bd80fa7f7ad7..8d51a1f2778c1 100644
--- a/paddle/fluid/operators/math/softmax.h
+++ b/paddle/fluid/operators/math/softmax.h
@@ -36,19 +36,18 @@ class SoftmaxGradFunctor {
 };
 
 #if defined(PADDLE_WITH_CUDA) || defined(PADDLE_WITH_HIP)
-template <typename T>
+template <typename T, typename DeviceContext>
 class SoftmaxCUDNNFunctor {
  public:
-  void operator()(const platform::CUDADeviceContext& context,
-                  const framework::Tensor* X, framework::Tensor* Y);
+  void operator()(const DeviceContext& context, const framework::Tensor* X,
+                  framework::Tensor* Y);
 };
 
-template <typename T>
+template <typename T, typename DeviceContext>
 class SoftmaxGradCUDNNFunctor {
  public:
-  void operator()(const platform::CUDADeviceContext& context,
-                  const framework::Tensor* Y, const framework::Tensor* y_grad,
-                  framework::Tensor* x_grad);
+  void operator()(const DeviceContext& context, const framework::Tensor* Y,
+                  const framework::Tensor* y_grad, framework::Tensor* x_grad);
 };
 
 #endif
diff --git a/paddle/fluid/operators/meshgrid_op.cc b/paddle/fluid/operators/meshgrid_op.cc
index 103169fedb90e..5a6862f380da1 100644
--- a/paddle/fluid/operators/meshgrid_op.cc
+++ b/paddle/fluid/operators/meshgrid_op.cc
@@ -19,6 +19,10 @@
 #include "paddle/fluid/framework/op_registry.h"
 #include "paddle/fluid/framework/operator.h"
 
+#include "paddle/fluid/framework/infershape_utils.h"
+#include "paddle/phi/core/infermeta_utils.h"
+#include "paddle/phi/infermeta/multiary.h"
+
 namespace paddle {
 namespace operators {
 
@@ -28,30 +32,6 @@ class MeshgridOp : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
- protected:
-  void InferShape(framework::InferShapeContext* ctx) const override {
-    PADDLE_ENFORCE_GE(
-        ctx->Inputs("X").size(), 1UL,
-        platform::errors::InvalidArgument("Input(X) should not be empty."));
-    PADDLE_ENFORCE_GE(
-        ctx->Outputs("Out").size(), 1UL,
-        platform::errors::InvalidArgument("Output(Out) should not be empty."));
-
-    auto inputs_dims = ctx->GetInputsDim("X");
-    const size_t inputs_num = inputs_dims.size();
-    auto outs_names = ctx->Outputs("Out");
-    const size_t outputs_num = outs_names.size();
-
-    auto out_shape = std::vector<int>(inputs_num);
-
-    for (size_t i = 0; i < inputs_num; i++) {
-      out_shape[i] = inputs_dims[i][0];
-    }
-    auto out_dims = phi::make_ddim(std::vector<int>(out_shape));
-    std::vector<framework::DDim> outs_dims(outputs_num, out_dims);
-    ctx->SetOutputsDim("Out", outs_dims);
-  }
-
  protected:
   framework::OpKernelType GetExpectedKernelType(
       const framework::ExecutionContext& ctx) const override {
@@ -142,7 +122,10 @@ class MeshgridGradOpMaker : public framework::SingleGradOpMaker<T> {
 }  // namespace paddle
 
 namespace ops = paddle::operators;
+DECLARE_INFER_SHAPE_FUNCTOR(meshgrid, MeshgridInferShapeFunctor,
+                            PD_INFER_META(phi::MeshgridInferMeta));
 REGISTER_OPERATOR(meshgrid, ops::MeshgridOp, ops::MeshgridOpMaker,
                   ops::MeshgridGradOpMaker<paddle::framework::OpDesc>,
-                  ops::MeshgridGradOpMaker<paddle::imperative::OpBase>);
+                  ops::MeshgridGradOpMaker<paddle::imperative::OpBase>,
+                  MeshgridInferShapeFunctor);
 REGISTER_OPERATOR(meshgrid_grad, ops::MeshgridGradOp);
diff --git a/paddle/fluid/operators/mul_op_xpu.cc b/paddle/fluid/operators/mul_op_xpu.cc
index 6ef41e059c7d9..7410b3b607c82 100644
--- a/paddle/fluid/operators/mul_op_xpu.cc
+++ b/paddle/fluid/operators/mul_op_xpu.cc
@@ -19,6 +19,8 @@ limitations under the License. */
 #include <unordered_map>
 #include <vector>
 #include "paddle/fluid/framework/op_registry.h"
+#include "paddle/fluid/operators/xpu_api_wrapper.h"
+#include "paddle/fluid/platform/device/device_wrapper.h"
 
 namespace paddle {
 namespace operators {
@@ -28,6 +30,8 @@ using framework::Tensor;
 
 template <typename DeviceContext, typename T>
 class MulXPUKernel : public framework::OpKernel<T> {
+  using XPUType = typename XPUTypeTrait<T>::Type;
+
  public:
   void Compute(const framework::ExecutionContext& context) const override {
     const Tensor* x = context.Input<Tensor>("X");
@@ -62,14 +66,15 @@ class MulXPUKernel : public framework::OpKernel<T> {
     const T* data_b = y_matrix.data<T>();
     T* data_c = z->data<T>();
     auto& dev_ctx = context.template device_context<DeviceContext>();
-    int ret = xpu::fc_int16(dev_ctx.x_context(), trans_a, trans_b, m, n, k,
-                            alpha, data_a, data_b, beta, data_c);
-    PADDLE_ENFORCE_EQ(
-        ret, XPU_SUCCESS,
-        platform::errors::External(
-            "XPU API return wrong value[%d], please check whether "
-            "Baidu Kunlun Card is properly installed.",
-            ret));
+
+    int ret = xpu_fc_wrapper<XPUType, int16_t>(
+        dev_ctx.x_context(), reinterpret_cast<const XPUType*>(data_a),
+        reinterpret_cast<const XPUType*>(data_b),
+        reinterpret_cast<XPUType*>(data_c), m, n, k, trans_a, trans_b, nullptr,
+        nullptr, nullptr, k, n, n, alpha, beta, nullptr,
+        xpu::Activation_t::LINEAR);
+    PADDLE_ENFORCE_XDNN_SUCCESS(ret, "xpu_fc_wrapper");
+
     if (z_dim.size() != 2) {
       z->Resize(z_dim);
     }
@@ -78,6 +83,8 @@ class MulXPUKernel : public framework::OpKernel<T> {
 
 template <typename DeviceContext, typename T>
 class MulGradXPUKernel : public framework::OpKernel<T> {
+  using XPUType = typename XPUTypeTrait<T>::Type;
+
  public:
   void Compute(const framework::ExecutionContext& ctx) const override {
     int x_num_col_dims = ctx.template Attr<int>("x_num_col_dims");
@@ -126,14 +133,14 @@ class MulGradXPUKernel : public framework::OpKernel<T> {
       const T* data_a = dout->data<T>();
       const T* data_b = y_matrix.data<T>();
       T* data_c = dx_matrix.data<T>();
-      int ret =
-          xpu::gemm_int16(dev_ctx.x_context(), trans_a, trans_b, m, n, k, alpha,
-                          data_a, lda, data_b, ldb, beta, data_c, ldc);
-      PADDLE_ENFORCE_EQ(ret, XPU_SUCCESS,
-                        platform::errors::External(
-                            "XPU API return wrong value[%d], please check "
-                            "where Baidu Kunlun Card is properly installed.",
-                            ret));
+
+      int ret = xpu_fc_wrapper<XPUType, int16_t>(
+          dev_ctx.x_context(), reinterpret_cast<const XPUType*>(data_a),
+          reinterpret_cast<const XPUType*>(data_b),
+          reinterpret_cast<XPUType*>(data_c), m, n, k, trans_a, trans_b,
+          nullptr, nullptr, nullptr, lda, ldb, ldc, alpha, beta, nullptr,
+          xpu::Activation_t::LINEAR);
+      PADDLE_ENFORCE_XDNN_SUCCESS(ret, "xpu_fc_wrapper");
     }
 
     if (dy) {
@@ -159,14 +166,14 @@ class MulGradXPUKernel : public framework::OpKernel<T> {
       const T* data_a = x_matrix.data<T>();
       const T* data_b = dout->data<T>();
       T* data_c = dy_matrix.data<T>();
-      int ret =
-          xpu::gemm_int16(dev_ctx.x_context(), trans_a, trans_b, m, n, k, alpha,
-                          data_a, lda, data_b, ldb, beta, data_c, ldc);
-      PADDLE_ENFORCE_EQ(ret, XPU_SUCCESS,
-                        platform::errors::External(
-                            "XPU API return wrong value[%d], please check "
-                            "where Baidu Kunlun Card is properly installed.",
-                            ret));
+
+      int ret = xpu_fc_wrapper<XPUType, int16_t>(
+          dev_ctx.x_context(), reinterpret_cast<const XPUType*>(data_a),
+          reinterpret_cast<const XPUType*>(data_b),
+          reinterpret_cast<XPUType*>(data_c), m, n, k, trans_a, trans_b,
+          nullptr, nullptr, nullptr, lda, ldb, ldc, alpha, beta, nullptr,
+          xpu::Activation_t::LINEAR);
+      PADDLE_ENFORCE_XDNN_SUCCESS(ret, "xpu_fc_wrapper");
     }
   }
 };
@@ -175,9 +182,12 @@ class MulGradXPUKernel : public framework::OpKernel<T> {
 }  // namespace paddle
 
 namespace ops = paddle::operators;
+namespace plat = paddle::platform;
 
 REGISTER_OP_XPU_KERNEL(
-    mul, ops::MulXPUKernel<paddle::platform::XPUDeviceContext, float>);
+    mul, ops::MulXPUKernel<paddle::platform::XPUDeviceContext, float>,
+    ops::MulXPUKernel<paddle::platform::XPUDeviceContext, plat::float16>);
 REGISTER_OP_XPU_KERNEL(
-    mul_grad, ops::MulGradXPUKernel<paddle::platform::XPUDeviceContext, float>)
+    mul_grad, ops::MulGradXPUKernel<paddle::platform::XPUDeviceContext, float>,
+    ops::MulGradXPUKernel<paddle::platform::XPUDeviceContext, plat::float16>)
 #endif
diff --git a/paddle/fluid/operators/nll_loss_op.cc b/paddle/fluid/operators/nll_loss_op.cc
index a4e1f7b3091a9..8f14bc10d5094 100644
--- a/paddle/fluid/operators/nll_loss_op.cc
+++ b/paddle/fluid/operators/nll_loss_op.cc
@@ -16,6 +16,7 @@ limitations under the License. */
 #include <string>
 #include "paddle/fluid/framework/infershape_utils.h"
 #include "paddle/fluid/framework/op_registry.h"
+#include "paddle/phi/infermeta/backward.h"
 #include "paddle/phi/infermeta/ternary.h"
 
 namespace paddle {
@@ -94,68 +95,6 @@ class NLLLossGradOp : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
-  void InferShape(framework::InferShapeContext* ctx) const override {
-    OP_INOUT_CHECK(ctx->HasInput("X"), "Input", "X", "NLLLoss");
-    OP_INOUT_CHECK(ctx->HasInput("Label"), "Input", "Label", "NLLLoss");
-    OP_INOUT_CHECK(ctx->HasInput(framework::GradVarName("Out")), "Input",
-                   framework::GradVarName("Out"), "NLLLoss");
-    OP_INOUT_CHECK(ctx->HasOutput(framework::GradVarName("X")), "Output",
-                   framework::GradVarName("X"), "NLLLoss");
-
-    auto reduction = ctx->Attrs().Get<std::string>("reduction");
-    auto x_dims = ctx->GetInputDim("X");
-    auto label_dims = ctx->GetInputDim("Label");
-    auto dout_dims = ctx->GetInputDim(framework::GradVarName("Out"));
-    bool contain_unknown_dim =
-        phi::contain_unknown_dim(x_dims) || phi::contain_unknown_dim(dout_dims);
-    bool check = ctx->IsRuntime() || !contain_unknown_dim;
-
-    if (check) {
-      auto batch_size = x_dims[0];
-      if (x_dims.size() == 2) {
-        PADDLE_ENFORCE_EQ(dout_dims.size(), 1,
-                          platform::errors::InvalidArgument(
-                              "The dimensions of Input(Out@Grad) must be 1"));
-        if (reduction == "none") {
-          PADDLE_ENFORCE_EQ(
-              dout_dims[0], batch_size,
-              platform::errors::InvalidArgument(
-                  "The unreduced size ofInput(Out@Grad) must be the "
-                  "same as batch_size."));
-        } else {
-          PADDLE_ENFORCE_EQ(
-              dout_dims[0], 1,
-              platform::errors::InvalidArgument(
-                  "The reduced size of Input(Out@Grad) must be 1"));
-        }
-      } else if (x_dims.size() == 4) {
-        if (reduction == "none") {
-          PADDLE_ENFORCE_EQ(
-              dout_dims.size(), 3,
-              platform::errors::InvalidArgument(
-                  "The dimensions of Input(Out@Grad) must be 3,But got [%s].",
-                  dout_dims.size()));
-          PADDLE_ENFORCE_EQ(
-              dout_dims[0] == label_dims[0] && dout_dims[1] == label_dims[1] &&
-                  dout_dims[2] == label_dims[2],
-              true, platform::errors::InvalidArgument(
-                        "The dimensions of Input(Out@Grad) must be match "
-                        "to Input(Label) dimensions."));
-        } else {
-          PADDLE_ENFORCE_EQ(
-              dout_dims[0], 1,
-              platform::errors::InvalidArgument(
-                  "The reduced size of Input(Out@Grad) must be 1"));
-        }
-      }
-    }
-
-    auto x_grad_name = framework::GradVarName("X");
-    if (ctx->HasOutput(x_grad_name)) {
-      ctx->SetOutputDim(x_grad_name, x_dims);
-    }
-  }
-
  protected:
   framework::OpKernelType GetExpectedKernelType(
       const framework::ExecutionContext& ctx) const override {
@@ -192,9 +131,12 @@ class NLLLossGradMaker : public framework::SingleGradOpMaker<T> {
 
 DECLARE_INFER_SHAPE_FUNCTOR(nll_loss, NllLossRawInferShapeFunctor,
                             PD_INFER_META(phi::NllLossRawInferMeta));
+DECLARE_INFER_SHAPE_FUNCTOR(nll_loss_grad, NllLossGradInferShapeFunctor,
+                            PD_INFER_META(phi::NllLossGradInferMeta));
 namespace ops = paddle::operators;
 REGISTER_OPERATOR(nll_loss, ops::NLLLossOp, ops::NLLLossOpMaker,
                   ops::NLLLossGradMaker<paddle::framework::OpDesc>,
                   ops::NLLLossGradMaker<paddle::imperative::OpBase>,
                   NllLossRawInferShapeFunctor);
-REGISTER_OPERATOR(nll_loss_grad, ops::NLLLossGradOp);
+REGISTER_OPERATOR(nll_loss_grad, ops::NLLLossGradOp,
+                  NllLossGradInferShapeFunctor);
diff --git a/paddle/fluid/operators/optimizers/adagrad_op.cc b/paddle/fluid/operators/optimizers/adagrad_op.cc
index 33c4cf94cf25a..91bad1430615f 100644
--- a/paddle/fluid/operators/optimizers/adagrad_op.cc
+++ b/paddle/fluid/operators/optimizers/adagrad_op.cc
@@ -19,6 +19,10 @@ limitations under the License. */
 #include "paddle/fluid/operators/math/selected_rows_functor.h"
 #include "paddle/phi/kernels/funcs/math_function.h"
 
+#include "paddle/fluid/framework/infershape_utils.h"
+#include "paddle/phi/core/infermeta_utils.h"
+#include "paddle/phi/infermeta/multiary.h"
+
 namespace paddle {
 namespace operators {
 
@@ -27,39 +31,6 @@ class AdagradOp : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
-  void InferShape(framework::InferShapeContext* ctx) const override {
-    OP_INOUT_CHECK(ctx->HasInput("Param"), "Input", "Param", "Adagrad");
-    OP_INOUT_CHECK(ctx->HasInput("Grad"), "Input", "Grad", "Adagrad");
-    OP_INOUT_CHECK(ctx->HasInput("Moment"), "Input", "Moment", "Adagrad");
-    OP_INOUT_CHECK(ctx->HasInput("LearningRate"), "Input", "LearningRate",
-                   "Adagrad");
-    OP_INOUT_CHECK(ctx->HasOutput("ParamOut"), "Output", "ParamOut", "Adagrad");
-    OP_INOUT_CHECK(ctx->HasOutput("MomentOut"), "Output", "MomentOut",
-                   "Adagrad");
-
-    auto lr_dims = ctx->GetInputDim("LearningRate");
-    PADDLE_ENFORCE_NE(phi::product(lr_dims), 0,
-                      platform::errors::InvalidArgument(
-                          "Maybe the Input variable LearningRate has not "
-                          "been initialized. You may need to confirm "
-                          "if you put exe.run(startup_program) "
-                          "after optimizer.minimize function."));
-    PADDLE_ENFORCE_EQ(phi::product(lr_dims), 1,
-                      platform::errors::InvalidArgument(
-                          "LearningRate should have one element"));
-    auto param_dims = ctx->GetInputDim("Param");
-    PADDLE_ENFORCE_EQ(
-        param_dims, ctx->GetInputDim("Grad"),
-        platform::errors::InvalidArgument("Param and Grad input of AdagradOp "
-                                          "should have the same dimension."));
-    PADDLE_ENFORCE_EQ(
-        param_dims, ctx->GetInputDim("Moment"),
-        platform::errors::InvalidArgument("Param and Moment input of AdagradOp "
-                                          "should have the same dimension."));
-
-    ctx->SetOutputDim("ParamOut", param_dims);
-    ctx->SetOutputDim("MomentOut", param_dims);
-  }
   framework::OpKernelType GetExpectedKernelType(
       const framework::ExecutionContext& ctx) const override {
     return framework::OpKernelType(
@@ -105,4 +76,7 @@ for numerical stability to avoid the division by zero error.
 }  // namespace paddle
 
 namespace ops = paddle::operators;
-REGISTER_OP_WITHOUT_GRADIENT(adagrad, ops::AdagradOp, ops::AdagradOpMaker);
+DECLARE_INFER_SHAPE_FUNCTOR(adagrad, AdagradInferShapeFunctor,
+                            PD_INFER_META(phi::AdagradInferMeta));
+REGISTER_OP_WITHOUT_GRADIENT(adagrad, ops::AdagradOp, ops::AdagradOpMaker,
+                             AdagradInferShapeFunctor);
diff --git a/paddle/fluid/operators/optimizers/distributed_fused_lamb_init_op.cc b/paddle/fluid/operators/optimizers/distributed_fused_lamb_init_op.cc
index efec50efa92ea..95b45934ea6d2 100644
--- a/paddle/fluid/operators/optimizers/distributed_fused_lamb_init_op.cc
+++ b/paddle/fluid/operators/optimizers/distributed_fused_lamb_init_op.cc
@@ -94,6 +94,7 @@ class DistributedFusedLambInitOpMaker
     AddOutput("GradOut", "The output gradient list.").AsDuplicable();
     AddOutput("GlobalScale",
               "The global scale. It is usually the scale factor for AMP.");
+    AddOutput("Step", "The global step which excludes the NaN/Inf step.");
 
     AddAttr<float>("beta1", "The initial value of Beta1Pow.");
     AddAttr<float>("beta2", "The initial value of Beta2Pow.");
diff --git a/paddle/fluid/operators/optimizers/distributed_fused_lamb_init_op.cu b/paddle/fluid/operators/optimizers/distributed_fused_lamb_init_op.cu
index 7d8a7186d58b4..3688b8067c231 100644
--- a/paddle/fluid/operators/optimizers/distributed_fused_lamb_init_op.cu
+++ b/paddle/fluid/operators/optimizers/distributed_fused_lamb_init_op.cu
@@ -698,6 +698,10 @@ class DistributedFusedLambInitOpKernel<platform::CUDADeviceContext, T>
       TensorFillConstant<float>(dev_ctx, global_scale, {1}, 1.0f);
     }
     VLOG(10) << "Init global scale ends";
+
+    TensorFillConstant<int64_t>(dev_ctx, ctx.Output<framework::Tensor>("Step"),
+                                {1}, static_cast<int64_t>(0));
+
     dev_ctx.Wait();
     VLOG(10) << "Wait for H2D copy";
   }
diff --git a/paddle/fluid/operators/optimizers/distributed_fused_lamb_op.cc b/paddle/fluid/operators/optimizers/distributed_fused_lamb_op.cc
index 8f7c87912e93a..161483c3420fc 100644
--- a/paddle/fluid/operators/optimizers/distributed_fused_lamb_op.cc
+++ b/paddle/fluid/operators/optimizers/distributed_fused_lamb_op.cc
@@ -110,6 +110,7 @@ class DistributedFusedLambOpMaker : public framework::OpProtoAndCheckerMaker {
         .AsDuplicable();
 
     AddOutput("FoundInf", "Whether there is NaN/Inf");
+    AddOutput("Step", "The global step which excludes the NaN/Inf step.");
 
     AddAttr<float>("beta1", "The initial Beta1Pow value.");
     AddAttr<float>("beta2", "The initial Beta2Pow value.");
diff --git a/paddle/fluid/operators/optimizers/distributed_fused_lamb_op.cu b/paddle/fluid/operators/optimizers/distributed_fused_lamb_op.cu
index 5b60f65442b55..f445a140f27a3 100644
--- a/paddle/fluid/operators/optimizers/distributed_fused_lamb_op.cu
+++ b/paddle/fluid/operators/optimizers/distributed_fused_lamb_op.cu
@@ -381,8 +381,9 @@ static __global__ void UpdateLambMomentAndTrustRatioDivCUDAKernel(
     const T *__restrict__ square_grad_norm_p,
     const T *__restrict__ global_scale, const T *__restrict__ beta1pow_p,
     const T *__restrict__ beta2pow_p, T *__restrict__ mom1_p,
-    T *__restrict__ mom2_p, T *__restrict__ trust_ratio_div_p, bool *found_inf,
-    T weight_decay, int weight_decay_end_numel, T beta1, T beta2, T epsilon,
+    T *__restrict__ mom2_p, T *__restrict__ trust_ratio_div_p,
+    bool *__restrict__ found_inf, int64_t *__restrict__ step, T weight_decay,
+    int weight_decay_end_numel, T beta1, T beta2, T epsilon,
     T max_global_grad_norm, int num, T rescale_grad) {
   T square_grad_norm = *square_grad_norm_p;
   bool need_update_found_inf =
@@ -392,6 +393,7 @@ static __global__ void UpdateLambMomentAndTrustRatioDivCUDAKernel(
     return;
   } else if (need_update_found_inf) {
     *found_inf = false;
+    ++(*step);
   }
 
   T scale = rescale_grad / global_scale[0];
@@ -467,8 +469,8 @@ static void MultiTensorUpdateLambMomentAndTrustRatioDiv(
     const platform::CUDADeviceContext &dev_ctx, const int *offsets, int n,
     const T *param_p, const GradT *grad_p, const T *square_grad_norm_p,
     const T *global_scale, const T *beta1pow_p, const T *beta2pow_p, T *mom1_p,
-    T *mom2_p, T *trust_ratio_div_p, bool *found_inf_p, T weight_decay,
-    int weight_decay_end_idx, T beta1, T beta2, T epsilon,
+    T *mom2_p, T *trust_ratio_div_p, bool *found_inf_p, int64_t *step,
+    T weight_decay, int weight_decay_end_idx, T beta1, T beta2, T epsilon,
     T max_global_grad_norm, T rescale_grad) {
   if (n <= 0) return;
   int numel = offsets[n] - offsets[0];
@@ -496,15 +498,24 @@ static void MultiTensorUpdateLambMomentAndTrustRatioDiv(
 
   auto stream = dev_ctx.stream();
   auto config = platform::GetGpuLaunchConfig1D(dev_ctx, numel, vec_size);
+  if (found_inf_p == nullptr) {
+    PADDLE_ENFORCE_EQ(
+        step, nullptr,
+        platform::errors::InvalidArgument(
+            "Output(Step) cannot be updated twice in one mini-batch."));
+  } else {
+    PADDLE_ENFORCE_NOT_NULL(step, platform::errors::InvalidArgument(
+                                      "Output(Step) cannot be nullptr."));
+  }
 
-#define PD_LAUNCH_LAMB_MOM_TRUST_RATIO_DIV_KERNEL                      \
-  do {                                                                 \
-    UpdateLambMomentAndTrustRatioDivCUDAKernel<T, GradT, kVecSize><<<  \
-        config.block_per_grid, config.thread_per_block, 0, stream>>>(  \
-        param_p, grad_p, square_grad_norm_p, global_scale, beta1pow_p, \
-        beta2pow_p, mom1_p, mom2_p, trust_ratio_div_p, found_inf_p,    \
-        weight_decay, weight_decay_end_numel, beta1, beta2, epsilon,   \
-        max_global_grad_norm, numel, rescale_grad);                    \
+#define PD_LAUNCH_LAMB_MOM_TRUST_RATIO_DIV_KERNEL                         \
+  do {                                                                    \
+    UpdateLambMomentAndTrustRatioDivCUDAKernel<T, GradT, kVecSize><<<     \
+        config.block_per_grid, config.thread_per_block, 0, stream>>>(     \
+        param_p, grad_p, square_grad_norm_p, global_scale, beta1pow_p,    \
+        beta2pow_p, mom1_p, mom2_p, trust_ratio_div_p, found_inf_p, step, \
+        weight_decay, weight_decay_end_numel, beta1, beta2, epsilon,      \
+        max_global_grad_norm, numel, rescale_grad);                       \
   } while (0)
 
   PD_VEC_LAUNCH_KERNEL(vec_size, PD_LAUNCH_LAMB_MOM_TRUST_RATIO_DIV_KERNEL);
@@ -1315,6 +1326,8 @@ class DistributedFusedLambOpKernel<platform::CUDADeviceContext, T>
     const auto *fp16_partial_fused_offsets =
         fp16_partial_fused_offsets_t->data<int>();
 
+    auto *step = ctx.Output<framework::Tensor>("Step")->data<int64_t>();
+
     VLOG(1) << "FusedParamOffsets: "
             << FlattenToString(fused_offsets, fused_offsets_t->numel(),
                                fused_offsets_t->place());
@@ -1337,8 +1350,8 @@ class DistributedFusedLambOpKernel<platform::CUDADeviceContext, T>
           dev_ctx, fp32_partial_fused_offsets, fp32_local_param_num,
           fp32_param + fp32_offset, fp32_sum_grad, fp32_square_grad_norm,
           global_scale, beta1pow, beta2pow, moment1, moment2, trust_ratio_div,
-          found_inf, weight_decay, fp32_weight_decay_end_idx, beta1, beta2,
-          epsilon, max_global_grad_norm, rescale_grad);
+          found_inf, step, weight_decay, fp32_weight_decay_end_idx, beta1,
+          beta2, epsilon, max_global_grad_norm, rescale_grad);
       VLOG(10) << "Update FP32 Moment and TrustRatioDiv done";
     }
     float *master_param = nullptr;
@@ -1346,13 +1359,14 @@ class DistributedFusedLambOpKernel<platform::CUDADeviceContext, T>
       master_param = fp32_param + fp32_numel;
       VLOG(10) << "Update FP16 Moment and TrustRatioDiv starts";
       auto tmp_found_inf = has_fp32_param ? nullptr : found_inf;
+      auto tmp_step = has_fp32_param ? nullptr : step;
       MultiTensorUpdateLambMomentAndTrustRatioDiv(
           dev_ctx, fp16_partial_fused_offsets, fp16_local_param_num,
           master_param + fp16_offset, fp16_sum_grad, fp32_square_grad_norm,
           global_scale, beta1pow, beta2pow, moment1 + fp32_numel_each_device,
           moment2 + fp32_numel_each_device,
-          trust_ratio_div + fp32_numel_each_device, tmp_found_inf, weight_decay,
-          fp16_weight_decay_end_idx, beta1, beta2, epsilon,
+          trust_ratio_div + fp32_numel_each_device, tmp_found_inf, tmp_step,
+          weight_decay, fp16_weight_decay_end_idx, beta1, beta2, epsilon,
           max_global_grad_norm, rescale_grad);
       VLOG(10) << "Update FP16 Moment and TrustRatioDiv done";
     }
diff --git a/paddle/fluid/operators/optimizers/momentum_op_xpu.cc b/paddle/fluid/operators/optimizers/momentum_op_xpu.cc
index 5624312d9a728..6897213c91a34 100644
--- a/paddle/fluid/operators/optimizers/momentum_op_xpu.cc
+++ b/paddle/fluid/operators/optimizers/momentum_op_xpu.cc
@@ -14,6 +14,7 @@ limitations under the License. */
 #ifdef PADDLE_WITH_XPU
 #include <string>
 #include "paddle/fluid/operators/optimizers/sgd_op.h"
+#include "paddle/fluid/platform/device/device_wrapper.h"
 namespace paddle {
 namespace operators {
 
@@ -33,6 +34,13 @@ class MomentumOpXPUKernel : public framework::OpKernel<T> {
     velocity_out->mutable_data<T>(ctx.GetPlace());
     auto* lr = learning_rate->data<T>();
 
+    auto regularization_method = ctx.Attr<std::string>("regularization_method");
+    auto regularization_coeff = ctx.Attr<float>("regularization_coeff");
+    if (regularization_method != "l2_decay") {
+      // only support l2_decay
+      regularization_coeff = 0.0f;
+    }
+
     auto* grad_var = ctx.InputVar("Grad");
     PADDLE_ENFORCE_EQ(grad_var->IsType<framework::LoDTensor>(), true,
                       platform::errors::PermissionDenied(
@@ -44,28 +52,16 @@ class MomentumOpXPUKernel : public framework::OpKernel<T> {
     auto grad = ctx.Input<framework::Tensor>("Grad");
 
     auto& dev_ctx = ctx.template device_context<DeviceContext>();
+
+    // int momentum(Context* ctx, const T* param, const T* velocity, const T*
+    // grad, T* param_out, T* velocity_out, int len, const float* lr, int
+    // use_nesterov, float mu, float l2_weight_decay);
     int r = xpu::momentum(dev_ctx.x_context(), param->data<float>(),
                           velocity->data<float>(), grad->data<float>(),
                           param_out->data<float>(), velocity_out->data<float>(),
-                          param_out->numel(), lr, use_nesterov, mu);
-    if (r == xpu::Error_t::INVALID_PARAM) {
-      PADDLE_ENFORCE_EQ(
-          r, xpu::Error_t::SUCCESS,
-          platform::errors::InvalidArgument(
-              "XPU kernel error of MomentumOp, error message: INVALID_PARAM, "
-              "please check your input & output."));
-    } else if (r == xpu::Error_t::RUNTIME_ERROR) {
-      PADDLE_ENFORCE_EQ(
-          r, xpu::Error_t::SUCCESS,
-          platform::errors::Unavailable(
-              "XPU kernel error of MomentumOp, error message: RUNTIME_ERROR, "
-              "please check whether Baidu Kunlun card is properly installed."));
-    } else if (r == xpu::Error_t::NO_ENOUGH_WORKSPACE) {
-      PADDLE_ENFORCE_EQ(r, xpu::Error_t::SUCCESS,
-                        platform::errors::ResourceExhausted(
-                            "XPU kernel error of MomentumOp, error message: "
-                            "NO_ENOUGH_WORKSPACE, XPU has no enough memory."));
-    }
+                          param_out->numel(), lr, use_nesterov, mu,
+                          regularization_coeff);
+    PADDLE_ENFORCE_XDNN_SUCCESS(r, "momentum");
   }
 };
 }  // namespace operators
diff --git a/paddle/fluid/operators/optimizers/rmsprop_op.cc b/paddle/fluid/operators/optimizers/rmsprop_op.cc
index cd6fdcf34e95f..b3458724482e9 100644
--- a/paddle/fluid/operators/optimizers/rmsprop_op.cc
+++ b/paddle/fluid/operators/optimizers/rmsprop_op.cc
@@ -14,91 +14,16 @@ limitations under the License. */
 
 #include "paddle/fluid/framework/op_registry.h"
 
+#include "paddle/fluid/framework/infershape_utils.h"
+#include "paddle/phi/core/infermeta_utils.h"
+#include "paddle/phi/infermeta/multiary.h"
+
 namespace paddle {
 namespace operators {
 
 class RmspropOp : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
-
-  void InferShape(framework::InferShapeContext *ctx) const override {
-    PADDLE_ENFORCE_EQ(ctx->HasInput("Param"), true,
-                      platform::errors::NotFound(
-                          "Input(Param) of RmspropOp should not be null."));
-    PADDLE_ENFORCE_EQ(
-        ctx->HasInput("MeanSquare"), true,
-        platform::errors::NotFound(
-            "Input(MeanSquare) of RmspropOp should not be null."));
-    PADDLE_ENFORCE_EQ(
-        ctx->HasInput("LearningRate"), true,
-        platform::errors::NotFound(
-            "Input(LearningRate) of RmspropOp should not be null."));
-    PADDLE_ENFORCE_EQ(ctx->HasInput("Grad"), true,
-                      platform::errors::NotFound(
-                          "Input(Grad) of RmspropOp should not be null."));
-    PADDLE_ENFORCE_EQ(ctx->HasInput("Moment"), true,
-                      platform::errors::NotFound(
-                          "Input(Moment) of RmspropOp should not be null."));
-    PADDLE_ENFORCE_EQ(ctx->GetInputsVarType("Param").front(),
-                      framework::proto::VarType::LOD_TENSOR,
-                      platform::errors::InvalidArgument(
-                          "The input var's type in RmspropOp should be "
-                          "LoDTensor, but the received is %s",
-                          ctx->GetInputsVarType("Param").front()));
-
-    PADDLE_ENFORCE_EQ(
-        ctx->HasOutput("ParamOut"), true,
-        platform::errors::NotFound(
-            "Output(param_out) of RmspropOp should not be null."));
-    PADDLE_ENFORCE_EQ(
-        ctx->HasOutput("MomentOut"), true,
-        platform::errors::NotFound(
-            "Output(MomentOut) of RmspropOp should not be null."));
-    PADDLE_ENFORCE_EQ(
-        ctx->HasOutput("MeanSquareOut"), true,
-        platform::errors::NotFound(
-            "Output(MeanSquareOut) of RmspropOp should not be null."));
-    if (ctx->Attrs().Get<bool>("centered")) {
-      PADDLE_ENFORCE_EQ(
-          ctx->HasOutput("MeanGradOut"), true,
-          platform::errors::NotFound(
-              "Output(MeanGradOut) of RmspropOp should not be null."));
-    }
-
-    auto param_dim = ctx->GetInputDim("Param");
-    PADDLE_ENFORCE_EQ(
-        param_dim, ctx->GetInputDim("Grad"),
-        platform::errors::InvalidArgument(
-            "Param and grad input of RmspropOp should have the same dimension. "
-            "But received Param's dim [%s] and Grad's dim [%s].",
-            param_dim, ctx->GetInputDim("Grad")));
-    PADDLE_ENFORCE_EQ(param_dim, ctx->GetInputDim("Moment"),
-                      platform::errors::InvalidArgument(
-                          "Param and Momentum input of RmspropOp "
-                          "should have the same dimension. But received "
-                          "Param's dim [%s] and Moment [%s]",
-                          param_dim, ctx->GetInputDim("Moment")));
-    PADDLE_ENFORCE_EQ(param_dim, ctx->GetInputDim("MeanSquare"),
-                      platform::errors::InvalidArgument(
-                          "Param and Momentum input of RmspropOp "
-                          "should have the same dimension. But received "
-                          "Param's dim [%s] and MeanSquare [%s]",
-                          param_dim, ctx->GetInputDim("MeanSquare")));
-
-    auto lr_dim = ctx->GetInputDim("LearningRate");
-    PADDLE_ENFORCE_EQ(phi::product(lr_dim), 1,
-                      platform::errors::InvalidArgument(
-                          "Learning Rate of RmspropOp should be a scalar. But "
-                          "received LearningRate's dim [%s]",
-                          phi::product(lr_dim)));
-
-    ctx->SetOutputDim("ParamOut", param_dim);
-    ctx->SetOutputDim("MomentOut", param_dim);
-    ctx->SetOutputDim("MeanSquareOut", param_dim);
-    if (ctx->Attrs().Get<bool>("centered")) {
-      ctx->SetOutputDim("MeanGradOut", param_dim);
-    }
-  }
 };
 
 class RmspropOpMaker : public framework::OpProtoAndCheckerMaker {
@@ -169,4 +94,7 @@ The original slides that proposed Rmsprop: Slide 29 of
 }  // namespace paddle
 
 namespace ops = paddle::operators;
-REGISTER_OP_WITHOUT_GRADIENT(rmsprop, ops::RmspropOp, ops::RmspropOpMaker);
+DECLARE_INFER_SHAPE_FUNCTOR(rmsprop, RmspropInferShapeFunctor,
+                            PD_INFER_META(phi::RmspropInferMeta));
+REGISTER_OP_WITHOUT_GRADIENT(rmsprop, ops::RmspropOp, ops::RmspropOpMaker,
+                             RmspropInferShapeFunctor);
diff --git a/paddle/fluid/operators/optimizers/sgd_op.cc b/paddle/fluid/operators/optimizers/sgd_op.cc
index 0e3f895d276af..f51d776d7195c 100644
--- a/paddle/fluid/operators/optimizers/sgd_op.cc
+++ b/paddle/fluid/operators/optimizers/sgd_op.cc
@@ -19,6 +19,10 @@ limitations under the License. */
 #include "paddle/fluid/platform/mkldnn_helper.h"
 #endif
 
+#include "paddle/fluid/framework/infershape_utils.h"
+#include "paddle/phi/core/infermeta_utils.h"
+#include "paddle/phi/infermeta/multiary.h"
+
 namespace paddle {
 namespace operators {
 
@@ -26,46 +30,6 @@ class SGDOp : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
-  void InferShape(framework::InferShapeContext *ctx) const override {
-    PADDLE_ENFORCE_EQ(ctx->HasInput("Param"), true,
-                      platform::errors::NotFound(
-                          "Input(Param) of SGDOp should not be null."));
-    PADDLE_ENFORCE_EQ(
-        ctx->HasInput("Grad"), true,
-        platform::errors::NotFound("Input(Grad) of SGDOp should not be null."));
-    PADDLE_ENFORCE_EQ(ctx->HasInput("LearningRate"), true,
-                      platform::errors::NotFound(
-                          "Input(LearningRate) of SGDOp should not be null."));
-    PADDLE_ENFORCE_EQ(ctx->HasOutput("ParamOut"), true,
-                      platform::errors::NotFound(
-                          "Output(ParamOut) of SGDOp should not be null."));
-
-    auto lr_dims = ctx->GetInputDim("LearningRate");
-    PADDLE_ENFORCE_NE(phi::product(lr_dims), 0,
-                      platform::errors::NotFound(
-                          "Maybe the Input variable LearningRate has not "
-                          "been initialized. You may need to confirm "
-                          "if you put exe.run(startup_program) "
-                          "after optimizer.minimize function."));
-    PADDLE_ENFORCE_EQ(phi::product(lr_dims), 1,
-                      platform::errors::InvalidArgument(
-                          "Learning rate should have 1 element. But received "
-                          "LearningRate dims [%s]",
-                          phi::product(lr_dims)));
-    auto param_dim = ctx->GetInputDim("Param");
-    if (ctx->GetInputsVarType("Grad")[0] ==
-        framework::proto::VarType::LOD_TENSOR) {
-      PADDLE_ENFORCE_EQ(
-          param_dim, ctx->GetInputDim("Grad"),
-          platform::errors::InvalidArgument(
-              "SGD Operator's input Param and Grad dimensions do not match. "
-              "The Param %s shape is [%s], but the Grad %s shape is [%s].",
-              ctx->Inputs("Param")[0], param_dim, ctx->Inputs("Grad")[0],
-              ctx->GetInputDim("Grad")));
-    }
-    ctx->SetOutputDim("ParamOut", param_dim);
-  }
-
  protected:
   framework::OpKernelType GetExpectedKernelType(
       const framework::ExecutionContext &ctx) const override {
@@ -161,8 +125,10 @@ This operator implements one step of the stochastic gradient descent algorithm.
 }  // namespace paddle
 
 namespace ops = paddle::operators;
+DECLARE_INFER_SHAPE_FUNCTOR(sgd, SGDInferShapeFunctor,
+                            PD_INFER_META(phi::SGDInferMeta));
 REGISTER_OPERATOR(
     sgd, ops::SGDOp, ops::SGDOpMaker,
     paddle::framework::EmptyGradOpMaker<paddle::framework::OpDesc>,
     paddle::framework::EmptyGradOpMaker<paddle::imperative::OpBase>,
-    ops::SGDOpInferVarType);
+    ops::SGDOpInferVarType, SGDInferShapeFunctor);
diff --git a/paddle/fluid/operators/pixel_shuffle_op.cc b/paddle/fluid/operators/pixel_shuffle_op.cc
index 21ca26f49f653..1724aedbe9b24 100644
--- a/paddle/fluid/operators/pixel_shuffle_op.cc
+++ b/paddle/fluid/operators/pixel_shuffle_op.cc
@@ -82,42 +82,6 @@ class PixelShuffleGradMaker : public framework::SingleGradOpMaker<T> {
 class PixelShuffleGradOp : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
-
-  void InferShape(framework::InferShapeContext* ctx) const override {
-    PADDLE_ENFORCE_EQ(
-        ctx->HasInput(framework::GradVarName("Out")), true,
-        platform::errors::NotFound("Input(Out@Grad) should not be null"));
-    PADDLE_ENFORCE_EQ(
-        ctx->HasOutput(framework::GradVarName("X")), true,
-        platform::errors::NotFound("Output(X@Grad) should not be null"));
-
-    auto do_dims = ctx->GetInputDim(framework::GradVarName("Out"));
-    PADDLE_ENFORCE_EQ(do_dims.size(), 4,
-                      platform::errors::InvalidArgument(
-                          "Input should be a 4-D tensor of format [N, C, H, W] "
-                          "or [N, H, W, C], but got %u.",
-                          do_dims.size()));
-
-    auto upscale_factor = ctx->Attrs().Get<int>("upscale_factor");
-
-    const std::string data_format =
-        ctx->Attrs().Get<std::string>("data_format");
-    const bool channel_last = (data_format == "NHWC");
-
-    auto dx_dims = do_dims;
-    dx_dims[0] = do_dims[0];
-
-    if (!channel_last) {
-      dx_dims[1] = do_dims[1] * (upscale_factor * upscale_factor);
-      dx_dims[2] = do_dims[2] / upscale_factor;
-      dx_dims[3] = do_dims[3] / upscale_factor;
-    } else {
-      dx_dims[1] = do_dims[1] / upscale_factor;
-      dx_dims[2] = do_dims[2] / upscale_factor;
-      dx_dims[3] = do_dims[3] * (upscale_factor * upscale_factor);
-    }
-    ctx->SetOutputDim(framework::GradVarName("X"), dx_dims);
-  }
 };
 
 }  // namespace operators
@@ -132,7 +96,11 @@ REGISTER_OPERATOR(pixel_shuffle, ops::PixelShuffleOp, ops::PixelShuffleOpMaker,
                   ops::PixelShuffleGradMaker<paddle::imperative::OpBase>,
                   PixelShuffleInferShapeFunctor);
 
-REGISTER_OPERATOR(pixel_shuffle_grad, ops::PixelShuffleGradOp);
+DECLARE_INFER_SHAPE_FUNCTOR(pixel_shuffle_grad,
+                            PixelShuffleGradInferShapeFunctor,
+                            PD_INFER_META(phi::PixelShuffleGradInferMeta));
+REGISTER_OPERATOR(pixel_shuffle_grad, ops::PixelShuffleGradOp,
+                  PixelShuffleGradInferShapeFunctor);
 
 REGISTER_OP_VERSION(pixel_shuffle)
     .AddCheckpoint(
diff --git a/paddle/fluid/operators/pscore/CMakeLists.txt b/paddle/fluid/operators/pscore/CMakeLists.txt
old mode 100644
new mode 100755
index baf82a9df31cb..863370540da82
--- a/paddle/fluid/operators/pscore/CMakeLists.txt
+++ b/paddle/fluid/operators/pscore/CMakeLists.txt
@@ -6,9 +6,9 @@ include(operators)
 
 set(DISTRIBUTE_DEPS "")
 
-list(APPEND DISTRIBUTE_DEPS fleet ps_service brpc_utils heter_server heter_client ps_framework_proto framework_proto sendrecv_rpc brpc leveldb ssl crypto protobuf gflags glog zlib snappy device_context)
+list(APPEND DISTRIBUTE_DEPS executor fleet ps_service brpc_utils heter_server heter_client ps_framework_proto framework_proto sendrecv_rpc brpc leveldb ssl crypto protobuf gflags glog zlib snappy device_context)
 
-set(DISTRIBUTE_COMPILE_FLAGS "-Wno-non-virtual-dtor -Wno-error=non-virtual-dtor -Wno-error=delete-non-virtual-dtor")
+set(DISTRIBUTE_COMPILE_FLAGS "-Wno-non-virtual-dtor -Wno-error=non-virtual-dtor -Wno-error=delete-non-virtual-dtor -Wno-error=parentheses")
 
 if (CMAKE_CXX_COMPILER_VERSION VERSION_GREATER 7.0)
     set(DISTRIBUTE_COMPILE_FLAGS
@@ -37,3 +37,6 @@ cc_test(send_and_recv_gpu_test SRCS send_and_recv_op_gpu_test.cc DEPS executor s
 
 set_source_files_properties(heter_listen_and_server_test.cc PROPERTIES COMPILE_FLAGS ${DISTRIBUTE_COMPILE_FLAGS})
 cc_test(heter_listen_and_server_test SRCS heter_listen_and_server_test.cc DEPS executor scope proto_desc scale_op heter_listen_and_serv_op ${RPC_DEPS} ${DISTRIBUTE_DEPS} eigen_function)
+
+#set_source_files_properties(heter_cloud_comm_cpu_test.cc PROPERTIES COMPILE_FLAGS ${DISTRIBUTE_COMPILE_FLAGS})
+#cc_test(heter_cloud_comm_cpu_test SRCS heter_cloud_comm_cpu_test.cc DEPS executor scope proto_desc scale_op heter_listen_and_serv_op ${RPC_DEPS} ${DISTRIBUTE_DEPS} eigen_function)
diff --git a/paddle/fluid/operators/pscore/distributed_lookup_table_op.h b/paddle/fluid/operators/pscore/distributed_lookup_table_op.h
index da439407a422b..c2717c19b2d8e 100644
--- a/paddle/fluid/operators/pscore/distributed_lookup_table_op.h
+++ b/paddle/fluid/operators/pscore/distributed_lookup_table_op.h
@@ -26,17 +26,13 @@ template <typename DeviceContext, typename T>
 class DistributedLookupTableKernel : public framework::OpKernel<T> {
  public:
   void Compute(const framework::ExecutionContext &context) const override {
-    auto &scope = context.scope();
-
     auto padding_idx = context.Attr<int64_t>("padding_idx");
     auto table_id = context.Attr<int>("table_id");
     bool is_test = context.Attr<bool>("is_test");
 
-    auto embedding_name = context.InputNames("W").front();
+    auto *var = context.InputVar("W");
     int64_t emb_dim = 0;
 
-    auto *var = scope.FindVar(embedding_name);
-
     if (var->IsType<framework::LoDTensor>()) {
       emb_dim = var->Get<framework::LoDTensor>().dims()[1];
     } else if (var->IsType<phi::SelectedRows>()) {
@@ -61,35 +57,31 @@ class DistributedLookupTableKernel : public framework::OpKernel<T> {
     } else {
       auto inputs_variable = context.MultiInputVar("Ids");
       auto outputs_variable = context.MultiOutputVar("Outputs");
-      auto inputs_name = context.InputNames("Ids");
-      auto outputs_name = context.OutputNames("Outputs");
 
       auto cpu_place = platform::CPUPlace();
-      framework::Scope *tmp_scope = scope.NewTmpScope().release();
 
       std::vector<const framework::LoDTensor *> tmp_input_vec;
       auto input_var_size = inputs_variable.size();
       std::vector<framework::LoDTensor *> tmp_output_vec;
       auto output_var_size = outputs_variable.size();
 
+      std::vector<std::shared_ptr<framework::LoDTensor>> tmp_tensors;
+
       // create temp input
       for (size_t idx = 0; idx < input_var_size; ++idx) {
-        framework::Variable *tmp_input_var = tmp_scope->Var(inputs_name[idx]);
-        framework::LoDTensor *tmp_input_tensor =
-            tmp_input_var->GetMutable<framework::LoDTensor>();
+        tmp_tensors.emplace_back(std::make_shared<framework::LoDTensor>());
+        auto *p = tmp_tensors.back().get();
         framework::TensorCopy(inputs_variable[idx]->Get<framework::LoDTensor>(),
-                              cpu_place, context.device_context(),
-                              tmp_input_tensor);
-        tmp_input_vec.push_back(tmp_input_tensor);
+                              cpu_place, context.device_context(), p);
+        tmp_input_vec.push_back(p);
       }
 
       // create temp output
       for (size_t idx = 0; idx < output_var_size; ++idx) {
-        framework::Variable *tmp_output_var = tmp_scope->Var(outputs_name[idx]);
-        framework::LoDTensor *tmp_output_tensor =
-            tmp_output_var->GetMutable<framework::LoDTensor>();
-        tmp_output_tensor->Resize(outputs[idx]->dims());
-        tmp_output_vec.push_back(tmp_output_tensor);
+        tmp_tensors.emplace_back(std::make_shared<framework::LoDTensor>());
+        auto *p = tmp_tensors.back().get();
+        p->Resize(outputs[idx]->dims());
+        tmp_output_vec.push_back(p);
       }
 
       // use fleet->PullSparse
@@ -100,27 +92,21 @@ class DistributedLookupTableKernel : public framework::OpKernel<T> {
 
       // cp temp to origin
       for (size_t idx = 0; idx < output_var_size; ++idx) {
-        framework::Variable *tmp_output_var = tmp_scope->Var(outputs_name[idx]);
-        framework::LoDTensor *tmp_output_tensor =
-            tmp_output_var->GetMutable<framework::LoDTensor>();
         framework::TensorCopy(
-            *tmp_output_tensor, context.GetPlace(), context.device_context(),
+            *tmp_output_vec[idx], context.GetPlace(), context.device_context(),
             outputs_variable[idx]->GetMutable<framework::LoDTensor>());
       }
-      delete tmp_scope;
     }
 
-    auto id_names = context.InputNames("Ids");
-    auto out_names = context.OutputNames("Outputs");
     auto lookup_table_version =
         context.Attr<std::string>("lookup_table_version");
+    auto id_vars = context.MultiInputVar("Ids");
+    auto out_vars = context.MultiOutputVar("Outputs");
 
     if (lookup_table_version == "lookup_table_v2") {
-      for (size_t i = 0; i < id_names.size(); ++i) {
-        auto *id_var = scope.FindVar(id_names[i]);
-        auto *out_var = scope.FindVar(out_names[i]);
-        auto *id_tensor = id_var->GetMutable<framework::LoDTensor>();
-        auto *out_tensor = out_var->GetMutable<framework::LoDTensor>();
+      for (size_t i = 0; i < id_vars.size(); ++i) {
+        auto *id_tensor = id_vars[i]->GetMutable<framework::LoDTensor>();
+        auto *out_tensor = out_vars[i]->GetMutable<framework::LoDTensor>();
 
         auto id_dims = id_tensor->dims();
         out_tensor->Resize(phi::make_ddim({static_cast<int64_t>(id_dims[0]),
diff --git a/paddle/fluid/operators/pscore/heter_cloud_comm_cpu_test.cc b/paddle/fluid/operators/pscore/heter_cloud_comm_cpu_test.cc
new file mode 100644
index 0000000000000..2340f443c49fb
--- /dev/null
+++ b/paddle/fluid/operators/pscore/heter_cloud_comm_cpu_test.cc
@@ -0,0 +1,247 @@
+/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#if defined PADDLE_WITH_PSCORE
+#include <stdlib.h>
+
+#include <memory>
+#include <random>
+#include <sstream>
+#include <string>
+#include <thread>  // NOLINT
+
+#include "gtest/gtest.h"
+#include "paddle/fluid/distributed/ps/service/heter_client.h"
+#include "paddle/fluid/distributed/ps/service/heter_server.h"
+#include "paddle/fluid/framework/op_registry.h"
+#include "paddle/fluid/framework/op_version_registry.h"
+
+namespace framework = paddle::framework;
+namespace platform = paddle::platform;
+namespace distributed = paddle::distributed;
+
+using MultiVarMsg = ::paddle::distributed::MultiVariableMessage;
+
+void CreateVarsOnScope(framework::Scope* scope) {
+  auto var1 = scope->Var("w");
+  var1->GetMutable<phi::SelectedRows>();
+  auto var2 = scope->Var("x");
+  var2->GetMutable<framework::LoDTensor>();
+}
+
+void InitTensorsOnClient(framework::Scope* scope, platform::CPUPlace* place,
+                         int64_t rows_numel) {
+  CreateVarsOnScope(scope);
+
+  auto w = scope->Var("w")->GetMutable<phi::SelectedRows>();
+  auto w_value = w->mutable_value();
+  w_value->Resize({rows_numel, 10});
+  for (int64_t i = 0; i < rows_numel; ++i) w->AutoGrownIndex(i, true);
+
+  auto ptr = w_value->mutable_data<float>(*place);
+
+  for (int64_t i = 0; i < w_value->numel(); ++i) {
+    ptr[i] = static_cast<float>(i / 10);
+  }
+
+  auto x_var = scope->Var("x")->GetMutable<framework::LoDTensor>();
+  float* x_ptr =
+      x_var->mutable_data<float>(framework::DDim({1, rows_numel}), *place);
+  for (int64_t i = 0; i < rows_numel; ++i) {
+    x_ptr[i] = 1.0;
+  }
+}
+
+void StartSwitchServer(
+    std::shared_ptr<distributed::HeterServer>& switch_server_ptr,  // NOLINT
+    std::vector<std::string> endpoints,
+    std::vector<std::string> peer_endpoints) {
+  switch_server_ptr->SetPeerEndPoints(peer_endpoints);
+  switch_server_ptr->SetEndPoint(endpoints[0]);
+  /*
+    std::shared_ptr<distributed::SendAndRecvVariableHandler> b_req_handler;
+    b_req_handler.reset(new distributed::SendAndRecvVariableHandler());
+    switch_server_ptr->SetServiceHandler(b_req_handler);
+
+    switch_server_ptr->SetLocalScope();
+
+    switch_server_ptr->RegisterServiceHandler(
+        std::to_string(distributed::PS_SAVE_WITH_SCOPE),
+        [&](const MultiVarMsg* request, MultiVarMsg* response,
+            brpc::Controller* cntl) -> int {
+          return b_req_handler->SaveInSwitchWithScope(request, response, cntl);
+        });
+
+    switch_server_ptr->RegisterServiceHandler(std::to_string(distributed::PS_SAVE_WITH_SHARD),
+                           [&](const MultiVarMsg* request, MultiVarMsg*
+    response,
+                               brpc::Controller* cntl) -> int {
+                             return b_req_handler->SaveInSwitchWithShard(
+                                 request, response, cntl);
+                           });
+
+    switch_server_ptr->RegisterServiceHandler(std::to_string(distributed::PS_QUERY_WITH_SCOPE),
+                           [&](const MultiVarMsg* request, MultiVarMsg*
+    response,
+                               brpc::Controller* cntl) -> int {
+                             return b_req_handler->QueryInSwitchWithScope(
+                                 request, response, cntl);
+                           });
+
+    switch_server_ptr->RegisterServiceHandler(std::to_string(distributed::PS_QUERY_WITH_SHARD),
+                           [&](const MultiVarMsg* request, MultiVarMsg*
+    response,
+                               brpc::Controller* cntl) -> int {
+                             return b_req_handler->QueryInSwitchWithShard(
+                                 request, response, cntl);
+                           });
+  */
+  switch_server_ptr->StartHeterService(false);
+}
+
+void StartSwitchInterServer(
+    std::shared_ptr<distributed::HeterServer>& switch_server_ptr,  // NOLINT
+    std::vector<std::string> endpoints,
+    std::vector<std::string> peer_endpoints) {
+  switch_server_ptr->SetPeerEndPoints(peer_endpoints);
+  switch_server_ptr->SetInterEndpoint(endpoints[1]);
+  switch_server_ptr->StartHeterInterService(false);
+}
+
+TEST(HETERSENDANDRECV, CPU) {
+  setenv("http_proxy", "", 1);
+  setenv("https_proxy", "", 1);
+
+  // 启动 switch server A & B
+  std::string switch_a_endpoint("127.0.0.1:6000");
+  std::string switch_a_endpoint_inter("127.0.0.1:6100");
+  std::string switch_b_endpoint_inter("127.0.0.1:7100");
+  std::string switch_b_endpoint("127.0.0.1:7000");
+
+  std::shared_ptr<distributed::HeterServer> switch_server_ptr_a =
+      std::make_shared<distributed::HeterServer>();
+  std::vector<std::string> end_points{switch_a_endpoint};
+  std::vector<std::string> peer_endpoints{switch_b_endpoint_inter};
+  std::thread switch_server_a_thread(StartSwitchServer,
+                                     std::ref(switch_server_ptr_a), end_points,
+                                     peer_endpoints);
+  switch_server_ptr_a->WaitServerReady();
+
+  std::shared_ptr<distributed::HeterServer> switch_server_ptr_b =
+      std::make_shared<distributed::HeterServer>();
+  end_points = {switch_b_endpoint, switch_b_endpoint_inter};
+  peer_endpoints = {};
+  std::thread switch_server_b_thread(StartSwitchServer,
+                                     std::ref(switch_server_ptr_b), end_points,
+                                     peer_endpoints);
+  switch_server_ptr_b->WaitServerReady();
+
+  end_points = {switch_b_endpoint, switch_b_endpoint_inter};
+  peer_endpoints = {};
+  std::thread switch_server_b_thread_inter(StartSwitchInterServer,
+                                           std::ref(switch_server_ptr_b),
+                                           end_points, peer_endpoints);
+  switch_server_ptr_b->WaitServerReady();
+
+  // 获取 client 实例
+  std::shared_ptr<distributed::HeterClient> heter_client_ptr_ =
+      distributed::HeterClient::GetInstance(
+          {switch_a_endpoint, switch_b_endpoint}, {}, 0);
+
+  platform::CPUPlace place;
+  platform::CPUDeviceContext ctx(place);
+  framework::Executor exe(place);
+
+  framework::ProgramDesc program;
+  exe.Prepare(program, 0);  // solve undefined symbol: tensor_table.cc
+  std::shared_ptr<framework::Scope> send_scope_ptr =
+      std::make_shared<framework::Scope>();
+  int64_t rows_numel = 10;
+  InitTensorsOnClient(send_scope_ptr.get(), &place, rows_numel);
+  LOG(INFO) << "InitTensorsOnClient done";
+
+  auto send_async = [&]() -> void {
+    /*
+    //std::string message_name =
+    std::to_string(distributed::PS_SAVE_WITH_SCOPE);
+    std::string message_name = "send and save";
+    std::vector<std::string> send_var_names{"w", "x"};
+    int ret = heter_client_ptr_->Send(ctx, *send_scope_ptr, message_name,
+                                      send_var_names);
+    if (!ret) {
+      LOG(ERROR) << ">>>> worker send success";
+    }
+    */
+    ///*
+    std::vector<int> vars_len{2, 4};
+    std::vector<float> values{1.0, 2.0, 3.0, 4.0, 5.0, 6.0};
+    int64_t data_size = 6;
+    std::vector<std::string> send_var_names{"w", "x"};
+    int group_id = 0;
+    int ret = heter_client_ptr_->Send(group_id, send_var_names, vars_len,
+                                      values.data(), data_size);
+    if (!ret) {
+      LOG(INFO) << ">>>> worker send success";
+    }
+    //*/
+  };
+  std::thread send_thread(send_async);
+  /*
+  std::string message_name = std::to_string(distributed::PS_QUERY_WITH_SCOPE);
+  std::vector<std::string> recv_var_names{"w", "x"};
+  std::shared_ptr<framework::Scope> recv_scope_ptr =
+      std::make_shared<framework::Scope>();
+  int ret = heter_client_ptr_->Recv(ctx, *recv_scope_ptr, message_name,
+                                    recv_var_names);
+  if (!ret && recv_scope_ptr->FindVar("w") && recv_scope_ptr->FindVar("x")) {
+    LOG(INFO) << ">>>> worker recv success";
+  } else {
+    LOG(INFO) << "worker recv failed";
+  }
+  */
+  ///*
+  int group_id = 0;
+  std::vector<std::string> recv_var_names{"w", "x"};
+  std::vector<float> values;
+  int data_size = 6;
+  values.resize(data_size);
+  int ret = heter_client_ptr_->Recv(group_id, recv_var_names, values.data(),
+                                    data_size);
+  if (!ret) {
+    VLOG(4) << "queried data is: ";
+    for (auto f : values) {
+      VLOG(4) << f << " ";
+    }
+    LOG(INFO) << ">>>> worker recv success";
+  }
+  //*/
+
+  send_thread.join();
+
+  switch_server_ptr_a->Stop();
+  LOG(INFO) << "switch server A stopped";
+
+  switch_server_ptr_b->Stop();
+  LOG(INFO) << "switch server B stopped";
+
+  switch_server_a_thread.join();
+  LOG(INFO) << "switch_server_a_thread joined";
+
+  switch_server_b_thread.join();
+  LOG(INFO) << "switch_server_b_thread joined";
+
+  switch_server_b_thread_inter.join();
+  LOG(INFO) << "switch_server_b_thread_inter joined";
+}
+#endif
diff --git a/paddle/fluid/operators/pscore/heter_listen_and_serv_op.cc b/paddle/fluid/operators/pscore/heter_listen_and_serv_op.cc
index 2c443e8c63cbe..2df0d7526a3d3 100644
--- a/paddle/fluid/operators/pscore/heter_listen_and_serv_op.cc
+++ b/paddle/fluid/operators/pscore/heter_listen_and_serv_op.cc
@@ -88,21 +88,20 @@ void HeterListenAndServOp::RunAsyncLoop(framework::ProgramDesc *program) const {
   for (size_t blkid = 1; blkid < num_blocks; ++blkid) {
     block_list.push_back(blkid);
   }
-
   for (size_t i = 0; i < block_list.size(); ++i) {
     auto blkid = block_list[i];
     auto it = message_to_block_id.find_value(blkid);
-    rpc_service_->RegisterServiceHandler(
+    heter_server_->RegisterServiceHandler(
         it->first, [&](const MultiVarMsg *request, MultiVarMsg *response,
                        brpc::Controller *cntl) -> int {
-          return request_send_and_recv_handler_->Handle(request, response,
-                                                        cntl);
+          return send_and_recv_variable_handler_->Handle(request, response,
+                                                         cntl);
         });
   }
 
   while (true) {
-    if (rpc_service_->IsExit() || rpc_service_->IsStop()) {
-      rpc_service_->Stop();
+    if (heter_server_->IsExit() || heter_server_->IsStop()) {
+      heter_server_->Stop();
       VLOG(0) << "get exit. rpc_processor stop!";
       break;
     }
@@ -110,8 +109,9 @@ void HeterListenAndServOp::RunAsyncLoop(framework::ProgramDesc *program) const {
   }  // while(true)
 }
 
-void RunServer(std::shared_ptr<paddle::distributed::HeterServer> service) {
-  service->StartHeterService();
+void RunServer(
+    std::shared_ptr<paddle::distributed::HeterServer> heter_server_ptr) {
+  heter_server_ptr->StartHeterService();
 }
 
 void HeterListenAndServOp::RunImpl(const framework::Scope &scope,
@@ -126,16 +126,16 @@ void HeterListenAndServOp::RunImpl(const framework::Scope &scope,
   auto fan_in = Attr<int>("fanin");
   auto inputs = Inputs("X");
 
-  PADDLE_ENFORCE_EQ(rpc_service_, nullptr,
+  PADDLE_ENFORCE_EQ(heter_server_, nullptr,
                     platform::errors::PreconditionNotMet(
                         "RPC service has been created unexpectedly."));
 
   std::string endpoint = Attr<std::string>("endpoint");
   VLOG(4) << "pserver_id: " << pserver_id << ", end_point:" << endpoint;
 
-  rpc_service_ = distributed::HeterServer::GetInstance();
-  rpc_service_->SetEndPoint(endpoint);
-  rpc_service_->SetFanin(fan_in);
+  heter_server_ = distributed::HeterServer::GetInstance();
+  heter_server_->SetEndPoint(endpoint);
+  heter_server_->SetFanin(fan_in);
 
   auto optimize_blocks =
       Attr<std::vector<framework::BlockDesc *>>("optimize_blocks");
@@ -146,20 +146,18 @@ void HeterListenAndServOp::RunImpl(const framework::Scope &scope,
 
   auto *program = optimize_blocks[0]->Program();
 
-  request_send_and_recv_handler_.reset(
-      new distributed::RequestSendAndRecvHandler());
-  request_send_and_recv_handler_->SetScope(&scope);
-  request_send_and_recv_handler_->SetDevCtx(&dev_ctx);
-  rpc_service_->SetRequestHandler(request_send_and_recv_handler_);
+  send_and_recv_variable_handler_.reset(
+      new distributed::SendAndRecvVariableHandler());
+  send_and_recv_variable_handler_->SetScope(&scope);
+  send_and_recv_variable_handler_->SetDevCtx(&dev_ctx);
+  heter_server_->SetServiceHandler(send_and_recv_variable_handler_);
 
   VLOG(2) << "RunAsyncLoop";
-  auto message_to_block_id_str =
-      Attr<std::vector<std::string>>("message_to_block_id");
 
   // start the server listening after all member initialized.
-  server_thread_.reset(new std::thread(RunServer, rpc_service_));
+  server_thread_.reset(new std::thread(RunServer, heter_server_));
   VLOG(3) << "wait server thread to become ready...";
-  rpc_service_->WaitServerReady();
+  heter_server_->WaitServerReady();
   RunAsyncLoop(program);
   VLOG(3) << "Wait for Server_thread_ stop";
   (server_thread_.get())->join();
diff --git a/paddle/fluid/operators/pscore/heter_listen_and_serv_op.h b/paddle/fluid/operators/pscore/heter_listen_and_serv_op.h
old mode 100644
new mode 100755
index 2d2d8abe70627..3ecff083b00c7
--- a/paddle/fluid/operators/pscore/heter_listen_and_serv_op.h
+++ b/paddle/fluid/operators/pscore/heter_listen_and_serv_op.h
@@ -34,7 +34,7 @@ limitations under the License. */
 
 namespace paddle {
 namespace distributed {
-class HeterRequestHandler;
+class ServiceHandlerBase;
 class HeterServer;
 }  // namespace distributed
 }  // namespace paddle
@@ -82,10 +82,10 @@ class HeterListenAndServOp : public framework::OperatorBase {
                const platform::Place& dev_place) const override;
 
  protected:
-  mutable std::shared_ptr<paddle::distributed::HeterServer> rpc_service_;
+  mutable std::shared_ptr<paddle::distributed::HeterServer> heter_server_;
   mutable std::shared_ptr<std::thread> server_thread_;
-  mutable std::shared_ptr<paddle::distributed::RequestSendAndRecvHandler>
-      request_send_and_recv_handler_;
+  mutable std::shared_ptr<paddle::distributed::SendAndRecvVariableHandler>
+      send_and_recv_variable_handler_;
 };
 
 }  // namespace operators
diff --git a/paddle/fluid/operators/pscore/heter_listen_and_server_test.cc b/paddle/fluid/operators/pscore/heter_listen_and_server_test.cc
index b024fe76b0972..ab2fcba51062f 100644
--- a/paddle/fluid/operators/pscore/heter_listen_and_server_test.cc
+++ b/paddle/fluid/operators/pscore/heter_listen_and_server_test.cc
@@ -142,7 +142,7 @@ void InitTensorsOnServer(framework::Scope* scope, platform::CPUPlace* place,
   CreateVarsOnScope(scope, place);
 }
 
-void StartHeterServer(std::string endpoint) {
+void RunHeterServerOp(std::string endpoint) {
   framework::ProgramDesc program;
   framework::Scope scope;
   platform::CPUPlace place;
@@ -167,10 +167,10 @@ TEST(HETER_LISTEN_AND_SERV, CPU) {
   std::string previous_endpoint = endpoint;
   LOG(INFO) << "before StartSendAndRecvServer";
   FLAGS_eager_delete_tensor_gb = -1;
-  std::thread server_thread(StartHeterServer, endpoint);
+  std::thread server_thread(RunHeterServerOp, endpoint);
   sleep(1);
-  auto b_rpc_service = distributed::HeterServer::GetInstance();
-  b_rpc_service->WaitServerReady();
+  auto heter_server_ptr_ = distributed::HeterServer::GetInstance();
+  heter_server_ptr_->WaitServerReady();
   using MicroScope =
       std::unordered_map<int, std::shared_ptr<std::vector<framework::Scope*>>>;
   using MiniScope = std::unordered_map<int, framework::Scope*>;
@@ -185,8 +185,8 @@ TEST(HETER_LISTEN_AND_SERV, CPU) {
   (*micro_scope).push_back(micro_scope_0);
   (*micro_scope).push_back(micro_scope_1);
   (*micro_scopes)[0] = micro_scope;
-  b_rpc_service->SetMicroBatchScopes(micro_scopes);
-  b_rpc_service->SetMiniBatchScopes(mini_scopes);
+  heter_server_ptr_->SetMicroBatchScopes(micro_scopes);
+  heter_server_ptr_->SetMiniBatchScopes(mini_scopes);
 
   using TaskQueue =
       std::unordered_map<int,
@@ -198,17 +198,13 @@ TEST(HETER_LISTEN_AND_SERV, CPU) {
   SharedTaskQueue task_queue_(new TaskQueue{});
   (*task_queue_)[0] = std::make_shared<
       ::paddle::framework::BlockingQueue<std::pair<std::string, int>>>();
-  b_rpc_service->SetTaskQueue(task_queue_);
+  heter_server_ptr_->SetTaskQueue(task_queue_);
 
   LOG(INFO) << "before HeterClient::GetInstance";
-  distributed::HeterClient* rpc_client =
+  distributed::HeterClient* heter_client_ptr_ =
       distributed::HeterClient::GetInstance({endpoint}, {previous_endpoint}, 0)
           .get();
 
-  PADDLE_ENFORCE_NE(rpc_client, nullptr,
-                    platform::errors::InvalidArgument(
-                        "Client Start Fail, Check Your Code & Env"));
-
   framework::Scope* scope = (*micro_scope)[0];
   platform::CPUPlace place;
   platform::CPUDeviceContext ctx(place);
@@ -224,8 +220,8 @@ TEST(HETER_LISTEN_AND_SERV, CPU) {
   std::vector<std::string> recv_var = {};
 
   LOG(INFO) << "before SendAndRecvAsync";
-  rpc_client->SendAndRecvAsync(ctx, *scope, in_var_name, send_var, recv_var,
-                               "forward");
+  heter_client_ptr_->SendAndRecvAsync(ctx, *scope, in_var_name, send_var,
+                                      recv_var, "forward");
   auto task = (*task_queue_)[0]->Pop();
   PADDLE_ENFORCE_EQ(
       task.first, "x",
@@ -234,15 +230,15 @@ TEST(HETER_LISTEN_AND_SERV, CPU) {
 
   InitTensorsOnClient2((*micro_scope)[1], &place, rows_numel);
   LOG(INFO) << "before SendAndRecvAsync 2";
-  rpc_client->SendAndRecvAsync(ctx, *((*micro_scope)[1]), in_var_name, send_var,
-                               recv_var, "backward");
+  heter_client_ptr_->SendAndRecvAsync(ctx, *((*micro_scope)[1]), in_var_name,
+                                      send_var, recv_var, "backward");
   auto task2 = (*task_queue_)[0]->Pop();
   PADDLE_ENFORCE_EQ(
       task2.first, "x",
       platform::errors::InvalidArgument(
           "Recv message and Send message name not match, Check your Code"));
 
-  rpc_client->Stop();
+  heter_client_ptr_->Stop();
   LOG(INFO) << "end server Stop";
   server_thread.join();
   LOG(INFO) << "end server thread join";
diff --git a/paddle/fluid/operators/pscore/heter_server_test.cc b/paddle/fluid/operators/pscore/heter_server_test.cc
index 6ab4204b2f9df..d4ee00d10a50b 100644
--- a/paddle/fluid/operators/pscore/heter_server_test.cc
+++ b/paddle/fluid/operators/pscore/heter_server_test.cc
@@ -34,8 +34,6 @@ using VarMsg = ::paddle::distributed::VariableMessage;
 
 USE_OP_ITSELF(scale);
 
-std::shared_ptr<distributed::HeterServer> b_rpc_service;
-
 std::string get_ip_port() {
   std::mt19937 rng;
   rng.seed(std::random_device()());
@@ -171,31 +169,32 @@ void StartSendAndRecvServer(std::string endpoint) {
   InitTensorsOnServer(&scope, &place, 10);
   LOG(INFO) << "end InitTensorsOnServer";
 
-  std::shared_ptr<distributed::RequestSendAndRecvHandler> b_req_handler;
-  b_req_handler.reset(new distributed::RequestSendAndRecvHandler());
+  std::shared_ptr<distributed::SendAndRecvVariableHandler> b_req_handler;
+  b_req_handler.reset(new distributed::SendAndRecvVariableHandler());
   LOG(INFO) << "before SetDevCtx";
   b_req_handler->SetDevCtx(&ctx);
   LOG(INFO) << "before SetScope";
   b_req_handler->SetScope(&scope);
   LOG(INFO) << "before HeterServer::GetInstance";
-  b_rpc_service = distributed::HeterServer::GetInstance();
-  b_rpc_service->SetEndPoint(endpoint);
+  std::shared_ptr<distributed::HeterServer> heter_server_ptr_ =
+      distributed::HeterServer::GetInstance();
+  heter_server_ptr_->SetEndPoint(endpoint);
   LOG(INFO) << "before HeterServer::RegisterServiceHandler";
-  b_rpc_service->RegisterServiceHandler(
+  heter_server_ptr_->RegisterServiceHandler(
       in_var_name, [&](const MultiVarMsg* request, MultiVarMsg* response,
                        brpc::Controller* cntl) -> int {
         return b_req_handler->Handle(request, response, cntl);
       });
-  b_rpc_service->RegisterServiceHandler(
+  heter_server_ptr_->RegisterServiceHandler(
       in_var_name2, [&](const MultiVarMsg* request, MultiVarMsg* response,
                         brpc::Controller* cntl) -> int {
         return b_req_handler->Handle(request, response, cntl);
       });
 
-  b_rpc_service->SetRequestHandler(b_req_handler);
+  heter_server_ptr_->SetServiceHandler(b_req_handler);
   LOG(INFO) << "before HeterServer::RunServer";
-  RunServer(b_rpc_service);
-  // std::thread server_thread(std::bind(RunServer, b_rpc_service));
+  RunServer(heter_server_ptr_);
+  // std::thread server_thread(std::bind(RunServer, heter_server_ptr_));
 
   // server_thread.join();
 }
@@ -206,9 +205,10 @@ TEST(SENDANDRECV, CPU) {
   std::string endpoint = get_ip_port();
   std::string previous_endpoint = endpoint;
   LOG(INFO) << "before StartSendAndRecvServer";
-  b_rpc_service = distributed::HeterServer::GetInstance();
+  std::shared_ptr<distributed::HeterServer> heter_server_ptr_ =
+      distributed::HeterServer::GetInstance();
   std::thread server_thread(StartSendAndRecvServer, endpoint);
-  b_rpc_service->WaitServerReady();
+  heter_server_ptr_->WaitServerReady();
   using MicroScope =
       std::unordered_map<int, std::shared_ptr<std::vector<framework::Scope*>>>;
   using MiniScope = std::unordered_map<int, framework::Scope*>;
@@ -223,8 +223,8 @@ TEST(SENDANDRECV, CPU) {
   (*micro_scope).push_back(micro_scope_0);
   (*micro_scope).push_back(micro_scope_1);
   (*micro_scopes)[0] = micro_scope;
-  b_rpc_service->SetMicroBatchScopes(micro_scopes);
-  b_rpc_service->SetMiniBatchScopes(mini_scopes);
+  heter_server_ptr_->SetMicroBatchScopes(micro_scopes);
+  heter_server_ptr_->SetMiniBatchScopes(mini_scopes);
 
   using TaskQueue =
       std::unordered_map<int,
@@ -236,17 +236,13 @@ TEST(SENDANDRECV, CPU) {
   SharedTaskQueue task_queue_(new TaskQueue{});
   (*task_queue_)[0] = std::make_shared<
       ::paddle::framework::BlockingQueue<std::pair<std::string, int>>>();
-  b_rpc_service->SetTaskQueue(task_queue_);
+  heter_server_ptr_->SetTaskQueue(task_queue_);
 
   LOG(INFO) << "before HeterClient::GetInstance";
-  distributed::HeterClient* rpc_client =
+  distributed::HeterClient* heter_client_ptr_ =
       distributed::HeterClient::GetInstance({endpoint}, {previous_endpoint}, 0)
           .get();
 
-  PADDLE_ENFORCE_NE(rpc_client, nullptr,
-                    platform::errors::InvalidArgument(
-                        "Client Start Fail, Check Your Code & Env"));
-
   framework::Scope* scope = (*micro_scope)[0];
   platform::CPUPlace place;
   platform::CPUDeviceContext ctx(place);
@@ -262,8 +258,8 @@ TEST(SENDANDRECV, CPU) {
   std::vector<std::string> recv_var = {};
 
   LOG(INFO) << "before SendAndRecvAsync";
-  rpc_client->SendAndRecvAsync(ctx, *scope, in_var_name, send_var, recv_var,
-                               "forward");
+  heter_client_ptr_->SendAndRecvAsync(ctx, *scope, in_var_name, send_var,
+                                      recv_var, "forward");
 
   LOG(INFO) << "client wait for Pop";
   auto task = (*task_queue_)[0]->Pop();
@@ -276,8 +272,8 @@ TEST(SENDANDRECV, CPU) {
   InitTensorsOnClient2((*micro_scope)[1], &place, rows_numel);
   LOG(INFO) << "before SendAndRecvAsync 2";
   std::string in_var_name2("y");
-  rpc_client->SendAndRecvAsync(ctx, *((*micro_scope)[1]), in_var_name2,
-                               send_var, recv_var, "backward");
+  heter_client_ptr_->SendAndRecvAsync(ctx, *((*micro_scope)[1]), in_var_name2,
+                                      send_var, recv_var, "backward");
   LOG(INFO) << "after SendAndRecvAsync 2";
 
   auto task2 = (*task_queue_)[0]->Pop();
@@ -286,8 +282,7 @@ TEST(SENDANDRECV, CPU) {
       platform::errors::InvalidArgument(
           "Recv message and Send message name not match, Check your Code"));
 
-  rpc_client->FinalizeWorker();
-  b_rpc_service->Stop();
+  heter_server_ptr_->Stop();
   LOG(INFO) << "end server Stop";
   server_thread.join();
   LOG(INFO) << "end server thread join";
diff --git a/paddle/fluid/operators/pscore/send_and_recv_op_cpu_test.cc b/paddle/fluid/operators/pscore/send_and_recv_op_cpu_test.cc
old mode 100644
new mode 100755
index 26da0d3696fdf..7c25d38d1ebad
--- a/paddle/fluid/operators/pscore/send_and_recv_op_cpu_test.cc
+++ b/paddle/fluid/operators/pscore/send_and_recv_op_cpu_test.cc
@@ -36,8 +36,6 @@ using VarMsg = ::paddle::distributed::VariableMessage;
 USE_OP_ITSELF(scale);
 USE_OP(send_and_recv);
 
-std::shared_ptr<distributed::HeterServer> b_rpc_service;
-
 std::string get_ip_port() {
   std::mt19937 rng;
   rng.seed(std::random_device()());
@@ -148,14 +146,15 @@ void StartSendAndRecvServer(std::string endpoint) {
   InitTensorsOnServer(&scope, &place, 10);
   LOG(INFO) << "end InitTensorsOnServer";
 
-  std::shared_ptr<distributed::RequestSendAndRecvHandler> b_req_handler;
-  b_req_handler.reset(new distributed::RequestSendAndRecvHandler());
+  std::shared_ptr<distributed::SendAndRecvVariableHandler> b_req_handler;
+  b_req_handler.reset(new distributed::SendAndRecvVariableHandler());
   LOG(INFO) << "before SetDevCtx";
   b_req_handler->SetDevCtx(&ctx);
   LOG(INFO) << "before SetScope";
   b_req_handler->SetScope(&scope);
   LOG(INFO) << "before HeterServer::GetInstance";
-  b_rpc_service = distributed::HeterServer::GetInstance();
+  std::shared_ptr<distributed::HeterServer> b_rpc_service =
+      distributed::HeterServer::GetInstance();
   b_rpc_service->SetEndPoint(endpoint);
   LOG(INFO) << "before HeterServer::RegisterServiceHandler";
   b_rpc_service->RegisterServiceHandler(
@@ -164,7 +163,7 @@ void StartSendAndRecvServer(std::string endpoint) {
         return b_req_handler->Handle(request, response, cntl);
       });
 
-  b_rpc_service->SetRequestHandler(b_req_handler);
+  b_rpc_service->SetServiceHandler(b_req_handler);
   LOG(INFO) << "before HeterServer::RunServer";
 
   RunServer(b_rpc_service);
@@ -179,7 +178,8 @@ TEST(SENDANDRECV, CPU) {
   std::string endpoint = get_ip_port();
   std::string previous_endpoint = endpoint;
   LOG(INFO) << "before StartSendAndRecvServer";
-  b_rpc_service = distributed::HeterServer::GetInstance();
+  std::shared_ptr<distributed::HeterServer> b_rpc_service =
+      distributed::HeterServer::GetInstance();
   std::thread server_thread(StartSendAndRecvServer, endpoint);
   b_rpc_service->WaitServerReady();
   using MicroScope =
@@ -292,7 +292,6 @@ TEST(SENDANDRECV, CPU) {
       platform::errors::InvalidArgument(
           "Recv message and Send message name not match, Check your Code"));
 
-  rpc_client->FinalizeWorker();
   b_rpc_service->Stop();
   LOG(INFO) << "end server Stop";
   server_thread.join();
diff --git a/paddle/fluid/operators/pscore/send_and_recv_op_gpu_test.cc b/paddle/fluid/operators/pscore/send_and_recv_op_gpu_test.cc
index a5e292a05e1ff..4054846460b07 100644
--- a/paddle/fluid/operators/pscore/send_and_recv_op_gpu_test.cc
+++ b/paddle/fluid/operators/pscore/send_and_recv_op_gpu_test.cc
@@ -167,8 +167,8 @@ void StartSendAndRecvServer(std::string endpoint) {
   InitTensorsOnServer(&scope, &place, 10);
   LOG(INFO) << "end InitTensorsOnServer";
 
-  std::shared_ptr<distributed::RequestSendAndRecvHandler> b_req_handler;
-  b_req_handler.reset(new distributed::RequestSendAndRecvHandler());
+  std::shared_ptr<distributed::SendAndRecvVariableHandler> b_req_handler;
+  b_req_handler.reset(new distributed::SendAndRecvVariableHandler());
   LOG(INFO) << "before SetDevCtx";
   b_req_handler->SetDevCtx(&ctx);
   LOG(INFO) << "before SetScope";
@@ -183,7 +183,7 @@ void StartSendAndRecvServer(std::string endpoint) {
         return b_req_handler->Handle(request, response, cntl);
       });
 
-  b_rpc_service2->SetRequestHandler(b_req_handler);
+  b_rpc_service2->SetServiceHandler(b_req_handler);
   LOG(INFO) << "before HeterServer::RunServer";
 
   RunServer(b_rpc_service2);
@@ -228,13 +228,11 @@ TEST(SENDANDRECV, GPU) {
   b_rpc_service2->SetTaskQueue(task_queue_);
 
   LOG(INFO) << "before HeterClient::GetInstance";
-  distributed::HeterClient* rpc_client =
-      distributed::HeterClient::GetInstance({endpoint}, {previous_endpoint}, 0)
-          .get();
-
-  PADDLE_ENFORCE_NE(rpc_client, nullptr,
-                    platform::errors::InvalidArgument(
-                        "Client Start Fail, Check Your Code & Env"));
+  std::shared_ptr<distributed::HeterClient> heter_client_ptr_ =
+      distributed::HeterClient::GetInstance({endpoint}, {previous_endpoint}, 0);
+  if (heter_client_ptr_ == nullptr) {
+    LOG(ERROR) << "heter_client_ptr_ is null";
+  }
 
   framework::Scope* scope = (*micro_scope)[0];
   platform::CUDAPlace place;
@@ -316,7 +314,6 @@ TEST(SENDANDRECV, GPU) {
       platform::errors::InvalidArgument(
           "Recv message and Send message name not match, Check your Code"));
 
-  rpc_client->FinalizeWorker();
   b_rpc_service2->Stop();
   LOG(INFO) << "end server Stop";
   server_thread.join();
diff --git a/paddle/fluid/operators/pscore/send_op.cc b/paddle/fluid/operators/pscore/send_op.cc
index 5b4a641f290d1..4ca99115be1ab 100644
--- a/paddle/fluid/operators/pscore/send_op.cc
+++ b/paddle/fluid/operators/pscore/send_op.cc
@@ -47,7 +47,7 @@ class SendOp : public framework::OperatorBase {
 
     auto send_varnames = Attr<std::vector<std::string>>("send_varnames");
 
-    // for common_dense_table, distributed_push_sparse op for push sparse in
+    // for memory_dense_table, distributed_push_sparse op for push sparse in
     // async
     if (is_sparse == 0 && send_varnames.size() >= 1 &&
         send_varnames[0] != "@PS_STEP_COUNTER@") {
diff --git a/paddle/fluid/operators/quantize_linear_op.cc b/paddle/fluid/operators/quantize_linear_op.cc
new file mode 100644
index 0000000000000..4039f0e9d07e1
--- /dev/null
+++ b/paddle/fluid/operators/quantize_linear_op.cc
@@ -0,0 +1,173 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+    http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "paddle/fluid/operators/quantize_linear_op.h"
+#include <algorithm>
+#include <string>
+#include <vector>
+#include "paddle/fluid/framework/eigen.h"
+#include "paddle/fluid/framework/op_version_registry.h"
+#include "paddle/fluid/platform/transform.h"
+#include "paddle/phi/kernels/funcs/blas/blas.h"
+#include "paddle/phi/kernels/impl/clip_kernel_impl.h"
+
+namespace paddle {
+namespace operators {
+
+template <typename T>
+struct ChannelDequantizeFunctorV2<platform::CPUDeviceContext, T> {
+  void operator()(const platform::CPUDeviceContext& dev_ctx,
+                  const framework::Tensor* in, const framework::Tensor* scale,
+                  T max_range, const int quant_axis, framework::Tensor* out) {
+    // Dequant op is before quantized op
+    // Dequantize the weight of quantized op
+    auto in_dims = in->dims();
+    const int64_t channel = in_dims[quant_axis];
+    const T* scale_factor = scale->data<T>();
+    if (quant_axis == 0) {
+      for (int64_t i = 0; i < channel; i++) {
+        T s = scale_factor[i];
+        framework::Tensor one_channel_in = in->Slice(i, i + 1);
+        framework::Tensor one_channel_out = out->Slice(i, i + 1);
+        auto in_e = framework::EigenVector<T>::Flatten(one_channel_in);
+        auto out_e = framework::EigenVector<T>::Flatten(one_channel_out);
+        auto& dev = *dev_ctx.eigen_device();
+        out_e.device(dev) = in_e * s / max_range;
+      }
+    } else if (quant_axis == 1) {
+      int64_t out_iter = 1;
+      for (int i = 0; i < quant_axis; i++) {
+        out_iter *= in_dims[i];
+      }
+      int64_t step_i = in->numel() / out_iter;
+      int64_t step_j = in->numel() / (out_iter * channel);
+      auto* in_data = in->data<T>();
+      auto* out_data = out->mutable_data<T>(dev_ctx.GetPlace());
+      for (int64_t i = 0; i < out_iter; i++) {
+        for (int64_t j = 0; j < channel; j++) {
+          auto* cur_in = in_data + i * step_i + j * step_j;
+          auto* cur_out = out_data + i * step_i + j * step_j;
+          T s = scale_factor[j];
+          for (int64_t k = 0; k < step_j; k++) {
+            *cur_out = (*cur_in) * s / max_range;
+            ++cur_in;
+            ++cur_out;
+          }
+        }
+      }
+    }
+  }
+};
+
+template struct DequantizeFunctor<platform::CPUDeviceContext, float>;
+template struct DequantizeFunctor<platform::CPUDeviceContext, double>;
+template struct ChannelDequantizeFunctorV2<platform::CPUDeviceContext, float>;
+template struct ChannelDequantizeFunctorV2<platform::CPUDeviceContext, double>;
+
+class QuantizeLinearOp : public framework::OperatorWithKernel {
+ public:
+  using framework::OperatorWithKernel::OperatorWithKernel;
+  void InferShape(framework::InferShapeContext* ctx) const override {
+    OP_INOUT_CHECK(ctx->HasInput("X"), "Input", "X", "QuantizeLinear");
+    OP_INOUT_CHECK(ctx->HasInput("Scale"), "Input", "Scale", "QuantizeLinear");
+    OP_INOUT_CHECK(ctx->HasInput("ZeroPoint"), "Input", "ZeroPoint",
+                   "QuantizeLinear");
+    OP_INOUT_CHECK(ctx->HasOutput("Y"), "Output", "Y", "QuantizeLinear");
+    ctx->SetOutputDim("Y", ctx->GetInputDim("X"));
+    int quant_axis = ctx->Attrs().Get<int>("quant_axis");
+    if (ctx->HasOutput("OutScale")) {
+      if (quant_axis < 0) {
+        ctx->SetOutputDim("OutScale", {1});
+      } else {
+        ctx->SetOutputDim("OutScale", {ctx->GetInputDim("X")[quant_axis]});
+      }
+    }
+    ctx->ShareLoD("X", /*->*/ "Y");
+  }
+
+ protected:
+  framework::OpKernelType GetExpectedKernelType(
+      const framework::ExecutionContext& ctx) const override {
+    return framework::OpKernelType(
+        OperatorWithKernel::IndicateVarDataType(ctx, "X"), ctx.GetPlace());
+  }
+};
+
+class QuantizeLinearOpMaker : public framework::OpProtoAndCheckerMaker {
+ public:
+  void Make() override {
+    AddInput("X", "(Tensor) Input is float data type.");
+    AddInput("Scale", "(Tensor) Input is float data type.");
+    AddInput("ZeroPoint", "(Tensor) Input is float data type.");
+    AddOutput("Y",
+              "(Tensor) Output of quantized low level tensor, "
+              "but also saved as float data type.");
+    AddOutput("OutScale", "(Tensor) Current scale").AsDispensable().AsExtra();
+    AddAttr<int>("quant_axis",
+                 "(int, default 0) The axis for quantization. "
+                 "For conv2d, depthwise_conv2d, conv2d_transpose "
+                 "and mul, the quant_axis is equal to the cout axis.")
+        .SetDefault(0)
+        .AddCustomChecker([](const int& quant_axis) {
+          PADDLE_ENFORCE_EQ(
+              quant_axis == 0 || quant_axis == 1 || quant_axis == -1, true,
+              platform::errors::InvalidArgument(
+                  "'quant_axis' should be 0 or 1, but "
+                  "the received is %d",
+                  quant_axis));
+        });
+    AddAttr<int>("bit_length", "(int, default 8)")
+        .SetDefault(8)
+        .AddCustomChecker([](const int& bit_length) {
+          PADDLE_ENFORCE_EQ(bit_length >= 1 && bit_length <= 16, true,
+                            platform::errors::InvalidArgument(
+                                "'bit_length' should be between 1 and 16, but "
+                                "the received is %d",
+                                bit_length));
+        });
+    AddAttr<bool>("is_test",
+                  "(bool, default false) Set to true for inference only, false "
+                  "for training. Some layers may run faster when this is true.")
+        .SetDefault(true);
+    AddComment(R"DOC(
+The scale of QuantizeLinear operator is a vector.
+In detail, each channel of the input X has a scale value.
+$$scale_c = max(abs(X_c))$$
+$$range = 2^{bit\_length - 1} - 1$$
+$$Out_c = round(\frac{X_c * range} {scale_c})$$
+In above three formulas, the range value of c is as follow:
+$$0 \leq c \lt \ the\ channel\ number\ of\ X$$
+)DOC");
+  }
+};
+
+}  // namespace operators
+}  // namespace paddle
+
+namespace ops = paddle::operators;
+using CPU = paddle::platform::CPUDeviceContext;
+
+REGISTER_OPERATOR(
+    quantize_linear, ops::QuantizeLinearOp, ops::QuantizeLinearOpMaker,
+    paddle::framework::EmptyGradOpMaker<paddle::framework::OpDesc>,
+    paddle::framework::EmptyGradOpMaker<paddle::imperative::OpBase>);
+
+REGISTER_OP_CPU_KERNEL(quantize_linear, ops::QuantizeLinearKernel<CPU, float>);
+
+REGISTER_OPERATOR(
+    dequantize_linear, ops::QuantizeLinearOp, ops::QuantizeLinearOpMaker,
+    paddle::framework::EmptyGradOpMaker<paddle::framework::OpDesc>,
+    paddle::framework::EmptyGradOpMaker<paddle::imperative::OpBase>);
+
+REGISTER_OP_CPU_KERNEL(dequantize_linear,
+                       ops::DeQuantizeLinearKernel<CPU, float, float>,
+                       ops::DeQuantizeLinearKernel<CPU, int8_t, float>,
+                       ops::DeQuantizeLinearKernel<CPU, double, double>);
diff --git a/paddle/fluid/operators/quantize_linear_op.cu b/paddle/fluid/operators/quantize_linear_op.cu
new file mode 100644
index 0000000000000..6c7e430f51126
--- /dev/null
+++ b/paddle/fluid/operators/quantize_linear_op.cu
@@ -0,0 +1,70 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include <string>
+#include "paddle/fluid/memory/memcpy.h"
+#include "paddle/fluid/operators/fake_dequantize_op.cu.h"
+#include "paddle/fluid/operators/fake_quantize_op.cu.h"
+#include "paddle/fluid/operators/quantize_linear_op.h"
+#include "paddle/fluid/platform/device/gpu/gpu_primitives.h"
+
+namespace paddle {
+namespace operators {
+
+template <typename T>
+struct ChannelDequantizeFunctorV2<platform::CUDADeviceContext, T> {
+  void operator()(const platform::CUDADeviceContext& dev_ctx,
+                  const framework::Tensor* in, const framework::Tensor* scale,
+                  T max_range, const int quant_axis, framework::Tensor* out) {
+    auto in_dims = in->dims();
+    const T* in_data = in->data<T>();
+    T* out_data = out->mutable_data<T>(dev_ctx.GetPlace());
+    int64_t num = in->numel();
+    const T* scale_factor = scale->data<T>();
+    int64_t block_size = std::min(
+        num, static_cast<int64_t>(dev_ctx.GetMaxThreadsPerBlock() / 4));
+    int64_t max_threads =
+        dev_ctx.GetMaxPhysicalThreadCount();  // SM * block_per_SM
+    const int64_t max_blocks =
+        std::max(((max_threads - 1) / block_size + 1), static_cast<int64_t>(1));
+    const int64_t grid_size =
+        std::min(max_blocks, (num + block_size - 1) / block_size);
+
+    int quant_stride = 1;
+    for (int i = quant_axis + 1; i < in_dims.size(); i++) {
+      quant_stride *= in_dims[i];
+    }
+
+    DequantizeOneScaleQuantAxisN<
+        T><<<grid_size, block_size, 0, dev_ctx.stream()>>>(
+        in_data, scale_factor, max_range, num, in_dims[quant_axis],
+        quant_stride, out_data);
+  }
+};
+
+template struct ChannelDequantizeFunctorV2<platform::CUDADeviceContext, float>;
+template struct ChannelDequantizeFunctorV2<platform::CUDADeviceContext, double>;
+
+}  // namespace operators
+}  // namespace paddle
+
+namespace ops = paddle::operators;
+using CUDA = paddle::platform::CUDADeviceContext;
+REGISTER_OP_CUDA_KERNEL(dequantize_linear,
+                        ops::DeQuantizeLinearKernel<CUDA, float, float>,
+                        ops::DeQuantizeLinearKernel<CUDA, int8_t, float>,
+                        ops::DeQuantizeLinearKernel<CUDA, double, double>);
+
+REGISTER_OP_CUDA_KERNEL(quantize_linear,
+                        ops::QuantizeLinearKernel<CUDA, float>);
diff --git a/paddle/fluid/operators/quantize_linear_op.h b/paddle/fluid/operators/quantize_linear_op.h
new file mode 100644
index 0000000000000..e20b99e85f0b3
--- /dev/null
+++ b/paddle/fluid/operators/quantize_linear_op.h
@@ -0,0 +1,119 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+    http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+
+#include <string>
+#include <vector>
+#include "paddle/fluid/framework/op_registry.h"
+#include "paddle/fluid/framework/tensor_util.h"
+#include "paddle/fluid/memory/malloc.h"
+#include "paddle/fluid/operators/fake_dequantize_op.h"
+#include "paddle/fluid/operators/fake_quantize_op.h"
+#include "paddle/fluid/platform/transform.h"
+#include "paddle/phi/common/data_type.h"
+#include "paddle/phi/core/ddim.h"
+#include "paddle/phi/core/hostdevice.h"
+#include "paddle/phi/kernels/cast_kernel.h"
+
+namespace paddle {
+namespace operators {
+
+template <typename DeviceContext, typename T>
+struct ChannelDequantizeFunctorV2 {
+  void operator()(const DeviceContext& dev_ctx, const framework::Tensor* in,
+                  const framework::Tensor** scales, const int scale_num,
+                  T max_range, const int quant_axis, framework::Tensor* out);
+};
+
+template <typename DeviceContext, typename T>
+class QuantizeLinearKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext& context) const override {
+    auto* in = context.Input<framework::Tensor>("X");
+    auto* in_scale = context.Input<framework::Tensor>("Scale");
+
+    auto* out = context.Output<framework::Tensor>("Y");
+    out->mutable_data<T>(context.GetPlace());
+    int bit_length = context.Attr<int>("bit_length");
+    int bin_cnt = std::pow(2, bit_length - 1) - 1;
+    int quant_axis = context.Attr<int>("quant_axis");
+    bool is_test = context.Attr<bool>("is_test");
+    auto& dev_ctx = context.template device_context<DeviceContext>();
+
+    if (quant_axis < 0) {
+      if (!is_test) {
+        auto* out_scale = context.Output<framework::Tensor>("OutScale");
+        T* out_s = out_scale->mutable_data<T>(context.GetPlace());
+        FindAbsMaxFunctor<DeviceContext, T>()(dev_ctx, in->data<T>(),
+                                              in->numel(), out_s);
+        ClipAndFakeQuantFunctor<DeviceContext, T>()(dev_ctx, *in, *out_scale,
+                                                    bin_cnt, out);
+      } else {
+        ClipAndFakeQuantFunctor<DeviceContext, T>()(dev_ctx, *in, *in_scale,
+                                                    bin_cnt, out);
+      }
+    } else {
+      if (!is_test) {
+        auto* out_scale = context.Output<framework::Tensor>("OutScale");
+        T* out_scale_data = out_scale->mutable_data<T>(context.GetPlace());
+        FindChannelAbsMaxFunctor<DeviceContext, T>()(dev_ctx, *in, quant_axis,
+                                                     out_scale_data);
+        ChannelClipAndFakeQuantFunctor<DeviceContext, T>()(
+            dev_ctx, *in, *out_scale, bin_cnt, quant_axis, out);
+      } else {
+        ChannelClipAndFakeQuantFunctor<DeviceContext, T>()(
+            dev_ctx, *in, *in_scale, bin_cnt, quant_axis, out);
+      }
+    }
+  }
+};
+
+template <typename DeviceContext, typename T, typename D>
+class DeQuantizeLinearKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext& context) const override {
+    auto& dev_ctx = context.template device_context<DeviceContext>();
+    auto* in = context.Input<framework::Tensor>("X");
+
+    auto in_tmp = phi::Cast<T>(
+        static_cast<const typename paddle::framework::ConvertToPhiContext<
+            DeviceContext>::TYPE&>(dev_ctx),
+        *in, experimental::CppTypeToDataType<D>::Type());
+
+    auto* scale = context.Input<framework::Tensor>("Scale");
+    auto* out = context.Output<framework::Tensor>("Y");
+    int bit_length = context.Attr<int>("bit_length");
+    auto quant_axis = context.Attr<int>("quant_axis");
+    out->mutable_data<D>(dev_ctx.GetPlace());
+
+    if (quant_axis < 0) {
+      float max_range = (std::pow(2, bit_length - 1) - 1);
+      DequantizeFunctor<DeviceContext, D>()(dev_ctx, &in_tmp, scale,
+                                            static_cast<D>(max_range), out);
+    } else {
+      PADDLE_ENFORCE_EQ(
+          scale->numel(), in_tmp.dims()[quant_axis],
+          platform::errors::PreconditionNotMet(
+              "The number of first scale values must be the same with "
+              "quant_axis dimension value of Input(X) when the `scale` has "
+              "only one element, but %ld != %ld here.",
+              scale->numel(), in_tmp.dims()[quant_axis]));
+      int max_range = (std::pow(2, bit_length - 1) - 1);
+
+      ChannelDequantizeFunctorV2<DeviceContext, D>()(
+          dev_ctx, &in_tmp, scale, static_cast<D>(max_range), quant_axis, out);
+    }
+  }
+};
+
+}  // namespace operators
+}  // namespace paddle
diff --git a/paddle/fluid/operators/range_op.cc b/paddle/fluid/operators/range_op.cc
index ddfbdbace054d..80fdb2ce6c345 100644
--- a/paddle/fluid/operators/range_op.cc
+++ b/paddle/fluid/operators/range_op.cc
@@ -61,6 +61,6 @@ class RangeOpMaker : public framework::OpProtoAndCheckerMaker {
 
 namespace ops = paddle::operators;
 DECLARE_INFER_SHAPE_FUNCTOR(range, RangeInferMetaFunctor,
-                            PD_INFER_META(phi::RangeInferMeta));
+                            PD_INFER_META(phi::ArangeInferMeta));
 REGISTER_OP_WITHOUT_GRADIENT(range, ops::RangeOp, ops::RangeOpMaker,
                              RangeInferMetaFunctor);
diff --git a/paddle/fluid/operators/reshape_op.cc b/paddle/fluid/operators/reshape_op.cc
index 0befc873ed696..8ccd1b26a3817 100644
--- a/paddle/fluid/operators/reshape_op.cc
+++ b/paddle/fluid/operators/reshape_op.cc
@@ -441,26 +441,27 @@ class ReshapeDoubleGradKernel {
  public:
   void operator()(const framework::ExecutionContext &ctx) const {
     auto *dd_x = ctx.Input<framework::Tensor>("DDX");
+    auto *d_out = ctx.Input<framework::Tensor>("DOut");
     auto *dd_out = ctx.Output<framework::Tensor>("DDOut");
     dd_out->mutable_data(ctx.GetPlace(), dd_x->type());
 
     if (platform::is_cpu_place(ctx.GetPlace())) {
       auto &dev_ctx = ctx.device_context<platform::CPUDeviceContext>();
       phi::ReshapeDoubleGradKernel(
-          static_cast<const phi::CPUContext &>(dev_ctx), *dd_x, dd_out);
+          static_cast<const phi::CPUContext &>(dev_ctx), *d_out, *dd_x, dd_out);
     }
 #if defined(PADDLE_WITH_CUDA) || defined(PADDLE_WITH_HIP)
     if (platform::is_gpu_place(ctx.GetPlace())) {
       auto &dev_ctx = ctx.device_context<platform::CUDADeviceContext>();
       phi::ReshapeDoubleGradKernel(
-          static_cast<const phi::GPUContext &>(dev_ctx), *dd_x, dd_out);
+          static_cast<const phi::GPUContext &>(dev_ctx), *d_out, *dd_x, dd_out);
     }
 #endif
 #ifdef PADDLE_WITH_XPU
     if (platform::is_xpu_place(ctx.GetPlace())) {
       auto &dev_ctx = ctx.device_context<platform::XPUDeviceContext>();
       phi::ReshapeDoubleGradKernel(
-          static_cast<const phi::XPUContext &>(dev_ctx), *dd_x, dd_out);
+          static_cast<const phi::XPUContext &>(dev_ctx), *d_out, *dd_x, dd_out);
     }
 #endif
   }
@@ -658,7 +659,7 @@ REGISTER_OPERATOR(reshape2_grad, ops::Reshape2GradOp,
 
 DECLARE_INFER_SHAPE_FUNCTOR(reshape2_grad_grad,
                             Reshape2DoubleGradInferShapeFunctor,
-                            PD_INFER_META(phi::GeneralUnaryGradInferMeta));
+                            PD_INFER_META(phi::ReshapeDoubleGradInferMeta));
 
 REGISTER_OPERATOR(reshape2_grad_grad, ops::Reshape2DoubleGradOp,
                   ops::ReshapeDoubleGradInplaceInferer,
diff --git a/paddle/fluid/operators/rnn_op_xpu.cc b/paddle/fluid/operators/rnn_op_xpu.cc
index 183f83dbae7c3..2dee4e889f739 100644
--- a/paddle/fluid/operators/rnn_op_xpu.cc
+++ b/paddle/fluid/operators/rnn_op_xpu.cc
@@ -1,4 +1,4 @@
-/* Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
 Licensed under the Apache License, Version 2.0 (the "License");
 you may not use this file except in compliance with the License.
 You may obtain a copy of the License at
@@ -13,6 +13,7 @@ limitations under the License. */
 
 #include "paddle/fluid/framework/op_registry.h"
 #include "paddle/fluid/operators/utils.h"
+#include "paddle/fluid/platform/device/device_wrapper.h"
 #include "paddle/fluid/platform/device/xpu/xpu_header.h"
 #include "paddle/fluid/platform/device_context.h"
 
@@ -21,9 +22,7 @@ namespace operators {
 
 using Tensor = framework::Tensor;
 using DDim = framework::DDim;
-
 using TensorList = std::vector<framework::Tensor>;
-
 template <typename TensorType, typename T>
 void reset_parameter_vector(const std::vector<TensorType>& raw_params_vec,
                             const int& num_layers, const bool& is_bidirec,
@@ -51,54 +50,89 @@ void reset_parameter_vector(const std::vector<TensorType>& raw_params_vec,
   }
 }
 
+template <typename DeviceContext, typename T>
+void RunLSTMLayer(const framework::ExecutionContext& ctx, int seq_len,
+                  int batch_size, int xdim, int hidden_size, const T* x, T* y,
+                  const T* init_h, const T* init_c, T* last_h, T* last_c,
+                  int state_offset, const std::vector<int>& seq_len_tensor,
+                  const std::vector<const T*>& param_list, T* i_f_g_o, T* c,
+                  bool is_bidirect, int layer_idx, int offset) {
+  bool is_reverse = false;
+  if (is_bidirect) {
+    layer_idx = 2 * layer_idx + offset;
+    if (offset > 0) {
+      is_reverse = true;
+    }
+  }
+  auto w_x = param_list[0 + offset * 4];
+  auto w_h = param_list[1 + offset * 4];
+  auto b_x = param_list[2 + offset * 4];
+  auto b_h = param_list[3 + offset * 4];
+
+  auto h_0 = init_h + layer_idx * state_offset;
+  auto c_0 = init_c + layer_idx * state_offset;
+  auto last_h_ptr = last_h + layer_idx * state_offset;
+  auto last_c_ptr = last_c + layer_idx * state_offset;
+  auto& dev_ctx = ctx.template device_context<DeviceContext>();
+  int r = xpu::lstm_train<T, T, int16_t>(
+      dev_ctx.x_context(), (const T*)x, (const T*)h_0, (const T*)c_0,
+      (const T*)w_x, (const T*)w_h, (const T*)b_x, (const T*)b_h,
+      reinterpret_cast<T*>(y), reinterpret_cast<T*>(last_h_ptr),
+      reinterpret_cast<T*>(last_c_ptr), batch_size, xdim, hidden_size, seq_len,
+      seq_len_tensor, is_reverse, nullptr, nullptr, nullptr, nullptr,
+      reinterpret_cast<T*>(i_f_g_o), reinterpret_cast<T*>(c),
+      xpu::Activation_t::TANH, xpu::Activation_t::SIGMOID);
+  PADDLE_ENFORCE_XDNN_SUCCESS(r, "lstm_train");
+}
+
 template <typename DeviceContext, typename T>
 class RnnXPUKernel : public framework::OpKernel<T> {
  public:
   void Compute(const framework::ExecutionContext& ctx) const override {
+    // Input
     auto* input = ctx.Input<Tensor>("Input");
     auto pre_state = ctx.MultiInput<Tensor>("PreState");
     auto weight_list = ctx.MultiInput<framework::Tensor>("WeightList");
+    bool has_seq_length = ctx.HasInput("SequenceLength");
+    // Output
     auto state = ctx.MultiOutput<Tensor>("State");
     auto* output = ctx.Output<Tensor>("Out");
+    auto* dropout_mask = ctx.Output<Tensor>("DropoutState");
     auto* reserve_data = ctx.Output<Tensor>("Reserve");
+    // Attrbutes
     const int& num_layers = ctx.Attr<int>("num_layers");
     const bool& is_bidirec = ctx.Attr<bool>("is_bidirec");
     const int& hidden_size = ctx.Attr<int>("hidden_size");
     const std::string& mode = ctx.Attr<std::string>("mode");
 
-    bool has_seq_length = ctx.HasInput("SequenceLength");
     const Tensor* sequence_length = nullptr;
     if (has_seq_length) {
       sequence_length = ctx.Input<Tensor>("SequenceLength");
     }
 
+    if (dropout_mask->IsInitialized()) {
+      if (dropout_mask->numel() != output->numel()) dropout_mask->clear();
+    }
+    dropout_mask->mutable_data<uint8_t>(output->dims(), ctx.GetPlace());
+
     PADDLE_ENFORCE_EQ(
         mode, "LSTM",
         platform::errors::InvalidArgument(
             "XPU only support LSTM mode now, current mode is %s", mode));
 
-    PADDLE_ENFORCE_EQ(is_bidirec, false,
-                      platform::errors::InvalidArgument(
-                          "XPU only support unidirectional LSTM now"));
-
-    PADDLE_ENFORCE_EQ(
-        num_layers, 1,
-        platform::errors::InvalidArgument(
-            "XPU only support 1 layer LSTM now, current layer num is %s",
-            num_layers));
-
     auto init_h = pre_state[0];
     auto init_c = pre_state[1];
     auto last_h = state[0];
     auto last_c = state[1];
 
     // check shape
-    int seq_len = input->dims()[0];
-    int batch_size = input->dims()[1];
-    int input_dim = input->dims()[2];
+    const int& seq_len = input->dims()[0];  // time_step
+    const int& batch_size = input->dims()[1];
+    const int& input_dim = input->dims()[2];
+    const int& direction_num = is_bidirec ? 2 : 1;
 
     PADDLE_ENFORCE_EQ(
-        init_h->dims()[0], num_layers,
+        init_h->dims()[0], num_layers * direction_num,
         platform::errors::InvalidArgument("The num_layers of in RNN layer must"
                                           " be the same as first dim of init "
                                           "hidden, but received num_layers:%d,"
@@ -106,13 +140,13 @@ class RnnXPUKernel : public framework::OpKernel<T> {
                                           num_layers, init_h->dims()[0]));
 
     PADDLE_ENFORCE_EQ(
-        init_c->dims()[0], num_layers,
+        init_c->dims()[0], num_layers * direction_num,
         platform::errors::InvalidArgument(
             "The num_layers of in RNN layer must"
             " be the same as first dim of cell state hidden, but received"
             " num_layers:%d, dim:%d",
             num_layers, init_c->dims()[0]));
-
+    // weightlist
     std::vector<std::vector<const T*>> parameter_lists;
     parameter_lists.resize(num_layers);
     reset_parameter_vector(weight_list, num_layers, is_bidirec,
@@ -122,41 +156,106 @@ class RnnXPUKernel : public framework::OpKernel<T> {
     output->mutable_data<T>(ctx.GetPlace());
     last_h->mutable_data<T>(ctx.GetPlace());
     last_c->mutable_data<T>(ctx.GetPlace());
-    reserve_data->Resize({seq_len * batch_size * hidden_size * 5});
-    reserve_data->mutable_data<T>(ctx.GetPlace());
 
+    reserve_data->Resize(
+        {num_layers * direction_num * seq_len * batch_size * hidden_size * 5});
+    reserve_data->mutable_data<T>(ctx.GetPlace());
+    Tensor internal_output_1_tensor, internal_output_2_tensor;
+    T* internal_output_1_ptr = nullptr;
+    T* internal_output_2_ptr = nullptr;
+    if (num_layers >= 2) {
+      internal_output_1_tensor.Resize(output->dims());
+      internal_output_1_ptr =
+          internal_output_1_tensor.mutable_data<T>(ctx.GetPlace());
+    }
+    if (num_layers >= 3) {
+      internal_output_2_tensor.Resize(output->dims());
+      internal_output_2_ptr =
+          internal_output_2_tensor.mutable_data<T>(ctx.GetPlace());
+    }
     // get ptr from tensor
     auto x = input->data<T>();
-    auto h_0 = init_h->data<T>();
-    auto c_0 = init_c->data<T>();
-    auto w_x = parameter_lists[0][0];
-    auto w_h = parameter_lists[0][1];
-    auto b_x = parameter_lists[0][2];
-    auto b_h = parameter_lists[0][3];
+    auto init_h_ptr = init_h->data<T>();
+    auto init_c_ptr = init_c->data<T>();
     auto y = output->data<T>();
     auto last_h_ptr = last_h->data<T>();
     auto last_c_ptr = last_c->data<T>();
     auto i_f_g_o = reserve_data->data<T>();
-    auto c = i_f_g_o + seq_len * batch_size * hidden_size * 4;
+    auto c =
+        i_f_g_o +
+        num_layers * direction_num * seq_len * batch_size * hidden_size * 4;
 
     std::vector<int> seq_len_tensor(batch_size, seq_len);
     if (has_seq_length) {
       seq_len_tensor = operators::GetDataFromTensor(sequence_length);
     }
 
-    // run kernel
     auto& dev_ctx = ctx.template device_context<DeviceContext>();
-    int r = xpu::lstm_train<T, T, int16_t>(
-        dev_ctx.x_context(), (const T*)x, (const T*)h_0, (const T*)c_0,
-        (const T*)w_x, (const T*)w_h, (const T*)b_x, (const T*)b_h,
-        reinterpret_cast<T*>(y), reinterpret_cast<T*>(last_h_ptr),
-        reinterpret_cast<T*>(last_c_ptr), batch_size, input_dim, hidden_size,
-        seq_len, seq_len_tensor, nullptr, nullptr, nullptr, nullptr,
-        reinterpret_cast<T*>(i_f_g_o), reinterpret_cast<T*>(c));
-    PADDLE_ENFORCE_EQ(r, xpu::Error_t::SUCCESS,
-                      platform::errors::External("RnnXPU(lstm) return wrong "
-                                                 "value[%d %s]",
-                                                 r, XPUAPIErrorMsg[r]));
+    int state_offset = pre_state[0]->dims()[1] * pre_state[0]->dims()[2];
+
+    for (int i = 0; i < num_layers; i++) {
+      const T* cur_input_ptr = nullptr;
+      int cur_xdim = -1;
+      i_f_g_o += i * direction_num * seq_len * batch_size * hidden_size * 4;
+      c += i * direction_num * seq_len * batch_size * hidden_size;
+
+      if (i == 0) {
+        cur_input_ptr = x;
+        cur_xdim = input_dim;
+      } else if (i % 2 != 0) {
+        cur_input_ptr = internal_output_1_ptr;
+        cur_xdim = is_bidirec ? 2 * hidden_size : hidden_size;
+      } else {
+        cur_input_ptr = internal_output_2_ptr;
+        cur_xdim = is_bidirec ? 2 * hidden_size : hidden_size;
+      }
+
+      T* cur_output_ptr = nullptr;
+      if (i == num_layers - 1) {
+        cur_output_ptr = y;
+      } else if (i % 2 != 0) {
+        cur_output_ptr = internal_output_2_ptr;
+      } else {
+        cur_output_ptr = internal_output_1_ptr;
+      }
+
+      if (is_bidirec) {
+        std::vector<Tensor> output_vec(2);
+        std::vector<T*> output_ptr_vec(2);
+        for (int k = 0; k < 2; ++k) {
+          output_vec[k].Resize({seq_len, batch_size, output->dims()[2] / 2});
+          output_ptr_vec[k] = output_vec[k].mutable_data<T>(ctx.GetPlace());
+        }
+        RunLSTMLayer<DeviceContext, T>(
+            ctx, seq_len, batch_size, cur_xdim, hidden_size, cur_input_ptr,
+            output_ptr_vec[0], init_h_ptr, init_c_ptr, last_h_ptr, last_c_ptr,
+            state_offset, seq_len_tensor, parameter_lists[i], i_f_g_o, c,
+            is_bidirec, i, 0);
+
+        T* bw_i_f_g_o = i_f_g_o + seq_len * batch_size * hidden_size * 4;
+        T* bw_c = c + seq_len * batch_size * hidden_size;
+        RunLSTMLayer<DeviceContext, T>(
+            ctx, seq_len, batch_size, cur_xdim, hidden_size, cur_input_ptr,
+            output_ptr_vec[1], init_h_ptr, init_c_ptr, last_h_ptr, last_c_ptr,
+            state_offset, seq_len_tensor, parameter_lists[i], bw_i_f_g_o, bw_c,
+            is_bidirec, i, 1);
+
+        // concat
+        int r = xpu::concat<T>(
+            dev_ctx.x_context(), {output_ptr_vec[0], output_ptr_vec[1]},
+            cur_output_ptr, {{seq_len, batch_size, hidden_size},
+                             {seq_len, batch_size, hidden_size}},
+            2);
+        PADDLE_ENFORCE_XDNN_SUCCESS(r, "concat");
+        xpu_wait(dev_ctx.x_context()->xpu_stream);
+      } else {
+        RunLSTMLayer<DeviceContext, T>(
+            ctx, seq_len, batch_size, cur_xdim, hidden_size, cur_input_ptr,
+            cur_output_ptr, init_h_ptr, init_c_ptr, last_h_ptr, last_c_ptr,
+            state_offset, seq_len_tensor, parameter_lists[i], i_f_g_o, c,
+            is_bidirec, i, 0);
+      }
+    }
   }
 };
 
@@ -221,7 +320,6 @@ class RnnXPUGradKernel : public framework::OpKernel<T> {
     int seq_len = input->dims()[0];
     int batch_size = input->dims()[1];
     int input_dim = input->dims()[2];
-
     PADDLE_ENFORCE_EQ(
         init_h->dims()[0], num_layers,
         platform::errors::InvalidArgument("The num_layers of in RNN layer must"
diff --git a/paddle/fluid/operators/sequence_ops/sequence_softmax_cudnn_op.cu.cc b/paddle/fluid/operators/sequence_ops/sequence_softmax_cudnn_op.cu.cc
index 57064301d7afb..976c10d0f433f 100644
--- a/paddle/fluid/operators/sequence_ops/sequence_softmax_cudnn_op.cu.cc
+++ b/paddle/fluid/operators/sequence_ops/sequence_softmax_cudnn_op.cu.cc
@@ -58,7 +58,7 @@ class SequenceSoftmaxCUDNNKernel : public framework::OpKernel<T> {
           phi::make_ddim({1UL, end_pos - start_pos});
       x_i.Resize(dims_i);
       out_i.Resize(dims_i);
-      math::SoftmaxCUDNNFunctor<T>()(
+      math::SoftmaxCUDNNFunctor<T, platform::CUDADeviceContext>()(
           ctx.template device_context<platform::CUDADeviceContext>(), &x_i,
           &out_i);
     }
@@ -93,7 +93,7 @@ class SequenceSoftmaxGradCUDNNKernel : public framework::OpKernel<T> {
       out_i.Resize(dims_i);
       out_grad_i.Resize(dims_i);
       x_grad_i.Resize(dims_i);
-      math::SoftmaxGradCUDNNFunctor<T>()(
+      math::SoftmaxGradCUDNNFunctor<T, platform::CUDADeviceContext>()(
           ctx.template device_context<platform::CUDADeviceContext>(), &out_i,
           &out_grad_i, &x_grad_i);
     }
diff --git a/paddle/fluid/operators/softmax_with_cross_entropy_op.cc b/paddle/fluid/operators/softmax_with_cross_entropy_op.cc
index 6f0881e9fc98f..22b592c1eb62a 100644
--- a/paddle/fluid/operators/softmax_with_cross_entropy_op.cc
+++ b/paddle/fluid/operators/softmax_with_cross_entropy_op.cc
@@ -12,8 +12,9 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. */
 
-#include "paddle/fluid/operators/softmax_with_cross_entropy_op.h"
+#include "paddle/fluid/framework/op_registry.h"
 #include "paddle/fluid/framework/op_version_registry.h"
+#include "paddle/phi/kernels/funcs/axis_utils.h"
 
 namespace paddle {
 namespace operators {
@@ -335,12 +336,6 @@ REGISTER_OPERATOR(softmax_with_cross_entropy, ops::SoftmaxWithCrossEntropyOp,
 REGISTER_OPERATOR(softmax_with_cross_entropy_grad,
                   ops::SoftmaxWithCrossEntropyOpGrad,
                   ops::SoftmaxWithCrossEntropyGradInplaceInferer);
-REGISTER_OP_CPU_KERNEL(softmax_with_cross_entropy,
-                       ops::SoftmaxWithCrossEntropyKernel<float>,
-                       ops::SoftmaxWithCrossEntropyKernel<double>);
-REGISTER_OP_CPU_KERNEL(softmax_with_cross_entropy_grad,
-                       ops::SoftmaxWithCrossEntropyGradKernel<float>,
-                       ops::SoftmaxWithCrossEntropyGradKernel<double>);
 
 REGISTER_OP_VERSION(softmax_with_cross_entropy)
 #if defined(PADDLE_WITH_ASCEND_CL) || defined(PADDLE_WITH_MLU)
diff --git a/paddle/fluid/operators/softmax_with_cross_entropy_op.h b/paddle/fluid/operators/softmax_with_cross_entropy_op.h
deleted file mode 100644
index 4b875cbf5841f..0000000000000
--- a/paddle/fluid/operators/softmax_with_cross_entropy_op.h
+++ /dev/null
@@ -1,318 +0,0 @@
-/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License. */
-
-#pragma once
-#include "paddle/fluid/framework/convert_utils.h"
-#include "paddle/fluid/framework/eigen.h"
-#include "paddle/fluid/framework/op_registry.h"
-#include "paddle/fluid/operators/math/cross_entropy.h"
-#include "paddle/fluid/operators/math/softmax.h"
-#include "paddle/phi/kernels/funcs/axis_utils.h"
-
-namespace paddle {
-namespace operators {
-
-using Tensor = framework::Tensor;
-
-template <typename T, typename Visitor>
-struct SoftmaxWithCrossEntropyFunctor {
- public:
-  SoftmaxWithCrossEntropyFunctor(const framework::ExecutionContext& context,
-                                 const framework::Tensor& labels,
-                                 const bool soft_label, const Visitor& visitor)
-      : context_(context),
-        labels_(labels),
-        soft_label_(soft_label),
-        visitor_(visitor) {}
-
-  template <typename U>
-  void apply() const {
-    visitor_.template Apply<U>(context_, labels_, soft_label_);
-  }
-
- private:
-  const framework::ExecutionContext& context_;
-  const framework::Tensor& labels_;
-  const bool soft_label_;
-  const Visitor& visitor_;
-};
-
-template <typename T, typename Visitor>
-static void RunSoftmaxWithCrossEntropyFunctor(
-    const framework::ExecutionContext& context, const Visitor& visitor) {
-  const auto* labels = context.Input<framework::Tensor>("Label");
-  const bool soft_label = context.Attr<bool>("soft_label");
-  SoftmaxWithCrossEntropyFunctor<T, Visitor> functor(context, *labels,
-                                                     soft_label, visitor);
-  auto dtype = framework::TransToProtoVarType(labels->dtype());
-  if (soft_label) {
-    PADDLE_ENFORCE_EQ(
-        dtype, framework::DataTypeTrait<T>::DataType(),
-        platform::errors::InvalidArgument("The Input(Label) should be with the "
-                                          "same data type as Input(Logits)."));
-    functor.template apply<T>();
-  } else {
-    framework::VisitIntDataType(dtype, functor);
-  }
-}
-
-template <typename T>
-class SoftmaxWithCrossEntropyKernel : public framework::OpKernel<T> {
- public:
-  void Compute(const framework::ExecutionContext& context) const override {
-    PADDLE_ENFORCE_EQ(
-        platform::is_cpu_place(context.GetPlace()), true,
-        platform::errors::Unimplemented("This kernel only runs on CPU."));
-    const bool use_softmax = context.Attr<bool>("use_softmax");
-    const Tensor* labels = context.Input<Tensor>("Label");
-    const bool soft_label = context.Attr<bool>("soft_label");
-
-    // do not with softmax op, and input is softmax
-    if (!use_softmax) {
-      const Tensor* softmax = context.Input<Tensor>("Logits");
-      Tensor* softmax_out = context.Output<Tensor>("Softmax");
-      Tensor* loss = context.Output<Tensor>("Loss");
-      const int rank = softmax->dims().size();
-      const int axis =
-          phi::funcs::CanonicalAxis(context.Attr<int>("axis"), rank);
-      int axis_dim = softmax->dims()[axis];
-
-      PADDLE_ENFORCE_GT(
-          axis_dim, 0,
-          platform::errors::InvalidArgument(
-              "The axis dimention should be larger than 0, but received "
-              "axis dimention is %d.",
-              axis_dim));
-
-      softmax_out->mutable_data<T>(context.GetPlace());
-      loss->mutable_data<T>(context.GetPlace());
-
-      const int n = phi::funcs::SizeToAxis(axis, softmax->dims());
-
-      PADDLE_ENFORCE_GT(
-          n, 0, platform::errors::InvalidArgument(
-                    "The size of axis should be larger than 0, but received "
-                    "SizeToAxis of softmax is %d.",
-                    n));
-
-      const int d = phi::funcs::SizeFromAxis(axis, softmax->dims());
-
-      Tensor softmax_2d, labels_2d, loss_2d, softmax_out_2d;
-      softmax_2d.ShareDataWith(*softmax).Resize({n, d});
-      labels_2d.ShareDataWith(*labels).Resize({n, labels->numel() / n});
-      loss_2d.ShareDataWith(*loss).Resize({n, d / axis_dim});
-      softmax_out_2d.ShareDataWith(*softmax_out).Resize({n, d});
-
-      auto& dev_ctx =
-          context.template device_context<platform::CPUDeviceContext>();
-
-      math::CrossEntropyFunctor<platform::CPUDeviceContext, T>()(
-          dev_ctx, &loss_2d, &softmax_2d, &labels_2d, soft_label,
-          context.Attr<int>("ignore_index"), axis_dim);
-
-      // cause of input is softmax
-      // copy to output softmax, directly
-      framework::TensorCopy(*softmax, context.GetPlace(),
-                            context.device_context(), softmax_out);
-
-      return;
-    }
-
-    const Tensor* logits = context.Input<Tensor>("Logits");
-    Tensor* softmax = context.Output<Tensor>("Softmax");
-    Tensor* loss = context.Output<Tensor>("Loss");
-
-    const int rank = logits->dims().size();
-    const int axis = phi::funcs::CanonicalAxis(context.Attr<int>("axis"), rank);
-    int axis_dim = logits->dims()[axis];
-    PADDLE_ENFORCE_GT(
-        axis_dim, 0,
-        platform::errors::InvalidArgument(
-            "The axis dimention should be larger than 0, but received "
-            "axis dimention is %d.",
-            axis_dim));
-
-    softmax->mutable_data<T>(context.GetPlace());
-    loss->mutable_data<T>(context.GetPlace());
-
-    const int n = phi::funcs::SizeToAxis(axis, logits->dims());
-    PADDLE_ENFORCE_GT(
-        n, 0, platform::errors::InvalidArgument(
-                  "The size of axis should be larger than 0, but received "
-                  "SizeToAxis of logits is %d.",
-                  n));
-
-    const int d = phi::funcs::SizeFromAxis(axis, logits->dims());
-    Tensor logits_2d, softmax_2d, labels_2d, loss_2d;
-    logits_2d.ShareDataWith(*logits).Resize({n, d});
-    softmax_2d.ShareDataWith(*softmax).Resize({n, d});
-    labels_2d.ShareDataWith(*labels).Resize({n, labels->numel() / n});
-    loss_2d.ShareDataWith(*loss).Resize({n, d / axis_dim});
-
-    auto& dev_ctx =
-        context.template device_context<platform::CPUDeviceContext>();
-    math::SoftmaxFunctor<platform::CPUDeviceContext, T, false>()(
-        dev_ctx, axis_dim, &logits_2d, &softmax_2d);
-    math::CrossEntropyFunctor<platform::CPUDeviceContext, T>()(
-        dev_ctx, &loss_2d, &softmax_2d, &labels_2d, soft_label,
-        context.Attr<int>("ignore_index"), axis_dim);
-  }
-};
-
-template <typename T>
-class SoftmaxWithCrossEntropyGradKernel : public framework::OpKernel<T> {
- public:
-  void Compute(const framework::ExecutionContext& context) const override {
-    RunSoftmaxWithCrossEntropyFunctor<T>(context, *this);
-  }
-
-  template <typename LabelT>
-  static void Apply(const framework::ExecutionContext& context,
-                    const framework::Tensor& labels, const bool soft_label) {
-    const Tensor* out_grad =
-        context.Input<Tensor>(framework::GradVarName("Loss"));
-    Tensor* logit_grad =
-        context.Output<Tensor>(framework::GradVarName("Logits"));
-    const Tensor* softmax = context.Input<Tensor>("Softmax");
-    const bool use_softmax = context.Attr<bool>("use_softmax");
-    if (logit_grad != softmax || !use_softmax) {
-      framework::TensorCopy(*softmax, context.GetPlace(),
-                            context.device_context(), logit_grad);
-    }
-    auto ignore_index = context.Attr<int>("ignore_index");
-
-    const int rank = logit_grad->dims().size();
-    const int axis = phi::funcs::CanonicalAxis(context.Attr<int>("axis"), rank);
-    int axis_dim = logit_grad->dims()[axis];
-    PADDLE_ENFORCE_GT(
-        axis_dim, 0,
-        platform::errors::InvalidArgument(
-            "The axis dimention should be larger than 0, but received "
-            "axis dimention is %d.",
-            axis_dim));
-
-    const int n = phi::funcs::SizeToAxis(axis, logit_grad->dims());
-    PADDLE_ENFORCE_GT(
-        n, 0, platform::errors::InvalidArgument(
-                  "The size of axis should be larger than 0, but received "
-                  "SizeToAxis of logit_grad is %d.",
-                  n));
-
-    const int d = phi::funcs::SizeFromAxis(axis, logit_grad->dims());
-    Tensor logit_grad_2d, labels_2d, out_grad_2d;
-    logit_grad_2d.ShareDataWith(*logit_grad).Resize({n, d});
-    labels_2d.ShareDataWith(labels).Resize({n, labels.numel() / n});
-    out_grad_2d.ShareDataWith(*out_grad).Resize({n, d / axis_dim});
-    auto out_grad_mat = framework::EigenMatrix<T>::From(out_grad_2d);
-    auto logit_grad_mat = framework::EigenMatrix<T>::From(logit_grad_2d);
-    auto& place = *context.template device_context<platform::CPUDeviceContext>()
-                       .eigen_device();
-    if (!use_softmax) {
-      // use_softmax step1
-      if (soft_label) {
-        auto lbl_mat = framework::EigenMatrix<T>::From(labels_2d);
-        logit_grad_mat.device(place) =
-            (-lbl_mat / logit_grad_mat);  // for each sample ,i  is sample id
-        logit_grad_mat.device(place) =
-            out_grad_mat.broadcast(Eigen::DSizes<int, 2>(1, axis_dim)) *
-            logit_grad_mat;
-      } else {
-        // use_softmax step2
-        const auto* label_data = labels.template data<LabelT>();
-        T* logit_grad_data = logit_grad->template data<T>();
-        const T* out_grad_data = out_grad->template data<T>();
-        const int remain = d / axis_dim;
-        for (int i = 0; i < n; ++i) {         // for each sample_1_dim
-          for (int j = 0; j < remain; j++) {  // for each sample_other_dims
-            int idx = i * remain + j;  // this sample's label_idx. for 1d case,
-                                       // remain=1 and j=0, so, idx = i
-            auto lbl = static_cast<int64_t>(label_data[idx]);
-            if (lbl == ignore_index) {
-              for (int k = 0; k < axis_dim; ++k) {  // for each class id's label
-                logit_grad_data[i * d + k * remain + j] = 0;
-              }
-            } else {
-              // only for this sample's label_idx, the label is 1, others is 0,
-              // so, only compute this label_idx's class
-              logit_grad_data[i * d + lbl * remain + j] =
-                  (-1 / logit_grad_data[i * d + lbl * remain + j]) *
-                  out_grad_data[idx];
-              for (int k = 0; k < axis_dim; ++k) {  // for each class id's label
-                if (k !=
-                    label_data[idx]) {  // label_data[idx]: this sample's label
-                  logit_grad_data[i * d + k * remain + j] = 0;
-                }
-              }
-            }
-          }
-        }
-      }
-      return;
-    }
-    // for use_softmax=False, continue
-
-    if (soft_label) {
-      // when soft_label = True, ignore_index is not supported
-      auto lbl_mat = framework::EigenMatrix<T>::From(labels_2d);
-      logit_grad_mat.device(place) =
-          out_grad_mat.broadcast(Eigen::DSizes<int, 2>(1, axis_dim)) *
-          (logit_grad_mat - lbl_mat);  // for each sample ,i  is sample id
-      //         1) compute dy/dx by p_j - y_j or P-Y, where j is class id,
-      //            P=logit_grad_mat[i] is all class's probs, Y=lbl_mat[i] is
-      //            all class's labels
-      //         2) compute dy * dy/dx by   Chain rule, dy=out_grad_mat[i]
-      // for high dims, e.g. (n,c) or (n,d1,...,dm, c), compute grad by matrix
-      // operation
-
-    } else {
-      logit_grad_mat.device(place) =
-          logit_grad_mat *  // element_wise multiply
-          out_grad_mat.broadcast(Eigen::DSizes<int, 2>(1, axis_dim));
-
-      const auto* label_data = labels.template data<LabelT>();
-      T* logit_grad_data = logit_grad->template data<T>();
-      const T* out_grad_data = out_grad->template data<T>();
-      const int remain = d / axis_dim;
-      for (int i = 0; i < n; ++i) {         // for each sample_1_dim
-        for (int j = 0; j < remain; j++) {  // for each sample_other_dims
-          int idx = i * remain + j;  // this sample's label_idx. for 1d case,
-                                     // remain=1 and j=0, so, idx = i
-          auto lbl = static_cast<int64_t>(label_data[idx]);
-          if (lbl == ignore_index) {
-            for (int k = 0; k < axis_dim; ++k) {  // for each class id's label
-              logit_grad_data[i * d + k * remain + j] = 0;
-            }
-          } else {
-            // only for this sample's label_idx, the label is 1, others is 0,
-            // so, only compute this label_idx's class
-            // for 1d case, remain=1 and j=0, so, [i * d + label_data[idx] *
-            // remain + j] = [i * d + label_data[idx]]
-            // let idx_x = i * d + label_data[idx] * remain + j,
-            //   logit_grad_data[idx_x] = logit_grad_data[idx_x] -
-            //   out_grad_data[idx]
-            // note: logit_grad_mat = logit_grad_mat * out_grad_mat
-            // so: logit_grad_data[idx_x] =  (logit_grad_data[idx_x] - 1) *
-            // out_grad_data[idx]
-            // means:           dy/dp * dy=   ( p - y ) * dy
-
-            logit_grad_data[i * d + lbl * remain + j] -= out_grad_data[idx];
-          }
-        }
-      }
-    }
-  }
-};
-
-}  // namespace operators
-}  // namespace paddle
diff --git a/paddle/fluid/operators/softmax_with_cross_entropy_op_mlu.cc b/paddle/fluid/operators/softmax_with_cross_entropy_op_mlu.cc
index 34650c2e06245..7056bcd4f76bc 100644
--- a/paddle/fluid/operators/softmax_with_cross_entropy_op_mlu.cc
+++ b/paddle/fluid/operators/softmax_with_cross_entropy_op_mlu.cc
@@ -12,8 +12,9 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. */
 
-#include "paddle/fluid/operators/softmax_with_cross_entropy_op.h"
+#include "paddle/fluid/framework/op_registry.h"
 #include "paddle/fluid/operators/mlu/mlu_baseop.h"
+#include "paddle/phi/kernels/funcs/axis_utils.h"
 
 namespace paddle {
 namespace operators {
diff --git a/paddle/fluid/operators/softmax_with_cross_entropy_op_npu.cc b/paddle/fluid/operators/softmax_with_cross_entropy_op_npu.cc
index 1f1fbea090c13..f64d9e022a1ad 100644
--- a/paddle/fluid/operators/softmax_with_cross_entropy_op_npu.cc
+++ b/paddle/fluid/operators/softmax_with_cross_entropy_op_npu.cc
@@ -12,13 +12,14 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. */
 
-#include "paddle/fluid/operators/softmax_with_cross_entropy_op.h"
-
 #include <memory>
 #include <string>
+
+#include "paddle/fluid/framework/op_registry.h"
 #include "paddle/fluid/operators/math/cross_entropy.h"
 #include "paddle/fluid/operators/math/softmax.h"
 #include "paddle/fluid/platform/device/npu/npu_op_runner.h"
+#include "paddle/phi/kernels/funcs/axis_utils.h"
 
 namespace paddle {
 namespace operators {
diff --git a/paddle/fluid/operators/softmax_with_cross_entropy_op_xpu.cc b/paddle/fluid/operators/softmax_with_cross_entropy_op_xpu.cc
index b5514525f5981..c07467a9b0ba3 100644
--- a/paddle/fluid/operators/softmax_with_cross_entropy_op_xpu.cc
+++ b/paddle/fluid/operators/softmax_with_cross_entropy_op_xpu.cc
@@ -12,20 +12,23 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. */
 
-#include "paddle/fluid/operators/softmax_with_cross_entropy_op.h"
 #ifdef PADDLE_WITH_XPU
 #include <memory>
 #include <string>
 #include <unordered_map>
 #include <vector>
 
+#include "paddle/fluid/framework/op_registry.h"
 #include "paddle/fluid/platform/device/device_wrapper.h"
+#include "paddle/phi/kernels/funcs/axis_utils.h"
 #include "xpu/refactor/math.h"
 #include "xpu/refactor/nn.h"
 
 namespace paddle {
 namespace operators {
 
+using Tensor = framework::Tensor;
+
 template <typename T>
 class SoftmaxWithCrossEntropyXPUKernel : public framework::OpKernel<T> {
   using XPUType = typename XPUTypeTrait<T>::Type;
diff --git a/paddle/fluid/operators/strided_slice_op.cc b/paddle/fluid/operators/strided_slice_op.cc
index 0ff7d654fc29d..6f092bbef067e 100644
--- a/paddle/fluid/operators/strided_slice_op.cc
+++ b/paddle/fluid/operators/strided_slice_op.cc
@@ -228,7 +228,7 @@ DECLARE_NO_NEED_BUFFER_VARS_INFERER(StridedSliceOpGradNoNeedBufferVarsInferer,
 namespace ops = paddle::operators;
 
 DECLARE_INFER_SHAPE_FUNCTOR(strided_slice, StridedSliceInferShape,
-                            PD_INFER_META(phi::StridedSliceInferMeta));
+                            PD_INFER_META(phi::StridedSliceRawInferMeta));
 
 REGISTER_OPERATOR(strided_slice, ops::StridedSliceOp, ops::StridedSliceOpMaker,
                   ops::StridedSliceOpGradMaker<paddle::framework::OpDesc>,
diff --git a/paddle/fluid/operators/temporal_shift_op.cc b/paddle/fluid/operators/temporal_shift_op.cc
index acf99d09ffb90..3bdb9cb972fc6 100644
--- a/paddle/fluid/operators/temporal_shift_op.cc
+++ b/paddle/fluid/operators/temporal_shift_op.cc
@@ -15,6 +15,10 @@
 #include <vector>
 #include "paddle/fluid/framework/op_registry.h"
 
+#include "paddle/fluid/framework/infershape_utils.h"
+#include "paddle/phi/core/infermeta_utils.h"
+#include "paddle/phi/infermeta/unary.h"
+
 namespace paddle {
 namespace operators {
 
@@ -24,49 +28,6 @@ class TemporalShiftOp : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
 
- protected:
-  void InferShape(framework::InferShapeContext* ctx) const override {
-    OP_INOUT_CHECK(ctx->HasInput("X"), "Input", "X", "SpectralNorm");
-    OP_INOUT_CHECK(ctx->HasOutput("Out"), "Output", "Out", "SpectralNorm");
-
-    auto dim_x = ctx->GetInputDim("X");
-    PADDLE_ENFORCE_EQ(dim_x.size(), 4,
-                      platform::errors::InvalidArgument(
-                          "Input(X) rank should be 4 in shape of [N*T, C, H, "
-                          "W], but received X rank(%d)",
-                          dim_x.size()));
-
-    int seg_num = ctx->Attrs().Get<int>("seg_num");
-    float shift_ratio = ctx->Attrs().Get<float>("shift_ratio");
-    PADDLE_ENFORCE_GT(
-        seg_num, 0,
-        platform::errors::InvalidArgument(
-            "Attr(seg_num) should be greater than 0, but received %d",
-            seg_num));
-    PADDLE_ENFORCE_GT(
-        shift_ratio, 0.,
-        platform::errors::InvalidArgument(
-            "Attr(shift_ratio) should be greater than 0, but received %d",
-            shift_ratio));
-    PADDLE_ENFORCE_LT(
-        shift_ratio, 0.5,
-        platform::errors::InvalidArgument(
-            "Attr(shift_ratio) should be less than 0.5, but received %d",
-            shift_ratio));
-
-    if (ctx->IsRuntime()) {
-      PADDLE_ENFORCE_EQ(dim_x[0] % seg_num, 0,
-                        platform::errors::InvalidArgument(
-                            "Input(X) dimension[0] should be divided exactly "
-                            "by Attr(seg_num), but received X dimension[0](%d) "
-                            "mod seg_num(%d) != 0",
-                            dim_x[0], seg_num));
-    }
-
-    ctx->SetOutputDim("Out", dim_x);
-    ctx->ShareLoD("X", "Out");
-  }
-
  protected:
   framework::OpKernelType GetExpectedKernelType(
       const framework::ExecutionContext& ctx) const override {
@@ -186,10 +147,13 @@ class TemporalShiftGradOpMaker : public framework::SingleGradOpMaker<T> {
 }  // namespace paddle
 
 namespace ops = paddle::operators;
+DECLARE_INFER_SHAPE_FUNCTOR(temporal_shift, TemporalShiftInferShapeFunctor,
+                            PD_INFER_META(phi::TemporalShiftInferMeta));
 REGISTER_OPERATOR(temporal_shift, ops::TemporalShiftOp,
                   ops::TemporalShiftOpMaker,
                   ops::TemporalShiftGradOpMaker<paddle::framework::OpDesc>,
-                  ops::TemporalShiftGradOpMaker<paddle::imperative::OpBase>);
+                  ops::TemporalShiftGradOpMaker<paddle::imperative::OpBase>,
+                  TemporalShiftInferShapeFunctor);
 REGISTER_OPERATOR(temporal_shift_grad, ops::TemporalShiftOpGrad);
 REGISTER_OP_CPU_KERNEL(temporal_shift, ops::TemporalShiftKernel<float>,
                        ops::TemporalShiftKernel<double>);
diff --git a/paddle/fluid/operators/uniform_random_op.h b/paddle/fluid/operators/uniform_random_op.h
index b941dc21c3ab2..ae846f4cae6fb 100644
--- a/paddle/fluid/operators/uniform_random_op.h
+++ b/paddle/fluid/operators/uniform_random_op.h
@@ -19,11 +19,7 @@
 #include "paddle/fluid/framework/op_registry.h"
 #include "paddle/fluid/framework/operator.h"
 #if defined(__NVCC__) || defined(__HIPCC__)
-DECLARE_bool(use_curand);
-#include <thrust/device_vector.h>
-#include <thrust/host_vector.h>
 #include <thrust/random.h>
-#include <thrust/transform.h>
 #include "paddle/fluid/framework/generator.h"
 #include "paddle/phi/kernels/full_kernel.h"
 #include "paddle/phi/kernels/funcs/distribution_helper.h"
@@ -146,39 +142,6 @@ struct UniformGenerator {
   }
 };
 
-template <typename T>
-struct UniformGeneratorOffset {
-  T min_, max_;
-  unsigned int seed_;
-  T diag_val_;
-  unsigned int diag_num_;
-  unsigned int diag_step_;
-  int offset_;
-  __host__ __device__ UniformGeneratorOffset(T min, T max, int seed,
-                                             int diag_num, int diag_step,
-                                             T diag_val, int offset)
-      : min_(min),
-        max_(max),
-        seed_(seed),
-        diag_num_(diag_num),
-        diag_step_(diag_step),
-        diag_val_(diag_val),
-        offset_(offset) {}
-
-  __host__ __device__ T operator()(const unsigned int n) const {
-    thrust::minstd_rand rng;
-    rng.seed(seed_);
-    thrust::uniform_real_distribution<T> dist(min_, max_);
-    rng.discard(n + offset_);
-    T out = dist(rng);
-    unsigned int remainder = n % (diag_step_ + 1);
-    if (remainder == 0 && diag_num_ > n / (diag_step_ + 1)) {
-      out = diag_val_;
-    }
-    return out;
-  }
-};
-
 template <typename T>
 void UniformRandom(const framework::ExecutionContext& context,
                    framework::Tensor* tensor) {
@@ -205,19 +168,10 @@ void UniformRandom(const framework::ExecutionContext& context,
   int device_id = context.GetPlace().GetDeviceId();
   auto gen_cuda = framework::GetDefaultCUDAGenerator(device_id);
   if (gen_cuda->GetIsInitPy() && seed_flag) {
-    if (FLAGS_use_curand) {
-      using MT = typename details::MPTypeTrait<T>::Type;
-      phi::funcs::uniform_distribution<MT> dist;
-      phi::funcs::uniform_real_transform<MT> trans(min, max);
-      phi::funcs::distribution_and_transform<T>(dev_cxt, tensor, dist, trans);
-    } else {
-      auto seed_offset = gen_cuda->IncrementOffset(1);
-      int64_t gen_offset = size * seed_offset.second;
-      auto func =
-          UniformGeneratorOffset<T>(min, max, seed_offset.first, diag_num,
-                                    diag_step, diag_val, gen_offset);
-      phi::IndexKernel<T, UniformGeneratorOffset<T>>(dev_cxt, tensor, func);
-    }
+    using MT = typename details::MPTypeTrait<T>::Type;
+    phi::funcs::uniform_distribution<MT> dist;
+    phi::funcs::uniform_real_transform<MT> trans(min, max);
+    phi::funcs::distribution_and_transform<T>(dev_cxt, tensor, dist, trans);
   } else {
     auto func =
         UniformGenerator<T>(min, max, seed, diag_num, diag_step, diag_val);
diff --git a/paddle/fluid/platform/collective_helper.cc b/paddle/fluid/platform/collective_helper.cc
index ae1df10c45f77..d05de900e5e77 100644
--- a/paddle/fluid/platform/collective_helper.cc
+++ b/paddle/fluid/platform/collective_helper.cc
@@ -199,6 +199,10 @@ NCCLComm* NCCLCommContext::AssignNCCLComm(ncclComm_t comm, int nranks, int rank,
       paddle::memory::allocation::AllocatorFacade::Instance()
           .GetZeroAllocator(CUDAPlace(dev_id))
           .get());
+  dev_ctx->SetPinnedAllocator(
+      paddle::memory::allocation::AllocatorFacade::Instance()
+          .GetAllocator(paddle::platform::CUDAPinnedPlace())
+          .get());
   dev_ctx->PartialInitWithAllocator();
 
   std::shared_ptr<platform::CudaEventObject> compute_event(
diff --git a/paddle/fluid/platform/device/gpu/gpu_launch_config.h b/paddle/fluid/platform/device/gpu/gpu_launch_config.h
index 4e8b790fa63d1..4a550e61d42da 100644
--- a/paddle/fluid/platform/device/gpu/gpu_launch_config.h
+++ b/paddle/fluid/platform/device/gpu/gpu_launch_config.h
@@ -170,6 +170,14 @@ inline GpuLaunchConfig GetGpuLaunchConfig2D(
   return config;
 }
 
+template <typename Context>
+void LimitGridDim(const Context& ctx, dim3* grid_dim) {
+  auto max_grid_dim = reinterpret_cast<const platform::CUDADeviceContext&>(ctx)
+                          .GetCUDAMaxGridDimSize();
+  grid_dim->x = grid_dim->x < max_grid_dim[0] ? grid_dim->x : max_grid_dim[0];
+  grid_dim->y = grid_dim->y < max_grid_dim[1] ? grid_dim->y : max_grid_dim[1];
+  grid_dim->z = grid_dim->z < max_grid_dim[2] ? grid_dim->z : max_grid_dim[2];
+}
 }  // namespace platform
 }  // namespace paddle
 
diff --git a/paddle/fluid/platform/device/gpu/nccl_helper.h b/paddle/fluid/platform/device/gpu/nccl_helper.h
index 1919f59f8c07f..4301ef4bcf126 100644
--- a/paddle/fluid/platform/device/gpu/nccl_helper.h
+++ b/paddle/fluid/platform/device/gpu/nccl_helper.h
@@ -113,6 +113,10 @@ struct NCCLContext {
         paddle::memory::allocation::AllocatorFacade::Instance()
             .GetZeroAllocator(CUDAPlace(dev_id))
             .get());
+    ctx_->SetPinnedAllocator(
+        paddle::memory::allocation::AllocatorFacade::Instance()
+            .GetAllocator(paddle::platform::CUDAPinnedPlace())
+            .get());
     ctx_->PartialInitWithAllocator();
   }
 
diff --git a/paddle/fluid/platform/device/ipu/CMakeLists.txt b/paddle/fluid/platform/device/ipu/CMakeLists.txt
index 42c949f7fe0f6..7712ede8fd210 100644
--- a/paddle/fluid/platform/device/ipu/CMakeLists.txt
+++ b/paddle/fluid/platform/device/ipu/CMakeLists.txt
@@ -1,6 +1,22 @@
-IF(WITH_IPU)
-  FILE(GLOB POPART_CANONICALIZATION_SRC ${PADDLE_SOURCE_DIR}/paddle/fluid/platform/device/ipu/popart_canonicalization/*.cc)
-  list(APPEND PADDLE_IPU_SRC ${POPART_CANONICALIZATION_SRC})
+if(WITH_IPU)
+  set(paddle_ipu_handler ${CMAKE_CURRENT_BINARY_DIR}/paddle_ipu_handler.h.tmp)
+  set(paddle_ipu_handler_final ${CMAKE_CURRENT_BINARY_DIR}/paddle_ipu_handler.h)
+  file(WRITE ${paddle_ipu_handler} "// Auto generated from CMake. DO NOT EDIT!\n\n")
+  file(APPEND ${paddle_ipu_handler} "\#pragma once\n")
+  file(APPEND ${paddle_ipu_handler} "\#include \"paddle/fluid/platform/device/ipu/popart_canonicalization/canonicalization_utils.h\"\n\n")
+  file(GLOB POPART_CANONICALIZATION_SRC ${CMAKE_CURRENT_SOURCE_DIR}/popart_canonicalization/*.cc)
+  copy_if_different(${paddle_ipu_handler} ${paddle_ipu_handler_final})
+
+  foreach(file_path ${POPART_CANONICALIZATION_SRC})
+    file(READ ${file_path} file_content)
+    string(REGEX MATCHALL "(REGISTER_HANDLER)(\\()([A-Za-z0-9_]+)(,)" op_handlers ${file_content})
+    string(REPLACE "REGISTER_HANDLER(" "" op_handlers "${op_handlers}")
+    string(REPLACE "," "" op_handlers "${op_handlers}")
+    foreach(op_handler ${op_handlers})
+      file(APPEND ${paddle_ipu_handler} "USE_HANDLER(${op_handler});\n")
+    endforeach()
+  endforeach()
+  
   set(IPU_BACKEND_SRC
     "ipu_strategy.cc"
     "ipu_executor.cc"
@@ -13,10 +29,7 @@ IF(WITH_IPU)
     "ipu_device.cc"
   )
 
-  cc_library(ipu_backend SRCS ${IPU_BACKEND_SRC} DEPS popart-only graph graph_helper popdist)
+  cc_library(popart_canonicalization SRCS ${POPART_CANONICALIZATION_SRC} DEPS graph)
+  cc_library(ipu_backend SRCS ${IPU_BACKEND_SRC} DEPS popart-only graph graph_helper popdist popart_canonicalization)
   cc_library(ipu_info SRCS ${IPU_INFO_SRC} DEPS popart-only enforce)
-  add_library(paddle_ipu SHARED ${PADDLE_IPU_SRC})
-  add_dependencies(paddle_ipu ipu_backend)
-  set(PADDLE_IPU_LIB "${CMAKE_CURRENT_BINARY_DIR}/libpaddle_ipu.so" CACHE STRING "")
-  set(PADDLE_IPU_LIB_DIR "${CMAKE_CURRENT_BINARY_DIR}" CACHE STRING "")
-ENDIF()
+endif()
diff --git a/paddle/fluid/platform/device/ipu/popart_canonicalization/activation_ops.cc b/paddle/fluid/platform/device/ipu/popart_canonicalization/activation_ops.cc
index fc2f1e476b92e..ab9ddfde21873 100644
--- a/paddle/fluid/platform/device/ipu/popart_canonicalization/activation_ops.cc
+++ b/paddle/fluid/platform/device/ipu/popart_canonicalization/activation_ops.cc
@@ -88,6 +88,11 @@ Node *log_softmax_handler(Graph *graph, Node *node) {
                       node->outputs);
 }
 
+}  // namespace
+}  // namespace ipu
+}  // namespace platform
+}  // namespace paddle
+
 REGISTER_HANDLER(relu, relu_handler);
 REGISTER_HANDLER(tanh, tanh_handler);
 REGISTER_HANDLER(log, log_handler);
@@ -95,8 +100,3 @@ REGISTER_HANDLER(sigmoid, sigmoid_handler);
 REGISTER_HANDLER(sqrt, sqrt_handler);
 REGISTER_HANDLER(gelu, gelu_handler);
 REGISTER_HANDLER(log_softmax, log_softmax_handler);
-
-}  // namespace
-}  // namespace ipu
-}  // namespace platform
-}  // namespace paddle
diff --git a/paddle/fluid/platform/device/ipu/popart_canonicalization/canonicalization_utils.h b/paddle/fluid/platform/device/ipu/popart_canonicalization/canonicalization_utils.h
index 5725ec767a425..32133e128c588 100644
--- a/paddle/fluid/platform/device/ipu/popart_canonicalization/canonicalization_utils.h
+++ b/paddle/fluid/platform/device/ipu/popart_canonicalization/canonicalization_utils.h
@@ -23,9 +23,36 @@ namespace paddle {
 namespace platform {
 namespace ipu {
 
-#define REGISTER_HANDLER(name, func) \
-  static bool __UNUSED_##name =      \
-      paddle::platform::ipu::RegisterHandler(#name, func)
+#define STATIC_ASSERT_GLOBAL_NAMESPACE(uniq_name, msg)                        \
+  struct __test_global_namespace_##uniq_name##__ {};                          \
+  static_assert(std::is_same<::__test_global_namespace_##uniq_name##__,       \
+                             __test_global_namespace_##uniq_name##__>::value, \
+                msg)
+
+#define REGISTER_HANDLER(op_type, handler)                         \
+  STATIC_ASSERT_GLOBAL_NAMESPACE(                                  \
+      __reg_ipu_op_handler__##op_type,                             \
+      "REGISTER_HANDLER must be called in global namespace");      \
+  struct __PaddleRegisterIpuOpHandler_##op_type {                  \
+    __PaddleRegisterIpuOpHandler_##op_type() {                     \
+      ::paddle::platform::ipu::RegisterHandler(                    \
+          #op_type, paddle::platform::ipu::handler);               \
+    }                                                              \
+    int Touch() const { return 0; }                                \
+  };                                                               \
+  static __PaddleRegisterIpuOpHandler_##op_type                    \
+      __PaddleRegisterIpuOpHandler_instance##op_type;              \
+  int TouchPaddleIpuOpHandlerRegister_##op_type() {                \
+    return __PaddleRegisterIpuOpHandler_instance##op_type.Touch(); \
+  }
+
+#define USE_HANDLER(op_type)                              \
+  STATIC_ASSERT_GLOBAL_NAMESPACE(                         \
+      __use_ipu_op_handler__##op_type,                    \
+      "USE_HANDLER must be called in global namespace");  \
+  extern int TouchPaddleIpuOpHandlerRegister_##op_type(); \
+  UNUSED static int use_handler__itself_##op_type##_ =    \
+      TouchPaddleIpuOpHandlerRegister_##op_type()
 
 using SymbolHandler = std::function<Node *(Graph *, Node *)>;
 
diff --git a/paddle/fluid/platform/device/ipu/popart_canonicalization/elementwise_ops.cc b/paddle/fluid/platform/device/ipu/popart_canonicalization/elementwise_ops.cc
index f0c19cac3a6c3..619d59a9f99a3 100644
--- a/paddle/fluid/platform/device/ipu/popart_canonicalization/elementwise_ops.cc
+++ b/paddle/fluid/platform/device/ipu/popart_canonicalization/elementwise_ops.cc
@@ -93,6 +93,11 @@ Node *elementwise_mod_handler(Graph *graph, Node *node) {
   return elementwise_op_handler(graph, node, "popart_mod");
 }
 
+}  // namespace
+}  // namespace ipu
+}  // namespace platform
+}  // namespace paddle
+
 REGISTER_HANDLER(elementwise_add, elementwise_add_handler);
 REGISTER_HANDLER(elementwise_sub, elementwise_sub_handler);
 REGISTER_HANDLER(elementwise_div, elementwise_div_handler);
@@ -101,8 +106,3 @@ REGISTER_HANDLER(elementwise_min, elementwise_min_handler);
 REGISTER_HANDLER(elementwise_max, elementwise_max_handler);
 REGISTER_HANDLER(elementwise_pow, elementwise_pow_handler);
 REGISTER_HANDLER(elementwise_mod, elementwise_mod_handler);
-
-}  // namespace
-}  // namespace ipu
-}  // namespace platform
-}  // namespace paddle
diff --git a/paddle/fluid/platform/device/ipu/popart_canonicalization/logic_ops.cc b/paddle/fluid/platform/device/ipu/popart_canonicalization/logic_ops.cc
index 7d92835534513..6f82acb5b7db3 100644
--- a/paddle/fluid/platform/device/ipu/popart_canonicalization/logic_ops.cc
+++ b/paddle/fluid/platform/device/ipu/popart_canonicalization/logic_ops.cc
@@ -58,14 +58,14 @@ Node *less_than_handler(Graph *graph, Node *node) {
                       {GetOutputVarNode("Out", node)}, {});
 }
 
+}  // namespace
+}  // namespace ipu
+}  // namespace platform
+}  // namespace paddle
+
 REGISTER_HANDLER(equal, equal_handler);
 REGISTER_HANDLER(logical_not, logical_not_handler);
 REGISTER_HANDLER(logical_or, logical_or_handler);
 REGISTER_HANDLER(logical_and, logical_and_handler);
 REGISTER_HANDLER(greater_than, greater_than_handler);
 REGISTER_HANDLER(less_than, less_than_handler);
-
-}  // namespace
-}  // namespace ipu
-}  // namespace platform
-}  // namespace paddle
diff --git a/paddle/fluid/platform/device/ipu/popart_canonicalization/math_ops.cc b/paddle/fluid/platform/device/ipu/popart_canonicalization/math_ops.cc
index ba6675f40f400..9a907cf5e880f 100644
--- a/paddle/fluid/platform/device/ipu/popart_canonicalization/math_ops.cc
+++ b/paddle/fluid/platform/device/ipu/popart_canonicalization/math_ops.cc
@@ -366,6 +366,11 @@ Node *arg_max_handler(Graph *graph, Node *node) {
                       {{"axis", axis}, {"keepdims", int64_t{0}}});
 }
 
+}  // namespace
+}  // namespace ipu
+}  // namespace platform
+}  // namespace paddle
+
 REGISTER_HANDLER(mean, mean_handler);
 REGISTER_HANDLER(pow, pow_handler);
 REGISTER_HANDLER(mul, mul_handler);
@@ -377,8 +382,3 @@ REGISTER_HANDLER(cross_entropy2, cross_entropy2_handler);
 REGISTER_HANDLER(cumsum, cumsum_handler);
 REGISTER_HANDLER(matmul_v2, matmul_v2_handler);
 REGISTER_HANDLER(arg_max, arg_max_handler);
-
-}  // namespace
-}  // namespace ipu
-}  // namespace platform
-}  // namespace paddle
diff --git a/paddle/fluid/platform/device/ipu/popart_canonicalization/nn_ops.cc b/paddle/fluid/platform/device/ipu/popart_canonicalization/nn_ops.cc
index b731ba532d60c..a529a34e6d71a 100644
--- a/paddle/fluid/platform/device/ipu/popart_canonicalization/nn_ops.cc
+++ b/paddle/fluid/platform/device/ipu/popart_canonicalization/nn_ops.cc
@@ -298,6 +298,11 @@ Node *dropout_handler(Graph *graph, Node *node) {
   }
 }
 
+}  // namespace
+}  // namespace ipu
+}  // namespace platform
+}  // namespace paddle
+
 REGISTER_HANDLER(pool2d, pool2d_handler);
 REGISTER_HANDLER(batch_norm, batch_norm_handler);
 REGISTER_HANDLER(group_norm, group_norm_handler);
@@ -305,8 +310,3 @@ REGISTER_HANDLER(instance_norm, instance_norm_handler);
 REGISTER_HANDLER(layer_norm, layer_norm_handler);
 REGISTER_HANDLER(conv2d, conv2d_handler);
 REGISTER_HANDLER(dropout, dropout_handler);
-
-}  // namespace
-}  // namespace ipu
-}  // namespace platform
-}  // namespace paddle
diff --git a/paddle/fluid/platform/device/ipu/popart_canonicalization/other_ops.cc b/paddle/fluid/platform/device/ipu/popart_canonicalization/other_ops.cc
index 8bd0794368838..c9ac081f920da 100644
--- a/paddle/fluid/platform/device/ipu/popart_canonicalization/other_ops.cc
+++ b/paddle/fluid/platform/device/ipu/popart_canonicalization/other_ops.cc
@@ -77,6 +77,11 @@ Node *detach_handler(Graph *graph, Node *node) {
                       node->outputs);
 }
 
+}  // namespace
+}  // namespace ipu
+}  // namespace platform
+}  // namespace paddle
+
 REGISTER_HANDLER(custom_op, custom_op_handler);
 REGISTER_HANDLER(print, print_handler);
 REGISTER_HANDLER(popart_optimizer, popart_optimizer_handler);
@@ -84,8 +89,3 @@ REGISTER_HANDLER(checkpointoutput, checkpointoutput_handler);
 REGISTER_HANDLER(custom_nll_loss, custom_nll_loss_handler);
 REGISTER_HANDLER(identity, identity_handler);
 REGISTER_HANDLER(detach, detach_handler);
-
-}  // namespace
-}  // namespace ipu
-}  // namespace platform
-}  // namespace paddle
diff --git a/paddle/fluid/platform/device/ipu/popart_canonicalization/reduce_ops.cc b/paddle/fluid/platform/device/ipu/popart_canonicalization/reduce_ops.cc
index f34484bc08c7c..852cb180aa787 100644
--- a/paddle/fluid/platform/device/ipu/popart_canonicalization/reduce_ops.cc
+++ b/paddle/fluid/platform/device/ipu/popart_canonicalization/reduce_ops.cc
@@ -56,13 +56,13 @@ Node *reduce_prod_handler(Graph *graph, Node *node) {
   return reduce_op_handler(graph, node, "popart_reduceprod");
 }
 
+}  // namespace
+}  // namespace ipu
+}  // namespace platform
+}  // namespace paddle
+
 REGISTER_HANDLER(reduce_mean, reduce_mean_handler);
 REGISTER_HANDLER(reduce_min, reduce_min_handler);
 REGISTER_HANDLER(reduce_sum, reduce_sum_handler);
 REGISTER_HANDLER(reduce_max, reduce_max_handler);
 REGISTER_HANDLER(reduce_prod, reduce_prod_handler);
-
-}  // namespace
-}  // namespace ipu
-}  // namespace platform
-}  // namespace paddle
diff --git a/paddle/fluid/platform/device/ipu/popart_canonicalization/search_ops.cc b/paddle/fluid/platform/device/ipu/popart_canonicalization/search_ops.cc
index 539053f2fb67b..aec89a1cf0d82 100644
--- a/paddle/fluid/platform/device/ipu/popart_canonicalization/search_ops.cc
+++ b/paddle/fluid/platform/device/ipu/popart_canonicalization/search_ops.cc
@@ -86,10 +86,10 @@ Node *topk_handler(Graph *graph, Node *node) {
                     static_cast<int>(framework::proto::VarType::INT32));
 }
 
-REGISTER_HANDLER(top_k, topk_handler);
-REGISTER_HANDLER(top_k_v2, topk_handler);
-
 }  // namespace
 }  // namespace ipu
 }  // namespace platform
 }  // namespace paddle
+
+REGISTER_HANDLER(top_k, topk_handler);
+REGISTER_HANDLER(top_k_v2, topk_handler);
diff --git a/paddle/fluid/platform/device/ipu/popart_canonicalization/tensor_ops.cc b/paddle/fluid/platform/device/ipu/popart_canonicalization/tensor_ops.cc
index 6ccb5441f8375..4c086bffb240e 100644
--- a/paddle/fluid/platform/device/ipu/popart_canonicalization/tensor_ops.cc
+++ b/paddle/fluid/platform/device/ipu/popart_canonicalization/tensor_ops.cc
@@ -570,6 +570,11 @@ Node *split_handler(Graph *graph, Node *node) {
        {"split", std::vector<int64_t>{sections.begin(), sections.end()}}});
 }
 
+}  // namespace
+}  // namespace ipu
+}  // namespace platform
+}  // namespace paddle
+
 REGISTER_HANDLER(fill_constant, fill_constant_handler);
 REGISTER_HANDLER(gaussian_random, gaussian_random_handler);
 REGISTER_HANDLER(uniform_random, uniform_random_handler);
@@ -593,8 +598,3 @@ REGISTER_HANDLER(lookup_table_v2, lookup_table_v2_handler);
 REGISTER_HANDLER(split, split_handler);
 REGISTER_HANDLER(one_hot, one_hot_handler);
 REGISTER_HANDLER(one_hot_v2, one_hot_v2_handler);
-
-}  // namespace
-}  // namespace ipu
-}  // namespace platform
-}  // namespace paddle
diff --git a/paddle/fluid/platform/device/xpu/xpu2_op_list.h b/paddle/fluid/platform/device/xpu/xpu2_op_list.h
index 5edab707e7e3a..08a7f08006957 100644
--- a/paddle/fluid/platform/device/xpu/xpu2_op_list.h
+++ b/paddle/fluid/platform/device/xpu/xpu2_op_list.h
@@ -70,8 +70,10 @@ XPUOpMap& get_kl2_ops() {
        XPUKernelSet({pOpKernelType(vartype::FP32, XPUPlace()),
                      pOpKernelType(vartype::FP16, XPUPlace())})},
       {"dropout_grad",
-       XPUKernelSet({pOpKernelType(vartype::FP32, XPUPlace())})},
-      {"dropout", XPUKernelSet({pOpKernelType(vartype::FP32, XPUPlace())})},
+       XPUKernelSet({pOpKernelType(vartype::FP32, XPUPlace()),
+                     pOpKernelType(vartype::FP16, XPUPlace())})},
+      {"dropout", XPUKernelSet({pOpKernelType(vartype::FP32, XPUPlace()),
+                                pOpKernelType(vartype::FP16, XPUPlace())})},
       {"elementwise_add_grad",
        XPUKernelSet({pOpKernelType(vartype::FP32, XPUPlace()),
                      pOpKernelType(vartype::FP16, XPUPlace())})},
@@ -147,7 +149,6 @@ XPUOpMap& get_kl2_ops() {
                      pOpKernelType(vartype::FP64, XPUPlace()),
                      pOpKernelType(vartype::FP32, XPUPlace()),
                      pOpKernelType(vartype::FP16, XPUPlace()),
-                     pOpKernelType(vartype::BF16, XPUPlace()),
                      pOpKernelType(vartype::COMPLEX64, XPUPlace()),
                      pOpKernelType(vartype::COMPLEX128, XPUPlace())})},
       {"flatten2_grad",
@@ -250,6 +251,8 @@ XPUOpMap& get_kl2_ops() {
       {"momentum", XPUKernelSet({pOpKernelType(vartype::FP32, XPUPlace())})},
       {"mul", XPUKernelSet({pOpKernelType(vartype::FP32, XPUPlace()),
                             pOpKernelType(vartype::FP16, XPUPlace())})},
+      {"mul_grad", XPUKernelSet({pOpKernelType(vartype::FP32, XPUPlace()),
+                                 pOpKernelType(vartype::FP16, XPUPlace())})},
       {"nearest_interp_v2",
        XPUKernelSet({pOpKernelType(vartype::FP32, XPUPlace())})},
       {"nearest_interp_v2_grad",
@@ -295,6 +298,7 @@ XPUOpMap& get_kl2_ops() {
                                  pOpKernelType(vartype::INT32, XPUPlace()),
                                  pOpKernelType(vartype::BOOL, XPUPlace()),
                                  pOpKernelType(vartype::FP32, XPUPlace())})},
+      {"rnn", XPUKernelSet({pOpKernelType(vartype::FP32, XPUPlace())})},
       {"roi_align", XPUKernelSet({pOpKernelType(vartype::FP32, XPUPlace())})},
       {"roi_align_grad",
        XPUKernelSet({pOpKernelType(vartype::FP32, XPUPlace())})},
diff --git a/paddle/fluid/platform/device_context.cc b/paddle/fluid/platform/device_context.cc
index 5ee54b1c86528..f3934c7d8713b 100644
--- a/paddle/fluid/platform/device_context.cc
+++ b/paddle/fluid/platform/device_context.cc
@@ -162,6 +162,11 @@ inline void EmplaceDeviceContext(
           dev_ctx->SetAllocator(memory::allocation::AllocatorFacade::Instance()
                                     .GetAllocator(p)
                                     .get());
+          dev_ctx->SetPinnedAllocator(
+              memory::allocation::AllocatorFacade::Instance()
+                  .GetAllocator(paddle::platform::CUDAPinnedPlace())
+                  .get());
+
           cuda_ctx->PartialInitWithAllocator();
           dev_ctx->SetGenerator(
               framework::GetDefaultCUDAGenerator(p.GetDeviceId()).get());
diff --git a/paddle/fluid/platform/device_context_test.cu b/paddle/fluid/platform/device_context_test.cu
index 851c756b665b8..08a04a9565af7 100644
--- a/paddle/fluid/platform/device_context_test.cu
+++ b/paddle/fluid/platform/device_context_test.cu
@@ -39,6 +39,10 @@ TEST(Device, Init) {
         paddle::memory::allocation::AllocatorFacade::Instance()
             .GetZeroAllocator(CUDAPlace(i))
             .get());
+    device_context->SetPinnedAllocator(
+        paddle::memory::allocation::AllocatorFacade::Instance()
+            .GetAllocator(paddle::platform::CUDAPinnedPlace())
+            .get());
     device_context->PartialInitWithAllocator();
 
     Eigen::GpuDevice* gpu_device = device_context->eigen_device();
@@ -66,6 +70,10 @@ TEST(Device, CUDADeviceContext) {
         paddle::memory::allocation::AllocatorFacade::Instance()
             .GetZeroAllocator(CUDAPlace(i))
             .get());
+    device_context->SetPinnedAllocator(
+        paddle::memory::allocation::AllocatorFacade::Instance()
+            .GetAllocator(paddle::platform::CUDAPinnedPlace())
+            .get());
     device_context->PartialInitWithAllocator();
     Eigen::GpuDevice* gpu_device = device_context->eigen_device();
     ASSERT_NE(nullptr, gpu_device);
diff --git a/paddle/fluid/platform/device_event.h b/paddle/fluid/platform/device_event.h
index 57f45a40165d7..463329d32c936 100644
--- a/paddle/fluid/platform/device_event.h
+++ b/paddle/fluid/platform/device_event.h
@@ -29,7 +29,7 @@ using ::paddle::platform::kCPU;
 USE_EVENT(kCPU)
 USE_EVENT_WAIT(kCPU, kCPU)
 
-#ifdef PADDLE_WITH_CUDA
+#if defined(PADDLE_WITH_CUDA) || defined(PADDLE_WITH_HIP)
 USE_EVENT(kCUDA);
 USE_EVENT_WAIT(kCUDA, kCUDA)
 USE_EVENT_WAIT(kCPU, kCUDA)
diff --git a/paddle/fluid/platform/device_event_gpu.cc b/paddle/fluid/platform/device_event_gpu.cc
index a811a5b9c130d..f42ccc5a1db54 100644
--- a/paddle/fluid/platform/device_event_gpu.cc
+++ b/paddle/fluid/platform/device_event_gpu.cc
@@ -15,7 +15,7 @@
 #include "paddle/fluid/platform/device_event_base.h"
 #include "paddle/fluid/platform/event.h"
 
-#ifdef PADDLE_WITH_CUDA
+#if defined(PADDLE_WITH_CUDA) || defined(PADDLE_WITH_HIP)
 namespace paddle {
 namespace platform {
 struct CUDADeviceEventWrapper {
diff --git a/paddle/fluid/platform/device_event_test.cc b/paddle/fluid/platform/device_event_test.cc
index 96e89f9257dd2..d9f744b26256b 100644
--- a/paddle/fluid/platform/device_event_test.cc
+++ b/paddle/fluid/platform/device_event_test.cc
@@ -75,6 +75,58 @@ TEST(DeviceEvent, CUDA) {
 }
 #endif
 
+#ifdef PADDLE_WITH_HIP
+#include <hip/hip_runtime.h>
+
+TEST(DeviceEvent, CUDA) {
+  VLOG(1) << "In Test";
+  using paddle::platform::CUDAPlace;
+
+  auto& pool = DeviceContextPool::Instance();
+  auto place = CUDAPlace(0);
+  auto* context =
+      static_cast<paddle::platform::CUDADeviceContext*>(pool.Get(place));
+
+  ASSERT_NE(context, nullptr);
+  // case 1. test for event_creator
+  DeviceEvent event(place);
+  ASSERT_NE(event.GetEvent().get(), nullptr);
+  bool status = event.Query();
+  ASSERT_EQ(status, true);
+  // case 2. test for event_recorder
+  event.Record(context);
+  status = event.Query();
+  ASSERT_EQ(status, false);
+  // case 3. test for event_finisher
+  event.Finish();
+  status = event.Query();
+  ASSERT_EQ(status, true);
+
+  // case 4. test for event_waiter
+  float *src_fp32, *dst_fp32;
+  int size = 1000000 * sizeof(float);
+  hipMallocHost(reinterpret_cast<void**>(&src_fp32), size);
+  hipMalloc(reinterpret_cast<void**>(&dst_fp32), size);
+  hipMemcpyAsync(dst_fp32, src_fp32, size, hipMemcpyHostToDevice,
+                 context->stream());
+  event.Record(context);  // step 1. record it
+  status = event.Query();
+  ASSERT_EQ(status, false);
+
+  event.Wait(kCUDA, context);  // step 2. add streamWaitEvent
+  status = event.Query();
+  ASSERT_EQ(status, false);  // async
+
+  event.Wait(kCPU, context);  // step 3. EventSynchornize
+  status = event.Query();
+  ASSERT_EQ(status, true);  // sync
+
+  // release resource
+  hipFree(dst_fp32);
+  hipFreeHost(src_fp32);
+}
+#endif
+
 TEST(DeviceEvent, CPU) {
   using paddle::platform::CPUPlace;
   auto place = CPUPlace();
diff --git a/paddle/fluid/platform/flags.cc b/paddle/fluid/platform/flags.cc
index e417b4fd8694b..c3d3f6a4f6893 100644
--- a/paddle/fluid/platform/flags.cc
+++ b/paddle/fluid/platform/flags.cc
@@ -364,11 +364,7 @@ PADDLE_DEFINE_EXPORTED_double(
  * Example:
  * Note: For selecting allocator policy of PaddlePaddle.
  */
-#ifdef PADDLE_ON_INFERENCE
-static constexpr char kDefaultAllocatorStrategy[] = "naive_best_fit";
-#else
 static constexpr char kDefaultAllocatorStrategy[] = "auto_growth";
-#endif
 PADDLE_DEFINE_EXPORTED_string(
     allocator_strategy, kDefaultAllocatorStrategy,
     "The allocation strategy, enum in [naive_best_fit, auto_growth]. "
@@ -549,8 +545,6 @@ PADDLE_DEFINE_EXPORTED_double(
  */
 PADDLE_DEFINE_EXPORTED_bool(use_mkldnn, false, "Use MKLDNN to run");
 
-PADDLE_DEFINE_EXPORTED_bool(use_curand, false, "Random OP use CURAND");
-
 /**
  * Debug related FLAG
  * Name: FLAGS_call_stack_level
diff --git a/paddle/fluid/platform/profiler/chrometracing_logger.cc b/paddle/fluid/platform/profiler/chrometracing_logger.cc
index 0c5d90a9277a6..d7879e7be517e 100644
--- a/paddle/fluid/platform/profiler/chrometracing_logger.cc
+++ b/paddle/fluid/platform/profiler/chrometracing_logger.cc
@@ -325,7 +325,7 @@ void ChromeTracingLogger::HandleTypeKernel(
     "name": "%s[%s]", "pid": %lld, "tid": %lld,
     "ts": %lld, "dur": %.3f,
     "ph": "X", "cat": "%s", 
-    "cname": "rail_animation",
+    "cname": "cq_build_failed",
     "args": {
       "start_time": "%.3f us",
       "end_time": "%.3f us",
@@ -376,7 +376,7 @@ void ChromeTracingLogger::HandleTypeMemcpy(
     "name": "%s[%s]", "pid": %lld, "tid": %lld,
     "ts": %lld, "dur": %.3f,
     "ph": "X", "cat": "%s", 
-    "cname": "rail_animation",
+    "cname": "cq_build_failed",
     "args": {
       "start_time": "%.3f us",
       "end_time": "%.3f us",
@@ -411,7 +411,7 @@ void ChromeTracingLogger::HandleTypeMemset(
     "name": "%s[%s]", "pid": %lld, "tid": %lld,
     "ts": %lld, "dur": %.3f,
     "ph": "X", "cat": "%s", 
-    "cname": "rail_animation",
+    "cname": "cq_build_failed",
     "args": {
       "start_time": "%.3f us",
       "end_time": "%.3f us",
diff --git a/paddle/fluid/platform/profiler_helper.h b/paddle/fluid/platform/profiler_helper.h
index c0b7fd417f272..c9e6f13f50524 100644
--- a/paddle/fluid/platform/profiler_helper.h
+++ b/paddle/fluid/platform/profiler_helper.h
@@ -35,6 +35,9 @@ limitations under the License. */
 #include <hip/hip_runtime.h>
 #endif
 
+#include "paddle/fluid/memory/memory.h"
+#include "paddle/fluid/platform/device/gpu/gpu_info.h"
+
 namespace paddle {
 namespace platform {
 
@@ -134,6 +137,10 @@ void SynchronizeAllDevice() {
 #endif
 }
 
+static double ToMegaBytes(size_t bytes) {
+  return static_cast<double>(bytes) / (1 << 20);
+}
+
 // Print results
 void PrintMemProfiler(
     const std::map<Place, std::unordered_map<std::string, MemoryProfierReport>>
@@ -144,6 +151,37 @@ void PrintMemProfiler(
             << "    Memory Profiling Report     "
             << "<-------------------------\n\n";
 
+#if defined(PADDLE_WITH_CUDA) || defined(PADDLE_WITH_HIP)
+  int num_gpus = GetGPUDeviceCount();
+  std::cout.setf(std::ios::left);
+  if (num_gpus > 0) {
+    std::cout << "GPU Memory Usage (MB):\n";
+    for (int dev_id = 0; dev_id < num_gpus; ++dev_id) {
+      int64_t allocated = memory::StatGetCurrentValue("Allocated", dev_id);
+      int64_t reserved = memory::StatGetCurrentValue("Reserved", dev_id);
+      size_t available = 0, total = 0, actual_available = 0, actual_total = 0;
+      RecordedGpuMemGetInfo(&available, &total, &actual_available,
+                            &actual_total, dev_id);
+
+      std::ostringstream system_gpu_memory;
+      system_gpu_memory << "System GPU Memory (gpu:" << dev_id << ")";
+      std::cout << "  " << std::setw(30) << system_gpu_memory.str()
+                << "Total: " << std::setw(12) << ToMegaBytes(total)
+                << "Allocated: " << std::setw(12)
+                << ToMegaBytes(total - available) << "Free: " << std::setw(12)
+                << ToMegaBytes(available) << "\n";
+      std::ostringstream software_memory_pool;
+      software_memory_pool << "Software Memory Pool (gpu:" << dev_id << ")";
+      std::cout << "  " << std::setw(30) << software_memory_pool.str()
+                << "Total: " << std::setw(12) << ToMegaBytes(reserved)
+                << "Allocated: " << std::setw(12)
+                << ToMegaBytes(reserved - allocated)
+                << "Free: " << std::setw(12) << ToMegaBytes(allocated) << "\n";
+    }
+    std::cout << "\n";
+  }
+#endif
+
   // Output events table
   std::cout.setf(std::ios::left);
   std::cout << std::setw(name_width) << "Event" << std::setw(data_width)
diff --git a/paddle/fluid/pybind/CMakeLists.txt b/paddle/fluid/pybind/CMakeLists.txt
index 52af9bb23680b..f8e7081de01bd 100644
--- a/paddle/fluid/pybind/CMakeLists.txt
+++ b/paddle/fluid/pybind/CMakeLists.txt
@@ -91,12 +91,18 @@ if(NOT ON_INFER)
   set (PYBIND_DEPS ${PYBIND_DEPS} processgroup eager_reducer)
   if (WITH_NCCL)
     set (PYBIND_DEPS ${PYBIND_DEPS} processgroup_nccl)
+    if (WITH_PSCORE)
+      set (PYBIND_DEPS ${PYBIND_DEPS} processgroup_heter)
+    endif()
   endif()
   if (WITH_GLOO)
     set (PYBIND_DEPS ${PYBIND_DEPS} processgroup_gloo)
   endif()
   if(WITH_ASCEND_CL)
     set (PYBIND_DEPS ${PYBIND_DEPS} processgroup_hccl)
+    if (WITH_PSCORE)
+      set (PYBIND_DEPS ${PYBIND_DEPS} processgroup_heter)
+    endif()
   endif()
   set(PYBIND_SRCS ${PYBIND_SRCS} distributed_py.cc)
 endif()
@@ -369,10 +375,6 @@ if(WITH_PYTHON)
     target_link_libraries(paddle_pybind ${ROCM_HIPRTC_LIB})
   endif()
 
-  if(WITH_IPU)
-    target_link_libraries(paddle_pybind paddle_ipu)
-  endif()
-
   get_property (os_dependency_modules GLOBAL PROPERTY OS_DEPENDENCY_MODULES)
   target_link_libraries(paddle_pybind ${os_dependency_modules})
   add_dependencies(paddle_pybind op_function_generator_cmd)
diff --git a/paddle/fluid/pybind/distributed_py.cc b/paddle/fluid/pybind/distributed_py.cc
index 6c74ea2eef4d0..38ed1d4f2bb5d 100644
--- a/paddle/fluid/pybind/distributed_py.cc
+++ b/paddle/fluid/pybind/distributed_py.cc
@@ -39,6 +39,11 @@ limitations under the License. */
 #include "paddle/fluid/distributed/collective/ProcessGroupHCCL.h"
 #endif
 
+#if defined(PADDLE_WITH_GLOO) && defined(PADDLE_WITH_PSCORE) && \
+    (defined(PADDLE_WITH_NCCL) || defined(PADDLE_WITH_ASCEND_CL))
+#include "paddle/fluid/distributed/collective/ProcessGroupHeter.h"
+#endif
+
 #if defined(PADDLE_WITH_GLOO)
 #include "paddle/fluid/distributed/collective/ProcessGroupGloo.h"
 #include "paddle/fluid/distributed/store/tcp_store.h"
@@ -217,6 +222,21 @@ void BindDistributed(py::module *m) {
                     int>(),
            py::arg("store"), py::arg("rank"), py::arg("world_size"),
            py::arg("group_id") = 0, py::call_guard<py::gil_scoped_release>());
+
+#if defined(PADDLE_WITH_GLOO) && defined(PADDLE_WITH_PSCORE) && \
+    (defined(PADDLE_WITH_NCCL) || defined(PADDLE_WITH_ASCEND_CL))
+  py::class_<distributed::ProcessGroupHeter,
+             std::shared_ptr<distributed::ProcessGroupHeter>>(
+      *m, "ProcessGroupHeter", ProcessGroup)
+      .def(py::init<const std::shared_ptr<distributed::Store> &, int, int, int,
+                    int, int, int, int, bool, std::string>(),
+           py::arg("store"), py::arg("rank"), py::arg("world_size"),
+           py::arg("gid") = 0, py::arg("local_rank") = 0,
+           py::arg("local_size") = 1, py::arg("gloo_rank") = 0,
+           py::arg("gloo_size") = 1, py::arg("with_switch") = false,
+           py::arg("switch_endpoint") = "",
+           py::call_guard<py::gil_scoped_release>());
+#endif
 #endif
 
 #if defined(PADDLE_WITH_ASCEND_CL)
@@ -227,6 +247,21 @@ void BindDistributed(py::module *m) {
                     int>(),
            py::arg("store"), py::arg("rank"), py::arg("world_size"),
            py::arg("group_id") = 0, py::call_guard<py::gil_scoped_release>());
+
+#if defined(PADDLE_WITH_GLOO) && defined(PADDLE_WITH_PSCORE) && \
+    (defined(PADDLE_WITH_NCCL) || defined(PADDLE_WITH_ASCEND_CL))
+  py::class_<distributed::ProcessGroupHeter,
+             std::shared_ptr<distributed::ProcessGroupHeter>>(
+      *m, "ProcessGroupHeter", ProcessGroup)
+      .def(py::init<const std::shared_ptr<distributed::Store> &, int, int, int,
+                    int, int, int, int, bool, std::string>(),
+           py::arg("store"), py::arg("rank"), py::arg("world_size"),
+           py::arg("gid") = 0, py::arg("local_rank") = 0,
+           py::arg("local_size") = 1, py::arg("gloo_rank") = 0,
+           py::arg("gloo_rank") = 1, py::arg("with_switch") = false,
+           py::arg("switch_endpoint") = "",
+           py::call_guard<py::gil_scoped_release>());
+#endif
 #endif
 
   py::class_<distributed::ProcessGroup::Task,
diff --git a/paddle/fluid/pybind/eager.cc b/paddle/fluid/pybind/eager.cc
index 5278f371dd4e7..c600844596d98 100644
--- a/paddle/fluid/pybind/eager.cc
+++ b/paddle/fluid/pybind/eager.cc
@@ -72,15 +72,25 @@ void EmptyTensorInitializer(TensorObject* self, const std::string& name,
   }
   if (var_type == paddle::framework::proto::VarType::LOD_TENSOR) {
     // TODO(jiabin): Maybe support LOD later
-    std::shared_ptr<phi::DenseTensor> dense_tensor =
-        std::make_shared<phi::DenseTensor>(
-            phi::make_intrusive<paddle::experimental::SharedStorage>(place),
-            phi::DenseTensorMeta(paddle::framework::TransToPhiDataType(dtype),
-                                 ddims));
-    if (phi::product(ddims) > 0) {
-      dense_tensor->mutable_data(place);
+    std::shared_ptr<phi::DenseTensor> dense_tensor = nullptr;
+    if (dims.empty()) {
+      std::shared_ptr<phi::Allocation> allocation_ptr = nullptr;
+      dense_tensor = std::make_shared<phi::DenseTensor>(
+          allocation_ptr,
+          phi::DenseTensorMeta(paddle::framework::TransToPhiDataType(dtype),
+                               ddims));
+    } else {
+      // TODO(dev): we need enhance check for ddims.
+      dense_tensor = std::make_shared<phi::DenseTensor>(
+          phi::make_intrusive<paddle::experimental::SharedStorage>(place),
+          phi::DenseTensorMeta(paddle::framework::TransToPhiDataType(dtype),
+                               ddims));
     }
     self->tensor.set_impl(dense_tensor);
+  } else if (var_type == paddle::framework::proto::VarType::SELECTED_ROWS) {
+    std::shared_ptr<phi::SelectedRows> tensor =
+        std::make_shared<phi::SelectedRows>();
+    self->tensor.set_impl(tensor);
   }
 
   if (!autograd_meta->GetMutableGradNode()) {
@@ -92,6 +102,7 @@ void EmptyTensorInitializer(TensorObject* self, const std::string& name,
 }
 
 void InitTensorWithNumpyValue(TensorObject* self, const py::object& array,
+                              const paddle::platform::Place& place,
                               bool zero_copy = false) {
   PADDLE_ENFORCE_EQ(
       self->tensor.defined(), true,
@@ -102,7 +113,6 @@ void InitTensorWithNumpyValue(TensorObject* self, const py::object& array,
           "eager tensor before init it with NumPy."));
   phi::DenseTensor* impl_ptr =
       static_cast<phi::DenseTensor*>(self->tensor.impl().get());
-  paddle::platform::Place place = impl_ptr->place();
   if (platform::is_cpu_place(place)) {
     SetTensorFromPyArray<platform::CPUPlace>(impl_ptr, array, place, zero_copy);
   } else if (platform::is_xpu_place(place)) {
@@ -289,7 +299,7 @@ void AutoInitTensorByPyArray(TensorObject* py_tensor_ptr,
 
   EmptyTensorInitializer(py_tensor_ptr, act_name, place, persistable,
                          stop_gradient);
-  InitTensorWithNumpyValue(py_tensor_ptr, numpy_value, zero_copy);
+  InitTensorWithNumpyValue(py_tensor_ptr, numpy_value, place, zero_copy);
 }
 
 // initialize Tensor by Tensor or framework::Tensor (mix args and
@@ -399,6 +409,7 @@ void AutoInitTensorByTensor(TensorObject* py_tensor_ptr,
    * ** name: std::string)
    *  **/
 int TensorInit(PyObject* self, PyObject* args, PyObject* kwargs) {
+  EAGER_TRY
   // set a flag to record use kwargs or not
   bool flag_kwargs = false;
   if (kwargs) flag_kwargs = true;
@@ -693,7 +704,8 @@ int TensorInit(PyObject* self, PyObject* args, PyObject* kwargs) {
         "make sure u call the existed constructor."));
   }
 
-  return 1;
+  return -1;
+  EAGER_CATCH_AND_THROW_RETURN_NEG
 }
 
 static void TensorDealloc(TensorObject* self) {
diff --git a/paddle/fluid/pybind/eager_functions.cc b/paddle/fluid/pybind/eager_functions.cc
index 985d0ef0beb76..fb115455357dd 100644
--- a/paddle/fluid/pybind/eager_functions.cc
+++ b/paddle/fluid/pybind/eager_functions.cc
@@ -26,13 +26,17 @@ limitations under the License. */
 #include "paddle/fluid/framework/convert_utils.h"
 #include "paddle/fluid/framework/custom_operator.h"
 #include "paddle/fluid/framework/op_meta_info_helper.h"
+#include "paddle/fluid/framework/python_headers.h"
 #include "paddle/fluid/memory/allocation/allocator.h"
 #include "paddle/fluid/memory/memcpy.h"
+#include "paddle/fluid/platform/device/gpu/gpu_info.h"
 #include "paddle/fluid/platform/dynload/dynamic_loader.h"
 #include "paddle/fluid/platform/enforce.h"
+#include "paddle/fluid/platform/stream/cuda_stream.h"
 #include "paddle/fluid/pybind/eager.h"
 #include "paddle/fluid/pybind/eager_utils.h"
 #include "paddle/fluid/pybind/exception.h"
+#include "paddle/fluid/pybind/tensor_py.h"
 #include "paddle/phi/api/ext/op_meta_info.h"
 #include "paddle/phi/api/lib/utils/allocator.h"
 #include "paddle/phi/api/lib/utils/storage.h"
@@ -536,6 +540,286 @@ static PyObject* eager_api_sparse_csr_tensor(PyObject* self, PyObject* args,
   return ToPyObject(tensor);
   EAGER_CATCH_AND_THROW_RETURN_NULL
 }
+#if defined(PADDLE_WITH_CUDA)
+static PyObject* eager_api_async_read(PyObject* self, PyObject* args,
+                                      PyObject* kwargs) {
+  EAGER_TRY
+  auto& src = GetTensorFromArgs("async_read", "src", args, 0, false);
+  auto& dst = GetTensorFromArgs("async_read", "dst", args, 1, false);
+  auto& index = GetTensorFromArgs("async_read", "index", args, 2, false);
+  auto& buffer = GetTensorFromArgs("async_read", "buffer", args, 3, false);
+  auto& offset = GetTensorFromArgs("async_read", "offset", args, 4, false);
+  auto& count = GetTensorFromArgs("async_read", "count", args, 5, false);
+  PADDLE_ENFORCE_EQ(
+      src.is_gpu_pinned(), true,
+      platform::errors::InvalidArgument("Required `src` device should be "
+                                        "CUDAPinnedPlace, but received %d.",
+                                        src.inner_place()));
+  PADDLE_ENFORCE_EQ(
+      dst.is_gpu(), true,
+      platform::errors::InvalidArgument(
+          "Required `dst` device should be CUDAPlace, but received %d.",
+          dst.inner_place()));
+  PADDLE_ENFORCE_EQ(
+      index.is_cpu(), true,
+      platform::errors::InvalidArgument(
+          "Required `index` device should be CPUPlace, but received %d.",
+          index.inner_place()));
+  PADDLE_ENFORCE_EQ(buffer.is_gpu_pinned(), true,
+                    platform::errors::InvalidArgument(
+                        "Required `buffer` device should be CUDAPinnedPlace, "
+                        "but received %d.",
+                        buffer.inner_place()));
+  PADDLE_ENFORCE_EQ(
+      offset.is_cpu(), true,
+      platform::errors::InvalidArgument(
+          "Required `offset` device should be CPUPlace, but received %d.",
+          offset.inner_place()));
+  PADDLE_ENFORCE_EQ(
+      count.is_cpu(), true,
+      platform::errors::InvalidArgument(
+          "Required `count` device should be CPUPlace, but received %d.",
+          count.inner_place()));
+
+  auto& src_tensor = src;
+  auto* dst_tensor = &dst;
+  auto& index_tensor = index;
+  auto* buffer_tensor = &buffer;
+  auto& offset_tensor = offset;
+  auto& count_tensor = count;
+  auto* dst_data = dst_tensor->mutable_data<float>(dst.place());
+  const auto& deviceId = paddle::platform::GetCurrentDeviceId();
+
+  PADDLE_ENFORCE_EQ(src_tensor.dims().size(), dst_tensor->dims().size(),
+                    platform::errors::InvalidArgument(
+                        "`src` and `dst` should have same tensor shape, "
+                        "except for the first dimension."));
+  PADDLE_ENFORCE_EQ(src_tensor.dims().size(), buffer_tensor->dims().size(),
+                    platform::errors::InvalidArgument(
+                        "`src` and `buffer` should have same tensor shape, "
+                        "except for the first dimension."));
+  for (int i = 1; i < src_tensor.dims().size(); i++) {
+    PADDLE_ENFORCE_EQ(src_tensor.dims()[i], dst_tensor->dims()[i],
+                      platform::errors::InvalidArgument(
+                          "`src` and `dst` should have the same tensor shape, "
+                          "except for the first dimension."));
+    PADDLE_ENFORCE_EQ(
+        src_tensor.dims()[i], buffer_tensor->dims()[i],
+        platform::errors::InvalidArgument(
+            "`src` and `buffer` should have the same tensor shape, "
+            "except for the first dimension."));
+  }
+  PADDLE_ENFORCE_EQ(index_tensor.dims().size(), 1,
+                    platform::errors::InvalidArgument(
+                        "`index` tensor should be one-dimensional."));
+
+  auto stream =
+      paddle::platform::stream::get_current_stream(deviceId)->raw_stream();
+
+  int64_t numel = 0;  // total copy length
+  int64_t copy_flag = offset_tensor.dims()[0];
+  int64_t size = src_tensor.numel() / src_tensor.dims()[0];
+
+  if (copy_flag != 0) {
+    PADDLE_ENFORCE_EQ(offset_tensor.dims().size(), 1,
+                      platform::errors::InvalidArgument(
+                          "`offset` tensor should be one-dimensional."));
+    PADDLE_ENFORCE_EQ(count_tensor.dims().size(), 1,
+                      platform::errors::InvalidArgument(
+                          "`count` tensor should be one-dimensional."));
+    PADDLE_ENFORCE_EQ(offset_tensor.numel(), count_tensor.numel(),
+                      platform::errors::InvalidArgument(
+                          "`offset` and `count` tensor size dismatch."));
+    auto* offset_data = offset_tensor.data<int64_t>();
+    auto* count_data = count_tensor.data<int64_t>();
+    for (int64_t i = 0; i < count_tensor.numel(); i++) {
+      numel += count_data[i];
+    }
+    PADDLE_ENFORCE_LE(
+        numel + index_tensor.numel(), buffer_tensor->dims()[0],
+        platform::errors::InvalidArgument("Buffer tensor size is too small."));
+    PADDLE_ENFORCE_LE(
+        numel + index_tensor.numel(), dst_tensor->dims()[0],
+        platform::errors::InvalidArgument("Target tensor size is too small."));
+
+    int64_t src_offset, dst_offset = 0, c;
+    auto* src_data = src_tensor.data<float>();
+    for (int64_t i = 0; i < offset_tensor.numel(); i++) {
+      src_offset = offset_data[i], c = count_data[i];
+      PADDLE_ENFORCE_LE(
+          src_offset + c, src_tensor.dims()[0],
+          platform::errors::InvalidArgument("Invalid offset or count index."));
+      PADDLE_ENFORCE_LE(
+          dst_offset + c, dst_tensor->dims()[0],
+          platform::errors::InvalidArgument("Invalid offset or count index."));
+      cudaMemcpyAsync(dst_data + (dst_offset * size),
+                      src_data + (src_offset * size), c * size * sizeof(float),
+                      cudaMemcpyHostToDevice, stream);
+      dst_offset += c;
+    }
+  } else {
+    PADDLE_ENFORCE_LE(
+        index_tensor.numel(), buffer_tensor->dims()[0],
+        platform::errors::InvalidArgument("Buffer tensor size is too small."));
+  }
+
+  // Select the index data to the buffer
+  auto index_select = [](const paddle::experimental::Tensor& src_tensor,
+                         const paddle::experimental::Tensor& index_tensor,
+                         paddle::experimental::Tensor* buffer_tensor) {
+    auto* src_data = src_tensor.data<float>();
+    auto* index_data = index_tensor.data<int64_t>();
+    auto* buffer_data = buffer_tensor->data<float>();
+    const int& slice_size = src_tensor.numel() / src_tensor.dims()[0];
+    const int& copy_bytes = slice_size * sizeof(float);
+    int64_t c = 0;
+    for (int64_t i = 0; i < index_tensor.numel(); i++) {
+      std::memcpy(buffer_data + c * slice_size,
+                  src_data + index_data[i] * slice_size, copy_bytes);
+      c += 1;
+    }
+  };
+  index_select(src_tensor, index_tensor, buffer_tensor);
+
+  // Copy the data to device memory
+  cudaMemcpyAsync(dst_data + (numel * size), buffer_tensor->data<float>(),
+                  index_tensor.numel() * size * sizeof(float),
+                  cudaMemcpyHostToDevice, stream);
+  Py_INCREF(Py_None);
+  return Py_None;
+  EAGER_CATCH_AND_THROW_RETURN_NULL
+}
+
+static PyObject* eager_api_async_write(PyObject* self, PyObject* args,
+                                       PyObject* kwargs) {
+  EAGER_TRY
+  auto& src = GetTensorFromArgs("async_write", "src", args, 0, false);
+  auto& dst = GetTensorFromArgs("async_write", "dst", args, 1, false);
+  auto& offset = GetTensorFromArgs("async_write", "offset", args, 2, false);
+  auto& count = GetTensorFromArgs("async_write", "count", args, 3, false);
+  PADDLE_ENFORCE_EQ(
+      src.is_gpu(), true,
+      platform::errors::InvalidArgument(
+          "Required `src` device should be CUDAPlace, but received %d. ",
+          src.inner_place()));
+  PADDLE_ENFORCE_EQ(dst.is_gpu_pinned(), true,
+                    platform::errors::InvalidArgument(
+                        "Required `dst` device should be CUDAPinnedPlace, "
+                        "but received %d. ",
+                        dst.inner_place()));
+  PADDLE_ENFORCE_EQ(
+      offset.is_cpu(), true,
+      platform::errors::InvalidArgument("Required `offset` device should "
+                                        "be CPUPlace, but received %d. ",
+                                        offset.inner_place()));
+  PADDLE_ENFORCE_EQ(
+      count.is_cpu(), true,
+      platform::errors::InvalidArgument(
+          "Required `count` device should be CPUPlace, but received %d. ",
+          count.inner_place()));
+
+  // TODO(daisiming): In future, add index as arguments following
+  // async_read.
+  auto& src_tensor = src;
+  auto* dst_tensor = &dst;
+  auto& offset_tensor = offset;
+  auto& count_tensor = count;
+  const auto& deviceId = paddle::platform::GetCurrentDeviceId();
+
+  PADDLE_ENFORCE_EQ(offset_tensor.dims().size(), 1,
+                    platform::errors::InvalidArgument(
+                        "`offset` tensor should be one-dimensional."));
+  PADDLE_ENFORCE_EQ(count_tensor.dims().size(), 1,
+                    platform::errors::InvalidArgument(
+                        "`count` tensor should be one-dimensional."));
+  PADDLE_ENFORCE_EQ(offset_tensor.numel(), count_tensor.numel(),
+                    platform::errors::InvalidArgument(
+                        "`offset` and `count` tensor size dismatch."));
+  PADDLE_ENFORCE_EQ(src_tensor.dims().size(), dst_tensor->dims().size(),
+                    platform::errors::InvalidArgument(
+                        "`src` and `dst` should have the same tensor shape, "
+                        "except for the first dimension."));
+  for (int i = 1; i < src_tensor.dims().size(); i++) {
+    PADDLE_ENFORCE_EQ(src_tensor.dims()[i], dst_tensor->dims()[i],
+                      platform::errors::InvalidArgument(
+                          "`src` and `dst` should have the same tensor shape, "
+                          "except for the first dimension."));
+  }
+
+  auto stream =
+      paddle::platform::stream::get_current_stream(deviceId)->raw_stream();
+
+  int64_t size = src_tensor.numel() / src_tensor.dims()[0];
+  auto* src_data = src_tensor.data<float>();
+  auto* dst_data = dst_tensor->data<float>();
+  const int64_t* offset_data = offset_tensor.data<int64_t>();
+  const int64_t* count_data = count_tensor.data<int64_t>();
+  int64_t src_offset = 0, dst_offset, c;
+  for (int64_t i = 0; i < offset_tensor.numel(); i++) {
+    dst_offset = offset_data[i], c = count_data[i];
+    PADDLE_ENFORCE_LE(
+        src_offset + c, src_tensor.dims()[0],
+        platform::errors::InvalidArgument("Invalid offset or count index"));
+    PADDLE_ENFORCE_LE(
+        dst_offset + c, dst_tensor->dims()[0],
+        platform::errors::InvalidArgument("Invalid offset or count index"));
+    cudaMemcpyAsync(dst_data + (dst_offset * size),
+                    src_data + (src_offset * size), c * size * sizeof(float),
+                    cudaMemcpyDeviceToHost, stream);
+    src_offset += c;
+  }
+  Py_INCREF(Py_None);
+  return Py_None;
+  EAGER_CATCH_AND_THROW_RETURN_NULL
+}
+
+static PyObject* eager_api_to_uva_tensor(PyObject* self, PyObject* args,
+                                         PyObject* kwargs) {
+  EAGER_TRY
+  VLOG(4) << "Running in eager_api_to_uva_tensor.";
+  auto new_tensor = std::shared_ptr<paddle::experimental::Tensor>(
+      new paddle::experimental::Tensor(
+          egr::Controller::Instance().GenerateUniqueName()));
+  PyObject* obj = PyTuple_GET_ITEM(args, 0);
+  auto array = py::cast<py::array>(py::handle(obj));
+
+  int device_id = 0;
+  PyObject* Py_device_id = PyTuple_GET_ITEM(args, 1);
+  if (Py_device_id) {
+    device_id = CastPyArg2AttrLong(Py_device_id, 1);
+  }
+
+  if (py::isinstance<py::array_t<int32_t>>(array)) {
+    SetUVATensorFromPyArray<int32_t>(new_tensor, array, device_id);
+  } else if (py::isinstance<py::array_t<int64_t>>(array)) {
+    SetUVATensorFromPyArray<int64_t>(new_tensor, array, device_id);
+  } else if (py::isinstance<py::array_t<float>>(array)) {
+    SetUVATensorFromPyArray<float>(new_tensor, array, device_id);
+  } else if (py::isinstance<py::array_t<double>>(array)) {
+    SetUVATensorFromPyArray<double>(new_tensor, array, device_id);
+  } else if (py::isinstance<py::array_t<int8_t>>(array)) {
+    SetUVATensorFromPyArray<int8_t>(new_tensor, array, device_id);
+  } else if (py::isinstance<py::array_t<int16_t>>(array)) {
+    SetUVATensorFromPyArray<int16_t>(new_tensor, array, device_id);
+  } else if (py::isinstance<py::array_t<paddle::platform::float16>>(array)) {
+    SetUVATensorFromPyArray<paddle::platform::float16>(new_tensor, array,
+                                                       device_id);
+  } else if (py::isinstance<py::array_t<bool>>(array)) {
+    SetUVATensorFromPyArray<bool>(new_tensor, array, device_id);
+  } else {
+    // obj may be any type, obj.cast<py::array>() may be failed,
+    // then the array.dtype will be string of unknown meaning.
+    PADDLE_THROW(platform::errors::InvalidArgument(
+        "Input object type error or incompatible array data type. "
+        "tensor.set() supports array with bool, float16, float32, "
+        "float64, int8, int16, int32, int64,"
+        "please check your input or input array data type."));
+  }
+
+  return ToPyObject(*(new_tensor.get()));
+  EAGER_CATCH_AND_THROW_RETURN_NULL
+}
+#endif
 
 PyMethodDef variable_functions[] = {
     // TODO(jiabin): Remove scale when we have final state tests
@@ -560,7 +844,15 @@ PyMethodDef variable_functions[] = {
     {"sparse_csr_tensor",
      (PyCFunction)(void (*)(void))eager_api_sparse_csr_tensor,
      METH_VARARGS | METH_KEYWORDS, NULL},
-    /**sparse functions**/
+/**sparse functions**/
+#if defined(PADDLE_WITH_CUDA)
+    {"async_read", (PyCFunction)(void (*)(void))eager_api_async_read,
+     METH_VARARGS | METH_KEYWORDS, NULL},
+    {"async_write", (PyCFunction)(void (*)(void))eager_api_async_write,
+     METH_VARARGS | METH_KEYWORDS, NULL},
+    {"to_uva_tensor", (PyCFunction)(void (*)(void))eager_api_to_uva_tensor,
+     METH_VARARGS | METH_KEYWORDS, NULL},
+#endif
     {NULL, NULL, 0, NULL}};
 
 void BindFunctions(PyObject* module) {
diff --git a/paddle/fluid/pybind/eager_method.cc b/paddle/fluid/pybind/eager_method.cc
index 37ace14d145c6..021899c5f3782 100644
--- a/paddle/fluid/pybind/eager_method.cc
+++ b/paddle/fluid/pybind/eager_method.cc
@@ -32,6 +32,7 @@ limitations under the License. */
 #include "paddle/fluid/pybind/eager_utils.h"
 #include "paddle/fluid/pybind/exception.h"
 #include "paddle/fluid/pybind/slice_utils.h"
+#include "paddle/fluid/pybind/uva_utils.h"
 #include "paddle/phi/api/include/api.h"
 #include "paddle/phi/common/data_type.h"
 #include "paddle/phi/core/compat/convert_utils.h"
@@ -41,7 +42,9 @@ limitations under the License. */
 #include "pybind11/detail/internals.h"
 #pragma GCC diagnostic ignored "-Wmissing-field-initializers"
 #include "paddle/fluid/framework/python_headers.h"
+#include "paddle/fluid/memory/allocation/mmap_allocator.h"
 #include "paddle/fluid/pybind/tensor_py.h"
+#include "paddle/phi/core/ddim.h"
 
 namespace paddle {
 namespace pybind {
@@ -124,6 +127,7 @@ class PyTensorVoidHook : public egr::TensorVoidHook {
 
 extern void InitTensorWithNumpyValue(TensorObject* self,
                                      const pybind11::object& array,
+                                     const paddle::platform::Place& place,
                                      bool zero_copy);
 
 extern PyTypeObject* p_tensor_type;
@@ -330,17 +334,22 @@ static PyObject* tensor_method_copy_(TensorObject* self, PyObject* args,
   bool blocking = CastPyArg2AttrBoolean(PyTuple_GET_ITEM(args, 1), 1);
   VLOG(6) << "Start Copy Tensor " << src_tensor.name() << " to "
           << self->tensor.name();
-  if (!self->tensor.defined()) {
+  if (!self->tensor.initialized()) {
     egr::EagerUtils::autograd_meta(&(self->tensor))
         ->SetStopGradient(
             egr::EagerUtils::autograd_meta(&(src_tensor))->StopGradient());
     egr::EagerUtils::autograd_meta(&(self->tensor))
         ->SetPersistable(
             egr::EagerUtils::autograd_meta(&(src_tensor))->Persistable());
+    if (src_tensor.initialized()) {
+      self->tensor.copy_(src_tensor, src_tensor.inner_place(), blocking);
+    }
+  } else {
+    if (src_tensor.initialized()) {
+      self->tensor.copy_(src_tensor, self->tensor.inner_place(), blocking);
+    }
   }
 
-  self->tensor.copy_(src_tensor, self->tensor.inner_place(), blocking);
-
   VLOG(6) << "Finish Copy Tensor " << src_tensor.name() << " to "
           << self->tensor.name();
   Py_INCREF(Py_None);
@@ -459,6 +468,9 @@ static PyObject* tensor__share_buffer_to(TensorObject* self, PyObject* args,
                         self->tensor.name()));
   auto* src_tensor =
       static_cast<paddle::framework::Tensor*>(self->tensor.impl().get());
+  if (!dst_ptr->defined()) {
+    dst_ptr->set_impl(std::make_shared<phi::DenseTensor>());
+  }
   auto dst_tensor =
       static_cast<paddle::framework::Tensor*>(dst_ptr->impl().get());
   dst_tensor->ShareDataWith(*src_tensor);
@@ -559,6 +571,10 @@ static PyObject* tensor_method_get_underline_tensor(TensorObject* self,
                                                     PyObject* args,
                                                     PyObject* kwargs) {
   EAGER_TRY
+  if (!self->tensor.defined()) {
+    Py_IncRef(Py_None);
+    return Py_None;
+  }
   if (self->tensor.is_dense_tensor()) {
     auto* tensor =
         static_cast<paddle::framework::LoDTensor*>(self->tensor.impl().get());
@@ -571,6 +587,25 @@ static PyObject* tensor_method_get_underline_tensor(TensorObject* self,
   EAGER_CATCH_AND_THROW_RETURN_NULL
 }
 
+static PyObject* tensor_method_get_underline_selected_rows(TensorObject* self,
+                                                           PyObject* args,
+                                                           PyObject* kwargs) {
+  EAGER_TRY
+  if (!self->tensor.defined()) {
+    Py_IncRef(Py_None);
+    return Py_None;
+  }
+  if (self->tensor.is_selected_rows()) {
+    auto* selected_rows =
+        static_cast<phi::SelectedRows*>(self->tensor.impl().get());
+    return ToPyObject(selected_rows);
+  } else {
+    Py_IncRef(Py_None);
+    return Py_None;
+  }
+  EAGER_CATCH_AND_THROW_RETURN_NULL
+}
+
 static PyObject* tensor__getitem_index_not_tensor(TensorObject* self,
                                                   PyObject* args,
                                                   PyObject* kwargs) {
@@ -1208,6 +1243,9 @@ static PyObject* tensor_method_get_non_zero_cols(TensorObject* self,
 static PyObject* tensor_method_is_sparse(TensorObject* self, PyObject* args,
                                          PyObject* kwargs) {
   EAGER_TRY
+  if (!self->tensor.defined()) {
+    return ToPyObject(false);
+  }
   return ToPyObject(self->tensor.is_sparse_coo_tensor() ||
                     self->tensor.is_sparse_csr_tensor());
   EAGER_CATCH_AND_THROW_RETURN_NULL
@@ -1216,6 +1254,9 @@ static PyObject* tensor_method_is_sparse(TensorObject* self, PyObject* args,
 static PyObject* tensor_method_is_sparse_coo(TensorObject* self, PyObject* args,
                                              PyObject* kwargs) {
   EAGER_TRY
+  if (!self->tensor.defined()) {
+    return ToPyObject(false);
+  }
   return ToPyObject(self->tensor.is_sparse_coo_tensor());
   EAGER_CATCH_AND_THROW_RETURN_NULL
 }
@@ -1223,25 +1264,13 @@ static PyObject* tensor_method_is_sparse_coo(TensorObject* self, PyObject* args,
 static PyObject* tensor_method_is_sparse_csr(TensorObject* self, PyObject* args,
                                              PyObject* kwargs) {
   EAGER_TRY
+  if (!self->tensor.defined()) {
+    return ToPyObject(false);
+  }
   return ToPyObject(self->tensor.is_sparse_csr_tensor());
   EAGER_CATCH_AND_THROW_RETURN_NULL
 }
 
-static PyObject* tensor_method_to_sparse_coo(TensorObject* self, PyObject* args,
-                                             PyObject* kwargs) {
-  EAGER_TRY
-  int64_t sparse_dim = CastPyArg2AttrLong(PyTuple_GET_ITEM(args, 0), 0);
-  auto coo_tensor = self->tensor.to_sparse_coo(sparse_dim);
-  egr::EagerUtils::autograd_meta(&coo_tensor)
-      ->SetStopGradient(
-          egr::EagerUtils::autograd_meta(&self->tensor)->StopGradient());
-  egr::EagerUtils::autograd_meta(&coo_tensor)
-      ->SetPersistable(
-          egr::EagerUtils::autograd_meta(&(self->tensor))->Persistable());
-  return ToPyObject(coo_tensor);
-  EAGER_CATCH_AND_THROW_RETURN_NULL
-}
-
 static PyObject* tensor_method_to_sparse_csr(TensorObject* self, PyObject* args,
                                              PyObject* kwargs) {
   EAGER_TRY
@@ -1256,20 +1285,6 @@ static PyObject* tensor_method_to_sparse_csr(TensorObject* self, PyObject* args,
   EAGER_CATCH_AND_THROW_RETURN_NULL
 }
 
-static PyObject* tensor_method_to_dense(TensorObject* self, PyObject* args,
-                                        PyObject* kwargs) {
-  EAGER_TRY
-  auto dense_tensor = self->tensor.to_dense();
-  egr::EagerUtils::autograd_meta(&dense_tensor)
-      ->SetStopGradient(
-          egr::EagerUtils::autograd_meta(&self->tensor)->StopGradient());
-  egr::EagerUtils::autograd_meta(&dense_tensor)
-      ->SetPersistable(
-          egr::EagerUtils::autograd_meta(&(self->tensor))->Persistable());
-  return ToPyObject(dense_tensor);
-  EAGER_CATCH_AND_THROW_RETURN_NULL
-}
-
 static PyObject* tensor__inplace_version(TensorObject* self, PyObject* args,
                                          PyObject* kwargs) {
   EAGER_TRY
@@ -1279,6 +1294,15 @@ static PyObject* tensor__inplace_version(TensorObject* self, PyObject* args,
   EAGER_CATCH_AND_THROW_RETURN_NULL
 }
 
+static PyObject* tensor_method_element_size(TensorObject* self, PyObject* args,
+                                            PyObject* kwargs) {
+  EAGER_TRY
+  uint32_t element_size = framework::DataTypeSize(self->tensor.dtype());
+
+  return ToPyObject(element_size);
+  EAGER_CATCH_AND_THROW_RETURN_NULL
+}
+
 static PyObject* tensor__bump_inplace_version(TensorObject* self,
                                               PyObject* args,
                                               PyObject* kwargs) {
@@ -1292,6 +1316,9 @@ static PyObject* tensor_method_is_selected_rows(TensorObject* self,
                                                 PyObject* args,
                                                 PyObject* kwargs) {
   EAGER_TRY
+  if (!self->tensor.defined()) {
+    return ToPyObject(false);
+  }
   return ToPyObject(self->tensor.is_selected_rows());
   EAGER_CATCH_AND_THROW_RETURN_NULL
 }
@@ -1308,6 +1335,13 @@ static PyObject* tensor_method_get_rows(TensorObject* self, PyObject* args,
   EAGER_CATCH_AND_THROW_RETURN_NULL
 }
 
+static PyObject* tensor_methon_element_size(TensorObject* self, PyObject* args,
+                                            PyObject* kwargs) {
+  EAGER_TRY
+  return ToPyObject(paddle::experimental::SizeOf(self->tensor.dtype()));
+  EAGER_CATCH_AND_THROW_RETURN_NULL
+}
+
 static PyObject* tensor__reset_grad_inplace_version(TensorObject* self,
                                                     PyObject* args,
                                                     PyObject* kwargs) {
@@ -1329,6 +1363,77 @@ static PyObject* tensor__reset_grad_inplace_version(TensorObject* self,
   EAGER_CATCH_AND_THROW_RETURN_NULL
 }
 
+static PyObject* tensor_method__share_memory(TensorObject* self, PyObject* args,
+                                             PyObject* kwargs) {
+  EAGER_TRY
+#ifndef _WIN32
+  PADDLE_ENFORCE_EQ(platform::is_cpu_place(self->tensor.inner_place()), true,
+                    platform::errors::InvalidArgument(
+                        "Sharing memory only support CPU Tensor currently"));
+  // 1. get LoDTensor
+  auto* t =
+      std::dynamic_pointer_cast<phi::DenseTensor>(self->tensor.impl()).get();
+  // 2. allocate shared memory
+  void* data_ptr = t->data();
+  size_t data_size =
+      t->numel() *
+      framework::SizeOfType(framework::TransToProtoVarType(t->dtype()));
+  auto shared_writer_holder =
+      memory::allocation::AllocateMemoryMapWriterAllocation(data_size);
+  // 3. maintain mmap fd set & backup ipc_name
+  const std::string& ipc_name = shared_writer_holder->ipc_name();
+  memory::allocation::MemoryMapFdSet::Instance().Insert(ipc_name);
+  // 4. copy data & reset holder
+  memory::Copy(platform::CPUPlace(), shared_writer_holder->ptr(),
+               platform::CPUPlace(), data_ptr, data_size);
+  t->ResetHolder(shared_writer_holder);
+  return ToPyObject(t);
+#else
+  PADDLE_THROW(platform::errors::PermissionDenied(
+      "Sharing memory in Windows OS is not supported currently"));
+  Py_INCREF(Py_None);
+  return Py_None;
+#endif
+  EAGER_CATCH_AND_THROW_RETURN_NULL
+}
+
+static PyObject* tensor__offset(TensorObject* self, PyObject* args,
+                                PyObject* kwargs) {
+  EAGER_TRY
+  auto t = std::dynamic_pointer_cast<phi::DenseTensor>(self->tensor.impl());
+  PADDLE_ENFORCE_EQ(
+      t->IsInitialized(), true,
+      platform::errors::InvalidArgument("Tensor %s has not been initialized!",
+                                        self->tensor.name()));
+
+  return ToPyObject(t->offset());
+  EAGER_CATCH_AND_THROW_RETURN_NULL
+}
+
+#if defined(PADDLE_WITH_CUDA)
+static PyObject* tensor_method__uva(TensorObject* self, PyObject* args,
+                                    PyObject* kwargs) {
+  EAGER_TRY
+  VLOG(4) << "Running in tensor_method__uva.";
+  PADDLE_ENFORCE_EQ(self->tensor.is_dense_tensor(), true,
+                    platform::errors::InvalidArgument(
+                        "Unified virtual addressing only support "
+                        "DenseTensor currently."));
+  PADDLE_ENFORCE_EQ(platform::is_cpu_place(self->tensor.inner_place()), true,
+                    platform::errors::InvalidArgument(
+                        "Unified virtual addressing only support "
+                        "CPU Tensor currently."));
+  int device_id = pybind::CastPyArg2AttrLong(PyTuple_GET_ITEM(args, 0), 0);
+  auto* self_tensor =
+      static_cast<paddle::framework::LoDTensor*>(self->tensor.impl().get());
+  tensor_uva(self_tensor, device_id);
+
+  Py_INCREF(Py_None);
+  return Py_None;
+  EAGER_CATCH_AND_THROW_RETURN_NULL
+}
+#endif
+
 PyMethodDef variable_methods[] = {
     {"numpy", (PyCFunction)(void (*)(void))tensor_method_numpy,
      METH_VARARGS | METH_KEYWORDS, NULL},
@@ -1368,6 +1473,9 @@ PyMethodDef variable_methods[] = {
     {"get_tensor",
      (PyCFunction)(void (*)(void))tensor_method_get_underline_tensor,
      METH_VARARGS | METH_KEYWORDS, NULL},
+    {"get_selected_rows",
+     (PyCFunction)(void (*)(void))tensor_method_get_underline_selected_rows,
+     METH_VARARGS | METH_KEYWORDS, NULL},
     {"_getitem_index_not_tensor",
      (PyCFunction)(void (*)(void))tensor__getitem_index_not_tensor,
      METH_VARARGS | METH_KEYWORDS, NULL},
@@ -1393,17 +1501,13 @@ PyMethodDef variable_methods[] = {
      (PyCFunction)(void (*)(void))tensor__copy_gradient_from,
      METH_VARARGS | METH_KEYWORDS, NULL},
     /***the method of sparse tensor****/
-    {"non_zero_indices",
-     (PyCFunction)(void (*)(void))tensor_method_get_non_zero_indices,
+    {"indices", (PyCFunction)(void (*)(void))tensor_method_get_non_zero_indices,
      METH_VARARGS | METH_KEYWORDS, NULL},
-    {"non_zero_elements",
-     (PyCFunction)(void (*)(void))tensor_method_get_non_zero_elements,
+    {"values", (PyCFunction)(void (*)(void))tensor_method_get_non_zero_elements,
      METH_VARARGS | METH_KEYWORDS, NULL},
-    {"non_zero_crows",
-     (PyCFunction)(void (*)(void))tensor_method_get_non_zero_crows,
+    {"crows", (PyCFunction)(void (*)(void))tensor_method_get_non_zero_crows,
      METH_VARARGS | METH_KEYWORDS, NULL},
-    {"non_zero_cols",
-     (PyCFunction)(void (*)(void))tensor_method_get_non_zero_cols,
+    {"cols", (PyCFunction)(void (*)(void))tensor_method_get_non_zero_cols,
      METH_VARARGS | METH_KEYWORDS, NULL},
     {"is_sparse", (PyCFunction)(void (*)(void))tensor_method_is_sparse,
      METH_VARARGS | METH_KEYWORDS, NULL},
@@ -1411,11 +1515,9 @@ PyMethodDef variable_methods[] = {
      METH_VARARGS | METH_KEYWORDS, NULL},
     {"is_sparse_csr", (PyCFunction)(void (*)(void))tensor_method_is_sparse_csr,
      METH_VARARGS | METH_KEYWORDS, NULL},
-    {"to_sparse_coo", (PyCFunction)(void (*)(void))tensor_method_to_sparse_coo,
-     METH_VARARGS | METH_KEYWORDS, NULL},
     {"to_sparse_csr", (PyCFunction)(void (*)(void))tensor_method_to_sparse_csr,
      METH_VARARGS | METH_KEYWORDS, NULL},
-    {"to_dense", (PyCFunction)(void (*)(void))tensor_method_to_dense,
+    {"element_size", (PyCFunction)(void (*)(void))tensor_method_element_size,
      METH_VARARGS | METH_KEYWORDS, NULL},
     /***the method of sparse tensor****/
     {"_inplace_version", (PyCFunction)(void (*)(void))tensor__inplace_version,
@@ -1428,9 +1530,19 @@ PyMethodDef variable_methods[] = {
      METH_VARARGS | METH_KEYWORDS, NULL},
     {"rows", (PyCFunction)(void (*)(void))tensor_method_get_rows,
      METH_VARARGS | METH_KEYWORDS, NULL},
+    {"element_size", (PyCFunction)(void (*)(void))tensor_methon_element_size,
+     METH_VARARGS | METH_KEYWORDS, NULL},
     {"_reset_grad_inplace_version",
      (PyCFunction)(void (*)(void))tensor__reset_grad_inplace_version,
      METH_VARARGS | METH_KEYWORDS, NULL},
+    {"_share_memory", (PyCFunction)(void (*)(void))tensor_method__share_memory,
+     METH_VARARGS | METH_KEYWORDS, NULL},
+    {"_offset", (PyCFunction)(void (*)(void))tensor__offset,
+     METH_VARARGS | METH_KEYWORDS, NULL},
+#if defined(PADDLE_WITH_CUDA)
+    {"_tensor_uva", (PyCFunction)(void (*)(void))tensor_method__uva,
+     METH_VARARGS | METH_KEYWORDS, NULL},
+#endif
     {NULL, NULL, 0, NULL}};
 
 }  // namespace pybind
diff --git a/paddle/fluid/pybind/eager_op_function_generator.cc b/paddle/fluid/pybind/eager_op_function_generator.cc
index 685e20aef2591..06d88be9bc8cc 100644
--- a/paddle/fluid/pybind/eager_op_function_generator.cc
+++ b/paddle/fluid/pybind/eager_op_function_generator.cc
@@ -433,7 +433,7 @@ GenerateOpFunctions() {
     std::map<std::string, std::string> inplace_map;
     // `sum` op has duplicate input. Don't consider adding inplace strategy
     // for `sum` in temporary.
-    if (op_type != "sum" && infer_inplace) {
+    if (infer_inplace && !special_inplace_op_set.count(op_type)) {
       // Inplace OP: op_type_.
       // The inplace OP needs a new implementation method.
       auto in_to_outs = infer_inplace(true);
diff --git a/paddle/fluid/pybind/eager_properties.cc b/paddle/fluid/pybind/eager_properties.cc
index d8c297b1a94c7..a72ea6c4b02e1 100644
--- a/paddle/fluid/pybind/eager_properties.cc
+++ b/paddle/fluid/pybind/eager_properties.cc
@@ -43,8 +43,14 @@ PyObject* tensor_properties_get_name(TensorObject* self, void* closure) {
 
 PyObject* tensor_properties_get_type(TensorObject* self, void* closure) {
   EAGER_TRY
+  if (!self->tensor.defined()) {
+    // be same to old dygraph
+    return ToPyObject(paddle::framework::proto::VarType::LOD_TENSOR);
+  }
   if (self->tensor.is_dense_tensor()) {
     return ToPyObject(paddle::framework::proto::VarType::LOD_TENSOR);
+  } else if (self->tensor.is_selected_rows()) {
+    return ToPyObject(paddle::framework::proto::VarType::SELECTED_ROWS);
   } else {
     Py_INCREF(Py_None);
     return Py_None;
@@ -63,7 +69,7 @@ int tensor_properties_set_name(TensorObject* self, PyObject* value,
   EAGER_TRY
   self->tensor.set_name(CastPyArg2AttrString(value, 0));
   return 0;
-  EAGER_CATCH_AND_THROW_RETURN_ZERO
+  EAGER_CATCH_AND_THROW_RETURN_NEG
 }
 
 PyObject* tensor_properties_get_stop_gradient(TensorObject* self,
@@ -104,7 +110,7 @@ int tensor_properties_set_grad(TensorObject* self, PyObject* value,
                      "the grad inside autograd_meta"));
   grad->copy_(src, self->tensor.inner_place(), true);
   return 0;
-  EAGER_CATCH_AND_THROW_RETURN_ZERO
+  EAGER_CATCH_AND_THROW_RETURN_NEG
 }
 
 int tensor_properties_set_stop_gradient(TensorObject* self, PyObject* value,
@@ -116,7 +122,7 @@ int tensor_properties_set_stop_gradient(TensorObject* self, PyObject* value,
     meta->SetGradNode(std::make_shared<egr::GradNodeAccumulation>(meta));
   }
   return 0;
-  EAGER_CATCH_AND_THROW_RETURN_ZERO
+  EAGER_CATCH_AND_THROW_RETURN_NEG
 }
 
 PyObject* tensor_properties_get_persistable(TensorObject* self, void* closure) {
@@ -132,13 +138,16 @@ int tensor_properties_set_persistable(TensorObject* self, PyObject* value,
   auto meta = egr::EagerUtils::autograd_meta(&self->tensor);
   meta->SetPersistable(CastPyArg2AttrBoolean(value, 0));
   return 0;
-  EAGER_CATCH_AND_THROW_RETURN_ZERO
+  EAGER_CATCH_AND_THROW_RETURN_NEG
 }
 
 PyObject* tensor_properties_get_shape(TensorObject* self, void* closure) {
   EAGER_TRY
-  auto ddim = self->tensor.shape();
   std::vector<int64_t> value;
+  if (!self->tensor.defined()) {
+    return ToPyObject(value);
+  }
+  auto ddim = self->tensor.shape();
   size_t rank = static_cast<size_t>(ddim.size());
   value.resize(rank);
   for (size_t i = 0; i < rank; i++) {
@@ -165,6 +174,10 @@ PyObject* tensor_properties_get_place_str(TensorObject* self, void* closure) {
 
 PyObject* tensor_properties_get_dtype(TensorObject* self, void* closure) {
   EAGER_TRY
+  if (!self->tensor.defined()) {
+    // be same to old dygraph
+    return ToPyObject(framework::proto::VarType::FP32);
+  }
   return ToPyObject(
       paddle::framework::TransToProtoVarType(self->tensor.type()));
   EAGER_CATCH_AND_THROW_RETURN_NULL
diff --git a/paddle/fluid/pybind/eager_py_layer.cc b/paddle/fluid/pybind/eager_py_layer.cc
index 59f21a1e1face..cade856b3607a 100644
--- a/paddle/fluid/pybind/eager_py_layer.cc
+++ b/paddle/fluid/pybind/eager_py_layer.cc
@@ -34,6 +34,8 @@ limitations under the License. */
 #include "paddle/phi/core/compat/convert_utils.h"
 #include "paddle/phi/core/dense_tensor.h"
 #include "pybind11/detail/internals.h"
+#pragma GCC diagnostic ignored "-Wwrite-strings"
+#pragma GCC diagnostic ignored "-Wmissing-field-initializers"
 
 namespace paddle {
 namespace pybind {
@@ -393,7 +395,7 @@ int tensor_properties_set_container(PyLayerObject* self, PyObject* value,
   Py_XDECREF(self->container);
   self->container = value;
   return 0;
-  EAGER_CATCH_AND_THROW_RETURN_ZERO
+  EAGER_CATCH_AND_THROW_RETURN_NEG
 }
 
 PyObject* tensor_properties_get_non_differentiable(PyLayerObject* self,
@@ -415,7 +417,7 @@ int tensor_properties_set_non_differentiable(PyLayerObject* self,
   Py_XDECREF(self->non_differentiable);
   self->non_differentiable = value;
   return 0;
-  EAGER_CATCH_AND_THROW_RETURN_ZERO
+  EAGER_CATCH_AND_THROW_RETURN_NEG
 }
 
 PyObject* tensor_properties_get_dirty_tensors(PyLayerObject* self,
@@ -437,7 +439,7 @@ int tensor_properties_set_dirty_tensors(PyLayerObject* self, PyObject* value,
   Py_XDECREF(self->dirty_tensors);
   self->dirty_tensors = value;
   return 0;
-  EAGER_CATCH_AND_THROW_RETURN_ZERO
+  EAGER_CATCH_AND_THROW_RETURN_NEG
 }
 
 int tensor_properties_set_materialize_grads(PyLayerObject* self,
@@ -445,7 +447,7 @@ int tensor_properties_set_materialize_grads(PyLayerObject* self,
   EAGER_TRY
   self->materialize_grads = CastPyArg2AttrBoolean(value, 0);
   return 0;
-  EAGER_CATCH_AND_THROW_RETURN_ZERO
+  EAGER_CATCH_AND_THROW_RETURN_NEG
 }
 
 PyMethodDef pylayer_methods[] = {
@@ -479,7 +481,7 @@ void BindEagerPyLayer(PyObject* module) {
   type->tp_dealloc = (destructor)PyLayerDealloc;
   type->tp_methods = pylayer_methods;
   type->tp_getset = pylayer_properties;
-  type->tp_new = PyLayerNew;
+  type->tp_new = (newfunc)PyLayerNew;
   Py_INCREF(&PyBaseObject_Type);
   type->tp_base = reinterpret_cast<PyTypeObject*>(&PyBaseObject_Type);
   type->tp_flags |=
diff --git a/paddle/fluid/pybind/eager_utils.cc b/paddle/fluid/pybind/eager_utils.cc
index bee3e27a55167..8baea3d0dbfe1 100644
--- a/paddle/fluid/pybind/eager_utils.cc
+++ b/paddle/fluid/pybind/eager_utils.cc
@@ -156,6 +156,17 @@ int64_t CastPyArg2AttrLong(PyObject* obj, ssize_t arg_pos) {
   }
 }
 
+size_t CastPyArg2AttrSize_t(PyObject* obj, ssize_t arg_pos) {
+  if (PyObject_CheckLongOrConvertToLong(&obj)) {
+    return PyLong_AsSize_t(obj);
+  } else {
+    PADDLE_THROW(platform::errors::InvalidArgument(
+        "argument (position %d) must be "
+        "long, but got %s",
+        arg_pos + 1, (reinterpret_cast<PyTypeObject*>(obj->ob_type))->tp_name));
+  }
+}
+
 float CastPyArg2AttrFloat(PyObject* obj, ssize_t arg_pos) {
   if (PyObject_CheckFloatOrConvertToFloat(&obj)) {
     return static_cast<float>(PyFloat_AsDouble(obj));
@@ -213,6 +224,9 @@ std::vector<paddle::experimental::Tensor> CastPyArg2VectorOfTensor(
       if (PyObject_IsInstance(item,
                               reinterpret_cast<PyObject*>(p_tensor_type))) {
         result.emplace_back(reinterpret_cast<TensorObject*>(item)->tensor);
+      } else if (item == Py_None) {
+        // emplace empty Tensor for None
+        result.emplace_back();
       } else {
         PADDLE_THROW(platform::errors::InvalidArgument(
             "argument (position %d) must be "
@@ -229,6 +243,9 @@ std::vector<paddle::experimental::Tensor> CastPyArg2VectorOfTensor(
       if (PyObject_IsInstance(item,
                               reinterpret_cast<PyObject*>(p_tensor_type))) {
         result.emplace_back(reinterpret_cast<TensorObject*>(item)->tensor);
+      } else if (item == Py_None) {
+        // emplace empty Tensor for None
+        result.emplace_back();
       } else {
         PADDLE_THROW(platform::errors::InvalidArgument(
             "argument (position %d) must be "
@@ -291,6 +308,51 @@ std::vector<int> CastPyArg2VectorOfInt(PyObject* obj, size_t arg_pos) {
   return result;
 }
 
+std::vector<size_t> CastPyArg2VectorOfSize_t(PyObject* obj, size_t arg_pos) {
+  std::vector<size_t> result;
+  if (PyList_Check(obj)) {
+    Py_ssize_t len = PyList_Size(obj);
+    PyObject* item = nullptr;
+    for (Py_ssize_t i = 0; i < len; i++) {
+      item = PyList_GetItem(obj, i);
+      if (PyObject_CheckLongOrConvertToLong(&item)) {
+        result.emplace_back(PyLong_AsSize_t(item));
+      } else {
+        PADDLE_THROW(platform::errors::InvalidArgument(
+            "argument (position %d) must be "
+            "list of int, but got %s at pos %d",
+            arg_pos + 1,
+            reinterpret_cast<PyTypeObject*>(item->ob_type)->tp_name, i));
+      }
+    }
+  } else {
+    PADDLE_THROW(platform::errors::InvalidArgument(
+        "argument (position %d) must be "
+        "list, but got %s",
+        arg_pos + 1, reinterpret_cast<PyTypeObject*>(obj->ob_type)->tp_name));
+  }
+  return result;
+}
+
+std::vector<std::vector<size_t>> CastPyArg2VectorOfVectorOfSize_t(
+    PyObject* obj, size_t arg_pos) {
+  std::vector<std::vector<size_t>> result;
+  if (PyList_Check(obj)) {
+    Py_ssize_t len = PyList_Size(obj);
+    PyObject* item = nullptr;
+    for (Py_ssize_t i = 0; i < len; i++) {
+      item = PyList_GetItem(obj, i);
+      result.emplace_back(CastPyArg2VectorOfSize_t(item, arg_pos));
+    }
+  } else {
+    PADDLE_THROW(platform::errors::InvalidArgument(
+        "argument (position %d) must be "
+        "list but got %s",
+        arg_pos + 1, reinterpret_cast<PyTypeObject*>(obj->ob_type)->tp_name));
+  }
+  return result;
+}
+
 platform::Place CastPyArg2Place(PyObject* obj, ssize_t arg_pos) {
   platform::Place place;
   if (PyObject_IsInstance(obj, reinterpret_cast<PyObject*>(g_place_pytype))) {
@@ -428,6 +490,8 @@ PyObject* ToPyObject(uint32_t value) { return PyLong_FromUnsignedLong(value); }
 
 PyObject* ToPyObject(int64_t value) { return PyLong_FromLongLong(value); }
 
+PyObject* ToPyObject(size_t value) { return PyLong_FromSize_t(value); }
+
 PyObject* ToPyObject(float value) { return PyLong_FromDouble(value); }
 
 PyObject* ToPyObject(double value) { return PyLong_FromDouble(value); }
@@ -500,6 +564,16 @@ PyObject* ToPyObject(const std::vector<int64_t>& value) {
   return result;
 }
 
+PyObject* ToPyObject(const std::vector<size_t>& value) {
+  PyObject* result = PyList_New((Py_ssize_t)value.size());
+
+  for (size_t i = 0; i < value.size(); i++) {
+    PyList_SET_ITEM(result, (Py_ssize_t)i, ToPyObject(value[i]));
+  }
+
+  return result;
+}
+
 PyObject* ToPyObject(const std::vector<float>& value) {
   PyObject* result = PyList_New((Py_ssize_t)value.size());
 
@@ -520,6 +594,16 @@ PyObject* ToPyObject(const std::vector<double>& value) {
   return result;
 }
 
+PyObject* ToPyObject(const std::vector<std::vector<size_t>>& value) {
+  PyObject* result = PyList_New((Py_ssize_t)value.size());
+
+  for (size_t i = 0; i < value.size(); i++) {
+    PyList_SET_ITEM(result, static_cast<Py_ssize_t>(i), ToPyObject(value[i]));
+  }
+
+  return result;
+}
+
 PyObject* ToPyObject(const std::vector<paddle::experimental::Tensor>& value,
                      bool return_py_none_if_not_initialize) {
   PyObject* result = PyList_New((Py_ssize_t)value.size());
@@ -569,6 +653,12 @@ PyObject* ToPyObject(const paddle::framework::LoDTensor* value) {
   return obj.ptr();
 }
 
+PyObject* ToPyObject(const phi::SelectedRows* value) {
+  auto obj = ::pybind11::cast(value, py::return_value_policy::reference);
+  obj.inc_ref();
+  return obj.ptr();
+}
+
 PyObject* ToPyObject(const void* value) {
   if (value == nullptr) {
     Py_INCREF(Py_None);
@@ -933,7 +1023,7 @@ paddle::experimental::Scalar CastPyArg2Scalar(PyObject* obj,
     bool value = CastPyArg2Boolean(obj, op_type, arg_pos);
     return paddle::experimental::Scalar(value);
 
-  } else if (type_name == "paddle.Tensor") {
+  } else if (type_name == "Tensor") {
     paddle::experimental::Tensor& value = GetTensorFromPyObject(
         op_type, "" /*arg_name*/, obj, arg_pos, false /*dispensable*/);
     return paddle::experimental::Scalar(value);
@@ -971,7 +1061,7 @@ paddle::experimental::IntArray CastPyArg2IntArray(PyObject* obj,
     std::vector<int> value = CastPyArg2Ints(obj, op_type, arg_pos);
     return paddle::experimental::IntArray(value);
 
-  } else if (type_name == "paddle.Tensor") {
+  } else if (type_name == "paddle.Tensor" || type_name == "Tensor") {
     paddle::experimental::Tensor& value = GetTensorFromPyObject(
         op_type, "" /*arg_name*/, obj, arg_pos, false /*dispensable*/);
     return paddle::experimental::IntArray(value);
diff --git a/paddle/fluid/pybind/eager_utils.h b/paddle/fluid/pybind/eager_utils.h
index bd78342e21f4b..90c4d727923d0 100644
--- a/paddle/fluid/pybind/eager_utils.h
+++ b/paddle/fluid/pybind/eager_utils.h
@@ -36,6 +36,7 @@ bool PyObject_CheckStr(PyObject* obj);
 bool CastPyArg2AttrBoolean(PyObject* obj, ssize_t arg_pos);
 int CastPyArg2AttrInt(PyObject* obj, ssize_t arg_pos);
 int64_t CastPyArg2AttrLong(PyObject* obj, ssize_t arg_pos);
+size_t CastPyArg2AttrSize_t(PyObject* obj, ssize_t arg_pos);
 float CastPyArg2AttrFloat(PyObject* obj, ssize_t arg_pos);
 std::string CastPyArg2AttrString(PyObject* obj, ssize_t arg_pos);
 paddle::CustomOpKernelContext CastPyArg2CustomOpKernelContext(PyObject* obj,
@@ -50,6 +51,9 @@ framework::Tensor CastPyArg2FrameworkTensor(PyObject* obj, ssize_t arg_pos);
 std::vector<framework::LoDTensor> CastPyArg2VectorOfTensorBase(PyObject* obj,
                                                                ssize_t arg_pos);
 std::vector<int> CastPyArg2VectorOfInt(PyObject* obj, size_t arg_pos);
+std::vector<size_t> CastPyArg2VectorOfSize_t(PyObject* obj, size_t arg_pos);
+std::vector<std::vector<size_t>> CastPyArg2VectorOfVectorOfSize_t(
+    PyObject* obj, size_t arg_pos);
 framework::proto::VarType::Type CastPyArg2ProtoType(PyObject* obj,
                                                     ssize_t arg_pos);
 
@@ -57,6 +61,7 @@ PyObject* ToPyObject(int value);
 PyObject* ToPyObject(uint32_t value);
 PyObject* ToPyObject(bool value);
 PyObject* ToPyObject(int64_t value);
+PyObject* ToPyObject(size_t value);
 PyObject* ToPyObject(float value);
 PyObject* ToPyObject(double value);
 PyObject* ToPyObject(const char* value);
@@ -68,12 +73,15 @@ PyObject* ToPyObject(const paddle::experimental::Tensor& value,
 PyObject* ToPyObject(const std::vector<bool>& value);
 PyObject* ToPyObject(const std::vector<int>& value);
 PyObject* ToPyObject(const std::vector<int64_t>& value);
+PyObject* ToPyObject(const std::vector<size_t>& value);
 PyObject* ToPyObject(const std::vector<float>& value);
 PyObject* ToPyObject(const std::vector<double>& value);
+PyObject* ToPyObject(const std::vector<std::vector<size_t>>& value);
 PyObject* ToPyObject(const std::vector<paddle::experimental::Tensor>& value,
                      bool return_py_none_if_not_initialize = false);
 PyObject* ToPyObject(const platform::Place& value);
 PyObject* ToPyObject(const framework::LoDTensor* value);
+PyObject* ToPyObject(const phi::SelectedRows* value);
 PyObject* ToPyObject(const paddle::framework::proto::VarType::Type& dtype);
 PyObject* ToPyObject(const paddle::framework::proto::VarType& type);
 PyObject* ToPyObject(const void* value);
diff --git a/paddle/fluid/pybind/exception.h b/paddle/fluid/pybind/exception.h
index cf82f464a11f2..b0e0ef8210389 100644
--- a/paddle/fluid/pybind/exception.h
+++ b/paddle/fluid/pybind/exception.h
@@ -26,11 +26,11 @@ limitations under the License. */
     return nullptr;                                   \
   }
 
-#define EAGER_CATCH_AND_THROW_RETURN_ZERO             \
+#define EAGER_CATCH_AND_THROW_RETURN_NEG              \
   }                                                   \
   catch (...) {                                       \
     ThrowExceptionToPython(std::current_exception()); \
-    return 0;                                         \
+    return -1;                                        \
   }
 
 namespace paddle {
diff --git a/paddle/fluid/pybind/fleet_py.cc b/paddle/fluid/pybind/fleet_py.cc
index befcf36b41c24..330719762ae08 100644
--- a/paddle/fluid/pybind/fleet_py.cc
+++ b/paddle/fluid/pybind/fleet_py.cc
@@ -86,11 +86,11 @@ void BindDistFleetWrapper(py::module* m) {
 void BindPSHost(py::module* m) {
   py::class_<distributed::PSHost>(*m, "PSHost")
       .def(py::init<const std::string&, uint32_t, uint32_t>())
-      .def("serialize_to_string", &distributed::PSHost::serialize_to_string)
-      .def("parse_from_string", &distributed::PSHost::parse_from_string)
-      .def("to_uint64", &distributed::PSHost::serialize_to_uint64)
-      .def("from_uint64", &distributed::PSHost::parse_from_uint64)
-      .def("to_string", &distributed::PSHost::to_string);
+      .def("serialize_to_string", &distributed::PSHost::SerializeToString)
+      .def("parse_from_string", &distributed::PSHost::ParseFromString)
+      .def("to_uint64", &distributed::PSHost::SerializeToUint64)
+      .def("from_uint64", &distributed::PSHost::ParseFromUint64)
+      .def("to_string", &distributed::PSHost::ToString);
 }
 
 void BindSparseShardingTools(py::module* m) {
@@ -224,7 +224,7 @@ void BindGraphPyClient(py::module* m) {
            &GraphPyClient::use_neighbors_sample_cache)
       .def("remove_graph_node", &GraphPyClient::remove_graph_node)
       .def("random_sample_nodes", &GraphPyClient::random_sample_nodes)
-      .def("stop_server", &GraphPyClient::stop_server)
+      .def("stop_server", &GraphPyClient::StopServer)
       .def("get_node_feat",
            [](GraphPyClient& self, std::string node_type,
               std::vector<int64_t> node_ids,
diff --git a/paddle/fluid/pybind/imperative.cc b/paddle/fluid/pybind/imperative.cc
index 0286560ec9982..e09c205db14e7 100644
--- a/paddle/fluid/pybind/imperative.cc
+++ b/paddle/fluid/pybind/imperative.cc
@@ -57,6 +57,7 @@ limitations under the License. */
 #include "paddle/fluid/pybind/pybind_boost_headers.h"
 #include "paddle/fluid/pybind/slice_utils.h"
 #include "paddle/fluid/pybind/tensor_py.h"
+#include "paddle/fluid/pybind/uva_utils.h"
 #include "paddle/phi/core/compat/arg_map_context.h"
 #include "paddle/phi/core/compat/type_defs.h"
 
@@ -1629,39 +1630,9 @@ void BindImperative(py::module *m_ptr) {
                                platform::errors::InvalidArgument(
                                    "Unified virtual addressing only support "
                                    "CPU Tensor currently."));
-             platform::DeviceContextPool &pool =
-                 platform::DeviceContextPool::Instance();
-             auto *dev_ctx = pool.Get(platform::CUDAPlace(device_id));
-             VLOG(4) << "Init the DeviceContext, and the place is "
-                     << dev_ctx->GetPlace();
              auto *self_tensor =
                  self->MutableVar()->GetMutable<framework::LoDTensor>();
-             // Register the cpu memory as the cuda host memory
-             const auto &data_numel = self_tensor->numel();
-             const size_t &need_allocate_size =
-                 data_numel *
-                 framework::SizeOfType(
-                     framework::TransToProtoVarType(self_tensor->dtype()));
-             void *data_ptr = self_tensor->data();
-             auto result = cudaHostRegister(data_ptr, need_allocate_size,
-                                            cudaHostRegisterDefault);
-             if (cudaSuccess != result) {
-               VLOG(4) << "UVA(unified virtual addressing) failed allocate:"
-                       << need_allocate_size << ", the error code:" << result;
-             }
-
-             // Get device pointer from the function of cudaHostGetDevicePointer
-             void *cuda_device_pointer = nullptr;
-             cudaHostGetDevicePointer(
-                 reinterpret_cast<void **>(&cuda_device_pointer),
-                 reinterpret_cast<void *>(data_ptr), 0);
-
-             // Reset the memory with device pointer
-             std::shared_ptr<memory::allocation::Allocation> holder =
-                 std::make_shared<memory::allocation::Allocation>(
-                     cuda_device_pointer, need_allocate_size,
-                     platform::CUDAPlace(device_id));
-             self_tensor->ResetHolderWithType(holder, self_tensor->dtype());
+             tensor_uva(self_tensor, device_id);
            },
            py::arg("device_id") = 0, py::return_value_policy::reference, R"DOC(
         Returns self tensor with the UVA(unified virtual addressing).
@@ -2211,6 +2182,7 @@ void BindImperative(py::module *m_ptr) {
   m.def("varbase_copy", &VarBaseCopy<platform::XPUPlace>);
   m.def("varbase_copy", &VarBaseCopy<platform::CUDAPinnedPlace>);
   m.def("varbase_copy", &VarBaseCopy<platform::NPUPlace>);
+  m.def("varbase_copy", &VarBaseCopy<platform::CustomPlace>);
   m.def("varbase_copy", &VarBaseCopy<platform::MLUPlace>);
 
   m.def(
@@ -2370,6 +2342,11 @@ void BindImperative(py::module *m_ptr) {
            const py::args args, const py::kwargs kwargs) {
           return imperative::PyLayerApply(place, cls, args, kwargs);
         });
+  m.def("pylayer_apply",
+        [](const platform::CustomPlace &place, const py::object &cls,
+           const py::args args, const py::kwargs kwargs) {
+          return imperative::PyLayerApply(place, cls, args, kwargs);
+        });
 
 #if defined(PADDLE_WITH_CUDA)
   m.def("to_uva_tensor",
diff --git a/paddle/fluid/pybind/op_function_generator.h b/paddle/fluid/pybind/op_function_generator.h
index 2bfc16c7d5b0f..f1e9c7e8f491b 100644
--- a/paddle/fluid/pybind/op_function_generator.h
+++ b/paddle/fluid/pybind/op_function_generator.h
@@ -36,6 +36,8 @@ std::map<std::string, std::set<std::string>> op_ins_map = {
     {"gru_unit", {"Input", "HiddenPrev", "Weight", "Bias"}},
     {"label_smooth", {"X", "PriorDist"}},
     {"assign", {"X"}},
+    {"crop", {"X", "Y", "Offsets"}},
+    {"crop_tensor", {"X", "Shape", "Offsets"}},
     {"reshape2", {"X", "Shape"}},
     {"expand", {"X", "ExpandTimes"}},
     {"slice",
@@ -50,11 +52,13 @@ std::map<std::string, std::set<std::string>> op_ins_map = {
     {"fake_quantize_dequantize_moving_average_abs_max",
      {"X", "InScale", "InAccum", "InState"}},
     {"nll_loss", {"X", "Label", "Weight"}},
+    {"smooth_l1_loss", {"X", "Y", "InsideWeight", "OutsideWeight"}},
     {"bilinear_tensor_product", {"X", "Y", "Weight", "Bias"}},
     {"gather", {"X", "Index", "Axis"}},
     {"repeat_interleave", {"X", "RepeatsTensor"}},
     {"roi_pool", {"X", "ROIs", "RoisNum"}},
     {"roi_align", {"X", "ROIs", "RoisNum"}},
+    {"prroi_pool", {"X", "ROIs", "BatchRoINums"}},
     {"psroi_pool", {"X", "ROIs", "RoisNum"}},
     {"collect_fpn_proposals",
      {"MultiLevelRois", "MultiLevelScores", "MultiLevelRoIsNum"}},
@@ -102,6 +106,15 @@ std::map<std::string, std::set<std::string>> op_ins_map = {
     {"linear_chain_crf", {"Emission", "Transition", "Label", "Length"}},
     {"crf_decoding", {"Emission", "Transition", "Label", "Length"}},
     {"chunk_eval", {"Inference", "Label", "SeqLength"}},
+    {"sequence_mask", {"X", "MaxLenTensor"}},
+    {"graph_reindex",
+     {"X", "Neighbors", "Count", "HashTable_Value", "HashTable_Index"}},
+    {"graph_sample_neighbors", {"Row", "Col_Ptr", "X", "Eids", "Perm_Buffer"}},
+    {"crop", {"X", "Y", "Offsets"}},
+    {"batch_norm",
+     {"X", "Scale", "Bias", "Mean", "Variance", "MomentumTensor"}},
+    {"inplace_abn",
+     {"X", "Scale", "Bias", "Mean", "Variance", "MomentumTensor"}},
 };
 
 // NOTE(zhiqiu): Like op_ins_map.
@@ -118,6 +131,9 @@ std::map<std::string, std::set<std::string>> op_outs_map = {
     {"batch_norm",
      {"Y", "MeanOut", "VarianceOut", "SavedMean", "SavedVariance",
       "ReserveSpace"}},
+    {"inplace_abn",
+     {"Y", "MeanOut", "VarianceOut", "SavedMean", "SavedVariance",
+      "ReserveSpace"}},
     {"fused_attention", {"LnMean",         "LnVariance",
                          "LnOut",          "QKVOut",
                          "QKVBiasOut",     "TransposeOut2",
@@ -203,6 +219,7 @@ std::map<std::string, std::set<std::string>> op_passing_outs_map = {
     {"merged_momentum", {"ParamOut", "VelocityOut", "MasterParamOut"}},
     {"sparse_momentum", {"ParamOut", "VelocityOut", "MasterParamOut"}},
     {"batch_norm", {"MeanOut", "VarianceOut"}},
+    {"inplace_abn", {"MeanOut", "VarianceOut"}},
     {"sync_batch_norm", {"MeanOut", "VarianceOut"}},
     {"accuracy", {"Correct", "Total"}},
     {"fill_constant", {"Out"}},
@@ -232,6 +249,10 @@ std::map<std::string, std::set<std::string>> op_passing_outs_map = {
     {"run_program", {"Out", "DOut", "OutScope"}},
     {"clear_float_status", {"FloatStatusOut"}},
     {"get_float_status", {"FloatStatusOut"}},
+    {"assign", {"Out"}},
+    {"assign_value", {"Out"}},
+    {"split", {"Out"}},
+    {"concat", {"Out"}},
 };
 
 // NOTE(pangyoki): Tensor View Strategy.
@@ -246,3 +267,12 @@ std::map<std::string, std::pair<std::string, std::string>> view_op_map = {
     {"reshape2", {"X", "Out"}},
     {"flatten_contiguous_range", {"X", "Out"}},
 };
+
+// NOTE(pangyoki): Special inplace ops that are not supported in temporary.
+// The input and output of some inplace ops are special, such as
+// duplicate input. These inplace ops have no usage scenarios and
+// are not supported in temporary.
+std::set<std::string> special_inplace_op_set = {
+    "sum",     // `sum` op has duplicate input
+    "assign",  // output of `assign` op is in `op_passing_outs_map`
+};
diff --git a/paddle/fluid/pybind/ps_gpu_wrapper_py.cc b/paddle/fluid/pybind/ps_gpu_wrapper_py.cc
index fe1f27226bad4..79529fca7d1be 100644
--- a/paddle/fluid/pybind/ps_gpu_wrapper_py.cc
+++ b/paddle/fluid/pybind/ps_gpu_wrapper_py.cc
@@ -63,6 +63,27 @@ void BindPSGPUWrapper(py::module* m) {
       .def("finalize", &framework::PSGPUWrapper::Finalize,
            py::call_guard<py::gil_scoped_release>());
 }  // end PSGPUWrapper
+#ifdef PADDLE_WITH_PSLIB
+void BindAfsWrapper(py::module* m) {
+  py::class_<framework::AfsWrapper, std::shared_ptr<framework::AfsWrapper>>(
+      *m, "AfsWrapper")
+      .def(py::init([]() { return std::make_shared<framework::AfsWrapper>(); }))
+      .def("init", &framework::AfsWrapper::init,
+           py::call_guard<py::gil_scoped_release>())
+      .def("list", &framework::AfsWrapper::list,
+           py::call_guard<py::gil_scoped_release>())
+      .def("mkdir", &framework::AfsWrapper::mkdir,
+           py::call_guard<py::gil_scoped_release>())
+      .def("exist", &framework::AfsWrapper::exist,
+           py::call_guard<py::gil_scoped_release>())
+      .def("download", &framework::AfsWrapper::download,
+           py::call_guard<py::gil_scoped_release>())
+      .def("upload", &framework::AfsWrapper::upload,
+           py::call_guard<py::gil_scoped_release>())
+      .def("remove", &framework::AfsWrapper::remove,
+           py::call_guard<py::gil_scoped_release>());
+}
+#endif
 #endif
 }  // end namespace pybind
 }  // end namespace paddle
diff --git a/paddle/fluid/pybind/ps_gpu_wrapper_py.h b/paddle/fluid/pybind/ps_gpu_wrapper_py.h
index ba4f146389ed3..22cd5ef0fd149 100644
--- a/paddle/fluid/pybind/ps_gpu_wrapper_py.h
+++ b/paddle/fluid/pybind/ps_gpu_wrapper_py.h
@@ -24,6 +24,9 @@ namespace pybind {
 
 #ifdef PADDLE_WITH_HETERPS
 void BindPSGPUWrapper(py::module* m);
+#ifdef PADDLE_WITH_PSLIB
+void BindAfsWrapper(py::module* m);
+#endif
 #endif
 }  // namespace pybind
 }  // namespace paddle
diff --git a/paddle/fluid/pybind/pybind.cc b/paddle/fluid/pybind/pybind.cc
index 96d569d47c45a..396c6c5e42d37 100644
--- a/paddle/fluid/pybind/pybind.cc
+++ b/paddle/fluid/pybind/pybind.cc
@@ -168,6 +168,8 @@ limitations under the License. */
 #include "paddle/fluid/eager/api/utils/global_utils.h"
 #include "paddle/fluid/pybind/eager_utils.h"
 #include "paddle/phi/api/ext/op_meta_info.h"
+#include "paddle/phi/kernels/autotune/cache.h"
+#include "paddle/phi/kernels/autotune/switch_autotune.h"
 #include "pybind11/stl.h"
 
 DECLARE_bool(use_mkldnn);
@@ -843,6 +845,10 @@ PYBIND11_MODULE(core_noavx, m) {
            [](framework::Tensor &self, const std::string &layout) {
              self.set_layout(StringToDataLayout(layout));
            })
+      .def("_alloc_float",
+           [](framework::Tensor &self, paddle::platform::CustomPlace &place) {
+             self.mutable_data<float>(place);
+           })
       .def("_alloc_float",
            [](framework::Tensor &self, paddle::platform::CUDAPlace &place) {
              self.mutable_data<float>(place);
@@ -871,6 +877,10 @@ PYBIND11_MODULE(core_noavx, m) {
            [](framework::Tensor &self, paddle::platform::CPUPlace &place) {
              self.mutable_data<int>(place);
            })
+      .def("_alloc_int",
+           [](framework::Tensor &self, paddle::platform::CustomPlace &place) {
+             self.mutable_data<int>(place);
+           })
       .def("_alloc_int",
            [](framework::Tensor &self, paddle::platform::XPUPlace &place) {
              self.mutable_data<int>(place);
@@ -899,6 +909,12 @@ PYBIND11_MODULE(core_noavx, m) {
              return reinterpret_cast<uintptr_t>(
                  self.mutable_data(place, framework::TransToPhiDataType(type)));
            })
+      .def("_mutable_data",
+           [](framework::Tensor &self, paddle::platform::CustomPlace &place,
+              paddle::framework::proto::VarType::Type type) {
+             return reinterpret_cast<uintptr_t>(
+                 self.mutable_data(place, framework::TransToPhiDataType(type)));
+           })
       .def("_mutable_data",
            [](framework::Tensor &self, paddle::platform::XPUPlace &place,
               paddle::framework::proto::VarType::Type type) {
@@ -932,6 +948,8 @@ PYBIND11_MODULE(core_noavx, m) {
            })
       .def("_copy_from", &TensorCopyFrom<paddle::platform::CPUPlace>,
            py::arg("tensor"), py::arg("place"), py::arg("batch_size") = -1)
+      .def("_copy_from", &TensorCopyFrom<paddle::platform::CustomPlace>,
+           py::arg("tensor"), py::arg("place"), py::arg("batch_size") = -1)
       .def("_copy_from", &TensorCopyFrom<paddle::platform::XPUPlace>,
            py::arg("tensor"), py::arg("place"), py::arg("batch_size") = -1)
       .def("_copy_from", &TensorCopyFrom<paddle::platform::CUDAPlace>,
@@ -946,6 +964,8 @@ PYBIND11_MODULE(core_noavx, m) {
            py::arg("tensor"), py::arg("place"), py::arg("batch_size") = -1)
       .def("set", SetTensorFromPyArray<paddle::platform::CPUPlace>,
            py::arg("array"), py::arg("place"), py::arg("zero_copy") = false)
+      .def("set", SetTensorFromPyArray<paddle::platform::CustomPlace>,
+           py::arg("array"), py::arg("place"), py::arg("zero_copy") = false)
       .def("set", SetTensorFromPyArray<paddle::platform::XPUPlace>,
            py::arg("array"), py::arg("place"), py::arg("zero_copy") = false)
       .def("set", SetTensorFromPyArray<paddle::platform::CUDAPlace>,
@@ -1983,6 +2003,19 @@ All parameter, weight, gradient are variables in Paddle.
                  "Please recompile or reinstall Paddle with NPU support."));
 #else
                 return new paddle::platform::NPUDeviceContext(place);
+#endif
+        })
+        .def_static("create",
+                    [](paddle::platform::CustomPlace& place)
+                        -> paddle::platform::DeviceContext* {
+#ifndef PADDLE_WITH_CUSTOM_DEVICE
+             PADDLE_THROW(
+                 platform::errors::PermissionDenied(
+                 "Cannot use CustomPlace in CPU/GPU/XPU version, "
+                 "Please recompile or reinstall Paddle with "
+                 "CustomDevice support."));
+#else
+                return new paddle::platform::CustomDeviceContext(place);
 #endif
         })
       .def_static("create",
@@ -2007,6 +2040,10 @@ All parameter, weight, gradient are variables in Paddle.
         paddle::memory::allocation::AllocatorFacade::Instance()
         .GetZeroAllocator(place)
         .get());
+      context->SetPinnedAllocator(
+        paddle::memory::allocation::AllocatorFacade::Instance()
+          .GetAllocator(paddle::platform::CUDAPinnedPlace())
+          .get());
       context->PartialInitWithAllocator();
       return context;
 #endif
@@ -2716,6 +2753,12 @@ All parameter, weight, gradient are variables in Paddle.
              pybind11::gil_scoped_release release;
              self.Run(scope, place);
            })
+      .def("run",
+           [](OperatorBase &self, const Scope &scope,
+              const platform::CustomPlace &place) {
+             pybind11::gil_scoped_release release;
+             self.Run(scope, place);
+           })
       .def("type",
            [](const OperatorBase &op) -> std::string { return op.Type(); })
       .def("outputs",
@@ -2861,7 +2904,7 @@ All parameter, weight, gradient are variables in Paddle.
            [](StandaloneExecutor &self, std::vector<std::string> feed_names,
               std::vector<std::string> fetch_names) {
              platform::RecordEvent record_event(
-                 "StandaloneExecutor:run",
+                 "StandaloneExecutor::run",
                  platform::TracerEventType::UserDefined, 1);
              paddle::framework::FetchList ret;
              {
@@ -4415,6 +4458,34 @@ All parameter, weight, gradient are variables in Paddle.
       .def("is_pattern_enabled", &platform::ipu::IpuStrategy::IsPatternEnabled);
 #endif
 
+  m.def("enable_autotune", [] {
+    return phi::autotune::AutoTuneStatus::Instance().EnableAutoTune();
+  });
+
+  m.def("disable_autotune", [] {
+    return phi::autotune::AutoTuneStatus::Instance().DisableAutoTune();
+  });
+
+  m.def("autotune_range", [](int64_t start, int64_t stop) {
+    return phi::autotune::AutoTuneStatus::Instance().SetAutoTuneRange(start,
+                                                                      stop);
+  });
+
+  m.def("update_autotune_status",
+        [] { return phi::autotune::AutoTuneStatus::Instance().Update(); });
+
+  m.def("autotune_status", [] {
+    phi::autotune::AutoTuneCache::Instance().UpdateStatus();
+    py::dict res;
+    res["use_autotune"] =
+        phi::autotune::AutoTuneStatus::Instance().UseAutoTune();
+    res["step_id"] = phi::autotune::AutoTuneStatus::Instance().StepID();
+    res["cache_size"] = phi::autotune::AutoTuneCache::Instance().Size();
+    res["cache_hit_rate"] =
+        phi::autotune::AutoTuneCache::Instance().CacheHitRate();
+    return res;
+  });
+
   BindFleetWrapper(&m);
   BindIO(&m);
 
@@ -4424,6 +4495,9 @@ All parameter, weight, gradient are variables in Paddle.
 #endif
 #ifdef PADDLE_WITH_HETERPS
   BindPSGPUWrapper(&m);
+#ifdef PADDLE_WITH_PSLIB
+  BindAfsWrapper(&m);
+#endif
 #endif
   BindGlooWrapper(&m);
   BindBoxHelper(&m);
diff --git a/paddle/fluid/pybind/tensor_py.h b/paddle/fluid/pybind/tensor_py.h
index bf459bd468421..3f7ce8b63f968 100644
--- a/paddle/fluid/pybind/tensor_py.h
+++ b/paddle/fluid/pybind/tensor_py.h
@@ -529,11 +529,10 @@ void SetTensorFromPyArray(framework::Tensor *self, const py::object &obj,
 }
 
 template <typename T>
-void SetUVATensorFromPyArray(
-    const std::shared_ptr<paddle::imperative::VarBase> &self,
-    const py::array_t<T> &array, int device_id) {
+void SetUVATensorFromPyArrayImpl(framework::LoDTensor *self_tensor,
+                                 const py::array_t<T> &array, int device_id) {
 #if defined(PADDLE_WITH_CUDA)
-  auto *self_tensor = self->MutableVar()->GetMutable<framework::LoDTensor>();
+  VLOG(4) << "Running in SetUVATensorFromPyArrayImpl.";
   std::vector<int64_t> dims;
   dims.reserve(array.ndim());
   int64_t numel = 1;
@@ -562,6 +561,38 @@ void SetUVATensorFromPyArray(
 #endif
 }
 
+template <typename T>
+void SetUVATensorFromPyArray(
+    const std::shared_ptr<paddle::imperative::VarBase> &self,
+    const py::array_t<T> &array, int device_id) {
+#if defined(PADDLE_WITH_CUDA)
+  VLOG(4) << "Running in SetUVATensorFromPyArray for VarBase.";
+  auto *self_tensor = self->MutableVar()->GetMutable<framework::LoDTensor>();
+  SetUVATensorFromPyArrayImpl<T>(self_tensor, array, device_id);
+#endif
+}
+
+template <typename T>
+void SetUVATensorFromPyArray(
+    const std::shared_ptr<paddle::experimental::Tensor> &self,
+    const py::array_t<T> &array, int device_id) {
+#if defined(PADDLE_WITH_CUDA)
+  VLOG(4) << "Running in SetUVATensorFromPyArray for Phi::Tensor.";
+  phi::DenseTensorMeta meta =
+      phi::DenseTensorMeta(phi::DataType::FLOAT32, phi::make_ddim({1, 1}));
+  std::shared_ptr<phi::DenseTensor> tmp_t = std::make_shared<phi::DenseTensor>(
+      std::make_unique<paddle::experimental::DefaultAllocator>(
+          paddle::platform::CPUPlace())
+          .get(),
+      meta);
+  self.get()->set_impl(tmp_t);
+  auto *self_tensor =
+      static_cast<paddle::framework::LoDTensor *>(self.get()->impl().get());
+
+  SetUVATensorFromPyArrayImpl<T>(self_tensor, array, device_id);
+#endif
+}
+
 template <typename T, size_t D>
 void _sliceCompute(const framework::Tensor *in, framework::Tensor *out,
                    const platform::CPUDeviceContext &ctx,
diff --git a/paddle/fluid/pybind/uva_utils.h b/paddle/fluid/pybind/uva_utils.h
new file mode 100644
index 0000000000000..94f55769b7356
--- /dev/null
+++ b/paddle/fluid/pybind/uva_utils.h
@@ -0,0 +1,60 @@
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#pragma once
+
+#include <Python.h>
+#include "paddle/fluid/operators/utils.h"
+#include "paddle/phi/common/data_type.h"
+#include "paddle/phi/core/compat/convert_utils.h"
+#include "paddle/phi/core/dense_tensor.h"
+
+namespace paddle {
+namespace pybind {
+
+static void tensor_uva(paddle::framework::LoDTensor *self_tensor,
+                       int device_id) {
+  VLOG(4) << "Running in _uva interface.";
+#if defined(PADDLE_WITH_CUDA)
+  platform::DeviceContextPool &pool = platform::DeviceContextPool::Instance();
+  auto *dev_ctx = pool.Get(platform::CUDAPlace(device_id));
+  VLOG(4) << "Init the DeviceContext, and the place is " << dev_ctx->GetPlace();
+  // Register the cpu memory as the cuda host memory
+  const auto &data_numel = self_tensor->numel();
+  const size_t &need_allocate_size =
+      data_numel * framework::SizeOfType(
+                       framework::TransToProtoVarType(self_tensor->dtype()));
+  void *data_ptr = self_tensor->data();
+  auto result =
+      cudaHostRegister(data_ptr, need_allocate_size, cudaHostRegisterDefault);
+  if (cudaSuccess != result) {
+    VLOG(4) << "UVA(unified virtual addressing) failed allocate:"
+            << need_allocate_size << ", the error code:" << result;
+  }
+  // Get device pointer from the function of cudaHostGetDevicePointer
+  void *cuda_device_pointer = nullptr;
+  cudaHostGetDevicePointer(reinterpret_cast<void **>(&cuda_device_pointer),
+                           reinterpret_cast<void *>(data_ptr), 0);
+
+  // Reset the memory with device pointer
+  std::shared_ptr<memory::allocation::Allocation> holder =
+      std::make_shared<memory::allocation::Allocation>(
+          cuda_device_pointer, need_allocate_size,
+          platform::CUDAPlace(device_id));
+  self_tensor->ResetHolderWithType(holder, self_tensor->dtype());
+#endif
+}
+
+}  // namespace pybind
+}  // namespace paddle
diff --git a/paddle/infrt/CMakeLists.txt b/paddle/infrt/CMakeLists.txt
index e777a8e3ab4e6..e5f224bf6ad99 100644
--- a/paddle/infrt/CMakeLists.txt
+++ b/paddle/infrt/CMakeLists.txt
@@ -9,6 +9,10 @@ option(INFRT_WITH_TRT  "Compile INFRT with TensorRT"    OFF)
 #TODO(xiaowei) remove fluid
 include_directories(${PADDLE_SOURCE_DIR}/paddle/fluid/platform)
 
+if(WITH_GPU)
+  set(INFRT_WITH_GPU ON)
+endif()
+
 if (INFRT_WITH_PHI)
   add_definitions("-DINFRT_WITH_PHI")
 
@@ -115,9 +119,6 @@ if (INFRT_WITH_PHI)
 endif()
 
 cc_library(infrt SHARED SRCS ${infrt_src} DEPS glog boost ${mlir_libs} ${phi_libs} paddle_framework_proto infrt_naive)
-if (INFRT_WITH_TRT)
-  target_link_libraries(infrt infrt_trt)
-endif()
 cc_library(infrt_static SRCS ${infrt_src} DEPS glog boost ${mlir_libs} ${phi_libs} paddle_framework_proto)
 add_dependencies(infrt ${infrt_mlir_incs} mlir-headers)
 
diff --git a/paddle/infrt/backends/tensorrt/CMakeLists.txt b/paddle/infrt/backends/tensorrt/CMakeLists.txt
index cc20c9a2e14b6..672515ea4b7f8 100644
--- a/paddle/infrt/backends/tensorrt/CMakeLists.txt
+++ b/paddle/infrt/backends/tensorrt/CMakeLists.txt
@@ -1,3 +1,7 @@
-cc_library(infrt_trt SRCS trt_engine.cc DEPS glog phi_dynload_cuda phi)
+add_subdirectory(plugin)
 
-cc_test_tiny(test_infrt_trt SRCS test_trt_engine.cc DEPS infrt_trt phi_dynload_cuda tensorrt_converter)
+core_gather_headers()
+
+gather_srcs(infrt_src SRCS trt_engine.cc)
+
+cc_test_tiny(test_infrt_trt SRCS test_trt_engine.cc DEPS infrt phi_dynload_cuda tensorrt_converter)
diff --git a/paddle/infrt/backends/tensorrt/plugin/CMakeLists.txt b/paddle/infrt/backends/tensorrt/plugin/CMakeLists.txt
new file mode 100644
index 0000000000000..8848148f2c612
--- /dev/null
+++ b/paddle/infrt/backends/tensorrt/plugin/CMakeLists.txt
@@ -0,0 +1 @@
+gather_srcs(infrt_src SRCS pool_op_plugin.cu)
diff --git a/paddle/infrt/backends/tensorrt/plugin/plugin_utils.h b/paddle/infrt/backends/tensorrt/plugin/plugin_utils.h
new file mode 100644
index 0000000000000..49e96e6eab0e4
--- /dev/null
+++ b/paddle/infrt/backends/tensorrt/plugin/plugin_utils.h
@@ -0,0 +1,153 @@
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#pragma once
+#include <glog/logging.h>
+
+#include <cassert>
+#include <cstring>
+#include <string>
+#include <type_traits>
+#include <vector>
+
+#include "paddle/phi/backends/dynload/tensorrt.h"
+
+namespace infrt {
+namespace backends {
+namespace tensorrt {
+namespace plugin {
+
+template <typename T>
+inline void SerializeValue(void** buffer, T const& value);
+
+template <typename T>
+inline void DeserializeValue(void const** buffer,
+                             size_t* buffer_size,
+                             T* value);
+
+namespace details {
+
+template <typename T, class Enable = void>
+struct Serializer {};
+
+template <typename T>
+struct Serializer<T,
+                  typename std::enable_if<std::is_arithmetic<T>::value ||
+                                          std::is_enum<T>::value ||
+                                          std::is_pod<T>::value>::type> {
+  static size_t SerializedSize(T const& value) { return sizeof(T); }
+
+  static void Serialize(void** buffer, T const& value) {
+    std::memcpy(*buffer, &value, sizeof(T));
+    reinterpret_cast<char*&>(*buffer) += sizeof(T);
+  }
+
+  static void Deserialize(void const** buffer, size_t* buffer_size, T* value) {
+    assert(*buffer_size >= sizeof(T));
+    std::memcpy(value, *buffer, sizeof(T));
+    reinterpret_cast<char const*&>(*buffer) += sizeof(T);
+    *buffer_size -= sizeof(T);
+  }
+};
+
+template <>
+struct Serializer<const char*> {
+  static size_t SerializedSize(const char* value) { return strlen(value) + 1; }
+
+  static void Serialize(void** buffer, const char* value) {
+    std::strcpy(static_cast<char*>(*buffer), value);  // NOLINT
+    reinterpret_cast<char*&>(*buffer) += strlen(value) + 1;
+  }
+
+  static void Deserialize(void const** buffer,
+                          size_t* buffer_size,
+                          const char** value) {
+    *value = static_cast<char const*>(*buffer);
+    size_t data_size = strnlen(*value, *buffer_size) + 1;
+    assert(*buffer_size >= data_size);
+    reinterpret_cast<char const*&>(*buffer) += data_size;
+    *buffer_size -= data_size;
+  }
+};
+
+template <typename T>
+struct Serializer<std::vector<T>,
+                  typename std::enable_if<std::is_arithmetic<T>::value ||
+                                          std::is_enum<T>::value ||
+                                          std::is_pod<T>::value>::type> {
+  static size_t SerializedSize(std::vector<T> const& value) {
+    return sizeof(value.size()) + value.size() * sizeof(T);
+  }
+
+  static void Serialize(void** buffer, std::vector<T> const& value) {
+    SerializeValue(buffer, value.size());
+    size_t nbyte = value.size() * sizeof(T);
+    std::memcpy(*buffer, value.data(), nbyte);
+    reinterpret_cast<char*&>(*buffer) += nbyte;
+  }
+
+  static void Deserialize(void const** buffer,
+                          size_t* buffer_size,
+                          std::vector<T>* value) {
+    size_t size;
+    DeserializeValue(buffer, buffer_size, &size);
+    value->resize(size);
+    size_t nbyte = value->size() * sizeof(T);
+    CHECK_GE(*buffer_size, nbyte);
+    std::memcpy(value->data(), *buffer, nbyte);
+    reinterpret_cast<char const*&>(*buffer) += nbyte;
+    *buffer_size -= nbyte;
+  }
+};
+
+}  // namespace details
+
+template <typename T>
+inline size_t SerializedSize(T const& value) {
+  return details::Serializer<T>::SerializedSize(value);
+}
+
+template <typename T>
+inline void SerializeValue(void** buffer, T const& value) {
+  return details::Serializer<T>::Serialize(buffer, value);
+}
+
+template <typename T>
+inline void DeserializeValue(void const** buffer,
+                             size_t* buffer_size,
+                             T* value) {
+  return details::Serializer<T>::Deserialize(buffer, buffer_size, value);
+}
+
+template <typename T>
+class TrtPluginRegistrar {
+ public:
+  TrtPluginRegistrar() {
+    static auto func_ptr = static_cast<nvinfer1::IPluginRegistry*>(
+        ::phi::dynload::getPluginRegistry());
+    func_ptr->registerCreator(instance, "");
+  }
+
+ private:
+  //! Plugin instance.
+  T instance{};
+};
+
+#define REGISTER_TRT_PLUGIN(name) \
+  static TrtPluginRegistrar<name> pluginRegistrar##name {}
+
+}  // namespace plugin
+}  // namespace tensorrt
+}  // namespace backends
+}  // namespace infrt
diff --git a/paddle/infrt/backends/tensorrt/plugin/pool_op_plugin.cu b/paddle/infrt/backends/tensorrt/plugin/pool_op_plugin.cu
new file mode 100644
index 0000000000000..5a53777c8e30f
--- /dev/null
+++ b/paddle/infrt/backends/tensorrt/plugin/pool_op_plugin.cu
@@ -0,0 +1,288 @@
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include "glog/logging.h"
+#include "paddle/infrt/backends/tensorrt/plugin/plugin_utils.h"
+#include "paddle/infrt/backends/tensorrt/plugin/pool_op_plugin.h"
+#include "paddle/phi/kernels/funcs/pooling.h"
+
+namespace infrt {
+namespace backends {
+namespace tensorrt {
+namespace plugin {
+
+PoolPlugin::PoolPlugin(bool ceil_mode,
+                       PoolType pool_type,
+                       bool adaptive,
+                       bool exclusive,
+                       std::vector<int> ksize,
+                       std::vector<int> strides,
+                       std::vector<int> paddings,
+                       std::vector<int> input_shape,
+                       std::vector<int> real_paddings)
+    : ceil_mode_(ceil_mode),
+      pool_type_(pool_type),
+      adaptive_(adaptive),
+      exclusive_(exclusive),
+      ksize_(ksize),
+      strides_(strides),
+      paddings_(paddings),
+      real_paddings_(real_paddings),
+      input_shape_(input_shape) {
+  output_shape_ = input_shape_;
+  std::vector<int> output_shape =
+      CalcOutputSize({input_shape_[1], input_shape_[2]},
+                     ceil_mode_,
+                     adaptive_,
+                     ksize_,
+                     strides_,
+                     real_paddings_);
+  output_shape_[1] = output_shape[0];
+  output_shape_[2] = output_shape[1];
+}
+
+PoolPlugin::PoolPlugin(void const* serialData, size_t serialLength) {
+  // deserializeBase(serialData, serialLength);
+  DeserializeValue(&serialData, &serialLength, &ceil_mode_);
+  DeserializeValue(&serialData, &serialLength, &pool_type_);
+  DeserializeValue(&serialData, &serialLength, &adaptive_);
+  DeserializeValue(&serialData, &serialLength, &exclusive_);
+  DeserializeValue(&serialData, &serialLength, &ksize_);
+  DeserializeValue(&serialData, &serialLength, &strides_);
+  DeserializeValue(&serialData, &serialLength, &paddings_);
+  DeserializeValue(&serialData, &serialLength, &real_paddings_);
+  DeserializeValue(&serialData, &serialLength, &input_shape_);
+  DeserializeValue(&serialData, &serialLength, &output_shape_);
+}
+
+const char* PoolPlugin::getPluginType() const noexcept { return "pool_plugin"; }
+
+const char* PoolPlugin::getPluginVersion() const noexcept { return "1"; }
+
+int PoolPlugin::getNbOutputs() const noexcept { return 1; }
+
+nvinfer1::Dims PoolPlugin::getOutputDimensions(int outputIndex,
+                                               const nvinfer1::Dims* inputs,
+                                               int nbInputs) noexcept {
+  assert(nbInputs == 1);
+  assert(index == 0);
+  assert(inputs[0].nbDims == 3);
+  nvinfer1::Dims const& input_dims = inputs[0];
+
+  nvinfer1::Dims output_dims = input_dims;
+
+  output_dims.d[1] = output_shape_[1];
+  output_dims.d[2] = output_shape_[2];
+  return output_dims;
+}
+
+int32_t PoolPlugin::initialize() noexcept { return 0; }
+
+void PoolPlugin::terminate() noexcept {}
+
+size_t PoolPlugin::getWorkspaceSize(int32_t maxBatchSize) const noexcept {
+  return 0;
+}
+
+#if IS_TRT_VERSION_LT(8000)
+int PoolPlugin::enqueue(int batch_size,
+                        const void* const* inputs,
+                        void** outputs,
+#else
+int PoolPlugin::enqueue(int batch_size,
+                        const void* const* inputs,
+                        void* const* outputs,
+#endif
+                        void* workspace,
+                        cudaStream_t stream) noexcept {
+  // TODO(wilber)
+  int input_size = 0;
+  float const* idata = reinterpret_cast<float const*>(inputs[0]);
+  float* const* odatas = reinterpret_cast<float* const*>(outputs);
+
+  std::vector<int> input_shape = input_shape_;
+  std::vector<int> output_shape = output_shape_;
+  input_shape.insert(input_shape.begin(), batch_size);
+  output_shape.insert(output_shape.begin(), batch_size);
+
+  if (pool_type_ == PoolType::max) {
+    ::phi::funcs::MaxPool<float> pool_process;
+    ::phi::funcs::Pool2dDirectCUDAFunctor<phi::funcs::MaxPool<float>, float>
+        pool2d_forward;
+    pool2d_forward(idata,
+                   input_shape,
+                   output_shape,
+                   ksize_,
+                   strides_,
+                   paddings_,
+                   true,
+                   false,
+                   odatas[0],
+                   stream,
+                   pool_process);
+  } else if (pool_type_ == PoolType::avg) {
+    ::phi::funcs::AvgPool<float> pool_process;
+    ::phi::funcs::Pool2dDirectCUDAFunctor<phi::funcs::AvgPool<float>, float>
+        pool2d_forward;
+    pool2d_forward(idata,
+                   input_shape,
+                   output_shape,
+                   ksize_,
+                   strides_,
+                   paddings_,
+                   exclusive_,
+                   adaptive_,
+                   odatas[0],
+                   stream,
+                   pool_process);
+  }
+
+  return cudaGetLastError() != cudaSuccess;
+}
+
+// TODO(wilber): serialize base info?
+size_t PoolPlugin::getSerializationSize() const noexcept {
+  return SerializedSize(ceil_mode_) + SerializedSize(pool_type_) +
+         SerializedSize(adaptive_) + SerializedSize(exclusive_) +
+         SerializedSize(ksize_) + SerializedSize(strides_) +
+         SerializedSize(paddings_) + SerializedSize(real_paddings_) +
+         SerializedSize(input_shape_) + SerializedSize(output_shape_);
+}
+// TODO(wilber): serialize base info?
+void PoolPlugin::serialize(void* buffer) const noexcept {
+  // serializeBase(buffer);
+  SerializeValue(&buffer, ceil_mode_);
+  SerializeValue(&buffer, pool_type_);
+  SerializeValue(&buffer, adaptive_);
+  SerializeValue(&buffer, exclusive_);
+  SerializeValue(&buffer, ksize_);
+  SerializeValue(&buffer, strides_);
+  SerializeValue(&buffer, paddings_);
+  SerializeValue(&buffer, real_paddings_);
+  SerializeValue(&buffer, input_shape_);
+  SerializeValue(&buffer, output_shape_);
+}
+
+void PoolPlugin::destroy() noexcept { delete this; }
+
+void PoolPlugin::setPluginNamespace(char const* plugin_namespace) noexcept {
+  namespace_ = plugin_namespace;
+}
+
+char const* PoolPlugin::getPluginNamespace() const noexcept {
+  return namespace_.c_str();
+}
+
+nvinfer1::DataType PoolPlugin::getOutputDataType(
+    int32_t index,
+    nvinfer1::DataType const* input_types,
+    int32_t nbInputs) const noexcept {
+  CHECK_EQ(index, 0);
+  CHECK_EQ((input_types[0] == nvinfer1::DataType::kFLOAT), true);
+  return input_types[0];
+}
+
+bool PoolPlugin::isOutputBroadcastAcrossBatch(int32_t outputIndex,
+                                              bool const* inputIsBroadcasted,
+                                              int32_t nbInputs) const noexcept {
+  return false;
+}
+
+bool PoolPlugin::canBroadcastInputAcrossBatch(int32_t inputIndex) const
+    noexcept {
+  return false;
+}
+
+nvinfer1::IPluginV2Ext* PoolPlugin::clone() const noexcept {
+  auto* plugin = new PoolPlugin(ceil_mode_,
+                                pool_type_,
+                                adaptive_,
+                                exclusive_,
+                                ksize_,
+                                strides_,
+                                paddings_,
+                                input_shape_,
+                                real_paddings_);
+  plugin->setPluginNamespace(namespace_.c_str());
+  return plugin;
+}
+
+void PoolPlugin::configurePlugin(nvinfer1::PluginTensorDesc const* in,
+                                 int32_t nb_input,
+                                 nvinfer1::PluginTensorDesc const* out,
+                                 int32_t nb_output) noexcept {
+  CHECK_EQ(nb_input, 1);
+  CHECK_EQ(nb_output, 1);
+
+  input_dims_ = in[0].dims;
+  data_format_ = in[0].format;
+  data_type_ = in[0].type;
+}
+
+bool PoolPlugin::supportsFormatCombination(
+    int32_t pos,
+    nvinfer1::PluginTensorDesc const* in_out,
+    int32_t nb_inputs,
+    int32_t nb_outputs) const noexcept {
+  CHECK_LT(pos, nb_inputs + nb_outputs);
+  CHECK_NOTNULL(in_out);
+
+  return ((in_out[pos].type == nvinfer1::DataType::kFLOAT) &&
+          in_out[pos].format == nvinfer1::PluginFormat::kLINEAR);
+}
+
+nvinfer1::IPluginV2* PoolPluginCreator::createPlugin(
+    const char* name, const nvinfer1::PluginFieldCollection* fc) noexcept {
+  // auto* plugin = new UffPoolPluginV2(*fc);
+  field_collection_ = *fc;
+  plugin_name_ = name;
+  const nvinfer1::PluginField* fields = fc->fields;
+
+  bool ceil_mode;
+  PoolPlugin::PoolType pool_type;
+  bool adaptive;
+  bool exclusive;
+  std::vector<int> ksize;
+  std::vector<int> strides;
+  std::vector<int> paddings;
+  std::vector<int> real_paddings;
+  std::vector<int> input_shape;
+  std::vector<int> output_shape;
+
+  // TODO(wilber): add implement.
+  CHECK(false) << "not implement";
+  // for (int i = 0; i < fc->nbFields; ++i) {
+  //   const char* attr_name = fields[i].name;
+  //   if (!strcmp(attr_name, "ceil_mode")) {
+  //     CHECK_EQ(fields[i].type == nvinfer1::PluginFieldType::kINT8, true);
+  //     ceil_mode = *static_cast<const bool*>(fields[i].data);
+  //     // mParam.numOutputBoxesPerClass =
+  //     //     *(static_cast<const int*>(fields[i].data));
+  //   }
+  // }
+
+  return nullptr;
+}
+
+nvinfer1::IPluginV2* PoolPluginCreator::deserializePlugin(
+    const char* name, const void* serialData, size_t serialLength) noexcept {
+  auto* plugin = new PoolPlugin(serialData, serialLength);
+  plugin_name_ = name;
+  return plugin;
+}
+
+}  // namespace plugin
+}  // namespace tensorrt
+}  // namespace backends
+}  // namespace infrt
diff --git a/paddle/infrt/backends/tensorrt/plugin/pool_op_plugin.h b/paddle/infrt/backends/tensorrt/plugin/pool_op_plugin.h
new file mode 100644
index 0000000000000..0da1d15845330
--- /dev/null
+++ b/paddle/infrt/backends/tensorrt/plugin/pool_op_plugin.h
@@ -0,0 +1,196 @@
+// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#pragma once
+#include <NvInferRuntime.h>
+#include <NvInferRuntimeCommon.h>
+#include <stdio.h>
+
+#include <cassert>
+#include <string>
+#include <vector>
+
+#include "paddle/infrt/backends/tensorrt/plugin/plugin_utils.h"
+#include "paddle/infrt/backends/tensorrt/trt_utils.h"
+
+namespace infrt {
+namespace backends {
+namespace tensorrt {
+namespace plugin {
+
+static std::vector<int> CalcOutputSize(const std::vector<int>& input_shape,
+                                       const bool& ceil_mode,
+                                       const bool& adaptive,
+                                       const std::vector<int>& ksize,
+                                       const std::vector<int>& strides,
+                                       const std::vector<int>& real_paddings) {
+  std::vector<int> output_shape = input_shape;
+  if (adaptive) {
+    output_shape[0] = ksize[0];
+    output_shape[1] = ksize[1];
+  } else {
+    int output_h = 0, output_w = 0;
+    if (ceil_mode) {
+      output_h = (input_shape[0] - ksize[0] + real_paddings[0] +
+                  real_paddings[1] + strides[0] - 1) /
+                     strides[0] +
+                 1;
+      output_w = (input_shape[1] - ksize[1] + real_paddings[2] +
+                  real_paddings[3] + strides[1] - 1) /
+                     strides[1] +
+                 1;
+    }
+    // TRT will use native layer when ceil_model=false
+    /*
+    else{
+      output_h = (input_shape[0] - ksize[0] + real_paddings[0] +
+    real_paddings[1]) / strides[0] + 1;
+      output_w = (input_shape[1] - ksize[1] + real_paddings[2] +
+    real_paddings[3]) / strides[1] + 1;
+    }
+    */
+    output_shape[0] = output_h;
+    output_shape[1] = output_w;
+  }
+  return output_shape;
+}
+
+class PoolPlugin : public nvinfer1::IPluginV2IOExt {
+ public:
+  enum class PoolType {
+    max = 0,
+    avg,
+  };
+
+  PoolPlugin() {}
+  PoolPlugin(bool ceil_mode,
+             PoolType pool_type,
+             bool adaptive,
+             bool exclusive,
+             std::vector<int> ksize,
+             std::vector<int> strides,
+             std::vector<int> paddings,
+             std::vector<int> input_shape,
+             std::vector<int> real_paddings);
+
+  PoolPlugin(void const* serialData, size_t serialLength);
+
+  // IPluginV2 methods
+  const char* getPluginType() const noexcept override;
+  const char* getPluginVersion() const noexcept override;
+  int getNbOutputs() const noexcept override;
+  nvinfer1::Dims getOutputDimensions(int outputIndex,
+                                     const nvinfer1::Dims* inputs,
+                                     int nbInputs) noexcept override;
+  int32_t initialize() noexcept override;
+  void terminate() noexcept override;
+  size_t getWorkspaceSize(int32_t maxBatchSize) const noexcept override;
+#if IS_TRT_VERSION_LT(8000)
+  int enqueue(int batchSize,
+              const void* const* inputs,
+              void** outputs,
+#else
+  int enqueue(int batchSize,
+              const void* const* inputs,
+              void* const* outputs,
+#endif
+              void* workspace,
+              cudaStream_t stream) noexcept override;
+  size_t getSerializationSize() const noexcept override;
+  void serialize(void* buffer) const noexcept override;
+  void destroy() noexcept override;
+  void setPluginNamespace(char const* pluginNamespace) noexcept override;
+  char const* getPluginNamespace() const noexcept override;
+
+  // IPluginV2Ext methods
+  nvinfer1::DataType getOutputDataType(int32_t index,
+                                       nvinfer1::DataType const* inputTypes,
+                                       int32_t nbInputs) const
+      noexcept override;
+  bool isOutputBroadcastAcrossBatch(int32_t outputIndex,
+                                    bool const* inputIsBroadcasted,
+                                    int32_t nbInputs) const noexcept override;
+  bool canBroadcastInputAcrossBatch(int32_t inputIndex) const noexcept override;
+  // void attachToContext(cudnnContext*,
+  //                      cublasContext*,
+  //                      IGpuAllocator*) noexcept override;
+  // void detachFromContext() noexcept override;
+  IPluginV2Ext* clone() const noexcept override;
+
+  // IPluginV2IOExt methods
+  void configurePlugin(nvinfer1::PluginTensorDesc const* in,
+                       int32_t nb_input,
+                       nvinfer1::PluginTensorDesc const* out,
+                       int32_t nb_output) noexcept override;
+  bool supportsFormatCombination(int32_t pos,
+                                 nvinfer1::PluginTensorDesc const* inOut,
+                                 int32_t nb_inputs,
+                                 int32_t nb_outputs) const noexcept override;
+
+ private:
+  bool ceil_mode_;
+  PoolType pool_type_;
+  bool adaptive_;
+  bool exclusive_;
+  std::vector<int> ksize_;
+  std::vector<int> strides_;
+  std::vector<int> paddings_;
+  std::vector<int> real_paddings_;
+  std::vector<int> input_shape_;
+  std::vector<int> output_shape_;
+
+ private:
+  nvinfer1::Dims input_dims_;
+  nvinfer1::DataType data_type_;
+  nvinfer1::PluginFormat data_format_;
+  std::string namespace_;
+};
+
+class PoolPluginCreator : public nvinfer1::IPluginCreator {
+ public:
+  const char* getPluginName() const noexcept override { return "pool_plugin"; }
+
+  const char* getPluginVersion() const noexcept override { return "1"; }
+
+  const nvinfer1::PluginFieldCollection* getFieldNames() noexcept override {
+    return &field_collection_;
+  }
+
+  nvinfer1::IPluginV2* createPlugin(
+      const char* name,
+      const nvinfer1::PluginFieldCollection* fc) noexcept override;
+
+  nvinfer1::IPluginV2* deserializePlugin(const char* name,
+                                         const void* serialData,
+                                         size_t serialLength) noexcept override;
+
+  void setPluginNamespace(const char* plugin_namespace) noexcept override {
+    plugin_namespace_ = plugin_namespace;
+  }
+
+  const char* getPluginNamespace() const noexcept override {
+    return plugin_namespace_.c_str();
+  }
+
+ private:
+  std::string plugin_namespace_;
+  std::string plugin_name_;
+  nvinfer1::PluginFieldCollection field_collection_{0, nullptr};
+};
+REGISTER_TRT_PLUGIN(PoolPluginCreator);
+
+}  // namespace plugin
+}  // namespace tensorrt
+}  // namespace backends
+}  // namespace infrt
diff --git a/paddle/infrt/dialect/tensorrt/convert.h b/paddle/infrt/dialect/tensorrt/convert.h
index fc607aa112714..be363e77848a5 100644
--- a/paddle/infrt/dialect/tensorrt/convert.h
+++ b/paddle/infrt/dialect/tensorrt/convert.h
@@ -14,70 +14,154 @@
 #pragma once
 
 #include <glog/logging.h>
+#include <llvm/Support/ErrorHandling.h>
+#include <llvm/include/mlir/IR/Attributes.h>
 #include <mlir/IR/Builders.h>
+#include <mlir/IR/BuiltinAttributes.h>
+#include <mlir/IR/PatternMatch.h>
 #include <mlir/Transforms/DialectConversion.h>
-
 #include "paddle/infrt/dialect/infrt/common/types.h"
 #include "paddle/infrt/dialect/infrt/ir/infrt_dialect.h"
 #include "paddle/infrt/dialect/pd/ir/pd_ops.h"
 #include "paddle/infrt/dialect/phi/ir/infrt_phi_tensor.h"
 #include "paddle/infrt/dialect/tensorrt/trt_ops.h"
+#include "paddle/infrt/kernel/tensorrt/trt_helper.h"
 
 namespace infrt {
 namespace trt {
+
+#ifdef INFRT_WITH_TRT
+
+#define STRING_TO_ENUM_TYPE(enum_type) enum_type
+#define STRING_TO_ENUM_VALUE(enum_value) enum_value
+#include <NvInfer.h>
+
+#else  // INFRT_WITH_TRT
+
+#define STRING_TO_ENUM_TYPE(enum_type) std::string
+#define STRING_TO_ENUM_VALUE(enum_value) #enum_value
+
+#endif  // INFRT_WITH_TRT
+
+template <typename T>
+::mlir::IntegerAttr createNvinferEnumAttr(
+    ::mlir::PatternRewriter &rewriter,  // NOLINT
+    T enum_value) {
+  return rewriter.getSI32IntegerAttr((int32_t)enum_value);
+}
+
+template <>
+::mlir::IntegerAttr createNvinferEnumAttr<std::string>(
+    ::mlir::PatternRewriter &rewriter, std::string enum_value) {  // NOLINT
+  (void)enum_value;
+  return rewriter.getSI32IntegerAttr(-1);
+}
+
 static mlir::Value createTRTConv2dOp(mlir::PatternRewriter &rewriter,  // NOLINT
-                                     mlir::Operation *op) {
+                                     mlir::Operation *op,
+                                     mlir::Value input,
+                                     mlir::Value filter) {
   auto conv_op = ::llvm::dyn_cast<infrt::pd::Conv2dOp>(op);
   ::mlir::SmallVector<::mlir::Value, 4> operands;
-  ::mlir::Operation::operand_range Input = conv_op.getODSOperands(0);
-  ::mlir::Operation::operand_range Filter = conv_op.getODSOperands(1);
-  operands.push_back((*Input.begin()));
-  operands.push_back((*Filter.begin()));
+  operands.push_back(input);
+  operands.push_back(filter);
 
   ::mlir::SmallVector<::mlir::Type, 4> resultTypes;
   for (auto v : conv_op.getODSResults(0)) {
     resultTypes.push_back(v.getType());
   }
+
   ::mlir::SmallVector<::mlir::NamedAttribute, 8> attributes;
-  {
-    // TODO(weishengying) :  get out_channel_num for filter shape
-    auto tblgen_attr = rewriter.getSI32IntegerAttr(3);
-    attributes.emplace_back(rewriter.getStringAttr("out_channel_num"),
-                            tblgen_attr);
+
+  auto *filter_producer = filter.getDefiningOp();
+  auto create_inited_tensor_op =
+      llvm::dyn_cast<::infrt::phi::CreateHostInitedDenseTensorOp>(
+          filter_producer);
+
+  CHECK_NOTNULL(create_inited_tensor_op);
+  mlir::ArrayAttr dims = create_inited_tensor_op.dims();
+  CHECK_EQ(dims.size(), 4U);
+  CHECK(dims[0].getType().isIntOrIndex());
+
+  const int32_t n_output = dims[0].cast<mlir::IntegerAttr>().getInt();
+  const int32_t filter_h = dims[2].cast<mlir::IntegerAttr>().getInt();
+  const int32_t filter_w = dims[3].cast<mlir::IntegerAttr>().getInt();
+
+  auto padding_attr = conv_op->getAttrOfType<::mlir::ArrayAttr>("paddings");
+  llvm::SmallVector<int32_t, 4> paddings(padding_attr.size());
+  for (size_t i = 0; i < padding_attr.size(); i++) {
+    paddings[i] = padding_attr[i].cast<mlir::IntegerAttr>().getInt();
   }
-  {
-    // TODO(weishengying) :  get kernel_size for filter shape
-    auto tblgen_attr = rewriter.getI32ArrayAttr({3, 3});
-    attributes.emplace_back(rewriter.getStringAttr("kernel_size"), tblgen_attr);
+
+  auto dilations_attr = conv_op->getAttrOfType<::mlir::ArrayAttr>("dilations");
+  llvm::SmallVector<int32_t> dilations(dilations_attr.size());
+  for (size_t i = 0; i < dilations_attr.size(); i++) {
+    dilations[i] = dilations_attr[i].cast<mlir::IntegerAttr>().getInt();
   }
-  {
-    auto tblgen_attr = op->getAttrOfType<::mlir::ArrayAttr>("strides");
-    attributes.emplace_back(rewriter.getStringAttr("strides"), tblgen_attr);
+
+  llvm::SmallVector<int32_t, 2> nv_paddings(2);
+  llvm::SmallVector<int32_t, 4> nv_pre_paddings(2);
+  llvm::SmallVector<int32_t, 4> nv_post_paddings(2);
+  llvm::SmallVector<int32_t, 2> nv_dilations({dilations[0], dilations[1]});
+  int32_t nv_padding_mode = 0;  // nvinfer1::PaddingMode::kEXPLICIT_ROUND_DOWN
+  auto padding_algorithm_attr =
+      conv_op->getAttrOfType<::mlir::StringAttr>("padding_algorithm");
+  if (padding_algorithm_attr.strref() == "VALID") {
+    for (size_t i = 0; i < paddings.size(); i++) {
+      paddings[i] = 0;
+    }
   }
-  {
-    auto tblgen_attr = op->getAttrOfType<::mlir::ArrayAttr>("paddings");
-    attributes.emplace_back(rewriter.getStringAttr("paddings"), tblgen_attr);
+  if (padding_algorithm_attr.strref() == "SAME") {
+    nv_padding_mode = 2;  // nvinfer1::PaddingMode::kSAME_UPPER
+    nv_dilations[0] = 1;
+    nv_dilations[1] = 1;
   }
-  {
-    auto tblgen_attr =
-        op->getAttrOfType<::mlir::StringAttr>("padding_algorithm");
-    attributes.emplace_back(rewriter.getStringAttr("padding_mode"),
-                            tblgen_attr);
+
+  if (paddings.size() == 2) {
+    nv_paddings[0] = paddings[0];
+    nv_paddings[1] = paddings[1];
+  } else {
+    CHECK_EQ(paddings.size(), 4U);
+    nv_pre_paddings[0] = paddings[0];
+    nv_pre_paddings[1] = paddings[2];
+    nv_post_paddings[0] = paddings[1];
+    nv_post_paddings[1] = paddings[3];
   }
+
+  attributes.emplace_back(rewriter.getStringAttr("out_channel_num"),
+                          rewriter.getSI32IntegerAttr(n_output));
+
+  attributes.emplace_back(rewriter.getStringAttr("kernel_size"),
+                          rewriter.getI32ArrayAttr({filter_h, filter_w}));
+
+  attributes.emplace_back(
+      rewriter.getStringAttr("dilations"),
+      rewriter.getI32ArrayAttr({nv_dilations[0], nv_dilations[1]}));
+
+  attributes.emplace_back(rewriter.getStringAttr("padding_mode"),
+                          rewriter.getSI32IntegerAttr(nv_padding_mode));
+
+  attributes.emplace_back(rewriter.getStringAttr("paddings"),
+                          rewriter.getI32ArrayAttr({paddings[0], paddings[1]}));
+
+  attributes.emplace_back(
+      rewriter.getStringAttr("pre_paddings"),
+      rewriter.getI32ArrayAttr({nv_pre_paddings[0], nv_pre_paddings[1]}));
+
+  attributes.emplace_back(
+      rewriter.getStringAttr("post_paddings"),
+      rewriter.getI32ArrayAttr({nv_post_paddings[0], nv_post_paddings[1]}));
+
   {
-    auto tblgen_attr = op->getAttrOfType<::mlir::IntegerAttr>("groups");
+    auto tblgen_attr = conv_op->getAttrOfType<::mlir::IntegerAttr>("groups");
     attributes.emplace_back(rewriter.getStringAttr("groups"), tblgen_attr);
   }
   {
-    auto tblgen_attr = op->getAttrOfType<::mlir::ArrayAttr>("dilations");
-    attributes.emplace_back(rewriter.getStringAttr("dilations"), tblgen_attr);
-  }
-  {
-    auto tblgen_attr = op->getAttrOfType<::mlir::StringAttr>("data_format");
-    attributes.emplace_back(rewriter.getStringAttr("data_format"), tblgen_attr);
+    auto tblgen_attr = conv_op->getAttrOfType<::mlir::ArrayAttr>("strides");
+    attributes.emplace_back(rewriter.getStringAttr("strides"), tblgen_attr);
   }
   return rewriter.create<trt::ConvolutionOp>(
-      op->getLoc(), resultTypes, operands, attributes);
+      conv_op->getLoc(), resultTypes, operands, attributes);
 }
 
 static inline mlir::ArrayAttr TransposeWeight(
@@ -161,49 +245,126 @@ inline ::llvm::SmallVector<::mlir::Value, 4> createTrtFcOp(
   return tblgen_repl_values;
 }
 
-static mlir::Value createTRTShuffledOp(
-    mlir::PatternRewriter &rewriter,  // NOLINT
-    mlir::Operation *op,
-    const mlir::Value &input,
-    const mlir::Attribute &start,
-    const mlir::Attribute &stop) {
-  auto flatten_op = ::llvm::dyn_cast<infrt::pd::Flatten_contiguous_rangeOp>(op);
-  ::mlir::SmallVector<::mlir::Value, 4> operands;
-  operands.push_back(input);
+inline mlir::IntegerAttr CreatePoolingType(
+    mlir::PatternRewriter &builder,  // NOLINT
+    mlir::StringAttr pool_type) {
+  // pool_type.
+  auto ptype = pool_type.str();
+  if (ptype == "max") {
+    return createNvinferEnumAttr(builder, nvinfer1::PoolingType::kMAX);
+  } else if (ptype == "avg") {
+    return createNvinferEnumAttr(builder, nvinfer1::PoolingType::kAVERAGE);
+  } else {
+    llvm_unreachable("unknown pool_type.");
+    return {};
+  }
+}
 
-  ::mlir::SmallVector<::mlir::Type, 4> resultTypes;
-  for (auto v : flatten_op.getODSResults(0)) {
-    resultTypes.push_back(v.getType());
+inline mlir::IntegerAttr CreatePaddingMode(
+    mlir::PatternRewriter &builder,  // NOLINT
+    mlir::StringAttr padding_algorithm,
+    mlir::BoolAttr ceil_mode) {
+  // TODO(Inference): Phi pool kernel seems not process ceil_mode.
+  auto padding_algo = padding_algorithm.str();
+  if (padding_algo == "SAME") {
+    return createNvinferEnumAttr(builder, nvinfer1::PaddingMode::kSAME_UPPER);
+  }
+  if (ceil_mode.getValue() && padding_algo != "SAME") {
+    return createNvinferEnumAttr(builder,
+                                 nvinfer1::PaddingMode::kEXPLICIT_ROUND_UP);
+  } else {
+    return createNvinferEnumAttr(builder,
+                                 nvinfer1::PaddingMode::kEXPLICIT_ROUND_DOWN);
   }
+}
 
-  ::mlir::SmallVector<::mlir::NamedAttribute, 8> attributes;
-  mlir::IntegerAttr start_attr = start.dyn_cast<mlir::IntegerAttr>();
-  mlir::IntegerAttr stop_attr = stop.dyn_cast<mlir::IntegerAttr>();
-
-  int start_axis = start_attr.getSInt();
-  int stop_axis = stop_attr.getSInt();
-  // TODO(weishengying) : get dim form DenseTonsor
-  int dims = 4;
-  // TODO(weishengying) : get input_dims form DenseTonsor
-  int input_dims[4] = {1, 2048, 1, 1};
-  int dim_prod = 1;
-
-  std::vector<int> flatten_dim(dims - (stop_axis - start_axis));
-  for (int i = 0, j = 0; i < dims; ++i) {
-    if (start_axis <= i + 1 && i + 1 <= stop_axis) {
-      int dim_i = input_dims[i];
-      dim_prod *= dim_i;
-      if (i + 1 == stop_axis) {
-        flatten_dim[j++] = dim_prod;
-      }
-    } else {
-      flatten_dim[j++] = input_dims[i];
-    }
+inline ::llvm::SmallVector<::mlir::Value, 4> CreatePaddleTrtPoolingOp(
+    mlir::PatternRewriter &builder,  // NOLINT
+    mlir::Value input,
+    mlir::StringAttr pool_type,
+    mlir::ArrayAttr ksize,
+    mlir::BoolAttr global_pooling,
+    mlir::ArrayAttr strides,
+    mlir::ArrayAttr paddings,
+    mlir::BoolAttr exclusive,
+    mlir::BoolAttr adaptive,
+    mlir::BoolAttr ceil_mode,
+    mlir::StringAttr data_format,
+    mlir::StringAttr padding_algorithm) {
+  ::llvm::SmallVector<::mlir::Value, 4> tblgen_repl_values;
+
+  // TODO(inference): Support NHWC.
+  if (data_format.str() != "NCHW") {
+    CHECK(false) << "The pool2d converter now only support NCHW.";
   }
-  auto reshape_arrt = rewriter.getI32ArrayAttr(flatten_dim);
-  attributes.emplace_back(rewriter.getStringAttr("reshape"), reshape_arrt);
-  return rewriter.create<trt::ShuffleOp>(
-      op->getLoc(), resultTypes, operands, attributes);
+
+  // TODO(Wilber): How to support dynamic shape?
+
+  auto *input_producer = input.getDefiningOp();
+
+  // Process pool_type.
+  auto pool_type_attr = CreatePoolingType(builder, pool_type);
+
+  // Update padding.
+  auto padding_algorithm_str = padding_algorithm.str();
+  auto paddings_attr = paddings;
+  if (padding_algorithm_str == "EXPLICIT") {
+    // Do nothing on paddings.
+  } else if (padding_algorithm_str == "SAME") {
+    // We should process this case in trt network build phase.
+  } else if (padding_algorithm_str == "VALID") {
+    // Set padding to zero.
+    paddings_attr = builder.getI32ArrayAttr({0, 0});
+  } else {
+    CHECK(false) << "Unknown padding_algotithm.";
+  }
+
+  // if global_pooling == true or adaptive == true, padding will be ignored
+  // if (global_pooling.getValue() || adaptive.getValue()) {
+  //   paddings_attr = builder.getI32ArrayAttr({0, 0});
+  // }
+
+  // if global_pooling == true, then we should update kernel size to input dims.
+  if (global_pooling.getValue() == true) {
+    // Update ksize to input dims.
+  }
+
+  // The adaptive logic should be processed when we get the context of
+  // INetworkDefinition, so we place the logic in infrt runtime(trt compile
+  // time).
+
+  // The `exclusive` may be a naive attr, which can be forward to trt.
+
+  auto padding_mode_attr =
+      CreatePaddingMode(builder, padding_algorithm, ceil_mode);
+
+  if (global_pooling.getValue() == true) {
+    CHECK(false) << "Temporarily not support global_pool";
+    return tblgen_repl_values;
+  }
+
+  PoolingOp pool_op;
+  {
+    auto ods_loc = builder.getFusedLoc({input_producer->getLoc()});
+    pool_op = builder.create<PoolingOp>(ods_loc,
+                                        input.getType(),
+                                        input,
+                                        pool_type_attr,
+                                        ksize,
+                                        strides,
+                                        paddings_attr,
+                                        padding_mode_attr,
+                                        exclusive,
+                                        adaptive,
+                                        padding_algorithm);
+  }
+
+  for (auto v :
+       ::llvm::SmallVector<::mlir::Value, 4>{pool_op.getODSResults(0)}) {
+    tblgen_repl_values.push_back(v);
+  }
+  return tblgen_repl_values;
 }
+
 }  // namespace trt
 }  // namespace infrt
diff --git a/paddle/infrt/dialect/tensorrt/pd_lower_to_trt.td b/paddle/infrt/dialect/tensorrt/pd_lower_to_trt.td
index ad60906ececbf..0cd100aa5b95a 100644
--- a/paddle/infrt/dialect/tensorrt/pd_lower_to_trt.td
+++ b/paddle/infrt/dialect/tensorrt/pd_lower_to_trt.td
@@ -25,15 +25,16 @@ def PD2TRT_Relu6_Lower : Pat<
         (PD_Relu6Op $X, $threshold),
         (TRT_ActivationOp $X, (TRT_createNvinferEnumAttr<"nvinfer1::ActivationType", "kCLIP">), (INFRT_createF32Attr<"0.0">), $threshold)>;
 
-def createTRTConv2dOp : NativeCodeCall<"createTRTConv2dOp($_builder, $0.getDefiningOp())">;
+def createTRTConv2dOp : NativeCodeCall<"createTRTConv2dOp($_builder, $0.getDefiningOp(), $1, $2)">;
 
 def PD2TRT_Conv2d_Lower : Pat<
         (PD_Conv2dOp:$old_value $Input, $Filter, $strides, $paddings, $padding_algorithm, $groups, $dilations, $data_format),
-        (createTRTConv2dOp $old_value)>;
+        (createTRTConv2dOp $old_value, $Input, $Filter)>;
 
+def createTrtPoolingOp : NativeCodeCall<"::infrt::trt::CreatePaddleTrtPoolingOp($_builder, $0, $1, $2, $3, $4, $5, $6, $7, $8, $9, $10)">;
 def PD2TRT_Pooling_Lower : Pat<
         (PD_Pool2dOp $Input, $pooling_type, $ksize, $global_pooling, $strides, $paddings, $exclusive, $adaptive, $ceil_mode, $data_format, $padding_algorithm),
-        (TRT_PoolingOp $Input, (INFRT_createI32Attr<"0">)/*kmax*/, $ksize, $strides, $paddings, $padding_algorithm)>;
+        (createTrtPoolingOp $Input, $pooling_type, $ksize, $global_pooling, $strides, $paddings, $exclusive, $adaptive, $ceil_mode, $data_format, $padding_algorithm)>;
 
 def PD2TRT_MatrixMultipl_Lower : Pat<
         (PD_MulOp $Input1, $Input2, $x_num_col_dims, $y_num_col_dims),
@@ -49,9 +50,7 @@ def PD2TRT_Fc_Lower : Pat<
         (PD_Elementwise_addOp:$elt_out (PD_Matmul_v2Op $X, $Y, $trans_x, $trans_y), $elt_y, $axis),
         (createTrtFcOp $X, $Y, $elt_y, $elt_out)>;
 
-def createTRTShuffledOp : NativeCodeCall<"createTRTShuffledOp($_builder, $0.getDefiningOp(), $1, $2, $3)">;
-
 def PD2TRT_Flatten_contiguous_range_Lower : Pat<
-        (PD_Flatten_contiguous_rangeOp:$out $input, $start_axis, $end_axis),
-        (createTRTShuffledOp $out, $input, $start_axis, $end_axis)>;
+        (PD_Flatten_contiguous_rangeOp $input, $start_axis, $end_axis),
+        (TRT_ShuffleOp $input, $start_axis, $end_axis)>;
 #endif // PD_LOWER_TO_TRT
diff --git a/paddle/infrt/dialect/tensorrt/trt_exec.cc b/paddle/infrt/dialect/tensorrt/trt_exec.cc
index b37186ada6d74..837ca2093747c 100644
--- a/paddle/infrt/dialect/tensorrt/trt_exec.cc
+++ b/paddle/infrt/dialect/tensorrt/trt_exec.cc
@@ -72,7 +72,7 @@ int main(int argc, char** argv) {
 #endif
 
   context->loadAllAvailableDialects();
-  module->dump();
+  // module->dump();
   mlir::PassManager pm(context);
 
   mlir::OpPassManager& trt_pass_manager = pm.nest<mlir::FuncOp>();
@@ -87,7 +87,7 @@ int main(int argc, char** argv) {
     std::cout << "\npass failed!\n" << std::endl;
     return 4;
   }
-  module->dump();
+  // module->dump();
   ::infrt::host_context::TestMlir(module.get(), &registry);
   return 0;
 }
diff --git a/paddle/infrt/dialect/tensorrt/trt_op_converter_pass.cc b/paddle/infrt/dialect/tensorrt/trt_op_converter_pass.cc
index 6bcef3d913d79..e40bbd67c0b5e 100644
--- a/paddle/infrt/dialect/tensorrt/trt_op_converter_pass.cc
+++ b/paddle/infrt/dialect/tensorrt/trt_op_converter_pass.cc
@@ -28,33 +28,6 @@
 namespace infrt {
 namespace trt {
 
-#ifdef INFRT_WITH_TRT
-
-#define STRING_TO_ENUM_TYPE(enum_type) enum_type
-#define STRING_TO_ENUM_VALUE(enum_value) enum_value
-#include <NvInfer.h>
-
-#else  // INFRT_WITH_TRT
-
-#define STRING_TO_ENUM_TYPE(enum_type) std::string
-#define STRING_TO_ENUM_VALUE(enum_value) #enum_value
-
-#endif  // INFRT_WITH_TRT
-
-template <typename T>
-::mlir::IntegerAttr createNvinferEnumAttr(
-    ::mlir::PatternRewriter &rewriter,  // NOLINT
-    T enum_value) {
-  return rewriter.getSI32IntegerAttr((int32_t)enum_value);
-}
-
-template <>
-::mlir::IntegerAttr createNvinferEnumAttr<std::string>(
-    ::mlir::PatternRewriter &rewriter, std::string enum_value) {  // NOLINT
-  (void)enum_value;
-  return rewriter.getSI32IntegerAttr(-1);
-}
-
 #include "paddle/infrt/dialect/tensorrt/pd_lower_to_trt.cpp.inc"  // NOLINT
 
 struct PD2TRT_GraphLower : public ::mlir::RewritePattern {
@@ -119,20 +92,121 @@ struct PD2TRT_Batch_Norm_Lower : public ::mlir::RewritePattern {
     ::mlir::Operation::operand_range Input = casted_op.getODSOperands(0);
     ::mlir::Operation::operand_range Scale = casted_op.getODSOperands(1);
     ::mlir::Operation::operand_range Bias = casted_op.getODSOperands(2);
+    ::mlir::Operation::operand_range Mean = casted_op.getODSOperands(3);
+    ::mlir::Operation::operand_range Variance = casted_op.getODSOperands(4);
+    operands.push_back(Input[0]);
+    operands.push_back(Bias[0]);
+    operands.push_back(Scale[0]);
 
     // TODO(weishengying) : recompute this via params
-    operands.push_back((*Input.begin()));
-    operands.push_back((*Scale.begin()));
-    operands.push_back((*Bias.begin()));
-    operands.push_back((*Bias.begin()));
+    auto *scale_producer = Scale[0].getDefiningOp();
+    auto create_scale_tensor_op =
+        llvm::dyn_cast<::infrt::phi::CreateHostInitedDenseTensorOp>(
+            scale_producer);
+    CHECK_NOTNULL(create_scale_tensor_op);
 
-    trt::ScaleNdOp scaleNd_op;
-    // inputs
-    ::mlir::SmallVector<::mlir::Value, 4> trt_inputs;
-    for (auto v : operands) {
-      trt_inputs.push_back(v);
+    auto *bias_producer = Bias[0].getDefiningOp();
+    auto create_bias_tensor_op =
+        llvm::dyn_cast<::infrt::phi::CreateHostInitedDenseTensorOp>(
+            bias_producer);
+    CHECK_NOTNULL(create_bias_tensor_op);
+
+    auto *mean_producer = Mean[0].getDefiningOp();
+    auto create_mean_tensor_op =
+        llvm::dyn_cast<::infrt::phi::CreateHostInitedDenseTensorOp>(
+            mean_producer);
+    CHECK_NOTNULL(create_mean_tensor_op);
+
+    auto *variance_producer = Variance[0].getDefiningOp();
+    auto create_variance_tensor_op =
+        llvm::dyn_cast<::infrt::phi::CreateHostInitedDenseTensorOp>(
+            variance_producer);
+    CHECK_NOTNULL(create_variance_tensor_op);
+
+    llvm::SmallVector<double> scale_data;
+    mlir::ArrayAttr scale_array_attr = create_scale_tensor_op.values();
+    CHECK_GT(scale_array_attr.size(), 0U);
+    CHECK(scale_array_attr[0].getType().isF32());
+    scale_data.resize(scale_array_attr.size());
+    for (size_t i = 0; i < scale_array_attr.size(); i++) {
+      scale_data[i] =
+          scale_array_attr[i].cast<mlir::FloatAttr>().getValueAsDouble();
     }
 
+    llvm::SmallVector<double> bias_data;
+    mlir::ArrayAttr bias_array_attr = create_bias_tensor_op.values();
+    CHECK_GT(bias_array_attr.size(), 0U);
+    CHECK(bias_array_attr[0].getType().isF32());
+    bias_data.resize(bias_array_attr.size());
+    for (size_t i = 0; i < bias_array_attr.size(); i++) {
+      bias_data[i] =
+          bias_array_attr[i].cast<mlir::FloatAttr>().getValueAsDouble();
+    }
+
+    llvm::SmallVector<double> mean_data;
+    mlir::ArrayAttr mean_array_attr = create_mean_tensor_op.values();
+    CHECK_GT(mean_array_attr.size(), 0U);
+    CHECK(mean_array_attr[0].getType().isF32());
+    mean_data.resize(mean_array_attr.size());
+    for (size_t i = 0; i < mean_array_attr.size(); i++) {
+      mean_data[i] =
+          mean_array_attr[i].cast<mlir::FloatAttr>().getValueAsDouble();
+    }
+
+    llvm::SmallVector<double> variance_data;
+    mlir::ArrayAttr variance_array_attr = create_variance_tensor_op.values();
+    CHECK_GT(variance_array_attr.size(), 0U);
+    CHECK(variance_array_attr[0].getType().isF32());
+    variance_data.resize(variance_array_attr.size());
+    for (size_t i = 0; i < variance_array_attr.size(); i++) {
+      variance_data[i] =
+          variance_array_attr[i].cast<mlir::FloatAttr>().getValueAsDouble();
+    }
+
+    double eps = casted_op.epsilonAttr().getValueAsDouble();
+
+    llvm::SmallVector<float> combile_scale_data;
+    combile_scale_data.resize(scale_data.size());
+    llvm::SmallVector<float> combile_bias_data;
+    combile_bias_data.resize(bias_data.size());
+
+    size_t ele_num = combile_scale_data.size();
+    for (size_t i = 0; i < ele_num; i++) {
+      float scale = scale_data[i];
+      float bias = bias_data[i];
+      float mean = mean_data[i];
+      float variance = variance_data[i];
+      combile_scale_data[i] = scale / sqrtf(variance + eps);
+      combile_bias_data[i] = bias - mean * combile_scale_data[i];
+    }
+
+    rewriter.setInsertionPoint(create_scale_tensor_op);
+    auto new_scale_op =
+        rewriter.create<::infrt::phi::CreateHostInitedDenseTensorOp>(
+            create_scale_tensor_op->getLoc(),
+            create_scale_tensor_op.output().getType(),
+            create_scale_tensor_op.context(),
+            create_scale_tensor_op.dims(),
+            ::infrt::LayoutAttr::get(rewriter.getContext(),
+                                     ::infrt::LayoutType::NCHW),
+            create_scale_tensor_op.lod(),
+            rewriter.getF32ArrayAttr(combile_scale_data));
+    rewriter.replaceOp(create_scale_tensor_op, new_scale_op->getResults());
+
+    rewriter.setInsertionPoint(create_bias_tensor_op);
+    auto new_bias_op =
+        rewriter.create<::infrt::phi::CreateHostInitedDenseTensorOp>(
+            create_bias_tensor_op->getLoc(),
+            create_bias_tensor_op.output().getType(),
+            create_bias_tensor_op.context(),
+            create_bias_tensor_op.dims(),
+            ::infrt::LayoutAttr::get(rewriter.getContext(),
+                                     ::infrt::LayoutType::NCHW),
+            create_bias_tensor_op.lod(),
+            rewriter.getF32ArrayAttr(combile_bias_data));
+    rewriter.replaceOp(create_bias_tensor_op, new_bias_op->getResults());
+
+    trt::ScaleNdOp scaleNd_op;
     // resultTypes
     ::mlir::SmallVector<::mlir::Type, 4> resultTypes;
     for (auto v : casted_op.getODSResults(0)) {
@@ -140,16 +214,8 @@ struct PD2TRT_Batch_Norm_Lower : public ::mlir::RewritePattern {
     }
 
     // attributes
+    rewriter.setInsertionPoint(op);
     ::mlir::SmallVector<::mlir::NamedAttribute, 8> attributes;
-    {
-      auto mode_attr = rewriter.getI32IntegerAttr(1);
-      attributes.emplace_back(rewriter.getStringAttr("mode"), mode_attr);
-    }
-
-    {
-      auto axis_attr = rewriter.getI32IntegerAttr(-1);
-      attributes.emplace_back(rewriter.getStringAttr("axis"), axis_attr);
-    }
     auto result = rewriter
                       .create<trt::ScaleNdOp>(
                           op->getLoc(), resultTypes, operands, attributes)
diff --git a/paddle/infrt/dialect/tensorrt/trt_ops.td b/paddle/infrt/dialect/tensorrt/trt_ops.td
index 3fd3f377f4ec7..b112cc748ecef 100755
--- a/paddle/infrt/dialect/tensorrt/trt_ops.td
+++ b/paddle/infrt/dialect/tensorrt/trt_ops.td
@@ -81,7 +81,9 @@ def TRT_ConvolutionOp : TRT_Op<"Convolution", [NoSideEffect]> {
     I32ArrayAttr:$kernel_size,
     I32ArrayAttr:$strides,
     I32ArrayAttr:$paddings,
-    StrAttr:$padding_mode,
+    I32ArrayAttr:$pre_paddings,
+    I32ArrayAttr:$post_paddings,
+    DefaultValuedAttr<SI32Attr, "0">:$padding_mode, //kEXPLICIT_ROUND_DOWN
     SI32Attr:$groups,
     I32ArrayAttr:$dilations
   );
@@ -97,11 +99,14 @@ def TRT_PoolingOp : TRT_Op<"Pooling", [NoSideEffect]> {
   }];
   let arguments = (ins
     DenseTensor:$input_tensor,
-    I32Attr:$pool_type,
+    SI32Attr:$pool_type,
     I32ArrayAttr:$window_size,
     I32ArrayAttr:$strides,
     I32ArrayAttr:$paddings,
-    StrAttr:$padding_mode
+    SI32Attr:$padding_mode,
+    BoolAttr:$exclusive,
+    BoolAttr:$adaptive,
+    StrAttr:$padding_algorithm
   );
   let results = (outs
     DenseTensor:$output_tensor
@@ -192,11 +197,9 @@ def TRT_ScaleNdOp : TRT_Op<"ScaleNd", [NoSideEffect]> {
   }];
   let arguments = (ins  
     DenseTensor:$input_tensor,
-    I32Attr:$mode,
     DenseTensor:$shift,
     DenseTensor:$scale,
-    DenseTensor:$power,
-    I32Attr:$axis
+    Optional<DenseTensor>:$power
   );
 
   let results = (outs DenseTensor:$Out);
@@ -211,7 +214,8 @@ def TRT_ShuffleOp : TRT_Op<"Shuffle", [NoSideEffect]> {
   }];
   let arguments = (ins  
     DenseTensor:$input_tensor,
-    I32ArrayAttr:$reshape
+    DefaultValuedAttr<SI32Attr, "1">:$start_axis,
+    DefaultValuedAttr<SI32Attr, "1">:$stop_axis
   );
 
   let results = (outs DenseTensor:$Out);
diff --git a/paddle/infrt/kernel/tensorrt/trt_helper.h b/paddle/infrt/kernel/tensorrt/trt_helper.h
index 96122bffacdb2..4f1f1dde38cbe 100644
--- a/paddle/infrt/kernel/tensorrt/trt_helper.h
+++ b/paddle/infrt/kernel/tensorrt/trt_helper.h
@@ -28,13 +28,13 @@ namespace infrt {
 namespace kernel {
 namespace tensorrt {
 
-static nvinfer1::DataType TensorTypeToWeightType(phi::DataType tensor_type) {
+static nvinfer1::DataType TensorTypeToWeightType(::phi::DataType tensor_type) {
   switch (tensor_type) {
-    case phi::DataType::FLOAT32:
+    case ::phi::DataType::FLOAT32:
       return nvinfer1::DataType::kFLOAT;
-    case phi::DataType::INT32:
+    case ::phi::DataType::INT32:
       return nvinfer1::DataType::kINT32;
-    case phi::DataType::FLOAT16:
+    case ::phi::DataType::FLOAT16:
       return nvinfer1::DataType::kHALF;
     default:
       llvm_unreachable("should not reach here");
@@ -52,7 +52,19 @@ static nvinfer1::Dims ArrayAttrToNvDims(const mlir::ArrayAttr& int_array_attr) {
   return dims;
 }
 
-static nvinfer1::Weights TensorToWeights(phi::DenseTensor* tensor) {
+template <typename T>
+static std::vector<T> ArrayAttrToVec(const mlir::ArrayAttr& int_array_attr) {
+  std::vector<T> ret;
+  ret.resize(int_array_attr.size());
+  CHECK(!int_array_attr.empty());
+  CHECK(int_array_attr[0].getType().isIntOrIndex());
+  for (size_t i = 0; i < int_array_attr.size(); ++i) {
+    ret[i] = int_array_attr[i].cast<mlir::IntegerAttr>().getInt();
+  }
+  return ret;
+}
+
+static nvinfer1::Weights TensorToWeights(::phi::DenseTensor* tensor) {
   CHECK_NOTNULL(tensor);
   nvinfer1::Weights ret;
   ret.type = TensorTypeToWeightType(tensor->dtype());
diff --git a/paddle/infrt/kernel/tensorrt/trt_kernels.cc b/paddle/infrt/kernel/tensorrt/trt_kernels.cc
index a6d740f01846d..c182dda2705fd 100644
--- a/paddle/infrt/kernel/tensorrt/trt_kernels.cc
+++ b/paddle/infrt/kernel/tensorrt/trt_kernels.cc
@@ -129,6 +129,7 @@ ::infrt::backends::tensorrt::TrtEngine CreateTrtEngine(
 
   // TODO(wilber): Find a way to add layer.
   for (auto& operation : block.without_terminator()) {
+    VLOG(1) << "process " << operation.getName().getStringRef().str() << " ...";
     if (trt::ActivationOp op = llvm::dyn_cast<trt::ActivationOp>(operation)) {
       ActivationFunc(
           op, network.get(), value_to_trt_tensor_map, value_to_tensor_map);
@@ -138,6 +139,18 @@ ::infrt::backends::tensorrt::TrtEngine CreateTrtEngine(
     } else if (trt::ConvolutionOp op =
                    llvm::dyn_cast<trt::ConvolutionOp>(operation)) {
       ConvFunc(op, network.get(), value_to_trt_tensor_map, value_to_tensor_map);
+    } else if (trt::PoolingOp op = llvm::dyn_cast<trt::PoolingOp>(operation)) {
+      PoolFunc(op, network.get(), value_to_trt_tensor_map, value_to_tensor_map);
+    } else if (trt::ShuffleOp op = llvm::dyn_cast<trt::ShuffleOp>(operation)) {
+      ShuffleFunc(
+          op, network.get(), value_to_trt_tensor_map, value_to_tensor_map);
+    } else if (trt::ScaleNdOp op = llvm::dyn_cast<trt::ScaleNdOp>(operation)) {
+      ScaleNdFunc(
+          op, network.get(), value_to_trt_tensor_map, value_to_tensor_map);
+    } else if (trt::ElementWiseOp op =
+                   llvm::dyn_cast<trt::ElementWiseOp>(operation)) {
+      EltwiseFunc(
+          op, network.get(), value_to_trt_tensor_map, value_to_tensor_map);
     } else {
       CHECK(false) << "not supported operation.";
     }
diff --git a/paddle/infrt/kernel/tensorrt/trt_layers.h b/paddle/infrt/kernel/tensorrt/trt_layers.h
index 19e20c170ec83..9d8eba0bb31f5 100644
--- a/paddle/infrt/kernel/tensorrt/trt_layers.h
+++ b/paddle/infrt/kernel/tensorrt/trt_layers.h
@@ -15,13 +15,16 @@
 #pragma once
 
 #include <NvInfer.h>
+#include <llvm/ADT/StringRef.h>
+#include <mlir/IR/BuiltinAttributes.h>
 #include <mlir/IR/Operation.h>
+#include <mlir/IR/Value.h>
 
 #include <string>
 
+#include "paddle/infrt/backends/tensorrt/plugin/pool_op_plugin.h"
 #include "paddle/infrt/dialect/tensorrt/trt_ops.h"
 #include "paddle/infrt/kernel/tensorrt/trt_helper.h"
-
 #include "paddle/phi/core/dense_tensor.h"
 
 namespace infrt {
@@ -63,7 +66,12 @@ inline void ConvFunc(trt::ConvolutionOp& op,  // NOLINT
   nvinfer1::Dims dims = ArrayAttrToNvDims(size_attrs);
   auto kernel_weights =
       TensorToWeights(value_to_tensor_map[op.kernel_weights()]);
-  auto bias_weights = TensorToWeights(value_to_tensor_map[op.bias_weights()]);
+  nvinfer1::Weights bias_weights;
+  if (op.bias_weights() == mlir::Value()) {
+    bias_weights = nvinfer1::Weights{};
+  } else {
+    bias_weights = TensorToWeights(value_to_tensor_map[op.bias_weights()]);
+  }
 
   auto* layer =
       network->addConvolutionNd(*value_to_trt_tensor_map[input_tensor_repr],
@@ -71,12 +79,88 @@ inline void ConvFunc(trt::ConvolutionOp& op,  // NOLINT
                                 dims,
                                 kernel_weights,
                                 bias_weights);
+
+  layer->setPaddingNd(ArrayAttrToNvDims(op.paddings()));
+  layer->setStrideNd(ArrayAttrToNvDims(op.strides()));
   CHECK_NOTNULL(layer);
   mlir::Value out_repr = op.output_tensor();
   nvinfer1::ITensor* out_tensor = layer->getOutput(0);
   value_to_trt_tensor_map[out_repr] = out_tensor;
 }
 
+inline void PoolFunc(trt::PoolingOp& op,  // NOLINT
+                     nvinfer1::INetworkDefinition* network,
+                     ValueToITensorMap& value_to_trt_tensor_map,  // NOLINT
+                     ValueToTensorMap& value_to_tensor_map) {     // NOLINT
+  mlir::Value input_tensor_repr = op.input_tensor();
+  nvinfer1::ITensor* input_itensor = value_to_trt_tensor_map[input_tensor_repr];
+  nvinfer1::Dims input_shape = input_itensor->getDimensions();
+  int input_dims = input_shape.nbDims;
+
+  auto padding_mode = op.padding_mode();
+  auto pool_type = op.pool_type();
+  mlir::ArrayAttr paddings = op.paddings();
+  mlir::ArrayAttr strides = op.strides();
+  mlir::ArrayAttr ksize = op.window_size();
+  bool exclusive = op.exclusive();
+  bool adaptive = op.adaptive();
+  auto padding_algorithm = op.padding_algorithm().str();
+
+  if (padding_algorithm == "SAME") {
+    // TODO(wilber)
+    CHECK(false) << "Not supported `same` padding algorithm";
+  }
+
+  if (adaptive) {
+    // TODO(Inference)
+    // CHECK(false) << "Not supported adaptive pool";
+
+    std::vector<int> input_shape_v;
+    for (int i = 0; i < input_dims; i++) {
+      input_shape_v.push_back(input_shape.d[i]);
+    }
+    auto paddings_val = ArrayAttrToVec<int>(paddings);
+    std::vector<int> real_paddings = paddings_val;
+    for (int i = 0; i < 2; ++i) {
+      int copy_pad = *(paddings_val.begin() + i);
+      real_paddings.insert(real_paddings.begin() + 2 * i + 1, copy_pad);
+    }
+
+    auto* plugin = new backends::tensorrt::plugin::PoolPlugin(
+        false,
+        backends::tensorrt::plugin::PoolPlugin::PoolType::avg,
+        adaptive,
+        exclusive,
+        ArrayAttrToVec<int>(ksize),
+        ArrayAttrToVec<int>(strides),
+        paddings_val,
+        input_shape_v,
+        real_paddings);
+    auto* layer = network->addPluginV2(&input_itensor, 1, *plugin);
+
+    mlir::Value out_repr = op.output_tensor();
+    nvinfer1::ITensor* out_tensor = layer->getOutput(0);
+    value_to_trt_tensor_map[out_repr] = out_tensor;
+    return;
+  }
+
+  nvinfer1::Dims window_size = ArrayAttrToNvDims(ksize);
+
+  auto* layer =
+      network->addPoolingNd(*input_itensor,
+                            static_cast<nvinfer1::PoolingType>(pool_type),
+                            window_size);
+  CHECK_NOTNULL(layer);
+  layer->setPaddingMode(static_cast<nvinfer1::PaddingMode>(padding_mode));
+  layer->setPaddingNd(ArrayAttrToNvDims(paddings));
+  layer->setStrideNd(ArrayAttrToNvDims(strides));
+  layer->setAverageCountExcludesPadding(exclusive);
+
+  mlir::Value out_repr = op.output_tensor();
+  nvinfer1::ITensor* out_tensor = layer->getOutput(0);
+  value_to_trt_tensor_map[out_repr] = out_tensor;
+}
+
 inline void FcFunc(trt::FullyConnectedOp& op,  // NOLINT
                    nvinfer1::INetworkDefinition* network,
                    ValueToITensorMap& value_to_trt_tensor_map,  // NOLINT
@@ -84,21 +168,135 @@ inline void FcFunc(trt::FullyConnectedOp& op,  // NOLINT
   mlir::Value input_tensor_repr = op.input_tensor();
   CHECK(value_to_trt_tensor_map.count(input_tensor_repr));
 
+  nvinfer1::ITensor* input_itensor = value_to_trt_tensor_map[input_tensor_repr];
+  nvinfer1::Dims input_shape = input_itensor->getDimensions();
+  int input_dims = input_shape.nbDims;
+  CHECK_EQ(input_dims, 1) << "Now we only support 2-d input.";
+  // TODO(wilber): We should place the logic to ir.  Now only support 2-d input
+  // and we reshape to 4-d.
+  nvinfer1::Dims reshape_before_fc_dim;
+  reshape_before_fc_dim.nbDims = input_dims + 2;
+  // padding shape "* x q x 1 x 1"
+  for (int i = 0; i < reshape_before_fc_dim.nbDims; i++) {
+    reshape_before_fc_dim.d[i] = 1;
+  }
+  reshape_before_fc_dim.d[0] = input_shape.d[0];
+  auto* reshape_before_fc_layer = network->addShuffle(*input_itensor);
+  reshape_before_fc_layer->setReshapeDimensions(reshape_before_fc_dim);
+
+  auto* reshape_itensor = reshape_before_fc_layer->getOutput(0);
+
   auto kernel_weights =
       TensorToWeights(value_to_tensor_map[op.kernel_weights()]);
   auto bias_weights = TensorToWeights(value_to_tensor_map[op.bias_weights()]);
 
   int out_channel_num = op.out_channel_num();
-  auto* layer =
-      network->addFullyConnected(*value_to_trt_tensor_map[input_tensor_repr],
-                                 out_channel_num,
-                                 kernel_weights,
-                                 bias_weights);
+  auto* layer = network->addFullyConnected(
+      *reshape_itensor, out_channel_num, kernel_weights, bias_weights);
+
+  // TODO(wilber): fix.
+  nvinfer1::Dims reshape_after_fc_dim;
+  reshape_after_fc_dim.nbDims = 1;
+  reshape_after_fc_dim.d[0] = layer->getOutput(0)->getDimensions().d[0];
+  auto* reshape_after_fc_layer = network->addShuffle(*layer->getOutput(0));
+  reshape_after_fc_layer->setReshapeDimensions(reshape_after_fc_dim);
 
   mlir::Value out_repr = op.output_tensor();
-  nvinfer1::ITensor* out_tensor = layer->getOutput(0);
+  nvinfer1::ITensor* out_tensor = reshape_after_fc_layer->getOutput(0);
   value_to_trt_tensor_map[out_repr] = out_tensor;
 }
+
+inline void ShuffleFunc(trt::ShuffleOp& op,  // NOLINT
+                        nvinfer1::INetworkDefinition* network,
+                        ValueToITensorMap& value_to_trt_tensor_map,  // NOLINT
+                        ValueToTensorMap& value_to_tensor_map) {     // NOLINT
+  mlir::Value input_tensor_repr = op.input_tensor();
+  nvinfer1::ITensor* input = value_to_trt_tensor_map[input_tensor_repr];
+  int dims = input->getDimensions().nbDims;
+  int start_axis = op.start_axis();
+  int stop_axis = op.stop_axis();
+
+  nvinfer1::IShuffleLayer* layer = nullptr;
+  if (start_axis < 0) start_axis += dims + 1;
+  if (stop_axis < 0) stop_axis += dims + 1;
+  int dim_prod = 1;
+  nvinfer1::Dims flatten_dim;
+  flatten_dim.nbDims = dims - (stop_axis - start_axis);
+  for (int i = 0, j = 0; i < dims; ++i) {
+    if (start_axis <= i + 1 && i + 1 <= stop_axis) {
+      int dim_i = input->getDimensions().d[i];
+      CHECK_GT(dim_i, 0);
+      dim_prod *= dim_i;
+      if (i + 1 == stop_axis) {
+        flatten_dim.d[j++] = dim_prod;
+      }
+    } else {
+      flatten_dim.d[j++] = input->getDimensions().d[i];
+    }
+  }
+  layer = network->addShuffle(*value_to_trt_tensor_map[input_tensor_repr]);
+  CHECK_NOTNULL(layer);
+  layer->setReshapeDimensions(flatten_dim);
+  for (size_t i = 0; i < op->getNumResults(); ++i) {
+    nvinfer1::ITensor* out_tensor = layer->getOutput(i);
+    mlir::Value out_value = op->getResult(i);
+    value_to_trt_tensor_map[out_value] = out_tensor;
+  }
+}
+
+inline void ScaleNdFunc(trt::ScaleNdOp& op,  // NOLINT
+                        nvinfer1::INetworkDefinition* network,
+                        ValueToITensorMap& value_to_trt_tensor_map,  // NOLINT
+                        ValueToTensorMap& value_to_tensor_map) {     // NOLINT
+  mlir::Value input_tensor_repr = op.input_tensor();
+  nvinfer1::ITensor* input = value_to_trt_tensor_map[input_tensor_repr];
+
+  mlir::Value shift_tensor_repr = op.shift();
+  nvinfer1::Weights shift =
+      TensorToWeights(value_to_tensor_map[shift_tensor_repr]);
+
+  mlir::Value scale_tensor_repr = op.scale();
+
+  nvinfer1::Weights scale =
+      TensorToWeights(value_to_tensor_map[scale_tensor_repr]);
+
+  nvinfer1::Weights power_weights{nvinfer1::DataType::kFLOAT, nullptr, 0};
+
+  nvinfer1::IScaleLayer* layer = nullptr;
+  layer = network->addScaleNd(
+      *input, nvinfer1::ScaleMode::kCHANNEL, shift, scale, power_weights, 0);
+  CHECK_NOTNULL(layer);
+
+  for (size_t i = 0; i < op->getNumResults(); ++i) {
+    nvinfer1::ITensor* out_tensor = layer->getOutput(i);
+    mlir::Value out_value = op->getResult(i);
+    value_to_trt_tensor_map[out_value] = out_tensor;
+  }
+}
+
+inline void EltwiseFunc(trt::ElementWiseOp& op,  // NOLINT
+                        nvinfer1::INetworkDefinition* network,
+                        ValueToITensorMap& value_to_trt_tensor_map,  // NOLINT
+                        ValueToTensorMap& value_to_tensor_map) {     // NOLINT
+  mlir::Value input1_tensor_repr = op.input1();
+  mlir::Value input2_tensor_repr = op.input2();
+  nvinfer1::ITensor* input1 = value_to_trt_tensor_map[input1_tensor_repr];
+  nvinfer1::ITensor* input2 = value_to_trt_tensor_map[input2_tensor_repr];
+
+  auto eltwise_operation = op.elementwise_operation();
+
+  auto* layer = network->addElementWise(
+      *input1,
+      *input2,
+      static_cast<nvinfer1::ElementWiseOperation>(eltwise_operation));
+  CHECK_NOTNULL(layer);
+  for (size_t i = 0; i < op->getNumResults(); ++i) {
+    nvinfer1::ITensor* out_tensor = layer->getOutput(i);
+    mlir::Value out_value = op->getResult(i);
+    value_to_trt_tensor_map[out_value] = out_tensor;
+  }
+}
+
 }  // namespace tensorrt
 }  // namespace kernel
 }  // namespace infrt
diff --git a/paddle/infrt/tests/dialect/phi/phi_test.mlir b/paddle/infrt/tests/dialect/phi/disabled_phi_test.mlir
similarity index 100%
rename from paddle/infrt/tests/dialect/phi/phi_test.mlir
rename to paddle/infrt/tests/dialect/phi/disabled_phi_test.mlir
diff --git a/paddle/infrt/tests/dialect/phi/kernels/resnet50_ops.mlir b/paddle/infrt/tests/dialect/phi/kernels/disabled_resnet50_ops.mlir
similarity index 100%
rename from paddle/infrt/tests/dialect/phi/kernels/resnet50_ops.mlir
rename to paddle/infrt/tests/dialect/phi/kernels/disabled_resnet50_ops.mlir
diff --git a/paddle/infrt/tests/dialect/tensorrt/disabled_trt.mlir b/paddle/infrt/tests/dialect/tensorrt/disabled_trt.mlir
deleted file mode 100644
index ef86dcf1e72a0..0000000000000
--- a/paddle/infrt/tests/dialect/tensorrt/disabled_trt.mlir
+++ /dev/null
@@ -1,37 +0,0 @@
-// RUN: infrtexec -i %s | FileCheck %s
-
-// CHECK-LABEL: @run_trt
-func @run_trt(%0 : !infrt.dense_tensor<GPU, FP32, NCHW>, %ctx : !phi.context<GPU>) {
-  %a = "trt.create_engine"(%0) ({
-    %1 = "trt.Activation"(%0) {activation_type = 1 : si32, alpha = 1.0 : f32, beta = 6.0 : f32} : (!infrt.dense_tensor<GPU, FP32, NCHW>) -> !infrt.dense_tensor<GPU, FP32, NCHW>
-    "infrt.return"(%1) : (!infrt.dense_tensor<GPU, FP32, NCHW>) -> ()
-  }) : (!infrt.dense_tensor<GPU, FP32, NCHW>) -> !trt.engine
-  "trt.inspect_engine"(%a) {} : (!trt.engine) -> ()
-
-  %res = "trt.compute"(%a, %ctx) {} : (!trt.engine, !phi.context<GPU>) -> (!infrt.tensor_list)
-  %size = "dt.tensor_list_get_size"(%res) {} : (!infrt.tensor_list) -> (i32)
-  "infrt.print.i32"(%size) {} : (i32) -> ()
-
-  %ts0 = "dt.tensor_list_get_tensor"(%res) {id = 0 : i32} : (!infrt.tensor_list) -> (!infrt.dense_tensor<GPU, FP32, NCHW>)
-  "phi_dt.print_tensor" (%ts0) : (!infrt.dense_tensor<GPU, FP32, NCHW>) -> ()
-
-  infrt.return
-}
-
-// CHECK-LABEL: @main
-func @main() {
-  %ctx = "phi_dt.create_context.gpu" (): () -> !phi.context<GPU>
-  %t = "phi_dt.create_dense_tensor.gpu" (%ctx) {
-    precision=#infrt.precision<FP32>,
-    layout=#infrt.layout<NCHW>,
-    dims=[1:i64, 3:i64, 1:i64, 1:i64], lod=[1:i64]}: (!phi.context<GPU>) -> (!infrt.dense_tensor<GPU, FP32, NCHW>)
-
-  "phi_dt.fill_dense_tensor.f32"(%t) {value=[3.8:f32, 2.4:f32, 1.3:f32]} : (!infrt.dense_tensor<GPU, FP32, NCHW>) -> ()
-  "phi_dt.print_tensor" (%t) : (!infrt.dense_tensor<GPU, FP32, NCHW>) -> ()
-
-  //%res = 
-  infrt.call @run_trt(%t, %ctx) : (!infrt.dense_tensor<GPU, FP32, NCHW>, !phi.context<GPU>) -> ()
-  //-> (!infrt.dense_tensor<GPU, FP32, NCHW>)
-
-  infrt.return
-}
diff --git a/paddle/infrt/tests/dialect/tensorrt/disabled_trt_activation.mlir b/paddle/infrt/tests/dialect/tensorrt/disabled_trt_activation.mlir
new file mode 100644
index 0000000000000..557990677696e
--- /dev/null
+++ b/paddle/infrt/tests/dialect/tensorrt/disabled_trt_activation.mlir
@@ -0,0 +1,21 @@
+module  {
+  func @main_graph(%arg0: !infrt.dense_tensor<CPU, FP32, ANY>) -> !infrt.dense_tensor<CPU, FP32, ANY> {
+    %0 = "phi_dt.create_context.gpu"() : () -> !phi.context<GPU>
+    %1 = "phi_dt.memcpy.gpu"(%arg0, %0) {d2h = false} : (!infrt.dense_tensor<CPU, FP32, ANY>, !phi.context<GPU>) -> !infrt.dense_tensor<GPU, FP32, NCHW>
+    %2 = "trt.create_engine"(%1) ( {
+      %6 = "trt.Activation"(%1) {activation_type = 1 : si32, alpha = 0.000000e+00 : f32, beta = 0.000000e+00 : f32} : (!infrt.dense_tensor<GPU, FP32, NCHW>) -> !infrt.dense_tensor<GPU, FP32, NCHW>
+      infrt.return %6 : !infrt.dense_tensor<GPU, FP32, NCHW>
+    }) {run_once = true} : (!infrt.dense_tensor<GPU, FP32, NCHW>) -> !trt.engine
+    %3 = "trt.compute"(%2, %0) : (!trt.engine, !phi.context<GPU>) -> !infrt.tensor_list
+    %4 = "dt.tensor_list_get_tensor"(%3) {id = 0 : i32} : (!infrt.tensor_list) -> !infrt.dense_tensor<GPU, FP32, NCHW>
+    %5 = "phi_dt.memcpy.gpu"(%4, %0) {d2h = true} : (!infrt.dense_tensor<GPU, FP32, NCHW>, !phi.context<GPU>) -> !infrt.dense_tensor<CPU, FP32, ANY>
+    infrt.return %5 : !infrt.dense_tensor<CPU, FP32, ANY>
+  }
+  func @main() {
+    %0 = "phi_dt.create_context.cpu"() : () -> !phi.context<CPU>
+    %1 = "phi_dt.create_inited_dense_tensor.cpu.f32"(%0) {dims = [3, 6, 1, 1], layout = #infrt.layout<NCHW>, lod = [0], value = 1.500000e+00 : f32} : (!phi.context<CPU>) -> !infrt.dense_tensor<CPU, FP32, NCHW>
+    %2 = infrt.call @main_graph(%1) : (!infrt.dense_tensor<CPU, FP32, NCHW>) -> !infrt.dense_tensor<CPU, FP32, NCHW>
+    phi_dt.print_tensor(%2 : !infrt.dense_tensor<CPU, FP32, NCHW>)
+    infrt.return
+  }
+}
diff --git a/paddle/infrt/tests/dialect/tensorrt/disabled_trt_fc.mlir b/paddle/infrt/tests/dialect/tensorrt/disabled_trt_fc.mlir
index 78dc4ac1c1093..aba706df71843 100644
--- a/paddle/infrt/tests/dialect/tensorrt/disabled_trt_fc.mlir
+++ b/paddle/infrt/tests/dialect/tensorrt/disabled_trt_fc.mlir
@@ -1,46 +1,25 @@
-// RUN: infrtexec -i %s | FileCheck %s
-
-// CHECK-LABEL: @main
-func @main() {
-  %ctx = "phi_dt.create_context.gpu" (): () -> !phi.context<GPU>
-  %cpu_ctx = "phi_dt.create_context.cpu" (): () -> !phi.context<CPU>
-
-  %input_tensor = "phi_dt.create_dense_tensor.gpu" (%ctx) {
-    precision=#infrt.precision<FP32>,
-    layout=#infrt.layout<NCHW>,
-    dims=[1:i64, 3:i64, 1:i64, 1:i64], lod=[1:i64]}: (!phi.context<GPU>) -> (!infrt.dense_tensor<GPU, FP32, NCHW>)
-  "phi_dt.fill_dense_tensor.f32"(%input_tensor) {value=[3.8:f32, 2.4:f32, 1.3:f32]} : (!infrt.dense_tensor<GPU, FP32, NCHW>) -> ()
-  //"phi_dt.print_tensor" (%input_tensor) : (!infrt.dense_tensor<GPU, FP32, NCHW>) -> ()
-
-  %kernel_weight = "phi_dt.create_dense_tensor.cpu"(%cpu_ctx) {
-    precision=#infrt.precision<FP32>,
-    layout=#infrt.layout<NCHW>,
-    dims=[2:i64, 3:i64], lod=[1:i64]} : (!phi.context<CPU>) -> (!infrt.dense_tensor<CPU, FP32, NCHW>)
-  "phi_dt.fill_dense_tensor.f32"(%kernel_weight) {value=[1.:f32, 2.:f32, 3.:f32, 4.:f32, 5.:f32, 6.:f32]} : (!infrt.dense_tensor<CPU, FP32, NCHW>) -> ()
-  //"phi_dt.print_tensor" (%kernel_weight) : (!infrt.dense_tensor<CPU, FP32, NCHW>) -> ()
-
-  %kernel_bias = "phi_dt.create_dense_tensor.cpu"(%cpu_ctx) {
-    precision=#infrt.precision<FP32>,
-    layout=#infrt.layout<NCHW>,
-    dims=[2:i64], lod=[1:i64]} : (!phi.context<CPU>) -> (!infrt.dense_tensor<CPU, FP32, NCHW>)
-  "phi_dt.fill_dense_tensor.f32"(%kernel_bias) {value=[1.:f32, 2.:f32]} : (!infrt.dense_tensor<CPU, FP32, NCHW>) -> ()
-  //"phi_dt.print_tensor" (%kernel_bias) : (!infrt.dense_tensor<CPU, FP32, NCHW>) -> ()
-
-  %engine = "trt.create_engine"(%input_tensor, %kernel_weight, %kernel_bias) ({
-    %1 = "trt.Activation"(%input_tensor) {activation_type = 1 : si32, alpha = 1.0 : f32, beta = 6.0 : f32} : (!infrt.dense_tensor<GPU, FP32, NCHW>) -> !infrt.dense_tensor<GPU, FP32, NCHW>
-    %2 = "trt.FullyConnected"(%input_tensor, %kernel_weight, %kernel_bias) {out_channel_num = 2 : si32} : (!infrt.dense_tensor<GPU, FP32, NCHW>, !infrt.dense_tensor<CPU, FP32, NCHW>, !infrt.dense_tensor<CPU, FP32, NCHW>) -> !infrt.dense_tensor<GPU, FP32, NCHW>
-    "infrt.return"(%1, %2) : (!infrt.dense_tensor<GPU, FP32, NCHW>, !infrt.dense_tensor<GPU, FP32, NCHW>) -> ()
-  }) : (!infrt.dense_tensor<GPU, FP32, NCHW>, !infrt.dense_tensor<CPU, FP32, NCHW>, !infrt.dense_tensor<CPU, FP32, NCHW>) -> !trt.engine
-
-  %res = "trt.compute"(%engine, %ctx) {} : (!trt.engine, !phi.context<GPU>) -> (!infrt.tensor_list)
-  %size = "dt.tensor_list_get_size"(%res) {} : (!infrt.tensor_list) -> (i32)
-  "infrt.print.i32"(%size) {} : (i32) -> ()
-
-  %ts0 = "dt.tensor_list_get_tensor"(%res) {id = 0 : i32} : (!infrt.tensor_list) -> (!infrt.dense_tensor<GPU, FP32, NCHW>)
-  "phi_dt.print_tensor" (%ts0) : (!infrt.dense_tensor<GPU, FP32, NCHW>) -> ()
-
-  %ts1 = "dt.tensor_list_get_tensor"(%res) {id = 1 : i32} : (!infrt.tensor_list) -> (!infrt.dense_tensor<GPU, FP32, NCHW>)
-  "phi_dt.print_tensor" (%ts1) : (!infrt.dense_tensor<GPU, FP32, NCHW>) -> ()
-
-  infrt.return
+module  {
+  func @main_graph(%arg0: !infrt.dense_tensor<CPU, FP32, ANY>) -> !infrt.dense_tensor<CPU, FP32, ANY> {
+    %ctx = "phi_dt.create_context.cpu" (): () -> !phi.context<CPU>
+    %0 = "phi_dt.create_context.gpu"() : () -> !phi.context<GPU>
+    %1 = "phi_dt.memcpy.gpu"(%arg0, %0) {d2h = false} : (!infrt.dense_tensor<CPU, FP32, ANY>, !phi.context<GPU>) -> !infrt.dense_tensor<GPU, FP32, NCHW>
+    %4 = "phi_dt.create_inited_dense_tensor.cpu.f32" (%ctx) {value=1.5:f32, layout=#infrt.layout<NCHW>, lod=[0], dims=[2, 6]}: (!phi.context<CPU>) -> (!infrt.dense_tensor<CPU, FP32, NCHW>)
+    %3 = "phi_dt.create_inited_dense_tensor.cpu.f32" (%ctx) {value=1.5:f32, layout=#infrt.layout<NCHW>, lod=[0], dims=[2]}: (!phi.context<CPU>) -> (!infrt.dense_tensor<CPU, FP32, NCHW>)
+    %5 = "trt.create_engine"(%1, %4, %3) ( {
+      %10 = "trt.FullyConnected"(%1, %4, %3) {out_channel_num = 2 : si32} : (!infrt.dense_tensor<GPU, FP32, NCHW>, !infrt.dense_tensor<CPU, FP32, NCHW>, !infrt.dense_tensor<CPU, FP32, NCHW>) -> !infrt.dense_tensor<GPU, FP32, NCHW>
+      infrt.return %10 : !infrt.dense_tensor<GPU, FP32, NCHW>
+    }) {run_once = true} : (!infrt.dense_tensor<GPU, FP32, NCHW>, !infrt.dense_tensor<CPU, FP32, NCHW>, !infrt.dense_tensor<CPU, FP32, NCHW>) -> !trt.engine
+    %6 = "trt.compute"(%5, %0) : (!trt.engine, !phi.context<GPU>) -> !infrt.tensor_list
+    %7 = "dt.tensor_list_get_tensor"(%6) {id = 0 : i32} : (!infrt.tensor_list) -> !infrt.dense_tensor<GPU, FP32, NCHW>
+    %8 = "phi_dt.memcpy.gpu"(%7, %0) {d2h = true} : (!infrt.dense_tensor<GPU, FP32, NCHW>, !phi.context<GPU>) -> !infrt.dense_tensor<CPU, FP32, ANY>
+    infrt.return %8 : !infrt.dense_tensor<CPU, FP32, ANY>
+  }
+
+  func @main() {
+    %ctx = "phi_dt.create_context.cpu" (): () -> !phi.context<CPU>
+    %input_tensor = "phi_dt.create_inited_dense_tensor.cpu.f32" (%ctx) {value=1.5:f32, layout=#infrt.layout<NCHW>, lod=[0], dims=[3, 6, 1, 1]}: (!phi.context<CPU>) -> (!infrt.dense_tensor<CPU, FP32, NCHW>)
+    %res = infrt.call @main_graph(%input_tensor) {} : (!infrt.dense_tensor<CPU, FP32, NCHW>) -> !infrt.dense_tensor<CPU, FP32, NCHW>
+    "phi_dt.print_tensor" (%res) : (!infrt.dense_tensor<CPU, FP32, NCHW>) -> ()
+    infrt.return
+  }
 }
diff --git a/paddle/infrt/tests/dialect/tensorrt/disabled_trt_pool.mlir b/paddle/infrt/tests/dialect/tensorrt/disabled_trt_pool.mlir
new file mode 100644
index 0000000000000..af24ac63d23fe
--- /dev/null
+++ b/paddle/infrt/tests/dialect/tensorrt/disabled_trt_pool.mlir
@@ -0,0 +1,21 @@
+module  {
+  func @main_graph(%arg0: !infrt.dense_tensor<CPU, FP32, ANY>) -> !infrt.dense_tensor<CPU, FP32, ANY> {
+    %0 = "phi_dt.create_context.gpu"() : () -> !phi.context<GPU>
+    %1 = "phi_dt.memcpy.gpu"(%arg0, %0) {d2h = false} : (!infrt.dense_tensor<CPU, FP32, ANY>, !phi.context<GPU>) -> !infrt.dense_tensor<GPU, FP32, NCHW>
+    %2 = "trt.create_engine"(%1) ( {
+      %6 = "trt.Pooling"(%1) {padding_mode = 0 : i32, paddings = [1 : i32, 1 : i32], pool_type = 0 : i32, strides = [2 : i32, 2 : i32], window_size = [3 : i32, 3 : i32], exclusive = false, adaptive = false, padding_algorithm = "EXPLICIT"} : (!infrt.dense_tensor<GPU, FP32, NCHW>) -> !infrt.dense_tensor<GPU, FP32, NCHW>
+      infrt.return %6 : !infrt.dense_tensor<GPU, FP32, NCHW>
+    }) {run_once = true} : (!infrt.dense_tensor<GPU, FP32, NCHW>) -> !trt.engine
+    %3 = "trt.compute"(%2, %0) : (!trt.engine, !phi.context<GPU>) -> !infrt.tensor_list
+    %4 = "dt.tensor_list_get_tensor"(%3) {id = 0 : i32} : (!infrt.tensor_list) -> !infrt.dense_tensor<GPU, FP32, NCHW>
+    %5 = "phi_dt.memcpy.gpu"(%4, %0) {d2h = true} : (!infrt.dense_tensor<GPU, FP32, NCHW>, !phi.context<GPU>) -> !infrt.dense_tensor<CPU, FP32, ANY>
+    infrt.return %5 : !infrt.dense_tensor<CPU, FP32, ANY>
+  }
+  func @main() {
+    %0 = "phi_dt.create_context.cpu"() : () -> !phi.context<CPU>
+    %1 = "phi_dt.create_inited_dense_tensor.cpu.f32"(%0) {dims = [1, 3, 10, 10], layout = #infrt.layout<NCHW>, lod = [0], value = 1.500000e+00 : f32} : (!phi.context<CPU>) -> !infrt.dense_tensor<CPU, FP32, NCHW>
+    %2 = infrt.call @main_graph(%1) : (!infrt.dense_tensor<CPU, FP32, NCHW>) -> !infrt.dense_tensor<CPU, FP32, NCHW>
+    phi_dt.print_tensor(%2 : !infrt.dense_tensor<CPU, FP32, NCHW>)
+    infrt.return
+  }
+}
diff --git a/paddle/phi/api/include/tensor.h b/paddle/phi/api/include/tensor.h
index 9b0371fc380b9..3c5c1531c4a2d 100644
--- a/paddle/phi/api/include/tensor.h
+++ b/paddle/phi/api/include/tensor.h
@@ -494,6 +494,7 @@ class PADDLE_API Tensor final {
    * @return AbstractAutogradMeta*
    */
   AbstractAutogradMeta* get_autograd_meta() const;
+  const std::shared_ptr<AbstractAutogradMeta>& mutable_autograd_meta() const;
 
   /**
    * @brief Set the autograd meta object
@@ -566,7 +567,7 @@ class PADDLE_API Tensor final {
    * heterogeneous Tensor implementation, so that the API level can be unified
    * to one `Tensor`.
    */
-  std::shared_ptr<phi::TensorBase> impl_;
+  std::shared_ptr<phi::TensorBase> impl_{nullptr};
 
   /**
    * [ Why need abstract AbstractAutogradMeta here? ]
diff --git a/paddle/phi/api/lib/CMakeLists.txt b/paddle/phi/api/lib/CMakeLists.txt
index af2533019156c..7dfe7d8cf4d20 100644
--- a/paddle/phi/api/lib/CMakeLists.txt
+++ b/paddle/phi/api/lib/CMakeLists.txt
@@ -165,7 +165,7 @@ cc_library(context_pool SRCS context_pool.cc DEPS phi_context phi_enforce place)
 cc_library(kernel_dispatch SRCS kernel_dispatch.cc DEPS phi_tensor_raw phi_context kernel_factory context_pool)
 cc_library(api_gen_utils SRCS api_gen_utils.cc DEPS phi_tensor_raw selected_rows sparse_csr_tensor sparse_coo_tensor)
 cc_library(phi_data_transform SRCS data_transform.cc DEPS phi_tensor_raw transfer_layout_kernel cast_kernel data_device_transform)
-cc_library(api_custom_impl SRCS api_custom_impl.cc DEPS phi_tensor_raw phi kernel_dispatch api_gen_utils phi_data_transform)
+cc_library(api_custom_impl SRCS api_custom_impl.cc DEPS phi_tensor_raw phi kernel_dispatch api_gen_utils backward_infermeta phi_data_transform)
 cc_library(sparse_api_custom_impl SRCS sparse_api_custom_impl.cc DEPS phi_tensor_raw phi kernel_dispatch api_gen_utils phi_data_transform)
 
 cc_library(phi_function_api SRCS ${api_source_file} DEPS phi_tensor_raw phi kernel_dispatch api_gen_utils phi_data_transform api_custom_impl)
diff --git a/paddle/phi/api/lib/api_custom_impl.cc b/paddle/phi/api/lib/api_custom_impl.cc
index 152873fe41072..5d1851fb85aa2 100644
--- a/paddle/phi/api/lib/api_custom_impl.cc
+++ b/paddle/phi/api/lib/api_custom_impl.cc
@@ -18,9 +18,11 @@ limitations under the License. */
 #include "paddle/phi/api/lib/data_transform.h"
 #include "paddle/phi/api/lib/kernel_dispatch.h"
 #include "paddle/phi/api/lib/utils/storage.h"
+#include "paddle/phi/common/type_traits.h"
 #include "paddle/phi/core/compat/convert_utils.h"
 #include "paddle/phi/core/kernel_registry.h"
 #include "paddle/phi/core/meta_tensor.h"
+#include "paddle/phi/infermeta/backward.h"
 #include "paddle/phi/infermeta/binary.h"
 #include "paddle/phi/infermeta/multiary.h"
 #include "paddle/phi/infermeta/nullary.h"
@@ -31,6 +33,215 @@ limitations under the License. */
 namespace paddle {
 namespace experimental {
 
+////////////////// Forward api impls //////////////////////
+
+Tensor conv2d_impl(const Tensor& input,
+                   const Tensor& filter,
+                   const std::vector<int>& strides,
+                   const std::vector<int>& paddings,
+                   const std::string& paddding_algorithm,
+                   int groups,
+                   const std::vector<int>& dilations,
+                   const std::string& data_format,
+                   bool use_addto,
+                   int workspace_size_MB,
+                   bool exhaustive_search) {
+  Backend kernel_backend = Backend::UNDEFINED;
+  DataLayout kernel_layout = DataLayout::UNDEFINED;
+  DataType kernel_data_type = DataType::UNDEFINED;
+
+  kernel_data_type = ParseDataType(input);
+
+  if (kernel_backend == Backend::UNDEFINED ||
+      kernel_layout == DataLayout::UNDEFINED ||
+      kernel_data_type == DataType::UNDEFINED) {
+    auto kernel_key_set = ParseKernelKeyByInputArgs(input, filter);
+    auto kernel_key = kernel_key_set.GetHighestPriorityKernelKey();
+    if (kernel_backend == Backend::UNDEFINED) {
+      kernel_backend = kernel_key.backend();
+    }
+    if (kernel_layout == DataLayout::UNDEFINED) {
+      kernel_layout = kernel_key.layout();
+    }
+    if (kernel_data_type == DataType::UNDEFINED) {
+      kernel_data_type = kernel_key.dtype();
+    }
+  }
+
+  VLOG(6) << "conv2d API kernel key: [" << kernel_backend << ", "
+          << kernel_layout << ", " << kernel_data_type << "]";
+  const auto& kernel = phi::KernelFactory::Instance().SelectKernelOrThrowError(
+      "conv2d", {kernel_backend, kernel_layout, kernel_data_type}, true);
+  VLOG(6) << "conv2d API kernel: " << kernel;
+
+  auto* dev_ctx = GetDeviceContextByBackend(kernel_backend);
+
+  phi::TensorArgDef args0 = kernel.InputAt(0);
+  phi::TensorArgDef args1 = kernel.InputAt(1);
+  if (kernel_backend == Backend::GPU) {
+    args0.backend = Backend::GPU;
+    args1.backend = Backend::GPU;
+  }
+
+  auto input_input = PrepareData(input, args0, {});
+  auto input_filter = PrepareData(filter, args1, {});
+
+  Tensor api_output;
+  auto kernel_out = SetKernelOutput(kernel_backend, &api_output);
+  phi::MetaTensor meta_out(kernel_out);
+
+  phi::ConvInferMeta(MakeMetaTensor(*input_input),
+                     MakeMetaTensor(*input_filter),
+                     strides,
+                     paddings,
+                     paddding_algorithm,
+                     groups,
+                     dilations,
+                     data_format,
+                     use_addto,
+                     workspace_size_MB,
+                     exhaustive_search,
+                     &meta_out);
+
+  using kernel_signature = void (*)(const platform::DeviceContext&,
+                                    const phi::DenseTensor&,
+                                    const phi::DenseTensor&,
+                                    const std::vector<int>&,
+                                    const std::vector<int>&,
+                                    const std::string&,
+                                    int,
+                                    const std::vector<int>&,
+                                    const std::string&,
+                                    bool,
+                                    int,
+                                    bool,
+                                    phi::DenseTensor*);
+  auto* kernel_fn = kernel.GetVariadicKernelFn<kernel_signature>();
+
+  {
+    (*kernel_fn)(*dev_ctx,
+                 *input_input,
+                 *input_filter,
+                 strides,
+                 paddings,
+                 paddding_algorithm,
+                 groups,
+                 dilations,
+                 data_format,
+                 use_addto,
+                 workspace_size_MB,
+                 exhaustive_search,
+                 kernel_out);
+  }
+
+  return api_output;
+}
+
+std::vector<std::vector<Tensor>> conv2d_grad_impl(
+    const Tensor& input,
+    const Tensor& filter,
+    const Tensor& out_grad,
+    const std::vector<int>& strides,
+    const std::vector<int>& paddings,
+    const std::string& paddding_algorithm,
+    int groups,
+    const std::vector<int>& dilations,
+    const std::string& data_format,
+    bool use_addto,
+    int workspace_size_MB,
+    bool exhaustive_search) {
+  Backend kernel_backend = Backend::UNDEFINED;
+  DataLayout kernel_layout = DataLayout::UNDEFINED;
+  DataType kernel_data_type = DataType::UNDEFINED;
+
+  if (kernel_backend == Backend::UNDEFINED ||
+      kernel_layout == DataLayout::UNDEFINED ||
+      kernel_data_type == DataType::UNDEFINED) {
+    auto kernel_key_set = ParseKernelKeyByInputArgs(input, filter, out_grad);
+    auto kernel_key = kernel_key_set.GetHighestPriorityKernelKey();
+    if (kernel_backend == Backend::UNDEFINED) {
+      kernel_backend = kernel_key.backend();
+    }
+    if (kernel_layout == DataLayout::UNDEFINED) {
+      kernel_layout = kernel_key.layout();
+    }
+    if (kernel_data_type == DataType::UNDEFINED) {
+      kernel_data_type = kernel_key.dtype();
+    }
+  }
+
+  VLOG(6) << "conv2d_grad API kernel key: [" << kernel_backend << ", "
+          << kernel_layout << ", " << kernel_data_type << "]";
+  const auto& kernel = phi::KernelFactory::Instance().SelectKernelOrThrowError(
+      "conv2d_grad", {kernel_backend, kernel_layout, kernel_data_type}, true);
+  VLOG(6) << "conv2d_grad API kernel: " << kernel;
+
+  auto* dev_ctx = GetDeviceContextByBackend(kernel_backend);
+
+  phi::TensorArgDef args0 = kernel.InputAt(0);
+  phi::TensorArgDef args1 = kernel.InputAt(1);
+  phi::TensorArgDef args2 = kernel.InputAt(2);
+  if (kernel_backend == Backend::GPU) {
+    args0.backend = Backend::GPU;
+    args1.backend = Backend::GPU;
+    args2.backend = Backend::GPU;
+  }
+
+  auto input_input = PrepareData(input, args0, {});
+  auto input_filter = PrepareData(filter, args1, {});
+  auto input_out_grad = PrepareData(out_grad, args2, {});
+
+  std::vector<std::vector<Tensor>> api_output(2);
+  api_output[0].emplace_back();
+  auto kernel_out_0 = SetKernelOutput(kernel_backend, &api_output[0][0]);
+  api_output[1].emplace_back();
+  auto kernel_out_1 = SetKernelOutput(kernel_backend, &api_output[1][0]);
+  phi::MetaTensor meta_out_0(kernel_out_0);
+  phi::MetaTensor meta_out_1(kernel_out_1);
+
+  phi::GeneralBinaryGradInferMeta(MakeMetaTensor(*input_input),
+                                  MakeMetaTensor(*input_filter),
+                                  &meta_out_0,
+                                  &meta_out_1);
+
+  using kernel_signature = void (*)(const platform::DeviceContext&,
+                                    const phi::DenseTensor&,
+                                    const phi::DenseTensor&,
+                                    const phi::DenseTensor&,
+                                    const std::vector<int>&,
+                                    const std::vector<int>&,
+                                    const std::string&,
+                                    int,
+                                    const std::vector<int>&,
+                                    const std::string&,
+                                    bool,
+                                    int,
+                                    bool,
+                                    phi::DenseTensor*,
+                                    phi::DenseTensor*);
+  auto* kernel_fn = kernel.GetVariadicKernelFn<kernel_signature>();
+
+  {
+    (*kernel_fn)(*dev_ctx,
+                 *input_input,
+                 *input_filter,
+                 *input_out_grad,
+                 strides,
+                 paddings,
+                 paddding_algorithm,
+                 groups,
+                 dilations,
+                 data_format,
+                 use_addto,
+                 workspace_size_MB,
+                 exhaustive_search,
+                 kernel_out_0,
+                 kernel_out_1);
+  }
+
+  return api_output;
+}
+
 Tensor copy_to_impl(const Tensor& x, Place place, bool blocking) {
   auto kernel_key_set = ParseKernelKeyByInputArgs(x);
   kernel_key_set.backend_set =
@@ -58,6 +269,7 @@ Tensor copy_to_impl(const Tensor& x, Place place, bool blocking) {
                                     phi::DenseTensor*);
 
   auto* kernel_fn = kernel.GetVariadicKernelFn<kernel_signature>();
+
   (*kernel_fn)(*dev_ctx, *dense_x, place, blocking, kernel_out);
 
   return out;
@@ -120,5 +332,687 @@ std::vector<Tensor> split_impl(const Tensor& x,
   return out;
 }
 
+std::tuple<Tensor, Tensor, Tensor> momentum_impl(
+    const Tensor& param,
+    const Tensor& grad,
+    const Tensor& velocity,
+    const Tensor& learning_rate,
+    paddle::optional<const Tensor&> master_param,
+    float mu,
+    bool use_nesterov,
+    const std::string& regularization_method,
+    float regularization_coeff,
+    bool multi_precision,
+    float rescale_grad) {
+  Backend kernel_backend = Backend::UNDEFINED;
+  DataLayout kernel_layout = DataLayout::UNDEFINED;
+  DataType kernel_data_type = DataType::UNDEFINED;
+  if (kernel_backend == Backend::UNDEFINED ||
+      kernel_layout == DataLayout::UNDEFINED ||
+      kernel_data_type == DataType::UNDEFINED) {
+    auto kernel_key_set = ParseKernelKeyByInputArgs(param);
+    auto kernel_key = kernel_key_set.GetHighestPriorityKernelKey();
+    if (kernel_backend == Backend::UNDEFINED) {
+      kernel_backend = kernel_key.backend();
+    }
+    if (kernel_layout == DataLayout::UNDEFINED) {
+      kernel_layout = kernel_key.layout();
+    }
+    if (kernel_data_type == DataType::UNDEFINED) {
+      kernel_data_type = kernel_key.dtype();
+    }
+  }
+  std::string kernel_name = "momentum";
+  if (grad.is_selected_rows()) {
+    kernel_name = "momentum_dense_param_sparse_grad";
+  }
+  const auto& kernel = phi::KernelFactory::Instance().SelectKernelOrThrowError(
+      kernel_name, {kernel_backend, kernel_layout, kernel_data_type});
+  VLOG(6) << kernel_name << " API kernel key: [" << kernel_backend << ", "
+          << kernel_layout << ", " << kernel_data_type << "]";
+  VLOG(6) << kernel_name << " API kernel: " << kernel;
+
+  auto* dev_ctx = GetDeviceContextByBackend(kernel_backend);
+
+  auto input_param = PrepareData(param, kernel.InputAt(0), {});
+  auto input_grad = PrepareData(grad, kernel.InputAt(1), {});
+  auto input_velocity = PrepareData(velocity, kernel.InputAt(2), {});
+  auto input_learning_rate = PrepareData(learning_rate, kernel.InputAt(3), {});
+  paddle::optional<const phi::DenseTensor&> input_master_param(paddle::none);
+  auto input_master_param_ptr =
+      PrepareData(master_param, kernel.InputAt(4), {});
+
+  std::tuple<Tensor, Tensor, Tensor> api_output;
+  auto kernel_out_0 = input_param.get();
+  auto kernel_out_1 = input_velocity.get();
+  phi::DenseTensor* kernel_out_2 = nullptr;
+  if (input_master_param_ptr) {
+    input_master_param =
+        paddle::make_optional<const phi::DenseTensor&>(*input_master_param_ptr);
+    kernel_out_2 =
+        paddle::make_optional<phi::DenseTensor&>(*input_master_param_ptr)
+            .get_ptr();
+  }
+
+  paddle::optional<const phi::MetaTensor&> input_meta_ref_master_param(
+      paddle::none);
+  phi::DenseTensor dt;
+  phi::MetaTensor input_meta_tmp_master_param(dt);
+  if (input_master_param_ptr) {
+    input_meta_tmp_master_param.set_dtype(input_master_param_ptr->dtype());
+    input_meta_tmp_master_param.set_dims(input_master_param_ptr->dims());
+    input_meta_tmp_master_param.set_layout(input_master_param_ptr->layout());
+    input_meta_ref_master_param = input_meta_tmp_master_param;
+  }
+  phi::MetaTensor meta_out_0(kernel_out_0);
+  phi::MetaTensor meta_out_1(kernel_out_1);
+  if (kernel_out_2) {
+    phi::MetaTensor meta_out_2(kernel_out_2);
+    phi::MomentumInferMeta(MakeMetaTensor(*input_param),
+                           MakeMetaTensor(*input_grad),
+                           MakeMetaTensor(*input_velocity),
+                           MakeMetaTensor(*input_learning_rate),
+                           input_meta_ref_master_param,
+                           mu,
+                           use_nesterov,
+                           regularization_method,
+                           regularization_coeff,
+                           multi_precision,
+                           rescale_grad,
+                           &meta_out_0,
+                           &meta_out_1,
+                           &meta_out_2);
+  } else {
+    phi::MomentumInferMeta(MakeMetaTensor(*input_param),
+                           MakeMetaTensor(*input_grad),
+                           MakeMetaTensor(*input_velocity),
+                           MakeMetaTensor(*input_learning_rate),
+                           input_meta_ref_master_param,
+                           mu,
+                           use_nesterov,
+                           regularization_method,
+                           regularization_coeff,
+                           multi_precision,
+                           rescale_grad,
+                           &meta_out_0,
+                           &meta_out_1,
+                           nullptr);
+  }
+
+  using kernel_signature = void (*)(const platform::DeviceContext&,
+                                    const phi::DenseTensor&,
+                                    const phi::DenseTensor&,
+                                    const phi::DenseTensor&,
+                                    const phi::DenseTensor&,
+                                    paddle::optional<const phi::DenseTensor&>,
+                                    float,
+                                    bool,
+                                    const std::string&,
+                                    float,
+                                    bool,
+                                    float,
+                                    phi::DenseTensor*,
+                                    phi::DenseTensor*,
+                                    phi::DenseTensor*);
+  auto* kernel_fn = kernel.GetVariadicKernelFn<kernel_signature>();
+
+  (*kernel_fn)(*dev_ctx,
+               *input_param,
+               *input_grad,
+               *input_velocity,
+               *input_learning_rate,
+               input_master_param,
+               mu,
+               use_nesterov,
+               regularization_method,
+               regularization_coeff,
+               multi_precision,
+               rescale_grad,
+               kernel_out_0,
+               kernel_out_1,
+               kernel_out_2);
+
+  return api_output;
+}
+
+std::vector<Tensor> unbind_impl(const Tensor& input, int axis) {
+  auto kernel_key_set = ParseKernelKeyByInputArgs(input);
+  auto kernel_key = kernel_key_set.GetHighestPriorityKernelKey();
+
+  Backend kernel_backend = kernel_key.backend();
+  DataLayout kernel_layout = kernel_key.layout();
+  DataType kernel_data_type = kernel_key.dtype();
+
+  auto kernel = phi::KernelFactory::Instance().SelectKernelOrThrowError(
+      "unbind", {kernel_backend, kernel_layout, kernel_data_type});
+  VLOG(6) << "unbind API kernel key: [" << kernel_backend << ", "
+          << kernel_layout << ", " << kernel_data_type << "]";
+  VLOG(6) << "unbind API kernel: " << kernel;
+
+  auto* dev_ctx = GetDeviceContextByBackend(kernel_backend);
+
+  auto dense_input = PrepareData(input, kernel.InputAt(0), {});
+
+  // Calculate the number of out tensors
+  auto input_shape = input.dims();
+  if (axis < 0) {
+    axis = input_shape.size() + axis;
+  }
+  auto out_num = input_shape[axis];
+
+  std::vector<Tensor> out;
+  auto dense_outs = SetKernelOutput(out_num, kernel_backend, &out);
+  std::vector<phi::MetaTensor> meta_outs;
+  meta_outs.reserve(out_num);
+  std::vector<phi::MetaTensor*> meta_out_ptrs;
+  meta_out_ptrs.reserve(out_num);
+  for (int64_t i = 0; i < out_num; ++i) {
+    meta_outs.push_back(dense_outs[i]);
+    meta_out_ptrs.push_back(&meta_outs.back());
+  }
+
+  phi::UnbindInferMeta(MakeMetaTensor(*dense_input), axis, meta_out_ptrs);
+
+  using kernel_signature = void (*)(const phi::DeviceContext&,
+                                    const phi::DenseTensor&,
+                                    int,
+                                    std::vector<phi::DenseTensor*>&);
+  auto* kernel_fn = kernel.GetVariadicKernelFn<kernel_signature>();
+  (*kernel_fn)(*dev_ctx, *dense_input, axis, dense_outs);
+
+  return out;
+}
+
+////////////////// Backward(grad) api impls //////////////////////
+
+// TODO(chenweihang):  the original sum grad op can support higher-level
+// differentiation,
+// but if we use this impl, it will not support. We need to be able to reuse
+// the autograd API here, which is not yet implemented
+// TODO(chenweihang): we should support call generated api in custom api impl
+std::vector<Tensor> add_n_grad_impl(const std::vector<Tensor>& x,
+                                    const Tensor& out_grad) {
+  auto kernel_key_set = ParseKernelKeyByInputArgs(out_grad);
+  auto kernel_key = kernel_key_set.GetHighestPriorityKernelKey();
+
+  Backend kernel_backend = kernel_key.backend();
+  DataLayout kernel_layout = kernel_key.layout();
+  DataType kernel_data_type = kernel_key.dtype();
+
+  auto kernel = phi::KernelFactory::Instance().SelectKernelOrThrowError(
+      "scale", {kernel_backend, kernel_layout, kernel_data_type});
+  VLOG(6) << "add_n_grad API kernel key: [" << kernel_backend << ", "
+          << kernel_layout << ", " << kernel_data_type << "]";
+  VLOG(6) << "add_n_grad API kernel: " << kernel;
+
+  auto* dev_ctx = GetDeviceContextByBackend(kernel_backend);
+
+  auto dense_out_grad = PrepareData(out_grad, kernel.InputAt(0), {});
+
+  size_t out_number = x.size();
+  std::vector<Tensor> x_grad;
+  auto dense_x_grad = SetKernelOutput(out_number, kernel_backend, &x_grad);
+
+  using kernel_signature = void (*)(const platform::DeviceContext&,
+                                    const phi::DenseTensor&,
+                                    const phi::Scalar&,
+                                    float,
+                                    bool,
+                                    phi::DenseTensor*);
+  auto* kernel_fn = kernel.GetVariadicKernelFn<kernel_signature>();
+
+  for (auto* dense_x_grad_t : dense_x_grad) {
+    phi::MetaTensor meta_out(dense_x_grad_t);
+    phi::UnchangedInferMeta(MakeMetaTensor(*dense_out_grad), &meta_out);
+    (*kernel_fn)(
+        *dev_ctx, *dense_out_grad, phi::Scalar(1.0), 0.0, true, dense_x_grad_t);
+  }
+
+  return x_grad;
+}
+
+std::tuple<Tensor, Tensor, Tensor, Tensor, Tensor, Tensor> batch_norm_impl(
+    const Tensor& x,
+    const Tensor& scale,
+    const Tensor& bias,
+    const Tensor& mean,
+    const Tensor& variance,
+    float momentum,
+    float epsilon,
+    const std::string& data_layout,
+    bool is_test,
+    bool use_global_stats,
+    bool trainable_statistics,
+    bool fuse_with_relu) {
+  Backend kernel_backend = Backend::UNDEFINED;
+  DataLayout kernel_layout = DataLayout::UNDEFINED;
+  DataType kernel_data_type = DataType::UNDEFINED;
+
+  kernel_data_type = ParseDataType(x);
+
+  if (kernel_backend == Backend::UNDEFINED ||
+      kernel_layout == DataLayout::UNDEFINED ||
+      kernel_data_type == DataType::UNDEFINED) {
+    auto kernel_key_set = ParseKernelKeyByInputArgs(x);
+    auto kernel_key = kernel_key_set.GetHighestPriorityKernelKey();
+    if (kernel_backend == Backend::UNDEFINED) {
+      kernel_backend = kernel_key.backend();
+    }
+    if (kernel_layout == DataLayout::UNDEFINED) {
+      kernel_layout = kernel_key.layout();
+    }
+    if (kernel_data_type == DataType::UNDEFINED) {
+      kernel_data_type = kernel_key.dtype();
+    }
+  }
+
+  const auto& kernel = phi::KernelFactory::Instance().SelectKernelOrThrowError(
+      "batch_norm", {kernel_backend, kernel_layout, kernel_data_type});
+  VLOG(6) << "batch_norm API kernel key: [" << kernel_backend << ", "
+          << kernel_layout << ", " << kernel_data_type << "]";
+  VLOG(6) << "batch_norm API kernel: " << kernel;
+
+  auto* dev_ctx = GetDeviceContextByBackend(kernel_backend);
+
+  auto input_x = PrepareData(x, kernel.InputAt(0), {});
+  auto input_scale = PrepareData(scale, kernel.InputAt(1), {});
+  auto input_bias = PrepareData(bias, kernel.InputAt(2), {});
+  auto input_mean = PrepareData(mean, kernel.InputAt(3), {});
+  auto input_variance = PrepareData(variance, kernel.InputAt(4), {});
+
+  std::tuple<Tensor, Tensor, Tensor, Tensor, Tensor, Tensor> api_output;
+  auto kernel_out_0 = SetKernelOutput(kernel_backend, &std::get<0>(api_output));
+  std::get<1>(api_output).set_impl(mean.impl());
+  std::get<2>(api_output).set_impl(variance.impl());
+  auto kernel_out_1 = SetKernelOutput(kernel_backend, &std::get<1>(api_output));
+  auto kernel_out_2 = SetKernelOutput(kernel_backend, &std::get<2>(api_output));
+  auto kernel_out_3 = SetKernelOutput(kernel_backend, &std::get<3>(api_output));
+  auto kernel_out_4 = SetKernelOutput(kernel_backend, &std::get<4>(api_output));
+  auto kernel_out_5 = SetKernelOutput(kernel_backend, &std::get<5>(api_output));
+  phi::MetaTensor meta_out_0(kernel_out_0);
+  phi::MetaTensor meta_out_1(kernel_out_1);
+  phi::MetaTensor meta_out_2(kernel_out_2);
+  phi::MetaTensor meta_out_3(kernel_out_3);
+  phi::MetaTensor meta_out_4(kernel_out_4);
+  phi::MetaTensor meta_out_5(kernel_out_5);
+
+  phi::BatchNormInferMeta(MakeMetaTensor(*input_x),
+                          MakeMetaTensor(*input_scale),
+                          MakeMetaTensor(*input_bias),
+                          MakeMetaTensor(*input_mean),
+                          MakeMetaTensor(*input_variance),
+                          momentum,
+                          epsilon,
+                          data_layout,
+                          is_test,
+                          use_global_stats,
+                          trainable_statistics,
+                          fuse_with_relu,
+                          &meta_out_0,
+                          &meta_out_1,
+                          &meta_out_2,
+                          &meta_out_3,
+                          &meta_out_4,
+                          &meta_out_5);
+
+  using kernel_signature = void (*)(const platform::DeviceContext&,
+                                    const phi::DenseTensor&,
+                                    const phi::DenseTensor&,
+                                    const phi::DenseTensor&,
+                                    const phi::DenseTensor&,
+                                    const phi::DenseTensor&,
+                                    float,
+                                    float,
+                                    const std::string&,
+                                    bool,
+                                    bool,
+                                    bool,
+                                    bool,
+                                    phi::DenseTensor*,
+                                    phi::DenseTensor*,
+                                    phi::DenseTensor*,
+                                    phi::DenseTensor*,
+                                    phi::DenseTensor*,
+                                    phi::DenseTensor*);
+  auto* kernel_fn = kernel.GetVariadicKernelFn<kernel_signature>();
+  {
+    (*kernel_fn)(*dev_ctx,
+                 *input_x,
+                 *input_scale,
+                 *input_bias,
+                 *input_mean,
+                 *input_variance,
+                 momentum,
+                 epsilon,
+                 data_layout,
+                 is_test,
+                 use_global_stats,
+                 trainable_statistics,
+                 fuse_with_relu,
+                 kernel_out_0,
+                 kernel_out_1,
+                 kernel_out_2,
+                 kernel_out_3,
+                 kernel_out_4,
+                 kernel_out_5);
+  }
+
+  return api_output;
+}
+
+std::vector<Tensor> concat_grad_impl(const std::vector<Tensor>& x,
+                                     const Tensor& out_grad,
+                                     const Scalar& axis) {
+  auto kernel_key_set = ParseKernelKeyByInputArgs(out_grad);
+  auto kernel_key = kernel_key_set.GetHighestPriorityKernelKey();
+
+  Backend kernel_backend = kernel_key.backend();
+  DataLayout kernel_layout = kernel_key.layout();
+  DataType kernel_data_type = kernel_key.dtype();
+
+  auto kernel = phi::KernelFactory::Instance().SelectKernelOrThrowError(
+      "concat_grad", {kernel_backend, kernel_layout, kernel_data_type});
+  VLOG(6) << "concat_grad API kernel key: [" << kernel_backend << ", "
+          << kernel_layout << ", " << kernel_data_type << "]";
+  VLOG(6) << "concat_grad API kernel: " << kernel;
+
+  auto* dev_ctx = GetDeviceContextByBackend(kernel_backend);
+
+  // std::unique_ptr<std::vector<phi::DenseTensor>>
+  auto dense_x = PrepareData(x, kernel.InputAt(0), {});
+  auto dense_out_grad = PrepareData(out_grad, kernel.InputAt(1), {});
+
+  // Calculate the number of out tensors
+  size_t out_number = x.size();
+  std::vector<Tensor> x_grad;
+  auto dense_x_grad = SetKernelOutput(out_number, kernel_backend, &x_grad);
+
+  std::vector<phi::MetaTensor> meta_x;
+  meta_x.reserve(x.size());
+  std::vector<phi::MetaTensor*> meta_x_ptrs;
+  meta_x_ptrs.reserve(x.size());
+  for (const auto& t : *dense_x) {
+    meta_x.push_back(t);
+    meta_x_ptrs.push_back(&meta_x.back());
+  }
+
+  std::vector<phi::MetaTensor> meta_x_grad;
+  meta_x_grad.reserve(x.size());
+  std::vector<phi::MetaTensor*> meta_x_grad_ptrs;
+  meta_x_grad_ptrs.reserve(x.size());
+  for (size_t i = 0; i < out_number; ++i) {
+    meta_x_grad.push_back(*dense_x_grad[i]);
+    meta_x_grad_ptrs.push_back(&meta_x_grad.back());
+  }
+
+  phi::UnchangedMultiInferMeta(meta_x_ptrs, meta_x_grad_ptrs);
+
+  std::vector<const phi::DenseTensor*> dense_x_ptr;
+  dense_x_ptr.reserve(x.size());
+  for (const auto& t : *dense_x) {
+    dense_x_ptr.push_back(&t);
+  }
+
+  using kernel_signature = void (*)(const platform::DeviceContext&,
+                                    const std::vector<const phi::DenseTensor*>&,
+                                    const phi::DenseTensor&,
+                                    const phi::Scalar&,
+                                    std::vector<phi::DenseTensor*>);
+  auto* kernel_fn = kernel.GetVariadicKernelFn<kernel_signature>();
+  (*kernel_fn)(
+      *dev_ctx, dense_x_ptr, *dense_out_grad, phi::Scalar(axis), dense_x_grad);
+
+  return x_grad;
+}
+
+Tensor imag_grad_impl(const Tensor& out_grad) {
+  phi::KernelKey kernel_key{ParseBackend(out_grad),
+                            out_grad.layout(),
+                            phi::dtype::ToComplex(out_grad.dtype())};
+  auto kernel = phi::KernelFactory::Instance().SelectKernelOrThrowError(
+      "imag_grad", kernel_key);
+
+  VLOG(6) << "imag_grad API kernel key: " << kernel_key;
+  VLOG(6) << "imag_grad API kernel: " << kernel;
+
+  auto* dev_ctx = GetDeviceContextByBackend(kernel_key.backend());
+
+  auto dense_out_grad = TensorToDenseTensor(out_grad);
+
+  Tensor out;
+  auto kernel_out = SetKernelOutput(kernel_key.backend(), &out);
+  phi::MetaTensor meta_out(kernel_out);
+  phi::RealAndImagGradInferMeta(*dense_out_grad, &meta_out);
+
+  using kernel_signature = void (*)(
+      const phi::DeviceContext&, const phi::DenseTensor&, phi::DenseTensor*);
+
+  auto* kernel_fn = kernel.GetVariadicKernelFn<kernel_signature>();
+  (*kernel_fn)(*dev_ctx, *dense_out_grad, kernel_out);
+
+  return out;
+}
+
+Tensor real_grad_impl(const Tensor& out_grad) {
+  phi::KernelKey kernel_key{ParseBackend(out_grad),
+                            out_grad.layout(),
+                            phi::dtype::ToComplex(out_grad.dtype())};
+  auto kernel = phi::KernelFactory::Instance().SelectKernelOrThrowError(
+      "real_grad", kernel_key);
+
+  VLOG(6) << "real_grad API kernel key: " << kernel_key;
+  VLOG(6) << "real_grad API kernel: " << kernel;
+
+  auto* dev_ctx = GetDeviceContextByBackend(kernel_key.backend());
+
+  auto dense_out_grad = TensorToDenseTensor(out_grad);
+
+  Tensor out;
+  auto kernel_out = SetKernelOutput(kernel_key.backend(), &out);
+  phi::MetaTensor meta_out(kernel_out);
+  phi::RealAndImagGradInferMeta(*dense_out_grad, &meta_out);
+
+  using kernel_signature = void (*)(
+      const phi::DeviceContext&, const phi::DenseTensor&, phi::DenseTensor*);
+
+  auto* kernel_fn = kernel.GetVariadicKernelFn<kernel_signature>();
+  (*kernel_fn)(*dev_ctx, *dense_out_grad, kernel_out);
+
+  return out;
+}
+
+std::vector<Tensor> stack_grad_impl(const std::vector<Tensor>& x,
+                                    const Tensor& out_grad,
+                                    int axis) {
+  auto kernel_key_set = ParseKernelKeyByInputArgs(out_grad);
+  auto kernel_key = kernel_key_set.GetHighestPriorityKernelKey();
+
+  Backend kernel_backend = kernel_key.backend();
+  DataLayout kernel_layout = kernel_key.layout();
+  DataType kernel_data_type = kernel_key.dtype();
+
+  auto kernel = phi::KernelFactory::Instance().SelectKernelOrThrowError(
+      "stack_grad", {kernel_backend, kernel_layout, kernel_data_type});
+  VLOG(6) << "stack_grad API kernel key: [" << kernel_backend << ", "
+          << kernel_layout << ", " << kernel_data_type << "]";
+  VLOG(6) << "stack_grad API kernel: " << kernel;
+
+  auto* dev_ctx = GetDeviceContextByBackend(kernel_backend);
+
+  auto dense_out_grad = PrepareData(out_grad, kernel.InputAt(0), {});
+
+  size_t out_number = x.size();
+  std::vector<Tensor> x_grad;
+  auto dense_x_grad = SetKernelOutput(out_number, kernel_backend, &x_grad);
+  std::vector<phi::MetaTensor> meta_x_grad;
+  meta_x_grad.reserve(out_number);
+  std::vector<phi::MetaTensor*> meta_x_grad_ptrs;
+  meta_x_grad_ptrs.reserve(out_number);
+  for (size_t i = 0; i < out_number; ++i) {
+    meta_x_grad.push_back(dense_x_grad[i]);
+    meta_x_grad_ptrs.push_back(&meta_x_grad.back());
+  }
+
+  phi::StackGradInferMeta(
+      MakeMetaTensor(*dense_out_grad), axis, meta_x_grad_ptrs);
+
+  using kernel_signature = void (*)(const platform::DeviceContext&,
+                                    const phi::DenseTensor&,
+                                    int axis,
+                                    std::vector<phi::DenseTensor*>);
+  auto* kernel_fn = kernel.GetVariadicKernelFn<kernel_signature>();
+  (*kernel_fn)(*dev_ctx, *dense_out_grad, axis, dense_x_grad);
+
+  return x_grad;
+}
+
+std::vector<Tensor> meshgrid_impl(const std::vector<Tensor>& inputs) {
+  Backend kernel_backend = Backend::UNDEFINED;
+  DataLayout kernel_layout = DataLayout::UNDEFINED;
+  DataType kernel_data_type = DataType::UNDEFINED;
+
+  if (kernel_backend == Backend::UNDEFINED ||
+      kernel_layout == DataLayout::UNDEFINED ||
+      kernel_data_type == DataType::UNDEFINED) {
+    auto kernel_key_set = ParseKernelKeyByInputArgs(inputs);
+    auto kernel_key = kernel_key_set.GetHighestPriorityKernelKey();
+    if (kernel_backend == Backend::UNDEFINED) {
+      kernel_backend = kernel_key.backend();
+    }
+    if (kernel_layout == DataLayout::UNDEFINED) {
+      kernel_layout = kernel_key.layout();
+    }
+    if (kernel_data_type == DataType::UNDEFINED) {
+      kernel_data_type = kernel_key.dtype();
+    }
+  }
+
+  const auto& kernel = phi::KernelFactory::Instance().SelectKernelOrThrowError(
+      "meshgrid", {kernel_backend, kernel_layout, kernel_data_type});
+  VLOG(6) << "meshgrid API kernel key: [" << kernel_backend << ", "
+          << kernel_layout << ", " << kernel_data_type << "]";
+  VLOG(6) << "meshgrid API kernel: " << kernel;
+
+  auto* dev_ctx = GetDeviceContextByBackend(kernel_backend);
+
+  auto input_inputs_vec = PrepareData(inputs, kernel.InputAt(0), {});
+  std::vector<const phi::DenseTensor*> input_inputs(input_inputs_vec->size());
+  for (size_t i = 0; i < input_inputs.size(); ++i) {
+    input_inputs[i] = &input_inputs_vec->at(i);
+  }
+
+  auto x_meta_vec = MakeMetaTensor(input_inputs);
+  std::vector<phi::MetaTensor*> inputs_metas(x_meta_vec.size());
+  for (size_t i = 0; i < x_meta_vec.size(); ++i) {
+    inputs_metas[i] = &x_meta_vec[i];
+  }
+
+  // Calculate the number of out tensors
+  size_t out_number = inputs.size();
+
+  std::vector<Tensor> out;
+  auto dense_outs = SetKernelOutput(out_number, kernel_backend, &out);
+
+  std::vector<phi::MetaTensor> meta_outs;
+  meta_outs.reserve(out_number);
+  std::vector<phi::MetaTensor*> meta_out_ptrs;
+  meta_out_ptrs.reserve(out_number);
+  for (size_t i = 0; i < out_number; ++i) {
+    meta_outs.push_back(dense_outs[i]);
+    meta_out_ptrs.push_back(&meta_outs.back());
+  }
+  phi::MeshgridInferMeta(inputs_metas, meta_out_ptrs);
+
+  using kernel_signature = void (*)(const platform::DeviceContext&,
+                                    const std::vector<const phi::DenseTensor*>&,
+                                    std::vector<phi::DenseTensor*>&);
+  auto* kernel_fn = kernel.GetVariadicKernelFn<kernel_signature>();
+  (*kernel_fn)(*dev_ctx, input_inputs, dense_outs);
+
+  return out;
+}
+
+std::vector<Tensor> meshgrid_grad_impl(
+    const std::vector<Tensor>& inputs,
+    const std::vector<Tensor>& outputs_grad) {
+  Backend kernel_backend = Backend::UNDEFINED;
+  DataLayout kernel_layout = DataLayout::UNDEFINED;
+  DataType kernel_data_type = DataType::UNDEFINED;
+
+  if (kernel_backend == Backend::UNDEFINED ||
+      kernel_layout == DataLayout::UNDEFINED ||
+      kernel_data_type == DataType::UNDEFINED) {
+    auto kernel_key_set = ParseKernelKeyByInputArgs(inputs, outputs_grad);
+    auto kernel_key = kernel_key_set.GetHighestPriorityKernelKey();
+    if (kernel_backend == Backend::UNDEFINED) {
+      kernel_backend = kernel_key.backend();
+    }
+    if (kernel_layout == DataLayout::UNDEFINED) {
+      kernel_layout = kernel_key.layout();
+    }
+    if (kernel_data_type == DataType::UNDEFINED) {
+      kernel_data_type = kernel_key.dtype();
+    }
+  }
+
+  const auto& kernel = phi::KernelFactory::Instance().SelectKernelOrThrowError(
+      "meshgrid_grad", {kernel_backend, kernel_layout, kernel_data_type});
+  VLOG(6) << "meshgrid_grad API kernel key: [" << kernel_backend << ", "
+          << kernel_layout << ", " << kernel_data_type << "]";
+  VLOG(6) << "meshgrid_grad API kernel: " << kernel;
+
+  auto* dev_ctx = GetDeviceContextByBackend(kernel_backend);
+
+  auto input_inputs_vec = PrepareData(inputs, kernel.InputAt(0), {});
+  std::vector<const phi::DenseTensor*> input_inputs(input_inputs_vec->size());
+  for (size_t i = 0; i < input_inputs.size(); ++i) {
+    input_inputs[i] = &input_inputs_vec->at(i);
+  }
+  auto input_outputs_grad_vec =
+      PrepareData(outputs_grad, kernel.InputAt(1), {});
+  std::vector<const phi::DenseTensor*> input_outputs_grad(
+      input_outputs_grad_vec->size());
+  for (size_t i = 0; i < input_outputs_grad.size(); ++i) {
+    input_outputs_grad[i] = &input_outputs_grad_vec->at(i);
+  }
+
+  size_t out_number = inputs.size();
+  std::vector<Tensor> api_output;
+  auto kernel_out = SetKernelOutput(out_number, kernel_backend, &api_output);
+
+  auto inputs_meta_vec = MakeMetaTensor(input_inputs);
+  std::vector<phi::MetaTensor*> inputs_metas(inputs_meta_vec.size());
+  for (size_t i = 0; i < inputs_meta_vec.size(); ++i) {
+    inputs_metas[i] = &inputs_meta_vec[i];
+  }
+
+  auto outputs_grad_meta_vec = MakeMetaTensor(input_outputs_grad);
+  std::vector<phi::MetaTensor*> outputs_grad_metas(
+      outputs_grad_meta_vec.size());
+  for (size_t i = 0; i < outputs_grad_meta_vec.size(); ++i) {
+    outputs_grad_metas[i] = &outputs_grad_meta_vec[i];
+  }
+
+  std::vector<phi::MetaTensor> meta_outs;
+  meta_outs.reserve(out_number);
+  std::vector<phi::MetaTensor*> meta_out_ptrs;
+  meta_out_ptrs.reserve(out_number);
+  for (size_t i = 0; i < out_number; ++i) {
+    meta_outs.push_back(kernel_out[i]);
+    meta_out_ptrs.push_back(&meta_outs.back());
+  }
+
+  phi::MeshgridGradInferMeta(inputs_metas, outputs_grad_metas, meta_out_ptrs);
+
+  using kernel_signature = void (*)(const platform::DeviceContext&,
+                                    const std::vector<const phi::DenseTensor*>&,
+                                    const std::vector<const phi::DenseTensor*>&,
+                                    std::vector<phi::DenseTensor*>&);
+  auto* kernel_fn = kernel.GetVariadicKernelFn<kernel_signature>();
+  (*kernel_fn)(*dev_ctx, input_inputs, input_outputs_grad, kernel_out);
+
+  return api_output;
+}
+
 }  // namespace experimental
 }  // namespace paddle
diff --git a/paddle/phi/api/lib/api_custom_impl.h b/paddle/phi/api/lib/api_custom_impl.h
index b2f5a074d9288..80ace229316a9 100644
--- a/paddle/phi/api/lib/api_custom_impl.h
+++ b/paddle/phi/api/lib/api_custom_impl.h
@@ -14,19 +14,104 @@ limitations under the License. */
 
 #pragma once
 
+#include <vector>
+
 #include "paddle/phi/api/include/tensor.h"
 #include "paddle/phi/common/int_array.h"
 #include "paddle/phi/common/place.h"
 #include "paddle/phi/common/scalar.h"
+#include "paddle/utils/optional.h"
 
 namespace paddle {
 namespace experimental {
 
+// NOTE: Separate forward and backward(grad) api impl
+// NOTE: The api_impl in this file are arranged in alphabetic order.
+
+////////////////// Forward api impls //////////////////////
+
+Tensor conv2d_impl(const Tensor& input,
+                   const Tensor& filter,
+                   const std::vector<int>& strides,
+                   const std::vector<int>& paddings,
+                   const std::string& paddding_algorithm,
+                   int groups,
+                   const std::vector<int>& dilations,
+                   const std::string& data_format,
+                   bool use_addto,
+                   int workspace_size_MB,
+                   bool exhaustive_search);
+
+std::vector<std::vector<Tensor>> conv2d_grad_impl(
+    const Tensor& input,
+    const Tensor& filter,
+    const Tensor& out_grad,
+    const std::vector<int>& strides,
+    const std::vector<int>& paddings,
+    const std::string& paddding_algorithm,
+    int groups,
+    const std::vector<int>& dilations,
+    const std::string& data_format,
+    bool use_addto,
+    int workspace_size_MB,
+    bool exhaustive_search);
+
 Tensor copy_to_impl(const Tensor& x, Place place, bool blocking);
 
 std::vector<Tensor> split_impl(const Tensor& x,
                                const IntArray& num_or_sections,
                                const Scalar& axis);
 
+std::tuple<Tensor, Tensor, Tensor> momentum_impl(
+    const Tensor& param,
+    const Tensor& grad,
+    const Tensor& velocity,
+    const Tensor& learning_rate,
+    paddle::optional<const Tensor&> master_param,
+    float mu,
+    bool use_nesterov,
+    const std::string& regularization_method,
+    float regularization_coeff,
+    bool multi_precision,
+    float rescale_grad);
+
+std::vector<Tensor> unbind_impl(const Tensor& input, int axis);
+
+////////////////// Backward(grad) api impls //////////////////////
+
+std::vector<Tensor> add_n_grad_impl(const std::vector<Tensor>& x,
+                                    const Tensor& out_grad);
+
+std::tuple<Tensor, Tensor, Tensor, Tensor, Tensor, Tensor> batch_norm_impl(
+    const Tensor& x,
+    const Tensor& scale,
+    const Tensor& bias,
+    const Tensor& mean,
+    const Tensor& variance,
+    float momentum,
+    float epsilon,
+    const std::string& data_layout,
+    bool is_test,
+    bool use_global_stats,
+    bool trainable_statistics,
+    bool fuse_with_relu);
+
+/************************   backward api impl   ***************************/
+
+std::vector<Tensor> concat_grad_impl(const std::vector<Tensor>& x,
+                                     const Tensor& out_grad,
+                                     const Scalar& axis);
+
+Tensor imag_grad_impl(const Tensor& x);
+
+Tensor real_grad_impl(const Tensor& x);
+
+std::vector<Tensor> stack_grad_impl(const std::vector<Tensor>& x,
+                                    const Tensor& out_grad,
+                                    int axis);
+std::vector<Tensor> meshgrid_impl(const std::vector<Tensor>& inputs);
+std::vector<Tensor> meshgrid_grad_impl(const std::vector<Tensor>& inputs,
+                                       const std::vector<Tensor>& outputs_grad);
+
 }  // namespace experimental
 }  // namespace paddle
diff --git a/paddle/phi/api/lib/api_gen_utils.cc b/paddle/phi/api/lib/api_gen_utils.cc
index 7cbb4344e81d7..732ecacde94d7 100644
--- a/paddle/phi/api/lib/api_gen_utils.cc
+++ b/paddle/phi/api/lib/api_gen_utils.cc
@@ -66,14 +66,6 @@ phi::MetaTensor MakeMetaTensor(const phi::DenseTensor& tensor) {
   return phi::MetaTensor(tensor);
 }
 
-paddle::optional<phi::MetaTensor> MakeMetaTensor(
-    const paddle::optional<const phi::DenseTensor&>& tensor) {
-  if (tensor) {
-    return {phi::MetaTensor(*tensor)};
-  }
-  return {paddle::none};
-}
-
 std::vector<phi::MetaTensor> MakeMetaTensor(
     const std::vector<const phi::DenseTensor*>& tensors) {
   std::vector<phi::MetaTensor> meta_tensors;
@@ -88,14 +80,6 @@ phi::MetaTensor MakeMetaTensor(const phi::SelectedRows& tensor) {
   return phi::MetaTensor(tensor);
 }
 
-paddle::optional<phi::MetaTensor> MakeMetaTensor(
-    const paddle::optional<const phi::SelectedRows&>& tensor) {
-  if (tensor) {
-    return {phi::MetaTensor(*tensor)};
-  }
-  return {paddle::none};
-}
-
 phi::MetaTensor MakeMetaTensor(const phi::StringTensor& tensor) {
   return phi::MetaTensor(tensor);
 }
diff --git a/paddle/phi/api/lib/api_gen_utils.h b/paddle/phi/api/lib/api_gen_utils.h
index 2a4c8417b5e6d..d7ecef61c5be3 100644
--- a/paddle/phi/api/lib/api_gen_utils.h
+++ b/paddle/phi/api/lib/api_gen_utils.h
@@ -50,17 +50,11 @@ std::shared_ptr<phi::StringTensor> TensorToStringTensor(const Tensor& tensor);
 
 phi::MetaTensor MakeMetaTensor(const phi::DenseTensor& tensor);
 
-paddle::optional<phi::MetaTensor> MakeMetaTensor(
-    const paddle::optional<const phi::DenseTensor&>& tensor);
-
 std::vector<phi::MetaTensor> MakeMetaTensor(
     const std::vector<const phi::DenseTensor*>& tensors);
 
 phi::MetaTensor MakeMetaTensor(const phi::SelectedRows& tensor);
 
-paddle::optional<phi::MetaTensor> MakeMetaTensor(
-    const paddle::optional<const phi::SelectedRows&>& tensor);
-
 phi::MetaTensor MakeMetaTensor(const phi::StringTensor& tensor);
 
 /* ------------------ for output ----------------------- */
diff --git a/paddle/phi/api/lib/data_transform.cc b/paddle/phi/api/lib/data_transform.cc
index 4e6ebe33aec8f..90d47977cdf60 100644
--- a/paddle/phi/api/lib/data_transform.cc
+++ b/paddle/phi/api/lib/data_transform.cc
@@ -37,9 +37,10 @@ inline bool NeedTransformDataType(const DataType& input,
 inline bool NeedTransformPlace(const paddle::platform::Place& input,
                                const Backend& target,
                                const TransformFlag& transform_flag) {
-  bool ret = transform_flag.need_trans_backend() &&
-             target != Backend::ALL_BACKEND &&
-             phi::TransToPhiBackend(input) != target;
+  bool ret =
+      input.GetType() == AllocationType::GPUPINNED ||
+      (transform_flag.need_trans_backend() && target != Backend::ALL_BACKEND &&
+       phi::TransToPhiBackend(input) != target);
   return ret;
 }
 
@@ -180,20 +181,23 @@ std::shared_ptr<phi::DenseTensor> PrepareData(
     const phi::TensorArgDef& target_args_def,
     const TransformFlag& transform_flag) {
   const auto& tensor_in = input.impl();
-  phi::DenseTensor& dense_tensor =
-      *static_cast<phi::DenseTensor*>(tensor_in.get());
-  if (!transform_flag.NeedTransform() || !dense_tensor.initialized() ||
-      (!NeedTransformPlace(
-           dense_tensor.place(), target_args_def.backend, transform_flag) &&
-       !NeedTransformDataType(
-           dense_tensor.dtype(), target_args_def.dtype, transform_flag) &&
-       !NeedTransformLayout(
-           dense_tensor.layout(), target_args_def.layout, transform_flag))) {
-    return std::static_pointer_cast<phi::DenseTensor>(tensor_in);
+  if (tensor_in) {
+    phi::DenseTensor& dense_tensor =
+        *static_cast<phi::DenseTensor*>(tensor_in.get());
+    if (!transform_flag.NeedTransform() || !dense_tensor.initialized() ||
+        (!NeedTransformPlace(
+             dense_tensor.place(), target_args_def.backend, transform_flag) &&
+         !NeedTransformDataType(
+             dense_tensor.dtype(), target_args_def.dtype, transform_flag) &&
+         !NeedTransformLayout(
+             dense_tensor.layout(), target_args_def.layout, transform_flag))) {
+      return std::static_pointer_cast<phi::DenseTensor>(tensor_in);
+    }
+    phi::DenseTensor out =
+        TransformData(dense_tensor, target_args_def, transform_flag);
+    return std::make_shared<phi::DenseTensor>(std::move(out));
   }
-  phi::DenseTensor out =
-      TransformData(dense_tensor, target_args_def, transform_flag);
-  return std::make_shared<phi::DenseTensor>(std::move(out));
+  return nullptr;
 }
 
 std::shared_ptr<phi::DenseTensor> PrepareData(
diff --git a/paddle/phi/api/lib/kernel_dispatch.cc b/paddle/phi/api/lib/kernel_dispatch.cc
index 1ca6e2ce0bb9a..6d97dc7657f00 100644
--- a/paddle/phi/api/lib/kernel_dispatch.cc
+++ b/paddle/phi/api/lib/kernel_dispatch.cc
@@ -14,18 +14,46 @@ limitations under the License. */
 
 #include "paddle/phi/api/lib/kernel_dispatch.h"
 
-#include "paddle/phi/api/include/context_pool.h"
-#include "paddle/phi/core/compat/convert_utils.h"
 #ifdef _MSC_VER
 #include <intrin.h>
 #endif
 
+#include "paddle/phi/api/include/context_pool.h"
+#include "paddle/phi/core/compat/convert_utils.h"
+#include "paddle/phi/core/string_tensor_utils.h"
+#include "paddle/phi/core/tensor_utils.h"
+
 namespace paddle {
 namespace experimental {
 namespace detail {
 
+// We need judge whether the allocation is nullptr,
+// whether the allocation is initialized, wo we need GetHolder method
+bool HasAllocation(const phi::TensorBase& t) {
+  if (phi::DenseTensor::classof(&t)) {
+    return phi::DenseTensorUtils::GetHolder(
+               static_cast<const phi::DenseTensor&>(t)) != nullptr;
+  } else if (phi::SelectedRows::classof(&t)) {
+    return phi::DenseTensorUtils::GetHolder(
+               static_cast<const phi::SelectedRows&>(t).value()) != nullptr;
+  } else if (phi::SparseCsrTensor::classof(&t)) {
+    return phi::DenseTensorUtils::GetHolder(
+               static_cast<const phi::SparseCsrTensor&>(t)
+                   .non_zero_elements()) != nullptr;
+  } else if (phi::SparseCooTensor::classof(&t)) {
+    return phi::DenseTensorUtils::GetHolder(
+               static_cast<const phi::SparseCooTensor&>(t)
+                   .non_zero_elements()) != nullptr;
+  } else if (phi::StringTensor::classof(&t)) {
+    return phi::StringTensorUtils::GetHolder(
+               static_cast<const phi::StringTensor&>(t)) != nullptr;
+  } else {
+    return false;
+  }
+}
+
 BackendSet GetTensorBackendSet(const phi::TensorBase& t) {
-  if (t.initialized()) {
+  if (HasAllocation(t)) {
     BackendSet backend_set(phi::TransToPhiBackend(t.place()));
     switch (t.layout()) {
       case DataLayout::MKLDNN:
diff --git a/paddle/phi/api/lib/kernel_dispatch.h b/paddle/phi/api/lib/kernel_dispatch.h
index 25b74e7fe31b9..be545ac9ce2f7 100644
--- a/paddle/phi/api/lib/kernel_dispatch.h
+++ b/paddle/phi/api/lib/kernel_dispatch.h
@@ -25,6 +25,8 @@ limitations under the License. */
 #include "paddle/phi/common/data_type.h"
 #include "paddle/phi/common/layout.h"
 #include "paddle/phi/core/selected_rows.h"
+#include "paddle/phi/core/sparse_coo_tensor.h"
+#include "paddle/phi/core/sparse_csr_tensor.h"
 
 // TODO(chenweihang): split Key, Kernel, Factory into diff files
 #include "paddle/phi/core/kernel_factory.h"
@@ -40,8 +42,10 @@ std::size_t CountLeadingZeros(uint64_t val);
 phi::DeviceContext* GetDeviceContextByBackend(phi::Backend backend);
 
 enum class KernelType {
-  DENSE_TENSOR_KENREL,  // kernel for DenseTensor
-  SELECTED_ROWS_KENREL  // kernel for SelectedRows
+  DENSE_TENSOR_KENREL,   // kernel for DenseTensor
+  SELECTED_ROWS_KENREL,  // kernel for SelectedRows
+  SPARSE_COO_KERNEL,     // kernel for SparseCooTensor
+  SPARSE_CSR_KERNEL      // kernel for SparseCsrTensor
 };
 
 // TODO(chenweihang): support DataLayout and DataType selected
@@ -130,6 +134,10 @@ struct KernelTypeParser : ArgsIterator<KernelTypeParser> {
   void operator()(const Tensor& x) {
     if (phi::SelectedRows::classof(x.impl().get())) {
       kernel_type = KernelType::SELECTED_ROWS_KENREL;
+    } else if (phi::SparseCooTensor::classof(x.impl().get())) {
+      kernel_type = KernelType::SPARSE_COO_KERNEL;
+    } else if (phi::SparseCsrTensor::classof(x.impl().get())) {
+      kernel_type = KernelType::SPARSE_CSR_KERNEL;
     }
   }
 
diff --git a/paddle/phi/api/lib/tensor.cc b/paddle/phi/api/lib/tensor.cc
index 5cd1fcb919638..ffc754feaed98 100644
--- a/paddle/phi/api/lib/tensor.cc
+++ b/paddle/phi/api/lib/tensor.cc
@@ -101,7 +101,11 @@ int64_t Tensor::size() const { return impl_->numel(); }
 phi::DDim Tensor::dims() const { return impl_->dims(); }
 
 std::vector<int64_t> Tensor::shape() const {
-  return phi::vectorize<int64_t>(impl_->dims());
+  auto dims = impl_->dims();
+  if (dims.size() == 1 && dims.at(0) == 0) {
+    return {};
+  }
+  return phi::vectorize<int64_t>(dims);
 }
 
 void Tensor::reshape(const std::vector<int64_t> &shape) {
@@ -354,6 +358,11 @@ AbstractAutogradMeta *Tensor::get_autograd_meta() const {
   return autograd_meta_.get();
 }
 
+const std::shared_ptr<AbstractAutogradMeta> &Tensor::mutable_autograd_meta()
+    const {
+  return autograd_meta_;
+}
+
 void Tensor::set_autograd_meta(
     std::shared_ptr<AbstractAutogradMeta> autograd_meta) {
   autograd_meta_ = std::move(autograd_meta);
diff --git a/paddle/phi/api/lib/tensor_method.cc b/paddle/phi/api/lib/tensor_method.cc
index 7d9f7a7ae17c8..c4c77ab93790d 100644
--- a/paddle/phi/api/lib/tensor_method.cc
+++ b/paddle/phi/api/lib/tensor_method.cc
@@ -177,6 +177,40 @@ void Tensor::copy_(const Tensor &src,
                  target_place,
                  blocking,
                  static_cast<phi::SelectedRows *>(impl_.get()));
+  } else if (kernel_type == KernelType::SPARSE_COO_KERNEL) {
+    auto kernel = phi::KernelFactory::Instance().SelectKernelOrThrowError(
+        "copy_sparse_coo", {kernel_backend, kernel_layout, kernel_data_type});
+    VLOG(6) << "copy API kernel key: " << kernel_key;
+    VLOG(6) << "copy API kernel: " << kernel;
+    using kernel_signature = void (*)(const platform::DeviceContext &,
+                                      const phi::SparseCooTensor &,
+                                      phi::Place,
+                                      bool,
+                                      phi::SparseCooTensor *);
+    this->set_impl(std::make_shared<phi::SparseCooTensor>());
+    auto *kernel_fn = kernel.GetVariadicKernelFn<kernel_signature>();
+    (*kernel_fn)(*dev_ctx,
+                 (*(std::static_pointer_cast<phi::SparseCooTensor>(src.impl_))),
+                 target_place,
+                 blocking,
+                 static_cast<phi::SparseCooTensor *>(impl_.get()));
+  } else if (kernel_type == KernelType::SPARSE_CSR_KERNEL) {
+    auto kernel = phi::KernelFactory::Instance().SelectKernelOrThrowError(
+        "copy_sparse_csr", {kernel_backend, kernel_layout, kernel_data_type});
+    VLOG(6) << "copy API kernel key: " << kernel_key;
+    VLOG(6) << "copy API kernel: " << kernel;
+    using kernel_signature = void (*)(const platform::DeviceContext &,
+                                      const phi::SparseCsrTensor &,
+                                      phi::Place,
+                                      bool,
+                                      phi::SparseCsrTensor *);
+    this->set_impl(std::make_shared<phi::SparseCsrTensor>());
+    auto *kernel_fn = kernel.GetVariadicKernelFn<kernel_signature>();
+    (*kernel_fn)(*dev_ctx,
+                 (*(std::static_pointer_cast<phi::SparseCsrTensor>(src.impl_))),
+                 target_place,
+                 blocking,
+                 static_cast<phi::SparseCsrTensor *>(impl_.get()));
   } else {
     PADDLE_THROW(phi::errors::InvalidArgument(
         "We currently only support dense tensor copy for now and if u need to "
diff --git a/paddle/phi/backends/gpu/gpu_launch_config.h b/paddle/phi/backends/gpu/gpu_launch_config.h
index 41bc6bb47c160..ea54083e8179b 100644
--- a/paddle/phi/backends/gpu/gpu_launch_config.h
+++ b/paddle/phi/backends/gpu/gpu_launch_config.h
@@ -179,6 +179,43 @@ inline GpuLaunchConfig GetGpuLaunchConfig2D(const phi::GPUContext& context,
   return config;
 }
 
+static inline int GetLastPow2(int n) {
+  n |= (n >> 1);
+  n |= (n >> 2);
+  n |= (n >> 4);
+  n |= (n >> 8);
+  n |= (n >> 16);
+  return std::max(1, n - (n >> 1));
+}
+
+inline GpuLaunchConfig GetGpuLaunchConfig3D(const phi::GPUContext& context,
+                                            int num_img,
+                                            int height,
+                                            int width) {
+  const int kThreadsPerBlock = 256;
+  int max_threads_per_block = context.GetMaxThreadsPerBlock();  // 1024
+  int max_threads = std::min(kThreadsPerBlock, max_threads_per_block);
+
+  int block_x = std::min(GetLastPow2(width), max_threads);
+  int block_y = std::min(GetLastPow2(height), max_threads / block_x);
+  int block_z = std::min(num_img, max_threads / block_x / block_y);
+
+  auto max_grid_dim = context.GetCUDAMaxGridDimSize();
+  int grid_x =
+      std::min<int>(max_grid_dim[0], backends::gpu::DivUp(width, block_x));
+  int grid_y =
+      std::min<int>(max_grid_dim[1], backends::gpu::DivUp(height, block_y));
+  int grid_z = std::min<int>(max_grid_dim[2],
+                             backends::gpu::DivUp(num_img, block_z * 4));
+
+  const int capability = context.GetComputeCapability();
+  GpuLaunchConfig config;
+  config.compute_capability = capability;
+  config.thread_per_block = dim3(block_x, block_y, block_z);
+  config.block_per_grid = dim3(grid_x, grid_y, grid_z);
+  return config;
+}
+
 }  // namespace gpu
 }  // namespace backends
 }  // namespace phi
diff --git a/paddle/phi/core/compat/convert_utils.cc b/paddle/phi/core/compat/convert_utils.cc
index 667cee10675d8..c08dfa64c7f1b 100644
--- a/paddle/phi/core/compat/convert_utils.cc
+++ b/paddle/phi/core/compat/convert_utils.cc
@@ -18,6 +18,7 @@ limitations under the License. */
 #include "paddle/phi/backends/xpu/xpu_info.h"
 #include "paddle/phi/common/place.h"
 #include "paddle/phi/core/compat/op_utils.h"
+#include "paddle/phi/core/enforce.h"
 
 #ifdef PADDLE_WITH_CUSTOM_DEVICE
 #include "paddle/phi/backends/device_manager.h"
@@ -31,14 +32,19 @@ Backend TransToPhiBackend(const phi::Place& place) {
     return Backend::CPU;
   } else if (allocation_type == phi::AllocationType::GPU) {
     return Backend::GPU;
+  } else if (allocation_type == phi::AllocationType::GPUPINNED) {
+    return Backend::GPU;
   } else if (allocation_type == phi::AllocationType::XPU) {
     return Backend::XPU;
+  } else if (allocation_type == phi::AllocationType::NPU) {
+    return Backend::NPU;
   } else if (allocation_type == phi::AllocationType::CUSTOM) {
     return static_cast<Backend>(
         static_cast<size_t>(Backend::NUM_BACKENDS) +
         GetOrRegisterGlobalDeviceTypeId(place.GetDeviceType()));
   } else {
-    return Backend::UNDEFINED;
+    PADDLE_THROW(phi::errors::InvalidArgument(
+        "Unsupported transform %s to phi Backend.", place));
   }
 }
 
diff --git a/paddle/phi/core/device_context.cc b/paddle/phi/core/device_context.cc
index 6b486196a4b8a..0f5f22b5bd1f4 100644
--- a/paddle/phi/core/device_context.cc
+++ b/paddle/phi/core/device_context.cc
@@ -49,6 +49,14 @@ struct DeviceContext::Impl {
     zero_allocator_ = allocator;
   }
 
+  void SetPinnedAllocator(const Allocator* allocator) {
+    PADDLE_ENFORCE_NOT_NULL(
+        allocator,
+        phi::errors::InvalidArgument(
+            "Required allocator shall not be nullptr, but received nullptr."));
+    pinned_allocator_ = allocator;
+  }
+
   const Allocator& GetAllocator() const {
     PADDLE_ENFORCE_NOT_NULL(
         device_allocator_,
@@ -68,15 +76,24 @@ struct DeviceContext::Impl {
   const Allocator& GetZeroAllocator() const {
     PADDLE_ENFORCE_NOT_NULL(
         zero_allocator_,
-        phi::errors::InvalidArgument("Required host_allocator_ shall not be "
+        phi::errors::InvalidArgument("Required zero_allocator_ shall not be "
                                      "nullptr, but received nullptr."));
     return *zero_allocator_;
   }
 
+  const Allocator& GetPinnedAllocator() const {
+    PADDLE_ENFORCE_NOT_NULL(
+        pinned_allocator_,
+        phi::errors::InvalidArgument("Required pinned_allocator_ shall not be "
+                                     "nullptr, but received nullptr."));
+    return *pinned_allocator_;
+  }
+
   void* Alloc(TensorBase* tensor,
               const Place& place,
               DataType dtype = DataType::UNDEFINED,
-              size_t requested_size = 0) const {
+              size_t requested_size = 0,
+              bool pinned = false) const {
     PADDLE_ENFORCE_NOT_NULL(
         tensor,
         phi::errors::InvalidArgument(
@@ -90,8 +107,9 @@ struct DeviceContext::Impl {
     if (tensor->initialized() && tensor->place() != place) {
       ClearHolder(tensor);
     }
-    auto* allocator =
-        tensor->numel() == 0 ? zero_allocator_ : device_allocator_;
+    auto* allocator = tensor->numel() == 0
+                          ? zero_allocator_
+                          : (pinned ? pinned_allocator_ : device_allocator_);
     return tensor->AllocateFrom(
         const_cast<Allocator*>(allocator), dtype, requested_size);
   }
@@ -99,9 +117,10 @@ struct DeviceContext::Impl {
   template <typename T>
   T* Alloc(TensorBase* tensor,
            const Place& place,
-           size_t requested_size = 0) const {
+           size_t requested_size = 0,
+           bool pinned = false) const {
     DataType dtype = paddle::experimental::CppTypeToDataType<T>::Type();
-    return static_cast<T*>(Alloc(tensor, place, dtype, requested_size));
+    return static_cast<T*>(Alloc(tensor, place, dtype, requested_size, pinned));
   }
 
   void* HostAlloc(TensorBase* tensor,
@@ -179,6 +198,7 @@ struct DeviceContext::Impl {
   const Allocator* device_allocator_{nullptr};
   const Allocator* host_allocator_{nullptr};
   const Allocator* zero_allocator_{nullptr};
+  const Allocator* pinned_allocator_{nullptr};
   Generator* device_generator_{nullptr};
   Generator* host_generator_{nullptr};
 };
@@ -189,6 +209,7 @@ DeviceContext::DeviceContext(const DeviceContext& other) {
   impl_->SetHostAllocator(&other.GetHostAllocator());
   impl_->SetAllocator(&other.GetAllocator());
   impl_->SetZeroAllocator(&other.GetZeroAllocator());
+  impl_->SetPinnedAllocator(&other.GetPinnedAllocator());
   impl_->SetHostGenerator(other.GetHostGenerator());
   impl_->SetGenerator(other.GetGenerator());
 }
@@ -225,15 +246,25 @@ const Allocator& DeviceContext::GetZeroAllocator() const {
   return impl_->GetZeroAllocator();
 }
 
+void DeviceContext::SetPinnedAllocator(const Allocator* allocator) {
+  impl_->SetPinnedAllocator(allocator);
+}
+const Allocator& DeviceContext::GetPinnedAllocator() const {
+  return impl_->GetPinnedAllocator();
+}
+
 void* DeviceContext::Alloc(TensorBase* tensor,
                            DataType dtype,
-                           size_t requested_size) const {
-  return impl_->Alloc(tensor, GetPlace(), dtype, requested_size);
+                           size_t requested_size,
+                           bool pinned) const {
+  return impl_->Alloc(tensor, GetPlace(), dtype, requested_size, pinned);
 }
 
 template <typename T>
-T* DeviceContext::Alloc(TensorBase* tensor, size_t requested_size) const {
-  return impl_->Alloc<T>(tensor, GetPlace(), requested_size);
+T* DeviceContext::Alloc(TensorBase* tensor,
+                        size_t requested_size,
+                        bool pinned) const {
+  return impl_->Alloc<T>(tensor, GetPlace(), requested_size, pinned);
 }
 
 void* DeviceContext::HostAlloc(TensorBase* tensor,
@@ -248,8 +279,8 @@ T* DeviceContext::HostAlloc(TensorBase* tensor, size_t requested_size) const {
 }
 
 #define DEVICE_CONTEXT_MEMBER_FUNC_INSTANTIATION(dtype)              \
-  template dtype* DeviceContext::Alloc(TensorBase* tensor,           \
-                                       size_t requested_size) const; \
+  template dtype* DeviceContext::Alloc(                              \
+      TensorBase* tensor, size_t requested_size, bool pinned) const; \
   template dtype* DeviceContext::HostAlloc(TensorBase* tensor,       \
                                            size_t requested_size) const;
 
diff --git a/paddle/phi/core/device_context.h b/paddle/phi/core/device_context.h
index 689f4e4e66d15..106d5ff7ddf98 100644
--- a/paddle/phi/core/device_context.h
+++ b/paddle/phi/core/device_context.h
@@ -80,6 +80,13 @@ class DeviceContext {
   */
   void SetZeroAllocator(const Allocator*);
 
+  /**
+  * @brief Set the zero-size Allocator object.
+  *
+  * @param allocator
+  */
+  void SetPinnedAllocator(const Allocator*);
+
   /**
    * @brief Get the const Allocator object.
    *
@@ -96,13 +103,20 @@ class DeviceContext {
 
   const Allocator& GetZeroAllocator() const;
 
+  const Allocator& GetPinnedAllocator() const;
+
   /**
    * @brief Allocate device memory for tensor.
    */
-  void* Alloc(TensorBase*, DataType dtype, size_t requested_size = 0) const;
+  void* Alloc(TensorBase*,
+              DataType dtype,
+              size_t requested_size = 0,
+              bool pinned = false) const;
 
   template <typename T>
-  T* Alloc(TensorBase* tensor, size_t requested_size = 0) const;
+  T* Alloc(TensorBase* tensor,
+           size_t requested_size = 0,
+           bool pinned = false) const;
 
   /**
    * @brief Allocate host memory for tensor.
diff --git a/paddle/phi/core/infermeta_utils.cc b/paddle/phi/core/infermeta_utils.cc
index 671ba2ec7dc25..0496d727e8d3b 100644
--- a/paddle/phi/core/infermeta_utils.cc
+++ b/paddle/phi/core/infermeta_utils.cc
@@ -87,6 +87,23 @@ std::vector<MetaTensor*> InferMetaContext::InputsBetween(size_t start,
   return result;
 }
 
+paddle::optional<const std::vector<const MetaTensor*>>
+InferMetaContext::OptionalInputsBetween(size_t start, size_t end) const {
+  const auto& first = inputs_.at(start);
+
+  if (first) {
+    std::vector<const MetaTensor*> result;
+    result.reserve(end - start);
+
+    for (size_t i = start; i < end; ++i) {
+      result.push_back(inputs_.at(i).get());
+    }
+
+    return paddle::optional<const std::vector<const MetaTensor*>>(result);
+  }
+  return paddle::optional<const std::vector<const MetaTensor*>>(paddle::none);
+}
+
 MetaTensor* InferMetaContext::MutableOutputAt(size_t idx) {
   return outputs_.at(idx).get();
 }
diff --git a/paddle/phi/core/infermeta_utils.h b/paddle/phi/core/infermeta_utils.h
index 0278e444e2de2..fad437f82c331 100644
--- a/paddle/phi/core/infermeta_utils.h
+++ b/paddle/phi/core/infermeta_utils.h
@@ -54,6 +54,8 @@ class InferMetaContext {
   const MetaTensor& InputAt(size_t idx) const;
   paddle::optional<const phi::MetaTensor&> OptionalInputAt(size_t idx) const;
   std::vector<MetaTensor*> InputsBetween(size_t start, size_t end) const;
+  paddle::optional<const std::vector<const phi::MetaTensor*>>
+  OptionalInputsBetween(size_t start, size_t end) const;
 
   MetaTensor* MutableOutputAt(size_t idx);
   std::vector<MetaTensor*> MutableOutputBetween(size_t start, size_t end);
@@ -174,6 +176,26 @@ struct InferMetaFnImpl<Return (*)(Args...), infer_meta_fn> {
     }
   };
 
+  template <typename... Tail>
+  struct InferMetaFnCallHelper<
+      paddle::optional<const std::vector<const MetaTensor*>>,
+      Tail...> {
+    template <int in_idx, int attr_idx, int out_idx, typename... PreviousArgs>
+    static void Call(InferMetaContext* ctx, PreviousArgs&... pargs) {
+      static_assert(attr_idx == 0,
+                    "InferMeta's Input should appear before Attributes.");
+      static_assert(out_idx == 0,
+                    "InferMeta's Input should appear before Outputs.");
+      const std::pair<int, int> range = ctx->InputRangeAt(in_idx);
+      paddle::optional<const std::vector<const MetaTensor*>> arg =
+          ctx->OptionalInputsBetween(range.first, range.second);
+      InferMetaFnCallHelper<
+          Tail...>::template Call<in_idx + 1, attr_idx, out_idx>(ctx,
+                                                                 pargs...,
+                                                                 arg);
+    }
+  };
+
   // TODO(chenweihang): support other attr type later
   PD_SPECIALIZE_InferMetaFnCallHelper_FOR_ATTRIBUTE(bool);
   PD_SPECIALIZE_InferMetaFnCallHelper_FOR_ATTRIBUTE(int);
diff --git a/paddle/phi/core/kernel_context.h b/paddle/phi/core/kernel_context.h
index d3ca1ffc61c42..ab4e044e62537 100644
--- a/paddle/phi/core/kernel_context.h
+++ b/paddle/phi/core/kernel_context.h
@@ -97,6 +97,22 @@ class KernelContext {
     return v;
   }
 
+  template <typename TensorType>
+  paddle::optional<const std::vector<const TensorType*>> OptionalInputsBetween(
+      size_t start, size_t end) {
+    const auto& first = inputs_.at(start);
+
+    if (first) {
+      std::vector<const TensorType*> v;
+      for (size_t i = start; i < end; ++i) {
+        auto* t = static_cast<const TensorType*>(inputs_.at(i));
+        v.emplace_back(t);
+      }
+      return paddle::optional<const std::vector<const TensorType*>>(v);
+    }
+    return paddle::optional<const std::vector<const TensorType*>>(paddle::none);
+  }
+
   template <typename TensorType>
   TensorType* MutableOutputAt(size_t idx) {
     return static_cast<TensorType*>(outputs_.at(idx));
diff --git a/paddle/phi/core/kernel_factory.cc b/paddle/phi/core/kernel_factory.cc
index ba41e082ab912..a1ce90c2c78ae 100644
--- a/paddle/phi/core/kernel_factory.cc
+++ b/paddle/phi/core/kernel_factory.cc
@@ -59,14 +59,47 @@ KernelKeyMap KernelFactory::SelectKernelMap(
   return iter->second;
 }
 
+bool KernelFactory::IsSelectKernelValid(const std::string& kernel_name,
+                                        const KernelKey& kernel_key) const {
+  auto iter = kernels_.find(kernel_name);
+  PADDLE_ENFORCE_NE(
+      iter,
+      kernels_.end(),
+      phi::errors::NotFound("The kernel `%s` is not registered.", kernel_name));
+
+  auto kernel_iter = iter->second.find(kernel_key);
+  if (kernel_iter == iter->second.end()) {
+    return false;
+  }
+  return true;
+}
+
 const Kernel& KernelFactory::SelectKernelOrThrowError(
-    const std::string& kernel_name, const KernelKey& kernel_key) const {
+    const std::string& kernel_name,
+    const KernelKey& kernel_key,
+    bool use_cudnn) const {
   auto iter = kernels_.find(kernel_name);
   PADDLE_ENFORCE_NE(
       iter,
       kernels_.end(),
       phi::errors::NotFound("The kernel `%s` is not registered.", kernel_name));
 
+#if defined(PADDLE_WITH_CUDA) || defined(PADDLE_WITH_HIP)
+  if (use_cudnn && kernel_key.backend() == Backend::GPU) {
+    auto kernel_iter = iter->second.find(
+        {Backend::GPUDNN, kernel_key.layout(), kernel_key.dtype()});
+    if (kernel_iter == iter->second.end() &&
+        kernel_key.layout() != phi::DataLayout::ALL_LAYOUT) {
+      kernel_iter = iter->second.find(
+          {Backend::GPUDNN, DataLayout::ALL_LAYOUT, kernel_key.dtype()});
+    }
+    if (kernel_iter != iter->second.end()) {
+      return kernel_iter->second;
+    }
+    LOG(WARNING) << "The cudnn kernel for [" << kernel_name
+                 << "] is not registered.";
+  }
+#endif
   auto kernel_iter = iter->second.find(kernel_key);
   // TODO(chenweihang): polish refind impl here
   if (kernel_iter == iter->second.end() &&
diff --git a/paddle/phi/core/kernel_factory.h b/paddle/phi/core/kernel_factory.h
index e502b9cb3e025..8fd25b691bdeb 100644
--- a/paddle/phi/core/kernel_factory.h
+++ b/paddle/phi/core/kernel_factory.h
@@ -238,13 +238,17 @@ class KernelFactory {
   }
 
   const Kernel& SelectKernelOrThrowError(const std::string& kernel_name,
-                                         const KernelKey& kernel_key) const;
+                                         const KernelKey& kernel_key,
+                                         bool use_cudnn = false) const;
 
   const Kernel& SelectKernelOrThrowError(const std::string& kernel_name,
                                          Backend backend,
                                          DataLayout layout,
                                          DataType dtype) const;
 
+  bool IsSelectKernelValid(const std::string& kernel_name,
+                           const KernelKey& kernel_key) const;
+
   Kernel SelectKernel(const std::string& kernel_name,
                       const KernelKey& kernel_key) const;
 
diff --git a/paddle/phi/core/kernel_registry.h b/paddle/phi/core/kernel_registry.h
index fac4b1e82792f..b18fd9e05f92f 100644
--- a/paddle/phi/core/kernel_registry.h
+++ b/paddle/phi/core/kernel_registry.h
@@ -81,6 +81,13 @@ struct KernelArgsParseFunctor<Return_ (*)(Args_...)> {
                               default_tensor_layout,
                               default_key.dtype(),
                               arg_type);
+      } else if (arg_type == std::type_index(typeid(
+                                 paddle::optional<
+                                     const std::vector<const DenseTensor*>>))) {
+        args_def->AppendInput(default_key.backend(),
+                              default_tensor_layout,
+                              default_key.dtype(),
+                              arg_type);
       } else if (arg_type == std::type_index(typeid(
                                  paddle::optional<const SelectedRows&>))) {
         args_def->AppendInput(default_key.backend(),
diff --git a/paddle/phi/core/kernel_utils.h b/paddle/phi/core/kernel_utils.h
index 5317288a2aa83..55574ea03ab4a 100644
--- a/paddle/phi/core/kernel_utils.h
+++ b/paddle/phi/core/kernel_utils.h
@@ -126,6 +126,30 @@ namespace phi {
     }                                                                        \
   }
 
+#define PD_SPECIALIZE_KernelCallHelper_FOR_OPTIONAL_MULTI_INPUT(tensor_type)  \
+  template <typename... Tail>                                                 \
+  struct KernelCallHelper<                                                    \
+      paddle::optional<const std::vector<const tensor_type*>>,                \
+      Tail...> {                                                              \
+    template <int dev_ctx_idx,                                                \
+              int in_idx,                                                     \
+              int attr_idx,                                                   \
+              int out_idx,                                                    \
+              typename... PreviousArgs>                                       \
+    static void Compute(KernelContext* ctx, PreviousArgs&... pargs) {         \
+      static_assert(attr_idx == 0,                                            \
+                    "Kernel's Input should appear before Attributes.");       \
+      static_assert(out_idx == 0,                                             \
+                    "Kernel's Input should appear before Outputs.");          \
+      const std::pair<int, int> range = ctx->InputRangeAt(in_idx);            \
+      paddle::optional<const std::vector<const tensor_type*>> arg =           \
+          ctx->OptionalInputsBetween<tensor_type>(range.first, range.second); \
+      KernelCallHelper<Tail...>::                                             \
+          template Compute<dev_ctx_idx, in_idx + 1, attr_idx, out_idx>(       \
+              ctx, pargs..., arg);                                            \
+    }                                                                         \
+  }
+
 #define PD_SPECIALIZE_KernelCallHelper_FOR_ATTRIBUTE(attr_type)           \
   template <typename... Tail>                                             \
   struct KernelCallHelper<attr_type, Tail...> {                           \
@@ -224,6 +248,7 @@ struct KernelImpl<Return (*)(DevCtx, Args...), kernel_fn> {
   PD_SPECIALIZE_KernelCallHelper_FOR_OPTIONAL_INPUT(SelectedRows);
   PD_SPECIALIZE_KernelCallHelper_FOR_MULTI_INPUT(DenseTensor);
   PD_SPECIALIZE_KernelCallHelper_FOR_INPUT(SelectedRows);
+  PD_SPECIALIZE_KernelCallHelper_FOR_OPTIONAL_MULTI_INPUT(DenseTensor);
 
   PD_SPECIALIZE_KernelCallHelper_FOR_INPUT(SparseCooTensor);
   PD_SPECIALIZE_KernelCallHelper_FOR_OPTIONAL_INPUT(SparseCooTensor);
diff --git a/paddle/phi/core/sparse_coo_tensor.cc b/paddle/phi/core/sparse_coo_tensor.cc
index ceaebe4e35b71..7d4261ef82972 100644
--- a/paddle/phi/core/sparse_coo_tensor.cc
+++ b/paddle/phi/core/sparse_coo_tensor.cc
@@ -16,6 +16,11 @@ limitations under the License. */
 
 namespace phi {
 
+SparseCooTensor::SparseCooTensor() {
+  DenseTensor non_zero_indices, non_zero_elements;
+  this->SetMember(non_zero_indices, non_zero_elements, {1}, true);
+}
+
 SparseCooTensor::SparseCooTensor(const DenseTensor& non_zero_indices,
                                  const DenseTensor& non_zero_elements,
                                  const DDim& dims)
diff --git a/paddle/phi/core/sparse_coo_tensor.h b/paddle/phi/core/sparse_coo_tensor.h
index ca3290f33e61e..ec43c5d62179b 100644
--- a/paddle/phi/core/sparse_coo_tensor.h
+++ b/paddle/phi/core/sparse_coo_tensor.h
@@ -30,6 +30,7 @@ namespace phi {
 class SparseCooTensor : public TensorBase,
                         public TypeInfoTraits<TensorBase, SparseCooTensor> {
  public:
+  SparseCooTensor();
   /// \brief Create the sparse coo tensor
   /// \param non_zero_indices The indices of non zero elements in original dense
   /// tensor.
@@ -145,6 +146,8 @@ class SparseCooTensor : public TensorBase,
   void* AllocateFrom(Allocator* allocator,
                      DataType dtype,
                      size_t requested_size = 0) override;
+
+  /// \brief set the dims of original dense tensor
   void set_dims(const DDim& dims) { this->dims_ = dims; }
 
  private:
diff --git a/paddle/phi/core/sparse_csr_tensor.cc b/paddle/phi/core/sparse_csr_tensor.cc
index cbf5f941b665d..ab9717a564eb5 100644
--- a/paddle/phi/core/sparse_csr_tensor.cc
+++ b/paddle/phi/core/sparse_csr_tensor.cc
@@ -16,6 +16,14 @@ limitations under the License. */
 
 namespace phi {
 
+SparseCsrTensor::SparseCsrTensor() {
+  DenseTensor crows, cols, values;
+  this->non_zero_crows_ = crows;
+  this->non_zero_cols_ = cols;
+  this->non_zero_elements_ = values;
+  this->dims_ = phi::make_ddim({1, 1});
+}
+
 inline void check_shape(const DDim& dims) {
   bool valid = dims.size() == 2 || dims.size() == 3;
 
diff --git a/paddle/phi/core/sparse_csr_tensor.h b/paddle/phi/core/sparse_csr_tensor.h
index 8a9de7a841422..7e14cad242d12 100644
--- a/paddle/phi/core/sparse_csr_tensor.h
+++ b/paddle/phi/core/sparse_csr_tensor.h
@@ -33,6 +33,7 @@ class CompatibleDenseTensorUtils;
 class SparseCsrTensor : public TensorBase,
                         public TypeInfoTraits<TensorBase, SparseCsrTensor> {
  public:
+  SparseCsrTensor();
   /// \brief Because sparse csr tensor is a resource handle, we provide a
   /// default
   /// move constructor to support move semantics.
@@ -143,6 +144,9 @@ class SparseCsrTensor : public TensorBase,
   /// return a mutable pointer of non_zero_elements.
   DenseTensor* mutable_non_zero_elements() { return &non_zero_elements_; }
 
+  /// \brief set the dims of original dense tensor
+  void set_dims(const DDim& dims) { this->dims_ = dims; }
+
  private:
   // save the compressed rows information of non zero elements
   DenseTensor non_zero_crows_;
diff --git a/paddle/phi/core/string_tensor_utils.h b/paddle/phi/core/string_tensor_utils.h
index c1b0d09647d91..777a24c9adfe1 100644
--- a/paddle/phi/core/string_tensor_utils.h
+++ b/paddle/phi/core/string_tensor_utils.h
@@ -23,6 +23,11 @@ class StringTensorUtils {
   static StringTensorMeta* GetMutableMeta(StringTensor* tensor) {
     return &(tensor->meta_);
   }
+
+  static const std::shared_ptr<phi::Allocation>& GetHolder(
+      const StringTensor& tensor) {
+    return tensor.holder_;
+  }
 };
 
 }  // namespace phi
diff --git a/paddle/phi/core/tensor_utils.h b/paddle/phi/core/tensor_utils.h
index 676a590ecbce2..abf8aeff4d3ab 100644
--- a/paddle/phi/core/tensor_utils.h
+++ b/paddle/phi/core/tensor_utils.h
@@ -25,6 +25,11 @@ class DenseTensorUtils {
     return &(tensor->meta_);
   }
 
+  static const std::shared_ptr<phi::Allocation>& GetHolder(
+      const DenseTensor& tensor) {
+    return tensor.holder_;
+  }
+
   static DenseTensor Slice(const DenseTensor& tensor,
                            int64_t begin_idx,
                            int64_t end_idx) {
diff --git a/paddle/phi/infermeta/backward.cc b/paddle/phi/infermeta/backward.cc
index 5d9ed8e9e8c87..49e416fd0152d 100644
--- a/paddle/phi/infermeta/backward.cc
+++ b/paddle/phi/infermeta/backward.cc
@@ -14,6 +14,9 @@ limitations under the License. */
 
 #include "paddle/phi/infermeta/backward.h"
 
+#include "paddle/phi/common/type_traits.h"
+#include "paddle/phi/kernels/funcs/axis_utils.h"
+
 namespace phi {
 
 void BilinearTensorProductGradInferMeta(const MetaTensor& x,
@@ -103,6 +106,69 @@ void Conv2dTransposeDoubleGradInferMeta(const MetaTensor& x,
   }
 }
 
+void CrossEntropyWithSoftmaxGradInferMeta(const MetaTensor& label,
+                                          const MetaTensor& softmax,
+                                          const MetaTensor& loss_grad,
+                                          bool soft_label,
+                                          bool use_softmax,
+                                          bool numeric_stable_mode,
+                                          int ignore_index,
+                                          int axis,
+                                          MetaTensor* logits_grad,
+                                          MetaConfig config) {
+  auto softmax_dims = softmax.dims();
+  auto labels_dims = label.dims();
+  auto softmax_rank = softmax_dims.size();
+  PADDLE_ENFORCE_GE(axis,
+                    -softmax_rank,
+                    phi::errors::InvalidArgument(
+                        "Attr(axis) value should be in range [-R, R-1], "
+                        "R is the rank of Input(Logits)."));
+  PADDLE_ENFORCE_LT(axis,
+                    softmax_rank,
+                    phi::errors::InvalidArgument(
+                        "Attr(axis) value should be in range [-R, R-1], "
+                        "R is the rank of Input(Logits)."));
+
+  axis = phi::funcs::CanonicalAxis(axis, softmax_rank);
+  for (int i = 0; i < softmax_rank; i++) {
+    if (i != axis) {
+      if (config.is_runtime || (softmax_dims[i] > 0 && labels_dims[i] > 0)) {
+        PADDLE_ENFORCE_EQ(
+            softmax_dims[i],
+            labels_dims[i],
+            phi::errors::InvalidArgument(
+                "Input(Logits) and Input(Label) should in same shape in "
+                "dimensions except axis."));
+      }
+    }
+  }
+
+  if (soft_label) {
+    if (config.is_runtime ||
+        (softmax_dims[axis] > 0 && labels_dims[axis] > 0)) {
+      PADDLE_ENFORCE_EQ(softmax_dims[axis],
+                        labels_dims[axis],
+                        phi::errors::InvalidArgument(
+                            "If Attr(soft_label) == true, "
+                            "the axis dimension of "
+                            "Input(X) and Input(Label) should be equal."));
+    }
+  } else {
+    if (config.is_runtime || labels_dims[axis] > 0) {
+      PADDLE_ENFORCE_EQ(
+          labels_dims[axis],
+          1UL,
+          phi::errors::InvalidArgument("If Attr(soft_label) == false, "
+                                       "the axis dimension of "
+                                       "Input(Label) should be 1."));
+    }
+  }
+
+  logits_grad->set_dims(softmax.dims());
+  logits_grad->set_dtype(softmax.dtype());
+}
+
 void GatherNdGradInferMeta(const MetaTensor& x,
                            const MetaTensor& index,
                            const MetaTensor& out_grad,
@@ -141,6 +207,54 @@ void GeneralTernaryGradInferMeta(const MetaTensor& x,
     dz->share_meta(z);
   }
 }
+void GeneralQuaternaryGradInferMeta(const MetaTensor& x,
+                                    const MetaTensor& y,
+                                    const MetaTensor& z,
+                                    const MetaTensor& k,
+                                    MetaTensor* dx,
+                                    MetaTensor* dy,
+                                    MetaTensor* dz,
+                                    MetaTensor* dk) {
+  if (dx) {
+    dx->share_meta(x);
+  }
+  if (dy) {
+    dy->share_meta(y);
+  }
+  if (dz) {
+    dz->share_meta(z);
+  }
+  if (dk) {
+    dk->share_meta(k);
+  }
+}
+
+void GeneralQuinaryGradInferMeta(const MetaTensor& x,
+                                 const MetaTensor& y,
+                                 const MetaTensor& z,
+                                 const MetaTensor& k,
+                                 const MetaTensor& l,
+                                 MetaTensor* dx,
+                                 MetaTensor* dy,
+                                 MetaTensor* dz,
+                                 MetaTensor* dk,
+                                 MetaTensor* dl) {
+  if (dx) {
+    dx->share_meta(x);
+  }
+  if (dy) {
+    dy->share_meta(y);
+  }
+  if (dz) {
+    dz->share_meta(z);
+  }
+  if (dk) {
+    dk->share_meta(k);
+  }
+  if (dl) {
+    dl->share_meta(l);
+  }
+}
 
 void GeneralUnaryGradInferMeta(const MetaTensor& x, MetaTensor* dx) {
   if (dx) {
@@ -180,6 +294,86 @@ void MaxPoolWithIndexGradInferMeta(const MetaTensor& x,
   dx->share_meta(x);
 }
 
+void MeshgridGradInferMeta(const std::vector<MetaTensor*>& inputs,
+                           const std::vector<MetaTensor*>& outputs_grad,
+                           std::vector<MetaTensor*> inputs_grad) {
+  PADDLE_ENFORCE_GT(outputs_grad.size(),
+                    1,
+                    errors::InvalidArgument(
+                        "Number of Inputs(Out@Grad) should be larger than 1."
+                        "But received Inputs(Out@Grad)' size = %d .",
+                        outputs_grad.size()));
+  for (size_t i = 0; i < inputs.size(); i++) {
+    inputs_grad[i]->share_meta(*inputs[i]);
+  }
+}
+
+void NllLossGradInferMeta(const MetaTensor& x,
+                          const MetaTensor& label,
+                          paddle::optional<const MetaTensor&> weight,
+                          const MetaTensor& total_weight,
+                          const MetaTensor& out_grad,
+                          int64_t ignore_index,
+                          const std::string& reduction,
+                          MetaTensor* dx,
+                          MetaConfig config) {
+  const auto& x_dims = x.dims();
+  const auto& label_dims = label.dims();
+  const auto& dout_dims = out_grad.dims();
+  bool contain_unknown_dim =
+      phi::contain_unknown_dim(x_dims) || phi::contain_unknown_dim(dout_dims);
+  bool check = config.is_runtime || !contain_unknown_dim;
+
+  if (check) {
+    auto batch_size = x_dims[0];
+    if (x_dims.size() == 2) {
+      PADDLE_ENFORCE_EQ(dout_dims.size(),
+                        1,
+                        phi::errors::InvalidArgument(
+                            "The dimensions of Input(Out@Grad) must be 1"));
+      if (reduction == "none") {
+        PADDLE_ENFORCE_EQ(
+            dout_dims[0],
+            batch_size,
+            phi::errors::InvalidArgument(
+                "The unreduced size ofInput(Out@Grad) must be the "
+                "same as batch_size."));
+      } else {
+        PADDLE_ENFORCE_EQ(dout_dims[0],
+                          1,
+                          phi::errors::InvalidArgument(
+                              "The reduced size of Input(Out@Grad) must be 1"));
+      }
+    } else if (x_dims.size() == 4) {
+      if (reduction == "none") {
+        PADDLE_ENFORCE_EQ(
+            dout_dims.size(),
+            3,
+            phi::errors::InvalidArgument(
+                "The dimensions of Input(Out@Grad) must be 3,But got [%s].",
+                dout_dims.size()));
+        PADDLE_ENFORCE_EQ(dout_dims[0] == label_dims[0] &&
+                              dout_dims[1] == label_dims[1] &&
+                              dout_dims[2] == label_dims[2],
+                          true,
+                          phi::errors::InvalidArgument(
+                              "The dimensions of Input(Out@Grad) must be match "
+                              "to Input(Label) dimensions."));
+      } else {
+        PADDLE_ENFORCE_EQ(dout_dims[0],
+                          1,
+                          phi::errors::InvalidArgument(
+                              "The reduced size of Input(Out@Grad) must be 1"));
+      }
+    }
+  }
+
+  if (dx) {
+    dx->set_dims(x_dims);
+    dx->set_dtype(x.dtype());
+  }
+}
+
 void PoolGradInferMeta(const MetaTensor& x,
                        const MetaTensor& out,
                        const MetaTensor& dout,
@@ -209,6 +403,20 @@ void PsroiPoolGradInferMeta(const MetaTensor& x,
   dx->share_meta(x);
 }
 
+void RealAndImagGradInferMeta(const MetaTensor& out_grad, MetaTensor* dx) {
+  dx->set_dims(out_grad.dims());
+  dx->set_dtype(dtype::ToComplex(out_grad.dtype()));
+  dx->set_layout(out_grad.layout());
+}
+
+void ReshapeDoubleGradInferMeta(const MetaTensor& out_grad,
+                                const MetaTensor& x_grad_grad,
+                                MetaTensor* out_grad_grad) {
+  if (out_grad_grad != nullptr) {
+    out_grad_grad->share_dims(out_grad);
+  }
+}
+
 void ScatterGradInferMeta(const MetaTensor& index,
                           const MetaTensor& updates,
                           const MetaTensor& out_grad,
@@ -244,4 +452,45 @@ void ScatterNdAddGradInferMeta(const MetaTensor& index,
   }
 }
 
+void StackGradInferMeta(const MetaTensor& out_grad,
+                        int axis,
+                        std::vector<MetaTensor*> x_grad) {
+  auto dy_dim = out_grad.dims();
+  int rank = dy_dim.size();
+  PADDLE_ENFORCE_GE(
+      axis,
+      -rank,
+      phi::errors::InvalidArgument(
+          "Attr(axis) must be inside [-rank, rank), where rank = %d, "
+          "but received axis is:%d.",
+          rank,
+          axis));
+  PADDLE_ENFORCE_LT(
+      axis,
+      rank,
+      phi::errors::InvalidArgument(
+          "Attr(axis) must be inside [-rank, rank), where rank = %d, "
+          "but received axis is:%d.",
+          rank,
+          axis));
+
+  if (axis < 0) axis += rank;
+  PADDLE_ENFORCE_LE(
+      x_grad.size(),
+      static_cast<size_t>(dy_dim[axis]),
+      phi::errors::InvalidArgument(
+          "Number of Outputs(X@Grad) should be less than or equal to dy dim "
+          "at axis, but received outputs size is:%d, dy dims is:%d.",
+          x_grad.size(),
+          static_cast<size_t>(dy_dim[axis])));
+
+  auto vec = phi::vectorize<int>(dy_dim);
+  vec.erase(vec.begin() + axis);
+
+  for (size_t i = 0; i < x_grad.size(); ++i) {
+    x_grad[i]->set_dims(phi::make_ddim(vec));
+    x_grad[i]->set_dtype(out_grad.dtype());
+  }
+}
+
 }  // namespace phi
diff --git a/paddle/phi/infermeta/backward.h b/paddle/phi/infermeta/backward.h
index 10b3e7cec7d2e..eff3731bf2253 100644
--- a/paddle/phi/infermeta/backward.h
+++ b/paddle/phi/infermeta/backward.h
@@ -68,6 +68,17 @@ void Conv2dTransposeDoubleGradInferMeta(const MetaTensor& x,
                                         MetaTensor* dfilter,
                                         MetaTensor* ddout);
 
+void CrossEntropyWithSoftmaxGradInferMeta(const MetaTensor& label,
+                                          const MetaTensor& softmax,
+                                          const MetaTensor& loss_grad,
+                                          bool soft_label,
+                                          bool use_softmax,
+                                          bool numeric_stable_mode,
+                                          int ignore_index,
+                                          int axis,
+                                          MetaTensor* logits_grad,
+                                          MetaConfig config = MetaConfig());
+
 void GatherNdGradInferMeta(const MetaTensor& x,
                            const MetaTensor& index,
                            const MetaTensor& out_grad,
@@ -85,6 +96,26 @@ void GeneralTernaryGradInferMeta(const MetaTensor& x,
                                  MetaTensor* dy,
                                  MetaTensor* dz);
 
+void GeneralQuaternaryGradInferMeta(const MetaTensor& x,
+                                    const MetaTensor& y,
+                                    const MetaTensor& z,
+                                    const MetaTensor& k,
+                                    MetaTensor* dx,
+                                    MetaTensor* dy,
+                                    MetaTensor* dz,
+                                    MetaTensor* dk);
+
+void GeneralQuinaryGradInferMeta(const MetaTensor& x,
+                                 const MetaTensor& y,
+                                 const MetaTensor& z,
+                                 const MetaTensor& k,
+                                 const MetaTensor& l,
+                                 MetaTensor* dx,
+                                 MetaTensor* dy,
+                                 MetaTensor* dz,
+                                 MetaTensor* dk,
+                                 MetaTensor* dl);
+
 void GeneralUnaryGradInferMeta(const MetaTensor& x, MetaTensor* dx);
 
 void GumbelSoftmaxGradInferMeta(const MetaTensor& out,
@@ -104,6 +135,20 @@ void MaxPoolWithIndexGradInferMeta(const MetaTensor& x,
                                    bool adaptive,
                                    MetaTensor* dx);
 
+void MeshgridGradInferMeta(const std::vector<MetaTensor*>& inputs,
+                           const std::vector<MetaTensor*>& outputs_grad,
+                           std::vector<MetaTensor*> inputs_grad);
+
+void NllLossGradInferMeta(const MetaTensor& input,
+                          const MetaTensor& label,
+                          paddle::optional<const MetaTensor&> weight,
+                          const MetaTensor& total_weight,
+                          const MetaTensor& out_grad,
+                          int64_t ignore_index,
+                          const std::string& reduction,
+                          MetaTensor* intput_grad,
+                          MetaConfig config = MetaConfig());
+
 void PsroiPoolGradInferMeta(const MetaTensor& x,
                             const MetaTensor& rois,
                             paddle::optional<const MetaTensor&> rois_num,
@@ -129,6 +174,12 @@ void PoolGradInferMeta(const MetaTensor& x,
                        const std::string& padding_algorithm,
                        MetaTensor* dx);
 
+void RealAndImagGradInferMeta(const MetaTensor& out_grad, MetaTensor* dx);
+
+void ReshapeDoubleGradInferMeta(const MetaTensor& out_grad,
+                                const MetaTensor& x_grad_grad,
+                                MetaTensor* out_grad_grad);
+
 void ScatterGradInferMeta(const MetaTensor& index,
                           const MetaTensor& updates,
                           const MetaTensor& out_grad,
@@ -142,4 +193,8 @@ void ScatterNdAddGradInferMeta(const MetaTensor& index,
                                MetaTensor* x_grad,
                                MetaTensor* updates_grad);
 
+void StackGradInferMeta(const MetaTensor& out_grad,
+                        int axis,
+                        std::vector<MetaTensor*> x_grad);
+
 }  // namespace phi
diff --git a/paddle/phi/infermeta/binary.cc b/paddle/phi/infermeta/binary.cc
index 8e285aba55145..2139605fb2048 100644
--- a/paddle/phi/infermeta/binary.cc
+++ b/paddle/phi/infermeta/binary.cc
@@ -21,6 +21,7 @@ limitations under the License. */
 #include "paddle/phi/core/ddim.h"
 #include "paddle/phi/core/infermeta_utils.h"
 #include "paddle/phi/kernels/cpu/conv_util.h"
+#include "paddle/phi/kernels/funcs/axis_utils.h"
 #include "paddle/phi/kernels/funcs/common_shape.h"
 
 namespace phi {
@@ -63,6 +64,16 @@ static void BinarySameInputDimsCheck(const MetaTensor& x,
   }
 }
 
+// Used in MatrixRankTolInferMeta
+static DDim CheckAndGetOutputDim(const DDim& dim_x) {
+  auto x_vec = phi::vectorize(dim_x);
+  if (x_vec.size() == 2) {
+    return phi::make_ddim({1});
+  }
+  x_vec.erase(x_vec.end() - 2, x_vec.end());
+  return phi::make_ddim(x_vec);
+}
+
 }  // namespace detail
 
 void AllValueCompareInferMeta(const MetaTensor& x,
@@ -75,6 +86,32 @@ void AllValueCompareInferMeta(const MetaTensor& x,
   out->set_dtype(DataType::BOOL);
 }
 
+void EmbeddingInferMeta(const MetaTensor& input,
+                        const MetaTensor& weight,
+                        int64_t padding_idx,
+                        MetaTensor* out) {
+  auto table_dims = weight.dims();
+  auto ids_dims = input.dims();
+  int ids_rank = ids_dims.size();
+  VLOG(5) << "ids rank is " << ids_rank << std::endl;
+  PADDLE_ENFORCE_EQ(
+      table_dims.size(),
+      2,
+      phi::errors::InvalidArgument(
+          "ShapeError: The dimensions of the 'lookup table' must be 2. "
+          "But received lookup table's dimensions = %d, "
+          "lookup table's shape = [%s].",
+          table_dims.size(),
+          table_dims));
+
+  auto output_dims = phi::vectorize(ids_dims);
+  output_dims.push_back(table_dims[1]);
+
+  out->set_dims(phi::make_ddim(output_dims));
+  out->set_dtype(weight.dtype());
+  out->share_lod(input);
+}
+
 void KLDivInferMeta(const MetaTensor& x,
                     const MetaTensor& label,
                     const std::string& reduction,
@@ -727,6 +764,82 @@ void CrossInferMeta(const MetaTensor& x,
   out->share_lod(x);
 }
 
+void CrossEntropyWithSoftmaxInferMeta(const MetaTensor& logits,
+                                      const MetaTensor& label,
+                                      bool soft_label,
+                                      bool use_softmax,
+                                      bool numeric_stable_mode,
+                                      int ignore_index,
+                                      int axis,
+                                      MetaTensor* softmax,
+                                      MetaTensor* loss,
+                                      MetaConfig config) {
+  auto logits_dims = logits.dims();
+  auto labels_dims = label.dims();
+  auto logits_rank = logits_dims.size();
+  PADDLE_ENFORCE_GE(axis,
+                    -logits_rank,
+                    phi::errors::InvalidArgument(
+                        "Attr(axis) value should be in range [-R, R-1], "
+                        "R is the rank of Input(Logits)."));
+  PADDLE_ENFORCE_LT(axis,
+                    logits_rank,
+                    phi::errors::InvalidArgument(
+                        "Attr(axis) value should be in range [-R, R-1], "
+                        "R is the rank of Input(Logits)."));
+
+  axis = phi::funcs::CanonicalAxis(axis, logits_rank);
+  for (int i = 0; i < logits_rank; i++) {
+    if (i != axis) {
+      if (config.is_runtime || (logits_dims[i] > 0 && labels_dims[i] > 0)) {
+        PADDLE_ENFORCE_EQ(logits_dims[i],
+                          labels_dims[i],
+                          phi::errors::InvalidArgument(
+                              "Input(Logits) and Input(Label) should in "
+                              "same shape in dimensions except axis."));
+      }
+    }
+  }
+
+  if (axis != logits_rank - 1) {
+    PADDLE_ENFORCE_EQ(
+        numeric_stable_mode,
+        true,
+        phi::errors::InvalidArgument("Attr(axis) can only be -1 "
+                                     "when not in numeric_stable_mode."));
+  }
+
+  if (soft_label) {
+    if (config.is_runtime || (logits_dims[axis] > 0 && labels_dims[axis] > 0)) {
+      PADDLE_ENFORCE_EQ(logits_dims[axis],
+                        labels_dims[axis],
+                        phi::errors::InvalidArgument(
+                            "If Attr(soft_label) == true,  "
+                            "the axis dimension of "
+                            "Input(X) and Input(Label) should be equal."));
+    }
+  } else {
+    if (config.is_runtime || labels_dims[axis] > 0) {
+      PADDLE_ENFORCE_EQ(
+          labels_dims[axis],
+          1UL,
+          phi::errors::InvalidArgument("If Attr(soft_label) == false, "
+                                       "the axis dimension of "
+                                       "Input(Label) should be 1."));
+    }
+  }
+
+  softmax->set_dims(logits_dims);
+  softmax->set_dtype(logits.dtype());
+
+  logits_dims[axis] = 1;
+  loss->set_dims(logits_dims);
+  loss->set_dtype(logits.dtype());
+
+  softmax->share_lod(logits);
+  loss->share_lod(logits);
+}
+
 void DistInferMeta(const MetaTensor& x,
                    const MetaTensor& y,
                    float p,
@@ -750,6 +863,26 @@ void DistInferMeta(const MetaTensor& x,
   out->set_dtype(x.dtype());
 }
 
+void DropoutInferMeta(const MetaTensor& x,
+                      paddle::optional<const MetaTensor&> seed_tensor,
+                      float p,
+                      bool is_test,
+                      const std::string& mode,
+                      int seed,
+                      bool fix_seed,
+                      MetaTensor* out,
+                      MetaTensor* mask) {
+  auto x_dims = x.dims();
+  out->set_dims(x_dims);
+  out->share_lod(x);
+  out->set_dtype(x.dtype());
+
+  if (mask != nullptr) {
+    mask->set_dims(x_dims);
+    mask->set_dtype(DataType::UINT8);
+  }
+}
+
 void DotInferMeta(const MetaTensor& x, const MetaTensor& y, MetaTensor* out) {
   auto x_dims = x.dims();
   auto x_rank = static_cast<size_t>(x_dims.size());
@@ -1342,6 +1475,47 @@ void MatmulWithFlattenInferMeta(const MetaTensor& x,
   out->share_lod(x);
 }
 
+void MatrixRankTolInferMeta(const MetaTensor& x,
+                            const MetaTensor& atol_tensor,
+                            bool use_default_tol,
+                            bool hermitian,
+                            MetaTensor* out) {
+  auto dim_x = x.dims();
+  PADDLE_ENFORCE_GE(
+      dim_x.size(),
+      2,
+      phi::errors::InvalidArgument("The dims of input must be greater than 2"));
+
+  if (hermitian) {
+    int rows = dim_x[dim_x.size() - 2];
+    int cols = dim_x[dim_x.size() - 1];
+    PADDLE_ENFORCE_EQ(rows,
+                      cols,
+                      phi::errors::InvalidArgument(
+                          "if hermitian == true, matrix should be n*n"));
+  }
+  DDim dim_x_batch = detail::CheckAndGetOutputDim(dim_x);
+  auto dim_tol = atol_tensor.dims();
+  if (dim_x_batch == dim_tol) {
+    out->set_dims(dim_x_batch);
+  } else {
+    int max_dim = std::max(dim_x_batch.size(), dim_tol.size());
+    int axis = std::abs(dim_x_batch.size() - dim_tol.size());
+    std::vector<int> x_batch_dims_array(max_dim);
+    std::vector<int> tol_dims_array(max_dim);
+    std::vector<int> out_dims_array(max_dim);
+    phi::funcs::GetBroadcastDimsArrays(dim_x_batch,
+                                       dim_tol,
+                                       x_batch_dims_array.data(),
+                                       tol_dims_array.data(),
+                                       out_dims_array.data(),
+                                       max_dim,
+                                       axis);
+    out->set_dims(phi::make_ddim(out_dims_array));
+  }
+  out->share_lod(x);
+}
+
 void MvInferMeta(const MetaTensor& x, const MetaTensor& vec, MetaTensor* out) {
   auto dim_x = x.dims();
   auto dim_vec = vec.dims();
@@ -1374,8 +1548,8 @@ void MvInferMeta(const MetaTensor& x, const MetaTensor& vec, MetaTensor* out) {
 
 void PReluInferMeta(const MetaTensor& x,
                     const MetaTensor& alpha,
-                    const std::string& mode,
                     const std::string& data_format,
+                    const std::string& mode,
                     MetaTensor* out,
                     MetaConfig config) {
   auto x_dim = x.dims();
diff --git a/paddle/phi/infermeta/binary.h b/paddle/phi/infermeta/binary.h
index fc9d2642d9cc4..192fa214c905f 100644
--- a/paddle/phi/infermeta/binary.h
+++ b/paddle/phi/infermeta/binary.h
@@ -37,6 +37,11 @@ void AllValueCompareInferMeta(const MetaTensor& x,
                               MetaTensor* out,
                               MetaConfig config = MetaConfig());
 
+void EmbeddingInferMeta(const MetaTensor& input,
+                        const MetaTensor& weight,
+                        int64_t padding_idx,
+                        MetaTensor* out);
+
 void KLDivInferMeta(const MetaTensor& x,
                     const MetaTensor& label,
                     const std::string& reduction,
@@ -112,6 +117,17 @@ void CrossInferMeta(const MetaTensor& x,
                     int axis,
                     MetaTensor* out);
 
+void CrossEntropyWithSoftmaxInferMeta(const MetaTensor& logits,
+                                      const MetaTensor& label,
+                                      bool soft_label,
+                                      bool use_softmax,
+                                      bool numeric_stable_mode,
+                                      int ignore_index,
+                                      int axis,
+                                      MetaTensor* softmax,
+                                      MetaTensor* loss,
+                                      MetaConfig config = MetaConfig());
+
 void DistInferMeta(const MetaTensor& x,
                    const MetaTensor& y,
                    float p,
@@ -119,6 +135,16 @@ void DistInferMeta(const MetaTensor& x,
 
 void DotInferMeta(const MetaTensor& x, const MetaTensor& y, MetaTensor* out);
 
+void DropoutInferMeta(const MetaTensor& x,
+                      paddle::optional<const MetaTensor&> seed_tensor,
+                      float p,
+                      bool is_test,
+                      const std::string& mode,
+                      int seed,
+                      bool fix_seed,
+                      MetaTensor* out,
+                      MetaTensor* mask);
+
 void ElementwiseInferMeta(const MetaTensor& x,
                           const MetaTensor& y,
                           MetaTensor* out);
@@ -192,14 +218,20 @@ void MatmulWithFlattenInferMeta(const MetaTensor& x,
                                 int y_num_col_dims,
                                 MetaTensor* out);
 
+void MatrixRankTolInferMeta(const MetaTensor& x,
+                            const MetaTensor& atol_tensor,
+                            bool use_default_tol,
+                            bool hermitian,
+                            MetaTensor* out);
+
 void MvInferMeta(const MetaTensor& x, const MetaTensor& vec, MetaTensor* out);
 
 void PReluInferMeta(const MetaTensor& x,
                     const MetaTensor& alpha,
-                    const std::string& mode,
                     const std::string& data_format,
+                    const std::string& mode,
                     MetaTensor* out,
-                    MetaConfig config);
+                    MetaConfig config = MetaConfig());
 
 void SearchsortedInferMeta(const MetaTensor& sorted_sequence,
                            const MetaTensor& value,
diff --git a/paddle/phi/infermeta/multiary.cc b/paddle/phi/infermeta/multiary.cc
index c6940492ce696..f2acfe5a9962b 100644
--- a/paddle/phi/infermeta/multiary.cc
+++ b/paddle/phi/infermeta/multiary.cc
@@ -66,6 +66,32 @@ void AdadeltaInferMeta(const MetaTensor& param,
   avg_squared_update_out->set_dtype(avg_squared_update.dtype());
 }
 
+void AdagradInferMeta(const MetaTensor& param,
+                      const MetaTensor& grad,
+                      const MetaTensor& moment,
+                      const MetaTensor& learning_rate,
+                      float epsilon,
+                      MetaTensor* param_out,
+                      MetaTensor* moment_out) {
+  auto lr_dims = learning_rate.dims();
+  PADDLE_ENFORCE_EQ(
+      phi::product(lr_dims),
+      1,
+      phi::errors::InvalidArgument("LearningRate should have one element"));
+  auto param_dims = param.dims();
+
+  PADDLE_ENFORCE_EQ(
+      param_dims,
+      moment.dims(),
+      phi::errors::InvalidArgument("Param and Moment input of AdagradOp "
+                                   "should have the same dimension."));
+
+  param_out->set_dims(param_dims);
+  param_out->set_dtype(param.dtype());
+  moment_out->set_dims(param_dims);
+  moment_out->set_dtype(moment.dtype());
+}
+
 void AdamInferMeta(const MetaTensor& param,
                    const MetaTensor& grad,
                    const MetaTensor& learning_rate,
@@ -253,6 +279,78 @@ void AdamwInferMeta(const MetaTensor& param,
                 master_param_outs);
 }
 
+void AddNInferMeta(const std::vector<MetaTensor*>& x,
+                   MetaTensor* out,
+                   MetaConfig config) {
+  auto N = x.size();
+  PADDLE_ENFORCE_GT(
+      N,
+      0,
+      phi::errors::InvalidArgument(
+          "The input tensor X's dimensions of SumOp "
+          "should be larger than 0. But received X's dimensions %d.",
+          N));
+  if (N == 1) {
+    VLOG(3) << "Warning: SumOp have only one input, may waste memory";
+  }
+
+  phi::DDim in_dim({0});
+  for (size_t i = 0; i < x.size(); ++i) {
+    auto x_dim = x[i]->dims();
+    if (phi::product(x_dim) == 0) {
+      continue;
+    }
+    if (phi::product(in_dim) == 0) {
+      in_dim = x_dim;
+    } else {
+      if (config.is_runtime) {
+        PADDLE_ENFORCE_EQ(in_dim,
+                          x_dim,
+                          phi::errors::InvalidArgument(
+                              "The input tensor X of SumOp must"
+                              " have same shape. But received X[0]'s shape = "
+                              "[%s], X[%d]'s shape = [%s].",
+                              in_dim,
+                              i,
+                              x_dim));
+      } else {
+        PADDLE_ENFORCE_EQ(
+            in_dim.size(),
+            x_dim.size(),
+            phi::errors::InvalidArgument(
+                "The input tensor X of SumOp must have same "
+                "dimensions. But received X[0]'s dimensions = %d, X[0]'s "
+                "shape = "
+                "[%s], X[%d]'s dimensions = %d, X[%d]'s shape = [%s].",
+                in_dim.size(),
+                in_dim,
+                i,
+                x_dim.size(),
+                i,
+                x_dim));
+        // if in_dim or x_dim has -1, not check equal
+        for (int j = 0; j < x_dim.size(); ++j) {
+          if (x_dim[j] == -1 || in_dim[j] == -1) {
+            continue;
+          }
+          PADDLE_ENFORCE_EQ(
+              in_dim[j],
+              x_dim[j],
+              phi::errors::InvalidArgument(
+                  "The input tensor X of SumOp must have same shape "
+                  "if not -1."
+                  "But received X[0]'s shape = [%s], X[%d]'s shape = [%s].",
+                  in_dim,
+                  i,
+                  x_dim));
+        }
+      }
+    }
+  }
+  out->set_dims(in_dim);
+  out->share_lod(*x[0]);
+}
+
 void AucInferMeta(const MetaTensor& input,
                   const MetaTensor& label,
                   const MetaTensor& stat_pos,
@@ -890,6 +988,569 @@ void HierarchicalSigmoidInferMeta(const MetaTensor& x,
   out->set_dtype(x.dtype());
 }
 
+static void Interpolate1DInferShapeCheck(
+    const MetaTensor& x,
+    paddle::optional<const MetaTensor&> out_size,
+    paddle::optional<const std::vector<const MetaTensor*>> size_tensor,
+    paddle::optional<const MetaTensor&> scale_tensor,
+    const std::string& data_layout_str,
+    int out_d,
+    int out_h,
+    int out_w,
+    const std::vector<float>& scale,
+    const std::string& interp_method,
+    bool align_corners,
+    int align_mode,
+    MetaTensor* output,
+    MetaConfig config) {
+  auto dim_x = x.dims();
+
+  PADDLE_ENFORCE_EQ("linear",
+                    interp_method,
+                    phi::errors::InvalidArgument(
+                        "Interpolation method can only be \"linear\" when"
+                        "Input(X) dimension is 3, but got method = %s .",
+                        interp_method));
+  const DataLayout data_layout =
+      paddle::framework::StringToDataLayout(data_layout_str);
+  for (int i = 0; i < dim_x.size(); ++i) {
+    PADDLE_ENFORCE_NE(
+        dim_x[i],
+        0,
+        phi::errors::InvalidArgument("The shape of input(x) should be larged "
+                                     "than 0, bug received shape[%d] is %d ",
+                                     i,
+                                     dim_x[i]));
+  }
+  if (size_tensor && size_tensor->size() > 0) {
+    // top prority size
+    auto inputs_name = size_tensor.get();
+    PADDLE_ENFORCE_EQ(
+        inputs_name.size(),
+        1,
+        phi::errors::InvalidArgument(
+            "Input(SizeTensor)'size of Op(interpolate) must be 1. "
+            "Attr(out_shape)'s length must be 1 for 3-D input tensor, but got "
+            "size = %d .",
+            inputs_name.size()));
+    phi::DDim dim_out;
+    if (data_layout == DataLayout::kNCHW) {
+      dim_out = {dim_x[0], dim_x[1], out_w};
+    } else {
+      dim_out = {dim_x[0], out_w, dim_x[2]};
+    }
+    output->set_dims(dim_out);
+    output->set_dtype(x.dtype());
+
+    return;
+  }
+
+  int out_w_tmp;
+  if (scale_tensor) {
+    auto scale_tensor_dim = scale_tensor->dims();
+    PADDLE_ENFORCE_EQ(
+        scale_tensor_dim.size(),
+        1,
+        phi::errors::InvalidArgument(
+            "Scale's dimension size must be 1, but got dimension = %d .",
+            scale_tensor_dim.size()));
+    PADDLE_ENFORCE_EQ(scale_tensor_dim[0],
+                      1,
+                      phi::errors::InvalidArgument(
+                          "Scale's shape must be 1, but got shape = %d .",
+                          scale_tensor_dim[0]));
+    out_w_tmp = -1;
+  } else {
+    if (scale.size() > 0) {
+      float scale_w = -1;
+      scale_w = scale[0];
+      PADDLE_ENFORCE_EQ(
+          scale_w > 0,
+          true,
+          phi::errors::InvalidArgument(
+              "The scale_w in Attr(scale) of Operator(interpolate) "
+              "should be greater than 0, but received value is %d.",
+              scale_w));
+      if (scale_w > 0.) {
+        // round down
+        out_w_tmp = (data_layout == DataLayout::kNCHW
+                         ? static_cast<int>(dim_x[2] * scale_w)
+                         : static_cast<int>(dim_x[1] * scale_w));
+        // protect when input shape is -1
+        out_w_tmp = out_w_tmp > 0 ? out_w_tmp : -1;
+      }
+    } else {
+      out_w_tmp = out_w;
+    }
+  }
+
+  if (out_size && config.is_runtime) {
+    auto out_size_dim = out_size->dims();
+    PADDLE_ENFORCE_EQ(
+        out_size_dim.size(),
+        1,
+        phi::errors::InvalidArgument(
+            "OutSize's dimension size must be 1, but got dimention = %d .",
+            out_size_dim.size()));
+    PADDLE_ENFORCE_EQ(
+        out_size_dim[0],
+        1,
+        phi::errors::InvalidArgument(
+            "OutSize's 0-th dimension's value must be 1, but got value = %d .",
+            out_size_dim[0]));
+
+    // dims will be seted in kernel
+    output->set_dtype(x.dtype());
+    output->share_lod(x);
+    return;
+  }
+
+  phi::DDim dim_out;
+  if (data_layout == DataLayout::kNCHW) {
+    dim_out = {dim_x[0], dim_x[1], out_w_tmp};
+  } else {
+    dim_out = {dim_x[0], out_w_tmp, dim_x[2]};
+  }
+  output->set_dims(dim_out);
+  output->set_dtype(x.dtype());
+}
+
+static void Interpolate2DInferShapeCheck(
+    const MetaTensor& x,
+    paddle::optional<const MetaTensor&> out_size,
+    paddle::optional<const std::vector<const MetaTensor*>> size_tensor,
+    paddle::optional<const MetaTensor&> scale_tensor,
+    const std::string& data_layout_str,
+    int out_d,
+    int out_h,
+    int out_w,
+    const std::vector<float>& scale,
+    const std::string& interp_method,
+    bool align_corners,
+    int align_mode,
+    MetaTensor* output,
+    MetaConfig config) {
+  auto dim_x = x.dims();
+
+  PADDLE_ENFORCE(
+      "bilinear" == interp_method || "nearest" == interp_method ||
+          "bicubic" == interp_method,
+      phi::errors::InvalidArgument(
+          "Interpolation method can only be \"bilinear\" or \"nearest\" when "
+          "Input(X) dimension is 4, but got method = %s.",
+          interp_method));
+  const DataLayout data_layout =
+      paddle::framework::StringToDataLayout(data_layout_str);
+
+  for (int i = 0; i < dim_x.size(); ++i) {
+    PADDLE_ENFORCE_NE(
+        dim_x[i],
+        0,
+        phi::errors::InvalidArgument("The shape of input(x) should be larged "
+                                     "than 0, bug received shape[%d] is %d ",
+                                     i,
+                                     dim_x[i]));
+  }
+
+  if (size_tensor && size_tensor->size()) {
+    // top prority size
+    auto inputs_name = size_tensor.get();
+    PADDLE_ENFORCE_EQ(
+        inputs_name.size(),
+        2,
+        phi::errors::InvalidArgument(
+            "Input(SizeTensor)'size of Op(interpolate) must be 2. "
+            "Attr(out_shape)'s length must be 2 for 4-D input "
+            "tensor, but got size = %d .",
+            inputs_name.size()));
+    phi::DDim dim_out;
+    if (data_layout == DataLayout::kNCHW) {
+      dim_out = {dim_x[0], dim_x[1], out_h, out_w};
+    } else {
+      dim_out = {dim_x[0], out_h, out_w, dim_x[3]};
+    }
+    output->set_dims(dim_out);
+    output->set_dtype(x.dtype());
+
+    return;
+  }
+
+  int out_h_tmp, out_w_tmp;
+  if (scale_tensor) {
+    auto scale_tensor_dim = scale_tensor->dims();
+    PADDLE_ENFORCE_EQ(
+        scale_tensor_dim.size(),
+        1,
+        phi::errors::InvalidArgument(
+            "Scale's dimension size must be 1, but got dimension = %d .",
+            scale_tensor_dim.size()));
+    PADDLE_ENFORCE_EQ(scale_tensor_dim[0] == 2 || scale_tensor_dim[0] == 1,
+                      true,
+                      phi::errors::InvalidArgument(
+                          "Scale's shape must be 2 or 1, but got shape = %d .",
+                          scale_tensor_dim[0]));
+    out_h_tmp = -1;
+    out_w_tmp = -1;
+  } else {
+    if (scale.size() > 0) {
+      float scale_h = -1;
+      float scale_w = -1;
+      scale_h = scale[0];
+      scale_w = scale[1];
+      PADDLE_ENFORCE_EQ(
+          scale_w > 0,
+          true,
+          phi::errors::InvalidArgument(
+              "The scale_w in Attr(scale) of Operator(interpolate) "
+              "should be greater than 0, but received value is %d.",
+              scale_w));
+      PADDLE_ENFORCE_EQ(
+          scale_h > 0,
+          true,
+          phi::errors::InvalidArgument(
+              "The scale_h in Attr(scale) of Operator(interpolate) "
+              "should be greater than 0, but received value is %d.",
+              scale_h));
+      if (scale_h > 0. && scale_w > 0.) {
+        // round down
+        out_h_tmp = (data_layout == DataLayout::kNCHW
+                         ? static_cast<int>(dim_x[2] * scale_h)
+                         : static_cast<int>(dim_x[1] * scale_h));
+        out_w_tmp = (data_layout == DataLayout::kNCHW
+                         ? static_cast<int>(dim_x[3] * scale_w)
+                         : static_cast<int>(dim_x[2] * scale_w));
+        // protect when input shape is -1
+        out_h_tmp = out_h_tmp > 0 ? out_h_tmp : -1;
+        out_w_tmp = out_w_tmp > 0 ? out_w_tmp : -1;
+      }
+    } else {
+      out_h_tmp = out_h;
+      out_w_tmp = out_w;
+    }
+  }
+
+  if (out_size && config.is_runtime) {
+    auto out_size_dim = out_size->dims();
+    PADDLE_ENFORCE_EQ(
+        out_size_dim.size(),
+        1,
+        phi::errors::InvalidArgument(
+            "OutSize's dimension size must be 1, but got dimension = %d .",
+            out_size_dim.size()));
+    PADDLE_ENFORCE_EQ(
+        out_size_dim[0],
+        2,
+        phi::errors::InvalidArgument(
+            "OutSize's dim[0] must be 2, but got dimention = %d .",
+            out_size_dim[0]));
+    // dims will be seted in kernel
+    output->set_dtype(x.dtype());
+    output->share_lod(x);
+    return;
+  }
+
+  phi::DDim dim_out;
+  if (data_layout == DataLayout::kNCHW) {
+    dim_out = {dim_x[0], dim_x[1], out_h_tmp, out_w_tmp};
+  } else {
+    dim_out = {dim_x[0], out_h_tmp, out_w_tmp, dim_x[3]};
+  }
+
+  output->set_dims(dim_out);
+  output->set_dtype(x.dtype());
+}
+
+static void Interpolate3DInferShapeCheck(
+    const MetaTensor& x,
+    paddle::optional<const MetaTensor&> out_size,
+    paddle::optional<const std::vector<const MetaTensor*>> size_tensor,
+    paddle::optional<const MetaTensor&> scale_tensor,
+    const std::string& data_layout_str,
+    int out_d,
+    int out_h,
+    int out_w,
+    const std::vector<float>& scale,
+    const std::string& interp_method,
+    bool align_corners,
+    int align_mode,
+    MetaTensor* output,
+    MetaConfig config) {
+  auto dim_x = x.dims();
+
+  PADDLE_ENFORCE("nearest" == interp_method || "trilinear" == interp_method,
+                 phi::errors::InvalidArgument(
+                     "Interpolation method can only be \"trilinear\" or "
+                     "\"nearest\" when Input(X) "
+                     "dimension is 5, but got method = %s .",
+                     interp_method));
+  const DataLayout data_layout =
+      paddle::framework::StringToDataLayout(data_layout_str);
+
+  for (int i = 0; i < dim_x.size(); ++i) {
+    PADDLE_ENFORCE_NE(
+        dim_x[i],
+        0,
+        phi::errors::InvalidArgument("The shape of input(x) should be larged "
+                                     "than 0, bug received shape[%d] is %d ",
+                                     i,
+                                     dim_x[i]));
+  }
+
+  if (size_tensor && size_tensor->size() > 0) {
+    // top prority size
+    auto inputs_name = size_tensor.get();
+    PADDLE_ENFORCE_EQ(
+        inputs_name.size(),
+        3,
+        phi::errors::InvalidArgument(
+            "Input(SizeTensor)'s size of Op(interpolate) must be 3. "
+            "Attr(out_shape)'s length must be 3 for 5-D input "
+            "tensor, but got size = %d .",
+            inputs_name.size()));
+    phi::DDim dim_out;
+    if (data_layout == DataLayout::kNCHW) {
+      dim_out = {dim_x[0], dim_x[1], out_d, out_h, out_w};
+    } else {
+      dim_out = {dim_x[0], out_d, out_h, out_w, dim_x[4]};
+    }
+    output->set_dims(dim_out);
+    output->set_dtype(x.dtype());
+    return;
+  }
+
+  int out_d_tmp, out_h_tmp, out_w_tmp;
+  if (scale_tensor) {
+    auto scale_tensor_dim = scale_tensor->dims();
+    PADDLE_ENFORCE_EQ(
+        scale_tensor_dim.size(),
+        1,
+        phi::errors::InvalidArgument(
+            "Scale's dimension size must be 1, but got size = %d .",
+            scale_tensor_dim.size()));
+    PADDLE_ENFORCE_EQ(scale_tensor_dim[0] == 3 || scale_tensor_dim[0] == 1,
+                      true,
+                      phi::errors::InvalidArgument(
+                          "Scale's shape must be 3 or 1, but got shape = %d .",
+                          scale_tensor_dim[0]));
+    out_d_tmp = -1;
+    out_h_tmp = -1;
+    out_w_tmp = -1;
+  } else {
+    if (scale.size() > 0) {
+      float scale_d = -1;
+      float scale_h = -1;
+      float scale_w = -1;
+      scale_d = scale[0];
+      scale_h = scale[1];
+      scale_w = scale[2];
+      PADDLE_ENFORCE_EQ(
+          scale_w > 0,
+          true,
+          phi::errors::InvalidArgument(
+              "The scale_w in Attr(scale) of Operator(interpolate) "
+              "should be greater than 0, but received value is %d.",
+              scale_w));
+      PADDLE_ENFORCE_EQ(
+          scale_h > 0,
+          true,
+          phi::errors::InvalidArgument(
+              "The scale_h in Attr(scale) of Operator(interpolate) "
+              "should be greater than 0, but received value is %d.",
+              scale_h));
+      PADDLE_ENFORCE_EQ(
+          scale_d > 0,
+          true,
+          phi::errors::InvalidArgument(
+              "The scale_d in Attr(scale) of Operator(interpolate) "
+              "should be greater than 0, but received value is %d.",
+              scale_d));
+      if (scale_d > 0. && scale_h > 0. && scale_w > 0.) {
+        // round down
+        out_d_tmp = (data_layout == DataLayout::kNCHW
+                         ? static_cast<int>(dim_x[2] * scale_d)
+                         : static_cast<int>(dim_x[1] * scale_d));
+        out_h_tmp = (data_layout == DataLayout::kNCHW
+                         ? static_cast<int>(dim_x[3] * scale_h)
+                         : static_cast<int>(dim_x[2] * scale_h));
+        out_w_tmp = (data_layout == DataLayout::kNCHW
+                         ? static_cast<int>(dim_x[4] * scale_w)
+                         : static_cast<int>(dim_x[3] * scale_w));
+        // protect when input shape is -1
+        out_d_tmp = out_d_tmp > 0 ? out_d_tmp : -1;
+        out_h_tmp = out_h_tmp > 0 ? out_h_tmp : -1;
+        out_w_tmp = out_w_tmp > 0 ? out_w_tmp : -1;
+      }
+    } else {
+      out_d_tmp = out_d;
+      out_h_tmp = out_h;
+      out_w_tmp = out_w;
+    }
+  }
+
+  if (out_size && config.is_runtime) {
+    auto out_size_dim = out_size->dims();
+    PADDLE_ENFORCE_EQ(
+        out_size_dim.size(),
+        1,
+        phi::errors::InvalidArgument(
+            "OutSize's dimension size must be 1, but got size is %d.",
+            out_size_dim.size()));
+    PADDLE_ENFORCE_EQ(out_size_dim[0],
+                      3,
+                      phi::errors::InvalidArgument(
+                          "OutSize's dim[0] must be 3, but got size is %d.",
+                          out_size_dim[0]));
+    // dims will be seted in kernel
+    output->set_dtype(x.dtype());
+    output->share_lod(x);
+    return;
+  }
+
+  phi::DDim dim_out;
+  if (data_layout == DataLayout::kNCHW) {
+    dim_out = {dim_x[0], dim_x[1], out_d_tmp, out_h_tmp, out_w_tmp};
+  } else {
+    dim_out = {dim_x[0], out_d_tmp, out_h_tmp, out_w_tmp, dim_x[4]};
+  }
+  output->set_dims(dim_out);
+  output->set_dtype(x.dtype());
+}
+
+void InterpolateInferMeta(
+    const MetaTensor& x,
+    paddle::optional<const MetaTensor&> out_size,
+    paddle::optional<const std::vector<const MetaTensor*>> size_tensor,
+    paddle::optional<const MetaTensor&> scale_tensor,
+    const std::string& data_layout_str,
+    int out_d,
+    int out_h,
+    int out_w,
+    const std::vector<float>& scale,
+    const std::string& interp_method,
+    bool align_corners,
+    int align_mode,
+    MetaTensor* output,
+    MetaConfig config) {
+  auto dim_x = x.dims();  // NCHW format
+  PADDLE_ENFORCE(
+      dim_x.size() == 3 || dim_x.size() == 4 || dim_x.size() == 5,
+      phi::errors::Unimplemented(
+          "Input(X) dimension must be 3, 4 or 5, but got dimension = %d .",
+          dim_x.size()));
+  if (dim_x.size() == 3) {
+    // shape check for 1D interpolate for input tensor shape NCHW
+    Interpolate1DInferShapeCheck(x,
+                                 out_size,
+                                 size_tensor,
+                                 scale_tensor,
+                                 data_layout_str,
+                                 out_d,
+                                 out_h,
+                                 out_w,
+                                 scale,
+                                 interp_method,
+                                 align_corners,
+                                 align_mode,
+                                 output,
+                                 config);
+  } else if (dim_x.size() == 4) {
+    // shape check for 2D interpolate for input tensor shape NCHW
+    Interpolate2DInferShapeCheck(x,
+                                 out_size,
+                                 size_tensor,
+                                 scale_tensor,
+                                 data_layout_str,
+                                 out_d,
+                                 out_h,
+                                 out_w,
+                                 scale,
+                                 interp_method,
+                                 align_corners,
+                                 align_mode,
+                                 output,
+                                 config);
+  } else {  // dim_x.size() == 5
+    // shape check for 3D interpolate for input tensor shape NCDHW
+    Interpolate3DInferShapeCheck(x,
+                                 out_size,
+                                 size_tensor,
+                                 scale_tensor,
+                                 data_layout_str,
+                                 out_d,
+                                 out_h,
+                                 out_w,
+                                 scale,
+                                 interp_method,
+                                 align_corners,
+                                 align_mode,
+                                 output,
+                                 config);
+  }
+}
+
+void MeshgridInferMeta(const std::vector<MetaTensor*>& inputs,
+                       std::vector<MetaTensor*> outputs) {
+  const size_t inputs_num = inputs.size();
+
+  auto out_shape = std::vector<int>(inputs_num);
+
+  for (size_t i = 0; i < inputs.size(); i++) {
+    out_shape[i] = inputs[i]->dims()[0];
+  }
+  auto out_dims = phi::make_ddim(std::vector<int>(out_shape));
+  for (size_t i = 0; i < outputs.size(); ++i) {
+    outputs[i]->set_dims(out_dims);
+    outputs[i]->set_dtype(inputs[0]->dtype());
+  }
+}
+
+void MomentumInferMeta(const MetaTensor& param,
+                       const MetaTensor& grad,
+                       const MetaTensor& velocity,
+                       const MetaTensor& learning_rate,
+                       paddle::optional<const MetaTensor&> master_param,
+                       float mu,
+                       bool use_nesterov,
+                       const std::string& regularization_method,
+                       float regularization_coeff,
+                       bool multi_precision,
+                       float rescale_grad,
+                       MetaTensor* param_out,
+                       MetaTensor* velocity_out,
+                       MetaTensor* master_param_out) {
+  PADDLE_ENFORCE_NE(
+      param_out,
+      nullptr,
+      errors::NotFound("Output(ParamOut) of Momentum should not be null."));
+  PADDLE_ENFORCE_NE(
+      velocity_out,
+      nullptr,
+      errors::NotFound("Output(VelocityOut) of Momentum should not be null."));
+
+  auto lr_dims = learning_rate.dims();
+  PADDLE_ENFORCE_NE(
+      phi::product(lr_dims),
+      0,
+      errors::InvalidArgument("Maybe the Input variable LearningRate has not "
+                              "been initialized. You may need to confirm "
+                              "if you put exe.run(startup_program) "
+                              "after optimizer.minimize function."));
+  PADDLE_ENFORCE_EQ(
+      phi::product(lr_dims),
+      1,
+      errors::InvalidArgument("Learning_rate should be a scalar. But Received "
+                              "LearningRate's dim [%s]",
+                              phi::product(lr_dims)));
+
+  auto param_dim = param.dims();
+  param_out->set_dims(param_dim);
+  velocity_out->set_dims(param_dim);
+
+  if (master_param_out) {
+    master_param_out->set_dims(param_dim);
+  }
+}
+
 void MultiDotInferMeta(const std::vector<MetaTensor*>& x, MetaTensor* out) {
   auto inputs_dims = GetMetaTensorsDim(x);
 
@@ -1082,6 +1743,65 @@ void PsroiPoolInferMeta(const MetaTensor& x,
   out->set_dtype(x.dtype());
 }
 
+void RmspropInferMeta(const MetaTensor& param,
+                      const MetaTensor& mean_square,
+                      const MetaTensor& grad,
+                      const MetaTensor& moment,
+                      const MetaTensor& learning_rate,
+                      paddle::optional<const MetaTensor&> mean_grad,
+                      float epsilon,
+                      float decay,
+                      float momentum,
+                      bool centered,
+                      MetaTensor* param_out,
+                      MetaTensor* moment_out,
+                      MetaTensor* mean_square_out,
+                      MetaTensor* mean_grad_out) {
+  if (centered) {
+    PADDLE_ENFORCE_NOT_NULL(
+        mean_grad_out,
+        phi::errors::InvalidArgument(
+            "Output(MeanGradOut) of RmspropOp should not be null."));
+  }
+
+  auto param_dim = param.dims();
+  PADDLE_ENFORCE_EQ(param_dim,
+                    moment.dims(),
+                    phi::errors::InvalidArgument(
+                        "Param and Momentum input of RmspropOp "
+                        "should have the same dimension. But received "
+                        "Param's dim [%s] and Moment [%s]",
+                        param_dim,
+                        moment.dims()));
+  PADDLE_ENFORCE_EQ(param_dim,
+                    mean_square.dims(),
+                    phi::errors::InvalidArgument(
+                        "Param and Momentum input of RmspropOp "
+                        "should have the same dimension. But received "
+                        "Param's dim [%s] and MeanSquare [%s]",
+                        param_dim,
+                        mean_square.dims()));
+
+  auto lr_dim = learning_rate.dims();
+  PADDLE_ENFORCE_EQ(phi::product(lr_dim),
+                    1,
+                    phi::errors::InvalidArgument(
+                        "Learning Rate of RmspropOp should be a scalar. But "
+                        "received LearningRate's dim [%s]",
+                        phi::product(lr_dim)));
+
+  param_out->set_dims(param_dim);
+  param_out->set_dtype(param.dtype());
+  moment_out->set_dims(param_dim);
+  moment_out->set_dtype(moment.dtype());
+  mean_square_out->set_dims(param_dim);
+  mean_square_out->set_dtype(mean_square.dtype());
+  if (centered) {
+    mean_grad_out->set_dims(param_dim);
+    mean_grad_out->set_dtype(mean_grad.get_ptr()->dtype());
+  }
+}
+
 void RnnInferMeta(const MetaTensor& x,
                   const std::vector<MetaTensor*>& pre_state,
                   const std::vector<MetaTensor*>& weight_list,
@@ -1167,6 +1887,29 @@ void RnnInferMeta(const MetaTensor& x,
   }
 }
 
+void SGDInferMeta(const MetaTensor& param,
+                  const MetaTensor& learning_rate,
+                  const MetaTensor& grad,
+                  paddle::optional<const MetaTensor&> master_param,
+                  bool multi_precision,
+                  MetaTensor* param_out,
+                  MetaTensor* master_param_out) {
+  PADDLE_ENFORCE_NOT_NULL(param_out,
+                          phi::errors::InvalidArgument(
+                              "Output(ParamOut) of SGDOp should not be null."));
+
+  auto lr_dims = learning_rate.dims();
+  PADDLE_ENFORCE_EQ(phi::product(lr_dims),
+                    1,
+                    phi::errors::InvalidArgument(
+                        "Learning rate should have 1 element. But received "
+                        "LearningRate dims [%s]",
+                        phi::product(lr_dims)));
+
+  param_out->set_dims(param.dims());
+  param_out->set_dtype(param.dtype());
+}
+
 void StackInferMeta(const std::vector<MetaTensor*>& x,
                     int axis,
                     MetaTensor* out) {
@@ -1213,6 +1956,13 @@ void StackInferMeta(const std::vector<MetaTensor*>& x,
   out->share_lod(*x.at(0));
 }
 
+void UnchangedMultiInferMeta(const std::vector<MetaTensor*>& x,
+                             std::vector<MetaTensor*> out) {
+  for (size_t i = 0; i < x.size(); ++i) {
+    out[i]->share_meta(*x[i]);
+  }
+}
+
 void WarpctcInferMeta(const MetaTensor& logits,
                       const MetaTensor& label,
                       const paddle::optional<const MetaTensor&> logits_length,
@@ -1275,6 +2025,103 @@ void WhereInferMeta(const MetaTensor& condition,
   out->share_meta(x);
 }
 
+void GraphReindexInferMeta(const MetaTensor& x,
+                           const MetaTensor& neighbors,
+                           const MetaTensor& count,
+                           paddle::optional<const MetaTensor&> hashtable_value,
+                           paddle::optional<const MetaTensor&> hashtable_index,
+                           bool flag_buffer_hashtable,
+                           MetaTensor* reindex_src,
+                           MetaTensor* reindex_dst,
+                           MetaTensor* out_nodes) {
+  auto GraphReindexShapeCheck = [](const phi::DDim& dims,
+                                   std::string tensor_name) {
+    if (dims.size() == 2) {
+      PADDLE_ENFORCE_EQ(
+          dims[1],
+          1,
+          phi::errors::InvalidArgument("The last dim of %s should be 1 when it "
+                                       "is 2D, but we get %d",
+                                       tensor_name,
+                                       dims[1]));
+    } else {
+      PADDLE_ENFORCE_EQ(
+          dims.size(),
+          1,
+          phi::errors::InvalidArgument(
+              "The %s should be 1D, when it is not 2D, but we get %d",
+              tensor_name,
+              dims.size()));
+    }
+  };
+
+  GraphReindexShapeCheck(x.dims(), "X");
+  GraphReindexShapeCheck(neighbors.dims(), "Neighbors");
+  GraphReindexShapeCheck(count.dims(), "Count");
+  if (flag_buffer_hashtable) {
+    GraphReindexShapeCheck(hashtable_value->dims(), "HashTable_Value");
+    GraphReindexShapeCheck(hashtable_index->dims(), "HashTable_Index");
+  }
+
+  reindex_src->set_dims({-1});
+  reindex_src->set_dtype(neighbors.dtype());
+  reindex_dst->set_dims({-1});
+  reindex_dst->set_dtype(neighbors.dtype());
+  out_nodes->set_dims({-1});
+  out_nodes->set_dtype(x.dtype());
+}
+
+void GraphSampleNeighborsInferMeta(
+    const MetaTensor& row,
+    const MetaTensor& col_ptr,
+    const MetaTensor& x,
+    paddle::optional<const MetaTensor&> eids,
+    paddle::optional<const MetaTensor&> perm_buffer,
+    int sample_size,
+    bool return_eids,
+    bool flag_perm_buffer,
+    MetaTensor* out,
+    MetaTensor* out_count,
+    MetaTensor* out_eids) {
+  // GSN: GraphSampleNeighbors
+  auto GSNShapeCheck = [](const phi::DDim& dims, std::string tensor_name) {
+    if (dims.size() == 2) {
+      PADDLE_ENFORCE_EQ(
+          dims[1],
+          1,
+          phi::errors::InvalidArgument("The last dim of %s should be 1 when it "
+                                       "is 2D, but we get %d",
+                                       tensor_name,
+                                       dims[1]));
+    } else {
+      PADDLE_ENFORCE_EQ(
+          dims.size(),
+          1,
+          phi::errors::InvalidArgument(
+              "The %s should be 1D, when it is not 2D, but we get %d",
+              tensor_name,
+              dims.size()));
+    }
+  };
+
+  GSNShapeCheck(row.dims(), "Row");
+  GSNShapeCheck(col_ptr.dims(), "Col_Ptr");
+  GSNShapeCheck(x.dims(), "X");
+  if (return_eids) {
+    GSNShapeCheck(eids->dims(), "Eids");
+    out_eids->set_dims({-1});
+    out_eids->set_dtype(row.dtype());
+  }
+  if (flag_perm_buffer) {
+    GSNShapeCheck(perm_buffer->dims(), "Perm_Buffer");
+  }
+
+  out->set_dims({-1});
+  out->set_dtype(row.dtype());
+  out_count->set_dims({-1});
+  out_count->set_dtype(DataType::INT32);
+}
+
 void Yolov3LossInferMeta(const MetaTensor& x,
                          const MetaTensor& gt_box,
                          const MetaTensor& gt_label,
diff --git a/paddle/phi/infermeta/multiary.h b/paddle/phi/infermeta/multiary.h
index 4a8020aefca50..c037641d082b7 100644
--- a/paddle/phi/infermeta/multiary.h
+++ b/paddle/phi/infermeta/multiary.h
@@ -47,6 +47,14 @@ void AdadeltaInferMeta(const MetaTensor& param,
                        MetaTensor* avg_squared_grad_out,
                        MetaTensor* avg_squared_update_out);
 
+void AdagradInferMeta(const MetaTensor& param,
+                      const MetaTensor& grad,
+                      const MetaTensor& moment,
+                      const MetaTensor& learning_rate,
+                      float epsilon,
+                      MetaTensor* param_out,
+                      MetaTensor* moment_out);
+
 void AdamaxInferMeta(const MetaTensor& param,
                      const MetaTensor& grad,
                      const MetaTensor& learning_rate,
@@ -109,6 +117,10 @@ void AdamwInferMeta(const MetaTensor& param,
                     MetaTensor* beta2_pow_out,
                     MetaTensor* master_param_outs);
 
+void AddNInferMeta(const std::vector<MetaTensor*>& x,
+                   MetaTensor* out,
+                   MetaConfig config = MetaConfig());
+
 void AucInferMeta(const MetaTensor& input,
                   const MetaTensor& label,
                   const MetaTensor& stat_pos,
@@ -199,6 +211,40 @@ void HierarchicalSigmoidInferMeta(const MetaTensor& x,
                                   MetaTensor* pre_out,
                                   MetaTensor* w_out);
 
+void InterpolateInferMeta(
+    const MetaTensor& x,
+    paddle::optional<const MetaTensor&> out_size,
+    paddle::optional<const std::vector<const MetaTensor*>> size_tensor,
+    paddle::optional<const MetaTensor&> scale_tensor,
+    const std::string& data_layout,
+    int out_d,
+    int out_h,
+    int out_w,
+    const std::vector<float>& scale,
+    const std::string& interp_method,
+    bool align_corners,
+    int align_mode,
+    MetaTensor* output,
+    MetaConfig config = MetaConfig());
+
+void MeshgridInferMeta(const std::vector<MetaTensor*>& inputs,
+                       std::vector<MetaTensor*> outputs);
+
+void MomentumInferMeta(const MetaTensor& param,
+                       const MetaTensor& grad,
+                       const MetaTensor& velocity,
+                       const MetaTensor& learning_rate,
+                       paddle::optional<const MetaTensor&> master_param,
+                       float mu,
+                       bool use_nesterov,
+                       const std::string& regularization_method,
+                       float regularization_coeff,
+                       bool multi_precision,
+                       float rescale_grad,
+                       MetaTensor* param_out,
+                       MetaTensor* velocity_out,
+                       MetaTensor* master_param_out);
+
 void MultiDotInferMeta(const std::vector<MetaTensor*>& x, MetaTensor* out);
 
 void MultiplexInferMeta(const std::vector<MetaTensor*>& ins,
@@ -214,6 +260,21 @@ void PsroiPoolInferMeta(const MetaTensor& x,
                         float spatial_scale,
                         MetaTensor* out);
 
+void RmspropInferMeta(const MetaTensor& param,
+                      const MetaTensor& mean_square,
+                      const MetaTensor& grad,
+                      const MetaTensor& moment,
+                      const MetaTensor& learning_rate,
+                      paddle::optional<const MetaTensor&> mean_grad,
+                      float epsilon,
+                      float decay,
+                      float momentum,
+                      bool centered,
+                      MetaTensor* param_out,
+                      MetaTensor* moment_out,
+                      MetaTensor* mean_square_out,
+                      MetaTensor* mean_grad_out);
+
 void RnnInferMeta(const MetaTensor& x,
                   const std::vector<MetaTensor*>& pre_state,
                   const std::vector<MetaTensor*>& weight_list,
@@ -231,10 +292,21 @@ void RnnInferMeta(const MetaTensor& x,
                   std::vector<MetaTensor*> state,
                   MetaTensor* reserve);
 
+void SGDInferMeta(const MetaTensor& param,
+                  const MetaTensor& learning_rate,
+                  const MetaTensor& grad,
+                  paddle::optional<const MetaTensor&> master_param,
+                  bool multi_precision,
+                  MetaTensor* param_out,
+                  MetaTensor* master_param_out);
+
 void StackInferMeta(const std::vector<MetaTensor*>& x,
                     int axis,
                     MetaTensor* out);
 
+void UnchangedMultiInferMeta(const std::vector<MetaTensor*>& x,
+                             std::vector<MetaTensor*> out);
+
 void WarpctcInferMeta(const MetaTensor& logits,
                       const MetaTensor& label,
                       const paddle::optional<const MetaTensor&> logits_length,
@@ -249,6 +321,29 @@ void WhereInferMeta(const MetaTensor& condition,
                     const MetaTensor& y,
                     MetaTensor* out);
 
+void GraphReindexInferMeta(const MetaTensor& x,
+                           const MetaTensor& neighbors,
+                           const MetaTensor& count,
+                           paddle::optional<const MetaTensor&> hashtable_value,
+                           paddle::optional<const MetaTensor&> hashtable_index,
+                           bool flag_buffer_hashtable,
+                           MetaTensor* reindex_src,
+                           MetaTensor* reindex_dst,
+                           MetaTensor* out_nodes);
+
+void GraphSampleNeighborsInferMeta(
+    const MetaTensor& row,
+    const MetaTensor& col_ptr,
+    const MetaTensor& x,
+    paddle::optional<const MetaTensor&> eids,
+    paddle::optional<const MetaTensor&> perm_buffer,
+    int sample_size,
+    bool return_eids,
+    bool flag_perm_buffer,
+    MetaTensor* out,
+    MetaTensor* out_count,
+    MetaTensor* out_eids);
+
 void Yolov3LossInferMeta(const MetaTensor& x,
                          const MetaTensor& gt_box,
                          const MetaTensor& gt_label,
diff --git a/paddle/phi/infermeta/nullary.cc b/paddle/phi/infermeta/nullary.cc
index 4a11d24a9868b..f76e7910d77b5 100644
--- a/paddle/phi/infermeta/nullary.cc
+++ b/paddle/phi/infermeta/nullary.cc
@@ -58,6 +58,39 @@ void GaussianRandomInferMeta(const IntArray& shape,
   out->set_layout(DataLayout::NCHW);
 }
 
+void RandpermInferMeta(int n, DataType dtype, MetaTensor* out) {
+  out->set_dims(phi::make_ddim({n}));
+  out->set_dtype(dtype);
+}
+
+void RandintInferMeta(
+    int low, int high, const IntArray& shape, DataType dtype, MetaTensor* out) {
+  PADDLE_ENFORCE_NOT_NULL(
+      out, errors::InvalidArgument("Output(Out) of RandintOp is null."));
+  PADDLE_ENFORCE_LT(
+      low,
+      high,
+      errors::InvalidArgument("randint's low must less then high, "
+                              "but received: low = %d, high = %d.",
+                              low,
+                              high));
+
+  auto& shape_vector = shape.GetData();
+  PADDLE_ENFORCE_EQ(
+      shape_vector.empty(),
+      false,
+      errors::InvalidArgument("The shape information should not be empty, it "
+                              "must be set by Attr(shape)."));
+
+  std::vector<int64_t> tensor_shape;
+  tensor_shape.reserve(shape_vector.size());
+  for (auto dim : shape_vector) {
+    tensor_shape.push_back(static_cast<int64_t>(dim));
+  }
+  out->set_dims(make_ddim(tensor_shape));
+  out->set_dtype(dtype);
+}
+
 void TruncatedGaussianRandomInferMeta(const std::vector<int>& shape,
                                       float mean,
                                       float std,
diff --git a/paddle/phi/infermeta/nullary.h b/paddle/phi/infermeta/nullary.h
index 4c9eb0b62a74e..f84ac01d002d3 100644
--- a/paddle/phi/infermeta/nullary.h
+++ b/paddle/phi/infermeta/nullary.h
@@ -53,6 +53,11 @@ void GaussianRandomInferMeta(const IntArray& shape,
                              DataType dtype,
                              MetaTensor* out);
 
+void RandpermInferMeta(int n, DataType dtype, MetaTensor* out);
+
+void RandintInferMeta(
+    int low, int high, const IntArray& shape, DataType dtype, MetaTensor* out);
+
 void TruncatedGaussianRandomInferMeta(const std::vector<int>& shape,
                                       float mean,
                                       float std,
diff --git a/paddle/phi/infermeta/ternary.cc b/paddle/phi/infermeta/ternary.cc
index 582dcb0137894..c692b6c8fcd13 100644
--- a/paddle/phi/infermeta/ternary.cc
+++ b/paddle/phi/infermeta/ternary.cc
@@ -141,6 +141,56 @@ void AddmmInferMeta(const MetaTensor& input,
   out->set_dtype(input.dtype());
 }
 
+void ArangeInferMeta(const MetaTensor& start,
+                     const MetaTensor& end,
+                     const MetaTensor& step,
+                     MetaTensor* out) {
+  auto start_dims = start.dims();
+  auto end_dims = end.dims();
+  auto step_dims = step.dims();
+  PADDLE_ENFORCE_EQ(
+      start_dims.size(),
+      1,
+      phi::errors::InvalidArgument(
+          "The dim of the shape of Input(Start) should be 1, but got %d",
+          start_dims.size()));
+
+  PADDLE_ENFORCE_EQ(start_dims[0],
+                    1,
+                    phi::errors::InvalidArgument(
+                        "The first dim of the shape of Input(Start) should "
+                        "be 1, but got %d",
+                        start_dims[0]));
+  PADDLE_ENFORCE_EQ(
+      end_dims.size(),
+      1,
+      phi::errors::InvalidArgument(
+          "The dim of the shape of Input(End) should be 1, but got %d",
+          end_dims.size()));
+
+  PADDLE_ENFORCE_EQ(
+      end_dims[0],
+      1,
+      phi::errors::InvalidArgument("The first dim of the shape of "
+                                   "Input(End) should be 1, but got %d",
+                                   end_dims[0]));
+  PADDLE_ENFORCE_EQ(
+      step_dims.size(),
+      1,
+      phi::errors::InvalidArgument(
+          "The dim of the shape of Input(Step) should be 1, but got %d",
+          step_dims.size()));
+
+  PADDLE_ENFORCE_EQ(step_dims[0],
+                    1,
+                    phi::errors::InvalidArgument(
+                        "The first dim of the shape of Input(Step) should "
+                        "be 1, but got %d",
+                        step_dims[0]));
+  out->set_dims({-1});
+  out->set_dtype(start.dtype());
+}
+
 void GraphSendRecvInferMeta(const MetaTensor& x,
                             const MetaTensor& src_index,
                             const MetaTensor& dst_index,
@@ -226,10 +276,10 @@ void LerpInferMeta(const MetaTensor& x,
   out->share_lod(x);
 }
 
-void LinspaceInferMeta(const MetaTensor& start,
-                       const MetaTensor& stop,
-                       const MetaTensor& number,
-                       MetaTensor* out) {
+void LinspaceRawInferMeta(const MetaTensor& start,
+                          const MetaTensor& stop,
+                          const MetaTensor& number,
+                          MetaTensor* out) {
   auto s_dims = start.dims();
   PADDLE_ENFORCE_EQ(
       (s_dims.size() == 1) && (s_dims[0] == 1),
@@ -255,6 +305,14 @@ void LinspaceInferMeta(const MetaTensor& start,
   out->set_dtype(start.dtype());
 }
 
+void LinspaceInferMeta(const MetaTensor& start,
+                       const MetaTensor& stop,
+                       const MetaTensor& number,
+                       DataType dtype,
+                       MetaTensor* out) {
+  LinspaceRawInferMeta(start, stop, number, out);
+}
+
 void NllLossRawInferMeta(const MetaTensor& input,
                          const MetaTensor& label,
                          paddle::optional<const MetaTensor&> weight,
@@ -345,56 +403,6 @@ void PutAlongAxisInferMeta(const MetaTensor& x,
   out->set_dtype(x.dtype());
 }
 
-void RangeInferMeta(const MetaTensor& start,
-                    const MetaTensor& end,
-                    const MetaTensor& step,
-                    MetaTensor* out) {
-  auto start_dims = start.dims();
-  auto end_dims = end.dims();
-  auto step_dims = step.dims();
-  PADDLE_ENFORCE_EQ(
-      start_dims.size(),
-      1,
-      phi::errors::InvalidArgument(
-          "The dim of the shape of Input(Start) should be 1, but got %d",
-          start_dims.size()));
-
-  PADDLE_ENFORCE_EQ(start_dims[0],
-                    1,
-                    phi::errors::InvalidArgument(
-                        "The first dim of the shape of Input(Start) should "
-                        "be 1, but got %d",
-                        start_dims[0]));
-  PADDLE_ENFORCE_EQ(
-      end_dims.size(),
-      1,
-      phi::errors::InvalidArgument(
-          "The dim of the shape of Input(End) should be 1, but got %d",
-          end_dims.size()));
-
-  PADDLE_ENFORCE_EQ(
-      end_dims[0],
-      1,
-      phi::errors::InvalidArgument("The first dim of the shape of "
-                                   "Input(End) should be 1, but got %d",
-                                   end_dims[0]));
-  PADDLE_ENFORCE_EQ(
-      step_dims.size(),
-      1,
-      phi::errors::InvalidArgument(
-          "The dim of the shape of Input(Step) should be 1, but got %d",
-          step_dims.size()));
-
-  PADDLE_ENFORCE_EQ(step_dims[0],
-                    1,
-                    phi::errors::InvalidArgument(
-                        "The first dim of the shape of Input(Step) should "
-                        "be 1, but got %d",
-                        step_dims[0]));
-  out->set_dims({-1});
-  out->set_dtype(start.dtype());
-}
-
 void RoiAlignInferMeta(const MetaTensor& x,
                        const MetaTensor& boxes,
                        paddle::optional<const MetaTensor&> boxes_num,
diff --git a/paddle/phi/infermeta/ternary.h b/paddle/phi/infermeta/ternary.h
index c18dde42f1ed2..83505f2c2fada 100644
--- a/paddle/phi/infermeta/ternary.h
+++ b/paddle/phi/infermeta/ternary.h
@@ -47,6 +47,11 @@ void AddmmInferMeta(const MetaTensor& input,
                     float beta,
                     MetaTensor* out);
 
+void ArangeInferMeta(const MetaTensor& start,
+                     const MetaTensor& end,
+                     const MetaTensor& step,
+                     MetaTensor* out);
+
 void GraphSendRecvInferMeta(const MetaTensor& x,
                             const MetaTensor& src_index,
                             const MetaTensor& dst_index,
@@ -60,9 +65,15 @@ void LerpInferMeta(const MetaTensor& x,
                    const MetaTensor& weight,
                    MetaTensor* out);
 
+void LinspaceRawInferMeta(const MetaTensor& start,
+                          const MetaTensor& stop,
+                          const MetaTensor& number,
+                          MetaTensor* out);
+
 void LinspaceInferMeta(const MetaTensor& start,
                        const MetaTensor& stop,
                        const MetaTensor& number,
+                       DataType dtype,
                        MetaTensor* out);
 
 void NllLossRawInferMeta(const MetaTensor& input,
@@ -81,11 +92,6 @@ void PutAlongAxisInferMeta(const MetaTensor& x,
                            const std::string& reduce,
                            MetaTensor* out);
 
-void RangeInferMeta(const MetaTensor& start,
-                    const MetaTensor& end,
-                    const MetaTensor& step,
-                    MetaTensor* out);
-
 void RoiAlignInferMeta(const MetaTensor& x,
                        const MetaTensor& boxes,
                        paddle::optional<const MetaTensor&> boxes_num,
diff --git a/paddle/phi/infermeta/unary.cc b/paddle/phi/infermeta/unary.cc
index d763b23ef5c35..c6e2cb761911e 100644
--- a/paddle/phi/infermeta/unary.cc
+++ b/paddle/phi/infermeta/unary.cc
@@ -24,12 +24,25 @@ limitations under the License. */
 #include "paddle/phi/core/infermeta_utils.h"
 #include "paddle/phi/kernels/funcs/parse_qr_mode.h"
 #include "paddle/phi/kernels/funcs/pooling.h"
+#include "paddle/phi/kernels/funcs/slice_utils.h"
 #include "paddle/phi/kernels/funcs/strided_slice.h"
 #include "paddle/phi/kernels/funcs/unfold_functor.h"
 #include "paddle/phi/kernels/funcs/unsqueeze.h"
 
 namespace phi {
 
+namespace detail {
+// Used in MatrixRankInferMeta
+static DDim CheckAndGetOutputDim(const DDim& dim_x) {
+  auto x_vec = phi::vectorize(dim_x);
+  if (x_vec.size() == 2) {
+    return phi::make_ddim({1});
+  }
+  x_vec.erase(x_vec.end() - 2, x_vec.end());
+  return phi::make_ddim(x_vec);
+}
+}  // namespace detail
+
 void ArgMinMaxInferMeta(const MetaTensor& x,
                         int64_t axis,
                         bool keepdims,
@@ -360,17 +373,6 @@ void DiagonalInferMeta(const MetaTensor& input,
   out->set_dims(phi::make_ddim(out_dims));
 }
 
-void DropoutInferMeta(const MetaTensor& x, MetaTensor* out, MetaTensor* mask) {
-  auto x_dims = x.dims();
-  out->set_dims(x_dims);
-  out->share_lod(x);
-  out->set_dtype(x.dtype());
-
-  if (mask != nullptr) {
-    mask->set_dims(x_dims);
-  }
-}
-
 void EighInferMeta(const MetaTensor& x,
                    const std::string& uplo,
                    MetaTensor* out_w,
@@ -911,6 +913,29 @@ void MatrixPowerInferMeta(const MetaTensor& x, int n, MetaTensor* out) {
   out->set_dtype(x.dtype());
 }
 
+void MatrixRankInferMeta(const MetaTensor& x,
+                         bool use_default_tol,
+                         bool hermitian,
+                         MetaTensor* out) {
+  auto dim_x = x.dims();
+  PADDLE_ENFORCE_GE(
+      dim_x.size(),
+      2,
+      phi::errors::InvalidArgument("The dims of input must be greater than 2"));
+
+  if (hermitian) {
+    int rows = dim_x[dim_x.size() - 2];
+    int cols = dim_x[dim_x.size() - 1];
+    PADDLE_ENFORCE_EQ(rows,
+                      cols,
+                      phi::errors::InvalidArgument(
+                          "if hermitian == true, matrix should be n*n"));
+  }
+  DDim dim_x_batch = detail::CheckAndGetOutputDim(dim_x);
+  out->set_dims(dim_x_batch);
+  out->share_lod(x);
+}
+
 void MaxOutInferMeta(const MetaTensor& x,
                      int groups,
                      int axis,
@@ -1290,6 +1315,36 @@ void PixelShuffleInferMeta(const MetaTensor& x,
   out->set_dims(output_dims);
 }
 
+void PixelShuffleGradInferMeta(const MetaTensor& out_grad,
+                               int upscale_factor,
+                               const std::string& data_format,
+                               MetaTensor* x_grad) {
+  auto do_dims = out_grad.dims();
+  PADDLE_ENFORCE_EQ(do_dims.size(),
+                    4,
+                    phi::errors::InvalidArgument(
+                        "Input should be a 4-D tensor of format [N, C, H, W] "
+                        "or [N, H, W, C], but got %u.",
+                        do_dims.size()));
+
+  const bool channel_last = (data_format == "NHWC");
+
+  auto dx_dims = do_dims;
+  dx_dims[0] = do_dims[0];
+
+  if (!channel_last) {
+    dx_dims[1] = do_dims[1] * (upscale_factor * upscale_factor);
+    dx_dims[2] = do_dims[2] / upscale_factor;
+    dx_dims[3] = do_dims[3] / upscale_factor;
+  } else {
+    dx_dims[1] = do_dims[1] / upscale_factor;
+    dx_dims[2] = do_dims[2] / upscale_factor;
+    dx_dims[3] = do_dims[3] * (upscale_factor * upscale_factor);
+  }
+  x_grad->set_dims(dx_dims);
+  x_grad->set_dtype(out_grad.dtype());
+}
+
 void PNormInferMeta(const MetaTensor& x,
                     float porder,
                     int axis,
@@ -1738,6 +1793,51 @@ void SizeInferMeta(const MetaTensor& input, MetaTensor* out) {
   out->set_dims({1});
 }
 
+void SliceRawInferMeta(const MetaTensor& input,
+                       const std::vector<int64_t>& axes,
+                       const IntArray& starts_arr,
+                       const IntArray& ends_arr,
+                       const std::vector<int64_t>& infer_flags_t,
+                       const std::vector<int64_t>& decrease_axis,
+                       MetaTensor* out,
+                       MetaConfig config) {
+  auto in_dims = input.dims();
+  PADDLE_ENFORCE_LT(
+      in_dims.size(),
+      7,
+      phi::errors::InvalidArgument("The rank of input should be less than 7."));
+  DDim out_dims(in_dims);
+
+  std::vector<int64_t> infer_flags = infer_flags_t;
+  if (infer_flags.empty()) {
+    // Initialize infer_flags with 1.
+    // To be compatible with other op tests in which infer_flags is not set.
+    infer_flags = std::vector<int64_t>(axes.size(), 1);
+  }
+
+  // 2.1 Check attrs.
+  std::vector<int64_t> starts = starts_arr.GetData();
+  std::vector<int64_t> ends = ends_arr.GetData();
+
+  phi::funcs::CheckAndUpdateSliceAttrs<int64_t>(
+      in_dims, axes, &starts, &ends, nullptr, &infer_flags);
+
+  auto slice_dims = phi::funcs::GetSliceDims<int64_t>(
+      in_dims, axes, starts, ends, nullptr, &infer_flags);
+  if (config.is_runtime) {
+    out_dims = phi::funcs::GetDecreasedDims<int64_t>(
+        slice_dims, decrease_axis, &infer_flags);
+  } else {
+    out_dims = phi::funcs::GetDecreasedDims<int64_t>(
+        slice_dims, decrease_axis, nullptr);
+  }
+
+  out->set_dims(out_dims);
+  if (axes.size() > 0 && axes[0] != 0) {
+    out->share_lod(input);
+  }
+}
+
 void SoftmaxInferMeta(const MetaTensor& x, int axis, MetaTensor* out) {
   auto dim_x = x.dims();
   auto rank_x = dim_x.size();
@@ -1922,15 +2022,15 @@ void SqueezeInferMeta(const MetaTensor& x,
   out->set_dtype(x.dtype());
 }
 
-void StridedSliceInferMeta(const MetaTensor& x,
-                           const std::vector<int>& axes,
-                           const IntArray& starts,
-                           const IntArray& ends,
-                           const IntArray& strides,
-                           const std::vector<int>& infer_flags,
-                           const std::vector<int>& decrease_axis,
-                           MetaTensor* out,
-                           MetaConfig config) {
+void StridedSliceRawInferMeta(const MetaTensor& x,
+                              const std::vector<int>& axes,
+                              const IntArray& starts,
+                              const IntArray& ends,
+                              const IntArray& strides,
+                              const std::vector<int>& infer_flags,
+                              const std::vector<int>& decrease_axis,
+                              MetaTensor* out,
+                              MetaConfig config) {
   auto in_dims = x.dims();
   PADDLE_ENFORCE_LT(
       in_dims.size(),
@@ -2052,6 +2152,19 @@ void StridedSliceInferMeta(const MetaTensor& x,
   out->set_dtype(x.dtype());
 }
 
+void StridedSliceInferMeta(const MetaTensor& x,
+                           const std::vector<int>& axes,
+                           const IntArray& starts,
+                           const IntArray& ends,
+                           const IntArray& strides,
+                           MetaTensor* out,
+                           MetaConfig config) {
+  std::vector<int> infer_flags(axes.size(), 1);
+  std::vector<int> decrease_axis;
+  StridedSliceRawInferMeta(
+      x, axes, starts, ends, strides, infer_flags, decrease_axis, out, config);
+}
+
 /*  Why not use SumRawInferMeta directly?
     Because we need make InferMetaFunction's args follow the design of api.yaml
 */
@@ -2089,6 +2202,52 @@ void SumRawInferMeta(const MetaTensor& x,
   out->set_layout(x.layout());
 }
 
+void TemporalShiftInferMeta(const MetaTensor& x,
+                            int seg_num,
+                            float shift_ratio,
+                            const std::string& data_format,
+                            MetaTensor* out,
+                            MetaConfig config) {
+  auto dim_x = x.dims();
+  PADDLE_ENFORCE_EQ(dim_x.size(),
+                    4,
+                    phi::errors::InvalidArgument(
+                        "Input(X) rank should be 4 in shape of [N*T, C, H, "
+                        "W], but received X rank(%d)",
+                        dim_x.size()));
+
+  PADDLE_ENFORCE_GT(
+      seg_num,
+      0,
+      phi::errors::InvalidArgument(
+          "Attr(seg_num) should be greater than 0, but received %d", seg_num));
+  PADDLE_ENFORCE_GT(
+      shift_ratio,
+      0.,
+      phi::errors::InvalidArgument(
+          "Attr(shift_ratio) should be greater than 0, but received %d",
+          shift_ratio));
+  PADDLE_ENFORCE_LT(
+      shift_ratio,
+      0.5,
+      phi::errors::InvalidArgument(
+          "Attr(shift_ratio) should be less than 0.5, but received %d",
+          shift_ratio));
+
+  if (config.is_runtime) {
+    PADDLE_ENFORCE_EQ(dim_x[0] % seg_num,
+                      0,
+                      phi::errors::InvalidArgument(
+                          "Input(X) dimension[0] should be divided exactly "
+                          "by Attr(seg_num), but received X dimension[0](%d) "
+                          "mod seg_num(%d) != 0",
+                          dim_x[0],
+                          seg_num));
+  }
+
+  out->share_meta(x);
+}
+
 void TileInferMeta(const MetaTensor& x,
                    const IntArray& repeat_times,
                    MetaTensor* out,
@@ -2335,7 +2494,7 @@ void TransposeGradInferMeta(const MetaTensor& x,
 
 void UnbindInferMeta(const MetaTensor& x,
                      int axis,
-                     std::vector<MetaTensor>* outs) {
+                     std::vector<MetaTensor*> outs) {
   auto in_dims = x.dims();
   std::vector<int> out_dim;
   axis = axis < 0 ? in_dims.size() + axis : axis;
@@ -2344,11 +2503,11 @@ void UnbindInferMeta(const MetaTensor& x,
   }
   auto out_dims = phi::make_ddim(out_dim);
 
-  for (size_t i = 0; i < outs->size(); ++i) {
-    (*outs)[i].set_dtype(x.dtype());
-    (*outs)[i].set_dims(out_dims);
-    (*outs)[i].set_layout(x.layout());
-    (*outs)[i].share_lod(x);
+  for (size_t i = 0; i < outs.size(); ++i) {
+    outs[i]->set_dtype(x.dtype());
+    outs[i]->set_dims(out_dims);
+    outs[i]->set_layout(x.layout());
+    outs[i]->share_lod(x);
   }
 }
 
diff --git a/paddle/phi/infermeta/unary.h b/paddle/phi/infermeta/unary.h
index 7ab0f3df2af32..c49e4c88dd899 100644
--- a/paddle/phi/infermeta/unary.h
+++ b/paddle/phi/infermeta/unary.h
@@ -80,8 +80,6 @@ void DiagInferMeta(const MetaTensor& x,
 void DiagonalInferMeta(
     const MetaTensor& input, int offset, int axis1, int axis2, MetaTensor* out);
 
-void DropoutInferMeta(const MetaTensor& x, MetaTensor* out, MetaTensor* mask);
-
 void EighInferMeta(const MetaTensor& x,
                    const std::string& uplo,
                    MetaTensor* out_w,
@@ -144,6 +142,11 @@ void LogsumexpInferMeta(const MetaTensor& input,
 
 void MatrixPowerInferMeta(const MetaTensor& x, int n, MetaTensor* out);
 
+void MatrixRankInferMeta(const MetaTensor& x,
+                         bool use_default_tol,
+                         bool hermitian,
+                         MetaTensor* out);
+
 void MaxOutInferMeta(const MetaTensor& x,
                      int groups,
                      int axis,
@@ -197,6 +200,11 @@ void PixelShuffleInferMeta(const MetaTensor& x,
                            const std::string& data_format,
                            MetaTensor* out);
 
+void PixelShuffleGradInferMeta(const MetaTensor& out_grad,
+                               int upscale_factor,
+                               const std::string& data_format,
+                               MetaTensor* x_grad);
+
 void PNormInferMeta(const MetaTensor& x,
                     float porder,
                     int axis,
@@ -271,6 +279,15 @@ void ShardIndexInferMeta(const MetaTensor& in,
 
 void SizeInferMeta(const MetaTensor& input, MetaTensor* out);
 
+void SliceRawInferMeta(const MetaTensor& input,
+                       const std::vector<int64_t>& axes,
+                       const IntArray& starts,
+                       const IntArray& ends,
+                       const std::vector<int64_t>& infer_flags,
+                       const std::vector<int64_t>& decrease_axis,
+                       MetaTensor* out,
+                       MetaConfig config = MetaConfig());
+
 void SoftmaxInferMeta(const MetaTensor& x, int axis, MetaTensor* out);
 
 void SplitInferMeta(const MetaTensor& x_meta,
@@ -284,13 +301,21 @@ void SqueezeInferMeta(const MetaTensor& x,
                       MetaTensor* xshape,
                       MetaTensor* out);
 
+void StridedSliceRawInferMeta(const MetaTensor& x,
+                              const std::vector<int>& axes,
+                              const IntArray& starts,
+                              const IntArray& ends,
+                              const IntArray& strides,
+                              const std::vector<int>& infer_flags,
+                              const std::vector<int>& decrease_axis,
+                              MetaTensor* out,
+                              MetaConfig config = MetaConfig());
+
 void StridedSliceInferMeta(const MetaTensor& x,
                            const std::vector<int>& axes,
                            const IntArray& starts,
                            const IntArray& ends,
                            const IntArray& strides,
-                           const std::vector<int>& infer_flags,
-                           const std::vector<int>& decrease_axis,
                            MetaTensor* out,
                            MetaConfig config = MetaConfig());
 
@@ -307,6 +332,13 @@ void SumRawInferMeta(const MetaTensor& x,
                      DataType dtype,
                      MetaTensor* out);
 
+void TemporalShiftInferMeta(const MetaTensor& x,
+                            int seg_num,
+                            float shift_ratio,
+                            const std::string& data_format,
+                            MetaTensor* out,
+                            MetaConfig config = MetaConfig());
+
 void TileInferMeta(const MetaTensor& x,
                    const IntArray& repeat_times,
                    MetaTensor* out,
@@ -343,7 +375,7 @@ void TrilTriuInferMeta(const MetaTensor& x,
 
 void UnbindInferMeta(const MetaTensor& x,
                      int axis,
-                     std::vector<MetaTensor>* outs);
+                     std::vector<MetaTensor*> outs);
 
 void UnchangedInferMeta(const MetaTensor& x, MetaTensor* out);
 
diff --git a/paddle/phi/kernels/CMakeLists.txt b/paddle/phi/kernels/CMakeLists.txt
index b0d762d00ecf9..d4b832cef0bd2 100644
--- a/paddle/phi/kernels/CMakeLists.txt
+++ b/paddle/phi/kernels/CMakeLists.txt
@@ -27,7 +27,7 @@ kernel_library(full_kernel DEPS ${COMMON_KERNEL_DEPS} empty_kernel)
 # Some kernels depend on some targets that are not commonly used.
 # These targets are not suitable for common dependencies.
 # In this case, you need to manually generate them here.
-set(MANUAL_BUILD_KERNELS adam_kernel adamw_kernel deformable_conv_kernel deformable_conv_grad_kernel eigh_kernel
+set(MANUAL_BUILD_KERNELS cross_entropy_kernel adam_kernel adamw_kernel deformable_conv_kernel deformable_conv_grad_kernel eigh_kernel
     gumbel_softmax_kernel gumbel_softmax_grad_kernel hierarchical_sigmoid_kernel hierarchical_sigmoid_grad_kernel
     matrix_power_kernel matrix_power_grad_kernel maxout_kernel maxout_grad_kernel pool_kernel
     put_along_axis_kernel put_along_axis_grad_kernel segment_pool_kernel segment_pool_grad_kernel
@@ -35,8 +35,10 @@ set(MANUAL_BUILD_KERNELS adam_kernel adamw_kernel deformable_conv_kernel deforma
     triangular_solve_grad_kernel determinant_grad_kernel reduce_kernel rnn_kernel rnn_grad_kernel warpctc_kernel warpctc_grad_kernel)
 kernel_library(adam_kernel DEPS gflags glog flags ${COMMON_KERNEL_DEPS} selected_rows_functor threadpool jit_kernel_helper)
 kernel_library(adamw_kernel DEPS ${COMMON_KERNEL_DEPS} adam_kernel)
+kernel_library(cross_entropy_kernel DEPS ${COMMON_KERNEL_DEPS} softmax cross_entropy)
 kernel_library(deformable_conv_kernel DEPS ${COMMON_KERNEL_DEPS} deformable_conv_functor)
 kernel_library(deformable_conv_grad_kernel DEPS ${COMMON_KERNEL_DEPS} deformable_conv_functor)
+kernel_library(determinant_grad_kernel DEPS ${COMMON_KERNEL_DEPS} matrix_inverse)
 kernel_library(eigh_kernel DEPS ${COMMON_KERNEL_DEPS} lapack_function)
 kernel_library(hierarchical_sigmoid_kernel DEPS ${COMMON_KERNEL_DEPS} matrix_bit_code)
 kernel_library(hierarchical_sigmoid_grad_kernel DEPS ${COMMON_KERNEL_DEPS} matrix_bit_code)
@@ -57,7 +59,6 @@ kernel_library(softmax_grad_kernel DEPS ${COMMON_KERNEL_DEPS} softmax)
 kernel_library(take_along_axis_kernel DEPS ${COMMON_KERNEL_DEPS} gather_scatter_kernel)
 kernel_library(take_along_axis_grad_kernel DEPS ${COMMON_KERNEL_DEPS} gather_scatter_kernel)
 kernel_library(triangular_solve_grad_kernel DEPS ${COMMON_KERNEL_DEPS} matrix_reduce)
-kernel_library(determinant_grad_kernel DEPS ${COMMON_KERNEL_DEPS} matrix_inverse)
 kernel_library(rnn_kernel DEPS ${COMMON_KERNEL_DEPS} concat_and_split_functor lstm_compute gru_compute)
 kernel_library(rnn_grad_kernel DEPS ${COMMON_KERNEL_DEPS} concat_and_split_functor lstm_compute gru_compute)
 kernel_library(warpctc_kernel DEPS ${COMMON_KERNEL_DEPS} phi_dynload_warpctc sequence_padding sequence_scale)
diff --git a/paddle/phi/kernels/activation_grad_kernel.h b/paddle/phi/kernels/activation_grad_kernel.h
index be6f97ad7c96e..065d018852267 100644
--- a/paddle/phi/kernels/activation_grad_kernel.h
+++ b/paddle/phi/kernels/activation_grad_kernel.h
@@ -125,18 +125,18 @@ void EluDoubleGradKernel(const Context& dev_ctx,
 template <typename T, typename Context>
 void SigmoidDoubleGradKernel(const Context& dev_ctx,
                              const DenseTensor& out,
-                             const DenseTensor& ddx,
                              const DenseTensor& dout,
+                             const DenseTensor& ddx,
                              DenseTensor* dout_new,
                              DenseTensor* ddout);
 
 template <typename T, typename Context>
 void SigmoidTripleGradKernel(const Context& dev_ctx,
                              const DenseTensor& out,
-                             const DenseTensor& ddx,
                              const DenseTensor& dout,
-                             const DenseTensor& d_ddout,
+                             const DenseTensor& ddx,
                              const DenseTensor& d_dout_new,
+                             const DenseTensor& d_ddout,
                              DenseTensor* d_out_new,
                              DenseTensor* d_dout,
                              DenseTensor* d_ddx);
@@ -197,6 +197,7 @@ DECLARE_ACT_GRAD_KERNEL_WITH_ONE_ATTRS_DEPX(ThresholdedRelu, threshold);
 DECLARE_ACT_GRAD_KERNEL_WITH_ONE_ATTRS_DEPX(SoftShrink, lambda);
 DECLARE_ACT_GRAD_KERNEL_WITH_ONE_ATTRS_DEPX(HardShrink, threshold);
 DECLARE_ACT_GRAD_KERNEL_WITH_ONE_ATTRS_DEPX(Swish, beta);
+DECLARE_ACT_GRAD_KERNEL_WITH_ONE_ATTRS_DEPX(Logit, eps);
 
 DECLARE_ACT_GRAD_KERNEL_WITH_TWO_ATTRS_DEPX(BRelu, t_min, t_max);
 
diff --git a/paddle/phi/kernels/range_kernel.h b/paddle/phi/kernels/arange_kernel.h
similarity index 78%
rename from paddle/phi/kernels/range_kernel.h
rename to paddle/phi/kernels/arange_kernel.h
index c76308193ae5e..be60824ac2be2 100644
--- a/paddle/phi/kernels/range_kernel.h
+++ b/paddle/phi/kernels/arange_kernel.h
@@ -19,10 +19,10 @@
 namespace phi {
 
 template <typename T, typename Context>
-void RangeKernel(const Context& dev_ctx,
-                 const DenseTensor& start,
-                 const DenseTensor& end,
-                 const DenseTensor& step,
-                 DenseTensor* out);
+void ArangeKernel(const Context& dev_ctx,
+                  const DenseTensor& start,
+                  const DenseTensor& end,
+                  const DenseTensor& step,
+                  DenseTensor* out);
 
 }  // namespace phi
diff --git a/paddle/phi/kernels/autotune/CMakeLists.txt b/paddle/phi/kernels/autotune/CMakeLists.txt
index a3fb9a06fe671..b933e0993deef 100644
--- a/paddle/phi/kernels/autotune/CMakeLists.txt
+++ b/paddle/phi/kernels/autotune/CMakeLists.txt
@@ -6,4 +6,6 @@ elseif (WITH_ROCM)
     hip_test(auto_tune_test SRCS auto_tune_test.cu DEPS gtest)
 endif()
 
-cc_test(cache_test SRCS cache_test.cc DEPS gtest)
+cc_library(cache SRCS cache.cc DEPS boost)
+
+cc_test(cache_test SRCS cache_test.cc DEPS gtest cache)
diff --git a/paddle/phi/kernels/autotune/cache.cc b/paddle/phi/kernels/autotune/cache.cc
new file mode 100644
index 0000000000000..bf68e2010151b
--- /dev/null
+++ b/paddle/phi/kernels/autotune/cache.cc
@@ -0,0 +1,36 @@
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include "paddle/phi/kernels/autotune/cache.h"
+
+namespace phi {
+namespace autotune {
+
+// Define the cache key of operator
+size_t ConvKey(const std::vector<int64_t>& x_dims,
+               const std::vector<int64_t>& w_dims,
+               const std::vector<int>& strides,
+               const std::vector<int>& paddings,
+               const std::vector<int>& dilations,
+               phi::DataType dtype) {
+  return GetKey(x_dims,
+                w_dims,
+                strides,
+                paddings,
+                dilations,
+                static_cast<int64_t>(dtype));
+}
+
+}  // namespace autotune
+}  // namespace phi
diff --git a/paddle/phi/kernels/autotune/cache.h b/paddle/phi/kernels/autotune/cache.h
index 990843e58f7f2..d492e7c151f91 100644
--- a/paddle/phi/kernels/autotune/cache.h
+++ b/paddle/phi/kernels/autotune/cache.h
@@ -64,14 +64,7 @@ size_t ConvKey(const std::vector<int64_t>& x_dims,
                const std::vector<int>& strides,
                const std::vector<int>& paddings,
                const std::vector<int>& dilations,
-               phi::DataType dtype) {
-  return GetKey(x_dims,
-                w_dims,
-                strides,
-                paddings,
-                dilations,
-                static_cast<int64_t>(dtype));
-}
+               phi::DataType dtype);
 
 template <typename AlgorithmT>
 class AlgorithmsCache {
@@ -104,14 +97,21 @@ class AlgorithmsCache {
     hash_[key] = algo;
   }
 
+  int64_t CacheMisses() const { return cache_misses_; }
+
+  int64_t CacheHits() const { return cache_hits_; }
+
   float CacheHitRate() const {
     int64_t num_accesses = cache_hits_ + cache_misses_;
-    float cache_hit_rate =
-        static_cast<float>(cache_hits_) / static_cast<float>(num_accesses);
+    float cache_hit_rate = 0.;
+    if (num_accesses != 0) {
+      cache_hit_rate =
+          static_cast<float>(cache_hits_) / static_cast<float>(num_accesses);
+    }
     return cache_hit_rate;
   }
 
-  int64_t Size() { return hash_.size(); }
+  int64_t Size() const { return hash_.size(); }
 
  private:
   std::unordered_map<size_t, AlgorithmT> hash_;
@@ -142,20 +142,58 @@ class AutoTuneCache {
     return auto_tune_map_[algo_type];
   }
 
-  // The number of total config cached
-  int64_t Size() {
-    int64_t total = 0;
+  void Clean(float miss_rate) {
+    std::lock_guard<std::mutex> lock(*autotune_cache_mutex_);
+    // Set a small tolerance to avoid performance degradation
+    // due to large cache size under dynamic shape.
+    if (miss_rate > 0.01) {
+      auto_tune_map_.clear();
+    }
+  }
+
+  void UpdateStatus() {
+    int64_t size = 0;
+    int64_t cache_hits = 0;
+    int64_t cache_misses = 0;
     for (auto& v : auto_tune_map_) {
-      VLOG(3) << v.first << " " << v.second.Size();
-      total += v.second.Size();
+      VLOG(4) << "AlgoType: " << v.first << " Cache Size: " << v.second.Size()
+              << " Hits: " << v.second.CacheHits()
+              << " Misses: " << v.second.CacheMisses()
+              << " Hit Rate: " << v.second.CacheHitRate();
+      size += v.second.Size();
+      cache_hits += v.second.CacheHits();
+      cache_misses += v.second.CacheMisses();
     }
-    return total;
+    total_size_ = size;
+    total_cache_hits_ = cache_hits;
+    total_cache_misses_ = cache_misses;
+  }
+
+  // The number of total config cached
+  int64_t Size() const { return total_size_; }
+
+  int64_t CacheHits() const { return total_cache_hits_; }
+
+  int64_t CacheMisses() const { return total_cache_misses_; }
+
+  float CacheHitRate() const {
+    float total_cache_hit_rate = 0.;
+    int64_t total_num_accesses = total_cache_hits_ + total_cache_misses_;
+    if (total_num_accesses != 0) {
+      total_cache_hit_rate = static_cast<float>(total_cache_hits_) /
+                             static_cast<float>(total_num_accesses);
+    }
+
+    return total_cache_hit_rate;
   }
 
  private:
   AutoTuneCache() : autotune_cache_mutex_(new std::mutex()) {}
   AlgorithmsTypeMap auto_tune_map_;
   std::shared_ptr<std::mutex> autotune_cache_mutex_;
+  int64_t total_cache_hits_ = 0;
+  int64_t total_cache_misses_ = 0;
+  int64_t total_size_ = 0;
 };
 
 }  // namespace autotune
diff --git a/paddle/phi/kernels/autotune/cache_test.cc b/paddle/phi/kernels/autotune/cache_test.cc
index 9fcd9b796d0ae..92ba411624fc0 100644
--- a/paddle/phi/kernels/autotune/cache_test.cc
+++ b/paddle/phi/kernels/autotune/cache_test.cc
@@ -46,8 +46,15 @@ TEST(AlgosCache, AlgosCache) {
   EXPECT_EQ(cache.Find(key), false);
   cache.Set(key, ConvAlgos::CuDNNKernel_1);
   EXPECT_EQ(cache.Size(), 2);
-  EXPECT_EQ(autotune_cache.Size(), 2);
+  EXPECT_EQ(cache.CacheHits(), 1);
+  EXPECT_EQ(cache.CacheMisses(), 2);
 
   float cache_hit_rate = static_cast<float>(1) / static_cast<float>(3);
   EXPECT_LT(std::abs(cache_hit_rate - cache.CacheHitRate()), 1e-5);
+
+  autotune_cache.UpdateStatus();
+  EXPECT_EQ(autotune_cache.Size(), 2);
+  EXPECT_EQ(autotune_cache.CacheHits(), 1);
+  EXPECT_EQ(autotune_cache.CacheMisses(), 2);
+  EXPECT_LT(std::abs(cache_hit_rate - autotune_cache.CacheHitRate()), 1e-5);
 }
diff --git a/paddle/phi/kernels/autotune/switch_autotune.h b/paddle/phi/kernels/autotune/switch_autotune.h
new file mode 100644
index 0000000000000..2f9621ed2079e
--- /dev/null
+++ b/paddle/phi/kernels/autotune/switch_autotune.h
@@ -0,0 +1,130 @@
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#pragma once
+#include <cmath>
+#include <mutex>
+#include <numeric>
+#include "glog/logging.h"
+#include "paddle/phi/kernels/autotune/cache.h"
+
+namespace phi {
+namespace autotune {
+
+class AutoTuneStatus {
+ public:
+  static AutoTuneStatus& Instance() {
+    static AutoTuneStatus switch_autotune;
+    return switch_autotune;
+  }
+
+  bool UseAutoTune() { return use_autotune_; }
+
+  // EnableAutoTune and DisableAutoTune Should be used for debug only.
+  void EnableAutoTune() {
+    use_autotune_ = true;
+    Init();
+  }
+
+  void DisableAutoTune() {
+    use_autotune_ = false;
+    Init();
+  }
+
+  void Update() {
+    current_steps_id_ += 1;
+
+    if (!use_autotune_ && !update_use_autotune_) {
+      return;
+    }
+
+    if (current_steps_id_ < start_step_id_) {
+      use_autotune_ = false;
+    } else if (current_steps_id_ >= start_step_id_ &&
+               current_steps_id_ < stop_step_id_) {
+      use_autotune_ = true;
+      AutoTuneCache::Instance().UpdateStatus();
+      step_hit_rates_.push_back(StepHitRate());
+      VLOG(3) << "Step ID " << current_steps_id_
+              << ", Accumulative Cache Hit Rate: "
+              << AutoTuneCache::Instance().CacheHitRate()
+              << ", Cache Size: " << AutoTuneCache::Instance().Size()
+              << ", Current Step Hit Rate: " << StepHitRate();
+    } else if (current_steps_id_ == stop_step_id_) {
+      use_autotune_ = false;
+      update_use_autotune_ = false;
+      // clean cache according miss rate
+      float miss_rate = static_cast<float>(1) - RecentHitRate();
+      AutoTuneCache::Instance().Clean(miss_rate);
+      VLOG(3) << "Recent Miss Rate: " << miss_rate;
+    }
+  }
+
+  int64_t StepID() { return current_steps_id_; }
+
+  float RecentHitRate() {
+    int recent_step_nums = std::ceil(step_hit_rates_.size() * 0.3);
+    float sum = std::accumulate(step_hit_rates_.rbegin(),
+                                step_hit_rates_.rbegin() + recent_step_nums,
+                                0.0);
+    float mean = sum / recent_step_nums;
+    return mean;
+  }
+
+  // Hit Rate of Current Step
+  float StepHitRate() {
+    int64_t current_hits = AutoTuneCache::Instance().CacheHits();
+    int64_t current_misses = AutoTuneCache::Instance().CacheMisses();
+    int64_t step_hits_ = current_hits - previous_hits_;
+    int64_t step_misses_ = current_misses - previous_misses_;
+    float step_hit_rate = 0.;
+    int64_t step_num_accesses = step_hits_ + step_misses_;
+    if (step_num_accesses != 0) {
+      step_hit_rate = static_cast<float>(step_hits_) /
+                      static_cast<float>(step_num_accesses);
+    }
+    previous_hits_ = current_hits;
+    previous_misses_ = current_misses;
+    return step_hit_rate;
+  }
+
+  void SetAutoTuneRange(int64_t start, int64_t stop) {
+    start_step_id_ = start;
+    stop_step_id_ = stop;
+  }
+
+ private:
+  AutoTuneStatus() = default;
+
+  void Init() {
+    update_use_autotune_ = use_autotune_;
+    current_steps_id_ = -1;
+    previous_hits_ = 0;
+    previous_misses_ = 0;
+    step_hit_rates_.clear();
+    AutoTuneCache::Instance().Clean(1.0);
+  }
+
+  int64_t start_step_id_ = 0;
+  int64_t stop_step_id_ = 10;
+  int64_t current_steps_id_ = -1;
+  bool use_autotune_ = false;
+  bool update_use_autotune_ = false;
+  int64_t previous_hits_ = 0;
+  int64_t previous_misses_ = 0;
+  std::vector<float> step_hit_rates_;
+};
+
+}  // namespace autotune
+}  // namespace phi
diff --git a/paddle/phi/kernels/batch_norm_grad_kernel.h b/paddle/phi/kernels/batch_norm_grad_kernel.h
index c15dbd2f63f58..73752f015ca3a 100644
--- a/paddle/phi/kernels/batch_norm_grad_kernel.h
+++ b/paddle/phi/kernels/batch_norm_grad_kernel.h
@@ -21,15 +21,15 @@ namespace phi {
 
 template <typename T, typename Context>
 void BatchNormGradRawKernel(const Context& dev_ctx,
-                            const DenseTensor& y_grad,
                             const DenseTensor& x,
                             const DenseTensor& scale,
                             const DenseTensor& bias,
+                            paddle::optional<const DenseTensor&> mean,
+                            paddle::optional<const DenseTensor&> variance,
                             const DenseTensor& saved_mean,
                             const DenseTensor& saved_variance,
                             paddle::optional<const DenseTensor&> reserve_space,
-                            paddle::optional<const DenseTensor&> mean,
-                            paddle::optional<const DenseTensor&> variance,
+                            const DenseTensor& y_grad,
                             float momentum,
                             float epsilon,
                             const std::string& data_layout,
@@ -44,15 +44,15 @@ void BatchNormGradRawKernel(const Context& dev_ctx,
 
 template <typename T, typename Context>
 void BatchNormGradKernel(const Context& dev_ctx,
-                         const DenseTensor& y_grad,
                          const DenseTensor& x,
                          const DenseTensor& scale,
                          const DenseTensor& bias,
+                         paddle::optional<const DenseTensor&> mean,
+                         paddle::optional<const DenseTensor&> variance,
                          const DenseTensor& saved_mean,
                          const DenseTensor& saved_variance,
                          paddle::optional<const DenseTensor&> reserve_space,
-                         paddle::optional<const DenseTensor&> mean,
-                         paddle::optional<const DenseTensor&> variance,
+                         const DenseTensor& y_grad,
                          float momentum,
                          float epsilon,
                          const std::string& data_layout,
diff --git a/paddle/pten/kernels/slice_kernel.h b/paddle/phi/kernels/clip_grad_kernel.h
similarity index 63%
rename from paddle/pten/kernels/slice_kernel.h
rename to paddle/phi/kernels/clip_grad_kernel.h
index c2a96312cdd5e..8a7e5b99fd924 100644
--- a/paddle/pten/kernels/slice_kernel.h
+++ b/paddle/phi/kernels/clip_grad_kernel.h
@@ -14,19 +14,18 @@
 
 #pragma once
 
-#include "paddle/phi/common/int_array.h"
+#include "paddle/phi/common/scalar.h"
 #include "paddle/phi/core/dense_tensor.h"
+#include "paddle/phi/core/device_context.h"
 
 namespace phi {
 
 template <typename T, typename Context>
-void SliceRawKernel(const Context& ctx,
-                    const DenseTensor& input,
-                    const std::vector<int64_t>& axes,
-                    const IntArray& starts,
-                    const IntArray& ends,
-                    const std::vector<int64_t>& infer_flags,
-                    const std::vector<int64_t>& decrease_axis,
-                    DenseTensor* out);
+void ClipGradKernel(const Context& dev_ctx,
+                    const DenseTensor& x,
+                    const DenseTensor& out_grad,
+                    const Scalar& min,
+                    const Scalar& max,
+                    DenseTensor* x_grad);
 
 }  // namespace phi
diff --git a/paddle/phi/kernels/clip_kernel.h b/paddle/phi/kernels/clip_kernel.h
new file mode 100644
index 0000000000000..14ac8342e03bc
--- /dev/null
+++ b/paddle/phi/kernels/clip_kernel.h
@@ -0,0 +1,31 @@
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#pragma once
+
+#include "paddle/phi/common/scalar.h"
+#include "paddle/phi/core/dense_tensor.h"
+#include "paddle/phi/core/device_context.h"
+#include "paddle/phi/core/selected_rows.h"
+
+namespace phi {
+
+template <typename T, typename Context>
+void ClipKernel(const Context& dev_ctx,
+                const DenseTensor& x,
+                const Scalar& min,
+                const Scalar& max,
+                DenseTensor* out);
+
+}  // namespace phi
diff --git a/paddle/phi/kernels/concat_grad_kernel.h b/paddle/phi/kernels/concat_grad_kernel.h
new file mode 100644
index 0000000000000..e407d73bb49ee
--- /dev/null
+++ b/paddle/phi/kernels/concat_grad_kernel.h
@@ -0,0 +1,30 @@
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#pragma once
+
+#include "paddle/phi/common/scalar.h"
+#include "paddle/phi/core/dense_tensor.h"
+#include "paddle/phi/infermeta/multiary.h"
+#include "paddle/phi/kernels/empty_kernel.h"
+namespace phi {
+
+template <typename T, typename Context>
+void ConcatGradKernel(const Context& dev_ctx,
+                      const std::vector<const DenseTensor*>& x,
+                      const DenseTensor& out_grad,
+                      const Scalar& axis_scalar,
+                      std::vector<DenseTensor*> x_grad);
+
+}  // namespace phi
diff --git a/paddle/phi/kernels/conv_grad_kernel.h b/paddle/phi/kernels/conv_grad_kernel.h
index bad30989ac90d..a6b970e0996be 100644
--- a/paddle/phi/kernels/conv_grad_kernel.h
+++ b/paddle/phi/kernels/conv_grad_kernel.h
@@ -20,9 +20,9 @@ namespace phi {
 
 template <typename T, typename Context>
 void ConvGradKernel(const Context& dev_ctx,
-                    const DenseTensor& out_grad,
                     const DenseTensor& input,
                     const DenseTensor& filter,
+                    const DenseTensor& out_grad,
                     const std::vector<int>& strides,
                     const std::vector<int>& paddings,
                     const std::string& paddding_algorithm,
@@ -37,9 +37,9 @@ void ConvGradKernel(const Context& dev_ctx,
 
 template <typename T, typename Context>
 void Conv3DGradKernel(const Context& dev_ctx,
-                      const DenseTensor& out_grad,
                       const DenseTensor& input,
                       const DenseTensor& filter,
+                      const DenseTensor& out_grad,
                       const std::vector<int>& strides,
                       const std::vector<int>& paddings,
                       const std::string& paddding_algorithm,
@@ -54,9 +54,9 @@ void Conv3DGradKernel(const Context& dev_ctx,
 
 template <typename T, typename Context>
 void DepthwiseConvGradKernel(const Context& dev_ctx,
-                             const DenseTensor& out_grad,
                              const DenseTensor& input,
                              const DenseTensor& filter,
+                             const DenseTensor& out_grad,
                              const std::vector<int>& strides,
                              const std::vector<int>& paddings,
                              const std::string& paddding_algorithm,
diff --git a/paddle/phi/kernels/copy_kernel.h b/paddle/phi/kernels/copy_kernel.h
index 95df29f7e653a..21b59d8d11b8d 100644
--- a/paddle/phi/kernels/copy_kernel.h
+++ b/paddle/phi/kernels/copy_kernel.h
@@ -15,7 +15,6 @@ limitations under the License. */
 #pragma once
 
 #include "paddle/phi/core/dense_tensor.h"
-#include "paddle/phi/core/sparse_csr_tensor.h"
 
 namespace phi {
 
diff --git a/paddle/phi/kernels/cpu/accuracy_kernel.cc b/paddle/phi/kernels/cpu/accuracy_kernel.cc
index c57ec69b73a23..6ff8a1f755897 100644
--- a/paddle/phi/kernels/cpu/accuracy_kernel.cc
+++ b/paddle/phi/kernels/cpu/accuracy_kernel.cc
@@ -69,4 +69,7 @@ void AccuracyRawKernel(const Context& dev_ctx,
 
 // TODO(add supported dtype.)
 PD_REGISTER_KERNEL(
-    accuracy, CPU, ALL_LAYOUT, phi::AccuracyRawKernel, float, double) {}
+    accuracy, CPU, ALL_LAYOUT, phi::AccuracyRawKernel, float, double) {
+  kernel->InputAt(1).SetDataType(phi::DataType::INT64);
+  kernel->InputAt(2).SetDataType(phi::DataType::INT64);
+}
diff --git a/paddle/phi/kernels/cpu/allclose_kernel.cc b/paddle/phi/kernels/cpu/allclose_kernel.cc
index 7ffeadfeed8aa..80dea561956cf 100644
--- a/paddle/phi/kernels/cpu/allclose_kernel.cc
+++ b/paddle/phi/kernels/cpu/allclose_kernel.cc
@@ -29,21 +29,28 @@ void AllCloseKernel(const Context& dev_ctx,
                     const Scalar& atol,
                     bool equal_nan,
                     DenseTensor* out) {
-  PADDLE_ENFORCE_EQ(
-      rtol.dtype(),
-      DataType::FLOAT64,
-      phi::errors::InvalidArgument(
-          "Input (Rtol) type must be double, but get %s.", rtol.dtype()));
-  PADDLE_ENFORCE_EQ(
-      atol.dtype(),
-      DataType::FLOAT64,
-      phi::errors::InvalidArgument(
-          "Input (Atol) type must be double, but get %s.", atol.dtype()));
-
+  double rtol_v, atol_v;
+  if (rtol.dtype() == DataType::FLOAT64) {
+    rtol_v = rtol.to<double>();
+  } else if (rtol.dtype() == DataType::FLOAT32) {
+    rtol_v = rtol.to<float>();
+  } else {
+    PADDLE_THROW(phi::errors::InvalidArgument(
+        "Input (Rtol) type must be double or float, but get %s.",
+        rtol.dtype()));
+  }
+  if (atol.dtype() == DataType::FLOAT64) {
+    atol_v = atol.to<double>();
+  } else if (atol.dtype() == DataType::FLOAT32) {
+    atol_v = atol.to<float>();
+  } else {
+    PADDLE_THROW(phi::errors::InvalidArgument(
+        "Input (Atol) type must be double or float, but get %s.",
+        atol.dtype()));
+  }
+  VLOG(3) << "rtol and atol is : " << rtol_v << " " << atol_v;
   auto* in_a = x.data<T>();
   auto* in_b = y.data<T>();
-  auto rtol_v = rtol.to<double>();
-  auto atol_v = atol.to<double>();
   auto* out_data = dev_ctx.template Alloc<bool>(out);
   *out_data = true;
 
diff --git a/paddle/phi/kernels/cpu/range_kernel.cc b/paddle/phi/kernels/cpu/arange_kernel.cc
similarity index 78%
rename from paddle/phi/kernels/cpu/range_kernel.cc
rename to paddle/phi/kernels/cpu/arange_kernel.cc
index 8731696f61760..478251b0d3b6a 100644
--- a/paddle/phi/kernels/cpu/range_kernel.cc
+++ b/paddle/phi/kernels/cpu/arange_kernel.cc
@@ -12,7 +12,7 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. */
 
-#include "paddle/phi/kernels/range_kernel.h"
+#include "paddle/phi/kernels/arange_kernel.h"
 #include "paddle/phi/backends/cpu/cpu_context.h"
 #include "paddle/phi/core/kernel_registry.h"
 #include "paddle/phi/kernels/funcs/range_function.h"
@@ -20,11 +20,11 @@ limitations under the License. */
 namespace phi {
 
 template <typename T, typename Context>
-void RangeKernel(const Context& dev_ctx,
-                 const DenseTensor& start,
-                 const DenseTensor& end,
-                 const DenseTensor& step,
-                 DenseTensor* out) {
+void ArangeKernel(const Context& dev_ctx,
+                  const DenseTensor& start,
+                  const DenseTensor& end,
+                  const DenseTensor& step,
+                  DenseTensor* out) {
   T start_value = start.data<T>()[0];
   T end_value = end.data<T>()[0];
   T step_value = step.data<T>()[0];
@@ -42,4 +42,4 @@ void RangeKernel(const Context& dev_ctx,
 }  // namespace phi
 
 PD_REGISTER_KERNEL(
-    range, CPU, ALL_LAYOUT, phi::RangeKernel, float, double, int, int64_t) {}
+    arange, CPU, ALL_LAYOUT, phi::ArangeKernel, float, double, int, int64_t) {}
diff --git a/paddle/phi/kernels/cpu/batch_norm_grad_kernel.cc b/paddle/phi/kernels/cpu/batch_norm_grad_kernel.cc
index de2343a384a5b..ae87886b89bff 100644
--- a/paddle/phi/kernels/cpu/batch_norm_grad_kernel.cc
+++ b/paddle/phi/kernels/cpu/batch_norm_grad_kernel.cc
@@ -37,15 +37,16 @@ using ConstEigenVectorArrayMap =
 
 template <typename T, typename Context>
 void BatchNormGradRawKernel(const Context& ctx,
-                            const DenseTensor& y_grad,
+
                             const DenseTensor& x,
                             const DenseTensor& scale,
                             const DenseTensor& bias,
+                            paddle::optional<const DenseTensor&> mean,
+                            paddle::optional<const DenseTensor&> variance,
                             const DenseTensor& saved_mean,
                             const DenseTensor& saved_variance,
                             paddle::optional<const DenseTensor&> reserve_space,
-                            paddle::optional<const DenseTensor&> mean,
-                            paddle::optional<const DenseTensor&> variance,
+                            const DenseTensor& y_grad,
                             float momentum,
                             float epsilon,
                             const std::string& data_layout_str,
@@ -122,8 +123,8 @@ void BatchNormGradRawKernel(const Context& ctx,
     ctx.template Alloc<T>(d_x);
   }
 
-  const T* mean_data = saved_mean.data<T>();
-  const T* inv_var_data = saved_variance.data<T>();
+  const T* mean_data = nullptr;
+  const T* inv_var_data = nullptr;
   DenseTensor inv_var_tensor;
   if (use_global_stats) {
     const auto* running_mean = mean.get_ptr();
@@ -136,6 +137,9 @@ void BatchNormGradRawKernel(const Context& ctx,
 
     inv_var_tmp = (var_arr + epsilon).sqrt().inverse();
     inv_var_data = running_inv_var_data;
+  } else {
+    mean_data = saved_mean.data<T>();
+    inv_var_data = saved_variance.data<T>();
   }
 
   ConstEigenVectorArrayMap<T> scale_arr(scale.data<T>(), C);
@@ -293,15 +297,15 @@ void BatchNormGradRawKernel(const Context& ctx,
 
 template <typename T, typename Context>
 void BatchNormGradKernel(const Context& dev_ctx,
-                         const DenseTensor& y_grad,
                          const DenseTensor& x,
                          const DenseTensor& scale,
                          const DenseTensor& bias,
+                         paddle::optional<const DenseTensor&> mean,
+                         paddle::optional<const DenseTensor&> variance,
                          const DenseTensor& saved_mean,
                          const DenseTensor& saved_variance,
                          paddle::optional<const DenseTensor&> reserve_space,
-                         paddle::optional<const DenseTensor&> mean,
-                         paddle::optional<const DenseTensor&> variance,
+                         const DenseTensor& y_grad,
                          float momentum,
                          float epsilon,
                          const std::string& data_layout,
@@ -313,15 +317,15 @@ void BatchNormGradKernel(const Context& dev_ctx,
                          DenseTensor* scale_grad,
                          DenseTensor* bias_grad) {
   BatchNormGradRawKernel<T, Context>(dev_ctx,
-                                     y_grad,
                                      x,
                                      scale,
                                      bias,
+                                     mean,
+                                     variance,
                                      saved_mean,
                                      saved_variance,
                                      reserve_space,
-                                     mean,
-                                     variance,
+                                     y_grad,
                                      momentum,
                                      epsilon,
                                      data_layout,
diff --git a/paddle/phi/kernels/cpu/cast_kernel.cc b/paddle/phi/kernels/cpu/cast_kernel.cc
index 800962544c73e..b53c94eb4cae2 100644
--- a/paddle/phi/kernels/cpu/cast_kernel.cc
+++ b/paddle/phi/kernels/cpu/cast_kernel.cc
@@ -41,6 +41,7 @@ PD_REGISTER_KERNEL(cast,
                    int64_t,
                    int16_t,
                    bool,
+                   int8_t,
                    uint8_t,
                    phi::dtype::float16,
                    phi::dtype::bfloat16,
diff --git a/paddle/phi/kernels/cpu/clip_grad_kernel.cc b/paddle/phi/kernels/cpu/clip_grad_kernel.cc
new file mode 100644
index 0000000000000..bccdc0746d51c
--- /dev/null
+++ b/paddle/phi/kernels/cpu/clip_grad_kernel.cc
@@ -0,0 +1,27 @@
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include "paddle/phi/kernels/clip_grad_kernel.h"
+#include "paddle/phi/backends/cpu/cpu_context.h"
+#include "paddle/phi/core/kernel_registry.h"
+#include "paddle/phi/kernels/impl/clip_grad_kernel_impl.h"
+
+PD_REGISTER_KERNEL(clip_grad,
+                   CPU,
+                   ALL_LAYOUT,
+                   phi::ClipGradKernel,
+                   float,
+                   double,
+                   int,
+                   int64_t) {}
diff --git a/paddle/phi/kernels/cpu/clip_kernel.cc b/paddle/phi/kernels/cpu/clip_kernel.cc
new file mode 100644
index 0000000000000..5fd9aea966f8d
--- /dev/null
+++ b/paddle/phi/kernels/cpu/clip_kernel.cc
@@ -0,0 +1,21 @@
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include "paddle/phi/kernels/clip_kernel.h"
+#include "paddle/phi/backends/cpu/cpu_context.h"
+#include "paddle/phi/core/kernel_registry.h"
+#include "paddle/phi/kernels/impl/clip_kernel_impl.h"
+
+PD_REGISTER_KERNEL(
+    clip, CPU, ALL_LAYOUT, phi::ClipKernel, float, double, int, int64_t) {}
diff --git a/paddle/phi/kernels/cpu/concat_grad_kernel.cc b/paddle/phi/kernels/cpu/concat_grad_kernel.cc
new file mode 100644
index 0000000000000..56ed95769fef4
--- /dev/null
+++ b/paddle/phi/kernels/cpu/concat_grad_kernel.cc
@@ -0,0 +1,35 @@
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include "paddle/phi/kernels/concat_grad_kernel.h"
+
+#include "paddle/phi/backends/cpu/cpu_context.h"
+#include "paddle/phi/common/bfloat16.h"
+#include "paddle/phi/common/complex.h"
+#include "paddle/phi/core/kernel_registry.h"
+#include "paddle/phi/kernels/impl/concat_grad_kernel_impl.h"
+
+PD_REGISTER_KERNEL(concat_grad,
+                   CPU,
+                   ALL_LAYOUT,
+                   phi::ConcatGradKernel,
+                   double,
+                   float,
+                   bool,
+                   int64_t,
+                   int,
+                   uint8_t,
+                   phi::dtype::float16,
+                   phi::dtype::complex<float>,
+                   phi::dtype::complex<double>) {}
diff --git a/paddle/phi/kernels/cpu/conv_grad_kernel.cc b/paddle/phi/kernels/cpu/conv_grad_kernel.cc
index 994ad861bd15b..2d8a9bf1de733 100644
--- a/paddle/phi/kernels/cpu/conv_grad_kernel.cc
+++ b/paddle/phi/kernels/cpu/conv_grad_kernel.cc
@@ -22,9 +22,9 @@ namespace phi {
 
 template <typename T, typename Context>
 void DepthwiseConvGradKernel(const Context& dev_ctx,
-                             const DenseTensor& out_grad,
                              const DenseTensor& input,
                              const DenseTensor& filter,
+                             const DenseTensor& out_grad,
                              const std::vector<int>& strides,
                              const std::vector<int>& paddings,
                              const std::string& paddding_algorithm,
@@ -38,9 +38,9 @@ void DepthwiseConvGradKernel(const Context& dev_ctx,
                              DenseTensor* input_grad,
                              DenseTensor* filter_grad) {
   ConvGradKernel<T>(dev_ctx,
-                    out_grad,
                     input,
                     filter,
+                    out_grad,
                     strides,
                     paddings,
                     paddding_algorithm,
@@ -56,9 +56,9 @@ void DepthwiseConvGradKernel(const Context& dev_ctx,
 
 template <typename T, typename Context>
 void Conv3DGradKernel(const Context& dev_ctx,
-                      const DenseTensor& out_grad,
                       const DenseTensor& input,
                       const DenseTensor& filter,
+                      const DenseTensor& out_grad,
                       const std::vector<int>& strides,
                       const std::vector<int>& paddings,
                       const std::string& paddding_algorithm,
@@ -71,9 +71,9 @@ void Conv3DGradKernel(const Context& dev_ctx,
                       DenseTensor* input_grad,
                       DenseTensor* filter_grad) {
   ConvGradKernel<T>(dev_ctx,
-                    out_grad,
                     input,
                     filter,
+                    out_grad,
                     strides,
                     paddings,
                     paddding_algorithm,
diff --git a/paddle/phi/kernels/cpu/cross_entropy_grad_kernel.cc b/paddle/phi/kernels/cpu/cross_entropy_grad_kernel.cc
new file mode 100644
index 0000000000000..d4a632b5e6ece
--- /dev/null
+++ b/paddle/phi/kernels/cpu/cross_entropy_grad_kernel.cc
@@ -0,0 +1,226 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "paddle/phi/kernels/cross_entropy_grad_kernel.h"
+
+#include "paddle/phi/backends/cpu/cpu_context.h"
+#include "paddle/phi/core/kernel_registry.h"
+#include "paddle/phi/kernels/copy_kernel.h"
+#include "paddle/phi/kernels/funcs/axis_utils.h"
+#include "paddle/phi/kernels/funcs/eigen/common.h"
+
+// TODO(chenweihang): move dispatch.h into phi/core
+#include "paddle/phi/api/ext/dispatch.h"
+
+namespace phi {
+
+template <typename T, typename LabelT>
+void CrossEntropyWithSoftmaxGradCPUKernel(const CPUContext& dev_ctx,
+                                          const DenseTensor& label,
+                                          const DenseTensor& softmax,
+                                          const DenseTensor& loss_grad,
+                                          bool soft_label,
+                                          bool use_softmax,
+                                          bool numeric_stable_mode,
+                                          int ignore_index,
+                                          int axis,
+                                          DenseTensor* logits_grad) {
+  const DenseTensor* out_grad = &loss_grad;
+  DenseTensor* logit_grad = logits_grad;
+
+  if (logit_grad != &softmax || !use_softmax) {
+    phi::Copy(dev_ctx, softmax, dev_ctx.GetPlace(), false, logit_grad);
+  }
+
+  const int rank = logit_grad->dims().size();
+  const int axis_v = phi::funcs::CanonicalAxis(axis, rank);
+  int axis_dim = logit_grad->dims()[axis_v];
+  PADDLE_ENFORCE_GT(
+      axis_dim,
+      0,
+      phi::errors::InvalidArgument(
+          "The axis dimention should be larger than 0, but received "
+          "axis dimention is %d.",
+          axis_dim));
+
+  const int n = phi::funcs::SizeToAxis(axis_v, logit_grad->dims());
+  PADDLE_ENFORCE_GT(
+      n,
+      0,
+      phi::errors::InvalidArgument(
+          "The size of axis should be larger than 0, but received "
+          "SizeToAxis of logit_grad is %d.",
+          n));
+
+  const int d = phi::funcs::SizeFromAxis(axis_v, logit_grad->dims());
+  DenseTensor logit_grad_2d(*logit_grad);
+  logit_grad_2d.Resize({n, d});
+  DenseTensor labels_2d(label);
+  labels_2d.Resize({n, label.numel() / n});
+  DenseTensor out_grad_2d(*out_grad);
+  out_grad_2d.Resize({n, d / axis_dim});
+
+  auto out_grad_mat = EigenMatrix<T>::From(out_grad_2d);
+  auto logit_grad_mat = EigenMatrix<T>::From(logit_grad_2d);
+  auto& place = *dev_ctx.eigen_device();
+
+  if (!use_softmax) {
+    // use_softmax step1
+    if (soft_label) {
+      auto lbl_mat = EigenMatrix<T>::From(labels_2d);
+      logit_grad_mat.device(place) =
+          (-lbl_mat / logit_grad_mat);  // for each sample ,i  is sample id
+      logit_grad_mat.device(place) =
+          out_grad_mat.broadcast(Eigen::DSizes<int, 2>(1, axis_dim)) *
+          logit_grad_mat;
+    } else {
+      // use_softmax step2
+      const auto* label_data = label.data<LabelT>();
+      T* logit_grad_data = logit_grad->data<T>();
+      const T* out_grad_data = out_grad->data<T>();
+      const int remain = d / axis_dim;
+      for (int i = 0; i < n; ++i) {         // for each sample_1_dim
+        for (int j = 0; j < remain; j++) {  // for each sample_other_dims
+          int idx = i * remain + j;  // this sample's label_idx. for 1d case,
+                                     // remain=1 and j=0, so, idx = i
+          auto lbl = static_cast<int64_t>(label_data[idx]);
+          if (lbl == ignore_index) {
+            for (int k = 0; k < axis_dim; ++k) {  // for each class id's label
+              logit_grad_data[i * d + k * remain + j] = 0;
+            }
+          } else {
+            // only for this sample's label_idx, the label is 1, others is 0,
+            // so, only compute this label_idx's class
+            logit_grad_data[i * d + lbl * remain + j] =
+                (-1 / logit_grad_data[i * d + lbl * remain + j]) *
+                out_grad_data[idx];
+            for (int k = 0; k < axis_dim; ++k) {  // for each class id's label
+              if (k !=
+                  label_data[idx]) {  // label_data[idx]: this sample's label
+                logit_grad_data[i * d + k * remain + j] = 0;
+              }
+            }
+          }
+        }
+      }
+    }
+    return;
+  }
+  // for use_softmax=False, continue
+
+  if (soft_label) {
+    // when soft_label = True, ignore_index is not supported
+    auto lbl_mat = EigenMatrix<T>::From(labels_2d);
+    logit_grad_mat.device(place) =
+        out_grad_mat.broadcast(Eigen::DSizes<int, 2>(1, axis_dim)) *
+        (logit_grad_mat - lbl_mat);
+    // for each sample, i is sample id
+    // 1) compute dy/dx by p_j - y_j or P-Y, where j is class id,
+    // P=logit_grad_mat[i] is all class's probs, Y=lbl_mat[i] is
+    // all class's label
+    // 2) compute dy * dy/dx by   Chain rule, dy=out_grad_mat[i]
+    // for high dims, e.g. (n,c) or (n,d1,...,dm, c), compute grad by matrix
+    // operation
+
+  } else {
+    logit_grad_mat.device(place) =
+        logit_grad_mat *  // element_wise multiply
+        out_grad_mat.broadcast(Eigen::DSizes<int, 2>(1, axis_dim));
+
+    const auto* label_data = label.data<LabelT>();
+    T* logit_grad_data = logit_grad->data<T>();
+    const T* out_grad_data = out_grad->data<T>();
+    const int remain = d / axis_dim;
+    for (int i = 0; i < n; ++i) {         // for each sample_1_dim
+      for (int j = 0; j < remain; j++) {  // for each sample_other_dims
+        int idx = i * remain + j;  // this sample's label_idx. for 1d case,
+                                   // remain=1 and j=0, so, idx = i
+        auto lbl = static_cast<int64_t>(label_data[idx]);
+        if (lbl == ignore_index) {
+          for (int k = 0; k < axis_dim; ++k) {  // for each class id's label
+            logit_grad_data[i * d + k * remain + j] = 0;
+          }
+        } else {
+          // only for this sample's label_idx, the label is 1, others is 0,
+          // so, only compute this label_idx's class
+          // for 1d case, remain=1 and j=0, so, [i * d + label_data[idx] *
+          // remain + j] = [i * d + label_data[idx]]
+          // let idx_x = i * d + label_data[idx] * remain + j,
+          //   logit_grad_data[idx_x] = logit_grad_data[idx_x] -
+          //   out_grad_data[idx]
+          // note: logit_grad_mat = logit_grad_mat * out_grad_mat
+          // so: logit_grad_data[idx_x] =  (logit_grad_data[idx_x] - 1) *
+          // out_grad_data[idx]
+          // means:           dy/dp * dy=   ( p - y ) * dy
+
+          logit_grad_data[i * d + lbl * remain + j] -= out_grad_data[idx];
+        }
+      }
+    }
+  }
+}
+
+template <typename T, typename Context>
+void CrossEntropyWithSoftmaxGradKernel(const Context& dev_ctx,
+                                       const DenseTensor& label,
+                                       const DenseTensor& softmax,
+                                       const DenseTensor& loss_grad,
+                                       bool soft_label,
+                                       bool use_softmax,
+                                       bool numeric_stable_mode,
+                                       int ignore_index,
+                                       int axis,
+                                       DenseTensor* logits_grad) {
+  auto dtype = label.dtype();
+  if (soft_label) {
+    PADDLE_ENFORCE_EQ(
+        dtype,
+        paddle::experimental::CppTypeToDataType<T>::Type(),
+        phi::errors::InvalidArgument("The Input(Label) should be with the "
+                                     "same data type as kernel data type."));
+    CrossEntropyWithSoftmaxGradCPUKernel<T, T>(dev_ctx,
+                                               label,
+                                               softmax,
+                                               loss_grad,
+                                               soft_label,
+                                               use_softmax,
+                                               numeric_stable_mode,
+                                               ignore_index,
+                                               axis,
+                                               logits_grad);
+  } else {
+    PD_DISPATCH_INTEGRAL_TYPES(
+        dtype, "CrossEntropyWithSoftmaxGradCPUKernel", ([&] {
+          CrossEntropyWithSoftmaxGradCPUKernel<T, data_t>(dev_ctx,
+                                                          label,
+                                                          softmax,
+                                                          loss_grad,
+                                                          soft_label,
+                                                          use_softmax,
+                                                          numeric_stable_mode,
+                                                          ignore_index,
+                                                          axis,
+                                                          logits_grad);
+        }));
+  }
+}
+
+}  // namespace phi
+
+PD_REGISTER_KERNEL(cross_entropy_with_softmax_grad,
+                   CPU,
+                   ALL_LAYOUT,
+                   phi::CrossEntropyWithSoftmaxGradKernel,
+                   float,
+                   double) {}
diff --git a/paddle/phi/kernels/cpu/cross_entropy_kernel.cc b/paddle/phi/kernels/cpu/cross_entropy_kernel.cc
new file mode 100644
index 0000000000000..c684fb416eaab
--- /dev/null
+++ b/paddle/phi/kernels/cpu/cross_entropy_kernel.cc
@@ -0,0 +1,104 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "paddle/phi/kernels/cross_entropy_kernel.h"
+
+#include "paddle/phi/backends/cpu/cpu_context.h"
+#include "paddle/phi/core/kernel_registry.h"
+#include "paddle/phi/kernels/copy_kernel.h"
+#include "paddle/phi/kernels/funcs/axis_utils.h"
+#include "paddle/phi/kernels/funcs/math_function.h"
+#include "paddle/phi/kernels/softmax_kernel.h"
+
+#include "paddle/fluid/operators/math/cross_entropy.h"
+
+namespace phi {
+
+template <typename T>
+void CrossEntropy(const CPUContext& dev_ctx,
+                  const DenseTensor& x,
+                  const DenseTensor& label,
+                  bool soft_label,
+                  int ignore_index,
+                  int axis,
+                  DenseTensor* out) {
+  const int rank = x.dims().size();
+  const int axis_v = phi::funcs::CanonicalAxis(axis, rank);
+  int axis_dim = x.dims()[axis_v];
+
+  PADDLE_ENFORCE_GT(
+      axis_dim,
+      0,
+      phi::errors::InvalidArgument(
+          "The axis dimention should be larger than 0, but received "
+          "axis dimention is %d.",
+          axis_dim));
+
+  dev_ctx.template Alloc<T>(out);
+
+  const int n = phi::funcs::SizeToAxis(axis_v, x.dims());
+  PADDLE_ENFORCE_GT(
+      n,
+      0,
+      phi::errors::InvalidArgument(
+          "The size of axis should be larger than 0, but received "
+          "SizeToAxis of softmax is %d.",
+          n));
+
+  const int d = phi::funcs::SizeFromAxis(axis_v, x.dims());
+
+  DenseTensor x_2d(x);
+  x_2d.Resize({n, d});
+  DenseTensor label_2d(label);
+  label_2d.Resize({n, label.numel() / n});
+  DenseTensor out_2d(*out);
+  out_2d.Resize({n, d / axis_dim});
+
+  paddle::operators::math::CrossEntropyFunctor<CPUContext, T>()(
+      dev_ctx, &out_2d, &x_2d, &label_2d, soft_label, ignore_index, axis_dim);
+}
+
+template <typename T, typename Context>
+void CrossEntropyWithSoftmaxKernel(const Context& dev_ctx,
+                                   const DenseTensor& logits,
+                                   const DenseTensor& label,
+                                   bool soft_label,
+                                   bool use_softmax,
+                                   bool numeric_stable_mode,
+                                   int ignore_index,
+                                   int axis,
+                                   DenseTensor* softmax,
+                                   DenseTensor* loss) {
+  // do not with softmax op, and input is softmax
+  if (!use_softmax) {
+    CrossEntropy<T>(
+        dev_ctx, logits, label, soft_label, ignore_index, axis, loss);
+    // cause of input is softmax, copy to output softmax, directly
+    phi::Copy<Context>(dev_ctx, logits, dev_ctx.GetPlace(), false, softmax);
+    return;
+  }
+
+  phi::SoftmaxKernel<T, Context>(dev_ctx, logits, axis, softmax);
+  CrossEntropy<T>(
+      dev_ctx, *softmax, label, soft_label, ignore_index, axis, loss);
+}
+
+}  // namespace phi
+
+PD_REGISTER_KERNEL(cross_entropy_with_softmax,
+                   CPU,
+                   ALL_LAYOUT,
+                   phi::CrossEntropyWithSoftmaxKernel,
+                   float,
+                   double) {}
diff --git a/paddle/phi/kernels/cpu/graph_reindex_kernel.cc b/paddle/phi/kernels/cpu/graph_reindex_kernel.cc
new file mode 100644
index 0000000000000..d6454b4796430
--- /dev/null
+++ b/paddle/phi/kernels/cpu/graph_reindex_kernel.cc
@@ -0,0 +1,84 @@
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include <unordered_map>
+#include <vector>
+
+#include "paddle/phi/kernels/graph_reindex_kernel.h"
+
+#include "paddle/phi/backends/cpu/cpu_context.h"
+#include "paddle/phi/core/kernel_registry.h"
+
+namespace phi {
+
+template <typename T, typename Context>
+void GraphReindexKernel(const Context& dev_ctx,
+                        const DenseTensor& x,
+                        const DenseTensor& neighbors,
+                        const DenseTensor& count,
+                        paddle::optional<const DenseTensor&> hashtable_value,
+                        paddle::optional<const DenseTensor&> hashtable_index,
+                        bool flag_buffer_hashtable,
+                        DenseTensor* reindex_src,
+                        DenseTensor* reindex_dst,
+                        DenseTensor* out_nodes) {
+  const T* x_data = x.data<T>();
+  const T* neighbors_data = neighbors.data<T>();
+  const int* count_data = count.data<int>();
+  const int bs = x.dims()[0];
+  const int num_edges = neighbors.dims()[0];
+
+  std::unordered_map<T, T> node_map;
+  std::vector<T> unique_nodes;
+  int reindex_id = 0;
+  for (int i = 0; i < bs; i++) {
+    T node = x_data[i];
+    unique_nodes.emplace_back(node);
+    node_map[node] = reindex_id++;
+  }
+  // Reindex Src
+  std::vector<T> src(num_edges);
+  std::vector<T> dst(num_edges);
+  for (int i = 0; i < num_edges; i++) {
+    T node = neighbors_data[i];
+    if (node_map.find(node) == node_map.end()) {
+      unique_nodes.emplace_back(node);
+      node_map[node] = reindex_id++;
+    }
+    src[i] = node_map[node];
+  }
+  // Reindex Dst
+  int cnt = 0;
+  for (int i = 0; i < bs; i++) {
+    for (int j = 0; j < count_data[i]; j++) {
+      T node = x_data[i];
+      dst[cnt++] = node_map[node];
+    }
+  }
+
+  reindex_src->Resize({num_edges});
+  T* reindex_src_data = dev_ctx.template Alloc<T>(reindex_src);
+  std::copy(src.begin(), src.end(), reindex_src_data);
+  reindex_dst->Resize({num_edges});
+  T* reindex_dst_data = dev_ctx.template Alloc<T>(reindex_dst);
+  std::copy(dst.begin(), dst.end(), reindex_dst_data);
+  out_nodes->Resize({static_cast<int>(unique_nodes.size())});
+  T* out_nodes_data = dev_ctx.template Alloc<T>(out_nodes);
+  std::copy(unique_nodes.begin(), unique_nodes.end(), out_nodes_data);
+}
+
+}  // namespace phi
+
+PD_REGISTER_KERNEL(
+    graph_reindex, CPU, ALL_LAYOUT, phi::GraphReindexKernel, int, int64_t) {}
diff --git a/paddle/phi/kernels/cpu/graph_sample_neighbors_kernel.cc b/paddle/phi/kernels/cpu/graph_sample_neighbors_kernel.cc
new file mode 100644
index 0000000000000..e18848af0dc08
--- /dev/null
+++ b/paddle/phi/kernels/cpu/graph_sample_neighbors_kernel.cc
@@ -0,0 +1,151 @@
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include <vector>
+
+#include "paddle/phi/kernels/graph_sample_neighbors_kernel.h"
+
+#include "paddle/phi/backends/cpu/cpu_context.h"
+#include "paddle/phi/core/kernel_registry.h"
+
+namespace phi {
+
+template <class bidiiter>
+void SampleUniqueNeighbors(
+    bidiiter begin,
+    bidiiter end,
+    int num_samples,
+    std::mt19937& rng,
+    std::uniform_int_distribution<int>& dice_distribution) {
+  int left_num = std::distance(begin, end);
+  for (int i = 0; i < num_samples; i++) {
+    bidiiter r = begin;
+    int random_step = dice_distribution(rng) % left_num;
+    std::advance(r, random_step);
+    std::swap(*begin, *r);
+    ++begin;
+    --left_num;
+  }
+}
+
+template <typename T>
+void SampleNeighbors(const T* row,
+                     const T* col_ptr,
+                     const T* input,
+                     std::vector<T>* output,
+                     std::vector<int>* output_count,
+                     int sample_size,
+                     int bs) {
+  // Allocate the memory of output
+  // Collect the neighbors size
+  std::vector<std::vector<T>> out_src_vec;
+  // `sample_cumsum_sizes` record the start position and end position
+  // after sampling.
+  std::vector<int> sample_cumsum_sizes(bs + 1);
+  // `total_neighbors` the size of output after sample.
+  int total_neighbors = 0;
+  sample_cumsum_sizes[0] = total_neighbors;
+  for (int i = 0; i < bs; i++) {
+    T node = input[i];
+    int cap = col_ptr[node + 1] - col_ptr[node];
+    int k = cap > sample_size ? sample_size : cap;
+    total_neighbors += k;
+    sample_cumsum_sizes[i + 1] = total_neighbors;
+    std::vector<T> out_src;
+    out_src.resize(cap);
+    out_src_vec.emplace_back(out_src);
+  }
+
+  output_count->resize(bs);
+  output->resize(total_neighbors);
+
+  std::random_device rd;
+  std::mt19937 rng{rd()};
+  std::uniform_int_distribution<int> dice_distribution(
+      0, std::numeric_limits<int>::max());
+
+#ifdef PADDLE_WITH_MKLML
+#pragma omp parallel for
+#endif
+  // Sample the neighbors in parallelism.
+  for (int i = 0; i < bs; i++) {
+    T node = input[i];
+    T begin = col_ptr[node], end = col_ptr[node + 1];
+    int cap = end - begin;
+    if (sample_size < cap) {
+      std::copy(row + begin, row + end, out_src_vec[i].begin());
+      // TODO(daisiming): Check whether is correct.
+      SampleUniqueNeighbors(out_src_vec[i].begin(),
+                            out_src_vec[i].end(),
+                            sample_size,
+                            rng,
+                            dice_distribution);
+      *(output_count->data() + i) = sample_size;
+    } else {
+      std::copy(row + begin, row + end, out_src_vec[i].begin());
+      *(output_count->data() + i) = cap;
+    }
+  }
+
+#ifdef PADDLE_WITH_MKLML
+#pragma omp parallel for
+#endif
+  // Copy the results parallelism
+  for (int i = 0; i < bs; i++) {
+    int k = sample_cumsum_sizes[i + 1] - sample_cumsum_sizes[i];
+    std::copy(out_src_vec[i].begin(),
+              out_src_vec[i].begin() + k,
+              output->data() + sample_cumsum_sizes[i]);
+  }
+}
+
+template <typename T, typename Context>
+void GraphSampleNeighborsKernel(
+    const Context& dev_ctx,
+    const DenseTensor& row,
+    const DenseTensor& col_ptr,
+    const DenseTensor& x,
+    paddle::optional<const DenseTensor&> eids,
+    paddle::optional<const DenseTensor&> perm_buffer,
+    int sample_size,
+    bool return_eids,
+    bool flag_perm_buffer,
+    DenseTensor* out,
+    DenseTensor* out_count,
+    DenseTensor* out_eids) {
+  const T* row_data = row.data<T>();
+  const T* col_ptr_data = col_ptr.data<T>();
+  const T* x_data = x.data<T>();
+  int bs = x.dims()[0];
+
+  std::vector<T> output;
+  std::vector<int> output_count;
+  SampleNeighbors<T>(
+      row_data, col_ptr_data, x_data, &output, &output_count, sample_size, bs);
+  out->Resize({static_cast<int>(output.size())});
+  T* out_data = dev_ctx.template Alloc<T>(out);
+  std::copy(output.begin(), output.end(), out_data);
+  out_count->Resize({bs});
+  int* out_count_data = dev_ctx.template Alloc<int>(out_count);
+  std::copy(output_count.begin(), output_count.end(), out_count_data);
+}
+
+}  // namespace phi
+
+PD_REGISTER_KERNEL(graph_sample_neighbors,
+                   CPU,
+                   ALL_LAYOUT,
+                   phi::GraphSampleNeighborsKernel,
+                   int,
+                   int64_t) {}
diff --git a/paddle/phi/kernels/cpu/graph_send_recv_grad_kernel.cc b/paddle/phi/kernels/cpu/graph_send_recv_grad_kernel.cc
index 6a83cee1ae40d..95eeb64afea20 100644
--- a/paddle/phi/kernels/cpu/graph_send_recv_grad_kernel.cc
+++ b/paddle/phi/kernels/cpu/graph_send_recv_grad_kernel.cc
@@ -118,12 +118,12 @@ void GraphSendRecvGradOpKernelLaunchHelper(
 
 template <typename T, typename Context>
 void GraphSendRecvGradKernel(const Context& ctx,
-                             const DenseTensor& out_grad,
                              const DenseTensor& x,
-                             paddle::optional<const DenseTensor&> out,
                              const DenseTensor& src_index,
                              const DenseTensor& dst_index,
+                             paddle::optional<const DenseTensor&> out,
                              paddle::optional<const DenseTensor&> dst_count,
+                             const DenseTensor& out_grad,
                              const std::string& pool_type,
                              DenseTensor* x_grad) {
   auto index_type = src_index.dtype();
diff --git a/paddle/phi/kernels/cpu/hierarchical_sigmoid_grad.h b/paddle/phi/kernels/cpu/hierarchical_sigmoid_grad.h
index b79aab96c0fc2..cc67f8e7f210c 100644
--- a/paddle/phi/kernels/cpu/hierarchical_sigmoid_grad.h
+++ b/paddle/phi/kernels/cpu/hierarchical_sigmoid_grad.h
@@ -31,11 +31,11 @@ void HierarchicalSigmoidGradKernelImpl(
     const DenseTensor& x,
     const DenseTensor& w,
     const DenseTensor& label,
-    const DenseTensor& pre_out,
-    const DenseTensor& out_grad,
     paddle::optional<const DenseTensor&> path,
     paddle::optional<const DenseTensor&> code,
     paddle::optional<const DenseTensor&> bias,
+    const DenseTensor& pre_out,
+    const DenseTensor& out_grad,
     int num_classes,
     bool remote_prefetch,
     int trainer_id,
diff --git a/paddle/phi/kernels/cpu/hierarchical_sigmoid_grad_kernel.cc b/paddle/phi/kernels/cpu/hierarchical_sigmoid_grad_kernel.cc
index f64a1a8162a37..9edc9f87d4b1f 100644
--- a/paddle/phi/kernels/cpu/hierarchical_sigmoid_grad_kernel.cc
+++ b/paddle/phi/kernels/cpu/hierarchical_sigmoid_grad_kernel.cc
@@ -25,11 +25,11 @@ void HierarchicalSigmoidGradKernel(const Context& ctx,
                                    const DenseTensor& x,
                                    const DenseTensor& w,
                                    const DenseTensor& label,
-                                   const DenseTensor& pre_out,
-                                   const DenseTensor& out_grad,
                                    paddle::optional<const DenseTensor&> path,
                                    paddle::optional<const DenseTensor&> code,
                                    paddle::optional<const DenseTensor&> bias,
+                                   const DenseTensor& pre_out,
+                                   const DenseTensor& out_grad,
                                    int num_classes,
                                    bool remote_prefetch,
                                    int trainer_id,
@@ -44,11 +44,11 @@ void HierarchicalSigmoidGradKernel(const Context& ctx,
                                        x,
                                        w,
                                        label,
-                                       pre_out,
-                                       out_grad,
                                        path,
                                        code,
                                        bias,
+                                       pre_out,
+                                       out_grad,
                                        num_classes,
                                        remote_prefetch,
                                        trainer_id,
diff --git a/paddle/phi/kernels/cpu/hierarchical_sigmoid_kernel.cc b/paddle/phi/kernels/cpu/hierarchical_sigmoid_kernel.cc
index 096a54f9fb263..4c4f1aa125a33 100644
--- a/paddle/phi/kernels/cpu/hierarchical_sigmoid_kernel.cc
+++ b/paddle/phi/kernels/cpu/hierarchical_sigmoid_kernel.cc
@@ -14,7 +14,6 @@
 
 #include "paddle/phi/kernels/hierarchical_sigmoid_kernel.h"
 
-#include "paddle/fluid/operators/clip_op.h"
 #include "paddle/fluid/operators/math/matrix_bit_code.h"
 #include "paddle/fluid/platform/transform.h"
 #include "paddle/phi/backends/cpu/cpu_context.h"
@@ -22,6 +21,7 @@
 #include "paddle/phi/kernels/funcs/eigen/common.h"
 #include "paddle/phi/kernels/funcs/eigen/eigen_function.h"
 #include "paddle/phi/kernels/funcs/math_function_impl.h"
+#include "paddle/phi/kernels/impl/clip_kernel_impl.h"
 
 namespace phi {
 
@@ -92,8 +92,7 @@ void HierarchicalSigmoidKernel(const Context& ctx,
         pre_out_data,
         pre_out_data + pre_out->numel(),
         pre_out_data,
-        paddle::operators::ClipFunctor<T>(static_cast<T>(-40.0),
-                                          static_cast<T>(40.0)));
+        ClipFunctor<T>(static_cast<T>(-40.0), static_cast<T>(40.0)));
   bit_code->Sum(*pre_out, out, static_cast<T>(-1));
   // use softrelu to calculate cross entropy
   pre_out_mat.device(place) = (static_cast<T>(1.0) + pre_out_mat.exp()).log();
diff --git a/paddle/phi/kernels/cpu/index_sample_grad_kernel.cc b/paddle/phi/kernels/cpu/index_sample_grad_kernel.cc
index 006711ceef75e..d060e8c9b2837 100644
--- a/paddle/phi/kernels/cpu/index_sample_grad_kernel.cc
+++ b/paddle/phi/kernels/cpu/index_sample_grad_kernel.cc
@@ -68,9 +68,9 @@ void IndexSampleGradInner(const Context& context,
 
 template <typename T, typename Context>
 void IndexSampleGradKernel(const Context& ctx,
-                           const DenseTensor& out_grad,
                            const DenseTensor& x,
                            const DenseTensor& index,
+                           const DenseTensor& out_grad,
                            DenseTensor* x_grad) {
   auto index_type = index.dtype();
   bool index_type_match =
diff --git a/paddle/phi/kernels/cpu/interpolate_grad_kernel.cc b/paddle/phi/kernels/cpu/interpolate_grad_kernel.cc
new file mode 100644
index 0000000000000..550439a5251db
--- /dev/null
+++ b/paddle/phi/kernels/cpu/interpolate_grad_kernel.cc
@@ -0,0 +1,1067 @@
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include "paddle/phi/kernels/interpolate_grad_kernel.h"
+#include "paddle/phi/backends/cpu/cpu_context.h"
+#include "paddle/phi/common/layout.h"
+#include "paddle/phi/core/kernel_registry.h"
+#include "paddle/phi/kernels/funcs/interpolate_function.h"
+#include "paddle/phi/kernels/funcs/math_function.h"
+
+namespace phi {
+
+template <typename T>
+static void LinearInterpolationGrad(const DenseTensor& output_grad,
+                                    DenseTensor* input_grad,
+                                    const float ratio_w,
+                                    const int in_w,
+                                    const int n,
+                                    const int c,
+                                    const int out_w,
+                                    const bool align_corners,
+                                    const int align_mode,
+                                    const DataLayout data_layout) {
+  auto input_grad_t = EigenTensor<T, 3>::From(*input_grad);
+  auto output_grad_t = EigenTensor<T, 3>::From(output_grad);
+  bool align_flag = (align_mode == 0 && !align_corners);
+  for (int l = 0; l < out_w; l++) {
+    int x_w = align_flag ? static_cast<int>(ratio_w * (l + 0.5) - 0.5)
+                         : static_cast<int>(ratio_w * l);
+    x_w = (x_w > 0) ? x_w : 0;                       // w
+    int x_e = (x_w < (in_w - 1)) ? (x_w + 1) : x_w;  // w_id
+
+    float idx_src_x = ratio_w * (l + 0.5) - 0.5;
+    idx_src_x = (idx_src_x > 0) ? idx_src_x : 0;
+    float d_w = align_flag ? idx_src_x - x_w : ratio_w * l - x_w;  // w1lambda
+    float d_e = 1.f - d_w;                                         // w2lambda
+
+    for (int i = 0; i < n; i++) {    // loop for batches
+      for (int j = 0; j < c; j++) {  // loop for channels
+        // linear interpolation grad
+        if (data_layout == DataLayout::kNCHW) {
+          const T grad = output_grad_t(i, j, l);
+          input_grad_t(i, j, x_w) += static_cast<T>(grad * d_e);
+          input_grad_t(i, j, x_e) += static_cast<T>(grad * d_w);
+        } else {
+          const T grad = output_grad_t(i, l, j);
+          input_grad_t(i, x_w, j) += static_cast<T>(grad * d_e);
+          input_grad_t(i, x_e, j) += static_cast<T>(grad * d_w);
+        }
+      }
+    }
+  }
+}
+
+template <typename T>
+static void BilinearInterpolationGrad(const DenseTensor& output_grad,
+                                      DenseTensor* input_grad,
+                                      const float ratio_h,
+                                      const float ratio_w,
+                                      const int in_h,
+                                      const int in_w,
+                                      const int n,
+                                      const int c,
+                                      const int out_h,
+                                      const int out_w,
+                                      const bool align_corners,
+                                      const int align_mode,
+                                      const DataLayout data_layout) {
+  auto input_grad_t = EigenTensor<T, 4>::From(*input_grad);
+  auto output_grad_t = EigenTensor<T, 4>::From(output_grad);
+  bool align_flag = (align_mode == 0 && !align_corners);
+  for (int k = 0; k < out_h; k++) {  // loop for images
+    int y_n = align_flag ? static_cast<int>(ratio_h * (k + 0.5) - 0.5)
+                         : static_cast<int>(ratio_h * k);
+    y_n = (y_n > 0) ? y_n : 0;
+    int y_s = (y_n + 1) < (in_h - 1) ? (y_n + 1) : (in_h - 1);
+    float idx_src_y = ratio_h * (k + 0.5) - 0.5;
+    idx_src_y = (idx_src_y > 0) ? idx_src_y : 0;
+    float d_n = align_flag ? idx_src_y - y_n : ratio_h * k - y_n;
+    float d_s = 1.f - d_n;
+
+    for (int l = 0; l < out_w; l++) {
+      int x_w = align_flag ? static_cast<int>(ratio_w * (l + 0.5) - 0.5)
+                           : static_cast<int>(ratio_w * l);
+      x_w = (x_w > 0) ? x_w : 0;
+      int x_e = (x_w + 1) < (in_w - 1) ? (x_w + 1) : (in_w - 1);
+      float idx_src_x = ratio_w * (l + 0.5) - 0.5;
+      idx_src_x = (idx_src_x > 0) ? idx_src_x : 0;
+      float d_w = align_flag ? idx_src_x - x_w : ratio_w * l - x_w;
+      float d_e = 1.f - d_w;
+
+      for (int i = 0; i < n; i++) {    // loop for batches
+        for (int j = 0; j < c; j++) {  // loop for channels
+          // bilinear interpolation grad
+          if (data_layout == DataLayout::kNCHW) {
+            const T grad = output_grad_t(i, j, k, l);
+            input_grad_t(i, j, y_n, x_w) += static_cast<T>(grad * d_s * d_e);
+            input_grad_t(i, j, y_s, x_w) += static_cast<T>(grad * d_n * d_e);
+            input_grad_t(i, j, y_n, x_e) += static_cast<T>(grad * d_s * d_w);
+            input_grad_t(i, j, y_s, x_e) += static_cast<T>(grad * d_n * d_w);
+          } else {
+            const T grad = output_grad_t(i, k, l, j);
+            input_grad_t(i, y_n, x_w, j) += static_cast<T>(grad * d_s * d_e);
+            input_grad_t(i, y_s, x_w, j) += static_cast<T>(grad * d_n * d_e);
+            input_grad_t(i, y_n, x_e, j) += static_cast<T>(grad * d_s * d_w);
+            input_grad_t(i, y_s, x_e, j) += static_cast<T>(grad * d_n * d_w);
+          }
+        }
+      }
+    }
+  }
+}
+
+template <typename T>
+static void NearestNeighborInterpolateGrad(const DenseTensor& output_grad,
+                                           DenseTensor* input_grad,
+                                           const float ratio_h,
+                                           const float ratio_w,
+                                           const int n,
+                                           const int c,
+                                           const int out_h,
+                                           const int out_w,
+                                           const bool align_corners,
+                                           const DataLayout data_layout) {
+  auto input_grad_t = EigenTensor<T, 4>::From(*input_grad);
+  auto output_grad_t = EigenTensor<T, 4>::From(output_grad);
+
+  for (int k = 0; k < out_h; k++) {  // loop for images
+    int in_k = (align_corners) ? static_cast<int>(ratio_h * k + 0.5)
+                               : static_cast<int>(ratio_h * k);
+
+    for (int l = 0; l < out_w; l++) {
+      int in_l = (align_corners) ? static_cast<int>(ratio_w * l + 0.5)
+                                 : static_cast<int>(ratio_w * l);
+
+      for (int i = 0; i < n; i++) {    // loop for batches
+        for (int j = 0; j < c; j++) {  // loop for channels
+          if (data_layout == DataLayout::kNCHW) {
+            input_grad_t(i, j, in_k, in_l) += output_grad_t(i, j, k, l);
+          } else {
+            input_grad_t(i, in_k, in_l, j) += output_grad_t(i, k, l, j);
+          }
+        }
+      }
+    }
+  }
+}
+
+template <typename T>
+static void BicubicInterpolationGrad(const DenseTensor& output_grad,
+                                     DenseTensor* input_grad,
+                                     const float ratio_h,
+                                     const float ratio_w,
+                                     const int in_h,
+                                     const int in_w,
+                                     const int n,
+                                     const int c,
+                                     const int out_h,
+                                     const int out_w,
+                                     const bool align_corners,
+                                     const DataLayout data_layout) {
+  auto input_grad_t = EigenTensor<T, 4>::From(*input_grad);
+  auto output_grad_t = EigenTensor<T, 4>::From(output_grad);
+
+  for (int k = 0; k < out_h; k++) {  // loop for images
+    T y_n = align_corners ? static_cast<T>(ratio_h * k)
+                          : static_cast<T>(ratio_h * (k + 0.5) - 0.5);
+    int input_y = floorf(y_n);
+    T y_t = y_n - input_y;
+
+    for (int l = 0; l < out_w; l++) {
+      T x_n = align_corners ? static_cast<T>(ratio_w * l)
+                            : static_cast<T>(ratio_w * (l + 0.5) - 0.5);
+      int input_x = floorf(x_n);
+      T x_t = x_n - input_x;
+
+      T x_coeffs[4];
+      T y_coeffs[4];
+
+      funcs::get_cubic_upsample_coefficients<T>(x_coeffs, x_t);
+      funcs::get_cubic_upsample_coefficients<T>(y_coeffs, y_t);
+
+      for (int i = 0; i < n; i++) {    // loop for batches
+        for (int j = 0; j < c; j++) {  // loop for channels
+          // bicubic interpolation grad
+          for (int ii = 0; ii < 4; ii++) {
+            for (int jj = 0; jj < 4; jj++) {
+              int access_x = std::max(std::min(input_x - 1 + ii, in_w - 1),
+                                      static_cast<int>(0));
+              int access_y = std::max(std::min(input_y - 1 + jj, in_h - 1),
+                                      static_cast<int>(0));
+              if (data_layout == DataLayout::kNCHW) {
+                T grad = output_grad_t(i, j, k, l);
+                input_grad_t(i, j, access_y, access_x) +=
+                    grad * y_coeffs[jj] * x_coeffs[ii];
+              } else {
+                T grad = output_grad_t(i, k, l, j);
+                input_grad_t(i, access_y, access_x, j) +=
+                    grad * y_coeffs[jj] * x_coeffs[ii];
+              }
+            }
+          }
+        }
+      }
+    }
+  }
+}
+
+template <typename T>
+static void TrilinearInterpolationGrad(const DenseTensor& output_grad,
+                                       DenseTensor* input_grad,
+                                       const float ratio_d,
+                                       const float ratio_h,
+                                       const float ratio_w,
+                                       const int in_d,
+                                       const int in_h,
+                                       const int in_w,
+                                       const int n,
+                                       const int c,
+                                       const int out_d,
+                                       const int out_h,
+                                       const int out_w,
+                                       const bool align_corners,
+                                       const int align_mode,
+                                       const DataLayout data_layout) {
+  auto input_grad_t = EigenTensor<T, 5>::From(*input_grad);
+  auto output_grad_t = EigenTensor<T, 5>::From(output_grad);
+  bool align_flag = (align_mode == 0 && !align_corners);
+  for (int j = 0; j < out_d; j++) {  // loop for D
+    int t_f = align_flag ? static_cast<int>(ratio_d * (j + 0.5) - 0.5)
+                         : static_cast<int>(ratio_d * j);
+    t_f = (t_f > 0) ? t_f : 0;
+    int t_b = (t_f + 1) < (in_d - 1) ? (t_f + 1) : (in_d - 1);
+    float idx_src_t = ratio_d * (j + 0.5) - 0.5;
+    idx_src_t = (idx_src_t > 0) ? idx_src_t : 0;
+    float d_f = align_flag ? idx_src_t - t_f : ratio_d * j - t_f;
+    float d_b = 1.f - d_f;
+
+    for (int k = 0; k < out_h; k++) {  // loop for H
+      int y_n = align_flag ? static_cast<int>(ratio_h * (k + 0.5) - 0.5)
+                           : static_cast<int>(ratio_h * k);
+      y_n = (y_n > 0) ? y_n : 0;
+      int y_s = (y_n + 1) < (in_h - 1) ? (y_n + 1) : (in_h - 1);
+      float idx_src_y = ratio_h * (k + 0.5) - 0.5;
+      idx_src_y = (idx_src_y > 0) ? idx_src_y : 0;
+      float d_n = align_flag ? idx_src_y - y_n : ratio_h * k - y_n;
+      float d_s = 1.f - d_n;
+
+      for (int l = 0; l < out_w; l++) {  // loop for W
+        int x_w = align_flag ? static_cast<int>(ratio_w * (l + 0.5) - 0.5)
+                             : static_cast<int>(ratio_w * l);
+        x_w = (x_w > 0) ? x_w : 0;
+        int x_e = (x_w + 1) < (in_w - 1) ? (x_w + 1) : (in_w - 1);
+        float idx_src_x = ratio_w * (l + 0.5) - 0.5;
+        idx_src_x = (idx_src_x > 0) ? idx_src_x : 0;
+        float d_w = align_flag ? idx_src_x - x_w : ratio_w * l - x_w;
+        float d_e = 1.f - d_w;
+
+        for (int b = 0; b < n; b++) {    // loop for batches
+          for (int i = 0; i < c; i++) {  // loop for channels
+            // trilinear interpolation grad
+            if (data_layout == DataLayout::kNCHW) {
+              const T grad = output_grad_t(b, i, j, k, l);
+              input_grad_t(b, i, t_f, y_n, x_w) +=
+                  static_cast<T>(grad * d_b * d_s * d_e);
+              input_grad_t(b, i, t_f, y_n, x_e) +=
+                  static_cast<T>(grad * d_b * d_s * d_w);
+              input_grad_t(b, i, t_f, y_s, x_w) +=
+                  static_cast<T>(grad * d_b * d_n * d_e);
+              input_grad_t(b, i, t_f, y_s, x_e) +=
+                  static_cast<T>(grad * d_b * d_n * d_w);
+              input_grad_t(b, i, t_b, y_n, x_w) +=
+                  static_cast<T>(grad * d_f * d_s * d_e);
+              input_grad_t(b, i, t_b, y_n, x_e) +=
+                  static_cast<T>(grad * d_f * d_s * d_w);
+              input_grad_t(b, i, t_b, y_s, x_w) +=
+                  static_cast<T>(grad * d_f * d_n * d_e);
+              input_grad_t(b, i, t_b, y_s, x_e) +=
+                  static_cast<T>(grad * d_f * d_n * d_w);
+            } else {
+              const T grad = output_grad_t(b, j, k, l, i);
+              input_grad_t(b, t_f, y_n, x_w, i) +=
+                  static_cast<T>(grad * d_b * d_s * d_e);
+              input_grad_t(b, t_f, y_n, x_e, i) +=
+                  static_cast<T>(grad * d_b * d_s * d_w);
+              input_grad_t(b, t_f, y_s, x_w, i) +=
+                  static_cast<T>(grad * d_b * d_n * d_e);
+              input_grad_t(b, t_f, y_s, x_e, i) +=
+                  static_cast<T>(grad * d_b * d_n * d_w);
+              input_grad_t(b, t_b, y_n, x_w, i) +=
+                  static_cast<T>(grad * d_f * d_s * d_e);
+              input_grad_t(b, t_b, y_n, x_e, i) +=
+                  static_cast<T>(grad * d_f * d_s * d_w);
+              input_grad_t(b, t_b, y_s, x_w, i) +=
+                  static_cast<T>(grad * d_f * d_n * d_e);
+              input_grad_t(b, t_b, y_s, x_e, i) +=
+                  static_cast<T>(grad * d_f * d_n * d_w);
+            }
+          }
+        }
+      }
+    }
+  }
+}
+
+template <typename T>
+static void NearestNeighbor3DInterpolateGrad(const DenseTensor& output_grad,
+                                             DenseTensor* input_grad,
+                                             const float ratio_d,
+                                             const float ratio_h,
+                                             const float ratio_w,
+                                             const int n,
+                                             const int c,
+                                             const int out_d,
+                                             const int out_h,
+                                             const int out_w,
+                                             const bool align_corners,
+                                             const DataLayout data_layout) {
+  auto input_grad_t = EigenTensor<T, 5>::From(*input_grad);
+  auto output_grad_t = EigenTensor<T, 5>::From(output_grad);
+
+  for (int d = 0; d < out_d; d++) {
+    int in_d = (align_corners) ? static_cast<int>(ratio_d * d + 0.5)
+                               : static_cast<int>(ratio_d * d);
+    for (int k = 0; k < out_h; k++) {  // loop for images
+      int in_k = (align_corners) ? static_cast<int>(ratio_h * k + 0.5)
+                                 : static_cast<int>(ratio_h * k);
+
+      for (int l = 0; l < out_w; l++) {
+        int in_l = (align_corners) ? static_cast<int>(ratio_w * l + 0.5)
+                                   : static_cast<int>(ratio_w * l);
+
+        for (int i = 0; i < n; i++) {    // loop for batches
+          for (int j = 0; j < c; j++) {  // loop for channels
+            if (data_layout == DataLayout::kNCHW) {
+              input_grad_t(i, j, in_d, in_k, in_l) +=
+                  output_grad_t(i, j, d, k, l);
+            } else {
+              input_grad_t(i, in_d, in_k, in_l, j) +=
+                  output_grad_t(i, d, k, l, j);
+            }
+          }
+        }
+      }
+    }
+  }
+}
+
+template <typename T, typename Context>
+static void Interpolate1DCPUBwd(
+    const Context& dev_ctx,
+    const DenseTensor& input,
+    paddle::optional<const DenseTensor&> out_size,
+    paddle::optional<const std::vector<const DenseTensor*>> size_tensor,
+    paddle::optional<const DenseTensor&> scale_tensor,
+    const DenseTensor& output_grad,
+    const std::string& data_layout_str,
+    int out_w,
+    const std::vector<float>& scale,
+    const std::string& interp_method,
+    bool align_corners,
+    int align_mode,
+    DenseTensor* input_grad) {
+  const DataLayout data_layout =
+      paddle::framework::StringToDataLayout(data_layout_str);
+  int n, c, in_d, in_h, in_w;
+  funcs::ExtractNCDWH(input.dims(), data_layout, &n, &c, &in_d, &in_h, &in_w);
+
+  float scale_w = -1.0;
+  if (scale_tensor) {
+    auto scale_data =
+        funcs::get_new_data_from_tensor<float>(scale_tensor.get_ptr());
+    scale_w = scale_data[0];
+    PADDLE_ENFORCE_EQ(
+        scale_w > 0,
+        true,
+        errors::InvalidArgument(
+            "The scale_w in input 'Scale' Tensor of Operator(interpolate) "
+            "should be greater than 0, but received value is %d.",
+            scale_w));
+  } else {
+    if (scale.size() > 0) {
+      scale_w = scale[0];
+      PADDLE_ENFORCE_EQ(
+          scale_w > 0,
+          true,
+          errors::InvalidArgument(
+              "The scale_w in Attr(scale) of Operator(interpolate) "
+              "should be greater than 0, but received value is %d.",
+              scale_w));
+    }
+  }
+  if (scale_w > 0.) {
+    out_w = static_cast<int>(in_w * scale_w);
+  }
+  if (out_size) {
+    auto out_size_data =
+        funcs::get_new_data_from_tensor<int>(out_size.get_ptr());
+    out_w = out_size_data[0];
+  }
+  if (size_tensor && size_tensor->size() > 0) {
+    // have size tensor
+    auto new_size = funcs::get_new_shape(size_tensor.get());
+    out_w = new_size[0];
+  }
+
+  phi::DDim dim_grad;
+  if (data_layout == DataLayout::kNCHW) {
+    dim_grad = {n, c, in_w};
+  } else {
+    dim_grad = {n, in_w, c};
+  }
+
+  input_grad->Resize(dim_grad);
+  dev_ctx.template Alloc<T>(input_grad);
+
+  phi::funcs::SetConstant<Context, T> zero;
+  zero(dev_ctx, input_grad, static_cast<T>(0.0));
+
+  if (in_w == out_w) {
+    paddle::framework::TensorCopy(output_grad, dev_ctx.GetPlace(), input_grad);
+    return;
+  }
+
+  float ratio_w = 0.f;
+  if (out_w > 1) {
+    float new_scale_w = 0.f;
+    new_scale_w = (scale_w > 0) ? static_cast<float>(1. / scale_w)
+                                : static_cast<float>(in_w) / out_w;
+    ratio_w = (align_corners) ? static_cast<float>(in_w - 1) / (out_w - 1)
+                              : static_cast<float>(new_scale_w);
+  }
+  if ("linear" == interp_method) {
+    LinearInterpolationGrad<T>(output_grad,
+                               input_grad,
+                               ratio_w,
+                               in_w,
+                               n,
+                               c,
+                               out_w,
+                               align_corners,
+                               align_mode,
+                               data_layout);
+  }
+}
+
+template <typename T, typename Context>
+static void Interpolate2DCPUBwd(
+    const Context& dev_ctx,
+    const DenseTensor& input,
+    paddle::optional<const DenseTensor&> out_size,
+    paddle::optional<const std::vector<const DenseTensor*>> size_tensor,
+    paddle::optional<const DenseTensor&> scale_tensor,
+    const DenseTensor& output_grad,
+    const std::string& data_layout_str,
+    int out_h,
+    int out_w,
+    const std::vector<float>& scale,
+    const std::string& interp_method,
+    bool align_corners,
+    int align_mode,
+    DenseTensor* input_grad) {
+  const DataLayout data_layout =
+      paddle::framework::StringToDataLayout(data_layout_str);
+  int n, c, in_d, in_h, in_w;
+  funcs::ExtractNCDWH(input.dims(), data_layout, &n, &c, &in_d, &in_h, &in_w);
+
+  float scale_h = -1;
+  float scale_w = -1;
+  if (scale_tensor) {
+    auto scale_data =
+        funcs::get_new_data_from_tensor<float>(scale_tensor.get_ptr());
+    if (scale_data.size() > 1) {
+      scale_h = scale_data[0];
+      scale_w = scale_data[1];
+    } else {
+      scale_w = scale_data[0];
+      scale_h = scale_data[0];
+    }
+    PADDLE_ENFORCE_EQ(
+        scale_w > 0,
+        true,
+        errors::InvalidArgument(
+            "The scale_w in input 'Scale' Tensor of Operator(interpolate) "
+            "should be greater than 0, but received value is %d.",
+            scale_w));
+    PADDLE_ENFORCE_EQ(
+        scale_h > 0,
+        true,
+        errors::InvalidArgument(
+            "The scale_h in input 'Scale' Tensor of Operator(interpolate) "
+            "should be greater than 0, but received value is %d.",
+            scale_h));
+  } else {
+    if (scale.size() > 1) {
+      scale_h = scale[0];
+      scale_w = scale[1];
+      PADDLE_ENFORCE_EQ(
+          scale_w > 0,
+          true,
+          errors::InvalidArgument(
+              "The scale_w in Attr(scale) of Operator(interpolate) "
+              "should be greater than 0, but received value is %d.",
+              scale_w));
+      PADDLE_ENFORCE_EQ(
+          scale_h > 0,
+          true,
+          errors::InvalidArgument(
+              "The scale_h in Attr(scale) of Operator(interpolate) "
+              "should be greater than 0, but received value is %d.",
+              scale_h));
+    }
+  }
+  if (scale_h > 0. && scale_w > 0.) {
+    out_h = static_cast<int>(in_h * scale_h);
+    out_w = static_cast<int>(in_w * scale_w);
+  }
+  if (out_size) {
+    auto out_size_data =
+        funcs::get_new_data_from_tensor<int>(out_size.get_ptr());
+    out_h = out_size_data[0];
+    out_w = out_size_data[1];
+  }
+  if (size_tensor && size_tensor->size() > 0) {
+    // have size tensor
+    auto new_size = funcs::get_new_shape(size_tensor.get());
+    out_h = new_size[0];
+    out_w = new_size[1];
+  }
+
+  phi::DDim dim_grad;
+  if (data_layout == DataLayout::kNCHW) {
+    dim_grad = {n, c, in_h, in_w};
+  } else {
+    dim_grad = {n, in_h, in_w, c};
+  }
+
+  input_grad->Resize(dim_grad);
+  dev_ctx.template Alloc<T>(input_grad);
+
+  phi::funcs::SetConstant<Context, T> zero;
+  zero(dev_ctx, input_grad, static_cast<T>(0.0));
+
+  if (in_h == out_h && in_w == out_w) {
+    paddle::framework::TensorCopy(output_grad, dev_ctx.GetPlace(), input_grad);
+    return;
+  }
+
+  float ratio_h = 0.f;
+  float ratio_w = 0.f;
+  if (out_h > 1) {
+    float new_scale_h = 0.f;
+    new_scale_h = (scale_h > 0) ? static_cast<float>(1. / scale_h)
+                                : static_cast<float>(in_h) / out_h;
+    ratio_h = (align_corners) ? static_cast<float>(in_h - 1) / (out_h - 1)
+                              : static_cast<float>(new_scale_h);
+  }
+  if (out_w > 1) {
+    float new_scale_w = 0.f;
+    new_scale_w = (scale_w > 0) ? static_cast<float>(1. / scale_w)
+                                : static_cast<float>(in_w) / out_w;
+    ratio_w = (align_corners) ? static_cast<float>(in_w - 1) / (out_w - 1)
+                              : static_cast<float>(new_scale_w);
+  }
+
+  if ("bilinear" == interp_method) {
+    BilinearInterpolationGrad<T>(output_grad,
+                                 input_grad,
+                                 ratio_h,
+                                 ratio_w,
+                                 in_h,
+                                 in_w,
+                                 n,
+                                 c,
+                                 out_h,
+                                 out_w,
+                                 align_corners,
+                                 align_mode,
+                                 data_layout);
+  } else if ("nearest" == interp_method) {
+    NearestNeighborInterpolateGrad<T>(output_grad,
+                                      input_grad,
+                                      ratio_h,
+                                      ratio_w,
+                                      n,
+                                      c,
+                                      out_h,
+                                      out_w,
+                                      align_corners,
+                                      data_layout);
+  } else if ("bicubic" == interp_method) {
+    BicubicInterpolationGrad<T>(output_grad,
+                                input_grad,
+                                ratio_h,
+                                ratio_w,
+                                in_h,
+                                in_w,
+                                n,
+                                c,
+                                out_h,
+                                out_w,
+                                align_corners,
+                                data_layout);
+  }
+}
+
+template <typename T, typename Context>
+static void Interpolate3DCPUBwd(
+    const Context& dev_ctx,
+    const DenseTensor& input,
+    paddle::optional<const DenseTensor&> out_size,
+    paddle::optional<const std::vector<const DenseTensor*>> size_tensor,
+    paddle::optional<const DenseTensor&> scale_tensor,
+    const DenseTensor& output_grad,
+    const std::string& data_layout_str,
+    int out_d,
+    int out_h,
+    int out_w,
+    const std::vector<float>& scale,
+    const std::string& interp_method,
+    bool align_corners,
+    int align_mode,
+    DenseTensor* input_grad) {
+  const DataLayout data_layout =
+      paddle::framework::StringToDataLayout(data_layout_str);
+  int n, c, in_d, in_h, in_w;
+  funcs::ExtractNCDWH(input.dims(), data_layout, &n, &c, &in_d, &in_h, &in_w);
+
+  float scale_d = -1;
+  float scale_h = -1;
+  float scale_w = -1;
+  if (scale_tensor) {
+    auto scale_data =
+        funcs::get_new_data_from_tensor<float>(scale_tensor.get_ptr());
+    if (scale_data.size() > 1) {
+      scale_d = scale_data[0];
+      scale_h = scale_data[1];
+      scale_w = scale_data[2];
+    } else {
+      scale_d = scale_data[0];
+      scale_h = scale_data[0];
+      scale_w = scale_data[0];
+    }
+    PADDLE_ENFORCE_EQ(
+        scale_w > 0,
+        true,
+        errors::InvalidArgument(
+            "The scale_w in input 'Scale' Tensor of Operator(interpolate) "
+            "should be greater than 0, but received value is %d.",
+            scale_w));
+    PADDLE_ENFORCE_EQ(
+        scale_h > 0,
+        true,
+        errors::InvalidArgument(
+            "The scale_h in input 'Scale' Tensor of Operator(interpolate) "
+            "should be greater than 0, but received value is %d.",
+            scale_h));
+    PADDLE_ENFORCE_EQ(
+        scale_d > 0,
+        true,
+        errors::InvalidArgument(
+            "The scale_d in input 'Scale' Tensor of Operator(interpolate) "
+            "should be greater than 0, but received value is %d.",
+            scale_d));
+  } else {
+    if (scale.size() > 1) {
+      scale_d = scale[0];
+      scale_h = scale[1];
+      scale_w = scale[2];
+      PADDLE_ENFORCE_EQ(
+          scale_w > 0,
+          true,
+          errors::InvalidArgument(
+              "The scale_w in Attr(scale) of Operator(interpolate) "
+              "should be greater than 0, but received value is %d.",
+              scale_w));
+      PADDLE_ENFORCE_EQ(
+          scale_h > 0,
+          true,
+          errors::InvalidArgument(
+              "The scale_h in Attr(scale) of Operator(interpolate) "
+              "should be greater than 0, but received value is %d.",
+              scale_h));
+      PADDLE_ENFORCE_EQ(
+          scale_d > 0,
+          true,
+          errors::InvalidArgument(
+              "The scale_d in Attr(scale) of Operator(interpolate) "
+              "should be greater than 0, but received value is %d.",
+              scale_d));
+    }
+  }
+  if (scale_d > 0. && scale_h > 0. && scale_w > 0.) {
+    out_d = static_cast<int>(in_d * scale_d);
+    out_h = static_cast<int>(in_h * scale_h);
+    out_w = static_cast<int>(in_w * scale_w);
+  }
+  if (out_size) {
+    auto out_size_data =
+        funcs::get_new_data_from_tensor<int>(out_size.get_ptr());
+    out_d = out_size_data[0];
+    out_h = out_size_data[1];
+    out_w = out_size_data[2];
+  }
+  if (size_tensor && size_tensor->size() > 0) {
+    // have size tensor
+    auto new_size = funcs::get_new_shape(size_tensor.get());
+    out_d = new_size[0];
+    out_h = new_size[1];
+    out_w = new_size[2];
+  }
+
+  phi::DDim dim_grad;
+  if (data_layout == DataLayout::kNCHW) {
+    dim_grad = {n, c, in_d, in_h, in_w};
+  } else {
+    dim_grad = {n, in_d, in_h, in_w, c};
+  }
+  input_grad->Resize(dim_grad);
+  dev_ctx.template Alloc<T>(input_grad);
+
+  phi::funcs::SetConstant<Context, T> zero;
+  zero(dev_ctx, input_grad, static_cast<T>(0.0));
+
+  if (in_d == out_d && in_h == out_h && in_w == out_w) {
+    paddle::framework::TensorCopy(output_grad, dev_ctx.GetPlace(), input_grad);
+    return;
+  }
+
+  float ratio_d = 0.f;
+  float ratio_h = 0.f;
+  float ratio_w = 0.f;
+  if (out_d > 1) {
+    float new_scale_d = 0.f;
+    new_scale_d = (scale_d > 0) ? static_cast<float>(1. / scale_d)
+                                : static_cast<float>(in_d) / out_d;
+    ratio_d = (align_corners) ? static_cast<float>(in_d - 1) / (out_d - 1)
+                              : static_cast<float>(new_scale_d);
+  }
+  if (out_h > 1) {
+    float new_scale_h = 0.f;
+    new_scale_h = (scale_h > 0) ? static_cast<float>(1. / scale_h)
+                                : static_cast<float>(in_h) / out_h;
+    ratio_h = (align_corners) ? static_cast<float>(in_h - 1) / (out_h - 1)
+                              : static_cast<float>(new_scale_h);
+  }
+  if (out_w > 1) {
+    float new_scale_w = 0.f;
+    new_scale_w = (scale_w > 0) ? static_cast<float>(1. / scale_w)
+                                : static_cast<float>(in_w) / out_w;
+    ratio_w = (align_corners) ? static_cast<float>(in_w - 1) / (out_w - 1)
+                              : static_cast<float>(new_scale_w);
+  }
+
+  if ("trilinear" == interp_method) {
+    TrilinearInterpolationGrad<T>(output_grad,
+                                  input_grad,
+                                  ratio_d,
+                                  ratio_h,
+                                  ratio_w,
+                                  in_d,
+                                  in_h,
+                                  in_w,
+                                  n,
+                                  c,
+                                  out_d,
+                                  out_h,
+                                  out_w,
+                                  align_corners,
+                                  align_mode,
+                                  data_layout);
+  } else if ("nearest" == interp_method) {
+    NearestNeighbor3DInterpolateGrad<T>(output_grad,
+                                        input_grad,
+                                        ratio_d,
+                                        ratio_h,
+                                        ratio_w,
+                                        n,
+                                        c,
+                                        out_d,
+                                        out_h,
+                                        out_w,
+                                        align_corners,
+                                        data_layout);
+  }
+}
+
+template <typename T, typename Context>
+void InterpolateGradKernel(
+    const Context& dev_ctx,
+    const DenseTensor& x,
+    paddle::optional<const DenseTensor&> out_size,
+    paddle::optional<const std::vector<const DenseTensor*>> size_tensor,
+    paddle::optional<const DenseTensor&> scale_tensor,
+    const DenseTensor& output_grad,
+    const std::string& data_layout,
+    int out_d,
+    int out_h,
+    int out_w,
+    const std::vector<float>& scale,
+    const std::string& interp_method,
+    bool align_corners,
+    int align_mode,
+    DenseTensor* x_grad) {
+  auto output_grad_dims = output_grad.dims();
+  if (output_grad_dims.size() == 3) {  // 1D interpolation grad
+    Interpolate1DCPUBwd<T, Context>(dev_ctx,
+                                    x,
+                                    out_size,
+                                    size_tensor,
+                                    scale_tensor,
+                                    output_grad,
+                                    data_layout,
+                                    out_w,
+                                    scale,
+                                    interp_method,
+                                    align_corners,
+                                    align_mode,
+                                    x_grad);
+  } else if (output_grad_dims.size() == 4) {  // 2D interpolation grad
+    Interpolate2DCPUBwd<T, Context>(dev_ctx,
+                                    x,
+                                    out_size,
+                                    size_tensor,
+                                    scale_tensor,
+                                    output_grad,
+                                    data_layout,
+                                    out_h,
+                                    out_w,
+                                    scale,
+                                    interp_method,
+                                    align_corners,
+                                    align_mode,
+                                    x_grad);
+
+  } else if (output_grad_dims.size() == 5) {  // 3D interpolation grad
+    Interpolate3DCPUBwd<T, Context>(dev_ctx,
+                                    x,
+                                    out_size,
+                                    size_tensor,
+                                    scale_tensor,
+                                    output_grad,
+                                    data_layout,
+                                    out_d,
+                                    out_h,
+                                    out_w,
+                                    scale,
+                                    interp_method,
+                                    align_corners,
+                                    align_mode,
+                                    x_grad);
+  }
+}
+
+template <typename T, typename Context>
+void BilinearInterpGradKernel(
+    const Context& dev_ctx,
+    const DenseTensor& x,
+    paddle::optional<const DenseTensor&> out_size,
+    paddle::optional<const std::vector<const DenseTensor*>> size_tensor,
+    paddle::optional<const DenseTensor&> scale_tensor,
+    const DenseTensor& out_grad,
+    const std::string& data_layout,
+    int out_d,
+    int out_h,
+    int out_w,
+    const std::vector<float>& scale,
+    const std::string& interp_method,
+    bool align_corners,
+    int align_mode,
+    DenseTensor* x_grad) {
+  InterpolateGradKernel<T, Context>(dev_ctx,
+                                    x,
+                                    out_size,
+                                    size_tensor,
+                                    scale_tensor,
+                                    out_grad,
+                                    data_layout,
+                                    out_d,
+                                    out_h,
+                                    out_w,
+                                    scale,
+                                    interp_method,
+                                    align_corners,
+                                    align_mode,
+                                    x_grad);
+}
+
+template <typename T, typename Context>
+void NearestInterpGradKernel(
+    const Context& dev_ctx,
+    const DenseTensor& x,
+    paddle::optional<const DenseTensor&> out_size,
+    paddle::optional<const std::vector<const DenseTensor*>> size_tensor,
+    paddle::optional<const DenseTensor&> scale_tensor,
+    const DenseTensor& out_grad,
+    const std::string& data_layout,
+    int out_d,
+    int out_h,
+    int out_w,
+    const std::vector<float>& scale,
+    const std::string& interp_method,
+    bool align_corners,
+    int align_mode,
+    DenseTensor* x_grad) {
+  InterpolateGradKernel<T, Context>(dev_ctx,
+                                    x,
+                                    out_size,
+                                    size_tensor,
+                                    scale_tensor,
+                                    out_grad,
+                                    data_layout,
+                                    out_d,
+                                    out_h,
+                                    out_w,
+                                    scale,
+                                    interp_method,
+                                    align_corners,
+                                    align_mode,
+                                    x_grad);
+}
+
+template <typename T, typename Context>
+void TrilinearInterpGradKernel(
+    const Context& dev_ctx,
+    const DenseTensor& x,
+    paddle::optional<const DenseTensor&> out_size,
+    paddle::optional<const std::vector<const DenseTensor*>> size_tensor,
+    paddle::optional<const DenseTensor&> scale_tensor,
+    const DenseTensor& out_grad,
+    const std::string& data_layout,
+    int out_d,
+    int out_h,
+    int out_w,
+    const std::vector<float>& scale,
+    const std::string& interp_method,
+    bool align_corners,
+    int align_mode,
+    DenseTensor* x_grad) {
+  InterpolateGradKernel<T, Context>(dev_ctx,
+                                    x,
+                                    out_size,
+                                    size_tensor,
+                                    scale_tensor,
+                                    out_grad,
+                                    data_layout,
+                                    out_d,
+                                    out_h,
+                                    out_w,
+                                    scale,
+                                    interp_method,
+                                    align_corners,
+                                    align_mode,
+                                    x_grad);
+}
+
+template <typename T, typename Context>
+void LinearInterpGradKernel(
+    const Context& dev_ctx,
+    const DenseTensor& x,
+    paddle::optional<const DenseTensor&> out_size,
+    paddle::optional<const std::vector<const DenseTensor*>> size_tensor,
+    paddle::optional<const DenseTensor&> scale_tensor,
+    const DenseTensor& out_grad,
+    const std::string& data_layout,
+    int out_d,
+    int out_h,
+    int out_w,
+    const std::vector<float>& scale,
+    const std::string& interp_method,
+    bool align_corners,
+    int align_mode,
+    DenseTensor* x_grad) {
+  InterpolateGradKernel<T, Context>(dev_ctx,
+                                    x,
+                                    out_size,
+                                    size_tensor,
+                                    scale_tensor,
+                                    out_grad,
+                                    data_layout,
+                                    out_d,
+                                    out_h,
+                                    out_w,
+                                    scale,
+                                    interp_method,
+                                    align_corners,
+                                    align_mode,
+                                    x_grad);
+}
+
+template <typename T, typename Context>
+void BicubicInterpGradKernel(
+    const Context& dev_ctx,
+    const DenseTensor& x,
+    paddle::optional<const DenseTensor&> out_size,
+    paddle::optional<const std::vector<const DenseTensor*>> size_tensor,
+    paddle::optional<const DenseTensor&> scale_tensor,
+    const DenseTensor& out_grad,
+    const std::string& data_layout,
+    int out_d,
+    int out_h,
+    int out_w,
+    const std::vector<float>& scale,
+    const std::string& interp_method,
+    bool align_corners,
+    int align_mode,
+    DenseTensor* x_grad) {
+  InterpolateGradKernel<T, Context>(dev_ctx,
+                                    x,
+                                    out_size,
+                                    size_tensor,
+                                    scale_tensor,
+                                    out_grad,
+                                    data_layout,
+                                    out_d,
+                                    out_h,
+                                    out_w,
+                                    scale,
+                                    interp_method,
+                                    align_corners,
+                                    align_mode,
+                                    x_grad);
+}
+
+}  // namespace phi
+
+PD_REGISTER_KERNEL(bilinear_interp_v2_grad,
+                   CPU,
+                   ALL_LAYOUT,
+                   phi::BilinearInterpGradKernel,
+                   float,
+                   double) {}
+PD_REGISTER_KERNEL(nearest_interp_v2_grad,
+                   CPU,
+                   ALL_LAYOUT,
+                   phi::NearestInterpGradKernel,
+                   float,
+                   double) {}
+PD_REGISTER_KERNEL(trilinear_interp_v2_grad,
+                   CPU,
+                   ALL_LAYOUT,
+                   phi::TrilinearInterpGradKernel,
+                   float,
+                   double) {}
+PD_REGISTER_KERNEL(linear_interp_v2_grad,
+                   CPU,
+                   ALL_LAYOUT,
+                   phi::LinearInterpGradKernel,
+                   float,
+                   double) {}
+PD_REGISTER_KERNEL(bicubic_interp_v2_grad,
+                   CPU,
+                   ALL_LAYOUT,
+                   phi::BicubicInterpGradKernel,
+                   float,
+                   double) {}
diff --git a/paddle/phi/kernels/cpu/interpolate_kernel.cc b/paddle/phi/kernels/cpu/interpolate_kernel.cc
new file mode 100644
index 0000000000000..da9a54748f06f
--- /dev/null
+++ b/paddle/phi/kernels/cpu/interpolate_kernel.cc
@@ -0,0 +1,1225 @@
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include "paddle/phi/kernels/interpolate_kernel.h"
+
+#include "paddle/phi/backends/cpu/cpu_context.h"
+#include "paddle/phi/common/layout.h"
+#include "paddle/phi/core/kernel_registry.h"
+#include "paddle/phi/kernels/funcs/interpolate_function.h"
+
+namespace phi {
+
+template <typename T>
+static inline T cubic_interp(T x0, T x1, T x2, T x3, T t) {
+  T coeffs[4];
+  funcs::get_cubic_upsample_coefficients<T>(coeffs, t);
+
+  return x0 * coeffs[0] + x1 * coeffs[1] + x2 * coeffs[2] + x3 * coeffs[3];
+}
+
+template <typename T>
+static void LinearInterpolation(const DenseTensor& input,
+                                DenseTensor* output,
+                                const float ratio_w,
+                                const int in_w,
+                                const int n,
+                                const int c,
+                                const int out_w,
+                                const bool align_corners,
+                                const int align_mode,
+                                const DataLayout data_layout) {
+  auto input_t = EigenTensor<T, 3>::From(input);
+  auto output_t = EigenTensor<T, 3>::From(*output);
+  bool align_flag = (align_mode == 0 && !align_corners);
+
+  std::vector<int> vx_w, vx_e;
+  std::vector<float> vd_w, vd_e;
+  vx_w.reserve(out_w);
+  vx_e.reserve(out_w);
+  vd_w.reserve(out_w);
+  vd_e.reserve(out_w);
+#ifdef PADDLE_WITH_MKLML
+#pragma omp parallel for
+#endif
+  for (int l = 0; l < out_w; l++) {
+    int x_w = align_flag ? static_cast<int>(ratio_w * (l + 0.5) - 0.5)
+                         : static_cast<int>(ratio_w * l);
+    x_w = (x_w > 0) ? x_w : 0;                       // w
+    int x_e = (x_w < (in_w - 1)) ? (x_w + 1) : x_w;  // w_id
+
+    float idx_src_x = ratio_w * (l + 0.5) - 0.5;
+    idx_src_x = (idx_src_x > 0) ? idx_src_x : 0;
+    float d_w = align_flag ? idx_src_x - x_w : ratio_w * l - x_w;  // w1lambda
+    float d_e = 1.f - d_w;                                         // w2lambda
+    {
+      vx_w[l] = x_w;
+      vx_e[l] = x_e;
+      vd_w[l] = d_w;
+      vd_e[l] = d_e;
+    }
+  }
+
+#ifdef PADDLE_WITH_MKLML
+#pragma omp parallel for collapse(3)
+#endif
+  for (int i = 0; i < n; i++) {    // loop for batches
+    for (int j = 0; j < c; j++) {  // loop for channels
+      for (int l = 0; l < out_w; l++) {
+        // linear interpolation
+        T out_t;
+        if (data_layout == DataLayout::kNCHW) {
+          out_t = input_t(i, j, vx_w[l]) * vd_e[l] +
+                  input_t(i, j, vx_e[l]) * vd_w[l];
+          output_t(i, j, l) = out_t;
+        } else {
+          out_t = input_t(i, vx_w[l], j) * vd_e[l] +
+                  input_t(i, vx_e[l], j) * vd_w[l];
+          output_t(i, l, j) = out_t;
+        }
+      }
+    }
+  }
+}
+
+template <typename T>
+static void BilinearInterpolation(const DenseTensor& input,
+                                  DenseTensor* output,
+                                  const float ratio_h,
+                                  const float ratio_w,
+                                  const int in_h,
+                                  const int in_w,
+                                  const int n,
+                                  const int c,
+                                  const int out_h,
+                                  const int out_w,
+                                  const bool align_corners,
+                                  const int align_mode,
+                                  const DataLayout data_layout) {
+  auto input_t = EigenTensor<T, 4>::From(input);
+  auto output_t = EigenTensor<T, 4>::From(*output);
+  bool align_flag = (align_mode == 0 && !align_corners);
+
+  std::vector<int> vy_n, vy_s;
+  std::vector<float> vd_n, vd_s;
+  vy_n.reserve(out_h);
+  vy_s.reserve(out_h);
+  vd_n.reserve(out_h);
+  vd_s.reserve(out_h);
+#ifdef PADDLE_WITH_MKLML
+#pragma omp parallel for
+#endif
+  for (int k = 0; k < out_h; k++) {
+    int y_n = align_flag ? static_cast<int>(ratio_h * (k + 0.5) - 0.5)
+                         : static_cast<int>(ratio_h * k);
+    y_n = (y_n > 0) ? y_n : 0;
+    int y_s = (y_n + 1) < (in_h - 1) ? (y_n + 1) : (in_h - 1);
+    float idx_src_y = ratio_h * (k + 0.5) - 0.5;
+    idx_src_y = (idx_src_y > 0) ? idx_src_y : 0;
+    float d_n = align_flag ? idx_src_y - y_n : ratio_h * k - y_n;
+    float d_s = 1.f - d_n;
+    {
+      vy_n[k] = y_n;
+      vy_s[k] = y_s;
+      vd_n[k] = d_n;
+      vd_s[k] = d_s;
+    }
+  }
+
+  std::vector<int> vx_w, vx_e;
+  std::vector<float> vd_w, vd_e;
+  vx_w.reserve(out_w);
+  vx_e.reserve(out_w);
+  vd_w.reserve(out_w);
+  vd_e.reserve(out_w);
+#ifdef PADDLE_WITH_MKLML
+#pragma omp parallel for
+#endif
+  for (int l = 0; l < out_w; l++) {
+    int x_w = (align_mode == 0 && !align_corners)
+                  ? static_cast<int>(ratio_w * (l + 0.5) - 0.5)
+                  : static_cast<int>(ratio_w * l);
+    x_w = (x_w > 0) ? x_w : 0;
+    int x_e = (x_w + 1) < (in_w - 1) ? (x_w + 1) : (in_w - 1);
+    float idx_src_x = ratio_w * (l + 0.5) - 0.5;
+    idx_src_x = (idx_src_x > 0) ? idx_src_x : 0;
+    float d_w = align_flag ? idx_src_x - x_w : ratio_w * l - x_w;
+    float d_e = 1.f - d_w;
+    {
+      vx_w[l] = x_w;
+      vx_e[l] = x_e;
+      vd_w[l] = d_w;
+      vd_e[l] = d_e;
+    }
+  }
+
+#ifdef PADDLE_WITH_MKLML
+#pragma omp parallel for collapse(4)
+#endif
+  for (int i = 0; i < n; i++) {          // loop for batches
+    for (int j = 0; j < c; j++) {        // loop for channels
+      for (int k = 0; k < out_h; k++) {  // loop for images
+        for (int l = 0; l < out_w; l++) {
+          // bilinear interpolation
+          T out_t;
+          if (data_layout == DataLayout::kNCHW) {
+            out_t = input_t(i, j, vy_n[k], vx_w[l]) * vd_s[k] * vd_e[l] +
+                    input_t(i, j, vy_s[k], vx_w[l]) * vd_n[k] * vd_e[l] +
+                    input_t(i, j, vy_n[k], vx_e[l]) * vd_s[k] * vd_w[l] +
+                    input_t(i, j, vy_s[k], vx_e[l]) * vd_n[k] * vd_w[l];
+            output_t(i, j, k, l) = out_t;
+
+          } else {
+            out_t = input_t(i, vy_n[k], vx_w[l], j) * vd_s[k] * vd_e[l] +
+                    input_t(i, vy_s[k], vx_w[l], j) * vd_n[k] * vd_e[l] +
+                    input_t(i, vy_n[k], vx_e[l], j) * vd_s[k] * vd_w[l] +
+                    input_t(i, vy_s[k], vx_e[l], j) * vd_n[k] * vd_w[l];
+            output_t(i, k, l, j) = out_t;
+          }
+        }
+      }
+    }
+  }
+}
+
+template <typename T>
+static void NearestNeighborInterpolate(const DenseTensor& input,
+                                       DenseTensor* output,
+                                       const float ratio_h,
+                                       const float ratio_w,
+                                       const int n,
+                                       const int c,
+                                       const int out_h,
+                                       const int out_w,
+                                       const bool align_corners,
+                                       const DataLayout& data_layout) {
+  auto input_t = EigenTensor<T, 4>::From(input);
+  auto output_t = EigenTensor<T, 4>::From(*output);
+  for (int k = 0; k < out_h; k++) {  // loop for images
+    int in_k = (align_corners) ? static_cast<int>(ratio_h * k + 0.5)
+                               : static_cast<int>(ratio_h * k);
+
+    for (int l = 0; l < out_w; l++) {
+      int in_l = (align_corners) ? static_cast<int>(ratio_w * l + 0.5)
+                                 : static_cast<int>(ratio_w * l);
+
+      for (int i = 0; i < n; i++) {    // loop for batches
+        for (int j = 0; j < c; j++) {  // loop for channels
+          if (data_layout == DataLayout::kNCHW) {
+            output_t(i, j, k, l) = input_t(i, j, in_k, in_l);
+          } else {
+            output_t(i, k, l, j) = input_t(i, in_k, in_l, j);
+          }
+        }
+      }
+    }
+  }
+}
+
+template <typename T>
+static void BicubicInterpolation(const DenseTensor& input,
+                                 DenseTensor* output,
+                                 const float ratio_h,
+                                 const float ratio_w,
+                                 const int in_h,
+                                 const int in_w,
+                                 const int n,
+                                 const int c,
+                                 const int out_h,
+                                 const int out_w,
+                                 const bool align_corners,
+                                 const DataLayout data_layout) {
+  auto input_t = EigenTensor<T, 4>::From(input);
+  auto output_t = EigenTensor<T, 4>::From(*output);
+
+  for (int k = 0; k < out_h; k++) {  // loop for images
+    T y_n = align_corners ? static_cast<T>(ratio_h * k)
+                          : static_cast<T>(ratio_h * (k + 0.5) - 0.5);
+    int input_y = floorf(y_n);
+    const T y_t = y_n - input_y;
+
+    for (int l = 0; l < out_w; l++) {
+      T x_n = align_corners ? static_cast<T>(ratio_w * l)
+                            : static_cast<T>(ratio_w * (l + 0.5) - 0.5);
+      int input_x = floorf(x_n);
+      const T x_t = x_n - input_x;
+
+      for (int i = 0; i < n; i++) {    // loop for batches
+        for (int j = 0; j < c; j++) {  // loop for channels
+          T coefficients[4];
+          // interp 4 times in x direction
+          for (int ii = 0; ii < 4; ii++) {
+            int access_y = std::max(std::min(input_y - 1 + ii, in_h - 1),
+                                    static_cast<int>(0));
+            int access_x_0 =
+                std::max(std::min(input_x - 1, in_w - 1), static_cast<int>(0));
+            int access_x_1 =
+                std::max(std::min(input_x + 0, in_w - 1), static_cast<int>(0));
+            int access_x_2 =
+                std::max(std::min(input_x + 1, in_w - 1), static_cast<int>(0));
+            int access_x_3 =
+                std::max(std::min(input_x + 2, in_w - 1), static_cast<int>(0));
+            if (data_layout == DataLayout::kNCHW) {
+              coefficients[ii] =
+                  cubic_interp<T>(input_t(i, j, access_y, access_x_0),
+                                  input_t(i, j, access_y, access_x_1),
+                                  input_t(i, j, access_y, access_x_2),
+                                  input_t(i, j, access_y, access_x_3),
+                                  x_t);
+            } else {
+              coefficients[ii] =
+                  cubic_interp<T>(input_t(i, access_y, access_x_0, j),
+                                  input_t(i, access_y, access_x_1, j),
+                                  input_t(i, access_y, access_x_2, j),
+                                  input_t(i, access_y, access_x_3, j),
+                                  x_t);
+            }
+          }
+
+          // interp y direction
+          if (data_layout == DataLayout::kNCHW) {
+            output_t(i, j, k, l) = cubic_interp<T>(coefficients[0],
+                                                   coefficients[1],
+                                                   coefficients[2],
+                                                   coefficients[3],
+                                                   y_t);
+          } else {
+            output_t(i, k, l, j) = cubic_interp<T>(coefficients[0],
+                                                   coefficients[1],
+                                                   coefficients[2],
+                                                   coefficients[3],
+                                                   y_t);
+          }
+        }
+      }
+    }
+  }
+}
+
+template <typename T>
+static void TrilinearInterpolation(const DenseTensor& input,
+                                   DenseTensor* output,
+                                   const float ratio_d,
+                                   const float ratio_h,
+                                   const float ratio_w,
+                                   const int in_d,
+                                   const int in_h,
+                                   const int in_w,
+                                   const int n,
+                                   const int c,
+                                   const int out_d,
+                                   const int out_h,
+                                   const int out_w,
+                                   const bool align_corners,
+                                   const int align_mode,
+                                   const DataLayout& data_layout) {
+  auto input_t = EigenTensor<T, 5>::From(input);
+  auto output_t = EigenTensor<T, 5>::From(*output);
+  bool align_flag = (align_mode == 0 && !align_corners);
+
+  std::vector<int> vt_f, vt_b;
+  std::vector<float> vd_f, vd_b;
+  vt_f.reserve(out_d);
+  vt_b.reserve(out_d);
+  vd_f.reserve(out_d);
+  vd_b.reserve(out_d);
+#ifdef PADDLE_WITH_MKLML
+#pragma omp parallel for
+#endif
+  for (int j = 0; j < out_d; j++) {
+    int t_f = align_flag ? static_cast<int>(ratio_d * (j + 0.5) - 0.5)
+                         : static_cast<int>(ratio_d * j);
+    t_f = (t_f > 0) ? t_f : 0;
+    int t_b = (t_f + 1) < (in_d - 1) ? (t_f + 1) : (in_d - 1);
+    float idx_src_t = ratio_d * (j + 0.5) - 0.5;
+    idx_src_t = (idx_src_t > 0) ? idx_src_t : 0;
+    float d_f = align_flag ? idx_src_t - t_f : ratio_d * j - t_f;
+    float d_b = 1.f - d_f;
+    {
+      vt_f[j] = t_f;
+      vt_b[j] = t_b;
+      vd_f[j] = d_f;
+      vd_b[j] = d_b;
+    }
+  }
+
+  std::vector<int> vy_n, vy_s;
+  std::vector<float> vd_n, vd_s;
+  vy_n.reserve(out_h);
+  vy_s.reserve(out_h);
+  vd_n.reserve(out_h);
+  vd_s.reserve(out_h);
+#ifdef PADDLE_WITH_MKLML
+#pragma omp parallel for
+#endif
+  for (int k = 0; k < out_h; k++) {
+    int y_n = align_flag ? static_cast<int>(ratio_h * (k + 0.5) - 0.5)
+                         : static_cast<int>(ratio_h * k);
+    y_n = (y_n > 0) ? y_n : 0;
+    int y_s = (y_n + 1) < (in_h - 1) ? (y_n + 1) : (in_h - 1);
+    float idx_src_y = ratio_h * (k + 0.5) - 0.5;
+    idx_src_y = (idx_src_y > 0) ? idx_src_y : 0;
+    float d_n = align_flag ? idx_src_y - y_n : ratio_h * k - y_n;
+    float d_s = 1.f - d_n;
+    {
+      vy_n[k] = y_n;
+      vy_s[k] = y_s;
+      vd_n[k] = d_n;
+      vd_s[k] = d_s;
+    }
+  }
+
+  std::vector<int> vx_w, vx_e;
+  std::vector<float> vd_w, vd_e;
+  vx_w.reserve(out_w);
+  vx_e.reserve(out_w);
+  vd_w.reserve(out_w);
+  vd_e.reserve(out_w);
+#ifdef PADDLE_WITH_MKLML
+#pragma omp parallel for
+#endif
+  for (int l = 0; l < out_w; l++) {
+    int x_w = (align_mode == 0 && !align_corners)
+                  ? static_cast<int>(ratio_w * (l + 0.5) - 0.5)
+                  : static_cast<int>(ratio_w * l);
+    x_w = (x_w > 0) ? x_w : 0;
+    int x_e = (x_w + 1) < (in_w - 1) ? (x_w + 1) : (in_w - 1);
+    float idx_src_x = ratio_w * (l + 0.5) - 0.5;
+    idx_src_x = (idx_src_x > 0) ? idx_src_x : 0;
+    float d_w = align_flag ? idx_src_x - x_w : ratio_w * l - x_w;
+    float d_e = 1.f - d_w;
+    {
+      vx_w[l] = x_w;
+      vx_e[l] = x_e;
+      vd_w[l] = d_w;
+      vd_e[l] = d_e;
+    }
+  }
+
+#ifdef PADDLE_WITH_MKLML
+#pragma omp parallel for collapse(5)
+#endif
+  for (int b = 0; b < n; b++) {          // loop for batches
+    for (int i = 0; i < c; i++) {        // loop for channels
+      for (int j = 0; j < out_d; j++) {  // loop for D, H, W
+        for (int k = 0; k < out_h; k++) {
+          for (int l = 0; l < out_w; l++) {
+            // trilinear interpolation
+            if (data_layout == DataLayout::kNCHW) {
+              T out_t = input_t(b, i, vt_f[j], vy_n[k], vx_w[l]) * vd_b[j] *
+                            vd_s[k] * vd_e[l] +
+                        input_t(b, i, vt_f[j], vy_n[k], vx_e[l]) * vd_b[j] *
+                            vd_s[k] * vd_w[l] +
+                        input_t(b, i, vt_f[j], vy_s[k], vx_w[l]) * vd_b[j] *
+                            vd_n[k] * vd_e[l] +
+                        input_t(b, i, vt_f[j], vy_s[k], vx_e[l]) * vd_b[j] *
+                            vd_n[k] * vd_w[l] +
+                        input_t(b, i, vt_b[j], vy_n[k], vx_w[l]) * vd_f[j] *
+                            vd_s[k] * vd_e[l] +
+                        input_t(b, i, vt_b[j], vy_n[k], vx_e[l]) * vd_f[j] *
+                            vd_s[k] * vd_w[l] +
+                        input_t(b, i, vt_b[j], vy_s[k], vx_w[l]) * vd_f[j] *
+                            vd_n[k] * vd_e[l] +
+                        input_t(b, i, vt_b[j], vy_s[k], vx_e[l]) * vd_f[j] *
+                            vd_n[k] * vd_w[l];
+              output_t(b, i, j, k, l) = out_t;
+            } else {
+              T out_t = input_t(b, vt_f[j], vy_n[k], vx_w[l], i) * vd_b[j] *
+                            vd_s[k] * vd_e[l] +
+                        input_t(b, vt_f[j], vy_n[k], vx_e[l], i) * vd_b[j] *
+                            vd_s[k] * vd_w[l] +
+                        input_t(b, vt_f[j], vy_s[k], vx_w[l], i) * vd_b[j] *
+                            vd_n[k] * vd_e[l] +
+                        input_t(b, vt_f[j], vy_s[k], vx_e[l], i) * vd_b[j] *
+                            vd_n[k] * vd_w[l] +
+                        input_t(b, vt_b[j], vy_n[k], vx_w[l], i) * vd_f[j] *
+                            vd_s[k] * vd_e[l] +
+                        input_t(b, vt_b[j], vy_n[k], vx_e[l], i) * vd_f[j] *
+                            vd_s[k] * vd_w[l] +
+                        input_t(b, vt_b[j], vy_s[k], vx_w[l], i) * vd_f[j] *
+                            vd_n[k] * vd_e[l] +
+                        input_t(b, vt_b[j], vy_s[k], vx_e[l], i) * vd_f[j] *
+                            vd_n[k] * vd_w[l];
+              output_t(b, j, k, l, i) = out_t;
+            }
+          }
+        }
+      }
+    }
+  }
+}
+
+template <typename T>
+static void NearestNeighbor3DInterpolate(const DenseTensor& input,
+                                         DenseTensor* output,
+                                         const float ratio_d,
+                                         const float ratio_h,
+                                         const float ratio_w,
+                                         const int n,
+                                         const int c,
+                                         const int out_d,
+                                         const int out_h,
+                                         const int out_w,
+                                         const bool align_corners,
+                                         const DataLayout& data_layout) {
+  auto input_t = EigenTensor<T, 5>::From(input);
+  auto output_t = EigenTensor<T, 5>::From(*output);
+  for (int d = 0; d < out_d; d++) {  // loop for images
+    int in_d = (align_corners) ? static_cast<int>(ratio_d * d + 0.5)
+                               : static_cast<int>(ratio_d * d);
+    for (int k = 0; k < out_h; k++) {
+      int in_k = (align_corners) ? static_cast<int>(ratio_h * k + 0.5)
+                                 : static_cast<int>(ratio_h * k);
+
+      for (int l = 0; l < out_w; l++) {
+        int in_l = (align_corners) ? static_cast<int>(ratio_w * l + 0.5)
+                                   : static_cast<int>(ratio_w * l);
+
+        for (int i = 0; i < n; i++) {    // loop for batches
+          for (int j = 0; j < c; j++) {  // loop for channels
+            if (data_layout == DataLayout::kNCHW) {
+              output_t(i, j, d, k, l) = input_t(i, j, in_d, in_k, in_l);
+            } else {  // NDHWC
+              output_t(i, d, k, l, j) = input_t(i, in_d, in_k, in_l, j);
+            }
+          }
+        }
+      }
+    }
+  }
+}
+
+template <typename T, typename Context>
+static void Interpolate1DCPUFwd(
+    const Context& dev_ctx,
+    const DenseTensor& x,
+    paddle::optional<const DenseTensor&> out_size,
+    paddle::optional<const std::vector<const DenseTensor*>> size_tensor,
+    paddle::optional<const DenseTensor&> scale_tensor,
+    const std::string& data_layout_str,
+    int out_w,
+    const std::vector<float>& scale,
+    const std::string& interp_method,
+    bool align_corners,
+    int align_mode,
+    DenseTensor* output) {
+  const DataLayout data_layout =
+      paddle::framework::StringToDataLayout(data_layout_str);
+  int n, c, in_d, in_h, in_w;
+  funcs::ExtractNCDWH(x.dims(), data_layout, &n, &c, &in_d, &in_h, &in_w);
+
+  float scale_w = -1.;
+  if (size_tensor && size_tensor->size() > 0) {
+    // have size tensor
+    auto new_size = funcs::get_new_shape(size_tensor.get());
+    out_w = new_size[0];
+  } else {
+    if (scale_tensor) {
+      auto scale_data =
+          funcs::get_new_data_from_tensor<float>(scale_tensor.get_ptr());
+      scale_w = scale_data[0];
+      PADDLE_ENFORCE_EQ(
+          scale_w > 0,
+          true,
+          errors::InvalidArgument(
+              "The scale_w in input 'Scale' Tensor of Operator(interpolate) "
+              "should be greater than 0, but received value is %d.",
+              scale_w));
+    } else {
+      if (scale.size() > 0) {
+        scale_w = scale[0];
+
+        PADDLE_ENFORCE_EQ(
+            scale_w > 0,
+            true,
+            errors::InvalidArgument(
+                "The scale_w in Attr(scale) of Operator(interpolate) "
+                "should be greater than 0, but received value is %d.",
+                scale_w));
+      }
+    }
+    if (scale_w > 0.) {
+      out_w = static_cast<int>(in_w * scale_w);
+    }
+    if (out_size) {
+      auto out_size_data =
+          funcs::get_new_data_from_tensor<int>(out_size.get_ptr());
+      out_w = out_size_data[0];
+    }
+  }
+  PADDLE_ENFORCE_GT(
+      out_w,
+      0,
+      errors::InvalidArgument("out_w in Attr(out_shape) of Op(interpolate) "
+                              "should be greater than 0."));
+  phi::DDim dim_out;
+  if (data_layout == DataLayout::kNCHW) {
+    dim_out = {n, c, out_w};
+  } else {
+    dim_out = {n, out_w, c};
+  }
+  output->Resize(dim_out);
+  dev_ctx.template Alloc<T>(output);
+
+  if (in_w == out_w) {
+    paddle::framework::TensorCopy(x, dev_ctx.GetPlace(), output);
+    return;
+  }
+
+  float ratio_w = 0.f;
+  if (out_w > 1) {
+    float new_scale_w = 0.f;
+    new_scale_w = (scale_w > 0) ? static_cast<float>(1. / scale_w)
+                                : static_cast<float>(in_w) / out_w;
+    ratio_w = (align_corners) ? static_cast<float>(in_w - 1) / (out_w - 1)
+                              : static_cast<float>(new_scale_w);
+  }
+  if ("linear" == interp_method) {
+    LinearInterpolation<T>(x,
+                           output,
+                           ratio_w,
+                           in_w,
+                           n,
+                           c,
+                           out_w,
+                           align_corners,
+                           align_mode,
+                           data_layout);
+  }
+}
+
+template <typename T, typename Context>
+static void Interpolate2DCPUFwd(
+    const Context& dev_ctx,
+    const DenseTensor& x,
+    paddle::optional<const DenseTensor&> out_size,
+    paddle::optional<const std::vector<const DenseTensor*>> size_tensor,
+    paddle::optional<const DenseTensor&> scale_tensor,
+    const std::string& data_layout_str,
+    int out_h,
+    int out_w,
+    const std::vector<float>& scale,
+    const std::string& interp_method,
+    bool align_corners,
+    int align_mode,
+    DenseTensor* output) {
+  const DataLayout data_layout =
+      paddle::framework::StringToDataLayout(data_layout_str);
+  int n, c, in_d, in_h, in_w;
+  funcs::ExtractNCDWH(x.dims(), data_layout, &n, &c, &in_d, &in_h, &in_w);
+
+  float scale_h = -1;
+  float scale_w = -1;
+
+  if (size_tensor && size_tensor->size() > 0) {
+    // have size tensor
+    auto new_size = funcs::get_new_shape(size_tensor.get());
+    out_h = new_size[0];
+    out_w = new_size[1];
+  } else {
+    if (scale_tensor) {
+      auto scale_data =
+          funcs::get_new_data_from_tensor<float>(scale_tensor.get_ptr());
+      if (scale_data.size() > 1) {
+        scale_h = scale_data[0];
+        scale_w = scale_data[1];
+      } else {
+        scale_h = scale_data[0];
+        scale_w = scale_data[0];
+      }
+      PADDLE_ENFORCE_EQ(
+          scale_w > 0,
+          true,
+          errors::InvalidArgument(
+              "The scale_w in input 'Scale' Tensor of Operator(interpolate) "
+              "should be greater than 0, but received value is %d.",
+              scale_w));
+      PADDLE_ENFORCE_EQ(
+          scale_h > 0,
+          true,
+          errors::InvalidArgument(
+              "The scale_h in input 'Scale' Tensor of Operator(interpolate) "
+              "should be greater than 0, but received value is %d.",
+              scale_h));
+    } else {
+      if (scale.size() > 1) {
+        scale_h = scale[0];
+        scale_w = scale[1];
+
+        PADDLE_ENFORCE_EQ(
+            scale_w > 0,
+            true,
+            errors::InvalidArgument(
+                "The scale_w in Attr(scale) of Operator(interpolate) "
+                "should be greater than 0, but received value is %d.",
+                scale_w));
+        PADDLE_ENFORCE_EQ(
+            scale_h > 0,
+            true,
+            errors::InvalidArgument(
+                "The scale_h in Attr(scale) of Operator(interpolate) "
+                "should be greater than 0, but received value is %d.",
+                scale_h));
+      }
+    }
+    if (scale_h > 0. && scale_w > 0.) {
+      out_h = static_cast<int>(in_h * scale_h);
+      out_w = static_cast<int>(in_w * scale_w);
+    }
+    if (out_size) {
+      auto out_size_data =
+          funcs::get_new_data_from_tensor<int>(out_size.get_ptr());
+      out_h = out_size_data[0];
+      out_w = out_size_data[1];
+    }
+  }
+  PADDLE_ENFORCE_GT(
+      out_h,
+      0,
+      errors::InvalidArgument("out_h in Attr(out_shape) of Op(interpolate) "
+                              "should be greater than 0."));
+  PADDLE_ENFORCE_GT(
+      out_w,
+      0,
+      errors::InvalidArgument("out_w in Attr(out_shape) of Op(interpolate) "
+                              "should be greater than 0."));
+  phi::DDim dim_out;
+  if (data_layout == DataLayout::kNCHW) {
+    dim_out = {n, c, out_h, out_w};
+  } else {
+    dim_out = {n, out_h, out_w, c};
+  }
+  output->Resize(dim_out);
+  dev_ctx.template Alloc<T>(output);
+
+  if (in_h == out_h && in_w == out_w) {
+    paddle::framework::TensorCopy(x, dev_ctx.GetPlace(), output);
+    return;
+  }
+
+  float ratio_h = 0.f;
+  float ratio_w = 0.f;
+  if (out_h > 1) {
+    float new_scale_h = 0.f;
+    new_scale_h = (scale_h > 0) ? static_cast<float>(1. / scale_h)
+                                : static_cast<float>(in_h) / out_h;
+    ratio_h = (align_corners) ? static_cast<float>(in_h - 1) / (out_h - 1)
+                              : static_cast<float>(new_scale_h);
+  }
+  if (out_w > 1) {
+    float new_scale_w = 0.f;
+    new_scale_w = (scale_w > 0) ? static_cast<float>(1. / scale_w)
+                                : static_cast<float>(in_w) / out_w;
+    ratio_w = (align_corners) ? static_cast<float>(in_w - 1) / (out_w - 1)
+                              : static_cast<float>(new_scale_w);
+  }
+
+  if ("bilinear" == interp_method) {
+    BilinearInterpolation<T>(x,
+                             output,
+                             ratio_h,
+                             ratio_w,
+                             in_h,
+                             in_w,
+                             n,
+                             c,
+                             out_h,
+                             out_w,
+                             align_corners,
+                             align_mode,
+                             data_layout);
+  } else if ("nearest" == interp_method) {
+    NearestNeighborInterpolate<T>(x,
+                                  output,
+                                  ratio_h,
+                                  ratio_w,
+                                  n,
+                                  c,
+                                  out_h,
+                                  out_w,
+                                  align_corners,
+                                  data_layout);
+  } else if ("bicubic" == interp_method) {
+    BicubicInterpolation<T>(x,
+                            output,
+                            ratio_h,
+                            ratio_w,
+                            in_h,
+                            in_w,
+                            n,
+                            c,
+                            out_h,
+                            out_w,
+                            align_corners,
+                            data_layout);
+  }
+}
+
+template <typename T, typename Context>
+static void Interpolate3DCPUFwd(
+    const Context& dev_ctx,
+    const DenseTensor& x,
+    paddle::optional<const DenseTensor&> out_size,
+    paddle::optional<const std::vector<const DenseTensor*>> size_tensor,
+    paddle::optional<const DenseTensor&> scale_tensor,
+    const std::string& data_layout_str,
+    int out_d,
+    int out_h,
+    int out_w,
+    const std::vector<float>& scale,
+    const std::string& interp_method,
+    bool align_corners,
+    int align_mode,
+    DenseTensor* output) {
+  const DataLayout data_layout =
+      paddle::framework::StringToDataLayout(data_layout_str);
+  int n, c, in_d, in_h, in_w;
+  funcs::ExtractNCDWH(x.dims(), data_layout, &n, &c, &in_d, &in_h, &in_w);
+
+  float scale_d = -1;
+  float scale_h = -1;
+  float scale_w = -1;
+
+  if (size_tensor && size_tensor->size() > 0) {
+    // have size tensor
+    auto new_size = funcs::get_new_shape(size_tensor.get());
+    out_d = new_size[0];
+    out_h = new_size[1];
+    out_w = new_size[2];
+  } else {
+    if (scale_tensor) {
+      auto scale_data =
+          funcs::get_new_data_from_tensor<float>(scale_tensor.get_ptr());
+      if (scale_data.size() > 1) {
+        scale_d = scale_data[0];
+        scale_h = scale_data[1];
+        scale_w = scale_data[2];
+      } else {
+        scale_d = scale_data[0];
+        scale_h = scale_data[0];
+        scale_w = scale_data[0];
+      }
+      PADDLE_ENFORCE_EQ(
+          scale_w > 0,
+          true,
+          errors::InvalidArgument(
+              "The scale_w in input 'Scale' Tensor of Operator(interpolate) "
+              "should be greater than 0, but received value is %d.",
+              scale_w));
+      PADDLE_ENFORCE_EQ(
+          scale_h > 0,
+          true,
+          errors::InvalidArgument(
+              "The scale_h in input 'Scale' Tensor of Operator(interpolate) "
+              "should be greater than 0, but received value is %d.",
+              scale_h));
+      PADDLE_ENFORCE_EQ(
+          scale_d > 0,
+          true,
+          errors::InvalidArgument(
+              "The scale_d in input 'Scale' Tensor of Operator(interpolate) "
+              "should be greater than 0, but received value is %d.",
+              scale_d));
+    } else {
+      if (scale.size() > 1) {
+        scale_d = scale[0];
+        scale_h = scale[1];
+        scale_w = scale[2];
+
+        PADDLE_ENFORCE_EQ(
+            scale_w > 0,
+            true,
+            errors::InvalidArgument(
+                "The scale_w in Attr(scale) of Operator(interpolate) "
+                "should be greater than 0, but received value is %d.",
+                scale_w));
+        PADDLE_ENFORCE_EQ(
+            scale_h > 0,
+            true,
+            errors::InvalidArgument(
+                "The scale_h in Attr(scale) of Operator(interpolate) "
+                "should be greater than 0, but received value is %d.",
+                scale_h));
+        PADDLE_ENFORCE_EQ(
+            scale_d > 0,
+            true,
+            errors::InvalidArgument(
+                "The scale_d in Attr(scale) of Operator(interpolate) "
+                "should be greater than 0, but received value is %d.",
+                scale_d));
+      }
+    }
+    if (scale_w > 0. && scale_h > 0. && scale_d > 0.) {
+      out_d = static_cast<int>(in_d * scale_d);
+      out_h = static_cast<int>(in_h * scale_h);
+      out_w = static_cast<int>(in_w * scale_w);
+    }
+    if (out_size) {
+      auto out_size_data =
+          funcs::get_new_data_from_tensor<int>(out_size.get_ptr());
+      out_d = out_size_data[0];
+      out_h = out_size_data[1];
+      out_w = out_size_data[2];
+    }
+  }
+  PADDLE_ENFORCE_GT(
+      out_d,
+      0,
+      errors::InvalidArgument("out_d in Attr(out_shape) of Op(interpolate) "
+                              "should be greater than 0."));
+  PADDLE_ENFORCE_GT(
+      out_h,
+      0,
+      errors::InvalidArgument("out_h in Attr(out_shape) of Op(interpolate) "
+                              "should be greater than 0."));
+  PADDLE_ENFORCE_GT(
+      out_w,
+      0,
+      errors::InvalidArgument("out_w in Attr(out_shape) of Op(interpolate) "
+                              "should be greater than 0."));
+
+  phi::DDim dim_out;
+  if (data_layout == DataLayout::kNCHW) {
+    dim_out = {n, c, out_d, out_h, out_w};
+  } else {
+    dim_out = {n, out_d, out_h, out_w, c};
+  }
+
+  output->Resize(dim_out);
+  dev_ctx.template Alloc<T>(output);
+
+  if (in_d == out_d && in_h == out_h && in_w == out_w) {
+    paddle::framework::TensorCopy(x, dev_ctx.GetPlace(), output);
+    return;
+  }
+
+  float ratio_d = 0.f;
+  float ratio_h = 0.f;
+  float ratio_w = 0.f;
+  if (out_d > 1) {
+    float new_scale_d = 0.f;
+    new_scale_d = (scale_d > 0) ? static_cast<float>(1. / scale_d)
+                                : static_cast<float>(in_d) / out_d;
+    ratio_d = (align_corners) ? static_cast<float>(in_d - 1) / (out_d - 1)
+                              : static_cast<float>(new_scale_d);
+  }
+  if (out_h > 1) {
+    float new_scale_h = 0.f;
+    new_scale_h = (scale_h > 0) ? static_cast<float>(1. / scale_h)
+                                : static_cast<float>(in_h) / out_h;
+    ratio_h = (align_corners) ? static_cast<float>(in_h - 1) / (out_h - 1)
+                              : static_cast<float>(new_scale_h);
+  }
+  if (out_w > 1) {
+    float new_scale_w = 0.f;
+    new_scale_w = (scale_w > 0) ? static_cast<float>(1. / scale_w)
+                                : static_cast<float>(in_w) / out_w;
+    ratio_w = (align_corners) ? static_cast<float>(in_w - 1) / (out_w - 1)
+                              : static_cast<float>(new_scale_w);
+  }
+
+  if ("trilinear" == interp_method) {
+    TrilinearInterpolation<T>(x,
+                              output,
+                              ratio_d,
+                              ratio_h,
+                              ratio_w,
+                              in_d,
+                              in_h,
+                              in_w,
+                              n,
+                              c,
+                              out_d,
+                              out_h,
+                              out_w,
+                              align_corners,
+                              align_mode,
+                              data_layout);
+  } else if ("nearest" == interp_method) {
+    NearestNeighbor3DInterpolate<T>(x,
+                                    output,
+                                    ratio_d,
+                                    ratio_h,
+                                    ratio_w,
+                                    n,
+                                    c,
+                                    out_d,
+                                    out_h,
+                                    out_w,
+                                    align_corners,
+                                    data_layout);
+  }
+}
+
+template <typename T, typename Context>
+void InterpolateKernel(
+    const Context& ctx,
+    const DenseTensor& x,
+    paddle::optional<const DenseTensor&> out_size,
+    paddle::optional<const std::vector<const DenseTensor*>> size_tensor,
+    paddle::optional<const DenseTensor&> scale_tensor,
+    const std::string& data_layout,
+    int out_d,
+    int out_h,
+    int out_w,
+    const std::vector<float>& scale,
+    const std::string& interp_method,
+    bool align_corners,
+    int align_mode,
+    DenseTensor* output) {
+  auto input_dims = x.dims();
+  if (input_dims.size() == 3) {  // 1D interpolation
+    Interpolate1DCPUFwd<T, Context>(ctx,
+                                    x,
+                                    out_size,
+                                    size_tensor,
+                                    scale_tensor,
+                                    data_layout,
+                                    out_w,
+                                    scale,
+                                    interp_method,
+                                    align_corners,
+                                    align_mode,
+                                    output);
+
+  } else if (input_dims.size() == 4) {  // 2D interpolation
+    Interpolate2DCPUFwd<T>(ctx,
+                           x,
+                           out_size,
+                           size_tensor,
+                           scale_tensor,
+                           data_layout,
+                           out_h,
+                           out_w,
+                           scale,
+                           interp_method,
+                           align_corners,
+                           align_mode,
+                           output);
+  } else if (input_dims.size() == 5) {  // 3D interpolation
+    Interpolate3DCPUFwd<T>(ctx,
+                           x,
+                           out_size,
+                           size_tensor,
+                           scale_tensor,
+                           data_layout,
+                           out_d,
+                           out_h,
+                           out_w,
+                           scale,
+                           interp_method,
+                           align_corners,
+                           align_mode,
+                           output);
+  }
+}
+
+template <typename T, typename Context>
+void BilinearInterpKernel(
+    const Context& ctx,
+    const DenseTensor& x,
+    paddle::optional<const DenseTensor&> out_size,
+    paddle::optional<const std::vector<const DenseTensor*>> size_tensor,
+    paddle::optional<const DenseTensor&> scale_tensor,
+    const std::string& data_layout,
+    int out_d,
+    int out_h,
+    int out_w,
+    const std::vector<float>& scale,
+    const std::string& interp_method,
+    bool align_corners,
+    int align_mode,
+    DenseTensor* output) {
+  InterpolateKernel<T, Context>(ctx,
+                                x,
+                                out_size,
+                                size_tensor,
+                                scale_tensor,
+                                data_layout,
+                                out_d,
+                                out_h,
+                                out_w,
+                                scale,
+                                interp_method,
+                                align_corners,
+                                align_mode,
+                                output);
+}
+
+template <typename T, typename Context>
+void NearestInterpKernel(
+    const Context& ctx,
+    const DenseTensor& x,
+    paddle::optional<const DenseTensor&> out_size,
+    paddle::optional<const std::vector<const DenseTensor*>> size_tensor,
+    paddle::optional<const DenseTensor&> scale_tensor,
+    const std::string& data_layout,
+    int out_d,
+    int out_h,
+    int out_w,
+    const std::vector<float>& scale,
+    const std::string& interp_method,
+    bool align_corners,
+    int align_mode,
+    DenseTensor* output) {
+  InterpolateKernel<T, Context>(ctx,
+                                x,
+                                out_size,
+                                size_tensor,
+                                scale_tensor,
+                                data_layout,
+                                out_d,
+                                out_h,
+                                out_w,
+                                scale,
+                                interp_method,
+                                align_corners,
+                                align_mode,
+                                output);
+}
+
+template <typename T, typename Context>
+void TrilinearInterpKernel(
+    const Context& ctx,
+    const DenseTensor& x,
+    paddle::optional<const DenseTensor&> out_size,
+    paddle::optional<const std::vector<const DenseTensor*>> size_tensor,
+    paddle::optional<const DenseTensor&> scale_tensor,
+    const std::string& data_layout,
+    int out_d,
+    int out_h,
+    int out_w,
+    const std::vector<float>& scale,
+    const std::string& interp_method,
+    bool align_corners,
+    int align_mode,
+    DenseTensor* output) {
+  InterpolateKernel<T, Context>(ctx,
+                                x,
+                                out_size,
+                                size_tensor,
+                                scale_tensor,
+                                data_layout,
+                                out_d,
+                                out_h,
+                                out_w,
+                                scale,
+                                interp_method,
+                                align_corners,
+                                align_mode,
+                                output);
+}
+
+template <typename T, typename Context>
+void LinearInterpKernel(
+    const Context& ctx,
+    const DenseTensor& x,
+    paddle::optional<const DenseTensor&> out_size,
+    paddle::optional<const std::vector<const DenseTensor*>> size_tensor,
+    paddle::optional<const DenseTensor&> scale_tensor,
+    const std::string& data_layout,
+    int out_d,
+    int out_h,
+    int out_w,
+    const std::vector<float>& scale,
+    const std::string& interp_method,
+    bool align_corners,
+    int align_mode,
+    DenseTensor* output) {
+  InterpolateKernel<T, Context>(ctx,
+                                x,
+                                out_size,
+                                size_tensor,
+                                scale_tensor,
+                                data_layout,
+                                out_d,
+                                out_h,
+                                out_w,
+                                scale,
+                                interp_method,
+                                align_corners,
+                                align_mode,
+                                output);
+}
+
+template <typename T, typename Context>
+void BicubicInterpKernel(
+    const Context& ctx,
+    const DenseTensor& x,
+    paddle::optional<const DenseTensor&> out_size,
+    paddle::optional<const std::vector<const DenseTensor*>> size_tensor,
+    paddle::optional<const DenseTensor&> scale_tensor,
+    const std::string& data_layout,
+    int out_d,
+    int out_h,
+    int out_w,
+    const std::vector<float>& scale,
+    const std::string& interp_method,
+    bool align_corners,
+    int align_mode,
+    DenseTensor* output) {
+  InterpolateKernel<T, Context>(ctx,
+                                x,
+                                out_size,
+                                size_tensor,
+                                scale_tensor,
+                                data_layout,
+                                out_d,
+                                out_h,
+                                out_w,
+                                scale,
+                                interp_method,
+                                align_corners,
+                                align_mode,
+                                output);
+}
+
+}  // namespace phi
+
+PD_REGISTER_KERNEL(bilinear_interp_v2,
+                   CPU,
+                   ALL_LAYOUT,
+                   phi::BilinearInterpKernel,
+                   float,
+                   double,
+                   uint8_t) {}
+PD_REGISTER_KERNEL(nearest_interp_v2,
+                   CPU,
+                   ALL_LAYOUT,
+                   phi::NearestInterpKernel,
+                   float,
+                   double,
+                   int,
+                   int64_t,
+                   uint8_t) {}
+PD_REGISTER_KERNEL(trilinear_interp_v2,
+                   CPU,
+                   ALL_LAYOUT,
+                   phi::TrilinearInterpKernel,
+                   float,
+                   double,
+                   uint8_t) {}
+PD_REGISTER_KERNEL(linear_interp_v2,
+                   CPU,
+                   ALL_LAYOUT,
+                   phi::LinearInterpKernel,
+                   float,
+                   double,
+                   uint8_t) {}
+PD_REGISTER_KERNEL(bicubic_interp_v2,
+                   CPU,
+                   ALL_LAYOUT,
+                   phi::BicubicInterpKernel,
+                   float,
+                   double) {}
diff --git a/paddle/phi/kernels/cpu/kthvalue_grad_kernel.cc b/paddle/phi/kernels/cpu/kthvalue_grad_kernel.cc
index 185d6cbedc85d..de7dfd167b76d 100644
--- a/paddle/phi/kernels/cpu/kthvalue_grad_kernel.cc
+++ b/paddle/phi/kernels/cpu/kthvalue_grad_kernel.cc
@@ -46,9 +46,9 @@ static void kthvalueAssign(const Type& input_height,
 
 template <typename T, typename Context>
 void KthvalueGradKernel(const Context& dev_ctx,
-                        const DenseTensor& d_out,
                         const DenseTensor& x,
                         const DenseTensor& indices,
+                        const DenseTensor& d_out,
                         int k,
                         int axis,
                         bool keepdim,
diff --git a/paddle/phi/kernels/cpu/masked_select_grad_kernel.cc b/paddle/phi/kernels/cpu/masked_select_grad_kernel.cc
index 7fe41e686af8c..bbb08f0616776 100644
--- a/paddle/phi/kernels/cpu/masked_select_grad_kernel.cc
+++ b/paddle/phi/kernels/cpu/masked_select_grad_kernel.cc
@@ -21,9 +21,9 @@ namespace phi {
 
 template <typename T, typename Context>
 void MaskedSelectGradKernel(const Context& dev_ctx,
-                            const DenseTensor& out_grad,
                             const DenseTensor& x,
                             const DenseTensor& mask,
+                            const DenseTensor& out_grad,
                             DenseTensor* x_grad) {
   auto* mask_data = mask.data<bool>();
   auto* input_data = out_grad.data<T>();
diff --git a/paddle/phi/kernels/cpu/nll_loss_grad_kernel.cc b/paddle/phi/kernels/cpu/nll_loss_grad_kernel.cc
index e7d74759f516a..5b859b6ec270e 100644
--- a/paddle/phi/kernels/cpu/nll_loss_grad_kernel.cc
+++ b/paddle/phi/kernels/cpu/nll_loss_grad_kernel.cc
@@ -121,8 +121,8 @@ template <typename T, typename Context>
 void NllLossGradKernel(const Context& dev_ctx,
                        const DenseTensor& x,
                        const DenseTensor& labels,
-                       const DenseTensor& total_weight,
                        paddle::optional<const DenseTensor&> weight,
+                       const DenseTensor& total_weight,
                        const DenseTensor& d_out,
                        int64_t ignore_index,
                        const std::string& reduction,
diff --git a/paddle/phi/kernels/cpu/one_hot_kernel.cc b/paddle/phi/kernels/cpu/one_hot_kernel.cc
index dc58489ebf70e..04f7c6a1f606d 100644
--- a/paddle/phi/kernels/cpu/one_hot_kernel.cc
+++ b/paddle/phi/kernels/cpu/one_hot_kernel.cc
@@ -25,18 +25,12 @@ struct OneHotV2OpFunctor {
   DenseTensor* out_;
   int depth_;
   const DeviceContext& ctx_;
-  bool allow_out_of_range_;
 
   OneHotV2OpFunctor(const DenseTensor* in,
                     DenseTensor* out,
                     int depth,
-                    const DeviceContext& ctx,
-                    bool allow_out_of_range = false)
-      : in_(in),
-        out_(out),
-        depth_(depth),
-        ctx_(ctx),
-        allow_out_of_range_(allow_out_of_range) {}
+                    const DeviceContext& ctx)
+      : in_(in), out_(out), depth_(depth), ctx_(ctx) {}
 
   template <typename OutT>
   void apply() const {
@@ -45,32 +39,24 @@ struct OneHotV2OpFunctor {
     auto* p_out_data = ctx_.template Alloc<OutT>(out_);
     funcs::set_constant(ctx_, out_, 0.0);
 
-    if (allow_out_of_range_) {
-      for (int i = 0; i < numel; ++i) {
-        if (p_in_data[i] >= 0 && p_in_data[i] < depth_) {
-          *(p_out_data + i * depth_ + p_in_data[i]) = 1.0;
-        }
-      }
-    } else {
-      for (int i = 0; i < numel; ++i) {
-        PADDLE_ENFORCE_GE(
-            p_in_data[i],
-            0,
-            phi::errors::InvalidArgument(
-                "Illegal index value, Input(input) value should be at least 0, "
-                "but received input (%d) less than 0",
-                p_in_data[i]));
-        PADDLE_ENFORCE_LT(
-            p_in_data[i],
-            depth_,
-            phi::errors::InvalidArgument(
-                "Illegal index value, Input(input) value should be less than "
-                "Input(depth), "
-                "but received input (%d) not less than depth (%d)",
-                p_in_data[i],
-                depth_));
-        *(p_out_data + i * depth_ + p_in_data[i]) = 1.0;
-      }
+    for (int i = 0; i < numel; ++i) {
+      PADDLE_ENFORCE_GE(
+          p_in_data[i],
+          0,
+          phi::errors::InvalidArgument(
+              "Illegal index value, Input(input) value should be at least 0, "
+              "but received input (%d) less than 0",
+              p_in_data[i]));
+      PADDLE_ENFORCE_LT(
+          p_in_data[i],
+          depth_,
+          phi::errors::InvalidArgument(
+              "Illegal index value, Input(input) value should be less than "
+              "Input(depth), "
+              "but received input (%d) not less than depth (%d)",
+              p_in_data[i],
+              depth_));
+      *(p_out_data + i * depth_ + p_in_data[i]) = 1.0;
     }
   }
 };
@@ -89,8 +75,7 @@ void OneHotRawKernel(const Context& dev_ctx,
   }
 
   phi::VisitDataType(dtype,
-                     OneHotV2OpFunctor<Context, T>(
-                         &x, out, depth, dev_ctx, allow_out_of_range));
+                     OneHotV2OpFunctor<Context, T>(&x, out, depth, dev_ctx));
 }
 
 }  // namespace phi
diff --git a/paddle/phi/kernels/cpu/prelu_grad_kernel.cc b/paddle/phi/kernels/cpu/prelu_grad_kernel.cc
index 97558cdb31f66..17be3fc897917 100644
--- a/paddle/phi/kernels/cpu/prelu_grad_kernel.cc
+++ b/paddle/phi/kernels/cpu/prelu_grad_kernel.cc
@@ -24,8 +24,8 @@ void PReluGradKernel(const Context& dev_ctx,
                      const DenseTensor& x,
                      const DenseTensor& alpha,
                      const DenseTensor& out_grad,
-                     const std::string& mode,
                      const std::string& data_format,
+                     const std::string& mode,
                      DenseTensor* x_grad,
                      DenseTensor* alpha_grad) {
   const T* alpha_ptr = alpha.data<T>();
diff --git a/paddle/phi/kernels/cpu/prelu_kernel.cc b/paddle/phi/kernels/cpu/prelu_kernel.cc
index 8f389ab9ff459..636a3a4d750d1 100644
--- a/paddle/phi/kernels/cpu/prelu_kernel.cc
+++ b/paddle/phi/kernels/cpu/prelu_kernel.cc
@@ -23,8 +23,8 @@ template <typename T, typename Context>
 void PReluKernel(const Context& dev_ctx,
                  const DenseTensor& x,
                  const DenseTensor& alpha,
-                 const std::string& mode,
                  const std::string& data_format,
+                 const std::string& mode,
                  DenseTensor* out) {
   const T* x_ptr = x.data<T>();
   const T* alpha_ptr = alpha.data<T>();
diff --git a/paddle/phi/kernels/cpu/strided_slice_grad_kernel.cc b/paddle/phi/kernels/cpu/strided_slice_grad_kernel.cc
index cdc5534d63c08..e6c812cf6bd5a 100644
--- a/paddle/phi/kernels/cpu/strided_slice_grad_kernel.cc
+++ b/paddle/phi/kernels/cpu/strided_slice_grad_kernel.cc
@@ -19,10 +19,10 @@
 #include "paddle/phi/core/kernel_registry.h"
 #include "paddle/phi/kernels/impl/strided_slice_grad_kernel_impl.h"
 
-PD_REGISTER_KERNEL(strided_slice_grad,
+PD_REGISTER_KERNEL(strided_slice_raw_grad,
                    CPU,
                    ALL_LAYOUT,
-                   phi::StridedSliceGradKernel,
+                   phi::StridedSliceRawGradKernel,
                    bool,
                    int,
                    int64_t,
diff --git a/paddle/phi/kernels/cpu/strided_slice_kernel.cc b/paddle/phi/kernels/cpu/strided_slice_kernel.cc
index f34a3301fcb42..d0aa7b2f4cee6 100644
--- a/paddle/phi/kernels/cpu/strided_slice_kernel.cc
+++ b/paddle/phi/kernels/cpu/strided_slice_kernel.cc
@@ -19,10 +19,10 @@
 #include "paddle/phi/core/kernel_registry.h"
 #include "paddle/phi/kernels/impl/strided_slice_kernel_impl.h"
 
-PD_REGISTER_KERNEL(strided_slice,
+PD_REGISTER_KERNEL(strided_slice_raw,
                    CPU,
                    ALL_LAYOUT,
-                   phi::StridedSliceKernel,
+                   phi::StridedSliceRawKernel,
                    bool,
                    int,
                    int64_t,
diff --git a/paddle/phi/kernels/cpu/top_k_grad_kernel.cc b/paddle/phi/kernels/cpu/top_k_grad_kernel.cc
index 582ee1157cce8..e44f85fb6c0fb 100644
--- a/paddle/phi/kernels/cpu/top_k_grad_kernel.cc
+++ b/paddle/phi/kernels/cpu/top_k_grad_kernel.cc
@@ -51,17 +51,17 @@ static void FullTopKAssign(const Type& input_height,
 
 template <typename T, typename Context>
 void TopkGradKernel(const Context& dev_ctx,
-                    const DenseTensor& out_grad,
                     const DenseTensor& x,
                     const DenseTensor& indices,
-                    int k,
+                    const DenseTensor& out_grad,
+                    const Scalar& k_scalar,
                     int axis,
                     bool largest,
                     bool sorted,
                     DenseTensor* x_grad) {
   const auto& in_dims = x.dims();
   const auto& out_dims = indices.dims();
-
+  int k = k_scalar.to<int>();
   // axis < 0, get the real axis
   axis = (axis < 0) ? (in_dims.size() + axis) : axis;
 
diff --git a/paddle/phi/kernels/cpu/transpose_kernel.cc b/paddle/phi/kernels/cpu/transpose_kernel.cc
index a80196e7f80e1..5dc4866e1efc3 100644
--- a/paddle/phi/kernels/cpu/transpose_kernel.cc
+++ b/paddle/phi/kernels/cpu/transpose_kernel.cc
@@ -75,6 +75,7 @@ PD_REGISTER_KERNEL(transpose,
                    double,
                    int32_t,
                    int64_t,
+                   phi::dtype::float16,
                    phi::dtype::bfloat16,
                    phi::dtype::complex<float>,
                    phi::dtype::complex<double>) {}
diff --git a/paddle/phi/kernels/cpu/truncated_gaussian_random_kernel.cc b/paddle/phi/kernels/cpu/truncated_gaussian_random_kernel.cc
index 4247e597acef4..1028008261919 100644
--- a/paddle/phi/kernels/cpu/truncated_gaussian_random_kernel.cc
+++ b/paddle/phi/kernels/cpu/truncated_gaussian_random_kernel.cc
@@ -21,10 +21,141 @@
 #include "paddle/phi/backends/cpu/cpu_context.h"
 #include "paddle/phi/core/kernel_registry.h"
 
-#include "paddle/fluid/framework/generator.h"
-
 namespace phi {
 
+// reference: https://gist.github.com/lakshayg/d80172fe5ae3c5d2c2aedb53c250320e
+template <typename T>
+T Erfinv(T x) {
+  if (x < -1 || x > 1) {
+    return std::numeric_limits<T>::quiet_NaN();
+  } else if (x == 1.0) {
+    return std::numeric_limits<T>::infinity();
+  } else if (x == -1.0) {
+    return -std::numeric_limits<T>::infinity();
+  }
+
+  const T LN2 = 6.931471805599453094172321214581e-1;
+
+  const T A0 = 1.1975323115670912564578e0;
+  const T A1 = 4.7072688112383978012285e1;
+  const T A2 = 6.9706266534389598238465e2;
+  const T A3 = 4.8548868893843886794648e3;
+  const T A4 = 1.6235862515167575384252e4;
+  const T A5 = 2.3782041382114385731252e4;
+  const T A6 = 1.1819493347062294404278e4;
+  const T A7 = 8.8709406962545514830200e2;
+
+  const T B0 = 1.0000000000000000000e0;
+  const T B1 = 4.2313330701600911252e1;
+  const T B2 = 6.8718700749205790830e2;
+  const T B3 = 5.3941960214247511077e3;
+  const T B4 = 2.1213794301586595867e4;
+  const T B5 = 3.9307895800092710610e4;
+  const T B6 = 2.8729085735721942674e4;
+  const T B7 = 5.2264952788528545610e3;
+
+  const T C0 = 1.42343711074968357734e0;
+  const T C1 = 4.63033784615654529590e0;
+  const T C2 = 5.76949722146069140550e0;
+  const T C3 = 3.64784832476320460504e0;
+  const T C4 = 1.27045825245236838258e0;
+  const T C5 = 2.41780725177450611770e-1;
+  const T C6 = 2.27238449892691845833e-2;
+  const T C7 = 7.74545014278341407640e-4;
+
+  const T D0 = 1.4142135623730950488016887e0;
+  const T D1 = 2.9036514445419946173133295e0;
+  const T D2 = 2.3707661626024532365971225e0;
+  const T D3 = 9.7547832001787427186894837e-1;
+  const T D4 = 2.0945065210512749128288442e-1;
+  const T D5 = 2.1494160384252876777097297e-2;
+  const T D6 = 7.7441459065157709165577218e-4;
+  const T D7 = 1.4859850019840355905497876e-9;
+
+  const T E0 = 6.65790464350110377720e0;
+  const T E1 = 5.46378491116411436990e0;
+  const T E2 = 1.78482653991729133580e0;
+  const T E3 = 2.96560571828504891230e-1;
+  const T E4 = 2.65321895265761230930e-2;
+  const T E5 = 1.24266094738807843860e-3;
+  const T E6 = 2.71155556874348757815e-5;
+  const T E7 = 2.01033439929228813265e-7;
+
+  const T F0 = 1.414213562373095048801689e0;
+  const T F1 = 8.482908416595164588112026e-1;
+  const T F2 = 1.936480946950659106176712e-1;
+  const T F3 = 2.103693768272068968719679e-2;
+  const T F4 = 1.112800997078859844711555e-3;
+  const T F5 = 2.611088405080593625138020e-5;
+  const T F6 = 2.010321207683943062279931e-7;
+  const T F7 = 2.891024605872965461538222e-15;
+
+  T abs_x = abs(x);
+
+  if (abs_x <= 0.85) {
+    T r = 0.180625 - 0.25 * x * x;
+    T num =
+        (((((((A7 * r + A6) * r + A5) * r + A4) * r + A3) * r + A2) * r + A1) *
+             r +
+         A0);
+    T den =
+        (((((((B7 * r + B6) * r + B5) * r + B4) * r + B3) * r + B2) * r + B1) *
+             r +
+         B0);
+    return x * num / den;
+  }
+
+  T r = sqrt(LN2 - log(1.0 - abs_x));
+
+  T num, den;
+  if (r <= 5.0) {
+    r = r - 1.6;
+    num =
+        (((((((C7 * r + C6) * r + C5) * r + C4) * r + C3) * r + C2) * r + C1) *
+             r +
+         C0);
+    den =
+        (((((((D7 * r + D6) * r + D5) * r + D4) * r + D3) * r + D2) * r + D1) *
+             r +
+         D0);
+  } else {
+    r = r - 5.0;
+    num =
+        (((((((E7 * r + E6) * r + E5) * r + E4) * r + E3) * r + E2) * r + E1) *
+             r +
+         E0);
+    den =
+        (((((((F7 * r + F6) * r + F5) * r + F4) * r + F3) * r + F2) * r + F1) *
+             r +
+         F0);
+  }
+
+  if (x < 0) {
+    return -num / den;
+  } else {
+    return num / den;
+  }
+}
+
+template <typename T>
+struct TruncatedNormal {
+  T mean, std;
+  T a_normal_cdf;
+  T b_normal_cdf;
+  TruncatedNormal(T mean, T std) : mean(mean), std(std) {
+    auto normal_cdf = [](T x) {
+      return (1.0 + std::erf(x / std::sqrt(2.0))) / 2.0;
+    };
+    a_normal_cdf = normal_cdf(-2.0);
+    b_normal_cdf = normal_cdf(2.0);
+  }
+
+  T operator()(T value) const {
+    auto p = a_normal_cdf + (b_normal_cdf - a_normal_cdf) * value;
+    return std::sqrt(2.0) * Erfinv(2 * p - 1) * std + mean;
+  }
+};
+
 template <typename T, typename Context>
 void TruncatedGaussianRandomKernel(const Context& dev_ctx,
                                    const std::vector<int>& shape,
@@ -42,7 +173,13 @@ void TruncatedGaussianRandomKernel(const Context& dev_ctx,
   TruncatedNormal<T> truncated_normal(mean, std);
   int64_t size = tensor->numel();
 
-  auto engine = paddle::framework::GetCPURandomEngine(seed);
+  std::shared_ptr<std::mt19937_64> engine;
+  if (seed) {
+    engine = std::make_shared<std::mt19937_64>();
+    engine->seed(seed);
+  } else {
+    engine = dev_ctx.GetGenerator()->GetCPUEngine();
+  }
   for (int64_t i = 0; i < size; ++i) {
     data[i] = truncated_normal(dist(*engine));
   }
diff --git a/paddle/phi/kernels/cross_entropy_grad_kernel.h b/paddle/phi/kernels/cross_entropy_grad_kernel.h
new file mode 100644
index 0000000000000..ae4b0436c93ca
--- /dev/null
+++ b/paddle/phi/kernels/cross_entropy_grad_kernel.h
@@ -0,0 +1,33 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+
+#include "paddle/phi/core/dense_tensor.h"
+
+namespace phi {
+
+template <typename T, typename Context>
+void CrossEntropyWithSoftmaxGradKernel(const Context& dev_ctx,
+                                       const DenseTensor& label,
+                                       const DenseTensor& softmax,
+                                       const DenseTensor& loss_grad,
+                                       bool soft_label,
+                                       bool use_softmax,
+                                       bool numeric_stable_mode,
+                                       int ignore_index,
+                                       int axis,
+                                       DenseTensor* logits_grad);
+
+}  // namespace phi
diff --git a/paddle/phi/kernels/cross_entropy_kernel.h b/paddle/phi/kernels/cross_entropy_kernel.h
new file mode 100644
index 0000000000000..621c5f3666213
--- /dev/null
+++ b/paddle/phi/kernels/cross_entropy_kernel.h
@@ -0,0 +1,39 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+
+#include "paddle/phi/core/dense_tensor.h"
+
+namespace phi {
+
+// The deformed product of operator iterative upgrade, there is no strict 2.0
+// API corresponding to it! In 2.0 API paddle.nn.functional.cross_entropy,
+// use_softmax has become an optional argument, which may be called
+// CrossEntropyWithSoftmax more accurately, here we keep this kernel arguments
+// same as original OpMaker, and if need a CrossEntropyKernel like
+// paddle.nn.functional.cross_entropy, we can reuse this kernel
+template <typename T, typename Context>
+void CrossEntropyWithSoftmaxKernel(const Context& dev_ctx,
+                                   const DenseTensor& logits,
+                                   const DenseTensor& label,
+                                   bool soft_label,
+                                   bool use_softmax,
+                                   bool numeric_stable_mode,
+                                   int ignore_index,
+                                   int axis,
+                                   DenseTensor* softmax,
+                                   DenseTensor* loss);
+
+}  // namespace phi
diff --git a/paddle/phi/kernels/empty_kernel.cc b/paddle/phi/kernels/empty_kernel.cc
index e547e0ea1318d..06d258a8a4e80 100644
--- a/paddle/phi/kernels/empty_kernel.cc
+++ b/paddle/phi/kernels/empty_kernel.cc
@@ -45,6 +45,7 @@ PD_REGISTER_KERNEL(empty,
                    phi::EmptyKernel,
                    float,
                    double,
+                   int8_t,
                    uint8_t,
                    int16_t,
                    int,
@@ -61,6 +62,7 @@ PD_REGISTER_KERNEL(empty_like,
                    phi::EmptyLikeKernel,
                    float,
                    double,
+                   int8_t,
                    uint8_t,
                    int16_t,
                    int,
@@ -80,6 +82,7 @@ PD_REGISTER_KERNEL(empty,
                    phi::EmptyKernel,
                    float,
                    double,
+                   int8_t,
                    uint8_t,
                    int16_t,
                    int,
@@ -95,6 +98,7 @@ PD_REGISTER_KERNEL(empty_like,
                    phi::EmptyLikeKernel,
                    float,
                    double,
+                   int8_t,
                    uint8_t,
                    int16_t,
                    int,
diff --git a/paddle/phi/kernels/flatten_grad_kernel.cc b/paddle/phi/kernels/flatten_grad_kernel.cc
index b7b45e46cf414..83f96c1f9f521 100644
--- a/paddle/phi/kernels/flatten_grad_kernel.cc
+++ b/paddle/phi/kernels/flatten_grad_kernel.cc
@@ -21,8 +21,8 @@ namespace phi {
 
 template <typename T, typename Context>
 void FlattenGradKernel(const Context& dev_ctx,
-                       const DenseTensor& out_grad,
                        const DenseTensor& xshape,
+                       const DenseTensor& out_grad,
                        DenseTensor* x_grad) {
   auto xshape_dims = xshape.dims();
   dev_ctx.Alloc(x_grad, out_grad.dtype());
diff --git a/paddle/phi/kernels/flatten_grad_kernel.h b/paddle/phi/kernels/flatten_grad_kernel.h
index 3ad27b430eb72..abd120e69b2e9 100644
--- a/paddle/phi/kernels/flatten_grad_kernel.h
+++ b/paddle/phi/kernels/flatten_grad_kernel.h
@@ -20,8 +20,8 @@ namespace phi {
 
 template <typename T, typename Context>
 void FlattenGradKernel(const Context& dev_ctx,
-                       const DenseTensor& out_grad,
                        const DenseTensor& xshape,
+                       const DenseTensor& out_grad,
                        DenseTensor* x_grad);
 
 }  // namespace phi
diff --git a/paddle/phi/kernels/funcs/activation_functor.h b/paddle/phi/kernels/funcs/activation_functor.h
index eee6cf5640774..84da69ed5da02 100644
--- a/paddle/phi/kernels/funcs/activation_functor.h
+++ b/paddle/phi/kernels/funcs/activation_functor.h
@@ -1878,12 +1878,17 @@ struct CudaCosGradFunctor : public BaseActivationFunctor<T> {
 
 template <typename T>
 struct CudaExpFunctor : public BaseActivationFunctor<T> {
-  using MPType = typename phi::dtype::MPTypeTrait<T>::Type;
+  // exp(x) = expf(x)
+  __device__ __forceinline__ T operator()(const T x) const {
+    return static_cast<T>(expf(static_cast<float>(x)));
+  }
+};
 
+template <>
+struct CudaExpFunctor<double> : public BaseActivationFunctor<double> {
   // exp(x) = exp(x)
-  __device__ __forceinline__ T operator()(const T arg_x) const {
-    MPType x = static_cast<MPType>(arg_x);
-    return static_cast<T>(exp(x));
+  __device__ __forceinline__ double operator()(const double x) const {
+    return exp(x);
   }
 };
 
diff --git a/paddle/phi/kernels/funcs/aligned_vector.h b/paddle/phi/kernels/funcs/aligned_vector.h
index 9382b03cf9368..d71a61f107a7a 100644
--- a/paddle/phi/kernels/funcs/aligned_vector.h
+++ b/paddle/phi/kernels/funcs/aligned_vector.h
@@ -13,7 +13,7 @@ See the License for the specific language governing permissions and
 limitations under the License. */
 
 #pragma once
-
+#include <algorithm>
 #include "paddle/phi/core/hostdevice.h"
 
 namespace phi {
diff --git a/paddle/phi/kernels/funcs/elementwise_grad_base.h b/paddle/phi/kernels/funcs/elementwise_grad_base.h
index 23b8388c74589..1021b510b26cd 100644
--- a/paddle/phi/kernels/funcs/elementwise_grad_base.h
+++ b/paddle/phi/kernels/funcs/elementwise_grad_base.h
@@ -24,6 +24,7 @@ limitations under the License. */
 // See Note [ Why still include the fluid headers? ]
 #include "paddle/fluid/memory/memcpy.h"
 #include "paddle/fluid/platform/device/gpu/gpu_device_function.h"
+#include "paddle/fluid/platform/device/gpu/gpu_launch_config.h"
 #include "paddle/phi/kernels/primitive/kernel_primitives.h"
 
 #endif
@@ -49,14 +50,6 @@ namespace phi {
 namespace funcs {
 using DDim = phi::DDim;
 
-template <typename T>
-void LimitGridDim(const GPUContext &ctx, T *grid_dim) {
-  auto max_grid_dim = ctx.GetCUDAMaxGridDimSize()[0];
-  if (*grid_dim > max_grid_dim) {
-    *grid_dim = max_grid_dim;
-  }
-}
-
 template <typename T, typename DX_OP, typename DY_OP, typename Tout = T>
 void CommonGradBroadcastCPU(const DenseTensor &x,
                             const DenseTensor &y,
@@ -978,17 +971,17 @@ static void ElemwiseGradBroadcast1CUDA(gpuStream_t stream,
   constexpr int half_walf = 16;
   if (w < half_walf || h < half_walf) {
     int block_size = std::min(ELEMWISE_MAX_BLOCK_DIM, h);
-    int gird_size = w;
-    ElemwiseGradBroadcast1CUDAKernel<<<gird_size, block_size, 0, stream>>>(
+    int grid_size = w;
+    ElemwiseGradBroadcast1CUDAKernel<<<grid_size, block_size, 0, stream>>>(
         x, y, out, dout, h, w, is_xsize_larger, dx_op, dy_op, dx, dy);
   } else {
     // suppose perfoemance improves with h increased.
     dim3 block_size = dim3(BLOCK_X, BLOCK_Y);
-    int grid_size = (w + BLOCK_X - 1) / BLOCK_X;
+    dim3 grid_size = dim3((w + BLOCK_X - 1) / BLOCK_X);
     auto gplace = phi::GPUPlace();
     auto *ctx = static_cast<GPUContext *>(
         paddle::platform::DeviceContextPool::Instance().Get(gplace));
-    LimitGridDim(*ctx, &grid_size);
+    paddle::platform::LimitGridDim(*ctx, &grid_size);
     FastElemwiseGradBroadcast1CUDAKernel<<<grid_size, block_size, 0, stream>>>(
         x, y, out, dout, h, w, is_xsize_larger, dx_op, dy_op, dx, dy);
   }
@@ -1009,13 +1002,12 @@ static void ElemwiseGradBroadcast2CUDA(gpuStream_t stream,
                                        T *dx,
                                        T *dy) {
   int block_size = std::min(ELEMWISE_MAX_BLOCK_DIM, pre * post);
-  int gird_size = n;
-  int grid_size = n;
+  dim3 grid_size = dim3(n);
   auto gplace = phi::GPUPlace();
   auto *ctx = static_cast<GPUContext *>(
       paddle::platform::DeviceContextPool::Instance().Get(gplace));
-  LimitGridDim(*ctx, &grid_size);
-  ElemwiseGradBroadcast2CUDAKernel<<<gird_size, block_size, 0, stream>>>(
+  paddle::platform::LimitGridDim(*ctx, &grid_size);
+  ElemwiseGradBroadcast2CUDAKernel<<<grid_size, block_size, 0, stream>>>(
       x, y, out, dout, pre, n, post, is_xsize_larger, dx_op, dy_op, dx, dy);
 }
 
@@ -1216,8 +1208,8 @@ void CommonGradBroadcastCUDA(const DenseTensor &x,
                                                          is_y);
       } else {
         dim3 block_size = dim3(BLOCK_X, BLOCK_Y);
-        int grid_size = (w + BLOCK_X - 1) / BLOCK_X;
-        LimitGridDim(ctx, &grid_size);
+        dim3 grid_size = dim3((w + BLOCK_X - 1) / BLOCK_X);
+        paddle::platform::LimitGridDim(ctx, &grid_size);
         FastCommonGradBroadcastCUDAKernelHeight<<<grid_size,
                                                   block_size,
                                                   0,
@@ -1253,8 +1245,8 @@ void CommonGradBroadcastCUDA(const DenseTensor &x,
                                                          is_y);
       } else {
         dim3 block_size = dim3(BLOCK_X, BLOCK_Y);
-        int grid_size = (w + BLOCK_X - 1) / BLOCK_X;
-        LimitGridDim(ctx, &grid_size);
+        dim3 grid_size = dim3((w + BLOCK_X - 1) / BLOCK_X);
+        paddle::platform::LimitGridDim(ctx, &grid_size);
         FastCommonGradBroadcastCUDAKernelHeight<<<grid_size,
                                                   block_size,
                                                   0,
@@ -1350,8 +1342,8 @@ void CommonGradBroadcastCUDA(const DenseTensor &x,
             << " post:" << post;
 
     int block_size = std::min(ELEMWISE_MAX_BLOCK_DIM, mid);
-    int grid_size = pre * post;
-    LimitGridDim(ctx, &grid_size);
+    dim3 grid_size = dim3(pre * post);
+    paddle::platform::LimitGridDim(ctx, &grid_size);
 
     FastCommonGradBroadcastAllCUDAKernel<<<grid_size, block_size, 0, stream>>>(
         x_data,
@@ -1392,8 +1384,8 @@ void CommonGradBroadcastCUDA(const DenseTensor &x,
                                1,
                                std::multiplies<int>());
       int block_size = std::min(ELEMWISE_MAX_BLOCK_DIM, mid);
-      int grid_size = pre * post;
-      LimitGridDim(ctx, &grid_size);
+      dim3 grid_size = dim3(pre * post);
+      paddle::platform::LimitGridDim(ctx, &grid_size);
       // we need to calc y offset with blockid, so do x_pre/y_pre to get left
       // size.
       if (k_pre != pre) k_pre = pre / k_pre;
@@ -1423,8 +1415,8 @@ void CommonGradBroadcastCUDA(const DenseTensor &x,
                                1,
                                std::multiplies<int>());
       int block_size = std::min(ELEMWISE_MAX_BLOCK_DIM, mid);
-      int grid_size = pre * post;
-      LimitGridDim(ctx, &grid_size);
+      dim3 grid_size = dim3(pre * post);
+      paddle::platform::LimitGridDim(ctx, &grid_size);
       if (k_pre != pre) k_pre = pre / k_pre;
 
       FastCommonGradBroadcastOneCUDAKernel<<<grid_size,
diff --git a/paddle/phi/kernels/funcs/interpolate_function.h b/paddle/phi/kernels/funcs/interpolate_function.h
new file mode 100644
index 0000000000000..453f9ea87c7cc
--- /dev/null
+++ b/paddle/phi/kernels/funcs/interpolate_function.h
@@ -0,0 +1,154 @@
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#pragma once
+
+#include "paddle/fluid/framework/tensor_util.h"
+#include "paddle/phi/common/layout.h"
+#include "paddle/phi/core/ddim.h"
+#include "paddle/phi/kernels/funcs/eigen/common.h"
+
+#if defined(__NVCC__) || defined(__HIPCC__)
+#include "paddle/fluid/platform/fast_divmod.h"
+#endif
+
+namespace phi {
+namespace funcs {
+
+template <typename T>
+HOSTDEVICE inline T CubicConvolution1(T x, T A) {
+  return ((A + 2) * x - (A + 3)) * x * x + 1;
+}
+
+template <typename T>
+HOSTDEVICE inline T CubicConvolution2(T x, T A) {
+  return ((A * x - 5 * A) * x + 8 * A) * x - 4 * A;
+}
+
+template <typename T>
+HOSTDEVICE inline void get_cubic_upsample_coefficients(T coeffs[4], T t) {
+  T A = -0.75;
+
+  T x1 = t;
+  coeffs[0] = CubicConvolution2<T>(x1 + 1.0, A);
+  coeffs[1] = CubicConvolution1<T>(x1, A);
+
+  // opposite coefficients
+  T x2 = 1.0 - t;
+  coeffs[2] = CubicConvolution1<T>(x2, A);
+  coeffs[3] = CubicConvolution2<T>(x2 + 1.0, A);
+}
+
+inline void ExtractNCDWH(const DDim& dims,
+                         const DataLayout& data_layout,
+                         int* N,
+                         int* C,
+                         int* D,
+                         int* H,
+                         int* W) {
+  *N = dims[0];
+
+  if (dims.size() == 3) {
+    *C = data_layout == DataLayout::kNCHW ? dims[1] : dims[2];
+    *D = 1;
+    *H = 1;
+    *W = data_layout == DataLayout::kNCHW ? dims[2] : dims[1];
+  } else if (dims.size() == 4) {
+    *C = data_layout == DataLayout::kNCHW ? dims[1] : dims[3];
+    *D = 1;
+    *H = data_layout == DataLayout::kNCHW ? dims[2] : dims[1];
+    *W = data_layout == DataLayout::kNCHW ? dims[3] : dims[2];
+  } else {
+    *C = data_layout == DataLayout::kNCHW ? dims[1] : dims[4];
+    *D = data_layout == DataLayout::kNCHW ? dims[2] : dims[1];
+    *H = data_layout == DataLayout::kNCHW ? dims[3] : dims[2];
+    *W = data_layout == DataLayout::kNCHW ? dims[4] : dims[3];
+  }
+}
+
+inline std::vector<int> get_new_shape(
+    const std::vector<const DenseTensor*>& list_new_shape_tensor) {
+  // get tensor from
+  std::vector<int> vec_new_shape;
+  for (size_t i = 0; i < list_new_shape_tensor.size(); ++i) {
+    auto tensor = list_new_shape_tensor[i];
+    PADDLE_ENFORCE_EQ(
+        tensor->dims(),
+        phi::make_ddim({1}),
+        errors::InvalidArgument("The shape of dimension tensor should be [1],"
+                                "but received d%.",
+                                tensor->dims()));
+    if (paddle::platform::is_gpu_place(tensor->place())) {
+      DenseTensor temp;
+      paddle::framework::TensorCopySync(
+          *tensor, paddle::platform::CPUPlace(), &temp);
+      vec_new_shape.push_back(static_cast<int32_t>(*temp.data<int32_t>()));
+    } else {
+      vec_new_shape.push_back(static_cast<int32_t>(*tensor->data<int32_t>()));
+    }
+  }
+
+  return vec_new_shape;
+}
+
+template <typename T>
+inline std::vector<T> get_new_data_from_tensor(
+    const DenseTensor* new_data_tensor) {
+  std::vector<T> vec_new_data;
+  auto* new_data = new_data_tensor->data<T>();
+  DenseTensor cpu_starts_tensor;
+  if (paddle::platform::is_gpu_place(new_data_tensor->place())) {
+    paddle::framework::TensorCopySync(
+        *new_data_tensor, paddle::platform::CPUPlace(), &cpu_starts_tensor);
+    new_data = cpu_starts_tensor.data<T>();
+  }
+#ifdef PADDLE_WITH_ASCEND_CL
+  if (paddle::platform::is_npu_place(new_data_tensor->place())) {
+    paddle::framework::TensorCopySync(
+        *new_data_tensor, paddle::platform::CPUPlace(), &cpu_starts_tensor);
+    new_data = cpu_starts_tensor.data<T>();
+  }
+#endif
+#ifdef PADDLE_WITH_XPU
+  if (paddle::platform::is_xpu_place(new_data_tensor->place())) {
+    paddle::framework::TensorCopySync(
+        *new_data_tensor, paddle::platform::CPUPlace(), &cpu_starts_tensor);
+    new_data = cpu_starts_tensor.data<T>();
+  }
+#endif
+  vec_new_data = std::vector<T>(new_data, new_data + new_data_tensor->numel());
+  return vec_new_data;
+}
+
+#if defined(__NVCC__) || defined(__HIPCC__)
+using paddle::platform::FastDivMod;
+
+struct FastDivModForInterpolate {
+ public:
+  FastDivMod channels_div;
+  FastDivMod output_w_div;
+  FastDivMod output_wc_div;
+
+  explicit HOSTDEVICE FastDivModForInterpolate(const int channels,
+                                               const int output_w,
+                                               const int outout_wc)
+      : channels_div(FastDivMod(channels)),
+        output_w_div(FastDivMod(output_w)),
+        output_wc_div(FastDivMod(outout_wc)) {}
+};
+
+#endif
+
+}  // namespace funcs
+}  // namespace phi
diff --git a/paddle/phi/kernels/funcs/reduce_function.h b/paddle/phi/kernels/funcs/reduce_function.h
index 0ee668c9ac1d9..a2d7fd9544dee 100644
--- a/paddle/phi/kernels/funcs/reduce_function.h
+++ b/paddle/phi/kernels/funcs/reduce_function.h
@@ -33,6 +33,7 @@ namespace cub = hipcub;
 #endif
 
 #include "paddle/fluid/platform/device/gpu/gpu_device_function.h"
+#include "paddle/fluid/platform/device/gpu/gpu_launch_config.h"
 #include "paddle/fluid/platform/fast_divmod.h"
 #include "paddle/phi/api/ext/dispatch.h"
 #include "paddle/phi/backends/gpu/gpu_context.h"
@@ -309,7 +310,7 @@ struct ReduceConfig {
       : reduce_dims_origin(origin_reduce_dims), x_dim(origin_x_dim) {}
 
   // get the parameters of reduceKernel
-  void Run(const paddle::platform::Place& place) {
+  void Run(const KPDevice& dev_ctx) {
     // step1: update the reduce_dim left_dim and x_dim
     SetReduceDim();
 
@@ -323,7 +324,7 @@ struct ReduceConfig {
     SetBlockDim();
 
     // step5: limit the grid to prevent thead overflow
-    LimitGridDim(place);
+    paddle::platform::LimitGridDim(dev_ctx, &grid);
   }
 
   // when should_reduce_again is true, we need malloc temp space for temp data
@@ -607,15 +608,6 @@ struct ReduceConfig {
     grid = grid_dim;
   }
 
-  void LimitGridDim(const paddle::platform::Place& place) {
-    auto* ctx = static_cast<paddle::platform::CUDADeviceContext*>(
-        paddle::platform::DeviceContextPool::Instance().Get(place));
-    std::array<int, 3> max_grid_dim = ctx->GetCUDAMaxGridDimSize();
-    grid.x = grid.x < max_grid_dim[0] ? grid.x : max_grid_dim[0];
-    grid.y = grid.y < max_grid_dim[1] ? grid.y : max_grid_dim[1];
-    grid.z = grid.z < max_grid_dim[2] ? grid.z : max_grid_dim[2];
-  }
-
  public:
   std::vector<int> reduce_dims_origin;
   std::vector<int> reduce_dim;
@@ -808,7 +800,7 @@ __global__ void ReduceHigherDimKernel(const Tx* x,
                                             1,
                                             1,
                                             left_num);
-      kps::ElementwiseUnary<Tx, MPType, REDUCE_VEC_SIZE, 1, 1, TransformOp>(
+      kps::ElementwiseUnary<Tx, MPType, 1, 1, 1, TransformOp>(
           &reduce_compute, &reduce_input, transformer);
       kps::Reduce<MPType,
                   1,
@@ -836,7 +828,7 @@ __global__ void ReduceHigherDimKernel(const Tx* x,
                                            1,
                                            1,
                                            left_num);
-      kps::ElementwiseUnary<Tx, MPType, REDUCE_VEC_SIZE, 1, 1, TransformOp>(
+      kps::ElementwiseUnary<Tx, MPType, 1, 1, 1, TransformOp>(
           &reduce_compute, &reduce_input, transformer);
       kps::Reduce<MPType,
                   1,
@@ -1072,7 +1064,7 @@ void ReduceKernel(const KPDevice& dev_ctx,
 
   auto x_dim = phi::vectorize<int>(x.dims());
   auto config = ReduceConfig<Ty>(origin_reduce_dims, x_dim);
-  config.Run(x.place());
+  config.Run(dev_ctx);
   int numel = x.numel();
   // after config.run()
   // SetOutputData for ReduceHigherDim when should_reduce_again is true,
diff --git a/paddle/phi/kernels/funcs/sparse/convolution.h b/paddle/phi/kernels/funcs/sparse/convolution.h
index 19f1f3d3cd2fa..f3caa2a62f4a8 100644
--- a/paddle/phi/kernels/funcs/sparse/convolution.h
+++ b/paddle/phi/kernels/funcs/sparse/convolution.h
@@ -33,28 +33,30 @@ struct Dims4D {
 };
 
 // Judge whether the current position x is in (lower, upper)
-inline HOSTDEVICE bool Check(const int& x,
+template <typename IntT = int>
+inline HOSTDEVICE bool Check(const IntT& x,
                              const int& kx,
                              const int& pad,
                              const int& stride,
                              const int dilation,
                              const int kdim,
                              const int xdim) {
-  const int lower = x - dilation * kx + pad;
-  const int uper = x + (kdim - kx - 1) * dilation - pad;
+  const IntT lower = x - dilation * kx + pad;
+  const IntT uper = x + (kdim - kx - 1) * dilation - pad;
   return (lower >= 0 && lower % stride == 0 && uper < xdim);
 }
 
 // Check whether the current position(x, y, z) is legal:
 // Judge the minimum and maximum values at each latitude
+template <typename IntT = int>
 inline HOSTDEVICE bool Check(const Dims4D& dims,
                              const Dims4D& kernel_dims,
                              const Dims4D& paddings,
                              const Dims4D& dilations,
                              const Dims4D& strides,
-                             const int x,
-                             const int y,
-                             const int z,
+                             const IntT x,
+                             const IntT y,
+                             const IntT z,
                              const int kx,
                              const int ky,
                              const int kz) {
@@ -67,22 +69,22 @@ inline HOSTDEVICE bool Check(const Dims4D& dims,
   return (x_valid && y_valid && z_valid);
 }
 
-template <typename Dim>
-inline HOSTDEVICE int PointToIndex(const int& batch,
-                                   const int& x,
-                                   const int& y,
-                                   const int& z,
-                                   const Dim& dims) {
+template <typename Dim, typename IntT = int>
+inline HOSTDEVICE IntT PointToIndex(const IntT& batch,
+                                    const IntT& x,
+                                    const IntT& y,
+                                    const IntT& z,
+                                    const Dim& dims) {
   return batch * dims[1] * dims[2] * dims[3] + z * dims[2] * dims[3] +
          y * dims[3] + x;
 }
 
 // TODO(zhangkaihuo): use division and multiply to optimize
 // modulo operation
-template <typename Dim>
+template <typename Dim, typename IntT = int>
 inline HOSTDEVICE void IndexToPoint(
-    const int index, const Dim& dims, int* batch, int* x, int* y, int* z) {
-  int n = index;
+    const IntT index, const Dim& dims, IntT* batch, IntT* x, IntT* y, IntT* z) {
+  IntT n = index;
   *x = n % dims[3];
   n /= dims[3];
   *y = n % dims[2];
@@ -176,8 +178,9 @@ inline const std::vector<int> PoolResetKernel(
   return res;
 }
 
-inline void PrefixSum(const int* counter, int* offsets, const int n) {
-  int offset = 0;
+template <typename T>
+inline void PrefixSum(const T* counter, T* offsets, const int n) {
+  T offset = 0;
   for (int i = 0; i < n; i++) {
     offsets[i] = offset;
     offset += counter[i];
diff --git a/paddle/phi/kernels/gpu/accuracy_kernel.cu b/paddle/phi/kernels/gpu/accuracy_kernel.cu
index f08fb74e54d8c..5eecfce093248 100644
--- a/paddle/phi/kernels/gpu/accuracy_kernel.cu
+++ b/paddle/phi/kernels/gpu/accuracy_kernel.cu
@@ -114,4 +114,7 @@ PD_REGISTER_KERNEL(accuracy,
                    phi::AccuracyRawKernel,
                    phi::dtype::float16,
                    float,
-                   double) {}
+                   double) {
+  kernel->InputAt(1).SetDataType(phi::DataType::INT64);
+  kernel->InputAt(2).SetDataType(phi::DataType::INT64);
+}
diff --git a/paddle/phi/kernels/gpu/allclose_kernel.cu b/paddle/phi/kernels/gpu/allclose_kernel.cu
index af2612bb10c9f..8abc6b272c511 100644
--- a/paddle/phi/kernels/gpu/allclose_kernel.cu
+++ b/paddle/phi/kernels/gpu/allclose_kernel.cu
@@ -51,21 +51,28 @@ void AllCloseKernel(const Context& dev_ctx,
                     const Scalar& atol,
                     bool equal_nan,
                     DenseTensor* out) {
-  PADDLE_ENFORCE_EQ(
-      rtol.dtype(),
-      DataType::FLOAT64,
-      phi::errors::InvalidArgument(
-          "Input (Rtol) type must be double, but get %s.", rtol.dtype()));
-  PADDLE_ENFORCE_EQ(
-      atol.dtype(),
-      DataType::FLOAT64,
-      phi::errors::InvalidArgument(
-          "Input (Atol) type must be double, but get %s.", atol.dtype()));
-
+  double rtol_v, atol_v;
+  if (rtol.dtype() == DataType::FLOAT64) {
+    rtol_v = rtol.to<double>();
+  } else if (rtol.dtype() == DataType::FLOAT32) {
+    rtol_v = rtol.to<float>();
+  } else {
+    PADDLE_THROW(phi::errors::InvalidArgument(
+        "Input (Rtol) type must be double or float, but get %s.",
+        rtol.dtype()));
+  }
+  if (atol.dtype() == DataType::FLOAT64) {
+    atol_v = atol.to<double>();
+  } else if (atol.dtype() == DataType::FLOAT32) {
+    atol_v = atol.to<float>();
+  } else {
+    PADDLE_THROW(phi::errors::InvalidArgument(
+        "Input (Atol) type must be double or float, but get %s.",
+        atol.dtype()));
+  }
+  VLOG(3) << "rtol and atol is : " << rtol_v << " " << atol_v;
   const T* in_data = x.data<T>();
   const T* other_data = y.data<T>();
-  auto rtol_v = rtol.to<double>();
-  auto atol_v = atol.to<double>();
   bool* out_data = dev_ctx.template Alloc<bool>(out);
 
   int num = x.numel();
diff --git a/paddle/phi/kernels/gpu/range_kernel.cu b/paddle/phi/kernels/gpu/arange_kernel.cu
similarity index 59%
rename from paddle/phi/kernels/gpu/range_kernel.cu
rename to paddle/phi/kernels/gpu/arange_kernel.cu
index 65d9b45efbcdd..9ea0d7c5393c3 100644
--- a/paddle/phi/kernels/gpu/range_kernel.cu
+++ b/paddle/phi/kernels/gpu/arange_kernel.cu
@@ -12,7 +12,7 @@
 // See the License for the specific language governing permissions and
 // limitations under the License.
 
-#include "paddle/phi/kernels/range_kernel.h"
+#include "paddle/phi/kernels/arange_kernel.h"
 
 #include "paddle/phi/backends/gpu/gpu_context.h"
 #include "paddle/phi/core/kernel_registry.h"
@@ -21,20 +21,33 @@
 
 namespace phi {
 
+template <typename T, typename Context>
+inline T GetValue(const Context& dev_ctx, const DenseTensor& x) {
+  T value = static_cast<T>(0);
+  if (x.place() != CPUPlace()) {
+    DenseTensor cpu_x;
+    Copy(dev_ctx, x, CPUPlace(), true, &cpu_x);
+    value = cpu_x.data<T>()[0];
+  } else {
+    value = x.data<T>()[0];
+  }
+  return value;
+}
+
 template <typename T>
 __global__ void Range(T start, T step, int64_t size, T* out) {
   CUDA_KERNEL_LOOP(index, size) { out[index] = start + step * index; }
 }
 
 template <typename T, typename Context>
-void RangeKernel(const Context& dev_ctx,
-                 const DenseTensor& start,
-                 const DenseTensor& end,
-                 const DenseTensor& step,
-                 DenseTensor* out) {
-  T start_value = start.data<T>()[0];
-  T end_value = end.data<T>()[0];
-  T step_value = step.data<T>()[0];
+void ArangeKernel(const Context& dev_ctx,
+                  const DenseTensor& start,
+                  const DenseTensor& end,
+                  const DenseTensor& step,
+                  DenseTensor* out) {
+  T start_value = GetValue<T, Context>(dev_ctx, start);
+  T end_value = GetValue<T, Context>(dev_ctx, end);
+  T step_value = GetValue<T, Context>(dev_ctx, step);
 
   int64_t size = 0;
   phi::funcs::GetSize(start_value, end_value, step_value, &size);
@@ -50,8 +63,8 @@ void RangeKernel(const Context& dev_ctx,
 }  // namespace phi
 
 PD_REGISTER_KERNEL(
-    range, GPU, ALL_LAYOUT, phi::RangeKernel, float, double, int64_t, int) {
-  kernel->InputAt(0).SetBackend(phi::Backend::CPU);
-  kernel->InputAt(1).SetBackend(phi::Backend::CPU);
-  kernel->InputAt(2).SetBackend(phi::Backend::CPU);
+    arange, GPU, ALL_LAYOUT, phi::ArangeKernel, float, double, int64_t, int) {
+  kernel->InputAt(0).SetBackend(phi::Backend::ALL_BACKEND);
+  kernel->InputAt(1).SetBackend(phi::Backend::ALL_BACKEND);
+  kernel->InputAt(2).SetBackend(phi::Backend::ALL_BACKEND);
 }
diff --git a/paddle/phi/kernels/gpu/batch_norm_grad_kernel.cu b/paddle/phi/kernels/gpu/batch_norm_grad_kernel.cu
index 339c3536d7a7f..09bce3c9895b3 100644
--- a/paddle/phi/kernels/gpu/batch_norm_grad_kernel.cu
+++ b/paddle/phi/kernels/gpu/batch_norm_grad_kernel.cu
@@ -306,15 +306,15 @@ static __global__ LAUNCH_BOUNDS(BlockDim) void BNBackwardData(
 
 template <typename T, typename Context>
 void BatchNormGradRawKernel(const Context &ctx,
-                            const DenseTensor &y_grad,
                             const DenseTensor &x,
                             const DenseTensor &scale,
                             const DenseTensor &bias,
+                            paddle::optional<const DenseTensor &> mean,
+                            paddle::optional<const DenseTensor &> variance,
                             const DenseTensor &saved_mean,
                             const DenseTensor &saved_variance,
                             paddle::optional<const DenseTensor &> reserve_space,
-                            paddle::optional<const DenseTensor &> mean,
-                            paddle::optional<const DenseTensor &> variance,
+                            const DenseTensor &y_grad,
                             float momentum,
                             float epsilon_f,
                             const std::string &data_layout_str,
@@ -863,15 +863,15 @@ void BatchNormGradRawKernel(const Context &ctx,
 
 template <typename T, typename Context>
 void BatchNormGradKernel(const Context &dev_ctx,
-                         const DenseTensor &y_grad,
                          const DenseTensor &x,
                          const DenseTensor &scale,
                          const DenseTensor &bias,
+                         paddle::optional<const DenseTensor &> mean,
+                         paddle::optional<const DenseTensor &> variance,
                          const DenseTensor &saved_mean,
                          const DenseTensor &saved_variance,
                          paddle::optional<const DenseTensor &> reserve_space,
-                         paddle::optional<const DenseTensor &> mean,
-                         paddle::optional<const DenseTensor &> variance,
+                         const DenseTensor &y_grad,
                          float momentum,
                          float epsilon,
                          const std::string &data_layout,
@@ -883,15 +883,15 @@ void BatchNormGradKernel(const Context &dev_ctx,
                          DenseTensor *scale_grad,
                          DenseTensor *bias_grad) {
   BatchNormGradRawKernel<T, Context>(dev_ctx,
-                                     y_grad,
                                      x,
                                      scale,
                                      bias,
+                                     mean,
+                                     variance,
                                      saved_mean,
                                      saved_variance,
                                      reserve_space,
-                                     mean,
-                                     variance,
+                                     y_grad,
                                      momentum,
                                      epsilon,
                                      data_layout,
diff --git a/paddle/phi/kernels/gpu/bernoulli_kernel.cu b/paddle/phi/kernels/gpu/bernoulli_kernel.cu
index 79d8a7b0f3444..edcf29e2d88d3 100644
--- a/paddle/phi/kernels/gpu/bernoulli_kernel.cu
+++ b/paddle/phi/kernels/gpu/bernoulli_kernel.cu
@@ -14,8 +14,6 @@
 
 #include "paddle/phi/kernels/bernoulli_kernel.h"
 
-#include <thrust/random.h>
-#include <thrust/transform.h>
 #ifdef __NVCC__
 #include <curand_kernel.h>
 #endif
@@ -32,35 +30,8 @@
 #include "paddle/phi/core/kernel_registry.h"
 #include "paddle/phi/kernels/funcs/distribution_helper.h"
 
-// See Note [ Why still include the fluid headers? ]
-#include "paddle/fluid/platform/transform.h"
-
-DECLARE_bool(use_curand);
-
 namespace phi {
 
-template <typename T>
-struct BernoulliCudaFunctor {
-  unsigned int seed_;
-  unsigned int offset_;
-  __host__ __device__ BernoulliCudaFunctor(unsigned int seed,
-                                           unsigned int offset)
-      : seed_(seed), offset_(offset) {}
-
-  __host__ __device__ T operator()(const unsigned int n, const T p) const {
-    // NOTE(zhiqiu): currently, PADDLE_ENFORCE in cuda kernel may print several
-    // lines of error messages if, and it should be refined.
-    PADDLE_ENFORCE(p >= 0.0 && p <= 1.0,
-                   "The probability should be >=0 and <= 1, but got %f",
-                   p);
-    thrust::minstd_rand rng;
-    rng.seed(seed_);
-    thrust::uniform_real_distribution<T> dist(0.0, 1.0);
-    rng.discard(n + offset_);
-    return static_cast<T>(dist(rng) < p);
-  }
-};
-
 // 'curand_uniform4/hiprand_uniform4' generate 4 random number each time
 template <typename T>
 __global__ void bernoulli_cuda_kernel(
@@ -100,30 +71,16 @@ void BernoulliKernel(const Context& ctx,
 
   auto gen_cuda = ctx.GetGenerator();
 
-  if (FLAGS_use_curand) {
-    auto seed_offset = gen_cuda->IncrementOffset(12);
-    uint64_t seed = seed_offset.first;
-    uint64_t offset = seed_offset.second;
+  auto seed_offset = gen_cuda->IncrementOffset(12);
+  uint64_t seed = seed_offset.first;
+  uint64_t offset = seed_offset.second;
 
-    auto gpu_config = phi::backends::gpu::GetGpuLaunchConfig1D(ctx, numel, 4);
-    size_t grid_size = gpu_config.GetGridSize();
-    size_t block_size = gpu_config.GetBlockSize();
+  auto gpu_config = phi::backends::gpu::GetGpuLaunchConfig1D(ctx, numel, 4);
+  size_t grid_size = gpu_config.GetGridSize();
+  size_t block_size = gpu_config.GetBlockSize();
 
-    bernoulli_cuda_kernel<<<grid_size, block_size, 0, ctx.stream()>>>(
-        numel, seed, offset, x_data, out_data);
-  } else {
-    auto seed_offset = gen_cuda->IncrementOffset(1);
-    int64_t gen_offset = numel * seed_offset.second;
-    paddle::platform::Transform<phi::GPUContext> trans;
-    thrust::counting_iterator<int64_t> index_sequence_begin(0);
-    trans(ctx,
-          index_sequence_begin,
-          index_sequence_begin + numel,
-          x_data,
-          out_data,
-          BernoulliCudaFunctor<T>(static_cast<int64_t>(seed_offset.first),
-                                  static_cast<int64_t>(gen_offset)));
-  }
+  bernoulli_cuda_kernel<<<grid_size, block_size, 0, ctx.stream()>>>(
+      numel, seed, offset, x_data, out_data);
 }
 
 }  // namespace phi
diff --git a/paddle/phi/kernels/gpu/cast_kernel.cu b/paddle/phi/kernels/gpu/cast_kernel.cu
index 7c4cadbc90ac6..40a84648e4b16 100644
--- a/paddle/phi/kernels/gpu/cast_kernel.cu
+++ b/paddle/phi/kernels/gpu/cast_kernel.cu
@@ -41,6 +41,7 @@ void CastKernel(const Context& dev_ctx,
                      int64_t,                           \
                      int16_t,                           \
                      bool,                              \
+                     int8_t,                            \
                      uint8_t,                           \
                      phi::dtype::float16,               \
                      phi::dtype::complex<float>,        \
diff --git a/paddle/phi/kernels/gpu/clip_grad_kernel.cu b/paddle/phi/kernels/gpu/clip_grad_kernel.cu
new file mode 100644
index 0000000000000..b76086be64887
--- /dev/null
+++ b/paddle/phi/kernels/gpu/clip_grad_kernel.cu
@@ -0,0 +1,28 @@
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include "paddle/phi/kernels/clip_grad_kernel.h"
+
+#include "paddle/phi/backends/gpu/gpu_context.h"
+#include "paddle/phi/core/kernel_registry.h"
+#include "paddle/phi/kernels/impl/clip_grad_kernel_impl.h"
+
+PD_REGISTER_KERNEL(clip_grad,
+                   GPU,
+                   ALL_LAYOUT,
+                   phi::ClipGradKernel,
+                   float,
+                   double,
+                   int,
+                   int64_t) {}
diff --git a/paddle/phi/kernels/gpu/clip_kernel.cu b/paddle/phi/kernels/gpu/clip_kernel.cu
new file mode 100644
index 0000000000000..9e0050db7fdbf
--- /dev/null
+++ b/paddle/phi/kernels/gpu/clip_kernel.cu
@@ -0,0 +1,30 @@
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include "paddle/phi/kernels/clip_kernel.h"
+
+#include "paddle/phi/backends/gpu/gpu_context.h"
+#include "paddle/phi/common/float16.h"
+#include "paddle/phi/core/kernel_registry.h"
+#include "paddle/phi/kernels/impl/clip_kernel_impl.h"
+
+PD_REGISTER_KERNEL(clip,
+                   GPU,
+                   ALL_LAYOUT,
+                   phi::ClipKernel,
+                   float,
+                   double,
+                   int,
+                   int64_t,
+                   phi::dtype::float16) {}
diff --git a/paddle/phi/kernels/gpu/concat_grad_kernel.cu b/paddle/phi/kernels/gpu/concat_grad_kernel.cu
new file mode 100644
index 0000000000000..2445978daca46
--- /dev/null
+++ b/paddle/phi/kernels/gpu/concat_grad_kernel.cu
@@ -0,0 +1,37 @@
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include "paddle/phi/kernels/concat_grad_kernel.h"
+
+#include "paddle/phi/backends/gpu/gpu_context.h"
+#include "paddle/phi/common/bfloat16.h"
+#include "paddle/phi/common/complex.h"
+#include "paddle/phi/common/float16.h"
+#include "paddle/phi/core/kernel_registry.h"
+#include "paddle/phi/kernels/impl/concat_grad_kernel_impl.h"
+
+PD_REGISTER_KERNEL(concat_grad,
+                   GPU,
+                   ALL_LAYOUT,
+                   phi::ConcatGradKernel,
+                   float,
+                   double,
+                   bool,
+                   int64_t,
+                   int,
+                   uint8_t,
+                   phi::dtype::float16,
+                   phi::dtype::bfloat16,
+                   phi::dtype::complex<float>,
+                   phi::dtype::complex<double>) {}
diff --git a/paddle/phi/kernels/gpu/conv_grad_kernel.cu b/paddle/phi/kernels/gpu/conv_grad_kernel.cu
index 4df7bb26adf84..677ec4a0620af 100644
--- a/paddle/phi/kernels/gpu/conv_grad_kernel.cu
+++ b/paddle/phi/kernels/gpu/conv_grad_kernel.cu
@@ -22,9 +22,9 @@ namespace phi {
 
 template <typename T, typename Context>
 void Conv3DGradKernel(const Context& dev_ctx,
-                      const DenseTensor& out_grad,
                       const DenseTensor& input,
                       const DenseTensor& filter,
+                      const DenseTensor& out_grad,
                       const std::vector<int>& strides,
                       const std::vector<int>& paddings,
                       const std::string& paddding_algorithm,
@@ -37,9 +37,9 @@ void Conv3DGradKernel(const Context& dev_ctx,
                       DenseTensor* input_grad,
                       DenseTensor* filter_grad) {
   ConvGradKernel<T>(dev_ctx,
-                    out_grad,
                     input,
                     filter,
+                    out_grad,
                     strides,
                     paddings,
                     paddding_algorithm,
diff --git a/paddle/phi/kernels/gpu/copy_kernel.cu b/paddle/phi/kernels/gpu/copy_kernel.cu
index 28dc6f196d168..16eff5b26e38a 100644
--- a/paddle/phi/kernels/gpu/copy_kernel.cu
+++ b/paddle/phi/kernels/gpu/copy_kernel.cu
@@ -48,7 +48,8 @@ void Copy(const Context& dev_ctx,
     // dev_ctx can not alloc pinned memory now
     dst_ptr = dst->mutable_data(dst_place, src.dtype());
   } else {
-    dst_ptr = dev_ctx.Alloc(dst, src.dtype());
+    dst_ptr = dev_ctx.Alloc(
+        dst, src.dtype(), 0, paddle::platform::is_cuda_pinned_place(dst_place));
   }
 
   if (src_ptr == dst_ptr && src_place == dst_place) {
@@ -151,6 +152,30 @@ void Copy(const Context& dev_ctx,
             "Context place dose not match the source and destination place."));
       }
     }
+  } else if (paddle::platform::is_gpu_place(src_place) &&  // NOLINT
+             paddle::platform::is_cuda_pinned_place(dst_place)) {
+    auto src_gpu_place = src_place;
+    auto dst_cuda_pinned_place = dst_place;
+    auto ctx_place = dev_ctx.GetPlace();
+    PADDLE_ENFORCE_EQ(
+        paddle::platform::is_gpu_place(ctx_place),
+        true,
+        phi::errors::PreconditionNotMet(
+            "Context place error, excepted GPUPlace, but actually %s.",
+            ctx_place));
+    auto ctx_gpu_place = ctx_place;
+    PADDLE_ENFORCE_EQ(src_gpu_place,
+                      ctx_gpu_place,
+                      phi::errors::Unavailable(
+                          "Source place and context place do not match, source "
+                          "place is %s, context place is %s.",
+                          src_gpu_place,
+                          ctx_gpu_place));
+    auto stream =
+        blocking ? nullptr
+                 : reinterpret_cast<const phi::GPUContext&>(dev_ctx).stream();
+    paddle::memory::Copy(
+        dst_cuda_pinned_place, dst_ptr, src_gpu_place, src_ptr, size, stream);
   } else {
     PADDLE_THROW(phi::errors::InvalidArgument(
         "Place type error. Please check the place of src and dst Tensor."));
diff --git a/paddle/phi/kernels/gpu/cross_entropy_grad_kernel.cu b/paddle/phi/kernels/gpu/cross_entropy_grad_kernel.cu
new file mode 100644
index 0000000000000..215b94c52b395
--- /dev/null
+++ b/paddle/phi/kernels/gpu/cross_entropy_grad_kernel.cu
@@ -0,0 +1,294 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "paddle/phi/kernels/cross_entropy_grad_kernel.h"
+
+#ifdef __NVCC__
+#include "cub/cub.cuh"
+#endif
+#ifdef __HIPCC__
+#include <hipcub/hipcub.hpp>
+namespace cub = hipcub;
+#endif
+
+#include "paddle/phi/common/amp_type_traits.h"
+#include "paddle/phi/core/kernel_registry.h"
+#include "paddle/phi/kernels/copy_kernel.h"
+#include "paddle/phi/kernels/funcs/axis_utils.h"
+#include "paddle/phi/kernels/funcs/for_range.h"
+#include "paddle/phi/kernels/funcs/math_function.h"
+#include "paddle/phi/kernels/gpudnn/softmax_gpudnn.h"
+
+// TODO(chenweihang): move dispatch.h into phi/core
+#include "paddle/phi/api/ext/dispatch.h"
+
+#include "paddle/fluid/operators/math/cross_entropy.h"
+#include "paddle/fluid/operators/math/softmax.h"
+#include "paddle/fluid/platform/device/gpu/gpu_device_function.h"
+#include "paddle/fluid/platform/device/gpu/gpu_dnn.h"
+
+namespace phi {
+
+template <typename T>
+__global__ void SoftLabelCrossEntropyGradientKernel(T* logit_grad,
+                                                    const T* loss_grad,
+                                                    const T* labels,
+                                                    const int n,
+                                                    const int d,
+                                                    const int remain) {
+  int ids = blockIdx.x * blockDim.x + threadIdx.x;
+  if (ids < n * d) {
+    int idx_n = ids / d;
+    int idx_remain = ids % remain;
+    int idx_loss = idx_n * remain + idx_remain;
+    logit_grad[ids] = loss_grad[idx_loss] * (-labels[ids] / logit_grad[ids]);
+  }
+}
+
+template <typename T, typename LabelT>
+__global__ void HardLabelCrossEntropyGradientKernel(T* logit_grad,
+                                                    const LabelT* labels,
+                                                    const int n,
+                                                    const int d,
+                                                    const int remain,
+                                                    const int ignore_index) {
+  CUDA_KERNEL_LOOP(index, n * remain) {
+    int idx_n = index / remain;
+    int idx_remain = index % remain;
+    int tmp = static_cast<int>(labels[index]);
+    int idx = idx_n * d + tmp * remain + idx_remain;
+    if (ignore_index != tmp) {
+      logit_grad[idx] = -static_cast<T>(1.) / logit_grad[idx];
+    }
+  }
+}
+
+template <typename T, typename LabelT>
+__global__ void ScaleCrossEntropyGradient(T* logit_grad,
+                                          const T* loss_grad,
+                                          const int num,
+                                          const int d,
+                                          const int remain,
+                                          const LabelT* labels,
+                                          const int ignore_index) {
+  CUDA_KERNEL_LOOP(index, num) {
+    int idx_n = index / d;
+    int idx_remain = index % remain;
+    int idx_lbl = idx_n * remain + idx_remain;
+    int k = (index % d) / remain;
+    auto lbl = static_cast<int64_t>(labels[idx_lbl]);
+    if (lbl == ignore_index || lbl != k) {
+      logit_grad[index] = static_cast<T>(0.);
+    } else {
+      logit_grad[index] *= loss_grad[idx_lbl];
+    }
+  }
+}
+
+template <typename T>
+__global__ void SoftCrossEntropyGradientKernel(T* logit_grad,
+                                               const T* loss_grad,
+                                               const T* labels,
+                                               const int64_t n,
+                                               const int64_t d,
+                                               const int64_t remain) {
+  int64_t ids = blockIdx.x * blockDim.x + threadIdx.x;
+  if (ids < n * d) {
+    int64_t idx_n = ids / d;
+    int64_t idx_remain = ids % remain;
+    int64_t idx_loss = idx_n * remain + idx_remain;
+    logit_grad[ids] = loss_grad[idx_loss] * (logit_grad[ids] - labels[ids]);
+  }
+}
+
+/*
+  Wrapper of softmax with cross entropy grad hard label.
+*/
+template <typename T, typename LabelT>
+__global__ void SoftmaxWithCrossEntropyGradHardLabel(T* logits_grad,
+                                                     const T* loss_grad,
+                                                     const T* softmax,
+                                                     const LabelT* labels,
+                                                     const int64_t n,
+                                                     const int64_t dim,
+                                                     const int64_t d,
+                                                     const int ignore_index) {
+  int64_t idx = blockIdx.x * blockDim.x + threadIdx.x;
+  int64_t idx_n = idx / (d * dim);
+  int64_t idx_dim = (idx / d) % dim;
+  int64_t idx_d = idx % d;
+  int64_t ids = idx_n * d + idx_d;
+
+  if (idx < n * dim * d) {
+    auto lbl = static_cast<int64_t>(labels[ids]);
+    if (lbl == ignore_index) {
+      logits_grad[idx] = static_cast<T>(0.0);
+    } else if (lbl == idx_dim) {
+      logits_grad[idx] = (softmax[idx] - static_cast<T>(1.0)) * loss_grad[ids];
+    } else {
+      logits_grad[idx] = softmax[idx] * loss_grad[ids];
+    }
+  }
+}
+
+template <typename T, typename LabelT>
+void CrossEntropyWithSoftmaxGradGPUKernel(const GPUContext& dev_ctx,
+                                          const DenseTensor& label,
+                                          const DenseTensor& softmax,
+                                          const DenseTensor& loss_grad,
+                                          bool soft_label,
+                                          bool use_softmax,
+                                          bool numeric_stable_mode,
+                                          int ignore_index,
+                                          int axis,
+                                          DenseTensor* logits_grad) {
+  PADDLE_ENFORCE_EQ(
+      dev_ctx.GetPlace().GetType(),
+      phi::AllocationType::GPU,
+      phi::errors::Unavailable("softmax_with_cross_entropy operator's "
+                               "CUDA kernel only runs on GPU device."));
+  const T* loss_grad_data = loss_grad.data<T>();
+  DenseTensor* logit_grad = logits_grad;
+
+  T* logit_grad_data = nullptr;
+  bool copy_flag = (logit_grad != &softmax && (!use_softmax || soft_label));
+  if (copy_flag) {
+    phi::Copy(dev_ctx, softmax, dev_ctx.GetPlace(), false, logit_grad);
+    logit_grad_data = logit_grad->data<T>();
+  } else {
+    logit_grad_data = dev_ctx.template Alloc<T>(logit_grad);
+  }
+
+  const int rank = logit_grad->dims().size();
+  const int axis_v = phi::funcs::CanonicalAxis(axis, rank);
+  int axis_dim = logit_grad->dims()[axis_v];
+
+  const int64_t n = phi::funcs::SizeToAxis(axis_v, logit_grad->dims());
+  const int64_t d = phi::funcs::SizeFromAxis(axis_v, logit_grad->dims());
+  const int64_t remain = d / axis_dim;
+
+#ifdef __HIPCC__
+  int block = 256;
+#else
+  int block = 512;
+#endif
+  auto stream = dev_ctx.stream();
+
+  // do not with softmax op, and input is softmax
+  if (!use_softmax) {
+    if (soft_label) {
+      int grid = (n * d + block - 1) / block;
+      const T* label_data = label.data<T>();
+      SoftLabelCrossEntropyGradientKernel<T><<<grid, block, 0, stream>>>(
+          logit_grad_data, loss_grad_data, label_data, n, d, remain);
+    } else {
+      DenseTensor logits_grad_2d(*logit_grad);
+      logits_grad_2d.Resize({n, d});
+      int grid = (n * remain + block - 1) / block;
+      const auto* label_data = label.data<LabelT>();
+      HardLabelCrossEntropyGradientKernel<T,
+                                          LabelT><<<grid, block, 0, stream>>>(
+          logit_grad_data, label_data, n, d, remain, ignore_index);
+      int num = n * d;
+      grid = (num + block - 1) / block;
+      ScaleCrossEntropyGradient<T, LabelT><<<grid, block, 0, stream>>>(
+          logit_grad_data,
+          loss_grad_data,
+          num,
+          d,
+          remain,
+          label_data,
+          ignore_index);
+    }
+
+    return;
+  }
+
+  // with softmax, continue
+
+  if (soft_label) {
+    int64_t grid = (n * d + block - 1) / block;
+    const T* label_data = label.data<T>();
+    SoftCrossEntropyGradientKernel<T><<<grid, block, 0, stream>>>(
+        logit_grad_data, loss_grad_data, label_data, n, d, remain);
+  } else {
+    const T* softmax_data = softmax.data<T>();
+    const auto* label_data = label.data<LabelT>();
+    int grid = (n * d + block - 1) / block;
+    SoftmaxWithCrossEntropyGradHardLabel<T><<<grid, block, 0, stream>>>(
+        logit_grad_data,
+        loss_grad_data,
+        softmax_data,
+        label_data,
+        n,
+        d / remain,
+        remain,
+        ignore_index);
+  }
+}
+
+template <typename T, typename Context>
+void CrossEntropyWithSoftmaxGradKernel(const Context& dev_ctx,
+                                       const DenseTensor& label,
+                                       const DenseTensor& softmax,
+                                       const DenseTensor& loss_grad,
+                                       bool soft_label,
+                                       bool use_softmax,
+                                       bool numeric_stable_mode,
+                                       int ignore_index,
+                                       int axis,
+                                       DenseTensor* logits_grad) {
+  auto dtype = label.dtype();
+  if (soft_label) {
+    PADDLE_ENFORCE_EQ(
+        dtype,
+        paddle::experimental::CppTypeToDataType<T>::Type(),
+        phi::errors::InvalidArgument("The Input(Label) should be with the "
+                                     "same data type as kernel data type."));
+    CrossEntropyWithSoftmaxGradGPUKernel<T, T>(dev_ctx,
+                                               label,
+                                               softmax,
+                                               loss_grad,
+                                               soft_label,
+                                               use_softmax,
+                                               numeric_stable_mode,
+                                               ignore_index,
+                                               axis,
+                                               logits_grad);
+  } else {
+    PD_DISPATCH_INTEGRAL_TYPES(
+        dtype, "CrossEntropyWithSoftmaxGradGPUKernel", ([&] {
+          CrossEntropyWithSoftmaxGradGPUKernel<T, data_t>(dev_ctx,
+                                                          label,
+                                                          softmax,
+                                                          loss_grad,
+                                                          soft_label,
+                                                          use_softmax,
+                                                          numeric_stable_mode,
+                                                          ignore_index,
+                                                          axis,
+                                                          logits_grad);
+        }));
+  }
+}
+
+}  // namespace phi
+
+PD_REGISTER_KERNEL(cross_entropy_with_softmax_grad,
+                   GPU,
+                   ALL_LAYOUT,
+                   phi::CrossEntropyWithSoftmaxGradKernel,
+                   float,
+                   double,
+                   phi::dtype::float16) {}
diff --git a/paddle/fluid/operators/softmax_with_cross_entropy_op.cu b/paddle/phi/kernels/gpu/cross_entropy_kernel.cu
similarity index 58%
rename from paddle/fluid/operators/softmax_with_cross_entropy_op.cu
rename to paddle/phi/kernels/gpu/cross_entropy_kernel.cu
index 41545a1ca20b2..055706cffd41e 100644
--- a/paddle/fluid/operators/softmax_with_cross_entropy_op.cu
+++ b/paddle/phi/kernels/gpu/cross_entropy_kernel.cu
@@ -1,13 +1,19 @@
-/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
 Licensed under the Apache License, Version 2.0 (the "License");
 you may not use this file except in compliance with the License.
 You may obtain a copy of the License at
+
     http://www.apache.org/licenses/LICENSE-2.0
+
 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. */
+
+#include "paddle/phi/kernels/cross_entropy_kernel.h"
+
 #ifdef __NVCC__
 #include "cub/cub.cuh"
 #endif
@@ -15,39 +21,43 @@ limitations under the License. */
 #include <hipcub/hipcub.hpp>
 namespace cub = hipcub;
 #endif
-#include "paddle/fluid/operators/amp/fp16_type_traits.h"
+
+#include "paddle/phi/common/amp_type_traits.h"
+#include "paddle/phi/core/kernel_registry.h"
+#include "paddle/phi/kernels/copy_kernel.h"
+#include "paddle/phi/kernels/funcs/axis_utils.h"
+#include "paddle/phi/kernels/funcs/for_range.h"
+#include "paddle/phi/kernels/funcs/math_function.h"
+#include "paddle/phi/kernels/gpudnn/softmax_gpudnn.h"
+
+// TODO(chenweihang): move dispatch.h into phi/core
+#include "paddle/phi/api/ext/dispatch.h"
+
 #include "paddle/fluid/operators/math/cross_entropy.h"
-#include "paddle/fluid/operators/softmax_with_cross_entropy_op.h"
+#include "paddle/fluid/operators/math/softmax.h"
 #include "paddle/fluid/platform/device/gpu/gpu_device_function.h"
 #include "paddle/fluid/platform/device/gpu/gpu_dnn.h"
-#include "paddle/fluid/platform/for_range.h"
-#include "paddle/phi/kernels/funcs/math_function.h"
-#include "paddle/phi/kernels/gpudnn/softmax_gpudnn.h"
 
-namespace paddle {
-namespace operators {
+namespace phi {
 
 #define ALIGN_BYTES 16
 
-using ScopedTensorDescriptor = platform::ScopedTensorDescriptor;
-using DataLayout = platform::DataLayout;
-using Tensor = framework::Tensor;
-namespace kps = phi::kps;
+enum class SoftmaxMode { kSoftmax, kLogSoftmax, kCrossEntropy };
 
 // Wrapper of log function. Use log(float32) for float16
 template <typename T>
 static __device__ __forceinline__ T Log(T x) {
-  using AccT = typename details::MPTypeTrait<T>::Type;
+  using AccT = typename dtype::MPTypeTrait<T>::Type;
   AccT logx = std::log(static_cast<AccT>(x));
-  return math::TolerableValue<T>()(static_cast<T>(logx));
+  return paddle::operators::math::TolerableValue<T>()(static_cast<T>(logx));
 }
 
 // Wrapper of exp function. Use exp(float32) for float16
 template <typename T>
 static __device__ __forceinline__ T Exp(T x) {
-  using AccT = typename details::MPTypeTrait<T>::Type;
+  using AccT = typename dtype::MPTypeTrait<T>::Type;
   AccT expx = std::exp(static_cast<AccT>(x));
-  return math::TolerableValue<T>()(static_cast<T>(expx));
+  return paddle::operators::math::TolerableValue<T>()(static_cast<T>(expx));
 }
 
 template <typename Tx, typename Ty = Tx>
@@ -62,22 +72,114 @@ struct ExpAddFunctor {
   Tx max;
 };
 
-// log2(value)
-static inline int Log2Ceil(int value) {
-  int log2_value = 0;
-  while ((1 << log2_value) < value) ++log2_value;
-  return log2_value;
-}
+/*
+  Cross entropy soft label with dynamic size on axis (log2_elements is
+  varibale).
+  - if the input is softmax，compute loss with softmax
+  - if the input is log_softmax, compute loss with log_softmax and update
+  softmax
+*/
+template <typename T, typename VecT, bool InLogMode = false>
+__global__ void CrossEntropySoftLabel(T* loss,
+                                      T* softmaxwrt,
+                                      const T* softmax,
+                                      const T* labels,
+                                      const int n,
+                                      const int dim,
+                                      const int d,
+                                      int log2_elements) {
+  const int kDimCeil = 1 << log2_elements;
+  const int kVSize = sizeof(VecT) / sizeof(T);
 
-enum class SoftmaxMode { kSoftmax, kLogSoftmax, kCrossEntropy };
+#ifdef __HIPCC__
+  const int kThreadPerBlock = 256;
+#else
+  const int kThreadPerBlock = 512;
+#endif
+  const int kBatchPerBlock = 1;
+  const int kWarpSize = 32;  // (dim < 32) ? dim : 32;
+  const int kBatchSize = 1;
+  const int kThreadPerBatch = kThreadPerBlock / kBatchPerBlock;
+  const int kWarpPerBatch = kThreadPerBatch / kWarpSize;
+
+  const int kIterations = (dim + kThreadPerBatch - 1) / kThreadPerBatch;
+  const int kIterationsV = (kIterations >= kVSize) ? (kIterations / kVSize) : 1;
+
+  const int first_batch = (blockDim.y * blockIdx.x + threadIdx.y) * kBatchSize;
+
+  T sum[kBatchSize]{static_cast<T>(0.0)};
+#pragma unroll
+  for (int i = 0; i < kBatchSize; ++i) {
+    int ids = first_batch + i;
+    if (ids >= n * d) break;
+    int idx_n = ids / d;
+    int idx_d = ids % d;
+#pragma unroll
+    for (int it = 0; it < kIterations; ++it) {
+      int idx_dim = it * kThreadPerBatch + threadIdx.x;
+      int idx = idx_n * dim * d + idx_dim * d + idx_d;
+
+      if (idx_n < n && idx_dim < dim) {
+        VecT softmaxdata;
+        if (InLogMode) {
+          softmaxdata = reinterpret_cast<VecT*>(&softmaxwrt[idx])[0];
+        } else {
+          softmaxdata = reinterpret_cast<const VecT*>(&softmax[idx])[0];
+        }
+        VecT labelsdata = reinterpret_cast<const VecT*>(&labels[idx])[0];
+        T* softmaxptr = reinterpret_cast<T*>(&softmaxdata);
+        T* labelsptr = reinterpret_cast<T*>(&labelsdata);
+#pragma unroll
+        for (int s = 0; s < kVSize; s++) {
+          if (InLogMode) {
+            sum[i] -= softmaxptr[s] * labelsptr[s];
+            softmaxptr[s] = Exp(softmaxptr[s]);
+          } else {
+            sum[i] -= Log(softmaxptr[s]) * labelsptr[s];
+          }
+        }
+        if (InLogMode) {
+          reinterpret_cast<VecT*>(&softmaxwrt[idx])[0] = softmaxdata;
+        }
+      }
+    }
+  }
+  phi::WarpReduceSum<T, kBatchSize, kWarpSize>(sum);
+  __syncthreads();
+
+  __shared__ T sumshare[kWarpPerBatch][kBatchPerBlock][kBatchSize];
+  if (threadIdx.x % kWarpSize == 0) {
+#pragma unroll
+    for (int i = 0; i < kBatchSize; i++) {
+      sumshare[threadIdx.x / kWarpSize][threadIdx.y][i] = sum[i];
+    }
+  }
+  __syncthreads();
+
+  // write
+  if (threadIdx.x == 0) {
+    for (int i = 0; i < kBatchSize; i++) {
+      int ids = first_batch + i;
+      if (ids < n * d) {
+        loss[ids] = sumshare[0][threadIdx.y][i];
+        for (int s = 1; s < kWarpPerBatch; s++) {
+          loss[ids] += sumshare[s][threadIdx.y][i];
+        }
+      }
+    }
+  }
+}
 
 /*
   Hard label cross entropy.
 */
 template <typename T, typename LabelT, bool IgnoreIndex>
-__global__ void CrossEntropyHardLabel(T* loss, const T* softmax,
-                                      const LabelT* labels, const int n,
-                                      const int dim, const int d,
+__global__ void CrossEntropyHardLabel(T* loss,
+                                      const T* softmax,
+                                      const LabelT* labels,
+                                      const int n,
+                                      const int dim,
+                                      const int d,
                                       const int ignore_idx) {
   int64_t ids = blockIdx.x * blockDim.x + threadIdx.x;
   int64_t idx_n = ids / d;
@@ -111,9 +213,12 @@ __global__ void CrossEntropyHardLabel(T* loss, const T* softmax,
   Output: loss and exp(input)
 */
 template <typename T, typename LabelT, bool IgnoreIndex>
-__global__ void CrossEntropyExpHardLabel(T* loss, T* softmax,
-                                         const LabelT* labels, const int n,
-                                         const int dim, const int d,
+__global__ void CrossEntropyExpHardLabel(T* loss,
+                                         T* softmax,
+                                         const LabelT* labels,
+                                         const int n,
+                                         const int dim,
+                                         const int d,
                                          const int ignore_idx) {
   int64_t idx = blockIdx.x * blockDim.x + threadIdx.x;
   int64_t idx_n = idx / (d * dim);
@@ -146,308 +251,64 @@ __global__ void CrossEntropyExpHardLabel(T* loss, T* softmax,
   }
 }
 
-/*
-  Core function of softmax with cross entropy forward
-    - softmax, SoftmaxMode=kSoftmax
-    - log softmax, SoftmaxMode=kLogSoftmax
-    - softmax with cross entropy hard label, SoftmaxMode=kCrossEntropy
-  The computation includes
-    - Compute max value: maxvalue_{i} = max_j src_{i,j}
-    - Compute sum of exp: s_{i} = sum_{j}{e^{src_{i,j} - maxvalue_{i}}}
-    - Compute: softmax_{i,j} = e^{src_{i,j} - maxvalue_{i}} / s_{i}
-    - Compute: logsoftmax_{i,j} = src_{i,j} - maxvalue_{i} - log(s_{i})
-    - Compute: loss_{i} = -logsoftmax[i,label[i]] (Hard label)
-  This computation results from following formula:
-    softmax_{i,j} = e^{src_{i,j}} / sum_{j}{e^{src_{i,j}}}
-                  = e^{src_{i,j} - maxvalue_{i}}
-                    / sum_{j}{e^{src_{i,j} - maxvalue_{i}}}
-                  = e^{src_{i,j} - maxvalue_{i}} / s_{i}
-    logsoftmax_{i,j} = log(softmax_{i,j})
-                     = src_{i,j} - maxvalue_{i} - log(s_{i})
-  One warp (32 threads) is used to compute 1 or 2 batch (kBatchSize).
-  For reduction max (sum), firstly compute max (sum) to one warp, then use
-  shuffle api to compute max (sum) in one warp.
-*/
-template <typename T, typename LabelT, typename VecT, typename AccT,
-          int Log2Elements, SoftmaxMode mode, bool IgnoreIndex>
-__global__ void WarpSoftmaxForward(T* loss, T* softmax, const T* src,
-                                   const LabelT* label, const int batch_size,
-                                   const int stride, const int element_count,
-                                   const int ignore_index) {
-  constexpr int kDimCeil = 1 << Log2Elements;
-  constexpr int kWarpSize = (kDimCeil < 32) ? kDimCeil : 32;
-  constexpr int kVSize = sizeof(VecT) / sizeof(T);
-  constexpr int kIterations = kDimCeil / kWarpSize;
-  constexpr int kIterationsV =
-      (kIterations >= kVSize) ? (kIterations / kVSize) : 1;
-  constexpr int kBatchSize = (kDimCeil <= 128) ? 2 : 1;
-
-  int first_batch = (blockDim.y * blockIdx.x + threadIdx.y) * kBatchSize;
+template <typename T, typename AccT, int VecSize, class ReduceFunctor>
+__device__ __forceinline__ AccT ThreadReduce(const T* input,
+                                             int size,
+                                             const int offset,
+                                             AccT init,
+                                             ReduceFunctor reducer) {
+  using VecT = kps::details::VectorType<T, VecSize>;
+  int tid = threadIdx.x;
+  AccT val = init;
 
-  // max index to read
-  int idx_max_v[kBatchSize];
-#pragma unroll
-  for (int i = 0; i < kBatchSize; i++) {
-    int idx_max = ((i + first_batch) < batch_size) ? element_count : 0;
-    idx_max_v[i] = idx_max / kVSize;
+  if (offset > 0) {
+    input -= offset;
+    size += offset;
+    if (tid >= offset) {
+      val = reducer(val, input[tid]);
+    }
+    size -= blockDim.x;
+    input += blockDim.x;
   }
+  int remain = size % (VecSize * blockDim.x);
 
-  // read data from global memory
-  AccT srcdata[kBatchSize][kIterationsV][kVSize];
+  T ins[VecSize];
+  VecT* ins_vec = reinterpret_cast<VecT*>(&ins);
 
-#pragma unroll
-  for (int i = 0; i < kBatchSize; ++i) {
-// read data to srcdata: - KVSize==1, - KVSize>1
-#pragma unroll
-    for (int it = 0; it < kIterationsV; ++it) {
-      int src_idx = threadIdx.x + it * kWarpSize;
-      if (kVSize == 1) {
-        if (src_idx < idx_max_v[i]) {
-          srcdata[i][it][0] =
-              static_cast<AccT>(src[(first_batch + i) * stride + src_idx]);
-        } else {
-          srcdata[i][it][0] = -std::numeric_limits<AccT>::infinity();
-        }
-      } else {
-        const VecT* src_v =
-            reinterpret_cast<const VecT*>(&src[(first_batch + i) * stride]);
-        if (src_idx < idx_max_v[i]) {
-          VecT srctmp = src_v[src_idx];
-          const T* srcinptr = reinterpret_cast<const T*>(&srctmp);
-#pragma unroll
-          for (int s = 0; s < kVSize; s++) {
-            srcdata[i][it][s] = static_cast<AccT>(srcinptr[s]);
-          }
-        } else {
-#pragma unroll
-          for (int s = 0; s < kVSize; s++) {
-            srcdata[i][it][s] = -std::numeric_limits<AccT>::infinity();
-          }
-        }
-      }
-    }
-  }
+  // vector part
+  for (; VecSize * tid < (size - remain); tid += blockDim.x) {
+    *ins_vec = reinterpret_cast<const VecT*>(input)[tid];
 
-  // compute max value: maxvalue_{i} = max_j src_{i,j}
-  AccT max_value[kBatchSize];
-#pragma unroll
-  for (int i = 0; i < kBatchSize; ++i) {
-    // it = 0
-    AccT valmax = srcdata[i][0][0];
 #pragma unroll
-    for (int s = 1; s < kVSize; ++s) {
-      valmax = (valmax > srcdata[i][0][s]) ? valmax : srcdata[i][0][s];
+    for (int i = 0; i < VecSize; ++i) {
+      val = reducer(val, ins[i]);
     }
-    max_value[i] = valmax;
+  }
 
-// it = 1, 2, ...
-#pragma unroll
-    for (int it = 1; it < kIterationsV; ++it) {
-      AccT valmax = srcdata[i][it][0];
-#pragma unroll
-      for (int s = 1; s < kVSize; ++s) {
-        valmax = (valmax > srcdata[i][it][s]) ? valmax : srcdata[i][it][s];
-      }
-      max_value[i] = (max_value[i] > valmax) ? max_value[i] : valmax;
-    }
+  // scalar part
+  tid = size - remain + threadIdx.x;
+  for (; tid < size; tid += blockDim.x) {
+    val = reducer(val, input[tid]);
   }
-  phi::WarpReduceMax<AccT, kBatchSize, kWarpSize>(max_value);
+  return val;
+}
 
-  // compute sum: s_{i} = sum_{j}{ exp(src_{i,j} - maxvalue_{i} }
-  AccT sum[kBatchSize];
-#pragma unroll
-  for (int i = 0; i < kBatchSize; ++i) {
-    // it = 0
-    if (mode == SoftmaxMode::kLogSoftmax ||
-        mode == SoftmaxMode::kCrossEntropy) {
-      sum[i] = std::exp(srcdata[i][0][0] - max_value[i]);
-    } else {
-      srcdata[i][0][0] = std::exp(srcdata[i][0][0] - max_value[i]);
-      sum[i] = srcdata[i][0][0];
-    }
-#pragma unroll
-    for (int s = 1; s < kVSize; ++s) {
-      if (mode == SoftmaxMode::kLogSoftmax ||
-          mode == SoftmaxMode::kCrossEntropy) {
-        sum[i] += std::exp(srcdata[i][0][s] - max_value[i]);
+template <typename T, bool IgnoreIndex>
+__device__ __forceinline__ void ComputeLoss(T* loss,
+                                            const T loss_value,
+                                            const int label_id,
+                                            const int64_t label_value,
+                                            const int tid,
+                                            const int vec_size,
+                                            const int offset,
+                                            const int ignore_index) {
+  int loss_id = vec_size * tid + offset;
+  if (IgnoreIndex) {
+    if (label_value == loss_id) {
+      if (label_value == ignore_index) {
+        loss[label_id] = static_cast<T>(0.0f);
       } else {
-        srcdata[i][0][s] = std::exp(srcdata[i][0][s] - max_value[i]);
-        sum[i] += srcdata[i][0][s];
-      }
-    }
-
-// it = 1, 2, ...
-#pragma unroll
-    for (int it = 1; it < kIterationsV; ++it) {
-#pragma unroll
-      for (int s = 0; s < kVSize; ++s) {
-        if (mode == SoftmaxMode::kLogSoftmax ||
-            mode == SoftmaxMode::kCrossEntropy) {
-          sum[i] += std::exp(srcdata[i][it][s] - max_value[i]);
-        } else {
-          srcdata[i][it][s] = std::exp(srcdata[i][it][s] - max_value[i]);
-          sum[i] += srcdata[i][it][s];
-        }
-      }
-    }
-  }
-  phi::WarpReduceSum<AccT, kBatchSize, kWarpSize>(sum);
-
-// write data
-#pragma unroll
-  for (int i = 0; i < kBatchSize; ++i) {
-    if (mode == SoftmaxMode::kLogSoftmax ||
-        mode == SoftmaxMode::kCrossEntropy) {
-      sum[i] = std::log(sum[i]);
-    }
-
-#pragma unroll
-    for (int it = 0; it < kIterationsV; ++it) {
-      int idx = threadIdx.x + it * kWarpSize;
-      if (kVSize == 1) {  // kVSize==1
-        if (idx < idx_max_v[i]) {
-          if (mode == SoftmaxMode::kLogSoftmax) {  // log softmax
-            softmax[(first_batch + i) * stride + idx] =
-                srcdata[i][it][0] - max_value[i] - sum[i];
-            // softmax with cross entropy hard label
-          } else if (mode == SoftmaxMode::kCrossEntropy) {
-            AccT logsoftmax = srcdata[i][it][0] - max_value[i] - sum[i];
-            // softmax
-            softmax[(first_batch + i) * stride + idx] = std::exp(logsoftmax);
-            // label
-            int loss_idx = (threadIdx.x + it * kWarpSize) * kVSize;
-            auto lbl = static_cast<int64_t>(label[first_batch + i]);
-            if (IgnoreIndex == true) {
-              // IgnoreIndex is true
-              if (lbl == loss_idx) {
-                if (lbl != ignore_index) {
-                  loss[first_batch + i] = -logsoftmax;
-                } else {
-                  loss[first_batch + i] = static_cast<T>(0.0);
-                }
-              }
-            } else {
-              // IgnoreIndex is false
-              if (lbl >= 0 && lbl < element_count) {
-                if (lbl == loss_idx) {
-                  loss[first_batch + i] = -logsoftmax;
-                }
-              } else {
-                loss[first_batch + i] = static_cast<T>(0.0);
-              }
-            }
-          } else {  // softmax
-            softmax[(first_batch + i) * stride + idx] =
-                srcdata[i][it][0] / sum[i];
-          }
-        } else {
-          break;
-        }
-      } else {  // KVSize>1
-        VecT* softmax_v =
-            reinterpret_cast<VecT*>(&softmax[(first_batch + i) * stride]);
-        VecT tmpdata;
-        T* tmpptr = reinterpret_cast<T*>(&tmpdata);
-#pragma unroll
-        for (int s = 0; s < kVSize; ++s) {
-          if (mode == SoftmaxMode::kLogSoftmax) {  // log softmax
-            tmpptr[s] = srcdata[i][it][s] - max_value[i] - sum[i];
-            // softmax with cross entropy hard label
-          } else if (mode == SoftmaxMode::kCrossEntropy) {
-            AccT logsoftmax = srcdata[i][it][s] - max_value[i] - sum[i];
-            // softmax
-            tmpptr[s] = std::exp(logsoftmax);
-            // label
-            int loss_idx = (threadIdx.x + it * kWarpSize) * kVSize + s;
-            auto lbl = static_cast<int64_t>(label[first_batch + i]);
-            if (IgnoreIndex == true) {
-              // IgnoreIndex is true
-              if (lbl == loss_idx && lbl != ignore_index) {
-                loss[first_batch + i] = -logsoftmax;
-              }
-            } else {
-              // IgnoreIndex is false
-              if (lbl >= 0 && lbl < element_count) {
-                if (lbl == loss_idx) {
-                  loss[first_batch + i] = -logsoftmax;
-                }
-              } else {
-                loss[first_batch + i] = static_cast<T>(0.0);
-              }
-            }
-          } else {  // softmax
-            tmpptr[s] = srcdata[i][it][s] / sum[i];
-          }
-        }
-        if (idx < idx_max_v[i]) {
-          softmax_v[idx] = tmpdata;
-        } else {
-          break;
-        }
-      }
-    }
-  }
-}
-
-#define SOFTMAX_WARP_FORWARD_CASE(Log2Elements, LabelT, VecT, AccT)   \
-  case Log2Elements:                                                  \
-    WarpSoftmaxForward<T, LabelT, VecT, AccT, Log2Elements, mode,     \
-                       IgnoreIndex><<<blocks, threads, 0, stream>>>(  \
-        loss, softmax, src, label, batch_size, stride, element_count, \
-        ignore_index);                                                \
-    break;
-
-/*
-  Wrapper of softmax with cross entropy forward hard label.
-*/
-template <typename T, typename LabelT, SoftmaxMode mode, bool IgnoreIndex>
-void SwitchWarpSoftmaxForward(T* loss, T* softmax, const T* src,
-                              const LabelT* label, const int batch_size,
-                              const int stride, const int element_count,
-                              const int ignore_index, gpuStream_t stream) {
-  using AccT = typename details::MPTypeTrait<T>::Type;
-
-  // use 128 threads per block to maximimize gpu utilization
-  const int log2_elements = static_cast<int>(Log2Ceil(element_count));
-  const int kDimCeil = 1 << log2_elements;
-  int kWarpSize = (kDimCeil < 32) ? kDimCeil : 32;
-  int batches_per_warp = (kDimCeil <= 128) ? 2 : 1;
-  constexpr int threads_per_block = 128;
-  int warps_per_block = (threads_per_block / kWarpSize);
-  int batches_per_block = warps_per_block * batches_per_warp;
-  int blocks = (batch_size + batches_per_block - 1) / batches_per_block;
-  dim3 threads(kWarpSize, warps_per_block, 1);
-
-  switch (log2_elements) {
-    SOFTMAX_WARP_FORWARD_CASE(0, LabelT, T, AccT);
-    SOFTMAX_WARP_FORWARD_CASE(1, LabelT, T, AccT);
-    SOFTMAX_WARP_FORWARD_CASE(2, LabelT, T, AccT);
-    SOFTMAX_WARP_FORWARD_CASE(3, LabelT, T, AccT);
-    SOFTMAX_WARP_FORWARD_CASE(4, LabelT, T, AccT);
-    SOFTMAX_WARP_FORWARD_CASE(5, LabelT, T, AccT);
-    SOFTMAX_WARP_FORWARD_CASE(6, LabelT, T, AccT);
-    SOFTMAX_WARP_FORWARD_CASE(7, LabelT, T, AccT);
-    SOFTMAX_WARP_FORWARD_CASE(8, LabelT, T, AccT);
-    SOFTMAX_WARP_FORWARD_CASE(9, LabelT, T, AccT);
-    default:
-      break;
-  }
-}
-
-template <typename T, bool IgnoreIndex>
-__device__ __forceinline__ void ComputeLoss(T* loss, const T loss_value,
-                                            const int label_id,
-                                            const int64_t label_value,
-                                            const int tid, const int vec_size,
-                                            const int offset,
-                                            const int ignore_index) {
-  int loss_id = vec_size * tid + offset;
-  if (IgnoreIndex) {
-    if (label_value == loss_id) {
-      if (label_value == ignore_index) {
-        loss[label_id] = static_cast<T>(0.0f);
-      } else {
-        loss[label_id] = loss_value;
+        loss[label_id] = loss_value;
       }
     }
   } else {
@@ -457,51 +318,19 @@ __device__ __forceinline__ void ComputeLoss(T* loss, const T loss_value,
   }
 }
 
-template <typename T, typename AccT, int VecSize, class ReduceFunctor>
-__device__ __forceinline__ AccT ThreadReduce(const T* input, int size,
-                                             const int offset, AccT init,
-                                             ReduceFunctor reducer) {
-  using VecT = kps::details::VectorType<T, VecSize>;
-  int tid = threadIdx.x;
-  AccT val = init;
-
-  if (offset > 0) {
-    input -= offset;
-    size += offset;
-    if (tid >= offset) {
-      val = reducer(val, input[tid]);
-    }
-    size -= blockDim.x;
-    input += blockDim.x;
-  }
-  int remain = size % (VecSize * blockDim.x);
-
-  T ins[VecSize];
-  VecT* ins_vec = reinterpret_cast<VecT*>(&ins);
-
-  // vector part
-  for (; VecSize * tid < (size - remain); tid += blockDim.x) {
-    *ins_vec = reinterpret_cast<const VecT*>(input)[tid];
-
-#pragma unroll
-    for (int i = 0; i < VecSize; ++i) {
-      val = reducer(val, ins[i]);
-    }
-  }
-
-  // scalar part
-  tid = size - remain + threadIdx.x;
-  for (; tid < size; tid += blockDim.x) {
-    val = reducer(val, input[tid]);
-  }
-  return val;
-}
-
-template <typename T, typename AccT, typename LabelT, int VecSize,
+template <typename T,
+          typename AccT,
+          typename LabelT,
+          int VecSize,
           bool IgnoreIndex>
 __device__ __forceinline__ void VectorizedSoftmaxForwardImpl(
-    T* loss, T* softmax, const T* logits, const LabelT* label, int size,
-    const int offset, const phi::LogSoftmaxForwardFunctor<AccT>& func,
+    T* loss,
+    T* softmax,
+    const T* logits,
+    const LabelT* label,
+    int size,
+    const int offset,
+    const phi::LogSoftmaxForwardFunctor<AccT>& func,
     const int ignore_index) {
   using VecT = kps::details::VectorType<T, VecSize>;
   int tid = threadIdx.x;
@@ -520,9 +349,14 @@ __device__ __forceinline__ void VectorizedSoftmaxForwardImpl(
       softmax[tid] = static_cast<T>(std::exp(log_softmax));
       // loss
       if (label_valid) {
-        ComputeLoss<T, IgnoreIndex>(loss, static_cast<T>(-log_softmax),
-                                    label_id, label_value, tid, 1,
-                                    loss_id_offset, ignore_index);
+        ComputeLoss<T, IgnoreIndex>(loss,
+                                    static_cast<T>(-log_softmax),
+                                    label_id,
+                                    label_value,
+                                    tid,
+                                    1,
+                                    loss_id_offset,
+                                    ignore_index);
       }
     }
     size -= blockDim.x;
@@ -550,9 +384,14 @@ __device__ __forceinline__ void VectorizedSoftmaxForwardImpl(
 
       // loss
       if (label_valid) {
-        ComputeLoss<T, IgnoreIndex>(loss, static_cast<T>(-log_softmax),
-                                    label_id, label_value, tid, VecSize,
-                                    loss_id_offset + i, ignore_index);
+        ComputeLoss<T, IgnoreIndex>(loss,
+                                    static_cast<T>(-log_softmax),
+                                    label_id,
+                                    label_value,
+                                    tid,
+                                    VecSize,
+                                    loss_id_offset + i,
+                                    ignore_index);
       }
     }
 
@@ -568,8 +407,13 @@ __device__ __forceinline__ void VectorizedSoftmaxForwardImpl(
 
     // loss
     if (label_valid) {
-      ComputeLoss<T, IgnoreIndex>(loss, static_cast<T>(-log_softmax), label_id,
-                                  label_value, tid, 1, loss_id_offset,
+      ComputeLoss<T, IgnoreIndex>(loss,
+                                  static_cast<T>(-log_softmax),
+                                  label_id,
+                                  label_value,
+                                  tid,
+                                  1,
+                                  loss_id_offset,
                                   ignore_index);
     }
   }
@@ -580,11 +424,19 @@ __device__ __forceinline__ void VectorizedSoftmaxForwardImpl(
   }
 }
 
-template <typename T, typename AccT, typename LabelT, int VecSize,
+template <typename T,
+          typename AccT,
+          typename LabelT,
+          int VecSize,
           bool IgnoreIndex>
 __device__ __forceinline__ void ScalarSoftmaxForwardImpl(
-    T* loss, T* softmax, const T* logits, const LabelT* label, const int size,
-    const phi::LogSoftmaxForwardFunctor<AccT>& func, const int ignore_index) {
+    T* loss,
+    T* softmax,
+    const T* logits,
+    const LabelT* label,
+    const int size,
+    const phi::LogSoftmaxForwardFunctor<AccT>& func,
+    const int ignore_index) {
   int tid = threadIdx.x;
   int remain = size % (VecSize * blockDim.x);
   int label_id = blockIdx.x;
@@ -605,8 +457,13 @@ __device__ __forceinline__ void ScalarSoftmaxForwardImpl(
       softmax[tid + i * blockDim.x] = static_cast<T>(std::exp(log_softmax));
       // loss
       if (label_valid) {
-        ComputeLoss<T, IgnoreIndex>(loss, static_cast<T>(-log_softmax),
-                                    label_id, label_value, tid, VecSize, i,
+        ComputeLoss<T, IgnoreIndex>(loss,
+                                    static_cast<T>(-log_softmax),
+                                    label_id,
+                                    label_value,
+                                    tid,
+                                    VecSize,
+                                    i,
                                     ignore_index);
       }
     }
@@ -618,8 +475,14 @@ __device__ __forceinline__ void ScalarSoftmaxForwardImpl(
     softmax[tid] = static_cast<T>(std::exp(log_softmax));
     // loss
     if (label_valid) {
-      ComputeLoss<T, IgnoreIndex>(loss, static_cast<T>(-log_softmax), label_id,
-                                  label_value, tid, 1, 0, ignore_index);
+      ComputeLoss<T, IgnoreIndex>(loss,
+                                  static_cast<T>(-log_softmax),
+                                  label_id,
+                                  label_value,
+                                  tid,
+                                  1,
+                                  0,
+                                  ignore_index);
     }
   }
 
@@ -629,11 +492,17 @@ __device__ __forceinline__ void ScalarSoftmaxForwardImpl(
   }
 }
 
-template <typename T, typename AccT, typename LabelT, int VecSize,
+template <typename T,
+          typename AccT,
+          typename LabelT,
+          int VecSize,
           bool IgnoreIndex>
-__global__ void VectorizedSoftmaxForward(T* loss, T* softmax, const T* logits,
+__global__ void VectorizedSoftmaxForward(T* loss,
+                                         T* softmax,
+                                         const T* logits,
                                          const LabelT* label,
-                                         const int high_dim, const int mid_dim,
+                                         const int high_dim,
+                                         const int mid_dim,
                                          const int ignore_index) {
   using VecT = kps::details::VectorType<T, VecSize>;
 
@@ -646,14 +515,20 @@ __global__ void VectorizedSoftmaxForward(T* loss, T* softmax, const T* logits,
 
   // 1. reduce max
   AccT max = ThreadReduce<T, AccT, VecSize, kps::MaxFunctor<AccT>>(
-      logits, mid_dim, input_offset, -std::numeric_limits<AccT>::infinity(),
+      logits,
+      mid_dim,
+      input_offset,
+      -std::numeric_limits<AccT>::infinity(),
       kps::MaxFunctor<AccT>());
   max = kps::details::BlockXReduce<AccT, kps::MaxFunctor<AccT>>(
       max, kps::MaxFunctor<AccT>());
 
   // 2. reduce sum
   AccT sum = ThreadReduce<T, AccT, VecSize, ExpAddFunctor<AccT>>(
-      logits, mid_dim, input_offset, static_cast<AccT>(0),
+      logits,
+      mid_dim,
+      input_offset,
+      static_cast<AccT>(0),
       ExpAddFunctor<AccT>(max));
   sum = kps::details::BlockXReduce<AccT, kps::AddFunctor<AccT>>(
       sum, kps::AddFunctor<AccT>());
@@ -662,7 +537,13 @@ __global__ void VectorizedSoftmaxForward(T* loss, T* softmax, const T* logits,
   phi::LogSoftmaxForwardFunctor<AccT> func(max, sum);
   if (input_offset == output_offset) {
     VectorizedSoftmaxForwardImpl<T, AccT, LabelT, VecSize, IgnoreIndex>(
-        loss, softmax, logits, label, mid_dim, input_offset, func,
+        loss,
+        softmax,
+        logits,
+        label,
+        mid_dim,
+        input_offset,
+        func,
         ignore_index);
   } else {
     ScalarSoftmaxForwardImpl<T, AccT, LabelT, VecSize, IgnoreIndex>(
@@ -670,229 +551,26 @@ __global__ void VectorizedSoftmaxForward(T* loss, T* softmax, const T* logits,
   }
 }
 
-template <typename T, typename LabelT, bool IgnoreIndex>
-void LaunchVectorizedSoftmaxForward(T* loss, T* softmax, const T* logits,
-                                    const LabelT* label, const int high_dim,
-                                    const int mid_dim, const int ignore_index,
-                                    gpuStream_t stream) {
-  using AccT = typename details::MPTypeTrait<T>::Type;
-  constexpr int vec_size = sizeof(float4) / sizeof(T);
-  const int max_num_threads = 1024;
-  int max_block_size = std::min(mid_dim / vec_size, max_num_threads);
-  if (vec_size > 1) {
-    max_block_size /= 2;
-  }
-
-  int block_size = 1;
-  while (block_size < max_block_size) {
-    block_size *= 2;
-  }
-  block_size = std::max(block_size, kps::details::kWarpSize);
-  dim3 grids(high_dim);
-  dim3 blocks(block_size);
-  VectorizedSoftmaxForward<T, AccT, LabelT, vec_size,
-                           IgnoreIndex><<<grids, blocks, 0, stream>>>(
-      loss, softmax, logits, label, high_dim, mid_dim, ignore_index);
-}
-
 /*
-  Wrapper of softmax with cross entropy hard label.
-  - SwitchWarpSoftmaxForward for small size when axis == -1
-  - LaunchVectorizedSoftmaxForward for large size when axis == -1
-  - cudnn function for axis != -1
+Core function of softmax with cross entropy forward soft label.
+The computation includes
+  - Compute maximum of batch: maxvalue_{i} = max_j src_{i,j}
+  - Compute sum of exp batch: s_{i} = sum_{j}{ exp(src_{i,j} - maxvalue_{i} }
+  - Compute: sum of - sum_{j}{ label_{i,j} * (src_{i,j} - maxvalue_{i} -
+log(sum[i]))}
+One warp (32 threads) is used to compute 1 or 2 batch (kBatchSize).
+For reduction max (sum), firstly compute max (sum) to one warp, then use shuffle
+api to compute max (sum) in one warp.
 */
-template <typename T, typename LabelT, bool IgnoreIndex>
-static void SoftmaxWithCrossEntropyHardLabel(
-    const platform::CUDADeviceContext& ctx, int rank, int axis,
-    const T* logits_data, const LabelT* labels_data, T* loss_data,
-    T* softmax_data, int N, int dim, int D, const int ignore_index) {
-  auto stream = ctx.stream();
-  constexpr int max_dim = 320;
-  if (D == 1) {
-    if (dim <= max_dim) {  // small size
-      const SoftmaxMode mode = SoftmaxMode::kCrossEntropy;
-      SwitchWarpSoftmaxForward<T, LabelT, mode, IgnoreIndex>(
-          loss_data, softmax_data, logits_data, labels_data, N, dim, dim,
-          ignore_index, stream);
-    } else {  // large size
-      LaunchVectorizedSoftmaxForward<T, LabelT, IgnoreIndex>(
-          loss_data, softmax_data, logits_data, labels_data, N, dim,
-          ignore_index, stream);
-    }
-  } else {
-    ScopedTensorDescriptor desc;
-    std::vector<int> tensor_dims = {N, dim, D, 1};
-    DataLayout layout = DataLayout::kNCHW;
-#ifdef PADDLE_WITH_HIP
-    miopenTensorDescriptor_t descp = desc.descriptor<T>(layout, tensor_dims);
-#else
-    cudnnTensorDescriptor_t descp = desc.descriptor<T>(layout, tensor_dims);
-#endif
-
-    auto handle = ctx.cudnn_handle();
-
-#ifdef PADDLE_WITH_HIP
-    auto mode = axis == rank - 1 ? MIOPEN_SOFTMAX_MODE_INSTANCE
-                                 : MIOPEN_SOFTMAX_MODE_CHANNEL;
-    PADDLE_ENFORCE_GPU_SUCCESS(platform::dynload::miopenSoftmaxForward_V2(
-        handle, platform::CudnnDataType<T>::kOne(), descp, logits_data,
-        platform::CudnnDataType<T>::kZero(), descp, softmax_data,
-        MIOPEN_SOFTMAX_LOG, mode));
-#else
-    auto mode = axis == rank - 1 ? CUDNN_SOFTMAX_MODE_INSTANCE
-                                 : CUDNN_SOFTMAX_MODE_CHANNEL;
-    PADDLE_ENFORCE_GPU_SUCCESS(platform::dynload::cudnnSoftmaxForward(
-        handle, CUDNN_SOFTMAX_LOG, mode, platform::CudnnDataType<T>::kOne(),
-        descp, logits_data, platform::CudnnDataType<T>::kZero(), descp,
-        softmax_data));
-#endif
-    int threads = 128;
-    int blocks = (N * dim * D + threads - 1) / threads;
-    // compute cross entropy, input is log softmax
-    CrossEntropyExpHardLabel<T, LabelT,
-                             IgnoreIndex><<<blocks, threads, 0, stream>>>(
-        loss_data, softmax_data, labels_data, N, dim, D, ignore_index);
-  }
-}
-
-/*
-  Wrapper of softmax with cross entropy grad hard label.
-*/
-template <typename T, typename LabelT>
-__global__ void SoftmaxWithCrossEntropyGradHardLabel(
-    T* logits_grad, const T* loss_grad, const T* softmax, const LabelT* labels,
-    const int64_t n, const int64_t dim, const int64_t d,
-    const int ignore_index) {
-  int64_t idx = blockIdx.x * blockDim.x + threadIdx.x;
-  int64_t idx_n = idx / (d * dim);
-  int64_t idx_dim = (idx / d) % dim;
-  int64_t idx_d = idx % d;
-  int64_t ids = idx_n * d + idx_d;
-
-  if (idx < n * dim * d) {
-    auto lbl = static_cast<int64_t>(labels[ids]);
-    if (lbl == ignore_index) {
-      logits_grad[idx] = static_cast<T>(0.0);
-    } else if (lbl == idx_dim) {
-      logits_grad[idx] = (softmax[idx] - static_cast<T>(1.0)) * loss_grad[ids];
-    } else {
-      logits_grad[idx] = softmax[idx] * loss_grad[ids];
-    }
-  }
-}
-
-/*
-  Cross entropy soft label with dynamic size on axis (log2_elements is
-  varibale).
-  - if the input is softmax，compute loss with softmax
-  - if the input is log_softmax, compute loss with log_softmax and update
-  softmax
-*/
-template <typename T, typename VecT, bool InLogMode = false>
-__global__ void CrossEntropySoftLabel(T* loss, T* softmaxwrt, const T* softmax,
-                                      const T* labels, const int n,
-                                      const int dim, const int d,
-                                      int log2_elements) {
-  const int kDimCeil = 1 << log2_elements;
-  const int kVSize = sizeof(VecT) / sizeof(T);
-
-#ifdef __HIPCC__
-  const int kThreadPerBlock = 256;
-#else
-  const int kThreadPerBlock = 512;
-#endif
-  const int kBatchPerBlock = 1;
-  const int kWarpSize = 32;  // (dim < 32) ? dim : 32;
-  const int kBatchSize = 1;
-  const int kThreadPerBatch = kThreadPerBlock / kBatchPerBlock;
-  const int kWarpPerBatch = kThreadPerBatch / kWarpSize;
-
-  const int kIterations = (dim + kThreadPerBatch - 1) / kThreadPerBatch;
-  const int kIterationsV = (kIterations >= kVSize) ? (kIterations / kVSize) : 1;
-
-  const int first_batch = (blockDim.y * blockIdx.x + threadIdx.y) * kBatchSize;
-
-  T sum[kBatchSize]{static_cast<T>(0.0)};
-#pragma unroll
-  for (int i = 0; i < kBatchSize; ++i) {
-    int ids = first_batch + i;
-    if (ids >= n * d) break;
-    int idx_n = ids / d;
-    int idx_d = ids % d;
-#pragma unroll
-    for (int it = 0; it < kIterations; ++it) {
-      int idx_dim = it * kThreadPerBatch + threadIdx.x;
-      int idx = idx_n * dim * d + idx_dim * d + idx_d;
-
-      if (idx_n < n && idx_dim < dim) {
-        VecT softmaxdata;
-        if (InLogMode) {
-          softmaxdata = reinterpret_cast<VecT*>(&softmaxwrt[idx])[0];
-        } else {
-          softmaxdata = reinterpret_cast<const VecT*>(&softmax[idx])[0];
-        }
-        VecT labelsdata = reinterpret_cast<const VecT*>(&labels[idx])[0];
-        T* softmaxptr = reinterpret_cast<T*>(&softmaxdata);
-        T* labelsptr = reinterpret_cast<T*>(&labelsdata);
-#pragma unroll
-        for (int s = 0; s < kVSize; s++) {
-          if (InLogMode) {
-            sum[i] -= softmaxptr[s] * labelsptr[s];
-            softmaxptr[s] = Exp(softmaxptr[s]);
-          } else {
-            sum[i] -= Log(softmaxptr[s]) * labelsptr[s];
-          }
-        }
-        if (InLogMode) {
-          reinterpret_cast<VecT*>(&softmaxwrt[idx])[0] = softmaxdata;
-        }
-      }
-    }
-  }
-  phi::WarpReduceSum<T, kBatchSize, kWarpSize>(sum);
-  __syncthreads();
-
-  __shared__ T sumshare[kWarpPerBatch][kBatchPerBlock][kBatchSize];
-  if (threadIdx.x % kWarpSize == 0) {
-#pragma unroll
-    for (int i = 0; i < kBatchSize; i++) {
-      sumshare[threadIdx.x / kWarpSize][threadIdx.y][i] = sum[i];
-    }
-  }
-  __syncthreads();
-
-  // write
-  if (threadIdx.x == 0) {
-    for (int i = 0; i < kBatchSize; i++) {
-      int ids = first_batch + i;
-      if (ids < n * d) {
-        loss[ids] = sumshare[0][threadIdx.y][i];
-        for (int s = 1; s < kWarpPerBatch; s++) {
-          loss[ids] += sumshare[s][threadIdx.y][i];
-        }
-      }
-    }
-  }
-}
-
-/*
-Core function of softmax with cross entropy forward soft label.
-The computation includes
-  - Compute maximum of batch: maxvalue_{i} = max_j src_{i,j}
-  - Compute sum of exp batch: s_{i} = sum_{j}{ exp(src_{i,j} - maxvalue_{i} }
-  - Compute: sum of - sum_{j}{ label_{i,j} * (src_{i,j} - maxvalue_{i} -
-log(sum[i]))}
-One warp (32 threads) is used to compute 1 or 2 batch (kBatchSize).
-For reduction max (sum), firstly compute max (sum) to one warp, then use shuffle
-api to compute max (sum) in one warp.
-*/
-template <typename T, typename VecT, typename AccT, int Log2Elements>
-__global__ void WarpSoftmaxForwardSoftLabel(T* loss, T* softmax, const T* src,
-                                            const T* label,
-                                            const int batch_size,
-                                            const int stride,
-                                            const int element_count) {
-  const bool LogMode = true;
+template <typename T, typename VecT, typename AccT, int Log2Elements>
+__global__ void WarpSoftmaxForwardSoftLabel(T* loss,
+                                            T* softmax,
+                                            const T* src,
+                                            const T* label,
+                                            const int batch_size,
+                                            const int stride,
+                                            const int element_count) {
+  const bool LogMode = true;
 
   constexpr int kDimCeil = 1 << Log2Elements;
   constexpr int kWarpSize = (kDimCeil < 32) ? kDimCeil : 32;
@@ -1030,7 +708,9 @@ __global__ void WarpSoftmaxForwardSoftLabel(T* loss, T* softmax, const T* src,
 
 #define SOFTMAX_WARP_FORWARD_SOFT_CASE(Log2Elements, VecT, AccT)               \
   case Log2Elements:                                                           \
-    WarpSoftmaxForwardSoftLabel<T, VecT, AccT,                                 \
+    WarpSoftmaxForwardSoftLabel<T,                                             \
+                                VecT,                                          \
+                                AccT,                                          \
                                 Log2Elements><<<blocks, threads, 0, stream>>>( \
         loss, softmax, src, label, batch_size, stride, element_count);         \
     break;
@@ -1039,13 +719,18 @@ __global__ void WarpSoftmaxForwardSoftLabel(T* loss, T* softmax, const T* src,
   Wrapper of softmax with cross entropy forward soft label.
 */
 template <typename T>
-void SwitchWarpSoftmaxForwardSoftLabel(const int blocks, const dim3 threads,
-                                       gpuStream_t stream, T* loss, T* softmax,
-                                       const T* src, const T* label,
-                                       const int batch_size, const int stride,
+void SwitchWarpSoftmaxForwardSoftLabel(const int blocks,
+                                       const dim3 threads,
+                                       gpuStream_t stream,
+                                       T* loss,
+                                       T* softmax,
+                                       const T* src,
+                                       const T* label,
+                                       const int batch_size,
+                                       const int stride,
                                        const int element_count,
                                        const int log2_elements) {
-  using AccT = typename details::MPTypeTrait<T>::Type;
+  using AccT = typename dtype::MPTypeTrait<T>::Type;
   switch (log2_elements) {
     SOFTMAX_WARP_FORWARD_SOFT_CASE(0, T, AccT);
     SOFTMAX_WARP_FORWARD_SOFT_CASE(1, T, AccT);
@@ -1063,10 +748,16 @@ void SwitchWarpSoftmaxForwardSoftLabel(const int blocks, const dim3 threads,
 }
 
 template <typename T>
-static void SoftmaxWithCrossEntropySoftLabel(
-    const platform::CUDADeviceContext& ctx, const int rank, const int axis,
-    const T* logits_data, const T* labels_data, T* softmax_data, T* loss_data,
-    int N, int dim, int D) {
+static void SoftmaxWithCrossEntropySoftLabel(const GPUContext& dev_ctx,
+                                             const int rank,
+                                             const int axis,
+                                             const T* logits_data,
+                                             const T* labels_data,
+                                             T* softmax_data,
+                                             T* loss_data,
+                                             int N,
+                                             int dim,
+                                             int D) {
 #ifdef __HIPCC__
   constexpr int kMaxBlockDim = 256;
 #else
@@ -1081,7 +772,7 @@ static void SoftmaxWithCrossEntropySoftLabel(
 
   const int kDimLog2 = static_cast<int>(Log2Ceil(dim));
   const int kDimCeil = 1 << kDimLog2;
-  auto stream = ctx.stream();
+  auto stream = dev_ctx.stream();
 
   if (D == 1 && dim <= max_dim) {
     int kWarpSize = (kDimCeil < 32) ? kDimCeil : 32;
@@ -1094,35 +785,55 @@ static void SoftmaxWithCrossEntropySoftLabel(
     int blocks = (N + batches_per_block - 1) / batches_per_block;
     dim3 threads(kWarpSize, warps_per_block, 1);
 
-    SwitchWarpSoftmaxForwardSoftLabel<T>(blocks, threads, stream, loss_data,
-                                         softmax_data, logits_data, labels_data,
-                                         N, dim, dim, kDimLog2);
+    SwitchWarpSoftmaxForwardSoftLabel<T>(blocks,
+                                         threads,
+                                         stream,
+                                         loss_data,
+                                         softmax_data,
+                                         logits_data,
+                                         labels_data,
+                                         N,
+                                         dim,
+                                         dim,
+                                         kDimLog2);
 
   } else {
     ScopedTensorDescriptor desc;
     std::vector<int> tensor_dims = {N, dim, D, 1};
-    DataLayout layout = DataLayout::kNCHW;
+    GPUDNNDataLayout layout = GPUDNNDataLayout::kNCHW;
 #ifdef PADDLE_WITH_HIP
     miopenTensorDescriptor_t descp = desc.descriptor<T>(layout, tensor_dims);
 #else
     cudnnTensorDescriptor_t descp = desc.descriptor<T>(layout, tensor_dims);
 #endif
 
-    auto handle = ctx.cudnn_handle();
+    auto handle = dev_ctx.cudnn_handle();
 
 #ifdef PADDLE_WITH_HIP
     auto mode = axis == rank - 1 ? MIOPEN_SOFTMAX_MODE_INSTANCE
                                  : MIOPEN_SOFTMAX_MODE_CHANNEL;
-    PADDLE_ENFORCE_GPU_SUCCESS(platform::dynload::miopenSoftmaxForward_V2(
-        handle, platform::CudnnDataType<T>::kOne(), descp, logits_data,
-        platform::CudnnDataType<T>::kZero(), descp, softmax_data,
-        MIOPEN_SOFTMAX_LOG, mode));
+    PADDLE_ENFORCE_GPU_SUCCESS(phi::dynload::miopenSoftmaxForward_V2(
+        handle,
+        paddle::platform::CudnnDataType<T>::kOne(),
+        descp,
+        logits_data,
+        paddle::platform::CudnnDataType<T>::kZero(),
+        descp,
+        softmax_data,
+        MIOPEN_SOFTMAX_LOG,
+        mode));
 #else
     auto mode = axis == rank - 1 ? CUDNN_SOFTMAX_MODE_INSTANCE
                                  : CUDNN_SOFTMAX_MODE_CHANNEL;
-    PADDLE_ENFORCE_GPU_SUCCESS(platform::dynload::cudnnSoftmaxForward(
-        handle, CUDNN_SOFTMAX_LOG, mode, platform::CudnnDataType<T>::kOne(),
-        descp, logits_data, platform::CudnnDataType<T>::kZero(), descp,
+    PADDLE_ENFORCE_GPU_SUCCESS(phi::dynload::cudnnSoftmaxForward(
+        handle,
+        CUDNN_SOFTMAX_LOG,
+        mode,
+        paddle::platform::CudnnDataType<T>::kOne(),
+        descp,
+        logits_data,
+        paddle::platform::CudnnDataType<T>::kZero(),
+        descp,
         softmax_data));
 #endif
 
@@ -1143,351 +854,712 @@ static void SoftmaxWithCrossEntropySoftLabel(
   }
 }
 
-template <typename T>
-__global__ void SoftCrossEntropyGradientKernel(T* logit_grad,
-                                               const T* loss_grad,
-                                               const T* labels, const int64_t n,
-                                               const int64_t d,
-                                               const int64_t remain) {
-  int64_t ids = blockIdx.x * blockDim.x + threadIdx.x;
-  if (ids < n * d) {
-    int64_t idx_n = ids / d;
-    int64_t idx_remain = ids % remain;
-    int64_t idx_loss = idx_n * remain + idx_remain;
-    logit_grad[ids] = loss_grad[idx_loss] * (logit_grad[ids] - labels[ids]);
-  }
-}
+/*
+  Core function of softmax with cross entropy forward
+    - softmax, SoftmaxMode=kSoftmax
+    - log softmax, SoftmaxMode=kLogSoftmax
+    - softmax with cross entropy hard label, SoftmaxMode=kCrossEntropy
+  The computation includes
+    - Compute max value: maxvalue_{i} = max_j src_{i,j}
+    - Compute sum of exp: s_{i} = sum_{j}{e^{src_{i,j} - maxvalue_{i}}}
+    - Compute: softmax_{i,j} = e^{src_{i,j} - maxvalue_{i}} / s_{i}
+    - Compute: logsoftmax_{i,j} = src_{i,j} - maxvalue_{i} - log(s_{i})
+    - Compute: loss_{i} = -logsoftmax[i,label[i]] (Hard label)
+  This computation results from following formula:
+    softmax_{i,j} = e^{src_{i,j}} / sum_{j}{e^{src_{i,j}}}
+                  = e^{src_{i,j} - maxvalue_{i}}
+                    / sum_{j}{e^{src_{i,j} - maxvalue_{i}}}
+                  = e^{src_{i,j} - maxvalue_{i}} / s_{i}
+    logsoftmax_{i,j} = log(softmax_{i,j})
+                     = src_{i,j} - maxvalue_{i} - log(s_{i})
+  One warp (32 threads) is used to compute 1 or 2 batch (kBatchSize).
+  For reduction max (sum), firstly compute max (sum) to one warp, then use
+  shuffle api to compute max (sum) in one warp.
+*/
+template <typename T,
+          typename LabelT,
+          typename VecT,
+          typename AccT,
+          int Log2Elements,
+          SoftmaxMode mode,
+          bool IgnoreIndex>
+__global__ void WarpSoftmaxForward(T* loss,
+                                   T* softmax,
+                                   const T* src,
+                                   const LabelT* label,
+                                   const int batch_size,
+                                   const int stride,
+                                   const int element_count,
+                                   const int ignore_index) {
+  constexpr int kDimCeil = 1 << Log2Elements;
+  constexpr int kWarpSize = (kDimCeil < 32) ? kDimCeil : 32;
+  constexpr int kVSize = sizeof(VecT) / sizeof(T);
+  constexpr int kIterations = kDimCeil / kWarpSize;
+  constexpr int kIterationsV =
+      (kIterations >= kVSize) ? (kIterations / kVSize) : 1;
+  constexpr int kBatchSize = (kDimCeil <= 128) ? 2 : 1;
 
-template <typename T>
-__global__ void SoftLabelCrossEntropyGradientKernel(T* logit_grad,
-                                                    const T* loss_grad,
-                                                    const T* labels,
-                                                    const int n, const int d,
-                                                    const int remain) {
-  int ids = blockIdx.x * blockDim.x + threadIdx.x;
-  if (ids < n * d) {
-    int idx_n = ids / d;
-    int idx_remain = ids % remain;
-    int idx_loss = idx_n * remain + idx_remain;
-    logit_grad[ids] = loss_grad[idx_loss] * (-labels[ids] / logit_grad[ids]);
+  int first_batch = (blockDim.y * blockIdx.x + threadIdx.y) * kBatchSize;
+
+  // max index to read
+  int idx_max_v[kBatchSize];
+#pragma unroll
+  for (int i = 0; i < kBatchSize; i++) {
+    int idx_max = ((i + first_batch) < batch_size) ? element_count : 0;
+    idx_max_v[i] = idx_max / kVSize;
   }
-}
 
-template <typename T, typename LabelT>
-__global__ void HardLabelCrossEntropyGradientKernel(T* logit_grad,
-                                                    const LabelT* labels,
-                                                    const int n, const int d,
-                                                    const int remain,
-                                                    const int ignore_index) {
-  CUDA_KERNEL_LOOP(index, n * remain) {
-    int idx_n = index / remain;
-    int idx_remain = index % remain;
-    int tmp = static_cast<int>(labels[index]);
-    int idx = idx_n * d + tmp * remain + idx_remain;
-    if (ignore_index != tmp) {
-      logit_grad[idx] = -static_cast<T>(1.) / logit_grad[idx];
+  // read data from global memory
+  AccT srcdata[kBatchSize][kIterationsV][kVSize];
+
+#pragma unroll
+  for (int i = 0; i < kBatchSize; ++i) {
+// read data to srcdata: - KVSize==1, - KVSize>1
+#pragma unroll
+    for (int it = 0; it < kIterationsV; ++it) {
+      int src_idx = threadIdx.x + it * kWarpSize;
+      if (kVSize == 1) {
+        if (src_idx < idx_max_v[i]) {
+          srcdata[i][it][0] =
+              static_cast<AccT>(src[(first_batch + i) * stride + src_idx]);
+        } else {
+          srcdata[i][it][0] = -std::numeric_limits<AccT>::infinity();
+        }
+      } else {
+        const VecT* src_v =
+            reinterpret_cast<const VecT*>(&src[(first_batch + i) * stride]);
+        if (src_idx < idx_max_v[i]) {
+          VecT srctmp = src_v[src_idx];
+          const T* srcinptr = reinterpret_cast<const T*>(&srctmp);
+#pragma unroll
+          for (int s = 0; s < kVSize; s++) {
+            srcdata[i][it][s] = static_cast<AccT>(srcinptr[s]);
+          }
+        } else {
+#pragma unroll
+          for (int s = 0; s < kVSize; s++) {
+            srcdata[i][it][s] = -std::numeric_limits<AccT>::infinity();
+          }
+        }
+      }
     }
   }
-}
 
-template <typename T, typename LabelT>
-__global__ void ScaleCrossEntropyGradient(T* logit_grad, const T* loss_grad,
-                                          const int num, const int d,
-                                          const int remain,
-                                          const LabelT* labels,
-                                          const int ignore_index) {
-  CUDA_KERNEL_LOOP(index, num) {
-    int idx_n = index / d;
-    int idx_remain = index % remain;
-    int idx_lbl = idx_n * remain + idx_remain;
-    int k = (index % d) / remain;
-    auto lbl = static_cast<int64_t>(labels[idx_lbl]);
-    if (lbl == ignore_index || lbl != k) {
-      logit_grad[index] = static_cast<T>(0.);
-    } else {
-      logit_grad[index] *= loss_grad[idx_lbl];
+  // compute max value: maxvalue_{i} = max_j src_{i,j}
+  AccT max_value[kBatchSize];
+#pragma unroll
+  for (int i = 0; i < kBatchSize; ++i) {
+    // it = 0
+    AccT valmax = srcdata[i][0][0];
+#pragma unroll
+    for (int s = 1; s < kVSize; ++s) {
+      valmax = (valmax > srcdata[i][0][s]) ? valmax : srcdata[i][0][s];
     }
-  }
-}
+    max_value[i] = valmax;
 
-template <typename T>
-class SoftmaxWithCrossEntropyCUDAKernel : public framework::OpKernel<T> {
- public:
-  void Compute(const framework::ExecutionContext& context) const override {
-    RunSoftmaxWithCrossEntropyFunctor<T>(context, *this);
+// it = 1, 2, ...
+#pragma unroll
+    for (int it = 1; it < kIterationsV; ++it) {
+      AccT valmax = srcdata[i][it][0];
+#pragma unroll
+      for (int s = 1; s < kVSize; ++s) {
+        valmax = (valmax > srcdata[i][it][s]) ? valmax : srcdata[i][it][s];
+      }
+      max_value[i] = (max_value[i] > valmax) ? max_value[i] : valmax;
+    }
   }
+  phi::WarpReduceMax<AccT, kBatchSize, kWarpSize>(max_value);
 
-  template <typename LabelT>
-  static void Apply(const framework::ExecutionContext& context,
-                    const framework::Tensor& labels, const bool soft_label) {
-    PADDLE_ENFORCE_EQ(
-        platform::is_gpu_place(context.GetPlace()), true,
-        platform::errors::Unavailable("softmax_with_cross_entropy operator's "
-                                      "CUDA kernel only runs on GPU device."));
-    const bool use_softmax = context.Attr<bool>("use_softmax");
-
-    // do not with softmax op, and input is softmax
-    if (!use_softmax) {
-      const Tensor* softmax = context.Input<Tensor>("Logits");
-      Tensor* softmax_out = context.Output<Tensor>("Softmax");
-      Tensor* loss = context.Output<Tensor>("Loss");
-
-      const int rank = softmax->dims().size();
-      const int axis =
-          phi::funcs::CanonicalAxis(context.Attr<int>("axis"), rank);
-      const int axis_dim = softmax->dims()[axis];
-
-      const int n = phi::funcs::SizeToAxis(axis, softmax->dims());
-      const int d = phi::funcs::SizeFromAxis(axis, softmax->dims());
-
-      auto* softmax_out_data =
-          softmax_out->template mutable_data<T>(context.GetPlace());
-      auto* loss_data = loss->template mutable_data<T>(context.GetPlace());
-
-      phi::funcs::SetConstant<platform::CUDADeviceContext, T> set_constant;
-      set_constant(context.cuda_device_context(), loss, static_cast<T>(0));
-      if (axis_dim == 1) {
-        set_constant(context.cuda_device_context(), softmax_out,
-                     static_cast<T>(1));
-        return;
+  // compute sum: s_{i} = sum_{j}{ exp(src_{i,j} - maxvalue_{i} }
+  AccT sum[kBatchSize];
+#pragma unroll
+  for (int i = 0; i < kBatchSize; ++i) {
+    // it = 0
+    if (mode == SoftmaxMode::kLogSoftmax ||
+        mode == SoftmaxMode::kCrossEntropy) {
+      sum[i] = std::exp(srcdata[i][0][0] - max_value[i]);
+    } else {
+      srcdata[i][0][0] = std::exp(srcdata[i][0][0] - max_value[i]);
+      sum[i] = srcdata[i][0][0];
+    }
+#pragma unroll
+    for (int s = 1; s < kVSize; ++s) {
+      if (mode == SoftmaxMode::kLogSoftmax ||
+          mode == SoftmaxMode::kCrossEntropy) {
+        sum[i] += std::exp(srcdata[i][0][s] - max_value[i]);
+      } else {
+        srcdata[i][0][s] = std::exp(srcdata[i][0][s] - max_value[i]);
+        sum[i] += srcdata[i][0][s];
       }
+    }
 
-      auto ignore_index = context.Attr<int>("ignore_index");
-
-      Tensor softmax_2d, labels_2d, loss_2d, softmax_out_2d;
-      softmax_2d.ShareDataWith(*softmax).Resize({n, d});
-      labels_2d.ShareDataWith(labels).Resize({n, labels.numel() / n});
-      loss_2d.ShareDataWith(*loss).Resize({n, 1});
-      softmax_out_2d.ShareDataWith(*softmax_out).Resize({n, d});
-
-      // math::CrossEntropyFunctor support axis is the last
-      if (axis == -1) {
-        math::CrossEntropyFunctor<platform::CUDADeviceContext, T>()(
-            context.cuda_device_context(), &loss_2d, &softmax_2d, &labels_2d,
-            soft_label, ignore_index, axis_dim);
-        return;
+// it = 1, 2, ...
+#pragma unroll
+    for (int it = 1; it < kIterationsV; ++it) {
+#pragma unroll
+      for (int s = 0; s < kVSize; ++s) {
+        if (mode == SoftmaxMode::kLogSoftmax ||
+            mode == SoftmaxMode::kCrossEntropy) {
+          sum[i] += std::exp(srcdata[i][it][s] - max_value[i]);
+        } else {
+          srcdata[i][it][s] = std::exp(srcdata[i][it][s] - max_value[i]);
+          sum[i] += srcdata[i][it][s];
+        }
       }
+    }
+  }
+  phi::WarpReduceSum<AccT, kBatchSize, kWarpSize>(sum);
 
-      // if axis is not the last, we need a new impliment
-      if (soft_label) {
-        auto* logits_data = softmax->template data<T>();
-        auto* labels_data = labels.template data<T>();
+// write data
+#pragma unroll
+  for (int i = 0; i < kBatchSize; ++i) {
+    if (mode == SoftmaxMode::kLogSoftmax ||
+        mode == SoftmaxMode::kCrossEntropy) {
+      sum[i] = std::log(sum[i]);
+    }
 
-        const int kDimLog2 = static_cast<int>(Log2Ceil(axis_dim));
-        const int kDimCeil = 1 << kDimLog2;
-#ifdef __HIPCC__
-        int kThreadPerBlock = 256;
-#else
-        int kThreadPerBlock = 512;
-#endif
-        int kBatchPerBlock = 1;
-        int blocks = (n * d + kBatchPerBlock - 1) / kBatchPerBlock;
-        dim3 threads(kThreadPerBlock / kBatchPerBlock, kBatchPerBlock, 1);
-
-        CrossEntropySoftLabel<T, T, false><<<
-            blocks, threads, 0, context.cuda_device_context().stream()>>>(
-            loss_data, NULL, logits_data, labels_data, n, axis_dim,
-            d / axis_dim, kDimLog2);
-      } else {  // HardLabel
-        auto* logits_data = softmax->template data<T>();
-        auto* labels_data = labels.template data<LabelT>();
-        int threads = 128;
-        int blocks = (n * d / axis_dim + threads - 1) / threads;
-        if (ignore_index >= 0 && ignore_index < axis_dim) {
-          CrossEntropyHardLabel<T, LabelT, true><<<
-              blocks, threads, 0, context.cuda_device_context().stream()>>>(
-              loss_data, logits_data, labels_data, n, axis_dim, d / axis_dim,
-              ignore_index);
+#pragma unroll
+    for (int it = 0; it < kIterationsV; ++it) {
+      int idx = threadIdx.x + it * kWarpSize;
+      if (kVSize == 1) {  // kVSize==1
+        if (idx < idx_max_v[i]) {
+          if (mode == SoftmaxMode::kLogSoftmax) {  // log softmax
+            softmax[(first_batch + i) * stride + idx] =
+                srcdata[i][it][0] - max_value[i] - sum[i];
+            // softmax with cross entropy hard label
+          } else if (mode == SoftmaxMode::kCrossEntropy) {
+            AccT logsoftmax = srcdata[i][it][0] - max_value[i] - sum[i];
+            // softmax
+            softmax[(first_batch + i) * stride + idx] = std::exp(logsoftmax);
+            // label
+            int loss_idx = (threadIdx.x + it * kWarpSize) * kVSize;
+            auto lbl = static_cast<int64_t>(label[first_batch + i]);
+            if (IgnoreIndex == true) {
+              // IgnoreIndex is true
+              if (lbl == loss_idx) {
+                if (lbl != ignore_index) {
+                  loss[first_batch + i] = -logsoftmax;
+                } else {
+                  loss[first_batch + i] = static_cast<T>(0.0);
+                }
+              }
+            } else {
+              // IgnoreIndex is false
+              if (lbl >= 0 && lbl < element_count) {
+                if (lbl == loss_idx) {
+                  loss[first_batch + i] = -logsoftmax;
+                }
+              } else {
+                loss[first_batch + i] = static_cast<T>(0.0);
+              }
+            }
+          } else {  // softmax
+            softmax[(first_batch + i) * stride + idx] =
+                srcdata[i][it][0] / sum[i];
+          }
+        } else {
+          break;
+        }
+      } else {  // KVSize>1
+        VecT* softmax_v =
+            reinterpret_cast<VecT*>(&softmax[(first_batch + i) * stride]);
+        VecT tmpdata;
+        T* tmpptr = reinterpret_cast<T*>(&tmpdata);
+#pragma unroll
+        for (int s = 0; s < kVSize; ++s) {
+          if (mode == SoftmaxMode::kLogSoftmax) {  // log softmax
+            tmpptr[s] = srcdata[i][it][s] - max_value[i] - sum[i];
+            // softmax with cross entropy hard label
+          } else if (mode == SoftmaxMode::kCrossEntropy) {
+            AccT logsoftmax = srcdata[i][it][s] - max_value[i] - sum[i];
+            // softmax
+            tmpptr[s] = std::exp(logsoftmax);
+            // label
+            int loss_idx = (threadIdx.x + it * kWarpSize) * kVSize + s;
+            auto lbl = static_cast<int64_t>(label[first_batch + i]);
+            if (IgnoreIndex == true) {
+              // IgnoreIndex is true
+              if (lbl == loss_idx && lbl != ignore_index) {
+                loss[first_batch + i] = -logsoftmax;
+              }
+            } else {
+              // IgnoreIndex is false
+              if (lbl >= 0 && lbl < element_count) {
+                if (lbl == loss_idx) {
+                  loss[first_batch + i] = -logsoftmax;
+                }
+              } else {
+                loss[first_batch + i] = static_cast<T>(0.0);
+              }
+            }
+          } else {  // softmax
+            tmpptr[s] = srcdata[i][it][s] / sum[i];
+          }
+        }
+        if (idx < idx_max_v[i]) {
+          softmax_v[idx] = tmpdata;
         } else {
-          CrossEntropyHardLabel<T, LabelT, false><<<
-              blocks, threads, 0, context.cuda_device_context().stream()>>>(
-              loss_data, logits_data, labels_data, n, axis_dim, d / axis_dim,
-              ignore_index);
+          break;
         }
       }
-
-      // cause of input is softmax
-      // copy to output softmax, directly
-      framework::TensorCopy(*softmax, context.GetPlace(),
-                            context.device_context(), softmax_out);
-
-      return;
     }
+  }
+}
 
-    const Tensor* logits = context.Input<Tensor>("Logits");
-    Tensor* softmax = context.Output<Tensor>("Softmax");
-    Tensor* loss = context.Output<Tensor>("Loss");
+#define SOFTMAX_WARP_FORWARD_CASE(Log2Elements, LabelT, VecT, AccT)  \
+  case Log2Elements:                                                 \
+    WarpSoftmaxForward<T,                                            \
+                       LabelT,                                       \
+                       VecT,                                         \
+                       AccT,                                         \
+                       Log2Elements,                                 \
+                       mode,                                         \
+                       IgnoreIndex><<<blocks, threads, 0, stream>>>( \
+        loss,                                                        \
+        softmax,                                                     \
+        src,                                                         \
+        label,                                                       \
+        batch_size,                                                  \
+        stride,                                                      \
+        element_count,                                               \
+        ignore_index);                                               \
+    break;
 
-    const int rank = logits->dims().size();
-    const int axis = phi::funcs::CanonicalAxis(context.Attr<int>("axis"), rank);
-    int axis_dim = logits->dims()[axis];
+/*
+  Wrapper of softmax with cross entropy forward hard label.
+*/
+template <typename T, typename LabelT, SoftmaxMode mode, bool IgnoreIndex>
+void SwitchWarpSoftmaxForward(T* loss,
+                              T* softmax,
+                              const T* src,
+                              const LabelT* label,
+                              const int batch_size,
+                              const int stride,
+                              const int element_count,
+                              const int ignore_index,
+                              gpuStream_t stream) {
+  using AccT = typename dtype::MPTypeTrait<T>::Type;
 
-    const int64_t n = phi::funcs::SizeToAxis(axis, logits->dims());
-    const int64_t d = phi::funcs::SizeFromAxis(axis, logits->dims());
+  // use 128 threads per block to maximimize gpu utilization
+  const int log2_elements = static_cast<int>(Log2Ceil(element_count));
+  const int kDimCeil = 1 << log2_elements;
+  int kWarpSize = (kDimCeil < 32) ? kDimCeil : 32;
+  int batches_per_warp = (kDimCeil <= 128) ? 2 : 1;
+  constexpr int threads_per_block = 128;
+  int warps_per_block = (threads_per_block / kWarpSize);
+  int batches_per_block = warps_per_block * batches_per_warp;
+  int blocks = (batch_size + batches_per_block - 1) / batches_per_block;
+  dim3 threads(kWarpSize, warps_per_block, 1);
 
-    auto* softmax_data = softmax->template mutable_data<T>(context.GetPlace());
-    auto* loss_data = loss->template mutable_data<T>(context.GetPlace());
+  switch (log2_elements) {
+    SOFTMAX_WARP_FORWARD_CASE(0, LabelT, T, AccT);
+    SOFTMAX_WARP_FORWARD_CASE(1, LabelT, T, AccT);
+    SOFTMAX_WARP_FORWARD_CASE(2, LabelT, T, AccT);
+    SOFTMAX_WARP_FORWARD_CASE(3, LabelT, T, AccT);
+    SOFTMAX_WARP_FORWARD_CASE(4, LabelT, T, AccT);
+    SOFTMAX_WARP_FORWARD_CASE(5, LabelT, T, AccT);
+    SOFTMAX_WARP_FORWARD_CASE(6, LabelT, T, AccT);
+    SOFTMAX_WARP_FORWARD_CASE(7, LabelT, T, AccT);
+    SOFTMAX_WARP_FORWARD_CASE(8, LabelT, T, AccT);
+    SOFTMAX_WARP_FORWARD_CASE(9, LabelT, T, AccT);
+    default:
+      break;
+  }
+}
 
-    if (axis_dim == 1) {
-      phi::funcs::SetConstant<platform::CUDADeviceContext, T> set_constant;
-      set_constant(context.cuda_device_context(), softmax, static_cast<T>(1));
-      set_constant(context.cuda_device_context(), loss, static_cast<T>(0));
-      return;
-    }
+template <typename T, typename LabelT, bool IgnoreIndex>
+void LaunchVectorizedSoftmaxForward(T* loss,
+                                    T* softmax,
+                                    const T* logits,
+                                    const LabelT* label,
+                                    const int high_dim,
+                                    const int mid_dim,
+                                    const int ignore_index,
+                                    gpuStream_t stream) {
+  using AccT = typename dtype::MPTypeTrait<T>::Type;
+  constexpr int vec_size = sizeof(float4) / sizeof(T);
+  const int max_num_threads = 1024;
+  int max_block_size = std::min(mid_dim / vec_size, max_num_threads);
+  if (vec_size > 1) {
+    max_block_size /= 2;
+  }
 
-    auto ignore_index = context.Attr<int>("ignore_index");
+  int block_size = 1;
+  while (block_size < max_block_size) {
+    block_size *= 2;
+  }
+  block_size = std::max(block_size, kps::details::kWarpSize);
+  dim3 grids(high_dim);
+  dim3 blocks(block_size);
+  VectorizedSoftmaxForward<T,
+                           AccT,
+                           LabelT,
+                           vec_size,
+                           IgnoreIndex><<<grids, blocks, 0, stream>>>(
+      loss, softmax, logits, label, high_dim, mid_dim, ignore_index);
+}
 
-    if (soft_label) {
-      auto* logits_data = logits->template data<T>();
-      auto* labels_data = labels.template data<T>();
-      SoftmaxWithCrossEntropySoftLabel<T>(
-          context.cuda_device_context(), rank, axis, logits_data, labels_data,
-          softmax_data, loss_data, n, axis_dim, d / axis_dim);
-    } else {
-      if (!context.Attr<bool>("numeric_stable_mode")) {
-        // CUDNN kernel only suppoer 2-D tensor and perfome softmax on last dim
-        Tensor logits_2d, softmax_2d, labels_2d, loss_2d;
-        logits_2d.ShareDataWith(*logits).Resize({n, d});
-        softmax_2d.ShareDataWith(*softmax).Resize({n, d});
-        labels_2d.ShareDataWith(labels).Resize({n, labels.numel() / n});
-        loss_2d.ShareDataWith(*loss).Resize({n, 1});
-        math::SoftmaxCUDNNFunctor<T>()(context.cuda_device_context(),
-                                       &logits_2d, &softmax_2d);
-        math::CrossEntropyFunctor<platform::CUDADeviceContext, T>()(
-            context.cuda_device_context(), &loss_2d, &softmax_2d, &labels_2d,
-            false, ignore_index, axis_dim);
-      } else {
-        auto* logits_data = logits->template data<T>();
-        auto* labels_data = labels.template data<LabelT>();
-        if (ignore_index >= 0 && ignore_index < axis_dim) {
-          SoftmaxWithCrossEntropyHardLabel<T, LabelT, true>(
-              context.cuda_device_context(), rank, axis, logits_data,
-              labels_data, loss_data, softmax_data, n, axis_dim, d / axis_dim,
-              ignore_index);
-        } else {
-          SoftmaxWithCrossEntropyHardLabel<T, LabelT, false>(
-              context.cuda_device_context(), rank, axis, logits_data,
-              labels_data, loss_data, softmax_data, n, axis_dim, d / axis_dim,
-              ignore_index);
-        }
-      }
+/*
+  Wrapper of softmax with cross entropy hard label.
+  - SwitchWarpSoftmaxForward for small size when axis == -1
+  - LaunchVectorizedSoftmaxForward for large size when axis == -1
+  - cudnn function for axis != -1
+*/
+template <typename T, typename LabelT, bool IgnoreIndex>
+static void SoftmaxWithCrossEntropyHardLabel(const GPUContext& dev_ctx,
+                                             int rank,
+                                             int axis,
+                                             const T* logits_data,
+                                             const LabelT* labels_data,
+                                             T* loss_data,
+                                             T* softmax_data,
+                                             int N,
+                                             int dim,
+                                             int D,
+                                             const int ignore_index) {
+  auto stream = dev_ctx.stream();
+  constexpr int max_dim = 320;
+  if (D == 1) {
+    if (dim <= max_dim) {  // small size
+      const SoftmaxMode mode = SoftmaxMode::kCrossEntropy;
+      SwitchWarpSoftmaxForward<T, LabelT, mode, IgnoreIndex>(loss_data,
+                                                             softmax_data,
+                                                             logits_data,
+                                                             labels_data,
+                                                             N,
+                                                             dim,
+                                                             dim,
+                                                             ignore_index,
+                                                             stream);
+    } else {  // large size
+      LaunchVectorizedSoftmaxForward<T, LabelT, IgnoreIndex>(loss_data,
+                                                             softmax_data,
+                                                             logits_data,
+                                                             labels_data,
+                                                             N,
+                                                             dim,
+                                                             ignore_index,
+                                                             stream);
     }
-  }
-};
+  } else {
+    ScopedTensorDescriptor desc;
+    std::vector<int> tensor_dims = {N, dim, D, 1};
+    GPUDNNDataLayout layout = GPUDNNDataLayout::kNCHW;
+#ifdef PADDLE_WITH_HIP
+    miopenTensorDescriptor_t descp = desc.descriptor<T>(layout, tensor_dims);
+#else
+    cudnnTensorDescriptor_t descp = desc.descriptor<T>(layout, tensor_dims);
+#endif
 
-template <typename T>
-class SoftmaxWithCrossEntropyGradCUDAKernel : public framework::OpKernel<T> {
- public:
-  void Compute(const framework::ExecutionContext& context) const override {
-    RunSoftmaxWithCrossEntropyFunctor<T>(context, *this);
+    auto handle = dev_ctx.cudnn_handle();
+
+#ifdef PADDLE_WITH_HIP
+    auto mode = axis == rank - 1 ? MIOPEN_SOFTMAX_MODE_INSTANCE
+                                 : MIOPEN_SOFTMAX_MODE_CHANNEL;
+    PADDLE_ENFORCE_GPU_SUCCESS(phi::dynload::miopenSoftmaxForward_V2(
+        handle,
+        paddle::platform::CudnnDataType<T>::kOne(),
+        descp,
+        logits_data,
+        paddle::platform::CudnnDataType<T>::kZero(),
+        descp,
+        softmax_data,
+        MIOPEN_SOFTMAX_LOG,
+        mode));
+#else
+    auto mode = axis == rank - 1 ? CUDNN_SOFTMAX_MODE_INSTANCE
+                                 : CUDNN_SOFTMAX_MODE_CHANNEL;
+    PADDLE_ENFORCE_GPU_SUCCESS(phi::dynload::cudnnSoftmaxForward(
+        handle,
+        CUDNN_SOFTMAX_LOG,
+        mode,
+        paddle::platform::CudnnDataType<T>::kOne(),
+        descp,
+        logits_data,
+        paddle::platform::CudnnDataType<T>::kZero(),
+        descp,
+        softmax_data));
+#endif
+    int threads = 128;
+    int blocks = (N * dim * D + threads - 1) / threads;
+    // compute cross entropy, input is log softmax
+    CrossEntropyExpHardLabel<T,
+                             LabelT,
+                             IgnoreIndex><<<blocks, threads, 0, stream>>>(
+        loss_data, softmax_data, labels_data, N, dim, D, ignore_index);
   }
+}
 
-  template <typename LabelT>
-  static void Apply(const framework::ExecutionContext& context,
-                    const framework::Tensor& labels, const bool soft_label) {
-    PADDLE_ENFORCE_EQ(
-        platform::is_gpu_place(context.GetPlace()), true,
-        platform::errors::Unavailable("softmax_with_cross_entropy operator's "
-                                      "CUDA kernel only runs on GPU device."));
-    const T* loss_grad_data =
-        context.Input<Tensor>(framework::GradVarName("Loss"))
-            ->template data<T>();
-    Tensor* logit_grad =
-        context.Output<Tensor>(framework::GradVarName("Logits"));
-    const Tensor* softmax = context.Input<Tensor>("Softmax");
-    auto stream = context.cuda_device_context().stream();
-    auto ignore_index = context.Attr<int>("ignore_index");
-    auto use_softmax = context.Attr<bool>("use_softmax");
-
-    T* logit_grad_data = nullptr;
-    bool copy_flag = (logit_grad != softmax && (!use_softmax || soft_label));
-    if (copy_flag) {
-      framework::TensorCopy(*softmax, context.GetPlace(),
-                            context.device_context(), logit_grad);
-      logit_grad_data = logit_grad->template data<T>();
-    } else {
-      logit_grad_data =
-          logit_grad->template mutable_data<T>(context.GetPlace());
+template <typename T, typename LabelT>
+void CrossEntropyWithSoftmaxCUDAKernel(const GPUContext& dev_ctx,
+                                       const DenseTensor& logits,
+                                       const DenseTensor& label,
+                                       bool soft_label,
+                                       bool use_softmax,
+                                       bool numeric_stable_mode,
+                                       int ignore_index,
+                                       int axis,
+                                       DenseTensor* softmax,
+                                       DenseTensor* loss) {
+  PADDLE_ENFORCE_EQ(
+      dev_ctx.GetPlace().GetType(),
+      AllocationType::GPU,
+      phi::errors::Unavailable("softmax_with_cross_entropy operator's "
+                               "CUDA kernel only runs on GPU device."));
+
+  // do not with softmax op, and input is softmax
+  if (!use_softmax) {
+    DenseTensor* softmax_out = softmax;
+    const DenseTensor* softmax = &logits;
+    const DenseTensor& labels = label;
+
+    const int rank = softmax->dims().size();
+    const int axis_v = phi::funcs::CanonicalAxis(axis, rank);
+    const int axis_dim = softmax->dims()[axis_v];
+
+    const int n = phi::funcs::SizeToAxis(axis_v, softmax->dims());
+    const int d = phi::funcs::SizeFromAxis(axis_v, softmax->dims());
+
+    auto* softmax_out_data = dev_ctx.template Alloc<T>(softmax_out);
+    auto* loss_data = dev_ctx.template Alloc<T>(loss);
+
+    phi::funcs::SetConstant<GPUContext, T> set_constant;
+    set_constant(dev_ctx, loss, static_cast<T>(0));
+    if (axis_dim == 1) {
+      set_constant(dev_ctx, softmax_out, static_cast<T>(1));
+      return;
     }
 
-    const int rank = logit_grad->dims().size();
-    const int axis = phi::funcs::CanonicalAxis(context.Attr<int>("axis"), rank);
-    int axis_dim = logit_grad->dims()[axis];
+    DenseTensor softmax_2d(*softmax);
+    softmax_2d.Resize({n, d});
+    DenseTensor labels_2d(labels);
+    labels_2d.Resize({n, labels.numel() / n});
+    DenseTensor loss_2d(*loss);
+    loss_2d.Resize({n, 1});
+    DenseTensor softmax_out_2d(*softmax_out);
+    softmax_out_2d.Resize({n, d});
+
+    // math::CrossEntropyFunctor support axis is the last
+    if (axis_v == -1) {
+      paddle::operators::math::CrossEntropyFunctor<GPUContext, T>()(
+          dev_ctx,
+          &loss_2d,
+          &softmax_2d,
+          &labels_2d,
+          soft_label,
+          ignore_index,
+          axis_dim);
+      return;
+    }
 
-    const int64_t n = phi::funcs::SizeToAxis(axis, logit_grad->dims());
-    const int64_t d = phi::funcs::SizeFromAxis(axis, logit_grad->dims());
-    const int64_t remain = d / axis_dim;
+    // if axis is not the last, we need a new impliment
+    if (soft_label) {
+      auto* logits_data = softmax->data<T>();
+      auto* labels_data = labels.data<T>();
 
+      const int kDimLog2 = static_cast<int>(Log2Ceil(axis_dim));
+      const int kDimCeil = 1 << kDimLog2;
 #ifdef __HIPCC__
-    int block = 256;
+      int kThreadPerBlock = 256;
 #else
-    int block = 512;
+      int kThreadPerBlock = 512;
 #endif
-
-    // do not with softmax op, and input is softmax
-    if (!use_softmax) {
-      if (soft_label) {
-        int grid = (n * d + block - 1) / block;
-        const T* label_data = labels.template data<T>();
-        SoftLabelCrossEntropyGradientKernel<T><<<grid, block, 0, stream>>>(
-            logit_grad_data, loss_grad_data, label_data, n, d, remain);
+      int kBatchPerBlock = 1;
+      int blocks = (n * d + kBatchPerBlock - 1) / kBatchPerBlock;
+      dim3 threads(kThreadPerBlock / kBatchPerBlock, kBatchPerBlock, 1);
+
+      CrossEntropySoftLabel<T,
+                            T,
+                            false><<<blocks, threads, 0, dev_ctx.stream()>>>(
+          loss_data,
+          NULL,
+          logits_data,
+          labels_data,
+          n,
+          axis_dim,
+          d / axis_dim,
+          kDimLog2);
+    } else {  // HardLabel
+      auto* logits_data = softmax->data<T>();
+      auto* labels_data = labels.data<LabelT>();
+      int threads = 128;
+      int blocks = (n * d / axis_dim + threads - 1) / threads;
+      if (ignore_index >= 0 && ignore_index < axis_dim) {
+        CrossEntropyHardLabel<T,
+                              LabelT,
+                              true><<<blocks, threads, 0, dev_ctx.stream()>>>(
+            loss_data,
+            logits_data,
+            labels_data,
+            n,
+            axis_dim,
+            d / axis_dim,
+            ignore_index);
       } else {
-        Tensor logits_grad_2d;
-        logits_grad_2d.ShareDataWith(*logit_grad).Resize({n, d});
-        int grid = (n * remain + block - 1) / block;
-        const auto* label_data = labels.template data<LabelT>();
-        HardLabelCrossEntropyGradientKernel<T,
-                                            LabelT><<<grid, block, 0, stream>>>(
-            logit_grad_data, label_data, n, d, remain, ignore_index);
-        int num = n * d;
-        grid = (num + block - 1) / block;
-        ScaleCrossEntropyGradient<T, LabelT><<<grid, block, 0, stream>>>(
-            logit_grad_data, loss_grad_data, num, d, remain, label_data,
+        CrossEntropyHardLabel<T,
+                              LabelT,
+                              false><<<blocks, threads, 0, dev_ctx.stream()>>>(
+            loss_data,
+            logits_data,
+            labels_data,
+            n,
+            axis_dim,
+            d / axis_dim,
             ignore_index);
       }
-
-      return;
     }
 
-    // with softmax, continue
+    // cause of input is softmax
+    // copy to output softmax, directly
+    phi::Copy<GPUContext>(
+        dev_ctx, *softmax, dev_ctx.GetPlace(), false, softmax_out);
 
-    if (soft_label) {
-      int64_t grid = (n * d + block - 1) / block;
-      const T* label_data = labels.template data<T>();
-      SoftCrossEntropyGradientKernel<T><<<grid, block, 0, stream>>>(
-          logit_grad_data, loss_grad_data, label_data, n, d, remain);
+    return;
+  }
+
+  const int rank = logits.dims().size();
+  const int axis_v = phi::funcs::CanonicalAxis(axis, rank);
+  int axis_dim = logits.dims()[axis_v];
+
+  const int64_t n = phi::funcs::SizeToAxis(axis_v, logits.dims());
+  const int64_t d = phi::funcs::SizeFromAxis(axis_v, logits.dims());
+
+  auto* softmax_data = dev_ctx.template Alloc<T>(softmax);
+  auto* loss_data = dev_ctx.template Alloc<T>(loss);
+
+  if (axis_dim == 1) {
+    phi::funcs::SetConstant<GPUContext, T> set_constant;
+    set_constant(dev_ctx, softmax, static_cast<T>(1));
+    set_constant(dev_ctx, loss, static_cast<T>(0));
+    return;
+  }
+
+  if (soft_label) {
+    auto* logits_data = logits.data<T>();
+    auto* labels_data = label.data<T>();
+    SoftmaxWithCrossEntropySoftLabel<T>(dev_ctx,
+                                        rank,
+                                        axis_v,
+                                        logits_data,
+                                        labels_data,
+                                        softmax_data,
+                                        loss_data,
+                                        n,
+                                        axis_dim,
+                                        d / axis_dim);
+  } else {
+    if (!numeric_stable_mode) {
+      // CUDNN kernel only suppoer 2-D tensor and perfome softmax on last dim
+      DenseTensor logits_2d(logits);
+      logits_2d.Resize({n, d});
+      DenseTensor softmax_2d(*softmax);
+      softmax_2d.Resize({n, d});
+      DenseTensor labels_2d(label);
+      labels_2d.Resize({n, label.numel() / n});
+      DenseTensor loss_2d(*loss);
+      loss_2d.Resize({n, 1});
+      paddle::operators::math::SoftmaxCUDNNFunctor<T, GPUContext>()(
+          dev_ctx, &logits_2d, &softmax_2d);
+      paddle::operators::math::CrossEntropyFunctor<GPUContext, T>()(
+          dev_ctx,
+          &loss_2d,
+          &softmax_2d,
+          &labels_2d,
+          false,
+          ignore_index,
+          axis_dim);
     } else {
-      const T* softmax_data = softmax->template data<T>();
-      const auto* label_data = labels.template data<LabelT>();
-      int grid = (n * d + block - 1) / block;
-      SoftmaxWithCrossEntropyGradHardLabel<T><<<grid, block, 0, stream>>>(
-          logit_grad_data, loss_grad_data, softmax_data, label_data, n,
-          d / remain, remain, ignore_index);
+      auto* logits_data = logits.data<T>();
+      auto* labels_data = label.data<LabelT>();
+      if (ignore_index >= 0 && ignore_index < axis_dim) {
+        SoftmaxWithCrossEntropyHardLabel<T, LabelT, true>(dev_ctx,
+                                                          rank,
+                                                          axis_v,
+                                                          logits_data,
+                                                          labels_data,
+                                                          loss_data,
+                                                          softmax_data,
+                                                          n,
+                                                          axis_dim,
+                                                          d / axis_dim,
+                                                          ignore_index);
+      } else {
+        SoftmaxWithCrossEntropyHardLabel<T, LabelT, false>(dev_ctx,
+                                                           rank,
+                                                           axis_v,
+                                                           logits_data,
+                                                           labels_data,
+                                                           loss_data,
+                                                           softmax_data,
+                                                           n,
+                                                           axis_dim,
+                                                           d / axis_dim,
+                                                           ignore_index);
+      }
     }
   }
-};
+}
+
+template <typename T, typename Context>
+void CrossEntropyWithSoftmaxKernel(const Context& dev_ctx,
+                                   const DenseTensor& logits,
+                                   const DenseTensor& label,
+                                   bool soft_label,
+                                   bool use_softmax,
+                                   bool numeric_stable_mode,
+                                   int ignore_index,
+                                   int axis,
+                                   DenseTensor* softmax,
+                                   DenseTensor* loss) {
+  auto dtype = label.dtype();
+  if (soft_label) {
+    PADDLE_ENFORCE_EQ(
+        dtype,
+        paddle::experimental::CppTypeToDataType<T>::Type(),
+        phi::errors::InvalidArgument("The Input(Label) should be with the "
+                                     "same data type as Input(Logits)."));
+    CrossEntropyWithSoftmaxCUDAKernel<T, T>(dev_ctx,
+                                            logits,
+                                            label,
+                                            soft_label,
+                                            use_softmax,
+                                            numeric_stable_mode,
+                                            ignore_index,
+                                            axis,
+                                            softmax,
+                                            loss);
+  } else {
+    PD_DISPATCH_INTEGRAL_TYPES(
+        dtype, "CrossEntropyWithSoftmaxCUDAKernel", ([&] {
+          CrossEntropyWithSoftmaxCUDAKernel<T, data_t>(dev_ctx,
+                                                       logits,
+                                                       label,
+                                                       soft_label,
+                                                       use_softmax,
+                                                       numeric_stable_mode,
+                                                       ignore_index,
+                                                       axis,
+                                                       softmax,
+                                                       loss);
+        }));
+  }
+}
 
-}  // namespace operators
-}  // namespace paddle
+}  // namespace phi
 
-namespace ops = paddle::operators;
 #ifdef PADDLE_WITH_HIP
-// MIOPEN do not support double
-REGISTER_OP_CUDA_KERNEL(
-    softmax_with_cross_entropy, ops::SoftmaxWithCrossEntropyCUDAKernel<float>,
-    ops::SoftmaxWithCrossEntropyCUDAKernel<paddle::platform::float16>);
-REGISTER_OP_CUDA_KERNEL(
-    softmax_with_cross_entropy_grad,
-    ops::SoftmaxWithCrossEntropyGradCUDAKernel<float>,
-    ops::SoftmaxWithCrossEntropyGradCUDAKernel<paddle::platform::float16>);
+PD_REGISTER_KERNEL(cross_entropy_with_softmax,
+                   GPU,
+                   ALL_LAYOUT,
+                   phi::CrossEntropyWithSoftmaxKernel,
+                   float,
+                   phi::dtype::float16) {}
 #else
-REGISTER_OP_CUDA_KERNEL(
-    softmax_with_cross_entropy, ops::SoftmaxWithCrossEntropyCUDAKernel<float>,
-    ops::SoftmaxWithCrossEntropyCUDAKernel<paddle::platform::float16>,
-    ops::SoftmaxWithCrossEntropyCUDAKernel<double>);
-REGISTER_OP_CUDA_KERNEL(
-    softmax_with_cross_entropy_grad,
-    ops::SoftmaxWithCrossEntropyGradCUDAKernel<float>,
-    ops::SoftmaxWithCrossEntropyGradCUDAKernel<paddle::platform::float16>,
-    ops::SoftmaxWithCrossEntropyGradCUDAKernel<double>);
+PD_REGISTER_KERNEL(cross_entropy_with_softmax,
+                   GPU,
+                   ALL_LAYOUT,
+                   phi::CrossEntropyWithSoftmaxKernel,
+                   float,
+                   double,
+                   phi::dtype::float16) {}
 #endif
diff --git a/paddle/phi/kernels/gpu/depthwise_conv_grad_kernel.cu b/paddle/phi/kernels/gpu/depthwise_conv_grad_kernel.cu
index 4f27b6fde99ff..5fc5482a0808b 100644
--- a/paddle/phi/kernels/gpu/depthwise_conv_grad_kernel.cu
+++ b/paddle/phi/kernels/gpu/depthwise_conv_grad_kernel.cu
@@ -24,9 +24,9 @@ namespace phi {
 
 template <typename T, typename Context>
 void DepthwiseConvGradKernel(const Context& dev_ctx,
-                             const DenseTensor& out_grad,
                              const DenseTensor& input,
                              const DenseTensor& filter,
+                             const DenseTensor& out_grad,
                              const std::vector<int>& strides_t,
                              const std::vector<int>& paddings_t,
                              const std::string& padding_algorithm,
diff --git a/paddle/phi/kernels/gpu/full_kernel.cu b/paddle/phi/kernels/gpu/full_kernel.cu
index 79b71b95d9ee8..50e57a46317e3 100644
--- a/paddle/phi/kernels/gpu/full_kernel.cu
+++ b/paddle/phi/kernels/gpu/full_kernel.cu
@@ -132,6 +132,7 @@ PD_REGISTER_KERNEL(full_like,
                    phi::FullLikeKernel,
                    float,
                    double,
+                   uint8_t,
                    int16_t,
                    int,
                    int64_t,
diff --git a/paddle/phi/kernels/gpu/gaussian_random_kernel.cu b/paddle/phi/kernels/gpu/gaussian_random_kernel.cu
index e159e5916cff2..96ebc0353ef24 100644
--- a/paddle/phi/kernels/gpu/gaussian_random_kernel.cu
+++ b/paddle/phi/kernels/gpu/gaussian_random_kernel.cu
@@ -14,10 +14,7 @@
 
 #include "paddle/phi/kernels/gaussian_random_kernel.h"
 
-#include <thrust/device_vector.h>
-#include <thrust/host_vector.h>
 #include <thrust/random.h>
-#include <thrust/transform.h>
 #include "paddle/phi/backends/gpu/gpu_context.h"
 #include "paddle/phi/common/amp_type_traits.h"
 #include "paddle/phi/core/dense_tensor.h"
@@ -27,8 +24,6 @@
 
 #include "paddle/fluid/framework/generator.h"
 
-DECLARE_bool(use_curand);
-
 namespace phi {
 
 template <typename T>
@@ -83,21 +78,11 @@ void GaussianRandomKernel(const Context& dev_ctx,
   auto gen_cuda = paddle::framework::GetDefaultCUDAGenerator(device_id);
 
   if (gen_cuda->GetIsInitPy() && seed_flag) {
-    if (FLAGS_use_curand) {
-      using MT = typename phi::dtype::MPTypeTrait<T>::Type;
-      funcs::normal_distribution<MT> dist;
-      funcs::normal_transform<MT> trans(static_cast<MT>(mean),
-                                        static_cast<MT>(std));
-      funcs::distribution_and_transform<T>(dev_ctx, tensor, dist, trans);
-    } else {
-      auto seed_offset = gen_cuda->IncrementOffset(1);
-      int64_t gen_offset = size * seed_offset.second;
-      auto func = GaussianGenerator<T>(static_cast<T>(mean),
-                                       static_cast<T>(std),
-                                       seed_offset.first,
-                                       gen_offset);
-      IndexKernel<T, GaussianGenerator<T>>(dev_ctx, tensor, func);
-    }
+    using MT = typename phi::dtype::MPTypeTrait<T>::Type;
+    funcs::normal_distribution<MT> dist;
+    funcs::normal_transform<MT> trans(static_cast<MT>(mean),
+                                      static_cast<MT>(std));
+    funcs::distribution_and_transform<T>(dev_ctx, tensor, dist, trans);
   } else {
     auto func =
         GaussianGenerator<T>(static_cast<T>(mean), static_cast<T>(std), seed);
diff --git a/paddle/phi/kernels/gpu/graph_reindex_funcs.h b/paddle/phi/kernels/gpu/graph_reindex_funcs.h
new file mode 100644
index 0000000000000..ea4f67e9d47e3
--- /dev/null
+++ b/paddle/phi/kernels/gpu/graph_reindex_funcs.h
@@ -0,0 +1,203 @@
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#pragma once
+
+#include "paddle/phi/kernels/graph_reindex_kernel.h"
+
+#include "paddle/fluid/platform/device/gpu/gpu_primitives.h"
+#include "paddle/phi/backends/gpu/gpu_context.h"
+#include "paddle/phi/core/hostdevice.h"
+
+namespace phi {
+
+template <typename T>
+inline __device__ size_t Hash(T id, int64_t size) {
+  return id % size;
+}
+
+template <typename T>
+inline __device__ bool AttemptInsert(
+    size_t pos, T id, int index, T* keys, int* key_index) {
+  if (sizeof(T) == 4) {
+    const T key = atomicCAS(reinterpret_cast<unsigned int*>(&keys[pos]),
+                            static_cast<unsigned int>(-1),
+                            static_cast<unsigned int>(id));
+    if (key == -1 || key == id) {
+      atomicMin(reinterpret_cast<unsigned int*>(&key_index[pos]),  // NOLINT
+                static_cast<unsigned int>(index));                 // NOLINT
+      return true;
+    } else {
+      return false;
+    }
+  } else if (sizeof(T) == 8) {
+    const T key = atomicCAS(
+        reinterpret_cast<unsigned long long int*>(&keys[pos]),  // NOLINT
+        static_cast<unsigned long long int>(-1),                // NOLINT
+        static_cast<unsigned long long int>(id));               // NOLINT
+    if (key == -1 || key == id) {
+      atomicMin(reinterpret_cast<unsigned int*>(&key_index[pos]),  // NOLINT
+                static_cast<unsigned int>(index));                 // NOLINT
+      return true;
+    } else {
+      return false;
+    }
+  }
+}
+
+template <typename T>
+inline __device__ void Insert(
+    T id, int index, int64_t size, T* keys, int* key_index) {
+  size_t pos = Hash(id, size);
+  size_t delta = 1;
+  while (!AttemptInsert(pos, id, index, keys, key_index)) {
+    pos = Hash(pos + delta, size);
+    delta += 1;
+  }
+}
+
+template <typename T>
+inline __device__ int64_t Search(T id, const T* keys, int64_t size) {
+  int64_t pos = Hash(id, size);
+
+  int64_t delta = 1;
+  while (keys[pos] != id) {
+    pos = Hash(pos + delta, size);
+    delta += 1;
+  }
+
+  return pos;
+}
+
+template <typename T>
+__global__ void BuildHashTable(
+    const T* items, int num_items, int64_t size, T* keys, int* key_index) {
+  CUDA_KERNEL_LOOP(index, num_items) {
+    Insert(items[index], index, size, keys, key_index);
+  }
+}
+
+template <typename T>
+__global__ void BuildHashTable(const T* items, int num_items, int* key_index) {
+  CUDA_KERNEL_LOOP(index, num_items) {
+    atomicMin(
+        reinterpret_cast<unsigned int*>(&key_index[items[index]]),  // NOLINT
+        static_cast<unsigned int>(index));                          // NOLINT
+  }
+}
+
+template <typename T>
+__global__ void ResetHashTable(const T* items,
+                               int num_items,
+                               int* key_index,
+                               int* values) {
+  CUDA_KERNEL_LOOP(index, num_items) {
+    key_index[items[index]] = -1;
+    values[items[index]] = -1;
+  }
+}
+
+template <typename T>
+__global__ void GetItemIndexCount(const T* items,
+                                  int* item_count,
+                                  int num_items,
+                                  int64_t size,
+                                  const T* keys,
+                                  int* key_index) {
+  CUDA_KERNEL_LOOP(i, num_items) {
+    int64_t pos = Search(items[i], keys, size);
+    if (key_index[pos] == i) {
+      item_count[i] = 1;
+    }
+  }
+}
+
+template <typename T>
+__global__ void GetItemIndexCount(const T* items,
+                                  int* item_count,
+                                  int num_items,
+                                  int* key_index) {
+  CUDA_KERNEL_LOOP(i, num_items) {
+    if (key_index[items[i]] == i) {
+      item_count[i] = 1;
+    }
+  }
+}
+
+template <typename T>
+__global__ void FillUniqueItems(const T* items,
+                                int num_items,
+                                int64_t size,
+                                T* unique_items,
+                                const int* item_count,
+                                const T* keys,
+                                int* values,
+                                int* key_index) {
+  CUDA_KERNEL_LOOP(i, num_items) {
+    int64_t pos = Search(items[i], keys, size);
+    if (key_index[pos] == i) {
+      values[pos] = item_count[i];
+      unique_items[item_count[i]] = items[i];
+    }
+  }
+}
+
+template <typename T>
+__global__ void FillUniqueItems(const T* items,
+                                int num_items,
+                                T* unique_items,
+                                const int* item_count,
+                                int* values,
+                                int* key_index) {
+  CUDA_KERNEL_LOOP(i, num_items) {
+    if (key_index[items[i]] == i) {
+      values[items[i]] = item_count[i];
+      unique_items[item_count[i]] = items[i];
+    }
+  }
+}
+
+template <typename T>
+__global__ void ReindexSrcOutput(T* src_output,
+                                 int num_items,
+                                 int64_t size,
+                                 const T* keys,
+                                 const int* values) {
+  CUDA_KERNEL_LOOP(i, num_items) {
+    int64_t pos = Search(src_output[i], keys, size);
+    src_output[i] = values[pos];
+  }
+}
+
+template <typename T>
+__global__ void ReindexSrcOutput(T* src_output,
+                                 int num_items,
+                                 const int* values) {
+  CUDA_KERNEL_LOOP(i, num_items) { src_output[i] = values[src_output[i]]; }
+}
+
+template <typename T>
+__global__ void ReindexInputNodes(const T* nodes,
+                                  int num_items,
+                                  T* reindex_nodes,
+                                  int64_t size,
+                                  const T* keys,
+                                  const int* values) {
+  CUDA_KERNEL_LOOP(i, num_items) {
+    int64_t pos = Search(nodes[i], keys, size);
+    reindex_nodes[i] = values[pos];
+  }
+}
+
+}  // namespace phi
diff --git a/paddle/phi/kernels/gpu/graph_reindex_kernel.cu b/paddle/phi/kernels/gpu/graph_reindex_kernel.cu
new file mode 100644
index 0000000000000..34bd1d6db77da
--- /dev/null
+++ b/paddle/phi/kernels/gpu/graph_reindex_kernel.cu
@@ -0,0 +1,363 @@
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include <thrust/copy.h>
+#include <thrust/device_vector.h>
+#include <thrust/reduce.h>
+#include <thrust/scan.h>
+#include <thrust/sequence.h>
+
+#include "paddle/phi/kernels/gpu/graph_reindex_funcs.h"
+#include "paddle/phi/kernels/graph_reindex_kernel.h"
+
+#include "paddle/phi/backends/gpu/gpu_context.h"
+#include "paddle/phi/core/kernel_registry.h"
+
+namespace phi {
+
+constexpr int WARP_SIZE = 32;
+
+template <typename T, typename Context>
+void FillHashTable(const Context& dev_ctx,
+                   const T* input,
+                   int num_input,
+                   int64_t len_hashtable,
+                   thrust::device_vector<T>* unique_items,
+                   T* keys,
+                   int* values,
+                   int* key_index) {
+#ifdef PADDLE_WITH_HIP
+  int block = 256;
+#else
+  int block = 1024;
+#endif
+  int max_grid_dimx = dev_ctx.GetCUDAMaxGridDimSize()[0];
+  int grid_tmp = (num_input + block - 1) / block;
+  int grid = grid_tmp < max_grid_dimx ? grid_tmp : max_grid_dimx;
+  // Insert data into keys and values.
+  BuildHashTable<T><<<grid, block, 0, dev_ctx.stream()>>>(
+      input, num_input, len_hashtable, keys, key_index);
+
+  // Get item index count.
+  thrust::device_vector<int> item_count(num_input + 1, 0);
+  GetItemIndexCount<T><<<grid, block, 0, dev_ctx.stream()>>>(
+      input,
+      thrust::raw_pointer_cast(item_count.data()),
+      num_input,
+      len_hashtable,
+      keys,
+      key_index);
+
+  thrust::exclusive_scan(
+      item_count.begin(), item_count.end(), item_count.begin());
+  size_t total_unique_items = item_count[num_input];
+  unique_items->resize(total_unique_items);
+
+  // Get unique items
+  FillUniqueItems<T><<<grid, block, 0, dev_ctx.stream()>>>(
+      input,
+      num_input,
+      len_hashtable,
+      thrust::raw_pointer_cast(unique_items->data()),
+      thrust::raw_pointer_cast(item_count.data()),
+      keys,
+      values,
+      key_index);
+}
+
+template <typename T, typename Context>
+void FillBufferHashTable(const Context& dev_ctx,
+                         const T* input,
+                         int num_input,
+                         thrust::device_vector<T>* unique_items,
+                         int* values,
+                         int* key_index) {
+#ifdef PADDLE_WITH_HIP
+  int block = 256;
+#else
+  int block = 1024;
+#endif
+  int max_grid_dimx = dev_ctx.GetCUDAMaxGridDimSize()[0];
+  int grid_tmp = (num_input + block - 1) / block;
+  int grid = grid_tmp < max_grid_dimx ? grid_tmp : max_grid_dimx;
+  // Insert data.
+  BuildHashTable<T><<<grid, block, 0, dev_ctx.stream()>>>(
+      input, num_input, key_index);
+
+  // Get item index count.
+  thrust::device_vector<int> item_count(num_input + 1, 0);
+  GetItemIndexCount<T><<<grid, block, 0, dev_ctx.stream()>>>(
+      input, thrust::raw_pointer_cast(item_count.data()), num_input, key_index);
+
+  thrust::exclusive_scan(
+      item_count.begin(), item_count.end(), item_count.begin());
+  size_t total_unique_items = item_count[num_input];
+  unique_items->resize(total_unique_items);
+
+  // Get unique items
+  FillUniqueItems<T><<<grid, block, 0, dev_ctx.stream()>>>(
+      input,
+      num_input,
+      thrust::raw_pointer_cast(unique_items->data()),
+      thrust::raw_pointer_cast(item_count.data()),
+      values,
+      key_index);
+}
+
+template <typename T, typename Context>
+void ResetBufferHashTable(const Context& dev_ctx,
+                          const T* input,
+                          int num_input,
+                          thrust::device_vector<T>* unique_items,
+                          int* values,
+                          int* key_index) {
+#ifdef PADDLE_WITH_HIP
+  int block = 256;
+#else
+  int block = 1024;
+#endif
+  int max_grid_dimx = dev_ctx.GetCUDAMaxGridDimSize()[0];
+  int grid_tmp = (unique_items->size() + block - 1) / block;
+  int grid = grid_tmp < max_grid_dimx ? grid_tmp : max_grid_dimx;
+  ResetHashTable<T><<<grid, block, 0, dev_ctx.stream()>>>(
+      thrust::raw_pointer_cast(unique_items->data()),
+      unique_items->size(),
+      key_index,
+      values);
+}
+
+template <typename T, typename Context>
+void Reindex(const Context& dev_ctx,
+             const T* inputs,
+             thrust::device_ptr<T> src_outputs,
+             thrust::device_vector<T>* out_nodes,
+             int num_inputs,
+             int num_edges) {
+  out_nodes->resize(num_inputs + num_edges);
+  thrust::copy(inputs, inputs + num_inputs, out_nodes->begin());
+  thrust::copy(
+      src_outputs, src_outputs + num_edges, out_nodes->begin() + num_inputs);
+  thrust::device_vector<T> unique_nodes;
+  unique_nodes.clear();
+
+  // Fill hash table
+  int64_t num = out_nodes->size();
+  int64_t log_num = 1 << static_cast<size_t>(1 + std::log2(num >> 1));
+  int64_t table_size = log_num << 1;
+  T* keys;
+  int *values, *key_index;
+
+#ifdef PADDLE_WITH_HIP
+  hipMalloc(&keys, table_size * sizeof(T));
+  hipMalloc(&values, table_size * sizeof(int));
+  hipMalloc(&key_index, table_size * sizeof(int));
+  hipMemset(keys, -1, table_size * sizeof(T));
+  hipMemset(values, -1, table_size * sizeof(int));
+  hipMemset(key_index, -1, table_size * sizeof(int));
+#else
+  cudaMalloc(&keys, table_size * sizeof(T));
+  cudaMalloc(&values, table_size * sizeof(int));
+  cudaMalloc(&key_index, table_size * sizeof(int));
+  cudaMemset(keys, -1, table_size * sizeof(T));
+  cudaMemset(values, -1, table_size * sizeof(int));
+  cudaMemset(key_index, -1, table_size * sizeof(int));
+#endif
+
+  FillHashTable<T, Context>(dev_ctx,
+                            thrust::raw_pointer_cast(out_nodes->data()),
+                            out_nodes->size(),
+                            table_size,
+                            &unique_nodes,
+                            keys,
+                            values,
+                            key_index);
+  out_nodes->resize(unique_nodes.size());
+  thrust::copy(unique_nodes.begin(), unique_nodes.end(), out_nodes->begin());
+
+// Fill outputs with reindex result.
+#ifdef PADDLE_WITH_HIP
+  int block = 256;
+#else
+  int block = 1024;
+#endif
+  int max_grid_dimx = dev_ctx.GetCUDAMaxGridDimSize()[0];
+  int grid_tmp = (num_edges + block - 1) / block;
+  int grid = grid_tmp < max_grid_dimx ? grid_tmp : max_grid_dimx;
+  ReindexSrcOutput<T><<<grid, block, 0, dev_ctx.stream()>>>(
+      thrust::raw_pointer_cast(src_outputs),
+      num_edges,
+      table_size,
+      keys,
+      values);
+#ifdef PADDLE_WITH_HIP
+  hipFree(keys);
+  hipFree(values);
+  hipFree(key_index);
+#else
+  cudaFree(keys);
+  cudaFree(values);
+  cudaFree(key_index);
+#endif
+}
+
+template <typename T, typename Context>
+void BufferReindex(const Context& dev_ctx,
+                   const T* inputs,
+                   thrust::device_ptr<T> src_outputs,
+                   thrust::device_vector<T>* out_nodes,
+                   int num_inputs,
+                   int* hashtable_value,
+                   int* hashtable_index,
+                   int num_edges) {
+  out_nodes->resize(num_inputs + num_edges);
+  thrust::copy(inputs, inputs + num_inputs, out_nodes->begin());
+  thrust::copy(
+      src_outputs, src_outputs + num_edges, out_nodes->begin() + num_inputs);
+  thrust::device_vector<T> unique_nodes;
+  unique_nodes.clear();
+
+  // Fill hash table
+  FillBufferHashTable<T, Context>(dev_ctx,
+                                  thrust::raw_pointer_cast(out_nodes->data()),
+                                  out_nodes->size(),
+                                  &unique_nodes,
+                                  hashtable_value,
+                                  hashtable_index);
+  out_nodes->resize(unique_nodes.size());
+  thrust::copy(unique_nodes.begin(), unique_nodes.end(), out_nodes->begin());
+
+// Fill outputs with reindex result.
+#ifdef PADDLE_WITH_HIP
+  int block = 256;
+#else
+  int block = 1024;
+#endif
+  int max_grid_dimx = dev_ctx.GetCUDAMaxGridDimSize()[0];
+  int grid_tmp = (num_edges + block - 1) / block;
+  int grid = grid_tmp < max_grid_dimx ? grid_tmp : max_grid_dimx;
+  ReindexSrcOutput<T><<<grid, block, 0, dev_ctx.stream()>>>(
+      thrust::raw_pointer_cast(src_outputs), num_edges, hashtable_value);
+
+  ResetBufferHashTable<T, Context>(dev_ctx,
+                                   thrust::raw_pointer_cast(out_nodes->data()),
+                                   out_nodes->size(),
+                                   &unique_nodes,
+                                   hashtable_value,
+                                   hashtable_index);
+}
+
+template <typename T, int BLOCK_WARPS, int TILE_SIZE>
+__global__ void GetDstEdgeCUDAKernel(const int64_t num_rows,
+                                     const int* in_rows,
+                                     const int* dst_counts,
+                                     const int* dst_ptr,
+                                     T* dst_outputs) {
+  assert(blockDim.x == WARP_SIZE);
+  assert(blockDim.y == BLOCK_WARPS);
+
+  int64_t out_row = blockIdx.x * TILE_SIZE + threadIdx.y;
+  const int64_t last_row =
+      min(static_cast<int64_t>(blockIdx.x + 1) * TILE_SIZE, num_rows);
+
+  while (out_row < last_row) {
+    const int row = in_rows[out_row];
+    const int dst_sample_size = dst_counts[out_row];
+    const int out_row_start = dst_ptr[out_row];
+    for (int idx = threadIdx.x; idx < dst_sample_size; idx += WARP_SIZE) {
+      dst_outputs[out_row_start + idx] = row;
+    }
+    out_row += BLOCK_WARPS;
+  }
+}
+
+template <typename T, typename Context>
+void GraphReindexKernel(const Context& dev_ctx,
+                        const DenseTensor& x,
+                        const DenseTensor& neighbors,
+                        const DenseTensor& count,
+                        paddle::optional<const DenseTensor&> hashtable_value,
+                        paddle::optional<const DenseTensor&> hashtable_index,
+                        bool flag_buffer_hashtable,
+                        DenseTensor* reindex_src,
+                        DenseTensor* reindex_dst,
+                        DenseTensor* out_nodes) {
+  const T* x_data = x.data<T>();
+  const T* neighbors_data = neighbors.data<T>();
+  const int* count_data = count.data<int>();
+  const int bs = x.dims()[0];
+  const int num_edges = neighbors.dims()[0];
+  reindex_src->Resize({num_edges});
+
+  T* reindex_src_data = dev_ctx.template Alloc<T>(reindex_src);
+  thrust::device_ptr<T> src_outputs(reindex_src_data);
+
+  thrust::device_vector<T> unique_nodes;
+  thrust::copy(neighbors_data, neighbors_data + num_edges, src_outputs);
+
+  if (flag_buffer_hashtable) {
+    // Here we directly use buffer tensor to act as a hash table.
+    DenseTensor hashtable_value_out(hashtable_value->type());
+    const auto* ph_value = hashtable_value.get_ptr();
+    hashtable_value_out.ShareDataWith(*ph_value);
+    DenseTensor hashtable_index_out(hashtable_index->type());
+    const auto* ph_index = hashtable_index.get_ptr();
+    hashtable_index_out.ShareDataWith(*ph_index);
+    int* hashtable_value_data =
+        hashtable_value_out.mutable_data<int>(dev_ctx.GetPlace());
+    int* hashtable_index_data =
+        hashtable_index_out.mutable_data<int>(dev_ctx.GetPlace());
+    BufferReindex<T, Context>(dev_ctx,
+                              x_data,
+                              src_outputs,
+                              &unique_nodes,
+                              bs,
+                              hashtable_value_data,
+                              hashtable_index_data,
+                              num_edges);
+  } else {
+    Reindex<T, Context>(
+        dev_ctx, x_data, src_outputs, &unique_nodes, bs, num_edges);
+  }
+
+  // Get reindex dst edge.
+  thrust::device_vector<int> unique_dst_reindex(bs);
+  thrust::sequence(unique_dst_reindex.begin(), unique_dst_reindex.end());
+  thrust::device_vector<int> dst_ptr(bs);
+  thrust::exclusive_scan(count_data, count_data + bs, dst_ptr.begin());
+  constexpr int BLOCK_WARPS = 128 / WARP_SIZE;
+  constexpr int TILE_SIZE = BLOCK_WARPS * 16;
+  const dim3 block(WARP_SIZE, BLOCK_WARPS);
+  const dim3 grid((bs + TILE_SIZE - 1) / TILE_SIZE);
+
+  reindex_dst->Resize({num_edges});
+  T* reindex_dst_data = dev_ctx.template Alloc<T>(reindex_dst);
+
+  GetDstEdgeCUDAKernel<T,
+                       BLOCK_WARPS,
+                       TILE_SIZE><<<grid, block, 0, dev_ctx.stream()>>>(
+      bs,
+      thrust::raw_pointer_cast(unique_dst_reindex.data()),
+      count_data,
+      thrust::raw_pointer_cast(dst_ptr.data()),
+      reindex_dst_data);
+
+  out_nodes->Resize({static_cast<int>(unique_nodes.size())});
+  T* out_nodes_data = dev_ctx.template Alloc<T>(out_nodes);
+  thrust::copy(unique_nodes.begin(), unique_nodes.end(), out_nodes_data);
+}
+
+}  // namespace phi
+
+PD_REGISTER_KERNEL(
+    graph_reindex, GPU, ALL_LAYOUT, phi::GraphReindexKernel, int, int64_t) {}
diff --git a/paddle/phi/kernels/gpu/graph_sample_neighbors_kernel.cu b/paddle/phi/kernels/gpu/graph_sample_neighbors_kernel.cu
new file mode 100644
index 0000000000000..1757b6b98dbf9
--- /dev/null
+++ b/paddle/phi/kernels/gpu/graph_sample_neighbors_kernel.cu
@@ -0,0 +1,393 @@
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include <thrust/copy.h>
+#include <thrust/device_vector.h>
+#include <thrust/reduce.h>
+#include <thrust/scan.h>
+#include <thrust/sequence.h>
+#include <thrust/transform.h>
+
+#ifdef PADDLE_WITH_HIP
+#include <hip/hip_runtime.h>
+#include <hiprand_kernel.h>
+#else
+#include <cuda_runtime.h>
+#include <curand_kernel.h>
+#endif
+
+#include "paddle/phi/kernels/graph_sample_neighbors_kernel.h"
+
+#include "paddle/phi/backends/gpu/gpu_context.h"
+#include "paddle/phi/core/hostdevice.h"
+#include "paddle/phi/core/kernel_registry.h"
+
+namespace phi {
+
+template <typename T>
+struct DegreeFunctor {
+  const T* col_ptr;
+  HOSTDEVICE explicit inline DegreeFunctor(const T* x) { this->col_ptr = x; }
+  HOSTDEVICE inline int operator()(T i) const {
+    return col_ptr[i + 1] - col_ptr[i];
+  }
+};
+
+struct MaxFunctor {
+  int cap;
+  HOSTDEVICE explicit inline MaxFunctor(int cap) { this->cap = cap; }
+  HOSTDEVICE inline int operator()(int x) const {
+    if (x > cap) {
+      return cap;
+    }
+    return x;
+  }
+};
+
+template <typename T, int WARP_SIZE, int BLOCK_WARPS, int TILE_SIZE>
+__global__ void SampleKernel(const uint64_t rand_seed,
+                             int k,
+                             const int64_t num_nodes,
+                             const T* nodes,
+                             const T* row,
+                             const T* col_ptr,
+                             T* output,
+                             int* output_ptr,
+                             int* output_idxs) {
+  assert(blockDim.x == WARP_SIZE);
+  assert(blockDim.y == BLOCK_WARPS);
+
+  int64_t out_row = blockIdx.x * TILE_SIZE + threadIdx.y;
+  const int64_t last_row =
+      min(static_cast<int64_t>(blockIdx.x + 1) * TILE_SIZE, num_nodes);
+#ifdef PADDLE_WITH_HIP
+  hiprandState rng;
+  hiprand_init(rand_seed * gridDim.x + blockIdx.x,
+               threadIdx.y * WARP_SIZE + threadIdx.x,
+               0,
+               &rng);
+#else
+  curandState rng;
+  curand_init(rand_seed * gridDim.x + blockIdx.x,
+              threadIdx.y * WARP_SIZE + threadIdx.x,
+              0,
+              &rng);
+#endif
+
+  while (out_row < last_row) {
+    T node = nodes[out_row];
+    T in_row_start = col_ptr[node];
+    int deg = col_ptr[node + 1] - in_row_start;
+    int out_row_start = output_ptr[out_row];
+
+    if (deg <= k) {
+      for (int idx = threadIdx.x; idx < deg; idx += WARP_SIZE) {
+        output[out_row_start + idx] = row[in_row_start + idx];
+      }
+    } else {
+      for (int idx = threadIdx.x; idx < k; idx += WARP_SIZE) {
+        output_idxs[out_row_start + idx] = idx;
+      }
+#ifdef PADDLE_WITH_CUDA
+      __syncwarp();
+#endif
+
+      for (int idx = k + threadIdx.x; idx < deg; idx += WARP_SIZE) {
+#ifdef PADDLE_WITH_HIP
+        const int num = hiprand(&rng) % (idx + 1);
+#else
+        const int num = curand(&rng) % (idx + 1);
+#endif
+        if (num < k) {
+          atomicMax(reinterpret_cast<unsigned int*>(  // NOLINT
+                        output_idxs + out_row_start + num),
+                    static_cast<unsigned int>(idx));  // NOLINT
+        }
+      }
+#ifdef PADDLE_WITH_CUDA
+      __syncwarp();
+#endif
+
+      for (int idx = threadIdx.x; idx < k; idx += WARP_SIZE) {
+        T perm_idx = output_idxs[out_row_start + idx] + in_row_start;
+        output[out_row_start + idx] = row[perm_idx];
+      }
+    }
+
+    out_row += BLOCK_WARPS;
+  }
+}
+
+template <typename T, typename Context>
+int GetTotalSampleNum(const thrust::device_ptr<const T> input,
+                      const T* col_ptr,
+                      thrust::device_ptr<int> output_count,
+                      int sample_size,
+                      int bs) {
+  thrust::transform(input, input + bs, output_count, DegreeFunctor<T>(col_ptr));
+  if (sample_size >= 0) {
+    thrust::transform(
+        output_count, output_count + bs, output_count, MaxFunctor(sample_size));
+  }
+  int total_sample_num = thrust::reduce(output_count, output_count + bs);
+  return total_sample_num;
+}
+
+template <typename T, typename Context>
+void SampleNeighbors(const Context& dev_ctx,
+                     const T* row,
+                     const T* col_ptr,
+                     const thrust::device_ptr<const T> input,
+                     thrust::device_ptr<T> output,
+                     thrust::device_ptr<int> output_count,
+                     int sample_size,
+                     int bs,
+                     int total_sample_num) {
+  thrust::device_vector<int> output_ptr;
+  thrust::device_vector<int> output_idxs;
+  output_ptr.resize(bs);
+  output_idxs.resize(total_sample_num);
+  thrust::exclusive_scan(
+      output_count, output_count + bs, output_ptr.begin(), 0);
+
+  constexpr int WARP_SIZE = 32;
+  constexpr int BLOCK_WARPS = 128 / WARP_SIZE;
+  constexpr int TILE_SIZE = BLOCK_WARPS * 16;
+  const dim3 block(WARP_SIZE, BLOCK_WARPS);
+  const dim3 grid((bs + TILE_SIZE - 1) / TILE_SIZE);
+  SampleKernel<T,
+               WARP_SIZE,
+               BLOCK_WARPS,
+               TILE_SIZE><<<grid, block, 0, dev_ctx.stream()>>>(
+      0,
+      sample_size,
+      bs,
+      thrust::raw_pointer_cast(input),
+      row,
+      col_ptr,
+      thrust::raw_pointer_cast(output),
+      thrust::raw_pointer_cast(output_ptr.data()),
+      thrust::raw_pointer_cast(output_idxs.data()));
+}
+
+template <typename T>
+__global__ void FisherYatesSampleKernel(const uint64_t rand_seed,
+                                        int k,
+                                        const int64_t num_rows,
+                                        const T* in_rows,
+                                        T* src,
+                                        const T* dst_count) {
+#ifdef PADDLE_WITH_HIP
+  hiprandState rng;
+  hiprand_init(
+      rand_seed * gridDim.x + blockIdx.x, threadIdx.y + threadIdx.x, 0, &rng);
+#else
+  curandState rng;
+  curand_init(
+      rand_seed * gridDim.x + blockIdx.x, threadIdx.y + threadIdx.x, 0, &rng);
+#endif
+  CUDA_KERNEL_LOOP(out_row, num_rows) {
+    const T row = in_rows[out_row];
+    const T in_row_start = dst_count[row];
+    const int deg = dst_count[row + 1] - in_row_start;
+    int split;
+    T tmp;
+
+    if (k < deg) {
+      if (deg < 2 * k) {
+        split = k;
+      } else {
+        split = deg - k;
+      }
+      for (int idx = deg - 1; idx >= split; idx--) {
+#ifdef PADDLE_WITH_HIP
+        const int num = hiprand(&rng) % (idx + 1);
+#else
+        const int num = curand(&rng) % (idx + 1);
+#endif
+        src[in_row_start + idx] = static_cast<T>(
+            atomicExch(reinterpret_cast<unsigned long long int*>(  // NOLINT
+                           src + in_row_start + num),
+                       static_cast<unsigned long long int>(  //  NOLINT
+                           src[in_row_start + idx])));
+      }
+    }
+  }
+}
+
+template <typename T, int WARP_SIZE, int BLOCK_WARPS, int TILE_SIZE>
+__global__ void GatherEdge(int k,
+                           int64_t num_rows,
+                           const T* in_rows,
+                           const T* src,
+                           const T* dst_count,
+                           T* outputs,
+                           int* output_ptr,
+                           T* perm_data) {
+  assert(blockDim.x == WARP_SIZE);
+  assert(blockDim.y == BLOCK_WARPS);
+
+  int64_t out_row = blockIdx.x * TILE_SIZE + threadIdx.y;
+  const int64_t last_row =
+      min(static_cast<int64_t>(blockIdx.x + 1) * TILE_SIZE, num_rows);
+
+  while (out_row < last_row) {
+    const T row = in_rows[out_row];
+    const T in_row_start = dst_count[row];
+    const int deg = dst_count[row + 1] - in_row_start;
+    const T out_row_start = output_ptr[out_row];
+
+    if (deg <= k) {
+      for (int idx = threadIdx.x; idx < deg; idx += WARP_SIZE) {
+        const T in_idx = in_row_start + idx;
+        outputs[out_row_start + idx] = src[in_idx];
+      }
+    } else {
+      int split = k;
+      int begin, end;
+      if (deg < 2 * k) {
+        begin = 0;
+        end = k;
+      } else {
+        begin = deg - k;
+        end = deg;
+      }
+
+      for (int idx = begin + threadIdx.x; idx < end; idx += WARP_SIZE) {
+        outputs[out_row_start + idx - begin] =
+            src[perm_data[in_row_start + idx]];
+      }
+    }
+    out_row += BLOCK_WARPS;
+  }
+}
+
+template <typename T, typename Context>
+void FisherYatesSampleNeighbors(const Context& dev_ctx,
+                                const T* row,
+                                const T* col_ptr,
+                                T* perm_data,
+                                const thrust::device_ptr<const T> input,
+                                thrust::device_ptr<T> output,
+                                thrust::device_ptr<int> output_count,
+                                int sample_size,
+                                int bs,
+                                int total_sample_num) {
+  thrust::device_vector<int> output_ptr;
+  output_ptr.resize(bs);
+  thrust::exclusive_scan(
+      output_count, output_count + bs, output_ptr.begin(), 0);
+
+#ifdef PADDLE_WITH_HIP
+  int block = 256;
+#else
+  int block = 1024;
+#endif
+  int max_grid_dimx = dev_ctx.GetCUDAMaxGridDimSize()[0];
+  int grid_tmp = (bs + block - 1) / block;
+  int grid = grid_tmp < max_grid_dimx ? grid_tmp : max_grid_dimx;
+
+  FisherYatesSampleKernel<T><<<grid, block, 0, dev_ctx.stream()>>>(
+      0, sample_size, bs, thrust::raw_pointer_cast(input), perm_data, col_ptr);
+
+  constexpr int GATHER_WARP_SIZE = 32;
+  constexpr int GATHER_BLOCK_WARPS = 128 / GATHER_WARP_SIZE;
+  constexpr int GATHER_TILE_SIZE = GATHER_BLOCK_WARPS * 16;
+  const dim3 gather_block(GATHER_WARP_SIZE, GATHER_BLOCK_WARPS);
+  const dim3 gather_grid((bs + GATHER_TILE_SIZE - 1) / GATHER_TILE_SIZE);
+
+  GatherEdge<
+      T,
+      GATHER_WARP_SIZE,
+      GATHER_BLOCK_WARPS,
+      GATHER_TILE_SIZE><<<gather_grid, gather_block, 0, dev_ctx.stream()>>>(
+      sample_size,
+      bs,
+      thrust::raw_pointer_cast(input),
+      row,
+      col_ptr,
+      thrust::raw_pointer_cast(output),
+      thrust::raw_pointer_cast(output_ptr.data()),
+      perm_data);
+}
+
+template <typename T, typename Context>
+void GraphSampleNeighborsKernel(
+    const Context& dev_ctx,
+    const DenseTensor& row,
+    const DenseTensor& col_ptr,
+    const DenseTensor& x,
+    paddle::optional<const DenseTensor&> eids,
+    paddle::optional<const DenseTensor&> perm_buffer,
+    int sample_size,
+    bool return_eids,
+    bool flag_perm_buffer,
+    DenseTensor* out,
+    DenseTensor* out_count,
+    DenseTensor* out_eids) {
+  auto* row_data = row.data<T>();
+  auto* col_ptr_data = col_ptr.data<T>();
+  auto* x_data = x.data<T>();
+  int bs = x.dims()[0];
+
+  const thrust::device_ptr<const T> input(x_data);
+
+  out_count->Resize({bs});
+  int* out_count_data = dev_ctx.template Alloc<int>(out_count);
+  thrust::device_ptr<int> output_count(out_count_data);
+
+  int total_sample_size = GetTotalSampleNum<T, Context>(
+      input, col_ptr_data, output_count, sample_size, bs);
+
+  out->Resize({static_cast<int>(total_sample_size)});
+  T* out_data = dev_ctx.template Alloc<T>(out);
+  thrust::device_ptr<T> output(out_data);
+
+  if (!flag_perm_buffer) {
+    SampleNeighbors<T, Context>(dev_ctx,
+                                row_data,
+                                col_ptr_data,
+                                input,
+                                output,
+                                output_count,
+                                sample_size,
+                                bs,
+                                total_sample_size);
+  } else {
+    DenseTensor perm_buffer_out(perm_buffer->type());
+    const auto* p_perm_buffer = perm_buffer.get_ptr();
+    perm_buffer_out.ShareDataWith(*p_perm_buffer);
+    T* perm_buffer_out_data =
+        perm_buffer_out.mutable_data<T>(dev_ctx.GetPlace());
+    FisherYatesSampleNeighbors<T, Context>(dev_ctx,
+                                           row_data,
+                                           col_ptr_data,
+                                           perm_buffer_out_data,
+                                           input,
+                                           output,
+                                           output_count,
+                                           sample_size,
+                                           bs,
+                                           total_sample_size);
+  }
+}
+
+}  // namespace phi
+
+PD_REGISTER_KERNEL(graph_sample_neighbors,
+                   GPU,
+                   ALL_LAYOUT,
+                   phi::GraphSampleNeighborsKernel,
+                   int,
+                   int64_t) {}
diff --git a/paddle/phi/kernels/gpu/graph_send_recv_grad_kernel.cu b/paddle/phi/kernels/gpu/graph_send_recv_grad_kernel.cu
index 8bd3337280d75..2be0caff79d64 100644
--- a/paddle/phi/kernels/gpu/graph_send_recv_grad_kernel.cu
+++ b/paddle/phi/kernels/gpu/graph_send_recv_grad_kernel.cu
@@ -102,12 +102,12 @@ void GraphSendRecvGradOpCUDAKernelLaunchHelper(
 
 template <typename T, typename Context>
 void GraphSendRecvGradKernel(const Context& ctx,
-                             const DenseTensor& out_grad,
                              const DenseTensor& x,
-                             paddle::optional<const DenseTensor&> out,
                              const DenseTensor& src_index,
                              const DenseTensor& dst_index,
+                             paddle::optional<const DenseTensor&> out,
                              paddle::optional<const DenseTensor&> dst_count,
+                             const DenseTensor& out_grad,
                              const std::string& pool_type,
                              DenseTensor* x_grad) {
   auto index_type = src_index.dtype();
diff --git a/paddle/phi/kernels/gpu/index_sample_grad_kernel.cu b/paddle/phi/kernels/gpu/index_sample_grad_kernel.cu
index 8b1ef964124d7..c8c025c7fc18f 100644
--- a/paddle/phi/kernels/gpu/index_sample_grad_kernel.cu
+++ b/paddle/phi/kernels/gpu/index_sample_grad_kernel.cu
@@ -26,17 +26,10 @@
 namespace phi {
 
 namespace {
-template <typename Context>
-void LimitGridDim(const Context& ctx, dim3* grid_dim) {
-  auto max_grid_dim =
-      reinterpret_cast<const phi::GPUContext&>(ctx).GetCUDAMaxGridDimSize();
-  grid_dim->x = grid_dim->x < max_grid_dim[0] ? grid_dim->x : max_grid_dim[0];
-  grid_dim->y = grid_dim->y < max_grid_dim[1] ? grid_dim->y : max_grid_dim[1];
-}
 #define PREDEFINED_BLOCK_SIZE_X 512
 #define PREDEFINED_BLOCK_SIZE 1024
 #define MIN(a, b) ((a) < (b) ? (a) : (b))
-};
+}  // namespace
 
 template <typename T, typename IndexT = int>
 __global__ void IndexSampleGrad(const IndexT* index,
@@ -67,9 +60,9 @@ __global__ void IndexSampleGrad(const IndexT* index,
 
 template <typename T, typename Context>
 void IndexSampleGradKernel(const Context& ctx,
-                           const DenseTensor& out_grad,
                            const DenseTensor& x,
                            const DenseTensor& index,
+                           const DenseTensor& out_grad,
                            DenseTensor* x_grad) {
   const T* output_grad_data = out_grad.data<T>();
   T* input_grad_data = ctx.template Alloc<T>(x_grad);
@@ -107,7 +100,7 @@ void IndexSampleGradKernel(const Context& ctx,
   dim3 block_dim(block_width, block_height);
   dim3 grid_dim((index_length + block_dim.x - 1) / block_dim.x,
                 (batch_size + block_dim.y - 1) / block_dim.y);
-  LimitGridDim(ctx, &grid_dim);
+  paddle::platform::LimitGridDim(ctx, &grid_dim);
 
   phi::funcs::SetConstant<Context, T> set_zero;
   set_zero(ctx, x_grad, static_cast<T>(0));
diff --git a/paddle/phi/kernels/gpu/index_sample_kernel.cu b/paddle/phi/kernels/gpu/index_sample_kernel.cu
index 68573d5596646..0eca473a565a8 100644
--- a/paddle/phi/kernels/gpu/index_sample_kernel.cu
+++ b/paddle/phi/kernels/gpu/index_sample_kernel.cu
@@ -25,13 +25,6 @@
 namespace phi {
 
 namespace {
-template <typename Context>
-void LimitGridDim(const Context& ctx, dim3* grid_dim) {
-  auto max_grid_dim =
-      reinterpret_cast<const phi::GPUContext&>(ctx).GetCUDAMaxGridDimSize();
-  grid_dim->x = grid_dim->x < max_grid_dim[0] ? grid_dim->x : max_grid_dim[0];
-  grid_dim->y = grid_dim->y < max_grid_dim[1] ? grid_dim->y : max_grid_dim[1];
-}
 #define PREDEFINED_BLOCK_SIZE_X 512
 #define PREDEFINED_BLOCK_SIZE 1024
 #define MIN(a, b) ((a) < (b) ? (a) : (b))
@@ -95,7 +88,7 @@ void IndexSampleKernel(const Context& ctx,
   dim3 block_dim(block_width, block_height);
   dim3 grid_dim((index_length + block_dim.x - 1) / block_dim.x,
                 (batch_size + block_dim.y - 1) / block_dim.y);
-  LimitGridDim(ctx, &grid_dim);
+  paddle::platform::LimitGridDim(ctx, &grid_dim);
 
   if (index_type == DataType::INT64) {
     const int64_t* index_data = index.data<int64_t>();
diff --git a/paddle/phi/kernels/gpu/index_select_grad_kernel.cu b/paddle/phi/kernels/gpu/index_select_grad_kernel.cu
index b3bd307e2aad6..75ae1bbcd0a08 100644
--- a/paddle/phi/kernels/gpu/index_select_grad_kernel.cu
+++ b/paddle/phi/kernels/gpu/index_select_grad_kernel.cu
@@ -14,6 +14,7 @@
 
 #include "paddle/phi/kernels/index_select_grad_kernel.h"
 
+#include "paddle/fluid/platform/device/gpu/gpu_launch_config.h"
 #include "paddle/fluid/platform/device/gpu/gpu_primitives.h"
 #include "paddle/phi/backends/gpu/gpu_info.h"
 #include "paddle/phi/core/kernel_registry.h"
@@ -84,30 +85,31 @@ void IndexSelectGradKernel(const Context& ctx,
                         phi::DataType::INT64));
 
   int64_t numel = x_grad->numel();
+  if (numel == 0) {
+    return;
+  }
   int64_t index_nums = index.numel();
   int64_t out_nums = out_grad.numel();
 
   auto stream = ctx.stream();
 
-  index_select_grad_init<
-      T><<<(numel + PADDLE_CUDA_NUM_THREADS - 1) / PADDLE_CUDA_NUM_THREADS,
-           PADDLE_CUDA_NUM_THREADS,
-           0,
-           stream>>>(in_grad_data, numel);
+  unsigned int block_dim = PADDLE_CUDA_NUM_THREADS;
+  dim3 grid_dim = dim3((numel + block_dim - 1) / block_dim);
+  paddle::platform::LimitGridDim(ctx, &grid_dim);
 
-  int blocks =
-      (out_nums + PADDLE_CUDA_NUM_THREADS - 1) / PADDLE_CUDA_NUM_THREADS;
-  int threads = PADDLE_CUDA_NUM_THREADS;
+  index_select_grad_init<T><<<grid_dim, block_dim, 0, stream>>>(in_grad_data,
+                                                                numel);
 
   if (FLAGS_cudnn_deterministic) {
     VLOG(2) << "Run grad kernel of index_select with single thread.";
-    blocks = 1;
-    threads = 1;
+    block_dim = 1;
+    grid_dim.x = 1;
   }
 
   if (index_type == phi::DataType::INT64) {
     const int64_t* index_data = index.data<int64_t>();
-    index_select_grad_cuda_kernel<T, int64_t><<<blocks, threads, 0, stream>>>(
+    index_select_grad_cuda_kernel<T,
+                                  int64_t><<<grid_dim, block_dim, 0, stream>>>(
         output_grad_data,
         in_grad_data,
         index_data,
@@ -118,7 +120,7 @@ void IndexSelectGradKernel(const Context& ctx,
         delta);
   } else {
     const int* index_data = index.data<int>();
-    index_select_grad_cuda_kernel<T, int><<<blocks, threads, 0, stream>>>(
+    index_select_grad_cuda_kernel<T, int><<<grid_dim, block_dim, 0, stream>>>(
         output_grad_data,
         in_grad_data,
         index_data,
diff --git a/paddle/phi/kernels/gpu/index_select_kernel.cu b/paddle/phi/kernels/gpu/index_select_kernel.cu
index e82976d46e68b..38a6582d790f8 100644
--- a/paddle/phi/kernels/gpu/index_select_kernel.cu
+++ b/paddle/phi/kernels/gpu/index_select_kernel.cu
@@ -14,6 +14,7 @@
 
 #include "paddle/phi/kernels/index_select_kernel.h"
 
+#include "paddle/fluid/platform/device/gpu/gpu_launch_config.h"
 #include "paddle/fluid/platform/device/gpu/gpu_primitives.h"
 #include "paddle/phi/backends/gpu/gpu_info.h"
 #include "paddle/phi/core/kernel_registry.h"
@@ -31,16 +32,14 @@ __global__ void index_select_cuda_kernel(const T* input,
                                          int64_t stride,
                                          int64_t size,
                                          int64_t delta) {
-  int64_t idx = blockIdx.x * blockDim.x + threadIdx.x;
-  if (idx >= N) {
-    return;
+  CUDA_KERNEL_LOOP(idx, N) {
+    int64_t pre_idx = idx / (stride * size);
+    int64_t dim_idx = idx % (stride * size) / stride;
+    IndexT src_dim_idx = index[dim_idx];
+    int64_t input_idx =
+        idx + (delta * pre_idx + src_dim_idx - dim_idx) * stride;
+    output[idx] = input[input_idx];
   }
-
-  int64_t pre_idx = idx / (stride * size);
-  int64_t dim_idx = idx % (stride * size) / stride;
-  IndexT src_dim_idx = index[dim_idx];
-  int64_t input_idx = idx + (delta * pre_idx + src_dim_idx - dim_idx) * stride;
-  output[idx] = input[input_idx];
 }
 
 template <typename T, typename Context>
@@ -73,23 +72,22 @@ void IndexSelectKernel(const Context& ctx,
   T* out_data = ctx.template Alloc<T>(output);
 
   int64_t numel = output->numel();
+  if (numel == 0) {
+    return;
+  }
   auto stream = ctx.stream();
 
+  unsigned int block_dim = PADDLE_CUDA_NUM_THREADS;
+  dim3 grid_dim = dim3((numel + block_dim - 1) / block_dim);
+  paddle::platform::LimitGridDim(ctx, &grid_dim);
+
   if (index_type == phi::DataType::INT64) {
     const int64_t* index_data = index.data<int64_t>();
-    index_select_cuda_kernel<T, int64_t><<<
-        (numel + PADDLE_CUDA_NUM_THREADS - 1) / PADDLE_CUDA_NUM_THREADS,
-        PADDLE_CUDA_NUM_THREADS,
-        0,
-        stream>>>(in_data, out_data, index_data, numel, stride, size, delta);
+    index_select_cuda_kernel<T, int64_t><<<grid_dim, block_dim, 0, stream>>>(
+        in_data, out_data, index_data, numel, stride, size, delta);
   } else {
     const int* index_data = index.data<int>();
-    index_select_cuda_kernel<
-        T,
-        int><<<(numel + PADDLE_CUDA_NUM_THREADS - 1) / PADDLE_CUDA_NUM_THREADS,
-               PADDLE_CUDA_NUM_THREADS,
-               0,
-               stream>>>(
+    index_select_cuda_kernel<T, int><<<grid_dim, block_dim, 0, stream>>>(
         in_data, out_data, index_data, numel, stride, size, delta);
   }
 }
diff --git a/paddle/phi/kernels/gpu/interpolate_grad_kernel.cu b/paddle/phi/kernels/gpu/interpolate_grad_kernel.cu
new file mode 100644
index 0000000000000..73334d9c38aa3
--- /dev/null
+++ b/paddle/phi/kernels/gpu/interpolate_grad_kernel.cu
@@ -0,0 +1,1601 @@
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include "paddle/phi/kernels/interpolate_grad_kernel.h"
+
+#include "paddle/fluid/platform/device/gpu/gpu_primitives.h"
+#include "paddle/fluid/platform/fast_divmod.h"
+#include "paddle/phi/backends/gpu/gpu_context.h"
+#include "paddle/phi/backends/gpu/gpu_launch_config.h"
+#include "paddle/phi/common/layout.h"
+#include "paddle/phi/core/kernel_registry.h"
+#include "paddle/phi/kernels/funcs/interpolate_function.h"
+#include "paddle/phi/kernels/funcs/math_cuda_utils.h"
+#include "paddle/phi/kernels/funcs/math_function.h"
+
+namespace phi {
+
+template <typename T>
+__forceinline__ __device__ void PreCalculatorForLinearInterpInputIndex(
+    int* in_img_idx,
+    int* x_id,
+    T* lambda1,
+    T* lambda2,
+    T src_x,
+    const int in_img_x) {
+  src_x = (src_x > 0) ? src_x : 0.f;
+  *in_img_idx = static_cast<int>(src_x);
+  *x_id = (*in_img_idx < in_img_x - 1) ? 1 : 0;
+  *lambda1 = src_x - *in_img_idx;
+  *lambda2 = 1.f - *lambda1;
+}
+
+template <typename T>
+__global__ void KeLinearInterpBw(T* in,
+                                 const size_t in_img_w,
+                                 const size_t input_w,
+                                 const T* out,
+                                 const size_t out_img_w,
+                                 const size_t output_h,
+                                 const size_t output_w,
+                                 const size_t num_channels,
+                                 const T ratio_w,
+                                 const bool align_corners,
+                                 const int align_mode,
+                                 const DataLayout data_layout) {
+  int nthreads = output_h * output_w;
+  int tid = blockIdx.x * blockDim.x + threadIdx.x;
+  int stride = blockDim.x * gridDim.x;
+  bool align_flag = (align_mode == 0 && !align_corners);
+  for (; tid < nthreads; tid += stride) {
+    int out_id_h = tid / output_w;
+    int out_id_w = tid % output_w;
+    int in_img_size = input_w / num_channels;
+    int out_img_size = output_w / num_channels;
+
+    int channel_id, out_img_idx;
+    if (data_layout == DataLayout::kNCHW) {
+      channel_id = out_id_w / out_img_size;
+      out_img_idx = tid % out_img_w;
+    } else {
+      out_img_idx = out_id_w % (out_img_w * num_channels) / num_channels;
+      channel_id = tid % num_channels;
+    }
+
+    int in_img_idx = align_flag ? ratio_w * (out_img_idx + 0.5) - 0.5
+                                : ratio_w * out_img_idx;
+    in_img_idx = (in_img_idx > 0) ? in_img_idx : 0;  // w
+    int w_id = (in_img_idx < in_img_w - 1) ? 1 : 0;  // w_id
+
+    T src_w = ratio_w * (out_img_idx + 0.5) - 0.5;
+    src_w = (src_w > 0) ? src_w : 0;
+    T w1lambda =
+        align_flag ? src_w - in_img_idx : ratio_w * out_img_idx - in_img_idx;
+    T w2lambda = 1.f - w1lambda;
+
+    T* in_pos;
+    if (data_layout == DataLayout::kNCHW) {
+      in_pos = &in[out_id_h * input_w + channel_id * in_img_size + in_img_idx];
+    } else {
+      in_pos = &in[out_id_h * input_w + in_img_idx * num_channels + channel_id];
+    }
+    const T* out_pos = &out[out_id_w];
+
+    if (data_layout == DataLayout::kNCHW) {
+      paddle::platform::CudaAtomicAdd(&in_pos[0], w2lambda * out_pos[0]);
+      paddle::platform::CudaAtomicAdd(&in_pos[w_id], w1lambda * out_pos[0]);
+    } else {
+      paddle::platform::CudaAtomicAdd(&in_pos[0], w2lambda * out_pos[0]);
+      paddle::platform::CudaAtomicAdd(&in_pos[w_id * num_channels],
+                                      w1lambda * out_pos[0]);
+    }
+  }
+}
+
+template <typename T>
+__global__ void KeNearestNeighborInterpNCHWBw(T* in,
+                                              const size_t in_img_h,
+                                              const size_t in_img_w,
+                                              const T* out,
+                                              const size_t out_img_h,
+                                              const size_t out_img_w,
+                                              const size_t nc,
+                                              const float ratio_h,
+                                              const float ratio_w,
+                                              const bool align_corners) {
+  int out_img_idx = threadIdx.x + blockIdx.x * blockDim.x;
+  int out_img_idy = threadIdx.y + blockIdx.y * blockDim.y;
+  int nc_id = threadIdx.z + blockIdx.z * blockDim.z;
+  int nc_stride = blockDim.z * gridDim.z;
+
+  // nearest_sampling by multiple read in_addr and write to out_addr
+  int in_img_idx = (align_corners)
+                       ? static_cast<int>(ratio_w * out_img_idx + 0.5)
+                       : static_cast<int>(ratio_w * out_img_idx);
+  int in_img_idy = (align_corners)
+                       ? static_cast<int>(ratio_h * out_img_idy + 0.5)
+                       : static_cast<int>(ratio_h * out_img_idy);
+
+  int in_index = (nc_id * in_img_h + in_img_idy) * in_img_w + in_img_idx;
+  int in_index_stride = nc_stride * in_img_h * in_img_w;
+
+  int out_index = (nc_id * out_img_h + out_img_idy) * out_img_w + out_img_idx;
+  int out_index_stride = nc_stride * out_img_h * out_img_w;
+
+  // prevent from multiple threads writing
+  if (out_img_idx < out_img_w && out_img_idy < out_img_h) {
+    while (nc_id < nc) {
+      T* in_pos = &in[in_index];
+      const T out_pos = out[out_index];
+      paddle::platform::CudaAtomicAdd(in_pos, out_pos);
+      in_index += in_index_stride;
+      out_index += out_index_stride;
+      nc_id += nc_stride;
+    }
+  }
+}
+
+template <typename T>
+__global__ void KeNearestNeighborInterpBw(
+    T* in,
+    const size_t in_img_h,
+    const size_t in_img_w,
+    const size_t input_h,
+    const size_t input_w,
+    const T* out,
+    const size_t out_img_h,
+    const size_t out_img_w,
+    const size_t output_h,
+    const size_t output_w,
+    const size_t num_channels,
+    const float ratio_h,
+    const float ratio_w,
+    const bool align_corners,
+    funcs::FastDivModForInterpolate divmods) {
+  int nthreads = output_h * output_w;
+  int tid = blockIdx.x * blockDim.x + threadIdx.x;
+  int stride = blockDim.x * gridDim.x;
+  int in_img_size = in_img_h * in_img_w;
+  int out_img_size = out_img_h * out_img_w;
+
+  for (; tid < nthreads; tid += stride) {
+    auto out_id_divmod = divmods.output_w_div.Divmod(tid);
+    int out_id_h = out_id_divmod.val[0];
+    int out_id_w = out_id_divmod.val[1];
+
+    int channel_id = divmods.channels_div.Divmod(tid).val[1];
+    auto outimg_id_divmod = divmods.output_wc_div.Divmod(out_id_w);
+    int out_img_idy = outimg_id_divmod.val[0];
+    int out_img_idx =
+        divmods.channels_div.Divmod(outimg_id_divmod.val[1]).val[0];
+
+    int in_img_idy = (align_corners)
+                         ? static_cast<int>(ratio_h * out_img_idy + 0.5)
+                         : static_cast<int>(ratio_h * out_img_idy);
+    int in_img_idx = (align_corners)
+                         ? static_cast<int>(ratio_w * out_img_idx + 0.5)
+                         : static_cast<int>(ratio_w * out_img_idx);
+
+    T* in_pos = &in[out_id_h * input_w + in_img_idy * in_img_w * num_channels +
+                    in_img_idx * num_channels + channel_id];
+
+    const T out_pos = out[tid];
+    paddle::platform::CudaAtomicAdd(in_pos, out_pos);
+  }
+}
+
+/* Calculate the minimum of partial elements in a block */
+template <typename T>
+__inline__ __device__ T PartialBlockMin(T val,
+                                        size_t threads_num_in_block,
+                                        unsigned mask) {
+  __shared__ T shared[WARP_SIZE];
+  __shared__ T shared_last_val;
+  __shared__ int shared_last_idx;
+  int lane = threadIdx.x & 0x1f;
+  int wid = threadIdx.x >> 5;
+  int threshold = (threads_num_in_block & (-WARP_SIZE));
+
+  if (threadIdx.x < threshold) {
+    shared_last_idx = (threshold >> 5) - 1;
+    val = phi::funcs::warpReduceMin(val, mask);
+    if (lane == 0) {
+      shared[wid] = val;
+    }
+  } else {
+    shared_last_val = std::numeric_limits<T>::max();
+    paddle::platform::CudaAtomicMin(&shared_last_val, val);
+    shared[wid] = shared_last_val;
+    shared_last_idx = wid;
+  }
+  __syncthreads();
+
+  if (threadIdx.x < threshold) {
+    val = (lane <= shared_last_idx) ? shared[lane]
+                                    : std::numeric_limits<T>::max();
+    val = phi::funcs::warpReduceMin(val, mask);
+    shared_last_val = val;
+  }
+  __syncthreads();
+  if (threadIdx.x >= threshold) {
+    val = shared_last_val;
+  }
+  return val;
+}
+
+template <typename T>
+__global__ void KeBilinearInterpBwShareMemory(T* in,
+                                              const int in_h,
+                                              const int in_w,
+                                              const T* __restrict__ out,
+                                              const int out_h,
+                                              const int out_w,
+                                              const int n,
+                                              const int num_channels,
+                                              float ratio_h,
+                                              float ratio_w,
+                                              const T align_type_value,
+                                              bool is_nchw) {
+  __shared__ T s_data[2][1024];
+  int tid = blockIdx.x * blockDim.x + threadIdx.x;
+  int stride = blockDim.x * gridDim.x;
+  int in_chw = in_h * in_w * num_channels;
+  int out_chw = num_channels * out_h * out_w;
+  int nthreads = n * out_chw;
+
+  for (; tid < nthreads; tid += stride) {
+    int out_id_h = tid / out_chw;
+    int out_id_w = tid % out_chw;
+    const int in_img_size = in_h * in_w;
+    const int out_img_size = out_h * out_w;
+    T value = out[out_id_h * out_chw + out_id_w];
+
+    int channel_id = out_id_w / out_img_size;
+    int out_img_idy = (out_id_w % out_img_size) / out_w;
+    int out_img_idx = tid % out_w;
+
+    int in_img_idx, in_img_idy, w_id, h_id;
+    T w1lambda, h1lambda, w2lambda, h2lambda;
+    T src_w = ratio_w * (out_img_idx + align_type_value) - align_type_value;
+    T src_h = ratio_h * (out_img_idy + align_type_value) - align_type_value;
+
+    PreCalculatorForLinearInterpInputIndex(
+        &in_img_idx, &w_id, &w1lambda, &w2lambda, src_w, in_w);
+    PreCalculatorForLinearInterpInputIndex(
+        &in_img_idy, &h_id, &h1lambda, &h2lambda, src_h, in_h);
+
+    // top_left_index is just input_index.
+    int input_index = out_id_h * in_chw + channel_id * in_img_size +
+                      in_img_idy * in_w + in_img_idx;
+    int top_right_index = input_index + w_id;
+    int bot_left_index = input_index + h_id * in_w;
+    int bot_right_index = input_index + h_id * in_w + w_id;
+    int in_top_min_index, in_bot_min_index;
+
+    s_data[0][threadIdx.x] = 0.f;
+    s_data[1][threadIdx.x] = 0.f;
+    int remain = nthreads - (tid & (-blockDim.x));
+    int in_top_max_index =
+        phi::funcs::blockReduceMax(top_right_index, FINAL_MASK);
+    int in_bot_max_index =
+        phi::funcs::blockReduceMax(bot_right_index, FINAL_MASK);
+
+    if (remain > blockDim.x) {
+      in_top_min_index = phi::funcs::blockReduceMin(input_index, FINAL_MASK);
+      in_bot_min_index = phi::funcs::blockReduceMin(bot_left_index, FINAL_MASK);
+    } else {
+      in_top_min_index = PartialBlockMin(input_index, remain, FINAL_MASK);
+      in_bot_min_index = PartialBlockMin(bot_left_index, remain, FINAL_MASK);
+    }
+    int upper_limit_share_idx = (in_top_max_index - in_top_min_index) >
+                                        (in_bot_max_index - in_bot_min_index)
+                                    ? (in_top_max_index - in_top_min_index)
+                                    : (in_bot_max_index - in_bot_min_index);
+    if (h_id != 0) {
+      paddle::platform::CudaAtomicAdd(
+          &s_data[0][input_index - in_top_min_index],
+          h2lambda * w2lambda * value);
+      paddle::platform::CudaAtomicAdd(
+          &s_data[0][top_right_index - in_top_min_index],
+          h2lambda * w1lambda * value);
+      paddle::platform::CudaAtomicAdd(
+          &s_data[1][bot_left_index - in_bot_min_index],
+          h1lambda * w2lambda * value);
+      paddle::platform::CudaAtomicAdd(
+          &s_data[1][bot_right_index - in_bot_min_index],
+          h1lambda * w1lambda * value);
+    } else {
+      paddle::platform::CudaAtomicAdd(
+          &s_data[0][top_right_index - in_top_min_index],
+          (h2lambda + h1lambda) * w1lambda * value);
+      paddle::platform::CudaAtomicAdd(
+          &s_data[1][bot_left_index - in_bot_min_index],
+          (h1lambda + h2lambda) * w2lambda * value);
+    }
+    __syncthreads();
+
+    if (threadIdx.x <= upper_limit_share_idx) {
+      paddle::platform::CudaAtomicAdd(&in[in_top_min_index + threadIdx.x],
+                                      s_data[0][threadIdx.x]);
+      paddle::platform::CudaAtomicAdd(&in[in_bot_min_index + threadIdx.x],
+                                      s_data[1][threadIdx.x]);
+    }
+  }
+}
+
+__device__ __forceinline__ int GetInputIndex(const size_t nc,
+                                             const int height,
+                                             const int width,
+                                             const int h,
+                                             const int w) {
+  return (nc * height + h) * width + w;
+}
+
+template <typename T>
+__global__ void KeBilinearInterpNCHWBw(T* in,
+                                       const int in_h,
+                                       const int in_w,
+                                       const int out_h,
+                                       const int out_w,
+                                       const int n,
+                                       const int num_channels,
+                                       float ratio_h,
+                                       float ratio_w,
+                                       const T* __restrict__ out,
+                                       const T align_type_value) {
+  int index = threadIdx.x + blockDim.x * blockIdx.x;
+  int stride = blockDim.x * gridDim.x;
+  int num_out = n * num_channels * out_h * out_w;
+  int num_in = n * num_channels * in_h * in_w;
+
+  for (; index < num_out; index += stride) {
+    int index_tmp = index;
+    int w2 = index_tmp % out_w;
+    index_tmp /= out_w;
+    int h2 = index_tmp % out_h;
+    int nc = index_tmp / out_h;
+
+    int h1, y_id;
+    T h1lambda, h0lambda;
+    T src_y = ratio_h * (h2 + align_type_value) - align_type_value;
+
+    PreCalculatorForLinearInterpInputIndex(
+        &h1, &y_id, &h1lambda, &h0lambda, src_y, in_h);
+    int w1, x_id;
+    T w1lambda, w0lambda;
+    T src_x = ratio_w * (w2 + align_type_value) - align_type_value;
+    PreCalculatorForLinearInterpInputIndex(
+        &w1, &x_id, &w1lambda, &w0lambda, src_x, in_w);
+
+    T d2val = out[index];
+
+    paddle::platform::CudaAtomicAdd(in + GetInputIndex(nc, in_h, in_w, h1, w1),
+                                    h0lambda * w0lambda * d2val);
+    paddle::platform::CudaAtomicAdd(
+        in + GetInputIndex(nc, in_h, in_w, h1, w1 + x_id),
+        h0lambda * w1lambda * d2val);
+    paddle::platform::CudaAtomicAdd(
+        in + GetInputIndex(nc, in_h, in_w, h1 + y_id, w1),
+        h1lambda * w0lambda * d2val);
+    paddle::platform::CudaAtomicAdd(
+        in + GetInputIndex(nc, in_h, in_w, h1 + y_id, w1 + x_id),
+        h1lambda * w1lambda * d2val);
+  }
+}
+
+template <typename T>
+__global__ void KeBilinearInterpBw(T* in,
+                                   const int in_h,
+                                   const int in_w,
+                                   const T* __restrict__ out,
+                                   const int out_h,
+                                   const int out_w,
+                                   const int n,
+                                   const int out_chw,
+                                   const int num_channels,
+                                   float ratio_h,
+                                   float ratio_w,
+                                   const T align_type_value,
+                                   funcs::FastDivModForInterpolate divmods) {
+  int tid = blockIdx.x * blockDim.x + threadIdx.x;
+  int stride = blockDim.x * gridDim.x;
+  int in_chw = in_h * in_w * num_channels;
+  int nthreads = n * out_chw;
+
+  for (; tid < nthreads; tid += stride) {
+    auto out_id_divmod = divmods.output_w_div.Divmod(tid);
+    int out_id_h = out_id_divmod.val[0];
+    int out_id_w = out_id_divmod.val[1];
+
+    int channel_id = divmods.channels_div.Divmod(tid).val[1];
+    auto outimg_id_divmod = divmods.output_wc_div.Divmod(out_id_w);
+    int out_img_idy = outimg_id_divmod.val[0];
+    int out_img_idx =
+        divmods.channels_div.Divmod(outimg_id_divmod.val[1]).val[0];
+
+    int in_img_idx, in_img_idy, w_id, h_id;
+    T w1lambda, h1lambda, w2lambda, h2lambda;
+    T src_w = ratio_w * (out_img_idx + align_type_value) - align_type_value;
+    T src_h = ratio_h * (out_img_idy + align_type_value) - align_type_value;
+
+    PreCalculatorForLinearInterpInputIndex(
+        &in_img_idx, &w_id, &w1lambda, &w2lambda, src_w, in_w);
+    PreCalculatorForLinearInterpInputIndex(
+        &in_img_idy, &h_id, &h1lambda, &h2lambda, src_h, in_h);
+
+    T value = out[tid];
+    T* in_pos = &in[out_id_h * in_chw + in_img_idy * in_w * num_channels +
+                    in_img_idx * num_channels + channel_id];
+    paddle::platform::CudaAtomicAdd(&in_pos[0], h2lambda * w2lambda * value);
+    paddle::platform::CudaAtomicAdd(&in_pos[w_id * num_channels],
+                                    h2lambda * w1lambda * value);
+    paddle::platform::CudaAtomicAdd(&in_pos[h_id * in_w * num_channels],
+                                    h1lambda * w2lambda * value);
+    paddle::platform::CudaAtomicAdd(
+        &in_pos[h_id * in_w * num_channels + w_id * num_channels],
+        h1lambda * w1lambda * value);
+  }
+}
+
+template <typename T>
+__global__ void KeBicubicInterpBw(T* in,
+                                  const size_t in_img_h,
+                                  const size_t in_img_w,
+                                  const size_t input_h,
+                                  const size_t input_w,
+                                  const T* out,
+                                  const size_t out_img_h,
+                                  const size_t out_img_w,
+                                  const size_t output_h,
+                                  const size_t output_w,
+                                  const size_t num_channels,
+                                  const float ratio_h,
+                                  const float ratio_w,
+                                  const bool align_corners,
+                                  const DataLayout data_layout) {
+  int nthreads = output_h * output_w;
+  int tid = blockIdx.x * blockDim.x + threadIdx.x;
+  int stride = blockDim.x * gridDim.x;
+
+  for (; tid < nthreads; tid += stride) {
+    int out_id_h = tid / output_w;
+    int out_id_w = tid % output_w;
+    int in_img_size = input_w / num_channels;
+    int out_img_size = output_w / num_channels;
+
+    int channel_id, out_img_idy, out_img_idx;
+    if (data_layout == DataLayout::kNCHW) {
+      channel_id = out_id_w / out_img_size;
+      out_img_idy = (out_id_w % out_img_size) / out_img_w;
+      out_img_idx = tid % out_img_w;
+    } else {
+      out_img_idy = out_id_w / (out_img_w * num_channels);
+      out_img_idx = out_id_w % (out_img_w * num_channels) / num_channels;
+      channel_id = tid % num_channels;
+    }
+
+    T in_img_idy = align_corners
+                       ? static_cast<T>(ratio_h * out_img_idy)
+                       : static_cast<T>(ratio_h * (out_img_idy + 0.5) - 0.5);
+    int input_y = floorf(in_img_idy);
+    const T y_t = in_img_idy - input_y;
+
+    T in_img_idx = align_corners
+                       ? static_cast<T>(ratio_w * out_img_idx)
+                       : static_cast<T>(ratio_w * (out_img_idx + 0.5) - 0.5);
+    int input_x = floorf(in_img_idx);
+
+    const T x_t = in_img_idx - input_x;
+
+    T x_coeffs[4];
+    T y_coeffs[4];
+
+    funcs::get_cubic_upsample_coefficients(x_coeffs, x_t);
+    funcs::get_cubic_upsample_coefficients(y_coeffs, y_t);
+
+    const T* out_pos = &out[out_id_h * output_w + out_id_w];
+    T* in_pos;
+
+    for (int i = 0; i < 4; i++) {
+      for (int j = 0; j < 4; j++) {
+        int access_y = max(min(static_cast<int>(input_y - 1 + j),
+                               static_cast<int>(in_img_h - 1)),
+                           0);
+        int access_x = max(min(static_cast<int>(input_x - 1 + i),
+                               static_cast<int>(in_img_w - 1)),
+                           0);
+        if (data_layout == DataLayout::kNCHW) {
+          in_pos = &in[out_id_h * input_w + channel_id * in_img_size +
+                       access_y * in_img_w + access_x];
+        } else {
+          in_pos = &in[out_id_h * input_w + access_y * in_img_w * num_channels +
+                       access_x * num_channels + channel_id];
+        }
+        paddle::platform::CudaAtomicAdd(
+            &in_pos[0], (out_pos[0] * y_coeffs[j] * x_coeffs[i]));
+      }
+    }
+  }
+}
+
+template <typename T>
+__global__ void KeTrilinearInterpBw(T* in,
+                                    const size_t in_img_d,
+                                    const size_t in_img_h,
+                                    const size_t in_img_w,
+                                    const size_t input_h,
+                                    const size_t input_w,
+                                    const T* out,
+                                    const size_t out_img_d,
+                                    const size_t out_img_h,
+                                    const size_t out_img_w,
+                                    const size_t output_h,
+                                    const size_t output_w,
+                                    const size_t num_channels,
+                                    const T ratio_d,
+                                    const T ratio_h,
+                                    const T ratio_w,
+                                    const bool align_corners,
+                                    const int align_mode,
+                                    const DataLayout data_layout) {
+  int nthreads = output_h * output_w;
+  int tid = blockIdx.x * blockDim.x + threadIdx.x;
+  int stride = blockDim.x * gridDim.x;
+  bool align_flag = (align_mode == 0 && !align_corners);
+  for (; tid < nthreads; tid += stride) {
+    int out_id_h = tid / output_w;
+    int out_id_w = tid % output_w;
+    int in_img_size = input_w / num_channels;
+    int out_img_size = output_w / num_channels;
+
+    int channel_id, out_img_idt, out_img_idy, out_img_idx;
+    if (data_layout == DataLayout::kNCHW) {
+      channel_id = out_id_w / out_img_size;
+      out_img_idt = (out_id_w % out_img_size) / out_img_h / out_img_w;
+      out_img_idy = ((out_id_w % out_img_size) / out_img_w) % out_img_h;
+      out_img_idx = tid % out_img_w;
+    } else {
+      out_img_idt = out_id_w / (out_img_h * out_img_w * num_channels);
+      out_img_idy = out_id_w % (out_img_h * out_img_w * num_channels) /
+                    (out_img_w * num_channels);
+      out_img_idx = out_id_w % (out_img_w * num_channels) / num_channels;
+      channel_id = tid % num_channels;
+    }
+
+    int in_img_idt = align_flag
+                         ? static_cast<int>(ratio_d * (out_img_idt + 0.5) - 0.5)
+                         : static_cast<int>(ratio_d * out_img_idt);
+    in_img_idt = (in_img_idt > 0) ? in_img_idt : 0;
+    int d_id = (in_img_idt < in_img_d - 1) ? 1 : 0;
+    T src_d = ratio_d * (out_img_idt + 0.5) - 0.5;
+    src_d = (src_d > 0) ? src_d : 0;
+    T d1lambda =
+        align_flag ? src_d - in_img_idt : ratio_d * out_img_idt - in_img_idt;
+    T d2lambda = 1.f - d1lambda;
+
+    int in_img_idy = align_flag
+                         ? static_cast<int>(ratio_h * (out_img_idy + 0.5) - 0.5)
+                         : static_cast<int>(ratio_h * out_img_idy);
+    in_img_idy = (in_img_idy > 0) ? in_img_idy : 0;
+    int h_id = (in_img_idy < in_img_h - 1) ? 1 : 0;
+    T src_h = ratio_h * (out_img_idy + 0.5) - 0.5;
+    src_h = (src_h > 0) ? src_h : 0;
+    T h1lambda =
+        align_flag ? src_h - in_img_idy : ratio_h * out_img_idy - in_img_idy;
+    T h2lambda = 1.f - h1lambda;
+
+    int in_img_idx = align_flag
+                         ? static_cast<int>(ratio_w * (out_img_idx + 0.5) - 0.5)
+                         : static_cast<int>(ratio_w * out_img_idx);
+    in_img_idx = (in_img_idx > 0) ? in_img_idx : 0;
+    int w_id = (in_img_idx < in_img_w - 1) ? 1 : 0;
+    T src_w = ratio_w * (out_img_idx + 0.5) - 0.5;
+    src_w = (src_w > 0) ? src_w : 0;
+    T w1lambda =
+        align_flag ? src_w - in_img_idx : ratio_w * out_img_idx - in_img_idx;
+    T w2lambda = 1.f - w1lambda;
+
+    if (data_layout == DataLayout::kNCHW) {
+      int in_pos1_idx = out_id_h * input_w + channel_id * in_img_size +
+                        (in_img_idt * in_img_h + in_img_idy) * in_img_w +
+                        in_img_idx;
+      T* in_pos1 = &in[in_pos1_idx];
+      int in_pos2_idx = in_pos1_idx + d_id * in_img_h * in_img_w;
+      T* in_pos2 = &in[in_pos2_idx];
+
+      const T* out_pos = &out[out_id_h * output_w + out_id_w];
+
+      // trilinear interpolation grad
+      paddle::platform::CudaAtomicAdd(
+          &in_pos1[0], d2lambda * h2lambda * w2lambda * out_pos[0]);
+      paddle::platform::CudaAtomicAdd(
+          &in_pos1[w_id], d2lambda * h2lambda * w1lambda * out_pos[0]);
+      paddle::platform::CudaAtomicAdd(
+          &in_pos1[h_id * in_img_w],
+          d2lambda * h1lambda * w2lambda * out_pos[0]);
+      paddle::platform::CudaAtomicAdd(
+          &in_pos1[h_id * in_img_w + w_id],
+          d2lambda * h1lambda * w1lambda * out_pos[0]);
+      paddle::platform::CudaAtomicAdd(
+          &in_pos2[0], d1lambda * h2lambda * w2lambda * out_pos[0]);
+      paddle::platform::CudaAtomicAdd(
+          &in_pos2[w_id], d1lambda * h2lambda * w1lambda * out_pos[0]);
+      paddle::platform::CudaAtomicAdd(
+          &in_pos2[h_id * in_img_w],
+          d1lambda * h1lambda * w2lambda * out_pos[0]);
+      paddle::platform::CudaAtomicAdd(
+          &in_pos2[h_id * in_img_w + w_id],
+          d1lambda * h1lambda * w1lambda * out_pos[0]);
+    } else {
+      int in_pos1_idx = out_id_h * input_w +
+                        in_img_idt * in_img_h * in_img_w * num_channels +
+                        in_img_idy * in_img_w * num_channels +
+                        in_img_idx * num_channels + channel_id;
+      T* in_pos1 = &in[in_pos1_idx];
+      int in_pos2_idx = in_pos1_idx + d_id * in_img_h * in_img_w * num_channels;
+      T* in_pos2 = &in[in_pos2_idx];
+
+      const T* out_pos = &out[out_id_h * output_w + out_id_w];
+
+      // trilinear interpolation grad
+      paddle::platform::CudaAtomicAdd(
+          &in_pos1[0], d2lambda * h2lambda * w2lambda * out_pos[0]);
+      paddle::platform::CudaAtomicAdd(
+          &in_pos1[w_id * num_channels],
+          d2lambda * h2lambda * w1lambda * out_pos[0]);
+      paddle::platform::CudaAtomicAdd(
+          &in_pos1[h_id * in_img_w * num_channels],
+          d2lambda * h1lambda * w2lambda * out_pos[0]);
+      paddle::platform::CudaAtomicAdd(
+          &in_pos1[h_id * in_img_w * num_channels + w_id * num_channels],
+          d2lambda * h1lambda * w1lambda * out_pos[0]);
+      paddle::platform::CudaAtomicAdd(
+          &in_pos2[0], d1lambda * h2lambda * w2lambda * out_pos[0]);
+      paddle::platform::CudaAtomicAdd(
+          &in_pos2[w_id * num_channels],
+          d1lambda * h2lambda * w1lambda * out_pos[0]);
+      paddle::platform::CudaAtomicAdd(
+          &in_pos2[h_id * in_img_w * num_channels],
+          d1lambda * h1lambda * w2lambda * out_pos[0]);
+      paddle::platform::CudaAtomicAdd(
+          &in_pos2[h_id * in_img_w * num_channels + w_id * num_channels],
+          d1lambda * h1lambda * w1lambda * out_pos[0]);
+    }
+  }
+}
+
+template <typename T>
+__global__ void KeNearestNeighbor3DInterpBw(T* in,
+                                            const size_t in_img_d,
+                                            const size_t in_img_h,
+                                            const size_t in_img_w,
+                                            const size_t input_h,
+                                            const size_t input_w,
+                                            const T* out,
+                                            const size_t out_img_d,
+                                            const size_t out_img_h,
+                                            const size_t out_img_w,
+                                            const size_t output_h,
+                                            const size_t output_w,
+                                            const size_t num_channels,
+                                            const float ratio_d,
+                                            const float ratio_h,
+                                            const float ratio_w,
+                                            const bool align_corners,
+                                            const DataLayout data_layout) {
+  int nthreads = output_h * output_w;
+  int tid = blockIdx.x * blockDim.x + threadIdx.x;
+  int stride = blockDim.x * gridDim.x;
+  for (; tid < nthreads; tid += stride) {
+    int out_id_h = tid / output_w;
+    int out_id_w = tid % output_w;
+    int in_img_size = input_w / num_channels;
+    int out_img_size = output_w / num_channels;
+
+    int channel_id, out_img_idt, out_img_idy, out_img_idx;
+    if (data_layout == DataLayout::kNCHW) {
+      channel_id = out_id_w / out_img_size;
+      out_img_idt = (out_id_w % out_img_size) / out_img_h / out_img_w;
+      out_img_idy = ((out_id_w % out_img_size) / out_img_w) % out_img_h;
+      out_img_idx = tid % out_img_w;
+    } else {
+      out_img_idt = out_id_w / (out_img_h * out_img_w * num_channels);
+      out_img_idy = out_id_w % (out_img_h * out_img_w * num_channels) /
+                    (out_img_w * num_channels);
+      out_img_idx = out_id_w % (out_img_w * num_channels) / num_channels;
+      channel_id = tid % num_channels;
+    }
+
+    int in_img_idt = (align_corners)
+                         ? static_cast<int>(ratio_d * out_img_idt + 0.5)
+                         : static_cast<int>(ratio_d * out_img_idt);
+    int in_img_idy = (align_corners)
+                         ? static_cast<int>(ratio_h * out_img_idy + 0.5)
+                         : static_cast<int>(ratio_h * out_img_idy);
+    int in_img_idx = (align_corners)
+                         ? static_cast<int>(ratio_w * out_img_idx + 0.5)
+                         : static_cast<int>(ratio_w * out_img_idx);
+
+    T* in_pos;
+    if (data_layout == DataLayout::kNCHW) {
+      in_pos = &in[out_id_h * input_w + channel_id * in_img_size +
+                   in_img_idt * in_img_h * in_img_w + in_img_idy * in_img_w +
+                   in_img_idx];
+    } else {
+      in_pos = &in[out_id_h * input_w +
+                   in_img_idt * in_img_h * in_img_w * num_channels +
+                   in_img_idy * in_img_w * num_channels +
+                   in_img_idx * num_channels + channel_id];
+    }
+    const T out_pos = out[out_id_h * output_w + out_id_w];
+    paddle::platform::CudaAtomicAdd(in_pos, out_pos);
+  }
+}
+
+template <typename T, typename Context>
+static void Interpolate1DCUDABwd(
+    const Context& dev_ctx,
+    const DenseTensor& input,
+    paddle::optional<const DenseTensor&> out_size,
+    paddle::optional<const std::vector<const DenseTensor*>> size_tensor,
+    paddle::optional<const DenseTensor&> scale_tensor,
+    const DenseTensor& output_grad,
+    const std::string& data_layout_str,
+    int out_w,
+    const std::vector<float>& scale,
+    const std::string& interp_method,
+    bool align_corners,
+    int align_mode,
+    DenseTensor* input_grad) {
+  const DataLayout data_layout =
+      paddle::framework::StringToDataLayout(data_layout_str);
+  int n, c, in_d, in_h, in_w;
+  funcs::ExtractNCDWH(input.dims(), data_layout, &n, &c, &in_d, &in_h, &in_w);
+
+  float scale_w = -1;
+  if (scale_tensor) {
+    auto scale_data =
+        funcs::get_new_data_from_tensor<float>(scale_tensor.get_ptr());
+    scale_w = scale_data[0];
+    PADDLE_ENFORCE_EQ(
+        scale_w > 0,
+        true,
+        errors::InvalidArgument(
+            "The scale_w in input 'Scale' Tensor of Operator(interpolate) "
+            "should be greater than 0, but received value is %d.",
+            scale_w));
+  } else {
+    if (scale.size() > 0) {
+      scale_w = scale[0];
+
+      PADDLE_ENFORCE_EQ(
+          scale_w > 0,
+          true,
+          errors::InvalidArgument(
+              "The scale_w in Attr(scale) of Operator(interpolate) "
+              "should be greater than 0, but received value is %d.",
+              scale_w));
+    }
+  }
+  if (scale_w > 0.) {
+    out_w = static_cast<int>(in_w * scale_w);
+  }
+
+  if (out_size) {
+    DenseTensor sizes;
+    paddle::framework::TensorCopySync(
+        *out_size, paddle::platform::CPUPlace(), &sizes);
+    auto size_data = sizes.data<int>();
+    out_w = size_data[0];
+  }
+  if (size_tensor && size_tensor->size() > 0) {
+    // have size tensor
+    auto new_size = funcs::get_new_shape(size_tensor.get());
+    out_w = new_size[0];
+  }
+
+  auto* output_grad_data = output_grad.data<T>();
+  phi::DDim dim_grad;
+  if (data_layout == DataLayout::kNCHW) {
+    dim_grad = {n, c, in_w};
+  } else {
+    dim_grad = {n, in_w, c};
+  }
+  input_grad->Resize(dim_grad);
+  auto* input_grad_data = dev_ctx.template Alloc<T>(input_grad);
+
+  phi::funcs::SetConstant<Context, T> zero;
+  zero(dev_ctx, input_grad, static_cast<T>(0.0));
+
+  if (in_w == out_w) {
+    paddle::framework::TensorCopy(output_grad, dev_ctx.GetPlace(), input_grad);
+    return;
+  }
+
+  float ratio_w = 0.f;
+  if (out_w > 1) {
+    float new_scale_w = 0.f;
+    new_scale_w = (scale_w > 0) ? static_cast<float>(1. / scale_w)
+                                : static_cast<float>(in_w) / out_w;
+    ratio_w = (align_corners) ? static_cast<float>(in_w - 1) / (out_w - 1)
+                              : static_cast<float>(new_scale_w);
+  }
+  int64_t in_cw = c * in_w;
+  int64_t out_cw = c * out_w;
+  auto pixelNum = n * out_cw;
+
+  backends::gpu::GpuLaunchConfig config =
+      backends::gpu::GetGpuLaunchConfig1D(dev_ctx, pixelNum);
+
+  if ("linear" == interp_method) {
+    KeLinearInterpBw<T><<<config.block_per_grid,
+                          config.thread_per_block,
+                          0,
+                          dev_ctx.stream()>>>(input_grad_data,
+                                              in_w,
+                                              in_cw,
+                                              output_grad_data,
+                                              out_w,
+                                              n,
+                                              out_cw,
+                                              c,
+                                              ratio_w,
+                                              align_corners,
+                                              align_mode,
+                                              data_layout);
+  }
+}
+
+template <typename T, typename Context>
+static void Interpolate2DCUDABwd(
+    const Context& dev_ctx,
+    const DenseTensor& input,
+    paddle::optional<const DenseTensor&> out_size,
+    paddle::optional<const std::vector<const DenseTensor*>> size_tensor,
+    paddle::optional<const DenseTensor&> scale_tensor,
+    const DenseTensor& output_grad,
+    const std::string& data_layout_str,
+    int out_h,
+    int out_w,
+    const std::vector<float>& scale,
+    const std::string& interp_method,
+    bool align_corners,
+    int align_mode,
+    DenseTensor* input_grad) {
+  const DataLayout data_layout =
+      paddle::framework::StringToDataLayout(data_layout_str);
+  int n, c, in_d, in_h, in_w;
+  funcs::ExtractNCDWH(input.dims(), data_layout, &n, &c, &in_d, &in_h, &in_w);
+
+  float scale_h = -1;
+  float scale_w = -1;
+  if (scale_tensor) {
+    auto scale_data =
+        funcs::get_new_data_from_tensor<float>(scale_tensor.get_ptr());
+    if (scale_data.size() > 1) {
+      scale_h = scale_data[0];
+      scale_w = scale_data[1];
+    } else {
+      scale_h = scale_data[0];
+      scale_w = scale_data[0];
+    }
+
+    PADDLE_ENFORCE_EQ(
+        scale_w > 0,
+        true,
+        errors::InvalidArgument(
+            "The scale_w in input 'Scale' Tensor of Operator(interpolate) "
+            "should be greater than 0, but received value is %d.",
+            scale_w));
+    PADDLE_ENFORCE_EQ(
+        scale_h > 0,
+        true,
+        errors::InvalidArgument(
+            "The scale_h in input 'Scale' Tensor of Operator(interpolate) "
+            "should be greater than 0, but received value is %d.",
+            scale_h));
+  } else {
+    if (scale.size() > 1) {
+      scale_w = scale[1];
+      scale_h = scale[0];
+
+      PADDLE_ENFORCE_EQ(
+          scale_w > 0,
+          true,
+          errors::InvalidArgument(
+              "The scale_w in Attr(scale) of Operator(interpolate) "
+              "should be greater than 0, but received value is %d.",
+              scale_w));
+      PADDLE_ENFORCE_EQ(
+          scale_h > 0,
+          true,
+          errors::InvalidArgument(
+              "The scale_h in Attr(scale) of Operator(interpolate) "
+              "should be greater than 0, but received value is %d.",
+              scale_h));
+    }
+  }
+  if (scale_w > 0. && scale_h > 0.) {
+    out_h = static_cast<int>(in_h * scale_h);
+    out_w = static_cast<int>(in_w * scale_w);
+  }
+
+  if (out_size) {
+    DenseTensor sizes;
+    paddle::framework::TensorCopySync(
+        *out_size, paddle::platform::CPUPlace(), &sizes);
+    auto size_data = sizes.data<int>();
+    out_h = size_data[0];
+    out_w = size_data[1];
+  }
+  if (size_tensor && size_tensor->size() > 0) {
+    // have size tensor
+    auto new_size = funcs::get_new_shape(size_tensor.get());
+    out_h = new_size[0];
+    out_w = new_size[1];
+  }
+
+  auto* output_grad_data = output_grad.data<T>();
+  phi::DDim dim_grad;
+  if (data_layout == DataLayout::kNCHW) {
+    dim_grad = {n, c, in_h, in_w};
+  } else {
+    dim_grad = {n, in_h, in_w, c};
+  }
+  input_grad->Resize(dim_grad);
+  auto* input_grad_data = dev_ctx.template Alloc<T>(input_grad);
+  phi::funcs::SetConstant<Context, T> zero;
+  zero(dev_ctx, input_grad, static_cast<T>(0.0));
+
+  if (in_h == out_h && in_w == out_w) {
+    paddle::framework::TensorCopy(output_grad, dev_ctx.GetPlace(), input_grad);
+    return;
+  }
+
+  float ratio_h = 0.f;
+  float ratio_w = 0.f;
+  if (out_h > 1) {
+    float new_scale_h = 0.f;
+    new_scale_h = (scale_h > 0) ? static_cast<float>(1. / scale_h)
+                                : static_cast<float>(in_h) / out_h;
+    ratio_h = (align_corners) ? static_cast<float>(in_h - 1) / (out_h - 1)
+                              : static_cast<float>(new_scale_h);
+  }
+  if (out_w > 1) {
+    float new_scale_w = 0.f;
+    new_scale_w = (scale_w > 0) ? static_cast<float>(1. / scale_w)
+                                : static_cast<float>(in_w) / out_w;
+    ratio_w = (align_corners) ? static_cast<float>(in_w - 1) / (out_w - 1)
+                              : static_cast<float>(new_scale_w);
+  }
+
+  int64_t in_hw = in_h * in_w;
+  int64_t out_hw = out_h * out_w;
+  int64_t in_chw = c * in_hw;
+  int64_t out_chw = c * out_hw;
+  auto pixelNum = n * out_chw;
+
+  backends::gpu::GpuLaunchConfig config =
+      backends::gpu::GetGpuLaunchConfig1D(dev_ctx, pixelNum);
+
+  if ("nearest" == interp_method) {
+    if (data_layout == DataLayout::kNCHW) {
+      // get launch 3D config
+      int nc = n * c;
+      backends::gpu::GpuLaunchConfig config_3d =
+          backends::gpu::GetGpuLaunchConfig3D(dev_ctx, nc, out_h, out_w);
+      KeNearestNeighborInterpNCHWBw<T><<<config_3d.block_per_grid,
+                                         config_3d.thread_per_block,
+                                         0,
+                                         dev_ctx.stream()>>>(input_grad_data,
+                                                             in_h,
+                                                             in_w,
+                                                             output_grad_data,
+                                                             out_h,
+                                                             out_w,
+                                                             nc,
+                                                             ratio_h,
+                                                             ratio_w,
+                                                             align_corners);
+    } else {
+      int64_t cw = c * out_w;
+      auto interp_divmods = funcs::FastDivModForInterpolate(c, out_chw, cw);
+      KeNearestNeighborInterpBw<T><<<config.block_per_grid,
+                                     config.thread_per_block,
+                                     0,
+                                     dev_ctx.stream()>>>(input_grad_data,
+                                                         in_h,
+                                                         in_w,
+                                                         n,
+                                                         in_chw,
+                                                         output_grad_data,
+                                                         out_h,
+                                                         out_w,
+                                                         n,
+                                                         out_chw,
+                                                         c,
+                                                         ratio_h,
+                                                         ratio_w,
+                                                         align_corners,
+                                                         interp_divmods);
+    }
+  } else if ("bilinear" == interp_method) {
+    const T align_type_value = (align_mode == 0 && !align_corners) ? 0.5f : 0;
+    bool is_nchw = (data_layout == DataLayout::kNCHW) ? true : false;
+    bool optimize_flag = false;
+#ifndef __HIPCC__
+    optimize_flag = (in_h < (out_h >> 6) && in_w < (out_w >> 6))
+                        ? true
+                        : ((in_h == 1 && in_w == 1) ? true : false);
+#endif
+
+    if (optimize_flag & is_nchw) {
+      KeBilinearInterpBwShareMemory<T><<<config.block_per_grid,
+                                         config.thread_per_block,
+                                         0,
+                                         dev_ctx.stream()>>>(input_grad_data,
+                                                             in_h,
+                                                             in_w,
+                                                             output_grad_data,
+                                                             out_h,
+                                                             out_w,
+                                                             n,
+                                                             c,
+                                                             ratio_h,
+                                                             ratio_w,
+                                                             align_type_value,
+                                                             is_nchw);
+    } else if (!optimize_flag & is_nchw) {
+      const int num_kernels = n * c * out_h * out_w;
+      const int num_threads = std::min(dev_ctx.GetMaxThreadsPerBlock(), 1024);
+      KeBilinearInterpNCHWBw<
+          T><<<backends::gpu::DivUp(num_kernels, num_threads),
+               num_threads,
+               0,
+               dev_ctx.stream()>>>(input_grad_data,
+                                   in_h,
+                                   in_w,
+                                   out_h,
+                                   out_w,
+                                   n,
+                                   c,
+                                   ratio_h,
+                                   ratio_w,
+                                   output_grad_data,
+                                   align_type_value);
+    } else {
+      int64_t cw = c * out_w;
+      auto interp_divmods = funcs::FastDivModForInterpolate(c, out_chw, cw);
+      KeBilinearInterpBw<T><<<config.block_per_grid,
+                              config.thread_per_block,
+                              0,
+                              dev_ctx.stream()>>>(input_grad_data,
+                                                  in_h,
+                                                  in_w,
+                                                  output_grad_data,
+                                                  out_h,
+                                                  out_w,
+                                                  n,
+                                                  out_chw,
+                                                  c,
+                                                  ratio_h,
+                                                  ratio_w,
+                                                  align_type_value,
+                                                  interp_divmods);
+    }
+  } else if ("bicubic" == interp_method) {
+#ifdef __HIPCC__
+    constexpr int thread_per_block = 256;
+#else
+    constexpr int thread_per_block = 512;
+#endif
+    KeBicubicInterpBw<
+        T><<<config.block_per_grid, thread_per_block, 0, dev_ctx.stream()>>>(
+        input_grad_data,
+        in_h,
+        in_w,
+        n,
+        in_chw,
+        output_grad_data,
+        out_h,
+        out_w,
+        n,
+        out_chw,
+        c,
+        ratio_h,
+        ratio_w,
+        align_corners,
+        data_layout);
+  }
+}
+
+template <typename T, typename Context>
+static void Interpolate3DCUDABwd(
+    const Context& dev_ctx,
+    const DenseTensor& input,
+    paddle::optional<const DenseTensor&> out_size,
+    paddle::optional<const std::vector<const DenseTensor*>> size_tensor,
+    paddle::optional<const DenseTensor&> scale_tensor,
+    const DenseTensor& output_grad,
+    const std::string& data_layout_str,
+    int out_d,
+    int out_h,
+    int out_w,
+    const std::vector<float>& scale,
+    const std::string& interp_method,
+    bool align_corners,
+    int align_mode,
+    DenseTensor* input_grad) {
+  const DataLayout data_layout =
+      paddle::framework::StringToDataLayout(data_layout_str);
+  int n, c, in_d, in_h, in_w;
+  funcs::ExtractNCDWH(input.dims(), data_layout, &n, &c, &in_d, &in_h, &in_w);
+
+  float scale_d = -1;
+  float scale_h = -1;
+  float scale_w = -1;
+  if (scale_tensor) {
+    auto scale_data =
+        funcs::get_new_data_from_tensor<float>(scale_tensor.get_ptr());
+    if (scale_data.size() > 1) {
+      scale_d = scale_data[0];
+      scale_h = scale_data[1];
+      scale_w = scale_data[2];
+    } else {
+      scale_d = scale_data[0];
+      scale_h = scale_data[0];
+      scale_w = scale_data[0];
+    }
+    PADDLE_ENFORCE_EQ(
+        scale_w > 0,
+        true,
+        errors::InvalidArgument(
+            "The scale_w in input 'Scale' Tensor of Operator(interpolate) "
+            "should be greater than 0, but received value is %d.",
+            scale_w));
+    PADDLE_ENFORCE_EQ(
+        scale_h > 0,
+        true,
+        errors::InvalidArgument(
+            "The scale_h in input 'Scale' Tensor of Operator(interpolate) "
+            "should be greater than 0, but received value is %d.",
+            scale_h));
+    PADDLE_ENFORCE_EQ(
+        scale_d > 0,
+        true,
+        errors::InvalidArgument(
+            "The scale_d in input 'Scale' Tensor of Operator(interpolate) "
+            "should be greater than 0, but received value is %d.",
+            scale_d));
+  } else {
+    if (scale.size() > 1) {
+      scale_d = scale[0];
+      scale_h = scale[1];
+      scale_w = scale[2];
+
+      PADDLE_ENFORCE_EQ(
+          scale_w > 0,
+          true,
+          errors::InvalidArgument(
+              "The scale_w in Attr(scale) of Operator(interpolate) "
+              "should be greater than 0, but received value is %d.",
+              scale_w));
+      PADDLE_ENFORCE_EQ(
+          scale_h > 0,
+          true,
+          errors::InvalidArgument(
+              "The scale_h in Attr(scale) of Operator(interpolate) "
+              "should be greater than 0, but received value is %d.",
+              scale_h));
+      PADDLE_ENFORCE_EQ(
+          scale_d > 0,
+          true,
+          errors::InvalidArgument(
+              "The scale_d in Attr(scale) of Operator(interpolate) "
+              "should be greater than 0, but received value is %d.",
+              scale_d));
+    }
+  }
+  if (scale_d > 0. && scale_h > 0. && scale_w > 0.) {
+    out_d = static_cast<int>(in_d * scale_d);
+    out_h = static_cast<int>(in_h * scale_h);
+    out_w = static_cast<int>(in_w * scale_w);
+  }
+
+  if (out_size) {
+    DenseTensor sizes;
+    paddle::framework::TensorCopySync(
+        *out_size, paddle::platform::CPUPlace(), &sizes);
+    auto size_data = sizes.data<int>();
+    out_d = size_data[0];
+    out_h = size_data[1];
+    out_w = size_data[2];
+  }
+  if (size_tensor && size_tensor->size() > 0) {
+    // have size tensor
+    auto new_size = funcs::get_new_shape(size_tensor.get());
+    out_d = new_size[0];
+    out_h = new_size[1];
+    out_w = new_size[2];
+  }
+
+  auto* output_grad_data = output_grad.data<T>();
+  phi::DDim dim_grad;
+  if (data_layout == DataLayout::kNCHW) {
+    dim_grad = {n, c, in_d, in_h, in_w};
+  } else {
+    dim_grad = {n, in_d, in_h, in_w, c};
+  }
+  input_grad->Resize(dim_grad);
+  auto* input_grad_data = dev_ctx.template Alloc<T>(input_grad);
+  phi::funcs::SetConstant<Context, T> zero;
+  zero(dev_ctx, input_grad, static_cast<T>(0.0));
+
+  if (in_d == out_d && in_h == out_h && in_w == out_w) {
+    paddle::framework::TensorCopy(output_grad, dev_ctx.GetPlace(), input_grad);
+    return;
+  }
+
+  float ratio_d = 0.f;
+  float ratio_h = 0.f;
+  float ratio_w = 0.f;
+  if (out_d > 1) {
+    float new_scale_d = 0.f;
+    new_scale_d = (scale_d > 0) ? static_cast<float>(1. / scale_d)
+                                : static_cast<float>(in_d) / out_d;
+    ratio_d = (align_corners) ? static_cast<float>(in_d - 1) / (out_d - 1)
+                              : static_cast<float>(new_scale_d);
+  }
+  if (out_h > 1) {
+    float new_scale_h = 0.f;
+    new_scale_h = (scale_h > 0) ? static_cast<float>(1. / scale_h)
+                                : static_cast<float>(in_h) / out_h;
+    ratio_h = (align_corners) ? static_cast<float>(in_h - 1) / (out_h - 1)
+                              : static_cast<float>(new_scale_h);
+  }
+  if (out_w > 1) {
+    float new_scale_w = 0.f;
+    new_scale_w = (scale_w > 0) ? static_cast<float>(1. / scale_w)
+                                : static_cast<float>(in_w) / out_w;
+    ratio_w = (align_corners) ? static_cast<float>(in_w - 1) / (out_w - 1)
+                              : static_cast<float>(new_scale_w);
+  }
+
+  int64_t in_dhw = in_d * in_h * in_w;
+  int64_t out_dhw = out_d * out_h * out_w;
+  int64_t in_cdhw = c * in_dhw;
+  int64_t out_cdhw = c * out_dhw;
+
+  auto pixelNum = n * out_cdhw;
+
+  backends::gpu::GpuLaunchConfig config =
+      backends::gpu::GetGpuLaunchConfig1D(dev_ctx, pixelNum);
+
+  if ("trilinear" == interp_method) {
+    KeTrilinearInterpBw<T><<<config.block_per_grid,
+                             config.thread_per_block,
+                             0,
+                             dev_ctx.stream()>>>(input_grad_data,
+                                                 in_d,
+                                                 in_h,
+                                                 in_w,
+                                                 n,
+                                                 in_cdhw,
+                                                 output_grad_data,
+                                                 out_d,
+                                                 out_h,
+                                                 out_w,
+                                                 n,
+                                                 out_cdhw,
+                                                 c,
+                                                 ratio_d,
+                                                 ratio_h,
+                                                 ratio_w,
+                                                 align_corners,
+                                                 align_mode,
+                                                 data_layout);
+  } else if ("nearest" == interp_method) {
+    KeNearestNeighbor3DInterpBw<T><<<config.block_per_grid,
+                                     config.thread_per_block,
+                                     0,
+                                     dev_ctx.stream()>>>(input_grad_data,
+                                                         in_d,
+                                                         in_h,
+                                                         in_w,
+                                                         n,
+                                                         in_cdhw,
+                                                         output_grad_data,
+                                                         out_d,
+                                                         out_h,
+                                                         out_w,
+                                                         n,
+                                                         out_cdhw,
+                                                         c,
+                                                         ratio_d,
+                                                         ratio_h,
+                                                         ratio_w,
+                                                         align_corners,
+                                                         data_layout);
+  }
+}
+
+template <typename T, typename Context>
+void InterpolateGradKernel(
+    const Context& dev_ctx,
+    const DenseTensor& x,
+    paddle::optional<const DenseTensor&> out_size,
+    paddle::optional<const std::vector<const DenseTensor*>> size_tensor,
+    paddle::optional<const DenseTensor&> scale_tensor,
+    const DenseTensor& output_grad,
+    const std::string& data_layout,
+    int out_d,
+    int out_h,
+    int out_w,
+    const std::vector<float>& scale,
+    const std::string& interp_method,
+    bool align_corners,
+    int align_mode,
+    DenseTensor* x_grad) {
+  auto output_grad_dims = output_grad.dims();
+  if (output_grad_dims.size() == 3) {  // 1D interpolation grad
+    Interpolate1DCUDABwd<T, Context>(dev_ctx,
+                                     x,
+                                     out_size,
+                                     size_tensor,
+                                     scale_tensor,
+                                     output_grad,
+                                     data_layout,
+                                     out_w,
+                                     scale,
+                                     interp_method,
+                                     align_corners,
+                                     align_mode,
+                                     x_grad);
+  } else if (output_grad_dims.size() == 4) {  // 2D interpolation grad
+    Interpolate2DCUDABwd<T, Context>(dev_ctx,
+                                     x,
+                                     out_size,
+                                     size_tensor,
+                                     scale_tensor,
+                                     output_grad,
+                                     data_layout,
+                                     out_h,
+                                     out_w,
+                                     scale,
+                                     interp_method,
+                                     align_corners,
+                                     align_mode,
+                                     x_grad);
+
+  } else if (output_grad_dims.size() == 5) {  // 3D interpolation grad
+    Interpolate3DCUDABwd<T, Context>(dev_ctx,
+                                     x,
+                                     out_size,
+                                     size_tensor,
+                                     scale_tensor,
+                                     output_grad,
+                                     data_layout,
+                                     out_d,
+                                     out_h,
+                                     out_w,
+                                     scale,
+                                     interp_method,
+                                     align_corners,
+                                     align_mode,
+                                     x_grad);
+  }
+}
+
+template <typename T, typename Context>
+void BilinearInterpGradKernel(
+    const Context& dev_ctx,
+    const DenseTensor& x,
+    paddle::optional<const DenseTensor&> out_size,
+    paddle::optional<const std::vector<const DenseTensor*>> size_tensor,
+    paddle::optional<const DenseTensor&> scale_tensor,
+    const DenseTensor& out_grad,
+    const std::string& data_layout,
+    int out_d,
+    int out_h,
+    int out_w,
+    const std::vector<float>& scale,
+    const std::string& interp_method,
+    bool align_corners,
+    int align_mode,
+    DenseTensor* x_grad) {
+  InterpolateGradKernel<T, Context>(dev_ctx,
+                                    x,
+                                    out_size,
+                                    size_tensor,
+                                    scale_tensor,
+                                    out_grad,
+                                    data_layout,
+                                    out_d,
+                                    out_h,
+                                    out_w,
+                                    scale,
+                                    interp_method,
+                                    align_corners,
+                                    align_mode,
+                                    x_grad);
+}
+
+template <typename T, typename Context>
+void NearestInterpGradKernel(
+    const Context& dev_ctx,
+    const DenseTensor& x,
+    paddle::optional<const DenseTensor&> out_size,
+    paddle::optional<const std::vector<const DenseTensor*>> size_tensor,
+    paddle::optional<const DenseTensor&> scale_tensor,
+    const DenseTensor& out_grad,
+    const std::string& data_layout,
+    int out_d,
+    int out_h,
+    int out_w,
+    const std::vector<float>& scale,
+    const std::string& interp_method,
+    bool align_corners,
+    int align_mode,
+    DenseTensor* x_grad) {
+  InterpolateGradKernel<T, Context>(dev_ctx,
+                                    x,
+                                    out_size,
+                                    size_tensor,
+                                    scale_tensor,
+                                    out_grad,
+                                    data_layout,
+                                    out_d,
+                                    out_h,
+                                    out_w,
+                                    scale,
+                                    interp_method,
+                                    align_corners,
+                                    align_mode,
+                                    x_grad);
+}
+
+template <typename T, typename Context>
+void TrilinearInterpGradKernel(
+    const Context& dev_ctx,
+    const DenseTensor& x,
+    paddle::optional<const DenseTensor&> out_size,
+    paddle::optional<const std::vector<const DenseTensor*>> size_tensor,
+    paddle::optional<const DenseTensor&> scale_tensor,
+    const DenseTensor& out_grad,
+    const std::string& data_layout,
+    int out_d,
+    int out_h,
+    int out_w,
+    const std::vector<float>& scale,
+    const std::string& interp_method,
+    bool align_corners,
+    int align_mode,
+    DenseTensor* x_grad) {
+  InterpolateGradKernel<T, Context>(dev_ctx,
+                                    x,
+                                    out_size,
+                                    size_tensor,
+                                    scale_tensor,
+                                    out_grad,
+                                    data_layout,
+                                    out_d,
+                                    out_h,
+                                    out_w,
+                                    scale,
+                                    interp_method,
+                                    align_corners,
+                                    align_mode,
+                                    x_grad);
+}
+
+template <typename T, typename Context>
+void LinearInterpGradKernel(
+    const Context& dev_ctx,
+    const DenseTensor& x,
+    paddle::optional<const DenseTensor&> out_size,
+    paddle::optional<const std::vector<const DenseTensor*>> size_tensor,
+    paddle::optional<const DenseTensor&> scale_tensor,
+    const DenseTensor& out_grad,
+    const std::string& data_layout,
+    int out_d,
+    int out_h,
+    int out_w,
+    const std::vector<float>& scale,
+    const std::string& interp_method,
+    bool align_corners,
+    int align_mode,
+    DenseTensor* x_grad) {
+  InterpolateGradKernel<T, Context>(dev_ctx,
+                                    x,
+                                    out_size,
+                                    size_tensor,
+                                    scale_tensor,
+                                    out_grad,
+                                    data_layout,
+                                    out_d,
+                                    out_h,
+                                    out_w,
+                                    scale,
+                                    interp_method,
+                                    align_corners,
+                                    align_mode,
+                                    x_grad);
+}
+
+template <typename T, typename Context>
+void BicubicInterpGradKernel(
+    const Context& dev_ctx,
+    const DenseTensor& x,
+    paddle::optional<const DenseTensor&> out_size,
+    paddle::optional<const std::vector<const DenseTensor*>> size_tensor,
+    paddle::optional<const DenseTensor&> scale_tensor,
+    const DenseTensor& out_grad,
+    const std::string& data_layout,
+    int out_d,
+    int out_h,
+    int out_w,
+    const std::vector<float>& scale,
+    const std::string& interp_method,
+    bool align_corners,
+    int align_mode,
+    DenseTensor* x_grad) {
+  InterpolateGradKernel<T, Context>(dev_ctx,
+                                    x,
+                                    out_size,
+                                    size_tensor,
+                                    scale_tensor,
+                                    out_grad,
+                                    data_layout,
+                                    out_d,
+                                    out_h,
+                                    out_w,
+                                    scale,
+                                    interp_method,
+                                    align_corners,
+                                    align_mode,
+                                    x_grad);
+}
+
+}  // namespace phi
+
+PD_REGISTER_KERNEL(bilinear_interp_v2_grad,
+                   GPU,
+                   ALL_LAYOUT,
+                   phi::BilinearInterpGradKernel,
+                   float,
+                   double) {}
+PD_REGISTER_KERNEL(nearest_interp_v2_grad,
+                   GPU,
+                   ALL_LAYOUT,
+                   phi::NearestInterpGradKernel,
+                   float,
+                   double) {}
+PD_REGISTER_KERNEL(trilinear_interp_v2_grad,
+                   GPU,
+                   ALL_LAYOUT,
+                   phi::TrilinearInterpGradKernel,
+                   float,
+                   double) {}
+PD_REGISTER_KERNEL(linear_interp_v2_grad,
+                   GPU,
+                   ALL_LAYOUT,
+                   phi::LinearInterpGradKernel,
+                   float,
+                   double) {}
+PD_REGISTER_KERNEL(bicubic_interp_v2_grad,
+                   GPU,
+                   ALL_LAYOUT,
+                   phi::BicubicInterpGradKernel,
+                   float,
+                   double) {}
diff --git a/paddle/phi/kernels/gpu/interpolate_kernel.cu b/paddle/phi/kernels/gpu/interpolate_kernel.cu
new file mode 100644
index 0000000000000..6e609aa11674e
--- /dev/null
+++ b/paddle/phi/kernels/gpu/interpolate_kernel.cu
@@ -0,0 +1,1479 @@
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include "paddle/phi/kernels/interpolate_kernel.h"
+
+#include "paddle/fluid/platform/device/gpu/gpu_device_function.h"
+#include "paddle/fluid/platform/device/gpu/gpu_primitives.h"
+#include "paddle/fluid/platform/fast_divmod.h"
+#include "paddle/phi/backends/gpu/gpu_launch_config.h"
+
+#include "paddle/phi/backends/gpu/gpu_context.h"
+#include "paddle/phi/common/layout.h"
+#include "paddle/phi/core/kernel_registry.h"
+#include "paddle/phi/kernels/funcs/interpolate_function.h"
+
+namespace phi {
+using paddle::platform::FastDivMod;
+
+template <typename T>
+__forceinline__ __device__ void PreCalculatorForLinearInterpInputIndex(
+    int* in_img_idx,
+    int* x_id,
+    T* lambda1,
+    T* lambda2,
+    T src_x,
+    const int in_img_x) {
+  src_x = (src_x > 0) ? src_x : 0.f;
+  *in_img_idx = static_cast<int>(src_x);
+  *x_id = (*in_img_idx < in_img_x - 1) ? 1 : 0;
+  *lambda1 = src_x - *in_img_idx;
+  *lambda2 = 1.f - *lambda1;
+}
+
+template <typename T>
+__global__ void KeLinearInterpFw(const T* in,
+                                 const size_t in_img_w,
+                                 const size_t input_w,
+                                 T* out,
+                                 const size_t out_img_w,
+                                 const size_t output_h,
+                                 const size_t output_w,
+                                 const size_t num_channels,
+                                 const float ratio_w,
+                                 const bool align_corners,
+                                 const int align_mode,
+                                 const DataLayout data_layout) {
+  int nthreads = output_h * output_w;
+  int tid = blockIdx.x * blockDim.x + threadIdx.x;
+  int stride = blockDim.x * gridDim.x;
+  bool align_flag = (align_mode == 0 && !align_corners);
+  for (; tid < nthreads; tid += stride) {
+    int out_id_h = tid / output_w;
+    int out_id_w = tid % output_w;
+    int in_img_size = input_w / num_channels;
+    int out_img_size = output_w / num_channels;
+
+    int channel_id, out_img_idy, out_img_idx;
+    if (data_layout == DataLayout::kNCHW) {
+      channel_id = out_id_w / out_img_size;
+      out_img_idx = tid % out_img_w;
+    } else {
+      out_img_idx = out_id_w % (out_img_w * num_channels) / num_channels;
+      channel_id = tid % num_channels;
+    }
+
+    int in_img_idx = align_flag
+                         ? static_cast<int>(ratio_w * (out_img_idx + 0.5) - 0.5)
+                         : static_cast<int>(ratio_w * out_img_idx);
+    in_img_idx = (in_img_idx > 0) ? in_img_idx : 0;  // w
+    int w_id = (in_img_idx < in_img_w - 1) ? 1 : 0;  // w_id
+
+    T src_w = ratio_w * (out_img_idx + 0.5) - 0.5;
+    src_w = (src_w > 0) ? src_w : 0;
+    T w1lambda =
+        align_flag ? src_w - in_img_idx : ratio_w * out_img_idx - in_img_idx;
+    T w2lambda = 1.f - w1lambda;
+
+    if (data_layout == DataLayout::kNCHW) {
+      const T* in_pos =
+          &in[out_id_h * out_id_w + channel_id * in_img_size + in_img_idx];
+      // linear interpolation
+      out[out_id_h * output_w + out_id_w] =
+          w2lambda * in_pos[0] + w1lambda * in_pos[w_id];
+
+    } else {
+      const T* in_pos =
+          &in[out_id_h * input_w + in_img_idx * num_channels + channel_id];
+      // linear interpolation
+      out[out_id_h * output_w + out_id_w] =
+          w2lambda * in_pos[0] + w1lambda * in_pos[w_id * num_channels];
+    }
+  }
+}
+
+template <typename T>
+__global__ void KeNearestNeighborInterpNCHWFw(const T* in,
+                                              const size_t in_img_h,
+                                              const size_t in_img_w,
+                                              T* out,
+                                              const size_t out_img_h,
+                                              const size_t out_img_w,
+                                              const size_t nc,
+                                              const float ratio_h,
+                                              const float ratio_w,
+                                              const bool align_corners) {
+  int out_img_idx = threadIdx.x + blockIdx.x * blockDim.x;
+  int out_img_idy = threadIdx.y + blockIdx.y * blockDim.y;
+  int nc_id = threadIdx.z + blockIdx.z * blockDim.z;
+  int nc_stride = blockDim.z * gridDim.z;
+
+  // nearest_sampling by multiple read in_addr and write to out_addr
+  int in_img_idx = (align_corners)
+                       ? static_cast<int>(ratio_w * out_img_idx + 0.5)
+                       : static_cast<int>(ratio_w * out_img_idx);
+  int in_img_idy = (align_corners)
+                       ? static_cast<int>(ratio_h * out_img_idy + 0.5)
+                       : static_cast<int>(ratio_h * out_img_idy);
+
+  int in_index = (nc_id * in_img_h + in_img_idy) * in_img_w + in_img_idx;
+  int in_index_stride = nc_stride * in_img_h * in_img_w;
+
+  int out_index = (nc_id * out_img_h + out_img_idy) * out_img_w + out_img_idx;
+  int out_index_stride = nc_stride * out_img_h * out_img_w;
+
+  // prevent from multiple threads writing
+  if (out_img_idx < out_img_w && out_img_idy < out_img_h) {
+    while (nc_id < nc) {
+      out[out_index] = in[in_index];
+      in_index += in_index_stride;
+      out_index += out_index_stride;
+      nc_id += nc_stride;
+    }
+  }
+}
+
+template <typename T>
+__global__ void KeNearestNeighborInterpFw(
+    const T* in,
+    const size_t in_img_h,
+    const size_t in_img_w,
+    const size_t input_h,
+    const size_t input_w,
+    T* out,
+    const size_t out_img_h,
+    const size_t out_img_w,
+    const size_t output_h,
+    const size_t output_w,
+    const size_t num_channels,
+    const float ratio_h,
+    const float ratio_w,
+    const bool align_corners,
+    funcs::FastDivModForInterpolate divmods) {
+  int nthreads = output_h * output_w;
+  int tid = blockIdx.x * blockDim.x + threadIdx.x;
+  int stride = blockDim.x * gridDim.x;
+  int in_img_size = in_img_h * in_img_w;
+  int out_img_size = out_img_h * out_img_w;
+
+  for (; tid < nthreads; tid += stride) {
+    auto out_id_divmod = divmods.output_w_div.Divmod(tid);
+    int out_id_h = out_id_divmod.val[0];
+    int out_id_w = out_id_divmod.val[1];
+
+    int channel_id = divmods.channels_div.Divmod(tid).val[1];
+    auto outimg_id_divmod = divmods.output_wc_div.Divmod(out_id_w);
+    int out_img_idy = outimg_id_divmod.val[0];
+    int out_img_idx =
+        divmods.channels_div.Divmod(outimg_id_divmod.val[1]).val[0];
+
+    int in_img_idy = (align_corners)
+                         ? static_cast<int>(ratio_h * out_img_idy + 0.5)
+                         : static_cast<int>(ratio_h * out_img_idy);
+    int in_img_idx = (align_corners)
+                         ? static_cast<int>(ratio_w * out_img_idx + 0.5)
+                         : static_cast<int>(ratio_w * out_img_idx);
+
+    out[tid] = in[out_id_h * input_w + in_img_idy * in_img_w * num_channels +
+                  in_img_idx * num_channels + channel_id];
+  }
+}
+
+template <typename T>
+__global__ void KeBilinearInterpFw(const T* in,
+                                   const size_t in_img_h,
+                                   const size_t in_img_w,
+                                   const size_t input_h,
+                                   const size_t input_w,
+                                   T* out,
+                                   const size_t out_img_h,
+                                   const size_t out_img_w,
+                                   const size_t output_h,
+                                   const size_t output_w,
+                                   const size_t num_channels,
+                                   const float ratio_h,
+                                   const float ratio_w,
+                                   const T align_type_value,
+                                   funcs::FastDivModForInterpolate divmods) {
+  int nthreads = output_h * output_w;
+  int tid = blockIdx.x * blockDim.x + threadIdx.x;
+  int stride = blockDim.x * gridDim.x;
+
+  for (; tid < nthreads; tid += stride) {
+    auto out_id_divmod = divmods.output_w_div.Divmod(tid);
+    int out_id_h = out_id_divmod.val[0];
+    int out_id_w = out_id_divmod.val[1];
+
+    int channel_id = divmods.channels_div.Divmod(tid).val[1];
+    auto outimg_id_divmod = divmods.output_wc_div.Divmod(out_id_w);
+    int out_img_idy = outimg_id_divmod.val[0];
+    int out_img_idx =
+        divmods.channels_div.Divmod(outimg_id_divmod.val[1]).val[0];
+
+    int in_img_idx, in_img_idy, h_id, w_id;
+    T h1lambda, w1lambda, h2lambda, w2lambda;
+    T src_w = ratio_w * (out_img_idx + align_type_value) - align_type_value;
+    T src_h = ratio_h * (out_img_idy + align_type_value) - align_type_value;
+
+    PreCalculatorForLinearInterpInputIndex(
+        &in_img_idx, &w_id, &w1lambda, &w2lambda, src_w, in_img_w);
+    PreCalculatorForLinearInterpInputIndex(
+        &in_img_idy, &h_id, &h1lambda, &h2lambda, src_h, in_img_h);
+
+    // bilinear interpolation
+    const T* in_pos =
+        &in[out_id_h * input_w + in_img_idy * in_img_w * num_channels +
+            in_img_idx * num_channels + channel_id];
+    out[tid] =
+        h2lambda *
+            (w2lambda * in_pos[0] + w1lambda * in_pos[w_id * num_channels]) +
+        h1lambda *
+            (w2lambda * in_pos[h_id * in_img_w * num_channels] +
+             w1lambda *
+                 in_pos[h_id * in_img_w * num_channels + w_id * num_channels]);
+  }
+}
+
+template <typename T>
+__global__ void KeBilinearInterpNCHWFw(const T* in,
+                                       const size_t in_img_h,
+                                       const size_t in_img_w,
+                                       T* out,
+                                       const size_t out_img_h,
+                                       const size_t out_img_w,
+                                       const size_t nc,
+                                       const float ratio_h,
+                                       const float ratio_w,
+                                       const T align_type_value) {
+  int out_img_idx = threadIdx.x + blockIdx.x * blockDim.x;
+  int out_img_idy = threadIdx.y + blockIdx.y * blockDim.y;
+  int nc_id = threadIdx.z + blockIdx.z * blockDim.z;
+  int nc_stride = blockDim.z * gridDim.z;
+
+  int in_img_idx, in_img_idy, h_id, w_id;
+  T h1lambda, w1lambda, h2lambda, w2lambda;
+  T src_w = ratio_w * (out_img_idx + align_type_value) - align_type_value;
+  T src_h = ratio_h * (out_img_idy + align_type_value) - align_type_value;
+
+  PreCalculatorForLinearInterpInputIndex(
+      &in_img_idx, &w_id, &w1lambda, &w2lambda, src_w, in_img_w);
+  PreCalculatorForLinearInterpInputIndex(
+      &in_img_idy, &h_id, &h1lambda, &h2lambda, src_h, in_img_h);
+
+  int in_index = (nc_id * in_img_h + in_img_idy) * in_img_w + in_img_idx;
+  int in_index_stride = nc_stride * in_img_h * in_img_w;
+
+  int out_index = (nc_id * out_img_h + out_img_idy) * out_img_w + out_img_idx;
+  int out_index_stride = nc_stride * out_img_h * out_img_w;
+
+  // prevent from multiple threads writing
+  if (out_img_idx < out_img_w && out_img_idy < out_img_h) {
+    while (nc_id < nc) {
+      const T* in_pos = &in[in_index];
+      out[out_index] =
+          h2lambda * (w2lambda * in_pos[0] + w1lambda * in_pos[w_id]) +
+          h1lambda * (w2lambda * in_pos[h_id * in_img_w] +
+                      w1lambda * in_pos[h_id * in_img_w + w_id]);
+
+      in_index += in_index_stride;
+      out_index += out_index_stride;
+      nc_id += nc_stride;
+    }
+  }
+}
+
+template <typename T>
+__device__ __forceinline__ static T Kecubic_interp(
+    const T x0, const T x1, const T x2, const T x3, T t) {
+  T coeffs[4];
+  T a = -0.75;
+  T x_1 = t;
+  T x_2 = 1.0 - t;
+  coeffs[0] = funcs::CubicConvolution2<T>(x_1 + 1.0, a);
+  coeffs[1] = funcs::CubicConvolution1<T>(x_1, a);
+  coeffs[2] = funcs::CubicConvolution1<T>(x_2, a);
+  coeffs[3] = funcs::CubicConvolution2<T>(x_2 + 1.0, a);
+  return x0 * coeffs[0] + x1 * coeffs[1] + x2 * coeffs[2] + x3 * coeffs[3];
+}
+
+template <typename T>
+__global__ void KeBicubicInterpFw(const T* in,
+                                  const size_t in_img_h,
+                                  const size_t in_img_w,
+                                  const size_t input_h,
+                                  const size_t input_w,
+                                  T* out,
+                                  const size_t out_img_h,
+                                  const size_t out_img_w,
+                                  const size_t output_h,
+                                  const size_t output_w,
+                                  const size_t num_channels,
+                                  const float ratio_h,
+                                  const float ratio_w,
+                                  const bool align_corners,
+                                  const DataLayout data_layout) {
+  int nthreads = output_h * output_w;
+  int tid = blockIdx.x * blockDim.x + threadIdx.x;
+  int stride = blockDim.x * gridDim.x;
+
+  for (; tid < nthreads; tid += stride) {
+    int out_id_h = tid / output_w;
+    int out_id_w = tid % output_w;
+    int in_img_size = input_w / num_channels;
+    int out_img_size = output_w / num_channels;
+
+    int channel_id, out_img_idy, out_img_idx;
+
+    if (data_layout == DataLayout::kNCHW) {
+      channel_id = out_id_w / out_img_size;
+      out_img_idy = (out_id_w % out_img_size) / out_img_w;
+      out_img_idx = tid % out_img_w;
+    } else {
+      out_img_idy = out_id_w / (out_img_w * num_channels);
+      out_img_idx = out_id_w % (out_img_w * num_channels) / num_channels;
+      channel_id = tid % num_channels;
+    }
+
+    T in_img_idy = align_corners
+                       ? static_cast<T>(ratio_h * out_img_idy)
+                       : static_cast<T>(ratio_h * (out_img_idy + 0.5) - 0.5);
+    int input_y = floorf(in_img_idy);
+    const T y_t = in_img_idy - input_y;
+
+    T in_img_idx = align_corners
+                       ? static_cast<T>(ratio_w * out_img_idx)
+                       : static_cast<T>(ratio_w * (out_img_idx + 0.5) - 0.5);
+    int input_x = floorf(in_img_idx);
+    const T x_t = in_img_idx - input_x;
+
+    T coefficients[4];
+    const T* in_pos_0;
+    const T* in_pos_1;
+    const T* in_pos_2;
+    const T* in_pos_3;
+    int access_x_0;
+    if (data_layout == DataLayout::kNCHW) {
+      for (int k = 0; k < 4; k++) {
+        int access_y =
+            max(min(input_y - 1 + k, static_cast<int>(in_img_h - 1)), 0);
+        access_x_0 = max(min(input_x - 1, static_cast<int>(in_img_w - 1)), 0);
+        int access_x_1 =
+            max(min(input_x + 0, static_cast<int>(in_img_w - 1)), 0);
+        int access_x_2 =
+            max(min(input_x + 1, static_cast<int>(in_img_w - 1)), 0);
+        int access_x_3 =
+            max(min(input_x + 2, static_cast<int>(in_img_w - 1)), 0);
+
+        in_pos_0 = &in[out_id_h * input_w + channel_id * in_img_size +
+                       access_y * in_img_w + access_x_0];
+        in_pos_1 = &in[out_id_h * input_w + channel_id * in_img_size +
+                       access_y * in_img_w + access_x_1];
+        in_pos_2 = &in[out_id_h * input_w + channel_id * in_img_size +
+                       access_y * in_img_w + access_x_2];
+        in_pos_3 = &in[out_id_h * input_w + channel_id * in_img_size +
+                       access_y * in_img_w + access_x_3];
+
+        coefficients[k] = Kecubic_interp<T>(
+            in_pos_0[0], in_pos_1[0], in_pos_2[0], in_pos_3[0], x_t);
+      }
+
+      out[out_id_h * output_w + out_id_w] = Kecubic_interp<T>(coefficients[0],
+                                                              coefficients[1],
+                                                              coefficients[2],
+                                                              coefficients[3],
+                                                              y_t);
+
+    } else {
+      for (int k = 0; k < 4; k++) {
+        int access_y =
+            max(min(input_y - 1 + k, static_cast<int>((in_img_h - 1))), 0);
+        int access_x_0 =
+            max(min(input_x - 1, static_cast<int>((in_img_w - 1))), 0);
+        int access_x_1 =
+            max(min(input_x + 0, static_cast<int>((in_img_w - 1))), 0);
+        int access_x_2 =
+            max(min(input_x + 1, static_cast<int>((in_img_w - 1))), 0);
+        int access_x_3 =
+            max(min(input_x + 2, static_cast<int>((in_img_w - 1))), 0);
+
+        const T* in_pos_0 =
+            &in[out_id_h * input_w + access_y * in_img_w * num_channels +
+                access_x_0 * num_channels + channel_id];
+        const T* in_pos_1 =
+            &in[out_id_h * input_w + access_y * in_img_w * num_channels +
+                access_x_1 * num_channels + channel_id];
+        const T* in_pos_2 =
+            &in[out_id_h * input_w + access_y * in_img_w * num_channels +
+                access_x_2 * num_channels + channel_id];
+        const T* in_pos_3 =
+            &in[out_id_h * input_w + access_y * in_img_w * num_channels +
+                access_x_3 * num_channels + channel_id];
+
+        coefficients[k] = Kecubic_interp(
+            in_pos_0[0], in_pos_1[0], in_pos_2[0], in_pos_3[0], x_t);
+      }
+
+      out[out_id_h * output_w + out_id_w] =
+          static_cast<T>(Kecubic_interp(coefficients[0],
+                                        coefficients[1],
+                                        coefficients[2],
+                                        coefficients[3],
+                                        y_t));
+    }
+  }
+}
+
+template <typename T>
+__global__ void KeTrilinearInterpFw(const T* in,
+                                    const size_t in_img_d,
+                                    const size_t in_img_h,
+                                    const size_t in_img_w,
+                                    const size_t input_h,
+                                    const size_t input_w,
+                                    T* out,
+                                    const size_t out_img_d,
+                                    const size_t out_img_h,
+                                    const size_t out_img_w,
+                                    const size_t output_h,
+                                    const size_t output_w,
+                                    const size_t num_channels,
+                                    const float ratio_d,
+                                    const float ratio_h,
+                                    const float ratio_w,
+                                    const bool align_corners,
+                                    const int align_mode,
+                                    const DataLayout data_layout) {
+  int nthreads = output_h * output_w;
+  int tid = blockIdx.x * blockDim.x + threadIdx.x;
+  int stride = blockDim.x * gridDim.x;
+  bool align_flag = (align_mode == 0 && !align_corners);
+  for (; tid < nthreads; tid += stride) {
+    int out_id_h = tid / output_w;
+    int out_id_w = tid % output_w;
+    int in_img_size = input_w / num_channels;
+    int out_img_size = output_w / num_channels;
+
+    int channel_id, out_img_idt, out_img_idy, out_img_idx;
+    if (data_layout == DataLayout::kNCHW) {
+      channel_id = out_id_w / out_img_size;
+      out_img_idt = (out_id_w % out_img_size) / out_img_h / out_img_w;
+      out_img_idy = ((out_id_w % out_img_size) / out_img_w) % out_img_h;
+      out_img_idx = tid % out_img_w;
+    } else {
+      out_img_idt = out_id_w / (out_img_h * out_img_w * num_channels);
+      out_img_idy = out_id_w % (out_img_h * out_img_w * num_channels) /
+                    (out_img_w * num_channels);
+      out_img_idx = out_id_w % (out_img_w * num_channels) / num_channels;
+      channel_id = tid % num_channels;
+    }
+
+    int in_img_idt = align_flag
+                         ? static_cast<int>(ratio_d * (out_img_idt + 0.5) - 0.5)
+                         : static_cast<int>(ratio_d * out_img_idt);
+    in_img_idt = (in_img_idt > 0) ? in_img_idt : 0;
+    int d_id = (in_img_idt < in_img_d - 1) ? 1 : 0;
+    T src_d = ratio_d * (out_img_idt + 0.5) - 0.5;
+    src_d = (src_d > 0) ? src_d : 0;
+    T d1lambda =
+        align_flag ? src_d - in_img_idt : ratio_d * out_img_idt - in_img_idt;
+    T d2lambda = 1.f - d1lambda;
+
+    int in_img_idy = align_flag
+                         ? static_cast<int>(ratio_h * (out_img_idy + 0.5) - 0.5)
+                         : static_cast<int>(ratio_h * out_img_idy);
+    in_img_idy = (in_img_idy > 0) ? in_img_idy : 0;
+    int h_id = (in_img_idy < in_img_h - 1) ? 1 : 0;
+    T src_h = ratio_h * (out_img_idy + 0.5) - 0.5;
+    src_h = (src_h > 0) ? src_h : 0;
+    T h1lambda =
+        align_flag ? src_h - in_img_idy : ratio_h * out_img_idy - in_img_idy;
+    T h2lambda = 1.f - h1lambda;
+
+    int in_img_idx = align_flag
+                         ? static_cast<int>(ratio_w * (out_img_idx + 0.5) - 0.5)
+                         : static_cast<int>(ratio_w * out_img_idx);
+    in_img_idx = (in_img_idx > 0) ? in_img_idx : 0;
+    int w_id = (in_img_idx < in_img_w - 1) ? 1 : 0;
+    T src_w = ratio_w * (out_img_idx + 0.5) - 0.5;
+    src_w = (src_w > 0) ? src_w : 0;
+    T w1lambda =
+        align_flag ? src_w - in_img_idx : ratio_w * out_img_idx - in_img_idx;
+    T w2lambda = 1.f - w1lambda;
+
+    if (data_layout == DataLayout::kNCHW) {
+      int in_pos1_idx = out_id_h * input_w + channel_id * in_img_size +
+                        (in_img_idt * in_img_h + in_img_idy) * in_img_w +
+                        in_img_idx;
+      const T* in_pos1 = &in[in_pos1_idx];
+      int in_pos2_idx = in_pos1_idx + d_id * in_img_h * in_img_w;
+      const T* in_pos2 = &in[in_pos2_idx];
+
+      // trilinear interpolation
+      out[out_id_h * output_w + out_id_w] =
+          d2lambda *
+              (h2lambda * (w2lambda * in_pos1[0] + w1lambda * in_pos1[w_id]) +
+               h1lambda * (w2lambda * in_pos1[h_id * in_img_w] +
+                           w1lambda * in_pos1[h_id * in_img_w + w_id])) +
+          d1lambda *
+              (h2lambda * (w2lambda * in_pos2[0] + w1lambda * in_pos2[w_id]) +
+               h1lambda * (w2lambda * in_pos2[h_id * in_img_w] +
+                           w1lambda * in_pos2[h_id * in_img_w + w_id]));
+
+    } else {
+      int in_pos1_idx = out_id_h * input_w +
+                        in_img_idt * in_img_h * in_img_w * num_channels +
+                        in_img_idy * in_img_w * num_channels +
+                        in_img_idx * num_channels + channel_id;
+      const T* in_pos1 = &in[in_pos1_idx];
+      int in_pos2_idx = in_pos1_idx + d_id * in_img_h * in_img_w * num_channels;
+      const T* in_pos2 = &in[in_pos2_idx];
+
+      // trilinear interpolation
+      out[out_id_h * output_w + out_id_w] =
+          d2lambda *
+              (h2lambda * (w2lambda * in_pos1[0] +
+                           w1lambda * in_pos1[w_id * num_channels]) +
+               h1lambda * (w2lambda * in_pos1[h_id * in_img_w * num_channels] +
+                           w1lambda * in_pos1[h_id * in_img_w * num_channels +
+                                              w_id * num_channels])) +
+          d1lambda *
+              (h2lambda * (w2lambda * in_pos2[0] +
+                           w1lambda * in_pos2[w_id * num_channels]) +
+               h1lambda * (w2lambda * in_pos2[h_id * in_img_w * num_channels] +
+                           w1lambda * in_pos2[h_id * in_img_w * num_channels +
+                                              w_id * num_channels]));
+    }
+  }
+}
+
+template <typename T>
+__global__ void KeNearestNeighbor3DInterpFw(const T* in,
+                                            const size_t in_img_d,
+                                            const size_t in_img_h,
+                                            const size_t in_img_w,
+                                            const size_t input_h,
+                                            const size_t input_w,
+                                            T* out,
+                                            const size_t out_img_d,
+                                            const size_t out_img_h,
+                                            const size_t out_img_w,
+                                            const size_t output_h,
+                                            const size_t output_w,
+                                            const size_t num_channels,
+                                            const float ratio_d,
+                                            const float ratio_h,
+                                            const float ratio_w,
+                                            const bool align_corners,
+                                            const DataLayout data_layout) {
+  int nthreads = output_h * output_w;  // ncdhw
+  int tid = blockIdx.x * blockDim.x + threadIdx.x;
+  int stride = blockDim.x * gridDim.x;
+  for (; tid < nthreads; tid += stride) {
+    int out_id_h = tid / output_w;
+    int out_id_w = tid % output_w;
+    int in_img_size = input_w / num_channels;
+    int out_img_size = output_w / num_channels;
+
+    int channel_id, out_img_idt, out_img_idy, out_img_idx;
+    if (data_layout == DataLayout::kNCHW) {
+      channel_id = out_id_w / out_img_size;
+      out_img_idt = (out_id_w % out_img_size) / out_img_h / out_img_w;
+      out_img_idy = ((out_id_w % out_img_size) / out_img_w) % out_img_h;
+      out_img_idx = tid % out_img_w;
+    } else {
+      out_img_idt = out_id_w / (out_img_h * out_img_w * num_channels);
+      out_img_idy = out_id_w % (out_img_h * out_img_w * num_channels) /
+                    (out_img_w * num_channels);
+      out_img_idx = out_id_w % (out_img_w * num_channels) / num_channels;
+      channel_id = tid % num_channels;
+    }
+
+    int in_img_idt = (align_corners)
+                         ? static_cast<int>(ratio_d * out_img_idt + 0.5)
+                         : static_cast<int>(ratio_d * out_img_idt);
+
+    int in_img_idy = (align_corners)
+                         ? static_cast<int>(ratio_h * out_img_idy + 0.5)
+                         : static_cast<int>(ratio_h * out_img_idy);
+    int in_img_idx = (align_corners)
+                         ? static_cast<int>(ratio_w * out_img_idx + 0.5)
+                         : static_cast<int>(ratio_w * out_img_idx);
+
+    if (data_layout == DataLayout::kNCHW) {
+      out[tid] = in[out_id_h * input_w + channel_id * in_img_size +
+                    in_img_idt * in_img_h * in_img_w + in_img_idy * in_img_w +
+                    in_img_idx];
+    } else {
+      out[tid] = in[out_id_h * input_w +
+                    in_img_idt * in_img_h * in_img_w * num_channels +
+                    in_img_idy * in_img_w * num_channels +
+                    in_img_idx * num_channels + channel_id];
+    }
+  }
+}
+
+template <typename T, typename Context>
+static void Interpolate1DCUDAFwd(
+    const Context& dev_ctx,
+    const DenseTensor& input,
+    paddle::optional<const DenseTensor&> out_size,
+    paddle::optional<const std::vector<const DenseTensor*>> size_tensor,
+    paddle::optional<const DenseTensor&> scale_tensor,
+    const std::string& data_layout_str,
+    int out_w,
+    const std::vector<float>& scale,
+    const std::string& interp_method,
+    bool align_corners,
+    int align_mode,
+    DenseTensor* output) {
+  auto* input_data = input.data<T>();
+
+  const DataLayout data_layout =
+      paddle::framework::StringToDataLayout(data_layout_str);
+  int n, c, in_d, in_h, in_w;
+  funcs::ExtractNCDWH(input.dims(), data_layout, &n, &c, &in_d, &in_h, &in_w);
+
+  float scale_w = -1;
+  if (size_tensor && size_tensor->size() > 0) {
+    // have size tensor
+    auto new_size = funcs::get_new_shape(size_tensor.get());
+    out_w = new_size[0];
+  } else {
+    if (scale_tensor) {
+      auto scale_data =
+          funcs::get_new_data_from_tensor<float>(scale_tensor.get_ptr());
+      scale_w = scale_data[0];
+      PADDLE_ENFORCE_EQ(
+          scale_w > 0,
+          true,
+          errors::InvalidArgument(
+              "The scale_w in input 'Scale' Tensor of Operator(interpolate) "
+              "should be greater than 0, but received value is %d.",
+              scale_w));
+    } else {
+      if (scale.size() > 0) {
+        scale_w = scale[0];
+        PADDLE_ENFORCE_EQ(
+            scale_w > 0,
+            true,
+            errors::InvalidArgument(
+                "The scale_w in Attr(scale) of Operator(interpolate) "
+                "should be greater than 0, but received value is %d.",
+                scale_w));
+      }
+    }
+    if (scale_w > 0.) {
+      out_w = static_cast<int>(in_w * scale_w);
+    }
+    if (out_size) {
+      DenseTensor sizes;
+      paddle::framework::TensorCopySync(
+          *out_size, paddle::platform::CPUPlace(), &sizes);
+      auto size_data = sizes.data<int>();
+      out_w = size_data[0];
+    }
+  }
+  PADDLE_ENFORCE_GT(
+      out_w,
+      0,
+      errors::InvalidArgument("out_w in Attr(out_shape) of Op(interpolate) "
+                              "should be greater than 0."));
+  phi::DDim dim_out;
+  if (data_layout == DataLayout::kNCHW) {
+    dim_out = {n, c, out_w};
+  } else {
+    dim_out = {n, out_w, c};
+  }
+  output->Resize(dim_out);
+  auto output_data = dev_ctx.template Alloc<T>(output);
+
+  if (in_w == out_w) {
+    paddle::framework::TensorCopy(input, dev_ctx.GetPlace(), output);
+    return;
+  }
+
+  float ratio_w = 0.f;
+  if (out_w > 1) {
+    float new_scale_w = 0.f;
+    new_scale_w = (scale_w > 0) ? static_cast<float>(1. / scale_w)
+                                : static_cast<float>(in_w) / out_w;
+    ratio_w = (align_corners) ? static_cast<float>(in_w - 1.0) / (out_w - 1.0)
+                              : static_cast<float>(new_scale_w);
+  }
+
+  int64_t in_cw = c * in_w;
+  int64_t out_cw = c * out_w;
+  auto pixelNum = n * out_cw;
+
+  backends::gpu::GpuLaunchConfig config =
+      backends::gpu::GetGpuLaunchConfig1D(dev_ctx, pixelNum);
+
+  if ("linear" == interp_method) {
+    KeLinearInterpFw<T><<<config.block_per_grid,
+                          config.thread_per_block,
+                          0,
+                          dev_ctx.stream()>>>(input_data,
+                                              in_w,
+                                              in_cw,
+                                              output_data,
+                                              out_w,
+                                              n,
+                                              out_cw,
+                                              c,
+                                              ratio_w,
+                                              align_corners,
+                                              align_mode,
+                                              data_layout);
+  }
+}
+
+template <typename T, typename Context>
+static void Interpolate2DCUDAFwd(
+    const Context& dev_ctx,
+    const DenseTensor& input,
+    paddle::optional<const DenseTensor&> out_size,
+    paddle::optional<const std::vector<const DenseTensor*>> size_tensor,
+    paddle::optional<const DenseTensor&> scale_tensor,
+    const std::string& data_layout_str,
+    int out_h,
+    int out_w,
+    const std::vector<float>& scale,
+    const std::string& interp_method,
+    bool align_corners,
+    int align_mode,
+    DenseTensor* output) {
+  auto* input_data = input.data<T>();
+
+  const DataLayout data_layout =
+      paddle::framework::StringToDataLayout(data_layout_str);
+  int n, c, in_d, in_h, in_w;
+  funcs::ExtractNCDWH(input.dims(), data_layout, &n, &c, &in_d, &in_h, &in_w);
+
+  float scale_w = -1;
+  float scale_h = -1;
+  if (size_tensor && size_tensor->size() > 0) {
+    // have size tensor
+    auto new_size = funcs::get_new_shape(size_tensor.get());
+    out_h = new_size[0];
+    out_w = new_size[1];
+  } else {
+    if (scale_tensor) {
+      auto scale_data =
+          funcs::get_new_data_from_tensor<float>(scale_tensor.get_ptr());
+      if (scale_data.size() > 1) {
+        scale_h = scale_data[0];
+        scale_w = scale_data[1];
+      } else {
+        scale_h = scale_data[0];
+        scale_w = scale_data[0];
+      }
+
+      PADDLE_ENFORCE_EQ(
+          scale_w > 0,
+          true,
+          errors::InvalidArgument(
+              "The scale_w in input 'Scale' Tensor of Operator(interpolate) "
+              "should be greater than 0, but received value is %d.",
+              scale_w));
+      PADDLE_ENFORCE_EQ(
+          scale_h > 0,
+          true,
+          errors::InvalidArgument(
+              "The scale_h in input 'Scale' Tensor of Operator(interpolate) "
+              "should be greater than 0, but received value is %d.",
+              scale_h));
+    } else {
+      if (scale.size() > 1) {
+        scale_w = scale[1];
+        scale_h = scale[0];
+
+        PADDLE_ENFORCE_EQ(
+            scale_w > 0,
+            true,
+            errors::InvalidArgument(
+                "The scale_w in Attr(scale) of Operator(interpolate) "
+                "should be greater than 0, but received value is %d.",
+                scale_w));
+        PADDLE_ENFORCE_EQ(
+            scale_h > 0,
+            true,
+            errors::InvalidArgument(
+                "The scale_h in Attr(scale) of Operator(interpolate) "
+                "should be greater than 0, but received value is %d.",
+                scale_h));
+      }
+    }
+    if (scale_w > 0. && scale_h > 0.) {
+      out_h = static_cast<int>(in_h * scale_h);
+      out_w = static_cast<int>(in_w * scale_w);
+    }
+    if (out_size) {
+      DenseTensor sizes;
+      paddle::framework::TensorCopySync(
+          *out_size, paddle::platform::CPUPlace(), &sizes);
+      auto size_data = sizes.data<int>();
+      out_h = size_data[0];
+      out_w = size_data[1];
+    }
+  }
+  PADDLE_ENFORCE_GT(
+      out_h,
+      0,
+      errors::InvalidArgument("out_h in Attr(out_shape) of Op(interpolate) "
+                              "should be greater than 0."));
+  PADDLE_ENFORCE_GT(
+      out_w,
+      0,
+      errors::InvalidArgument("out_w in Attr(out_shape) of Op(interpolate) "
+                              "should be greater than 0."));
+
+  phi::DDim dim_out;
+  if (data_layout == DataLayout::kNCHW) {
+    dim_out = {n, c, out_h, out_w};
+  } else {
+    dim_out = {n, out_h, out_w, c};
+  }
+  output->Resize(dim_out);
+  auto output_data = dev_ctx.template Alloc<T>(output);
+
+  if (in_h == out_h && in_w == out_w) {
+    paddle::framework::TensorCopy(input, dev_ctx.GetPlace(), output);
+    return;
+  }
+
+  float ratio_h = 0.f;
+  float ratio_w = 0.f;
+  if (out_h > 1) {
+    float new_scale_h = 0.f;
+    new_scale_h = (scale_h > 0) ? static_cast<float>(1. / scale_h)
+                                : static_cast<float>(in_h) / out_h;
+    ratio_h = (align_corners) ? static_cast<float>(in_h - 1) / (out_h - 1)
+                              : static_cast<float>(new_scale_h);
+  }
+  if (out_w > 1) {
+    float new_scale_w = 0.f;
+    new_scale_w = (scale_w > 0) ? static_cast<float>(1. / scale_w)
+                                : static_cast<float>(in_w) / out_w;
+    ratio_w = (align_corners) ? static_cast<float>(in_w - 1) / (out_w - 1)
+                              : static_cast<float>(new_scale_w);
+  }
+
+  int64_t in_hw = in_h * in_w;
+  int64_t out_hw = out_h * out_w;
+  int64_t in_chw = c * in_hw;
+  int64_t out_chw = c * out_hw;
+
+  auto pixelNum = n * out_chw;
+
+  backends::gpu::GpuLaunchConfig config =
+      backends::gpu::GetGpuLaunchConfig1D(dev_ctx, pixelNum);
+
+  if ("nearest" == interp_method) {
+    if (data_layout == DataLayout::kNCHW) {
+      // get launch 3D config
+      int nc = n * c;
+      backends::gpu::GpuLaunchConfig config_3d =
+          backends::gpu::GetGpuLaunchConfig3D(dev_ctx, nc, out_h, out_w);
+      KeNearestNeighborInterpNCHWFw<T><<<config_3d.block_per_grid,
+                                         config_3d.thread_per_block,
+                                         0,
+                                         dev_ctx.stream()>>>(input_data,
+                                                             in_h,
+                                                             in_w,
+                                                             output_data,
+                                                             out_h,
+                                                             out_w,
+                                                             nc,
+                                                             ratio_h,
+                                                             ratio_w,
+                                                             align_corners);
+    } else {
+      int64_t cw = c * out_w;
+      auto interp_divmods = funcs::FastDivModForInterpolate(c, out_chw, cw);
+      KeNearestNeighborInterpFw<T><<<config.block_per_grid,
+                                     config.thread_per_block,
+                                     0,
+                                     dev_ctx.stream()>>>(input_data,
+                                                         in_h,
+                                                         in_w,
+                                                         n,
+                                                         in_chw,
+                                                         output_data,
+                                                         out_h,
+                                                         out_w,
+                                                         n,
+                                                         out_chw,
+                                                         c,
+                                                         ratio_h,
+                                                         ratio_w,
+                                                         align_corners,
+                                                         interp_divmods);
+    }
+  } else if ("bilinear" == interp_method) {
+    dim3 thread_num = config.thread_per_block;
+#ifdef WITH_NV_JETSON
+    if (config.compute_capability == 53 || config.compute_capability == 62) {
+      thread_num = 512;
+    }
+#endif
+    const T align_type_value = (align_mode == 0 && !align_corners) ? 0.5f : 0;
+    if (data_layout == DataLayout::kNCHW) {
+      // get launch 3D config
+      int nc = n * c;
+      backends::gpu::GpuLaunchConfig config_3d =
+          backends::gpu::GetGpuLaunchConfig3D(dev_ctx, nc, out_h, out_w);
+      KeBilinearInterpNCHWFw<T><<<config_3d.block_per_grid,
+                                  config_3d.thread_per_block,
+                                  0,
+                                  dev_ctx.stream()>>>(input_data,
+                                                      in_h,
+                                                      in_w,
+                                                      output_data,
+                                                      out_h,
+                                                      out_w,
+                                                      nc,
+                                                      ratio_h,
+                                                      ratio_w,
+                                                      align_type_value);
+    } else {
+      int64_t cw = c * out_w;
+      auto interp_divmods = funcs::FastDivModForInterpolate(c, out_chw, cw);
+      KeBilinearInterpFw<
+          T><<<config.block_per_grid, thread_num, 0, dev_ctx.stream()>>>(
+          input_data,
+          in_h,
+          in_w,
+          n,
+          in_chw,
+          output_data,
+          out_h,
+          out_w,
+          n,
+          out_chw,
+          c,
+          ratio_h,
+          ratio_w,
+          align_type_value,
+          interp_divmods);
+    }
+  } else if ("bicubic" == interp_method) {
+#ifdef __HIPCC__
+    constexpr int thread_per_block = 256;
+#else
+    constexpr int thread_per_block = 512;
+#endif
+    KeBicubicInterpFw<
+        T><<<config.block_per_grid, thread_per_block, 0, dev_ctx.stream()>>>(
+        input_data,
+        in_h,
+        in_w,
+        n,
+        in_chw,
+        output_data,
+        out_h,
+        out_w,
+        n,
+        out_chw,
+        c,
+        ratio_h,
+        ratio_w,
+        align_corners,
+        data_layout);
+  }
+}
+
+template <typename T, typename Context>
+static void Interpolate3DCUDAFwd(
+    const Context& dev_ctx,
+    const DenseTensor& input,
+    paddle::optional<const DenseTensor&> out_size,
+    paddle::optional<const std::vector<const DenseTensor*>> size_tensor,
+    paddle::optional<const DenseTensor&> scale_tensor,
+    const std::string& data_layout_str,
+    int out_d,
+    int out_h,
+    int out_w,
+    const std::vector<float>& scale,
+    const std::string& interp_method,
+    bool align_corners,
+    int align_mode,
+    DenseTensor* output) {
+  auto* input_data = input.data<T>();
+
+  const DataLayout data_layout =
+      paddle::framework::StringToDataLayout(data_layout_str);
+  int n, c, in_d, in_h, in_w;
+  funcs::ExtractNCDWH(input.dims(), data_layout, &n, &c, &in_d, &in_h, &in_w);
+
+  float scale_w = -1;
+  float scale_d = -1;
+  float scale_h = -1;
+  if (size_tensor && size_tensor->size() > 0) {
+    // have size tensor
+    auto new_size = funcs::get_new_shape(size_tensor.get());
+    out_d = new_size[0];
+    out_h = new_size[1];
+    out_w = new_size[2];
+  } else {
+    if (scale_tensor) {
+      auto scale_data =
+          funcs::get_new_data_from_tensor<float>(scale_tensor.get_ptr());
+      if (scale_data.size() > 1) {
+        scale_d = scale_data[0];
+        scale_h = scale_data[1];
+        scale_w = scale_data[2];
+      } else {
+        scale_d = scale_data[0];
+        scale_h = scale_data[0];
+        scale_w = scale_data[0];
+      }
+
+      PADDLE_ENFORCE_EQ(
+          scale_w > 0,
+          true,
+          errors::InvalidArgument(
+              "The scale_w in input 'Scale' Tensor of Operator(interpolate) "
+              "should be greater than 0, but received value is %d.",
+              scale_w));
+      PADDLE_ENFORCE_EQ(
+          scale_h > 0,
+          true,
+          errors::InvalidArgument(
+              "The scale_h in input 'Scale' Tensor of Operator(interpolate) "
+              "should be greater than 0, but received value is %d.",
+              scale_h));
+      PADDLE_ENFORCE_EQ(
+          scale_d > 0,
+          true,
+          errors::InvalidArgument(
+              "The scale_d in input 'Scale' Tensor of Operator(interpolate) "
+              "should be greater than 0, but received value is %d.",
+              scale_d));
+    } else {
+      if (scale.size() > 1) {
+        scale_d = scale[0];
+        scale_h = scale[1];
+        scale_w = scale[2];
+
+        PADDLE_ENFORCE_EQ(
+            scale_w > 0,
+            true,
+            errors::InvalidArgument(
+                "The scale_w in Attr(scale) of Operator(interpolate) "
+                "should be greater than 0, but received value is %d.",
+                scale_w));
+        PADDLE_ENFORCE_EQ(
+            scale_h > 0,
+            true,
+            errors::InvalidArgument(
+                "The scale_h in Attr(scale) of Operator(interpolate) "
+                "should be greater than 0, but received value is %d.",
+                scale_h));
+        PADDLE_ENFORCE_EQ(
+            scale_d > 0,
+            true,
+            errors::InvalidArgument(
+                "The scale_d in Attr(scale) of Operator(interpolate) "
+                "should be greater than 0, but received value is %d.",
+                scale_d));
+      }
+    }
+    if (scale_d > 0. && scale_h > 0. && scale_w > 0.) {
+      out_d = static_cast<int>(in_d * scale_d);
+      out_h = static_cast<int>(in_h * scale_h);
+      out_w = static_cast<int>(in_w * scale_w);
+    }
+    if (out_size) {
+      DenseTensor sizes;
+      paddle::framework::TensorCopySync(
+          *out_size, paddle::platform::CPUPlace(), &sizes);
+      auto size_data = sizes.data<int>();
+      out_d = size_data[0];
+      out_h = size_data[1];
+      out_w = size_data[2];
+    }
+  }
+  PADDLE_ENFORCE_GT(
+      out_d,
+      0,
+      errors::InvalidArgument("out_d in Attr(out_shape) of Op(interpolate) "
+                              "should be greater than 0."));
+  PADDLE_ENFORCE_GT(
+      out_h,
+      0,
+      errors::InvalidArgument("out_h in Attr(out_shape) of Op(interpolate) "
+                              "should be greater than 0."));
+  PADDLE_ENFORCE_GT(
+      out_w,
+      0,
+      errors::InvalidArgument("out_w in Attr(out_shape) of Op(interpolate) "
+                              "should be greater than 0."));
+
+  phi::DDim dim_out;
+  if (data_layout == DataLayout::kNCHW) {
+    dim_out = {n, c, out_d, out_h, out_w};
+  } else {
+    dim_out = {n, out_d, out_h, out_w, c};
+  }
+  output->Resize(dim_out);
+  auto output_data = dev_ctx.template Alloc<T>(output);
+
+  if (in_d == out_d && in_h == out_h && in_w == out_w) {
+    paddle::framework::TensorCopy(input, dev_ctx.GetPlace(), output);
+    return;
+  }
+
+  float ratio_d = 0.f;
+  float ratio_h = 0.f;
+  float ratio_w = 0.f;
+  if (out_d > 1) {
+    float new_scale_d = 0.f;
+    new_scale_d = (scale_d > 0) ? static_cast<float>(1. / scale_d)
+                                : static_cast<float>(in_d) / out_d;
+    ratio_d = (align_corners) ? static_cast<float>(in_d - 1) / (out_d - 1)
+                              : static_cast<float>(new_scale_d);
+  }
+  if (out_h > 1) {
+    float new_scale_h = 0.f;
+    new_scale_h = (scale_h > 0) ? static_cast<float>(1. / scale_h)
+                                : static_cast<float>(in_h) / out_h;
+    ratio_h = (align_corners) ? static_cast<float>(in_h - 1) / (out_h - 1)
+                              : static_cast<float>(new_scale_h);
+  }
+  if (out_w > 1) {
+    float new_scale_w = 0.f;
+    new_scale_w = (scale_w > 0) ? static_cast<float>(1. / scale_w)
+                                : static_cast<float>(in_w) / out_w;
+    ratio_w = (align_corners) ? static_cast<float>(in_w - 1) / (out_w - 1)
+                              : static_cast<float>(new_scale_w);
+  }
+
+  int64_t in_dhw = in_d * in_h * in_w;
+  int64_t out_dhw = out_d * out_h * out_w;
+  int64_t in_cdhw = c * in_dhw;
+  int64_t out_cdhw = c * out_dhw;
+
+  auto pixelNum = n * out_cdhw;
+
+  backends::gpu::GpuLaunchConfig config =
+      backends::gpu::GetGpuLaunchConfig1D(dev_ctx, pixelNum);
+
+  if ("trilinear" == interp_method) {
+    KeTrilinearInterpFw<T><<<config.block_per_grid,
+                             config.thread_per_block,
+                             0,
+                             dev_ctx.stream()>>>(input_data,
+                                                 in_d,
+                                                 in_h,
+                                                 in_w,
+                                                 n,
+                                                 in_cdhw,
+                                                 output_data,
+                                                 out_d,
+                                                 out_h,
+                                                 out_w,
+                                                 n,
+                                                 out_cdhw,
+                                                 c,
+                                                 ratio_d,
+                                                 ratio_h,
+                                                 ratio_w,
+                                                 align_corners,
+                                                 align_mode,
+                                                 data_layout);
+  } else if ("nearest" == interp_method) {
+    KeNearestNeighbor3DInterpFw<T><<<config.block_per_grid,
+                                     config.thread_per_block,
+                                     0,
+                                     dev_ctx.stream()>>>(input_data,
+                                                         in_d,
+                                                         in_h,
+                                                         in_w,
+                                                         n,
+                                                         in_cdhw,
+                                                         output_data,
+                                                         out_d,
+                                                         out_h,
+                                                         out_w,
+                                                         n,
+                                                         out_cdhw,
+                                                         c,
+                                                         ratio_d,
+                                                         ratio_h,
+                                                         ratio_w,
+                                                         align_corners,
+                                                         data_layout);
+  }
+}
+
+template <typename T, typename Context>
+void InterpolateKernel(
+    const Context& dev_ctx,
+    const DenseTensor& x,
+    paddle::optional<const DenseTensor&> out_size,
+    paddle::optional<const std::vector<const DenseTensor*>> size_tensor,
+    paddle::optional<const DenseTensor&> scale_tensor,
+    const std::string& data_layout,
+    int out_d,
+    int out_h,
+    int out_w,
+    const std::vector<float>& scale,
+    const std::string& interp_method,
+    bool align_corners,
+    int align_mode,
+    DenseTensor* output) {
+  auto input_dims = x.dims();
+  if (input_dims.size() == 3) {  // 1D interpolation
+    Interpolate1DCUDAFwd<T, Context>(dev_ctx,
+                                     x,
+                                     out_size,
+                                     size_tensor,
+                                     scale_tensor,
+                                     data_layout,
+                                     out_w,
+                                     scale,
+                                     interp_method,
+                                     align_corners,
+                                     align_mode,
+                                     output);
+  } else if (input_dims.size() == 4) {  // 2D interpolation
+    Interpolate2DCUDAFwd<T, Context>(dev_ctx,
+                                     x,
+                                     out_size,
+                                     size_tensor,
+                                     scale_tensor,
+                                     data_layout,
+                                     out_h,
+                                     out_w,
+                                     scale,
+                                     interp_method,
+                                     align_corners,
+                                     align_mode,
+                                     output);
+  } else if (input_dims.size() == 5) {  // 3D interpolation
+    Interpolate3DCUDAFwd<T, Context>(dev_ctx,
+                                     x,
+                                     out_size,
+                                     size_tensor,
+                                     scale_tensor,
+                                     data_layout,
+                                     out_d,
+                                     out_h,
+                                     out_w,
+                                     scale,
+                                     interp_method,
+                                     align_corners,
+                                     align_mode,
+                                     output);
+  }
+}
+
+template <typename T, typename Context>
+void BilinearInterpKernel(
+    const Context& dev_ctx,
+    const DenseTensor& x,
+    paddle::optional<const DenseTensor&> out_size,
+    paddle::optional<const std::vector<const DenseTensor*>> size_tensor,
+    paddle::optional<const DenseTensor&> scale_tensor,
+    const std::string& data_layout,
+    int out_d,
+    int out_h,
+    int out_w,
+    const std::vector<float>& scale,
+    const std::string& interp_method,
+    bool align_corners,
+    int align_mode,
+    DenseTensor* output) {
+  InterpolateKernel<T, Context>(dev_ctx,
+                                x,
+                                out_size,
+                                size_tensor,
+                                scale_tensor,
+                                data_layout,
+                                out_d,
+                                out_h,
+                                out_w,
+                                scale,
+                                interp_method,
+                                align_corners,
+                                align_mode,
+                                output);
+}
+
+template <typename T, typename Context>
+void NearestInterpKernel(
+    const Context& dev_ctx,
+    const DenseTensor& x,
+    paddle::optional<const DenseTensor&> out_size,
+    paddle::optional<const std::vector<const DenseTensor*>> size_tensor,
+    paddle::optional<const DenseTensor&> scale_tensor,
+    const std::string& data_layout,
+    int out_d,
+    int out_h,
+    int out_w,
+    const std::vector<float>& scale,
+    const std::string& interp_method,
+    bool align_corners,
+    int align_mode,
+    DenseTensor* output) {
+  InterpolateKernel<T, Context>(dev_ctx,
+                                x,
+                                out_size,
+                                size_tensor,
+                                scale_tensor,
+                                data_layout,
+                                out_d,
+                                out_h,
+                                out_w,
+                                scale,
+                                interp_method,
+                                align_corners,
+                                align_mode,
+                                output);
+}
+
+template <typename T, typename Context>
+void TrilinearInterpKernel(
+    const Context& dev_ctx,
+    const DenseTensor& x,
+    paddle::optional<const DenseTensor&> out_size,
+    paddle::optional<const std::vector<const DenseTensor*>> size_tensor,
+    paddle::optional<const DenseTensor&> scale_tensor,
+    const std::string& data_layout,
+    int out_d,
+    int out_h,
+    int out_w,
+    const std::vector<float>& scale,
+    const std::string& interp_method,
+    bool align_corners,
+    int align_mode,
+    DenseTensor* output) {
+  InterpolateKernel<T, Context>(dev_ctx,
+                                x,
+                                out_size,
+                                size_tensor,
+                                scale_tensor,
+                                data_layout,
+                                out_d,
+                                out_h,
+                                out_w,
+                                scale,
+                                interp_method,
+                                align_corners,
+                                align_mode,
+                                output);
+}
+
+template <typename T, typename Context>
+void LinearInterpKernel(
+    const Context& dev_ctx,
+    const DenseTensor& x,
+    paddle::optional<const DenseTensor&> out_size,
+    paddle::optional<const std::vector<const DenseTensor*>> size_tensor,
+    paddle::optional<const DenseTensor&> scale_tensor,
+    const std::string& data_layout,
+    int out_d,
+    int out_h,
+    int out_w,
+    const std::vector<float>& scale,
+    const std::string& interp_method,
+    bool align_corners,
+    int align_mode,
+    DenseTensor* output) {
+  InterpolateKernel<T, Context>(dev_ctx,
+                                x,
+                                out_size,
+                                size_tensor,
+                                scale_tensor,
+                                data_layout,
+                                out_d,
+                                out_h,
+                                out_w,
+                                scale,
+                                interp_method,
+                                align_corners,
+                                align_mode,
+                                output);
+}
+
+template <typename T, typename Context>
+void BicubicInterpKernel(
+    const Context& dev_ctx,
+    const DenseTensor& x,
+    paddle::optional<const DenseTensor&> out_size,
+    paddle::optional<const std::vector<const DenseTensor*>> size_tensor,
+    paddle::optional<const DenseTensor&> scale_tensor,
+    const std::string& data_layout,
+    int out_d,
+    int out_h,
+    int out_w,
+    const std::vector<float>& scale,
+    const std::string& interp_method,
+    bool align_corners,
+    int align_mode,
+    DenseTensor* output) {
+  InterpolateKernel<T, Context>(dev_ctx,
+                                x,
+                                out_size,
+                                size_tensor,
+                                scale_tensor,
+                                data_layout,
+                                out_d,
+                                out_h,
+                                out_w,
+                                scale,
+                                interp_method,
+                                align_corners,
+                                align_mode,
+                                output);
+}
+
+}  // namespace phi
+
+PD_REGISTER_KERNEL(bilinear_interp_v2,
+                   GPU,
+                   ALL_LAYOUT,
+                   phi::BilinearInterpKernel,
+                   float,
+                   double,
+                   int) {}
+PD_REGISTER_KERNEL(nearest_interp_v2,
+                   GPU,
+                   ALL_LAYOUT,
+                   phi::NearestInterpKernel,
+                   float,
+                   double,
+                   int,
+                   int64_t) {}
+PD_REGISTER_KERNEL(trilinear_interp_v2,
+                   GPU,
+                   ALL_LAYOUT,
+                   phi::TrilinearInterpKernel,
+                   float,
+                   double,
+                   int) {}
+PD_REGISTER_KERNEL(linear_interp_v2,
+                   GPU,
+                   ALL_LAYOUT,
+                   phi::LinearInterpKernel,
+                   float,
+                   double,
+                   int) {}
+PD_REGISTER_KERNEL(bicubic_interp_v2,
+                   GPU,
+                   ALL_LAYOUT,
+                   phi::BicubicInterpKernel,
+                   float,
+                   double,
+                   int) {}
diff --git a/paddle/phi/kernels/gpu/kthvalue_grad_kernel.cu b/paddle/phi/kernels/gpu/kthvalue_grad_kernel.cu
index f6e96046a2bd7..bcd370a72d91d 100644
--- a/paddle/phi/kernels/gpu/kthvalue_grad_kernel.cu
+++ b/paddle/phi/kernels/gpu/kthvalue_grad_kernel.cu
@@ -34,9 +34,9 @@ static int getBlockSize(int col) {
 
 template <typename T, typename Context>
 void KthvalueGradKernel(const Context& dev_ctx,
-                        const DenseTensor& d_out,
                         const DenseTensor& x,
                         const DenseTensor& indices,
+                        const DenseTensor& d_out,
                         int k,
                         int axis,
                         bool keepdim,
diff --git a/paddle/phi/kernels/gpu/masked_select_grad_kernel.cu b/paddle/phi/kernels/gpu/masked_select_grad_kernel.cu
index 5a4ce3a2679b9..171baab5513e4 100644
--- a/paddle/phi/kernels/gpu/masked_select_grad_kernel.cu
+++ b/paddle/phi/kernels/gpu/masked_select_grad_kernel.cu
@@ -44,9 +44,9 @@ struct MaskedSelectGradFunctor {
 
 template <typename T, typename Context>
 void MaskedSelectGradKernel(const Context& dev_ctx,
-                            const DenseTensor& out_grad,
                             const DenseTensor& x,
                             const DenseTensor& mask,
+                            const DenseTensor& out_grad,
                             DenseTensor* x_grad) {
   auto mask_size = mask.numel();
   dev_ctx.template Alloc<T>(x_grad);
diff --git a/paddle/phi/kernels/gpu/multinomial_kernel.cu b/paddle/phi/kernels/gpu/multinomial_kernel.cu
index ee5f843b18a90..ef6cd1323a9df 100644
--- a/paddle/phi/kernels/gpu/multinomial_kernel.cu
+++ b/paddle/phi/kernels/gpu/multinomial_kernel.cu
@@ -18,11 +18,6 @@ limitations under the License. */
 
 #include "paddle/phi/kernels/multinomial_kernel.h"
 
-#include <thrust/execution_policy.h>
-#include <thrust/random.h>
-#include <thrust/scan.h>
-#include <thrust/transform.h>
-
 #ifdef __NVCC__
 #include "cub/cub.cuh"
 #endif
@@ -44,12 +39,6 @@ namespace cub = hipcub;
 #include "paddle/phi/kernels/funcs/multinomial_functor.h"
 #include "paddle/phi/kernels/top_k_kernel.h"
 
-// See Note [ Why still include the fluid headers? ]
-#include "paddle/fluid/memory/memcpy.h"
-#include "paddle/fluid/platform/transform.h"
-
-DECLARE_bool(use_curand);
-
 namespace phi {
 
 template <typename T>
@@ -74,32 +63,6 @@ __global__ void NormalizeProbability(T* norm_probs,
   }
 }
 
-template <typename T>
-__global__ void GetCumulativeProbs(T* norm_probs_data,
-                                   int64_t num_distributions,
-                                   int64_t num_categories,
-                                   T* cumulative_probs_data) {
-  int id = blockIdx.x;
-  thrust::inclusive_scan(thrust::device,
-                         norm_probs_data + id * num_categories,
-                         norm_probs_data + (id + 1) * num_categories,
-                         cumulative_probs_data + id * num_categories);
-}
-
-template <typename T>
-struct RandomGeneratorCudaFunctor {
-  unsigned int seed_;
-  __host__ __device__ RandomGeneratorCudaFunctor(int seed) : seed_(seed) {}
-
-  __host__ __device__ T operator()(const unsigned int n) const {
-    thrust::minstd_rand rng;
-    rng.seed(seed_);
-    thrust::uniform_real_distribution<T> dist(0.0, 1.0);
-    rng.discard(n);
-    return dist(rng);
-  }
-};
-
 template <typename T>
 __device__ int binarySearchFunctor(T* cumulative_probs_data,
                                    T* norm_probs_data,
@@ -130,7 +93,6 @@ __device__ int binarySearchFunctor(T* cumulative_probs_data,
 
 template <typename T>
 __global__ void sampleMultinomialWithReplacement(
-    T* rng_data,
     const int64_t num_samples,
     int64_t* out_data,
     const int64_t num_distributions,
@@ -138,10 +100,9 @@ __global__ void sampleMultinomialWithReplacement(
     T* cumulative_probs_data,
     T* norm_probs_data,
     uint64_t seed,
-    uint64_t offset,
-    bool use_curand) {
+    uint64_t offset) {
   // use binary search to get the selected category sample id.
-  // let cumulative_probs_data[id-1] < rng_data < cumulative_probs_data[id].
+  // let cumulative_probs_data[id-1] < rng_number < cumulative_probs_data[id].
   size_t idx = gridDim.x * blockDim.x * blockIdx.y + blockDim.x * blockIdx.x +
                threadIdx.x;
 
@@ -151,10 +112,7 @@ __global__ void sampleMultinomialWithReplacement(
   int sample = blockIdx.x * blockDim.x + threadIdx.x;
   for (int dist = blockIdx.y; dist < num_distributions; dist += gridDim.y) {
     if (sample < num_samples) {
-      T rng_number = rng_data[sample + dist * num_samples];
-      if (use_curand) {
-        rng_number = static_cast<T>(curand_uniform4(&state).x);
-      }
+      T rng_number = static_cast<T>(curand_uniform4(&state).x);
       // Find the bucket that a uniform random number lies in
       int selected_category =
           binarySearchFunctor<T>(cumulative_probs_data + dist * num_categories,
@@ -182,10 +140,7 @@ void MultinomialKernel(const Context& dev_ctx,
   const int64_t num_distributions = in_rank > 1 ? in_dims[in_rank - 2] : 1;
 
   // If replacement is False, it's not a replaceable sample. Every category
-  // can
-  // be used only once. So after every sample, probability of the distribution
-  // will change. The implementation can't be parallelizable. Thus, call CPU
-  // implementation ``funcs::MultinomialFunctor`` to sample the distribution.
+  // can be used only once.
   if (!replacement) {
     int64_t in_data_numel = x.numel();
     int64_t out_data_numel = out->numel();
@@ -202,76 +157,50 @@ void MultinomialKernel(const Context& dev_ctx,
                in_data_numel * sizeof(T),
                cudaMemcpyDeviceToHost);
 #endif
-    if (FLAGS_use_curand) {
-      for (size_t i = 0; i < num_distributions; ++i) {
-        int zero_num = 0;
-        for (size_t j = 0; j < num_categories; ++j) {
-          T weight = cpu_in_data[i * num_distributions + j];
-          PADDLE_ENFORCE_GE(
-              weight,
-              0,
-              errors::InvalidArgument(
-                  "Each element of multinomial'input must >= 0, but got %f.",
-                  weight));
-          if (weight == static_cast<T>(0)) {
-            zero_num++;
-          }
+    for (size_t i = 0; i < num_distributions; ++i) {
+      int zero_num = 0;
+      for (size_t j = 0; j < num_categories; ++j) {
+        T weight = cpu_in_data[i * num_distributions + j];
+        PADDLE_ENFORCE_GE(
+            weight,
+            0,
+            errors::InvalidArgument(
+                "Each element of multinomial'input must >= 0, but got %f.",
+                weight));
+        if (weight == static_cast<T>(0)) {
+          zero_num++;
         }
-        int valid_samples = num_categories - zero_num;
-        PADDLE_ENFORCE_LE(
-            num_samples,
-            valid_samples,
-            errors::InvalidArgument("When replacement=False, 'num_samples' "
-                                    "must less than or eaqual to the number of "
-                                    "positive item of input"));
       }
-
-      // Refer to [gumbel softmax algorithm]
-      DenseTensor rand = EmptyLike<T, Context>(dev_ctx, x);
-      T* rand_data = rand.data<T>();
-      funcs::uniform_distribution<T> dist;
-      funcs::exponential_transform<T> trans(1.0);
-      funcs::distribution_and_transform<T>(dev_ctx, &rand, dist, trans);
-
-      funcs::ForRange<Context> for_range(dev_ctx, x.numel());
-      for_range([rand_data, in_data] __device__(size_t idx) {
-        rand_data[idx] = in_data[idx] / rand_data[idx];
-      });
-
-      if (num_samples == 1) {
-        ArgMaxKernel<T, Context>(
-            dev_ctx, rand, -1, true, false, 3 /*proto::VarType::INT64*/, out);
-      } else {
-        std::vector<int64_t> out_dim_vec = vectorize<int64_t>(out->dims());
-        DenseTensor value = Empty<T, Context>(dev_ctx, IntArray(out_dim_vec));
-        TopkKernel<T, Context>(
-            dev_ctx, rand, Scalar(num_samples), -1, true, true, &value, out);
-      }
-      return;
+      int valid_samples = num_categories - zero_num;
+      PADDLE_ENFORCE_LE(
+          num_samples,
+          valid_samples,
+          errors::InvalidArgument("When replacement=False, 'num_samples' "
+                                  "must less than or eaqual to the number of "
+                                  "positive item of input"));
     }
 
-    funcs::MultinomialFunctor<T>(dev_ctx,
-                                 cpu_out_data,
-                                 cpu_in_data,
-                                 num_samples,
-                                 replacement,
-                                 num_categories,
-                                 num_distributions);
-
-#ifdef PADDLE_WITH_HIP
-    hipMemcpy(out_data,
-              cpu_out_data,
-              out_data_numel * sizeof(int64_t),
-              hipMemcpyHostToDevice);
-#else
-    cudaMemcpy(out_data,
-               cpu_out_data,
-               out_data_numel * sizeof(int64_t),
-               cudaMemcpyHostToDevice);
-#endif
-
-    delete[] cpu_in_data;
-    delete[] cpu_out_data;
+    // Refer to [gumbel softmax algorithm]
+    DenseTensor rand = EmptyLike<T, Context>(dev_ctx, x);
+    T* rand_data = rand.data<T>();
+    funcs::uniform_distribution<T> dist;
+    funcs::exponential_transform<T> trans(1.0);
+    funcs::distribution_and_transform<T>(dev_ctx, &rand, dist, trans);
+
+    funcs::ForRange<Context> for_range(dev_ctx, x.numel());
+    for_range([rand_data, in_data] __device__(size_t idx) {
+      rand_data[idx] = in_data[idx] / rand_data[idx];
+    });
+
+    if (num_samples == 1) {
+      ArgMaxKernel<T, Context>(
+          dev_ctx, rand, -1, true, false, 3 /*proto::VarType::INT64*/, out);
+    } else {
+      std::vector<int64_t> out_dim_vec = vectorize<int64_t>(out->dims());
+      DenseTensor value = Empty<T, Context>(dev_ctx, IntArray(out_dim_vec));
+      TopkKernel<T, Context>(
+          dev_ctx, rand, Scalar(num_samples), -1, true, true, &value, out);
+    }
     return;
   }
 
@@ -322,44 +251,18 @@ void MultinomialKernel(const Context& dev_ctx,
   auto* cumulative_probs_data =
       dev_ctx.template Alloc<T>(&cumulative_probs_tensor);
 
-  if (FLAGS_use_curand) {
-    // 'phi::funcs::InclusiveScan' has higher accuracy than
-    // 'thrust::inclusive_scan'
-    funcs::InclusiveScan<T, std::plus<T>>(
-        /*in*/ norm_probs_data,
-        /*out*/ cumulative_probs_data,
-        /*outer_dim*/ static_cast<size_t>(num_distributions),
-        /*mid_dim*/ static_cast<size_t>(num_categories),
-        /*inner_dim*/ static_cast<size_t>(1),
-        /*init*/ static_cast<T>(0),
-        std::plus<T>(),
-        /*reverse=*/false,
-        dev_ctx);
-  } else {
-    dim3 block_cumsum(1);
-    dim3 grid_cumsum(num_distributions);
-    GetCumulativeProbs<T><<<grid_cumsum, block_cumsum, 0, dev_ctx.stream()>>>(
-        norm_probs_data,
-        num_distributions,
-        num_categories,
-        cumulative_probs_data);
-  }
-
-  // Generate random number for each sample.
-  std::random_device rd;
-  auto seed = rd();
-
-  DenseTensor rng_data_tensor;
-  rng_data_tensor.Resize({num_distributions, num_samples});
-  auto* rng_data = dev_ctx.template Alloc<T>(&rng_data_tensor);
-
-  thrust::counting_iterator<int64_t> index_sequence_begin(0);
-  paddle::platform::Transform<GPUContext> trans;
-  trans(dev_ctx,
-        index_sequence_begin,
-        index_sequence_begin + num_distributions * num_samples,
-        rng_data,
-        RandomGeneratorCudaFunctor<T>(seed));
+  // 'phi::funcs::InclusiveScan' has higher accuracy than
+  // 'thrust::inclusive_scan'
+  funcs::InclusiveScan<T, std::plus<T>>(
+      /*in*/ norm_probs_data,
+      /*out*/ cumulative_probs_data,
+      /*outer_dim*/ static_cast<size_t>(num_distributions),
+      /*mid_dim*/ static_cast<size_t>(num_categories),
+      /*inner_dim*/ static_cast<size_t>(1),
+      /*init*/ static_cast<T>(0),
+      std::plus<T>(),
+      /*reverse=*/false,
+      dev_ctx);
 
   // Sample the multinomial distributions.
   dim3 block(128);
@@ -376,7 +279,6 @@ void MultinomialKernel(const Context& dev_ctx,
   auto seed_offset = gen_cuda->IncrementOffset(increment);
 
   sampleMultinomialWithReplacement<T><<<grid, block, 0, dev_ctx.stream()>>>(
-      rng_data,
       num_samples,
       out_data,
       num_distributions,
@@ -384,8 +286,7 @@ void MultinomialKernel(const Context& dev_ctx,
       cumulative_probs_data,
       norm_probs_data,
       seed_offset.first,
-      seed_offset.second,
-      FLAGS_use_curand);
+      seed_offset.second);
 }
 
 }  // namespace phi
diff --git a/paddle/phi/kernels/gpu/nll_loss_grad_kernel.cu b/paddle/phi/kernels/gpu/nll_loss_grad_kernel.cu
index 9a2d9c6e479aa..43106ec1d863f 100644
--- a/paddle/phi/kernels/gpu/nll_loss_grad_kernel.cu
+++ b/paddle/phi/kernels/gpu/nll_loss_grad_kernel.cu
@@ -23,8 +23,8 @@ template <typename T, typename Context>
 void NllLossGradKernel(const Context& dev_ctx,
                        const DenseTensor& x,
                        const DenseTensor& labels,
-                       const DenseTensor& total_weight,
                        paddle::optional<const DenseTensor&> weight,
+                       const DenseTensor& total_weight,
                        const DenseTensor& dout,
                        int64_t ignore_index,
                        const std::string& reduction,
diff --git a/paddle/phi/kernels/gpu/one_hot_kernel.cu b/paddle/phi/kernels/gpu/one_hot_kernel.cu
index 32c7fa1e85d15..c5884884231a8 100644
--- a/paddle/phi/kernels/gpu/one_hot_kernel.cu
+++ b/paddle/phi/kernels/gpu/one_hot_kernel.cu
@@ -29,7 +29,14 @@ __global__ void FillOutputKernel(const InT* p_in_data,
                                  const int64_t numel,
                                  const int depth) {
   int idx = blockIdx.x * blockDim.x + threadIdx.x;
-  if (idx < numel && p_in_data[idx] >= 0 && p_in_data[idx] < depth) {
+  if (idx < numel) {
+    PADDLE_ENFORCE(p_in_data[idx] >= 0 && p_in_data[idx] < depth,
+                   "Illegal index value, Input(input) value should be "
+                   "greater than or equal to 0, and less than depth [%d], "
+                   "but received [%lld].",
+                   depth,
+                   p_in_data[idx]);
+
     *(p_out_data + (idx * depth) + p_in_data[idx]) = 1.0;
   }
 }
diff --git a/paddle/phi/kernels/gpu/p_norm_grad_kernel.cu b/paddle/phi/kernels/gpu/p_norm_grad_kernel.cu
index 9b0e43d25a7ce..fdfed25b3dda8 100644
--- a/paddle/phi/kernels/gpu/p_norm_grad_kernel.cu
+++ b/paddle/phi/kernels/gpu/p_norm_grad_kernel.cu
@@ -42,8 +42,9 @@ struct AbsMaxAndMinGradFunctor {
 
 template <typename T>
 struct PNormGradFunctor {
-  HOSTDEVICE explicit inline PNormGradFunctor(float porder) {
+  HOSTDEVICE explicit inline PNormGradFunctor(float porder, float eps) {
     this->porder = static_cast<T>(porder - 1.);
+    this->eps = static_cast<T>(eps);
   }
   template <typename Context,
             typename X,
@@ -58,11 +59,12 @@ struct PNormGradFunctor {
                   DY* dy,
                   const Dim& dim,
                   int size) {
-    dx->device(place) = (*x).abs().pow(this->porder) * (*x).sign() *
-                        dy->broadcast(dim) *
-                        (*y).pow(-this->porder).broadcast(dim);
+    dx->device(place) =
+        (*x).abs().pow(this->porder) * (*x).sign() * dy->broadcast(dim) *
+        (*y + y->constant(eps)).pow(-this->porder).broadcast(dim);
   }
   T porder;
+  T eps;
 };
 
 template <typename T, typename Context>
@@ -96,7 +98,7 @@ void PNormGradKernel(const Context& dev_ctx,
         dev_ctx, in_x, in_norm, in_norm_dy, out_dx, functor, dims, reduce_all);
 
   } else {
-    auto functor = PNormGradFunctor<T>(porder);
+    auto functor = PNormGradFunctor<T>(porder, epsilon);
     funcs::LaunchReduceGradKernel<Context, T, PNormGradFunctor<T>>(
         dev_ctx, in_x, in_norm, in_norm_dy, out_dx, functor, dims, reduce_all);
   }
diff --git a/paddle/phi/kernels/gpu/prelu_grad_kernel.cu b/paddle/phi/kernels/gpu/prelu_grad_kernel.cu
index d8661268e82c3..013ad1974a8fb 100644
--- a/paddle/phi/kernels/gpu/prelu_grad_kernel.cu
+++ b/paddle/phi/kernels/gpu/prelu_grad_kernel.cu
@@ -102,8 +102,8 @@ void PReluGradKernel(const Context& dev_ctx,
                      const DenseTensor& x,
                      const DenseTensor& alpha,
                      const DenseTensor& out_grad,
-                     const std::string& mode,
                      const std::string& data_format,
+                     const std::string& mode,
                      DenseTensor* x_grad,
                      DenseTensor* alpha_grad) {
   dev_ctx.template Alloc<T>(x_grad);
diff --git a/paddle/phi/kernels/gpu/prelu_kernel.cu b/paddle/phi/kernels/gpu/prelu_kernel.cu
index 8255a7ba2ed96..c4730768982bb 100644
--- a/paddle/phi/kernels/gpu/prelu_kernel.cu
+++ b/paddle/phi/kernels/gpu/prelu_kernel.cu
@@ -24,8 +24,8 @@ template <typename T, typename Context>
 void PReluKernel(const Context& dev_ctx,
                  const DenseTensor& x,
                  const DenseTensor& alpha,
-                 const std::string& mode,
                  const std::string& data_format,
+                 const std::string& mode,
                  DenseTensor* out) {
   const T* x_ptr = x.data<T>();
   T* o_ptr = dev_ctx.template Alloc<T>(out);
diff --git a/paddle/phi/kernels/gpu/randint_kernel.cu b/paddle/phi/kernels/gpu/randint_kernel.cu
index 0188505002268..90eaea6a0868c 100644
--- a/paddle/phi/kernels/gpu/randint_kernel.cu
+++ b/paddle/phi/kernels/gpu/randint_kernel.cu
@@ -23,8 +23,6 @@
 // See Note [ Why still include the fluid headers? ]
 #include "paddle/fluid/memory/memcpy.h"
 
-DECLARE_bool(use_curand);
-
 namespace phi {
 
 template <typename T, typename Context>
@@ -37,37 +35,9 @@ void RandintRawKernel(const Context& dev_ctx,
                       DenseTensor* out) {
   out->Resize(phi::make_ddim(shape.GetData()));
   T* data = dev_ctx.template Alloc<T>(out);
-  if (FLAGS_use_curand) {
-    funcs::uniform_distribution<uint32_t> dist;
-    funcs::uniform_int_transform<T, uint32_t> trans(low, high);
-    funcs::distribution_and_transform<T>(dev_ctx, out, dist, trans);
-  } else {
-    DenseTensor tmp;
-    tmp.Resize(phi::make_ddim(shape.GetData()));
-    T* tmp_data = dev_ctx.template HostAlloc<T>(&tmp);
-
-    std::shared_ptr<std::mt19937_64> engine;
-    if (seed) {
-      engine = std::make_shared<std::mt19937_64>();
-      engine->seed(seed);
-    } else {
-      engine = dev_ctx.GetHostGenerator()->GetCPUEngine();
-    }
-
-    std::uniform_int_distribution<T> dist(low, high - 1);
-    auto numel = out->numel();
-    for (int64_t i = 0; i < numel; ++i) {
-      tmp_data[i] = dist(*engine);
-    }
-
-    paddle::memory::Copy<phi::GPUPlace, phi::Place>(
-        out->place(),
-        data,
-        tmp.place(),
-        tmp_data,
-        numel * paddle::experimental::SizeOf(out->dtype()),
-        0);
-  }
+  funcs::uniform_distribution<uint32_t> dist;
+  funcs::uniform_int_transform<T, uint32_t> trans(low, high);
+  funcs::distribution_and_transform<T>(dev_ctx, out, dist, trans);
 }
 
 template <typename T, typename Context>
diff --git a/paddle/phi/kernels/gpu/randperm_kernel.cu b/paddle/phi/kernels/gpu/randperm_kernel.cu
index 678b580beca2f..4e488ed470df9 100644
--- a/paddle/phi/kernels/gpu/randperm_kernel.cu
+++ b/paddle/phi/kernels/gpu/randperm_kernel.cu
@@ -84,91 +84,65 @@ __global__ void SwapRepeatKernel(
 template <typename T, typename Context>
 void RandpermRawKernel(
     const Context& dev_ctx, int n, DataType dtype, int seed, DenseTensor* out) {
-  if (FLAGS_use_curand) {
-    DenseTensor key;
-    RandintKernel<int, Context>(dev_ctx,
-                                std::numeric_limits<int>::min(),
-                                std::numeric_limits<int>::max(),
-                                IntArray({n}),
-                                phi::DataType::INT32,
-                                &key);
-    DenseTensor key_out = Empty<int, Context>(dev_ctx, IntArray({n}));
-
-    DenseTensor range = Empty<T, Context>(dev_ctx, IntArray({n}));
-    T* range_data = range.data<T>();
-    funcs::ForRange<Context> for_range(dev_ctx, n);
-    for_range([range_data] __device__(size_t idx) {
-      range_data[idx] = static_cast<T>(idx);
-    });
-
-    out->Resize(phi::make_ddim({n}));
-    T* out_data = dev_ctx.template Alloc<T>(out);
-
-    // Refer to [Algorithm of randperm] https://osf.io/af2hy/ to
-    // improve performance of radix sort.
-    double n_d = static_cast<double>(n);
-    int begin_bit = 0;
-    int end_bit =
-        std::ceil(std::log2(n_d - (6 * n_d * n_d + 1) / (12 * std::log(0.9))));
-
-    size_t temp_storage_bytes = 0;
-    cub::DeviceRadixSort::SortPairs<int, T>(nullptr,
-                                            temp_storage_bytes,
-                                            key.data<int>(),
-                                            key_out.data<int>(),
-                                            range.data<T>(),
-                                            out_data,
-                                            n,
-                                            begin_bit,
-                                            end_bit < 32 ? end_bit : 32,
-                                            dev_ctx.stream());
-
-    auto d_temp_storage = paddle::memory::Alloc(dev_ctx, temp_storage_bytes);
-    cub::DeviceRadixSort::SortPairs<int, T>(d_temp_storage->ptr(),
-                                            temp_storage_bytes,
-                                            key.data<int>(),
-                                            key_out.data<int>(),
-                                            range.data<T>(),
-                                            out_data,
-                                            n,
-                                            begin_bit,
-                                            end_bit < 32 ? end_bit : 32,
-                                            dev_ctx.stream());
-
-    auto gen_cuda = dev_ctx.GetGenerator();
-    auto seed_offset = gen_cuda->IncrementOffset(n);
-    uint64_t seed = seed_offset.first;
-    uint64_t offset = seed_offset.second;
-
-    auto config = phi::backends::gpu::GetGpuLaunchConfig1D(dev_ctx, n);
-    SwapRepeatKernel<T><<<config.block_per_grid.x,
-                          config.thread_per_block.x,
-                          0,
-                          dev_ctx.stream()>>>(
-        key_out.data<int>(), out_data, n, seed, offset);
-  } else {
-    DenseTensor tmp;
-    tmp.Resize(phi::make_ddim({n}));
-    T* tmp_data = dev_ctx.template HostAlloc<T>(&tmp);
-
-    std::shared_ptr<std::mt19937_64> engine;
-    if (seed) {
-      engine = std::make_shared<std::mt19937_64>();
-      engine->seed(seed);
-    } else {
-      engine = dev_ctx.GetHostGenerator()->GetCPUEngine();
-    }
-
-    for (int i = 0; i < n; ++i) {
-      tmp_data[i] = static_cast<T>(i);
-    }
-    std::shuffle(tmp_data, tmp_data + n, *engine);
-
-    T* out_data = dev_ctx.template Alloc<T>(out);
-    auto size = out->numel() * paddle::experimental::SizeOf(out->dtype());
-    paddle::memory::Copy<phi::GPUPlace, phi::Place>(
-        out->place(), out_data, tmp.place(), tmp_data, size, 0);
-  }
+  DenseTensor key;
+  RandintKernel<int, Context>(dev_ctx,
+                              std::numeric_limits<int>::min(),
+                              std::numeric_limits<int>::max(),
+                              IntArray({n}),
+                              phi::DataType::INT32,
+                              &key);
+  DenseTensor key_out = Empty<int, Context>(dev_ctx, IntArray({n}));
+
+  DenseTensor range = Empty<T, Context>(dev_ctx, IntArray({n}));
+  T* range_data = range.data<T>();
+  funcs::ForRange<Context> for_range(dev_ctx, n);
+  for_range([range_data] __device__(size_t idx) {
+    range_data[idx] = static_cast<T>(idx);
+  });
+
+  out->Resize(phi::make_ddim({n}));
+  T* out_data = dev_ctx.template Alloc<T>(out);
+
+  // Refer to [Algorithm of randperm] https://osf.io/af2hy/ to
+  // improve performance of radix sort.
+  double n_d = static_cast<double>(n);
+  int begin_bit = 0;
+  int end_bit =
+      std::ceil(std::log2(n_d - (6 * n_d * n_d + 1) / (12 * std::log(0.9))));
+
+  size_t temp_storage_bytes = 0;
+  cub::DeviceRadixSort::SortPairs<int, T>(nullptr,
+                                          temp_storage_bytes,
+                                          key.data<int>(),
+                                          key_out.data<int>(),
+                                          range.data<T>(),
+                                          out_data,
+                                          n,
+                                          begin_bit,
+                                          end_bit < 32 ? end_bit : 32,
+                                          dev_ctx.stream());
+
+  auto d_temp_storage = paddle::memory::Alloc(dev_ctx, temp_storage_bytes);
+  cub::DeviceRadixSort::SortPairs<int, T>(d_temp_storage->ptr(),
+                                          temp_storage_bytes,
+                                          key.data<int>(),
+                                          key_out.data<int>(),
+                                          range.data<T>(),
+                                          out_data,
+                                          n,
+                                          begin_bit,
+                                          end_bit < 32 ? end_bit : 32,
+                                          dev_ctx.stream());
+
+  auto gen_cuda = dev_ctx.GetGenerator();
+  auto seed_offset = gen_cuda->IncrementOffset(n);
+
+  auto config = phi::backends::gpu::GetGpuLaunchConfig1D(dev_ctx, n);
+  SwapRepeatKernel<T><<<config.block_per_grid.x,
+                        config.thread_per_block.x,
+                        0,
+                        dev_ctx.stream()>>>(
+      key_out.data<int>(), out_data, n, seed_offset.first, seed_offset.second);
 }
 
 template <typename T, typename Context>
diff --git a/paddle/phi/kernels/gpu/strided_slice_grad_kernel.cu b/paddle/phi/kernels/gpu/strided_slice_grad_kernel.cu
index 5f31d488533a6..90d9f1d986577 100644
--- a/paddle/phi/kernels/gpu/strided_slice_grad_kernel.cu
+++ b/paddle/phi/kernels/gpu/strided_slice_grad_kernel.cu
@@ -19,10 +19,10 @@
 #include "paddle/phi/core/kernel_registry.h"
 #include "paddle/phi/kernels/impl/strided_slice_grad_kernel_impl.h"
 
-PD_REGISTER_KERNEL(strided_slice_grad,
+PD_REGISTER_KERNEL(strided_slice_raw_grad,
                    GPU,
                    ALL_LAYOUT,
-                   phi::StridedSliceGradKernel,
+                   phi::StridedSliceRawGradKernel,
                    bool,
                    int,
                    int64_t,
diff --git a/paddle/phi/kernels/gpu/strided_slice_kernel.cu b/paddle/phi/kernels/gpu/strided_slice_kernel.cu
index ff10718edb323..716150ff47dea 100644
--- a/paddle/phi/kernels/gpu/strided_slice_kernel.cu
+++ b/paddle/phi/kernels/gpu/strided_slice_kernel.cu
@@ -19,10 +19,10 @@
 #include "paddle/phi/core/kernel_registry.h"
 #include "paddle/phi/kernels/impl/strided_slice_kernel_impl.h"
 
-PD_REGISTER_KERNEL(strided_slice,
+PD_REGISTER_KERNEL(strided_slice_raw,
                    GPU,
                    ALL_LAYOUT,
-                   phi::StridedSliceKernel,
+                   phi::StridedSliceRawKernel,
                    bool,
                    int,
                    int64_t,
diff --git a/paddle/phi/kernels/gpu/top_k_grad_kernel.cu b/paddle/phi/kernels/gpu/top_k_grad_kernel.cu
index b0b45223489e9..32c5fc0006f4c 100644
--- a/paddle/phi/kernels/gpu/top_k_grad_kernel.cu
+++ b/paddle/phi/kernels/gpu/top_k_grad_kernel.cu
@@ -25,10 +25,10 @@ namespace ops = paddle::operators;
 
 template <typename T, typename Context>
 void TopkGradKernel(const Context& dev_ctx,
-                    const DenseTensor& out_grad,
                     const DenseTensor& x,
                     const DenseTensor& indices,
-                    int k,
+                    const DenseTensor& out_grad,
+                    const Scalar& k_scalar,
                     int axis,
                     bool largest,
                     bool sorted,
@@ -36,6 +36,8 @@ void TopkGradKernel(const Context& dev_ctx,
   const auto& in_dims = x.dims();
   const auto& out_dims = indices.dims();
 
+  int k = k_scalar.to<int>();
+
   // get the real the axis and the k
   if (axis < 0) {
     axis += in_dims.size();
diff --git a/paddle/phi/kernels/gpu/top_k_kernel.cu b/paddle/phi/kernels/gpu/top_k_kernel.cu
index adaf5cc092b4e..8262023826b32 100644
--- a/paddle/phi/kernels/gpu/top_k_kernel.cu
+++ b/paddle/phi/kernels/gpu/top_k_kernel.cu
@@ -98,7 +98,7 @@ void TopkKernel(const Context& dev_ctx,
     }
 
 #if defined(PADDLE_WITH_CUDA) && CUDA_VERSION >= 9000
-    if (input_width >= 1024 && input_height == 1) {
+    if (input_width >= 1024 && in_dims.size() == 1) {
       // 1. Gather TopK, but without sorting
       constexpr int max_num_threads = 1024;
       if (largest) {
diff --git a/paddle/phi/kernels/gpu/truncated_gaussian_random_kernel.cu b/paddle/phi/kernels/gpu/truncated_gaussian_random_kernel.cu
index f27b32ca7b831..5b6ae9d09bff2 100644
--- a/paddle/phi/kernels/gpu/truncated_gaussian_random_kernel.cu
+++ b/paddle/phi/kernels/gpu/truncated_gaussian_random_kernel.cu
@@ -24,8 +24,6 @@
 #include "paddle/phi/core/dense_tensor.h"
 #include "paddle/phi/core/kernel_registry.h"
 
-#include "paddle/fluid/framework/generator.h"
-
 namespace phi {
 
 template <typename T>
@@ -106,8 +104,7 @@ void TruncatedGaussianRandomKernel(const Context& dev_ctx,
   thrust::counting_iterator<int64_t> index_sequence_begin(0);
   int64_t size = tensor->numel();
 
-  int device_id = dev_ctx.GetPlace().GetDeviceId();
-  auto gen_cuda = paddle::framework::GetDefaultCUDAGenerator(device_id);
+  auto gen_cuda = dev_ctx.GetGenerator();
 
   if (gen_cuda->GetIsInitPy() && seed_flag) {
     auto seed_offset = gen_cuda->IncrementOffset(1);
diff --git a/paddle/phi/kernels/gpu/uniform_random_kernel.cu b/paddle/phi/kernels/gpu/uniform_random_kernel.cu
index 2cabde0bbf942..a4aea10cfe762 100644
--- a/paddle/phi/kernels/gpu/uniform_random_kernel.cu
+++ b/paddle/phi/kernels/gpu/uniform_random_kernel.cu
@@ -14,14 +14,13 @@
 
 #include "paddle/phi/kernels/uniform_random_kernel.h"
 
+#include <thrust/random.h>
 #include "gflags/gflags.h"
 
 #include "paddle/phi/core/kernel_registry.h"
 #include "paddle/phi/kernels/funcs/distribution_helper.h"
 #include "paddle/phi/kernels/funcs/index_impl.cu.h"
 
-DECLARE_bool(use_curand);
-
 namespace phi {
 
 template <typename T>
@@ -54,43 +53,6 @@ struct UniformGenerator {
   }
 };
 
-template <typename T>
-struct UniformGeneratorOffset {
-  T min_, max_;
-  unsigned int seed_;
-  T diag_val_;
-  unsigned int diag_num_;
-  unsigned int diag_step_;
-  int offset_;
-  __host__ __device__ UniformGeneratorOffset(T min,
-                                             T max,
-                                             int seed,
-                                             int diag_num,
-                                             int diag_step,
-                                             T diag_val,
-                                             int offset)
-      : min_(min),
-        max_(max),
-        seed_(seed),
-        diag_num_(diag_num),
-        diag_step_(diag_step),
-        diag_val_(diag_val),
-        offset_(offset) {}
-
-  __host__ __device__ T operator()(const unsigned int n) const {
-    thrust::minstd_rand rng;
-    rng.seed(seed_);
-    thrust::uniform_real_distribution<T> dist(min_, max_);
-    rng.discard(n + offset_);
-    T out = dist(rng);
-    unsigned int remainder = n % (diag_step_ + 1);
-    if (remainder == 0 && diag_num_ > n / (diag_step_ + 1)) {
-      out = diag_val_;
-    }
-    return out;
-  }
-};
-
 template <typename T, typename Context>
 void UniformRandomRawKernel(const Context& dev_ctx,
                             const IntArray& shape,
@@ -114,23 +76,10 @@ void UniformRandomRawKernel(const Context& dev_ctx,
 
   auto generator = dev_ctx.GetGenerator();
   if (generator->GetIsInitPy() && seed_flag) {
-    if (FLAGS_use_curand) {
-      using MT = typename kps::details::MPTypeTrait<T>::Type;
-      funcs::uniform_distribution<MT> dist;
-      funcs::uniform_real_transform<MT> trans(min, max);
-      funcs::distribution_and_transform<T>(dev_ctx, out, dist, trans);
-    } else {
-      auto seed_offset = generator->IncrementOffset(1);
-      int64_t gen_offset = size * seed_offset.second;
-      auto func = UniformGeneratorOffset<T>(min,
-                                            max,
-                                            seed_offset.first,
-                                            diag_num,
-                                            diag_step,
-                                            diag_val,
-                                            gen_offset);
-      IndexKernel<T, UniformGeneratorOffset<T>>(dev_ctx, out, func);
-    }
+    using MT = typename kps::details::MPTypeTrait<T>::Type;
+    funcs::uniform_distribution<MT> dist;
+    funcs::uniform_real_transform<MT> trans(min, max);
+    funcs::distribution_and_transform<T>(dev_ctx, out, dist, trans);
   } else {
     auto func =
         UniformGenerator<T>(min, max, seed, diag_num, diag_step, diag_val);
diff --git a/paddle/phi/kernels/gpudnn/conv_grad_kernel.cu b/paddle/phi/kernels/gpudnn/conv_grad_kernel.cu
index a99a1e5f9471e..e09c33380b307 100644
--- a/paddle/phi/kernels/gpudnn/conv_grad_kernel.cu
+++ b/paddle/phi/kernels/gpudnn/conv_grad_kernel.cu
@@ -43,9 +43,9 @@ namespace phi {
 
 template <typename T, typename Context>
 void ConvCudnnGradKernel(const Context& ctx,
-                         const DenseTensor& output_grad,
                          const DenseTensor& input,
                          const DenseTensor& filter,
+                         const DenseTensor& output_grad,
                          const std::vector<int>& strides_t,
                          const std::vector<int>& paddings_t,
                          const std::string& padding_algorithm,
@@ -595,9 +595,9 @@ void ConvCudnnGradKernel(const Context& ctx,
 
 template <typename T, typename Context>
 void Conv3DCudnnGradKernel(const Context& dev_ctx,
-                           const DenseTensor& out_grad,
                            const DenseTensor& input,
                            const DenseTensor& filter,
+                           const DenseTensor& out_grad,
                            const std::vector<int>& strides,
                            const std::vector<int>& paddings,
                            const std::string& paddding_algorithm,
@@ -610,9 +610,9 @@ void Conv3DCudnnGradKernel(const Context& dev_ctx,
                            DenseTensor* input_grad,
                            DenseTensor* filter_grad) {
   ConvCudnnGradKernel<T>(dev_ctx,
-                         out_grad,
                          input,
                          filter,
+                         out_grad,
                          strides,
                          paddings,
                          paddding_algorithm,
diff --git a/paddle/phi/kernels/graph_reindex_kernel.h b/paddle/phi/kernels/graph_reindex_kernel.h
new file mode 100644
index 0000000000000..68f1ebc6f5cc4
--- /dev/null
+++ b/paddle/phi/kernels/graph_reindex_kernel.h
@@ -0,0 +1,33 @@
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#pragma once
+
+#include "paddle/phi/core/dense_tensor.h"
+
+namespace phi {
+
+template <typename T, typename Context>
+void GraphReindexKernel(const Context& dev_ctx,
+                        const DenseTensor& x,
+                        const DenseTensor& neighbors,
+                        const DenseTensor& count,
+                        paddle::optional<const DenseTensor&> hashtable_value,
+                        paddle::optional<const DenseTensor&> hashtable_index,
+                        bool flag_buffer_hashtable,
+                        DenseTensor* reindex_src,
+                        DenseTensor* reindex_dst,
+                        DenseTensor* out_nodes);
+
+}  // namespace phi
diff --git a/paddle/phi/kernels/graph_sample_neighbors_kernel.h b/paddle/phi/kernels/graph_sample_neighbors_kernel.h
new file mode 100644
index 0000000000000..f7d205bd08ad0
--- /dev/null
+++ b/paddle/phi/kernels/graph_sample_neighbors_kernel.h
@@ -0,0 +1,36 @@
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#pragma once
+
+#include "paddle/phi/core/dense_tensor.h"
+
+namespace phi {
+
+template <typename T, typename Context>
+void GraphSampleNeighborsKernel(
+    const Context& dev_ctx,
+    const DenseTensor& row,
+    const DenseTensor& col_ptr,
+    const DenseTensor& x,
+    paddle::optional<const DenseTensor&> eids,
+    paddle::optional<const DenseTensor&> perm_buffer,
+    int sample_size,
+    bool return_eids,
+    bool flag_perm_buffer,
+    DenseTensor* out,
+    DenseTensor* out_count,
+    DenseTensor* out_eids);
+
+}  // namespace phi
diff --git a/paddle/phi/kernels/graph_send_recv_grad_kernel.h b/paddle/phi/kernels/graph_send_recv_grad_kernel.h
index 3694c8f1e6c99..c0b1a34d09c00 100644
--- a/paddle/phi/kernels/graph_send_recv_grad_kernel.h
+++ b/paddle/phi/kernels/graph_send_recv_grad_kernel.h
@@ -22,12 +22,12 @@ namespace phi {
 
 template <typename T, typename Context>
 void GraphSendRecvGradKernel(const Context& ctx,
-                             const DenseTensor& out_grad,
                              const DenseTensor& x,
-                             paddle::optional<const DenseTensor&> out,
                              const DenseTensor& src_index,
                              const DenseTensor& dst_index,
+                             paddle::optional<const DenseTensor&> out,
                              paddle::optional<const DenseTensor&> dst_count,
+                             const DenseTensor& out_grad,
                              const std::string& pool_type,
                              DenseTensor* x_grad);
 }  // namespace phi
diff --git a/paddle/phi/kernels/hierarchical_sigmoid_grad_kernel.h b/paddle/phi/kernels/hierarchical_sigmoid_grad_kernel.h
index f7a327cd3f566..7922a767db23c 100644
--- a/paddle/phi/kernels/hierarchical_sigmoid_grad_kernel.h
+++ b/paddle/phi/kernels/hierarchical_sigmoid_grad_kernel.h
@@ -23,11 +23,11 @@ void HierarchicalSigmoidGradKernel(const Context& ctx,
                                    const DenseTensor& x,
                                    const DenseTensor& w,
                                    const DenseTensor& label,
-                                   const DenseTensor& pre_out,
-                                   const DenseTensor& out_grad,
                                    paddle::optional<const DenseTensor&> path,
                                    paddle::optional<const DenseTensor&> code,
                                    paddle::optional<const DenseTensor&> bias,
+                                   const DenseTensor& pre_out,
+                                   const DenseTensor& out_grad,
                                    int num_classes,
                                    bool remote_prefetch,
                                    int trainer_id,
diff --git a/paddle/phi/kernels/histogram_kernel.h b/paddle/phi/kernels/histogram_kernel.h
index b6b4593361dad..0020f7b0435da 100644
--- a/paddle/phi/kernels/histogram_kernel.h
+++ b/paddle/phi/kernels/histogram_kernel.h
@@ -18,11 +18,11 @@
 namespace phi {
 
 template <typename T, typename Context>
-void HistogramSelectKernel(const Context& dev_ctx,
-                           const DenseTensor& input,
-                           int64_t bins,
-                           int min,
-                           int max,
-                           DenseTensor* out);
+void HistogramKernel(const Context& dev_ctx,
+                     const DenseTensor& input,
+                     int64_t bins,
+                     int min,
+                     int max,
+                     DenseTensor* output);
 
 }  // namespace phi
diff --git a/paddle/phi/kernels/impl/activation_grad_impl.h b/paddle/phi/kernels/impl/activation_grad_impl.h
index 37273b7944ede..bf9b7cdf559d3 100644
--- a/paddle/phi/kernels/impl/activation_grad_impl.h
+++ b/paddle/phi/kernels/impl/activation_grad_impl.h
@@ -243,8 +243,8 @@ void LogitGradKernel(const Context& dev_ctx,
 template <typename T, typename Context>
 void SigmoidDoubleGradKernel(const Context& dev_ctx,
                              const DenseTensor& out,
-                             const DenseTensor& ddx,
                              const DenseTensor& dout,
+                             const DenseTensor& ddx,
                              DenseTensor* dout_new,
                              DenseTensor* ddout) {
   if (dout_new) {
@@ -262,10 +262,10 @@ void SigmoidDoubleGradKernel(const Context& dev_ctx,
 template <typename T, typename Context>
 void SigmoidTripleGradKernel(const Context& dev_ctx,
                              const DenseTensor& out,
-                             const DenseTensor& ddx,
                              const DenseTensor& dout,
-                             const DenseTensor& d_ddout,
+                             const DenseTensor& ddx,
                              const DenseTensor& d_dout_new,
+                             const DenseTensor& d_ddout,
                              DenseTensor* d_out_new,
                              DenseTensor* d_dout,
                              DenseTensor* d_ddx) {
diff --git a/paddle/phi/kernels/impl/clip_grad_kernel_impl.h b/paddle/phi/kernels/impl/clip_grad_kernel_impl.h
new file mode 100644
index 0000000000000..7ce86492327ba
--- /dev/null
+++ b/paddle/phi/kernels/impl/clip_grad_kernel_impl.h
@@ -0,0 +1,74 @@
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#pragma once
+
+#include "paddle/phi/kernels/clip_kernel.h"
+
+#include "paddle/phi/backends/all_context.h"
+#include "paddle/phi/core/kernel_registry.h"
+
+#include "paddle/fluid/platform/device_context.h"
+#include "paddle/fluid/platform/transform.h"
+#if defined(__NVCC__) || defined(__HIPCC__)
+#include "paddle/phi/kernels/funcs/broadcast_function.h"
+#endif
+
+namespace phi {
+
+template <typename T>
+class ClipGradFunctor {
+ public:
+  explicit ClipGradFunctor(const T min, const T max) : min_(min), max_(max) {}
+  HOSTDEVICE T operator()(const T x, const T y) const {
+    return (y > min_ && y < max_) ? x : static_cast<T>(0);
+  }
+
+ private:
+  T min_;
+  T max_;
+};
+
+template <typename T, typename Context>
+void ClipGradKernel(const Context& dev_ctx,
+                    const DenseTensor& x,
+                    const DenseTensor& out_grad,
+                    const Scalar& min,
+                    const Scalar& max,
+                    DenseTensor* x_grad) {
+  auto max_ = max.to<T>();
+  auto min_ = min.to<T>();
+
+#if defined(__NVCC__) || defined(__HIPCC__)
+  std::vector<const DenseTensor*> ins = {&out_grad, &x};
+  std::vector<DenseTensor*> outs = {x_grad};
+  auto functor = ClipGradFunctor<T>(min_, max_);
+  dev_ctx.template Alloc<T>(x_grad);
+  phi::funcs::ElementwiseKernel<T>(dev_ctx, ins, &outs, functor);
+#else
+  int64_t numel = out_grad.numel();
+  auto* d_x_data = dev_ctx.template Alloc<T>(x_grad);
+  const T* d_out_data = out_grad.data<T>();
+  const T* x_data = x.data<T>();
+  paddle::platform::Transform<Context> trans;
+  trans(dev_ctx,
+        d_out_data,
+        d_out_data + numel,
+        x_data,
+        d_x_data,
+        ClipGradFunctor<T>(min_, max_));
+#endif
+}
+
+}  // namespace phi
diff --git a/paddle/phi/kernels/impl/clip_kernel_impl.h b/paddle/phi/kernels/impl/clip_kernel_impl.h
new file mode 100644
index 0000000000000..17c04c31a598a
--- /dev/null
+++ b/paddle/phi/kernels/impl/clip_kernel_impl.h
@@ -0,0 +1,79 @@
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#pragma once
+
+#include "paddle/phi/kernels/clip_kernel.h"
+
+#include "paddle/phi/backends/all_context.h"
+#include "paddle/phi/core/kernel_registry.h"
+
+#include "paddle/fluid/operators/math/selected_rows_functor.h"
+#include "paddle/fluid/platform/transform.h"
+#if defined(__NVCC__) || defined(__HIPCC__)
+#include "paddle/phi/kernels/funcs/broadcast_function.h"
+#endif
+
+namespace phi {
+
+template <typename T>
+class ClipFunctor {
+ public:
+  explicit ClipFunctor(const T min, const T max) : min_(min), max_(max) {}
+  HOSTDEVICE T operator()(const T x) const {
+    return x < min_ ? min_ : x > max_ ? max_ : x;
+  }
+
+ private:
+  T min_;
+  T max_;
+};
+
+template <typename T, typename Context>
+void ClipKernel(const Context& dev_ctx,
+                const DenseTensor& x,
+                const Scalar& min,
+                const Scalar& max,
+                DenseTensor* out) {
+  auto max_ = max.to<T>();
+  auto min_ = min.to<T>();
+
+  PADDLE_ENFORCE_LE(
+      min_,
+      max_,
+      errors::InvalidArgument("max should be greater than or equal to min. "
+                              "But received min = %f, max = %f",
+                              static_cast<float>(min_),
+                              static_cast<float>(max_)));
+
+  T* out_data = dev_ctx.template Alloc<T>(out);
+  // const T* x_data = x->data<T>();
+  // int64_t numel = x->numel();
+  const T* x_data = x.data<T>();
+  int64_t numel = x.numel();
+  if (paddle::platform::is_gpu_place(dev_ctx.GetPlace())) {
+#if defined(__NVCC__) || defined(__HIPCC__)
+    std::vector<const DenseTensor*> ins = {&x};
+    std::vector<DenseTensor*> outs = {out};
+    auto functor = ClipFunctor<T>(min_, max_);
+    phi::funcs::ElementwiseKernel<T>(dev_ctx, ins, &outs, functor);
+#endif
+  } else {
+    paddle::platform::Transform<Context> trans;
+    trans(
+        dev_ctx, x_data, x_data + numel, out_data, ClipFunctor<T>(min_, max_));
+  }
+}
+
+}  // namespace phi
diff --git a/paddle/phi/kernels/impl/concat_grad_kernel_impl.h b/paddle/phi/kernels/impl/concat_grad_kernel_impl.h
new file mode 100644
index 0000000000000..e89920340ff18
--- /dev/null
+++ b/paddle/phi/kernels/impl/concat_grad_kernel_impl.h
@@ -0,0 +1,69 @@
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+#pragma once
+
+#include "paddle/phi/kernels/concat_grad_kernel.h"
+
+#include "paddle/fluid/operators/strided_memcpy.h"
+#include "paddle/phi/kernels/funcs/concat_and_split_functor.h"
+#include "paddle/phi/kernels/funcs/concat_funcs.h"
+
+namespace phi {
+
+template <typename T, typename Context>
+void ConcatGradKernel(const Context& dev_ctx,
+                      const std::vector<const DenseTensor*>& x,
+                      const DenseTensor& out_grad,
+                      const Scalar& axis_scalar,
+                      std::vector<DenseTensor*> x_grad) {
+  auto outs = x_grad;
+  {
+    auto dx = x_grad;
+    for (size_t i = 0; i < dx.size(); ++i) {
+      if (dx[i] != nullptr) {
+        dx[i]->set_lod(x[i]->lod());
+      }
+    }
+  }
+  PADDLE_ENFORCE_NOT_NULL(
+      x[0], phi::errors::NotFound("The first input tensor is not initalized."));
+
+  auto axis = axis_scalar.to<int>();
+  axis = funcs::ComputeAxis(static_cast<int64_t>(axis),
+                            static_cast<int64_t>(x[0]->dims().size()));
+  // get output tensor that the name is not kEmptyVarName
+  std::vector<DenseTensor*> outputs;
+  for (size_t j = 0; j < outs.size(); ++j) {
+    if (outs[j] && outs[j]->numel() != 0UL) {
+      dev_ctx.template Alloc<T>(outs[j]);
+
+      outputs.push_back(outs[j]);
+    } else {
+      outputs.push_back(nullptr);
+    }
+  }
+
+  // Sometimes direct copies will be faster, this maybe need deeply analysis.
+  if (axis == 0 && outs.size() < 10) {
+    std::vector<const DenseTensor*> ref_shape;
+    ref_shape.insert(ref_shape.begin(), x.begin(), x.end());
+    paddle::operators::StridedMemcpyWithAxis0<T>(
+        dev_ctx, out_grad, ref_shape, &outputs);
+  } else {
+    phi::funcs::SplitFunctor<Context, T> split_functor;
+    split_functor(dev_ctx, out_grad, x, static_cast<int>(axis), &outputs);
+  }
+}
+
+}  // namespace phi
diff --git a/paddle/phi/kernels/impl/conv_grad_kernel_impl.h b/paddle/phi/kernels/impl/conv_grad_kernel_impl.h
index 2deebb996a057..6674500c3c2e5 100644
--- a/paddle/phi/kernels/impl/conv_grad_kernel_impl.h
+++ b/paddle/phi/kernels/impl/conv_grad_kernel_impl.h
@@ -26,9 +26,9 @@ namespace phi {
 
 template <typename T, typename Context>
 void ConvGradKernel(const Context& dev_ctx,
-                    const DenseTensor& output_grad,
                     const DenseTensor& input,
                     const DenseTensor& filter_t,
+                    const DenseTensor& output_grad,
                     const std::vector<int>& strides,
                     const std::vector<int>& paddings_t,
                     const std::string& padding_algorithm,
diff --git a/paddle/phi/kernels/impl/lgamma_grad_kernel_impl.h b/paddle/phi/kernels/impl/lgamma_grad_kernel_impl.h
index 8fb1f1c4fa361..9ef6c61fd60fb 100644
--- a/paddle/phi/kernels/impl/lgamma_grad_kernel_impl.h
+++ b/paddle/phi/kernels/impl/lgamma_grad_kernel_impl.h
@@ -33,8 +33,8 @@ struct LgammaGradFunctor {
 };
 template <typename T, typename Context>
 void LgammaGradKernel(const Context& dev_ctx,
-                      const DenseTensor& d_out,
                       const DenseTensor& x,
+                      const DenseTensor& d_out,
                       DenseTensor* d_x) {
   auto numel = d_out.numel();
   auto* dout_data = d_out.data<T>();
diff --git a/paddle/phi/kernels/impl/logsumexp_grad_kernel_impl.h b/paddle/phi/kernels/impl/logsumexp_grad_kernel_impl.h
index c2583ce8d32df..23e4414858a78 100644
--- a/paddle/phi/kernels/impl/logsumexp_grad_kernel_impl.h
+++ b/paddle/phi/kernels/impl/logsumexp_grad_kernel_impl.h
@@ -46,7 +46,7 @@ void LogsumexpGradKernel(const Context& dev_ctx,
                          const DenseTensor& in,
                          const DenseTensor& out,
                          const DenseTensor& out_grad,
-                         const std::vector<int>& axis,
+                         const std::vector<int64_t>& axis,
                          bool keepdim,
                          bool reduce_all,
                          DenseTensor* in_grad) {
@@ -67,22 +67,27 @@ void LogsumexpGradKernel(const Context& dev_ctx,
   } else {
     int rank = in.dims().size();
     LogsumexpGradFunctor functor;
+    std::vector<int32_t> axis32;
+    axis32.reserve(axis.size());
+    std::for_each(axis.begin(), axis.end(), [&axis32](const int64_t& t) {
+      axis32.push_back(t);
+    });
     switch (rank) {
       case 1:
         phi::funcs::ReduceGradFunctor<Context, T, 1, LogsumexpGradFunctor>(
-            dev_ctx, in, out, out_grad, in_grad, functor, axis);
+            dev_ctx, in, out, out_grad, in_grad, functor, axis32);
         break;
       case 2:
         phi::funcs::ReduceGradFunctor<Context, T, 2, LogsumexpGradFunctor>(
-            dev_ctx, in, out, out_grad, in_grad, functor, axis);
+            dev_ctx, in, out, out_grad, in_grad, functor, axis32);
         break;
       case 3:
         phi::funcs::ReduceGradFunctor<Context, T, 3, LogsumexpGradFunctor>(
-            dev_ctx, in, out, out_grad, in_grad, functor, axis);
+            dev_ctx, in, out, out_grad, in_grad, functor, axis32);
         break;
       case 4:
         phi::funcs::ReduceGradFunctor<Context, T, 4, LogsumexpGradFunctor>(
-            dev_ctx, in, out, out_grad, in_grad, functor, axis);
+            dev_ctx, in, out, out_grad, in_grad, functor, axis32);
         break;
     }
   }
diff --git a/paddle/phi/kernels/impl/strided_slice_grad_kernel_impl.h b/paddle/phi/kernels/impl/strided_slice_grad_kernel_impl.h
index f0fddce6b5547..95780682c98dd 100644
--- a/paddle/phi/kernels/impl/strided_slice_grad_kernel_impl.h
+++ b/paddle/phi/kernels/impl/strided_slice_grad_kernel_impl.h
@@ -20,16 +20,16 @@
 namespace phi {
 
 template <typename T, typename Context>
-void StridedSliceGradKernel(const Context& dev_ctx,
-                            const DenseTensor& x,
-                            const DenseTensor& out_grad,
-                            const std::vector<int>& axes,
-                            const IntArray& starts,
-                            const IntArray& ends,
-                            const IntArray& strides,
-                            const std::vector<int>& infer_flags,
-                            const std::vector<int>& decrease_axis,
-                            DenseTensor* x_grad) {
+void StridedSliceRawGradKernel(const Context& dev_ctx,
+                               const DenseTensor& x,
+                               const DenseTensor& out_grad,
+                               const std::vector<int>& axes,
+                               const IntArray& starts,
+                               const IntArray& ends,
+                               const IntArray& strides,
+                               const std::vector<int>& infer_flags,
+                               const std::vector<int>& decrease_axis,
+                               DenseTensor* x_grad) {
   int rank = x.dims().size();
 #define SLICE_CASE(Rank)                                            \
   case Rank:                                                        \
diff --git a/paddle/phi/kernels/impl/strided_slice_kernel_impl.h b/paddle/phi/kernels/impl/strided_slice_kernel_impl.h
index 2df937524ef20..81e6d5056267a 100644
--- a/paddle/phi/kernels/impl/strided_slice_kernel_impl.h
+++ b/paddle/phi/kernels/impl/strided_slice_kernel_impl.h
@@ -20,15 +20,15 @@
 namespace phi {
 
 template <typename T, typename Context>
-void StridedSliceKernel(const Context& dev_ctx,
-                        const DenseTensor& x,
-                        const std::vector<int>& axes,
-                        const IntArray& starts,
-                        const IntArray& ends,
-                        const IntArray& strides,
-                        const std::vector<int>& infer_flags,
-                        const std::vector<int>& decrease_axis,
-                        DenseTensor* out) {
+void StridedSliceRawKernel(const Context& dev_ctx,
+                           const DenseTensor& x,
+                           const std::vector<int>& axes,
+                           const IntArray& starts,
+                           const IntArray& ends,
+                           const IntArray& strides,
+                           const std::vector<int>& infer_flags,
+                           const std::vector<int>& decrease_axis,
+                           DenseTensor* out) {
   int rank = x.dims().size();
 #define SLICE_CASE(Rank)                                        \
   case Rank:                                                    \
diff --git a/paddle/phi/kernels/impl/trace_grad_kernel_impl.h b/paddle/phi/kernels/impl/trace_grad_kernel_impl.h
index b0878d779462a..90a2327ef3e20 100644
--- a/paddle/phi/kernels/impl/trace_grad_kernel_impl.h
+++ b/paddle/phi/kernels/impl/trace_grad_kernel_impl.h
@@ -82,8 +82,8 @@ struct TraceGradFunctor {
 
 template <typename T, typename Context>
 void TraceGradKernel(const Context& ctx,
-                     const DenseTensor& out_grad,
                      const DenseTensor& x,
+                     const DenseTensor& out_grad,
                      int offset,
                      int axis1,
                      int axis2,
diff --git a/paddle/phi/kernels/impl/warpctc_kernel_impl.h b/paddle/phi/kernels/impl/warpctc_kernel_impl.h
index 8a18f2500a512..ef6be7a9dfa88 100644
--- a/paddle/phi/kernels/impl/warpctc_kernel_impl.h
+++ b/paddle/phi/kernels/impl/warpctc_kernel_impl.h
@@ -203,7 +203,7 @@ class WarpCTCFunctor {
   void init(const Context& dev_ctx, const size_t blank) {
     warpctc_version_ = phi::dynload::get_warpctc_version();
 
-    if (dev_ctx.GetPlace() == phi::GPUPlace()) {
+    if (paddle::platform::is_gpu_place(dev_ctx.GetPlace())) {
 #if defined(PADDLE_WITH_CUDA) || defined(PADDLE_WITH_HIP)
       options_.loc = CTC_GPU;
       options_.stream =
diff --git a/paddle/phi/kernels/index_sample_grad_kernel.h b/paddle/phi/kernels/index_sample_grad_kernel.h
index 5c6e101f1b43d..2b66076ee0a2b 100644
--- a/paddle/phi/kernels/index_sample_grad_kernel.h
+++ b/paddle/phi/kernels/index_sample_grad_kernel.h
@@ -20,9 +20,9 @@ namespace phi {
 
 template <typename T, typename Context>
 void IndexSampleGradKernel(const Context& ctx,
-                           const DenseTensor& out_grad,
                            const DenseTensor& x,
                            const DenseTensor& index,
+                           const DenseTensor& out_grad,
                            DenseTensor* in_grad);
 
 }  // namespace phi
diff --git a/paddle/phi/kernels/interpolate_grad_kernel.h b/paddle/phi/kernels/interpolate_grad_kernel.h
new file mode 100644
index 0000000000000..59d2dddd87007
--- /dev/null
+++ b/paddle/phi/kernels/interpolate_grad_kernel.h
@@ -0,0 +1,39 @@
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#pragma once
+
+#include "paddle/phi/core/dense_tensor.h"
+
+namespace phi {
+
+template <typename T, typename Context>
+void BilinearInterpGradKernel(
+    const Context& ctx,
+    const DenseTensor& x,
+    paddle::optional<const DenseTensor&> out_size,
+    paddle::optional<const std::vector<const DenseTensor*>> size_tensor,
+    paddle::optional<const DenseTensor&> scale_tensor,
+    const DenseTensor& out_grad,
+    const std::string& data_layout,
+    int out_d,
+    int out_h,
+    int out_w,
+    const std::vector<float>& scale,
+    const std::string& interp_method,
+    bool align_corners,
+    int align_mode,
+    DenseTensor* x_grad);
+
+}  // namespace phi
diff --git a/paddle/phi/kernels/interpolate_kernel.h b/paddle/phi/kernels/interpolate_kernel.h
new file mode 100644
index 0000000000000..4623657f5a594
--- /dev/null
+++ b/paddle/phi/kernels/interpolate_kernel.h
@@ -0,0 +1,110 @@
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#pragma once
+
+#include "paddle/phi/core/dense_tensor.h"
+
+namespace phi {
+
+template <typename T, typename Context>
+void BilinearInterpKernel(
+    const Context& ctx,
+    const DenseTensor& x,
+    paddle::optional<const DenseTensor&> out_size,
+    paddle::optional<const std::vector<const DenseTensor*>> size_tensor,
+    paddle::optional<const DenseTensor&> scale_tensor,
+    const std::string& data_layout,
+    int out_d,
+    int out_h,
+    int out_w,
+    const std::vector<float>& scale,
+    const std::string& interp_method,
+    bool align_corners,
+    int align_mode,
+    DenseTensor* output);
+
+template <typename T, typename Context>
+void NearestInterpKernel(
+    const Context& ctx,
+    const DenseTensor& x,
+    paddle::optional<const DenseTensor&> out_size,
+    paddle::optional<const std::vector<const DenseTensor*>> size_tensor,
+
+    paddle::optional<const DenseTensor&> scale_tensor,
+    const std::string& data_layout,
+    int out_d,
+    int out_h,
+    int out_w,
+    const std::vector<float>& scale,
+    const std::string& interp_method,
+    bool align_corners,
+    int align_mode,
+    DenseTensor* output);
+
+template <typename T, typename Context>
+void TrilinearInterpKernel(
+    const Context& ctx,
+    const DenseTensor& x,
+    paddle::optional<const DenseTensor&> out_size,
+    paddle::optional<const std::vector<const DenseTensor*>> size_tensor,
+
+    paddle::optional<const DenseTensor&> scale_tensor,
+    const std::string& data_layout,
+    int out_d,
+    int out_h,
+    int out_w,
+    const std::vector<float>& scale,
+    const std::string& interp_method,
+    bool align_corners,
+    int align_mode,
+    DenseTensor* output);
+
+template <typename T, typename Context>
+void LinearInterpKernel(
+    const Context& ctx,
+    const DenseTensor& x,
+    paddle::optional<const DenseTensor&> out_size,
+    paddle::optional<const std::vector<const DenseTensor*>> size_tensor,
+
+    paddle::optional<const DenseTensor&> scale_tensor,
+    const std::string& data_layout,
+    int out_d,
+    int out_h,
+    int out_w,
+    const std::vector<float>& scale,
+    const std::string& interp_method,
+    bool align_corners,
+    int align_mode,
+    DenseTensor* output);
+
+template <typename T, typename Context>
+void BicubicInterpKernel(
+    const Context& ctx,
+    const DenseTensor& x,
+    paddle::optional<const DenseTensor&> out_size,
+    paddle::optional<const std::vector<const DenseTensor*>> size_tensor,
+
+    paddle::optional<const DenseTensor&> scale_tensor,
+    const std::string& data_layout,
+    int out_d,
+    int out_h,
+    int out_w,
+    const std::vector<float>& scale,
+    const std::string& interp_method,
+    bool align_corners,
+    int align_mode,
+    DenseTensor* output);
+
+}  // namespace phi
diff --git a/paddle/phi/kernels/kldiv_loss_grad_kernel.h b/paddle/phi/kernels/kldiv_loss_grad_kernel.h
index 8f53898fa6816..6e05c7992eb61 100644
--- a/paddle/phi/kernels/kldiv_loss_grad_kernel.h
+++ b/paddle/phi/kernels/kldiv_loss_grad_kernel.h
@@ -19,7 +19,6 @@
 namespace phi {
 
 template <typename T, typename Context>
-// XKTODO (change name)
 void KLDivLossGradKernel(const Context& dev_ctx,
                          const DenseTensor& x,
                          const DenseTensor& label,
diff --git a/paddle/phi/kernels/kthvalue_grad_kernel.h b/paddle/phi/kernels/kthvalue_grad_kernel.h
index 488dde8237b08..c2eac0a3e3de9 100644
--- a/paddle/phi/kernels/kthvalue_grad_kernel.h
+++ b/paddle/phi/kernels/kthvalue_grad_kernel.h
@@ -20,9 +20,9 @@
 namespace phi {
 template <typename T, typename Context>
 void KthvalueGradKernel(const Context& dev_ctx,
-                        const DenseTensor& d_out,
                         const DenseTensor& x,
                         const DenseTensor& indices,
+                        const DenseTensor& d_out,
                         int k,
                         int axis,
                         bool keepdim,
diff --git a/paddle/phi/kernels/lgamma_grad_kernel.h b/paddle/phi/kernels/lgamma_grad_kernel.h
index 94173cc29c7a7..d7f0ef399eaa0 100644
--- a/paddle/phi/kernels/lgamma_grad_kernel.h
+++ b/paddle/phi/kernels/lgamma_grad_kernel.h
@@ -21,7 +21,7 @@ namespace phi {
 
 template <typename T, typename Context>
 void LgammaGradKernel(const Context& dev_ctx,
-                      const DenseTensor& d_out,
                       const DenseTensor& x,
+                      const DenseTensor& d_out,
                       DenseTensor* d_x);
 }  // namespace phi
diff --git a/paddle/phi/kernels/logsumexp_grad_kernel.h b/paddle/phi/kernels/logsumexp_grad_kernel.h
index d68c447aa65cb..170f1c6c557ea 100644
--- a/paddle/phi/kernels/logsumexp_grad_kernel.h
+++ b/paddle/phi/kernels/logsumexp_grad_kernel.h
@@ -23,7 +23,7 @@ void LogsumexpGradKernel(const Context& ctx,
                          const DenseTensor& in,
                          const DenseTensor& out,
                          const DenseTensor& out_grad,
-                         const std::vector<int>& axis,
+                         const std::vector<int64_t>& axis,
                          bool keepdim,
                          bool reduce_all,
                          DenseTensor* in_grad);
diff --git a/paddle/phi/kernels/masked_select_grad_kernel.h b/paddle/phi/kernels/masked_select_grad_kernel.h
index f9db1fcd2acc7..db7d105093d2a 100644
--- a/paddle/phi/kernels/masked_select_grad_kernel.h
+++ b/paddle/phi/kernels/masked_select_grad_kernel.h
@@ -19,9 +19,9 @@ namespace phi {
 
 template <typename T, typename Context>
 void MaskedSelectGradKernel(const Context& dev_ctx,
-                            const DenseTensor& out_grad,
                             const DenseTensor& x,
                             const DenseTensor& mask,
+                            const DenseTensor& out_grad,
                             DenseTensor* x_grad);
 
 }  // namspace phi
diff --git a/paddle/phi/kernels/nll_loss_grad_kernel.h b/paddle/phi/kernels/nll_loss_grad_kernel.h
index 127dc2f961f10..c06f0726899ee 100644
--- a/paddle/phi/kernels/nll_loss_grad_kernel.h
+++ b/paddle/phi/kernels/nll_loss_grad_kernel.h
@@ -22,8 +22,8 @@ template <typename T, typename Context>
 void NllLossGradKernel(const Context& dev_ctx,
                        const DenseTensor& x,
                        const DenseTensor& label,
-                       const DenseTensor& total_weight,
                        paddle::optional<const DenseTensor&> weight,
+                       const DenseTensor& total_weight,
                        const DenseTensor& d_out,
                        int64_t ignore_index,
                        const std::string& reduction,
diff --git a/paddle/phi/kernels/norm_grad_kernel.h b/paddle/phi/kernels/norm_grad_kernel.h
index 55714b8a4a091..a67e757ba510f 100644
--- a/paddle/phi/kernels/norm_grad_kernel.h
+++ b/paddle/phi/kernels/norm_grad_kernel.h
@@ -21,7 +21,7 @@ namespace phi {
 template <typename T, typename Context>
 void NormGradKernel(const Context& ctx,
                     const DenseTensor& x,
-                    const DenseTensor& out,
+                    const DenseTensor& norm,
                     const DenseTensor& out_grad,
                     int axis,
                     float epsilon,
diff --git a/paddle/phi/kernels/prelu_grad_kernel.h b/paddle/phi/kernels/prelu_grad_kernel.h
index 15917e2e1f02e..d36f529640d7d 100644
--- a/paddle/phi/kernels/prelu_grad_kernel.h
+++ b/paddle/phi/kernels/prelu_grad_kernel.h
@@ -24,8 +24,8 @@ void PReluGradKernel(const Context& dev_ctx,
                      const DenseTensor& x,
                      const DenseTensor& alpha,
                      const DenseTensor& out_grad,
-                     const std::string& mode,
                      const std::string& data_format,
+                     const std::string& mode,
                      DenseTensor* x_grad,
                      DenseTensor* alpha_grad);
 }  // namespace phi
diff --git a/paddle/phi/kernels/prelu_kernel.h b/paddle/phi/kernels/prelu_kernel.h
index 251332a8158dc..7e273ecfd2fa1 100644
--- a/paddle/phi/kernels/prelu_kernel.h
+++ b/paddle/phi/kernels/prelu_kernel.h
@@ -22,7 +22,7 @@ template <typename T, typename Context>
 void PReluKernel(const Context& dev_ctx,
                  const DenseTensor& x,
                  const DenseTensor& alpha,
-                 const std::string& mode,
                  const std::string& data_format,
+                 const std::string& mode,
                  DenseTensor* out);
 }  // namespace phi
diff --git a/paddle/phi/kernels/reshape_grad_kernel.cc b/paddle/phi/kernels/reshape_grad_kernel.cc
index 38132966407dc..129a69d4e4e0f 100644
--- a/paddle/phi/kernels/reshape_grad_kernel.cc
+++ b/paddle/phi/kernels/reshape_grad_kernel.cc
@@ -30,6 +30,7 @@ void ReshapeGradKernel(const Context& dev_ctx,
 
 template <typename Context>
 void ReshapeDoubleGradKernel(const Context& dev_ctx,
+                             const DenseTensor& out_grad,
                              const DenseTensor& x_grad_grad,
                              DenseTensor* out_grad_grad) {
   ReshapeGradKernel(dev_ctx, x_grad_grad, out_grad_grad);
diff --git a/paddle/phi/kernels/reshape_grad_kernel.h b/paddle/phi/kernels/reshape_grad_kernel.h
index 4eb3f68337aff..06ec3de15ab22 100644
--- a/paddle/phi/kernels/reshape_grad_kernel.h
+++ b/paddle/phi/kernels/reshape_grad_kernel.h
@@ -25,6 +25,7 @@ void ReshapeGradKernel(const Context& dev_ctx,
 
 template <typename Context>
 void ReshapeDoubleGradKernel(const Context& dev_ctx,
+                             const DenseTensor& out_grad,
                              const DenseTensor& x_grad_grad,
                              DenseTensor* out_grad_grad);
 
diff --git a/paddle/phi/kernels/selected_rows/clip_kernel.h b/paddle/phi/kernels/selected_rows/clip_kernel.h
new file mode 100644
index 0000000000000..ec56d92c513ea
--- /dev/null
+++ b/paddle/phi/kernels/selected_rows/clip_kernel.h
@@ -0,0 +1,34 @@
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#pragma once
+
+#include "paddle/fluid/operators/math/selected_rows_functor.h"
+#include "paddle/phi/common/scalar.h"
+#include "paddle/phi/core/dense_tensor.h"
+#include "paddle/phi/core/device_context.h"
+#include "paddle/phi/core/selected_rows.h"
+#include "paddle/phi/kernels/impl/clip_kernel_impl.h"
+
+namespace phi {
+namespace sr {
+
+template <typename T, typename Context>
+void ClipSparseKernel(const Context& dev_ctx,
+                      const SelectedRows& x,
+                      const Scalar& min,
+                      const Scalar& max,
+                      SelectedRows* out);
+}  // namespace sr
+}  // namespace phi
diff --git a/paddle/phi/kernels/selected_rows/cpu/clip_kernel.cc b/paddle/phi/kernels/selected_rows/cpu/clip_kernel.cc
new file mode 100644
index 0000000000000..0098bf13f2b2f
--- /dev/null
+++ b/paddle/phi/kernels/selected_rows/cpu/clip_kernel.cc
@@ -0,0 +1,28 @@
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include "paddle/phi/kernels/selected_rows/clip_kernel.h"
+
+#include "paddle/phi/backends/cpu/cpu_context.h"
+#include "paddle/phi/core/kernel_registry.h"
+#include "paddle/phi/kernels/selected_rows/impl/clip_kernel_impl.h"
+
+PD_REGISTER_KERNEL(clip_sr,
+                   CPU,
+                   ALL_LAYOUT,
+                   phi::sr::ClipSparseKernel,
+                   float,
+                   double,
+                   int,
+                   int64_t) {}
diff --git a/paddle/phi/kernels/selected_rows/gpu/clip_kernel.cu b/paddle/phi/kernels/selected_rows/gpu/clip_kernel.cu
new file mode 100644
index 0000000000000..a8d659559e19e
--- /dev/null
+++ b/paddle/phi/kernels/selected_rows/gpu/clip_kernel.cu
@@ -0,0 +1,30 @@
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include "paddle/phi/kernels/selected_rows/clip_kernel.h"
+
+#include "paddle/phi/backends/gpu/gpu_context.h"
+#include "paddle/phi/common/float16.h"
+#include "paddle/phi/core/kernel_registry.h"
+#include "paddle/phi/kernels/selected_rows/impl/clip_kernel_impl.h"
+
+PD_REGISTER_KERNEL(clip_sr,
+                   GPU,
+                   ALL_LAYOUT,
+                   phi::sr::ClipSparseKernel,
+                   float,
+                   double,
+                   int,
+                   int64_t,
+                   phi::dtype::float16) {}
diff --git a/paddle/phi/kernels/selected_rows/hierarchical_sigmoid_grad_kernel.cc b/paddle/phi/kernels/selected_rows/hierarchical_sigmoid_grad_kernel.cc
index 80b2a1f6678a2..1660601bbd36e 100644
--- a/paddle/phi/kernels/selected_rows/hierarchical_sigmoid_grad_kernel.cc
+++ b/paddle/phi/kernels/selected_rows/hierarchical_sigmoid_grad_kernel.cc
@@ -40,11 +40,11 @@ void HierarchicalSigmoidGradKernel(const Context& ctx,
                                    const DenseTensor& x,
                                    const DenseTensor& w,
                                    const DenseTensor& label,
-                                   const DenseTensor& pre_out,
-                                   const DenseTensor& out_grad,
                                    paddle::optional<const DenseTensor&> path,
                                    paddle::optional<const DenseTensor&> code,
                                    paddle::optional<const DenseTensor&> bias,
+                                   const DenseTensor& pre_out,
+                                   const DenseTensor& out_grad,
                                    int num_classes,
                                    bool remote_prefetch,
                                    int trainer_id,
@@ -70,11 +70,11 @@ void HierarchicalSigmoidGradKernel(const Context& ctx,
                                             x,
                                             w,
                                             label,
-                                            pre_out,
-                                            out_grad,
                                             path,
                                             code,
                                             bias,
+                                            pre_out,
+                                            out_grad,
                                             num_classes,
                                             remote_prefetch,
                                             trainer_id,
diff --git a/paddle/phi/kernels/selected_rows/hierarchical_sigmoid_grad_kernel.h b/paddle/phi/kernels/selected_rows/hierarchical_sigmoid_grad_kernel.h
index 557c8b1bc5eed..4c03b83d80fff 100644
--- a/paddle/phi/kernels/selected_rows/hierarchical_sigmoid_grad_kernel.h
+++ b/paddle/phi/kernels/selected_rows/hierarchical_sigmoid_grad_kernel.h
@@ -25,11 +25,11 @@ void HierarchicalSigmoidGradKernel(const Context& ctx,
                                    const DenseTensor& x,
                                    const DenseTensor& w,
                                    const DenseTensor& label,
-                                   const DenseTensor& pre_out,
-                                   const DenseTensor& out_grad,
                                    paddle::optional<const DenseTensor&> path,
                                    paddle::optional<const DenseTensor&> code,
                                    paddle::optional<const DenseTensor&> bias,
+                                   const DenseTensor& pre_out,
+                                   const DenseTensor& out_grad,
                                    int num_classes,
                                    bool remote_prefetch,
                                    int trainer_id,
diff --git a/paddle/phi/kernels/selected_rows/impl/clip_kernel_impl.h b/paddle/phi/kernels/selected_rows/impl/clip_kernel_impl.h
new file mode 100644
index 0000000000000..1d95e633b93a6
--- /dev/null
+++ b/paddle/phi/kernels/selected_rows/impl/clip_kernel_impl.h
@@ -0,0 +1,62 @@
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#pragma once
+
+#include "paddle/phi/kernels/selected_rows/clip_kernel.h"
+
+#include "paddle/fluid/operators/math/selected_rows_functor.h"
+#include "paddle/phi/common/scalar.h"
+#include "paddle/phi/core/dense_tensor.h"
+#include "paddle/phi/core/device_context.h"
+#include "paddle/phi/core/selected_rows.h"
+
+namespace phi {
+namespace sr {
+
+template <typename T, typename Context>
+void ClipSparseKernel(const Context& dev_ctx,
+                      const SelectedRows& x,
+                      const Scalar& min,
+                      const Scalar& max,
+                      SelectedRows* out) {
+  auto max_ = max.to<T>();
+  auto min_ = min.to<T>();
+
+  PADDLE_ENFORCE_LE(
+      min_,
+      max_,
+      errors::InvalidArgument("max should be greater than or equal to min. "
+                              "But received min = %f, max = %f",
+                              static_cast<float>(min_),
+                              static_cast<float>(max_)));
+
+  PADDLE_ENFORCE_NE(&x,
+                    out,
+                    errors::InvalidArgument("Inplace clip is not allowed "
+                                            "when x is SelectedRows"));
+  paddle::operators::math::scatter::MergeAdd<Context, T> merge_func;
+  merge_func(dev_ctx, x, out);
+  auto* out_tensor = out->mutable_value();
+  auto* out_data = out_tensor->data<T>();
+  int64_t numel = out_tensor->numel();
+  paddle::platform::Transform<Context> trans;
+  trans(dev_ctx,
+        out_data,
+        out_data + numel,
+        out_data,
+        ClipFunctor<T>(min_, max_));
+}
+}  // namespace sr
+}  // namespace phi
diff --git a/paddle/phi/kernels/sparse/convolution_grad_kernel.h b/paddle/phi/kernels/sparse/convolution_grad_kernel.h
index 23e059c72e776..eebfcddfc7a9e 100644
--- a/paddle/phi/kernels/sparse/convolution_grad_kernel.h
+++ b/paddle/phi/kernels/sparse/convolution_grad_kernel.h
@@ -25,37 +25,37 @@ namespace sparse {
 template <typename T, typename Context>
 void Conv3dGradKernel(const Context& dev_ctx,
                       const SparseCooTensor& x,
-                      const DenseTensor& rulebook,
                       const DenseTensor& kernel,
-                      const DenseTensor& out_grad,
+                      const DenseTensor& rulebook,
+                      const SparseCooTensor& out_grad,
                       const std::vector<int>& paddings,
                       const std::vector<int>& dilations,
                       const std::vector<int>& strides,
                       const int groups,
                       const bool subm,
-                      DenseTensor* x_grad,
+                      SparseCooTensor* x_grad,
                       DenseTensor* kernel_grad);
 
 template <typename T, typename Context>
-std::vector<DenseTensor> Conv3dGrad(const Context& dev_ctx,
-                                    const SparseCooTensor& x,
-                                    const DenseTensor& rulebook,
-                                    const DenseTensor& kernel,
-                                    const DenseTensor& out_grad,
-                                    const std::vector<int>& paddings,
-                                    const std::vector<int>& dilations,
-                                    const std::vector<int>& strides,
-                                    const int groups,
-                                    const bool subm) {
-  DenseTensor x_grad =
-      phi::Empty<Context>(dev_ctx, DenseTensorMeta(x.dtype(), {1}, x.layout()));
-  DenseTensor kernel_grad = phi::Empty<Context>(
-      dev_ctx, DenseTensorMeta(kernel.dtype(), {1}, kernel.layout()));
+std::tuple<SparseCooTensor, DenseTensor> Conv3dGrad(
+    const Context& dev_ctx,
+    const SparseCooTensor& x,
+    const DenseTensor& kernel,
+    const DenseTensor& rulebook,
+    const SparseCooTensor& out_grad,
+    const std::vector<int>& paddings,
+    const std::vector<int>& dilations,
+    const std::vector<int>& strides,
+    const int groups,
+    const bool subm) {
+  SparseCooTensor x_grad;
+  DenseTensor kernel_grad;
+
   // TODO(zhangkaihuo): call InferMeta func here
   Conv3dGradKernel<T, Context>(dev_ctx,
                                x,
-                               rulebook,
                                kernel,
+                               rulebook,
                                out_grad,
                                paddings,
                                dilations,
@@ -64,10 +64,7 @@ std::vector<DenseTensor> Conv3dGrad(const Context& dev_ctx,
                                subm,
                                &x_grad,
                                &kernel_grad);
-  std::vector<DenseTensor> out(2);
-  out[0] = x_grad;
-  out[1] = kernel_grad;
-  return out;
+  return std::make_tuple(x_grad, kernel_grad);
 }
 
 }  // namespace sparse
diff --git a/paddle/phi/kernels/sparse/convolution_kernel.h b/paddle/phi/kernels/sparse/convolution_kernel.h
index ff2cf94edb5a3..6120d6339a7eb 100644
--- a/paddle/phi/kernels/sparse/convolution_kernel.h
+++ b/paddle/phi/kernels/sparse/convolution_kernel.h
@@ -45,11 +45,7 @@ SparseCooTensor Conv3d(const Context& dev_ctx,
                        const int groups,
                        const bool subm,
                        DenseTensor* rulebook) {
-  DenseTensor indices = phi::Empty<Context>(
-      dev_ctx, DenseTensorMeta(DataType::INT32, {1}, DataLayout::NCHW));
-  DenseTensor values =
-      phi::Empty<Context>(dev_ctx, DenseTensorMeta(x.dtype(), {1}, x.layout()));
-  SparseCooTensor coo(indices, values, x.dims());
+  SparseCooTensor coo;
   Conv3dKernel<T, Context>(dev_ctx,
                            x,
                            kernel,
diff --git a/paddle/phi/kernels/sparse/copy_kernel.cc b/paddle/phi/kernels/sparse/copy_kernel.cc
new file mode 100644
index 0000000000000..705c19e020c84
--- /dev/null
+++ b/paddle/phi/kernels/sparse/copy_kernel.cc
@@ -0,0 +1,99 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "paddle/phi/kernels/sparse/copy_kernel.h"
+
+#include "paddle/phi/backends/cpu/cpu_context.h"
+#include "paddle/phi/backends/gpu/gpu_context.h"
+#include "paddle/phi/core/kernel_registry.h"
+#include "paddle/phi/core/sparse_coo_tensor.h"
+#include "paddle/phi/core/sparse_csr_tensor.h"
+#include "paddle/phi/kernels/copy_kernel.h"
+
+namespace phi {
+namespace sparse {
+
+template <typename Context>
+void CopyCoo(const Context& dev_ctx,
+             const SparseCooTensor& src,
+             Place dst_place,
+             bool blocking,
+             SparseCooTensor* dst) {
+  phi::Copy<Context>(dev_ctx,
+                     src.non_zero_indices(),
+                     dst_place,
+                     blocking,
+                     dst->mutable_non_zero_indices());
+
+  phi::Copy<Context>(dev_ctx,
+                     src.non_zero_elements(),
+                     dst_place,
+                     blocking,
+                     dst->mutable_non_zero_elements());
+  dst->set_dims(src.dims());
+}
+
+template <typename Context>
+void CopyCsr(const Context& dev_ctx,
+             const SparseCsrTensor& src,
+             Place dst_place,
+             bool blocking,
+             SparseCsrTensor* dst) {
+  phi::Copy<Context>(dev_ctx,
+                     src.non_zero_crows(),
+                     dst_place,
+                     blocking,
+                     dst->mutable_non_zero_crows());
+
+  phi::Copy<Context>(dev_ctx,
+                     src.non_zero_cols(),
+                     dst_place,
+                     blocking,
+                     dst->mutable_non_zero_cols());
+
+  phi::Copy<Context>(dev_ctx,
+                     src.non_zero_elements(),
+                     dst_place,
+                     blocking,
+                     dst->mutable_non_zero_elements());
+  dst->set_dims(src.dims());
+}
+
+}  // namespace sparse
+}  // namespace phi
+
+PD_REGISTER_GENERAL_KERNEL(copy_sparse_coo,
+                           CPU,
+                           ALL_LAYOUT,
+                           phi::sparse::CopyCoo<phi::CPUContext>,
+                           ALL_DTYPE) {}
+
+PD_REGISTER_GENERAL_KERNEL(copy_sparse_csr,
+                           CPU,
+                           ALL_LAYOUT,
+                           phi::sparse::CopyCsr<phi::CPUContext>,
+                           ALL_DTYPE) {}
+
+#if defined(PADDLE_WITH_CUDA) || defined(PADDLE_WITH_HIP)
+PD_REGISTER_GENERAL_KERNEL(copy_sparse_coo,
+                           GPU,
+                           ALL_LAYOUT,
+                           phi::sparse::CopyCoo<phi::GPUContext>,
+                           ALL_DTYPE) {}
+PD_REGISTER_GENERAL_KERNEL(copy_sparse_csr,
+                           GPU,
+                           ALL_LAYOUT,
+                           phi::sparse::CopyCsr<phi::GPUContext>,
+                           ALL_DTYPE) {}
+#endif
diff --git a/paddle/phi/kernels/sparse/copy_kernel.h b/paddle/phi/kernels/sparse/copy_kernel.h
new file mode 100644
index 0000000000000..a43621a4dfeed
--- /dev/null
+++ b/paddle/phi/kernels/sparse/copy_kernel.h
@@ -0,0 +1,41 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+
+#include "paddle/phi/api/lib/utils/storage.h"
+#include "paddle/phi/core/dense_tensor.h"
+#include "paddle/phi/core/sparse_coo_tensor.h"
+#include "paddle/phi/core/sparse_csr_tensor.h"
+#include "paddle/phi/kernels/empty_kernel.h"
+
+namespace phi {
+namespace sparse {
+
+template <typename Context>
+void CopyCoo(const Context& dev_ctx,
+             const SparseCooTensor& src,
+             Place dst_place,
+             bool blocking,
+             SparseCooTensor* dst);
+
+template <typename Context>
+void CopyCsr(const Context& dev_ctx,
+             const SparseCsrTensor& src,
+             Place dst_place,
+             bool blocking,
+             SparseCsrTensor* dst);
+
+}  // namespace sparse
+}  // namespace phi
diff --git a/paddle/phi/kernels/sparse/cpu/convolution.h b/paddle/phi/kernels/sparse/cpu/convolution.h
index 93a335e2f1c35..b2544619774c2 100644
--- a/paddle/phi/kernels/sparse/cpu/convolution.h
+++ b/paddle/phi/kernels/sparse/cpu/convolution.h
@@ -31,7 +31,7 @@ using Dims4D = phi::funcs::sparse::Dims4D;
 // such as: kernel(3, 3, 3), kernel_size = 27
 // counter_per_weight: (kernel_size)
 // TODO(zhangkaihuo): optimize performance with multithreading
-template <typename T, typename Context>
+template <typename T, typename Context, typename IntT = int>
 void ProductRuleBook(const Context& dev_ctx,
                      const SparseCooTensor& x,
                      const std::vector<int>& kernel_sizes,
@@ -44,7 +44,7 @@ void ProductRuleBook(const Context& dev_ctx,
                      DenseTensor* counter_per_kernel) {
   const int64_t non_zero_num = x.nnz();
   const auto& non_zero_indices = x.non_zero_indices();
-  const int* indices_ptr = non_zero_indices.data<int>();
+  const IntT* indices_ptr = non_zero_indices.data<IntT>();
   int* counter_ptr = counter_per_kernel->data<int>();
   int kernel_size = kernel_sizes[0] * kernel_sizes[1] * kernel_sizes[2];
   memset(counter_ptr, 0, kernel_size * sizeof(int));
@@ -60,33 +60,33 @@ void ProductRuleBook(const Context& dev_ctx,
   const Dims4D c_strides(1, strides[2], strides[1], strides[0]);
   const Dims4D c_dilations(1, dilations[2], dilations[1], dilations[0]);
 
-  std::set<int> hash_in;
+  std::set<IntT> hash_in;
   if (subm) {
     for (int i = 0; i < non_zero_num; i++) {
-      int batch = indices_ptr[i];
-      int in_z = indices_ptr[i + non_zero_num];
-      int in_y = indices_ptr[i + 2 * non_zero_num];
-      int in_x = indices_ptr[i + 3 * non_zero_num];
-      int index = phi::funcs::sparse::PointToIndex<DDim>(
+      IntT batch = indices_ptr[i];
+      IntT in_z = indices_ptr[i + non_zero_num];
+      IntT in_y = indices_ptr[i + 2 * non_zero_num];
+      IntT in_x = indices_ptr[i + 3 * non_zero_num];
+      IntT index = phi::funcs::sparse::PointToIndex<DDim>(
           batch, in_x, in_y, in_z, x_dims);
       hash_in.insert(index);
     }
   }
 
-  auto f_calc_rulebook = [&](int* rulebook_ptr) {
+  auto f_calc_rulebook = [&](IntT* rulebook_ptr) {
     int kernel_index = 0, rulebook_index = 0;
     for (int kz = 0; kz < kernel_sizes[0]; kz++) {
       for (int ky = 0; ky < kernel_sizes[1]; ky++) {
         for (int kx = 0; kx < kernel_sizes[2]; kx++) {
           ++kernel_index;
           for (int64_t i = 0; i < non_zero_num; i++) {
-            int batch = indices_ptr[i];
-            int in_z = indices_ptr[i + non_zero_num];
-            int in_y = indices_ptr[i + 2 * non_zero_num];
-            int in_x = indices_ptr[i + 3 * non_zero_num];
-            int out_z = (in_z + paddings[0] - kz * dilations[0]) / strides[0];
-            int out_y = (in_y + paddings[1] - ky * dilations[1]) / strides[1];
-            int out_x = (in_x + paddings[2] - kx * dilations[2]) / strides[2];
+            IntT batch = indices_ptr[i];
+            IntT in_z = indices_ptr[i + non_zero_num];
+            IntT in_y = indices_ptr[i + 2 * non_zero_num];
+            IntT in_x = indices_ptr[i + 3 * non_zero_num];
+            IntT out_z = (in_z + paddings[0] - kz * dilations[0]) / strides[0];
+            IntT out_y = (in_y + paddings[1] - ky * dilations[1]) / strides[1];
+            IntT out_x = (in_x + paddings[2] - kx * dilations[2]) / strides[2];
             if (phi::funcs::sparse::Check(c_x_dims,
                                           c_kernel_dims,
                                           c_paddings,
@@ -99,7 +99,7 @@ void ProductRuleBook(const Context& dev_ctx,
                                           ky,
                                           kz)) {
               if (subm) {
-                int out_index = phi::funcs::sparse::PointToIndex<DDim>(
+                IntT out_index = phi::funcs::sparse::PointToIndex<DDim>(
                     batch, out_x, out_y, out_z, out_dims);
                 if (hash_in.find(out_index) == hash_in.end()) {
                   continue;
@@ -126,15 +126,16 @@ void ProductRuleBook(const Context& dev_ctx,
 
   f_calc_rulebook(nullptr);
   // alloc the rulebook
-  DenseTensorMeta rulebook_meta(
-      DataType::INT32, {3, rulebook_len}, DataLayout::NCHW);
-  rulebook->set_meta(rulebook_meta);
-  dev_ctx.Alloc(rulebook, rulebook->dtype(), rulebook->numel() * sizeof(int));
-  int* rulebook_ptr = rulebook->data<int>();
+  *rulebook = phi::Empty(
+      dev_ctx,
+      DenseTensorMeta(paddle::experimental::CppTypeToDataType<IntT>::Type(),
+                      {3, rulebook_len},
+                      DataLayout::NCHW));
+  IntT* rulebook_ptr = rulebook->data<IntT>();
   f_calc_rulebook(rulebook_ptr);
 }
 
-template <typename T, typename Context>
+template <typename T, typename Context, typename IntT = int>
 void UpdateRulebookAndOutIndex(const Context& dev_ctx,
                                const SparseCooTensor& x,
                                const int kernel_size,
@@ -142,9 +143,9 @@ void UpdateRulebookAndOutIndex(const Context& dev_ctx,
                                const DDim& out_dims,
                                DenseTensor* rulebook,
                                SparseCooTensor* out) {
-  std::set<int> out_indexs;
+  std::set<IntT> out_indexs;
   int n = rulebook->dims()[1];
-  int* rulebook_ptr = rulebook->data<int>();
+  IntT* rulebook_ptr = rulebook->data<IntT>();
   for (int i = 0; i < n; i++) {
     out_indexs.insert(rulebook_ptr[i + n * 2]);
   }
@@ -152,16 +153,19 @@ void UpdateRulebookAndOutIndex(const Context& dev_ctx,
   int out_non_zero_num = out_indexs.size();
   const int64_t sparse_dim = 4;
   DenseTensorMeta indices_meta(
-      DataType::INT32, {sparse_dim, out_non_zero_num}, DataLayout::NCHW);
-  DenseTensorMeta values_meta(
-      x.dtype(), {out_non_zero_num, out_channels}, x.layout());
+      paddle::experimental::CppTypeToDataType<IntT>::Type(),
+      {sparse_dim, out_non_zero_num},
+      DataLayout::NCHW);
+  DenseTensorMeta values_meta(x.dtype(),
+                              {out_non_zero_num, out_channels},
+                              x.non_zero_elements().layout());
   phi::DenseTensor out_indices = phi::Empty(dev_ctx, std::move(indices_meta));
   phi::DenseTensor out_values = phi::Empty(dev_ctx, std::move(values_meta));
-  int* out_indices_ptr = out_indices.data<int>();
+  IntT* out_indices_ptr = out_indices.data<IntT>();
   int i = 0;
   for (auto it = out_indexs.begin(); it != out_indexs.end(); it++, i++) {
-    const int index = *it;
-    int batch, x, y, z;
+    const IntT index = *it;
+    IntT batch, x, y, z;
     phi::funcs::sparse::IndexToPoint<DDim>(index, out_dims, &batch, &x, &y, &z);
     out_indices_ptr[i] = batch;
     out_indices_ptr[i + out_non_zero_num] = z;
@@ -169,7 +173,7 @@ void UpdateRulebookAndOutIndex(const Context& dev_ctx,
     out_indices_ptr[i + out_non_zero_num * 3] = x;
   }
   for (i = 0; i < n; i++) {
-    int out_index = rulebook_ptr[i + n * 2];
+    IntT out_index = rulebook_ptr[i + n * 2];
     rulebook_ptr[i + n * 2] =
         std::distance(out_indexs.begin(), out_indexs.find(out_index));
   }
@@ -177,20 +181,20 @@ void UpdateRulebookAndOutIndex(const Context& dev_ctx,
   out->SetMember(out_indices, out_values, out_dims, true);
 }
 
-template <typename T>
+template <typename T, typename IntT = int>
 void Gather(
-    const T* x, const int* indexs, const int n, const int channels, T* out) {
+    const T* x, const IntT* indexs, const int n, const int channels, T* out) {
   for (int i = 0; i < n; i++) {
-    int real_i = indexs[i];
+    IntT real_i = indexs[i];
     memcpy(out + i * channels, x + real_i * channels, channels * sizeof(T));
   }
 }
 
-template <typename T>
+template <typename T, typename IntT = int>
 void Scatter(
-    const T* x, const int* indexs, const int n, const int channels, T* out) {
+    const T* x, const IntT* indexs, const int n, const int channels, T* out) {
   for (int i = 0; i < n; i++) {
-    int real_i = indexs[i];
+    IntT real_i = indexs[i];
     for (int j = 0; j < channels; j++) {
       out[real_i * channels + j] += x[i * channels + j];
     }
diff --git a/paddle/phi/kernels/sparse/cpu/convolution_grad_kernel.cc b/paddle/phi/kernels/sparse/cpu/convolution_grad_kernel.cc
index 3348d81cf6b4b..80693c90d1e7f 100644
--- a/paddle/phi/kernels/sparse/cpu/convolution_grad_kernel.cc
+++ b/paddle/phi/kernels/sparse/cpu/convolution_grad_kernel.cc
@@ -13,10 +13,13 @@ See the License for the specific language governing permissions and
 limitations under the License. */
 
 #include "paddle/phi/kernels/sparse/convolution_grad_kernel.h"
+#include "paddle/phi/kernels/copy_kernel.h"
 #include "paddle/phi/kernels/funcs/blas/blas.h"
 #include "paddle/phi/kernels/funcs/math_function.h"
 #include "paddle/phi/kernels/sparse/cpu/convolution.h"
 
+#include "paddle/phi/api/ext/dispatch.h"
+
 namespace phi {
 namespace sparse {
 
@@ -28,24 +31,24 @@ namespace sparse {
 //]
 // x_grad = out_grad * transpose(kenrel)
 // kernel_grad = transpose(x) * out_grad
-template <typename T, typename Context>
-void Conv3dGradKernel(const Context& dev_ctx,
-                      const SparseCooTensor& x,
-                      const DenseTensor& rulebook,
-                      const DenseTensor& kernel,
-                      const DenseTensor& out_grad,
-                      const std::vector<int>& paddings,
-                      const std::vector<int>& dilations,
-                      const std::vector<int>& strides,
-                      const int groups,
-                      const bool subm,
-                      DenseTensor* x_grad,
-                      DenseTensor* kernel_grad) {
+template <typename T, typename IntT = int>
+void Conv3dGradCPUKernel(const CPUContext& dev_ctx,
+                         const SparseCooTensor& x,
+                         const DenseTensor& kernel,
+                         const DenseTensor& rulebook,
+                         const SparseCooTensor& out_grad,
+                         const std::vector<int>& paddings,
+                         const std::vector<int>& dilations,
+                         const std::vector<int>& strides,
+                         const int groups,
+                         const bool subm,
+                         SparseCooTensor* x_grad,
+                         DenseTensor* kernel_grad) {
   const auto& kernel_dims = kernel.dims();
   const int kernel_size = kernel_dims[0] * kernel_dims[1] * kernel_dims[2];
   const int in_channels = kernel_dims[3];
   const int out_channels = kernel_dims[4];
-  const int* rulebook_ptr = rulebook.data<int>();
+  const IntT* rulebook_ptr = rulebook.data<IntT>();
 
   const int rulebook_len = rulebook.dims()[1];
 
@@ -65,25 +68,30 @@ void Conv3dGradKernel(const Context& dev_ctx,
   T* in_features_ptr = in_features.data<T>();
   T* d_x_features_ptr = d_x_features.data<T>();
   T* out_grad_features_ptr = out_grad_features.data<T>();
-  kernel_grad->Resize(kernel_dims);
-  dev_ctx.Alloc(
-      kernel_grad, kernel_grad->dtype(), kernel_grad->numel() * sizeof(T));
+  *kernel_grad = phi::EmptyLike<T>(dev_ctx, kernel);
   T* d_kernel_ptr = kernel_grad->data<T>();
   memset(d_kernel_ptr, 0, sizeof(T) * kernel_grad->numel());
 
   int half_kernel_size = kernel_size / 2;
-  auto blas = phi::funcs::GetBlas<Context, T>(dev_ctx);
-  x_grad->Resize(x.non_zero_elements().dims());
-  dev_ctx.Alloc(x_grad, x_grad->dtype(), sizeof(T) * x_grad->numel());
-  T* x_grad_values_ptr = x_grad->data<T>();
-  memset(x_grad_values_ptr, 0, sizeof(T) * x_grad->numel());
+  auto blas = phi::funcs::GetBlas<CPUContext, T>(dev_ctx);
+  DenseTensor x_grad_indices =
+      phi::EmptyLike<IntT>(dev_ctx, x.non_zero_indices());
+  DenseTensor x_grad_values = phi::EmptyLike<T>(dev_ctx, x.non_zero_elements());
+  T* x_grad_values_ptr = x_grad_values.data<T>();
+  memset(x_grad_values_ptr, 0, sizeof(T) * x_grad_values.numel());
   memset(d_x_features_ptr, 0, sizeof(T) * d_x_features.numel());
-
-  std::vector<int> offsets(kernel_size + 1), counter(kernel_size, 0);
+  phi::Copy<CPUContext>(dev_ctx,
+                        x.non_zero_indices(),
+                        dev_ctx.GetPlace(),
+                        false,
+                        &x_grad_indices);
+  x_grad->SetMember(x_grad_indices, x_grad_values, x.dims(), true);
+
+  std::vector<IntT> offsets(kernel_size + 1), counter(kernel_size, 0);
   for (int i = 0; i < rulebook_len; i++) {
     counter[rulebook_ptr[i]] += 1;
   }
-  int offset = 0, max_count = 0;
+  IntT offset = 0, max_count = 0;
   for (int i = 0; i < kernel_size; i++) {
     offsets[i] = offset;
     offset += counter[i];
@@ -94,30 +102,31 @@ void Conv3dGradKernel(const Context& dev_ctx,
   offsets[kernel_size] = offset;
 
   if (subm) {
-    phi::funcs::sparse::SubmPreProcess<T, Context>(dev_ctx,
-                                                   x,
-                                                   kernel,
-                                                   out_grad,
-                                                   in_channels,
-                                                   out_channels,
-                                                   half_kernel_size,
-                                                   kernel_grad,
-                                                   x_grad);
+    phi::funcs::sparse::SubmPreProcess<T, CPUContext>(
+        dev_ctx,
+        x,
+        kernel,
+        out_grad.non_zero_elements(),
+        in_channels,
+        out_channels,
+        half_kernel_size,
+        kernel_grad,
+        &x_grad_values);
     if (max_count == 0) {
       return;
     }
   }
 
-  Gather<T>(x.non_zero_elements().data<T>(),
-            rulebook_ptr + rulebook_len,
-            rulebook_len,
-            in_channels,
-            in_features_ptr);
-  Gather<T>(out_grad.data<T>(),
-            rulebook_ptr + rulebook_len * 2,
-            rulebook_len,
-            out_channels,
-            out_grad_features_ptr);
+  Gather<T, IntT>(x.non_zero_elements().data<T>(),
+                  rulebook_ptr + rulebook_len,
+                  rulebook_len,
+                  in_channels,
+                  in_features_ptr);
+  Gather<T, IntT>(out_grad.non_zero_elements().data<T>(),
+                  rulebook_ptr + rulebook_len * 2,
+                  rulebook_len,
+                  out_channels,
+                  out_grad_features_ptr);
 
   const T* kernel_ptr = kernel.data<T>();
   for (int i = 0; i < kernel_size; i++) {
@@ -162,11 +171,41 @@ void Conv3dGradKernel(const Context& dev_ctx,
   }
 
   // 4. scatter
-  Scatter<T>(d_x_features_ptr,
-             rulebook.data<int>() + rulebook_len,
-             rulebook_len,
-             in_channels,
-             x_grad_values_ptr);
+  Scatter<T, IntT>(d_x_features_ptr,
+                   rulebook.data<IntT>() + rulebook_len,
+                   rulebook_len,
+                   in_channels,
+                   x_grad_values_ptr);
+}
+
+template <typename T, typename Context>
+void Conv3dGradKernel(const Context& dev_ctx,
+                      const SparseCooTensor& x,
+                      const DenseTensor& kernel,
+                      const DenseTensor& rulebook,
+                      const SparseCooTensor& out_grad,
+                      const std::vector<int>& paddings,
+                      const std::vector<int>& dilations,
+                      const std::vector<int>& strides,
+                      const int groups,
+                      const bool subm,
+                      SparseCooTensor* x_grad,
+                      DenseTensor* kernel_grad) {
+  PD_DISPATCH_INTEGRAL_TYPES(
+      x.non_zero_indices().dtype(), "Conv3dGradCPUKernel", ([&] {
+        Conv3dGradCPUKernel<T, data_t>(dev_ctx,
+                                       x,
+                                       kernel,
+                                       rulebook,
+                                       out_grad,
+                                       paddings,
+                                       dilations,
+                                       strides,
+                                       groups,
+                                       subm,
+                                       x_grad,
+                                       kernel_grad);
+      }));
 }
 
 }  // namespace sparse
diff --git a/paddle/phi/kernels/sparse/cpu/convolution_kernel.cc b/paddle/phi/kernels/sparse/cpu/convolution_kernel.cc
index f022e4ef4bb63..a1c8cf014c7fb 100644
--- a/paddle/phi/kernels/sparse/cpu/convolution_kernel.cc
+++ b/paddle/phi/kernels/sparse/cpu/convolution_kernel.cc
@@ -17,6 +17,8 @@ limitations under the License. */
 #include "paddle/phi/core/tensor_meta.h"
 #include "paddle/phi/kernels/funcs/blas/blas.h"
 
+#include "paddle/phi/api/ext/dispatch.h"
+
 namespace phi {
 namespace sparse {
 
@@ -25,17 +27,17 @@ namespace sparse {
  * kernel: (D, H, W, C, OC)
  * out: (N, D, H, W, OC)
 **/
-template <typename T, typename Context>
-void Conv3dKernel(const Context& dev_ctx,
-                  const SparseCooTensor& x,
-                  const DenseTensor& kernel,
-                  const std::vector<int>& paddings,
-                  const std::vector<int>& dilations,
-                  const std::vector<int>& strides,
-                  const int groups,
-                  const bool subm,
-                  SparseCooTensor* out,
-                  DenseTensor* rulebook) {
+template <typename T, typename IntT = int>
+void Conv3dCPUKernel(const CPUContext& dev_ctx,
+                     const SparseCooTensor& x,
+                     const DenseTensor& kernel,
+                     const std::vector<int>& paddings,
+                     const std::vector<int>& dilations,
+                     const std::vector<int>& strides,
+                     const int groups,
+                     const bool subm,
+                     SparseCooTensor* out,
+                     DenseTensor* rulebook) {
   // update padding and dilation
   // Currently, only support x.layout is NDHWC, groups = 1
   // if x.layout != NDHWC then transpose(x), transpose(weight)
@@ -66,18 +68,18 @@ void Conv3dKernel(const Context& dev_ctx,
       DataType::INT32, {kernel_size}, DataLayout::NCHW);
   DenseTensor counter_per_kernel = phi::Empty(dev_ctx, std::move(counter_meta));
 
-  ProductRuleBook<T, Context>(dev_ctx,
-                              x,
-                              kernel_sizes,
-                              subm_paddings,
-                              dilations,
-                              subm_strides,
-                              out_dims,
-                              subm,
-                              rulebook,
-                              &counter_per_kernel);
-
-  UpdateRulebookAndOutIndex<T>(
+  ProductRuleBook<T, CPUContext, IntT>(dev_ctx,
+                                       x,
+                                       kernel_sizes,
+                                       subm_paddings,
+                                       dilations,
+                                       subm_strides,
+                                       out_dims,
+                                       subm,
+                                       rulebook,
+                                       &counter_per_kernel);
+
+  UpdateRulebookAndOutIndex<T, CPUContext, IntT>(
       dev_ctx, x, kernel_size, out_channels, out_dims, rulebook, out);
 
   int n = rulebook->dims()[1];
@@ -95,14 +97,14 @@ void Conv3dKernel(const Context& dev_ctx,
   T* in_features_ptr = in_features.data<T>();
   T* out_features_ptr = out_features.data<T>();
 
-  Gather<T>(x.non_zero_elements().data<T>(),
-            rulebook->data<int>() + n,
-            n,
-            in_channels,
-            in_features_ptr);
+  Gather<T, IntT>(x.non_zero_elements().data<T>(),
+                  rulebook->data<IntT>() + n,
+                  n,
+                  in_channels,
+                  in_features_ptr);
 
   // 3. call gemm for every werght
-  auto blas = phi::funcs::GetBlas<Context, T>(dev_ctx);
+  auto blas = phi::funcs::GetBlas<CPUContext, T>(dev_ctx);
   std::vector<int> offsets(kernel_size + 1);
   int offset = 0;
   for (int i = 0; i < kernel_size; i++) {
@@ -139,11 +141,37 @@ void Conv3dKernel(const Context& dev_ctx,
   // 4. scatter
   T* out_values_ptr = out->mutable_non_zero_elements()->data<T>();
   memset(out_values_ptr, 0, sizeof(T) * out->nnz() * out_channels);
-  Scatter<T>(out_features_ptr,
-             rulebook->data<int>() + n * 2,
-             n,
-             out_channels,
-             out_values_ptr);
+  Scatter<T, IntT>(out_features_ptr,
+                   rulebook->data<IntT>() + n * 2,
+                   n,
+                   out_channels,
+                   out_values_ptr);
+}
+
+template <typename T, typename Context>
+void Conv3dKernel(const Context& dev_ctx,
+                  const SparseCooTensor& x,
+                  const DenseTensor& kernel,
+                  const std::vector<int>& paddings,
+                  const std::vector<int>& dilations,
+                  const std::vector<int>& strides,
+                  const int groups,
+                  const bool subm,
+                  SparseCooTensor* out,
+                  DenseTensor* rulebook) {
+  PD_DISPATCH_INTEGRAL_TYPES(
+      x.non_zero_indices().dtype(), "Conv3dCPUKernel", ([&] {
+        Conv3dCPUKernel<T, data_t>(dev_ctx,
+                                   x,
+                                   kernel,
+                                   paddings,
+                                   dilations,
+                                   strides,
+                                   groups,
+                                   subm,
+                                   out,
+                                   rulebook);
+      }));
 }
 
 }  // namespace sparse
diff --git a/paddle/phi/kernels/sparse/cpu/sparse_mask_kernel.cc b/paddle/phi/kernels/sparse/cpu/sparse_mask_kernel.cc
new file mode 100644
index 0000000000000..0a5e145312e0e
--- /dev/null
+++ b/paddle/phi/kernels/sparse/cpu/sparse_mask_kernel.cc
@@ -0,0 +1,103 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "paddle/phi/kernels/sparse/sparse_mask_kernel.h"
+#include "paddle/phi/core/ddim.h"
+#include "paddle/phi/core/enforce.h"
+#include "paddle/phi/core/kernel_registry.h"
+#include "paddle/phi/kernels/copy_kernel.h"
+#include "paddle/phi/kernels/empty_kernel.h"
+#include "paddle/phi/kernels/funcs/math_function.h"
+
+#include "paddle/phi/api/ext/dispatch.h"
+
+namespace phi {
+namespace sparse {
+
+template <typename T, typename IntT>
+void SparseMaskCPUKernel(const CPUContext& dev_ctx,
+                         const DenseTensor& x,
+                         const SparseCooTensor& mask,
+                         SparseCooTensor* out) {
+  const DDim& dims = x.dims();
+  PADDLE_ENFORCE_EQ(
+      x.dims(),
+      mask.dims(),
+      phi::errors::InvalidArgument("the input x and mask must have the shape"));
+  const DenseTensor& indices = mask.non_zero_indices();
+  const DenseTensor& values = mask.non_zero_elements();
+  int sparse_dim = indices.dims().size();
+  std::vector<int64_t> sparse_offsets(sparse_dim);
+  int64_t offset = 1;
+  for (int i = sparse_dim - 1; i >= 0; i--) {
+    sparse_offsets[i] = offset;
+    offset *= dims[i];
+  }
+
+  DenseTensor out_indices = phi::EmptyLike<T>(dev_ctx, indices);
+  DenseTensor out_values = phi::EmptyLike<T>(dev_ctx, values);
+
+  // the out_indices is same as indices of mask
+  phi::Copy(dev_ctx, indices, dev_ctx.GetPlace(), false, &out_indices);
+
+  const IntT* indices_ptr = indices.data<IntT>();
+  T* out_values_ptr = out_values.data<T>();
+  const T* x_ptr = x.data<T>();
+
+  const int64_t non_zero_num = mask.nnz();
+  auto dims_2d = flatten_to_2d(dims, sparse_dim);
+  const int cols = dims_2d[1];
+
+  for (int64_t i = 0; i < non_zero_num; i++) {
+    int64_t index = 0;
+    for (int j = 0; j < sparse_dim; j++) {
+      index += indices_ptr[j * non_zero_num + i] * sparse_offsets[j];
+    }
+    memcpy(out_values_ptr + i * cols, x_ptr + index * cols, cols * sizeof(T));
+  }
+  out->SetMember(out_indices, out_values, dims, true);
+}
+
+/**
+ * @brief Filter the DenseTensor x by the
+ * mask.non_zero_indices() and output a SparseCooTensor
+ * x and mask must have the same shape.
+**/
+template <typename T, typename Context>
+void SparseMaskKernel(const Context& dev_ctx,
+                      const DenseTensor& x,
+                      const SparseCooTensor& mask,
+                      SparseCooTensor* out) {
+  PD_DISPATCH_INTEGRAL_TYPES(
+      mask.non_zero_indices().dtype(), "SparseMaskCPUKernel", ([&] {
+        SparseMaskCPUKernel<T, data_t>(dev_ctx, x, mask, out);
+      }));
+}
+
+}  // namespace sparse
+}  // namespace phi
+
+PD_REGISTER_KERNEL(sparse_mask,
+                   CPU,
+                   ALL_LAYOUT,
+                   phi::sparse::SparseMaskKernel,
+                   float,
+                   double,
+                   uint8_t,
+                   int8_t,
+                   int16_t,
+                   int,
+                   int64_t) {
+  kernel->InputAt(1).SetDataLayout(phi::DataLayout::SPARSE_COO);
+}
diff --git a/paddle/phi/kernels/sparse/cpu/sparse_pool_grad_kernel.cc b/paddle/phi/kernels/sparse/cpu/sparse_pool_grad_kernel.cc
index 3010d480b55c9..30221975e7756 100644
--- a/paddle/phi/kernels/sparse/cpu/sparse_pool_grad_kernel.cc
+++ b/paddle/phi/kernels/sparse/cpu/sparse_pool_grad_kernel.cc
@@ -14,24 +14,28 @@ limitations under the License. */
 
 #include "paddle/phi/kernels/sparse/sparse_pool_grad_kernel.h"
 #include "paddle/phi/core/kernel_registry.h"
+#include "paddle/phi/kernels/copy_kernel.h"
+#include "paddle/phi/kernels/empty_kernel.h"
 #include "paddle/phi/kernels/funcs/pooling.h"
 #include "paddle/phi/kernels/funcs/sparse/convolution.h"
 
+#include "paddle/phi/api/ext/dispatch.h"
+
 namespace phi {
 namespace sparse {
 
-template <typename T, typename Context>
-void MaxPoolGradKernel(const Context& dev_ctx,
-                       const SparseCooTensor& x,
-                       const DenseTensor& rulebook,
-                       const SparseCooTensor& out,
-                       const DenseTensor& out_grad,
-                       const std::vector<int>& kernel_sizes,
-                       DenseTensor* x_grad) {
+template <typename T, typename IntT = int>
+void MaxPoolGradCPUKernel(const CPUContext& dev_ctx,
+                          const SparseCooTensor& x,
+                          const DenseTensor& rulebook,
+                          const SparseCooTensor& out,
+                          const SparseCooTensor& out_grad,
+                          const std::vector<int>& kernel_sizes,
+                          SparseCooTensor* x_grad) {
   int kernel_size = kernel_sizes[0] * kernel_sizes[1] * kernel_sizes[2];
   const int channels = x.dims()[4];
   int rulebook_len = rulebook.dims()[1];
-  const int* rulebook_ptr = rulebook.data<int>();
+  const IntT* rulebook_ptr = rulebook.data<IntT>();
   std::vector<int> offsets(kernel_size + 1), counter(kernel_size, 0);
   for (int i = 0; i < rulebook_len; i++) {
     counter[rulebook_ptr[i]] += 1;
@@ -40,15 +44,25 @@ void MaxPoolGradKernel(const Context& dev_ctx,
 
   const T* in_features_ptr = x.non_zero_elements().data<T>();
   const T* out_features_ptr = out.non_zero_elements().data<T>();
-  const T* out_grad_ptr = out_grad.data<T>();
-  T* x_grad_ptr = x_grad->data<T>();
+  const T* out_grad_ptr = out_grad.non_zero_elements().data<T>();
+  // TODO(zhangkaihuo): call phi::sparse::EmptyLike
+  DenseTensor x_grad_indices =
+      phi::EmptyLike<IntT>(dev_ctx, x.non_zero_indices());
+  DenseTensor x_grad_values = phi::EmptyLike<T>(dev_ctx, x.non_zero_elements());
+  x_grad->SetMember(x_grad_indices, x_grad_values, x.dims(), true);
+  T* x_grad_ptr = x_grad_values.data<T>();
   memset(x_grad_ptr, 0, sizeof(T) * x_grad->numel());
+  phi::Copy<CPUContext>(dev_ctx,
+                        x.non_zero_indices(),
+                        dev_ctx.GetPlace(),
+                        false,
+                        &x_grad_indices);
 
   phi::funcs::MaxPoolGrad<T> grad_functor;
   for (int i = 0; i < kernel_size; i++) {
     for (int j = 0; j < counter[i]; j++) {
-      int in_i = rulebook_ptr[rulebook_len + offsets[i] + j];
-      int out_i = rulebook_ptr[rulebook_len * 2 + offsets[i] + j];
+      IntT in_i = rulebook_ptr[rulebook_len + offsets[i] + j];
+      IntT out_i = rulebook_ptr[rulebook_len * 2 + offsets[i] + j];
       for (int c = 0; c < channels; c++) {
         grad_functor.compute(in_features_ptr[in_i * channels + c],
                              out_features_ptr[out_i * channels + c],
@@ -60,6 +74,21 @@ void MaxPoolGradKernel(const Context& dev_ctx,
   }
 }
 
+template <typename T, typename Context>
+void MaxPoolGradKernel(const Context& dev_ctx,
+                       const SparseCooTensor& x,
+                       const DenseTensor& rulebook,
+                       const SparseCooTensor& out,
+                       const SparseCooTensor& out_grad,
+                       const std::vector<int>& kernel_sizes,
+                       SparseCooTensor* x_grad) {
+  PD_DISPATCH_INTEGRAL_TYPES(
+      x.non_zero_indices().dtype(), "MaxPoolGradCPUKernel", ([&] {
+        MaxPoolGradCPUKernel<T, data_t>(
+            dev_ctx, x, rulebook, out, out_grad, kernel_sizes, x_grad);
+      }));
+}
+
 }  // namespace sparse
 }  // namespace phi
 
diff --git a/paddle/phi/kernels/sparse/cpu/sparse_pool_kernel.cc b/paddle/phi/kernels/sparse/cpu/sparse_pool_kernel.cc
index 86971242df5ae..ed6e0200587e8 100644
--- a/paddle/phi/kernels/sparse/cpu/sparse_pool_kernel.cc
+++ b/paddle/phi/kernels/sparse/cpu/sparse_pool_kernel.cc
@@ -19,6 +19,8 @@ limitations under the License. */
 #include "paddle/phi/kernels/funcs/sparse/convolution.h"
 #include "paddle/phi/kernels/sparse/cpu/convolution.h"
 
+#include "paddle/phi/api/ext/dispatch.h"
+
 namespace phi {
 namespace sparse {
 
@@ -27,15 +29,15 @@ namespace sparse {
  * kernel: (D, H, W, C, OC)
  * out: (N, D, H, W, OC)
 **/
-template <typename T, typename Context>
-void MaxPoolKernel(const Context& dev_ctx,
-                   const SparseCooTensor& x,
-                   const std::vector<int>& kernel_sizes,
-                   const std::vector<int>& paddings,
-                   const std::vector<int>& dilations,
-                   const std::vector<int>& strides,
-                   SparseCooTensor* out,
-                   DenseTensor* rulebook) {
+template <typename T, typename IntT = int>
+void MaxPoolCPUKernel(const CPUContext& dev_ctx,
+                      const SparseCooTensor& x,
+                      const std::vector<int>& kernel_sizes,
+                      const std::vector<int>& paddings,
+                      const std::vector<int>& dilations,
+                      const std::vector<int>& strides,
+                      SparseCooTensor* out,
+                      DenseTensor* rulebook) {
   const auto& x_dims = x.dims();
   int kernel_size = kernel_sizes[0] * kernel_sizes[1] * kernel_sizes[2];
   const std::vector<int>& real_kernel_sizes =
@@ -51,22 +53,22 @@ void MaxPoolKernel(const Context& dev_ctx,
 
   const T* in_features_ptr = x.non_zero_elements().data<T>();
   // 1. product rule book
-  ProductRuleBook<T, Context>(dev_ctx,
-                              x,
-                              real_kernel_sizes,
-                              paddings,
-                              dilations,
-                              strides,
-                              out_dims,
-                              false,
-                              rulebook,
-                              &counter_per_kernel);
-
-  UpdateRulebookAndOutIndex<T>(
+  ProductRuleBook<T, CPUContext, IntT>(dev_ctx,
+                                       x,
+                                       real_kernel_sizes,
+                                       paddings,
+                                       dilations,
+                                       strides,
+                                       out_dims,
+                                       false,
+                                       rulebook,
+                                       &counter_per_kernel);
+
+  UpdateRulebookAndOutIndex<T, CPUContext, IntT>(
       dev_ctx, x, kernel_size, in_channels, out_dims, rulebook, out);
 
   int rulebook_len = rulebook->dims()[1];
-  const int* rulebook_ptr = rulebook->data<int>();
+  const IntT* rulebook_ptr = rulebook->data<IntT>();
   const int* counter_ptr = counter_per_kernel.data<int>();
 
   std::vector<int> offsets(kernel_size + 1);
@@ -78,8 +80,8 @@ void MaxPoolKernel(const Context& dev_ctx,
   phi::funcs::MaxPool<T> max_pool_functor;
   for (int i = 0; i < kernel_size; i++) {
     for (int j = 0; j < counter_ptr[i]; j++) {
-      int in_i = rulebook_ptr[rulebook_len + offsets[i] + j];
-      int out_i = rulebook_ptr[rulebook_len * 2 + offsets[i] + j];
+      IntT in_i = rulebook_ptr[rulebook_len + offsets[i] + j];
+      IntT out_i = rulebook_ptr[rulebook_len * 2 + offsets[i] + j];
       if (!out_flags[out_i]) {
         out_flags[out_i] = true;
         memcpy(&out_features_ptr[out_i * in_channels],
@@ -95,6 +97,28 @@ void MaxPoolKernel(const Context& dev_ctx,
   }
 }
 
+template <typename T, typename Context>
+void MaxPoolKernel(const Context& dev_ctx,
+                   const SparseCooTensor& x,
+                   const std::vector<int>& kernel_sizes,
+                   const std::vector<int>& paddings,
+                   const std::vector<int>& dilations,
+                   const std::vector<int>& strides,
+                   SparseCooTensor* out,
+                   DenseTensor* rulebook) {
+  PD_DISPATCH_INTEGRAL_TYPES(
+      x.non_zero_indices().dtype(), "MaxPoolCPUKernel", ([&] {
+        MaxPoolCPUKernel<T, data_t>(dev_ctx,
+                                    x,
+                                    kernel_sizes,
+                                    paddings,
+                                    dilations,
+                                    strides,
+                                    out,
+                                    rulebook);
+      }));
+}
+
 }  // namespace sparse
 }  // namespace phi
 
diff --git a/paddle/phi/kernels/sparse/cpu/sparse_utils_kernel.cc b/paddle/phi/kernels/sparse/cpu/sparse_utils_kernel.cc
index 50e95ee0b8a48..acc834269663d 100644
--- a/paddle/phi/kernels/sparse/cpu/sparse_utils_kernel.cc
+++ b/paddle/phi/kernels/sparse/cpu/sparse_utils_kernel.cc
@@ -121,7 +121,8 @@ void SparseCsrToCooKernel(const Context& dev_ctx,
   const auto place = dev_ctx.GetPlace();
   DenseTensorMeta indices_meta(
       DataType::INT64, {sparse_dim, non_zero_num}, DataLayout::NCHW);
-  DenseTensorMeta values_meta(x.dtype(), {non_zero_num}, x.layout());
+  DenseTensorMeta values_meta(
+      x.dtype(), {non_zero_num}, x.non_zero_elements().layout());
   phi::DenseTensor indices = phi::Empty(dev_ctx, std::move(indices_meta));
   phi::DenseTensor values = phi::Empty(dev_ctx, std::move(values_meta));
   int64_t* coo_indices = indices.mutable_data<int64_t>(place);
@@ -174,7 +175,8 @@ void SparseCooToCsrKernel(const Context& dev_ctx,
   DenseTensorMeta crows_meta(
       DataType::INT64, {batchs * (rows + 1)}, DataLayout::NCHW);
   DenseTensorMeta cols_meta(DataType::INT64, {non_zero_num}, DataLayout::NCHW);
-  DenseTensorMeta values_meta(x.dtype(), {non_zero_num}, x.layout());
+  DenseTensorMeta values_meta(
+      x.dtype(), {non_zero_num}, x.non_zero_elements().layout());
   phi::DenseTensor non_zero_crows(
       phi::make_intrusive<paddle::experimental::SharedStorage>(place),
       std::move(crows_meta));
@@ -362,3 +364,33 @@ PD_REGISTER_KERNEL(sparse_csr_to_dense,
                    int16_t,
                    int,
                    int64_t) {}
+
+PD_REGISTER_KERNEL(coo_values,
+                   CPU,
+                   ALL_LAYOUT,
+                   phi::sparse::CooValuesKernel,
+                   float,
+                   double,
+                   phi::dtype::float16,
+                   uint8_t,
+                   int8_t,
+                   int16_t,
+                   int,
+                   int64_t) {
+  kernel->InputAt(0).SetDataLayout(phi::DataLayout::SPARSE_COO);
+}
+
+PD_REGISTER_KERNEL(csr_values,
+                   CPU,
+                   ALL_LAYOUT,
+                   phi::sparse::CsrValuesKernel,
+                   float,
+                   double,
+                   phi::dtype::float16,
+                   uint8_t,
+                   int8_t,
+                   int16_t,
+                   int,
+                   int64_t) {
+  kernel->InputAt(0).SetDataLayout(phi::DataLayout::SPARSE_COO);
+}
diff --git a/paddle/phi/kernels/sparse/gpu/convolution.cu.h b/paddle/phi/kernels/sparse/gpu/convolution.cu.h
index 5b928817f64d7..5662a4fac71c5 100644
--- a/paddle/phi/kernels/sparse/gpu/convolution.cu.h
+++ b/paddle/phi/kernels/sparse/gpu/convolution.cu.h
@@ -98,21 +98,21 @@ __global__ void ScatterKernel(const T* input,
   }
 }
 
-template <typename Context>
-inline int* SortedAndUniqueIndex(const Context& dev_ctx,
-                                 const int* rulebook_ptr,
-                                 const int len,
-                                 DenseTensor* out_index,
-                                 DenseTensor* unique_key,
-                                 DenseTensor* unique_value) {
+template <typename Context, typename IntT = int>
+inline IntT* SortedAndUniqueIndex(const Context& dev_ctx,
+                                  const IntT* rulebook_ptr,
+                                  const int len,
+                                  DenseTensor* out_index,
+                                  DenseTensor* unique_key,
+                                  DenseTensor* unique_value) {
   phi::IndexKernel<int, kps::IdentityFunctor<int>>(
       dev_ctx, out_index, kps::IdentityFunctor<int>());
   phi::IndexKernel<int, kps::IdentityFunctor<int>>(
       dev_ctx, unique_value, kps::IdentityFunctor<int>());
 
-  phi::backends::gpu::GpuMemcpyAsync(unique_key->data<int>(),
+  phi::backends::gpu::GpuMemcpyAsync(unique_key->data<IntT>(),
                                      rulebook_ptr,
-                                     sizeof(int) * len,
+                                     sizeof(IntT) * len,
 #ifdef PADDLE_WITH_HIP
                                      hipMemcpyDeviceToDevice,
 #else
@@ -126,19 +126,19 @@ inline int* SortedAndUniqueIndex(const Context& dev_ctx,
 #else
   thrust::sort_by_key(thrust::cuda::par.on(dev_ctx.stream()),
 #endif
-                      unique_key->data<int>(),
-                      unique_key->data<int>() + len,
+                      unique_key->data<IntT>(),
+                      unique_key->data<IntT>() + len,
                       out_index->data<int>());
 
   // 4. unique
-  thrust::pair<int*, int*> new_end =
+  thrust::pair<IntT*, int*> new_end =
 #ifdef PADDLE_WITH_HIP
       thrust::unique_by_key(thrust::hip::par.on(dev_ctx.stream()),
 #else
       thrust::unique_by_key(thrust::cuda::par.on(dev_ctx.stream()),
 #endif
-                            unique_key->data<int>(),
-                            unique_key->data<int>() + len,
+                            unique_key->data<IntT>(),
+                            unique_key->data<IntT>() + len,
                             unique_value->data<int>());
   return new_end.first;
 }
@@ -159,7 +159,7 @@ __global__ void SetFlagAndUpdateCounterKernel(const int* indexs,
 
   for (int i = tid; i < n; i += gridDim.x * blockDim.x) {
     int index = indexs[i];
-    int kernel_index = rulebook_ptr[index];
+    T kernel_index = rulebook_ptr[index];
     rulebook_ptr[index + rulebook_len] = -1;
     rulebook_ptr[index + 2 * rulebook_len] = -1;
     rulebook_ptr[index] = -1;
@@ -183,18 +183,18 @@ __global__ void SetFlagAndUpdateCounterKernel(const int* indexs,
  * rulebook_out_indexs: the output index in rulebook
 **/
 template <typename T>
-__global__ void UpdateIndexKernel(const int* unique_keys,
+__global__ void UpdateIndexKernel(const T* unique_keys,
                                   const int* unique_values,
                                   const int* out_indexs,
-                                  const int non_zero_num,
+                                  const int64_t non_zero_num,
                                   const int rulebook_len,
                                   const Dims4D out_dims,
                                   T* out_indices,
                                   T* rulebook_out_indexs) {
   int tid = threadIdx.x + blockIdx.x * blockDim.x;
   for (int i = tid; i < non_zero_num; i += gridDim.x * blockDim.x) {
-    const int index = unique_keys[i];
-    int batch, x, y, z;
+    const T index = unique_keys[i];
+    T batch, x, y, z;
     phi::funcs::sparse::IndexToPoint<Dims4D>(
         index, out_dims, &batch, &x, &y, &z);
     // get out indices
@@ -207,7 +207,7 @@ __global__ void UpdateIndexKernel(const int* unique_keys,
     int start = unique_values[i];
     int end = i == non_zero_num - 1 ? rulebook_len : unique_values[i + 1];
     // max(end-start) = kernel_size
-    for (int j = start; j < end; j++) {
+    for (T j = start; j < end; j++) {
       rulebook_out_indexs[out_indexs[j]] = i;
     }
   }
@@ -215,7 +215,7 @@ __global__ void UpdateIndexKernel(const int* unique_keys,
 
 // brief: calculation the distance between start and end
 template <typename T>
-__global__ void DistanceKernel(const T* start, const T* end, int* distance) {
+__global__ void DistanceKernel(const T* start, const T* end, T* distance) {
   if (threadIdx.x == 0) {
     *distance = end - start;
   }
@@ -249,7 +249,7 @@ __global__ void ProductRuleBookKernel(const T* x_indices,
                                       const bool subm,
                                       T* rulebook,
                                       int* counter,
-                                      int* in_indexs) {
+                                      T* in_indexs) {
   int tid = threadIdx.x + blockIdx.x * blockDim.x;
   extern __shared__ int counter_buf[];  // kernel_size
   const int kernel_size = kernel_dims[3] * kernel_dims[2] * kernel_dims[1];
@@ -261,10 +261,10 @@ __global__ void ProductRuleBookKernel(const T* x_indices,
 
   for (int i = tid; i < non_zero_num; i += gridDim.x * blockDim.x) {
     int kernel_index = 0;
-    int batch = x_indices[i];
-    int in_z = x_indices[i + non_zero_num];
-    int in_y = x_indices[i + 2 * non_zero_num];
-    int in_x = x_indices[i + 3 * non_zero_num];
+    T batch = x_indices[i];
+    T in_z = x_indices[i + non_zero_num];
+    T in_y = x_indices[i + 2 * non_zero_num];
+    T in_x = x_indices[i + 3 * non_zero_num];
     if (subm) {
       in_indexs[i] = PointToIndex(batch, in_x, in_y, in_z, x_dims);
     }
@@ -283,9 +283,9 @@ __global__ void ProductRuleBookKernel(const T* x_indices,
                                         kx,
                                         ky,
                                         kz)) {
-            int out_z = (in_z + paddings[1] - kz * dilations[1]) / strides[1];
-            int out_y = (in_y + paddings[2] - ky * dilations[2]) / strides[2];
-            int out_x = (in_x + paddings[3] - kx * dilations[3]) / strides[3];
+            T out_z = (in_z + paddings[1] - kz * dilations[1]) / strides[1];
+            T out_y = (in_y + paddings[2] - ky * dilations[2]) / strides[2];
+            T out_x = (in_x + paddings[3] - kx * dilations[3]) / strides[3];
             in_i = i;
             out_index = phi::funcs::sparse::PointToIndex<Dims4D>(
                 batch, out_x, out_y, out_z, out_dims);
@@ -321,7 +321,7 @@ __global__ void ProductRuleBookKernel(const T* x_indices,
 // 5. update the out_index by unique_key, uniqe_value and the index of
 // unique_value:
 //  the new out_index: 0, 2, 3, 2, 3, 0, 1
-template <typename T, typename Context>
+template <typename T, typename Context, typename IntT = int>
 int ProductRuleBook(const Context& dev_ctx,
                     const SparseCooTensor& x,
                     const std::vector<int>& kernel_sizes,
@@ -334,23 +334,26 @@ int ProductRuleBook(const Context& dev_ctx,
                     DenseTensor* counter_per_kernel,
                     DenseTensor* offsets_per_kernel,
                     DenseTensor* out_index,
-                    DenseTensor* unique_key,
                     DenseTensor* unique_value,
                     SparseCooTensor* out,
                     std::vector<int>* h_counter,
                     std::vector<int>* h_offsets) {
+  // TODO(zhangkaihuo): use PD_DISPATCH_INTEGRAL_TYPES for secondary dispatch
+  auto indices_dtype = paddle::experimental::CppTypeToDataType<IntT>::Type();
   const int64_t non_zero_num = x.nnz();
   const auto& non_zero_indices = x.non_zero_indices();
-  const int* indices_ptr = non_zero_indices.data<int>();
+  const IntT* indices_ptr = non_zero_indices.data<IntT>();
   DenseTensor in_indexs = phi::Empty<Context>(
-      dev_ctx, DenseTensorMeta(DataType::INT32, {x.nnz()}, DataLayout::NCHW));
+      dev_ctx, DenseTensorMeta(indices_dtype, {x.nnz()}, DataLayout::NCHW));
   int* counter_ptr = counter_per_kernel->data<int>();
   int* offsets_ptr = offsets_per_kernel->data<int>();
   int kernel_size = kernel_sizes[0] * kernel_sizes[1] * kernel_sizes[2];
   const int rulebook_rows = 3;
   const int rulebook_cols = kernel_size * non_zero_num;
-  rulebook->ResizeAndAllocate({rulebook_rows, rulebook_cols});
-  int* rulebook_ptr = rulebook->data<int>();
+  DenseTensorMeta rulebook_meta(
+      indices_dtype, {rulebook_rows, rulebook_cols}, DataLayout::NCHW);
+  *rulebook = phi::Empty(dev_ctx, std::move(rulebook_meta));
+  IntT* rulebook_ptr = rulebook->data<IntT>();
 
   const auto x_dims = x.dims();
   Dims4D d_x_dims(x_dims[0], x_dims[3], x_dims[2], x_dims[1]);
@@ -366,39 +369,39 @@ int ProductRuleBook(const Context& dev_ctx,
   auto config =
       phi::backends::gpu::GetGpuLaunchConfig1D(dev_ctx, non_zero_num, 1);
 
-  ProductRuleBookKernel<int><<<config.block_per_grid.x,
-                               config.thread_per_block.x,
-                               kernel_size * sizeof(int),
-                               dev_ctx.stream()>>>(indices_ptr,
-                                                   d_x_dims,
-                                                   d_kernel_dims,
-                                                   d_out_dims,
-                                                   non_zero_num,
-                                                   d_paddings,
-                                                   d_dilations,
-                                                   d_strides,
-                                                   subm,
-                                                   rulebook_ptr,
-                                                   counter_ptr,
-                                                   in_indexs.data<int>());
+  ProductRuleBookKernel<IntT><<<config.block_per_grid.x,
+                                config.thread_per_block.x,
+                                kernel_size * sizeof(int),
+                                dev_ctx.stream()>>>(indices_ptr,
+                                                    d_x_dims,
+                                                    d_kernel_dims,
+                                                    d_out_dims,
+                                                    non_zero_num,
+                                                    d_paddings,
+                                                    d_dilations,
+                                                    d_strides,
+                                                    subm,
+                                                    rulebook_ptr,
+                                                    counter_ptr,
+                                                    in_indexs.data<IntT>());
 
 // 2. remove -1
 #ifdef PADDLE_WITH_HIP
-  int* last = thrust::remove(thrust::hip::par.on(dev_ctx.stream()),
+  IntT* last = thrust::remove(thrust::hip::par.on(dev_ctx.stream()),
 #else
-  int* last = thrust::remove(thrust::cuda::par.on(dev_ctx.stream()),
+  IntT* last = thrust::remove(thrust::cuda::par.on(dev_ctx.stream()),
 #endif
-                             rulebook_ptr,
-                             rulebook_ptr + rulebook_rows * rulebook_cols,
-                             -1);
+                              rulebook_ptr,
+                              rulebook_ptr + rulebook_rows * rulebook_cols,
+                              -1);
 
-  DistanceKernel<int><<<1, 1, 0, dev_ctx.stream()>>>(
+  DistanceKernel<IntT><<<1, 1, 0, dev_ctx.stream()>>>(
       rulebook_ptr, last, rulebook_ptr + 3 * kernel_size * non_zero_num - 1);
-  int rulebook_len = 0;
+  IntT rulebook_len = 0;
   phi::backends::gpu::GpuMemcpyAsync(
       &rulebook_len,
       rulebook_ptr + 3 * kernel_size * non_zero_num - 1,
-      sizeof(int),
+      sizeof(IntT),
 #ifdef PADDLE_WITH_HIP
       hipMemcpyDeviceToHost,
 #else
@@ -415,11 +418,10 @@ int ProductRuleBook(const Context& dev_ctx,
     // and then the intermediate output index is subtracted from the input index
     // to obain the rulebook.
     // get difference
-    int32_t* A_key_ptr = rulebook_ptr + 2 * rulebook_len;
-    int32_t* B_key_ptr = in_indexs.data<int>();
-    DenseTensor A_val = phi::Empty<Context>(
-        dev_ctx,
-        DenseTensorMeta(DataType::INT32, {rulebook_len}, DataLayout::NCHW));
+    IntT* A_key_ptr = rulebook_ptr + 2 * rulebook_len;
+    IntT* B_key_ptr = in_indexs.data<IntT>();
+    DenseTensorMeta val_meta(DataType::INT32, {rulebook_len}, DataLayout::NCHW);
+    DenseTensor A_val = phi::Empty<Context>(dev_ctx, std::move(val_meta));
     DenseTensor B_val = phi::Empty<Context>(
         dev_ctx, DenseTensorMeta(DataType::INT32, {x.nnz()}, DataLayout::NCHW));
     phi::IndexKernel<int, kps::IdentityFunctor<int>>(
@@ -428,10 +430,8 @@ int ProductRuleBook(const Context& dev_ctx,
         dev_ctx, &B_val, kps::IdentityFunctor<int>());
     DenseTensor key_result = phi::Empty<Context>(
         dev_ctx,
-        DenseTensorMeta(DataType::INT32, {rulebook_len + 1}, DataLayout::NCHW));
-    DenseTensor val_result = phi::Empty<Context>(
-        dev_ctx,
-        DenseTensorMeta(DataType::INT32, {rulebook_len}, DataLayout::NCHW));
+        DenseTensorMeta(indices_dtype, {rulebook_len + 1}, DataLayout::NCHW));
+    DenseTensor val_result = phi::Empty<Context>(dev_ctx, std::move(val_meta));
 
 #ifdef PADDLE_WITH_HIP
     thrust::exclusive_scan(thrust::hip::par.on(dev_ctx.stream()),
@@ -454,7 +454,7 @@ int ProductRuleBook(const Context& dev_ctx,
                                        dev_ctx.stream());
     dev_ctx.Wait();
 
-    thrust::pair<int*, int*> end;
+    thrust::pair<IntT*, int*> end;
     // Because set_diff does not support duplicate data, set_diff is performed
     // separately for each segment of data.
     // TODO(zhangkaihuo): Using hashtable here may get better performance,
@@ -462,7 +462,7 @@ int ProductRuleBook(const Context& dev_ctx,
     for (int i = 0; i < kernel_size; i++) {
       int start = offsets[i];
       int stop = i == kernel_size - 1 ? rulebook_len : offsets[i + 1];
-      int* key_result_start = (i == 0 ? key_result.data<int>() : end.first);
+      IntT* key_result_start = (i == 0 ? key_result.data<IntT>() : end.first);
       int* val_result_start = i == 0 ? val_result.data<int>() : end.second;
       end =
 #ifdef PADDLE_WITH_HIP
@@ -480,14 +480,14 @@ int ProductRuleBook(const Context& dev_ctx,
                                         val_result_start);
     }
 
-    DistanceKernel<int><<<1, 1, 0, dev_ctx.stream()>>>(
-        key_result.data<int>(),
+    DistanceKernel<IntT><<<1, 1, 0, dev_ctx.stream()>>>(
+        key_result.data<IntT>(),
         end.first,
-        key_result.data<int>() + rulebook_len);
-    int len = 0;
+        key_result.data<IntT>() + rulebook_len);
+    IntT len = 0;
     phi::backends::gpu::GpuMemcpyAsync(&len,
-                                       key_result.data<int>() + rulebook_len,
-                                       sizeof(int),
+                                       key_result.data<IntT>() + rulebook_len,
+                                       sizeof(IntT),
 #ifdef PADDLE_WITH_HIP
                                        hipMemcpyDeviceToHost,
 #else
@@ -497,10 +497,10 @@ int ProductRuleBook(const Context& dev_ctx,
     dev_ctx.Wait();
     // set the diff value = -1, and update counter
     auto config = phi::backends::gpu::GetGpuLaunchConfig1D(dev_ctx, len, 1);
-    SetFlagAndUpdateCounterKernel<int><<<config.block_per_grid.x,
-                                         config.thread_per_block,
-                                         kernel_size * sizeof(int),
-                                         dev_ctx.stream()>>>(
+    SetFlagAndUpdateCounterKernel<IntT><<<config.block_per_grid.x,
+                                          config.thread_per_block,
+                                          kernel_size * sizeof(int),
+                                          dev_ctx.stream()>>>(
         val_result.data<int>(),
         len,
         rulebook_len,
@@ -509,18 +509,18 @@ int ProductRuleBook(const Context& dev_ctx,
         counter_ptr);
 // remove -1
 #ifdef PADDLE_WITH_HIP
-    int* last = thrust::remove(thrust::hip::par.on(dev_ctx.stream()),
+    IntT* last = thrust::remove(thrust::hip::par.on(dev_ctx.stream()),
 #else
-    int* last = thrust::remove(thrust::cuda::par.on(dev_ctx.stream()),
+    IntT* last = thrust::remove(thrust::cuda::par.on(dev_ctx.stream()),
 #endif
-                               rulebook_ptr,
-                               rulebook_ptr + 3 * rulebook_len,
-                               -1);
-    DistanceKernel<int><<<1, 1, 0, dev_ctx.stream()>>>(
-        rulebook_ptr, last, key_result.data<int>() + rulebook_len);
+                                rulebook_ptr,
+                                rulebook_ptr + 3 * rulebook_len,
+                                -1);
+    DistanceKernel<IntT><<<1, 1, 0, dev_ctx.stream()>>>(
+        rulebook_ptr, last, key_result.data<IntT>() + rulebook_len);
     phi::backends::gpu::GpuMemcpyAsync(&rulebook_len,
-                                       key_result.data<int>() + rulebook_len,
-                                       sizeof(int),
+                                       key_result.data<IntT>() + rulebook_len,
+                                       sizeof(IntT),
 #ifdef PADDLE_WITH_HIP
                                        hipMemcpyDeviceToHost,
 #else
@@ -563,42 +563,47 @@ int ProductRuleBook(const Context& dev_ctx,
                                      cudaMemcpyDeviceToHost,
                                      dev_ctx.stream());
 #endif
-  rulebook->Resize({rulebook_rows, rulebook_len});
+  rulebook->Resize({rulebook_rows, static_cast<int>(rulebook_len)});
 
   // 3. sorted or merge the out index
-  out_index->ResizeAndAllocate({rulebook_len});
-  unique_value->ResizeAndAllocate({rulebook_len});
-  unique_key->ResizeAndAllocate({rulebook_len});
+  out_index->ResizeAndAllocate({static_cast<int>(rulebook_len)});
+  unique_value->ResizeAndAllocate({static_cast<int>(rulebook_len)});
+  DenseTensor unique_key = phi::Empty(
+      dev_ctx,
+      DenseTensorMeta(paddle::experimental::CppTypeToDataType<IntT>::Type(),
+                      {static_cast<int>(rulebook_len)},
+                      DataLayout::NCHW));
   int* out_index_ptr = out_index->data<int>();
   int* unique_value_ptr = unique_value->data<int>();
-  int* unique_key_ptr = unique_key->data<int>();
-
-  int* new_end = SortedAndUniqueIndex(dev_ctx,
-                                      rulebook_ptr + 2 * rulebook_len,
-                                      rulebook_len,
-                                      out_index,
-                                      unique_key,
-                                      unique_value);
+  IntT* unique_key_ptr = unique_key.data<IntT>();
+
+  IntT* new_end =
+      SortedAndUniqueIndex<Context, IntT>(dev_ctx,
+                                          rulebook_ptr + 2 * rulebook_len,
+                                          rulebook_len,
+                                          out_index,
+                                          &unique_key,
+                                          unique_value);
   // thrust::distance doesn't support stream parameters
   // const int out_non_zero_num = thrust::distance(unique_key_ptr,
   // new_end.first);
-  DistanceKernel<int><<<1, 1>>>(
+  DistanceKernel<IntT><<<1, 1>>>(
       unique_key_ptr,
       new_end,
       rulebook_ptr + rulebook_rows * rulebook_cols - 1);
-  int out_non_zero_num = 0;
+  IntT out_non_zero_num = 0;
 #ifdef PADDLE_WITH_HIP
   phi::backends::gpu::GpuMemcpyAsync(
       &out_non_zero_num,
       rulebook_ptr + rulebook_rows * rulebook_cols - 1,
-      sizeof(int),
+      sizeof(IntT),
       hipMemcpyDeviceToHost,
       dev_ctx.stream());
 #else
   phi::backends::gpu::GpuMemcpyAsync(
       &out_non_zero_num,
       rulebook_ptr + rulebook_rows * rulebook_cols - 1,
-      sizeof(int),
+      sizeof(IntT),
       cudaMemcpyDeviceToHost,
       dev_ctx.stream());
 #endif
@@ -607,27 +612,29 @@ int ProductRuleBook(const Context& dev_ctx,
   // 5. update out_indices and rulebook by unique_value_ptr
   const int64_t sparse_dim = 4;
   DenseTensorMeta indices_meta(
-      DataType::INT32, {sparse_dim, out_non_zero_num}, DataLayout::NCHW);
-  DenseTensorMeta values_meta(
-      x.dtype(), {out_non_zero_num, kernel_sizes[4]}, x.layout());
+      indices_dtype, {sparse_dim, out_non_zero_num}, DataLayout::NCHW);
+  DenseTensorMeta values_meta(x.dtype(),
+                              {out_non_zero_num, kernel_sizes[4]},
+                              x.non_zero_elements().layout());
   phi::DenseTensor out_indices = phi::Empty(dev_ctx, std::move(indices_meta));
   phi::DenseTensor out_values = phi::Empty(dev_ctx, std::move(values_meta));
 
-  int* out_indices_ptr = out_indices.data<int>();
+  IntT* out_indices_ptr = out_indices.data<IntT>();
 
   config =
       phi::backends::gpu::GetGpuLaunchConfig1D(dev_ctx, out_non_zero_num, 1);
-  UpdateIndexKernel<int><<<config.block_per_grid.x,
-                           config.thread_per_block.x,
-                           0,
-                           dev_ctx.stream()>>>(unique_key_ptr,
-                                               unique_value_ptr,
-                                               out_index_ptr,
-                                               out_non_zero_num,
-                                               rulebook_len,
-                                               d_out_dims,
-                                               out_indices_ptr,
-                                               rulebook_ptr + 2 * rulebook_len);
+  UpdateIndexKernel<IntT><<<config.block_per_grid.x,
+                            config.thread_per_block.x,
+                            0,
+                            dev_ctx.stream()>>>(
+      unique_key_ptr,
+      unique_value_ptr,
+      out_index_ptr,
+      out_non_zero_num,
+      rulebook_len,
+      d_out_dims,
+      out_indices_ptr,
+      rulebook_ptr + 2 * rulebook_len);
   out->SetMember(out_indices, out_values, out_dims, true);
   return rulebook_len;
 }
diff --git a/paddle/phi/kernels/sparse/gpu/convolution_grad_kernel.cu b/paddle/phi/kernels/sparse/gpu/convolution_grad_kernel.cu
index 4db0a0b0011b5..2b61be7289646 100644
--- a/paddle/phi/kernels/sparse/gpu/convolution_grad_kernel.cu
+++ b/paddle/phi/kernels/sparse/gpu/convolution_grad_kernel.cu
@@ -12,16 +12,20 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. */
 
+#include "glog/logging.h"
 #include "paddle/phi/backends/gpu/gpu_context.h"
 #include "paddle/phi/backends/gpu/gpu_info.h"
 #include "paddle/phi/backends/gpu/gpu_launch_config.h"
 #include "paddle/phi/core/kernel_registry.h"
 #include "paddle/phi/core/tensor_meta.h"
+#include "paddle/phi/kernels/copy_kernel.h"
 #include "paddle/phi/kernels/funcs/blas/blas.h"
 #include "paddle/phi/kernels/funcs/math_function.h"
 #include "paddle/phi/kernels/sparse/convolution_grad_kernel.h"
 #include "paddle/phi/kernels/sparse/gpu/convolution.cu.h"
 
+#include "paddle/phi/api/ext/dispatch.h"
+
 namespace phi {
 namespace sparse {
 
@@ -33,24 +37,24 @@ namespace sparse {
 //]
 // x_grad = out_grad * transpose(kenrel)
 // kernel_grad = transpose(x) * out_grad
-template <typename T, typename Context>
-void Conv3dGradKernel(const Context& dev_ctx,
-                      const SparseCooTensor& x,
-                      const DenseTensor& rulebook,
-                      const DenseTensor& kernel,
-                      const DenseTensor& out_grad,
-                      const std::vector<int>& paddings,
-                      const std::vector<int>& dilations,
-                      const std::vector<int>& strides,
-                      const int groups,
-                      const bool subm,
-                      DenseTensor* x_grad,
-                      DenseTensor* kernel_grad) {
+template <typename T, typename IntT>
+void Conv3dGradGPUKernel(const GPUContext& dev_ctx,
+                         const SparseCooTensor& x,
+                         const DenseTensor& kernel,
+                         const DenseTensor& rulebook,
+                         const SparseCooTensor& out_grad,
+                         const std::vector<int>& paddings,
+                         const std::vector<int>& dilations,
+                         const std::vector<int>& strides,
+                         const int groups,
+                         const bool subm,
+                         SparseCooTensor* x_grad,
+                         DenseTensor* kernel_grad) {
   const auto& kernel_dims = kernel.dims();
   const int kernel_size = kernel_dims[0] * kernel_dims[1] * kernel_dims[2];
   const int in_channels = kernel_dims[3];
   const int out_channels = kernel_dims[4];
-  const int* rulebook_ptr = rulebook.data<int>();
+  const IntT* rulebook_ptr = rulebook.data<IntT>();
 
   const int rulebook_len = rulebook.dims()[1];
 
@@ -70,23 +74,31 @@ void Conv3dGradKernel(const Context& dev_ctx,
   T* in_features_ptr = in_features.data<T>();
   T* d_x_features_ptr = d_x_features.data<T>();
   T* out_grad_features_ptr = out_grad_features.data<T>();
-  kernel_grad->ResizeAndAllocate(kernel_dims);
+  *kernel_grad = phi::EmptyLike<T>(dev_ctx, kernel);
   T* d_kernel_ptr = kernel_grad->data<T>();
-  phi::funcs::SetConstant<Context, T> set_zero;
+  phi::funcs::SetConstant<GPUContext, T> set_zero;
   set_zero(dev_ctx, kernel_grad, static_cast<T>(0.0f));
 
   int half_kernel_size = kernel_size / 2;
-  auto blas = phi::funcs::GetBlas<Context, T>(dev_ctx);
-  x_grad->ResizeAndAllocate(x.non_zero_elements().dims());
-  T* x_grad_values_ptr = x_grad->data<T>();
-  set_zero(dev_ctx, x_grad, static_cast<T>(0.0f));
+  auto blas = phi::funcs::GetBlas<GPUContext, T>(dev_ctx);
+  DenseTensor x_grad_indices =
+      phi::EmptyLike<IntT>(dev_ctx, x.non_zero_indices());
+  DenseTensor x_grad_values = phi::EmptyLike<T>(dev_ctx, x.non_zero_elements());
+  T* x_grad_values_ptr = x_grad_values.data<T>();
+  set_zero(dev_ctx, &x_grad_values, static_cast<T>(0.0f));
   set_zero(dev_ctx, &d_x_features, static_cast<T>(0.0f));
+  phi::Copy<GPUContext>(dev_ctx,
+                        x.non_zero_indices(),
+                        dev_ctx.GetPlace(),
+                        false,
+                        &x_grad_indices);
+  x_grad->SetMember(x_grad_indices, x_grad_values, x.dims(), true);
 
-  std::vector<int> offsets(kernel_size + 1), counter(kernel_size, 0),
+  std::vector<IntT> offsets(kernel_size + 1), counter(kernel_size, 0),
       h_counter(rulebook_len, 0);
   phi::backends::gpu::GpuMemcpyAsync(&h_counter[0],
                                      rulebook_ptr,
-                                     rulebook_len * sizeof(int),
+                                     rulebook_len * sizeof(IntT),
 #ifdef PADDLE_WITH_HIP
                                      hipMemcpyDeviceToHost,
 #else
@@ -99,7 +111,7 @@ void Conv3dGradKernel(const Context& dev_ctx,
   for (int i = 0; i < rulebook_len; i++) {
     counter[h_counter[i]] += 1;
   }
-  int offset = 0, max_count = 0;
+  IntT offset = 0, max_count = 0;
   for (int i = 0; i < kernel_size; i++) {
     offsets[i] = offset;
     offset += counter[i];
@@ -110,15 +122,16 @@ void Conv3dGradKernel(const Context& dev_ctx,
   offsets[kernel_size] = offset;
 
   if (subm) {
-    phi::funcs::sparse::SubmPreProcess<T, Context>(dev_ctx,
-                                                   x,
-                                                   kernel,
-                                                   out_grad,
-                                                   in_channels,
-                                                   out_channels,
-                                                   half_kernel_size,
-                                                   kernel_grad,
-                                                   x_grad);
+    phi::funcs::sparse::SubmPreProcess<T, GPUContext>(
+        dev_ctx,
+        x,
+        kernel,
+        out_grad.non_zero_elements(),
+        in_channels,
+        out_channels,
+        half_kernel_size,
+        kernel_grad,
+        &x_grad_values);
     if (max_count == 0) {
       return;
     }
@@ -126,25 +139,26 @@ void Conv3dGradKernel(const Context& dev_ctx,
 
   auto config = phi::backends::gpu::GetGpuLaunchConfig1D(
       dev_ctx, rulebook_len * in_channels, 1);
-  GatherKernel<T, int><<<config.block_per_grid.x,
-                         config.thread_per_block.x,
-                         0,
-                         dev_ctx.stream()>>>(x.non_zero_elements().data<T>(),
-                                             rulebook_ptr + rulebook_len,
-                                             in_features_ptr,
-                                             rulebook_len,
-                                             in_channels);
+  GatherKernel<T, IntT><<<config.block_per_grid.x,
+                          config.thread_per_block.x,
+                          0,
+                          dev_ctx.stream()>>>(x.non_zero_elements().data<T>(),
+                                              rulebook_ptr + rulebook_len,
+                                              in_features_ptr,
+                                              rulebook_len,
+                                              in_channels);
 
   config = phi::backends::gpu::GetGpuLaunchConfig1D(
       dev_ctx, rulebook_len * out_channels, 1);
-  GatherKernel<T, int><<<config.block_per_grid.x,
-                         config.thread_per_block.x,
-                         0,
-                         dev_ctx.stream()>>>(out_grad.data<T>(),
-                                             rulebook_ptr + rulebook_len * 2,
-                                             out_grad_features_ptr,
-                                             rulebook_len,
-                                             out_channels);
+  GatherKernel<T, IntT><<<config.block_per_grid.x,
+                          config.thread_per_block.x,
+                          0,
+                          dev_ctx.stream()>>>(
+      out_grad.non_zero_elements().data<T>(),
+      rulebook_ptr + rulebook_len * 2,
+      out_grad_features_ptr,
+      rulebook_len,
+      out_channels);
 
   const T* kernel_ptr = kernel.data<T>();
   for (int i = 0; i < kernel_size; i++) {
@@ -189,18 +203,22 @@ void Conv3dGradKernel(const Context& dev_ctx,
   }
 
   // 4. scatter
-  x_grad->ResizeAndAllocate(x.non_zero_elements().dims());
+  // x_grad->ResizeAndAllocate(x.non_zero_elements().dims());
   DenseTensorMeta index_meta(DataType::INT32, {rulebook_len}, DataLayout::NCHW);
   DenseTensor out_index = phi::Empty(dev_ctx, std::move(index_meta));
-  DenseTensor unique_key = phi::Empty(dev_ctx, std::move(index_meta));
+  DenseTensor unique_key = phi::Empty(
+      dev_ctx,
+      DenseTensorMeta(paddle::experimental::CppTypeToDataType<IntT>::Type(),
+                      {rulebook_len},
+                      DataLayout::NCHW));
   DenseTensor unique_value = phi::Empty(dev_ctx, std::move(index_meta));
 
-  SortedAndUniqueIndex(dev_ctx,
-                       rulebook_ptr + rulebook_len,
-                       rulebook_len,
-                       &out_index,
-                       &unique_key,
-                       &unique_value);
+  SortedAndUniqueIndex<GPUContext, IntT>(dev_ctx,
+                                         rulebook_ptr + rulebook_len,
+                                         rulebook_len,
+                                         &out_index,
+                                         &unique_key,
+                                         &unique_value);
 
   config = phi::backends::gpu::GetGpuLaunchConfig1D(
       dev_ctx, rulebook_len * in_channels, 1);
@@ -218,6 +236,36 @@ void Conv3dGradKernel(const Context& dev_ctx,
                                          subm);
 }
 
+template <typename T, typename Context>
+void Conv3dGradKernel(const Context& dev_ctx,
+                      const SparseCooTensor& x,
+                      const DenseTensor& kernel,
+                      const DenseTensor& rulebook,
+                      const SparseCooTensor& out_grad,
+                      const std::vector<int>& paddings,
+                      const std::vector<int>& dilations,
+                      const std::vector<int>& strides,
+                      const int groups,
+                      const bool subm,
+                      SparseCooTensor* x_grad,
+                      DenseTensor* kernel_grad) {
+  PD_DISPATCH_INTEGRAL_TYPES(
+      x.non_zero_indices().dtype(), "Conv3dGradGPUKernel", ([&] {
+        Conv3dGradGPUKernel<T, data_t>(dev_ctx,
+                                       x,
+                                       kernel,
+                                       rulebook,
+                                       out_grad,
+                                       paddings,
+                                       dilations,
+                                       strides,
+                                       groups,
+                                       subm,
+                                       x_grad,
+                                       kernel_grad);
+      }));
+}
+
 }  // namespace sparse
 }  // namespace phi
 
diff --git a/paddle/phi/kernels/sparse/gpu/convolution_kernel.cu b/paddle/phi/kernels/sparse/gpu/convolution_kernel.cu
index 214e689e9370a..2d212eadffac1 100644
--- a/paddle/phi/kernels/sparse/gpu/convolution_kernel.cu
+++ b/paddle/phi/kernels/sparse/gpu/convolution_kernel.cu
@@ -19,29 +19,25 @@ limitations under the License. */
 #include "paddle/phi/kernels/sparse/convolution_kernel.h"
 #include "paddle/phi/kernels/sparse/gpu/convolution.cu.h"
 
+#include "paddle/phi/api/ext/dispatch.h"
+
 namespace phi {
 namespace sparse {
 
-/**
- * x: (N, D, H, W, C)
- * kernel: (D, H, W, C, OC)
- * out: (N, D, H, W, OC)
-**/
-template <typename T, typename Context>
-void Conv3dKernel(const Context& dev_ctx,
-                  const SparseCooTensor& x,
-                  const DenseTensor& kernel,
-                  const std::vector<int>& paddings,
-                  const std::vector<int>& dilations,
-                  const std::vector<int>& strides,
-                  const int groups,
-                  const bool subm,
-                  SparseCooTensor* out,
-                  DenseTensor* rulebook) {
+template <typename T, typename IntT>
+void Conv3dGPUKernel(const GPUContext& dev_ctx,
+                     const SparseCooTensor& x,
+                     const DenseTensor& kernel,
+                     const std::vector<int>& paddings,
+                     const std::vector<int>& dilations,
+                     const std::vector<int>& strides,
+                     const int groups,
+                     const bool subm,
+                     SparseCooTensor* out,
+                     DenseTensor* rulebook) {
   // update padding and dilation
   // Currently, only support x.layout is NDHWC, groups = 1
   // if x.layout != NDHWC then transpose(x), transpose(weight)
-
   const auto& x_dims = x.dims();
   const auto& kernel_dims = kernel.dims();
   int kernel_size = kernel_dims[0] * kernel_dims[1] * kernel_dims[2];
@@ -67,7 +63,6 @@ void Conv3dKernel(const Context& dev_ctx,
   DenseTensor offsets_per_kernel = phi::Empty(dev_ctx, std::move(offsets_meta));
   DenseTensorMeta index_meta(DataType::INT32, {1}, DataLayout::NCHW);
   DenseTensor out_index = phi::Empty(dev_ctx, std::move(index_meta));
-  DenseTensor unique_key = phi::Empty(dev_ctx, std::move(index_meta));
   DenseTensor unique_value = phi::Empty(dev_ctx, std::move(index_meta));
 
   std::vector<int> subm_paddings(paddings), subm_strides(strides);
@@ -75,28 +70,26 @@ void Conv3dKernel(const Context& dev_ctx,
     phi::funcs::sparse::ResetSubmKernelSizeAndStrides(
         kernel.dims(), &subm_paddings, &subm_strides);
   }
-
-  int n = ProductRuleBook<T, Context>(dev_ctx,
-                                      x,
-                                      kernel_sizes,
-                                      subm_paddings,
-                                      dilations,
-                                      subm_strides,
-                                      out_dims,
-                                      subm,
-                                      rulebook,
-                                      &counter_per_kernel,
-                                      &offsets_per_kernel,
-                                      &out_index,
-                                      &unique_key,
-                                      &unique_value,
-                                      out,
-                                      &h_counter,
-                                      &offsets);
+  int n = ProductRuleBook<T, GPUContext, IntT>(dev_ctx,
+                                               x,
+                                               kernel_sizes,
+                                               subm_paddings,
+                                               dilations,
+                                               subm_strides,
+                                               out_dims,
+                                               subm,
+                                               rulebook,
+                                               &counter_per_kernel,
+                                               &offsets_per_kernel,
+                                               &out_index,
+                                               &unique_value,
+                                               out,
+                                               &h_counter,
+                                               &offsets);
 
   const int* counter_ptr = counter_per_kernel.data<int>();
   const int* offsets_ptr = counter_per_kernel.data<int>();
-  const int* rulebook_ptr = rulebook->data<int>();
+  const IntT* rulebook_ptr = rulebook->data<IntT>();
 
   // 2. gather
   DenseTensorMeta in_features_meta(
@@ -109,22 +102,22 @@ void Conv3dKernel(const Context& dev_ctx,
       phi::Empty(dev_ctx, std::move(out_features_meta));
   T* in_features_ptr = in_features.data<T>();
   T* out_features_ptr = out_features.data<T>();
-  phi::funcs::SetConstant<Context, T> set_zero;
+  phi::funcs::SetConstant<GPUContext, T> set_zero;
   set_zero(dev_ctx, &out_features, static_cast<T>(0.0f));
 
   auto config =
       phi::backends::gpu::GetGpuLaunchConfig1D(dev_ctx, n * in_channels, 1);
-  GatherKernel<T, int><<<config.block_per_grid.x,
-                         config.thread_per_block.x,
-                         0,
-                         dev_ctx.stream()>>>(x.non_zero_elements().data<T>(),
-                                             rulebook_ptr + n,
-                                             in_features_ptr,
-                                             n,
-                                             in_channels);
+  GatherKernel<T, IntT><<<config.block_per_grid.x,
+                          config.thread_per_block.x,
+                          0,
+                          dev_ctx.stream()>>>(x.non_zero_elements().data<T>(),
+                                              rulebook_ptr + n,
+                                              in_features_ptr,
+                                              n,
+                                              in_channels);
 
   // 3. call gemm for every werght
-  auto blas = phi::funcs::GetBlas<Context, T>(dev_ctx);
+  auto blas = phi::funcs::GetBlas<GPUContext, T>(dev_ctx);
   auto* out_values = out->mutable_non_zero_elements();
   T* out_values_ptr = out_values->data<T>();
 
@@ -168,6 +161,36 @@ void Conv3dKernel(const Context& dev_ctx,
                                          out_channels,
                                          out_values_ptr);
 }
+/**
+ * x: (N, D, H, W, C)
+ * kernel: (D, H, W, C, OC)
+ * out: (N, D, H, W, OC)
+**/
+template <typename T, typename Context>
+void Conv3dKernel(const Context& dev_ctx,
+                  const SparseCooTensor& x,
+                  const DenseTensor& kernel,
+                  const std::vector<int>& paddings,
+                  const std::vector<int>& dilations,
+                  const std::vector<int>& strides,
+                  const int groups,
+                  const bool subm,
+                  SparseCooTensor* out,
+                  DenseTensor* rulebook) {
+  PD_DISPATCH_INTEGRAL_TYPES(
+      x.non_zero_indices().dtype(), "Conv3dGPUKernel", ([&] {
+        Conv3dGPUKernel<T, data_t>(dev_ctx,
+                                   x,
+                                   kernel,
+                                   paddings,
+                                   dilations,
+                                   strides,
+                                   groups,
+                                   subm,
+                                   out,
+                                   rulebook);
+      }));
+}
 
 }  // namespace sparse
 }  // namespace phi
diff --git a/paddle/phi/kernels/sparse/gpu/sparse_mask_kernel.cu b/paddle/phi/kernels/sparse/gpu/sparse_mask_kernel.cu
new file mode 100644
index 0000000000000..d206d6bbc195c
--- /dev/null
+++ b/paddle/phi/kernels/sparse/gpu/sparse_mask_kernel.cu
@@ -0,0 +1,140 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "paddle/phi/backends/gpu/gpu_info.h"
+#include "paddle/phi/backends/gpu/gpu_launch_config.h"
+#include "paddle/phi/core/ddim.h"
+#include "paddle/phi/core/enforce.h"
+#include "paddle/phi/core/kernel_registry.h"
+#include "paddle/phi/kernels/copy_kernel.h"
+#include "paddle/phi/kernels/empty_kernel.h"
+#include "paddle/phi/kernels/funcs/math_function.h"
+#include "paddle/phi/kernels/sparse/sparse_mask_kernel.h"
+
+#include "paddle/phi/api/ext/dispatch.h"
+
+namespace phi {
+namespace sparse {
+
+template <typename T, typename IntT>
+__global__ void MaskKernel(const T* x_ptr,
+                           const IntT* indices_ptr,
+                           const int64_t* sparse_offsets,
+                           const int64_t non_zero_num,
+                           const int cols,
+                           const int sparse_dim,
+                           T* out_values_ptr) {
+  CUDA_KERNEL_LOOP_TYPE(i, non_zero_num * cols, int64_t) {
+    int64_t out_i = i / cols;
+    int64_t col_i = i - out_i * cols;
+    int64_t index = 0;
+    for (int j = 0; j < sparse_dim; j++) {
+      index += indices_ptr[j * non_zero_num + i] * sparse_offsets[j];
+    }
+    out_values_ptr[out_i * cols + col_i] = x_ptr[index * cols + col_i];
+  }
+}
+
+template <typename T, typename IntT>
+void SparseMaskGPUKernel(const GPUContext& dev_ctx,
+                         const DenseTensor& x,
+                         const SparseCooTensor& mask,
+                         SparseCooTensor* out) {
+  const DDim& dims = x.dims();
+  PADDLE_ENFORCE_EQ(
+      x.dims(),
+      mask.dims(),
+      phi::errors::InvalidArgument("the input x and mask must have the shape"));
+  const DenseTensor& indices = mask.non_zero_indices();
+  const DenseTensor& values = mask.non_zero_elements();
+  int sparse_dim = indices.dims().size();
+  DenseTensor sparse_offsets = phi::Empty(
+      dev_ctx,
+      DenseTensorMeta(DataType::INT64, {sparse_dim}, DataLayout::NCHW));
+  std::vector<int64_t> h_sparse_offsets(sparse_dim);
+  int64_t offset = 1;
+  for (int i = sparse_dim - 1; i >= 0; i--) {
+    h_sparse_offsets[i] = offset;
+    offset *= dims[i];
+  }
+
+  phi::backends::gpu::GpuMemcpyAsync(sparse_offsets.data<int64_t>(),
+                                     &h_sparse_offsets[0],
+                                     sizeof(int64_t) * sparse_dim,
+#ifdef PADDLE_WITH_HIP
+                                     hipMemcpyHostToDevice,
+#else
+                                     cudaMemcpyHostToDevice,
+#endif
+                                     dev_ctx.stream());
+
+  DenseTensor out_indices = phi::EmptyLike<T>(dev_ctx, indices);
+  DenseTensor out_values = phi::EmptyLike<T>(dev_ctx, values);
+
+  phi::Copy(dev_ctx, indices, dev_ctx.GetPlace(), false, &out_indices);
+
+  const IntT* indices_ptr = indices.data<IntT>();
+  T* out_values_ptr = out_values.data<T>();
+  const T* x_ptr = x.data<T>();
+  const int64_t non_zero_num = mask.nnz();
+  auto dims_2d = flatten_to_2d(dims, sparse_dim);
+  const int cols = dims_2d[1];
+
+  auto config =
+      phi::backends::gpu::GetGpuLaunchConfig1D(dev_ctx, non_zero_num * cols, 1);
+  MaskKernel<T, IntT><<<config.block_per_grid, config.thread_per_block>>>(
+      x_ptr,
+      indices_ptr,
+      sparse_offsets.data<int64_t>(),
+      non_zero_num,
+      cols,
+      sparse_dim,
+      out_values_ptr);
+
+  out->SetMember(out_indices, out_values, dims, true);
+}
+
+/**
+ * @brief Filter the DenseTensor x by the
+ * mask.non_zero_indices() and output a SparseCooTensor
+ * x and mask must have the same shape.
+**/
+template <typename T, typename Context>
+void SparseMaskKernel(const Context& dev_ctx,
+                      const DenseTensor& x,
+                      const SparseCooTensor& mask,
+                      SparseCooTensor* out) {
+  PD_DISPATCH_INTEGRAL_TYPES(
+      mask.non_zero_indices().dtype(), "SparseMaskGPUKernel", ([&] {
+        SparseMaskGPUKernel<T, data_t>(dev_ctx, x, mask, out);
+      }));
+}
+
+}  // namespace sparse
+}  // namespace phi
+
+PD_REGISTER_KERNEL(sparse_mask,
+                   GPU,
+                   ALL_LAYOUT,
+                   phi::sparse::SparseMaskKernel,
+                   float,
+                   double,
+                   phi::dtype::float16,
+                   uint8_t,
+                   int8_t,
+                   int16_t,
+                   int,
+                   int64_t) {
+  kernel->InputAt(1).SetDataLayout(phi::DataLayout::SPARSE_COO);
+}
diff --git a/paddle/phi/kernels/sparse/gpu/sparse_pool_grad_kernel.cu b/paddle/phi/kernels/sparse/gpu/sparse_pool_grad_kernel.cu
index 1048dd1be0c01..8657e7319d8ca 100644
--- a/paddle/phi/kernels/sparse/gpu/sparse_pool_grad_kernel.cu
+++ b/paddle/phi/kernels/sparse/gpu/sparse_pool_grad_kernel.cu
@@ -12,24 +12,28 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. */
 
+#include "paddle/phi/kernels/sparse/sparse_pool_grad_kernel.h"
+
 #include "paddle/phi/backends/gpu/gpu_context.h"
 #include "paddle/phi/backends/gpu/gpu_info.h"
 #include "paddle/phi/backends/gpu/gpu_launch_config.h"
 #include "paddle/phi/core/kernel_registry.h"
+#include "paddle/phi/kernels/copy_kernel.h"
+#include "paddle/phi/kernels/empty_kernel.h"
 #include "paddle/phi/kernels/funcs/math_function.h"
 #include "paddle/phi/kernels/funcs/pooling.h"
 #include "paddle/phi/kernels/funcs/sparse/convolution.h"
 
-#include "paddle/phi/kernels/sparse/sparse_pool_grad_kernel.h"
+#include "paddle/phi/api/ext/dispatch.h"
 
 namespace phi {
 namespace sparse {
 
-template <typename T>
+template <typename T, typename IntT = int>
 __global__ void MaxPoolGradCudaKernel(const T* in_features_ptr,
                                       const T* out_features_ptr,
                                       const T* out_grad_ptr,
-                                      const int* rulebook_ptr,
+                                      const IntT* rulebook_ptr,
                                       const int n,
                                       const int rulebook_len,
                                       const int channels,
@@ -38,8 +42,8 @@ __global__ void MaxPoolGradCudaKernel(const T* in_features_ptr,
   CUDA_KERNEL_LOOP_TYPE(i, n * channels, int64_t) {
     int real_i = i / channels;
     int c = i - real_i * channels;
-    int in_i = rulebook_ptr[real_i];
-    int out_i = rulebook_ptr[real_i + rulebook_len];
+    IntT in_i = rulebook_ptr[real_i];
+    IntT out_i = rulebook_ptr[real_i + rulebook_len];
     grad_functor.compute(in_features_ptr[in_i * channels + c],
                          out_features_ptr[out_i * channels + c],
                          out_grad_ptr[out_i * channels + c],
@@ -48,23 +52,23 @@ __global__ void MaxPoolGradCudaKernel(const T* in_features_ptr,
   }
 }
 
-template <typename T, typename Context>
-void MaxPoolGradKernel(const Context& dev_ctx,
-                       const SparseCooTensor& x,
-                       const DenseTensor& rulebook,
-                       const SparseCooTensor& out,
-                       const DenseTensor& out_grad,
-                       const std::vector<int>& kernel_sizes,
-                       DenseTensor* x_grad) {
+template <typename T, typename IntT = int>
+void MaxPoolGradGPUKernel(const GPUContext& dev_ctx,
+                          const SparseCooTensor& x,
+                          const DenseTensor& rulebook,
+                          const SparseCooTensor& out,
+                          const SparseCooTensor& out_grad,
+                          const std::vector<int>& kernel_sizes,
+                          SparseCooTensor* x_grad) {
   int kernel_size = kernel_sizes[0] * kernel_sizes[1] * kernel_sizes[2];
   const int in_channels = x.dims()[4];
   int rulebook_len = rulebook.dims()[1];
-  const int* rulebook_ptr = rulebook.data<int>();
-  std::vector<int> offsets(kernel_size + 1), counter(kernel_size, 0),
+  const IntT* rulebook_ptr = rulebook.data<IntT>();
+  std::vector<IntT> offsets(kernel_size + 1), counter(kernel_size, 0),
       h_counter(kernel_size);
   phi::backends::gpu::GpuMemcpyAsync(&h_counter[0],
                                      rulebook_ptr,
-                                     rulebook_len * sizeof(int),
+                                     rulebook_len * sizeof(IntT),
 #ifdef PADDLE_WITH_HIP
                                      hipMemcpyDeviceToHost,
 #else
@@ -80,10 +84,20 @@ void MaxPoolGradKernel(const Context& dev_ctx,
 
   const T* in_features_ptr = x.non_zero_elements().data<T>();
   const T* out_features_ptr = out.non_zero_elements().data<T>();
-  const T* out_grad_ptr = out_grad.data<T>();
-  T* x_grad_ptr = x_grad->data<T>();
-  phi::funcs::SetConstant<Context, T> set_zero;
-  set_zero(dev_ctx, x_grad, static_cast<T>(0.0f));
+  const T* out_grad_ptr = out_grad.non_zero_elements().data<T>();
+  // TODO(zhangkaihuo): call phi::sparse::EmptyLike
+  DenseTensor x_grad_indices =
+      phi::EmptyLike<IntT>(dev_ctx, x.non_zero_indices());
+  DenseTensor x_grad_values = phi::EmptyLike<T>(dev_ctx, x.non_zero_elements());
+  x_grad->SetMember(x_grad_indices, x_grad_values, x.dims(), true);
+  T* x_grad_ptr = x_grad_values.data<T>();
+  phi::funcs::SetConstant<GPUContext, T> set_zero;
+  set_zero(dev_ctx, &x_grad_values, static_cast<T>(0.0f));
+  phi::Copy<GPUContext>(dev_ctx,
+                        x.non_zero_indices(),
+                        dev_ctx.GetPlace(),
+                        false,
+                        &x_grad_indices);
 
   for (int i = 0; i < kernel_size; i++) {
     if (counter[i] <= 0) {
@@ -92,10 +106,10 @@ void MaxPoolGradKernel(const Context& dev_ctx,
 
     auto config = phi::backends::gpu::GetGpuLaunchConfig1D(
         dev_ctx, counter[i] * in_channels, 1);
-    MaxPoolGradCudaKernel<T><<<config.block_per_grid.x,
-                               config.thread_per_block.x,
-                               0,
-                               dev_ctx.stream()>>>(
+    MaxPoolGradCudaKernel<T, IntT><<<config.block_per_grid.x,
+                                     config.thread_per_block.x,
+                                     0,
+                                     dev_ctx.stream()>>>(
         in_features_ptr,
         out_features_ptr,
         out_grad_ptr,
@@ -107,6 +121,21 @@ void MaxPoolGradKernel(const Context& dev_ctx,
   }
 }
 
+template <typename T, typename Context>
+void MaxPoolGradKernel(const Context& dev_ctx,
+                       const SparseCooTensor& x,
+                       const DenseTensor& rulebook,
+                       const SparseCooTensor& out,
+                       const SparseCooTensor& out_grad,
+                       const std::vector<int>& kernel_sizes,
+                       SparseCooTensor* x_grad) {
+  PD_DISPATCH_INTEGRAL_TYPES(
+      x.non_zero_indices().dtype(), "MaxPoolGradGPUKernel", ([&] {
+        MaxPoolGradGPUKernel<T, data_t>(
+            dev_ctx, x, rulebook, out, out_grad, kernel_sizes, x_grad);
+      }));
+}
+
 }  // namespace sparse
 }  // namespace phi
 
diff --git a/paddle/phi/kernels/sparse/gpu/sparse_pool_kernel.cu b/paddle/phi/kernels/sparse/gpu/sparse_pool_kernel.cu
index 0f6a0d13b1ddb..a59cd3c7a5a78 100644
--- a/paddle/phi/kernels/sparse/gpu/sparse_pool_kernel.cu
+++ b/paddle/phi/kernels/sparse/gpu/sparse_pool_kernel.cu
@@ -12,19 +12,22 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. */
 
+#include "paddle/phi/kernels/sparse/sparse_pool_kernel.h"
+
 #include "paddle/phi/core/kernel_registry.h"
 #include "paddle/phi/core/tensor_meta.h"
 #include "paddle/phi/kernels/funcs/pooling.h"
 #include "paddle/phi/kernels/funcs/sparse/convolution.h"
 #include "paddle/phi/kernels/sparse/gpu/convolution.cu.h"
-#include "paddle/phi/kernels/sparse/sparse_pool_kernel.h"
+
+#include "paddle/phi/api/ext/dispatch.h"
 
 namespace phi {
 namespace sparse {
 
-template <typename T>
+template <typename T, typename IntT = int>
 __global__ void MaxPoolCudaKernel(const T* in_features_ptr,
-                                  const int* rulebook_ptr,
+                                  const IntT* rulebook_ptr,
                                   const int n,
                                   const int rulebook_len,
                                   const int channels,
@@ -33,8 +36,8 @@ __global__ void MaxPoolCudaKernel(const T* in_features_ptr,
   CUDA_KERNEL_LOOP_TYPE(i, n * channels, int64_t) {
     int real_i = i / channels;
     int channel_i = i - real_i * channels;
-    int in_i = rulebook_ptr[real_i];
-    int out_i = rulebook_ptr[real_i + rulebook_len];
+    IntT in_i = rulebook_ptr[real_i];
+    IntT out_i = rulebook_ptr[real_i + rulebook_len];
     max_pool_functor.compute(in_features_ptr[in_i * channels + channel_i],
                              &out_features_ptr[out_i * channels + channel_i]);
   }
@@ -45,15 +48,15 @@ __global__ void MaxPoolCudaKernel(const T* in_features_ptr,
  * kernel: (D, H, W, C, OC)
  * out: (N, D, H, W, OC)
 **/
-template <typename T, typename Context>
-void MaxPoolKernel(const Context& dev_ctx,
-                   const SparseCooTensor& x,
-                   const std::vector<int>& kernel_sizes,
-                   const std::vector<int>& paddings,
-                   const std::vector<int>& dilations,
-                   const std::vector<int>& strides,
-                   SparseCooTensor* out,
-                   DenseTensor* rulebook) {
+template <typename T, typename IntT = int>
+void MaxPoolGPUKernel(const GPUContext& dev_ctx,
+                      const SparseCooTensor& x,
+                      const std::vector<int>& kernel_sizes,
+                      const std::vector<int>& paddings,
+                      const std::vector<int>& dilations,
+                      const std::vector<int>& strides,
+                      SparseCooTensor* out,
+                      DenseTensor* rulebook) {
   const auto& x_dims = x.dims();
   int kernel_size = kernel_sizes[0] * kernel_sizes[1] * kernel_sizes[2];
   const std::vector<int>& real_kernel_sizes =
@@ -70,29 +73,27 @@ void MaxPoolKernel(const Context& dev_ctx,
   DenseTensor offsets_per_kernel = phi::Empty(dev_ctx, std::move(counter_meta));
   DenseTensorMeta index_meta(DataType::INT32, {1}, DataLayout::NCHW);
   DenseTensor out_index = phi::Empty(dev_ctx, std::move(index_meta));
-  DenseTensor unique_key = phi::Empty(dev_ctx, std::move(index_meta));
   DenseTensor unique_value = phi::Empty(dev_ctx, std::move(index_meta));
 
   // 1. product rulebook
-  int rulebook_len = ProductRuleBook<T, Context>(dev_ctx,
-                                                 x,
-                                                 real_kernel_sizes,
-                                                 paddings,
-                                                 dilations,
-                                                 strides,
-                                                 out_dims,
-                                                 false,
-                                                 rulebook,
-                                                 &counter_per_kernel,
-                                                 &offsets_per_kernel,
-                                                 &out_index,
-                                                 &unique_key,
-                                                 &unique_value,
-                                                 out,
-                                                 &counter,
-                                                 &offsets);
-
-  const int* rulebook_ptr = rulebook->data<int>();
+  int rulebook_len = ProductRuleBook<T, GPUContext, IntT>(dev_ctx,
+                                                          x,
+                                                          real_kernel_sizes,
+                                                          paddings,
+                                                          dilations,
+                                                          strides,
+                                                          out_dims,
+                                                          false,
+                                                          rulebook,
+                                                          &counter_per_kernel,
+                                                          &offsets_per_kernel,
+                                                          &out_index,
+                                                          &unique_value,
+                                                          out,
+                                                          &counter,
+                                                          &offsets);
+
+  const IntT* rulebook_ptr = rulebook->data<IntT>();
 
   T* out_features_ptr = out->mutable_non_zero_elements()->data<T>();
   const T* in_features_ptr = x.non_zero_elements().data<T>();
@@ -113,10 +114,10 @@ void MaxPoolKernel(const Context& dev_ctx,
 
     auto config = phi::backends::gpu::GetGpuLaunchConfig1D(
         dev_ctx, counter[i] * in_channels, 1);
-    MaxPoolCudaKernel<T><<<config.block_per_grid.x,
-                           config.thread_per_block.x,
-                           0,
-                           dev_ctx.stream()>>>(
+    MaxPoolCudaKernel<T, IntT><<<config.block_per_grid.x,
+                                 config.thread_per_block.x,
+                                 0,
+                                 dev_ctx.stream()>>>(
         in_features_ptr,
         rulebook_ptr + offsets[i] + rulebook_len,
         counter[i],
@@ -126,6 +127,28 @@ void MaxPoolKernel(const Context& dev_ctx,
   }
 }
 
+template <typename T, typename Context>
+void MaxPoolKernel(const Context& dev_ctx,
+                   const SparseCooTensor& x,
+                   const std::vector<int>& kernel_sizes,
+                   const std::vector<int>& paddings,
+                   const std::vector<int>& dilations,
+                   const std::vector<int>& strides,
+                   SparseCooTensor* out,
+                   DenseTensor* rulebook) {
+  PD_DISPATCH_INTEGRAL_TYPES(
+      x.non_zero_indices().dtype(), "MaxPoolGPUKernel", ([&] {
+        MaxPoolGPUKernel<T, data_t>(dev_ctx,
+                                    x,
+                                    kernel_sizes,
+                                    paddings,
+                                    dilations,
+                                    strides,
+                                    out,
+                                    rulebook);
+      }));
+}
+
 }  // namespace sparse
 }  // namespace phi
 
diff --git a/paddle/phi/kernels/sparse/gpu/sparse_utils_kernel.cu b/paddle/phi/kernels/sparse/gpu/sparse_utils_kernel.cu
index 8048180e425ea..1109baf92e302 100644
--- a/paddle/phi/kernels/sparse/gpu/sparse_utils_kernel.cu
+++ b/paddle/phi/kernels/sparse/gpu/sparse_utils_kernel.cu
@@ -257,7 +257,8 @@ void SparseCsrToCooKernel(const Context& dev_ctx,
   const auto place = dev_ctx.GetPlace();
   DenseTensorMeta indices_meta(
       DataType::INT64, {sparse_dim, non_zero_num}, DataLayout::NCHW);
-  DenseTensorMeta values_meta(x.dtype(), {non_zero_num}, x.layout());
+  DenseTensorMeta values_meta(
+      x.dtype(), {non_zero_num}, x.non_zero_elements().layout());
   DenseTensorMeta offsets_meta(DataType::INT32, {batchs}, DataLayout::NCHW);
   DenseTensor indices = phi::Empty(dev_ctx, std::move(indices_meta));
   DenseTensor values = phi::Empty(dev_ctx, std::move(values_meta));
@@ -385,7 +386,8 @@ void SparseCooToCsrKernel(const Context& dev_ctx,
   DenseTensorMeta crows_meta(
       DataType::INT64, {batchs * (rows + 1)}, DataLayout::NCHW);
   DenseTensorMeta cols_meta(DataType::INT64, {non_zero_num}, DataLayout::NCHW);
-  DenseTensorMeta values_meta(x.dtype(), {non_zero_num}, x.layout());
+  DenseTensorMeta values_meta(
+      x.dtype(), {non_zero_num}, x.non_zero_elements().layout());
   phi::DenseTensor non_zero_crows(
       phi::make_intrusive<paddle::experimental::SharedStorage>(place),
       std::move(crows_meta));
@@ -633,3 +635,33 @@ PD_REGISTER_KERNEL(sparse_csr_to_dense,
                    int16_t,
                    int,
                    int64_t) {}
+
+PD_REGISTER_KERNEL(coo_values,
+                   GPU,
+                   ALL_LAYOUT,
+                   phi::sparse::CooValuesKernel,
+                   float,
+                   double,
+                   phi::dtype::float16,
+                   uint8_t,
+                   int8_t,
+                   int16_t,
+                   int,
+                   int64_t) {
+  kernel->InputAt(0).SetDataLayout(phi::DataLayout::SPARSE_COO);
+}
+
+PD_REGISTER_KERNEL(csr_values,
+                   GPU,
+                   ALL_LAYOUT,
+                   phi::sparse::CsrValuesKernel,
+                   float,
+                   double,
+                   phi::dtype::float16,
+                   uint8_t,
+                   int8_t,
+                   int16_t,
+                   int,
+                   int64_t) {
+  kernel->InputAt(0).SetDataLayout(phi::DataLayout::SPARSE_COO);
+}
diff --git a/paddle/phi/kernels/sparse/sparse_mask_kernel.h b/paddle/phi/kernels/sparse/sparse_mask_kernel.h
new file mode 100644
index 0000000000000..210412abd8620
--- /dev/null
+++ b/paddle/phi/kernels/sparse/sparse_mask_kernel.h
@@ -0,0 +1,30 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+
+#include "paddle/phi/core/dense_tensor.h"
+#include "paddle/phi/core/sparse_coo_tensor.h"
+
+namespace phi {
+namespace sparse {
+
+template <typename T, typename Context>
+void SparseMaskKernel(const Context& dev_ctx,
+                      const DenseTensor& x,
+                      const SparseCooTensor& mask,
+                      SparseCooTensor* out);
+
+}  // namespace sparse
+}  // namespace phi
diff --git a/paddle/phi/kernels/sparse/sparse_pool_grad_kernel.h b/paddle/phi/kernels/sparse/sparse_pool_grad_kernel.h
index 572ade76281bc..2f7366a010aaa 100644
--- a/paddle/phi/kernels/sparse/sparse_pool_grad_kernel.h
+++ b/paddle/phi/kernels/sparse/sparse_pool_grad_kernel.h
@@ -26,20 +26,18 @@ void MaxPoolGradKernel(const Context& dev_ctx,
                        const SparseCooTensor& x,
                        const DenseTensor& rulebook,
                        const SparseCooTensor& out,
-                       const DenseTensor& out_grad,
+                       const SparseCooTensor& out_grad,
                        const std::vector<int>& kernel_sizes,
-                       DenseTensor* x_grad);
+                       SparseCooTensor* x_grad);
 
 template <typename T, typename Context>
-DenseTensor MaxPoolGrad(const Context& dev_ctx,
-                        const SparseCooTensor& x,
-                        const DenseTensor& rulebook,
-                        const SparseCooTensor& out,
-                        const DenseTensor& out_grad,
-                        const std::vector<int>& kernel_sizes) {
-  DenseTensor x_grad = phi::Empty<Context>(
-      dev_ctx,
-      DenseTensorMeta(x.dtype(), x.non_zero_elements().dims(), x.layout()));
+SparseCooTensor MaxPoolGrad(const Context& dev_ctx,
+                            const SparseCooTensor& x,
+                            const DenseTensor& rulebook,
+                            const SparseCooTensor& out,
+                            const SparseCooTensor& out_grad,
+                            const std::vector<int>& kernel_sizes) {
+  SparseCooTensor x_grad;
   MaxPoolGradKernel<T, Context>(
       dev_ctx, x, rulebook, out, out_grad, kernel_sizes, &x_grad);
   return x_grad;
diff --git a/paddle/phi/kernels/sparse/sparse_pool_kernel.h b/paddle/phi/kernels/sparse/sparse_pool_kernel.h
index bfadbf72e300f..d5248a1ad250e 100644
--- a/paddle/phi/kernels/sparse/sparse_pool_kernel.h
+++ b/paddle/phi/kernels/sparse/sparse_pool_kernel.h
@@ -39,11 +39,7 @@ SparseCooTensor MaxPool(const Context& dev_ctx,
                         const std::vector<int>& dilations,
                         const std::vector<int>& strides,
                         DenseTensor* rulebook) {
-  DenseTensor indices = phi::Empty<Context>(
-      dev_ctx, DenseTensorMeta(DataType::INT32, {1}, DataLayout::NCHW));
-  DenseTensor values =
-      phi::Empty<Context>(dev_ctx, DenseTensorMeta(x.dtype(), {1}, x.layout()));
-  SparseCooTensor coo(indices, values, x.dims());
+  SparseCooTensor coo;
   MaxPoolKernel<T, Context>(
       dev_ctx, x, kernel_sizes, paddings, dilations, strides, &coo, rulebook);
   return coo;
diff --git a/paddle/phi/kernels/sparse/sparse_utils_grad_kernel.cc b/paddle/phi/kernels/sparse/sparse_utils_grad_kernel.cc
new file mode 100644
index 0000000000000..35329807e7798
--- /dev/null
+++ b/paddle/phi/kernels/sparse/sparse_utils_grad_kernel.cc
@@ -0,0 +1,98 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "paddle/phi/kernels/sparse/sparse_utils_grad_kernel.h"
+#include "paddle/phi/core/kernel_registry.h"
+#include "paddle/phi/kernels/sparse/sparse_mask_kernel.h"
+
+namespace phi {
+namespace sparse {
+
+template <typename T, typename Context>
+void CooValuesGradKernel(const Context& dev_ctx,
+                         const SparseCooTensor& x,
+                         const DenseTensor& out_grad,
+                         SparseCooTensor* x_grad) {
+  x_grad->SetMember(x.non_zero_indices(), out_grad, x.dims(), true);
+}
+
+template <typename T, typename Context>
+void SparseCooToDenseGradKernel(const Context& dev_ctx,
+                                const SparseCooTensor& x,
+                                const DenseTensor& out_grad,
+                                SparseCooTensor* x_grad) {
+  SparseMaskKernel<T, Context>(dev_ctx, out_grad, x, x_grad);
+}
+
+}  // namespace sparse
+}  // namespace phi
+
+PD_REGISTER_KERNEL(coo_values_grad,
+                   CPU,
+                   ALL_LAYOUT,
+                   phi::sparse::CooValuesGradKernel,
+                   float,
+                   double,
+                   uint8_t,
+                   int8_t,
+                   int16_t,
+                   int,
+                   int64_t) {
+  kernel->InputAt(0).SetDataLayout(phi::DataLayout::SPARSE_COO);
+}
+
+PD_REGISTER_KERNEL(sparse_coo_to_dense_grad,
+                   CPU,
+                   ALL_LAYOUT,
+                   phi::sparse::SparseCooToDenseGradKernel,
+                   float,
+                   double,
+                   uint8_t,
+                   int8_t,
+                   int16_t,
+                   int,
+                   int64_t) {
+  kernel->InputAt(0).SetDataLayout(phi::DataLayout::SPARSE_COO);
+}
+
+#if defined(PADDLE_WITH_CUDA) || defined(PADDLE_WITH_HIP)
+PD_REGISTER_KERNEL(coo_values_grad,
+                   GPU,
+                   ALL_LAYOUT,
+                   phi::sparse::CooValuesGradKernel,
+                   float,
+                   double,
+                   phi::dtype::float16,
+                   uint8_t,
+                   int8_t,
+                   int16_t,
+                   int,
+                   int64_t) {
+  kernel->InputAt(0).SetDataLayout(phi::DataLayout::SPARSE_COO);
+}
+PD_REGISTER_KERNEL(sparse_coo_to_dense_grad,
+                   GPU,
+                   ALL_LAYOUT,
+                   phi::sparse::SparseCooToDenseGradKernel,
+                   float,
+                   double,
+                   phi::dtype::float16,
+                   uint8_t,
+                   int8_t,
+                   int16_t,
+                   int,
+                   int64_t) {
+  kernel->InputAt(0).SetDataLayout(phi::DataLayout::SPARSE_COO);
+}
+#endif
diff --git a/paddle/phi/kernels/sparse/sparse_utils_grad_kernel.h b/paddle/phi/kernels/sparse/sparse_utils_grad_kernel.h
new file mode 100644
index 0000000000000..0775582bf1fb8
--- /dev/null
+++ b/paddle/phi/kernels/sparse/sparse_utils_grad_kernel.h
@@ -0,0 +1,36 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+
+#include "paddle/phi/core/dense_tensor.h"
+#include "paddle/phi/core/sparse_coo_tensor.h"
+
+namespace phi {
+namespace sparse {
+
+template <typename T, typename Context>
+void CooValuesGradKernel(const Context& dev_ctx,
+                         const SparseCooTensor& x,
+                         const DenseTensor& out_grad,
+                         SparseCooTensor* x_grad);
+
+template <typename T, typename Context>
+void SparseCooToDenseGradKernel(const Context& dev_ctx,
+                                const SparseCooTensor& x,
+                                const DenseTensor& out_grad,
+                                SparseCooTensor* x_grad);
+
+}  // namespace sparse
+}  // namespace phi
diff --git a/paddle/phi/kernels/sparse/sparse_utils_kernel.h b/paddle/phi/kernels/sparse/sparse_utils_kernel.h
index da05eb3d3cf76..961cd9f829eb2 100644
--- a/paddle/phi/kernels/sparse/sparse_utils_kernel.h
+++ b/paddle/phi/kernels/sparse/sparse_utils_kernel.h
@@ -133,5 +133,19 @@ DenseTensor SparseCsrToDense(const Context& dev_ctx, const SparseCsrTensor& x) {
   return dense;
 }
 
+template <typename T, typename Context>
+void CooValuesKernel(const Context& dev_ctx,
+                     const SparseCooTensor& x,
+                     DenseTensor* out) {
+  *out = x.non_zero_elements();
+}
+
+template <typename T, typename Context>
+void CsrValuesKernel(const Context& dev_ctx,
+                     const SparseCsrTensor& x,
+                     DenseTensor* out) {
+  *out = x.non_zero_elements();
+}
+
 }  // namespace sparse
 }  // namespace phi
diff --git a/paddle/phi/kernels/strided_slice_grad_kernel.cc b/paddle/phi/kernels/strided_slice_grad_kernel.cc
new file mode 100644
index 0000000000000..38dd360ea66c2
--- /dev/null
+++ b/paddle/phi/kernels/strided_slice_grad_kernel.cc
@@ -0,0 +1,69 @@
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include "paddle/phi/kernels/strided_slice_grad_kernel.h"
+
+#include "paddle/phi/core/kernel_registry.h"
+
+namespace phi {
+
+template <typename T, typename Context>
+void StridedSliceGradKernel(const Context& dev_ctx,
+                            const DenseTensor& x,
+                            const DenseTensor& out_grad,
+                            const std::vector<int>& axes,
+                            const IntArray& starts,
+                            const IntArray& ends,
+                            const IntArray& strides,
+                            DenseTensor* x_grad) {
+  std::vector<int> infer_flags(axes.size(), 1);
+  std::vector<int> decrease_axis;
+  StridedSliceRawGradKernel<T, Context>(dev_ctx,
+                                        x,
+                                        out_grad,
+                                        axes,
+                                        starts,
+                                        ends,
+                                        strides,
+                                        infer_flags,
+                                        decrease_axis,
+                                        x_grad);
+}
+
+}  // namespace phi
+
+PD_REGISTER_KERNEL(strided_slice_grad,
+                   CPU,
+                   ALL_LAYOUT,
+                   phi::StridedSliceGradKernel,
+                   bool,
+                   int,
+                   int64_t,
+                   float,
+                   double,
+                   phi::dtype::complex<float>,
+                   phi::dtype::complex<double>) {}
+#if defined(PADDLE_WITH_CUDA) || defined(PADDLE_WITH_HIP)
+PD_REGISTER_KERNEL(strided_slice_grad,
+                   GPU,
+                   ALL_LAYOUT,
+                   phi::StridedSliceGradKernel,
+                   bool,
+                   int,
+                   int64_t,
+                   float,
+                   double,
+                   phi::dtype::complex<float>,
+                   phi::dtype::complex<double>) {}
+#endif
diff --git a/paddle/phi/kernels/strided_slice_grad_kernel.h b/paddle/phi/kernels/strided_slice_grad_kernel.h
index 07fba9d27bfe9..21d01310b662f 100644
--- a/paddle/phi/kernels/strided_slice_grad_kernel.h
+++ b/paddle/phi/kernels/strided_slice_grad_kernel.h
@@ -19,6 +19,18 @@
 
 namespace phi {
 
+template <typename T, typename Context>
+void StridedSliceRawGradKernel(const Context& dev_ctx,
+                               const DenseTensor& x,
+                               const DenseTensor& out_grad,
+                               const std::vector<int>& axes,
+                               const IntArray& starts,
+                               const IntArray& ends,
+                               const IntArray& strides,
+                               const std::vector<int>& infer_flags,
+                               const std::vector<int>& decrease_axis,
+                               DenseTensor* x_grad);
+
 template <typename T, typename Context>
 void StridedSliceGradKernel(const Context& dev_ctx,
                             const DenseTensor& x,
@@ -27,8 +39,6 @@ void StridedSliceGradKernel(const Context& dev_ctx,
                             const IntArray& starts,
                             const IntArray& ends,
                             const IntArray& strides,
-                            const std::vector<int>& infer_flags,
-                            const std::vector<int>& decrease_axis,
                             DenseTensor* x_grad);
 
 template <typename T, typename Context>
diff --git a/paddle/phi/kernels/strided_slice_kernel.cc b/paddle/phi/kernels/strided_slice_kernel.cc
new file mode 100644
index 0000000000000..547d574cd78d0
--- /dev/null
+++ b/paddle/phi/kernels/strided_slice_kernel.cc
@@ -0,0 +1,60 @@
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include "paddle/phi/kernels/strided_slice_kernel.h"
+
+#include "paddle/phi/core/kernel_registry.h"
+
+namespace phi {
+
+template <typename T, typename Context>
+void StridedSliceKernel(const Context& dev_ctx,
+                        const DenseTensor& x,
+                        const std::vector<int>& axes,
+                        const IntArray& starts,
+                        const IntArray& ends,
+                        const IntArray& strides,
+                        DenseTensor* out) {
+  std::vector<int> infer_flags(axes.size(), 1);
+  std::vector<int> decrease_axis;
+  StridedSliceRawKernel<T, Context>(
+      dev_ctx, x, axes, starts, ends, strides, infer_flags, decrease_axis, out);
+}
+
+}  // namespace phi
+
+PD_REGISTER_KERNEL(strided_slice,
+                   CPU,
+                   ALL_LAYOUT,
+                   phi::StridedSliceKernel,
+                   bool,
+                   int,
+                   int64_t,
+                   float,
+                   double,
+                   phi::dtype::complex<float>,
+                   phi::dtype::complex<double>) {}
+#if defined(PADDLE_WITH_CUDA) || defined(PADDLE_WITH_HIP)
+PD_REGISTER_KERNEL(strided_slice,
+                   GPU,
+                   ALL_LAYOUT,
+                   phi::StridedSliceKernel,
+                   bool,
+                   int,
+                   int64_t,
+                   float,
+                   double,
+                   phi::dtype::complex<float>,
+                   phi::dtype::complex<double>) {}
+#endif
diff --git a/paddle/phi/kernels/strided_slice_kernel.h b/paddle/phi/kernels/strided_slice_kernel.h
index fd90d81b8556c..2c8b373bf03a8 100644
--- a/paddle/phi/kernels/strided_slice_kernel.h
+++ b/paddle/phi/kernels/strided_slice_kernel.h
@@ -19,6 +19,17 @@
 
 namespace phi {
 
+template <typename T, typename Context>
+void StridedSliceRawKernel(const Context& dev_ctx,
+                           const DenseTensor& x,
+                           const std::vector<int>& axes,
+                           const IntArray& starts,
+                           const IntArray& ends,
+                           const IntArray& strides,
+                           const std::vector<int>& infer_flags,
+                           const std::vector<int>& decrease_axis,
+                           DenseTensor* out);
+
 template <typename T, typename Context>
 void StridedSliceKernel(const Context& dev_ctx,
                         const DenseTensor& x,
@@ -26,8 +37,6 @@ void StridedSliceKernel(const Context& dev_ctx,
                         const IntArray& starts,
                         const IntArray& ends,
                         const IntArray& strides,
-                        const std::vector<int>& infer_flags,
-                        const std::vector<int>& decrease_axis,
                         DenseTensor* out);
 
 template <typename T, typename Context>
diff --git a/paddle/phi/kernels/top_k_grad_kernel.h b/paddle/phi/kernels/top_k_grad_kernel.h
index f577b982c575d..e4fde92dad8fd 100644
--- a/paddle/phi/kernels/top_k_grad_kernel.h
+++ b/paddle/phi/kernels/top_k_grad_kernel.h
@@ -14,16 +14,17 @@
 
 #pragma once
 
+#include "paddle/phi/common/scalar.h"
 #include "paddle/phi/core/dense_tensor.h"
 
 namespace phi {
 
 template <typename T, typename Context>
 void TopkGradKernel(const Context& dev_ctx,
-                    const DenseTensor& out_grad,
                     const DenseTensor& x,
                     const DenseTensor& indices,
-                    int k,
+                    const DenseTensor& out_grad,
+                    const Scalar& k,
                     int axis,
                     bool largest,
                     bool sorted,
diff --git a/paddle/phi/kernels/trace_grad_kernel.h b/paddle/phi/kernels/trace_grad_kernel.h
index ef17986e75593..4884e53b4efe5 100644
--- a/paddle/phi/kernels/trace_grad_kernel.h
+++ b/paddle/phi/kernels/trace_grad_kernel.h
@@ -20,8 +20,8 @@ namespace phi {
 
 template <typename T, typename Context>
 void TraceGradKernel(const Context& ctx,
-                     const DenseTensor& out_grad,
                      const DenseTensor& x,
+                     const DenseTensor& out_grad,
                      int offset,
                      int axis1,
                      int axis2,
diff --git a/paddle/phi/kernels/truncated_gaussian_random_kernel.h b/paddle/phi/kernels/truncated_gaussian_random_kernel.h
index 2781b79520a5d..773bfc8c71eac 100644
--- a/paddle/phi/kernels/truncated_gaussian_random_kernel.h
+++ b/paddle/phi/kernels/truncated_gaussian_random_kernel.h
@@ -14,149 +14,11 @@
 
 #pragma once
 
-#include <limits>
-#include <random>
-
 #include "paddle/phi/common/int_array.h"
 #include "paddle/phi/core/dense_tensor.h"
-#include "paddle/phi/core/device_context.h"
-#include "paddle/phi/infermeta/nullary.h"
 
 namespace phi {
 
-// reference: https://gist.github.com/lakshayg/d80172fe5ae3c5d2c2aedb53c250320e
-template <typename T>
-T Erfinv(T x) {
-  if (x < -1 || x > 1) {
-    return std::numeric_limits<T>::quiet_NaN();
-  } else if (x == 1.0) {
-    return std::numeric_limits<T>::infinity();
-  } else if (x == -1.0) {
-    return -std::numeric_limits<T>::infinity();
-  }
-
-  const T LN2 = 6.931471805599453094172321214581e-1;
-
-  const T A0 = 1.1975323115670912564578e0;
-  const T A1 = 4.7072688112383978012285e1;
-  const T A2 = 6.9706266534389598238465e2;
-  const T A3 = 4.8548868893843886794648e3;
-  const T A4 = 1.6235862515167575384252e4;
-  const T A5 = 2.3782041382114385731252e4;
-  const T A6 = 1.1819493347062294404278e4;
-  const T A7 = 8.8709406962545514830200e2;
-
-  const T B0 = 1.0000000000000000000e0;
-  const T B1 = 4.2313330701600911252e1;
-  const T B2 = 6.8718700749205790830e2;
-  const T B3 = 5.3941960214247511077e3;
-  const T B4 = 2.1213794301586595867e4;
-  const T B5 = 3.9307895800092710610e4;
-  const T B6 = 2.8729085735721942674e4;
-  const T B7 = 5.2264952788528545610e3;
-
-  const T C0 = 1.42343711074968357734e0;
-  const T C1 = 4.63033784615654529590e0;
-  const T C2 = 5.76949722146069140550e0;
-  const T C3 = 3.64784832476320460504e0;
-  const T C4 = 1.27045825245236838258e0;
-  const T C5 = 2.41780725177450611770e-1;
-  const T C6 = 2.27238449892691845833e-2;
-  const T C7 = 7.74545014278341407640e-4;
-
-  const T D0 = 1.4142135623730950488016887e0;
-  const T D1 = 2.9036514445419946173133295e0;
-  const T D2 = 2.3707661626024532365971225e0;
-  const T D3 = 9.7547832001787427186894837e-1;
-  const T D4 = 2.0945065210512749128288442e-1;
-  const T D5 = 2.1494160384252876777097297e-2;
-  const T D6 = 7.7441459065157709165577218e-4;
-  const T D7 = 1.4859850019840355905497876e-9;
-
-  const T E0 = 6.65790464350110377720e0;
-  const T E1 = 5.46378491116411436990e0;
-  const T E2 = 1.78482653991729133580e0;
-  const T E3 = 2.96560571828504891230e-1;
-  const T E4 = 2.65321895265761230930e-2;
-  const T E5 = 1.24266094738807843860e-3;
-  const T E6 = 2.71155556874348757815e-5;
-  const T E7 = 2.01033439929228813265e-7;
-
-  const T F0 = 1.414213562373095048801689e0;
-  const T F1 = 8.482908416595164588112026e-1;
-  const T F2 = 1.936480946950659106176712e-1;
-  const T F3 = 2.103693768272068968719679e-2;
-  const T F4 = 1.112800997078859844711555e-3;
-  const T F5 = 2.611088405080593625138020e-5;
-  const T F6 = 2.010321207683943062279931e-7;
-  const T F7 = 2.891024605872965461538222e-15;
-
-  T abs_x = abs(x);
-
-  if (abs_x <= 0.85) {
-    T r = 0.180625 - 0.25 * x * x;
-    T num =
-        (((((((A7 * r + A6) * r + A5) * r + A4) * r + A3) * r + A2) * r + A1) *
-             r +
-         A0);
-    T den =
-        (((((((B7 * r + B6) * r + B5) * r + B4) * r + B3) * r + B2) * r + B1) *
-             r +
-         B0);
-    return x * num / den;
-  }
-
-  T r = sqrt(LN2 - log(1.0 - abs_x));
-
-  T num, den;
-  if (r <= 5.0) {
-    r = r - 1.6;
-    num =
-        (((((((C7 * r + C6) * r + C5) * r + C4) * r + C3) * r + C2) * r + C1) *
-             r +
-         C0);
-    den =
-        (((((((D7 * r + D6) * r + D5) * r + D4) * r + D3) * r + D2) * r + D1) *
-             r +
-         D0);
-  } else {
-    r = r - 5.0;
-    num =
-        (((((((E7 * r + E6) * r + E5) * r + E4) * r + E3) * r + E2) * r + E1) *
-             r +
-         E0);
-    den =
-        (((((((F7 * r + F6) * r + F5) * r + F4) * r + F3) * r + F2) * r + F1) *
-             r +
-         F0);
-  }
-
-  if (x < 0) {
-    return -num / den;
-  } else {
-    return num / den;
-  }
-}
-
-template <typename T>
-struct TruncatedNormal {
-  T mean, std;
-  T a_normal_cdf;
-  T b_normal_cdf;
-  TruncatedNormal(T mean, T std) : mean(mean), std(std) {
-    auto normal_cdf = [](T x) {
-      return (1.0 + std::erf(x / std::sqrt(2.0))) / 2.0;
-    };
-    a_normal_cdf = normal_cdf(-2.0);
-    b_normal_cdf = normal_cdf(2.0);
-  }
-
-  T operator()(T value) const {
-    auto p = a_normal_cdf + (b_normal_cdf - a_normal_cdf) * value;
-    return std::sqrt(2.0) * Erfinv(2 * p - 1) * std + mean;
-  }
-};
-
 template <typename T, typename Context>
 void TruncatedGaussianRandomKernel(const Context& dev_ctx,
                                    const std::vector<int>& shape,
diff --git a/paddle/phi/ops/compat/activation_sig.cc b/paddle/phi/ops/compat/activation_sig.cc
index 34f830abe7ea3..8add832c366cf 100644
--- a/paddle/phi/ops/compat/activation_sig.cc
+++ b/paddle/phi/ops/compat/activation_sig.cc
@@ -139,13 +139,13 @@ KernelSignature TanhTripleGradOpArgumentMapping(
 KernelSignature SigmoidDoubleGradOpArgumentMapping(
     const ArgumentMappingContext& ctx) {
   return KernelSignature(
-      "sigmoid_double_grad", {"Out", "DDX", "DOut"}, {}, {"DOutNew", "DDOut"});
+      "sigmoid_double_grad", {"Out", "DOut", "DDX"}, {}, {"DOutNew", "DDOut"});
 }
 
 KernelSignature SigmoidTripleGradOpArgumentMapping(
     const ArgumentMappingContext& ctx) {
   return KernelSignature("sigmoid_triple_grad",
-                         {"Out", "DDX", "DOut", "D_DDOut", "D_DOut_New"},
+                         {"Out", "DOut", "DDX", "D_DOut_New", "D_DDOut"},
                          {},
                          {"D_OutNew", "D_DOut", "D_DDx"});
 }
diff --git a/paddle/phi/ops/compat/batch_norm_sig.cc b/paddle/phi/ops/compat/batch_norm_sig.cc
index 803bb50b438a5..cfd9f4def933a 100644
--- a/paddle/phi/ops/compat/batch_norm_sig.cc
+++ b/paddle/phi/ops/compat/batch_norm_sig.cc
@@ -59,15 +59,17 @@ KernelSignature BatchNormGradOpArgumentMapping(
     const ArgumentMappingContext& ctx) {
   return KernelSignature(
       "batch_norm_grad",
-      {GradVarName("Y"),
-       "X",
-       "Scale",
-       "Bias",
-       "SavedMean",
-       "SavedVariance",
-       "ReserveSpace",
-       "Mean",
-       "Variance"},
+      {
+          "X",
+          "Scale",
+          "Bias",
+          "Mean",
+          "Variance",
+          "SavedMean",
+          "SavedVariance",
+          "ReserveSpace",
+          GradVarName("Y"),
+      },
       {"momentum",
        "epsilon",
        "data_layout",
diff --git a/paddle/phi/ops/compat/clip_sig.cc b/paddle/phi/ops/compat/clip_sig.cc
new file mode 100644
index 0000000000000..78fa6c36a5149
--- /dev/null
+++ b/paddle/phi/ops/compat/clip_sig.cc
@@ -0,0 +1,88 @@
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include "paddle/phi/core/compat/op_utils.h"
+#include "paddle/utils/small_vector.h"
+
+namespace phi {
+
+KernelSignature ClipOpArgumentMapping(const ArgumentMappingContext& ctx) {
+  paddle::SmallVector<std::string> attr_names;
+  attr_names.emplace_back(ctx.HasInput("Min") ? "Min" : "min");
+  attr_names.emplace_back(ctx.HasInput("Max") ? "Max" : "max");
+  if (ctx.IsDenseTensorInput("X")) {
+    if (ctx.HasInput("Min")) {
+      if (ctx.HasInput("Max")) {
+        return KernelSignature("clip", {"X"}, {"Min", "Max"}, {"Out"});
+      } else {
+        return KernelSignature("clip", {"X"}, {"Min", "max"}, {"Out"});
+      }
+    } else {
+      if (ctx.HasInput("Max")) {
+        return KernelSignature("clip", {"X"}, {"min", "Max"}, {"Out"});
+      } else {
+        return KernelSignature("clip", {"X"}, {"min", "max"}, {"Out"});
+      }
+    }
+  } else if (ctx.IsSelectedRowsInput("X")) {
+    if (ctx.HasInput("Min")) {
+      if (ctx.HasInput("Max")) {
+        return KernelSignature("clip_sr", {"X"}, {"Min", "Max"}, {"Out"});
+      } else {
+        return KernelSignature("clip_sr", {"X"}, {"Min", "max"}, {"Out"});
+      }
+    } else {
+      if (ctx.HasInput("Max")) {
+        return KernelSignature("clip_sr", {"X"}, {"min", "Max"}, {"Out"});
+      } else {
+        return KernelSignature("clip_sr", {"X"}, {"min", "max"}, {"Out"});
+      }
+    }
+  }
+
+  return KernelSignature("unregistered", {}, {}, {});
+}
+
+KernelSignature ClipGradOpArgumentMapping(const ArgumentMappingContext& ctx) {
+  if (ctx.HasInput("Min")) {
+    if (ctx.HasInput("Max")) {
+      return KernelSignature("clip_grad",
+                             {"X", GradVarName("Out")},
+                             {"Min", "Max"},
+                             {GradVarName("X")});
+    } else {
+      return KernelSignature("clip_grad",
+                             {"X", GradVarName("Out")},
+                             {"Min", "max"},
+                             {GradVarName("X")});
+    }
+  } else {
+    if (ctx.HasInput("Max")) {
+      return KernelSignature("clip_grad",
+                             {"X", GradVarName("Out")},
+                             {"min", "Max"},
+                             {GradVarName("X")});
+    } else {
+      return KernelSignature("clip_grad",
+                             {"X", GradVarName("Out")},
+                             {"min", "max"},
+                             {GradVarName("X")});
+    }
+  }
+}
+
+}  // namespace phi
+
+PD_REGISTER_ARG_MAPPING_FN(clip, phi::ClipOpArgumentMapping);
+PD_REGISTER_ARG_MAPPING_FN(clip_grad, phi::ClipGradOpArgumentMapping);
diff --git a/paddle/phi/ops/compat/concat_sig.cc b/paddle/phi/ops/compat/concat_sig.cc
index 21e653ccfe90f..d443f521c6146 100644
--- a/paddle/phi/ops/compat/concat_sig.cc
+++ b/paddle/phi/ops/compat/concat_sig.cc
@@ -23,6 +23,20 @@ KernelSignature ConcatOpArgumentMapping(const ArgumentMappingContext& ctx) {
   return KernelSignature("concat", {"X"}, {"axis"}, {"Out"});
 }
 
+KernelSignature ConcatGradOpArgumentMapping(const ArgumentMappingContext& ctx) {
+  if (ctx.HasInput("AxisTensor")) {
+    return KernelSignature("concat_grad",
+                           {"X", {GradVarName("Out")}},
+                           {"AxisTensor"},
+                           {{GradVarName("X")}});
+  }
+  return KernelSignature("concat_grad",
+                         {"X", {GradVarName("Out")}},
+                         {"axis"},
+                         {{GradVarName("X")}});
+}
+
 }  // namespace phi
 
 PD_REGISTER_ARG_MAPPING_FN(concat, phi::ConcatOpArgumentMapping);
+PD_REGISTER_ARG_MAPPING_FN(concat_grad, phi::ConcatGradOpArgumentMapping);
diff --git a/paddle/phi/ops/compat/conv2d_sig.cc b/paddle/phi/ops/compat/conv2d_sig.cc
index 67b99f1dd619c..19e20fddcb811 100644
--- a/paddle/phi/ops/compat/conv2d_sig.cc
+++ b/paddle/phi/ops/compat/conv2d_sig.cc
@@ -46,7 +46,7 @@ KernelSignature Conv2dOpArgumentMapping(const ArgumentMappingContext& ctx) {
 
 KernelSignature Conv2dGradOpArgumentMapping(const ArgumentMappingContext& ctx) {
   return KernelSignature("conv2d_grad",
-                         {GradVarName("Output"), "Input", "Filter"},
+                         {"Input", "Filter", GradVarName("Output")},
                          {"strides",
                           "paddings",
                           "padding_algorithm",
diff --git a/paddle/phi/ops/compat/conv3d_sig.cc b/paddle/phi/ops/compat/conv3d_sig.cc
index a036afac82a8d..b24c08b60c950 100644
--- a/paddle/phi/ops/compat/conv3d_sig.cc
+++ b/paddle/phi/ops/compat/conv3d_sig.cc
@@ -33,7 +33,7 @@ KernelSignature Conv3dOpArgumentMapping(const ArgumentMappingContext& ctx) {
 
 KernelSignature Conv3dGradOpArgumentMapping(const ArgumentMappingContext& ctx) {
   return KernelSignature("conv2d_grad",
-                         {GradVarName("Output"), "Input", "Filter"},
+                         {"Input", "Filter", GradVarName("Output")},
                          {"strides",
                           "paddings",
                           "padding_algorithm",
diff --git a/paddle/phi/ops/compat/depthwise_conv2d_sig.cc b/paddle/phi/ops/compat/depthwise_conv2d_sig.cc
index e2b6801f73bcd..d2d7451ecafce 100644
--- a/paddle/phi/ops/compat/depthwise_conv2d_sig.cc
+++ b/paddle/phi/ops/compat/depthwise_conv2d_sig.cc
@@ -36,7 +36,7 @@ KernelSignature DepthwiseConv2dOpArgumentMapping(
 KernelSignature DepthwiseConv2dGradOpArgumentMapping(
     const ArgumentMappingContext& ctx) {
   return KernelSignature("depthwise_conv2d_grad",
-                         {GradVarName("Output"), "Input", "Filter"},
+                         {"Input", "Filter", GradVarName("Output")},
                          {"strides",
                           "paddings",
                           "padding_algorithm",
diff --git a/paddle/phi/ops/compat/flatten_sig.cc b/paddle/phi/ops/compat/flatten_sig.cc
index b72ad05ea09d8..3e8119c38cf51 100644
--- a/paddle/phi/ops/compat/flatten_sig.cc
+++ b/paddle/phi/ops/compat/flatten_sig.cc
@@ -31,7 +31,7 @@ KernelSignature FlattenOpArgumentMapping(const ArgumentMappingContext& ctx) {
 KernelSignature FlattenGradOpArgumentMapping(
     const ArgumentMappingContext& ctx) {
   return KernelSignature(
-      "flatten_grad", {GradVarName("Out"), "XShape"}, {}, {GradVarName("X")});
+      "flatten_grad", {"XShape", GradVarName("Out")}, {}, {GradVarName("X")});
 }
 
 }  // namespace phi
diff --git a/paddle/phi/ops/compat/graph_reindex_sig.cc b/paddle/phi/ops/compat/graph_reindex_sig.cc
new file mode 100644
index 0000000000000..4e1e7ccedc19d
--- /dev/null
+++ b/paddle/phi/ops/compat/graph_reindex_sig.cc
@@ -0,0 +1,30 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "paddle/phi/core/compat/op_utils.h"
+
+namespace phi {
+
+KernelSignature GraphReindexOpArgumentMapping(
+    const ArgumentMappingContext& ctx) {
+  return KernelSignature(
+      "graph_reindex",
+      {"X", "Neighbors", "Count", "HashTable_Value", "HashTable_Index"},
+      {"flag_buffer_hashtable"},
+      {"Reindex_Src", "Reindex_Dst", "Out_Nodes"});
+}
+
+}  // namespace phi
+
+PD_REGISTER_ARG_MAPPING_FN(graph_reindex, phi::GraphReindexOpArgumentMapping);
diff --git a/paddle/phi/ops/compat/graph_sample_neighbors_sig.cc b/paddle/phi/ops/compat/graph_sample_neighbors_sig.cc
new file mode 100644
index 0000000000000..dd8aaa95c583d
--- /dev/null
+++ b/paddle/phi/ops/compat/graph_sample_neighbors_sig.cc
@@ -0,0 +1,30 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "paddle/phi/core/compat/op_utils.h"
+
+namespace phi {
+
+KernelSignature GraphSampleNeighborsOpArgumentMapping(
+    const ArgumentMappingContext& ctx) {
+  return KernelSignature("graph_sample_neighbors",
+                         {"Row", "Col_Ptr", "X", "Eids", "Perm_Buffer"},
+                         {"sample_size", "return_eids", "flag_perm_buffer"},
+                         {"Out", "Out_Count", "Out_Eids"});
+}
+
+}  // namespace phi
+
+PD_REGISTER_ARG_MAPPING_FN(graph_sample_neighbors,
+                           phi::GraphSampleNeighborsOpArgumentMapping);
diff --git a/paddle/phi/ops/compat/graph_send_recv_sig.cc b/paddle/phi/ops/compat/graph_send_recv_sig.cc
index fa4da0704c987..cf36b9baa2d03 100644
--- a/paddle/phi/ops/compat/graph_send_recv_sig.cc
+++ b/paddle/phi/ops/compat/graph_send_recv_sig.cc
@@ -28,7 +28,7 @@ KernelSignature GraphSendRecvGradOpArgumentMapping(
     const ArgumentMappingContext& ctx) {
   return KernelSignature(
       "graph_send_recv_grad",
-      {GradVarName("Out"), "X", "Out", "Src_index", "Dst_index", "Dst_count"},
+      {"X", "Src_index", "Dst_index", "Out", "Dst_count", GradVarName("Out")},
       {"pool_type"},
       {GradVarName("X")});
 }
diff --git a/paddle/phi/ops/compat/hierarchical_sigmoid_sig.cc b/paddle/phi/ops/compat/hierarchical_sigmoid_sig.cc
index 20183d1a9b066..58c190fb657bb 100644
--- a/paddle/phi/ops/compat/hierarchical_sigmoid_sig.cc
+++ b/paddle/phi/ops/compat/hierarchical_sigmoid_sig.cc
@@ -38,11 +38,11 @@ KernelSignature HierarchicalSigmoidGradOpArgumentMapping(
         {"X",
          "W",
          "Label",
-         "PreOut",
-         GradVarName("Out"),
          "PathTable",
          "PathCode",
-         "Bias"},
+         "Bias",
+         "PreOut",
+         GradVarName("Out")},
         {"num_classes",
          "remote_prefetch",
          "trainer_id",
@@ -57,11 +57,11 @@ KernelSignature HierarchicalSigmoidGradOpArgumentMapping(
         {"X",
          "W",
          "Label",
-         "PreOut",
-         GradVarName("Out"),
          "PathTable",
          "PathCode",
-         "Bias"},
+         "Bias",
+         "PreOut",
+         GradVarName("Out")},
         {"num_classes",
          "remote_prefetch",
          "trainer_id",
diff --git a/paddle/phi/ops/compat/index_sample_sig.cc b/paddle/phi/ops/compat/index_sample_sig.cc
index 0d2aed68a72a5..3b7e3f063d6c1 100644
--- a/paddle/phi/ops/compat/index_sample_sig.cc
+++ b/paddle/phi/ops/compat/index_sample_sig.cc
@@ -19,7 +19,7 @@ namespace phi {
 KernelSignature IndexSampleGradOpArgumentMapping(
     const ArgumentMappingContext& ctx) {
   return KernelSignature("index_sample_grad",
-                         {GradVarName("Out"), "X", "Index"},
+                         {"X", "Index", GradVarName("Out")},
                          {},
                          {GradVarName("X")});
 }
diff --git a/paddle/phi/ops/compat/interpolate_sig.cc b/paddle/phi/ops/compat/interpolate_sig.cc
new file mode 100644
index 0000000000000..ba0e971e4ab00
--- /dev/null
+++ b/paddle/phi/ops/compat/interpolate_sig.cc
@@ -0,0 +1,194 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "paddle/phi/core/compat/op_utils.h"
+
+namespace phi {
+
+KernelSignature BilinearInterpOpArgumentMapping(
+    const ArgumentMappingContext& ctx) {
+  return KernelSignature("bilinear_interp_v2",
+                         {"X", "OutSize", "SizeTensor", "Scale"},
+                         {"data_layout",
+                          "out_d",
+                          "out_h",
+                          "out_w",
+                          "scale",
+                          "interp_method",
+                          "align_corners",
+                          "align_mode"},
+                         {"Out"});
+}
+
+KernelSignature NearestInterpOpArgumentMapping(
+    const ArgumentMappingContext& ctx) {
+  return KernelSignature("nearest_interp_v2",
+                         {"X", "OutSize", "SizeTensor", "Scale"},
+                         {"data_layout",
+                          "out_d",
+                          "out_h",
+                          "out_w",
+                          "scale",
+                          "interp_method",
+                          "align_corners",
+                          "align_mode"},
+                         {"Out"});
+}
+KernelSignature TrilinearInterpOpArgumentMapping(
+    const ArgumentMappingContext& ctx) {
+  return KernelSignature("trilinear_interp_v2",
+                         {"X", "OutSize", "SizeTensor", "Scale"},
+                         {"data_layout",
+                          "out_d",
+                          "out_h",
+                          "out_w",
+                          "scale",
+                          "interp_method",
+                          "align_corners",
+                          "align_mode"},
+                         {"Out"});
+}
+
+KernelSignature LinearInterpOpArgumentMapping(
+    const ArgumentMappingContext& ctx) {
+  return KernelSignature("linear_interp_v2",
+                         {"X", "OutSize", "SizeTensor", "Scale"},
+                         {"data_layout",
+                          "out_d",
+                          "out_h",
+                          "out_w",
+                          "scale",
+                          "interp_method",
+                          "align_corners",
+                          "align_mode"},
+                         {"Out"});
+}
+
+KernelSignature BicubicInterpOpArgumentMapping(
+    const ArgumentMappingContext& ctx) {
+  return KernelSignature("bicubic_interp_v2",
+                         {"X", "OutSize", "SizeTensor", "Scale"},
+                         {"data_layout",
+                          "out_d",
+                          "out_h",
+                          "out_w",
+                          "scale",
+                          "interp_method",
+                          "align_corners",
+                          "align_mode"},
+                         {"Out"});
+}
+
+KernelSignature BilinearInterpGradOpArgumentMapping(
+    const ArgumentMappingContext& ctx) {
+  return KernelSignature(
+      "bilinear_interp_v2_grad",
+      {"X", "OutSize", "SizeTensor", "Scale", GradVarName("Out")},
+      {"data_layout",
+       "out_d",
+       "out_h",
+       "out_w",
+       "scale",
+       "interp_method",
+       "align_corners",
+       "align_mode"},
+      {GradVarName("X")});
+}
+
+KernelSignature NearestInterpGradOpArgumentMapping(
+    const ArgumentMappingContext& ctx) {
+  return KernelSignature(
+      "nearest_interp_v2_grad",
+      {"X", "OutSize", "SizeTensor", "Scale", GradVarName("Out")},
+      {"data_layout",
+       "out_d",
+       "out_h",
+       "out_w",
+       "scale",
+       "interp_method",
+       "align_corners",
+       "align_mode"},
+      {GradVarName("X")});
+}
+KernelSignature TrilinearInterpGradOpArgumentMapping(
+    const ArgumentMappingContext& ctx) {
+  return KernelSignature(
+      "trilinear_interp_v2_grad",
+      {"X", "OutSize", "SizeTensor", "Scale", GradVarName("Out")},
+      {"data_layout",
+       "out_d",
+       "out_h",
+       "out_w",
+       "scale",
+       "interp_method",
+       "align_corners",
+       "align_mode"},
+      {GradVarName("X")});
+}
+
+KernelSignature LinearInterpGradOpArgumentMapping(
+    const ArgumentMappingContext& ctx) {
+  return KernelSignature(
+      "linear_interp_v2_grad",
+      {"X", "OutSize", "SizeTensor", "Scale", GradVarName("Out")},
+      {"data_layout",
+       "out_d",
+       "out_h",
+       "out_w",
+       "scale",
+       "interp_method",
+       "align_corners",
+       "align_mode"},
+      {GradVarName("X")});
+}
+
+KernelSignature BicubicInterpGradOpArgumentMapping(
+    const ArgumentMappingContext& ctx) {
+  return KernelSignature(
+      "bicubic_interp_v2_grad",
+      {"X", "OutSize", "SizeTensor", "Scale", GradVarName("Out")},
+      {"data_layout",
+       "out_d",
+       "out_h",
+       "out_w",
+       "scale",
+       "interp_method",
+       "align_corners",
+       "align_mode"},
+      {GradVarName("X")});
+}
+
+}  // namespace phi
+
+PD_REGISTER_ARG_MAPPING_FN(bilinear_interp_v2,
+                           phi::BilinearInterpOpArgumentMapping);
+PD_REGISTER_ARG_MAPPING_FN(nearest_interp_v2,
+                           phi::NearestInterpOpArgumentMapping);
+PD_REGISTER_ARG_MAPPING_FN(trilinear_interp_v2,
+                           phi::TrilinearInterpOpArgumentMapping);
+PD_REGISTER_ARG_MAPPING_FN(linear_interp_v2,
+                           phi::LinearInterpOpArgumentMapping);
+PD_REGISTER_ARG_MAPPING_FN(bicubic_interp_v2,
+                           phi::BicubicInterpOpArgumentMapping);
+
+PD_REGISTER_ARG_MAPPING_FN(bilinear_interp_v2_grad,
+                           phi::BilinearInterpGradOpArgumentMapping);
+PD_REGISTER_ARG_MAPPING_FN(nearest_interp_v2_grad,
+                           phi::NearestInterpGradOpArgumentMapping);
+PD_REGISTER_ARG_MAPPING_FN(trilinear_interp_v2_grad,
+                           phi::TrilinearInterpGradOpArgumentMapping);
+PD_REGISTER_ARG_MAPPING_FN(linear_interp_v2_grad,
+                           phi::LinearInterpGradOpArgumentMapping);
+PD_REGISTER_ARG_MAPPING_FN(bicubic_interp_v2_grad,
+                           phi::BicubicInterpGradOpArgumentMapping);
diff --git a/paddle/phi/ops/compat/kthvalue_sig.cc b/paddle/phi/ops/compat/kthvalue_sig.cc
index e59e9de1e4382..3b1a6a45f9a0a 100644
--- a/paddle/phi/ops/compat/kthvalue_sig.cc
+++ b/paddle/phi/ops/compat/kthvalue_sig.cc
@@ -20,7 +20,7 @@ namespace phi {
 KernelSignature KthvalueGradOpArgumentMapping(
     const ArgumentMappingContext& ctx) {
   return KernelSignature("kthvalue_grad",
-                         {GradVarName("Out"), "X", "Indices"},
+                         {"X", "Indices", GradVarName("Out")},
                          {"k", "axis", "keepdim"},
                          {GradVarName("X")});
 }
diff --git a/paddle/phi/ops/compat/lgamma_sig.cc b/paddle/phi/ops/compat/lgamma_sig.cc
index 968ad4923ba7b..452ba5e2b45a1 100644
--- a/paddle/phi/ops/compat/lgamma_sig.cc
+++ b/paddle/phi/ops/compat/lgamma_sig.cc
@@ -18,7 +18,7 @@ namespace phi {
 
 KernelSignature LgammaGradOpArgumentMapping(const ArgumentMappingContext& ctx) {
   return KernelSignature(
-      "lgamma_grad", {GradVarName("Out"), "X"}, {}, {GradVarName("X")});
+      "lgamma_grad", {"X", GradVarName("Out")}, {}, {GradVarName("X")});
 }
 
 }  // namespace phi
diff --git a/paddle/phi/ops/compat/masked_select_sig.cc b/paddle/phi/ops/compat/masked_select_sig.cc
index 8083b123bcff5..ec0eb90315bc1 100644
--- a/paddle/phi/ops/compat/masked_select_sig.cc
+++ b/paddle/phi/ops/compat/masked_select_sig.cc
@@ -24,7 +24,7 @@ KernelSignature MaskedSelectOpArgumentMapping(
 KernelSignature MaskedSelectGradOpArgumentMapping(
     const ArgumentMappingContext& ctx) {
   return KernelSignature("masked_select_grad",
-                         {GradVarName("Y"), "X", "Mask"},
+                         {"X", "Mask", GradVarName("Y")},
                          {},
                          {GradVarName("X")});
 }
diff --git a/paddle/phi/ops/compat/nll_loss_sig.cc b/paddle/phi/ops/compat/nll_loss_sig.cc
index f274d7f77c5c0..87a060ce7a672 100644
--- a/paddle/phi/ops/compat/nll_loss_sig.cc
+++ b/paddle/phi/ops/compat/nll_loss_sig.cc
@@ -29,7 +29,7 @@ KernelSignature NllLossGradOpArgumentMapping(
     const ArgumentMappingContext& ctx) {
   return KernelSignature(
       "nll_loss_grad",
-      {"X", "Label", "Total_weight", "Weight", GradVarName("Out")},
+      {"X", "Label", "Weight", "Total_weight", GradVarName("Out")},
       {"ignore_index", "reduction"},
       {GradVarName("X")});
 }
diff --git a/paddle/phi/ops/compat/prelu_sig.cc b/paddle/phi/ops/compat/prelu_sig.cc
index bd296c5e95318..43e5f20a92676 100644
--- a/paddle/phi/ops/compat/prelu_sig.cc
+++ b/paddle/phi/ops/compat/prelu_sig.cc
@@ -16,13 +16,19 @@
 
 namespace phi {
 
+KernelSignature PReluOpArgumentMapping(const ArgumentMappingContext& ctx) {
+  return KernelSignature(
+      "prelu", {"X", "Alpha"}, {"data_format", "mode"}, {"Out"});
+}
+
 KernelSignature PReluGradOpArgumentMapping(const ArgumentMappingContext& ctx) {
   return KernelSignature("prelu_grad",
                          {"X", "Alpha", GradVarName("Out")},
-                         {"mode", "data_format"},
+                         {"data_format", "mode"},
                          {GradVarName("X"), GradVarName("Alpha")});
 }
 
 }  // namespace phi
 
+PD_REGISTER_ARG_MAPPING_FN(prelu, phi::PReluOpArgumentMapping);
 PD_REGISTER_ARG_MAPPING_FN(prelu_grad, phi::PReluGradOpArgumentMapping);
diff --git a/paddle/phi/ops/compat/range_sig.cc b/paddle/phi/ops/compat/range_sig.cc
new file mode 100644
index 0000000000000..d48898bd8487c
--- /dev/null
+++ b/paddle/phi/ops/compat/range_sig.cc
@@ -0,0 +1,17 @@
+/* Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "paddle/phi/core/compat/op_utils.h"
+
+PD_REGISTER_BASE_KERNEL_NAME(range, arange);
diff --git a/paddle/phi/ops/compat/reshape_sig.cc b/paddle/phi/ops/compat/reshape_sig.cc
index 6b528efe6d056..04f64e4035273 100644
--- a/paddle/phi/ops/compat/reshape_sig.cc
+++ b/paddle/phi/ops/compat/reshape_sig.cc
@@ -47,7 +47,7 @@ KernelSignature ReshapeGradOpArgumentMapping(
 
 KernelSignature ReshapeDoubleGradOpArgumentMapping(
     const ArgumentMappingContext& ctx) {
-  return KernelSignature("reshape_double_grad", {"DDX"}, {}, {"DDOut"});
+  return KernelSignature("reshape_double_grad", {"DOut", "DDX"}, {}, {"DDOut"});
 }
 
 }  // namespace phi
diff --git a/paddle/phi/ops/compat/softmax_with_cross_entropy_sig.cc b/paddle/phi/ops/compat/softmax_with_cross_entropy_sig.cc
new file mode 100644
index 0000000000000..9cfc5ded90a49
--- /dev/null
+++ b/paddle/phi/ops/compat/softmax_with_cross_entropy_sig.cc
@@ -0,0 +1,53 @@
+// Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include "paddle/phi/core/compat/op_utils.h"
+
+namespace phi {
+
+KernelSignature SoftmaxWithCrossEntropyOpArgumentMapping(
+    const ArgumentMappingContext& ctx) {
+  return KernelSignature("cross_entropy_with_softmax",
+                         {"Logits", "Label"},
+                         {"soft_label",
+                          "use_softmax",
+                          "numeric_stable_mode",
+                          "ignore_index",
+                          "axis"},
+                         {"Softmax", "Loss"});
+}
+
+KernelSignature SoftmaxWithCrossEntropyGradOpArgumentMapping(
+    const ArgumentMappingContext& ctx) {
+  return KernelSignature("cross_entropy_with_softmax_grad",
+                         {"Label", "Softmax", GradVarName("Loss")},
+                         {"soft_label",
+                          "use_softmax",
+                          "numeric_stable_mode",
+                          "ignore_index",
+                          "axis"},
+                         {GradVarName("Logits")});
+}
+
+}  // namespace phi
+
+PD_REGISTER_BASE_KERNEL_NAME(softmax_with_cross_entropy,
+                             cross_entropy_with_softmax);
+PD_REGISTER_BASE_KERNEL_NAME(softmax_with_cross_entropy_grad,
+                             cross_entropy_with_softmax_grad);
+
+PD_REGISTER_ARG_MAPPING_FN(softmax_with_cross_entropy,
+                           phi::SoftmaxWithCrossEntropyOpArgumentMapping);
+PD_REGISTER_ARG_MAPPING_FN(softmax_with_cross_entropy_grad,
+                           phi::SoftmaxWithCrossEntropyGradOpArgumentMapping);
diff --git a/paddle/phi/ops/compat/strided_slice_sig.cc b/paddle/phi/ops/compat/strided_slice_sig.cc
index 70ce2e3e07ce9..9fb70af0dea51 100644
--- a/paddle/phi/ops/compat/strided_slice_sig.cc
+++ b/paddle/phi/ops/compat/strided_slice_sig.cc
@@ -57,14 +57,14 @@ KernelSignature StridedSliceOpArgumentMapping(
                                             "decrease_axis"};
   paddle::SmallVector<std::string> outputs = {"Out"};
 
-  std::string op_type;
+  std::string kernel_name;
   if (ctx.IsDenseTensorVectorInput("Input")) {
-    op_type = "strided_slice_array";
+    kernel_name = "strided_slice_array";
   } else {
-    op_type = "strided_slice";
+    kernel_name = "strided_slice_raw";
   }
   // NOTE(dev): Use this to avoid regularization.
-  KernelSignature sig(op_type, inputs, attrs, outputs);
+  KernelSignature sig(kernel_name, inputs, attrs, outputs);
   return sig;
 }
 
@@ -106,15 +106,15 @@ KernelSignature StridedSliceGradOpArgumentMapping(
                                             "decrease_axis"};
   paddle::SmallVector<std::string> outputs = {GradVarName("Input")};
 
-  std::string op_type;
+  std::string kernel_name;
   if (ctx.IsDenseTensorVectorInput("Input")) {
-    op_type = "strided_slice_array_grad";
+    kernel_name = "strided_slice_array_grad";
   } else {
-    op_type = "strided_slice_grad";
+    kernel_name = "strided_slice_raw_grad";
   }
 
   // NOTE(dev): Use this to avoid regularization.
-  KernelSignature sig(op_type, inputs, attrs, outputs);
+  KernelSignature sig(kernel_name, inputs, attrs, outputs);
   return sig;
 }
 
@@ -132,573 +132,273 @@ NOTE: The following codes are for 'get_compat_kernel_signature.py'
 
 ############################  Forward ############################
 
-return KernelSignature("{strided_slice}", {"Input"},
+return KernelSignature("strided_slice_raw", {"Input"},
               {"axes", "StartsTensor", "EndsTensor",
 "StartsTensor","infer_flags", "decrease_axis"},
               {"Out"});
 
-return KernelSignature("{strided_slice}", {"Input"},
+return KernelSignature("strided_slice_raw", {"Input"},
               {"axes", "StartsTensor", "EndsTensor",
 "StartsTensorList","infer_flags", "decrease_axis"},
               {"Out"});
 
-return KernelSignature("{strided_slice}", {"Input"},
+return KernelSignature("strided_slice_raw", {"Input"},
               {"axes", "StartsTensor", "EndsTensor", "starts","infer_flags",
 "decrease_axis"},
               {"Out"});
 
-return KernelSignature("{strided_slice}", {"Input"},
+return KernelSignature("strided_slice_raw", {"Input"},
               {"axes", "StartsTensor", "EndsTensorList",
 "StartsTensor","infer_flags", "decrease_axis"},
               {"Out"});
 
-return KernelSignature("{strided_slice}", {"Input"},
+return KernelSignature("strided_slice_raw", {"Input"},
               {"axes", "StartsTensor", "EndsTensorList",
 "StartsTensorList","infer_flags", "decrease_axis"},
               {"Out"});
 
-return KernelSignature("{strided_slice}", {"Input"},
+return KernelSignature("strided_slice_raw", {"Input"},
               {"axes", "StartsTensor", "EndsTensorList", "starts","infer_flags",
 "decrease_axis"},
               {"Out"});
 
-return KernelSignature("{strided_slice}", {"Input"},
+return KernelSignature("strided_slice_raw", {"Input"},
               {"axes", "StartsTensor", "ends", "StartsTensor","infer_flags",
 "decrease_axis"},
               {"Out"});
 
-return KernelSignature("{strided_slice}", {"Input"},
+return KernelSignature("strided_slice_raw", {"Input"},
               {"axes", "StartsTensor", "ends", "StartsTensorList","infer_flags",
 "decrease_axis"},
               {"Out"});
 
-return KernelSignature("{strided_slice}", {"Input"},
+return KernelSignature("strided_slice_raw", {"Input"},
               {"axes", "StartsTensor", "ends", "starts","infer_flags",
 "decrease_axis"},
               {"Out"});
 
-return KernelSignature("{strided_slice}", {"Input"},
+return KernelSignature("strided_slice_raw", {"Input"},
               {"axes", "StartsTensorList", "EndsTensor",
 "StartsTensor","infer_flags", "decrease_axis"},
               {"Out"});
 
-return KernelSignature("{strided_slice}", {"Input"},
+return KernelSignature("strided_slice_raw", {"Input"},
               {"axes", "StartsTensorList", "EndsTensor",
 "StartsTensorList","infer_flags", "decrease_axis"},
               {"Out"});
 
-return KernelSignature("{strided_slice}", {"Input"},
+return KernelSignature("strided_slice_raw", {"Input"},
               {"axes", "StartsTensorList", "EndsTensor", "starts","infer_flags",
 "decrease_axis"},
               {"Out"});
 
-return KernelSignature("{strided_slice}", {"Input"},
+return KernelSignature("strided_slice_raw", {"Input"},
               {"axes", "StartsTensorList", "EndsTensorList",
 "StartsTensor","infer_flags", "decrease_axis"},
               {"Out"});
 
-return KernelSignature("{strided_slice}", {"Input"},
+return KernelSignature("strided_slice_raw", {"Input"},
               {"axes", "StartsTensorList", "EndsTensorList",
 "StartsTensorList","infer_flags", "decrease_axis"},
               {"Out"});
 
-return KernelSignature("{strided_slice}", {"Input"},
+return KernelSignature("strided_slice_raw", {"Input"},
               {"axes", "StartsTensorList", "EndsTensorList",
 "starts","infer_flags", "decrease_axis"},
               {"Out"});
 
-return KernelSignature("{strided_slice}", {"Input"},
+return KernelSignature("strided_slice_raw", {"Input"},
               {"axes", "StartsTensorList", "ends", "StartsTensor","infer_flags",
 "decrease_axis"},
               {"Out"});
 
-return KernelSignature("{strided_slice}", {"Input"},
+return KernelSignature("strided_slice_raw", {"Input"},
               {"axes", "StartsTensorList", "ends",
 "StartsTensorList","infer_flags", "decrease_axis"},
               {"Out"});
 
-return KernelSignature("{strided_slice}", {"Input"},
+return KernelSignature("strided_slice_raw", {"Input"},
               {"axes", "StartsTensorList", "ends", "starts","infer_flags",
 "decrease_axis"},
               {"Out"});
 
-return KernelSignature("{strided_slice}", {"Input"},
+return KernelSignature("strided_slice_raw", {"Input"},
               {"axes", "starts", "EndsTensor", "StartsTensor","infer_flags",
 "decrease_axis"},
               {"Out"});
 
-return KernelSignature("{strided_slice}", {"Input"},
+return KernelSignature("strided_slice_raw", {"Input"},
               {"axes", "starts", "EndsTensor", "StartsTensorList","infer_flags",
 "decrease_axis"},
               {"Out"});
 
-return KernelSignature("{strided_slice}", {"Input"},
+return KernelSignature("strided_slice_raw", {"Input"},
               {"axes", "starts", "EndsTensor", "starts","infer_flags",
 "decrease_axis"},
               {"Out"});
 
-return KernelSignature("{strided_slice}", {"Input"},
+return KernelSignature("strided_slice_raw", {"Input"},
               {"axes", "starts", "EndsTensorList", "StartsTensor","infer_flags",
 "decrease_axis"},
               {"Out"});
 
-return KernelSignature("{strided_slice}", {"Input"},
+return KernelSignature("strided_slice_raw", {"Input"},
               {"axes", "starts", "EndsTensorList",
 "StartsTensorList","infer_flags", "decrease_axis"},
               {"Out"});
 
-return KernelSignature("{strided_slice}", {"Input"},
+return KernelSignature("strided_slice_raw", {"Input"},
               {"axes", "starts", "EndsTensorList", "starts","infer_flags",
 "decrease_axis"},
               {"Out"});
 
-return KernelSignature("{strided_slice}", {"Input"},
+return KernelSignature("strided_slice_raw", {"Input"},
               {"axes", "starts", "ends", "StartsTensor","infer_flags",
 "decrease_axis"},
               {"Out"});
 
-return KernelSignature("{strided_slice}", {"Input"},
+return KernelSignature("strided_slice_raw", {"Input"},
               {"axes", "starts", "ends", "StartsTensorList","infer_flags",
 "decrease_axis"},
               {"Out"});
 
-return KernelSignature("{strided_slice}", {"Input"},
+return KernelSignature("strided_slice_raw", {"Input"},
               {"axes", "starts", "ends", "starts","infer_flags",
 "decrease_axis"},
               {"Out"});
 
-return KernelSignature("{strided_slice_array}", {"Input"},
+return KernelSignature("strided_slice_array", {"Input"},
               {"axes", "StartsTensor", "EndsTensor",
 "StartsTensor","infer_flags", "decrease_axis"},
               {"Out"});
 
-return KernelSignature("{strided_slice_array}", {"Input"},
+return KernelSignature("strided_slice_array", {"Input"},
               {"axes", "StartsTensor", "EndsTensor",
 "StartsTensorList","infer_flags", "decrease_axis"},
               {"Out"});
 
-return KernelSignature("{strided_slice_array}", {"Input"},
+return KernelSignature("strided_slice_array", {"Input"},
               {"axes", "StartsTensor", "EndsTensor", "starts","infer_flags",
 "decrease_axis"},
               {"Out"});
 
-return KernelSignature("{strided_slice_array}", {"Input"},
+return KernelSignature("strided_slice_array", {"Input"},
               {"axes", "StartsTensor", "EndsTensorList",
 "StartsTensor","infer_flags", "decrease_axis"},
               {"Out"});
 
-return KernelSignature("{strided_slice_array}", {"Input"},
+return KernelSignature("strided_slice_array", {"Input"},
               {"axes", "StartsTensor", "EndsTensorList",
 "StartsTensorList","infer_flags", "decrease_axis"},
               {"Out"});
 
-return KernelSignature("{strided_slice_array}", {"Input"},
+return KernelSignature("strided_slice_array", {"Input"},
               {"axes", "StartsTensor", "EndsTensorList", "starts","infer_flags",
 "decrease_axis"},
               {"Out"});
 
-return KernelSignature("{strided_slice_array}", {"Input"},
+return KernelSignature("strided_slice_array", {"Input"},
               {"axes", "StartsTensor", "ends", "StartsTensor","infer_flags",
 "decrease_axis"},
               {"Out"});
 
-return KernelSignature("{strided_slice_array}", {"Input"},
+return KernelSignature("strided_slice_array", {"Input"},
               {"axes", "StartsTensor", "ends", "StartsTensorList","infer_flags",
 "decrease_axis"},
               {"Out"});
 
-return KernelSignature("{strided_slice_array}", {"Input"},
+return KernelSignature("strided_slice_array", {"Input"},
               {"axes", "StartsTensor", "ends", "starts","infer_flags",
 "decrease_axis"},
               {"Out"});
 
-return KernelSignature("{strided_slice_array}", {"Input"},
+return KernelSignature("strided_slice_array", {"Input"},
               {"axes", "StartsTensorList", "EndsTensor",
 "StartsTensor","infer_flags", "decrease_axis"},
               {"Out"});
 
-return KernelSignature("{strided_slice_array}", {"Input"},
+return KernelSignature("strided_slice_array", {"Input"},
               {"axes", "StartsTensorList", "EndsTensor",
 "StartsTensorList","infer_flags", "decrease_axis"},
               {"Out"});
 
-return KernelSignature("{strided_slice_array}", {"Input"},
+return KernelSignature("strided_slice_array", {"Input"},
               {"axes", "StartsTensorList", "EndsTensor", "starts","infer_flags",
 "decrease_axis"},
               {"Out"});
 
-return KernelSignature("{strided_slice_array}", {"Input"},
+return KernelSignature("strided_slice_array", {"Input"},
               {"axes", "StartsTensorList", "EndsTensorList",
 "StartsTensor","infer_flags", "decrease_axis"},
               {"Out"});
 
-return KernelSignature("{strided_slice_array}", {"Input"},
+return KernelSignature("strided_slice_array", {"Input"},
               {"axes", "StartsTensorList", "EndsTensorList",
 "StartsTensorList","infer_flags", "decrease_axis"},
               {"Out"});
 
-return KernelSignature("{strided_slice_array}", {"Input"},
+return KernelSignature("strided_slice_array", {"Input"},
               {"axes", "StartsTensorList", "EndsTensorList",
 "starts","infer_flags", "decrease_axis"},
               {"Out"});
 
-return KernelSignature("{strided_slice_array}", {"Input"},
+return KernelSignature("strided_slice_array", {"Input"},
               {"axes", "StartsTensorList", "ends", "StartsTensor","infer_flags",
 "decrease_axis"},
               {"Out"});
 
-return KernelSignature("{strided_slice_array}", {"Input"},
+return KernelSignature("strided_slice_array", {"Input"},
               {"axes", "StartsTensorList", "ends",
 "StartsTensorList","infer_flags", "decrease_axis"},
               {"Out"});
 
-return KernelSignature("{strided_slice_array}", {"Input"},
+return KernelSignature("strided_slice_array", {"Input"},
               {"axes", "StartsTensorList", "ends", "starts","infer_flags",
 "decrease_axis"},
               {"Out"});
 
-return KernelSignature("{strided_slice_array}", {"Input"},
+return KernelSignature("strided_slice_array", {"Input"},
               {"axes", "starts", "EndsTensor", "StartsTensor","infer_flags",
 "decrease_axis"},
               {"Out"});
 
-return KernelSignature("{strided_slice_array}", {"Input"},
+return KernelSignature("strided_slice_array", {"Input"},
               {"axes", "starts", "EndsTensor", "StartsTensorList","infer_flags",
 "decrease_axis"},
               {"Out"});
 
-return KernelSignature("{strided_slice_array}", {"Input"},
+return KernelSignature("strided_slice_array", {"Input"},
               {"axes", "starts", "EndsTensor", "starts","infer_flags",
 "decrease_axis"},
               {"Out"});
 
-return KernelSignature("{strided_slice_array}", {"Input"},
+return KernelSignature("strided_slice_array", {"Input"},
               {"axes", "starts", "EndsTensorList", "StartsTensor","infer_flags",
 "decrease_axis"},
               {"Out"});
 
-return KernelSignature("{strided_slice_array}", {"Input"},
+return KernelSignature("strided_slice_array", {"Input"},
               {"axes", "starts", "EndsTensorList",
 "StartsTensorList","infer_flags", "decrease_axis"},
               {"Out"});
 
-return KernelSignature("{strided_slice_array}", {"Input"},
+return KernelSignature("strided_slice_array", {"Input"},
               {"axes", "starts", "EndsTensorList", "starts","infer_flags",
 "decrease_axis"},
               {"Out"});
 
-return KernelSignature("{strided_slice_array}", {"Input"},
+return KernelSignature("strided_slice_array", {"Input"},
               {"axes", "starts", "ends", "StartsTensor","infer_flags",
 "decrease_axis"},
               {"Out"});
 
-return KernelSignature("{strided_slice_array}", {"Input"},
+return KernelSignature("strided_slice_array", {"Input"},
               {"axes", "starts", "ends", "StartsTensorList","infer_flags",
 "decrease_axis"},
               {"Out"});
 
-return KernelSignature("{strided_slice_array}", {"Input"},
+return KernelSignature("strided_slice_array", {"Input"},
               {"axes", "starts", "ends", "starts","infer_flags",
 "decrease_axis"},
               {"Out"});
-
-############################  Backward ############################
-
-
-return KernelSignature("{strided_slice_grad}", {"Input", GradVarName("Out")},
-              {"axes", "StartsTensor", "EndsTensor",
-"StartsTensor","infer_flags", "decrease_axis"},
-              {GradVarName("Input")});
-
-return KernelSignature("{strided_slice_grad}", {"Input", GradVarName("Out")},
-              {"axes", "StartsTensor", "EndsTensor",
-"StartsTensorList","infer_flags", "decrease_axis"},
-              {GradVarName("Input")});
-
-return KernelSignature("{strided_slice_grad}", {"Input", GradVarName("Out")},
-              {"axes", "StartsTensor", "EndsTensor", "starts","infer_flags",
-"decrease_axis"},
-              {GradVarName("Input")});
-
-return KernelSignature("{strided_slice_grad}", {"Input", GradVarName("Out")},
-              {"axes", "StartsTensor", "EndsTensorList",
-"StartsTensor","infer_flags", "decrease_axis"},
-              {GradVarName("Input")});
-
-return KernelSignature("{strided_slice_grad}", {"Input", GradVarName("Out")},
-              {"axes", "StartsTensor", "EndsTensorList",
-"StartsTensorList","infer_flags", "decrease_axis"},
-              {GradVarName("Input")});
-
-return KernelSignature("{strided_slice_grad}", {"Input", GradVarName("Out")},
-              {"axes", "StartsTensor", "EndsTensorList", "starts","infer_flags",
-"decrease_axis"},
-              {GradVarName("Input")});
-
-return KernelSignature("{strided_slice_grad}", {"Input", GradVarName("Out")},
-              {"axes", "StartsTensor", "ends", "StartsTensor","infer_flags",
-"decrease_axis"},
-              {GradVarName("Input")});
-
-return KernelSignature("{strided_slice_grad}", {"Input", GradVarName("Out")},
-              {"axes", "StartsTensor", "ends", "StartsTensorList","infer_flags",
-"decrease_axis"},
-              {GradVarName("Input")});
-
-return KernelSignature("{strided_slice_grad}", {"Input", GradVarName("Out")},
-              {"axes", "StartsTensor", "ends", "starts","infer_flags",
-"decrease_axis"},
-              {GradVarName("Input")});
-
-return KernelSignature("{strided_slice_grad}", {"Input", GradVarName("Out")},
-              {"axes", "StartsTensorList", "EndsTensor",
-"StartsTensor","infer_flags", "decrease_axis"},
-              {GradVarName("Input")});
-
-return KernelSignature("{strided_slice_grad}", {"Input", GradVarName("Out")},
-              {"axes", "StartsTensorList", "EndsTensor",
-"StartsTensorList","infer_flags", "decrease_axis"},
-              {GradVarName("Input")});
-
-return KernelSignature("{strided_slice_grad}", {"Input", GradVarName("Out")},
-              {"axes", "StartsTensorList", "EndsTensor", "starts","infer_flags",
-"decrease_axis"},
-              {GradVarName("Input")});
-
-return KernelSignature("{strided_slice_grad}", {"Input", GradVarName("Out")},
-              {"axes", "StartsTensorList", "EndsTensorList",
-"StartsTensor","infer_flags", "decrease_axis"},
-              {GradVarName("Input")});
-
-return KernelSignature("{strided_slice_grad}", {"Input", GradVarName("Out")},
-              {"axes", "StartsTensorList", "EndsTensorList",
-"StartsTensorList","infer_flags", "decrease_axis"},
-              {GradVarName("Input")});
-
-return KernelSignature("{strided_slice_grad}", {"Input", GradVarName("Out")},
-              {"axes", "StartsTensorList", "EndsTensorList",
-"starts","infer_flags", "decrease_axis"},
-              {GradVarName("Input")});
-
-return KernelSignature("{strided_slice_grad}", {"Input", GradVarName("Out")},
-              {"axes", "StartsTensorList", "ends", "StartsTensor","infer_flags",
-"decrease_axis"},
-              {GradVarName("Input")});
-
-return KernelSignature("{strided_slice_grad}", {"Input", GradVarName("Out")},
-              {"axes", "StartsTensorList", "ends",
-"StartsTensorList","infer_flags", "decrease_axis"},
-              {GradVarName("Input")});
-
-return KernelSignature("{strided_slice_grad}", {"Input", GradVarName("Out")},
-              {"axes", "StartsTensorList", "ends", "starts","infer_flags",
-"decrease_axis"},
-              {GradVarName("Input")});
-
-return KernelSignature("{strided_slice_grad}", {"Input", GradVarName("Out")},
-              {"axes", "starts", "EndsTensor", "StartsTensor","infer_flags",
-"decrease_axis"},
-              {GradVarName("Input")});
-
-return KernelSignature("{strided_slice_grad}", {"Input", GradVarName("Out")},
-              {"axes", "starts", "EndsTensor", "StartsTensorList","infer_flags",
-"decrease_axis"},
-              {GradVarName("Input")});
-
-return KernelSignature("{strided_slice_grad}", {"Input", GradVarName("Out")},
-              {"axes", "starts", "EndsTensor", "starts","infer_flags",
-"decrease_axis"},
-              {GradVarName("Input")});
-
-return KernelSignature("{strided_slice_grad}", {"Input", GradVarName("Out")},
-              {"axes", "starts", "EndsTensorList", "StartsTensor","infer_flags",
-"decrease_axis"},
-              {GradVarName("Input")});
-
-return KernelSignature("{strided_slice_grad}", {"Input", GradVarName("Out")},
-              {"axes", "starts", "EndsTensorList",
-"StartsTensorList","infer_flags", "decrease_axis"},
-              {GradVarName("Input")});
-
-return KernelSignature("{strided_slice_grad}", {"Input", GradVarName("Out")},
-              {"axes", "starts", "EndsTensorList", "starts","infer_flags",
-"decrease_axis"},
-              {GradVarName("Input")});
-
-return KernelSignature("{strided_slice_grad}", {"Input", GradVarName("Out")},
-              {"axes", "starts", "ends", "StartsTensor","infer_flags",
-"decrease_axis"},
-              {GradVarName("Input")});
-
-return KernelSignature("{strided_slice_grad}", {"Input", GradVarName("Out")},
-              {"axes", "starts", "ends", "StartsTensorList","infer_flags",
-"decrease_axis"},
-              {GradVarName("Input")});
-
-return KernelSignature("{strided_slice_grad}", {"Input", GradVarName("Out")},
-              {"axes", "starts", "ends", "starts","infer_flags",
-"decrease_axis"},
-              {GradVarName("Input")});
-
-return KernelSignature("{strided_slice_array_grad}", {"Input",
-GradVarName("Out")},
-              {"axes", "StartsTensor", "EndsTensor",
-"StartsTensor","infer_flags", "decrease_axis"},
-              {GradVarName("Input")});
-
-return KernelSignature("{strided_slice_array_grad}", {"Input",
-GradVarName("Out")},
-              {"axes", "StartsTensor", "EndsTensor",
-"StartsTensorList","infer_flags", "decrease_axis"},
-              {GradVarName("Input")});
-
-return KernelSignature("{strided_slice_array_grad}", {"Input",
-GradVarName("Out")},
-              {"axes", "StartsTensor", "EndsTensor", "starts","infer_flags",
-"decrease_axis"},
-              {GradVarName("Input")});
-
-return KernelSignature("{strided_slice_array_grad}", {"Input",
-GradVarName("Out")},
-              {"axes", "StartsTensor", "EndsTensorList",
-"StartsTensor","infer_flags", "decrease_axis"},
-              {GradVarName("Input")});
-
-return KernelSignature("{strided_slice_array_grad}", {"Input",
-GradVarName("Out")},
-              {"axes", "StartsTensor", "EndsTensorList",
-"StartsTensorList","infer_flags", "decrease_axis"},
-              {GradVarName("Input")});
-
-return KernelSignature("{strided_slice_array_grad}", {"Input",
-GradVarName("Out")},
-              {"axes", "StartsTensor", "EndsTensorList", "starts","infer_flags",
-"decrease_axis"},
-              {GradVarName("Input")});
-
-return KernelSignature("{strided_slice_array_grad}", {"Input",
-GradVarName("Out")},
-              {"axes", "StartsTensor", "ends", "StartsTensor","infer_flags",
-"decrease_axis"},
-              {GradVarName("Input")});
-
-return KernelSignature("{strided_slice_array_grad}", {"Input",
-GradVarName("Out")},
-              {"axes", "StartsTensor", "ends", "StartsTensorList","infer_flags",
-"decrease_axis"},
-              {GradVarName("Input")});
-
-return KernelSignature("{strided_slice_array_grad}", {"Input",
-GradVarName("Out")},
-              {"axes", "StartsTensor", "ends", "starts","infer_flags",
-"decrease_axis"},
-              {GradVarName("Input")});
-
-return KernelSignature("{strided_slice_array_grad}", {"Input",
-GradVarName("Out")},
-              {"axes", "StartsTensorList", "EndsTensor",
-"StartsTensor","infer_flags", "decrease_axis"},
-              {GradVarName("Input")});
-
-return KernelSignature("{strided_slice_array_grad}", {"Input",
-GradVarName("Out")},
-              {"axes", "StartsTensorList", "EndsTensor",
-"StartsTensorList","infer_flags", "decrease_axis"},
-              {GradVarName("Input")});
-
-return KernelSignature("{strided_slice_array_grad}", {"Input",
-GradVarName("Out")},
-              {"axes", "StartsTensorList", "EndsTensor", "starts","infer_flags",
-"decrease_axis"},
-              {GradVarName("Input")});
-
-return KernelSignature("{strided_slice_array_grad}", {"Input",
-GradVarName("Out")},
-              {"axes", "StartsTensorList", "EndsTensorList",
-"StartsTensor","infer_flags", "decrease_axis"},
-              {GradVarName("Input")});
-
-return KernelSignature("{strided_slice_array_grad}", {"Input",
-GradVarName("Out")},
-              {"axes", "StartsTensorList", "EndsTensorList",
-"StartsTensorList","infer_flags", "decrease_axis"},
-              {GradVarName("Input")});
-
-return KernelSignature("{strided_slice_array_grad}", {"Input",
-GradVarName("Out")},
-              {"axes", "StartsTensorList", "EndsTensorList",
-"starts","infer_flags", "decrease_axis"},
-              {GradVarName("Input")});
-
-return KernelSignature("{strided_slice_array_grad}", {"Input",
-GradVarName("Out")},
-              {"axes", "StartsTensorList", "ends", "StartsTensor","infer_flags",
-"decrease_axis"},
-              {GradVarName("Input")});
-
-return KernelSignature("{strided_slice_array_grad}", {"Input",
-GradVarName("Out")},
-              {"axes", "StartsTensorList", "ends",
-"StartsTensorList","infer_flags", "decrease_axis"},
-              {GradVarName("Input")});
-
-return KernelSignature("{strided_slice_array_grad}", {"Input",
-GradVarName("Out")},
-              {"axes", "StartsTensorList", "ends", "starts","infer_flags",
-"decrease_axis"},
-              {GradVarName("Input")});
-
-return KernelSignature("{strided_slice_array_grad}", {"Input",
-GradVarName("Out")},
-              {"axes", "starts", "EndsTensor", "StartsTensor","infer_flags",
-"decrease_axis"},
-              {GradVarName("Input")});
-
-return KernelSignature("{strided_slice_array_grad}", {"Input",
-GradVarName("Out")},
-              {"axes", "starts", "EndsTensor", "StartsTensorList","infer_flags",
-"decrease_axis"},
-              {GradVarName("Input")});
-
-return KernelSignature("{strided_slice_array_grad}", {"Input",
-GradVarName("Out")},
-              {"axes", "starts", "EndsTensor", "starts","infer_flags",
-"decrease_axis"},
-              {GradVarName("Input")});
-
-return KernelSignature("{strided_slice_array_grad}", {"Input",
-GradVarName("Out")},
-              {"axes", "starts", "EndsTensorList", "StartsTensor","infer_flags",
-"decrease_axis"},
-              {GradVarName("Input")});
-
-return KernelSignature("{strided_slice_array_grad}", {"Input",
-GradVarName("Out")},
-              {"axes", "starts", "EndsTensorList",
-"StartsTensorList","infer_flags", "decrease_axis"},
-              {GradVarName("Input")});
-
-return KernelSignature("{strided_slice_array_grad}", {"Input",
-GradVarName("Out")},
-              {"axes", "starts", "EndsTensorList", "starts","infer_flags",
-"decrease_axis"},
-              {GradVarName("Input")});
-
-return KernelSignature("{strided_slice_array_grad}", {"Input",
-GradVarName("Out")},
-              {"axes", "starts", "ends", "StartsTensor","infer_flags",
-"decrease_axis"},
-              {GradVarName("Input")});
-
-return KernelSignature("{strided_slice_array_grad}", {"Input",
-GradVarName("Out")},
-              {"axes", "starts", "ends", "StartsTensorList","infer_flags",
-"decrease_axis"},
-              {GradVarName("Input")});
-
-return KernelSignature("{strided_slice_array_grad}", {"Input",
-GradVarName("Out")},
-              {"axes", "starts", "ends", "starts","infer_flags",
-"decrease_axis"},
-              {GradVarName("Input")});
 */
diff --git a/paddle/phi/ops/compat/top_k_sig.cc b/paddle/phi/ops/compat/top_k_sig.cc
index 9bf922b3d1b58..8488a18e34ce1 100644
--- a/paddle/phi/ops/compat/top_k_sig.cc
+++ b/paddle/phi/ops/compat/top_k_sig.cc
@@ -29,7 +29,7 @@ KernelSignature TopkOpArgumentMapping(const ArgumentMappingContext& ctx) {
 
 KernelSignature TopkGradOpArgumentMapping(const ArgumentMappingContext& ctx) {
   return KernelSignature("top_k_grad",
-                         {GradVarName("Out"), "X", "Indices"},
+                         {"X", "Indices", GradVarName("Out")},
                          {"k", "axis", "largest", "sorted"},
                          {GradVarName("X")});
 }
diff --git a/paddle/phi/ops/compat/trace_sig.cc b/paddle/phi/ops/compat/trace_sig.cc
index 44fd53db98a3c..c3f5d6d287551 100644
--- a/paddle/phi/ops/compat/trace_sig.cc
+++ b/paddle/phi/ops/compat/trace_sig.cc
@@ -23,7 +23,7 @@ KernelSignature TraceOpArgumentMapping(const ArgumentMappingContext& ctx) {
 
 KernelSignature TraceGradOpArgumentMapping(const ArgumentMappingContext& ctx) {
   return KernelSignature("trace_grad",
-                         {GradVarName("Out"), "Input"},
+                         {"Input", GradVarName("Out")},
                          {"offset", "axis1", "axis2"},
                          {GradVarName("Input")});
 }
diff --git a/paddle/phi/tests/api/CMakeLists.txt b/paddle/phi/tests/api/CMakeLists.txt
index cc05c0194804a..94378aceff58c 100644
--- a/paddle/phi/tests/api/CMakeLists.txt
+++ b/paddle/phi/tests/api/CMakeLists.txt
@@ -12,7 +12,6 @@ cc_test(test_dot_api SRCS test_dot_api.cc DEPS ${COMMON_API_TEST_DEPS})
 cc_test(test_matmul_api SRCS test_matmul_api.cc DEPS ${COMMON_API_TEST_DEPS})
 cc_test(test_empty_api SRCS test_empty_api.cc DEPS ${COMMON_API_TEST_DEPS})
 cc_test(test_fill_api SRCS test_fill_api.cc DEPS ${COMMON_API_TEST_DEPS})
-cc_test(test_flatten_api SRCS test_flatten_api.cc DEPS ${COMMON_API_TEST_DEPS})
 cc_test(test_elementwise_api SRCS test_elementwise_api.cc DEPS ${COMMON_API_TEST_DEPS})
 cc_test(test_cast_api SRCS test_cast_api.cc DEPS ${COMMON_API_TEST_DEPS})
 cc_test(test_reshape_api SRCS test_reshape_api.cc DEPS ${COMMON_API_TEST_DEPS})
diff --git a/paddle/phi/tests/api/test_flatten_api.cc b/paddle/phi/tests/api/test_flatten_api.cc
deleted file mode 100644
index f1c8935e26640..0000000000000
--- a/paddle/phi/tests/api/test_flatten_api.cc
+++ /dev/null
@@ -1,75 +0,0 @@
-/* Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License. */
-
-#include <gtest/gtest.h>
-#include <memory>
-
-#include "paddle/phi/api/include/api.h"
-
-#include "paddle/phi/api/lib/utils/allocator.h"
-#include "paddle/phi/core/dense_tensor.h"
-#include "paddle/phi/core/kernel_registry.h"
-
-PD_DECLARE_KERNEL(flatten, CPU, ALL_LAYOUT);
-
-namespace paddle {
-namespace tests {
-
-namespace framework = paddle::framework;
-using DDim = phi::DDim;
-
-// TODO(chenweihang): Remove this test after the API is used in the dygraph
-TEST(API, flatten) {
-  // 1. create tensor
-  const auto alloc = std::make_unique<paddle::experimental::DefaultAllocator>(
-      paddle::platform::CPUPlace());
-  auto dense_x = std::make_shared<phi::DenseTensor>(
-      alloc.get(),
-      phi::DenseTensorMeta(phi::DataType::FLOAT32,
-                           phi::make_ddim({3, 2, 2, 3}),
-                           phi::DataLayout::NCHW));
-  auto* dense_x_data =
-      dense_x->mutable_data<float>(paddle::platform::CPUPlace());
-
-  for (int i = 0; i < dense_x->numel(); i++) {
-    dense_x_data[i] = i;
-  }
-
-  paddle::experimental::Tensor x(dense_x);
-  int start_axis = 1, stop_axis = 2;
-  // 2. test API
-  auto out = paddle::experimental::flatten(x, start_axis, stop_axis);
-
-  // 3. check result
-  std::vector<int> expect_shape = {3, 4, 3};
-  ASSERT_EQ(out.dims()[0], expect_shape[0]);
-  ASSERT_EQ(out.dims()[1], expect_shape[1]);
-  ASSERT_EQ(out.dims()[2], expect_shape[2]);
-  ASSERT_EQ(out.numel(), 36);
-  ASSERT_EQ(out.is_cpu(), true);
-  ASSERT_EQ(out.type(), phi::DataType::FLOAT32);
-  ASSERT_EQ(out.layout(), phi::DataLayout::NCHW);
-  ASSERT_EQ(out.initialized(), true);
-  bool value_equal = true;
-  auto dense_out = std::dynamic_pointer_cast<phi::DenseTensor>(out.impl());
-  auto* dense_out_data = dense_out->data<float>();
-  for (int i = 0; i < dense_x->numel(); i++) {
-    if (std::abs(dense_x_data[i] - dense_out_data[i]) > 1e-6f)
-      value_equal = false;
-  }
-  ASSERT_EQ(value_equal, true);
-}
-
-}  // namespace tests
-}  // namespace paddle
diff --git a/paddle/phi/tests/kernels/test_sparse_conv3d_dev_api.cc b/paddle/phi/tests/kernels/test_sparse_conv3d_dev_api.cc
index 4800e1402ba56..9fb0e5692645d 100644
--- a/paddle/phi/tests/kernels/test_sparse_conv3d_dev_api.cc
+++ b/paddle/phi/tests/kernels/test_sparse_conv3d_dev_api.cc
@@ -48,13 +48,13 @@ std::vector<T2> cast(const std::vector<T1>& in) {
   return out;
 }
 
-template <typename T>
-void TestConv3dBase(const std::vector<int>& indices,
+template <typename T, typename IntT = int>
+void TestConv3dBase(const std::vector<IntT>& indices,
                     const std::vector<T>& features,
                     const DDim& x_dims,
                     const std::vector<T>& kernel,
                     const DDim& kernel_dims,
-                    const std::vector<int>& correct_out_indices,
+                    const std::vector<IntT>& correct_out_indices,
                     const std::vector<T>& correct_out_features,
                     const DDim& correct_out_dims,
                     const int non_zero_num,
@@ -71,16 +71,22 @@ void TestConv3dBase(const std::vector<int>& indices,
       paddle::memory::allocation::AllocatorFacade::Instance()
           .GetAllocator(paddle::platform::CPUPlace())
           .get());
+  dev_ctx_cpu.SetHostAllocator(
+      paddle::memory::allocation::AllocatorFacade::Instance()
+          .GetAllocator(paddle::platform::CPUPlace())
+          .get());
   dev_ctx_cpu.Init();
 
   const int in_channels = kernel_dims[3];
   const int out_channels = kernel_dims[4];
 
+  auto indices_dtype = paddle::experimental::CppTypeToDataType<IntT>::Type();
   DenseTensor indices_tensor = phi::Empty(
       dev_ctx_cpu,
-      DenseTensorMeta(DataType::INT32, {4, non_zero_num}, DataLayout::NCHW));
-  memcpy(
-      indices_tensor.data<int>(), indices.data(), indices.size() * sizeof(int));
+      DenseTensorMeta(indices_dtype, {4, non_zero_num}, DataLayout::NCHW));
+  memcpy(indices_tensor.data<IntT>(),
+         indices.data(),
+         indices.size() * sizeof(IntT));
   DenseTensor features_tensor = phi::Empty(
       dev_ctx_cpu,
       DenseTensorMeta(paddle::experimental::CppTypeToDataType<T>::Type(),
@@ -107,7 +113,7 @@ void TestConv3dBase(const std::vector<int>& indices,
 
   if (!std::is_same<T, phi::dtype::float16>::value) {
     DenseTensor rulebook = phi::Empty(
-        dev_ctx_cpu, DenseTensorMeta(DataType::INT32, {1}, DataLayout::NCHW));
+        dev_ctx_cpu, DenseTensorMeta(indices_dtype, {1}, DataLayout::NCHW));
     SparseCooTensor out = sparse::Conv3d<T>(dev_ctx_cpu,
                                             x_tensor,
                                             kernel_tensor,
@@ -125,26 +131,26 @@ void TestConv3dBase(const std::vector<int>& indices,
     ASSERT_EQ((int64_t)correct_out_features.size() / out_channels, out.nnz());
 
     int cmp_indices = memcmp(correct_out_indices.data(),
-                             out.non_zero_indices().data<int>(),
-                             correct_out_indices.size() * sizeof(int));
+                             out.non_zero_indices().data<IntT>(),
+                             correct_out_indices.size() * sizeof(IntT));
     ASSERT_EQ(cmp_indices, 0);
 
     f_verify(out.non_zero_elements().data<T>(), correct_out_features);
 
     if (backward) {
-      std::vector<DenseTensor> grads =
+      std::tuple<SparseCooTensor, DenseTensor> grads =
           sparse::Conv3dGrad<T>(dev_ctx_cpu,
                                 x_tensor,
-                                rulebook,
                                 kernel_tensor,
-                                out.non_zero_elements(),
+                                rulebook,
+                                out,
                                 paddings,
                                 dilations,
                                 strides,
                                 1,
                                 subm);
-      f_verify(grads[0].data<T>(), features_grad);
-      f_verify(grads[1].data<T>(), kernel_grad);
+      f_verify(std::get<0>(grads).non_zero_elements().data<T>(), features_grad);
+      f_verify(std::get<1>(grads).data<T>(), kernel_grad);
     }
   }
 
@@ -160,11 +166,15 @@ void TestConv3dBase(const std::vector<int>& indices,
       paddle::memory::allocation::AllocatorFacade::Instance()
           .GetAllocator(phi::CPUPlace())
           .get());
+  dev_ctx_gpu.SetPinnedAllocator(
+      paddle::memory::allocation::AllocatorFacade::Instance()
+          .GetAllocator(paddle::platform::CUDAPinnedPlace())
+          .get());
   dev_ctx_gpu.PartialInitWithAllocator();
 
   DenseTensor d_indices_tensor = phi::Empty(
       dev_ctx_gpu,
-      DenseTensorMeta(DataType::INT32, {4, non_zero_num}, DataLayout::NCHW));
+      DenseTensorMeta(indices_dtype, {4, non_zero_num}, DataLayout::NCHW));
   phi::Copy(
       dev_ctx_gpu, indices_tensor, phi::GPUPlace(), true, &d_indices_tensor);
 
@@ -187,7 +197,7 @@ void TestConv3dBase(const std::vector<int>& indices,
       dev_ctx_gpu, kernel_tensor, phi::GPUPlace(), true, &d_kernel_tensor);
 
   DenseTensor d_rulebook = phi::Empty(
-      dev_ctx_gpu, DenseTensorMeta(DataType::INT32, {1}, DataLayout::NCHW));
+      dev_ctx_gpu, DenseTensorMeta(indices_dtype, {1}, DataLayout::NCHW));
   SparseCooTensor d_out = sparse::Conv3d<T>(dev_ctx_gpu,
                                             d_x_tensor,
                                             d_kernel_tensor,
@@ -206,7 +216,7 @@ void TestConv3dBase(const std::vector<int>& indices,
 
   DenseTensor h_indices_tensor = phi::Empty(
       dev_ctx_cpu,
-      DenseTensorMeta(DataType::INT32, {4, d_out.nnz()}, DataLayout::NCHW));
+      DenseTensorMeta(indices_dtype, {4, d_out.nnz()}, DataLayout::NCHW));
   phi::Copy(dev_ctx_gpu,
             d_out.non_zero_indices(),
             phi::CPUPlace(),
@@ -214,15 +224,12 @@ void TestConv3dBase(const std::vector<int>& indices,
             &h_indices_tensor);
 
   int cmp_indices2 = memcmp(correct_out_indices.data(),
-                            h_indices_tensor.data<int>(),
-                            correct_out_indices.size() * sizeof(int));
+                            h_indices_tensor.data<IntT>(),
+                            correct_out_indices.size() * sizeof(IntT));
   ASSERT_EQ(cmp_indices2, 0);
 
-  DenseTensor h_features_tensor = phi::Empty(
-      dev_ctx_cpu,
-      DenseTensorMeta(paddle::experimental::CppTypeToDataType<T>::Type(),
-                      {d_out.nnz()},
-                      d_out.layout()));
+  DenseTensor h_features_tensor =
+      phi::EmptyLike<T>(dev_ctx_cpu, d_out.non_zero_elements());
 
   phi::Copy(dev_ctx_gpu,
             d_out.non_zero_elements(),
@@ -232,38 +239,40 @@ void TestConv3dBase(const std::vector<int>& indices,
   f_verify(h_features_tensor.data<T>(), correct_out_features);
 
   if (backward) {
-    std::vector<DenseTensor> grads =
+    std::tuple<SparseCooTensor, DenseTensor> grads =
         sparse::Conv3dGrad<T>(dev_ctx_gpu,
                               d_x_tensor,
-                              d_rulebook,
                               d_kernel_tensor,
-                              d_out.non_zero_elements(),
+                              d_rulebook,
+                              d_out,
                               paddings,
                               dilations,
                               strides,
                               1,
                               subm);
-    DenseTensor h_features_grad = phi::Empty(
-        dev_ctx_cpu,
-        DenseTensorMeta(grads[0].dtype(), grads[0].dims(), grads[0].layout()));
-    phi::Copy(dev_ctx_gpu, grads[0], phi::CPUPlace(), true, &h_features_grad);
+    DenseTensor d_features_grad = std::get<0>(grads).non_zero_elements();
+    DenseTensor d_kernel_grad = std::get<1>(grads);
+    DenseTensor h_features_grad =
+        phi::EmptyLike<T>(dev_ctx_cpu, d_features_grad);
+    phi::Copy(
+        dev_ctx_gpu, d_features_grad, phi::CPUPlace(), true, &h_features_grad);
     f_verify(h_features_grad.data<T>(), features_grad);
 
-    DenseTensor h_kernel_grad = phi::Empty(
-        dev_ctx_cpu,
-        DenseTensorMeta(grads[1].dtype(), grads[1].dims(), grads[1].layout()));
-    phi::Copy(dev_ctx_gpu, grads[1], phi::CPUPlace(), true, &h_kernel_grad);
+    DenseTensor h_kernel_grad = phi::EmptyLike<T>(dev_ctx_cpu, d_kernel_grad);
+    phi::Copy(
+        dev_ctx_gpu, std::get<1>(grads), phi::CPUPlace(), true, &h_kernel_grad);
     f_verify(h_kernel_grad.data<T>(), kernel_grad);
   }
 #endif
 }
 
-void TestConv3d(const std::vector<int>& indices,
+template <typename IntT = int>
+void TestConv3d(const std::vector<IntT>& indices,
                 const std::vector<float>& features,
                 const DDim& x_dims,
                 const std::vector<float>& kernel,
                 const DDim& kernel_dims,
-                const std::vector<int>& correct_out_indices,
+                const std::vector<IntT>& correct_out_indices,
                 const std::vector<float>& correct_out_features,
                 const DDim& correct_out_dims,
                 const int non_zero_num,
@@ -276,41 +285,41 @@ void TestConv3d(const std::vector<int>& indices,
                 const std::vector<float> kernel_grad = {},
                 const bool subm = false) {
   // test float
-  TestConv3dBase<float>(indices,
-                        features,
-                        x_dims,
-                        kernel,
-                        kernel_dims,
-                        correct_out_indices,
-                        correct_out_features,
-                        correct_out_dims,
-                        non_zero_num,
-                        paddings,
-                        strides,
-                        dilations,
-                        diff,
-                        backward,
-                        features_grad,
-                        kernel_grad,
-                        subm);
+  TestConv3dBase<float, IntT>(indices,
+                              features,
+                              x_dims,
+                              kernel,
+                              kernel_dims,
+                              correct_out_indices,
+                              correct_out_features,
+                              correct_out_dims,
+                              non_zero_num,
+                              paddings,
+                              strides,
+                              dilations,
+                              diff,
+                              backward,
+                              features_grad,
+                              kernel_grad,
+                              subm);
   // test double
-  TestConv3dBase<double>(indices,
-                         cast<float, double>(features),
-                         x_dims,
-                         cast<float, double>(kernel),
-                         kernel_dims,
-                         correct_out_indices,
-                         cast<float, double>(correct_out_features),
-                         correct_out_dims,
-                         non_zero_num,
-                         paddings,
-                         strides,
-                         dilations,
-                         diff,
-                         backward,
-                         cast<float, double>(features_grad),
-                         cast<float, double>(kernel_grad),
-                         subm);
+  TestConv3dBase<double, IntT>(indices,
+                               cast<float, double>(features),
+                               x_dims,
+                               cast<float, double>(kernel),
+                               kernel_dims,
+                               correct_out_indices,
+                               cast<float, double>(correct_out_features),
+                               correct_out_dims,
+                               non_zero_num,
+                               paddings,
+                               strides,
+                               dilations,
+                               diff,
+                               backward,
+                               cast<float, double>(features_grad),
+                               cast<float, double>(kernel_grad),
+                               subm);
 }
 
 TEST(DEV_API, sparse_conv3d) {
@@ -610,6 +619,51 @@ TEST(DEV_API, sparse_conv2d) {
              dilations);
 }
 
+TEST(DEV_API, sparse_conv2d_int64) {
+  const int in_channels = 1;
+  const int out_channels = 1;
+  DDim x_dims = {1, 1, 5, 5, in_channels};
+  DDim kernel_dims = {1, 3, 3, in_channels, out_channels};
+  DDim out_dims = {1, 1, 3, 3, out_channels};
+  std::vector<int> paddings = {0, 0, 0};
+  std::vector<int> strides = {1, 1, 1};
+  std::vector<int> dilations = {1, 1, 1};
+
+  const int non_zero_num = 3;
+  std::vector<int64_t> indices_flatten = {0, 0, 0, 0, 0, 0, 0, 4, 0, 3, 2, 4};
+
+  std::vector<float> features = {-0.79394531, -0.3125, -0.55029297};
+  // 3*3*3=27
+  std::vector<float> kernel = {0.65820312,
+                               0.75048828,
+                               0.21411133,
+                               0.17370605,
+                               0.85546875,
+                               0.53076172,
+                               0.28833008,
+                               0.71044922,
+                               0.00659943};
+
+  std::vector<int64_t> out_indices_flatten = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+                                              0, 0, 2, 2, 2, 1, 2, 0, 1, 2};
+
+  std::vector<float> out_features = {
+      -0.17004, -0.71338, -0.00206, -0.22205, -0.09009};
+
+  TestConv3d<int64_t>(indices_flatten,
+                      features,
+                      x_dims,
+                      kernel,
+                      kernel_dims,
+                      out_indices_flatten,
+                      out_features,
+                      out_dims,
+                      non_zero_num,
+                      paddings,
+                      strides,
+                      dilations);
+}
+
 TEST(DEV_API, sparse_conv3d_backward) {
   const int in_channels = 1;
   const int out_channels = 1;
diff --git a/paddle/phi/tests/kernels/test_sparse_pool_dev_api.cc b/paddle/phi/tests/kernels/test_sparse_pool_dev_api.cc
index 27673704168c9..8f7288d70d7d0 100644
--- a/paddle/phi/tests/kernels/test_sparse_pool_dev_api.cc
+++ b/paddle/phi/tests/kernels/test_sparse_pool_dev_api.cc
@@ -36,11 +36,11 @@ std::vector<T2> cast(const std::vector<T1>& in) {
   }
   return out;
 }
-template <typename T>
-void TestMaxPoolBase(const std::vector<int>& indices,
+template <typename T, typename IntT = int>
+void TestMaxPoolBase(const std::vector<IntT>& indices,
                      const std::vector<T>& features,
                      const DDim& x_dims,
-                     const std::vector<int>& correct_out_indices,
+                     const std::vector<IntT>& correct_out_indices,
                      const std::vector<T>& correct_out_features,
                      const DDim& correct_out_dims,
                      const int non_zero_num,
@@ -56,16 +56,22 @@ void TestMaxPoolBase(const std::vector<int>& indices,
       paddle::memory::allocation::AllocatorFacade::Instance()
           .GetAllocator(paddle::platform::CPUPlace())
           .get());
+  dev_ctx_cpu.SetHostAllocator(
+      paddle::memory::allocation::AllocatorFacade::Instance()
+          .GetAllocator(phi::CPUPlace())
+          .get());
   dev_ctx_cpu.Init();
 
   const int in_channels = x_dims[4];
   const int out_channels = in_channels;
 
+  auto indices_dtype = paddle::experimental::CppTypeToDataType<IntT>::Type();
   DenseTensor indices_tensor = phi::Empty(
       dev_ctx_cpu,
-      DenseTensorMeta(DataType::INT32, {4, non_zero_num}, DataLayout::NCHW));
-  memcpy(
-      indices_tensor.data<int>(), indices.data(), indices.size() * sizeof(int));
+      DenseTensorMeta(indices_dtype, {4, non_zero_num}, DataLayout::NCHW));
+  memcpy(indices_tensor.data<IntT>(),
+         indices.data(),
+         indices.size() * sizeof(IntT));
   DenseTensor features_tensor = phi::Empty(
       dev_ctx_cpu,
       DenseTensorMeta(paddle::experimental::CppTypeToDataType<T>::Type(),
@@ -84,8 +90,7 @@ void TestMaxPoolBase(const std::vector<int>& indices,
   };
 
   if (!std::is_same<T, phi::dtype::float16>::value) {
-    DenseTensor rulebook = phi::Empty(
-        dev_ctx_cpu, DenseTensorMeta(DataType::INT32, {1}, DataLayout::NCHW));
+    DenseTensor rulebook;
     SparseCooTensor out = sparse::MaxPool<T>(dev_ctx_cpu,
                                              x_tensor,
                                              kernel_sizes,
@@ -101,20 +106,16 @@ void TestMaxPoolBase(const std::vector<int>& indices,
     ASSERT_EQ((int64_t)correct_out_features.size() / out_channels, out.nnz());
 
     int cmp_indices = memcmp(correct_out_indices.data(),
-                             out.non_zero_indices().data<int>(),
-                             correct_out_indices.size() * sizeof(int));
+                             out.non_zero_indices().data<IntT>(),
+                             correct_out_indices.size() * sizeof(IntT));
     ASSERT_EQ(cmp_indices, 0);
 
     f_verify(out.non_zero_elements().data<T>(), correct_out_features);
 
     if (backward) {
-      DenseTensor x_grad = sparse::MaxPoolGrad<T>(dev_ctx_cpu,
-                                                  x_tensor,
-                                                  rulebook,
-                                                  out,
-                                                  out.non_zero_elements(),
-                                                  kernel_sizes);
-      f_verify(x_grad.data<T>(), features_grad);
+      SparseCooTensor x_grad = sparse::MaxPoolGrad<T>(
+          dev_ctx_cpu, x_tensor, rulebook, out, out, kernel_sizes);
+      f_verify(x_grad.non_zero_elements().data<T>(), features_grad);
     }
   }
 
@@ -130,26 +131,26 @@ void TestMaxPoolBase(const std::vector<int>& indices,
       paddle::memory::allocation::AllocatorFacade::Instance()
           .GetAllocator(phi::CPUPlace())
           .get());
+  dev_ctx_gpu.SetPinnedAllocator(
+      paddle::memory::allocation::AllocatorFacade::Instance()
+          .GetAllocator(paddle::platform::CUDAPinnedPlace())
+          .get());
   dev_ctx_gpu.PartialInitWithAllocator();
 
   DenseTensor d_indices_tensor = phi::Empty(
       dev_ctx_gpu,
-      DenseTensorMeta(DataType::INT32, {4, non_zero_num}, DataLayout::NCHW));
+      DenseTensorMeta(indices_dtype, {4, non_zero_num}, DataLayout::NCHW));
   phi::Copy(
       dev_ctx_gpu, indices_tensor, phi::GPUPlace(), true, &d_indices_tensor);
 
-  DenseTensor d_features_tensor = phi::Empty(
-      dev_ctx_gpu,
-      DenseTensorMeta(paddle::experimental::CppTypeToDataType<T>::Type(),
-                      {non_zero_num, in_channels},
-                      DataLayout::NHWC));
+  DenseTensor d_features_tensor =
+      phi::EmptyLike<T>(dev_ctx_gpu, features_tensor);
   phi::Copy(
       dev_ctx_gpu, features_tensor, phi::GPUPlace(), true, &d_features_tensor);
 
   SparseCooTensor d_x_tensor(d_indices_tensor, d_features_tensor, x_dims);
 
-  DenseTensor d_rulebook = phi::Empty(
-      dev_ctx_gpu, DenseTensorMeta(DataType::INT32, {1}, DataLayout::NCHW));
+  DenseTensor d_rulebook;
   SparseCooTensor d_out = sparse::MaxPool<T>(dev_ctx_gpu,
                                              d_x_tensor,
                                              kernel_sizes,
@@ -166,7 +167,7 @@ void TestMaxPoolBase(const std::vector<int>& indices,
 
   DenseTensor h_indices_tensor = phi::Empty(
       dev_ctx_cpu,
-      DenseTensorMeta(DataType::INT32, {4, d_out.nnz()}, DataLayout::NCHW));
+      DenseTensorMeta(indices_dtype, {4, d_out.nnz()}, DataLayout::NCHW));
   phi::Copy(dev_ctx_gpu,
             d_out.non_zero_indices(),
             phi::CPUPlace(),
@@ -174,15 +175,12 @@ void TestMaxPoolBase(const std::vector<int>& indices,
             &h_indices_tensor);
 
   int cmp_indices2 = memcmp(correct_out_indices.data(),
-                            h_indices_tensor.data<int>(),
-                            correct_out_indices.size() * sizeof(int));
+                            h_indices_tensor.data<IntT>(),
+                            correct_out_indices.size() * sizeof(IntT));
   ASSERT_EQ(cmp_indices2, 0);
 
-  DenseTensor h_features_tensor = phi::Empty(
-      dev_ctx_cpu,
-      DenseTensorMeta(paddle::experimental::CppTypeToDataType<T>::Type(),
-                      {d_out.nnz()},
-                      d_out.layout()));
+  DenseTensor h_features_tensor =
+      phi::EmptyLike<T>(dev_ctx_cpu, d_out.non_zero_elements());
 
   phi::Copy(dev_ctx_gpu,
             d_out.non_zero_elements(),
@@ -192,25 +190,25 @@ void TestMaxPoolBase(const std::vector<int>& indices,
   f_verify(h_features_tensor.data<T>(), correct_out_features);
 
   if (backward) {
-    DenseTensor x_grad = sparse::MaxPoolGrad<T>(dev_ctx_gpu,
-                                                d_x_tensor,
-                                                d_rulebook,
-                                                d_out,
-                                                d_out.non_zero_elements(),
-                                                kernel_sizes);
-    DenseTensor h_features_grad = phi::Empty(
-        dev_ctx_cpu,
-        DenseTensorMeta(x_grad.dtype(), x_grad.dims(), x_grad.layout()));
-    phi::Copy(dev_ctx_gpu, x_grad, phi::CPUPlace(), true, &h_features_grad);
+    SparseCooTensor x_grad = sparse::MaxPoolGrad<T>(
+        dev_ctx_gpu, d_x_tensor, d_rulebook, d_out, d_out, kernel_sizes);
+    DenseTensor h_features_grad =
+        phi::EmptyLike<T>(dev_ctx_cpu, x_grad.non_zero_elements());
+    phi::Copy(dev_ctx_gpu,
+              x_grad.non_zero_elements(),
+              phi::CPUPlace(),
+              true,
+              &h_features_grad);
     f_verify(h_features_grad.data<T>(), features_grad);
   }
 #endif
 }
 
-void TestMaxPool(const std::vector<int>& indices,
+template <typename IntT = int>
+void TestMaxPool(const std::vector<IntT>& indices,
                  const std::vector<float>& features,
                  const DDim& x_dims,
-                 const std::vector<int>& correct_out_indices,
+                 const std::vector<IntT>& correct_out_indices,
                  const std::vector<float>& correct_out_features,
                  const DDim& correct_out_dims,
                  const int non_zero_num,
@@ -222,35 +220,35 @@ void TestMaxPool(const std::vector<int>& indices,
                  const bool backward = false,
                  const std::vector<float> features_grad = {}) {
   // test float
-  TestMaxPoolBase<float>(indices,
-                         features,
-                         x_dims,
-                         correct_out_indices,
-                         correct_out_features,
-                         correct_out_dims,
-                         non_zero_num,
-                         kernel_sizes,
-                         paddings,
-                         strides,
-                         dilations,
-                         diff,
-                         backward,
-                         features_grad);
+  TestMaxPoolBase<float, IntT>(indices,
+                               features,
+                               x_dims,
+                               correct_out_indices,
+                               correct_out_features,
+                               correct_out_dims,
+                               non_zero_num,
+                               kernel_sizes,
+                               paddings,
+                               strides,
+                               dilations,
+                               diff,
+                               backward,
+                               features_grad);
   // test double
-  TestMaxPoolBase<double>(indices,
-                          cast<float, double>(features),
-                          x_dims,
-                          correct_out_indices,
-                          cast<float, double>(correct_out_features),
-                          correct_out_dims,
-                          non_zero_num,
-                          kernel_sizes,
-                          paddings,
-                          strides,
-                          dilations,
-                          diff,
-                          backward,
-                          cast<float, double>(features_grad));
+  TestMaxPoolBase<double, IntT>(indices,
+                                cast<float, double>(features),
+                                x_dims,
+                                correct_out_indices,
+                                cast<float, double>(correct_out_features),
+                                correct_out_dims,
+                                non_zero_num,
+                                kernel_sizes,
+                                paddings,
+                                strides,
+                                dilations,
+                                diff,
+                                backward,
+                                cast<float, double>(features_grad));
 }
 
 TEST(DEV_API, sparse_maxpool) {
diff --git a/paddle/phi/tests/kernels/test_sparse_utils_dev_api.cc b/paddle/phi/tests/kernels/test_sparse_utils_dev_api.cc
index b8f214b79e290..93728ad31b0d6 100644
--- a/paddle/phi/tests/kernels/test_sparse_utils_dev_api.cc
+++ b/paddle/phi/tests/kernels/test_sparse_utils_dev_api.cc
@@ -117,6 +117,10 @@ void TestDenseToSparseCoo(const DenseTensor& dense_x,
       paddle::memory::allocation::AllocatorFacade::Instance()
           .GetAllocator(phi::CPUPlace())
           .get());
+  dev_ctx_gpu.SetPinnedAllocator(
+      paddle::memory::allocation::AllocatorFacade::Instance()
+          .GetAllocator(paddle::platform::CUDAPinnedPlace())
+          .get());
   dev_ctx_gpu.PartialInitWithAllocator();
 
   const auto cuda_alloc =
@@ -328,6 +332,10 @@ void TestSparseCsrToCoo(const DDim& dense_dims,
       paddle::memory::allocation::AllocatorFacade::Instance()
           .GetAllocator(phi::CPUPlace())
           .get());
+  dev_ctx_gpu.SetPinnedAllocator(
+      paddle::memory::allocation::AllocatorFacade::Instance()
+          .GetAllocator(paddle::platform::CUDAPinnedPlace())
+          .get());
   dev_ctx_gpu.PartialInitWithAllocator();
 
   const auto cuda_alloc =
@@ -511,6 +519,10 @@ void TestCooToCsr(const DDim& dense_dims,
       paddle::memory::allocation::AllocatorFacade::Instance()
           .GetAllocator(phi::CPUPlace())
           .get());
+  dev_ctx_gpu.SetPinnedAllocator(
+      paddle::memory::allocation::AllocatorFacade::Instance()
+          .GetAllocator(paddle::platform::CUDAPinnedPlace())
+          .get());
   dev_ctx_gpu.PartialInitWithAllocator();
   phi::DenseTensor d_indices(cuda_alloc.get(), indices_meta);
   phi::DenseTensor d_values(cuda_alloc.get(), values_meta);
@@ -611,6 +623,10 @@ void TestDenseToSparseCsr(const DenseTensor& dense_x,
       paddle::memory::allocation::AllocatorFacade::Instance()
           .GetAllocator(phi::CPUPlace())
           .get());
+  dev_ctx_gpu.SetPinnedAllocator(
+      paddle::memory::allocation::AllocatorFacade::Instance()
+          .GetAllocator(paddle::platform::CUDAPinnedPlace())
+          .get());
   dev_ctx_gpu.PartialInitWithAllocator();
   phi::Copy(dev_ctx_gpu, dense_x, phi::GPUPlace(), true, &d_dense_x);
   auto sparse_out = sparse::DenseToSparseCsr<T>(dev_ctx_gpu, d_dense_x);
@@ -741,6 +757,10 @@ void TestSparseCooToDense(const DDim& dense_dims,
       paddle::memory::allocation::AllocatorFacade::Instance()
           .GetAllocator(phi::CPUPlace())
           .get());
+  dev_ctx_gpu.SetPinnedAllocator(
+      paddle::memory::allocation::AllocatorFacade::Instance()
+          .GetAllocator(paddle::platform::CUDAPinnedPlace())
+          .get());
   dev_ctx_gpu.PartialInitWithAllocator();
   DenseTensor d_dense_indices(cuda_alloc.get(), dense_indices.meta());
   DenseTensor d_dense_elements(cuda_alloc.get(), dense_elements.meta());
@@ -886,6 +906,10 @@ void TestSparseCsrToDense(const DDim& dense_dims,
       paddle::memory::allocation::AllocatorFacade::Instance()
           .GetAllocator(phi::CPUPlace())
           .get());
+  dev_ctx_gpu.SetPinnedAllocator(
+      paddle::memory::allocation::AllocatorFacade::Instance()
+          .GetAllocator(paddle::platform::CUDAPinnedPlace())
+          .get());
   dev_ctx_gpu.PartialInitWithAllocator();
   phi::DenseTensor d_crows(cuda_alloc.get(), crows_meta);
   phi::DenseTensor d_cols(cuda_alloc.get(), cols_meta);
diff --git a/paddle/scripts/infrt_build.sh b/paddle/scripts/infrt_build.sh
index 6634f5396ac74..ef753200971b3 100755
--- a/paddle/scripts/infrt_build.sh
+++ b/paddle/scripts/infrt_build.sh
@@ -20,6 +20,9 @@
 
 set -e
 
+# TARGET: CPU/GPU/TensorRt
+TARGET=GPU
+
 if [ -z ${BRANCH} ]; then
     BRANCH="develop"
 fi
@@ -32,8 +35,14 @@ function update_pd_ops() {
    # compile and install paddle
    rm -rf ${PADDLE_ROOT}/build && mkdir -p ${PADDLE_ROOT}/build
    cd ${PADDLE_ROOT}/build
-   cmake .. -DWITH_PYTHON=ON -DWITH_MKL=OFF -DWITH_GPU=OFF -DPYTHON_EXECUTABLE=`which python3` -DWITH_XBYAK=OFF -DWITH_NCCL=OFF -DWITH_RCCL=OFF -DWITH_CRYPTO=OFF
-   make -j8 paddle_python print_pten_kernels kernel_signature_generator
+
+   INFER_WITH_GPU=OFF
+   if [ "${TARGET}" == "GPU" ] || [ "${TARGET}" == "gpu" ] || [ "${TARGET}" == "TensorRt" ] || [ "${TARGET}" == "tensorrt" ]; then
+      INFER_WITH_GPU=ON
+   fi
+
+   cmake .. -DWITH_PYTHON=ON -DWITH_MKL=OFF -DWITH_GPU=$INFER_WITH_GPU -DPYTHON_EXECUTABLE=`which python3` -DWITH_XBYAK=OFF -DWITH_NCCL=OFF -DWITH_RCCL=OFF -DWITH_CRYPTO=OFF
+   make -j24 paddle_python print_pten_kernels kernel_signature_generator
    cd ${PADDLE_ROOT}/build
    ./paddle/phi/tools/print_pten_kernels > ../tools/infrt/kernels.json
    ./paddle/fluid/pybind/kernel_signature_generator > ../tools/infrt/kernel_signature.json
@@ -81,6 +90,7 @@ function init() {
 }
 
 function infrt_gen_and_build() {
+    parallel_number=24
     if [ "$1" != "" ]; then
       parallel_number=$1
     fi
@@ -93,7 +103,13 @@ function infrt_gen_and_build() {
     # step2. compile infrt
     cd ${PADDLE_ROOT}/build
     rm -f infrt_summary.txt
-    cmake ..  -DWITH_MKL=OFF -DWITH_GPU=OFF -DWITH_CRYPTO=OFF -DCMAKE_BUILD_TYPE=Release -DWITH_INFRT=ON -DWITH_PYTHON=OFF -DWITH_TESTING==${WITH_TESTING:-ON}; build_error=$?
+
+    INFER_WITH_GPU=OFF
+    if [ "${TARGET}" == "GPU" ] || [ "${TARGET}" == "gpu" ] || [ "${TARGET}" == "TensorRt" ]; then
+       INFER_WITH_GPU=ON
+    fi
+
+    cmake ..  -DWITH_MKL=OFF -DWITH_GPU=${INFER_WITH_GPU} -DWITH_TENSORRT=ON -DWITH_CRYPTO=OFF -DCMAKE_BUILD_TYPE=Release -DWITH_INFRT=ON -DINFRT_WITH_GPU=ON -DINFRT_WITH_TRT=ON -DWITH_PYTHON=OFF -DWITH_TESTING==${WITH_TESTING:-ON}; build_error=$?
     if [ "$build_error" != 0 ];then
         exit 7;
     fi
@@ -156,28 +172,34 @@ function main() {
         echo "      (2)bash infrt_build.sh build_only"
         echo "      (3)bash infrt_build.sh test_only"
         echo "      optional command: --update_pd_ops : pd_ops.td will be updated according to paddle's code."
+        echo "                        --target= : GPU/gpu/CPU/cpu/TensorRt/tensorrt, default value is GPU."
         exit 0
     fi
 
     init
 
-    case $CMD in
-      build_and_test)
-        infrt_gen_and_build ${parallel_number}
-        test_infrt
-        ;;
-      build_only)
-        infrt_gen_and_build ${parallel_number}
-        ;;
-      test_only)
-        test_infrt
-        ;;
-      *)
-        print_usage
-        exit 1
-        ;;
-    esac
-
+    for i in "$@"; do
+        case $i in
+            --target=*)
+              TARGET="${i#*=}"
+              shift
+              ;;
+            build_and_test)
+              infrt_gen_and_build ${parallel_number}
+              test_infrt
+              ;;
+            build_only)
+              infrt_gen_and_build ${parallel_number}
+              ;;
+            test_only)
+              test_infrt
+              ;;
+            *)
+              print_usage
+              exit 1
+              ;;
+        esac
+    done
     set +x
     if [[ -f ${PADDLE_ROOT}/build/infrt_summary.txt ]];then
       echo "=====================build summary======================"
diff --git a/paddle/scripts/paddle_build.bat b/paddle/scripts/paddle_build.bat
index 4092922d01322..21df60e972121 100644
--- a/paddle/scripts/paddle_build.bat
+++ b/paddle/scripts/paddle_build.bat
@@ -79,6 +79,10 @@ if not defined NIGHTLY_MODE set PRECISION_TEST=OFF
 if not defined retry_times set retry_times=1
 if not defined PYTHON_ROOT set PYTHON_ROOT=C:\Python37
 if not defined BUILD_DIR set BUILD_DIR=build
+if not defined NEW_RELEASE_ALL set NEW_RELEASE_ALL=ON
+if not defined NEW_RELEASE_PYPI set NEW_RELEASE_PYPI=OFF
+if not defined NEW_RELEASE_JIT set NEW_RELEASE_JIT=OFF
+
 set task_name=%1
 set UPLOAD_TP_FILE=OFF
 
@@ -251,7 +255,6 @@ set MSVC_STATIC_CRT=ON
 set ON_INFER=ON
 set WITH_TENSORRT=ON
 set WITH_INFERENCE_API_TEST=ON
-set vcvars64_dir="D:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Auxiliary\Build\vcvars64.bat"
 
 call :cmake || goto cmake_error
 call :build || goto build_error
@@ -327,8 +330,12 @@ set PreferredToolArchitecture=x64
 for /F %%# in ('wmic os get localdatetime^|findstr 20') do set start=%%#
 set start=%start:~4,10%
 
-if not defined CUDA_TOOLKIT_ROOT_DIR set CUDA_TOOLKIT_ROOT_DIR=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.2
-set PATH=%TENSORRT_ROOT:/=\%\lib;%CUDA_TOOLKIT_ROOT_DIR%\bin;%CUDA_TOOLKIT_ROOT_DIR%\libnvvp;%PATH%
+if not defined CUDA_TOOLKIT_ROOT_DIR set CUDA_TOOLKIT_ROOT_DIR=C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v10.2
+set PATH=%TENSORRT_ROOT:/=\%\lib;%CUDA_TOOLKIT_ROOT_DIR:/=\%\bin;%CUDA_TOOLKIT_ROOT_DIR:/=\%\libnvvp;%PATH%
+
+rem CUDA_TOOLKIT_ROOT_DIR in cmake must use / rather than \
+set TENSORRT_ROOT=%TENSORRT_ROOT:\=/%
+set CUDA_TOOLKIT_ROOT_DIR=%CUDA_TOOLKIT_ROOT_DIR:\=/%
 
 rem install ninja if GENERATOR is Ninja
 if %GENERATOR% == "Ninja" (
@@ -427,14 +434,18 @@ echo cmake .. -G %GENERATOR% -DCMAKE_BUILD_TYPE=Release -DWITH_AVX=%WITH_AVX% -D
 -DWITH_INFERENCE_API_TEST=%WITH_INFERENCE_API_TEST% -DTHIRD_PARTY_PATH=%THIRD_PARTY_PATH% ^
 -DINFERENCE_DEMO_INSTALL_DIR=%INFERENCE_DEMO_INSTALL_DIR% -DWITH_STATIC_LIB=%WITH_STATIC_LIB% ^
 -DWITH_TENSORRT=%WITH_TENSORRT% -DTENSORRT_ROOT="%TENSORRT_ROOT%" -DMSVC_STATIC_CRT=%MSVC_STATIC_CRT% ^
--DWITH_UNITY_BUILD=%WITH_UNITY_BUILD% -DCUDA_ARCH_NAME=%CUDA_ARCH_NAME% -DCUB_PATH=%THIRD_PARTY_HOME%/cub
+-DWITH_UNITY_BUILD=%WITH_UNITY_BUILD% -DCUDA_ARCH_NAME=%CUDA_ARCH_NAME% -DCUB_PATH=%THIRD_PARTY_HOME%/cub ^
+-DCUDA_TOOLKIT_ROOT_DIR="%CUDA_TOOLKIT_ROOT_DIR%" -DNEW_RELEASE_ALL=%NEW_RELEASE_ALL% -DNEW_RELEASE_PYPI=%NEW_RELEASE_PYPI% ^
+-DNEW_RELEASE_JIT=%NEW_RELEASE_JIT%
 
 cmake .. -G %GENERATOR% -DCMAKE_BUILD_TYPE=Release -DWITH_AVX=%WITH_AVX% -DWITH_GPU=%WITH_GPU% -DWITH_MKL=%WITH_MKL% ^
 -DWITH_TESTING=%WITH_TESTING% -DWITH_PYTHON=%WITH_PYTHON% -DPYTHON_EXECUTABLE=%PYTHON_EXECUTABLE% -DON_INFER=%ON_INFER% ^
 -DWITH_INFERENCE_API_TEST=%WITH_INFERENCE_API_TEST% -DTHIRD_PARTY_PATH=%THIRD_PARTY_PATH% ^
 -DINFERENCE_DEMO_INSTALL_DIR=%INFERENCE_DEMO_INSTALL_DIR% -DWITH_STATIC_LIB=%WITH_STATIC_LIB% ^
 -DWITH_TENSORRT=%WITH_TENSORRT% -DTENSORRT_ROOT="%TENSORRT_ROOT%" -DMSVC_STATIC_CRT=%MSVC_STATIC_CRT% ^
--DWITH_UNITY_BUILD=%WITH_UNITY_BUILD% -DCUDA_ARCH_NAME=%CUDA_ARCH_NAME% -DCUB_PATH=%THIRD_PARTY_HOME%/cub
+-DWITH_UNITY_BUILD=%WITH_UNITY_BUILD% -DCUDA_ARCH_NAME=%CUDA_ARCH_NAME% -DCUB_PATH=%THIRD_PARTY_HOME%/cub ^
+-DCUDA_TOOLKIT_ROOT_DIR="%CUDA_TOOLKIT_ROOT_DIR%" -DNEW_RELEASE_ALL=%NEW_RELEASE_ALL% -DNEW_RELEASE_PYPI=%NEW_RELEASE_PYPI% ^
+-DNEW_RELEASE_JIT=%NEW_RELEASE_JIT%
 goto:eof
 
 :cmake_error
@@ -646,7 +657,6 @@ for /F %%# in ('wmic os get localdatetime^|findstr 20') do set start=%%#
 set start=%start:~4,10%
 
 set FLAGS_call_stack_level=2
-set FLAGS_use_curand=True
 dir %THIRD_PARTY_PATH:/=\%\install\openblas\lib
 dir %THIRD_PARTY_PATH:/=\%\install\openblas\bin
 dir %THIRD_PARTY_PATH:/=\%\install\zlib\bin
@@ -699,7 +709,9 @@ echo cmake .. -G %GENERATOR% -DCMAKE_BUILD_TYPE=Release -DWITH_AVX=%WITH_AVX% -D
 -DWITH_INFERENCE_API_TEST=%WITH_INFERENCE_API_TEST% -DTHIRD_PARTY_PATH=%THIRD_PARTY_PATH% ^
 -DINFERENCE_DEMO_INSTALL_DIR=%INFERENCE_DEMO_INSTALL_DIR% -DWITH_STATIC_LIB=%WITH_STATIC_LIB% ^
 -DWITH_TENSORRT=%WITH_TENSORRT% -DTENSORRT_ROOT="%TENSORRT_ROOT%" -DMSVC_STATIC_CRT=%MSVC_STATIC_CRT% ^
--DWITH_UNITY_BUILD=%WITH_UNITY_BUILD% -DCUDA_ARCH_NAME=%CUDA_ARCH_NAME% >> %work_dir%\win_cmake.sh
+-DWITH_UNITY_BUILD=%WITH_UNITY_BUILD% -DCUDA_ARCH_NAME=%CUDA_ARCH_NAME% -DCUB_PATH=%THIRD_PARTY_HOME%/cub ^
+-DCUDA_TOOLKIT_ROOT_DIR="%CUDA_TOOLKIT_ROOT_DIR%" -DNEW_RELEASE_ALL=%NEW_RELEASE_ALL% -DNEW_RELEASE_PYPI=%NEW_RELEASE_PYPI% ^
+-DNEW_RELEASE_JIT=%NEW_RELEASE_JIT% >> %work_dir%\win_cmake.sh
 
 %cache_dir%\tools\busybox64.exe bash %work_dir%\tools\windows\run_unittests.sh %NIGHTLY_MODE% %PRECISION_TEST% %WITH_GPU%
 
@@ -739,7 +751,7 @@ for /F %%i in ("%libsize%") do (
 )
 
 cd /d %work_dir%\paddle\fluid\inference\api\demo_ci
-%cache_dir%\tools\busybox64.exe bash run.sh %work_dir:\=/% %WITH_MKL% %WITH_GPU% %cache_dir:\=/%/inference_demo %WITH_TENSORRT% %TENSORRT_ROOT% %WITH_ONNXRUNTIME% %MSVC_STATIC_CRT%
+%cache_dir%\tools\busybox64.exe bash run.sh %work_dir:\=/% %WITH_MKL% %WITH_GPU% %cache_dir:\=/%/inference_demo %WITH_TENSORRT% %TENSORRT_ROOT% %WITH_ONNXRUNTIME% %MSVC_STATIC_CRT% "%CUDA_TOOLKIT_ROOT_DIR%"
 goto:eof
 
 :test_inference_error
@@ -770,7 +782,7 @@ echo    Step 7. Testing fluid library with infer_ut for inference ...
 echo    ========================================
 
 cd /d %work_dir%\paddle\fluid\inference\tests\infer_ut
-%cache_dir%\tools\busybox64.exe bash run.sh %work_dir:\=/% %WITH_MKL% %WITH_GPU% %cache_dir:\=/%/inference_demo %TENSORRT_ROOT% %WITH_ONNXRUNTIME% %MSVC_STATIC_CRT%
+%cache_dir%\tools\busybox64.exe bash run.sh %work_dir:\=/% %WITH_MKL% %WITH_GPU% %cache_dir:\=/%/inference_demo %TENSORRT_ROOT% %WITH_ONNXRUNTIME% %MSVC_STATIC_CRT% "%CUDA_TOOLKIT_ROOT_DIR%"
 goto:eof
 
 :test_inference_ut_error
diff --git a/paddle/scripts/paddle_build.sh b/paddle/scripts/paddle_build.sh
index d1220e4537582..e8bde467e085d 100755
--- a/paddle/scripts/paddle_build.sh
+++ b/paddle/scripts/paddle_build.sh
@@ -61,8 +61,6 @@ function init() {
     # NOTE(chenweihang): For easy debugging, CI displays the C++ error stacktrace by default 
     export FLAGS_call_stack_level=2
 
-    export FLAGS_use_curand=True
-
     # set CI_SKIP_CPP_TEST if only *.py changed
     # In order to avoid using in some CI(such as daily performance), the current
     # branch must not be `${BRANCH}` which is usually develop.
diff --git a/python/paddle/__init__.py b/python/paddle/__init__.py
index d21c1b2803a34..63f16c4eb78f1 100755
--- a/python/paddle/__init__.py
+++ b/python/paddle/__init__.py
@@ -48,7 +48,10 @@
 from .framework.dtype import bool  # noqa: F401
 from .framework.dtype import complex64  # noqa: F401
 from .framework.dtype import complex128  # noqa: F401
-from .framework import VarBase as Tensor  # noqa: F401
+if fluid.framework._in_eager_mode_:
+    Tensor = framework.core.eager.Tensor
+else:
+    from .framework import VarBase as Tensor  # noqa: F401
 Tensor.__qualname__ = 'Tensor'  # noqa: F401
 import paddle.compat  # noqa: F401
 import paddle.distributed  # noqa: F401
@@ -72,6 +75,7 @@
 import paddle.reader  # noqa: F401
 import paddle.static  # noqa: F401
 import paddle.vision  # noqa: F401
+import paddle.sparse  # noqa: F401
 
 from .tensor.attribute import is_complex  # noqa: F401
 from .tensor.attribute import is_integer  # noqa: F401
@@ -210,6 +214,7 @@
 from .tensor.math import stanh  # noqa: F401
 from .tensor.math import sum  # noqa: F401
 from .tensor.math import nansum  # noqa: F401
+from .tensor.math import nanmean  # noqa: F401
 from .tensor.math import tanh  # noqa: F401
 from .tensor.math import tanh_  # noqa: F401
 from .tensor.math import add_n  # noqa: F401
@@ -543,6 +548,7 @@
            'not_equal',
            'sum',
            'nansum',
+           'nanmean',
            'tile',
            'greater_equal',
            'isfinite',
diff --git a/python/paddle/autograd/__init__.py b/python/paddle/autograd/__init__.py
index 7aab7117de905..b13a4591b4ef2 100644
--- a/python/paddle/autograd/__init__.py
+++ b/python/paddle/autograd/__init__.py
@@ -1,11 +1,11 @@
 # Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
-# 
+#
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
-# 
+#
 #     http://www.apache.org/licenses/LICENSE-2.0
-# 
+#
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
@@ -13,12 +13,18 @@
 # limitations under the License.
 
 from ..fluid.dygraph.base import grad  # noqa: F401
+from ..fluid.dygraph.base import no_grad_ as no_grad  # noqa: F401
+from ..framework import is_grad_enabled, set_grad_enabled  # noqa: F401
 from . import backward_mode  # noqa: F401
 from .backward_mode import backward  # noqa: F401
 from .py_layer import PyLayer, PyLayerContext, EagerPyLayer, EagerPyLayerContext  # noqa: F401
 from ..framework import set_grad_enabled, is_grad_enabled  # noqa: F401
 from ..fluid.dygraph.base import no_grad_ as no_grad  # noqa: F401
-from .functional import jacobian, hessian, batch_jacobian, batch_hessian  # noqa: F401
-from .functional import vjp, jvp, vhp  # noqa: F401
+from .functional import vjp, jvp, Jacobian, Hessian  # noqa: F401
+from .functional import jacobian, hessian, batch_jacobian, batch_hessian, vhp  # noqa: F401
 
-__all__ = ['backward', 'PyLayer', 'PyLayerContext']
+__all__ = [  # noqa
+    'backward',
+    'PyLayer',
+    'PyLayerContext',
+]
diff --git a/python/paddle/autograd/functional.py b/python/paddle/autograd/functional.py
index c663d37e7f2ab..8e027c270b700 100644
--- a/python/paddle/autograd/functional.py
+++ b/python/paddle/autograd/functional.py
@@ -12,236 +12,686 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-import contextlib
+import functools
+import typing
+
 import paddle
-from paddle.static import gradients
-from ..fluid import framework
-from ..fluid.dygraph import grad
-from ..tensor.creation import assign
-from ..tensor import reshape, zeros_like, to_tensor
-from .utils import _tensors, _stack_tensor_or_return_none, _replace_none_with_zero_tensor
-
-
-@contextlib.contextmanager
-def gradient_scope(*var_lists, create_graph=False, allow_unused=False):
-    def grad_fn(ys, xs, v=None, create_graph=create_graph):
-        if v is not None:
-            assert len(ys) == len(v), (
-                f'The argument {v} is expected to be of the same size as the output. '
-                f'Here the output is {ys}, and `v` is {v}.')
-        if allow_unused:
-            ys = [
-                to_tensor(
-                    [0.0], stop_gradient=False) if y is None else y for y in ys
-            ]
-        return grad(
-            ys, xs, v, create_graph=create_graph, allow_unused=allow_unused)
-
-    def return_fn(out):
-        if isinstance(out, paddle.Tensor):
-            if not create_graph:
-                out = out.detach()
-            return out
-        if isinstance(out, list):
-            return list(return_fn(x) for x in out)
-        elif isinstance(out, tuple):
-            return tuple(return_fn(x) for x in out)
-        else:
-            assert out is None
-            return out
-
-    def process(vl):
-        if vl is None:
-            return None
-        out = []
-        # If v is treated as constant in the outer scope, its gradient is guaranteed
-        # not to be taken beyond this scope. Within this scope, however, v's gradient
-        # may be computed. We only need to detach v in this case.
-        # Otherwise, v's gradient is valid, and is subject to update beyond this scope.
-        # In this case we must not confuse the gradient in the outer scope with the
-        # inner one's. Moreover, we need to make sure that the result from the inner
-        # scope can flow back to the outer scope. This can be satisfied by extending
-        # the original variable with a duplication operation v1 = v so that v still
-        # maintains the complete lineage.
-        for v in vl:
-            if v is None:
-                out.append(v)
-                continue
-            if create_graph and not v.stop_gradient:
-                v = assign(v)
-            else:
-                v = v.detach()
-                v.stop_gradient = False
-            out.append(v)
-        return out
-
-    try:
-        var_lists = [process(vl) for vl in var_lists]
-        bundle = var_lists + [grad_fn, return_fn]
-        yield bundle
-    finally:
-        pass
+from paddle.fluid import framework
 
 
-@framework.dygraph_only
-def vjp(func, inputs, v=None, create_graph=False, allow_unused=False):
+def vjp(func, xs, v=None):
     r"""Computes the Vector-Jacobian product, a functional form of
     reverse mode automatic differentiation.
 
+    Warning:
+        This API is in beta, the signatures could be changed in future version.
+
     Args:
-        func(Callable): `func` takes as input a tensor or a list/tuple
-            of tensors and returns a tensor or a list/tuple of tensors.
-        inputs(list[Tensor]|tuple[Tensor]|Tensor): used as positional
-            arguments to evaluate `func`. `inputs` is accepted as one
-            tensor or a list of tensors.
-        v(list[Tensor]|tuple[Tensor]|Tensor|None, optional): the
-            cotangent vector invovled in the VJP computation. `v` matches
-            the size and shape of `func`'s output. Default value is None
-            and in this case is equivalent to all ones the same size
-            of `func`'s output.
-        create_graph(bool, optional): if `True`, gradients can be
-            evaluated on the results. If `False`, taking gradients on
-            the results is invalid. Default value is False.
-        allow_unused(bool, optional): In case that some Tensors of
-            `inputs` do not contribute to the computation of the output.
-            If `allow_unused` is False, an error will be raised,
-            Otherwise, the gradients of the said inputs are returned
-            None. Default value is False.
+        func(Callable): A function that takes ``xs`` as inputs parameter and
+            returns a sequence of Tensors or a Tensor.
+        xs(Tensor|Sequence[Tensor]): Used as positional arguments to evaluate
+            ``func``. ``xs`` is accepted as one Tensor or a sequence of Tensors.
+        v(Tensor|Sequence[Tensor]|None, optional): The cotangent vector invovled
+            in the VJP computation. ``v`` matches the size and shape of
+            ``func`` 's output. Defaults to None, which is equivalent to all
+            ones the same size of ``func`` 's output.
 
     Returns:
         output(tuple):
-            func_out(list[Tensor]|tuple[Tensor]|Tensor): the output of
-                `func(inputs)`
-            vjp(list[Tensor]): the pullback results of `v` on `func`
+        
+            - func_out(Tensor|tuple[Tensor]): The output of ``func(xs)`` .
+            - vjp(Tensor|tuple[Tensor]): The vjp result.
 
     Examples:
-      .. code-block:: python
-
-        def func(x):
-          return paddle.matmul(x, x)
-
-        x = paddle.ones(shape=[2, 2], dtype='float32')
-        output, inputs_grad = vjp(func, x)
-        print(inputs_grad)
-        # [Tensor(shape=[2, 2], dtype=float32, place=CUDAPlace(0), stop_gradient=True,
-        #        [[4., 4.],
-        #         [4., 4.]])]
-
-        v = paddle.to_tensor([[1.0, 0.0], [0.0, 0.0]])
-        output, inputs_grad = vjp(func, x, v)
-        print(inputs_grad)
-        # [Tensor(shape=[2, 2], dtype=float32, place=CUDAPlace(0), stop_gradient=True,
-        #        [[2., 1.],
-        #         [1., 0.]])]
-
-        output, inputs_grad = vjp(func, x, v, create_graph=True)
-        print(inputs_grad)
-        # [Tensor(shape=[2, 2], dtype=float32, place=CUDAPlace(0), stop_gradient=False,
-        #        [[2., 1.],
-        #         [1., 0.]])]
-
-        y = paddle.ones(shape=[2, 2], dtype='float32')
-        def func_unused(x, y):
-          return paddle.matmul(x, x)
-
-        output, inputs_grad = vjp(func, [x, y], v)
-        # ValueError: (InvalidArgument) The 1-th input does not appear in the backward graph. 
-        # Please check the input variable or set allow_unused=True to get None result.
-        # [Hint: Expected allow_unused_ == true, but received allow_unused_:0 != true:1.]     
-
-        output, inputs_grad = vjp(func, [x, y], v, allow_unused=True)
-        print(inputs_grad)
-        # [Tensor(shape=[2, 2], dtype=float32, place=CUDAPlace(0), stop_gradient=True,
-        #        [[2., 1.],
-        #         [1., 0.]]), None]
+
+        .. code-block:: python
+
+            import paddle
+
+            def func(x):
+                return paddle.matmul(x, x)
+
+            x = paddle.ones(shape=[2, 2], dtype='float32')
+            _, vjp_result = paddle.incubate.autograd.vjp(func, x)
+            print(vjp_result)
+            # Tensor(shape=[2, 2], dtype=float32, place=Place(gpu:0), stop_gradient=False,
+            #        [[4., 4.],
+            #         [4., 4.]])
+
+            v = paddle.to_tensor([[1.0, 0.0], [0.0, 0.0]])
+            _, vjp_result = paddle.incubate.autograd.vjp(func, x, v)
+            print(vjp_result)
+            # Tensor(shape=[2, 2], dtype=float32, place=Place(gpu:0), stop_gradient=False,
+            #        [[2., 1.],
+            #         [1., 0.]])
     """
-    xs = _tensors(inputs, "inputs")
-    if v is not None:
-        v = _tensors(v, "v")
+    _check_inputs(func, xs, v)
 
-    with gradient_scope(
-            xs, v, create_graph=create_graph,
-            allow_unused=allow_unused) as [xs, v, grad_fn, return_fn]:
-        outputs = func(*xs)
-        ys = _tensors(outputs, "outputs")
-        grads = grad_fn(ys, xs, v)
-        outputs, grads = return_fn(outputs), return_fn(grads)
+    # ``_seprate`` breaks the dependencies between ``xs`` and other
+    # variables. See more ``_seprate`` .
+    xs, v = _separate(xs), _separate(v)
+    ys = func(*xs) if isinstance(xs, typing.Sequence) else func(xs)
+    _check_v_shape(v, ys)
 
-    return outputs, grads
+    return ys, _grad(ys, xs, v)
 
 
-@framework.dygraph_only
-def jvp(func, inputs, v=None, create_graph=False, allow_unused=False):
+def jvp(func, xs, v=None):
     r"""
     Computes the Jacobian-Vector product for a function at the given
     inputs and a vector in the tangent space induced by the inputs.
 
-    .. note::
-        **This API is ONLY available in imperative mode.**
+    Warning:
+        This API is in beta, the signatures could be changed in future version.
 
     Args:
-        func(Callable): `func` takes as input a tensor or a list/tuple
-            of tensors and returns a tensor or a list/tuple of tensors.
-        inputs(list[Tensor]|tuple[Tensor]|Tensor): used as positional
-            arguments to evaluate `func`. `inputs` is accepted as one
-            tensor or a list/tuple of tensors.
-        v(list[Tensor]|tuple[Tensor]|Tensor|None, optional): the
-            tangent vector invovled in the JVP computation. `v` matches
-            the size and shape of `inputs`. `v` is Optional if `func`
-            returns a single tensor. Default value is None and in this
-            case is equivalent to all ones the same size of `inputs`.
-        create_graph(bool, optional): if `True`, gradients can
-            be evaluated on the results. If `False`, taking gradients
-            on the results is invalid. Default value is False.
-        allow_unused(bool, optional): In case that some Tensors of
-            `inputs` do not contribute to the computation of the output.
-            If `allow_unused` is False, an error will be raised,
-            Otherwise, the gradients of the said inputs are returned
-            None. Default value is False.
+        func(Callable): The ``func`` takes as input a Tensor or a Sequence
+            of Tensors and returns a Tensor or a Sequence of Tensors.
+        xs(Tensor|Sequence[Tensor]): Used as positional arguments to
+            evaluate ``func``.  The ``xs`` is accepted as one Tensor or a
+            Sequence of Tensors.
+        v(Tensor|Sequence[Tensor]|None, Optional): The tangent vector invovled
+            in the JVP computation. The ``v`` matches the size and shape of
+            ``xs`` . Default value is None and in this case is equivalent to 
+            all ones the same size of ``xs`` .
 
     Returns:
         output(tuple):
-            func_out(list[Tensor]|tuple[Tensor]|Tensor): the output of
-                `func(inputs)`
-            jvp(list[Tensor]): the pullback results of `v` on `func`
+
+            - func_out(Tensor|tuple[Tensor]): The output of ``func(xs)`` .
+            - jvp(Tensor|tuple[Tensor]): The jvp result.
+
+    Examples:
+
+        .. code-block:: python
+
+            import paddle
+
+
+            def func(x):
+                return paddle.matmul(x, x)
+
+
+            x = paddle.ones(shape=[2, 2], dtype='float32')
+            _, jvp_result = paddle.incubate.autograd.jvp(func, x)
+            print(jvp_result)
+            # Tensor(shape=[2, 2], dtype=float32, place=Place(gpu:0), stop_gradient=False,
+            #        [[4., 4.],
+            #         [4., 4.]])
+            v = paddle.to_tensor([[1.0, 0.0], [0.0, 0.0]])
+            _, jvp_result = paddle.incubate.autograd.jvp(func, x, v)
+            print(jvp_result)
+            # Tensor(shape=[2, 2], dtype=float32, place=Place(gpu:0), stop_gradient=False,
+            #        [[2., 1.],
+            #         [1., 0.]])
+
+    """
+    _check_inputs(func, xs, v)
+    # ``_seprate`` breaks the dependencies between ``xs`` and other
+    # variables. See more ``_seprate`` .
+    xs, v = _separate(xs), _separate(v)
+    ys = func(*xs) if isinstance(xs, typing.Sequence) else func(xs)
+    _check_v_shape(v, xs)
+    return ys, _double_backward_trick(ys, xs, v)
+
+
+def _double_backward_trick(ys, xs, v):
+    """Double backward trick for computing ``jvp`` by ``vjp``
+    see details: https://j-towns.github.io/2017/06/12/A-new-trick.html
+    """
+    # The value of ys_grad is not important, it can be any random value in 
+    # theory, but it's required to set stop_gradient=False.
+    ys_grad = _zeros_like_with_grad(ys)
+    xs_grad = _grad(ys, xs, ys_grad)
+    return _grad(xs_grad, ys_grad, v)
+
+
+def _zeros_like_with_grad(xs):
+    """Create a zero or zeros sequence Tensor like ``xs`` with a flag 
+    ``stop_graident=False`` .
+    """
+    if not isinstance(xs, typing.Sequence):
+        ys = paddle.zeros_like(xs)
+        ys.stop_gradient = False
+    else:
+        ys = []
+        for x in xs:
+            y = paddle.zeros_like(x)
+            y.stop_gradient = False
+            ys.append(y)
+    return ys
+
+
+class Jacobian(object):
+    r"""
+    Computes the Jacobian matrix of a given function.
+
+    If the function has multiple inputs and multiple outputs, during internal 
+    implementation, all input tensors are concatenated after being flatten, 
+    the batch dimension is retained, and the output is subject to the same 
+    processing rules.
+
+    Once the Jacobian ``J`` is constructed, you can use a multidimensional index 
+    to retrieve the submatrix of ``J``, as same as slicing a Tensor. The 
+    submatrix is lazily evaluated along row axis, and will be cached once 
+    evaluated.
+
+    For examples, supposing ``is_batched=True``, you can retrieve the submatrix 
+    by following methods:
+
+        * J[:], retrieving the full matrix.
+        * J[:, :, j], retrieving the partial derivatives w.r.t. the j'th input
+          variable.
+        * J[:, i, :], retrieving the partial derivatives w.r.t. the i'th output
+          variable.
+        * J[:, i, j], retrieving the partial derivatives w.r.t. the i'th output
+          variable and the j'th input variable.
+
+    Notes:
+
+        Eclipsis index is not supported currently.
+
+    Warning:
+        This API is in beta, the signatures could be changed in future version.
+
+    Args:
+
+        func (Callable): A python function that takes a Tensor or a sequence of 
+            Tensors as inputs(the first dimension is batch size) and
+            returns a Tensor  a sequence of Tensors.
+        xs (Tensor|Sequence[Tensor]): The input to the function ``func`` .
+        is_batched (bool): If true, the first axis is batch axis. Defaults to 
+            False.
+
+    Returns:
+
+        Jacobian (Object): A python object retains the Jacobian matrix.
+
+    Examples:
+
+        .. code-block:: python
+
+            import paddle
+
+
+            def func(x, y):
+                return paddle.matmul(x, y)
+
+
+            x = paddle.to_tensor([[1., 2.], [3., 4.]])
+            J = paddle.incubate.autograd.Jacobian(func, [x, x])
+            print(J[:, :])
+            # Tensor(shape=[4, 8], dtype=float32, place=Place(gpu:0), stop_gradient=False,
+            #        [[1., 3., 0., 0., 1., 0., 2., 0.],
+            #         [2., 4., 0., 0., 0., 1., 0., 2.],
+            #         [0., 0., 1., 3., 3., 0., 4., 0.],
+            #         [0., 0., 2., 4., 0., 3., 0., 4.]])
+
+            print(J[0, :])
+            # Tensor(shape=[8], dtype=float32, place=Place(gpu:0), stop_gradient=False,
+            #        [1., 3., 0., 0., 1., 0., 2., 0.])
+            print(J[:, 0])
+            # Tensor(shape=[4], dtype=float32, place=Place(gpu:0), stop_gradient=False,
+            #        [1., 2., 0., 0.])
+
+    """
+
+    def __init__(self, func, xs, is_batched=False):
+        if not is_batched:
+            self._jacobian = _JacobianNoBatch(func, xs)
+        else:
+            self._jacobian = _JacobianBatchFirst(func, xs)
+
+    def __getitem__(self, indexes):
+        return self._jacobian[indexes]
+
+    @property
+    def shape(self):
+        """The shape of flattened Jacobian matrix.
+        """
+        return self._jacobian.shape
+
+
+class Hessian(object):
+    """
+    Computes the Hessian matrix  with a given ``func`` with respect to ``xs`` .
+
+    If the function has multiple inputs, during internal implementation, 
+    all input tensors are concatenated after being flatten, the batch dimension 
+    is retained.
+
+    The Hessian submatrix is lazily evaluated, and can be retrieved with a 
+    multidimensional indexes. See details ``Jacobian`` .
+
+    Warning:
+        This API is in beta, the signatures could be changed in future version.
+
+    Args:
+        func (Callable): A python function that takes a Tensor or a Tensor
+            sequence as inputs and returns a Tensor with shape 
+            ``[batch_size, 1]`` with batch or ``[1]`` without batch.
+        xs (Tensor|Sequence(Tensor)): The input Tensor or Tensor sequence of 
+            the function ``func``.
+        is_batched (bool): If true, the first axis is batch axis. Defaults to 
+            False.
+
+    Returns:
+
+        Hessian (Object): A python object retains the Hessian matrix.
+
 
     Examples:
+
     .. code-block:: python
 
-        def func(x):
-          return paddle.matmul(x, x)
+        import paddle
 
-        x = paddle.ones(shape=[2, 2], dtype='float32')
 
-        output, inputs_grad = jvp(func, x)
-        print(inputs_grad)
-        # [Tensor(shape=[2, 2], dtype=float32, place=CUDAPlace(0), stop_gradient=True,
-        #        [[2., 2.],
-        #         [2., 2.]])]
+        def reducer(x):
+            return paddle.sum(x * x)
 
-        v = paddle.to_tensor([[1.0, 0.0], [0.0, 0.0]])
-        output, inputs_grad = vjp(func, x, v)
-        print(inputs_grad)
-        # [Tensor(shape=[2, 2], dtype=float32, place=CUDAPlace(0), stop_gradient=True,
-        #        [[1., 1.],
-        #         [0., 0.]])]
 
+        x = paddle.rand([2, 2])
+        h = paddle.incubate.autograd.Hessian(reducer, x)
+        print(h[:])
+        # Tensor(shape=[4, 4], dtype=float32, place=Place(gpu:0), stop_gradient=False,
+        #        [[2., 0., 0., 0.],
+        #         [0., 2., 0., 0.],
+        #         [0., 0., 2., 0.],
+        #         [0., 0., 0., 2.]])
     """
-    xs = _tensors(inputs, "inputs")
-    if v is not None:
-        v = _tensors(v, "v")
 
-    with gradient_scope(
-            xs, v, create_graph=create_graph,
-            allow_unused=allow_unused) as [xs, v, grad_fn, return_fn]:
-        outputs = func(*xs)
-        ys = _tensors(outputs, "outputs")
-        ys_grad = [zeros_like(y) for y in ys]
-        xs_grad = grad_fn(ys, xs, ys_grad, create_graph=True)
-        ys_grad = grad_fn(xs_grad, ys_grad, v)
-        outputs, ys_grad = return_fn(outputs), return_fn(ys_grad)
+    def __init__(self, func, xs, is_batched=False):
+        def _jac_func(*xs):
+            jac = Jacobian(func, xs, is_batched=is_batched)
+            if (is_batched and jac.shape[1] != 1) or (not is_batched and
+                                                      jac.shape[0] != 1):
+                raise RuntimeError(
+                    "The function given to Hessian shoud return as single element Tensor or batched single element Tensor."
+                )
+            return jac[:, 0, :] if is_batched else jac[0, :]
+
+        self.symbolic = Jacobian(_jac_func, xs, is_batched=is_batched)
+
+    def __getitem__(self, indexes):
+        return self.symbolic[indexes]
 
-    return outputs, ys_grad
+    @property
+    def shape(self):
+        """The shape of flattened Hessian matrix.
+        """
+        return self.symbolic.shape
+
+
+class _Jacobian(object):
+    """The base class for computing Jacobian matrix.
+
+    ``_Jacobian`` implementes the core logic of multidimensional index and lazy 
+    evaluation for Jacobian matrix, subclass only need to overwrite following 
+    methods:
+
+        * ``_lazy_axis()``,  return the axis along which will be lazy 
+            evaluating.
+        * ``_flatten(xs)``, flattens the inputs ``xs``.
+        * ``_evaluate(index)``, evaluates one slice along ``_lazy_axis`` .
+
+    Notes:
+
+        Because currently PaddlePaddle only support reverse differentiation by 
+        ``paddle.grad``, so lazy evaluation is only supported along the row of 
+        Jacobian matrix, which means that slicing along row will get better 
+        performance.
+
+    """
+
+    def __init__(self, func, xs):
+        self._xs = _separate(xs)
+        self._ys = func(*_as_tensors(self._xs))
+        self._flatten_xs = self._flatten(_as_tensors(self._xs))
+        self._flatten_ys = self._flatten(_as_tensors(self._ys))
+        self._cache = {}
+
+    @property
+    def shape(self):
+        raise NotImplementedError
+
+    @property
+    def _lazy_axis(self):
+        """"The axis of lazily evaluated."""
+        raise NotImplementedError
+
+    def _lazy_indexes(self, indexes):
+        idx = indexes[self._lazy_axis]
+        return (idx, ) if isinstance(
+            idx, int) else tuple(range(idx.start, idx.stop, idx.step))
+
+    def _flatten(self, xs):
+        raise NotImplementedError
+
+    def _shifted_indexes(self, indexes, lazy_axis_size=0):
+        idx = indexes[self._lazy_axis]
+        shifted_lazy_axis_idx = 0 if isinstance(
+            idx, int) else slice(0, lazy_axis_size, 1)
+        return indexes[:self._lazy_axis] + (shifted_lazy_axis_idx,
+                                            ) + indexes[self._lazy_axis + 1:]
+
+    def __getitem__(self, indexes):
+        indexes = _multi_index(indexes, self.shape)
+
+        if isinstance(indexes[self._lazy_axis], int):
+            other_indexes = indexes[:self._lazy_axis] + \
+                indexes[self._lazy_axis+1:]
+            return self._cached_evaluate(indexes[self._lazy_axis])[
+                other_indexes]
+        lazy_indexes = self._lazy_indexes(indexes)
+        part_jac = paddle.stack(
+            [self._cached_evaluate(i) for i in lazy_indexes],
+            axis=self._lazy_axis)
+        return part_jac[self._shifted_indexes(indexes, len(lazy_indexes))]
+
+    def _cached_evaluate(self, k):
+        v = self._cache.get(k)
+        if v is None:
+            v = self._evaluate(k)
+            self._cache[k] = v
+        return v
+
+    def _evaluate(self, index):
+        """Evaluate one slice at along lazy axis."""
+        raise NotImplementedError
+
+
+class _JacobianNoBatch(_Jacobian):
+    """Compute Jacobian matrix without batch dimension.
+    Suppose the mapping is :math:`f: R^M \to R^N`, the output shape is 
+    ``(N, M)`` .
+    """
+
+    def __init__(self, func, xs):
+        super(_JacobianNoBatch, self).__init__(func, xs)
+
+    @property
+    def shape(self):
+        return (self._flatten_ys.shape[0], self._flatten_xs.shape[0])
+
+    @property
+    def _lazy_axis(self):
+        return 0
+
+    def _flatten(self, xs):
+        return paddle.concat(tuple(x.reshape((-1, )) for x in xs))
+
+    def _evaluate(self, row_index):
+        return self._flatten(_grad(
+            self._flatten_ys[row_index],
+            self._xs, ))
+
+
+class _JacobianBatchLast(_Jacobian):
+    """Compute Jacobian matrix with batch at last axis.
+    Suppose the mapping is :math:`f: R^{M,B} \to R^{N,B}`, the output shape is 
+    ``(N, M, B)`` .
+    """
+
+    def __init__(self, func, xs):
+        super(_JacobianBatchLast, self).__init__(func, xs)
+
+    @property
+    def shape(self):
+        return (self._flatten_ys.shape[0], self._flatten_xs.shape[0],
+                self._flatten_xs.shape[1])
+
+    @property
+    def _lazy_axis(self):
+        return 0
+
+    def _flatten(self, xs):
+        return paddle.concat(
+            tuple(x.reshape((-1, x.shape[-1])) for x in _as_tensors(xs)), 0)
+
+    def _evaluate(self, row):
+        return self._flatten(_grad(self._flatten_ys[row, :], self._xs))
+
+
+class _JacobianBatchFirst(_Jacobian):
+    """Compute Jacobian matrix with batch at first axis.
+    Suppose the mapping is :math:`f: R^{B,M} \to R^{B,N}`, the output shape is 
+    ``(B, N, M)`` .
+    """
+
+    def __init__(self, func, xs):
+        super(_JacobianBatchFirst, self).__init__(func, xs)
+
+    @property
+    def shape(self):
+        return (self._flatten_xs.shape[0], self._flatten_ys.shape[1],
+                self._flatten_xs.shape[1])
+
+    @property
+    def _lazy_axis(self):
+        return 1
+
+    def _flatten(self, xs):
+        return paddle.concat(
+            tuple(x.reshape((x.shape[0], -1)) for x in _as_tensors(xs)), 1)
+
+    def _evaluate(self, row_index):
+        return self._flatten(_grad(self._flatten_ys[:, row_index], self._xs))
+
+
+def _multi_index(indexes, shape):
+    """A tool for parsing N-dimensional index into a standard format.
+
+    Currently supporting following input format:
+        * ([positive|negative|slice], ...), the right-most elements can be 
+            omited.
+
+    The standard format after converted is slice tuple which contains N elements:
+        * ([positive|slice], ..., [positive|slice])
+
+    Notes: 
+        Ellipsis indexes such as ``(..., i), (i, ...)`` is not supported.
+
+    Args:
+        indexes (tuple): The input indexes.
+        shape (tuple): The input shape.
+
+    Returns:
+        tuple: The standard format index as the above description.
+    """
+    indexes = indexes if isinstance(indexes, typing.Sequence) else (indexes, )
+    if any(isinstance(i, type(Ellipsis)) for i in indexes):
+        raise IndexError('Ellipsis index currently is not supported.')
+    # Fill the right-most elements.
+    indexes = indexes + (slice(0, None, None), ) * (len(shape) - len(indexes))
+    # Convert to positive index.
+    positive_indexes = []
+    for i, index in enumerate(indexes):
+        if isinstance(index, slice):
+            index = slice(index.start or 0, index.stop or shape[i],
+                          index.step or 1)
+            positive_indexes.append(
+                slice(
+                    index.start + shape[i] if index.start < 0 else index.start,
+                    index.stop + shape[i] if index.stop < 0 else index.stop,
+                    # Negative step means index backward, no need to convert to
+                    # positive interger.
+                    index.step))
+        elif isinstance(index, int):
+            positive_indexes.append(index + shape[i] if index < 0 else index)
+        else:
+            raise TypeError(f'Not supported index type {index}.')
+    return tuple(positive_indexes)
+
+
+def _as_tensors(xs):
+    return (xs, ) if isinstance(xs, framework.Variable) else xs
+
+
+def _stack_tensor_or_return_none(origin_list):
+    assert len(origin_list) > 0, "Can't not stack an empty list"
+    return paddle.stack(
+        origin_list, axis=0) if isinstance(
+            origin_list[0], paddle.fluid.framework.Variable) else None
+
+
+def _replace_none_with_zero_tensor(xs, refs):
+    if xs is None:
+        xs = paddle.zeros_like(refs)
+        xs.stop_gradient = refs.stop_gradient
+        return xs
+    elif isinstance(xs, typing.Sequence):
+        return tuple(
+            _replace_none_with_zero_tensor(x, refs[i])
+            for i, x in enumerate(xs))
+    else:
+        return xs
+
+
+def _grad(ys, xs, v=None):
+    """A gradient function that can be used in dynamic graph and static graph.
+
+    The ``grad`` combines ``paddle.grad`` used in dynamic graph and
+    ``paddle.static.gradients`` used in static graph, and do following changes:
+
+    * The ``allow_unused`` flag is removed and set defaults to true internally,
+        none in outputs will be replaced by zero tensor.
+    * The ``create_graph`` flag is removed and set defaults to true internally,
+        only makes sense in dynamic graph.
+    * When xs is a single Tensor, ``paddle.grad`` returns a list which only 
+        contains one Tensor. It may confuse users, thus in this case we improve 
+        to return a single Tensor in _grad interface.
+
+    Args:
+        ys (Tensor|Sequence[Tensor]): The output tensor or tensor sequence of
+            the graph to compute gradients.
+        xs (Tensor|Sequence[Tensor]): The input tensor or tensor sequence of the graph to
+            compute gradients. The returned values of this API are the
+            gradients of inputs .
+        v (Tensor|Sequence[Tensor]|None,optional): The initial gradient values
+            of outputs . If grad_outputs is None, the initial gradient values of
+            outputs would be Tensors filled with 1; if grad_outputs is not None,
+            it must have the same length as outputs , and in this case, the
+            initial gradient value of the i-th outputs would be: (1) a Tensor
+            filled with 1 when the i-th element of grad_outputs is None;
+            (2) the i-th element of grad_outputs when the i-th element of
+            grad_outputs is a Tensor. Default None.
+
+    Returns:
+        Tensor|tuple[Tensor]: Tensor or a tuple of Tensors, whose length is the 
+            same as the Tensor number inside inputs, and the i-th returned 
+            Tensor is the sum of gradients of outputs with respect to the i-th 
+            inputs.
+    """
+    if paddle.fluid._non_static_mode():
+        xs_grad = paddle.grad(ys, xs, v, create_graph=True, allow_unused=True)
+    else:
+        xs_grad = paddle.static.gradients(ys, xs, v)
+
+    if isinstance(xs, paddle.fluid.framework.Variable):
+        xs_grad = xs_grad[0]
+
+    return _replace_none_with_zero_tensor(xs_grad, xs)
+
+
+def _separate(xs):
+    """
+    ``_separate`` separates ``xs`` from the computation graph through ``clone`` 
+    or ``deteach`` .
+
+    Interally, ``paddle.grad(xs, ys)`` is stateful API implemented based on 
+    computional graph, which will reduce gradients along all path from ys to xs.
+
+    However, funcional autograd API such as ``vjp``, ``jvp`` is stateless, and 
+    only compute gradients with a given ``func`` .
+
+    For example, given a ``func`` :math:`y0=f(x0)`, supposing forward path is:
+    ``x0 -> y0``, ``x0 -> x1 -> y0`` .
+    ``paddle.grad(y0, x0)`` will reduce gradients along ``y0->x0`` and 
+    ``y0->x1->x0``, and ``vjp`` only need reduce along ``y0->x0``.
+
+    So, it's needed to clone or detach xs for breaking the dependencies with 
+    other variables.
+
+    Examples:
+
+        .. code-block:: python
+
+            import paddle
+            from paddle.autograd.functional import _separate
+
+
+            def func(x, y):
+                return x * y
+
+
+            x = paddle.ones((1,))
+            x.stop_gradient = False
+
+            y = func(x, x)
+            print(paddle.grad(y, x))
+            # [Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=True,
+            #        [2.])]
+
+            x1, x2 = _separate((x, x))
+            y = func(x1, x2)
+            print(paddle.grad(y, x1))
+            # [Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=True,
+            #        [1.])]
+
+    """
+    if isinstance(xs, typing.Sequence):
+        return tuple(_single_separate(x) for x in xs)
+    else:
+        return _single_separate(xs)
+
+
+def _single_separate(x):
+    if x is None:  # x maybe none because grad input's v defaults to none.
+        return x
+    if not x.stop_gradient:
+        return paddle.clone(x)
+    else:  # use detach to share memory when no need gradients.
+        x = x.detach()
+        x.stop_gradient = False
+        return x
+    return x
+
+
+def _check_inputs(func, xs, v=None):
+    if not callable(func):
+        raise TypeError(f"Expected 'fun' is Callable, but got {type(func)}.")
+
+    if not isinstance(xs, (framework.Variable, typing.Sequence)):
+        raise TypeError(f"Expected 'xs' is a Tensor|Sequence[Tensor],"
+                        f"but got {type(xs)}.")
+    if isinstance(xs, typing.Sequence) and not all(
+            isinstance(x, framework.Variable) for x in xs):
+        raise TypeError("All elements of 'xs' shoule be Tensor.")
+
+    if not isinstance(v, (framework.Variable, typing.Sequence, type(None))):
+        raise TypeError(
+            f"Expected 'v' is Tensor|Sequence[Tensor]|None, but got {type(v)}.")
+
+    if isinstance(v, typing.Sequence) and not all(
+            isinstance(e, framework.Variable) for e in v):
+        raise TypeError("All elements of 'xs' shoule be Tensor.")
+
+
+def _check_v_shape(v, refs):
+    if v is None:
+        return
+
+    v, refs = _as_tensors(v), _as_tensors(refs)
+    if len(refs) != len(v):
+        raise RuntimeError(f"The argument v is a tuple of invalid length:"
+                           f"should be {len(refs)} but got {len(v)}.")
+
+    for index, (element_v, element_ref) in enumerate(zip(v, refs)):
+        if element_v.shape != element_ref.shape:
+            raise RuntimeError(
+                f"The v[{index}] has invalid shape: should "
+                f"be {element_ref.shape} but got {element_v.shape}.")
 
 
 @framework.dygraph_only
@@ -354,16 +804,18 @@ def func(x, y):
             #         [0., 0., 0., 2.]]), None))
 
     '''
-    inputs = _tensors(inputs, "inputs")
-    outputs = _tensors(func(*inputs), "outputs")
+    inputs = _as_tensors(inputs)
+    outputs = _as_tensors(func(*inputs))
     fin_size = len(inputs)
     fout_size = len(outputs)
-    flat_outputs = tuple(reshape(output, shape=[-1]) for output in outputs)
+    flat_outputs = tuple(
+        paddle.reshape(
+            output, shape=[-1]) for output in outputs)
     jacobian = tuple()
     for i, flat_output in enumerate(flat_outputs):
         jac_i = list([] for _ in range(fin_size))
         for k in range(len(flat_output)):
-            row_k = grad(
+            row_k = paddle.grad(
                 flat_output[k],
                 inputs,
                 create_graph=create_graph,
@@ -371,7 +823,7 @@ def func(x, y):
                 allow_unused=allow_unused)
             for j in range(fin_size):
                 jac_i[j].append(
-                    reshape(
+                    paddle.reshape(
                         row_k[j], shape=[-1])
                     if isinstance(row_k[j], paddle.Tensor) else None)
         jacobian += (tuple(
@@ -419,7 +871,7 @@ def batch_jacobian(func, inputs, create_graph=False, allow_unused=False):
         be a tuple of Tensors. If both of inputs and outputs are Tensor
         list/tuple, then the Jacobian will be a tuple of tuple of Tensors.
         Noted that the first dimension of inputs is batch size.
-        
+
         For example,
         the inputs shape and outputs shape of function ``func` is [batch_size, num] 
         and [batch_size, num] respectively, then the Jacobian will be a Tensor with
@@ -489,10 +941,10 @@ def func(x, y):
             #        [0., 1., 0., 1., 0., 1., 0., 1.]]), Tensor(shape=[2, 8], dtype=float64, place=CUDAPlace(0), stop_gradient=True,
             #       [[1., 0., 1., 0., 1., 0., 1., 0.],
             #        [0., 1., 0., 1., 0., 1., 0., 1.]]))
-   
+
     '''
-    inputs = _tensors(inputs, "inputs")
-    outputs = _tensors(func(*inputs), "outputs")
+    inputs = _as_tensors(inputs)
+    outputs = _as_tensors(func(*inputs))
     batch_size = inputs[0].shape[0]
     for input in inputs:
         assert input.shape[
@@ -503,13 +955,13 @@ def func(x, y):
     fin_size = len(inputs)
     fout_size = len(outputs)
     flat_outputs = tuple(
-        reshape(
+        paddle.reshape(
             output, shape=[batch_size, -1]) for output in outputs)
     jacobian = tuple()
     for i, flat_output in enumerate(flat_outputs):
         jac_i = list([] for _ in range(fin_size))
         for k in range(flat_output.shape[1]):
-            row_k = grad(
+            row_k = paddle.grad(
                 flat_output[:, k],
                 inputs,
                 create_graph=create_graph,
@@ -517,7 +969,7 @@ def func(x, y):
                 allow_unused=allow_unused)
             for j in range(fin_size):
                 jac_i[j].append(
-                    reshape(
+                    paddle.reshape(
                         row_k[j], shape=[-1])
                     if isinstance(row_k[j], paddle.Tensor) else None)
         jacobian += (tuple(
@@ -569,7 +1021,7 @@ def batch_hessian(func, inputs, create_graph=False, allow_unused=False):
         the inputs shape and outputs shape of function ``func` is [batch_size, num] 
         and [batch_size, 1] respectively, then the batched Hessian will be a Tensor with
         a shape of [num, batch_size * num].
-        
+
         Why the final shape in this case is that?
         because batch_hessian will create a inner func(the wrapper of paddle.grad() func)
         to computes the sum of gradients of `outputs` with respect to each `inputs`,
@@ -579,7 +1031,7 @@ def batch_hessian(func, inputs, create_graph=False, allow_unused=False):
         matrix of the ``i``th column output(Noted that this output means the first order 
         differentiation) and the ``j``th input and will have same dtype and device as the 
         corresponding input. Other situations can be deduced by analogy.
-    
+
 
     Examples 1:
         .. code-block:: python
@@ -592,8 +1044,8 @@ def batch_hessian(func, inputs, create_graph=False, allow_unused=False):
 
             def func(x):
                 return paddle.matmul(x * x, weight)[:, 0:1]
-            
-           
+
+
             x.stop_gradient = False
             batch_hessian = paddle.autograd.batch_hessian(func, x)
             print(batch_hessian)
@@ -612,7 +1064,7 @@ def func(x):
 
             def func(x, y):
                 return paddle.matmul(x * x * y * y, weight)[:, 0:1]
-            
+
             x.stop_gradient = False
             y.stop_gradient = False
             batch_hessian = paddle.autograd.batch_hessian(func, [x, y])
@@ -629,7 +1081,7 @@ def func(x, y):
             #   Tensor(shape=[2, 8], dtype=float64, place=CUDAPlace(0), stop_gradient=True,
             #        [[2., 0., 2., 0., 2., 0., 2., 0.],
             #         [0., 2., 0., 2., 0., 2., 0., 2.]])))
-            
+
 
     Examples 3:
         .. code-block:: python
@@ -639,7 +1091,7 @@ def func(x, y):
             x = paddle.ones(shape=(4, 2), dtype='float64')
             weight = paddle.ones(shape=(2, 4), dtype='float64')
             y = paddle.ones(shape=(4, 2), dtype='float64')
-            
+
             def func(x, y):
                 return paddle.matmul(x * x, weight)[:, 0:1]
 
@@ -652,7 +1104,7 @@ def func(x, y):
             #         [0., 2., 0., 2., 0., 2., 0., 2.]]), None), (None, None))
 
     '''
-    inputs = _tensors(inputs, "inputs")
+    inputs = _as_tensors(inputs)
     outputs = func(*inputs)
     batch_size = inputs[0].shape[0]
     for input in inputs:
@@ -663,7 +1115,7 @@ def func(x, y):
     ], "The function to compute batched Hessian matrix should return a Tensor of shape [batch_size, 1]"
 
     def jac_func(*ins):
-        grad_inputs = grad(
+        grad_inputs = paddle.grad(
             outputs,
             ins,
             create_graph=True,
@@ -715,7 +1167,7 @@ def hessian(func, inputs, create_graph=False, allow_unused=False):
 
             def func(x):
                 return paddle.sum(paddle.matmul(x, x))
-            
+
             x = paddle.ones(shape=[2, 2], dtype='float32')
             x.stop_gradient = False
             hessian = paddle.autograd.hessian(func, x)
@@ -733,7 +1185,7 @@ def func(x):
 
             def func(x, y):
                 return paddle.sum(paddle.matmul(x, y))
-            
+
             x = paddle.ones(shape=[2, 2], dtype='float32')
             y = paddle.ones(shape=[2, 2], dtype='float32')
             x.stop_gradient = False
@@ -768,7 +1220,7 @@ def func(x, y):
 
             def func(x, y):
                 return paddle.sum(paddle.matmul(x, x))
-            
+
             x = paddle.ones(shape=[2, 2], dtype='float32')
             y = paddle.ones(shape=[2, 2], dtype='float32')
             x.stop_gradient = False
@@ -782,14 +1234,14 @@ def func(x, y):
             #         [0., 1., 1., 2.]]), None), (None, None))
 
     '''
-    inputs = _tensors(inputs, "inputs")
+    inputs = _as_tensors(inputs)
     outputs = func(*inputs)
     assert isinstance(outputs, paddle.Tensor) and outputs.shape == [
         1
     ], "The function to compute Hessian matrix should return a Tensor with a single element"
 
     def jac_func(*ins):
-        grad_inputs = grad(
+        grad_inputs = paddle.grad(
             outputs,
             ins,
             create_graph=True,
@@ -803,7 +1255,6 @@ def jac_func(*ins):
         jac_func, inputs, create_graph=create_graph, allow_unused=allow_unused)
 
 
-@framework.dygraph_only
 def vhp(func, inputs, v=None, create_graph=False, allow_unused=False):
     ''' 
     .. note::
@@ -839,7 +1290,7 @@ def vhp(func, inputs, v=None, create_graph=False, allow_unused=False):
             import paddle
             def func(x):
                 return paddle.sum(paddle.matmul(x, x))
-            
+
             x = paddle.ones(shape=[2, 2], dtype='float32')
             x.stop_gradient = False
             vx = paddle.ones(shape=[2, 2], dtype='float32') * 2
@@ -856,7 +1307,7 @@ def func(x):
             import paddle
             def func(x):
                 return paddle.sum(paddle.matmul(x, x))
-            
+
             x = paddle.ones(shape=[2, 2], dtype='float32')
             x.stop_gradient = False
             vhp_rslt = paddle.autograd.vhp(func, x)
@@ -872,7 +1323,7 @@ def func(x):
             import paddle
             def func(x, y):
                 return paddle.sum(paddle.matmul(x, x))
-            
+
             x = paddle.ones(shape=[2, 2], dtype='float32')
             x.stop_gradient = False
             y = paddle.ones(shape=[2, 2], dtype='float32')
@@ -887,177 +1338,17 @@ def func(x, y):
             #        [[8., 8.],
             #         [8., 8.]]), None])
     '''
-    xs = _tensors(inputs, "inputs")
+    xs = _as_tensors(inputs)
     if v is not None:
-        v = _tensors(v, "v")
-
-    with gradient_scope(
-            xs, v, create_graph=create_graph,
-            allow_unused=allow_unused) as [xs, v, grad_fn, return_fn]:
-        outputs = func(*xs)
-        ys = _tensors(outputs, "outputs")
-        assert len(ys) == 1 and isinstance(
-            ys[0], paddle.Tensor
-        ) and ys[0].shape == [
-            1
-        ], "The function to compute vhp should return a Tensor with a single element"
-        jac = grad_fn(ys, xs, create_graph=True)
-        vhp = grad_fn(jac, xs, v)
-        outputs, vhp = return_fn(outputs), return_fn(vhp)
+        v = _as_tensors(v)
+    xs, v = _separate(xs), _separate(v)
+    outputs = func(*xs)
+    ys = _as_tensors(outputs)
+    assert len(ys) == 1 and isinstance(
+        ys[0], framework.Variable
+    ) and ys[0].shape == [
+        1
+    ], "The function to compute vhp should return a Tensor with a single element"
+    jac = _grad(ys, xs)
+    vhp = _grad(jac, xs, v)
     return outputs, vhp
-
-
-class Jacobian(object):
-    r"""
-    Computes the Jacobian matrix of function `func`, which may take as input
-    single or multiple tensor typed arguments and output a single tensor or
-    multiple tensors. 
-    
-    In case `func` is multi-input and multi-output, i.e., 
-    
-    func: Callable[[Tensor, ...], [Tensor, ...]]
-
-    `func` is treated as a vector valued function with all its inputs flattened
-    into a single one dimensional tensor, or a two dimensional tensor with the
-    first dimension retained as the batching dimension. The same rule applies to
-    the function outputs.
-
-    Once the Jacobian J is constructed, there are four ways to retrieve the 
-    partial derivatives.
-
-    - J[:], retrieving the full matrix.
-    
-    - J[:, j], retrieving the partial derivatives w.r.t. the j'th input 
-    variable.
-
-    - J[i, :], retrieving the partial derivatives w.r.t. the i'th output 
-    variable.
-
-    - J[i, j], retrieving the partial derivatives w.r.t. the i'th output 
-    variable and the j'th input variable. 
-
-    Examples:
-        .. code-block:: python
-            import paddle        
-            import numpy as np
-
-            def func(xs):
-                x, y = xs
-                return paddle.matmul(x, y)
-            
-            main = fluid.Program()
-            startup = fluid.Program()
-            with fluid.program_guard(main, startup):
-                x = paddle.static.data(name='x', shape=[2, 2], dtype='float32')
-                JJ = paddle.autograd.functional.Jacobian(func, [x, x])
-                nrow, ncol = JJ.shape()
-                full_jacobian = JJ[:]
-            place = fluid.CUDAPlace(0)
-            exe = fluid.Executor(place)
-            exe.run(startup)
-
-            feeds = {'x': np.array([[2., 2.], [2., 1.]]).astype('float32')}
-            jacobian = exe.run(main, feed=feeds, fetch_list=[full_jacobian])[0]
-            print(jacobian)
-            # [[4. 2. 2. 0. 4. 2. 2. 0.]
-            #  [2. 3. 0. 2. 2. 3. 0. 2.]
-            #  [2. 0. 3. 2. 2. 0. 3. 2.]
-            #  [0. 2. 2. 2. 0. 2. 2. 2.]]
-    """
-
-    def __init__(self, func, inputs, batch=False):
-        r"""Constructing a Jacobian matrix.
-
-        Parameters:
-            func (Callable): a Python function that takes as input a Tensor
-                or a Tensor list and outputs a Tensor or a Tensor list.
-            inputs (Tensor|list[Tensor]): a Tensor or a list of Tensors as
-                `func`'s input.
-            batch (bool):  if True the 0'th axis is considered the batch
-                dimension, both on input and output.
-        """
-
-        def enable_grads(inputs):
-            if isinstance(inputs, (list, tuple)):
-                for x in inputs:
-                    x.stop_gradient = False
-            else:
-                assert isinstance(inputs, paddle.fluid.framework.Variable), (
-                    f"Expecting {inputs} to be paddle.fluid.framework.Variable,"
-                    f" however it's found to be a(n) {type(inputs)}.")
-                inputs.stop_gradient = False
-            return inputs
-
-        self.batch = batch
-        self.xs = enable_grads(inputs)
-        ys = func(inputs)
-        if not isinstance(ys, list):
-            ys = [ys]
-        self.y = self.flatten_all(ys)
-        self.ydim = self.y.shape[-1]
-        self.xdim = self.flatten_all(inputs).shape[-1]
-        self.bdim = self.y.shape[0]
-        self.jacobian = {}
-
-    def flatten(self, x):
-        to = [x.shape[0], -1] if self.batch else [-1]
-        return x.reshape(to)
-
-    def flatten_all(self, xs):
-        if isinstance(xs, (list, tuple)):
-            return paddle.concat([self.flatten(x) for x in xs], axis=-1)
-        else:
-            return self.flatten(xs)
-
-    def shape(self):
-        return (self.ydim, self.xdim)
-
-    def __getitem__(self, tup):
-        if hasattr(tup, '__iter__'):
-            i, j = tup
-        else:
-            i, j = tup, None
-
-        full = isinstance(i, slice)
-
-        if full:
-            if 'full' not in self.jacobian:
-                rows = [
-                    self.flatten_all(gradients(self.y[..., i], self.xs))
-                    for i in range(self.ydim)
-                ]
-                self.jacobian['full'] = full_jacobian = paddle.stack(rows)
-            else:
-                full_jacobian = self.jacobian['full']
-
-            return full_jacobian[i] if j is None else full_jacobian[i][..., j]
-
-        assert 0 <= i < self.ydim, f"Jacobian index i={i} is not valid."
-        assert j is None or isinstance(j, slice) or (0 <= j < self.xdim), (
-            f"Jacobian index j={j} is not valid.")
-        if 'full' in self.jacobian:
-            JJ = self.jacobian['full']
-        else:
-            JJ = self.jacobian
-            if i not in self.jacobian:
-                self.jacobian[i] = self.flatten_all(
-                    gradients(self.y[..., i], self.xs))
-
-        if j is None:
-            return JJ[i]
-        else:
-            return JJ[i][..., j]
-
-
-class Hessian(object):
-    def __init__(self, func, inputs, batch=False):
-        f_x = lambda xs: Jacobian(func, xs, batch=batch)[0]
-        self.symbolic = Jacobian(f_x, inputs, batch=batch)
-        self.xs = inputs
-        self.batch = batch
-
-    def __getitem__(self, tup):
-        return self.symbolic[tup]
-
-    def shape(self):
-        return self.symbolic.shape()
diff --git a/python/paddle/autograd/utils.py b/python/paddle/autograd/utils.py
deleted file mode 100644
index 710c9ee18dfbf..0000000000000
--- a/python/paddle/autograd/utils.py
+++ /dev/null
@@ -1,45 +0,0 @@
-# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
-# 
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-# 
-#     http://www.apache.org/licenses/LICENSE-2.0
-# 
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import paddle
-
-
-def _tensors(ts, name):
-    if isinstance(ts, (list, tuple)):
-        assert len(ts) > 0, "{} connot be empty".format(name)
-        for each_t in ts:
-            assert isinstance(
-                each_t, paddle.Tensor
-            ) or each_t is None, "Elements of {} must be paddle.Tensor or None".format(
-                name)
-        return list(ts)
-    else:
-        assert isinstance(ts, paddle.Tensor), "{} must be Tensor".format(name)
-        return [ts]
-
-
-def _stack_tensor_or_return_none(origin_list):
-    assert len(origin_list) > 0, "Can't not stack an empty list"
-    return paddle.stack(
-        origin_list, axis=0) if isinstance(origin_list[0],
-                                           paddle.Tensor) else None
-
-
-def _replace_none_with_zero_tensor(t, spec_t):
-    if t is None:
-        zero_t = paddle.zeros(shape=spec_t.shape, dtype=spec_t.dtype)
-        zero_t.stop_gradient = spec_t.stop_gradient
-        return zero_t
-    else:
-        return t
diff --git a/python/paddle/common_ops_import.py b/python/paddle/common_ops_import.py
index 9897480858946..de8056f280a39 100644
--- a/python/paddle/common_ops_import.py
+++ b/python/paddle/common_ops_import.py
@@ -22,7 +22,7 @@
 from paddle.fluid import core, dygraph_utils
 from paddle.fluid.data_feeder import check_type, check_dtype, check_variable_and_dtype, convert_dtype
 from paddle.fluid.layers import fill_constant, utils, scale
-from paddle.fluid.layers.layer_function_generator import templatedoc
+from paddle.tensor.layer_function_generator import templatedoc
 import paddle.fluid as fluid
 import numpy
 import warnings
diff --git a/python/paddle/dataset/image.py b/python/paddle/dataset/image.py
index c36213282c59c..a094529edf575 100644
--- a/python/paddle/dataset/image.py
+++ b/python/paddle/dataset/image.py
@@ -54,7 +54,10 @@
     if retcode != 0:
         cv2 = None
     else:
-        import cv2
+        try:
+            import cv2
+        except ImportError:
+            cv2 = None
 else:
     try:
         import cv2
diff --git a/python/paddle/distributed/collective.py b/python/paddle/distributed/collective.py
index 6dbd7d228eefa..a5ea528d13450 100644
--- a/python/paddle/distributed/collective.py
+++ b/python/paddle/distributed/collective.py
@@ -17,6 +17,7 @@
 from datetime import timedelta
 from ..fluid.layer_helper import LayerHelper
 from ..fluid.framework import Variable
+from ..fluid.framework import in_dygraph_mode
 from ..fluid.framework import OpProtoHolder
 from ..fluid.framework import _non_static_mode
 from ..fluid.framework import convert_np_dtype_to_dtype_
@@ -142,6 +143,16 @@ def _get_global_env():
 _default_backend = None
 
 
+def _set_default_backend(backend):
+    global _default_backend
+    _default_backend = backend
+
+
+def _set_default_store(store):
+    global _default_store
+    _default_store = store
+
+
 def _get_group_map():
     global _group_map
     if not _group_map:
@@ -157,25 +168,31 @@ def _get_global_group():
 
 def _get_group_map_by_name():
     global _group_map_by_name
-    assert _default_group_name in _group_map_by_name, (
-        "Call paddle.distributed.init_parallel_env first "
-        "to initialize the distributed environment.")
     return _group_map_by_name
 
 
 def _get_default_group():
+    global _group_map_by_name
     assert _default_group_name in _group_map_by_name, (
         "Call paddle.distributed.init_parallel_env first "
         "to initialize the distributed environment.")
     return _get_group_map_by_name()[_default_group_name]
 
 
-def _new_ring_id():
-    return len(_get_group_map()) + max(_get_global_env().nrings, 9)
+def _set_group_map(gid, group):
+    global _group_map
+    assert gid not in _group_map
+    _group_map[gid] = group
+
+
+def _set_group_map_by_name(name, group):
+    global _group_map_by_name
+    assert name not in _group_map_by_name
+    _group_map_by_name[name] = group
 
 
-def _new_group_name_id():
-    return len(_get_group_map_by_name()) + max(_get_global_env().nrings, 9)
+def _new_ring_id():
+    return len(_get_group_map()) + max(_get_global_env().nrings, 9)
 
 
 def get_group(id=0):
@@ -202,194 +219,25 @@ def get_group(id=0):
     return gm[id] if id in gm else None
 
 
-def _new_process_group_impl(backend, store, rank, world_size, group_name,
-                            pg_options):
-    if backend == "gloo":
-        gloo_store = core.GlooStore(store)
-
+def _new_process_group_impl(backend,
+                            store,
+                            rank,
+                            world_size,
+                            group_name,
+                            pg_options,
+                            group_id=0):
     pg = None
+    assert backend in _valid_backend_list, "Unsupported backend: %s." % backend
     if backend == "gloo":
-        pg = core.ProcessGroupGloo(gloo_store, rank, world_size)
+        pg = core.ProcessGroupGloo(store, rank, world_size, group_id)
     elif backend == "nccl":
-        pg = core.ProcessGroupNCCL(store, rank, world_size)
+        pg = core.ProcessGroupNCCL(store, rank, world_size, group_id)
     elif backend == "hccl":
-        pg = core.ProcessGroupHCCL(store, rank, world_size)
+        pg = core.ProcessGroupHCCL(store, rank, world_size, group_id)
 
     return pg
 
 
-def _init_parallel_env(rank=None,
-                       world_size=None,
-                       backend="nccl",
-                       timeout=timedelta(0),
-                       pg_options=None):
-    """
-
-    Initializes the default distributed environment.
-    
-    Args:
-        rank (int, optional): the rank of the current process or device from 0 to world_size (exclusive).
-            If you launch your training with paddle.distributed.run or 
-            paddle.distributed.launch module, None can be given. Default: None.
-        world_size (int, optional): total number of processes or devices.
-            If you launch your training with paddle.distributed.run or 
-            paddle.distributed.launch module, None can be given. Default: None.
-        backend (str, optional): the name of the backend used to initialize
-            the distributed environment. The value can be one of 'nccl' for
-            GPU, 'gloo' for CPU or 'hccl' for NPU. Default: 'nccl'.
-        timeout (datetime.timedelta, optional): timeout used for operations of
-            the group. Default: datetime.timedelta(0) which means no timeout.
-        pg_options (dict, optional): options for the group. Default: None.
-
-    Returns:
-        Group: a group.
-
-    Examples:
-
-        .. code-block:: python
-
-            # filename: train.py
-            import paddle
-            paddle.distributed.init_parallel_env(0, 1)
-            
-            # how to start
-            # python paddle.distributed.run --gpus="0,1" train.py
-
-    """
-
-    global _group_map_by_name
-    global _default_group_name
-    assert _default_group_name not in _group_map_by_name, (
-        "The default distributed environment has been initialized.")
-
-    assert backend in _valid_backend_list, (
-        "Backend must be one of {}, but the given one is: {}".format(
-            _valid_backend_list, backend))
-    _default_backend = backend
-
-    assert isinstance(timeout, timedelta), (
-        "timeout must be of the type datetime.timedelta.")
-
-    if rank is None or world_size is None:
-        assert rank is None and world_size is None, (
-            "rank and world_size should be unset at the same time.")
-        trainer_id = os.getenv("PADDLE_TRAINER_ID", None)
-        trainer_num = os.getenv("PADDLE_TRAINERS_NUM", None)
-        if trainer_id is None or trainer_num is None:
-            warnings.warn("If rank and world_size are both None, please start "
-                          "your training with paddle.distributed.run or "
-                          "paddle.distributed.launch module. Otherwise, "
-                          "init_parallel_env will do nothing.")
-            return None
-        rank = int(trainer_id)
-        world_size = int(trainer_num)
-
-    assert rank >= 0 and world_size > rank and world_size > 1, (
-        "rank must be non-negative and world_size must be the "
-        "maximum rank plus one. Moreover, at least two processes are "
-        "required to create a process group.")
-
-    master_addr = os.getenv("MASTER_ADDR", None)
-    master_port = os.getenv("MASTER_PORT", None)
-    if not master_addr or not master_port:
-        endpoints = os.getenv("PADDLE_MASTER", None)
-        if endpoints is None:
-            endpoints = os.getenv("PADDLE_TRAINER_ENDPOINTS", None)
-        if not endpoints:
-            raise ValueError(
-                "The environment variable 'MASTER_ADDR' and 'MASTER_PORT' "
-                "must be specified, for example 'export MASTER_ADDR=127.0.0.1' "
-                "and 'export MASTER_ADDR=54612'. Or you can start your training"
-                "with paddle.distributed.run or "
-                "paddle.distributed.luanch module.")
-        if ',' in endpoints:
-            endpoints = endpoints.split(',')[0]
-        master_addr, master_port = endpoints.split(":")
-
-    master_port = int(master_port)
-
-    is_master = rank == 0
-    global _default_store
-    _default_store = core.TCPStore(master_addr, master_port, is_master,
-                                   world_size, timeout)
-
-    pg = _new_process_group_impl(backend, _default_store, rank, world_size,
-                                 _default_group_name, pg_options)
-    ranks = list(range(world_size))
-    group = Group(
-        rank, world_size, id=0, ranks=ranks, pg=pg, name=_default_group_name)
-
-    paddle.fluid.dygraph.parallel_helper._set_parallel_ctx(True)
-    _group_map_by_name[_default_group_name] = group
-    return group
-
-
-def _new_group(ranks=None,
-               backend=None,
-               group_name=None,
-               timeout=timedelta(0),
-               pg_options=None):
-    """
-    Create a new process group.
-
-    Args:
-        ranks (list, optional): list of ranks for the new group. If None is given, 
-            all processes is used. Default: None.
-        backend (str, optional): the name of the backend used to initialize
-            the distributed environment. Default: the one for init_parallel_env.
-        timeout (datetime.timedelta, optional): timeout used for operations of
-            the group. Default: datetime.timedelta(0).
-        pg_options (dict, optional): options for the group. Default: None.
-
-    Examples:
-
-        .. code-block:: python
-
-            import paddle
-            paddle.distributed.init_parallel_env(0, 1)
-            paddle.distributed.new_group([0, 1])
-
-            # how to start
-            # python paddle.distributed.run --gpus="0,1" train.py
-
-    """
-    global _default_group_name
-    if group_name is None:
-        group_name = _default_group_name + str(_new_group_name_id())
-    if group_name == _default_group_name:
-        raise ValueError("group_name must be specified and it cannot be '{}' "
-                         "which is used for the default process group created "
-                         "by init_parallel_env.".format(_default_group_name))
-    global_group = _get_default_group()
-    global_rank = global_group.rank
-    global_ranks = global_group.ranks
-    if ranks is None:
-        ranks = global_ranks
-    assert len(ranks) <= len(global_ranks), (
-        "Size of new group must be less than or "
-        "equal to that of the default global group.")
-    size = len(ranks)
-    assert size > 1, "A group must have at least two memebers."
-    ranks = sorted(ranks)
-    if global_rank in ranks:
-        rank = ranks.index(global_rank)
-        pg = _new_process_group_impl(backend, _default_store, rank, size,
-                                     group_name, pg_options)
-    else:
-        rank = -1
-        pg = None
-    group = Group(
-        rank,
-        size,
-        id=_new_group_name_id(),
-        ranks=ranks,
-        pg=pg,
-        name=group_name)
-    _group_map_by_name[group_name] = group
-
-    return group
-
-
 def barrier(group=None):
     """
 
@@ -414,6 +262,12 @@ def barrier(group=None):
     if group is not None and not group.is_member():
         return
 
+    if in_dygraph_mode():
+        group = _get_default_group() if group is None else group
+        task = group.process_group.barrier()
+        task.wait()
+        return
+
     ring_id = 0 if group is None else group.id
 
     temp = fill_constant([1], dtype="int32", value="1")
@@ -455,6 +309,40 @@ def new_group(ranks=None, backend=None):
             paddle.distributed.all_reduce(tindata, group=gp, use_calc_stream=False)
 
     """
+    global _group_map
+    if in_dygraph_mode():
+        global _default_group_name
+        gid = _new_ring_id()
+        group_name = _default_group_name + str(gid)
+        global_group = _get_default_group()
+        global_rank = global_group.rank
+        global_ranks = global_group.ranks
+        backend = _default_backend if backend is None else backend
+        if ranks is None:
+            ranks = global_ranks
+        assert len(ranks) <= len(global_ranks), (
+            "Size of new group must be less than or "
+            "equal to that of the default global group.")
+        size = len(ranks)
+        ranks = sorted(ranks)
+        if global_rank in ranks and size > 1:
+            rank = ranks.index(global_rank)
+            pg = _new_process_group_impl(
+                backend,
+                _default_store,
+                rank,
+                size,
+                group_name,
+                pg_options=None,
+                group_id=gid)
+        else:
+            rank = -1
+            pg = None
+        group = Group(rank, size, id=gid, ranks=ranks, pg=pg, name=group_name)
+        _group_map_by_name[group_name] = group
+        _group_map[gid] = group
+
+        return group
 
     if not backend:
         backend = 'nccl'
@@ -465,7 +353,6 @@ def new_group(ranks=None, backend=None):
 
     ring_id = _new_ring_id()
 
-    global _group_map
     if global_rank not in ranks:
         gp = Group(-1, -1, ring_id, ranks)
         _group_map[ring_id] = gp
@@ -628,7 +515,18 @@ def broadcast(tensor, src, group=None, use_calc_stream=True):
     if not isinstance(src, int):
         raise ValueError("src should be int.")
 
-    ring_id = 0 if group is None else group.id
+    if in_dygraph_mode():
+        group = _get_default_group() if group is None else group
+        gsrc = group.get_group_rank(src)
+        assert gsrc >= 0, ("src rank out of group, need global rank")
+        task = group.process_group.broadcast(tensor, gsrc)
+        if use_calc_stream:
+            task.wait()
+            return None
+        else:
+            return task
+
+    ring_id = ring_id = 0 if group is None else group.id
     gsrc = src if group is None else group.get_group_rank(src)
     assert gsrc >= 0, ("src rank out of group, need global rank")
 
@@ -701,6 +599,23 @@ def all_reduce(tensor, op=ReduceOp.SUM, group=None, use_calc_stream=True):
     if group is not None and not group.is_member():
         return
 
+    if in_dygraph_mode():
+        if op == ReduceOp.SUM:
+            op_type = core.ReduceOp.SUM
+        elif op == ReduceOp.MAX:
+            op_type = core.ReduceOp.MAX
+        elif op == ReduceOp.MIN:
+            op_type = core.ReduceOp.MIN
+        else:
+            raise ValueError("Unknown reduce_op type for allreduce.")
+        group = _get_default_group() if group is None else group
+        task = group.process_group.allreduce(tensor, op_type)
+        if use_calc_stream:
+            task.wait()
+            return None
+        else:
+            return task
+
     ring_id = 0 if group is None else group.id
     if _non_static_mode():
         if op == ReduceOp.SUM:
@@ -721,9 +636,6 @@ def all_reduce(tensor, op=ReduceOp.SUM, group=None, use_calc_stream=True):
     check_variable_and_dtype(
         tensor, 'tensor', ['float16', 'float32', 'float64', 'int32', 'int64'],
         'all_reduce')
-    if not op in [ReduceOp.SUM, ReduceOp.MAX, ReduceOp.MIN, ReduceOp.PROD]:
-        raise ValueError("The op for all_reduce must be one of educeOp.PROD, "
-                         "ReduceOp.SUM, ReduceOp.MAX, ReduceOp.MIN.")
     if op == ReduceOp.SUM:
         op_type = 'c_allreduce_sum'
     elif op == ReduceOp.MAX:
@@ -789,8 +701,24 @@ def reduce(tensor, dst, op=ReduceOp.SUM, group=None, use_calc_stream=True):
     if group is not None and not group.is_member():
         return
 
-    if not isinstance(dst, int):
-        raise ValueError("dst should be int.")
+    if in_dygraph_mode():
+        if op == ReduceOp.SUM:
+            op_type = core.ReduceOp.SUM
+        elif op == ReduceOp.MAX:
+            op_type = core.ReduceOp.MAX
+        elif op == ReduceOp.MIN:
+            op_type = core.ReduceOp.MIN
+        else:
+            raise ValueError("Unknown reduce_op type for reduce.")
+        group = _get_default_group() if group is None else group
+        gdst = group.get_group_rank(dst)
+        assert gdst >= 0, ("dst rank out of group, need global rank")
+        task = group.process_group.reduce(tensor, gdst, op_type)
+        if use_calc_stream:
+            task.wait()
+            return None
+        else:
+            return task
 
     ring_id = 0 if group is None else group.id
     gdst = dst if group is None else group.get_group_rank(dst)
@@ -820,9 +748,6 @@ def reduce(tensor, dst, op=ReduceOp.SUM, group=None, use_calc_stream=True):
     check_variable_and_dtype(
         tensor, 'tensor', ['float16', 'float32', 'float64', 'int32', 'int64'],
         'all_reduce')
-    if not op in [ReduceOp.SUM, ReduceOp.MAX, ReduceOp.MIN, ReduceOp.PROD]:
-        raise ValueError("The op for reduce must be one of educeOp.PROD, "
-                         "ReduceOp.SUM, ReduceOp.MAX, ReduceOp.MIN.")
 
     if op == ReduceOp.SUM:
         op_type = 'c_reduce_sum'
@@ -897,6 +822,15 @@ def all_gather(tensor_list, tensor, group=None, use_calc_stream=True):
     if group is not None and not group.is_member():
         return
 
+    if in_dygraph_mode():
+        group = _get_default_group() if group is None else group
+        out = paddle.concat(tensor_list)
+        task = group.process_group.all_gather(tensor, out)
+        task.wait()
+        tensor_list.clear()
+        tensor_list.extend(paddle.split(out, group.nranks, 0))
+        return
+
     ring_id = 0 if group is None else group.id
     nranks = _get_global_group().nranks if group is None else group.nranks
 
@@ -985,17 +919,31 @@ def scatter(tensor, tensor_list=None, src=0, group=None, use_calc_stream=True):
     if not isinstance(src, int):
         raise ValueError("src should be int.")
 
-    ring_id = 0 if group is None else group.id
-    gsrc = src if group is None else group.get_group_rank(src)
+    if in_dygraph_mode():
+        group = _get_default_group() if group is None else group
+        gsrc = group.get_group_rank(src)
+        rank = group.rank
+        nranks = group.nranks
+    else:
+        ring_id = 0 if group is None else group.id
+        gsrc = src if group is None else group.get_group_rank(src)
+        rank = _get_global_group().rank if group is None else group.rank
+        nranks = _get_global_group().nranks if group is None else group.nranks
     assert gsrc >= 0, ("src rank out of group, need global rank")
-    rank = _get_global_group().rank if group is None else group.rank
-    nranks = _get_global_group().nranks if group is None else group.nranks
 
     if rank != gsrc:
         tensor_list = []
         for _ in range(nranks):
             tensor_list.append(tensor)
     temp = paddle.concat(tensor_list, axis=0)
+    if in_dygraph_mode():
+        task = group.process_group.scatter(temp, tensor, gsrc)
+        if use_calc_stream:
+            task.wait()
+            return None
+        else:
+            return task
+
     if _non_static_mode():
         return _C_ops.c_scatter(temp, tensor, 'use_calc_stream',
                                 use_calc_stream, 'ring_id', ring_id, 'nranks',
@@ -1070,11 +1018,12 @@ def _c_concat(tensor, group=None):
     """
     if group is not None and not group.is_member():
         return
-    ring_id = 0 if group is None else group.id
+    group = _get_default_group() if group is None else group
+    ring_id = group.id
 
     global_rank = _get_global_env().rank
-    rank = global_rank if group is None else group.get_group_rank(global_rank)
-    nranks = _get_global_env().world_size if group is None else group.nranks
+    rank = group.rank
+    nranks = group.nranks
 
     if _non_static_mode():
         return _C_ops.c_concat(tensor, 'ring_id', ring_id, 'use_calc_stream',
@@ -1765,9 +1714,21 @@ def alltoall(in_tensor_list, out_tensor_list, group=None, use_calc_stream=True):
     if group is not None and not group.is_member():
         return
 
-    ring_id = 0 if group is None else group.id
+    if in_dygraph_mode():
+        group = _get_default_group() if group is None else group
+    else:
+        ring_id = 0 if group is None else group.id
+
     temp = paddle.concat(in_tensor_list, axis=0)
     nranks = len(in_tensor_list)
+    if in_dygraph_mode():
+        out = paddle.concat(out_tensor_list, axis=0)
+        task = group.process_group.alltoall(temp, out)
+        task.wait()
+        out_tensor_list.clear()
+        out_tensor_list.extend(paddle.split(out, nranks, 0))
+        return
+
     if _non_static_mode():
         out = _C_ops.alltoall(temp, 'use_calc_stream', use_calc_stream,
                               'ring_id', ring_id)
@@ -1834,6 +1795,16 @@ def send(tensor, dst=0, group=None, use_calc_stream=True):
     """
     if group is not None and not group.is_member():
         return
+
+    if in_dygraph_mode():
+        group = _get_default_group() if group is None else group
+        task = group.process_group.send(tensor, dst)
+        if use_calc_stream:
+            task.wait()
+            return None
+        else:
+            return task
+
     ring_id = 0 if group is None else group.id
 
     if _non_static_mode():
@@ -1887,6 +1858,16 @@ def recv(tensor, src=0, group=None, use_calc_stream=True):
     """
     if group is not None and not group.is_member():
         return
+
+    if in_dygraph_mode():
+        group = _get_default_group() if group is None else group
+        task = group.process_group.recv(tensor, src)
+        if use_calc_stream:
+            task.wait()
+            return None
+        else:
+            return task
+
     ring_id = 0 if group is None else group.id
 
     if _non_static_mode():
diff --git a/python/paddle/distributed/fleet/base/fleet_base.py b/python/paddle/distributed/fleet/base/fleet_base.py
index ae1a63d72a5cf..4e975e74bdb14 100755
--- a/python/paddle/distributed/fleet/base/fleet_base.py
+++ b/python/paddle/distributed/fleet/base/fleet_base.py
@@ -1668,7 +1668,7 @@ def _minimize_impl(self,
             opt_info["mpi_rank"] = self.worker_index()
             for k, v in self._user_defined_strategy.trainer_desc_configs.items(
             ):
-                if v:
+                if v or k not in opt_info:
                     opt_info[k] = v
             program._fleet_opt = opt_info
 
@@ -1745,7 +1745,7 @@ def _minimize_losses_impl(self,
             opt_info["mpi_rank"] = self.worker_index()
             for k, v in self._user_defined_strategy.trainer_desc_configs.items(
             ):
-                if v:
+                if v or k not in opt_info:
                     opt_info[k] = v
             program._fleet_opt = opt_info
             # print("fleet base opt info:", id(program), program._fleet_opt)
diff --git a/python/paddle/distributed/fleet/meta_optimizers/sharding_optimizer.py b/python/paddle/distributed/fleet/meta_optimizers/sharding_optimizer.py
index 52468ab533496..c4d42f90615fc 100755
--- a/python/paddle/distributed/fleet/meta_optimizers/sharding_optimizer.py
+++ b/python/paddle/distributed/fleet/meta_optimizers/sharding_optimizer.py
@@ -1621,13 +1621,8 @@ def _create_gm_cond(self, main_block):
             persistable=True,
             force_cpu=True)
 
-        cond_var = layers.create_global_var(
-            name="gradient_merge_cond",
-            shape=[1],
-            value=bool(0),
-            dtype='bool',
-            persistable=False,
-            force_cpu=True)
+        cond_var = main_block.create_var(
+            name="gradient_merge_cond", shape=[1], dtype='bool')
 
         with device_guard("cpu"):
             # step_var = (step_var + 1) % k_step
diff --git a/python/paddle/distributed/fleet/runtime/the_one_ps.py b/python/paddle/distributed/fleet/runtime/the_one_ps.py
index 47e1c64f9954d..c90fab6af5c15 100644
--- a/python/paddle/distributed/fleet/runtime/the_one_ps.py
+++ b/python/paddle/distributed/fleet/runtime/the_one_ps.py
@@ -984,7 +984,7 @@ def _get_tables():
                             table_proto.accessor)
                 else:
                     table.type = "PS_DENSE_TABLE"
-                    table.table_class = "CommonDenseTable"
+                    table.table_class = "MemoryDenseTable"
                     table.shard_num = 256
                     common.table_name = "MergedDense"
 
diff --git a/python/paddle/distributed/fleet/utils/hybrid_parallel_util.py b/python/paddle/distributed/fleet/utils/hybrid_parallel_util.py
index 75aa9766e7b28..1285e1f3323ff 100644
--- a/python/paddle/distributed/fleet/utils/hybrid_parallel_util.py
+++ b/python/paddle/distributed/fleet/utils/hybrid_parallel_util.py
@@ -20,6 +20,7 @@
 import paddle
 from paddle.fluid import core
 from paddle.fluid.dygraph.parallel import _split_tensors, sync_params_buffers, build_groups
+from paddle.fluid.framework import in_dygraph_mode, _in_legacy_dygraph
 from collections import OrderedDict
 from .log_util import logger
 
@@ -58,6 +59,30 @@ def _apply_collective_grads(parameters, comm_group):
     _split_tensors(coalesced_grads_and_vars)
 
 
+def _apply_collective_grads_eager(parameters, comm_group):
+    grad_var_set = set()
+    grad_vars = []
+
+    for param in parameters:
+        if param.trainable and (param._grad_ivar() is not None):
+            g_var = param._grad_ivar()
+            assert not g_var.is_sparse(
+            ), "Now, it doesn't support sparse parameters"
+            grad_vars.append(g_var)
+            assert g_var not in grad_var_set
+            grad_var_set.add(g_var)
+
+    coalesced_grads_and_vars = build_groups(grad_vars, 128 * 1024 * 1024)
+
+    div_factor = 1.0 / comm_group.nranks
+    for coalesced_grad, _, _ in coalesced_grads_and_vars:
+        # need to div nranks 
+        coalesced_grad.scale_(div_factor)
+        paddle.distributed.all_reduce(coalesced_grad, group=comm_group)
+
+    _split_tensors(coalesced_grads_and_vars)
+
+
 def _broadcast_data_help(data, shape, dtype, hcg):
     model_parallel_group = hcg.get_model_parallel_group()
     src_rank = hcg.get_model_parallel_group_src_rank()
@@ -115,10 +140,17 @@ def broadcast_dp_parameters(model, hcg):
 
 
 def fused_allreduce_gradients(parameter_list, hcg):
-    data_parallel_group = None if hcg is None else hcg.get_data_parallel_group()
-    logger.debug("dp start fuse allreduce gradients")
-    with framework.no_grad():
-        _apply_collective_grads(parameter_list, data_parallel_group)
+    if _in_legacy_dygraph():
+        data_parallel_group = None if hcg is None else hcg.get_data_parallel_group(
+        )
+        logger.debug("dp start fuse allreduce gradients")
+        with framework.no_grad():
+            _apply_collective_grads(parameter_list, data_parallel_group)
+    elif in_dygraph_mode():
+        assert hcg is None, "It's not support to use hcg in EagerDygraph now."
+        data_parallel_group = paddle.distributed.collective._get_default_group()
+        with framework.no_grad():
+            _apply_collective_grads_eager(parameter_list, data_parallel_group)
 
 
 def sharding_reduce_gradients(parameter_list, hcg):
diff --git a/python/paddle/distributed/fleet/utils/recompute.py b/python/paddle/distributed/fleet/utils/recompute.py
index 4ccb48ef72e71..c767be77d8384 100755
--- a/python/paddle/distributed/fleet/utils/recompute.py
+++ b/python/paddle/distributed/fleet/utils/recompute.py
@@ -14,9 +14,11 @@
 
 import paddle
 from paddle.fluid import core
-from paddle.autograd import PyLayer
+from paddle.autograd import PyLayer, EagerPyLayer
+
 from paddle.fluid import framework
 import contextlib
+from paddle.fluid.framework import in_dygraph_mode
 
 import logging
 logger = logging.getLogger(__name__)
@@ -32,7 +34,7 @@
 def detach_variable(inputs):
     out = []
     for inp in inputs:
-        if not isinstance(inp, core.VarBase):
+        if not isinstance(inp, (core.eager.Tensor, core.VarBase)):
             out.append(inp)
             continue
 
@@ -44,7 +46,7 @@ def detach_variable(inputs):
 
 def check_recompute_necessary(inputs):
     if not any(input_.stop_gradient == False for input_ in inputs
-               if isinstance(input_, paddle.Tensor)):
+               if isinstance(input_, (core.eager.Tensor, paddle.Tensor))):
         logger.warn(
             "[Recompute]: None of the inputs to current recompute block need grad, "
             "therefore there is NO need to recompute this block in backward !")
@@ -60,6 +62,140 @@ def swith_rng_state(rng_state):
         paddle.set_cuda_rng_state(orig_cuda_rng_state)
 
 
+class EagerRecomputeFunction(EagerPyLayer):
+    @staticmethod
+    def forward(ctx, run_function, preserve_rng_state, *args):
+        if framework._dygraph_tracer()._has_grad:
+            check_recompute_necessary(args)
+
+        # store for recomputing 
+        ctx.run_function = run_function
+        ctx.preserve_rng_state = preserve_rng_state
+
+        # NOTE the number of outputs of backward() should be equal to the number of tensors in forward()'s input
+        # the order of tensors in backward()'s output should be the same as tensors in forward()'s input
+        # None tensor inputs will be filtered in backward inputs.
+
+        # save input for backward
+        ctx.inputs = []
+        ctx.tensor_indices = []
+        tensor_inputs = []
+        for i, arg in enumerate(args):
+            if paddle.is_tensor(arg):
+                tensor_inputs.append(arg)
+                ctx.tensor_indices.append(i)
+                ctx.inputs.append(None)
+            else:
+                ctx.inputs.append(arg)
+        ctx.save_for_backward(*tensor_inputs)
+
+        # NOTE recompute with restore RNG only support one senario where one process for one cuda gpu.
+        # one process with multiple gpu and mix-gpu-cpu senarios are not support
+        if ctx.preserve_rng_state:
+            cur_device = paddle.get_device()
+            if 'gpu:' not in cur_device:
+                raise RuntimeError(
+                    "Recompute with RNG perserve is not support current device: {}.".
+                    format(cur_device))
+            ctx.fw_cuda_rng_state = paddle.get_cuda_rng_state()
+
+        # TODO support AMP
+        tracer = framework._dygraph_tracer()
+        ctx.is_fw_autocast = False if tracer._amp_level == core.AmpLevel.O0 else True
+        if tracer._amp_level == core.AmpLevel.O2:
+            ctx.amp_level = 'O2'
+        elif tracer._amp_level in (core.AmpLevel.O1, core.AmpLevel.O0):
+            ctx.amp_level = 'O1'
+        else:
+            raise ValueError("unsupported amp level: {}".format(
+                tracer._amp_level))
+
+        if tracer._amp_dtype == 'float16':
+            ctx.amp_dtype = 'float16'
+        elif tracer._amp_dtype in ('bfloat16', 'float32'):
+            ctx.amp_dtype = 'bfloat16'
+        else:
+            raise ValueError("unsupported amp dtype: {}".format(
+                tracer._amp_dtype))
+
+        ctx.amp_white_list, ctx.amp_black_list = tracer._get_amp_op_list()
+
+        with paddle.no_grad():
+            outputs = run_function(*args)
+        return outputs
+
+    @staticmethod
+    def backward(ctx, *args):
+        with paddle.fluid.dygraph.guard():
+            # TODO need to check the recompute calling is vaild or not
+
+            # Restore inputs
+            inputs = list(ctx.inputs)
+            tensor_indices = ctx.tensor_indices
+            tensors = ctx.saved_tensor()
+            for i, idx in enumerate(tensor_indices):
+                inputs[idx] = tensors[i]
+
+            # paddle.enable_grad()
+            tracer = framework._dygraph_tracer()
+            tracer._has_grad = True
+
+            # NOTE support AMP
+            # need restore auto_cast state as well as w/b list
+            if ctx.preserve_rng_state:
+                with swith_rng_state(ctx.fw_cuda_rng_state):
+                    with paddle.amp.auto_cast(
+                            enable=ctx.is_fw_autocast,
+                            custom_white_list=ctx.amp_white_list,
+                            custom_black_list=ctx.amp_black_list,
+                            level=ctx.amp_level,
+                            dtype=ctx.amp_dtype):
+                        detached_inputs = detach_variable(tuple(inputs))
+                        outputs = ctx.run_function(*detached_inputs)
+            else:
+                with paddle.amp.auto_cast(
+                        enable=ctx.is_fw_autocast,
+                        custom_white_list=ctx.amp_white_list,
+                        custom_black_list=ctx.amp_black_list,
+                        level=ctx.amp_level,
+                        dtype=ctx.amp_dtype):
+                    detached_inputs = detach_variable(tuple(inputs))
+                    outputs = ctx.run_function(*detached_inputs)
+
+            if isinstance(outputs, core.eager.Tensor):
+                outputs = (outputs, )
+            assert len(outputs) == len(args)
+
+            # run backward() with only tensor that requires grad
+            forward_outputs_with_grad = []
+            # NOTE In Transformer-like network, if user put the attention mask into the recompute segment output,
+            # pylayer will force the stop_gradient of attention mask to be False, which will make the number of 
+            # tensor that need grad does not match.
+            # the following backward_inputs_with_grad is used to avoid this case.
+            backward_inputs_with_grad = []
+            for i in range(len(outputs)):
+                if isinstance(
+                        outputs[i],
+                        core.eager.Tensor) and not outputs[i].stop_gradient:
+                    forward_outputs_with_grad.append(outputs[i])
+                    backward_inputs_with_grad.append(args[i])
+
+            if len(forward_outputs_with_grad) == 0:
+                raise RuntimeError(
+                    "none of output has requires_grad=True, this recompute() is not necessary"
+                )
+
+            # actually backward
+            with paddle.amp.auto_cast(enable=False):
+                paddle.autograd.backward(forward_outputs_with_grad,
+                                         backward_inputs_with_grad)
+
+            grads = tuple(
+                inp.grad for inp in detached_inputs
+                if isinstance(inp, core.eager.Tensor))
+            return grads
+
+
 class RecomputeFunction(PyLayer):
     @staticmethod
     def forward(ctx, run_function, preserve_rng_state, *args):
@@ -315,4 +451,7 @@ def run_model(cuda_state, recompute_block=[], recompute_kwargs={}):
         raise ValueError("Unexpected keyword arguments: " + ",".join(
             arg for arg in kwargs))
 
-    return RecomputeFunction.apply(function, preserve, *args)
+    if in_dygraph_mode():
+        return EagerRecomputeFunction.apply(function, preserve, *args)
+    else:
+        return RecomputeFunction.apply(function, preserve, *args)
diff --git a/python/paddle/distributed/launch/context/device.py b/python/paddle/distributed/launch/context/device.py
index c2f6896ab6c04..30b8cc1538590 100644
--- a/python/paddle/distributed/launch/context/device.py
+++ b/python/paddle/distributed/launch/context/device.py
@@ -142,4 +142,4 @@ def detect_device(self):
 
 if __name__ == '__main__':
     d = Device.parse_device()
-    print(d.get_selected_flag())
+    print(d.get_selected_devices())
diff --git a/python/paddle/distributed/launch/context/node.py b/python/paddle/distributed/launch/context/node.py
index 2fa8b892275a0..8082541ffe06c 100644
--- a/python/paddle/distributed/launch/context/node.py
+++ b/python/paddle/distributed/launch/context/node.py
@@ -44,6 +44,7 @@ def get_ports_occupied(self):
         return self.free_ports
 
     def get_free_port(self):
+        # for loop to avoid port conflict
         for _ in range(100):
             with closing(socket.socket(socket.AF_INET,
                                        socket.SOCK_STREAM)) as s:
diff --git a/python/paddle/distributed/launch/controllers/collective.py b/python/paddle/distributed/launch/controllers/collective.py
index bbcb7c81d6e65..3763bac041451 100644
--- a/python/paddle/distributed/launch/controllers/collective.py
+++ b/python/paddle/distributed/launch/controllers/collective.py
@@ -93,7 +93,7 @@ def build_pod(self):
                 "PADDLE_RANK_IN_NODE": str(i),
             }
             if self.pod.replicas == 1:
-                e.update({selected_dev_key: selected_dev_list})
+                e.update({selected_dev_key: ",".join(selected_dev_list)})
             else:
                 e.update({selected_dev_key: selected_dev_list[i]})
             self.add_container(envs=e, log_tag=i)
@@ -134,7 +134,7 @@ def run(self):
             if ok:
                 self.job.replicas = replicas
             else:
-                self.ctx.logger.warnning("peer not ready {}".format(self.job))
+                self.ctx.logger.warning("peer not ready {}".format(self.job))
                 break
 
             self.ctx.logger.debug("Run {}".format(self.job))
diff --git a/python/paddle/distributed/launch/controllers/controller.py b/python/paddle/distributed/launch/controllers/controller.py
index fbe9df4c9a223..9527ae35c4b6b 100644
--- a/python/paddle/distributed/launch/controllers/controller.py
+++ b/python/paddle/distributed/launch/controllers/controller.py
@@ -75,8 +75,9 @@ def watch(self) -> bool:
         while not self.ctx.status.is_done():
             status = self.pod.watch(timeout=2)
 
-            if self.ctx.continous_log():
-                self.pod.logs()
+            #if self.ctx.continous_log():
+            # default to print log
+            self.pod.logs()
 
             # completed
             if status == self.ctx.status.COMPLETED:
diff --git a/python/paddle/distributed/launch/job/container.py b/python/paddle/distributed/launch/job/container.py
index 7105cae9024f2..a1ad6dbe24e8e 100644
--- a/python/paddle/distributed/launch/job/container.py
+++ b/python/paddle/distributed/launch/job/container.py
@@ -145,31 +145,34 @@ def __str__(self):
             self.errfile,
             self._env, )
 
-    def logs(self, fn=None, offset=0, whence=1, lines=1000):
+    def logs(self, fn=None, offset=0, whence=1, limit=1000):
         if not self._log_handler:
             self._log_handler = open(self._out)
 
         if fn is None:
             fn = sys.stdout
 
-        self._log_handler.seek(offset, whence)
-
         try:
-            idx = 0
-            for line in self._log_handler:
-                fn.write(line)
-                idx += 1
-                if idx > lines:
+            if offset != 0 or whence != 1:
+                self._log_handler.seek(offset, whence)
+
+            for _ in range(limit):
+                line = self._log_handler.readline()
+                if not line:
                     break
-        finally:
-            return self._log_handler.tell()
+                fn.write(line)
+        except:
+            return
 
     def tail(self, length=3000):
         if not self._log_handler:
             self._log_handler = open(self._out)
 
-        self._log_handler.seek(0, 2)
-        ed = self._log_handler.tell()
+        try:
+            self._log_handler.seek(0, 2)
+            ed = self._log_handler.tell()
+        except:
+            pass
 
         if ed > length:
             self.logs(offset=ed - length, whence=0)
diff --git a/python/paddle/distributed/launch/main.py b/python/paddle/distributed/launch/main.py
index dd7edba35a474..400a447260252 100644
--- a/python/paddle/distributed/launch/main.py
+++ b/python/paddle/distributed/launch/main.py
@@ -40,7 +40,7 @@ def launch():
 
         - ``--rank``: The rank of the node, can be auto assigned by master. Default ``--rank=-1``.
 
-        - ``--log_level``: The log level to set for logging.setLevel which can be CRITICAL/ERROR/WARNING/INFO/DEBUG/NOTSET, case insensitive. The rank 0 log will not print in the terminal by default, while you can enable it by adding --log_level=debug. Default ``--log_level=INFO``.
+        - ``--log_level``: The log level to set for logging.setLevel which can be CRITICAL/ERROR/WARNING/INFO/DEBUG/NOTSET, case insensitive. Default ``--log_level=INFO``.
 
         - ``--nnodes``: The number of nodes for a distributed job, it can be a range in elastic mode, e.g., ``--nnodes=2:3``. Default ``--nnodes=1``.
 
diff --git a/python/paddle/distributed/parallel.py b/python/paddle/distributed/parallel.py
index 16ed528b64f0c..f0365cab8c896 100644
--- a/python/paddle/distributed/parallel.py
+++ b/python/paddle/distributed/parallel.py
@@ -24,11 +24,23 @@
 
 # deprecated module import
 from paddle.fluid import core
+from paddle.fluid.framework import in_dygraph_mode
 from paddle.fluid.framework import _set_expected_place
 from paddle.fluid.dygraph import parallel_helper
 from paddle.distributed.fleet.launch_utils import check_backend
 from paddle.fluid.dygraph.parallel import ParallelEnv
 from paddle.distributed.fleet.base.private_helper_function import wait_server_ready  # noqa: F401
+from paddle.distributed import collective
+from paddle.distributed.collective import _set_group_map
+from paddle.distributed.collective import _set_group_map_by_name
+from paddle.distributed.collective import _get_group_map_by_name
+from paddle.distributed.collective import _group_map_by_name
+from paddle.distributed.collective import _default_group_name
+from paddle.distributed.collective import _valid_backend_list
+from paddle.distributed.collective import _set_default_backend
+from paddle.distributed.collective import _set_default_store
+from paddle.distributed.collective import _new_process_group_impl
+from paddle.distributed.collective import Group
 
 __all__ = []
 
@@ -159,18 +171,90 @@ def train():
 
     if not is_cpu_only and core.is_compiled_with_cuda():
         _check_var_exists("FLAGS_selected_gpus")
+        backend = "nccl" if backend == "auto" else backend
     elif not is_cpu_only and core.is_compiled_with_xpu():
         _check_var_exists('FLAGS_selected_xpus')
+        backend = "bkcl" if backend == "auto" else backend
     elif not is_cpu_only and core.is_compiled_with_npu():
         _check_var_exists('FLAGS_selected_npus')
+        backend = "hccl" if backend == "auto" else backend
     elif not is_cpu_only and core.is_compiled_with_mlu():
         _check_var_exists('FLAGS_selected_mlus')
+        backend = "cncl" if backend == "auto" else backend
 
     _check_var_exists("PADDLE_TRAINER_ID")
     _check_var_exists("PADDLE_CURRENT_ENDPOINT")
     _check_var_exists("PADDLE_TRAINERS_NUM")
     _check_var_exists("PADDLE_TRAINER_ENDPOINTS")
 
+    # NOTE(chenweihang): [ why config global place here? ]
+    # the dygraph mode will be set to default mode,
+    # users will not call `dygraph.guard` or `enable_dygraph`
+    # directly, if they want to switch default place,
+    # they need to call a function to change default place,
+    # here just set correctly place to users
+    if is_cpu_only:
+        place = core.CPUPlace()
+    elif core.is_compiled_with_cuda():
+        place = core.CUDAPlace(parallel_env.device_id)
+    elif core.is_compiled_with_xpu():
+        place = core.XPUPlace(parallel_env.device_id)
+    elif core.is_compiled_with_npu():
+        place = core.NPUPlace(parallel_env.device_id)
+    elif core.is_compiled_with_mlu():
+        place = core.MLUPlace(parallel_env.device_id)
+
+    _set_expected_place(place)
+
+    group = None
+    if backend in _valid_backend_list and in_dygraph_mode():
+        if _default_group_name in _get_group_map_by_name():
+            return _get_group_map_by_name()[_default_group_name]
+        _set_default_backend(backend)
+        rank = int(os.getenv("PADDLE_TRAINER_ID"))
+        world_size = int(os.getenv("PADDLE_TRAINERS_NUM"))
+        assert rank >= 0 and world_size > rank and world_size > 1, (
+            "rank must be non-negative and world_size must be the "
+            "maximum rank plus one. Moreover, at least two processes are "
+            "required to create a process group.")
+        master_addr = os.getenv("MASTER_ADDR", None)
+        master_port = os.getenv("MASTER_PORT", None)
+        endpoints = None
+        if not master_addr or not master_port:
+            endpoints = os.getenv("PADDLE_MASTER", None)
+        if endpoints is None:
+            endpoints = os.getenv("PADDLE_TRAINER_ENDPOINTS").split(',')[0]
+        assert endpoints, (
+            "The environment variable 'MASTER_ADDR' and 'MASTER_PORT' "
+            "must be specified, for example 'export MASTER_ADDR=127.0.0.1' "
+            "and 'export MASTER_ADDR=54612'. Or you can start your training"
+            "with paddle.distributed.run module.")
+        master_addr, master_port = endpoints.split(":")
+        master_port = int(master_port)
+        is_master = rank == 0
+        default_store = core.TCPStore(master_addr, master_port, is_master,
+                                      world_size)
+        _set_default_store(default_store)
+        pg = _new_process_group_impl(
+            backend,
+            default_store,
+            rank,
+            world_size,
+            _default_group_name,
+            pg_options=None)
+        ranks = list(range(world_size))
+        group = Group(
+            rank,
+            world_size,
+            id=0,
+            ranks=ranks,
+            pg=pg,
+            name=_default_group_name)
+        _set_group_map_by_name(_default_group_name, group)
+        _set_group_map(0, group)
+        parallel_helper._set_parallel_ctx(True)
+        return group
+
     node_num = set([i.split(":")[0] for i in parallel_env.trainer_endpoints])
     # 3: init gloo context (step 1: httpsever start)
     init_gloo = int(os.getenv("PADDLE_WITH_GLOO", "0"))
@@ -202,24 +286,6 @@ def train():
     strategy.current_endpoint = parallel_env.current_endpoint
     strategy.nrings = parallel_env.nrings
 
-    # NOTE(chenweihang): [ why config global place here? ]
-    # the dygraph mode will be set to default mode,
-    # users will not call `dygraph.guard` or `enable_dygraph`
-    # directly, if they want to switch default place,
-    # they need to call a function to change default place,
-    # here just set correctly place to users
-    if is_cpu_only:
-        place = core.CPUPlace()
-    elif core.is_compiled_with_cuda():
-        place = core.CUDAPlace(parallel_env.device_id)
-    elif core.is_compiled_with_xpu():
-        place = core.XPUPlace(parallel_env.device_id)
-    elif core.is_compiled_with_npu():
-        place = core.NPUPlace(parallel_env.device_id)
-    elif core.is_compiled_with_mlu():
-        place = core.MLUPlace(parallel_env.device_id)
-
-    _set_expected_place(place)
     # init nccl or hccl or bkcl or heter context
     if is_cpu_only:
         parallel_helper._set_parallel_ctx(
@@ -274,6 +340,7 @@ def train():
         if parallel_env.rank == 0:
             http_server_d["running"] = False
             http_server.join()
+    return group
 
 
 def get_rank():
diff --git a/python/paddle/distributed/passes/auto_parallel_gradient_merge.py b/python/paddle/distributed/passes/auto_parallel_gradient_merge.py
index 7668dff36207e..accac81133825 100644
--- a/python/paddle/distributed/passes/auto_parallel_gradient_merge.py
+++ b/python/paddle/distributed/passes/auto_parallel_gradient_merge.py
@@ -107,13 +107,8 @@ def _get_gm_cond_var(main_program, k_steps, dist_context):
         force_cpu=True)
     set_var_dist_attr(dist_context, step_var, [-1], world_process_group.ranks)
 
-    cond_var = layers.create_global_var(
-        name="gradient_merge_cond",
-        shape=[1],
-        value=bool(0),
-        dtype='bool',
-        persistable=False,
-        force_cpu=True)
+    cond_var = main_block.create_var(
+        name="gradient_merge_cond", shape=[1], dtype='bool')
     set_var_dist_attr(dist_context, cond_var, [-1], world_process_group.ranks)
 
     with device_guard("cpu"):
diff --git a/python/paddle/distributed/ps/the_one_ps.py b/python/paddle/distributed/ps/the_one_ps.py
index 00daaf986bfa0..1d23567b72abe 100755
--- a/python/paddle/distributed/ps/the_one_ps.py
+++ b/python/paddle/distributed/ps/the_one_ps.py
@@ -16,7 +16,7 @@
 
 import os
 import paddle.fluid as fluid
-import paddle.distributed.fleet as fleet
+from paddle.distributed import fleet
 from paddle.fluid import core
 from paddle.distributed.ps.utils.public import *
 from paddle.fluid.framework import Program
@@ -26,7 +26,7 @@
 from paddle.fluid.framework import Variable, Parameter
 from paddle.distributed.fleet.runtime.runtime_base import RuntimeBase
 from paddle.distributed.fleet.base.private_helper_function import wait_server_ready
-import paddle.distributed.fleet.proto.the_one_ps_pb2 as ps_pb2
+from paddle.distributed.fleet.proto import the_one_ps_pb2
 from paddle.fluid.communicator import Communicator, HeterClient
 from google.protobuf import text_format
 
@@ -518,7 +518,7 @@ def _set(self, table_proto):
         table_proto.table_id = self.idx
         table_proto.table_class = 'BarrierTable'
         table_proto.shard_num = 256
-        table_proto.type = ps_pb2.PS_OTHER_TABLE
+        table_proto.type = the_one_ps_pb2.PS_OTHER_TABLE
 
         table_proto.accessor.accessor_class = "CommMergeAccessor"
         table_proto.accessor.fea_dim = 0
@@ -544,7 +544,7 @@ def __init__(self, idx, tensor_dict, role_maker):
 
     def _set(self, table_proto):
         table_proto.table_id = self.idx
-        table_proto.type = ps_pb2.PS_OTHER_TABLE
+        table_proto.type = the_one_ps_pb2.PS_OTHER_TABLE
         table_proto.table_class = self.tensor_dict.get("tensor_table_class", '')
 
         table_proto.accessor.accessor_class = "CommMergeAccessor"
@@ -573,7 +573,7 @@ def _set(self, table_proto):
             return
         table_proto.table_id = ctx.table_id()
         table_proto.table_class = self.table_class
-        table_proto.type = ps_pb2.PS_SPARSE_TABLE
+        table_proto.type = the_one_ps_pb2.PS_SPARSE_TABLE
         table_proto.shard_num = self.shard_num
 
         self.common.table_name = self.context['grad_name_to_param_name'][
@@ -621,7 +621,7 @@ def _set(self, table_proto):
 class GeoSparseTable(SparseTable):
     def __init__(self, context, send_ctx):
         super(GeoSparseTable, self).__init__(context, send_ctx)
-        self.table_class = "SparseGeoTable"
+        self.table_class = "MemorySparseGeoTable"
         if self.context['ps_mode'] != DistributedMode.GEO:
             raise ValueError("not geo sparse table!")
 
@@ -632,7 +632,7 @@ def _set(self, table_proto):
             return
         table_proto.table_id = ctx.table_id()
         table_proto.table_class = self.table_class
-        table_proto.type = ps_pb2.PS_SPARSE_TABLE
+        table_proto.type = the_one_ps_pb2.PS_SPARSE_TABLE
         table_proto.shard_num = self.shard_num
 
         table_proto.accessor.accessor_class = 'CommMergeAccessor'
@@ -664,8 +664,8 @@ def _set(self, table_proto):
 
         table_proto.table_id = ctx.table_id()
 
-        table_proto.type = ps_pb2.PS_DENSE_TABLE
-        table_proto.table_class = "CommonDenseTable"
+        table_proto.type = the_one_ps_pb2.PS_DENSE_TABLE
+        table_proto.table_class = "MemoryDenseTable"
         table_proto.shard_num = 256
 
         table_proto.accessor.accessor_class = 'CommMergeAccessor'
@@ -748,7 +748,7 @@ def __init__(self, context):
         self.service = self._get_service()
         self.fs_client = self._get_fs_client()
 
-        self.ps_desc = ps_pb2.PSParameter()
+        self.ps_desc = the_one_ps_pb2.PSParameter()
 
     def _get_tensor_tables(self):
         program_idx = 0
@@ -806,7 +806,7 @@ def build_server_desc(self):
             table_proto = self.ps_desc.server_param.downpour_server_param.downpour_table_param.add(
             )
             table._set(table_proto)
-            if table_proto.type == ps_pb2.PS_SPARSE_TABLE and table_proto.common is not None:
+            if table_proto.type == the_one_ps_pb2.PS_SPARSE_TABLE and table_proto.common is not None:
                 self.sparse_table_maps[
                     table_proto.common.table_name] = table_proto.table_id
 
diff --git a/python/paddle/distributed/spawn.py b/python/paddle/distributed/spawn.py
index cea831d9d90b5..4adb19830522b 100644
--- a/python/paddle/distributed/spawn.py
+++ b/python/paddle/distributed/spawn.py
@@ -452,11 +452,12 @@ def forward(self, x):
 
             def train(print_result=False):
                 # 1. initialize parallel environment
-                dist.init_parallel_env()
+                group = dist.init_parallel_env()
+                process_group = group.process_group if group else None
 
                 # 2. create data parallel layer & optimizer
                 layer = LinearNet()
-                dp_layer = paddle.DataParallel(layer)
+                dp_layer = paddle.DataParallel(layer, process_group=process_group)
 
                 loss_fn = nn.MSELoss()
                 adam = opt.Adam(
diff --git a/python/paddle/fluid/backward.py b/python/paddle/fluid/backward.py
index 0988f6709552b..ba7692b442f82 100755
--- a/python/paddle/fluid/backward.py
+++ b/python/paddle/fluid/backward.py
@@ -1337,6 +1337,7 @@ def _append_backward_vars_(block, start_op_idx, grad_to_var, grad_info_map):
                 continue
             grad_info_map[grad_to_var[grad_var_name]] = (grad_var_name, block)
         # infer_shape and infer_type
+        op_desc.check_attrs()
         op_desc.infer_var_type(block.desc)
         op_desc.infer_shape(block.desc)
 
diff --git a/python/paddle/fluid/contrib/layers/nn.py b/python/paddle/fluid/contrib/layers/nn.py
index fde42e35e0739..c73ea8b5b0e1a 100644
--- a/python/paddle/fluid/contrib/layers/nn.py
+++ b/python/paddle/fluid/contrib/layers/nn.py
@@ -530,21 +530,44 @@ def fused_seqpool_cvm(input,
                       use_cvm=True,
                       cvm_offset=2):
     """
-    **Embedding Sequence pool**
+    :api_attr: Static Graph
 
-    This layer is the fusion of sequence_pool and continuous_value_model.
+    This OP is the fusion of sequence_pool and continuous_value_model op.
 
-    **Notes: The Op only receives List of LoDTensor as input, only support SUM pooling now.
+    **Note:** The Op only receives List of LoDTensor as input, only support SUM pooling now.
 
     Args:
         input(Variable|list of Variable): Input is List of LoDTensor.
         pool_type(str): pooling type, only support SUM pooling now.
         cvm(Variable): cvm Variable.
-        pad_value(float): padding value of sequence pool.
-        use_cvm(bool): use cvm or not.
+        pad_value(float, optional): padding value of sequence pool. Default: 0.0.
+        use_cvm(bool, optional): use cvm or not. Default: True.
+        cvm_offset(int, optional): cvm offset. Default: 2, which means cvm contains show, click.
+
     Returns:
         Variable|list of Variable: The tensor variable storing sequence pool and cvm
         of input.
+
+    Examples:
+        .. code-block:: python
+
+            import paddle
+            import paddle.fluid as fluid
+            paddle.enable_static()
+
+            data = paddle.static.data(name='x', shape=[-1, 1], dtype='int64', lod_level=1)
+            data2 = paddle.static.data(name='y', shape=[-1, 1], dtype='int64', lod_level=1)
+            inputs = [data, data2]
+            embs = fluid.layers.nn._pull_box_sparse(input=inputs, size=11, is_distributed=True, is_sparse=True)
+
+            label = paddle.static.data(name="label", shape=[-1, 1], dtype="int64", lod_level=1)
+            ones = fluid.layers.fill_constant_batch_size_like(input=label, shape=[-1, 1], dtype="int64", value=1)
+            show_clk = paddle.cast(paddle.concat([ones, label], axis=1), dtype='float32')
+            show_clk.stop_gradient = True
+
+            cvms = fluid.contrib.layers.fused_seqpool_cvm(embs, 'sum', show_clk)
+
+
     """
     helper = LayerHelper('fused_seqpool_cvm', **locals())
 
diff --git a/python/paddle/fluid/contrib/slim/quantization/imperative/qat.py b/python/paddle/fluid/contrib/slim/quantization/imperative/qat.py
index a3fdca5e40669..059cb7b0dd1bf 100644
--- a/python/paddle/fluid/contrib/slim/quantization/imperative/qat.py
+++ b/python/paddle/fluid/contrib/slim/quantization/imperative/qat.py
@@ -28,6 +28,7 @@
 from paddle.fluid.initializer import Constant
 from paddle.fluid.dygraph.io import INFER_MODEL_SUFFIX, INFER_PARAMS_SUFFIX
 from paddle.fluid.io import load_inference_model, save_inference_model
+from ..quantization_pass import ReplaceFakeQuantDequantPass, QuantWeightPass
 from paddle.fluid.log_helper import get_logger
 from .. import quantization_pass
 from . import utils
@@ -431,7 +432,12 @@ def apply(self, model):
 
             setattr(parent_layer, sub_name, cur_quant_layer)
 
-    def save_quantized_model(self, model, path, input_spec=None, **config):
+    def save_quantized_model(self,
+                             model,
+                             path,
+                             input_spec=None,
+                             onnx_format=False,
+                             **config):
         """
         Save the quantized model for the inference.
 
@@ -444,6 +450,8 @@ def save_quantized_model(self, model, path, input_spec=None, **config):
                 InputSpec or example Tensor. If None, all input variables of 
                 the original Layer's forward method would be the inputs of
                 the saved model. Default None.
+            onnx_format (bool, optional): Whether to export the quantized model 
+                with format of ONNX. Default is False.
             **configs (dict, optional): Other save configuration options for
                 compatibility. We do not recommend using these configurations,
                 they may be removed in the future. If not necessary, DO NOT use
@@ -498,6 +506,18 @@ def save_quantized_model(self, model, path, input_spec=None, **config):
 
         self._set_skip_quant_attr(infer_program)
 
+        clip_extra = False
+        if onnx_format:
+            graph = IrGraph(core.Graph(infer_program.desc), for_test=False)
+            transform_pass = ReplaceFakeQuantDequantPass(scope, place)
+            transform_pass.apply(graph)
+
+            quant_weight_pass = QuantWeightPass(scope, place)
+            quant_weight_pass.apply(graph)
+            infer_program = graph.to_program()
+
+            clip_extra = True
+
         save_inference_model(
             dirname=dirname,
             feeded_var_names=feed_target_names,
@@ -506,7 +526,7 @@ def save_quantized_model(self, model, path, input_spec=None, **config):
             main_program=infer_program.clone(),
             model_filename=model_filename,
             params_filename=params_filename,
-            clip_extra=False)
+            clip_extra=clip_extra)
 
         if is_dynamic_mode:
             paddle.disable_static()
diff --git a/python/paddle/fluid/contrib/slim/quantization/imperative/utils.py b/python/paddle/fluid/contrib/slim/quantization/imperative/utils.py
index 009ce372b4f29..758928f8dafe8 100644
--- a/python/paddle/fluid/contrib/slim/quantization/imperative/utils.py
+++ b/python/paddle/fluid/contrib/slim/quantization/imperative/utils.py
@@ -18,10 +18,7 @@
 import paddle
 import paddle.nn.quant.quant_layers as quant_layers
 
-from ..quantization_pass import _get_op_input_var_names
-from ..quantization_pass import _get_op_output_var_names
-from ..quantization_pass import _get_output_name_index
-from ..quantization_pass import _get_input_name_index
+from ..utils import _get_op_input_var_names, _get_op_output_var_names, _get_output_name_index, _get_input_name_index
 
 layer_name_map = {
     'Conv2DTranspose': paddle.nn.Conv2DTranspose,
diff --git a/python/paddle/fluid/contrib/slim/quantization/post_training_quantization.py b/python/paddle/fluid/contrib/slim/quantization/post_training_quantization.py
index b1b645e85e75d..a4c7a2a2bf8df 100644
--- a/python/paddle/fluid/contrib/slim/quantization/post_training_quantization.py
+++ b/python/paddle/fluid/contrib/slim/quantization/post_training_quantization.py
@@ -25,18 +25,10 @@
 from ....executor import global_scope, Executor
 from ....framework import IrGraph
 from ....log_helper import get_logger
-from .quantization_pass import QuantizationTransformPass
-from .quantization_pass import QuantizationFreezePass
-from .quantization_pass import AddQuantDequantPass
-from .quantization_pass import _out_scale_op_list
-from .quantization_pass import _get_op_input_var_names
-from .quantization_pass import _get_op_output_var_names
-from .quantization_pass import _get_output_name_index
-from .quantization_pass import _get_input_name_index
-from .quantization_pass import _channelwise_quant_axis1_ops
+from .quantization_pass import QuantizationTransformPass, QuantizationTransformPassV2, QuantizationFreezePass, QuantWeightPass, AddQuantDequantPass, AddQuantDequantPassV2
 from .cal_kl_threshold import cal_kl_threshold
 from .adaround import run_adaround
-from .utils import load_variable_data, set_variable_data
+from . import utils
 
 __all__ = ['PostTrainingQuantization', 'WeightQuantization']
 
@@ -131,6 +123,7 @@ def __init__(self,
                  weight_bits=8,
                  activation_quantize_type='range_abs_max',
                  weight_quantize_type='channel_wise_abs_max',
+                 onnx_format=False,
                  optimize_model=False,
                  is_use_cache_file=False,
                  cache_dir=None):
@@ -203,6 +196,8 @@ def __init__(self,
                 the fake ops in saving quantized model, and we save the scale obtained
                 by post training quantization in fake ops. Compared to 'abs_max',
                 the model accuracy is usually higher when it is 'channel_wise_abs_max'.
+            onnx_format(bool): Whether to export the quantized model with format of ONNX.
+                Default is False.
             optimize_model(bool, optional): If set optimize_model as True, it applies
                 some passes to the model before quantization, and it supports
                 `conv2d/depthwise_conv2d + bn` pass so far. Some targets require the
@@ -265,8 +260,8 @@ def __init__(self,
         self._learning_rate = learning_rate
         self._dynamic_quantize_op_type = ['lstm']
         self._support_quantize_op_type = \
-            list(set(QuantizationTransformPass._supported_quantizable_op_type +
-                AddQuantDequantPass._supported_quantizable_op_type +
+            list(set(utils._weight_supported_quantizable_op_type +
+                utils._act_supported_quantizable_op_type +
                 self._dynamic_quantize_op_type))
 
         # Check inputs
@@ -305,6 +300,7 @@ def __init__(self,
         self._weight_bits = weight_bits
         self._activation_quantize_type = activation_quantize_type
         self._weight_quantize_type = weight_quantize_type
+        self._onnx_format = onnx_format
         self._is_full_quantize = is_full_quantize
         if is_full_quantize:
             self._quantizable_op_type = self._support_quantize_op_type
@@ -322,7 +318,7 @@ def __init__(self,
         self._fetch_list = None
         self._data_loader = data_loader
 
-        self._out_scale_op_list = _out_scale_op_list
+        self._out_scale_op_list = utils._out_scale_op_list
         self._quantized_weight_var_name = set()
         self._quantized_act_var_name = set()
         self._weight_op_pairs = {}
@@ -391,22 +387,27 @@ def quantize(self):
                 break
         _logger.info("Finish sampling stage, all batch: " + str(batch_id))
 
-        if self._round_type == 'adaround':
-            self._adaround_apply()
-
-        self._reset_activation_persistable()
         if self._algo == 'avg':
             for var_name in self._quantized_act_var_name:
                 self._quantized_threshold[var_name] = \
                 np.array(self._quantized_var_avg[var_name]).mean()
         if self._algo in ["KL", "hist"]:
             self._calculate_kl_hist_threshold()
-        if self._algo in ["KL", "abs_max", "hist", "avg", "mse", "emd"]:
-            self._update_program()
-        else:
+
+        if self._round_type == 'adaround':
+            self._adaround_apply()
+
+        self._reset_activation_persistable()
+
+        if self._algo is 'min_max':
             self._save_input_threhold()
+        else:
+            self._update_program()
+
+        # save out_threshold for quantized ops.
+        if not self._onnx_format:
+            self._save_output_threshold()
 
-        self._save_output_threshold()
         if any(op_type in self._quantizable_op_type
                for op_type in self._dynamic_quantize_op_type):
             self._collect_dynamic_quantize_op_threshold(
@@ -431,6 +432,7 @@ def quantize(self):
         return self._program
 
     def _adaround_apply(self):
+        assert self._algo != "min_max", "The algo should not be min_max."
         if self._algo in ["KL", "hist"]:
             scale_dict = self._quantized_var_threshold
         else:
@@ -466,6 +468,7 @@ def save_quantized_model(self,
         Returns:
             None
         '''
+        clip_extra = True if self._onnx_format else False
         io.save_inference_model(
             dirname=save_model_path,
             model_filename=model_filename,
@@ -473,7 +476,8 @@ def save_quantized_model(self,
             feeded_var_names=self._feed_list,
             target_vars=self._fetch_list,
             executor=self._executor,
-            main_program=self._program)
+            main_program=self._program,
+            clip_extra=clip_extra)
         _logger.info("The quantized model is saved in " + save_model_path)
 
     def _load_model_data(self):
@@ -551,22 +555,22 @@ def collect_var_name(var_name_list, persistable_var_names, op_type):
                 # For quantized ops, sample inputs and outputs
                 if op_type in self._quantizable_op_type:
                     collect_var_name(
-                        _get_op_input_var_names(op), persistable_var_names,
-                        op_type)
+                        utils._get_op_input_var_names(op),
+                        persistable_var_names, op_type)
                     collect_var_name(
-                        _get_op_output_var_names(op), persistable_var_names,
-                        op_type)
+                        utils._get_op_output_var_names(op),
+                        persistable_var_names, op_type)
                     # collect quanted op output var name
-                    for out_var_name in _get_op_output_var_names(op):
-                        for in_var_name in _get_op_input_var_names(op):
+                    for out_var_name in utils._get_op_output_var_names(op):
+                        for in_var_name in utils._get_op_input_var_names(op):
                             if in_var_name in persistable_var_names:
                                 self._quantized_op_pairs[
                                     in_var_name] = out_var_name
                 # For other op, only sample output scale
                 elif op_type in self._out_scale_op_list:
                     collect_var_name(
-                        _get_op_output_var_names(op), persistable_var_names,
-                        op_type)
+                        utils._get_op_output_var_names(op),
+                        persistable_var_names, op_type)
 
     def _set_activation_persistable(self):
         '''
@@ -608,13 +612,13 @@ def _sampling(self):
     def _sample_mse(self):
         if self._quantized_threshold == {}:
             for var_name in self._quantized_weight_var_name:
-                var_tensor = load_variable_data(self._scope, var_name)
+                var_tensor = utils.load_variable_data(self._scope, var_name)
                 if self._weight_quantize_type == "abs_max":
                     abs_max_value = float(np.max(np.abs(var_tensor)))
                 elif self._weight_quantize_type == "channel_wise_abs_max":
                     abs_max_value = []
                     if self._weight_op_pairs[
-                            var_name] in _channelwise_quant_axis1_ops:
+                            var_name] in utils._channelwise_quant_axis1_ops:
                         for i in range(var_tensor.shape[1]):
                             abs_max_value.append(
                                 float(np.max(np.abs(var_tensor[:, i]))))
@@ -625,7 +629,7 @@ def _sample_mse(self):
                 self._quantized_threshold[var_name] = abs_max_value
         _logger.info("MSE searching stage ...")
         for var_name in self._quantized_act_var_name:
-            var_tensor = load_variable_data(self._scope, var_name)
+            var_tensor = utils.load_variable_data(self._scope, var_name)
             var_tensor = var_tensor.flatten()
             abs_max_value = float(np.max(np.abs(var_tensor)))
             abs_max_value = 1e-8 if abs_max_value == 0.0 else abs_max_value
@@ -647,13 +651,13 @@ def _sample_mse(self):
     def _sample_emd(self):
         if self._quantized_threshold == {}:
             for var_name in self._quantized_weight_var_name:
-                var_tensor = load_variable_data(self._scope, var_name)
+                var_tensor = utils.load_variable_data(self._scope, var_name)
                 if self._weight_quantize_type == "abs_max":
                     abs_max_value = float(np.max(np.abs(var_tensor)))
                 elif self._weight_quantize_type == "channel_wise_abs_max":
                     abs_max_value = []
                     if self._weight_op_pairs[
-                            var_name] in _channelwise_quant_axis1_ops:
+                            var_name] in utils._channelwise_quant_axis1_ops:
                         for i in range(var_tensor.shape[1]):
                             abs_max_value.append(
                                 float(np.max(np.abs(var_tensor[:, i]))))
@@ -664,7 +668,7 @@ def _sample_emd(self):
                 self._quantized_threshold[var_name] = abs_max_value
         _logger.info("EMD searching stage ...")
         for var_name in self._quantized_act_var_name:
-            var_tensor = load_variable_data(self._scope, var_name)
+            var_tensor = utils.load_variable_data(self._scope, var_name)
             var_tensor = var_tensor.flatten()
             abs_max_value = float(np.max(np.abs(var_tensor)))
             abs_max_value = 1e-8 if abs_max_value == 0.0 else abs_max_value
@@ -688,13 +692,13 @@ def _sample_emd(self):
     def _sample_avg(self):
         if self._quantized_threshold == {}:
             for var_name in self._quantized_weight_var_name:
-                var_tensor = load_variable_data(self._scope, var_name)
+                var_tensor = utils.load_variable_data(self._scope, var_name)
                 if self._weight_quantize_type == "abs_max":
                     abs_max_value = float(np.max(np.abs(var_tensor)))
                 elif self._weight_quantize_type == "channel_wise_abs_max":
                     abs_max_value = []
                     if self._weight_op_pairs[
-                            var_name] in _channelwise_quant_axis1_ops:
+                            var_name] in utils._channelwise_quant_axis1_ops:
                         for i in range(var_tensor.shape[1]):
                             abs_max_value.append(
                                 float(np.max(np.abs(var_tensor[:, i]))))
@@ -705,7 +709,7 @@ def _sample_avg(self):
                 self._quantized_threshold[var_name] = abs_max_value
 
         for var_name in self._quantized_act_var_name:
-            var_tensor = load_variable_data(self._scope, var_name)
+            var_tensor = utils.load_variable_data(self._scope, var_name)
             abs_max_value = float(np.max(np.abs(var_tensor)))
             if (var_name not in self._quantized_var_avg):
                 self._quantized_var_avg[var_name] = []
@@ -717,13 +721,13 @@ def _sample_avg(self):
     def _sample_abs_max(self):
         if self._quantized_threshold == {}:
             for var_name in self._quantized_weight_var_name:
-                var_tensor = load_variable_data(self._scope, var_name)
+                var_tensor = utils.load_variable_data(self._scope, var_name)
                 if self._weight_quantize_type == "abs_max":
                     abs_max_value = float(np.max(np.abs(var_tensor)))
                 elif self._weight_quantize_type == "channel_wise_abs_max":
                     abs_max_value = []
                     if self._weight_op_pairs[
-                            var_name] in _channelwise_quant_axis1_ops:
+                            var_name] in utils._channelwise_quant_axis1_ops:
                         for i in range(var_tensor.shape[1]):
                             abs_max_value.append(
                                 float(np.max(np.abs(var_tensor[:, i]))))
@@ -734,7 +738,7 @@ def _sample_abs_max(self):
                 self._quantized_threshold[var_name] = abs_max_value
 
         for var_name in self._quantized_act_var_name:
-            var_tensor = load_variable_data(self._scope, var_name)
+            var_tensor = utils.load_variable_data(self._scope, var_name)
             abs_max_value = float(np.max(np.abs(var_tensor)))
             if (var_name not in self._quantized_threshold) or \
                 (abs_max_value > self._quantized_threshold[var_name]):
@@ -743,7 +747,7 @@ def _sample_abs_max(self):
     def _sample_min_max(self):
         if self._quantized_var_min == {} and self._quantized_var_max == {}:
             for var_name in self._quantized_weight_var_name:
-                var_tensor = load_variable_data(self._scope, var_name)
+                var_tensor = utils.load_variable_data(self._scope, var_name)
                 if self._weight_quantize_type == "abs_max":
                     min_value = float(np.min(var_tensor))
                     max_value = float(np.max(var_tensor))
@@ -751,7 +755,7 @@ def _sample_min_max(self):
                     min_value = []
                     max_value = []
                     if self._weight_op_pairs[
-                            var_name] in _channelwise_quant_axis1_ops:
+                            var_name] in utils._channelwise_quant_axis1_ops:
                         for i in range(var_tensor.shape[1]):
                             min_value.append(float(np.min(var_tensor[:, i])))
                             max_value.append(float(np.max(var_tensor[:, i])))
@@ -763,7 +767,7 @@ def _sample_min_max(self):
                 self._quantized_var_max[var_name] = max_value
 
         for var_name in self._quantized_act_var_name:
-            var_tensor = load_variable_data(self._scope, var_name)
+            var_tensor = utils.load_variable_data(self._scope, var_name)
             min_value = float(np.min(var_tensor))
             max_value = float(np.max(var_tensor))
             if (var_name not in self._quantized_var_min) or \
@@ -775,7 +779,7 @@ def _sample_min_max(self):
 
     def _sample_histogram(self):
         for var_name in self._quantized_act_var_name:
-            var_tensor = load_variable_data(self._scope, var_name)
+            var_tensor = utils.load_variable_data(self._scope, var_name)
             var_tensor_abs = np.abs(var_tensor)
             bins = self._sampling_act_histogram[var_name][1]
             hist, _ = np.histogram(var_tensor_abs, bins=bins)
@@ -790,7 +794,7 @@ def _save_input_threhold(self):
         for block_id in range(len(self._program.blocks)):
             for op in self._program.blocks[block_id].ops:
                 if op.type in self._quantizable_op_type:
-                    for var_name in _get_op_input_var_names(op):
+                    for var_name in utils._get_op_input_var_names(op):
                         assert var_name in self._quantized_var_min
                         assert var_name in self._quantized_var_max
                         op._set_attr(var_name + ".min",
@@ -805,7 +809,7 @@ def _collect_activation_abs_min_max(self):
         get the min and max value, and then calculate the threshold.
         '''
         for var_name in self._quantized_act_var_name:
-            var_tensor = load_variable_data(self._scope, var_name)
+            var_tensor = utils.load_variable_data(self._scope, var_name)
             var_tensor = np.abs(var_tensor)
             min_value = float(np.min(var_tensor))
             max_value = float(np.max(var_tensor))
@@ -839,13 +843,13 @@ def _calculate_kl_hist_threshold(self):
 
         # Abs_max threshold for weights
         for var_name in self._quantized_weight_var_name:
-            weight_data = load_variable_data(self._scope, var_name)
+            weight_data = utils.load_variable_data(self._scope, var_name)
             if self._weight_quantize_type == "abs_max":
                 weight_threshold = float(np.max(np.abs(weight_data)))
             elif self._weight_quantize_type == "channel_wise_abs_max":
                 weight_threshold = []
                 if self._weight_op_pairs[
-                        var_name] in _channelwise_quant_axis1_ops:
+                        var_name] in utils._channelwise_quant_axis1_ops:
                     for i in range(weight_data.shape[1]):
                         weight_threshold.append(
                             float(np.max(np.abs(weight_data[:, i]))))
@@ -876,17 +880,27 @@ def _update_program(self):
 
         # use QuantizationTransformPass to insert fake_quant/fake_dequantize op
         major_quantizable_op_types = []
-        for op_type in QuantizationTransformPass._supported_quantizable_op_type:
+        for op_type in utils._weight_supported_quantizable_op_type:
             if op_type in self._quantizable_op_type:
                 major_quantizable_op_types.append(op_type)
-        transform_pass = QuantizationTransformPass(
-            scope=self._scope,
-            place=self._place,
-            weight_bits=self._weight_bits,
-            activation_bits=self._activation_bits,
-            activation_quantize_type=self._activation_quantize_type,
-            weight_quantize_type=self._weight_quantize_type,
-            quantizable_op_type=major_quantizable_op_types)
+        if not self._onnx_format:
+            transform_pass = QuantizationTransformPass(
+                scope=self._scope,
+                place=self._place,
+                weight_bits=self._weight_bits,
+                activation_bits=self._activation_bits,
+                activation_quantize_type=self._activation_quantize_type,
+                weight_quantize_type=self._weight_quantize_type,
+                quantizable_op_type=major_quantizable_op_types)
+        else:
+            transform_pass = QuantizationTransformPassV2(
+                scope=self._scope,
+                place=self._place,
+                weight_bits=self._weight_bits,
+                activation_bits=self._activation_bits,
+                activation_quantize_type=self._activation_quantize_type,
+                weight_quantize_type=self._weight_quantize_type,
+                quantizable_op_type=major_quantizable_op_types)
 
         for sub_graph in graph.all_sub_graphs():
             # Insert fake_quant/fake_dequantize op must in test graph, so
@@ -896,13 +910,20 @@ def _update_program(self):
 
         # use AddQuantDequantPass to insert fake_quant_dequant op
         minor_quantizable_op_types = []
-        for op_type in AddQuantDequantPass._supported_quantizable_op_type:
+        for op_type in utils._act_supported_quantizable_op_type:
             if op_type in self._quantizable_op_type:
                 minor_quantizable_op_types.append(op_type)
-        add_quant_dequant_pass = AddQuantDequantPass(
-            scope=self._scope,
-            place=self._place,
-            quantizable_op_type=minor_quantizable_op_types)
+        if not self._onnx_format:
+            add_quant_dequant_pass = AddQuantDequantPass(
+                scope=self._scope,
+                place=self._place,
+                quantizable_op_type=minor_quantizable_op_types)
+        else:
+            add_quant_dequant_pass = AddQuantDequantPassV2(
+                scope=self._scope,
+                place=self._place,
+                quantizable_op_type=minor_quantizable_op_types,
+                is_full_quantized=self._is_full_quantize)
 
         for sub_graph in graph.all_sub_graphs():
             sub_graph._for_test = True
@@ -914,33 +935,39 @@ def _update_program(self):
         else:
             scale_dict = self._quantized_threshold
         for key, val in scale_dict.items():
-            set_variable_data(
+            utils.set_variable_data(
                 self._scope,
                 self._place,
                 key + ".scale",
                 np.array(
                     [val], dtype=np.float32))
-            set_variable_data(
+            utils.set_variable_data(
                 self._scope,
                 self._place,
                 key + ".quant_dequant.scale",
                 np.array(
                     [val], dtype=np.float32))
 
-        # apply QuantizationFreezePass, and obtain the final quant model
-        freeze_pass = QuantizationFreezePass(
-            scope=self._scope,
-            place=self._place,
-            bias_correction=self._bias_correction,
-            weight_bits=self._weight_bits,
-            round_type=self._round_type,
-            activation_bits=self._activation_bits,
-            weight_quantize_type=self._weight_quantize_type,
-            quantizable_op_type=major_quantizable_op_types)
-
-        for sub_graph in graph.all_sub_graphs():
-            sub_graph._for_test = True
-            freeze_pass.apply(sub_graph)
+        if not self._onnx_format:
+            # apply QuantizationFreezePass, and obtain the final quant model
+            freeze_pass = QuantizationFreezePass(
+                scope=self._scope,
+                place=self._place,
+                bias_correction=self._bias_correction,
+                weight_bits=self._weight_bits,
+                round_type=self._round_type,
+                activation_bits=self._activation_bits,
+                weight_quantize_type=self._weight_quantize_type,
+                quantizable_op_type=major_quantizable_op_types)
+
+            for sub_graph in graph.all_sub_graphs():
+                sub_graph._for_test = True
+                freeze_pass.apply(sub_graph)
+        else:
+            quant_weight_pass = QuantWeightPass(self._scope, self._place)
+            for sub_graph in graph.all_sub_graphs():
+                sub_graph._for_test = True
+                quant_weight_pass.apply(sub_graph)
 
         self._program = graph.to_program()
 
@@ -960,7 +987,7 @@ def save_info(op_node, out_var_name, threshold_map, out_info_name,
                 op._set_attr("quantization_type", quantized_type)
 
         def analysis_and_save_info(op_node, out_var_name):
-            argname_index = _get_output_name_index(op_node, out_var_name)
+            argname_index = utils._get_output_name_index(op_node, out_var_name)
             assert argname_index is not None, \
                 out_var_name + " is not the output of the op"
             if self._algo == "KL":
@@ -997,7 +1024,7 @@ def analysis_and_save_info(op_node, out_var_name):
             for op in self._program.blocks[block_id].ops:
                 if op.type in (
                         self._quantizable_op_type + self._out_scale_op_list):
-                    out_var_names = _get_op_output_var_names(op)
+                    out_var_names = utils._get_op_output_var_names(op)
                     for var_name in out_var_names:
                         analysis_and_save_info(op, var_name)
 
@@ -1020,11 +1047,11 @@ def _collect_dynamic_quantize_op_threshold(self, target_ops_type):
         quantization_type = str("post_" + self._algo).lower()
         persistable_var_names = _all_persistable_var_names(self._program)
         for op in target_ops:
-            for var_name in _get_op_input_var_names(op):
+            for var_name in utils._get_op_input_var_names(op):
                 if var_name in persistable_var_names:
-                    var_data = load_variable_data(self._scope, var_name)
+                    var_data = utils.load_variable_data(self._scope, var_name)
                     threshold = float(np.max(np.abs(var_data)))
-                    argname, index = _get_input_name_index(op, var_name)
+                    argname, index = utils._get_input_name_index(op, var_name)
                     op._set_attr(argname + str(index) + "_threshold", threshold)
                     op._set_attr("quantization_type", quantization_type)
                     op._set_attr("bit_length", self._weight_bits)
@@ -1268,7 +1295,7 @@ def _weight_abs_max_quantization(self, scope, place, weight_bits,
         save_weight_dtype = np.int8 if weight_bits == 8 else np.int16
 
         # Get quantized scale and weight data
-        weight_data = load_variable_data(scope, var_name)
+        weight_data = utils.load_variable_data(scope, var_name)
         if abs(threshold_rate) < 1e-10:
             threshold_value = np.max(np.abs(weight_data))
         else:
@@ -1282,11 +1309,13 @@ def _weight_abs_max_quantization(self, scope, place, weight_bits,
 
         # Set weight data
         if not for_test:
-            set_variable_data(scope, place, var_name, quantized_weight_data)
+            utils.set_variable_data(scope, place, var_name,
+                                    quantized_weight_data)
         else:
             dequantized_weight_data = \
                 (quantized_weight_data * scale).astype(np.float32)
-            set_variable_data(scope, place, var_name, dequantized_weight_data)
+            utils.set_variable_data(scope, place, var_name,
+                                    dequantized_weight_data)
 
         # Save info
         op._set_attr('quantization_type', 'post_weight_abs_max')
@@ -1303,7 +1332,7 @@ def _weight_channel_wise_abs_max_quantization(
         save_weight_dtype = np.int8 if weight_bits == 8 else np.int16
 
         # Get quantized scale and weight data
-        weight_data = load_variable_data(scope, var_name)
+        weight_data = utils.load_variable_data(scope, var_name)
         if op.type == "mul":
             scales, quantized_weight_data = \
                 self._mul_channel_wise_quantization(weight_data,
@@ -1317,7 +1346,8 @@ def _weight_channel_wise_abs_max_quantization(
 
         # Set weight data
         if not for_test:
-            set_variable_data(scope, place, var_name, quantized_weight_data)
+            utils.set_variable_data(scope, place, var_name,
+                                    quantized_weight_data)
         else:
             if op.type == "mul":
                 dequantized_weight_data = \
@@ -1328,7 +1358,8 @@ def _weight_channel_wise_abs_max_quantization(
             else:
                 _logger.error(op.type +
                               " is not supported by weight quantization")
-            set_variable_data(scope, place, var_name, dequantized_weight_data)
+            utils.set_variable_data(scope, place, var_name,
+                                    dequantized_weight_data)
 
         # Save info
         op._set_attr('quantization_type', 'post_weight_channel_wise_abs_max')
diff --git a/python/paddle/fluid/contrib/slim/quantization/quantization_pass.py b/python/paddle/fluid/contrib/slim/quantization/quantization_pass.py
index 6d7c91fddeb77..17ddedd9d300a 100644
--- a/python/paddle/fluid/contrib/slim/quantization/quantization_pass.py
+++ b/python/paddle/fluid/contrib/slim/quantization/quantization_pass.py
@@ -26,12 +26,20 @@
 from ....layers import mean
 from ....executor import scope_guard
 from ....framework import _get_paddle_place
-from .utils import _channelwise_quant_axis1_ops, quant_tensor
+from . import utils
 
 __all__ = [
-    'QuantizationTransformPass', 'QuantizationFreezePass', 'ConvertToInt8Pass',
-    'TransformForMobilePass', 'OutScaleForTrainingPass',
-    'OutScaleForInferencePass', 'AddQuantDequantPass'
+    'QuantizationTransformPass',
+    'QuantizationFreezePass',
+    'ConvertToInt8Pass',
+    'TransformForMobilePass',
+    'OutScaleForTrainingPass',
+    'OutScaleForInferencePass',
+    'AddQuantDequantPass',
+    'QuantizationTransformPassV2',
+    'AddQuantDequantPassV2',
+    'ReplaceFakeQuantDequantPass',
+    'QuantWeightPass',
 ]
 
 _fake_quant_op_list = [
@@ -44,278 +52,13 @@
 ]
 
 _fake_quant_dequant_op_list = [
-    'fake_quantize_dequantize_moving_average_abs_max'
+    'fake_quantize_dequantize_moving_average_abs_max',
+    "fake_channel_wise_quantize_dequantize_abs_max",
 ]
 
-_out_scale_op_list = [
-    "conv2d",
-    "depthwise_conv2d",
-    "mul",
-    "matmul",
-    "matmul_v2",
-    "relu",
-    "leaky_relu",
-    "relu6",
-    "sigmoid",
-    "tanh",
-    "prelu",
-    "swish",
-    "dropout",
-    "softmax",
-    "batch_norm",
-    "layer_norm",
-    "elementwise_add",
-    "pool2d",
-    "reshape2",
-    "transpose2",
-    "concat",
-    "elementwise_mul",
-    "elementwise_pow",
-    "elementwise_sub",
-    "scale",
-    "slice",
-    "hard_swish",
-    "hard_sigmoid",
-    "conv2d_transpose",
-    "gru",
-    "bilinear_interp",
-    "nearest_interp",
-    "trilinear_interp",
-    "flatten",
-    "flatten2",
-    "transpose",
-    "pad2d",
-    "pad3d",
-    "reshape",
-    "split",
-    "flatten_contiguous_range",
-    "squeeze",
-    "squeeze2",
-    "nearest_interp_v2",
-    "fill_constant_batch_size_like",
-    "bilinear_interp",
-    "bilinear_interp_v2",
-    "arg_max",
-    "abs",
-    "assign",
-    "cast",
-    "clip",
-    "box_coder",
-    "crop",
-    "cumsum",
-    "equal",
-    "expand_v2",
-    "fill_any_like",
-    "fill_constant",
-    "gelu",
-    "instance_norm",
-    "lookup_table",
-    "lookup_table_v2",
-    "norm",
-    "p_norm",
-    "pow",
-    "reduce_mean",
-    "stack",
-    "top_k_v2",
-    "unsqueeze",
-    "unsqueeze2",
-    "logical_and",
-    "logical_not",
-    "meshgrid",
-    "roi_align",
-    "strided_slice",
-    "where",
-    "grid_sampler",
-    "tile",
-    "group_norm",
-    "reduce_sum",
-    "square",
-    "softplus",
-    "gather",
-    "shuffle_channel",
-]
-
-# list op real input and output names, to avoid processing input such as AxisTensor.
-_op_real_in_out_name = {
-    "conv2d": [["Input", "Filter"], ["Output"]],
-    "depthwise_conv2d": [["Input", "Filter"], ["Output"]],
-    "conv2d_transpose": [["Input", "Filter"], ["Output"]],
-    "mul": [["X", "Y"], ["Out"]],
-    "matmul": [["X", "Y"], ["Out"]],
-    "matmul_v2": [["X", "Y"], ["Out"]],
-    "pool2d": [["X"], ["Out"]],
-    "elementwise_add": [["X", "Y"], ["Out"]],
-    "concat": [["X"], ["Out"]],
-    "softmax": [["X"], ["Out"]],
-    "argmax": [["X"], ["Out"]],
-    "transpose": [["X"], ["Out"]],
-    "equal": [["X", "Y"], ["Out"]],
-    "gather": [["X"], ["Out"]],
-    "greater_equal": [["X", "Y"], ["Out"]],
-    "greater_than": [["X", "Y"], ["Out"]],
-    "less_equal": [["X", "Y"], ["Out"]],
-    "less_than": [["X", "Y"], ["Out"]],
-    "mean": [["X"], ["Out"]],
-    "not_equal": [["X", "Y"], ["Out"]],
-    "reshape": [["X"], ["Out"]],
-    "reshape2": [["X"], ["Out"]],
-    "transpose2": [["X"], ["Out"]],
-    "bilinear_interp": [["X"], ["Out"]],
-    "nearest_interp": [["X"], ["Out"]],
-    "trilinear_interp": [["X"], ["Out"]],
-    "slice": [["Input"], ["Out"]],
-    "squeeze": [["X"], ["Out"]],
-    "elementwise_sub": [["X", "Y"], ["Out"]],
-    "relu": [["X"], ["Out"]],
-    "relu6": [["X"], ["Out"]],
-    "leaky_relu": [["X"], ["Out"]],
-    "prelu": [["X", "Alpha"], ["Out"]],
-    "tanh": [["X"], ["Out"]],
-    "swish": [["X"], ["Out"]],
-    "dropout": [["X"], ["Out"]],
-    "batch_norm": [["X"], ["Y"]],
-    "layer_norm": [["X"], ["Y"]],
-    "sigmoid": [["X"], ["Out"]],
-    "elementwise_mul": [["X", "Y"], ["Out"]],
-    "elementwise_pow": [["X", "Y"], ["Out"]],
-    "scale": [["X"], ["Out"]],
-    "hard_swish": [["X"], ["Out"]],
-    "hard_sigmoid": [["X"], ["Out"]],
-    "gru": [["Input", "Weight"], ["Hidden"]],
-    "lstm": [["Input", "Weight"], ["Hidden"]],
-    "pad2d": [["X"], ["Out"]],
-    "pad3d": [["X"], ["Out"]],
-    "flatten": [["X"], ["Out"]],
-    "flatten2": [["X"], ["Out"]],
-    "unsqueeze2": [["X"], ["Out"]],
-    "unsqueeze2": [["X"], ["Out"]],
-    "flatten_contiguous_range": [["X"], ["Out"]],
-    "split": [["X"], ["Out"]],
-    "squeeze2": [["X"], ["Out"]],
-    "nearest_interp_v2": [["X"], ["Out"]],
-    "bilinear_interp": [["X"], ["Out"]],
-    "bilinear_interp_v2": [["X"], ["Out"]],
-    "fill_constant_batch_size_like": [["Input"], ["Out"]],
-    "arg_max": [["X"], ["Out"]],
-    "abs": [["X"], ["Out"]],
-    "assign": [["X"], ["Out"]],
-    "cast": [["X"], ["Out"]],
-    "clip": [["X"], ["Out"]],
-    "box_coder": [["PriorBox"], ["OutputBox"]],
-    "crop": [["X"], ["Out"]],
-    "cumsum": [["X"], ["Out"]],
-    "expand_v2": [["X"], ["Out"]],
-    "fill_any_like": [["X"], ["Out"]],
-    "fill_constant": [[], ["Out"]],
-    "gelu": [["X"], ["Out"]],
-    "instance_norm": [["X"], ["Out"]],
-    "lookup_table": [["W", "Ids"], ["Out"]],
-    "lookup_table_v2": [["W", "Ids"], ["Out"]],
-    "norm": [["X"], ["Norm"]],
-    "p_norm": [["X"], ["Out"]],
-    "pow": [["X"], ["Out"]],
-    "reduce_mean": [["X"], ["Out"]],
-    "stack": [["X"], ["Y"]],
-    "top_k_v2": [["X"], ["Out", "Indices"]],
-    "logical_and": [["X", "Y"], ["Out"]],
-    "logical_not": [["X"], ["Out"]],
-    "meshgrid": [["X"], ["Out"]],
-    "roi_align": [["X", "ROIs"], ["Out"]],
-    "strided_slice": [["Input"], ["Out"]],
-    "where": [["Condition", "X", "Y"], ["Out"]],
-    "grid_sampler": [["X", "Grid"], ["Output"]],
-    "tile": [["X"], ["Out"]],
-    "group_norm": [["X"], ["Y", "Mean", "Variance"]],
-    "reduce_sum": [["X"], ["Out"]],
-    "square": [["X"], ["Out"]],
-    "softplus": [["X"], ["Out"]],
-    "shuffle_channel": [["X"], ["Out"]],
-}
-
 _conv_ops = ['conv2d', 'depthwise_conv2d', 'conv2d_transpose']
 
-
-def _get_op_input_var_names(op):
-    """
-    Get the input var names of the op.
-    Args:
-        op(IrNode, Operator): the input op.
-    Returns:
-        input_var_names or None.
-    """
-    assert isinstance(op, (IrNode, Operator)), \
-        "The input op should be IrNode or Operator."
-    var_names = []
-    op_name = op.name() if isinstance(op, IrNode) \
-        else op.type
-    if op_name not in _op_real_in_out_name:
-        return []
-
-    name_list = _op_real_in_out_name[op_name][0]
-    for name in name_list:
-        var_name = op.input(name)
-        if isinstance(var_name, list):
-            var_names.extend(var_name)
-        else:
-            var_names.append(var_name)
-    return var_names
-
-
-def _get_input_name_index(op, input_var_name):
-    """Get the input name and index of the var_name in the op"""
-    assert isinstance(op, (IrNode, Operator)), \
-        "The input op should be IrNode or Operator."
-    op_name = op.name() if isinstance(op, IrNode) \
-        else op.type
-    if op_name not in _op_real_in_out_name:
-        return None
-
-    res = None
-    for argname in _op_real_in_out_name[op_name][0]:
-        var_names = op.input(argname)
-        for index, name in enumerate(var_names):
-            if name == input_var_name:
-                res = (argname, index)
-    return res
-
-
-def _get_op_output_var_names(op):
-    """ """
-    assert isinstance(op, (IrNode, Operator)), \
-        "The input op should be IrNode or Operator."
-    var_names = []
-    op_name = op.name() if isinstance(op, IrNode) \
-        else op.type
-    if op_name not in _op_real_in_out_name:
-        return []
-
-    name_list = _op_real_in_out_name[op_name][1]
-    for name in name_list:
-        var_name = op.output(name)
-        if isinstance(var_name, list):
-            var_names.extend(var_name)
-        else:
-            var_names.append(var_name)
-    return var_names
-
-
-def _get_output_name_index(op, output_var_name):
-    """Get the output name and index of the var_name in the op"""
-    assert isinstance(op, (IrNode, Operator)), \
-        "The input op should be IrNode or Operator."
-    op_name = op.name() if isinstance(op, IrNode) \
-        else op.type
-    if op_name not in _op_real_in_out_name:
-        return None
-
-    name_list = _op_real_in_out_name[op_name][1]
-    res = None
-    for name in name_list:
-        var_name = op.output(name)
-        for index, val in enumerate(var_name):
-            if val == output_var_name:
-                res = (name, index)
-    return res
+_SCALE_DEFAULT_VALUE = 0.001
 
 
 def _init_var_node(var_node, value, scope, place):
@@ -334,7 +77,7 @@ def _is_input_all_not_persistable(graph, op_node):
     Analyse the real inputs of the op node are all not persistable.
     '''
     is_input_all_not_persistable = True
-    for var_name in _get_op_input_var_names(op_node):
+    for var_name in utils._get_op_input_var_names(op_node):
         in_node = graph._find_node_by_name(op_node.inputs, var_name)
         is_input_all_not_persistable = (is_input_all_not_persistable and \
             (not in_node.persistable()))
@@ -360,10 +103,6 @@ class QuantizationTransformPass(object):
     Quantize the ops that have weights. Add quant and dequant ops for
     the quantized ops's inputs.
     """
-    _supported_quantizable_op_type = [
-        'conv2d', 'depthwise_conv2d', 'conv2d_transpose', 'mul', 'matmul',
-        'matmul_v2'
-    ]
 
     def __init__(self,
                  scope=None,
@@ -493,7 +232,7 @@ def __init__(self,
 
         self._quantizable_ops = quantizable_op_type
         for op in self._quantizable_ops:
-            assert op in QuantizationTransformPass._supported_quantizable_op_type, \
+            assert op in utils._weight_supported_quantizable_op_type, \
                 op + " is not supported for quantization."
         self._quantizable_grad_ops = [
             '%s_grad' % (op) for op in self._quantizable_ops
@@ -588,7 +327,7 @@ def _transform_forward(graph, op):
                         else self._activation_quantize_type
                     if quant_type == 'channel_wise_abs_max':  # Weight quantization
                         quant_axis = 1 if op.name() in \
-                            _channelwise_quant_axis1_ops else 0
+                            utils._channelwise_quant_axis1_ops else 0
                         quant_var_node, scale_var_node = self._insert_channel_quant_op(
                             graph, var_node, name, quant_bits, quant_axis)
                         dequant_var_node = self._insert_channel_dequant_op(
@@ -753,7 +492,7 @@ def _insert_quant_range_abs_max_op(self, graph, var_node, name, quant_bits):
         _init_var_node(
             scale_in_node,
             np.array(
-                [0.001], dtype=data_type),
+                [_SCALE_DEFAULT_VALUE], dtype=data_type),
             self._scope,
             self._place)
 
@@ -821,7 +560,7 @@ def _insert_quant_moving_average_abs_max_op(self, graph, var_node, name,
         _init_var_node(
             scale_in_node,
             np.array(
-                [0.001], dtype=data_type),
+                [_SCALE_DEFAULT_VALUE], dtype=data_type),
             self._scope,
             self._place)
 
@@ -1289,17 +1028,21 @@ def apply(self, graph):
                     if self._round_type == 'round':
                         if any(
                                 _check_grandchild_op_node(op_node, op)
-                                for op in _channelwise_quant_axis1_ops):
+                                for op in utils._channelwise_quant_axis1_ops):
                             quant_axis = 1
                         else:
                             quant_axis = 0
-                        quantized_param_v = quant_tensor(param_v.copy(),
-                                                         scale_v, quant_axis,
-                                                         self._weight_bits)
+                        quantized_param_v = utils.quant_tensor(
+                            param_v.copy(), scale_v, quant_axis,
+                            self._weight_bits)
                         quantized_param_v = np.round(quantized_param_v)
                         if self._bias_correction == True:
-                            quantized_param_v = self._bias_correction_w(
-                                param_v, quantized_param_v, scale_v, quant_axis)
+                            quantized_param_v = utils.bias_correction_w(
+                                param_v,
+                                quantized_param_v,
+                                scale_v,
+                                quant_axis,
+                                weight_bits=self._weight_bits)
                             quantized_param_v = np.round(quantized_param_v)
                         self._restore_var(input_arg_name, quantized_param_v)
                     self._remove_fake_quant_and_dequant_op(graph, op_node)
@@ -1319,7 +1062,7 @@ def apply(self, graph):
                 op_node_desc.attr("quantization_type") == "qat_with_weight":
                 if self._weight_quantize_type == 'channel_wise_abs_max':
                     quant_axis = 1 if op_node.name() in \
-                        _channelwise_quant_axis1_ops else 0
+                        utils._channelwise_quant_axis1_ops else 0
                     self._insert_post_channel_dequant_op(graph, op_node,
                                                          quant_axis)
                 else:
@@ -1519,46 +1262,6 @@ def _is_float(self, v):
         return isinstance(v, float) or isinstance(v, np.float32) \
             or isinstance(v, np.float64)
 
-    def _bias_correction_w(self, x, x_quant, scale_v, quant_axis):
-        '''
-        Bias correction for weight
-        '''
-        eps = 1e-8
-        bnt = (1 << (self._weight_bits - 1)) - 1
-        x_dequant = x_quant.copy()
-        if isinstance(scale_v, list):
-            if quant_axis == 0:
-                for i, s in enumerate(scale_v):
-                    x_dequant[i] = x_dequant[i] * s / bnt
-                quant_bias = x - x_dequant
-                mean_bias = quant_bias.reshape(quant_bias.shape[0], -1).mean(-1)
-                std_orig = x.reshape(x.shape[0], -1).std(-1)
-                std_quant = x_dequant.reshape(x_dequant.shape[0], -1).std(-1)
-                std_bias = std_orig / (std_quant + eps)
-            else:
-                for i, s in enumerate(scale_v):
-                    x_dequant[:, i] = x_quant[:, i] * s / bnt
-                quant_bias = x - x_dequant
-                mean_bias = np.array([
-                    quant_bias[:, i].mean() for i in range(quant_bias.shape[1])
-                ])
-                std_orig = np.array([x[:, i].std() for i in range(x.shape[1])])
-                std_quant = np.array(
-                    [x_dequant[:, i].std() for i in range(x_dequant.shape[1])])
-                std_bias = std_orig / (std_quant + eps)
-        else:
-            x_dequant = x_quant * scale_v / bnt
-            mean_bias = (x - x_dequant).mean()
-            std_bias = x.std() / (x_dequant.std() + eps)
-        if mean_bias.ndim == 1:
-            std_bias = np.resize(std_bias, x.shape)
-            mean_bias = np.resize(mean_bias, x.shape)
-
-        x_dequant = (mean_bias + x_dequant) * std_bias
-        quantized_param_v = quant_tensor(x_dequant, scale_v, quant_axis,
-                                         self._weight_bits)
-        return quantized_param_v
-
 
 class ConvertToInt8Pass(object):
     def __init__(self, scope, place, quantizable_op_type=None):
@@ -1707,7 +1410,7 @@ def __init__(self, scope=None, place=None, moving_rate=0.9):
         self._place = _get_paddle_place(place)
         self._moving_rate = moving_rate
         self._is_test = None
-        self._teller_set = _out_scale_op_list
+        self._teller_set = utils._out_scale_op_list
 
     def apply(self, graph):
         """
@@ -1725,7 +1428,7 @@ def apply(self, graph):
             if op.name() in self._teller_set:
                 target_ops.append(op)
         for op in target_ops:
-            for output_var_name in _get_op_output_var_names(op):
+            for output_var_name in utils._get_op_output_var_names(op):
                 in_node = graph._find_node_by_name(op.outputs, output_var_name)
                 if in_node.dtype() not in \
                     [core.VarDesc.VarType.FP64, core.VarDesc.VarType.FP32]:
@@ -1796,14 +1499,13 @@ def apply(self, graph):
                     graph.link_to(accum_in_node, scale_op_node)
                     graph.link_to(scale_op_node, state_out_node)
                     graph.link_to(scale_op_node, accum_out_node)
-        graph.resolve_hazard()
         return graph
 
     def _scale_name(self, var_name):
         """
         Return the scale name for the var named `var_name`.
         """
-        return "%s@scale" % (var_name)
+        return "%s.scale" % (var_name)
 
 
 class OutScaleForInferencePass(object):
@@ -1816,7 +1518,7 @@ def __init__(self, scope=None):
             scope(fluid.Scope): The scope is used to initialize these new parameters.
         """
         self._scope = scope
-        self._teller_set = _out_scale_op_list
+        self._teller_set = utils._out_scale_op_list
 
     def apply(self, graph):
         """
@@ -1831,7 +1533,7 @@ def apply(self, graph):
         op_nodes = graph.all_op_nodes()
         for op_node in op_nodes:
             if op_node.name() in self._teller_set:
-                var_names = _get_op_output_var_names(op_node)
+                var_names = utils._get_op_output_var_names(op_node)
                 for var_name in var_names:
                     in_node = graph._find_node_by_name(op_node.outputs,
                                                        var_name)
@@ -1848,7 +1550,8 @@ def apply(self, graph):
                     # For compatibility, we save output threshold by two methods.
                     op_node.op()._set_attr("out_threshold", float(scale_value))
 
-                    argname_index = _get_output_name_index(op_node, var_name)
+                    argname_index = utils._get_output_name_index(op_node,
+                                                                 var_name)
                     assert argname_index is not None, \
                         var_name + " is not the output of the op"
                     op_node.op()._set_attr(argname_index[0] + str(argname_index[1]) \
@@ -1861,7 +1564,7 @@ def _scale_name(self, var_name):
         """
         Return the scale name for the var named `var_name`.
         """
-        return "%s@scale" % (var_name)
+        return "%s.scale" % (var_name)
 
 
 class AddQuantDequantPass(object):
@@ -1869,95 +1572,6 @@ class AddQuantDequantPass(object):
     Quantize the ops that do not have weights, and add quant_dequant op for the 
     quantized ops's inputs.
     """
-    _supported_quantizable_op_type = [
-        "pool2d",
-        "elementwise_add",
-        "concat",
-        "softmax",
-        "argmax",
-        "transpose",
-        "equal",
-        "gather",
-        "greater_equal",
-        "greater_than",
-        "less_equal",
-        "less_than",
-        "mean",
-        "not_equal",
-        "reshape",
-        "reshape2",
-        "dropout",
-        "bilinear_interp",
-        "nearest_interp",
-        "trilinear_interp",
-        "slice",
-        "squeeze",
-        "elementwise_sub",
-        "mul",
-        "matmul",
-        "relu",
-        "relu6",
-        "leaky_relu",
-        "tanh",
-        "swish",
-        "scale",
-        "transpose",
-        "transpose2",
-        "sigmoid",
-        "pad2d",
-        "flatten",
-        "flatten2",
-        "batch_norm",
-        "layer_norm",
-        "matmul_v2",
-        "split",
-        "flatten_contiguous_range",
-        "squeeze2",
-        "nearest_interp_v2",
-        "bilinear_interp",
-        "bilinear_interp_v2",
-        "fill_constant_batch_size_like",
-        "arg_max",
-        "abs",
-        "assign",
-        "cast",
-        "clip",
-        "box_coder",
-        "crop",
-        "cumsum",
-        "elementwise_mul",
-        "elementwise_pow",
-        "expand_v2",
-        "fill_any_like",
-        "fill_constant",
-        "gelu",
-        "hard_sigmoid",
-        "hard_swish",
-        "instance_norm",
-        "lookup_table",
-        "lookup_table_v2",
-        "norm",
-        "p_norm",
-        "pad3d",
-        "pow",
-        "prelu",
-        "reduce_mean",
-        "unsqueeze",
-        "unsqueeze2",
-        "logical_and",
-        "logical_not",
-        "meshgrid",
-        "roi_align",
-        "strided_slice",
-        "where",
-        "grid_sampler",
-        "tile",
-        "group_norm",
-        "reduce_sum",
-        "square",
-        "softplus",
-        "shuffle_channel",
-    ]
 
     # To be compatible with PaddleSlim, not remove _activation_type for now
     _activation_type = ["relu", "relu6", "leaky_relu", "tanh", "swish"]
@@ -2000,12 +1614,11 @@ def __init__(self,
         self._skip_pattern = skip_pattern
 
         if is_full_quantized:
-            self._quantizable_op_type = \
-                AddQuantDequantPass._supported_quantizable_op_type
+            self._quantizable_op_type = utils._act_supported_quantizable_op_type
         else:
             self._quantizable_op_type = quantizable_op_type
             for op_type in quantizable_op_type:
-                assert op_type in AddQuantDequantPass._supported_quantizable_op_type, \
+                assert op_type in utils._act_supported_quantizable_op_type, \
                     op_type + " is not supported for quantization."
         self._quantizable_grad_op_type = [
             '%s_grad' % (op) for op in self._quantizable_op_type
@@ -2050,7 +1663,7 @@ def apply(self, graph):
                                        "qat_without_weight")
                 op_node.op()._set_attr("activation_bits", self._quant_bits)
                 op_node.op()._set_attr("with_quant_attr", True)
-                arg_names = _get_op_input_var_names(op_node)
+                arg_names = utils._get_op_input_var_names(op_node)
                 for arg_name in arg_names:
                     in_node = graph._find_node_by_name(op_node.inputs, arg_name)
                     if arg_name in dequantized_vars_map:
@@ -2095,7 +1708,7 @@ def _inser_quant_dequant_moving_average_abs_max_op(self, graph, var_node,
         _init_var_node(
             scale_in_node,
             np.array(
-                [0.001], dtype=data_type),
+                [_SCALE_DEFAULT_VALUE], dtype=data_type),
             self._scope,
             self._place)
 
@@ -2162,3 +1775,870 @@ def _inser_quant_dequant_moving_average_abs_max_op(self, graph, var_node,
             graph.link_to(quant_op_node, accum_out_node)
 
         return quant_var_node, scale_out_node
+
+
+class InsertQuantizeLinear(object):
+    """
+    Insert quantize_linear and dequantize_linear op before ops.
+
+    Args:
+        place(paddle.CPUPlace|paddle.CUDAPlace|str): place is used to restore the weight tensors.
+            If it's string, It can be ``cpu``, and ``gpu:x``, where ``x`` is the index of the GPUs.
+        scope(paddle.Scope): scope is used to get the weight tensor values.
+        quant_bits(int, optional): quantization bit number for weight. Default is 8.
+        quant_axis(int, optional): quantization dimension of channels. When it is greater than or
+            equal to 0, it will quantization with per channel, else quantization with per layer.
+            Default is -1.
+        channel_wise(bool, optional): Whether quantization with per channel or not. Default is False.
+        is_test(bool, optional): Whether quantization with training or not. Default is True.
+    """
+
+    def __init__(self,
+                 place,
+                 scope,
+                 quant_bits=8,
+                 quant_axis=-1,
+                 channel_wise=False,
+                 is_test=True):
+        self._place = place
+        self._scope = scope
+        self.quant_bits = quant_bits
+        self.quant_axis = quant_axis
+        self.channel_wise = channel_wise
+        self._is_test = is_test
+
+    def insert_quant_op(self, graph, var_node):
+        assert var_node.is_var(), '{} is not a var'.format(var_node.name())
+
+        quant_var_node = graph.create_var_node(
+            name=self._quantized_var_name(var_node.name()),
+            var_type=var_node.type(),
+            shape=var_node.shape(),
+            var_dtype=var_node.dtype())
+        data_type = 'float64' if var_node.dtype(
+        ) == core.VarDesc.VarType.FP64 else 'float32'
+        if self.channel_wise:
+            scale_var_shape = var_node.shape()[self.quant_axis]
+            scale_var_type = core.VarDesc.VarType.LOD_TENSOR
+            init_scale_value = np.zeros(scale_var_shape, dtype=data_type)
+        else:
+            scale_var_shape = 1
+            scale_var_type = var_node.type()
+            init_scale_value = np.array([_SCALE_DEFAULT_VALUE], dtype=data_type)
+        scale_var_node = graph.create_persistable_node(
+            name=self._quantized_scale_name(var_node.name()),
+            var_type=scale_var_type,
+            shape=[scale_var_shape],
+            var_dtype=var_node.dtype())
+        _init_var_node(scale_var_node, init_scale_value, self._scope,
+                       self._place)
+
+        zero_point_node = None
+        if zero_point_node is None:
+            zero_point_node = graph.create_persistable_node(
+                name=self._zero_point_name(quant_var_node.name()),
+                var_type=core.VarDesc.VarType.LOD_TENSOR,
+                shape=scale_var_node.shape(),
+                var_dtype=core.VarDesc.VarType.INT32)
+            _init_var_node(
+                zero_point_node,
+                np.zeros(
+                    scale_var_node.shape(), dtype="int32"),
+                self._scope,
+                self._place)
+
+        inputs = {"X": var_node, "Scale": scale_var_node}
+        if zero_point_node is not None:
+            inputs["ZeroPoint"] = zero_point_node
+
+        attrs = {"quant_axis": self.quant_axis, "bit_length": self.quant_bits}
+        outputs = {"Y": quant_var_node}
+        if not self._is_test:
+            attrs["is_test"] = self._is_test
+            attrs["op_role"] = core.op_proto_and_checker_maker.OpRole.Forward
+            scale_out_node = graph.create_var_node_from_desc(scale_var_node.var(
+            ))
+            outputs["OutScale"] = scale_out_node
+
+        quant_op_node = graph.create_op_node(
+            op_type="quantize_linear",
+            attrs=attrs,
+            inputs=inputs,
+            outputs=outputs)
+
+        graph.link_to(var_node, quant_op_node)
+        graph.link_to(scale_var_node, quant_op_node)
+        if zero_point_node is not None:
+            graph.link_to(zero_point_node, quant_op_node)
+        graph.link_to(quant_op_node, quant_var_node)
+        if not self._is_test:
+            graph.link_to(quant_op_node, scale_out_node)
+        return quant_var_node, scale_var_node
+
+    def insert_dequant_op(self, graph, var_node, scale_var_node):
+        assert var_node.is_var(), '{} is not a var'.format(var_node.name())
+
+        dequant_var_node = graph.create_var_node(
+            name=self._dequantized_var_name(var_node.name()),
+            var_type=var_node.type(),
+            shape=var_node.shape(),
+            var_dtype=var_node.dtype())
+
+        zero_point_node = None
+        if zero_point_node is None:
+            zero_point_node = graph.create_persistable_node(
+                name=self._zero_point_name(dequant_var_node.name()),
+                var_type=core.VarDesc.VarType.LOD_TENSOR,
+                shape=scale_var_node.shape(),
+                var_dtype=core.VarDesc.VarType.INT32)
+            _init_var_node(
+                zero_point_node,
+                np.zeros(
+                    scale_var_node.shape(), dtype="int32"),
+                self._scope,
+                self._place)
+
+        inputs = {"X": var_node, "Scale": scale_var_node}
+        if zero_point_node is not None:
+            inputs["ZeroPoint"] = zero_point_node
+
+        attrs = {"quant_axis": self.quant_axis, "bit_length": self.quant_bits}
+        if not self._is_test:
+            attrs["op_role"] = core.op_proto_and_checker_maker.OpRole.Forward
+
+        quant_op_node = graph.create_op_node(
+            op_type="dequantize_linear",
+            attrs=attrs,
+            inputs=inputs,
+            outputs={"Y": dequant_var_node})
+
+        graph.link_to(var_node, quant_op_node)
+        graph.link_to(scale_var_node, quant_op_node)
+        if zero_point_node is not None:
+            graph.link_to(zero_point_node, quant_op_node)
+        graph.link_to(quant_op_node, dequant_var_node)
+        return dequant_var_node
+
+    def _quantized_var_name(self, var_name):
+        """
+        Return quantized variable name for the input `var_name`.
+        """
+        return "%s.quantized" % (var_name)
+
+    def _dequantized_var_name(self, var_name):
+        """
+        Return dequantized variable name for the input `var_name`.
+        """
+        return "%s.dequantized" % (var_name)
+
+    def _quantized_scale_name(self, var_name):
+        """
+        Return the scale name of quantized variable for the input `var_name`.
+        """
+        return "%s.scale" % (var_name)
+
+    def _zero_point_name(self, var_name):
+        """
+        Return the scale name for the var named `var_name`.
+        """
+        return "%s@zero_point" % (var_name)
+
+
+class QuantizationTransformPassV2(object):
+    """
+    Quantize the ops that have weights. Add quant and dequant ops for
+    the quantized ops's inputs.
+    """
+
+    def __init__(self,
+                 scope=None,
+                 place=None,
+                 weight_bits=8,
+                 activation_bits=8,
+                 activation_quantize_type='abs_max',
+                 weight_quantize_type='abs_max',
+                 window_size=10000,
+                 moving_rate=0.9,
+                 skip_pattern=['skip_quant'],
+                 quantizable_op_type=['conv2d', 'depthwise_conv2d', 'mul'],
+                 weight_quantize_func=None,
+                 act_quantize_func=None,
+                 weight_preprocess_func=None,
+                 act_preprocess_func=None,
+                 optimizer_func=None,
+                 executor=None):
+        r"""
+        Args:
+            scope(paddle.Scope): When activation use 'range_abs_max' as the quantize
+                type, this pass will create some new parameters. The scope is used to
+                initialize these new parameters.
+            place(paddle.CPUPlace|paddle.CUDAPlace|str): place is used to initialize new
+                parameters described above. If it's string, It can be ``cpu``, and ``gpu:x``,
+                where ``x`` is the index of the GPUs. 
+            weight_bits(int): quantization bit number for weights,
+                the bias is not quantized.
+            activation_bits(int): quantization bit number for activation.
+            activation_quantize_type(str): quantization type for activation,
+                now support 'abs_max', 'range_abs_max' and 'moving_average_abs_max'.
+                If use 'abs_max' mode, the quantization scale will be calculated
+                dynamically each step in both training and testing period. If use
+                'range_abs_max', a static quantization scale will be calculated
+                during training and used in inference.
+            weight_quantize_type(str): quantization type for weights,
+                support 'abs_max' and 'channel_wise_abs_max'. The 'range_abs_max'
+                usually is not used for weight, since weights are fixed once the
+                model is well trained.
+            window_size(int): the window size for 'range_abs_max' quantization.
+            moving_rate(float): the param for 'moving_average_abs_max' quantization.
+            skip_pattern(str or str list): The user-defined quantization skip pattern, which
+                will be presented in the name scope of an op. When the skip pattern is
+                detected in an op's name scope, the corresponding op will not be quantized. 
+            quantizable_op_type(list[str]): List the type of ops that will be quantized. 
+                Default is ["conv2d", "depthwise_conv2d", "mul"]. The quantizable_op_type in
+                QuantizationFreezePass and ConvertToInt8Pass must be the same as this.
+            weight_quantize_func(function): Function that defines how to quantize weight.
+                Using this can quickly test if user's quantization method works or not.
+                In this function, user should both define quantization function and
+                dequantization function, that is, the function's input is non-quantized
+                weight and function returns dequantized weight. If None, will use
+                quantization op defined by 'weight_quantize_type'. Default is None.
+            act_quantize_func(function): Function that defines how to quantize activation.
+                Using this can quickly test if user's quantization method works or not.
+                In this function, user should both define quantization and dequantization
+                process, that is, the function's input is non-quantized activation and
+                function returns dequantized activation. If None, will use quantization
+                op defined by 'activation_quantize_type'. Default is None.
+            weight_preprocess_func(function): Function that defines how to preprocess
+                weight before quantization. Using this can quickly test if user's preprocess
+                method works or not. The function's input is non-quantized weight and
+                function returns processed weight to be quantized. If None, the weight will
+                be quantized directly. Default is None.
+            act_preprocess_func(function): Function that defines how to preprocess
+                activation before quantization. Using this can quickly test if user's
+                preprocess method works or not. The function's input is non-quantized
+                activation and function returns processed activation to be quantized.
+                If None, the activation will be quantized directly. Default is None.
+            optimizer_func(function): Fuction return a optimizer. When 'is_test' is
+                False and user want to use self-defined quantization function and
+                preprocess function, this function must be set. Default is None.
+            executor(paddle.Executor): If user want to use self-defined quantization
+                function and preprocess function, executor must be set for initialization.
+                Default is None.
+
+        Examples:
+        .. code-block:: python
+            # The original graph will be rewrite.
+            import paddle
+            from paddle.fluid.contrib.slim.quantization \
+                import QuantizationTransformPassV2
+            from paddle.fluid.contrib.slim.graph import IrGraph
+            from paddle.fluid import core
+
+            graph = IrGraph(core.Graph(program.desc), for_test=False)
+            place = paddle.CPUPlace()
+            scope = paddle.static.global_scope()
+            transform_pass = QuantizationTransformPassV2(scope, place)
+            transform_pass.apply(graph)
+        """
+        self._scope = scope
+        self._place = _get_paddle_place(place)
+        self._weight_bits = weight_bits
+        self._activation_bits = activation_bits
+        self._skip_pattern = skip_pattern
+        self._weight_quantize_func = weight_quantize_func
+        self._act_quantize_func = act_quantize_func
+        self._weight_preprocess_func = weight_preprocess_func
+        self._act_preprocess_func = act_preprocess_func
+        self._optimizer = optimizer_func
+        self._exe = executor
+        quant_type = [
+            'abs_max', 'channel_wise_abs_max', 'range_abs_max',
+            'moving_average_abs_max'
+        ]
+        assert activation_quantize_type != 'channel_wise_abs_max', \
+            "The activation quantization type does not support 'channel_wise_abs_max'."
+        if activation_quantize_type not in quant_type:
+            raise ValueError(
+                "Unknown activation_quantize_type : '%s'. It can only be "
+                "'abs_max' or 'range_abs_max' or 'moving_average_abs_max'." %
+                (str(activation_quantize_type)))
+        if weight_quantize_type not in quant_type:
+            raise ValueError(
+                "Unknown weight_quantize_type: '%s'. It can only be "
+                "'abs_max' or 'channel_wise_abs_max' or 'range_abs_max' "
+                "or 'moving_average_abs_max'." % (str(weight_quantize_type)))
+
+        self._activation_quantize_type = activation_quantize_type
+        self._weight_quantize_type = weight_quantize_type
+        self._window_size = window_size
+        self._moving_rate = moving_rate
+
+        self._quantizable_ops = quantizable_op_type
+        for op in self._quantizable_ops:
+            assert op in utils._weight_supported_quantizable_op_type, \
+                op + " is not supported for quantization."
+        self._quantizable_grad_ops = [
+            '%s_grad' % (op) for op in self._quantizable_ops
+        ]
+        self._is_test = None
+        self._global_step = None
+
+        self.create_var_map = {}
+        self.create_op_map = {}
+
+        # marked the variable which has been dequantized.
+        self.dequantized_vars = collections.OrderedDict()
+        self.persistable_vars = []
+        self.processed_vars = []
+
+    def _quant_preprocess(self, op_node):
+        user_skipped = False
+        if isinstance(self._skip_pattern, list):
+            user_skipped = op_node.op().has_attr("op_namescope") and \
+                            any(pattern in op_node.op().attr("op_namescope") \
+                                for pattern in self._skip_pattern)
+        elif isinstance(self._skip_pattern, str):
+            user_skipped = op_node.op().has_attr("op_namescope") and \
+                            op_node.op().attr("op_namescope").find(
+                                self._skip_pattern) != -1
+
+        if user_skipped:
+            op_node.op()._set_attr("skip_quant", True)
+            op_node.op()._set_attr("with_quant_attr", True)
+
+    def _transform_forward(self, graph, op):
+        op.op()._set_attr("quantization_type", "qat_with_weight")
+        inputs = op.inputs
+        for var_node in inputs:
+            if var_node.name() not in op.input_arg_names():
+                continue
+            if var_node.name() in self.dequantized_vars:
+                dequant_var_node = self.dequantized_vars[var_node.name()]
+            else:
+                name = var_node.name()
+                if name in self.processed_vars:
+                    continue
+                is_weight = True if var_node.name() in self.persistable_vars \
+                    else False
+
+                # if var node is weight and weight_preprocess_func is not None,
+                # will insert weight preprocess func 
+                # to preorocess weight before quantization
+                # if var node is activation and act_preprocess_func is not None, 
+                # will insert activation preprocess func 
+                # to preorocess activation before quantization
+                if is_weight and self._weight_preprocess_func is not None:
+                    var_node = self._insert_func(
+                        graph, self._weight_preprocess_func, var_node, op)
+                elif not is_weight and self._act_preprocess_func is not None:
+                    var_node = self._insert_func(
+                        graph, self._act_preprocess_func, var_node, op)
+
+                # if var node is weight and weight_quantize_func is not None,
+                # will insert weight quantize func to quantize and dequantize weight
+                # if var node is activation and act_quantize_func is not None,
+                # will insert act quantize func to quantize and dequantize activation
+                if is_weight and self._weight_quantize_func is not None:
+                    target_out_node = self._insert_func(
+                        graph, self._weight_quantize_func, var_node, op)
+                    processed_vars.append(name)
+                    continue
+                elif not is_weight and self._act_quantize_func is not None:
+                    target_out_node = self._insert_func(
+                        graph, self._act_quantize_func, var_node, op)
+                    processed_vars.append(name)
+                    continue
+
+                quant_bits = self._weight_bits if var_node.name() in self.persistable_vars \
+                    else self._activation_bits
+                quant_type = self._weight_quantize_type if is_weight \
+                    else self._activation_quantize_type
+                quant_axis = -1
+                channel_wise = False
+                if quant_type == 'channel_wise_abs_max':  # Weight quantization
+                    channel_wise = True
+                    quant_axis = 1 if op.name() in \
+                        utils._channelwise_quant_axis1_ops else 0
+                insert_quant_pass = InsertQuantizeLinear(
+                    self._place,
+                    self._scope,
+                    quant_bits=quant_bits,
+                    quant_axis=quant_axis,
+                    channel_wise=channel_wise,
+                    is_test=self._is_test)
+                quant_var_node, scale_var_node = insert_quant_pass.insert_quant_op(
+                    graph, var_node)
+                dequant_var_node = insert_quant_pass.insert_dequant_op(
+                    graph, quant_var_node, scale_var_node)
+
+                self.dequantized_vars[name] = dequant_var_node
+            graph.update_input_link(var_node, dequant_var_node, op)
+
+    def _transform_backward(self, graph, op):
+        for var_node in op.inputs:
+            if var_node.name() not in op.input_arg_names():
+                continue
+            if var_node.name() in self.dequantized_vars:
+                dequant_var_node = self.dequantized_vars[var_node.name()]
+                graph.update_input_link(var_node, dequant_var_node, op)
+
+    def _has_weight(self, op):
+        has_weight = False
+        for var_node in op.inputs:
+            if var_node.name() not in op.input_arg_names():
+                continue
+            name = var_node.name()
+            if var_node.name() in self.persistable_vars:
+                has_weight = True
+        return has_weight
+
+    def _is_skip_quant(self, graph, op_node):
+        """
+        Analyse whether the op node skips quantization.
+        """
+        is_skip = False
+        if op_node.op().has_attr("skip_quant") and \
+            op_node.op().attr("skip_quant"):
+            is_skip = True
+        # if the inputs of mul and matmul are not all persistable, use
+        # AddQuantDequantPassV2 to quantize them.
+        if op_node.name() in ["mul", "matmul", "matmul_v2"] and \
+            _is_input_all_not_persistable(graph, op_node):
+            is_skip = True
+        if op_node.op().has_attr("quantization_type") and \
+            op_node.op().attr("quantization_type") == "qat_without_weight":
+            is_skip = True
+        return is_skip
+
+    def apply(self, graph):
+        """
+        Quantize the graph for training process. According to weight and
+        activation quantization type, the graph will be added some fake
+        quantize operators and fake dequantize operators.
+
+        Args:
+            graph(IrGraph): the applied graph.
+        Returns:
+            None
+        """
+        assert isinstance(graph,
+                          IrGraph), 'graph must be the instance of IrGraph.'
+        self._is_test = graph.is_test()
+
+        self.persistable_vars = [
+            p.name() for p in graph.all_persistable_nodes()
+        ]
+
+        ops = graph.all_op_nodes()
+        # Do the preproccess of quantization, such as skipping some ops
+        # for not being quantized.
+        for op in ops:
+            if op.name() in self._quantizable_ops or \
+                    op.name() in self._quantizable_grad_ops:
+                self._quant_preprocess(op)
+        # Insert mapping table to solve the problem in saving inference model.
+        graph.out_node_mapping_table = dict()
+        # The process of _transform_forward and _transform_backward is needed in two for loops.
+        # The loop for transforming the forward graph:
+        for op in ops:
+            if op.name() in self._quantizable_ops:
+                if not self._is_skip_quant(graph, op) and self._has_weight(op):
+                    self._transform_forward(graph, op)
+        # The loop for renaming the inputs of backward op.
+        for op in ops:
+            if op.name() in self._quantizable_grad_ops and self._has_weight(op):
+                self._transform_backward(graph, op)
+        return graph
+
+
+class AddQuantDequantPassV2(object):
+    """
+    Quantize the ops that do not have weights, and add quant_linear and dequant_linear
+    op for the quantized ops's inputs.
+    """
+
+    # To be compatible with PaddleSlim, not remove _activation_type for now
+    _activation_type = ["relu", "relu6", "leaky_relu", "tanh", "swish"]
+
+    def __init__(self,
+                 scope=None,
+                 place=None,
+                 moving_rate=0.9,
+                 quant_bits=8,
+                 skip_pattern=["skip_quant"],
+                 quantizable_op_type=["elementwise_add", "pool2d"],
+                 is_full_quantized=False):
+        """
+        Args:
+            scope(paddle.Scope): The scope is used to initialize these new parameters.
+            place(paddle.CPUPlace|paddle.CUDAPlace|str): place is used to initialize new
+                parameters described above. If ``place`` is string, it can be It can be ``cpu``
+                or ``gpu:x``, where ``x`` is the index of the GPUs.
+            moving_rate(float, optional): the param for 'quant_dequant_moving_average_abs_max' 
+                quantization. Default is 0.9.
+            quant_bits(int, optional): quantization bit number for activation. Default is 8.
+            skip_pattern(str, optional): The user-defined quantization skip pattern, which
+                will be presented in the name scope of an op. When the skip pattern is
+                detected in an op's name scope, the corresponding op will not be quantized.
+                Default is 'skip_quant'.
+            quantizable_op_type(list[str], optional): List the type of ops that will be 
+                quantized. Default is ["elementwise_add", "pool2d"]. 
+            is_full_quantized(bool, optional): If set is_full_quantized as True, apply 
+                quantization to all supported quantizable op type. If set is_full_quantized
+                as False, only apply quantization to the op type according to the input 
+                quantizable_op_type.
+        
+        Examples:
+        .. code-block:: python
+            # The original graph will be rewrite.
+            import paddle
+            from paddle.fluid.contrib.slim.quantization \
+                import AddQuantDequantPassV2
+            from paddle.fluid.contrib.slim.graph import IrGraph
+            from paddle.fluid import core
+
+            graph = IrGraph(core.Graph(program.desc), for_test=False)
+            place = paddle.CPUPlace()
+            scope = paddle.static.global_scope()
+            add_quant_dequant_pass = AddQuantDequantPassV2(scope, place)
+            add_quant_dequant_pass.apply(graph)
+        """
+        self._scope = scope
+        self._place = _get_paddle_place(place)
+        self._moving_rate = moving_rate
+        self._quant_bits = quant_bits
+        self._is_test = None
+        self._skip_pattern = skip_pattern
+
+        if is_full_quantized:
+            self._quantizable_op_type = utils._act_supported_quantizable_op_type
+        else:
+            self._quantizable_op_type = quantizable_op_type
+            for op_type in quantizable_op_type:
+                assert op_type in utils._act_supported_quantizable_op_type, \
+                    op_type + " is not supported for quantization."
+        self._quantizable_grad_op_type = [
+            '%s_grad' % (op) for op in self._quantizable_op_type
+        ]
+
+        assert self._scope != None, "scope must not be None."
+        assert self._place != None, "place must not be None."
+        self.persistable_vars = []
+
+    def apply(self, graph):
+        """
+        Add quant_dequant before some ops, such as the 'elementwise_add' and
+        'pool2d' op.
+
+        Args:
+            graph(IrGraph): the target graph.
+        Returns:
+            None
+        """
+        assert isinstance(graph,
+                          IrGraph), 'graph must be the instance of IrGraph.'
+        self._is_test = graph.is_test()
+        dequantized_vars_map = collections.OrderedDict()
+
+        self.persistable_vars = [
+            p.name() for p in graph.all_persistable_nodes()
+        ]
+
+        # Forward stage, insert quant_dequant op
+        all_op_nodes = graph.all_op_nodes()
+        for op_node in all_op_nodes:
+            if op_node.name() in self._quantizable_op_type:
+                is_skip = False
+                if isinstance(self._skip_pattern, list):
+                    is_skip = op_node.op().has_attr("op_namescope") and \
+                                   any(pattern in op_node.op().attr("op_namescope") for pattern in self._skip_pattern)
+                elif isinstance(self._skip_pattern, str):
+                    is_skip = op_node.op().has_attr("op_namescope") and \
+                                   op_node.op().attr("op_namescope").find(self._skip_pattern) != -1
+                is_quantized = op_node.op().has_attr("quantization_type") and \
+                    op_node.op().attr("quantization_type") == "qat_with_weight"
+                if is_skip or is_quantized:
+                    continue
+
+                op_node.op()._set_attr("quantization_type",
+                                       "qat_without_weight")
+                arg_names = utils._get_op_input_var_names(op_node)
+                for arg_name in arg_names:
+                    in_node = graph._find_node_by_name(op_node.inputs, arg_name)
+                    if in_node.persistable():
+                        continue
+                    if arg_name in dequantized_vars_map:
+                        dequant_var_node = dequantized_vars_map[arg_name]
+                    else:
+                        insert_quant_pass = InsertQuantizeLinear(
+                            self._place,
+                            self._scope,
+                            quant_bits=self._quant_bits,
+                            quant_axis=-1,
+                            channel_wise=False,
+                            is_test=self._is_test)
+                        quant_var_node, scale_var_node = insert_quant_pass.insert_quant_op(
+                            graph, in_node)
+                        dequant_var_node = insert_quant_pass.insert_dequant_op(
+                            graph, quant_var_node, scale_var_node)
+                        dequantized_vars_map[arg_name] = dequant_var_node
+                    graph.update_input_link(in_node, dequant_var_node, op_node)
+
+        # Backward stage, update input link
+        for op_node in all_op_nodes:
+            if op_node.name() in self._quantizable_grad_op_type:
+                for input_name in op_node.input_arg_names():
+                    if input_name in dequantized_vars_map:
+                        in_node = graph._find_node_by_name(op_node.inputs,
+                                                           input_name)
+                        dequant_var_node = dequantized_vars_map[input_name]
+                        graph.update_input_link(in_node, dequant_var_node,
+                                                op_node)
+
+        return graph
+
+
+class ReplaceFakeQuantDequantPass(object):
+    """
+    replace quant-dequant ops with quantize_linear and dequantize_linear ops.
+    """
+
+    def __init__(self, scope, place):
+        r"""
+        Args:
+            scope(paddle.Scope): The scope is used to initialize these new parameters.
+            place(paddle.CPUPlace|paddle.CUDAPlace|str): place is used to initialize new
+                parameters described above. If ``place`` is string, it can be It can be ``cpu``
+                or ``gpu:x``, where ``x`` is the index of the GPUs.
+        
+        Examples:
+        .. code-block:: python
+            # The original graph will be rewrite.
+            import paddle
+            from paddle.fluid.contrib.slim.quantization \
+                import ReplaceFakeQuantDequantPass
+            from paddle.fluid.contrib.slim.graph import IrGraph
+            from paddle.fluid import core
+
+            graph = IrGraph(core.Graph(program.desc), for_test=False)
+            place = paddle.CPUPlace()
+            scope = paddle.static.global_scope()
+            replace_pass = ReplaceFakeQuantDequantPass(scope, place)
+            replace_pass.apply(graph)
+        """
+        self._place = _get_paddle_place(place)
+        self._scope = scope
+        assert self._scope != None, "scope must not be None."
+        assert self._place != None, "place must not be None."
+
+    def apply(self, graph):
+        assert isinstance(graph,
+                          IrGraph), 'graph must be the instance of IrGraph.'
+        fake_quant_dequant_ops = []
+
+        for op in graph.all_op_nodes():
+            if op.name() in _fake_quant_dequant_op_list:
+                fake_quant_dequant_ops.append(op)
+
+        for _op in fake_quant_dequant_ops:
+            self._replace_op(graph, _op)
+            graph.safe_remove_nodes(_op)
+
+        graph.resolve_hazard()
+        return graph
+
+    def _replace_op(self, graph, op):
+        x_node = graph._find_node_by_name(op.inputs, op.input("X")[0])
+        out_node = graph._find_node_by_name(op.outputs, op.output("Out")[0])
+        scale_node = graph._find_node_by_name(op.outputs,
+                                              op.output("OutScale")[0])
+
+        quant_axis = op.op().attr("quant_axis") if op.op().has_attr(
+            "quant_axis") else -1
+        bit_length = op.op().attr("bit_length") if op.op().has_attr(
+            "bit_length") else 8
+
+        zero_point_node = None
+        quanted_node = x_node
+        if zero_point_node is None:
+            zero_point_node = graph.create_persistable_node(
+                name=self._zero_point_name(quanted_node.name()),
+                var_type=core.VarDesc.VarType.LOD_TENSOR,
+                shape=scale_node.shape(),
+                var_dtype=core.VarDesc.VarType.INT32)
+            _init_var_node(
+                zero_point_node,
+                np.zeros(
+                    scale_node.shape(), dtype="int32"),
+                self._scope,
+                self._place)
+
+        quant_var_node = graph.create_var_node(
+            name=self._quantized_var_name(x_node.name()),
+            var_type=x_node.type(),
+            shape=x_node.shape(),
+            var_dtype=x_node.dtype())
+        quant_op_node = graph.create_op_node(
+            op_type="quantize_linear",
+            attrs={"quant_axis": quant_axis,
+                   "bit_length": bit_length},
+            inputs={
+                "X": x_node,
+                "Scale": scale_node,
+                "ZeroPoint": zero_point_node
+            },
+            outputs={"Y": quant_var_node})
+        graph.link_to(x_node, quant_op_node)
+        graph.link_to(scale_node, quant_op_node)
+        if zero_point_node is not None:
+            graph.link_to(zero_point_node, quant_op_node)
+        graph.link_to(quant_op_node, quant_var_node)
+        dequant_op_node = graph.create_op_node(
+            op_type="dequantize_linear",
+            attrs={"quant_axis": quant_axis,
+                   "bit_length": bit_length},
+            inputs={
+                "X": quant_var_node,
+                "Scale": scale_node,
+                "ZeroPoint": zero_point_node
+            },
+            outputs={"Y": out_node})
+        graph.link_to(quant_var_node, dequant_op_node)
+        graph.link_to(scale_node, dequant_op_node)
+        if zero_point_node is not None:
+            graph.link_to(zero_point_node, dequant_op_node)
+        graph.link_to(dequant_op_node, out_node)
+
+    def _quantized_var_name(self, var_name):
+        """
+        Return quantized variable name for the input `var_name`.
+        """
+        return "%s.quantized" % (var_name)
+
+    def _zero_point_name(self, var_name):
+        """
+        Return the scale name for the var named `var_name`.
+        """
+        return "%s@zero_point" % (var_name)
+
+
+class QuantWeightPass(object):
+    """
+    quant weights and remove weights input quantize_linear node. for example:
+    `weight -> quant -> dequant -> conv2d` will be frozen into `weight -> dequant -> conv2d`,
+    and weight will be scaled offline.
+
+    Args:
+        scope(paddle.Scope): scope is used to get the weight tensor values.
+        place(paddle.CPUPlace|paddle.CUDAPlace|str): place is used to restore the weight tensors.
+            If it's string, It can be ``cpu``, and ``gpu:x``, where ``x`` is the index of the GPUs.
+        bias_correction(bool): whether use bias correction for post-training quantization.
+             https://arxiv.org/abs/1810.05723.
+        quant_bits(int, optional): quantization bit number for weight. Default is 8.
+        save_int_weight(bool, optional): Whether the type saving the weight is int. Default is True.
+    
+    Examples:
+        .. code-block:: python
+            # The original graph will be rewrite.
+            import paddle
+            from paddle.fluid.contrib.slim.quantization \
+                import QuantWeightPass
+            from paddle.fluid.contrib.slim.graph import IrGraph
+            from paddle.fluid import core
+
+            graph = IrGraph(core.Graph(program.desc), for_test=False)
+            place = paddle.CPUPlace()
+            scope = paddle.static.global_scope()
+            quant_weight_pass = QuantWeightPass(scope, place)
+            quant_weight_pass.apply(graph)
+    """
+
+    def __init__(self,
+                 scope,
+                 place,
+                 bias_correction=False,
+                 quant_bits=8,
+                 save_int_weight=True):
+        self._place = _get_paddle_place(place)
+        self._scope = scope
+        self._bias_correction = bias_correction
+        self._quant_bits = quant_bits
+        self._save_int_weight = save_int_weight
+        assert self._scope != None, "scope must not be None."
+        assert self._place != None, "place must not be None."
+
+    def apply(self, graph):
+        assert isinstance(graph,
+                          IrGraph), 'graph must be the instance of IrGraph.'
+        fake_quant_ops_for_weight = []
+
+        fake_quant_ops = [
+            op for op in graph.all_op_nodes() if op.name() == "quantize_linear"
+        ]
+        for _op in fake_quant_ops:
+            x_node = graph._find_node_by_name(_op.inputs, _op.input("X")[0])
+            if x_node.persistable():
+                scale_node = graph._find_node_by_name(_op.inputs,
+                                                      _op.input("Scale")[0])
+                zero_point_node = graph._find_node_by_name(
+                    _op.inputs, _op.input("ZeroPoint")[0])
+                out_node = graph._find_node_by_name(_op.outputs,
+                                                    _op.output("Y")[0])
+
+                scale_v = self._load_var(scale_node.name())
+                assert scale_v.ndim in [1, 2
+                                        ], "the dim of scale_v should be 1 or 2"
+                if scale_v.ndim == 2:
+                    scale_v = scale_v[0]
+                if scale_v.size == 1 and _op.name() == 'abs_max':
+                    scale_v = scale_v[0]
+                else:
+                    scale_v = scale_v.tolist()
+                param_v = self._load_var(x_node.name())
+                quant_axis = _op.op().attr("quant_axis")
+                bits_length = _op.op().attr("bit_length")
+                quantized_param_v = utils.quant_tensor(param_v.copy(), scale_v,
+                                                       quant_axis, bits_length)
+                if self._bias_correction == True:
+                    quantized_param_v = utils.bias_correction_w(
+                        param_v,
+                        quantized_param_v,
+                        scale_v,
+                        quant_axis,
+                        weight_bits=bits_length)
+                if self._save_int_weight:
+                    # cast weight type to int
+                    if self._quant_bits == 8:
+                        save_weight_dtype = np.int8
+                    quantized_param_v = quantized_param_v.astype(
+                        save_weight_dtype)
+                self._restore_var(x_node.name(), quantized_param_v)
+
+                for next_op_node in out_node.outputs:
+                    graph.update_input_link(out_node, x_node, next_op_node)
+                graph.safe_remove_nodes(out_node)
+        self._remove_unused_var_nodes(graph)
+
+    def _remove_unused_var_nodes(self, graph):
+        all_used_vars = set()
+        ops = graph.all_op_nodes()
+        for op_node in ops:
+            for input_node in op_node.inputs:
+                all_used_vars.add(input_node)
+            for output_node in op_node.outputs:
+                all_used_vars.add(output_node)
+
+        all_used_vars = {n.node for n in all_used_vars}
+        all_unused_vars = {
+            n
+            for n in filter(lambda node: node.node not in all_used_vars,
+                            graph.all_var_nodes())
+        }
+        graph.safe_remove_nodes(all_unused_vars)
+
+    def _load_var(self, name):
+        return np.array(self._scope.find_var(name).get_tensor())
+
+    def _restore_var(self, name, array):
+        tensor = self._scope.find_var(name).get_tensor()
+        tensor.set(array, self._place)
diff --git a/python/paddle/fluid/contrib/slim/quantization/utils.py b/python/paddle/fluid/contrib/slim/quantization/utils.py
index 43f33f33c3138..608844dd55da7 100644
--- a/python/paddle/fluid/contrib/slim/quantization/utils.py
+++ b/python/paddle/fluid/contrib/slim/quantization/utils.py
@@ -13,11 +13,292 @@
 # limitations under the License.
 
 import numpy as np
+from ....framework import IrNode
+from ....framework import Operator
+
+_weight_supported_quantizable_op_type = [
+    'conv2d', 'depthwise_conv2d', 'conv2d_transpose', 'mul', 'matmul',
+    'matmul_v2'
+]
+
+_act_supported_quantizable_op_type = [
+    "pool2d",
+    "elementwise_add",
+    "concat",
+    "softmax",
+    "argmax",
+    "transpose",
+    "equal",
+    "gather",
+    "greater_equal",
+    "greater_than",
+    "less_equal",
+    "less_than",
+    "mean",
+    "not_equal",
+    "reshape",
+    "reshape2",
+    "dropout",
+    "bilinear_interp",
+    "nearest_interp",
+    "trilinear_interp",
+    "slice",
+    "squeeze",
+    "elementwise_sub",
+    "mul",
+    "matmul",
+    "relu",
+    "relu6",
+    "leaky_relu",
+    "tanh",
+    "swish",
+    "scale",
+    "transpose",
+    "transpose2",
+    "sigmoid",
+    "pad2d",
+    "flatten",
+    "flatten2",
+    "batch_norm",
+    "layer_norm",
+    "matmul_v2",
+    "split",
+    "flatten_contiguous_range",
+    "squeeze2",
+    "nearest_interp_v2",
+    "bilinear_interp",
+    "bilinear_interp_v2",
+    "fill_constant_batch_size_like",
+    "arg_max",
+    "abs",
+    "assign",
+    "cast",
+    "clip",
+    "box_coder",
+    "crop",
+    "cumsum",
+    "elementwise_mul",
+    "elementwise_pow",
+    "expand_v2",
+    "fill_any_like",
+    "fill_constant",
+    "gelu",
+    "hard_sigmoid",
+    "hard_swish",
+    "instance_norm",
+    "lookup_table",
+    "lookup_table_v2",
+    "norm",
+    "p_norm",
+    "pad3d",
+    "pow",
+    "prelu",
+    "reduce_mean",
+    "unsqueeze",
+    "unsqueeze2",
+    "logical_and",
+    "logical_not",
+    "meshgrid",
+    "roi_align",
+    "strided_slice",
+    "where",
+    "grid_sampler",
+    "tile",
+    "group_norm",
+    "reduce_sum",
+    "square",
+    "softplus",
+    "shuffle_channel",
+]
+
+_out_scale_op_list = list(
+    set(_weight_supported_quantizable_op_type +
+        _act_supported_quantizable_op_type))
 
 _channelwise_quant_axis1_ops = [
     'conv2d_transpose', 'mul', 'matmul', 'matmul_v2'
 ]
 
+# list op real input and output names, to avoid processing input such as AxisTensor.
+_op_real_in_out_name = {
+    "conv2d": [["Input", "Filter"], ["Output"]],
+    "depthwise_conv2d": [["Input", "Filter"], ["Output"]],
+    "conv2d_transpose": [["Input", "Filter"], ["Output"]],
+    "mul": [["X", "Y"], ["Out"]],
+    "matmul": [["X", "Y"], ["Out"]],
+    "matmul_v2": [["X", "Y"], ["Out"]],
+    "pool2d": [["X"], ["Out"]],
+    "elementwise_add": [["X", "Y"], ["Out"]],
+    "concat": [["X"], ["Out"]],
+    "softmax": [["X"], ["Out"]],
+    "argmax": [["X"], ["Out"]],
+    "transpose": [["X"], ["Out"]],
+    "equal": [["X", "Y"], ["Out"]],
+    "gather": [["X"], ["Out"]],
+    "greater_equal": [["X", "Y"], ["Out"]],
+    "greater_than": [["X", "Y"], ["Out"]],
+    "less_equal": [["X", "Y"], ["Out"]],
+    "less_than": [["X", "Y"], ["Out"]],
+    "mean": [["X"], ["Out"]],
+    "not_equal": [["X", "Y"], ["Out"]],
+    "reshape": [["X"], ["Out"]],
+    "reshape2": [["X"], ["Out"]],
+    "transpose2": [["X"], ["Out"]],
+    "bilinear_interp": [["X"], ["Out"]],
+    "nearest_interp": [["X"], ["Out"]],
+    "trilinear_interp": [["X"], ["Out"]],
+    "slice": [["Input"], ["Out"]],
+    "squeeze": [["X"], ["Out"]],
+    "elementwise_sub": [["X", "Y"], ["Out"]],
+    "relu": [["X"], ["Out"]],
+    "relu6": [["X"], ["Out"]],
+    "leaky_relu": [["X"], ["Out"]],
+    "prelu": [["X", "Alpha"], ["Out"]],
+    "tanh": [["X"], ["Out"]],
+    "swish": [["X"], ["Out"]],
+    "dropout": [["X"], ["Out"]],
+    "batch_norm": [["X"], ["Y"]],
+    "layer_norm": [["X"], ["Y"]],
+    "sigmoid": [["X"], ["Out"]],
+    "elementwise_mul": [["X", "Y"], ["Out"]],
+    "elementwise_pow": [["X", "Y"], ["Out"]],
+    "scale": [["X"], ["Out"]],
+    "hard_swish": [["X"], ["Out"]],
+    "hard_sigmoid": [["X"], ["Out"]],
+    "gru": [["Input", "Weight"], ["Hidden"]],
+    "lstm": [["Input", "Weight"], ["Hidden"]],
+    "pad2d": [["X"], ["Out"]],
+    "pad3d": [["X"], ["Out"]],
+    "flatten": [["X"], ["Out"]],
+    "flatten2": [["X"], ["Out"]],
+    "unsqueeze2": [["X"], ["Out"]],
+    "unsqueeze2": [["X"], ["Out"]],
+    "flatten_contiguous_range": [["X"], ["Out"]],
+    "split": [["X"], ["Out"]],
+    "squeeze2": [["X"], ["Out"]],
+    "nearest_interp_v2": [["X"], ["Out"]],
+    "bilinear_interp": [["X"], ["Out"]],
+    "bilinear_interp_v2": [["X"], ["Out"]],
+    "fill_constant_batch_size_like": [["Input"], ["Out"]],
+    "arg_max": [["X"], ["Out"]],
+    "abs": [["X"], ["Out"]],
+    "assign": [["X"], ["Out"]],
+    "cast": [["X"], ["Out"]],
+    "clip": [["X"], ["Out"]],
+    "box_coder": [["PriorBox"], ["OutputBox"]],
+    "crop": [["X"], ["Out"]],
+    "cumsum": [["X"], ["Out"]],
+    "expand_v2": [["X"], ["Out"]],
+    "fill_any_like": [["X"], ["Out"]],
+    "fill_constant": [[], ["Out"]],
+    "gelu": [["X"], ["Out"]],
+    "instance_norm": [["X"], ["Out"]],
+    "lookup_table": [["W", "Ids"], ["Out"]],
+    "lookup_table_v2": [["W", "Ids"], ["Out"]],
+    "norm": [["X"], ["Norm"]],
+    "p_norm": [["X"], ["Out"]],
+    "pow": [["X"], ["Out"]],
+    "reduce_mean": [["X"], ["Out"]],
+    "stack": [["X"], ["Y"]],
+    "top_k_v2": [["X"], ["Out", "Indices"]],
+    "logical_and": [["X", "Y"], ["Out"]],
+    "logical_not": [["X"], ["Out"]],
+    "meshgrid": [["X"], ["Out"]],
+    "roi_align": [["X", "ROIs"], ["Out"]],
+    "strided_slice": [["Input"], ["Out"]],
+    "where": [["Condition", "X", "Y"], ["Out"]],
+    "grid_sampler": [["X", "Grid"], ["Output"]],
+    "tile": [["X"], ["Out"]],
+    "group_norm": [["X"], ["Y", "Mean", "Variance"]],
+    "reduce_sum": [["X"], ["Out"]],
+    "square": [["X"], ["Out"]],
+    "softplus": [["X"], ["Out"]],
+    "shuffle_channel": [["X"], ["Out"]],
+}
+
+
+def _get_op_input_var_names(op):
+    """
+    Get the input var names of the op.
+    Args:
+        op(IrNode, Operator): the input op.
+    Returns:
+        input_var_names or None.
+    """
+    assert isinstance(op, (IrNode, Operator)), \
+        "The input op should be IrNode or Operator."
+    var_names = []
+    op_name = op.name() if isinstance(op, IrNode) \
+        else op.type
+    if op_name not in _op_real_in_out_name:
+        return []
+
+    name_list = _op_real_in_out_name[op_name][0]
+    for name in name_list:
+        var_name = op.input(name)
+        if isinstance(var_name, list):
+            var_names.extend(var_name)
+        else:
+            var_names.append(var_name)
+    return var_names
+
+
+def _get_op_output_var_names(op):
+    """ """
+    assert isinstance(op, (IrNode, Operator)), \
+        "The input op should be IrNode or Operator."
+    var_names = []
+    op_name = op.name() if isinstance(op, IrNode) \
+        else op.type
+    if op_name not in _op_real_in_out_name:
+        return []
+
+    name_list = _op_real_in_out_name[op_name][1]
+    for name in name_list:
+        var_name = op.output(name)
+        if isinstance(var_name, list):
+            var_names.extend(var_name)
+        else:
+            var_names.append(var_name)
+    return var_names
+
+
+def _get_input_name_index(op, input_var_name):
+    """Get the input name and index of the var_name in the op"""
+    assert isinstance(op, (IrNode, Operator)), \
+        "The input op should be IrNode or Operator."
+    op_name = op.name() if isinstance(op, IrNode) \
+        else op.type
+    if op_name not in _op_real_in_out_name:
+        return None
+
+    res = None
+    for argname in _op_real_in_out_name[op_name][0]:
+        var_names = op.input(argname)
+        for index, name in enumerate(var_names):
+            if name == input_var_name:
+                res = (argname, index)
+    return res
+
+
+def _get_output_name_index(op, output_var_name):
+    """Get the output name and index of the var_name in the op"""
+    assert isinstance(op, (IrNode, Operator)), \
+        "The input op should be IrNode or Operator."
+    op_name = op.name() if isinstance(op, IrNode) \
+        else op.type
+    if op_name not in _op_real_in_out_name:
+        return None
+
+    name_list = _op_real_in_out_name[op_name][1]
+    res = None
+    for name in name_list:
+        var_name = op.output(name)
+        for index, val in enumerate(var_name):
+            if val == output_var_name:
+                res = (name, index)
+    return res
+
 
 def load_variable_data(scope, var_name):
     '''
@@ -84,6 +365,46 @@ def dequant_tensor(x, scale, quant_axis=0, weight_bits=8):
     return x
 
 
+def bias_correction_w(x, x_quant, scale_v, quant_axis, weight_bits=8):
+    '''
+    Bias correction for weight
+    '''
+    eps = 1e-8
+    bnt = (1 << (weight_bits - 1)) - 1
+    x_dequant = x_quant.copy()
+    if isinstance(scale_v, list):
+        if quant_axis == 0:
+            for i, s in enumerate(scale_v):
+                x_dequant[i] = x_dequant[i] * s / bnt
+            quant_bias = x - x_dequant
+            mean_bias = quant_bias.reshape(quant_bias.shape[0], -1).mean(-1)
+            std_orig = x.reshape(x.shape[0], -1).std(-1)
+            std_quant = x_dequant.reshape(x_dequant.shape[0], -1).std(-1)
+            std_bias = std_orig / (std_quant + eps)
+        else:
+            for i, s in enumerate(scale_v):
+                x_dequant[:, i] = x_quant[:, i] * s / bnt
+            quant_bias = x - x_dequant
+            mean_bias = np.array(
+                [quant_bias[:, i].mean() for i in range(quant_bias.shape[1])])
+            std_orig = np.array([x[:, i].std() for i in range(x.shape[1])])
+            std_quant = np.array(
+                [x_dequant[:, i].std() for i in range(x_dequant.shape[1])])
+            std_bias = std_orig / (std_quant + eps)
+    else:
+        x_dequant = x_quant * scale_v / bnt
+        mean_bias = (x - x_dequant).mean()
+        std_bias = x.std() / (x_dequant.std() + eps)
+    if mean_bias.ndim == 1:
+        std_bias = np.resize(std_bias, x.shape)
+        mean_bias = np.resize(mean_bias, x.shape)
+
+    x_dequant = (mean_bias + x_dequant) * std_bias
+    quantized_param_v = quant_tensor(x_dequant, scale_v, quant_axis,
+                                     weight_bits)
+    return quantized_param_v
+
+
 def stable_sigmoid(x):
     sig = np.where(x < 0, np.exp(x) / (1 + np.exp(x)), 1 / (1 + np.exp(-x)))
     return sig
diff --git a/python/paddle/fluid/contrib/slim/tests/save_quant_model.py b/python/paddle/fluid/contrib/slim/tests/save_quant_model.py
index f97c2778c0918..73ec8cf3e023d 100644
--- a/python/paddle/fluid/contrib/slim/tests/save_quant_model.py
+++ b/python/paddle/fluid/contrib/slim/tests/save_quant_model.py
@@ -88,8 +88,8 @@ def transform_and_save_int8_model(original_path,
                                   debug=False,
                                   quant_model_filename='',
                                   quant_params_filename='',
-                                  save_model_filename='',
-                                  save_params_filename=''):
+                                  save_model_filename="__model__",
+                                  save_params_filename=None):
     place = fluid.CPUPlace()
     exe = fluid.Executor(place)
     inference_scope = fluid.executor.global_scope()
diff --git a/python/paddle/fluid/contrib/slim/tests/test_imperative_qat.py b/python/paddle/fluid/contrib/slim/tests/test_imperative_qat.py
index 5db720b028ffe..015ecb3d4a4e9 100644
--- a/python/paddle/fluid/contrib/slim/tests/test_imperative_qat.py
+++ b/python/paddle/fluid/contrib/slim/tests/test_imperative_qat.py
@@ -53,7 +53,9 @@ class TestImperativeQat(unittest.TestCase):
     def set_vars(self):
         self.weight_quantize_type = 'abs_max'
         self.activation_quantize_type = 'moving_average_abs_max'
-        print('weight_quantize_type', self.weight_quantize_type)
+        self.onnx_format = False
+        self.check_export_model_accuracy = True
+        self.diff_threshold = 0.01
 
     def func_qat(self):
         self.set_vars()
@@ -159,9 +161,13 @@ def func_qat(self):
             data = next(test_reader())
             test_data = np.array([x[0].reshape(1, 28, 28)
                                   for x in data]).astype('float32')
+            y_data = np.array(
+                [x[1] for x in data]).astype('int64').reshape(-1, 1)
             test_img = fluid.dygraph.to_variable(test_data)
+            label = fluid.dygraph.to_variable(y_data)
             lenet.eval()
-            before_save = lenet(test_img)
+            fp32_out = lenet(test_img)
+            fp32_acc = fluid.layers.accuracy(fp32_out, label).numpy()
 
         with tempfile.TemporaryDirectory(prefix="qat_save_path_") as tmpdir:
             # save inference quantized model
@@ -171,7 +177,8 @@ def func_qat(self):
                 input_spec=[
                     paddle.static.InputSpec(
                         shape=[None, 1, 28, 28], dtype='float32')
-                ])
+                ],
+                onnx_format=self.onnx_format)
             print('Quantized model saved in %s' % tmpdir)
 
             if core.is_compiled_with_cuda():
@@ -185,13 +192,15 @@ def func_qat(self):
                  executor=exe,
                  model_filename="lenet" + INFER_MODEL_SUFFIX,
                  params_filename="lenet" + INFER_PARAMS_SUFFIX)
-            after_save, = exe.run(inference_program,
-                                  feed={feed_target_names[0]: test_data},
-                                  fetch_list=fetch_targets)
-            # check
-            self.assertTrue(
-                np.allclose(after_save, before_save.numpy()),
-                msg='Failed to save the inference quantized model.')
+            quant_out, = exe.run(inference_program,
+                                 feed={feed_target_names[0]: test_data},
+                                 fetch_list=fetch_targets)
+            paddle.disable_static()
+            quant_out = fluid.dygraph.to_variable(quant_out)
+            quant_acc = fluid.layers.accuracy(quant_out, label).numpy()
+            paddle.enable_static()
+            delta_value = fp32_acc - quant_acc
+            self.assertLess(delta_value, self.diff_threshold)
 
     def test_qat(self):
         with _test_eager_guard():
@@ -199,5 +208,13 @@ def test_qat(self):
         self.func_qat()
 
 
+class TestImperativeQatONNXFormat(unittest.TestCase):
+    def set_vars(self):
+        self.weight_quantize_type = 'abs_max'
+        self.activation_quantize_type = 'moving_average_abs_max'
+        self.onnx_format = True
+        self.diff_threshold = 0.025
+
+
 if __name__ == '__main__':
     unittest.main()
diff --git a/python/paddle/fluid/contrib/slim/tests/test_imperative_qat_channelwise.py b/python/paddle/fluid/contrib/slim/tests/test_imperative_qat_channelwise.py
index 1a6c9c41638db..ff40b170345a8 100644
--- a/python/paddle/fluid/contrib/slim/tests/test_imperative_qat_channelwise.py
+++ b/python/paddle/fluid/contrib/slim/tests/test_imperative_qat_channelwise.py
@@ -41,6 +41,17 @@ class TestImperativeQatChannelWise(TestImperativeQat):
     def set_vars(self):
         self.weight_quantize_type = 'channel_wise_abs_max'
         self.activation_quantize_type = 'moving_average_abs_max'
+        self.diff_threshold = 0.01
+        self.onnx_format = False
+        print('weight_quantize_type', self.weight_quantize_type)
+
+
+class TestImperativeQatChannelWiseONNXFormat(TestImperativeQat):
+    def set_vars(self):
+        self.weight_quantize_type = 'channel_wise_abs_max'
+        self.activation_quantize_type = 'moving_average_abs_max'
+        self.onnx_format = True
+        self.diff_threshold = 0.025
         print('weight_quantize_type', self.weight_quantize_type)
 
 
diff --git a/python/paddle/fluid/contrib/slim/tests/test_post_training_quantization_lstm_model.py b/python/paddle/fluid/contrib/slim/tests/test_post_training_quantization_lstm_model.py
index 58a430eb96406..85cabb6b5e9b7 100644
--- a/python/paddle/fluid/contrib/slim/tests/test_post_training_quantization_lstm_model.py
+++ b/python/paddle/fluid/contrib/slim/tests/test_post_training_quantization_lstm_model.py
@@ -173,7 +173,8 @@ def generate_quantized_model(self,
                                  is_use_cache_file=False,
                                  is_optimize_model=False,
                                  batch_size=10,
-                                 batch_nums=10):
+                                 batch_nums=10,
+                                 onnx_format=False):
 
         place = fluid.CPUPlace()
         exe = fluid.Executor(place)
@@ -190,14 +191,28 @@ def generate_quantized_model(self,
             round_type=round_type,
             is_full_quantize=is_full_quantize,
             optimize_model=is_optimize_model,
+            onnx_format=onnx_format,
             is_use_cache_file=is_use_cache_file)
         ptq.quantize()
         ptq.save_quantized_model(self.int8_model_path)
 
-    def run_test(self, model_name, model_url, model_md5, data_name, data_url,
-                 data_md5, algo, round_type, quantizable_op_type,
-                 is_full_quantize, is_use_cache_file, is_optimize_model,
-                 diff_threshold, infer_iterations, quant_iterations):
+    def run_test(self,
+                 model_name,
+                 model_url,
+                 model_md5,
+                 data_name,
+                 data_url,
+                 data_md5,
+                 algo,
+                 round_type,
+                 quantizable_op_type,
+                 is_full_quantize,
+                 is_use_cache_file,
+                 is_optimize_model,
+                 diff_threshold,
+                 infer_iterations,
+                 quant_iterations,
+                 onnx_format=False):
         fp32_model_path = self.download_model(model_url, model_md5, model_name)
         fp32_model_path = os.path.join(fp32_model_path, model_name)
 
@@ -211,10 +226,10 @@ def run_test(self, model_name, model_url, model_md5, data_name, data_url,
 
         print("Start post training quantization for {0} on {1} samples ...".
               format(model_name, quant_iterations))
-        self.generate_quantized_model(fp32_model_path, data_path, algo,
-                                      round_type, quantizable_op_type,
-                                      is_full_quantize, is_use_cache_file,
-                                      is_optimize_model, quant_iterations)
+        self.generate_quantized_model(
+            fp32_model_path, data_path, algo, round_type, quantizable_op_type,
+            is_full_quantize, is_use_cache_file, is_optimize_model,
+            quant_iterations, onnx_format)
 
         print("Start INT8 inference for {0} on {1} samples ...".format(
             model_name, infer_iterations))
@@ -278,5 +293,42 @@ def test_post_training_kl(self):
                       diff_threshold, infer_iterations, quant_iterations)
 
 
+class TestPostTrainingKLForMnistONNXFormat(TestPostTrainingQuantization):
+    def test_post_training_kl_onnx_format(self):
+        model_name = "nlp_lstm_fp32_model"
+        model_url = "https://paddle-inference-dist.cdn.bcebos.com/int8/unittest_model_data/nlp_lstm_fp32_model.tar.gz"
+        model_md5 = "519b8eeac756e7b4b7bcb2868e880452"
+        data_name = "quant_lstm_input_data"
+        data_url = "https://paddle-inference-dist.cdn.bcebos.com/int8/unittest_model_data/quant_lstm_input_data.tar.gz"
+        data_md5 = "add84c754e9b792fea1fbd728d134ab7"
+        algo = "KL"
+        round_type = "round"
+        quantizable_op_type = ["mul", "lstm"]
+        is_full_quantize = False
+        is_use_cache_file = False
+        is_optimize_model = False
+        diff_threshold = 0.01
+        infer_iterations = 100
+        quant_iterations = 10
+        onnx_format = True
+        self.run_test(
+            model_name,
+            model_url,
+            model_md5,
+            data_name,
+            data_url,
+            data_md5,
+            algo,
+            round_type,
+            quantizable_op_type,
+            is_full_quantize,
+            is_use_cache_file,
+            is_optimize_model,
+            diff_threshold,
+            infer_iterations,
+            quant_iterations,
+            onnx_format=onnx_format)
+
+
 if __name__ == '__main__':
     unittest.main()
diff --git a/python/paddle/fluid/contrib/slim/tests/test_post_training_quantization_mnist.py b/python/paddle/fluid/contrib/slim/tests/test_post_training_quantization_mnist.py
index 74198da11fb2c..c219d2fbf89a9 100644
--- a/python/paddle/fluid/contrib/slim/tests/test_post_training_quantization_mnist.py
+++ b/python/paddle/fluid/contrib/slim/tests/test_post_training_quantization_mnist.py
@@ -116,7 +116,8 @@ def generate_quantized_model(self,
                                  is_use_cache_file=False,
                                  is_optimize_model=False,
                                  batch_size=10,
-                                 batch_nums=10):
+                                 batch_nums=10,
+                                 onnx_format=False):
 
         place = fluid.CPUPlace()
         exe = fluid.Executor(place)
@@ -134,6 +135,7 @@ def generate_quantized_model(self,
             round_type=round_type,
             is_full_quantize=is_full_quantize,
             optimize_model=is_optimize_model,
+            onnx_format=onnx_format,
             is_use_cache_file=is_use_cache_file)
         ptq.quantize()
         ptq.save_quantized_model(self.int8_model_path)
@@ -151,7 +153,8 @@ def run_test(self,
                  diff_threshold,
                  batch_size=10,
                  infer_iterations=10,
-                 quant_iterations=5):
+                 quant_iterations=5,
+                 onnx_format=False):
 
         origin_model_path = self.download_model(data_url, data_md5, model_name)
         origin_model_path = os.path.join(origin_model_path, model_name)
@@ -166,7 +169,7 @@ def run_test(self,
         self.generate_quantized_model(origin_model_path, algo, round_type,
                                       quantizable_op_type, is_full_quantize,
                                       is_use_cache_file, is_optimize_model,
-                                      batch_size, quant_iterations)
+                                      batch_size, quant_iterations, onnx_format)
 
         print("Start INT8 inference for {0} on {1} images ...".format(
             model_name, infer_iterations * batch_size))
@@ -335,5 +338,72 @@ def test_post_training_mse(self):
                       infer_iterations, quant_iterations)
 
 
+class TestPostTrainingmseForMnistONNXFormat(TestPostTrainingQuantization):
+    def test_post_training_mse_onnx_format(self):
+        model_name = "mnist_model"
+        data_url = "http://paddle-inference-dist.bj.bcebos.com/int8/mnist_model.tar.gz"
+        data_md5 = "be71d3997ec35ac2a65ae8a145e2887c"
+        algo = "mse"
+        round_type = "round"
+        quantizable_op_type = ["conv2d", "depthwise_conv2d", "mul"]
+        is_full_quantize = False
+        is_use_cache_file = False
+        is_optimize_model = True
+        onnx_format = True
+        diff_threshold = 0.01
+        batch_size = 10
+        infer_iterations = 50
+        quant_iterations = 5
+        self.run_test(
+            model_name,
+            data_url,
+            data_md5,
+            algo,
+            round_type,
+            quantizable_op_type,
+            is_full_quantize,
+            is_use_cache_file,
+            is_optimize_model,
+            diff_threshold,
+            batch_size,
+            infer_iterations,
+            quant_iterations,
+            onnx_format=onnx_format)
+
+
+class TestPostTrainingmseForMnistONNXFormatFullQuant(
+        TestPostTrainingQuantization):
+    def test_post_training_mse_onnx_format_full_quant(self):
+        model_name = "mnist_model"
+        data_url = "http://paddle-inference-dist.bj.bcebos.com/int8/mnist_model.tar.gz"
+        data_md5 = "be71d3997ec35ac2a65ae8a145e2887c"
+        algo = "mse"
+        round_type = "round"
+        quantizable_op_type = ["conv2d", "depthwise_conv2d", "mul"]
+        is_full_quantize = True
+        is_use_cache_file = False
+        is_optimize_model = False
+        onnx_format = True
+        diff_threshold = 0.01
+        batch_size = 10
+        infer_iterations = 50
+        quant_iterations = 5
+        self.run_test(
+            model_name,
+            data_url,
+            data_md5,
+            algo,
+            round_type,
+            quantizable_op_type,
+            is_full_quantize,
+            is_use_cache_file,
+            is_optimize_model,
+            diff_threshold,
+            batch_size,
+            infer_iterations,
+            quant_iterations,
+            onnx_format=onnx_format)
+
+
 if __name__ == '__main__':
     unittest.main()
diff --git a/python/paddle/fluid/contrib/slim/tests/test_post_training_quantization_mobilenetv1.py b/python/paddle/fluid/contrib/slim/tests/test_post_training_quantization_mobilenetv1.py
index 312a0c9e4b40e..498a1ec46cacd 100644
--- a/python/paddle/fluid/contrib/slim/tests/test_post_training_quantization_mobilenetv1.py
+++ b/python/paddle/fluid/contrib/slim/tests/test_post_training_quantization_mobilenetv1.py
@@ -243,7 +243,8 @@ def generate_quantized_model(self,
                                  round_type="round",
                                  is_full_quantize=False,
                                  is_use_cache_file=False,
-                                 is_optimize_model=False):
+                                 is_optimize_model=False,
+                                 onnx_format=False):
         try:
             os.system("mkdir " + self.int8_model)
         except Exception as e:
@@ -265,13 +266,23 @@ def generate_quantized_model(self,
             round_type=round_type,
             is_full_quantize=is_full_quantize,
             optimize_model=is_optimize_model,
+            onnx_format=onnx_format,
             is_use_cache_file=is_use_cache_file)
         ptq.quantize()
         ptq.save_quantized_model(self.int8_model)
 
-    def run_test(self, model, algo, round_type, data_urls, data_md5s,
-                 quantizable_op_type, is_full_quantize, is_use_cache_file,
-                 is_optimize_model, diff_threshold):
+    def run_test(self,
+                 model,
+                 algo,
+                 round_type,
+                 data_urls,
+                 data_md5s,
+                 quantizable_op_type,
+                 is_full_quantize,
+                 is_use_cache_file,
+                 is_optimize_model,
+                 diff_threshold,
+                 onnx_format=False):
         infer_iterations = self.infer_iterations
         batch_size = self.batch_size
         sample_iterations = self.sample_iterations
@@ -285,9 +296,10 @@ def run_test(self, model, algo, round_type, data_urls, data_md5s,
 
         print("Start INT8 post training quantization for {0} on {1} images ...".
               format(model, sample_iterations * batch_size))
-        self.generate_quantized_model(
-            model_cache_folder + "/model", quantizable_op_type, algo,
-            round_type, is_full_quantize, is_use_cache_file, is_optimize_model)
+        self.generate_quantized_model(model_cache_folder + "/model",
+                                      quantizable_op_type, algo, round_type,
+                                      is_full_quantize, is_use_cache_file,
+                                      is_optimize_model, onnx_format)
 
         print("Start INT8 inference for {0} on {1} images ...".format(
             model, infer_iterations * batch_size))
@@ -517,5 +529,38 @@ def test_post_training_avg_mobilenetv1(self):
                       is_optimize_model, diff_threshold)
 
 
+class TestPostTrainingAvgONNXFormatForMobilenetv1(TestPostTrainingQuantization):
+    def test_post_training_onnx_format_mobilenetv1(self):
+        model = "MobileNet-V1"
+        algo = "avg"
+        round_type = "round"
+        data_urls = [
+            'http://paddle-inference-dist.bj.bcebos.com/int8/mobilenetv1_int8_model.tar.gz'
+        ]
+        data_md5s = ['13892b0716d26443a8cdea15b3c6438b']
+        quantizable_op_type = [
+            "conv2d",
+            "depthwise_conv2d",
+            "mul",
+        ]
+        is_full_quantize = False
+        is_use_cache_file = False
+        is_optimize_model = True
+        onnx_format = True
+        diff_threshold = 0.05
+        self.run_test(
+            model,
+            algo,
+            round_type,
+            data_urls,
+            data_md5s,
+            quantizable_op_type,
+            is_full_quantize,
+            is_use_cache_file,
+            is_optimize_model,
+            diff_threshold,
+            onnx_format=onnx_format)
+
+
 if __name__ == '__main__':
     unittest.main()
diff --git a/python/paddle/fluid/contrib/slim/tests/test_post_training_quantization_resnet50.py b/python/paddle/fluid/contrib/slim/tests/test_post_training_quantization_resnet50.py
index a26dcb51c724a..dc12026a21ab1 100644
--- a/python/paddle/fluid/contrib/slim/tests/test_post_training_quantization_resnet50.py
+++ b/python/paddle/fluid/contrib/slim/tests/test_post_training_quantization_resnet50.py
@@ -39,5 +39,34 @@ def test_post_training_resnet50(self):
                       is_optimize_model, diff_threshold)
 
 
+class TestPostTrainingForResnet50ONNXFormat(TestPostTrainingQuantization):
+    def test_post_training_resnet50(self):
+        model = "ResNet-50"
+        algo = "min_max"
+        round_type = "round"
+        data_urls = [
+            'http://paddle-inference-dist.bj.bcebos.com/int8/resnet50_int8_model.tar.gz'
+        ]
+        data_md5s = ['4a5194524823d9b76da6e738e1367881']
+        quantizable_op_type = ["conv2d", "mul"]
+        is_full_quantize = False
+        is_use_cache_file = False
+        is_optimize_model = False
+        diff_threshold = 0.025
+        onnx_format = True
+        self.run_test(
+            model,
+            algo,
+            round_type,
+            data_urls,
+            data_md5s,
+            quantizable_op_type,
+            is_full_quantize,
+            is_use_cache_file,
+            is_optimize_model,
+            diff_threshold,
+            onnx_format=onnx_format)
+
+
 if __name__ == '__main__':
     unittest.main()
diff --git a/python/paddle/fluid/contrib/slim/tests/test_quantization_pass.py b/python/paddle/fluid/contrib/slim/tests/test_quantization_pass.py
index ccd23485c3d9a..fe261237f1227 100644
--- a/python/paddle/fluid/contrib/slim/tests/test_quantization_pass.py
+++ b/python/paddle/fluid/contrib/slim/tests/test_quantization_pass.py
@@ -21,6 +21,7 @@
 import paddle
 from paddle.fluid.framework import IrGraph
 from paddle.fluid.contrib.slim.quantization import QuantizationTransformPass
+from paddle.fluid.contrib.slim.quantization import QuantizationTransformPassV2
 from paddle.fluid.contrib.slim.quantization import QuantizationFreezePass
 from paddle.fluid.contrib.slim.quantization import ConvertToInt8Pass
 from paddle.fluid.contrib.slim.quantization import TransformForMobilePass
@@ -686,5 +687,129 @@ def test_residual_block_skip_pattern_1(self):
             for_ci=True)
 
 
+class TestQuantizationTransformPassV2(unittest.TestCase):
+    def setUp(self):
+        self.quantizable_op_and_inputs = {
+            'conv2d': ['Input', 'Filter'],
+            'depthwise_conv2d': ['Input', 'Filter'],
+            'mul': ['X', 'Y']
+        }
+        self.quantizable_grad_op_inputs = {
+            'conv2d_grad': ['Input', 'Filter'],
+            'depthwise_conv2d_grad': ['Input', 'Filter'],
+            'mul_grad': ['X', 'Y']
+        }
+
+    def check_program(self, program):
+        quantized_ops = set()
+        for block in program.blocks:
+            for op in block.ops:
+                # check forward
+                if op.type in self.quantizable_op_and_inputs:
+                    for arg_name in op.input_arg_names:
+                        self.assertTrue(
+                            arg_name.endswith('.quantized.dequantized'))
+                        quantized_ops.add(arg_name)
+
+            for op in block.ops:
+                # check backward
+                if op.type in self.quantizable_grad_op_inputs:
+                    for pname in self.quantizable_grad_op_inputs[op.type]:
+                        arg_name = op.input(pname)[0]
+                        self.assertTrue(
+                            arg_name.endswith('.quantized.dequantized'))
+                        self.assertTrue(arg_name in quantized_ops)
+
+    def linear_fc_quant(self,
+                        activation_quant_type,
+                        weight_quantize_type,
+                        for_ci=True):
+        main = fluid.Program()
+        startup = fluid.Program()
+        with fluid.program_guard(main, startup):
+            loss = linear_fc(3)
+            opt = fluid.optimizer.Adam(learning_rate=0.001)
+            opt.minimize(loss)
+        place = fluid.CPUPlace()
+        graph = IrGraph(core.Graph(main.desc), for_test=False)
+        transform_pass = QuantizationTransformPassV2(
+            scope=fluid.global_scope(),
+            place=place,
+            activation_quantize_type=activation_quant_type,
+            weight_quantize_type=weight_quantize_type)
+        transform_pass.apply(graph)
+        if not for_ci:
+            marked_nodes = set()
+            for op in graph.all_op_nodes():
+                if op.name().find('quantize') > -1:
+                    marked_nodes.add(op)
+            graph.draw('.', 'quantize_fc_' + activation_quant_type,
+                       marked_nodes)
+        program = graph.to_program()
+        self.check_program(program)
+        val_graph = IrGraph(core.Graph(program.desc), for_test=False)
+        if not for_ci:
+            val_marked_nodes = set()
+            for op in val_graph.all_op_nodes():
+                if op.name().find('quantize') > -1:
+                    val_marked_nodes.add(op)
+            val_graph.draw('.', 'val_fc_' + activation_quant_type,
+                           val_marked_nodes)
+
+    def test_linear_fc_quant_abs_max(self):
+        self.linear_fc_quant('abs_max', 'abs_max', for_ci=True)
+
+    def test_linear_fc_quant_channel_wise_abs_max(self):
+        self.linear_fc_quant('abs_max', 'channel_wise_abs_max', for_ci=True)
+
+    def residual_block_quant(self,
+                             activation_quant_type,
+                             weight_quantize_type,
+                             quantizable_op_type,
+                             for_ci=True):
+        main = fluid.Program()
+        startup = fluid.Program()
+        with fluid.program_guard(main, startup):
+            loss = residual_block(2)
+            opt = fluid.optimizer.Adam(learning_rate=0.001)
+            opt.minimize(loss)
+        place = fluid.CPUPlace()
+        graph = IrGraph(core.Graph(main.desc), for_test=False)
+        transform_pass = QuantizationTransformPass(
+            scope=fluid.global_scope(),
+            place=place,
+            activation_quantize_type=activation_quant_type,
+            weight_quantize_type=weight_quantize_type,
+            quantizable_op_type=quantizable_op_type)
+        transform_pass.apply(graph)
+        if not for_ci:
+            marked_nodes = set()
+            for op in graph.all_op_nodes():
+                if op.name().find('quantize') > -1:
+                    marked_nodes.add(op)
+            graph.draw('.', 'quantize_residual_' + activation_quant_type,
+                       marked_nodes)
+        program = graph.to_program()
+        self.check_program(program)
+        val_graph = IrGraph(core.Graph(program.desc), for_test=False)
+        if not for_ci:
+            val_marked_nodes = set()
+            for op in val_graph.all_op_nodes():
+                if op.name().find('quantize') > -1:
+                    val_marked_nodes.add(op)
+            val_graph.draw('.', 'val_residual_' + activation_quant_type,
+                           val_marked_nodes)
+
+    def test_residual_block_abs_max(self):
+        quantizable_op_type = ['conv2d', 'depthwise_conv2d', 'mul', 'matmul']
+        self.residual_block_quant(
+            'abs_max', 'abs_max', quantizable_op_type, for_ci=True)
+
+    def test_residual_block_channel_wise_abs_max(self):
+        quantizable_op_type = ['conv2d', 'depthwise_conv2d', 'mul', 'matmul']
+        self.residual_block_quant(
+            'abs_max', 'channel_wise_abs_max', quantizable_op_type, for_ci=True)
+
+
 if __name__ == '__main__':
     unittest.main()
diff --git a/python/paddle/fluid/dataloader/collate.py b/python/paddle/fluid/dataloader/collate.py
index 2086827258128..0bf041007eb38 100644
--- a/python/paddle/fluid/dataloader/collate.py
+++ b/python/paddle/fluid/dataloader/collate.py
@@ -57,7 +57,7 @@ def default_collate_fn(batch):
     if isinstance(sample, np.ndarray):
         batch = np.stack(batch, axis=0)
         return batch
-    elif isinstance(sample, paddle.Tensor):
+    elif isinstance(sample, (paddle.Tensor, core.eager.Tensor)):
         return layers.stack(batch, axis=0)
     elif isinstance(sample, numbers.Number):
         batch = np.array(batch)
@@ -99,7 +99,7 @@ def default_convert_fn(batch):
         Batched data: batched each number, numpy array and paddle.Tensor
                       in input data.
     """
-    if isinstance(batch, (paddle.Tensor, np.ndarray)):
+    if isinstance(batch, (paddle.Tensor, np.ndarray, core.eager.Tensor)):
         return batch
     elif isinstance(batch, (str, bytes)):
         return batch
diff --git a/python/paddle/fluid/dataloader/dataloader_iter.py b/python/paddle/fluid/dataloader/dataloader_iter.py
index 0dc733440fada..bbf2a4377c767 100644
--- a/python/paddle/fluid/dataloader/dataloader_iter.py
+++ b/python/paddle/fluid/dataloader/dataloader_iter.py
@@ -229,7 +229,7 @@ def _thread_loop(self, legacy_expected_place):
                 # pack as LoDTensorArray
                 array = core.LoDTensorArray()
                 for slot in batch:
-                    if isinstance(slot, paddle.Tensor):
+                    if isinstance(slot, (paddle.Tensor, core.eager.Tensor)):
                         slot = slot.value().get_tensor()
                     elif not isinstance(slot, core.LoDTensor):
                         tmp = core.LoDTensor()
@@ -543,7 +543,8 @@ def _thread_loop(self, legacy_expected_place):
                             # LoDTensor not in shared memory is not
                             # serializable, cannot be create in workers
                             for slot in batch:
-                                if isinstance(slot, paddle.Tensor):
+                                if isinstance(slot, (paddle.Tensor,
+                                                     core.eager.Tensor)):
                                     slot = slot.value().get_tensor()
                                 elif not isinstance(slot, core.LoDTensor):
                                     tmp = core.LoDTensor()
diff --git a/python/paddle/fluid/dataloader/flat.py b/python/paddle/fluid/dataloader/flat.py
index 32c8ef02dd915..5baf4cc853e27 100644
--- a/python/paddle/fluid/dataloader/flat.py
+++ b/python/paddle/fluid/dataloader/flat.py
@@ -36,7 +36,8 @@ def _flatten_batch(batch):
     def _flatten(batch, flat_batch, structure, field_idx):
         if isinstance(batch, Sequence):
             for field in batch:
-                if isinstance(field, (np.ndarray, paddle.Tensor)):
+                if isinstance(field, (np.ndarray, paddle.Tensor,
+                                      paddle.fluid.core.eager.Tensor)):
                     structure.append('{}{}'.format(FIELD_PREFIX, field_idx))
                     flat_batch.append(field)
                     field_idx += 1
@@ -54,7 +55,8 @@ def _flatten(batch, flat_batch, structure, field_idx):
                     structure.append(field)
         elif isinstance(batch, Mapping):
             for k, field in batch.items():
-                if isinstance(field, (np.ndarray, paddle.Tensor)):
+                if isinstance(field, (np.ndarray, paddle.Tensor,
+                                      paddle.fluid.core.eager.Tensor)):
                     structure[k] = '{}{}'.format(FIELD_PREFIX, field_idx)
                     flat_batch.append(field)
                     field_idx += 1
diff --git a/python/paddle/fluid/dygraph/math_op_patch.py b/python/paddle/fluid/dygraph/math_op_patch.py
index 98de5949ba422..8b80444fe9011 100644
--- a/python/paddle/fluid/dygraph/math_op_patch.py
+++ b/python/paddle/fluid/dygraph/math_op_patch.py
@@ -270,7 +270,10 @@ def __impl__(self, other_var):
 
             # 4. calculation
             axis = -1
-            math_op = getattr(_C_ops, op_type)
+            if framework._in_eager_mode_ and op_type == 'elementwise_add':
+                math_op = getattr(_C_ops, 'final_state_add')
+            else:
+                math_op = getattr(_C_ops, op_type)
             return math_op(self, other_var, 'axis', axis)
 
         comment = OpProtoHolder.instance().get_op_proto(op_type).comment
diff --git a/python/paddle/fluid/dygraph/nn.py b/python/paddle/fluid/dygraph/nn.py
index 531adc9e456b8..89fcbe1a5d18d 100644
--- a/python/paddle/fluid/dygraph/nn.py
+++ b/python/paddle/fluid/dygraph/nn.py
@@ -21,7 +21,7 @@
 from ..layers import nn as F
 from .. import dygraph_utils
 from . import layers
-from ..framework import Variable, _non_static_mode, OpProtoHolder, Parameter, _dygraph_tracer, _varbase_creator, default_main_program, _global_flags, in_dygraph_mode
+from ..framework import Variable, _non_static_mode, OpProtoHolder, Parameter, _dygraph_tracer, _varbase_creator, default_main_program, _global_flags, in_dygraph_mode, _in_legacy_dygraph
 from ..data_feeder import convert_dtype, check_variable_and_dtype, check_type, check_dtype
 from ..param_attr import ParamAttr
 from ..initializer import Normal, Constant, NumpyArrayInitializer
@@ -240,6 +240,18 @@ def _get_default_param_initializer():
             is_bias=True)
 
     def forward(self, input):
+        if in_dygraph_mode() and self._l_type == "conv2d":
+            pre_bias = _C_ops.final_state_conv2d(
+                input, self.weight, self._stride, self._padding, "EXPLICIT",
+                self._groups if self._groups else 1, self._dilation, "NCHW",
+                False, -1, False)
+            if self.bias is not None:
+                pre_act = F.elementwise_add(pre_bias, self.bias, axis=1)
+            else:
+                pre_act = pre_bias
+            return dygraph_utils._append_activation_in_dygraph(
+                pre_act, self._act, use_mkldnn=self._use_mkldnn)
+
         if _non_static_mode() and (self._l_type == 'conv2d' or
                                    self._l_type == 'depthwise_conv2d'):
             attrs = ('strides', self._stride, 'paddings', self._padding,
@@ -1339,15 +1351,26 @@ def forward(self, input):
         variance_out = self._variance
 
         if _non_static_mode():
-            attrs = ("momentum", self._momentum, "epsilon", self._epsilon,
-                     "is_test", not self.training, "data_layout",
-                     self._data_layout, "use_mkldnn", self._use_mkldnn,
-                     "fuse_with_relu", self._fuse_with_relu, "use_global_stats",
-                     self._use_global_stats, 'trainable_statistics',
-                     self._trainable_statistics)
-            batch_norm_out, _, _, _, _, _ = _C_ops.batch_norm(
-                input, self.weight, self.bias, self._mean, self._variance,
-                mean_out, variance_out, *attrs)
+            if in_dygraph_mode():
+                batch_norm_out, t1, t2, t3, t4, _ = _C_ops.final_state_batch_norm(
+                    input, self.weight, self.bias, self._mean, self._variance,
+                    self._momentum, self._epsilon, self._data_layout,
+                    not self.training, self._use_global_stats,
+                    self._trainable_statistics, False)
+                return dygraph_utils._append_activation_in_dygraph(
+                    batch_norm_out, act=self._act, use_mkldnn=self._use_mkldnn)
+
+            elif _in_legacy_dygraph():
+                attrs = ("momentum", self._momentum, "epsilon", self._epsilon,
+                         "is_test", not self.training, "data_layout",
+                         self._data_layout, "use_mkldnn", self._use_mkldnn,
+                         "fuse_with_relu", self._fuse_with_relu,
+                         "use_global_stats", self._use_global_stats,
+                         'trainable_statistics', self._trainable_statistics)
+                batch_norm_out, _, _, _, _, _ = _C_ops.batch_norm(
+                    input, self.weight, self.bias, self._mean, self._variance,
+                    None, mean_out, variance_out, *attrs)
+
             return dygraph_utils._append_activation_in_dygraph(
                 batch_norm_out, act=self._act, use_mkldnn=self._use_mkldnn)
 
diff --git a/python/paddle/fluid/dygraph/parallel.py b/python/paddle/fluid/dygraph/parallel.py
index 64388aadb2f02..e6b891cbe00bf 100644
--- a/python/paddle/fluid/dygraph/parallel.py
+++ b/python/paddle/fluid/dygraph/parallel.py
@@ -22,6 +22,7 @@
 from contextlib import contextmanager
 
 import paddle
+from paddle import _C_ops
 from paddle.fluid import core
 from paddle.fluid import framework
 from paddle.fluid.dygraph import layers
@@ -307,17 +308,28 @@ def _reshape_inplace(x, shape):
 
 @framework.dygraph_only
 def _split_tensors(coalesced_grads_and_grad_vars):
-    for coalesced_grad, origin_grad_vars, grad_shapes in coalesced_grads_and_grad_vars:
-        grad_var_len = [np.prod(g_shape) for g_shape in grad_shapes]
-        framework._dygraph_tracer().trace_op(
-            type='split',
-            inputs={'X': coalesced_grad},
-            outputs={'Out': origin_grad_vars},
-            attrs={'sections': grad_var_len,
-                   'axis': 0})
-        for g_var, g_shape in zip(origin_grad_vars, grad_shapes):
-            _reshape_inplace(x=g_var, shape=g_shape)
-            assert g_var.shape == g_shape
+    if _in_legacy_dygraph():
+        for coalesced_grad, origin_grad_vars, grad_shapes in coalesced_grads_and_grad_vars:
+            grad_var_len = [np.prod(g_shape) for g_shape in grad_shapes]
+            framework._dygraph_tracer().trace_op(
+                type='split',
+                inputs={'X': coalesced_grad},
+                outputs={'Out': origin_grad_vars},
+                attrs={'sections': grad_var_len,
+                       'axis': 0})
+            for g_var, g_shape in zip(origin_grad_vars, grad_shapes):
+                _reshape_inplace(x=g_var, shape=g_shape)
+                assert g_var.shape == g_shape
+    elif in_dygraph_mode():
+        for coalesced_grad, origin_grad_vars, grad_shapes in coalesced_grads_and_grad_vars:
+            grad_var_len = [np.prod(g_shape) for g_shape in grad_shapes]
+            attrs = ()
+            attrs += ('sections', grad_var_len)
+            attrs += ('axis', 0)
+            _C_ops.split(coalesced_grad, origin_grad_vars, *attrs)
+            for g_var, g_shape in zip(origin_grad_vars, grad_shapes):
+                g_var.reshape_(shape=g_shape)
+                assert g_var.shape == g_shape
 
 
 def scale_loss(loss):
@@ -360,9 +372,10 @@ def sync_params_buffers(model,
                         is_model_parallel=False):
     model_vars = []
     for _, param in model._obtain_parameters_buffers().items():
-        if not isinstance(param, core.VarBase):
-            raise TypeError("The data type of '%s' must be Varbase" %
-                            param.name)
+        if not isinstance(param, (core.VarBase, core.eager.Tensor)):
+            raise TypeError(
+                "The data type of '%s' must be Varbase or eager.Tensor" %
+                param.name)
 
         # is_distributed param not need to sync when in mp mode
         if isinstance(param, ParamBase):
@@ -397,16 +410,6 @@ def sync_params_buffers(model,
                    'axis': 0})
 
 
-@imperative_base.no_grad
-@framework.dygraph_only
-def sync_eager_params(model, comm_group=None, src_rank=0):
-    for _, param in model._obtain_parameters_buffers().items():
-        if not isinstance(param, core.eager.Tensor):
-            raise TypeError("The data type of '%s' must be '%s'" %
-                            (param.name, core.eager.Tensor))
-        comm_group.broadcast(param, src_rank).synchronize()
-
-
 class DataParallel(layers.Layer):
     """
     Run the dygraph module with data parallelism.
@@ -574,7 +577,7 @@ def __init__(self,
                  comm_buffer_size=25,
                  last_comm_buffer_size=1,
                  find_unused_parameters=False,
-                 process_group=None):
+                 group=None):
         super(DataParallel,
               self).__init__(layers.full_name() + "_data_parallel")
 
@@ -584,7 +587,7 @@ def __init__(self,
         self._layers = layers
         self.find_unused_parameters = find_unused_parameters
         self.grad_need_sync = True
-        self.process_group = process_group
+        self.group = group
         self.var_dtype = core.eager.Tensor if in_dygraph_mode(
         ) else core.VarBase
 
@@ -603,20 +606,18 @@ def __init__(self,
             "ParallelContext must be initialized before. You should use init_parallel_env() before" \
             "constructing the DataParallel."
 
-            if self.process_group is None and in_dygraph_mode():
-                raise RuntimeError(
-                    "Process group should be built for DataParallel in eager mode."
-                )
+            if in_dygraph_mode():
+                self.group = paddle.distributed.collective._get_default_group(
+                ) if self.group is None else self.group
+
+                assert isinstance(self.group, paddle.distributed.collective.Group), \
+                    "ProcessGroup must be an instance of Group in DataParallel."
 
             # sync buffer and params
             # TODO(liuyuhui) Currently not support xpu. xpu is 
             # still broadcasting parameters when calling layer
             if not paddle.is_compiled_with_xpu():
-                if in_dygraph_mode():
-                    sync_eager_params(
-                        self._layers, comm_group=self.process_group)
-                elif _in_legacy_dygraph():
-                    sync_params_buffers(self._layers)
+                sync_params_buffers(self._layers)
 
             self.comm_buffer_size = int(comm_buffer_size * 1024 * 1024)
             # NOTE(shenliang03): We can set environment variables to control 
@@ -677,7 +678,7 @@ def check_layer_sparse(sublayer):
             self._reducer = core.EagerReducer(
                 trainable_parameters,
                 list(reversed(self.group_indices)), is_sparse_gradient,
-                self.process_group,
+                self.group.process_group,
                 [self.last_comm_buffer_size, self.comm_buffer_size],
                 self.find_unused_parameters)
         elif _in_legacy_dygraph():
diff --git a/python/paddle/fluid/dygraph/tracer.py b/python/paddle/fluid/dygraph/tracer.py
index e1fabf9aeda10..6e1ed6b0a1dec 100644
--- a/python/paddle/fluid/dygraph/tracer.py
+++ b/python/paddle/fluid/dygraph/tracer.py
@@ -22,6 +22,14 @@
 from paddle import _C_ops
 
 final_state_name_mapping = {
+    "graph_send_recv": {
+        "final_op_name": "final_state_graph_send_recv",
+        "x": "X",
+        "src_index": "Src_index",
+        "dst_index": "Dst_index",
+        "out": "Out",
+        "dst_count": "Dst_count"
+    },
     "matmul_v2": {
         "final_op_name": "final_state_matmul",
         "transpose_x": "trans_x",
@@ -40,12 +48,12 @@
         "x": "X",
         "out": "Out",
     },
-    "pool2d": {
-        "final_op_name": "final_state_pool2d",
-        "x": "X",
-        "kernel_size": "ksize",
-        "out": "Out",
-    },
+    # "pool2d": {
+    #     "final_op_name": "final_state_pool2d",
+    #     "x": "X",
+    #     "kernel_size": "ksize",
+    #     "out": "Out",
+    # },
     "abs": {
         "final_op_name": "final_state_abs",
         "x": "X",
@@ -64,12 +72,12 @@
         "axis2": "axis2",
         "out": "Out",
     },
-    "one_hot": {
-        "final_op_name": "final_state_one_hot",
-        "x": "X",
-        "num_class": "depth",
-        "out": "Out",
-    }
+    # "one_hot": {
+    #     "final_op_name": "final_state_one_hot",
+    #     "x": "X",
+    #     "num_class": "depth",
+    #     "out": "Out",
+    # }
 }
 
 
@@ -110,6 +118,9 @@ def eager_trace_op(self,
 
         arg_list = []
         for i in range(len(op_args)):
+            # initialized with None
+            arg_to_append = None
+
             arg_name = op_args[i]
             arg_type = op_args_type[i]
             if arg_name in inputs.keys():
@@ -117,14 +128,20 @@ def eager_trace_op(self,
             elif arg_name in outputs.keys():
                 arg_to_append = outputs[arg_name]
             else:
-                if "Num" in arg_name:
+                if "Num" in arg_name[-3:]:
                     # Remove "Num" suffix to get out_name
                     out_name = arg_name[:-3]
                     assert out_name in outputs.keys()
                     num_outs = len(outputs[out_name])
                     arg_to_append = num_outs
-                else:
-                    arg_to_append = None
+                # NOTE(dev): For MasterParam/MasterParamOut in optimzer op
+                elif "Var" in arg_name[-3:]:
+                    out_name = arg_name[:-3]
+                    print(out_name)
+                    if out_name in outputs.keys():
+                        arg_to_append = outputs[out_name]
+                    elif out_name in inputs.keys():
+                        arg_to_append = inputs[out_name]
 
             if arg_to_append is None:
                 arg_list.append(arg_to_append)
diff --git a/python/paddle/fluid/dygraph/varbase_patch_methods.py b/python/paddle/fluid/dygraph/varbase_patch_methods.py
index 60c144d550028..72aee0ba87e58 100644
--- a/python/paddle/fluid/dygraph/varbase_patch_methods.py
+++ b/python/paddle/fluid/dygraph/varbase_patch_methods.py
@@ -20,7 +20,7 @@
 
 import paddle
 from .. import framework
-from ..framework import convert_np_dtype_to_dtype_
+from ..framework import convert_np_dtype_to_dtype_, _in_legacy_dygraph
 from .. import core
 from .. import unique_name
 from ..framework import Variable, Parameter, ParamBase, _getitem_impl_, _setitem_impl_, EagerParamBase
@@ -99,7 +99,7 @@ def _to_static_var(self, to_parameter=False, **kwargs):
 
         # Note: getattr(self, attr, None) will call x.grad=x.gradient(), but gradient() only available in dygraph.
         # It will fail. So, for propery that different between dynamic and static graph, should not getattr(self, attr, None).
-        attr_not_need_keys = ['grad', 'T']
+        attr_not_need_keys = ['grad', 'T', 'place', '_place_str']
         if isinstance(self, (ParamBase, EagerParamBase)):
             attr_kwargs = self.__dict__.copy()
         else:
@@ -798,7 +798,11 @@ def _set_grad_ivar(self, value):
 
     @framework.dygraph_only
     def clone(self):
-        return _C_ops.assign(self)
+        if _in_legacy_dygraph():
+            output = core.VarBase()
+        else:
+            output = core.eager.Tensor()
+        return _C_ops.assign(self, output)
 
     @framework.dygraph_only
     def value(self):
@@ -812,6 +816,25 @@ def _slice(self, begin_idx, end_idx):
     def _numel(self):
         return self.get_tensor()._numel()
 
+    @framework.dygraph_only
+    def _uva(self, device_id=0):
+        '''
+        Returns self tensor with the UVA(unified virtual addressing).
+
+        Args:
+            device_id(int, optional): The destination GPU device id. Default: None, means current device.
+
+        Examples:
+            .. code-block:: python
+
+              # required: gpu
+              import paddle
+              x = paddle.to_tensor([1, 2, 3], place=paddle.CPUPlace())
+              x._uva()
+              print(x)
+        '''
+        self._tensor_uva(device_id)
+
     @framework.dygraph_only
     def cpu(self):
         if self.place.is_cpu_place():
@@ -823,7 +846,11 @@ def cpu(self):
             return res
 
     @framework.dygraph_only
-    def cuda(self, device_id, blocking):
+    def cuda(self, device_id=0, blocking=True):
+        if device_id is None:
+            device_id = 0
+        if not isinstance(device_id, int):
+            raise ValueError("\'device_id\' must be a positive integer")
         if self.place.is_gpu_place():
             return self
         else:
@@ -832,6 +859,48 @@ def cuda(self, device_id, blocking):
             res.persistable = self.persistable
             return res
 
+    @framework.dygraph_only
+    def pin_memory(self):
+        if self.place.is_cuda_pinned_place():
+            return self
+        else:
+            res = self._copy_to(core.CUDAPinnedPlace(), True)
+            res.stop_gradient = self.stop_gradient
+            res.persistable = self.persistable
+            return res
+
+    @framework.dygraph_only
+    def values(self):
+        if self.is_sparse_coo():
+            return _C_ops.final_state_sparse_coo_values(self)
+        elif self.is_sparse_csr():
+            return _C_ops.final_state_sparse_csr_values(self)
+        else:
+            raise ValueError(
+                "only SparseCooTensor and SparseCsrTensor have method values")
+
+    @framework.dygraph_only
+    def to_dense(self):
+        if self.is_sparse_coo():
+            return _C_ops.final_state_sparse_coo_to_dense(self)
+        elif self.is_sparse_csr():
+            return _C_ops.final_state_sparse_to_dense(self)
+        else:
+            return self
+
+    @framework.dygraph_only
+    def to_sparse_coo(self, sparse_dim):
+        if self.is_sparse_csr():
+            return _C_ops.final_state_sparse_to_sparse_coo(self, sparse_dim)
+        elif self.is_sparse_coo():
+            return self
+        elif self.is_selected_rows():
+            raise ValueError(
+                "SelectedRows does not support to_sparse_coo method")
+        else:
+            #is dense tensor
+            return _C_ops.final_state_sparse_dense_to_coo(self, sparse_dim)
+
     if framework._in_eager_mode_ and not hasattr(core, "eager"):
         return
 
@@ -844,7 +913,8 @@ def cuda(self, device_id, blocking):
         ("__repr__", __str__), ("__deepcopy__", __deepcopy__),
         ("__module__", "paddle"), ("__array__", __array__),
         ("__getitem__", __getitem__), ("item", item),
-        ("__setitem__", __setitem__), ("_to", _to)):
+        ("__setitem__", __setitem__), ("_to", _to), ("values", values),
+        ("to_dense", to_dense), ("to_sparse_coo", to_sparse_coo)):
         if framework._in_eager_mode_:
             setattr(core.eager.Tensor, method_name, method)
         else:
@@ -857,8 +927,10 @@ def cuda(self, device_id, blocking):
         setattr(core.eager.Tensor, "value", value)
         setattr(core.eager.Tensor, "cpu", cpu)
         setattr(core.eager.Tensor, "cuda", cuda)
+        setattr(core.eager.Tensor, "pin_memory", pin_memory)
         setattr(core.eager.Tensor, "_slice", _slice)
         setattr(core.eager.Tensor, "_numel", _numel)
+        setattr(core.eager.Tensor, "_uva", _uva)
     else:
         setattr(core.VarBase, "__name__", "Tensor")
         setattr(core.VarBase, "grad", grad)
diff --git a/python/paddle/fluid/executor.py b/python/paddle/fluid/executor.py
index a7971763f53e1..2232c34e63bd0 100644
--- a/python/paddle/fluid/executor.py
+++ b/python/paddle/fluid/executor.py
@@ -394,9 +394,11 @@ def _is_enable_standalone_executor():
     Whether to use experimental executor `StandaloneExecutor`.
     """
     flag = False
-    env_val = os.environ.get('FLAGS_USE_STANDALONE_EXECUTOR', None)
+
+    env_val = os.environ.get('FLAGS_USE_STANDALONE_EXECUTOR', '1')
     if env_val in [1, '1', True, 'True', 'true']:
         flag = True
+
     return flag
 
 
@@ -1274,7 +1276,7 @@ def run(self,
 
         """
         try:
-            return self._run_impl(
+            res = self._run_impl(
                 program=program,
                 feed=feed,
                 fetch_list=fetch_list,
@@ -1285,6 +1287,8 @@ def run(self,
                 use_program_cache=use_program_cache,
                 use_prune=use_prune,
                 return_merged=return_merged)
+            core.update_autotune_status()
+            return res
         except Exception as e:
             six.reraise(*sys.exc_info())
 
@@ -1375,6 +1379,10 @@ def _run_impl(self, program, feed, fetch_list, feed_var_name,
             program = pruned_program
 
         def _can_use_interpreter_core(program, place):
+            if core.is_compiled_with_npu() or core.is_compiled_with_xpu(
+            ) or core.is_compiled_with_mlu() or core.is_compiled_with_ipu():
+                return False
+
             compiled = isinstance(program, compiler.CompiledProgram)
             # NOTE(zhiqiu): do not support compiled program now
             if compiled:
@@ -1385,6 +1393,8 @@ def _can_use_interpreter_core(program, place):
                 # else:
                 #     return False
             else:
+                if isinstance(program._graph, compiler.CompiledProgram):
+                    return False
                 assert isinstance(program, Program)
                 return True
 
diff --git a/python/paddle/fluid/framework.py b/python/paddle/fluid/framework.py
index 6d32632f2b445..a329610eeae83 100644
--- a/python/paddle/fluid/framework.py
+++ b/python/paddle/fluid/framework.py
@@ -106,14 +106,59 @@
 # to make sure in most case, we find new dygraph mode first with only one if statement.
 
 
+def _update_monkey_methods(is_eager):
+    """
+    Update monkey methods of VarBase or eager.Tensor while
+    switching eager mode and legacy mode.
+    """
+    from paddle import _C_ops
+    from .dygraph.varbase_patch_methods import monkey_patch_varbase
+    from .dygraph import monkey_patch_math_varbase
+
+    global _already_patch_eager_tensor
+    global _already_patch_varbase
+
+    assert isinstance(is_eager, bool)
+    # switch into eager mode
+    if is_eager:
+        _C_ops.switch_to_eager_ops()
+        if not _already_patch_eager_tensor:
+            monkey_patch_varbase()
+            monkey_patch_math_varbase()
+
+            _already_patch_eager_tensor = True
+    # switch back into legacy mode
+    else:
+        _C_ops.switch_to_core_ops()
+        if not _already_patch_varbase:
+            monkey_patch_varbase()
+            monkey_patch_math_varbase()
+
+            _already_patch_varbase = True
+
+    # switch Paddle.Tensor bind type
+    _switch_tensor_bind_type(is_eager)
+
+
+def _switch_tensor_bind_type(is_eager):
+    import paddle
+    if is_eager:
+        paddle.Tensor = core.eager.Tensor
+    else:
+        paddle.Tensor = core.VarBase
+    paddle.Tensor.__qualname__ = 'Tensor'
+
+
 def _enable_legacy_dygraph():
     global _in_eager_mode_
     _in_eager_mode_ = False
+    _update_monkey_methods(is_eager=False)
 
 
 def _disable_legacy_dygraph():
     global _in_eager_mode_
     _in_eager_mode_ = True
+    _update_monkey_methods(is_eager=True)
 
 
 def _in_eager_without_dygraph_check():
@@ -162,25 +207,10 @@ def _non_static_mode():
 @signature_safe_contextmanager
 def _test_eager_guard(place=None):
     _disable_legacy_dygraph()
-    from paddle import _C_ops
-    _C_ops.switch_to_eager_ops()
-    global _already_patch_eager_tensor
-    global _already_patch_varbase
-    from .dygraph.varbase_patch_methods import monkey_patch_varbase
-    from .dygraph import monkey_patch_math_varbase
-    if not _already_patch_eager_tensor:
-        monkey_patch_varbase()
-        monkey_patch_math_varbase()
-        _already_patch_eager_tensor = True
     try:
         yield
     finally:
         _enable_legacy_dygraph()
-        if not _already_patch_varbase:
-            monkey_patch_varbase()
-            monkey_patch_math_varbase()
-            _already_patch_varbase = True
-        _C_ops.switch_to_core_ops()
 
 
 global_ipu_index = None
diff --git a/python/paddle/fluid/initializer.py b/python/paddle/fluid/initializer.py
index a416d139a9111..b3baedc401504 100644
--- a/python/paddle/fluid/initializer.py
+++ b/python/paddle/fluid/initializer.py
@@ -17,7 +17,7 @@
 import math
 from . import framework
 from . import core
-from .framework import _non_static_mode, default_main_program
+from .framework import _non_static_mode, in_dygraph_mode, _in_legacy_dygraph, default_main_program, _current_expected_place
 import numpy as np
 from .core import VarDesc
 from . import unique_name
@@ -417,7 +417,18 @@ def __call__(self, var, block=None):
             out_dtype = var.dtype
             out_var = var
 
-        if framework._non_static_mode():
+        if in_dygraph_mode():
+            out_var = _C_ops.final_state_truncated_gaussian_random(
+                var.shape, self._mean, self._std_dev, self._seed, out_dtype,
+                _current_expected_place())
+            if var.dtype in [VarDesc.VarType.FP16, VarDesc.VarType.BF16]:
+                var_tmp = _C_ops.final_state_cast(out_var, var.dtype)
+                var_tmp._share_underline_tensor_to(var)
+            else:
+                out_var._share_underline_tensor_to(var)
+            return None
+
+        if _in_legacy_dygraph():
             out_var = _C_ops.truncated_gaussian_random(
                 'shape', var.shape, 'dtype', out_dtype, 'mean', self._mean,
                 'std', self._std_dev, 'seed', self._seed)
@@ -550,12 +561,12 @@ def __call__(self, var, block=None):
 
         if framework._non_static_mode():
             if self._uniform:
-                limit = np.sqrt(6.0 / float(fan_in + fan_out))
+                limit = math.sqrt(6.0 / float(fan_in + fan_out))
                 out_var = _C_ops.uniform_random('shape', out_var.shape, 'min',
                                                 -limit, 'max', limit, 'seed',
                                                 self._seed, 'dtype', out_dtype)
             else:
-                std = np.sqrt(2.0 / float(fan_in + fan_out))
+                std = math.sqrt(2.0 / float(fan_in + fan_out))
                 out_var = _C_ops.gaussian_random(
                     'shape', out_var.shape, 'dtype', out_dtype, 'mean', 0.0,
                     'std', std, 'seed', self._seed)
@@ -570,7 +581,7 @@ def __call__(self, var, block=None):
             return None
         else:
             if self._uniform:
-                limit = np.sqrt(6.0 / float(fan_in + fan_out))
+                limit = math.sqrt(6.0 / float(fan_in + fan_out))
                 op = block.append_op(
                     type="uniform_random",
                     inputs={},
@@ -584,7 +595,7 @@ def __call__(self, var, block=None):
                     },
                     stop_gradient=True)
             else:
-                std = np.sqrt(2.0 / float(fan_in + fan_out))
+                std = math.sqrt(2.0 / float(fan_in + fan_out))
                 op = block.append_op(
                     type="gaussian_random",
                     outputs={"Out": out_var},
@@ -702,13 +713,13 @@ def __call__(self, var, block=None):
 
         if framework._non_static_mode():
             if self._uniform:
-                limit = np.sqrt(6.0 / float(fan_in))
+                limit = math.sqrt(6.0 / float(fan_in))
                 out_var = _C_ops.uniform_random('shape', out_var.shape, 'min',
                                                 -limit, 'max', limit, 'seed',
                                                 self._seed, 'dtype',
                                                 int(out_dtype))
             else:
-                std = np.sqrt(2.0 / float(fan_in))
+                std = math.sqrt(2.0 / float(fan_in))
                 out_var = _C_ops.gaussian_random(
                     'shape', out_var.shape, 'dtype',
                     int(out_dtype), 'mean', 0.0, 'std', std, 'seed', self._seed)
@@ -723,7 +734,7 @@ def __call__(self, var, block=None):
             return None
         else:
             if self._uniform:
-                limit = np.sqrt(6.0 / float(fan_in))
+                limit = math.sqrt(6.0 / float(fan_in))
                 op = block.append_op(
                     type="uniform_random",
                     inputs={},
@@ -738,7 +749,7 @@ def __call__(self, var, block=None):
                     stop_gradient=True)
 
             else:
-                std = np.sqrt(2.0 / float(fan_in))
+                std = math.sqrt(2.0 / float(fan_in))
                 op = block.append_op(
                     type="gaussian_random",
                     outputs={"Out": out_var},
@@ -876,9 +887,9 @@ def __call__(self, var, block=None):
             raise ValueError("The size of input is too big. ")
 
         if framework._non_static_mode():
-            out_var = _C_ops.assign_value('shape',
-                                          list(shape), 'dtype', out_dtype,
-                                          value_name, values)
+            _C_ops.assign_value(out_var, 'shape',
+                                list(shape), 'dtype', out_dtype, value_name,
+                                values)
             if var.dtype in [
                     VarDesc.VarType.FP16, VarDesc.VarType.BF16,
                     VarDesc.VarType.FP64
@@ -985,9 +996,9 @@ def __call__(self, var, block=None):
                              "saving it to file and 'load_op' to load it")
 
         if framework._non_static_mode():
-            out_var = _C_ops.assign_value('shape',
-                                          list(self._value.shape), 'dtype',
-                                          out_dtype, value_name, values)
+            _C_ops.assign_value(out_var, 'shape',
+                                list(self._value.shape), 'dtype', out_dtype,
+                                value_name, values)
             if var.dtype in [VarDesc.VarType.FP16, VarDesc.VarType.BF16]:
                 var_tmp = _C_ops.cast(out_var, 'in_dtype', out_var.dtype,
                                       'out_dtype', var.dtype)
diff --git a/python/paddle/fluid/layers/control_flow.py b/python/paddle/fluid/layers/control_flow.py
index 138e968a0b385..184453a6fcb2b 100755
--- a/python/paddle/fluid/layers/control_flow.py
+++ b/python/paddle/fluid/layers/control_flow.py
@@ -18,7 +18,7 @@
 from .layer_function_generator import autodoc, templatedoc
 from .tensor import assign, cast, fill_constant
 from .. import core
-from ..framework import Program, Variable, Operator, _non_static_mode, static_only
+from ..framework import Program, Variable, Operator, _non_static_mode, static_only, _in_legacy_dygraph, in_dygraph_mode
 from ..layer_helper import LayerHelper, unique_name
 from .nn import logical_and, logical_not, logical_or
 from .utils import assert_same_structure, map_structure, hold_mutable_vars, copy_mutable_vars
@@ -974,6 +974,19 @@ def get_inputs_outputs_in_block(current_block, inner_inputs, inner_outputs,
     :return: inner_inputs, inner_outputs
     """
 
+    def is_ignore_vars(op, var_name):
+        # NOTE(dev): There are some persistable var created in some non-standard API
+        # such as "contrib.layers.shuffle_batch". It create a "Seed" used both in
+        # Input and Output. This var shall not be considered as a loop_var in
+        # control_flow.
+        IGNORE_VAR_NAMES = {"shuffle_batch": ["shuffle_batch_seed"]}
+        if op.type in IGNORE_VAR_NAMES:
+            var_names = IGNORE_VAR_NAMES[op.type]
+            for name in var_names:
+                if name in var_name:
+                    return True
+        return False
+
     # Step1: update inner_inputs and inner_outputs
     # NOTE: Here assumes that all variables are input or output of Ops,
     # but some variables are created without appendding a real op.
@@ -982,7 +995,8 @@ def get_inputs_outputs_in_block(current_block, inner_inputs, inner_outputs,
         assert isinstance(op, Operator)
         for iname in op.input_names:
             for in_var_name in op.input(iname):
-                if in_var_name not in inner_outputs:
+                if in_var_name not in inner_outputs and not is_ignore_vars(
+                        op, in_var_name):
                     inner_inputs.add(in_var_name)
 
         for oname in op.output_names:
@@ -3853,7 +3867,9 @@ def is_empty(x, name=None):
             #    - data: [0])
 
     """
-    if _non_static_mode():
+    if in_dygraph_mode():
+        return _C_ops.final_state_is_empty(x)
+    if _in_legacy_dygraph():
         return _C_ops.is_empty(x)
 
     check_variable_and_dtype(x, 'x', ['float32', 'float64', 'int32', 'int64'],
diff --git a/python/paddle/fluid/layers/layer_function_generator.py b/python/paddle/fluid/layers/layer_function_generator.py
index a99838cb27d4c..ec99f7c64f36f 100755
--- a/python/paddle/fluid/layers/layer_function_generator.py
+++ b/python/paddle/fluid/layers/layer_function_generator.py
@@ -23,6 +23,7 @@
 from ..framework import OpProtoHolder, Variable, core, convert_np_dtype_to_dtype_, _non_static_mode, in_dygraph_mode, _in_legacy_dygraph
 from ..layer_helper import LayerHelper
 from ..data_feeder import check_variable_and_dtype
+from paddle.fluid.framework import in_dygraph_mode, _in_legacy_dygraph
 from paddle import _C_ops
 
 __all__ = [
diff --git a/python/paddle/fluid/layers/loss.py b/python/paddle/fluid/layers/loss.py
index 01706741f0e15..b78865a0ece4e 100644
--- a/python/paddle/fluid/layers/loss.py
+++ b/python/paddle/fluid/layers/loss.py
@@ -21,7 +21,7 @@
 from . import nn
 from .layer_function_generator import templatedoc
 from ..layer_helper import LayerHelper
-from ..framework import Variable, _non_static_mode, static_only
+from ..framework import Variable, _non_static_mode, static_only, _in_legacy_dygraph, in_dygraph_mode
 from .. import core
 from ..data_feeder import check_variable_and_dtype, check_type
 from ..param_attr import ParamAttr
@@ -1101,6 +1101,25 @@ def sampled_softmax_with_cross_entropy(logits,
             out = fluid.layers.sampled_softmax_with_cross_entropy(
                       logits=fc, label=label, num_samples=25)
     """
+    if _non_static_mode():
+        sample_logits_attrs = ('use_customized_samples', use_customized_samples,
+                               'uniq', True, 'remove_accidental_hits',
+                               remove_accidental_hits, 'num_samples',
+                               num_samples, 'seed', seed)
+        _, _, _, _, sampled_logits_out, sampled_label_out = _C_ops.sample_logits(
+            logits, label, *sample_logits_attrs)
+        depth = num_samples + 1
+        sampled_softlabel_out = _C_ops.one_hot(sampled_label_out, 'depth',
+                                               depth)
+
+        softmax_with_cross_entropy_attrs = ('soft_label', True,
+                                            'numeric_stable_mode', False)
+
+        _, loss = _C_ops.softmax_with_cross_entropy(
+            sampled_logits_out, sampled_softlabel_out,
+            *softmax_with_cross_entropy_attrs)
+        return loss / num_true
+
     helper = LayerHelper('sample_logits', **locals())
     samples = customized_samples if use_customized_samples else helper.create_variable_for_type_inference(
         dtype='int64')
@@ -1267,10 +1286,15 @@ def softmax_with_cross_entropy(logits,
                 ignore_index, 'numeric_stable_mode', numeric_stable_mode,
                 'axis', axis)
         else:
-            softmax, loss = _C_ops.softmax_with_cross_entropy(
-                logits, label, 'soft_label', soft_label, 'ignore_index',
-                ignore_index, 'numeric_stable_mode', numeric_stable_mode,
-                'axis', axis)
+            if in_dygraph_mode():
+                softmax, loss = _C_ops.final_state_cross_entropy_with_softmax(
+                    logits, label, soft_label, True, numeric_stable_mode,
+                    ignore_index, axis)
+            if _in_legacy_dygraph():
+                softmax, loss = _C_ops.softmax_with_cross_entropy(
+                    logits, label, 'soft_label', soft_label, 'ignore_index',
+                    ignore_index, 'numeric_stable_mode', numeric_stable_mode,
+                    'axis', axis)
         if not return_softmax:
             return loss
         else:
@@ -1458,6 +1482,10 @@ def sigmoid_cross_entropy_with_logits(x,
                                                             ignore_index=-1, normalize=True)
             print(loss)
     """
+
+    if in_dygraph_mode():
+        return _C_ops.final_state_sigmoid_cross_entropy_with_logits(
+            x, label, normalize, int(ignore_index))
     check_variable_and_dtype(x, 'input', ['float16', 'float32', 'float64'],
                              'sigmoid_cross_entropy_with_logits')
 
@@ -1582,6 +1610,10 @@ def huber_loss(input, label, delta):
         HuberLoss, = exe.run(feed={'input':input_data ,'label':label_data}, fetch_list=[loss.name])
         print(HuberLoss)  #[[1.5], [0.5], [0.5], [0. ]], dtype=float32
     """
+    if in_dygraph_mode():
+        out, residual = _C_ops.final_state_huber_loss(input, label, delta)
+        return out
+
     helper = LayerHelper('huber_loss', **locals())
     check_variable_and_dtype(input, 'input', ['float32', 'float64'],
                              'huber_loss')
diff --git a/python/paddle/fluid/layers/metric_op.py b/python/paddle/fluid/layers/metric_op.py
index 5f91596f25d4d..7616e49c48ffc 100644
--- a/python/paddle/fluid/layers/metric_op.py
+++ b/python/paddle/fluid/layers/metric_op.py
@@ -20,7 +20,7 @@
 import warnings
 from ..layer_helper import LayerHelper
 from ..initializer import Normal, Constant
-from ..framework import Variable, _non_static_mode, _varbase_creator
+from ..framework import Variable, _non_static_mode, _varbase_creator, _in_legacy_dygraph, in_dygraph_mode
 from .. import core
 from ..param_attr import ParamAttr
 from . import nn
diff --git a/python/paddle/fluid/layers/nn.py b/python/paddle/fluid/layers/nn.py
index d1ef9d6d8b4ea..311a6278a89f8 100755
--- a/python/paddle/fluid/layers/nn.py
+++ b/python/paddle/fluid/layers/nn.py
@@ -25,6 +25,7 @@
 
 import paddle
 from ..layer_helper import LayerHelper
+from paddle.fluid.framework import _in_legacy_dygraph
 from ..initializer import Normal, Constant, NumpyArrayInitializer
 from ..framework import Variable, OpProtoHolder, _non_static_mode, dygraph_only, _dygraph_tracer, default_main_program, _varbase_creator, static_only, _global_flags, _in_legacy_dygraph, in_dygraph_mode
 from .. import dygraph_utils
@@ -204,7 +205,6 @@ def _elementwise_op_in_dygraph(x,
                                op_name=None):
     op = getattr(_C_ops, op_name)
     out = op(x, y, 'axis', axis, 'use_mkldnn', use_mkldnn)
-
     return dygraph_utils._append_activation_in_dygraph(
         out, act, use_mkldnn=use_mkldnn)
 
@@ -2947,6 +2947,38 @@ def batch_norm(input,
     mean_out = mean
     # variance and variance_out share the same memory
     variance_out = variance
+
+    if in_dygraph_mode():
+        inputs_has_MomemtumTensor = False
+        attrs_has_momentum = False
+        tmp_tensor_type = core.eager.Tensor
+        if isinstance(momentum, tmp_tensor_type):
+            inputs_has_MomemtumTensor = True
+        else:
+            attrs_has_momentum = True
+
+        attrs_ = ()
+        if attrs_has_momentum:
+            attrs_ = ('momentum', momentum, 'epsilon', epsilon, 'is_test',
+                      is_test, 'data_layout', data_layout, 'use_mkldnn', False,
+                      'fuse_with_relu', False, 'use_global_stats',
+                      use_global_stats)
+        else:
+            attrs_ = ('epsilon', epsilon, 'is_test', is_test, 'data_layout',
+                      data_layout, 'use_mkldnn', False, 'fuse_with_relu', False,
+                      'use_global_stats', use_global_stats)
+        if inputs_has_MomemtumTensor:
+            batch_norm_out, _, _, _, _, _ = _C_ops.batch_norm(
+                input, scale, bias, mean, variance, momentum, mean_out,
+                variance_out, *attrs_)
+        else:
+            batch_norm_out, _, _, _, _, _ = _C_ops.batch_norm(
+                input, scale, bias, mean, variance, None, mean_out,
+                variance_out, *attrs_)
+
+        return dygraph_utils._append_activation_in_dygraph(
+            batch_norm_out, act=act, use_mkldnn=False)
+
     saved_mean = helper.create_variable_for_type_inference(
         dtype=dtype, stop_gradient=True)
     saved_variance = helper.create_variable_for_type_inference(
@@ -2964,7 +2996,9 @@ def batch_norm(input,
         "Scale": scale,
         "Bias": bias,
         "Mean": mean,
-        "Variance": variance
+        "Variance": variance,
+        "MeanOut": mean_out,
+        "VarianceOut": variance_out
     }
     attrs = {
         "epsilon": epsilon,
@@ -3142,13 +3176,46 @@ def inplace_abn(input,
     mean_out = mean
     # variance and variance out share the same memory
     variance_out = variance
+    # batch_norm_out and input share the same memory
+    batch_norm_out = input
+
+    if in_dygraph_mode():
+        inputs_has_MomemtumTensor = False
+        attrs_has_momentum = False
+        tmp_tensor_type = core.eager.Tensor
+        if isinstance(momentum, tmp_tensor_type):
+            inputs_has_MomemtumTensor = True
+        else:
+            attrs_has_momentum = True
+
+        attrs__ = ()
+        if attrs_has_momentum:
+            attrs__ = ('momentum', momentum, 'epsilon', epsilon, 'is_test',
+                       is_test, 'data_layout', data_layout, 'use_mkldnn', False,
+                       'fuse_with_relu', False, 'use_global_stats',
+                       use_global_stats, 'activation', act, 'alpha', act_alpha)
+        else:
+            attrs__ = ('epsilon', epsilon, 'is_test', is_test, 'data_layout',
+                       data_layout, 'use_mkldnn', False, 'fuse_with_relu',
+                       False, 'use_global_stats', use_global_stats,
+                       'activation', act, 'alpha', act_alpha)
+        if inputs_has_MomemtumTensor:
+            batch_norm_out, _, _, _, _, _ = _C_ops.inplace_abn_(
+                input, scale, bias, mean, variance, momentum, mean_out,
+                variance_out, *attrs__)
+            return batch_norm_out
+        else:
+            batch_norm_out, _, _, _, _, _ = _C_ops.inplace_abn_(
+                input, scale, bias, mean, variance, None, mean_out,
+                variance_out, *attrs__)
+            return batch_norm_out
+
     saved_mean = helper.create_variable_for_type_inference(
         dtype=dtype, stop_gradient=True)
     saved_variance = helper.create_variable_for_type_inference(
         dtype=dtype, stop_gradient=True)
     reserve_space = helper.create_variable_for_type_inference(
         dtype=dtype, stop_gradient=True)
-    batch_norm_out = input
 
     inputs = {
         "X": input,
@@ -4790,7 +4857,7 @@ def reduce_prod(input, dim=None, keep_dim=False, name=None):
             fluid.layers.reduce_prod(y, dim=[1, 2]) # [24.0, 1680.0]
             fluid.layers.reduce_prod(y, dim=[0, 1]) # [105.0, 384.0]
     """
-    helper = LayerHelper('reduce_prod', **locals())
+
     if dim is not None and not isinstance(dim, list):
         if isinstance(dim, tuple):
             dim = list(dim)
@@ -4800,6 +4867,12 @@ def reduce_prod(input, dim=None, keep_dim=False, name=None):
             raise TypeError(
                 "The type of axis must be int, list or tuple, but received {}".
                 format(type(dim)))
+    if in_dygraph_mode():
+        return _C_ops.final_state_reduce_prod(
+            input, dim if dim != None and dim != [] else [0], keep_dim, True if
+            dim == None or dim == [] or len(dim) == len(input.shape) else False)
+
+    helper = LayerHelper('reduce_prod', **locals())
     check_variable_and_dtype(
         input, 'input', ['float32', 'float64', 'int32', 'int64'], 'reduce_prod')
     out = helper.create_variable_for_type_inference(dtype=helper.input_dtype())
@@ -5019,7 +5092,9 @@ def split(input, num_or_sections, dim=-1, name=None):
             raise TypeError(
                 "The type of 'num_or_sections' in split must be int, list or tuple in imperative mode, but "
                 "received %s." % (type(num_or_sections)))
-        return _C_ops.split(input, num, *attrs)
+        out = [_varbase_creator() for n in range(num)]
+        _C_ops.split(input, out, *attrs)
+        return out
 
     check_variable_and_dtype(
         input, 'input',
@@ -5134,7 +5209,6 @@ def l2_normalize(x, axis, epsilon=1e-12, name=None):
         #  [-0.33972208 -0.43014923  0.31772556  0.76617881 -0.10761525]]
 
     """
-
     if len(x.shape) == 1:
         axis = 0
     if _non_static_mode():
@@ -6427,7 +6501,9 @@ def squeeze(input, axes, name=None):
             y = layers.squeeze(input=x, axes=[2]) # y.shape=[None, 5, 10]
 
     """
-    if _non_static_mode():
+    if in_dygraph_mode():
+        return _C_ops.final_state_squeeze(input, axes)[1]
+    if _in_legacy_dygraph():
         out, _ = _C_ops.squeeze2(input, 'axes', axes)
         return out
 
@@ -6488,8 +6564,10 @@ def unsqueeze(input, axes, name=None):
                 item.numpy().item(0) if isinstance(item, Variable) else item
                 for item in axes
             ]
-        out, _ = _C_ops.unsqueeze2(input, 'axes', axes)
-        return out
+        if _in_legacy_dygraph():
+            out, _ = _C_ops.unsqueeze2(input, 'axes', axes)
+            return out
+        return _C_ops.final_state_unsqueeze(input, axes)[1]
 
     check_type(axes, 'axis/axes', (int, list, tuple, Variable), 'unsqueeze')
     check_variable_and_dtype(input, 'input', [
@@ -7017,6 +7095,10 @@ def label_smooth(label,
             smooth_label = layers.label_smooth(
                 label=one_hot_label, epsilon=0.1, dtype="float32")
     """
+    if in_dygraph_mode():
+        return _C_ops.final_state_label_smooth(label, prior_dist,
+                                               float(epsilon))
+
     if epsilon > 1. or epsilon < 0.:
         raise ValueError("The value of epsilon must be between 0 and 1.")
 
@@ -8761,8 +8843,7 @@ def scatter_nd_add(ref, index, updates, name=None):
     """
 
     if in_dygraph_mode():
-        op = getattr(_C_ops, 'scatter_nd_add')
-        return op(ref, index, updates)
+        return _C_ops.final_state_scatter_nd_add(ref, index, updates)
     else:
         if _in_legacy_dygraph():
             op = getattr(_C_ops, 'scatter_nd_add')
@@ -8910,7 +8991,9 @@ def log(x, name=None):
             res = paddle.log(x)
             # [[0.693147, 1.09861, 1.38629], [1.94591, 2.07944, 2.19722]]
     """
-    if _non_static_mode():
+    if in_dygraph_mode():
+        return _C_ops.final_state_log(x)
+    if _in_legacy_dygraph():
         return _C_ops.log(x)
 
     check_variable_and_dtype(x, 'x', ['float32', 'float64'], "log")
@@ -10297,7 +10380,10 @@ def stack(x, axis=0, name=None):
     """
     axis = 0 if axis is None else axis
 
-    if _non_static_mode():
+    if in_dygraph_mode():
+        return _C_ops.final_state_stack(x, axis)
+
+    if _in_legacy_dygraph():
         return _C_ops.stack(x, 'axis', axis)
 
     if not isinstance(x, list) and not isinstance(x, tuple):
@@ -11186,18 +11272,15 @@ def slice(input, axes, starts, ends):
         infer_flags = list(1 for i in range(len(axes)))
 
         tmp_tensor_type = core.eager.Tensor
-
         if isinstance(starts, (list, tuple)):
             starts = [
                 item.numpy().item(0)
                 if isinstance(item, tmp_tensor_type) else item
                 for item in starts
             ]
-            attrs += ('starts', starts)
         elif isinstance(starts, tmp_tensor_type):
-            starts_tensor = starts
-            starts.stop_gradient = True
-            infer_flags = list(-1 for i in range(len(axes)))
+            tensor_t = starts.numpy()
+            starts = [ele for ele in tensor_t]
 
         if isinstance(ends, (list, tuple)):
             ends = [
@@ -11206,12 +11289,11 @@ def slice(input, axes, starts, ends):
             ]
             attrs += ('ends', ends)
         elif isinstance(ends, tmp_tensor_type):
-            ends_tensor = ends
-            ends_tensor.stop_gradient = True
-            infer_flags = list(-1 for i in range(len(axes)))
+            tensor_t = ends.numpy()
+            ends = [ele for ele in tensor_t]
 
-        return _C_ops.slice(input, starts_tensor, ends_tensor, None, None,
-                            'axes', axes, 'infer_flags', infer_flags, *attrs)
+        return _C_ops.final_state_slice(input, axes, starts, ends, infer_flags,
+                                        [])
     else:
         if _in_legacy_dygraph():
             attrs = ()
@@ -11426,6 +11508,10 @@ def strided_slice(input, axes, starts, ends, strides):
             sliced_2 = fluid.layers.strided_slice(input, axes=axes, starts=[minus_3, 0, 2], ends=ends, strides=strides_2)
             # sliced_2 is input[:, 0:3:1, 0:2:1, 2:4:2].
     """
+    if in_dygraph_mode():
+        return _C_ops.final_state_strided_slice(input, axes, starts, ends,
+                                                strides)
+
     helper = LayerHelper('strided_slice', **locals())
 
     check_variable_and_dtype(input, 'input',
@@ -11590,7 +11676,11 @@ def shape(input):
             res = exe.run(fluid.default_main_program(), feed={'x':img}, fetch_list=[output])
             print(res) # [array([  3, 100, 100], dtype=int32)]
     """
-    if _non_static_mode():
+    if in_dygraph_mode():
+        out = _C_ops.final_state_shape(input)
+        out.stop_gradient = True
+        return out
+    if _in_legacy_dygraph():
         out = _C_ops.shape(input)
         out.stop_gradient = True
         return out
@@ -11666,8 +11756,12 @@ def size(input):
             rank = layers.size(input) # 300
     """
 
-    if _non_static_mode():
+    if in_dygraph_mode():
+        return _C_ops.final_state_size(input)
+
+    if _in_legacy_dygraph():
         return _C_ops.size(input)
+
     check_variable_and_dtype(
         input, 'input',
         ['bool', 'float16', 'float32', 'float64', 'int32', 'int64'], "size")
@@ -11754,6 +11848,9 @@ def scale(x, scale=1.0, bias=0.0, bias_after_scale=True, act=None, name=None):
 
     """
 
+    if in_dygraph_mode():
+        out = _C_ops.final_state_scale(x, scale, float(bias), bias_after_scale)
+        return dygraph_utils._append_activation_in_dygraph(out)
     if _non_static_mode():
         _scale = scale.numpy().item(0) if isinstance(scale, Variable) else scale
         out = _C_ops.scale(x, 'scale',
@@ -12529,6 +12626,9 @@ def logical_and(x, y, out=None, name=None):
             res = paddle.logical_and(x, y)
             print(res) # [True False True False]
     """
+    if in_dygraph_mode():
+        return _C_ops.final_state_logical_and(x, y)
+
     return _logical_op(
         op_name="logical_and", x=x, y=y, name=name, out=out, binary_op=True)
 
@@ -12568,6 +12668,8 @@ def logical_or(x, y, out=None, name=None):
             res = paddle.logical_or(x, y)
             print(res) # [[ True  True] [ True False]]
     """
+    if in_dygraph_mode():
+        return _C_ops.final_state_logical_or(x, y)
     return _logical_op(
         op_name="logical_or", x=x, y=y, name=name, out=out, binary_op=True)
 
@@ -12607,6 +12709,9 @@ def logical_xor(x, y, out=None, name=None):
             res = paddle.logical_xor(x, y)
             print(res) # [[False,  True], [ True, False]]
     """
+    if in_dygraph_mode():
+        return _C_ops.final_state_logical_xor(x, y)
+
     return _logical_op(
         op_name="logical_xor", x=x, y=y, name=name, out=out, binary_op=True)
 
@@ -12639,7 +12744,8 @@ def logical_not(x, out=None, name=None):
             res = paddle.logical_not(x)
             print(res) # [False  True False  True]
     """
-
+    if in_dygraph_mode():
+        return _C_ops.final_state_logical_not(x)
     return _logical_op(
         op_name="logical_not", x=x, y=None, name=name, out=out, binary_op=False)
 
@@ -12771,8 +12877,10 @@ def mean(x, name=None):
             mean = fluid.layers.mean(input)
     """
 
-    if _non_static_mode():
+    if _in_legacy_dygraph():
         return _C_ops.mean(x)
+    if in_dygraph_mode():
+        return _C_ops.final_state_mean_all(x)
 
     helper = LayerHelper("mean", **locals())
     check_variable_and_dtype(x, 'x', ['float16', 'float32', 'float64'], 'mean')
@@ -13415,6 +13523,9 @@ def log_loss(input, label, epsilon=1e-4, name=None):
           prob = paddle.randn((10,1))
           cost = F.log_loss(input=prob, label=label)
     """
+    if in_dygraph_mode():
+        return _C_ops.final_state_log_loss(input, label, epsilon)
+
     helper = LayerHelper('log_loss', **locals())
     check_variable_and_dtype(input, 'input', ['float32'], 'log_loss')
     check_variable_and_dtype(label, 'label', ['float32'], 'log_loss')
@@ -14430,7 +14541,10 @@ def where(condition):
              out = layers.where(condition) # [[]]
 
     """
-    if _non_static_mode():
+
+    if in_dygraph_mode():
+        return _C_ops.final_state_where_index(condition)
+    if _in_legacy_dygraph():
         return _C_ops.where_index(condition)
 
     helper = LayerHelper("where_index", **locals())
@@ -14923,6 +15037,10 @@ def unfold(x, kernel_sizes, strides=1, paddings=0, dilations=1, name=None):
             "Unexpected type of paddings, it should be either an integer or a list"
             "of 2 or 4 integers")
 
+    if in_dygraph_mode():
+        return _C_ops.final_state_unfold(x, kernel_sizes, strides, paddings,
+                                         dilations)
+
     out = helper.create_variable_for_type_inference(dtype=x.dtype)
     helper.append_op(
         type="unfold",
@@ -15150,6 +15268,10 @@ def shard_index(input, index_num, nshards, shard_id, ignore_value=-1):
             print(shard_label)
             # [[-1], [1]]
     """
+    if in_dygraph_mode():
+        return _C_ops.final_state_shard_index(input, index_num, nshards,
+                                              shard_id, ignore_value)
+
     check_variable_and_dtype(input, 'input', ['int64', 'int32'], 'shard_index')
     op_type = 'shard_index'
     helper = LayerHelper(op_type, **locals())
@@ -15301,7 +15423,9 @@ def mish(x, threshold=20, name=None):
         out, = exe.run(feed={'x':x_data}, fetch_list=[y.name])
         print(out)  # [[0.66666667, 1.66666667, 3., 4.]]
     """
-    if _non_static_mode():
+    if in_dygraph_mode():
+        return _C_ops.final_state_mish(x, threshold)
+    if _in_legacy_dygraph():
         return _C_ops.mish(x, 'threshold', threshold)
 
     check_variable_and_dtype(x, 'x', ['float32', 'float64'], 'mish')
diff --git a/python/paddle/fluid/layers/ops.py b/python/paddle/fluid/layers/ops.py
index 05c0b03a025c6..d8cd7f6abf6df 100755
--- a/python/paddle/fluid/layers/ops.py
+++ b/python/paddle/fluid/layers/ops.py
@@ -16,9 +16,10 @@
 import os
 from .layer_function_generator import generate_layer_fn, generate_activation_fn, generate_inplace_fn, add_sample_code
 from .. import core
-from ..framework import convert_np_dtype_to_dtype_, Variable
+from ..framework import convert_np_dtype_to_dtype_, Variable, in_dygraph_mode
 from ..data_feeder import convert_dtype, check_variable_and_dtype, check_type, check_dtype
 from paddle.utils import deprecated
+from paddle import _C_ops
 
 __deprecated_func_name__ = {
     'tanh_shrink': 'tanhshrink',
@@ -794,6 +795,9 @@ def gelu(x, approximate=False):
 
 
 def erf(x, name=None):
+    if in_dygraph_mode():
+        return _C_ops.final_state_erf(x)
+
     locals_var = locals().copy()
     kwargs = dict()
     for name, val in locals_var.items():
diff --git a/python/paddle/fluid/layers/sequence_lod.py b/python/paddle/fluid/layers/sequence_lod.py
index 1aa3e357c4fd7..80dc990af4556 100644
--- a/python/paddle/fluid/layers/sequence_lod.py
+++ b/python/paddle/fluid/layers/sequence_lod.py
@@ -15,10 +15,11 @@
 from __future__ import print_function
 
 from .layer_function_generator import templatedoc
-from ..framework import Variable, _non_static_mode
+from ..framework import core, Variable, _non_static_mode, in_dygraph_mode, _in_legacy_dygraph, convert_np_dtype_to_dtype_
 from ..layer_helper import LayerHelper
 from ..data_feeder import check_variable_and_dtype, check_type, check_dtype
 from ..core import VarDesc
+from paddle import _C_ops
 
 __all__ = [
     'sequence_conv',
@@ -1380,6 +1381,20 @@ def sequence_mask(x, maxlen=None, dtype='int64', name=None):
             #  [1 1 1 1 1 1 1 1 0 0]]
 
     """
+
+    if in_dygraph_mode():
+        if not isinstance(dtype, core.VarDesc.VarType):
+            dtype = convert_np_dtype_to_dtype_(dtype)
+        if maxlen is not None:
+            if isinstance(maxlen, core.eager.Tensor):
+                attrs = ('out_dtype', dtype)
+                out = _C_ops.sequence_mask(x, maxlen, *attrs)
+            else:
+                attrs = ('out_dtype', dtype, 'maxlen', maxlen)
+                out = _C_ops.sequence_mask(x, None, *attrs)
+            out.stop_gradient = True
+            return out
+
     helper = LayerHelper('sequence_mask', **locals())
     out = helper.create_variable_for_type_inference(dtype=dtype)
 
diff --git a/python/paddle/fluid/layers/tensor.py b/python/paddle/fluid/layers/tensor.py
index f9f65ffb57f90..e302371988739 100644
--- a/python/paddle/fluid/layers/tensor.py
+++ b/python/paddle/fluid/layers/tensor.py
@@ -21,7 +21,7 @@
 from ..layer_helper import LayerHelper
 from ..param_attr import ParamAttr
 from ..initializer import Initializer
-from ..framework import convert_np_dtype_to_dtype_, _non_static_mode, _varbase_creator, device_guard, _in_legacy_dygraph
+from ..framework import _current_expected_place, convert_np_dtype_to_dtype_, _non_static_mode, _varbase_creator, device_guard, _in_legacy_dygraph, in_dygraph_mode
 from ..framework import Variable
 from ..initializer import Constant
 from ..core import VarDesc
@@ -243,6 +243,11 @@ def cast(x, dtype):
             x = paddle.to_tensor([2, 3, 4], 'float64')
             y = paddle.cast(x, 'uint8')
     """
+    if in_dygraph_mode():
+        if not isinstance(dtype, core.VarDesc.VarType):
+            dtype = convert_np_dtype_to_dtype_(dtype)
+        return _C_ops.final_state_cast(x, dtype)
+
     if _non_static_mode():
         if not isinstance(dtype, core.VarDesc.VarType):
             dtype = convert_np_dtype_to_dtype_(dtype)
@@ -318,13 +323,23 @@ def concat(input, axis=0, name=None):
                 #  [14 15 16]]
     """
 
-    if _non_static_mode():
+    if in_dygraph_mode():
         if isinstance(axis, Variable):
             axis = axis.numpy()
             axis = axis.item(0)
         if not isinstance(input, Variable):
             input = [t for t in input if t.shape.count(0) == 0]
-        return _C_ops.concat(input, 'axis', axis)
+        return _C_ops.final_state_concat(input, axis)
+
+    if _in_legacy_dygraph():
+        if isinstance(axis, Variable):
+            axis = axis.numpy()
+            axis = axis.item(0)
+        if not isinstance(input, Variable):
+            input = [t for t in input if t.shape.count(0) == 0]
+        out = _varbase_creator()
+        _C_ops.concat(input, out, 'axis', axis)
+        return out
 
     check_type(input, 'input', (list, tuple, Variable), 'concat')
     if not isinstance(input, Variable):
@@ -606,15 +621,24 @@ def assign(input, output=None):
     # isinstance(VarBase, Variable) == False. It will cause return None
     # after this api.
     if isinstance(input, (Variable, core.VarBase)):
-        check_dtype(input.dtype, 'input', [
-            'float16', 'uint16', 'float32', 'float64', 'int32', 'int64',
-            'uint8', 'bool'
-        ], 'assign', '(When the type of input in assign is Variable.)')
-        if output is None:
-            output = helper.create_variable_for_type_inference(
-                dtype=input.dtype)
-        helper.append_op(
-            type='assign', inputs={'X': [input]}, outputs={'Out': [output]})
+        if _non_static_mode():
+            if output is None:
+                if _in_legacy_dygraph():
+                    output = core.VarBase()
+                else:
+                    output = core.eager.Tensor()
+            _C_ops.assign(input, output)
+        else:
+            check_dtype(input.dtype, 'input', [
+                'float16', 'uint16', 'float32', 'float64', 'int32', 'int64',
+                'uint8', 'bool'
+            ], 'assign', '(When the type of input in assign is Variable.)')
+            if output is None:
+                output = helper.create_variable_for_type_inference(
+                    dtype=input.dtype)
+            helper.append_op(
+                type='assign', inputs={'X': [input]},
+                outputs={'Out': [output]})
     elif isinstance(input, numpy.ndarray):
         # Not support [var, var, ...] currently.
         if len(input.shape) > 0 and any(isinstance(x, Variable) for x in input):
@@ -663,9 +687,7 @@ def assign(input, output=None):
             })
 
     if is_inplace and _non_static_mode():
-        # TODO(jiabin): Remove this when we support inplace
-        if _in_legacy_dygraph():
-            output._bump_inplace_version()
+        output._bump_inplace_version()
 
     return output
 
@@ -824,6 +846,18 @@ def fill_constant_batch_size_like(input,
                     input=like, shape=[1], value=0, dtype='int64') #like=[[10, 10]] data=[0]
 
     """
+    if in_dygraph_mode():
+        if not isinstance(dtype, core.VarDesc.VarType):
+            dtype = convert_np_dtype_to_dtype_(dtype)
+
+        place = _current_expected_place()
+        if force_cpu:
+            place = core.CPUPlace()
+        out = _C_ops.final_state_full_batch_size_like(
+            input, shape, dtype, value, input_dim_idx, output_dim_idx, place)
+        out.stop_gradient = True
+        return out
+
     helper = LayerHelper("fill_constant_batch_size_like", **locals())
     out = helper.create_variable_for_type_inference(dtype=dtype)
     attrs = {
@@ -1439,7 +1473,11 @@ def range(start, end, step, dtype, name=None):
     elif step.dtype != dtype:
         step = cast(step, dtype)
 
-    if _non_static_mode():
+    if in_dygraph_mode():
+        return _C_ops.final_state_arange(start, end, step, dtype,
+                                         _current_expected_place())
+
+    if _in_legacy_dygraph():
         out = _C_ops.range(start, end, step)
         out.stop_gradient = True
         return out
@@ -1510,10 +1548,12 @@ def linspace(start, stop, num, dtype=None, name=None):
     if not isinstance(num, Variable):
         with device_guard("cpu"):
             tensor_num = fill_constant([1], 'int32', num)
-    if _non_static_mode():
+    if _in_legacy_dygraph():
         return _C_ops.linspace(tensor_start, tensor_stop, tensor_num, 'dtype',
                                dtype)
-
+    if in_dygraph_mode():
+        return _C_ops.final_state_linspace(tensor_start, tensor_stop,
+                                           tensor_num, dtype)
     helper = LayerHelper("linspace", **locals())
 
     start_dtype = convert_dtype(tensor_start.dtype)
@@ -1698,10 +1738,12 @@ def eye(num_rows,
     else:
         num_columns = num_rows
 
-    if _non_static_mode():
+    if in_dygraph_mode():
+        out = _C_ops.final_state_eye(num_rows, num_columns, dtype,
+                                     _current_expected_place())
+    elif _in_legacy_dygraph():
         out = _C_ops.eye('dtype', dtype, 'num_rows', num_rows, 'num_columns',
                          num_columns)
-
     else:
         helper = LayerHelper("eye", **locals())
         check_dtype(dtype, 'dtype',
diff --git a/python/paddle/fluid/optimizer.py b/python/paddle/fluid/optimizer.py
index 7bf4608de89c9..8242d8e3392ec 100755
--- a/python/paddle/fluid/optimizer.py
+++ b/python/paddle/fluid/optimizer.py
@@ -7098,13 +7098,8 @@ def _get_gm_cond_var(self, main_block):
             persistable=True,
             force_cpu=True)
 
-        cond_var = layers.create_global_var(
-            name="gradient_merge_cond",
-            shape=[1],
-            value=bool(0),
-            dtype='bool',
-            persistable=False,
-            force_cpu=True)
+        cond_var = main_block.create_var(
+            name="gradient_merge_cond", shape=[1], dtype='bool')
 
         with device_guard("cpu"):
             # step_var = (step_var + 1) % k_step
diff --git a/python/paddle/fluid/tests/unittests/CMakeLists.txt b/python/paddle/fluid/tests/unittests/CMakeLists.txt
index c816a8c4c231f..2e4259d2085c5 100755
--- a/python/paddle/fluid/tests/unittests/CMakeLists.txt
+++ b/python/paddle/fluid/tests/unittests/CMakeLists.txt
@@ -39,9 +39,7 @@ if (WITH_GPU OR WITH_XPU OR WITH_ASCEND OR WITH_ASCEND_CL)
 endif()
 list(APPEND DIST_TEST_OPS test_parallel_dygraph_unused_variables)
 list(APPEND DIST_TEST_OPS test_parallel_dygraph_control_flow)
-list(APPEND DIST_TEST_OPS test_parallel_dygraph_control_flow_in_eager_mode)
 list(APPEND DIST_TEST_OPS test_parallel_dygraph_no_sync)
-list(APPEND DIST_TEST_OPS test_parallel_dygraph_no_sync_in_eager_mode)
 list(APPEND DIST_TEST_OPS test_parallel_dygraph_no_sync_gradient_check)
 list(APPEND DIST_TEST_OPS test_parallel_dygraph_dataparallel)
 list(APPEND DIST_TEST_OPS test_parallel_dygraph_pipeline_parallel)
@@ -60,6 +58,7 @@ list(APPEND DIST_TEST_OPS test_auto_parallel_data_unshard)
 list(APPEND DIST_TEST_OPS test_auto_parallel_save_load)
 list(APPEND DIST_TEST_OPS test_auto_parallel_autoconvert)
 list(APPEND DIST_TEST_OPS test_collective_process_group)
+list(APPEND DIST_TEST_OPS test_eager_dist_api)
 set(MIXED_DIST_TEST_OPS ${DIST_TEST_OPS})
 #remove distribute unittests.
 list(APPEND MIXED_DIST_TEST_OPS test_dgc_op)
@@ -233,6 +232,7 @@ endif()
 
 if(WIN32)
     LIST(REMOVE_ITEM TEST_OPS test_complex_matmul)
+    LIST(REMOVE_ITEM TEST_OPS test_ops_nms)
 endif()
 
 LIST(REMOVE_ITEM TEST_OPS test_fleet_checkpoint)
@@ -277,9 +277,7 @@ if ((NOT WITH_GPU) AND (NOT WITH_ROCM))
     LIST(REMOVE_ITEM TEST_OPS test_parallel_dygraph_transformer)
     LIST(REMOVE_ITEM TEST_OPS test_parallel_dygraph_sync_batch_norm)
     list(REMOVE_ITEM TEST_OPS test_parallel_dygraph_control_flow)
-    list(REMOVE_ITEM TEST_OPS test_parallel_dygraph_control_flow_in_eager_mode)
     list(REMOVE_ITEM TEST_OPS test_parallel_dygraph_no_sync)
-    list(REMOVE_ITEM TEST_OPS test_parallel_dygraph_no_sync_in_eager_mode)
     list(REMOVE_ITEM TEST_OPS test_parallel_dygraph_no_sync_gradient_check)
     list(REMOVE_ITEM TEST_OPS test_parallel_dygraph_dataparallel)
     list(REMOVE_ITEM TEST_OPS test_parallel_dygraph_pipeline_parallel)
@@ -311,6 +309,7 @@ if ((NOT WITH_GPU) AND (NOT WITH_ROCM))
     LIST(REMOVE_ITEM TEST_OPS test_auto_parallel_save_load)
     LIST(REMOVE_ITEM TEST_OPS test_auto_parallel_autoconvert)
     LIST(REMOVE_ITEM TEST_OPS test_collective_process_group)
+    LIST(REMOVE_ITEM TEST_OPS test_eager_dist_api)
 elseif(WITH_GPU)
     if (${CUDNN_VERSION} VERSION_LESS 7100)
         LIST(REMOVE_ITEM TEST_OPS test_conv2d_fusion_op)
@@ -597,6 +596,13 @@ foreach(TEST_OP ${TEST_OPS_WITH_GC})
   py_test_modules(${TEST_OP} MODULES ${TEST_OP} ENVS ${GC_ENVS})
 endforeach()
 
+# Switch some dy2st UT to eager mode
+set(TEST_EAGER_OPS test_jit_save_load test_translated_layer)
+foreach(TEST_OP ${TEST_EAGER_OPS})
+  list(REMOVE_ITEM TEST_OPS ${TEST_OP})
+  py_test_modules(${TEST_OP} MODULES ${TEST_OP} ENVS FLAGS_enable_eager_mode=1)
+endforeach()
+
 if ((NOT WITH_GPU) AND (NOT WITH_XPU) AND NOT (WITH_ASCEND OR WITH_ASCEND_CL))
     list(REMOVE_ITEM TEST_OPS "test_fleet_graph_execution_meta_optimizer")
     list(REMOVE_ITEM TEST_OPS "test_gen_nccl_id_op")
@@ -851,7 +857,7 @@ endif()
 
 # dist xpu tests:
 if (WITH_XPU_BKCL)
-    py_test(test_collective_reduce_api_xpu SRCS "test_collective_reduce_api.py")
+    #py_test(test_collective_reduce_api_xpu SRCS "test_collective_reduce_api.py")
     py_test(test_collective_allreduce_api_xpu SRCS "test_collective_allreduce_api.py")
 endif()
 
@@ -949,7 +955,7 @@ endif()
 if (WITH_DISTRIBUTE AND NOT APPLE)
     if(WITH_GPU OR WITH_ROCM)
         set_tests_properties(test_c_comm_init_op PROPERTIES TIMEOUT 120)
-        set_tests_properties(test_dist_mnist_gradient_merge PROPERTIES TIMEOUT 160)
+        set_tests_properties(test_dist_mnist_gradient_merge PROPERTIES TIMEOUT 360)
     endif()
 endif()
 
@@ -967,6 +973,7 @@ set_tests_properties(test_nearest_interp_op PROPERTIES TIMEOUT 120)
 set_tests_properties(test_profiler PROPERTIES TIMEOUT 120)
 set_tests_properties(test_inplace_softmax_with_cross_entropy PROPERTIES TIMEOUT 120)
 set_tests_properties(test_cross_entropy2_op PROPERTIES TIMEOUT 120)
+set_tests_properties(test_cross_entropy_loss PROPERTIES TIMEOUT 180)
 set_tests_properties(test_fetch_unmerged PROPERTIES TIMEOUT 120)
 set_tests_properties(test_gru_unit_op PROPERTIES TIMEOUT 120)
 set_tests_properties(test_activation_nn_grad PROPERTIES TIMEOUT 200)
@@ -1030,7 +1037,7 @@ set_tests_properties(test_parallel_executor_seresnext_with_reduce_gpu PROPERTIES
 set_tests_properties(test_dropout_op PROPERTIES TIMEOUT 120)
 set_tests_properties(test_argsort_op PROPERTIES TIMEOUT 120)
 set_tests_properties(test_gather_nd_op PROPERTIES TIMEOUT 120)
-set_tests_properties(test_nn_grad PROPERTIES TIMEOUT 120)
+set_tests_properties(test_nn_grad PROPERTIES TIMEOUT 120 ENVIRONMENT FLAGS_USE_STANDALONE_EXECUTOR=0)
 set_tests_properties(test_elementwise_sub_op PROPERTIES TIMEOUT 120)
 set_tests_properties(test_row_conv_op PROPERTIES TIMEOUT 120)
 set_tests_properties(test_parallel_executor_seresnext_with_fuse_all_reduce_gpu PROPERTIES TIMEOUT 120)
@@ -1069,13 +1076,13 @@ set_tests_properties(test_space_to_depth_op PROPERTIES TIMEOUT 200)
 set_tests_properties(test_dyn_rnn PROPERTIES TIMEOUT 120)
 set_tests_properties(test_sgd_op PROPERTIES TIMEOUT 250)
 set_tests_properties(test_parallel_executor_seresnext_base_gpu PROPERTIES TIMEOUT 120)
-set_tests_properties(test_norm_nn_grad PROPERTIES TIMEOUT 120)
+set_tests_properties(test_norm_nn_grad PROPERTIES TIMEOUT 120 ENVIRONMENT FLAGS_USE_STANDALONE_EXECUTOR=0)
 set_tests_properties(test_matrix_nms_op PROPERTIES TIMEOUT 120)
 set_tests_properties(test_generator_dataloader PROPERTIES TIMEOUT 120)
 set_tests_properties(test_partial_concat_op PROPERTIES TIMEOUT 120)
 set_tests_properties(test_fuse_optimizer_pass PROPERTIES TIMEOUT 120)
-set_tests_properties(test_softmax_with_cross_entropy_op PROPERTIES TIMEOUT 120)
-set_tests_properties(test_reduce_op PROPERTIES TIMEOUT 120)
+set_tests_properties(test_softmax_with_cross_entropy_op PROPERTIES TIMEOUT 220)
+set_tests_properties(test_reduce_op PROPERTIES TIMEOUT 500)
 set_tests_properties(test_adam_optimizer_fp32_fp64 PROPERTIES TIMEOUT 120)
 set_tests_properties(test_elementwise_nn_grad PROPERTIES TIMEOUT 120)
 set_tests_properties(test_buffer_shared_memory_reuse_pass_and_fuse_optimization_op_pass PROPERTIES TIMEOUT 120)
@@ -1124,12 +1131,11 @@ set_tests_properties(test_cumprod_op PROPERTIES TIMEOUT 120)
 set_tests_properties(test_split_program PROPERTIES TIMEOUT 120)
 if(WITH_DISTRIBUTE AND WITH_GPU AND WITH_NCCL)
     set_tests_properties(test_parallel_dygraph_dataparallel PROPERTIES TIMEOUT 120)
-    set_tests_properties(test_parallel_dygraph_mnist PROPERTIES TIMEOUT 120)
-    set_tests_properties(test_parallel_dygraph_unused_variables PROPERTIES TIMEOUT 150)
-    set_tests_properties(test_parallel_dygraph_control_flow PROPERTIES TIMEOUT 200)
-    set_tests_properties(test_parallel_dygraph_control_flow_in_eager_mode PROPERTIES TIMEOUT 150)
-    set_tests_properties(test_parallel_dygraph_no_sync PROPERTIES TIMEOUT 150)
-    set_tests_properties(test_parallel_dygraph_no_sync_in_eager_mode PROPERTIES TIMEOUT 150)
+    set_tests_properties(test_parallel_dygraph_mnist PROPERTIES TIMEOUT 200)
+    set_tests_properties(test_parallel_dygraph_se_resnext PROPERTIES TIMEOUT 200)
+    set_tests_properties(test_parallel_dygraph_unused_variables PROPERTIES TIMEOUT 350)
+    set_tests_properties(test_parallel_dygraph_control_flow PROPERTIES TIMEOUT 350)
+    set_tests_properties(test_parallel_dygraph_no_sync PROPERTIES TIMEOUT 300)
     set_tests_properties(test_parallel_dygraph_no_sync_gradient_check PROPERTIES TIMEOUT 30)
     set_tests_properties(test_parallel_dygraph_pipeline_parallel PROPERTIES TIMEOUT 200)
     set_tests_properties(test_parallel_dygraph_tensor_parallel PROPERTIES TIMEOUT 200)
@@ -1147,10 +1153,12 @@ if(WITH_DISTRIBUTE AND WITH_GPU AND WITH_NCCL)
     set_tests_properties(test_auto_parallel_save_load PROPERTIES TIMEOUT 120)
     set_tests_properties(test_auto_parallel_autoconvert PROPERTIES TIMEOUT 120)
     set_tests_properties(test_collective_process_group PROPERTIES TIMEOUT 120)
+    set_tests_properties(test_eager_dist_api PROPERTIES TIMEOUT 300)
     
     if(${NCCL_VERSION} VERSION_GREATER_EQUAL 2212)
-        set_tests_properties(test_parallel_dygraph_sparse_embedding PROPERTIES TIMEOUT 120)
-        set_tests_properties(test_parallel_dygraph_transformer PROPERTIES TIMEOUT 120)
+        set_tests_properties(test_parallel_dygraph_sparse_embedding PROPERTIES TIMEOUT 200)
+        set_tests_properties(test_parallel_dygraph_transformer PROPERTIES TIMEOUT 200)
+        set_tests_properties(test_parallel_dygraph_sparse_embedding_over_height PROPERTIES TIMEOUT 150)
     endif()
 endif()
 
diff --git a/python/paddle/fluid/tests/unittests/auto_parallel/test_tunable_variable.py b/python/paddle/fluid/tests/unittests/auto_parallel/test_tunable_variable.py
index c36fca7a9d09a..ade228f6c494b 100644
--- a/python/paddle/fluid/tests/unittests/auto_parallel/test_tunable_variable.py
+++ b/python/paddle/fluid/tests/unittests/auto_parallel/test_tunable_variable.py
@@ -76,7 +76,7 @@ def test_float_range(self):
             "float_range", start=0.4, stop=4.4, default=2.0)
         float_range = tv.FloatRange.from_state(float_range.get_state())
         self.assertEqual(float_range.default, 2.0)
-        self.assertGreater(float_range.random(), 0.4)
+        self.assertGreaterEqual(float_range.random(), 0.4)
         self.assertLess(float_range.random(1234), 4.4)
         self.assertNotAlmostEqual(float_range.random(), 1)
         self.assertNotAlmostEqual(float_range.random(), 4.4)
@@ -90,7 +90,7 @@ def test_float_range(self):
             endpoint=True)
         float_range = tv.FloatRange.from_state(float_range.get_state())
         self.assertEqual(float_range.default, 3.0)
-        self.assertGreater(float_range.random(), 0.4)
+        self.assertGreaterEqual(float_range.random(), 0.4)
         self.assertLessEqual(float_range.random(1234), 8.4)
         self.assertNotAlmostEqual(float_range.random(), 2)
 
diff --git a/python/paddle/fluid/tests/unittests/autograd/CMakeLists.txt b/python/paddle/fluid/tests/unittests/autograd/CMakeLists.txt
index 6d9625483ea82..1f69abac01ac6 100644
--- a/python/paddle/fluid/tests/unittests/autograd/CMakeLists.txt
+++ b/python/paddle/fluid/tests/unittests/autograd/CMakeLists.txt
@@ -6,6 +6,5 @@ foreach(TEST_OP ${TEST_OPS})
     py_test_modules(${TEST_OP} MODULES ${TEST_OP} ENVS ${GC_ENVS})
 endforeach(TEST_OP)
 
-set_tests_properties(test_jacobian PROPERTIES TIMEOUT 50)
-set_tests_properties(test_hessian PROPERTIES TIMEOUT 50)
-set_tests_properties(test_vhp PROPERTIES TIMEOUT 50)
+set_tests_properties(test_autograd_functional_dynamic PROPERTIES TIMEOUT 100)
+set_tests_properties(test_autograd_functional_static PROPERTIES TIMEOUT 100)
diff --git a/python/paddle/fluid/tests/unittests/autograd/config.py b/python/paddle/fluid/tests/unittests/autograd/config.py
new file mode 100644
index 0000000000000..311ca49d39555
--- /dev/null
+++ b/python/paddle/fluid/tests/unittests/autograd/config.py
@@ -0,0 +1,49 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import paddle
+
+DEVICES = [paddle.CPUPlace()]
+if paddle.is_compiled_with_cuda():
+    DEVICES.append(paddle.CUDAPlace(0))
+
+DEFAULT_DTYPE = 'float64'
+
+# The numerical tolerance of different dtype of different order different
+# derivative. It's a empirical value provided by Paddle Science team.
+TOLERANCE = {
+    "float32": {
+        "first_order_grad": {
+            "rtol": 1e-3,
+            "atol": 1e-3,
+            "eps": 1e-4
+        },
+        "second_order_grad": {
+            "rtol": 1e-2,
+            "atol": 1e-2,
+            "eps": 1e-2
+        }
+    },
+    "float64": {
+        "first_order_grad": {
+            "rtol": 1e-7,
+            "atol": 1e-7,
+            "eps": 1e-7
+        },
+        "second_order_grad": {
+            "rtol": 1e-5,
+            "atol": 1e-5,
+            "eps": 1e-5
+        }
+    }
+}
diff --git a/python/paddle/fluid/tests/unittests/autograd/test_autograd_functional_dynamic.py b/python/paddle/fluid/tests/unittests/autograd/test_autograd_functional_dynamic.py
new file mode 100644
index 0000000000000..8c725fe24e59c
--- /dev/null
+++ b/python/paddle/fluid/tests/unittests/autograd/test_autograd_functional_dynamic.py
@@ -0,0 +1,1326 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import collections
+import typing
+import unittest
+
+import numpy as np
+import paddle
+import paddle.compat as cpt
+import paddle.nn.functional as F
+from paddle.autograd.functional import _as_tensors
+from paddle.fluid.framework import _test_eager_guard, _in_legacy_dygraph, _in_eager_without_dygraph_check
+
+import config
+import utils
+from utils import (_compute_numerical_batch_hessian, _compute_numerical_hessian,
+                   _compute_numerical_vhp, _compute_numerical_jacobian,
+                   _compute_numerical_batch_jacobian)
+from utils import matmul, mul, nested, o2, pow, reduce, reduce_dim, unuse
+
+
+def make_v(f, inputs):
+    outputs = _as_tensors(f(*inputs))
+    return [paddle.ones_like(x) for x in outputs]
+
+
+class TestAutogradFunctional(unittest.TestCase):
+    @classmethod
+    def setUpClass(cls):
+        cls.RAW_INPUTS = {
+            'a': [1.0],
+            'b': [1.0, 2.0],
+            'c': [3.0, 4.0],
+            'd': [[2.0], [3.0]],
+            'A': [[1.0, 2.0], [2.0, 3.0], [3.0, 4.0]],
+            'B': [[1.0, 2.0, 3.0], [2.0, 3.0, 4.0]],
+        }
+
+    def setUp(self):
+        pass
+
+    def gen_input(self, inp, stop_gradient=False):
+        if isinstance(inp, paddle.Tensor):
+            return inp
+        return paddle.to_tensor(
+            self.RAW_INPUTS[inp], stop_gradient=stop_gradient)
+
+    def gen_inputs(self, inputs):
+        if isinstance(inputs, list):
+            inputs = [self.gen_input(x) for x in inputs]
+        else:
+            inputs = [self.gen_input(inputs)]
+        return inputs
+
+    def gen_test_pairs(self,
+                       func,
+                       inputs,
+                       v=None,
+                       create_graph=False,
+                       allow_unused=False):
+        def vjp_test():
+            nonlocal v
+            xs = self.gen_inputs(inputs)
+            if v is not None:
+                v = self.gen_inputs(v)
+                outputs, inputs_grad = paddle.autograd.vjp(func, xs, v)
+            else:
+                outputs, inputs_grad = paddle.autograd.vjp(func, xs)
+            return outputs, inputs_grad
+
+        def grad_test():
+            nonlocal v
+            xs = self.gen_inputs(inputs)
+            if v is not None:
+                v = self.gen_inputs(v)
+            outputs = func(*xs)
+            if v is not None:
+                inputs_grad = paddle.grad(
+                    outputs,
+                    xs,
+                    v,
+                    create_graph=create_graph,
+                    allow_unused=allow_unused)
+            else:
+                inputs_grad = paddle.grad(
+                    outputs,
+                    xs,
+                    create_graph=create_graph,
+                    allow_unused=allow_unused)
+            return outputs, inputs_grad
+
+        return vjp_test, grad_test
+
+    def gen_jvp_tests(self,
+                      func,
+                      inputs,
+                      v=None,
+                      create_graph=False,
+                      allow_unused=False):
+        def jvp_test():
+            nonlocal v
+            xs = self.gen_inputs(inputs)
+            if v is not None:
+                v = self.gen_inputs(v)
+                outputs, outputs_grad = paddle.autograd.jvp(
+                    func,
+                    xs,
+                    v,
+                    create_graph=create_graph,
+                    allow_unused=allow_unused)
+            else:
+                outputs, outputs_grad = paddle.autograd.jvp(
+                    func,
+                    xs,
+                    create_graph=create_graph,
+                    allow_unused=allow_unused)
+            return outputs, outputs_grad
+
+        return jvp_test
+
+    def check_results(self, ref, res):
+        type_error = 'Result is different than expected in shape or type'
+        value_error = 'Result is different than expected values'
+        if ref is None:
+            self.assertTrue(res is None, type_error)
+        elif isinstance(ref, paddle.Tensor):
+            self.assertTrue(isinstance(res, paddle.Tensor), type_error)
+            np.testing.assert_allclose(res, ref)
+        else:
+            self.assertTrue(len(res) == len(ref), type_error)
+            for i in range(len(ref)):
+                self.check_results(ref[i], res[i])
+        return True
+
+
+class TestVJP(TestAutogradFunctional):
+    def func_vjp_i1o1(self):
+        test_cases = [
+            [reduce, 'A'],  # noqa
+            [reduce_dim, 'A'],  # noqa
+        ]  # noqa
+        for f, inputs in test_cases:
+            vjp, grad = self.gen_test_pairs(f, inputs)
+            vjp_result, grad_result = vjp(), grad()
+            self.check_results(grad_result, vjp_result)
+
+    def func_vjp_i2o1(self):
+        test_cases = [
+            [matmul, ['A', 'B']],  # noqa
+            [mul, ['b', 'c']],  # noqa
+        ]  # noqa
+        for f, inputs in test_cases:
+            vjp, grad = self.gen_test_pairs(f, inputs)
+            vjp_result, grad_result = vjp(), grad()
+            self.check_results(grad_result, vjp_result)
+
+    def func_vjp_i2o2(self):
+        test_cases = [
+            [o2, ['A', 'A']],  # noqa
+        ]  # noqa
+        for f, inputs in test_cases:
+            inputs = self.gen_inputs(inputs)
+            v = make_v(f, inputs)
+            vjp, grad = self.gen_test_pairs(f, inputs, v=v)
+            vjp_result, grad_result = vjp(), grad()
+            self.check_results(grad_result, vjp_result)
+
+    def func_vjp_i2o2_omitting_v(self):
+        test_cases = [
+            [o2, ['A', 'A']],  # noqa
+        ]  # noqa
+        for f, inputs in test_cases:
+            inputs = self.gen_inputs(inputs)
+            vjp, grad = self.gen_test_pairs(f, inputs)
+            vjp_result, grad_result = vjp(), grad()
+            self.check_results(grad_result, vjp_result)
+
+    def func_vjp_nested(self):
+        x = self.gen_input('a')
+        test_cases = [
+            [nested(x), 'a'],  # noqa
+        ]
+        for f, inputs in test_cases:
+            vjp, grad = self.gen_test_pairs(f, inputs)
+            vjp_result, grad_result = vjp(), grad()
+            self.check_results(grad_result, vjp_result)
+
+    def func_vjp_aliased_input(self):
+        x = self.gen_input('a')
+        ref = self.gen_test_pairs(nested(x), 'a')[0]
+        aliased = self.gen_test_pairs(nested(x), x)[0]
+        ref_result, aliased_result = ref(), aliased()
+        self.check_results(ref_result, aliased_result)
+
+    def test_all_cases(self):
+        if _in_legacy_dygraph():
+            self.func_vjp_i1o1()
+            self.func_vjp_i2o1()
+            self.func_vjp_i2o2()
+            self.func_vjp_i2o2_omitting_v()
+            self.func_vjp_nested()
+            self.func_vjp_aliased_input()
+
+
+@utils.place(config.DEVICES)
+@utils.parameterize(
+    (utils.TEST_CASE_NAME, 'fun', 'xs', 'v', 'expected_exception'), (
+        ('v_shape_not_equal_ys', utils.square, np.random.rand(3),
+         np.random.rand(1), RuntimeError), ))
+class TestVJPException(unittest.TestCase):
+    def func_vjp(self):
+        with self.assertRaises(self.expected_exception):
+            paddle.autograd.vjp(self.fun,
+                                paddle.to_tensor(self.xs),
+                                paddle.to_tensor(self.v))
+
+    def test_all_cases(self):
+        if _in_legacy_dygraph():
+            self.func_vjp()
+
+
+def jac(grad_fn, f, inputs):
+    assert grad_fn in [paddle.autograd.vjp, paddle.autograd.jvp]
+    if grad_fn is paddle.autograd.jvp:
+        vs = [paddle.zeros_like(x) for x in inputs]
+    else:
+        outputs = f(*inputs)
+        if isinstance(outputs, paddle.Tensor):
+            outputs = [outputs]
+        vs = [paddle.zeros_like(y) for y in outputs]
+    JJ_cols = []
+    for i, v in enumerate(vs):
+        v = v.flatten()
+        for j in range(len(v)):
+            _v = paddle.zeros_like(v).detach()
+            _v[j] = 1.0
+            _v = _v.reshape(vs[i].shape)
+            _vs = vs.copy()
+            _vs[i] = _v
+            _, grads = grad_fn(f, inputs, _vs)
+            d_outs = paddle.concat([d_out.flatten() for d_out in grads])
+            JJ_cols.append(d_outs)
+    # JJ is the fully unrolled jacobian
+    JJ = paddle.stack(JJ_cols)
+    if grad_fn is paddle.autograd.vjp:
+        JJ = JJ.t()
+    return JJ
+
+
+class TestJVP(TestAutogradFunctional):
+    def func_jvp_i1o1(self):
+        test_cases = [
+            [reduce, 'A'],  # noqa
+            [reduce_dim, 'A'],  # noqa
+        ]  # noqa
+        for f, inputs in test_cases:
+            inputs = self.gen_inputs(inputs)
+            forward_jac = jac(paddle.autograd.jvp, f, inputs)
+            reverse_jac = jac(paddle.autograd.vjp, f, inputs)
+            self.check_results(forward_jac, reverse_jac)
+
+    def func_jvp_i2o1(self):
+        test_cases = [  # noqa
+            [matmul, ['A', 'B']],  # noqa
+        ]  # noqa
+        for f, inputs in test_cases:
+            inputs = self.gen_inputs(inputs)
+            forward_jac = jac(paddle.autograd.jvp, f, inputs)
+            reverse_jac = jac(paddle.autograd.vjp, f, inputs)
+            self.check_results(forward_jac, reverse_jac)
+
+    def func_jvp_i2o2(self):
+        test_cases = [  # noqa
+            [o2, ['A', 'A']],  # noqa
+        ]  # noqa
+        for f, inputs in test_cases:
+            inputs = self.gen_inputs(inputs)
+            forward_jac = jac(paddle.autograd.jvp, f, inputs)
+            reverse_jac = jac(paddle.autograd.vjp, f, inputs)
+            self.check_results(forward_jac, reverse_jac)
+
+    def func_jvp_i2o2_omitting_v(self):
+        test_cases = [  # noqa
+            [o2, ['A', 'A']],  # noqa
+        ]  # noqa
+        for f, inputs in test_cases:
+            inputs = self.gen_inputs(inputs)
+            results_omitting_v = paddle.autograd.jvp(f, inputs)
+            v = [paddle.ones_like(x) for x in inputs]
+            results_with_v = paddle.autograd.jvp(f, inputs, v)
+            self.check_results(results_omitting_v, results_with_v)
+
+    def test_all_cases(self):
+        if _in_legacy_dygraph():
+            self.func_jvp_i1o1()
+            self.func_jvp_i2o1()
+            self.func_jvp_i2o2()
+            self.func_jvp_i2o2_omitting_v()
+
+
+@utils.place(config.DEVICES)
+@utils.parameterize((utils.TEST_CASE_NAME, 'func', 'xs'), (
+    ('1d_in_1d_out', utils.square, np.array([2., 3.])),
+    ('3d_in_3d_out', utils.square, np.random.rand(2, 3, 4)),
+    ('single_in_single_out', utils.square, np.random.rand(2, 3)),
+    ('multi_in_single_out', paddle.matmul,
+     (np.random.rand(2, 2), np.random.rand(2, 2))), ))
+class TestJacobianClassNoBatch(unittest.TestCase):
+    def setUp(self):
+        self._dtype = self.xs[0].dtype if isinstance(
+            self.xs, typing.Sequence) else self.xs.dtype
+        self._eps = config.TOLERANCE.get(str(self._dtype)).get(
+            "first_order_grad").get("eps")
+        self._rtol = config.TOLERANCE.get(str(self._dtype)).get(
+            "first_order_grad").get("rtol")
+        self._atol = config.TOLERANCE.get(str(self._dtype)).get(
+            "first_order_grad").get("atol")
+
+        self.xs = [paddle.to_tensor(x) for x in self.xs] if isinstance(
+            self.xs, typing.Sequence) else paddle.to_tensor(self.xs)
+        self._actual = paddle.autograd.Jacobian(self.func, self.xs, False)
+        self._expected = self._expected()
+
+    def func_jacobian(self):
+        Index = collections.namedtuple('Index', ('type', 'value'))
+        indexes = (Index('all', (slice(0, None, None), slice(0, None, None))),
+                   Index('row', (0, slice(0, None, None))),
+                   Index('col', (slice(0, None, None), 0)),
+                   Index('multi-row', (slice(0, 2, 1), slice(0, None, None))))
+        self.assertEqual(self._actual[:].numpy().dtype, self._expected.dtype)
+        for index in indexes:
+            np.testing.assert_allclose(
+                self._actual.__getitem__(index.value),
+                self._expected.__getitem__(index.value),
+                rtol=self._rtol,
+                atol=self._atol,
+                err_msg=f'Testcase {index.type} index not passed, value is {index.value}'
+            )
+
+    def _expected(self):
+        jac = utils._compute_numerical_jacobian(self.func, self.xs, self._eps,
+                                                self._dtype)
+        return utils._np_concat_matrix_sequence(jac, utils.MatrixFormat.NM)
+
+    def test_all_cases(self):
+        if _in_legacy_dygraph():
+            self.func_jacobian()
+
+
+@utils.place(config.DEVICES)
+@utils.parameterize((utils.TEST_CASE_NAME, 'func', 'xs'), (
+    ('1d_in_1d_out', utils.square, np.array([[1., 2., 3.], [3., 4., 3.]])),
+    ('3d_in_3d_out', utils.square, np.random.rand(2, 3, 4)),
+    ('multi_in_single_out', utils.square, np.random.rand(2, 3)), ))
+class TestJacobianClassBatchFirst(unittest.TestCase):
+    def setUp(self):
+        self._dtype = self.xs[0].dtype if isinstance(
+            self.xs, typing.Sequence) else self.xs.dtype
+        self._eps = config.TOLERANCE.get(str(self._dtype)).get(
+            "first_order_grad").get("eps")
+        self._rtol = config.TOLERANCE.get(str(self._dtype)).get(
+            "first_order_grad").get("rtol")
+        self._atol = config.TOLERANCE.get(str(self._dtype)).get(
+            "first_order_grad").get("atol")
+
+        self.xs = [paddle.to_tensor(x) for x in self.xs] if isinstance(
+            self.xs, typing.Sequence) else paddle.to_tensor(self.xs)
+        self._actual = paddle.autograd.Jacobian(self.func, self.xs, True)
+        self._expected = self._expected()
+
+    def func_jacobian(self):
+        Index = collections.namedtuple('Index', ('type', 'value'))
+        indexes = (
+            Index('all', (slice(0, None, None), slice(0, None, None),
+                          slice(0, None, None))),
+            Index('row', (slice(0, None, None), 0, slice(0, None, None))),
+            Index('col',
+                  (slice(0, None, None), slice(0, None, None), 0)), Index(
+                      'batch', (slice(0, 2, None), slice(0, None, None),
+                                slice(0, None, None))),
+            Index('multi_row',
+                  (slice(0, 1, None), slice(0, 2, 1), slice(0, None, None))))
+        self.assertEqual(self._actual[:].numpy().dtype, self._expected.dtype)
+        for index in indexes:
+            np.testing.assert_allclose(
+                self._actual.__getitem__(index.value),
+                self._expected.__getitem__(index.value),
+                rtol=self._rtol,
+                atol=self._atol,
+                err_msg=f'Testcase {index.type} index not passed, value is {index.value}'
+            )
+
+    def _expected(self):
+        jac = utils._compute_numerical_batch_jacobian(
+            self.func, self.xs, self._eps, self._dtype, False)
+        jac = utils._np_concat_matrix_sequence(jac, utils.MatrixFormat.NBM)
+        return utils._np_transpose_matrix_format(jac, utils.MatrixFormat.NBM,
+                                                 utils.MatrixFormat.BNM)
+
+    def test_all_cases(self):
+        if _in_legacy_dygraph():
+            self.func_jacobian()
+
+
+class TestHessianClassNoBatch(unittest.TestCase):
+    @classmethod
+    def setUpClass(self):
+        self.shape = (2, 2)
+        self.dtype = 'float32'
+        self.np_dtype = np.float32
+        self.numerical_delta = config.TOLERANCE.get(self.dtype).get(
+            "second_order_grad").get("eps")
+        self.rtol = config.TOLERANCE.get(self.dtype).get(
+            "second_order_grad").get("rtol")
+        self.atol = config.TOLERANCE.get(self.dtype).get(
+            "second_order_grad").get("atol")
+        self.x = paddle.rand(shape=self.shape, dtype=self.dtype)
+        self.y = paddle.rand(shape=self.shape, dtype=self.dtype)
+
+    def func_single_input(self):
+        def func(x):
+            return paddle.sum(paddle.matmul(x, x))
+
+        numerical_hessian = utils._compute_numerical_hessian(
+            func, self.x, self.numerical_delta, self.np_dtype)
+        numerical_hessian = utils._np_concat_matrix_sequence(numerical_hessian)
+
+        self.x.stop_gradient = False
+        hessian = paddle.autograd.Hessian(func, self.x)
+        np.testing.assert_allclose(hessian[:].numpy(), numerical_hessian,
+                                   self.rtol, self.atol)
+
+    def func_multi_input(self):
+        def func(x, y):
+            return paddle.sum(paddle.matmul(x, y))
+
+        numerical_hessian = utils._compute_numerical_hessian(
+            func, [self.x, self.y], self.numerical_delta, self.np_dtype)
+        numerical_hessian = utils._np_concat_matrix_sequence(numerical_hessian)
+        self.x.stop_gradient = False
+        self.y.stop_gradient = False
+        hessian = paddle.autograd.Hessian(func, [self.x, self.y])
+        np.testing.assert_allclose(
+            hessian[:].numpy(),
+            numerical_hessian,
+            rtol=self.rtol,
+            atol=self.atol)
+
+    def func_allow_unused_true(self):
+        def func(x, y):
+            return paddle.sum(paddle.matmul(x, x))
+
+        numerical_hessian = utils._compute_numerical_hessian(
+            func, [self.x, self.y], self.numerical_delta, self.np_dtype)
+        numerical_hessian = utils._np_concat_matrix_sequence(numerical_hessian)
+        self.x.stop_gradient = False
+        self.y.stop_gradient = False
+        hessian = paddle.autograd.Hessian(func, [self.x, self.y])
+        np.testing.assert_allclose(hessian[:].numpy(), numerical_hessian,
+                                   self.rtol, self.atol)
+
+    def func_create_graph_true(self):
+        def func(x):
+            return paddle.sum(F.sigmoid(x))
+
+        numerical_hessian = utils._compute_numerical_hessian(
+            func, self.x, self.numerical_delta, self.np_dtype)
+        numerical_hessian = utils._np_concat_matrix_sequence(numerical_hessian)
+        self.x.stop_gradient = False
+        hessian = paddle.autograd.Hessian(func, self.x)
+        assert hessian[:].stop_gradient == False
+        np.testing.assert_allclose(hessian[:].numpy(), numerical_hessian,
+                                   self.rtol, self.atol)
+
+    def func_out_not_single(self):
+        def func(x):
+            return x * x
+
+        with self.assertRaises(RuntimeError):
+            paddle.autograd.Hessian(func, paddle.ones([3]))
+
+    def test_all_cases(self):
+        if _in_legacy_dygraph():
+            self.func_single_input()
+            self.func_multi_input()
+            self.func_allow_unused_true()
+            self.func_create_graph_true()
+            self.func_out_not_single()
+
+
+class TestHessianClassBatchFirst(unittest.TestCase):
+    @classmethod
+    def setUpClass(self):
+        self.x_shape = (5, 2)
+        self.weight_shape = (2, 4)
+        self.y_shape = (5, 2)
+        self.nbatch, self.nrow = 5, 2
+        self.dtype = 'float32'
+        self.np_dtype = np.float32
+        self.numerical_delta = config.TOLERANCE.get(self.dtype).get(
+            'second_order_grad').get('eps')
+        self.rtol = config.TOLERANCE.get(self.dtype).get(
+            'second_order_grad').get('rtol')
+        self.atol = config.TOLERANCE.get(self.dtype).get(
+            'second_order_grad').get('atol')
+        self.x = paddle.rand(shape=self.x_shape, dtype=self.dtype)
+        self.weight = paddle.rand(shape=self.weight_shape, dtype=self.dtype)
+        self.y = paddle.rand(shape=self.y_shape, dtype=self.dtype)
+
+    def func_single_input(self):
+        def func(x):
+            return paddle.matmul(x * x, self.weight)[:, 0:1]
+
+        expected = utils._compute_numerical_batch_hessian(
+            func, self.x, self.numerical_delta, self.np_dtype)
+
+        H = paddle.autograd.Hessian(func, self.x, is_batched=True)
+        actual = utils._np_transpose_matrix_format(
+            H[:].numpy(), utils.MatrixFormat.BNM, utils.MatrixFormat.NBM)
+        actual = actual.reshape((H.shape[1], -1))
+
+        np.testing.assert_allclose(actual, expected, self.rtol, self.atol)
+
+    def func_multi_input(self):
+        def func(x, y):
+            return paddle.matmul(x * x * y * y, self.weight)[:, 0:1]
+
+        xs_len = 2
+        expected = utils._compute_numerical_batch_hessian(
+            func, [self.x, self.y], self.numerical_delta, self.np_dtype)
+        expected = np.reshape(
+            np.array(expected),
+            (xs_len, xs_len, self.nrow, self.nbatch, self.nrow))
+        expected = [[n for n in row] for row in expected]
+        expected = utils._np_concat_matrix_sequence(expected)
+
+        self.x.stop_gradient = False
+        self.y.stop_gradient = False
+        H = paddle.autograd.Hessian(func, [self.x, self.y], is_batched=True)
+        actual = utils._np_transpose_matrix_format(
+            H[:].numpy(), utils.MatrixFormat.BNM, utils.MatrixFormat.NBM)
+
+        np.testing.assert_allclose(actual, expected, self.rtol, self.atol)
+
+    def func_allow_unused(self):
+        def func(x, y):
+            return paddle.matmul(x * x, self.weight)[:, 0:1]
+
+        xs_len = 2
+        expected = utils._compute_numerical_batch_hessian(
+            func, [self.x, self.y], self.numerical_delta, self.np_dtype)
+        expected = np.reshape(
+            np.array(expected),
+            (xs_len, xs_len, self.nrow, self.nbatch, self.nrow))
+        expected = [[n for n in row] for row in expected]
+        expected = utils._np_concat_matrix_sequence(expected)
+        expected = utils._np_transpose_matrix_format(
+            expected, utils.MatrixFormat.NBM, utils.MatrixFormat.BNM)
+
+        actual = paddle.autograd.Hessian(
+            func, [self.x, self.y], is_batched=True)[:]
+
+        np.testing.assert_allclose(
+            actual, expected, rtol=self.rtol, atol=self.atol)
+
+    def func_stop_gradient(self):
+        def func(x):
+            return paddle.matmul(x * x, self.weight)[:, 0:1]
+
+        expected = utils._compute_numerical_batch_hessian(
+            func, self.x, self.numerical_delta, self.np_dtype)
+
+        x = self.x.clone()
+        x.stop_gradient = True
+        H = paddle.autograd.Hessian(func, self.x, is_batched=True)[:]
+        actual = utils._np_transpose_matrix_format(
+            H[:].numpy(), utils.MatrixFormat.BNM, utils.MatrixFormat.NBM)
+        actual = actual.reshape((H.shape[1], -1))
+
+        np.testing.assert_allclose(actual, expected, self.rtol, self.atol)
+
+    def func_out_not_single(self):
+        def func(x):
+            return (x * x)
+
+        with self.assertRaises(RuntimeError):
+            paddle.autograd.Hessian(func, paddle.ones((3, 3)), is_batched=True)
+
+    def test_all_cases(self):
+        if _in_legacy_dygraph():
+            self.func_single_input()
+            self.func_multi_input()
+            self.func_allow_unused()
+            self.func_stop_gradient()
+            self.func_out_not_single()
+
+
+class TestHessian(unittest.TestCase):
+    @classmethod
+    def setUpClass(self):
+        self.shape = (2, 2)
+        self.dtype = 'float32'
+        self.np_dtype = np.float32
+        self.numerical_delta = config.TOLERANCE.get(self.dtype).get(
+            "second_order_grad").get("eps")
+        self.rtol = config.TOLERANCE.get(self.dtype).get(
+            "second_order_grad").get("rtol")
+        self.atol = config.TOLERANCE.get(self.dtype).get(
+            "second_order_grad").get("atol")
+        self.x = paddle.rand(shape=self.shape, dtype=self.dtype)
+        self.y = paddle.rand(shape=self.shape, dtype=self.dtype)
+
+    def func_single_input(self):
+        def func(x):
+            return paddle.sum(paddle.matmul(x, x))
+
+        numerical_hessian = _compute_numerical_hessian(
+            func, self.x, self.numerical_delta, self.np_dtype)
+
+        self.x.stop_gradient = False
+        hessian = paddle.autograd.hessian(func, self.x)
+        np.testing.assert_allclose(hessian.numpy(), numerical_hessian[0][0],
+                                   self.rtol, self.atol)
+
+    def func_multi_input(self):
+        def func(x, y):
+            return paddle.sum(paddle.matmul(x, y))
+
+        numerical_hessian = _compute_numerical_hessian(
+            func, [self.x, self.y], self.numerical_delta, self.np_dtype)
+
+        self.x.stop_gradient = False
+        self.y.stop_gradient = False
+        hessian = paddle.autograd.hessian(func, [self.x, self.y])
+        for i in range(len(hessian)):
+            for j in range(len(hessian[0])):
+                np.testing.assert_allclose(hessian[i][j].numpy(),
+                                           numerical_hessian[i][j], self.rtol,
+                                           self.atol)
+
+    def func_allow_unused_false(self):
+        def func(x, y):
+            return paddle.sum(paddle.matmul(x, x))
+
+        try:
+            self.x.stop_gradient = False
+            self.y.stop_gradient = False
+            hessian = paddle.autograd.hessian(func, [self.x, self.y])
+        except ValueError as e:
+            error_msg = cpt.get_exception_message(e)
+            assert error_msg.find("allow_unused") > 0
+
+    def func_allow_unused_true(self):
+        def func(x, y):
+            return paddle.sum(paddle.matmul(x, x))
+
+        numerical_hessian = _compute_numerical_hessian(
+            func, [self.x, self.y], self.numerical_delta, self.np_dtype)
+        self.x.stop_gradient = False
+        self.y.stop_gradient = False
+        hessian = paddle.autograd.hessian(
+            func, [self.x, self.y], allow_unused=True)
+        for i in range(len(hessian)):
+            for j in range(len(hessian[0])):
+                if i == j == 0:
+                    np.testing.assert_allclose(hessian[i][j].numpy(),
+                                               numerical_hessian[i][j],
+                                               self.rtol, self.atol)
+                else:
+                    assert hessian[i][j] is None
+
+    def func_create_graph_false(self):
+        def func(x):
+            return paddle.sum(paddle.matmul(x, x))
+
+        numerical_hessian = _compute_numerical_hessian(
+            func, self.x, self.numerical_delta, self.np_dtype)
+        self.x.stop_gradient = False
+        hessian = paddle.autograd.hessian(func, self.x)
+        assert hessian.stop_gradient == True
+        np.testing.assert_allclose(hessian.numpy(), numerical_hessian[0][0],
+                                   self.rtol, self.atol)
+        try:
+            paddle.grad(hessian, self.x)
+        except RuntimeError as e:
+            error_msg = cpt.get_exception_message(e)
+            assert error_msg.find("has no gradient") > 0
+
+    def func_create_graph_true(self):
+        def func(x):
+            return paddle.sum(F.sigmoid(x))
+
+        numerical_hessian = _compute_numerical_hessian(
+            func, self.x, self.numerical_delta, self.np_dtype)
+        self.x.stop_gradient = False
+        hessian = paddle.autograd.hessian(func, self.x, create_graph=True)
+        assert hessian.stop_gradient == False
+        np.testing.assert_allclose(hessian.numpy(), numerical_hessian[0][0],
+                                   self.rtol, self.atol)
+        triple_grad = paddle.grad(hessian, self.x)
+        assert triple_grad is not None
+
+    def test_all_cases(self):
+        if _in_legacy_dygraph():
+            self.func_single_input()
+            self.func_multi_input()
+            self.func_allow_unused_false()
+            self.func_allow_unused_true()
+            self.func_create_graph_false()
+            self.func_create_graph_true()
+
+
+class TestHessianFloat64(TestHessian):
+    @classmethod
+    def setUpClass(self):
+        self.shape = (2, 2)
+        self.dtype = 'float64'
+        self.np_dtype = np.float64
+        self.numerical_delta = config.TOLERANCE.get(self.dtype).get(
+            "second_order_grad").get("eps")
+        self.rtol = config.TOLERANCE.get(self.dtype).get(
+            "second_order_grad").get("rtol")
+        self.atol = config.TOLERANCE.get(self.dtype).get(
+            "second_order_grad").get("atol")
+        self.x = paddle.rand(shape=self.shape, dtype=self.dtype)
+        self.y = paddle.rand(shape=self.shape, dtype=self.dtype)
+
+
+class TestBatchHessian(unittest.TestCase):
+    @classmethod
+    def setUpClass(self):
+        self.x_shape = (5, 2)
+        self.weight_shape = (2, 4)
+        self.y_shape = (5, 2)
+        self.dtype = 'float32'
+        self.np_dtype = np.float32
+        self.numerical_delta = config.TOLERANCE.get(self.dtype).get(
+            "second_order_grad").get("eps")
+        self.rtol = config.TOLERANCE.get(self.dtype).get(
+            "second_order_grad").get("rtol")
+        self.atol = config.TOLERANCE.get(self.dtype).get(
+            "second_order_grad").get("atol")
+        self.x = paddle.rand(shape=self.x_shape, dtype=self.dtype)
+        self.weight = paddle.rand(shape=self.weight_shape, dtype=self.dtype)
+        self.y = paddle.rand(shape=self.y_shape, dtype=self.dtype)
+
+    def func_single_input(self):
+        def func(x):
+            return paddle.matmul(x * x, self.weight)[:, 0:1]
+
+        numerical_hessian = _compute_numerical_batch_hessian(
+            func, self.x, self.numerical_delta, self.np_dtype)
+        self.x.stop_gradient = False
+        hessian = paddle.autograd.batch_hessian(func, self.x, create_graph=True)
+        np.testing.assert_allclose(hessian, numerical_hessian, self.rtol,
+                                   self.atol)
+
+    def func_multi_input(self):
+        def func(x, y):
+            return paddle.matmul(x * x * y * y, self.weight)[:, 0:1]
+
+        numerical_hessian = _compute_numerical_batch_hessian(
+            func, [self.x, self.y], self.numerical_delta, self.np_dtype)
+
+        self.x.stop_gradient = False
+        self.y.stop_gradient = False
+        hessian = paddle.autograd.batch_hessian(func, [self.x, self.y])
+
+        shape_tensor = paddle.to_tensor(numerical_hessian).astype("float64")
+        hessian_reshape = np.reshape(hessian, (shape_tensor.shape))
+        np.testing.assert_allclose(hessian_reshape, numerical_hessian,
+                                   self.rtol, self.atol)
+
+    def func_allow_unused_false(self):
+        def func(x, y):
+            return paddle.matmul(x * x, self.weight)[:, 0:1]
+
+        try:
+            self.x.stop_gradient = False
+            self.y.stop_gradient = False
+            hessian = paddle.autograd.batch_hessian(func, [self.x, self.y])
+        except ValueError as e:
+            error_msg = cpt.get_exception_message(e)
+            assert error_msg.find("allow_unused") > 0
+
+    def func_allow_unused_true(self):
+        def func(x, y):
+            return paddle.matmul(x * x, self.weight)[:, 0:1]
+
+        numerical_hessian = _compute_numerical_batch_hessian(
+            func, [self.x, self.y], self.numerical_delta, self.np_dtype)
+        self.x.stop_gradient = False
+        self.y.stop_gradient = False
+        hessian = paddle.autograd.batch_hessian(
+            func, [self.x, self.y], allow_unused=True)
+
+        for i in range(len(hessian)):
+            for j in range(len(hessian[0])):
+                if i == j == 0:
+                    numerical_hessian = np.stack(
+                        (numerical_hessian[i][j], numerical_hessian[i][j + 1]),
+                        axis=0)
+                    np.testing.assert_allclose(hessian[i][j], numerical_hessian,
+                                               self.rtol, self.atol)
+                else:
+                    assert hessian[i][j] is None
+
+    def func_create_graph_false(self):
+        def func(x):
+            return paddle.matmul(x * x, self.weight)[:, 0:1]
+
+        numerical_hessian = _compute_numerical_batch_hessian(
+            func, self.x, self.numerical_delta, self.np_dtype)
+        self.x.stop_gradient = False
+        hessian = paddle.autograd.batch_hessian(func, self.x)
+        assert hessian.stop_gradient == True
+        np.testing.assert_allclose(hessian.numpy(), numerical_hessian,
+                                   self.rtol, self.atol)
+        try:
+            paddle.grad(hessian, self.x)
+        except RuntimeError as e:
+            error_msg = cpt.get_exception_message(e)
+            assert error_msg.find("has no gradient") > 0
+
+    def func_create_graph_true(self):
+        def func(x):
+            return paddle.matmul(x * x, self.weight)[:, 0:1]
+
+        numerical_hessian = _compute_numerical_batch_hessian(
+            func, self.x, self.numerical_delta, self.np_dtype)
+        self.x.stop_gradient = False
+        hessian = paddle.autograd.batch_hessian(func, self.x, create_graph=True)
+        assert hessian.stop_gradient == False
+        np.testing.assert_allclose(hessian.numpy(), numerical_hessian,
+                                   self.rtol, self.atol)
+        triple_grad = paddle.grad(hessian, self.x)
+        assert triple_grad is not None
+
+    def test_all_cases(self):
+        if _in_legacy_dygraph():
+            self.func_single_input()
+            self.func_multi_input()
+            self.func_allow_unused_false()
+            self.func_allow_unused_true()
+            self.func_create_graph_false()
+            self.func_create_graph_true()
+
+
+class TestBatchHessianFloat64(TestBatchHessian):
+    @classmethod
+    def setUpClass(self):
+        self.x_shape = (5, 2)
+        self.weight_shape = (2, 4)
+        self.y_shape = (5, 2)
+        self.dtype = 'float64'
+        self.np_dtype = np.float64
+        self.numerical_delta = config.TOLERANCE.get(self.dtype).get(
+            "second_order_grad").get("eps")
+        self.rtol = config.TOLERANCE.get(self.dtype).get(
+            "second_order_grad").get("rtol")
+        self.atol = config.TOLERANCE.get(self.dtype).get(
+            "second_order_grad").get("atol")
+        self.x = paddle.rand(shape=self.x_shape, dtype=self.dtype)
+        self.weight = paddle.rand(shape=self.weight_shape, dtype=self.dtype)
+        self.y = paddle.rand(shape=self.y_shape, dtype=self.dtype)
+
+
+class TestVHP(unittest.TestCase):
+    @classmethod
+    def setUpClass(self):
+        self.shape = (2, 2)
+        self.dtype = 'float32'
+        self.np_dtype = np.float32
+        self.numerical_delta = config.TOLERANCE.get(self.dtype).get(
+            "second_order_grad").get("eps")
+        self.rtol = config.TOLERANCE.get(self.dtype).get(
+            "second_order_grad").get("rtol")
+        self.atol = config.TOLERANCE.get(self.dtype).get(
+            "second_order_grad").get("atol")
+        self.x = paddle.rand(shape=self.shape, dtype=self.dtype)
+        self.y = paddle.rand(shape=self.shape, dtype=self.dtype)
+        self.vx = paddle.rand(shape=self.shape, dtype=self.dtype)
+        self.vy = paddle.rand(shape=self.shape, dtype=self.dtype)
+
+    def func_single_input(self):
+        def func(x):
+            return paddle.sum(paddle.matmul(x, x))
+
+        numerical_func_output = func(self.x).numpy()
+        numerical_vhp = _compute_numerical_vhp(
+            func, self.x, self.vx, self.numerical_delta, self.np_dtype)
+
+        self.x.stop_gradient = False
+        func_output, vhp = paddle.autograd.vhp(func, self.x, self.vx)
+        np.testing.assert_allclose(func_output.numpy(), numerical_func_output,
+                                   self.rtol, self.atol)
+        np.testing.assert_allclose(vhp[0].numpy(), numerical_vhp[0], self.rtol,
+                                   self.atol)
+
+    def func_multi_input(self):
+        def func(x, y):
+            return paddle.sum(paddle.matmul(x, y))
+
+        numerical_func_output = func(self.x, self.y).numpy()
+        numerical_vhp = _compute_numerical_vhp(
+            func, [self.x, self.y], [self.vx, self.vy], self.numerical_delta,
+            self.np_dtype)
+
+        self.x.stop_gradient = False
+        self.y.stop_gradient = False
+        func_output, vhp = paddle.autograd.vhp(func, [self.x, self.y],
+                                               [self.vx, self.vy])
+        np.testing.assert_allclose(func_output.numpy(), numerical_func_output,
+                                   self.rtol, self.atol)
+        for i in range(len(vhp)):
+            np.testing.assert_allclose(vhp[i].numpy(), numerical_vhp[i],
+                                       self.rtol, self.atol)
+
+    def func_v_default(self):
+        def func(x, y):
+            return paddle.sum(paddle.matmul(x, y))
+
+        numerical_func_output = func(self.x, self.y).numpy()
+        vx = paddle.ones(self.vx.shape, dtype=self.vx.dtype)
+        vy = paddle.ones(self.vy.shape, dtype=self.vy.dtype)
+        numerical_vhp = _compute_numerical_vhp(func, [self.x, self.y],
+                                               [vx, vy], self.numerical_delta,
+                                               self.np_dtype)
+
+        self.x.stop_gradient = False
+        self.y.stop_gradient = False
+        func_output, vhp = paddle.autograd.vhp(func, [self.x, self.y])
+        np.testing.assert_allclose(func_output.numpy(), numerical_func_output,
+                                   self.rtol, self.atol)
+        for i in range(len(vhp)):
+            np.testing.assert_allclose(vhp[i].numpy(), numerical_vhp[i],
+                                       self.rtol, self.atol)
+
+    def func_allow_unused_true(self):
+        def func(x, y):
+            return paddle.sum(paddle.matmul(x, x))
+
+        numerical_func_output = func(self.x, self.y).numpy()
+        numerical_vhp = _compute_numerical_vhp(
+            func, [self.x, self.y], [self.vx, self.vy], self.numerical_delta,
+            self.np_dtype)
+
+        self.x.stop_gradient = False
+        self.y.stop_gradient = False
+        func_output, vhp = paddle.autograd.vhp(func, [self.x, self.y],
+                                               [self.vx, self.vy])
+        np.testing.assert_allclose(func_output.numpy(), numerical_func_output,
+                                   self.rtol, self.atol)
+        np.testing.assert_allclose(vhp[0].numpy(), numerical_vhp[0], self.rtol,
+                                   self.atol)
+
+    def func_create_graph_true(self):
+        def func(x):
+            return paddle.sum(F.sigmoid(x))
+
+        numerical_func_output = func(self.x).numpy()
+        numerical_vhp = _compute_numerical_vhp(
+            func, self.x, self.vx, self.numerical_delta, self.np_dtype)
+
+        self.x.stop_gradient = False
+        func_output, vhp = paddle.autograd.vhp(func, self.x, self.vx)
+        np.testing.assert_allclose(func_output.numpy(), numerical_func_output,
+                                   self.rtol, self.atol)
+        assert vhp[0].stop_gradient == False
+        np.testing.assert_allclose(vhp[0].numpy(), numerical_vhp[0], self.rtol,
+                                   self.atol)
+        triple_grad = paddle.grad(vhp, self.x)
+        assert triple_grad is not None
+
+    def test_all_cases(self):
+        if _in_legacy_dygraph():
+            self.func_v_default()
+            self.func_multi_input()
+            self.func_single_input()
+            self.func_allow_unused_true()
+            self.func_create_graph_true()
+
+
+class TestJacobian(unittest.TestCase):
+    @classmethod
+    def setUpClass(self):
+        self.shape = (4, 4)
+        self.dtype = 'float32'
+        self.np_dtype = np.float32
+        self.numerical_delta = 1e-4
+        self.rtol = 1e-3
+        self.atol = 1e-3
+        self.x = paddle.rand(shape=self.shape, dtype=self.dtype)
+        self.y = paddle.rand(shape=self.shape, dtype=self.dtype)
+
+    def func_single_input_and_single_output(self):
+        def func(x):
+            return paddle.matmul(x, x)
+
+        numerical_jacobian = _compute_numerical_jacobian(
+            func, self.x, self.numerical_delta, self.np_dtype)
+        self.x.stop_gradient = False
+        jacobian = paddle.autograd.jacobian(func, self.x)
+        np.testing.assert_allclose(jacobian.numpy(), numerical_jacobian[0][0],
+                                   self.rtol, self.atol)
+
+    def func_single_input_and_multi_output(self):
+        def func(x):
+            return paddle.matmul(x, x), x * x
+
+        numerical_jacobian = _compute_numerical_jacobian(
+            func, self.x, self.numerical_delta, self.np_dtype)
+        self.x.stop_gradient = False
+        jacobian = paddle.autograd.jacobian(func, self.x)
+        for i in range(len(jacobian)):
+            np.testing.assert_allclose(jacobian[i].numpy(),
+                                       numerical_jacobian[i][0], self.rtol,
+                                       self.atol)
+
+    def func_multi_input_and_single_output(self):
+        def func(x, y):
+            return paddle.matmul(x, y)
+
+        numerical_jacobian = _compute_numerical_jacobian(
+            func, [self.x, self.y], self.numerical_delta, self.np_dtype)
+        self.x.stop_gradient = False
+        self.y.stop_gradient = False
+        jacobian = paddle.autograd.jacobian(func, [self.x, self.y])
+        for j in range(len(jacobian)):
+            np.testing.assert_allclose(jacobian[j].numpy(),
+                                       numerical_jacobian[0][j], self.rtol,
+                                       self.atol)
+
+    def func_multi_input_and_multi_output(self):
+        def func(x, y):
+            return paddle.matmul(x, y), x * y
+
+        numerical_jacobian = _compute_numerical_jacobian(
+            func, [self.x, self.y], self.numerical_delta, self.np_dtype)
+        self.x.stop_gradient = False
+        self.y.stop_gradient = False
+        jacobian = paddle.autograd.jacobian(func, [self.x, self.y])
+        for i in range(len(jacobian)):
+            for j in range(len(jacobian[0])):
+                np.testing.assert_allclose(jacobian[i][j].numpy(),
+                                           numerical_jacobian[i][j], self.rtol,
+                                           self.atol)
+
+    def func_allow_unused_false(self):
+        def func(x, y):
+            return paddle.matmul(x, x)
+
+        try:
+            self.x.stop_gradient = False
+            self.y.stop_gradient = False
+            jacobian = paddle.autograd.jacobian(func, [self.x, self.y])
+        except ValueError as e:
+            error_msg = cpt.get_exception_message(e)
+            assert error_msg.find("allow_unused") > 0
+
+    def func_allow_unused_true(self):
+        def func(x, y):
+            return paddle.matmul(x, x)
+
+        numerical_jacobian = _compute_numerical_jacobian(
+            func, [self.x, self.y], self.numerical_delta, self.np_dtype)
+        self.x.stop_gradient = False
+        self.y.stop_gradient = False
+        jacobian = paddle.autograd.jacobian(
+            func, [self.x, self.y], allow_unused=True)
+        np.testing.assert_allclose(
+            jacobian[0].numpy(), numerical_jacobian[0][0], self.rtol, self.atol)
+        assert jacobian[1] is None
+
+    def func_create_graph_false(self):
+        def func(x, y):
+            return paddle.matmul(x, y)
+
+        numerical_jacobian = _compute_numerical_jacobian(
+            func, [self.x, self.y], self.numerical_delta, self.np_dtype)
+        self.x.stop_gradient = False
+        self.y.stop_gradient = False
+        jacobian = paddle.autograd.jacobian(func, [self.x, self.y])
+        for j in range(len(jacobian)):
+            assert jacobian[j].stop_gradient == True
+            np.testing.assert_allclose(jacobian[j].numpy(),
+                                       numerical_jacobian[0][j], self.rtol,
+                                       self.atol)
+        try:
+            paddle.grad(jacobian[0], [self.x, self.y])
+        except RuntimeError as e:
+            error_msg = cpt.get_exception_message(e)
+            assert error_msg.find("has no gradient") > 0
+
+    def func_create_graph_true(self):
+        def func(x, y):
+            return paddle.matmul(x, y)
+
+        numerical_jacobian = _compute_numerical_jacobian(
+            func, [self.x, self.y], self.numerical_delta, self.np_dtype)
+        self.x.stop_gradient = False
+        self.y.stop_gradient = False
+        jacobian = paddle.autograd.jacobian(
+            func, [self.x, self.y], create_graph=True)
+        for j in range(len(jacobian)):
+            assert jacobian[j].stop_gradient == False
+            np.testing.assert_allclose(jacobian[j].numpy(),
+                                       numerical_jacobian[0][j], self.rtol,
+                                       self.atol)
+        double_grad = paddle.grad(jacobian[0], [self.x, self.y])
+        assert double_grad is not None
+
+    def test_all_cases(self):
+        if _in_legacy_dygraph():
+            self.func_multi_input_and_multi_output()
+            self.func_multi_input_and_single_output()
+            self.func_single_input_and_multi_output()
+            self.func_single_input_and_single_output()
+            self.func_allow_unused_false()
+            self.func_allow_unused_true()
+            self.func_create_graph_false()
+            self.func_create_graph_true()
+
+
+class TestJacobianFloat64(TestJacobian):
+    @classmethod
+    def setUpClass(self):
+        self.shape = (4, 4)
+        self.dtype = 'float64'
+        self.np_dtype = np.float64
+        self.numerical_delta = 1e-7
+        self.rtol = 1e-7
+        self.atol = 1e-7
+        self.x = paddle.rand(shape=self.shape, dtype=self.dtype)
+        self.y = paddle.rand(shape=self.shape, dtype=self.dtype)
+
+
+class TestJacobianBatch(unittest.TestCase):
+    @classmethod
+    def setUpClass(self):
+        self.x_shape = (4, 2)
+        self.weight_shape = (2, 4)
+        self.y_shape = (4, 2)
+        self.dtype = 'float32'
+        self.np_dtype = np.float32
+        self.numerical_delta = 1e-4
+        self.rtol = 1e-3
+        self.atol = 1e-3
+        self.x = paddle.rand(shape=self.x_shape, dtype=self.dtype)
+        self.weight = paddle.rand(shape=self.weight_shape, dtype=self.dtype)
+        self.y = paddle.rand(shape=self.y_shape, dtype=self.dtype)
+
+    def func_batch_single_input_and_batch_single_output(self):
+        def func(x):
+            return paddle.matmul(paddle.matmul(x, self.weight), self.y)
+
+        numerical_jacobian = _compute_numerical_batch_jacobian(
+            func, [self.x], self.numerical_delta, self.np_dtype)
+
+        self.x.stop_gradient = False
+        batch_jacobian = paddle.autograd.batch_jacobian(
+            func,
+            self.x, )
+
+        self.assertTrue(
+            np.allclose(batch_jacobian.numpy().all(), numerical_jacobian[0][0]
+                        .all()))
+
+    def func_batch_single_input_and_batch_multi_output(self):
+        def func(x):
+            return paddle.matmul(paddle.matmul(x, self.weight), self.y), x * x
+
+        numerical_jacobian = _compute_numerical_batch_jacobian(
+            func, [self.x], self.numerical_delta, self.np_dtype)
+
+        self.x.stop_gradient = False
+        batch_jacobian = paddle.autograd.batch_jacobian(
+            func,
+            self.x, )
+
+        for i in range(len(batch_jacobian)):
+            np.testing.assert_allclose(batch_jacobian[i].numpy(),
+                                       numerical_jacobian[i][0], self.rtol,
+                                       self.atol)
+
+    def func_batch_multi_input_and_batch_single_output(self):
+        def func(x, y):
+            return x * y
+
+        numerical_jacobian = _compute_numerical_batch_jacobian(
+            func, [self.x, self.y], self.numerical_delta, self.np_dtype)
+
+        self.x.stop_gradient = False
+        self.y.stop_gradient = False
+        batch_jacobian = paddle.autograd.batch_jacobian(func, [self.x, self.y])
+
+        for j in range(len(batch_jacobian)):
+            np.testing.assert_allclose(batch_jacobian[j].numpy(),
+                                       numerical_jacobian[0][j], self.rtol,
+                                       self.atol)
+
+    def func_batch_multi_input_and_batch_multi_output(self):
+        def func(x, y):
+            return x * y, x * y
+
+        numerical_jacobian = _compute_numerical_batch_jacobian(
+            func, [self.x, self.y], self.numerical_delta, self.np_dtype)
+
+        self.x.stop_gradient = False
+        self.y.stop_gradient = False
+        batch_jacobian = paddle.autograd.batch_jacobian(func, [self.x, self.y])
+
+        for i in range(len(batch_jacobian)):
+            np.testing.assert_allclose(batch_jacobian[i], numerical_jacobian[i],
+                                       self.rtol, self.atol)
+
+    def func_allow_unused_false(self):
+        def func(x, y):
+            return x * x
+
+        try:
+            self.x.stop_gradient = False
+            self.y.stop_gradient = False
+            jacobian = paddle.autograd.batch_jacobian(func, [self.x, self.y])
+        except ValueError as e:
+            error_msg = cpt.get_exception_message(e)
+            assert error_msg.find("allow_unused") > 0
+
+    def func_allow_unused_true(self):
+        def func(x, y):
+            return x * x
+
+        numerical_jacobian = _compute_numerical_batch_jacobian(
+            func, [self.x, self.y], self.numerical_delta, self.np_dtype)
+        self.x.stop_gradient = False
+        self.y.stop_gradient = False
+        jacobian = paddle.autograd.batch_jacobian(
+            func, [self.x, self.y], allow_unused=True)
+
+        np.testing.assert_allclose(
+            jacobian[0].numpy(), numerical_jacobian[0][0], self.rtol, self.atol)
+        assert jacobian[1] is None
+
+    def func_create_graph_false(self):
+        def func(x, y):
+            return x * y
+
+        numerical_jacobian = _compute_numerical_batch_jacobian(
+            func, [self.x, self.y], self.numerical_delta, self.np_dtype)
+        self.x.stop_gradient = False
+        self.y.stop_gradient = False
+        jacobian = paddle.autograd.batch_jacobian(func, [self.x, self.y])
+        for j in range(len(jacobian)):
+            assert jacobian[j].stop_gradient == True
+            np.testing.assert_allclose(jacobian[j].numpy(),
+                                       numerical_jacobian[0][j], self.rtol,
+                                       self.atol)
+        try:
+            paddle.grad(jacobian[0], [self.x, self.y])
+        except RuntimeError as e:
+            error_msg = cpt.get_exception_message(e)
+            assert error_msg.find("has no gradient") > 0
+
+    def func_create_graph_true(self):
+        def func(x, y):
+            return x * y
+
+        numerical_jacobian = _compute_numerical_batch_jacobian(
+            func, [self.x, self.y], self.numerical_delta, self.np_dtype)
+        self.x.stop_gradient = False
+        self.y.stop_gradient = False
+        jacobian = paddle.autograd.batch_jacobian(
+            func, [self.x, self.y], create_graph=True)
+        for j in range(len(jacobian)):
+            assert jacobian[j].stop_gradient == False
+            np.testing.assert_allclose(jacobian[j].numpy(),
+                                       numerical_jacobian[0][j], self.rtol,
+                                       self.atol)
+        double_grad = paddle.grad(jacobian[0], [self.x, self.y])
+        assert double_grad is not None
+
+    def test_all_cases(self):
+        if _in_legacy_dygraph():
+            self.func_batch_single_input_and_batch_single_output()
+            self.func_batch_single_input_and_batch_multi_output()
+            self.func_batch_multi_input_and_batch_single_output()
+            self.func_batch_multi_input_and_batch_multi_output()
+            self.func_allow_unused_false()
+            self.func_allow_unused_true()
+            self.func_create_graph_false()
+            self.func_create_graph_true()
+
+
+class TestJacobianBatchFloat64(TestJacobianBatch):
+    @classmethod
+    def setUpClass(self):
+        self.x_shape = (12, 2)
+        self.weight_shape = (2, 12)
+        self.y_shape = (12, 2)
+        self.dtype = 'float64'
+        self.np_dtype = np.float64
+        self.numerical_delta = config.TOLERANCE.get(self.dtype).get(
+            'second_order_grad').get('eps')
+        self.rtol = config.TOLERANCE.get(self.dtype).get(
+            'second_order_grad').get('rtol')
+        self.atol = config.TOLERANCE.get(self.dtype).get(
+            'second_order_grad').get('atol')
+        self.x = paddle.rand(shape=self.x_shape, dtype=self.dtype)
+        self.weight = paddle.rand(shape=self.weight_shape, dtype=self.dtype)
+        self.y = paddle.rand(shape=self.y_shape, dtype=self.dtype)
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/autograd/test_autograd_functional_static.py b/python/paddle/fluid/tests/unittests/autograd/test_autograd_functional_static.py
new file mode 100644
index 0000000000000..8801664fdca9a
--- /dev/null
+++ b/python/paddle/fluid/tests/unittests/autograd/test_autograd_functional_static.py
@@ -0,0 +1,455 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import typing
+import unittest
+
+import numpy as np
+import paddle
+import paddle.fluid as fluid
+
+import config
+import utils
+from utils import (_compute_numerical_batch_jacobian,
+                   _compute_numerical_jacobian)
+from paddle.autograd.functional import _as_tensors
+
+paddle.enable_static()
+
+
+@utils.place(config.DEVICES)
+@utils.parameterize((utils.TEST_CASE_NAME, 'fun', 'xs', 'v', 'stop_gradient'), (
+    ('tensor_input', utils.reduce, np.random.rand(2, 3), None, False),
+    ('tensor_sequence_input', utils.reduce, np.random.rand(2, 3), None, False),
+    ('v_not_none', utils.reduce, np.random.rand(2, 3), np.random.rand(1),
+     False),
+    ('xs_stop_gradient', utils.reduce, np.random.rand(2, 3), np.random.rand(1),
+     True),
+    ('func_mutmul', utils.matmul, (np.random.rand(3, 2), np.random.rand(2, 3)),
+     None, False),
+    ('func_mul', utils.mul, (np.random.rand(3, 3), np.random.rand(3, 3)), None,
+     False),
+    ('func_out_two', utils.o2, (np.random.rand(10), np.random.rand(10)), None,
+     False), ))
+class TestVJP(unittest.TestCase):
+    def setUp(self):
+        self.dtype = str(self.xs[0].dtype) if isinstance(
+            self.xs, typing.Sequence) else str(self.xs.dtype)
+        self._rtol = config.TOLERANCE.get(str(self.dtype)).get(
+            "first_order_grad").get("rtol")
+        self._atol = config.TOLERANCE.get(str(self.dtype)).get(
+            "first_order_grad").get("atol")
+
+    def _vjp(self):
+        exe = paddle.static.Executor()
+        sp = paddle.static.Program()
+        mp = paddle.static.Program()
+        with paddle.static.program_guard(mp, sp):
+            feed, static_xs, static_v = gen_static_data_and_feed(
+                self.xs, self.v, stop_gradient=self.stop_gradient)
+            ys, xs_grads = paddle.autograd.vjp(self.fun, static_xs, static_v)
+        exe.run(sp)
+        return exe.run(mp, feed=feed, fetch_list=[ys, xs_grads])
+
+    def _expected_vjp(self):
+        exe = paddle.static.Executor()
+        sp = paddle.static.Program()
+        mp = paddle.static.Program()
+        with paddle.static.program_guard(mp, sp):
+            feed, static_xs, static_v = gen_static_data_and_feed(self.xs,
+                                                                 self.v, False)
+            ys = self.fun(*static_xs) if isinstance(
+                static_xs, typing.Sequence) else self.fun(static_xs)
+            xs_grads = paddle.static.gradients(ys, static_xs, static_v)
+        exe.run(sp)
+        return exe.run(mp, feed=feed, fetch_list=[ys, xs_grads])
+
+    def test_vjp(self):
+        actual = self._vjp()
+        expected = self._expected_vjp()
+        self.assertEqual(len(actual), len(expected))
+        for i in range(len(actual)):
+            np.testing.assert_allclose(
+                actual[i], expected[i], rtol=self._rtol, atol=self._atol)
+
+
+@utils.place(config.DEVICES)
+@utils.parameterize(
+    (utils.TEST_CASE_NAME, 'fun', 'xs', 'v', 'expected_exception'), (
+        ('v_shape_not_equal_ys', utils.square, np.random.rand(3),
+         np.random.rand(1), RuntimeError), ))
+class TestVJPException(unittest.TestCase):
+    def setUp(self):
+        self.exe = paddle.static.Executor()
+
+    def _vjp(self):
+        sp = paddle.static.Program()
+        mp = paddle.static.Program()
+        with paddle.static.program_guard(mp, sp):
+            feed, static_xs, static_v = gen_static_data_and_feed(self.xs,
+                                                                 self.v)
+            ys, xs_grads = paddle.autograd.vjp(self.fun, static_xs, static_v)
+        self.exe.run(sp)
+        return self.exe.run(mp, feed, fetch_list=[ys, xs_grads])
+
+    def test_vjp(self):
+        with self.assertRaises(self.expected_exception):
+            self._vjp()
+
+
+def gen_static_data_and_feed(xs, v, stop_gradient=True):
+    feed = {}
+    if isinstance(xs, typing.Sequence):
+        static_xs = []
+        for i, x in enumerate(xs):
+            x = paddle.static.data(f"x{i}", x.shape, x.dtype)
+            x.stop_gradient = stop_gradient
+            static_xs.append(x)
+        feed.update({f'x{idx}': value for idx, value in enumerate(xs)})
+    else:
+        static_xs = paddle.static.data('x', xs.shape, xs.dtype)
+        static_xs.stop_gradient = stop_gradient
+        feed.update({'x': xs})
+
+    if isinstance(v, typing.Sequence):
+        static_v = []
+        for i, e in enumerate(v):
+            e = paddle.static.data(f'v{idx}', v.shape, v.dtype)
+            e.stop_gradient = stop_gradient
+            static_v.append(e)
+        feed.update({f'v{idx}': value for idx, value in v})
+    elif v is not None:
+        static_v = paddle.static.data('v', v.shape, v.dtype)
+        static_v.stop_gradient = stop_gradient
+        feed.update({'v': v})
+    else:
+        static_v = v
+
+    return feed, static_xs, static_v
+
+
+def approx_jacobian(f, xs, dtype, eps=1e-5, batch=False):
+    r"""Computes an approximate Jacobian matrix of a multi-valued function 
+    using finite differences.
+
+    The function input is required to be an np array or a list of list of np 
+    arrays. 
+    """
+
+    def flatten(x):
+        if len(x.shape) > 0:
+            to = [x.shape[0], -1] if batch else [-1]
+            return x.reshape(to)
+        else:
+            return x
+
+    def flatten_all(xs):
+        if isinstance(xs, list):
+            flattened = np.concatenate([flatten(x) for x in xs], axis=-1)
+        else:
+            flattened = flatten(xs)
+        return flattened
+
+    def x_like(x, orig_x):
+        return x.reshape(orig_x.shape)
+
+    def _f(x):
+        if multi_inps:
+            _xs = np.split(x, splits, axis=-1)
+            _xs = [x_like(_x, _o) for _x, _o in zip(_xs, xs)]
+            outs = f(_xs)
+        else:
+            outs = f(x)
+        return flatten_all(outs)
+
+    multi_inps = False if isinstance(xs, np.ndarray) else True
+    x = flatten_all(xs)
+    xdim = x.shape[-1]
+    splits = []
+
+    if multi_inps:
+        split = 0
+        for inp in xs:
+            split += flatten(inp).shape[-1]
+            splits.append(split)
+
+    ds = eps * np.eye(xdim, dtype=dtype)
+
+    fprimes_by_x = [(0.5 * (_f(x + d) - _f(x - d)) / eps) for d in ds]
+    fprimes_by_y = np.stack(fprimes_by_x, axis=-1)
+    return np.transpose(fprimes_by_y, [1, 0, 2]) if batch else fprimes_by_y
+
+
+def make_tensors(inps):
+    if isinstance(inps, list):
+        xs = [
+            paddle.static.data(
+                f'x{i}', inp.shape, dtype=inp.dtype)
+            for i, inp in enumerate(inps)
+        ]
+    else:
+        xs = paddle.static.data(name='x', shape=inps.shape, dtype=inps.dtype)
+    return xs
+
+
+all_data_shapes = {
+    'A': [[1., 2.]],
+    'B': [[1., 2.], [2., 1.]],
+    'C': [[2., 2.], [2., 1.]],
+    'D': [[[2., 2.], [2., 1.]], [[1., 2.], [2., 1.]]],
+    'E': [[[3., 4.], [2., 3.]], [[2., 1.], [1., 3.]]],
+}
+
+
+def prepare_data(test, input_shapes, dtype):
+    for name, shape in input_shapes.items():
+        setattr(test, name, np.array(shape, dtype=dtype))
+
+
+class TestJacobianFloat32(unittest.TestCase):
+    @classmethod
+    def setUpClass(self):
+        paddle.enable_static()
+        if fluid.core.is_compiled_with_cuda():
+            self.place = fluid.CUDAPlace(0)
+        else:
+            self.place = fluid.CPUPlace()
+        self.dtype = 'float32'
+        self.np_dtype = np.float32
+        prepare_data(self, all_data_shapes, self.dtype)
+        self.eps = config.TOLERANCE.get(self.dtype).get('first_order_grad').get(
+            'eps')
+        # self.rtol = config.TOLERANCE.get(self.dtype).get('first_order_grad').get('rtol')
+        # self.atol = config.TOLERANCE.get(self.dtype).get('first_order_grad').get('atol')
+        # Do't use tolerance in config, which will cause this test case failed.
+        self.rtol = 1e-2
+        self.atol = 1e-2
+
+    def run_test_by_fullmatrix(self, pd_f, np_f, inps, batch=False):
+        main = fluid.Program()
+        startup = fluid.Program()
+        with fluid.program_guard(main, startup):
+            xs = make_tensors(inps)
+            JJ = paddle.autograd.functional.Jacobian(pd_f, xs, is_batched=batch)
+            if batch:
+                _, nrow, ncol = JJ.shape
+            else:
+                nrow, ncol = JJ.shape
+            full_jacobian = JJ[:]
+        exe = fluid.Executor(self.place)
+        exe.run(startup)
+        if isinstance(inps, list):
+            feeds = {f'x{i}': x for i, x in enumerate(inps)}
+        else:
+            feeds = {'x': inps}
+        pd_jacobians = exe.run(main, feed=feeds, fetch_list=[full_jacobian])[0]
+        np_jacobians = approx_jacobian(
+            np_f, inps, self.dtype, self.eps, batch=batch)
+        if batch:
+            np_jacobians = utils._np_transpose_matrix_format(
+                np_jacobians, utils.MatrixFormat.NBM, utils.MatrixFormat.BNM)
+
+        np.testing.assert_allclose(pd_jacobians, np_jacobians, self.rtol,
+                                   self.atol)
+
+    def run_test_by_rows(self, pd_f, np_f, inps, batch=False):
+        main = fluid.Program()
+        startup = fluid.Program()
+        with fluid.program_guard(main, startup):
+            xs = make_tensors(inps)
+            JJ = paddle.autograd.functional.Jacobian(pd_f, xs, is_batched=batch)
+            if batch:
+                nbatch, nrow, ncol = JJ.shape
+                rows = [JJ[:, i, :] for i in range(nrow)]
+            else:
+                nrow, ncol = JJ.shape
+                rows = [JJ[i, :] for i in range(nrow)]
+
+        exe = fluid.Executor(self.place)
+        exe.run(startup)
+        if isinstance(inps, list):
+            feeds = {f'x{i}': x for i, x in enumerate(inps)}
+        else:
+            feeds = {'x': inps}
+        pd_jac = exe.run(main, feed=feeds, fetch_list=[rows])
+        np_jac = approx_jacobian(np_f, inps, self.dtype, self.eps, batch=batch)
+        for i in range(nrow):
+            np.testing.assert_allclose(pd_jac[i], np_jac[i], self.rtol,
+                                       self.atol)
+
+    def run_test_by_entries(self, pd_f, np_f, inps, batch=False):
+        main = fluid.Program()
+        startup = fluid.Program()
+        with fluid.program_guard(main, startup):
+            xs = make_tensors(inps)
+            JJ = paddle.autograd.functional.Jacobian(pd_f, xs, is_batched=batch)
+            if batch:
+                nbatch, nrow, ncol = JJ.shape
+                entries = [
+                    JJ[:, i, j] for i in range(nrow) for j in range(ncol)
+                ]
+            else:
+                nrow, ncol = JJ.shape
+                entries = [JJ[i, j] for i in range(nrow) for j in range(ncol)]
+        exe = fluid.Executor(self.place)
+        exe.run(startup)
+        if isinstance(inps, list):
+            feeds = {f'x{i}': x for i, x in enumerate(inps)}
+        else:
+            feeds = {'x': inps}
+        pd_entries = exe.run(main, feed=feeds, fetch_list=[entries])
+        np_jac = approx_jacobian(np_f, inps, self.dtype, self.eps, batch=batch)
+        np_entries = [
+            np_jac[i, ..., j] for i in range(nrow) for j in range(ncol)
+        ]
+        for pd_entry, np_entry in zip(pd_entries, np_entries):
+            np.testing.assert_allclose(pd_entry, np_entry, self.rtol, self.atol)
+
+    def test_square(self):
+        def pd_f(x):
+            return paddle.multiply(x, x)
+
+        def np_f(x):
+            return np.multiply(x, x)
+
+        self.run_test_by_fullmatrix(pd_f, np_f, self.A)
+        self.run_test_by_rows(pd_f, np_f, self.A)
+        self.run_test_by_entries(pd_f, np_f, self.A)
+
+    def test_mul(self):
+        def pd_f(x, y):
+            return paddle.multiply(x, y)
+
+        def np_f(xs):
+            x, y = xs
+            return np.multiply(x, y)
+
+        self.run_test_by_fullmatrix(
+            pd_f,
+            np_f,
+            [self.B, self.C], )
+        self.run_test_by_rows(pd_f, np_f, [self.B, self.C])
+        self.run_test_by_entries(pd_f, np_f, [self.B, self.C])
+
+    def test_matmul(self):
+        def pd_f(x, y):
+            return paddle.matmul(x, y)
+
+        def np_f(xs):
+            x, y = xs
+            return np.matmul(x, y)
+
+        self.run_test_by_fullmatrix(pd_f, np_f, [self.B, self.C])
+        self.run_test_by_rows(pd_f, np_f, [self.B, self.C])
+        self.run_test_by_entries(pd_f, np_f, [self.B, self.C])
+
+    def test_batch_matmul(self):
+        def pd_f(x, y):
+            return paddle.matmul(x, y)
+
+        def np_f(xs):
+            x, y = xs
+            return np.matmul(x, y)
+
+        self.run_test_by_fullmatrix(pd_f, np_f, [self.D, self.E], batch=True)
+        self.run_test_by_rows(pd_f, np_f, [self.D, self.E], batch=True)
+        self.run_test_by_entries(pd_f, np_f, [self.D, self.E], batch=True)
+
+
+class TestJacobianFloat64(TestJacobianFloat32):
+    @classmethod
+    def setUpClass(self):
+        paddle.enable_static()
+        if fluid.core.is_compiled_with_cuda():
+            self.place = fluid.CUDAPlace(0)
+        else:
+            self.place = fluid.CPUPlace()
+        self.dtype = 'float64'
+        prepare_data(self, all_data_shapes, self.dtype)
+        self.eps = config.TOLERANCE.get(self.dtype).get('first_order_grad').get(
+            'eps')
+        self.rtol = config.TOLERANCE.get(self.dtype).get(
+            'first_order_grad').get('rtol')
+        self.atol = config.TOLERANCE.get(self.dtype).get(
+            'first_order_grad').get('atol')
+
+
+class TestHessianFloat32(unittest.TestCase):
+    @classmethod
+    def setUpClass(self):
+        paddle.enable_static()
+        if fluid.core.is_compiled_with_cuda():
+            self.place = fluid.CUDAPlace(0)
+        else:
+            self.place = fluid.CPUPlace()
+        self.dtype = 'float32'
+        prepare_data(self, all_data_shapes, self.dtype)
+        self.eps = config.TOLERANCE.get(self.dtype).get(
+            'second_order_grad').get('eps')
+        self.rtol = config.TOLERANCE.get(self.dtype).get(
+            'second_order_grad').get('rtol')
+        self.atol = config.TOLERANCE.get(self.dtype).get(
+            'second_order_grad').get('atol')
+
+    def run_test_by_fullmatrix(self, pd_f, inps, np_hess, batch=False):
+        main = fluid.Program()
+        startup = fluid.Program()
+        with fluid.program_guard(main, startup):
+            xs = make_tensors(inps)
+            HH = paddle.autograd.functional.Hessian(pd_f, xs, is_batched=batch)
+            nrow, ncol = HH.shape
+            full_hessian = HH[:]
+        exe = fluid.Executor(self.place)
+        exe.run(startup)
+        if isinstance(inps, list):
+            feeds = {f'x{i}': x for i, x in enumerate(inps)}
+        else:
+            feeds = {'x': inps}
+        pd_hess = exe.run(main, feed=feeds, fetch_list=[full_hessian])[0]
+        np.testing.assert_allclose(pd_hess, np_hess, self.rtol, self.atol)
+
+    def test_square(self):
+        def pd_f(x):
+            """Input is a square matrix."""
+            return paddle.matmul(x, x.T).flatten().sum()
+
+        def np_hess(x):
+            dim = x.shape[0]
+            upperleft = 2 * np.eye(dim, dtype=self.dtype)
+            upper = np.concatenate((upperleft, upperleft))
+            return np.concatenate((upper, upper), axis=1)
+
+        self.run_test_by_fullmatrix(pd_f, self.B, np_hess(self.B))
+
+
+class TestHessianFloat64(TestHessianFloat32):
+    @classmethod
+    def setUpClass(self):
+        paddle.enable_static()
+        if fluid.core.is_compiled_with_cuda():
+            self.place = fluid.CUDAPlace(0)
+        else:
+            self.place = fluid.CPUPlace()
+        self.dtype = 'float64'
+        prepare_data(self, all_data_shapes, self.dtype)
+        self.eps = config.TOLERANCE.get(self.dtype).get(
+            'second_order_grad').get('eps')
+        self.rtol = config.TOLERANCE.get(self.dtype).get(
+            'second_order_grad').get('rtol')
+        self.atol = config.TOLERANCE.get(self.dtype).get(
+            'second_order_grad').get('atol')
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/autograd/test_autograd_static.py b/python/paddle/fluid/tests/unittests/autograd/test_autograd_static.py
deleted file mode 100644
index 60dc9d06b8a7f..0000000000000
--- a/python/paddle/fluid/tests/unittests/autograd/test_autograd_static.py
+++ /dev/null
@@ -1,308 +0,0 @@
-# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
-# 
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-# 
-#     http://www.apache.org/licenses/LICENSE-2.0
-# 
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import unittest
-import numpy as np
-import paddle
-import paddle.fluid as fluid
-from utils import _compute_numerical_jacobian, _compute_numerical_batch_jacobian
-
-
-def approx_jacobian(f, xs, dtype, eps=1e-5, batch=False):
-    r"""Computes an approximate Jacobian matrix of a multi-valued function 
-    using finite differences.
-    
-    The function input is required to be an np array or a list of list of np 
-    arrays. 
-    """
-
-    def flatten(x):
-        if len(x.shape) > 0:
-            to = [x.shape[0], -1] if batch else [-1]
-            return x.reshape(to)
-        else:
-            return x
-
-    def flatten_all(xs):
-        if isinstance(xs, list):
-            flattened = np.concatenate([flatten(x) for x in xs], axis=-1)
-        else:
-            flattened = flatten(xs)
-        return flattened
-
-    def x_like(x, orig_x):
-        return x.reshape(orig_x.shape)
-
-    def _f(x):
-        if multi_inps:
-            _xs = np.split(x, splits, axis=-1)
-            _xs = [x_like(_x, _o) for _x, _o in zip(_xs, xs)]
-            outs = f(_xs)
-        else:
-            outs = f(x)
-        return flatten_all(outs)
-
-    multi_inps = False if isinstance(xs, np.ndarray) else True
-    x = flatten_all(xs)
-    xdim = x.shape[-1]
-    splits = []
-
-    if multi_inps:
-        split = 0
-        for inp in xs:
-            split += flatten(inp).shape[-1]
-            splits.append(split)
-
-    ds = eps * np.eye(xdim, dtype=dtype)
-
-    fprimes_by_x = [(0.5 * (_f(x + d) - _f(x - d)) / eps) for d in ds]
-    fprimes_by_y = np.stack(fprimes_by_x, axis=-1)
-    return np.transpose(fprimes_by_y, [1, 0, 2]) if batch else fprimes_by_y
-
-
-def make_tensors(inps):
-    if isinstance(inps, list):
-        xs = [
-            paddle.static.data(
-                f'x{i}', inp.shape, dtype=inp.dtype)
-            for i, inp in enumerate(inps)
-        ]
-    else:
-        xs = paddle.static.data(name='x', shape=inps.shape, dtype=inps.dtype)
-    return xs
-
-
-all_data_shapes = {
-    'A': [[1., 2.]],
-    'B': [[1., 2.], [2., 1.]],
-    'C': [[2., 2.], [2., 1.]],
-    'D': [[[2., 2.], [2., 1.]], [[1., 2.], [2., 1.]]],
-    'E': [[[3., 4.], [2., 3.]], [[2., 1.], [1., 3.]]],
-}
-
-
-def prepare_data(test, input_shapes, dtype):
-    for name, shape in input_shapes.items():
-        setattr(test, name, np.array(shape, dtype=dtype))
-
-
-class TestJacobianFloat32(unittest.TestCase):
-    @classmethod
-    def setUpClass(self):
-        paddle.enable_static()
-        if fluid.core.is_compiled_with_cuda():
-            self.place = fluid.CUDAPlace(0)
-        else:
-            self.place = fluid.CPUPlace()
-        self.dtype = 'float32'
-        prepare_data(self, all_data_shapes, self.dtype)
-        self.eps = 1e-4
-        self.rtol = 1e-2
-        self.atol = 1e-2
-
-    def run_test_by_fullmatrix(self, pd_f, np_f, inps, batch=False):
-        main = fluid.Program()
-        startup = fluid.Program()
-        with fluid.program_guard(main, startup):
-            xs = make_tensors(inps)
-            JJ = paddle.autograd.functional.Jacobian(pd_f, xs, batch=batch)
-            nrow, ncol = JJ.shape()
-            full_jacobian = JJ[:]
-        exe = fluid.Executor(self.place)
-        exe.run(startup)
-        if isinstance(inps, list):
-            feeds = {f'x{i}': x for i, x in enumerate(inps)}
-        else:
-            feeds = {'x': inps}
-        pd_jacobians = exe.run(main, feed=feeds, fetch_list=[full_jacobian])[0]
-        np_jacobians = approx_jacobian(
-            np_f, inps, self.dtype, self.eps, batch=batch)
-        self.assertTrue(
-            np.allclose(pd_jacobians, np_jacobians, self.rtol, self.atol))
-
-    def run_test_by_rows(self, pd_f, np_f, inps, batch=False):
-        main = fluid.Program()
-        startup = fluid.Program()
-        with fluid.program_guard(main, startup):
-            xs = make_tensors(inps)
-            JJ = paddle.autograd.functional.Jacobian(pd_f, xs, batch=batch)
-            nrow, ncol = JJ.shape()
-            rows = [JJ[i] for i in range(nrow)]
-        exe = fluid.Executor(self.place)
-        exe.run(startup)
-        if isinstance(inps, list):
-            feeds = {f'x{i}': x for i, x in enumerate(inps)}
-        else:
-            feeds = {'x': inps}
-        pd_jac = exe.run(main, feed=feeds, fetch_list=[rows])
-        np_jac = approx_jacobian(np_f, inps, self.dtype, self.eps, batch=batch)
-        for i in range(nrow):
-            self.assertTrue(
-                np.allclose(pd_jac[i], np_jac[i], self.rtol, self.atol))
-
-    def run_test_by_entries(self, pd_f, np_f, inps, batch=False):
-        main = fluid.Program()
-        startup = fluid.Program()
-        with fluid.program_guard(main, startup):
-            xs = make_tensors(inps)
-            JJ = paddle.autograd.functional.Jacobian(pd_f, xs, batch=batch)
-            nrow, ncol = JJ.shape()
-            entries = [JJ[i, j] for i in range(nrow) for j in range(ncol)]
-        exe = fluid.Executor(self.place)
-        exe.run(startup)
-        if isinstance(inps, list):
-            feeds = {f'x{i}': x for i, x in enumerate(inps)}
-        else:
-            feeds = {'x': inps}
-        pd_entries = exe.run(main, feed=feeds, fetch_list=[entries])
-        np_jac = approx_jacobian(np_f, inps, self.dtype, self.eps, batch=batch)
-        np_entries = [
-            np_jac[i, ..., j] for i in range(nrow) for j in range(ncol)
-        ]
-        for pd_entry, np_entry in zip(pd_entries, np_entries):
-            self.assertTrue(
-                np.allclose(pd_entry, np_entry, self.rtol, self.atol))
-
-    def test_square(self):
-        def pd_f(x):
-            return paddle.multiply(x, x)
-
-        def np_f(x):
-            return np.multiply(x, x)
-
-        self.run_test_by_fullmatrix(pd_f, np_f, self.A)
-        self.run_test_by_rows(pd_f, np_f, self.A)
-        self.run_test_by_entries(pd_f, np_f, self.A)
-
-    def test_mul(self):
-        def pd_f(xs):
-            x, y = xs
-            return paddle.multiply(x, y)
-
-        def np_f(xs):
-            x, y = xs
-            return np.multiply(x, y)
-
-        self.run_test_by_fullmatrix(
-            pd_f,
-            np_f,
-            [self.B, self.C], )
-        self.run_test_by_rows(pd_f, np_f, [self.B, self.C])
-        self.run_test_by_entries(pd_f, np_f, [self.B, self.C])
-
-    def test_matmul(self):
-        def pd_f(xs):
-            x, y = xs
-            return paddle.matmul(x, y)
-
-        def np_f(xs):
-            x, y = xs
-            return np.matmul(x, y)
-
-        self.run_test_by_fullmatrix(pd_f, np_f, [self.B, self.C])
-        self.run_test_by_rows(pd_f, np_f, [self.B, self.C])
-        self.run_test_by_entries(pd_f, np_f, [self.B, self.C])
-
-    def test_batch_matmul(self):
-        def pd_f(xs):
-            x, y = xs
-            return paddle.matmul(x, y)
-
-        def np_f(xs):
-            x, y = xs
-            return np.matmul(x, y)
-
-        self.run_test_by_fullmatrix(pd_f, np_f, [self.D, self.E], batch=True)
-        self.run_test_by_rows(pd_f, np_f, [self.D, self.E], batch=True)
-        self.run_test_by_entries(pd_f, np_f, [self.D, self.E], batch=True)
-
-
-class TestJacobianFloat64(TestJacobianFloat32):
-    @classmethod
-    def setUpClass(self):
-        paddle.enable_static()
-        if fluid.core.is_compiled_with_cuda():
-            self.place = fluid.CUDAPlace(0)
-        else:
-            self.place = fluid.CPUPlace()
-        self.dtype = 'float64'
-        prepare_data(self, all_data_shapes, self.dtype)
-        self.eps = 1e-7
-        self.rtol = 1e-6
-        self.atol = 1e-6
-
-
-class TestHessianFloat64(unittest.TestCase):
-    @classmethod
-    def setUpClass(self):
-        paddle.enable_static()
-        if fluid.core.is_compiled_with_cuda():
-            self.place = fluid.CUDAPlace(0)
-        else:
-            self.place = fluid.CPUPlace()
-        self.dtype = 'float64'
-        prepare_data(self, all_data_shapes, self.dtype)
-        self.eps = 1e-7
-        self.rtol = 1e-6
-        self.atol = 1e-6
-
-    def run_test_by_fullmatrix(self, pd_f, inps, np_hess, batch=False):
-        main = fluid.Program()
-        startup = fluid.Program()
-        with fluid.program_guard(main, startup):
-            xs = make_tensors(inps)
-            HH = paddle.autograd.functional.Hessian(pd_f, xs, batch=batch)
-            nrow, ncol = HH.shape()
-            full_hessian = HH[:]
-        exe = fluid.Executor(self.place)
-        exe.run(startup)
-        if isinstance(inps, list):
-            feeds = {f'x{i}': x for i, x in enumerate(inps)}
-        else:
-            feeds = {'x': inps}
-        pd_hess = exe.run(main, feed=feeds, fetch_list=[full_hessian])[0]
-        self.assertTrue(np.allclose(pd_hess, np_hess, self.rtol, self.atol))
-
-    def test_square(self):
-        def pd_f(x):
-            """Input is a square matrix."""
-            return paddle.matmul(x, x.T)
-
-        def np_hess(x):
-            dim = x.shape[0]
-            f_xx_upperleft = 2 * np.eye(dim, dtype=self.dtype)
-            f_xx = np.zeros([dim * dim, dim * dim], dtype=self.dtype)
-            f_xx[:dim, :dim] = f_xx_upperleft
-            return f_xx
-
-        self.run_test_by_fullmatrix(pd_f, self.B, np_hess(self.B))
-
-        def test_batch_square(self):
-            def pd_f(x):
-                """Input is a square matrix."""
-                return paddle.matmul(x, paddle.transpose(x, [0, 2, 1]))
-
-            def np_hess(x):
-                bat, dim, _ = x.shape
-                f_xx_upperleft = 2 * np.eye(dim, dtype=self.dtype)
-                f_xx = np.zeros([bat, dim * dim, dim * dim], dtype=self.dtype)
-                f_xx[..., :dim, :dim] = f_xx_upperleft
-                return f_xx
-
-            self.run_test_by_fullmatrix(
-                pd_f, self.E, np_hess(self.E), batch=True)
-
-
-if __name__ == "__main__":
-    unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/autograd/test_hessian.py b/python/paddle/fluid/tests/unittests/autograd/test_hessian.py
deleted file mode 100644
index 7b3bd9fd55932..0000000000000
--- a/python/paddle/fluid/tests/unittests/autograd/test_hessian.py
+++ /dev/null
@@ -1,263 +0,0 @@
-# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
-# 
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-# 
-#     http://www.apache.org/licenses/LICENSE-2.0
-# 
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import unittest
-import numpy as np
-import paddle
-import paddle.compat as cpt
-import paddle.nn.functional as F
-from utils import _compute_numerical_hessian, _compute_numerical_batch_hessian
-
-
-class TestHessian(unittest.TestCase):
-    @classmethod
-    def setUpClass(self):
-        self.shape = (2, 2)
-        self.dtype = 'float32'
-        self.np_dtype = np.float32
-        self.numerical_delta = 1e-2
-        self.rtol = 1e-2
-        self.atol = 1e-2
-        self.x = paddle.rand(shape=self.shape, dtype=self.dtype)
-        self.y = paddle.rand(shape=self.shape, dtype=self.dtype)
-
-    def test_single_input(self):
-        def func(x):
-            return paddle.sum(paddle.matmul(x, x))
-
-        numerical_hessian = _compute_numerical_hessian(
-            func, self.x, self.numerical_delta, self.np_dtype)
-
-        self.x.stop_gradient = False
-        hessian = paddle.autograd.hessian(func, self.x)
-        assert np.allclose(hessian.numpy(), numerical_hessian[0][0], self.rtol,
-                           self.atol)
-
-    def test_multi_input(self):
-        def func(x, y):
-            return paddle.sum(paddle.matmul(x, y))
-
-        numerical_hessian = _compute_numerical_hessian(
-            func, [self.x, self.y], self.numerical_delta, self.np_dtype)
-
-        self.x.stop_gradient = False
-        self.y.stop_gradient = False
-        hessian = paddle.autograd.hessian(func, [self.x, self.y])
-        for i in range(len(hessian)):
-            for j in range(len(hessian[0])):
-                assert np.allclose(hessian[i][j].numpy(),
-                                   numerical_hessian[i][j], self.rtol,
-                                   self.atol)
-
-    def test_allow_unused_false(self):
-        def func(x, y):
-            return paddle.sum(paddle.matmul(x, x))
-
-        try:
-            self.x.stop_gradient = False
-            self.y.stop_gradient = False
-            hessian = paddle.autograd.hessian(func, [self.x, self.y])
-        except ValueError as e:
-            error_msg = cpt.get_exception_message(e)
-            assert error_msg.find("allow_unused") > 0
-
-    def test_allow_unused_true(self):
-        def func(x, y):
-            return paddle.sum(paddle.matmul(x, x))
-
-        numerical_hessian = _compute_numerical_hessian(
-            func, [self.x, self.y], self.numerical_delta, self.np_dtype)
-        self.x.stop_gradient = False
-        self.y.stop_gradient = False
-        hessian = paddle.autograd.hessian(
-            func, [self.x, self.y], allow_unused=True)
-        for i in range(len(hessian)):
-            for j in range(len(hessian[0])):
-                if i == j == 0:
-                    assert np.allclose(hessian[i][j].numpy(),
-                                       numerical_hessian[i][j], self.rtol,
-                                       self.atol)
-                else:
-                    assert hessian[i][j] is None
-
-    def test_create_graph_false(self):
-        def func(x):
-            return paddle.sum(paddle.matmul(x, x))
-
-        numerical_hessian = _compute_numerical_hessian(
-            func, self.x, self.numerical_delta, self.np_dtype)
-        self.x.stop_gradient = False
-        hessian = paddle.autograd.hessian(func, self.x)
-        assert hessian.stop_gradient == True
-        assert np.allclose(hessian.numpy(), numerical_hessian[0][0], self.rtol,
-                           self.atol)
-        try:
-            paddle.grad(hessian, self.x)
-        except RuntimeError as e:
-            error_msg = cpt.get_exception_message(e)
-            assert error_msg.find("has no gradient") > 0
-
-    def test_create_graph_true(self):
-        def func(x):
-            return paddle.sum(F.sigmoid(x))
-
-        numerical_hessian = _compute_numerical_hessian(
-            func, self.x, self.numerical_delta, self.np_dtype)
-        self.x.stop_gradient = False
-        hessian = paddle.autograd.hessian(func, self.x, create_graph=True)
-        assert hessian.stop_gradient == False
-        assert np.allclose(hessian.numpy(), numerical_hessian[0][0], self.rtol,
-                           self.atol)
-        triple_grad = paddle.grad(hessian, self.x)
-        assert triple_grad is not None
-
-
-class TestHessianFloat64(TestHessian):
-    @classmethod
-    def setUpClass(self):
-        self.shape = (2, 2)
-        self.dtype = 'float64'
-        self.np_dtype = np.float64
-        self.numerical_delta = 1e-5
-        self.rtol = 1e-5
-        self.atol = 1e-5
-        self.x = paddle.rand(shape=self.shape, dtype=self.dtype)
-        self.y = paddle.rand(shape=self.shape, dtype=self.dtype)
-
-
-class TestBatchHessian(unittest.TestCase):
-    @classmethod
-    def setUpClass(self):
-        self.x_shape = (5, 2)
-        self.weight_shape = (2, 4)
-        self.y_shape = (5, 2)
-        self.dtype = 'float32'
-        self.np_dtype = np.float32
-        self.numerical_delta = 1e-2
-        self.rtol = 1e-3
-        self.atol = 1e-3
-        self.x = paddle.rand(shape=self.x_shape, dtype=self.dtype)
-        self.weight = paddle.rand(shape=self.weight_shape, dtype=self.dtype)
-        self.y = paddle.rand(shape=self.y_shape, dtype=self.dtype)
-
-    def test_single_input(self):
-        def func(x):
-            return paddle.matmul(x * x, self.weight)[:, 0:1]
-
-        numerical_hessian = _compute_numerical_batch_hessian(
-            func, self.x, self.numerical_delta, self.np_dtype)
-        self.x.stop_gradient = False
-        hessian = paddle.autograd.batch_hessian(func, self.x, create_graph=True)
-        assert np.allclose(hessian, numerical_hessian, self.rtol, self.atol)
-
-    def test_multi_input(self):
-        def func(x, y):
-            return paddle.matmul(x * x * y * y, self.weight)[:, 0:1]
-
-        numerical_hessian = _compute_numerical_batch_hessian(
-            func, [self.x, self.y], self.numerical_delta, self.np_dtype)
-
-        self.x.stop_gradient = False
-        self.y.stop_gradient = False
-        hessian = paddle.autograd.batch_hessian(func, [self.x, self.y])
-
-        shape_tensor = paddle.to_tensor(numerical_hessian).astype("float64")
-        hessian_reshape = np.reshape(hessian, (shape_tensor.shape))
-        assert np.allclose(hessian_reshape, numerical_hessian, self.rtol,
-                           self.atol)
-
-    def test_allow_unused_false(self):
-        def func(x, y):
-            return paddle.matmul(x * x, self.weight)[:, 0:1]
-
-        try:
-            self.x.stop_gradient = False
-            self.y.stop_gradient = False
-            hessian = paddle.autograd.batch_hessian(func, [self.x, self.y])
-        except ValueError as e:
-            error_msg = cpt.get_exception_message(e)
-            assert error_msg.find("allow_unused") > 0
-
-    def test_allow_unused_true(self):
-        def func(x, y):
-            return paddle.matmul(x * x, self.weight)[:, 0:1]
-
-        numerical_hessian = _compute_numerical_batch_hessian(
-            func, [self.x, self.y], self.numerical_delta, self.np_dtype)
-        self.x.stop_gradient = False
-        self.y.stop_gradient = False
-        hessian = paddle.autograd.batch_hessian(
-            func, [self.x, self.y], allow_unused=True)
-
-        for i in range(len(hessian)):
-            for j in range(len(hessian[0])):
-                if i == j == 0:
-                    numerical_hessian = np.stack(
-                        (numerical_hessian[i][j], numerical_hessian[i][j + 1]),
-                        axis=0)
-                    assert np.allclose(hessian[i][j], numerical_hessian,
-                                       self.rtol, self.atol)
-                else:
-                    assert hessian[i][j] is None
-
-    def test_create_graph_false(self):
-        def func(x):
-            return paddle.matmul(x * x, self.weight)[:, 0:1]
-
-        numerical_hessian = _compute_numerical_batch_hessian(
-            func, self.x, self.numerical_delta, self.np_dtype)
-        self.x.stop_gradient = False
-        hessian = paddle.autograd.batch_hessian(func, self.x)
-        assert hessian.stop_gradient == True
-        assert np.allclose(hessian.numpy(), numerical_hessian, self.rtol,
-                           self.atol)
-        try:
-            paddle.grad(hessian, self.x)
-        except RuntimeError as e:
-            error_msg = cpt.get_exception_message(e)
-            assert error_msg.find("has no gradient") > 0
-
-    def test_create_graph_true(self):
-        def func(x):
-            return paddle.matmul(x * x, self.weight)[:, 0:1]
-
-        numerical_hessian = _compute_numerical_batch_hessian(
-            func, self.x, self.numerical_delta, self.np_dtype)
-        self.x.stop_gradient = False
-        hessian = paddle.autograd.batch_hessian(func, self.x, create_graph=True)
-        assert hessian.stop_gradient == False
-        assert np.allclose(hessian.numpy(), numerical_hessian, self.rtol,
-                           self.atol)
-        triple_grad = paddle.grad(hessian, self.x)
-        assert triple_grad is not None
-
-
-class TestBatchHessianFloat64(TestBatchHessian):
-    @classmethod
-    def setUpClass(self):
-        self.x_shape = (5, 2)
-        self.weight_shape = (2, 4)
-        self.y_shape = (5, 2)
-        self.dtype = 'float64'
-        self.np_dtype = np.float64
-        self.numerical_delta = 1e-4
-        self.rtol = 1e-5
-        self.atol = 1e-5
-        self.x = paddle.rand(shape=self.x_shape, dtype=self.dtype)
-        self.weight = paddle.rand(shape=self.weight_shape, dtype=self.dtype)
-        self.y = paddle.rand(shape=self.y_shape, dtype=self.dtype)
-
-
-if __name__ == "__main__":
-    unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/autograd/test_jacobian.py b/python/paddle/fluid/tests/unittests/autograd/test_jacobian.py
deleted file mode 100644
index 335ea4e519bef..0000000000000
--- a/python/paddle/fluid/tests/unittests/autograd/test_jacobian.py
+++ /dev/null
@@ -1,319 +0,0 @@
-# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
-# 
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-# 
-#     http://www.apache.org/licenses/LICENSE-2.0
-# 
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import unittest
-import numpy as np
-import paddle
-import paddle.compat as cpt
-from utils import _compute_numerical_jacobian, _compute_numerical_batch_jacobian
-
-
-class TestJacobian(unittest.TestCase):
-    @classmethod
-    def setUpClass(self):
-        self.shape = (4, 4)
-        self.dtype = 'float32'
-        self.np_dtype = np.float32
-        self.numerical_delta = 1e-4
-        self.rtol = 1e-3
-        self.atol = 1e-3
-        self.x = paddle.rand(shape=self.shape, dtype=self.dtype)
-        self.y = paddle.rand(shape=self.shape, dtype=self.dtype)
-
-    def test_single_input_and_single_output(self):
-        def func(x):
-            return paddle.matmul(x, x)
-
-        numerical_jacobian = _compute_numerical_jacobian(
-            func, self.x, self.numerical_delta, self.np_dtype)
-        self.x.stop_gradient = False
-        jacobian = paddle.autograd.jacobian(func, self.x)
-        assert np.allclose(jacobian.numpy(), numerical_jacobian[0][0],
-                           self.rtol, self.atol)
-
-    def test_single_input_and_multi_output(self):
-        def func(x):
-            return paddle.matmul(x, x), x * x
-
-        numerical_jacobian = _compute_numerical_jacobian(
-            func, self.x, self.numerical_delta, self.np_dtype)
-        self.x.stop_gradient = False
-        jacobian = paddle.autograd.jacobian(func, self.x)
-        for i in range(len(jacobian)):
-            assert np.allclose(jacobian[i].numpy(), numerical_jacobian[i][0],
-                               self.rtol, self.atol)
-
-    def test_multi_input_and_single_output(self):
-        def func(x, y):
-            return paddle.matmul(x, y)
-
-        numerical_jacobian = _compute_numerical_jacobian(
-            func, [self.x, self.y], self.numerical_delta, self.np_dtype)
-        self.x.stop_gradient = False
-        self.y.stop_gradient = False
-        jacobian = paddle.autograd.jacobian(func, [self.x, self.y])
-        for j in range(len(jacobian)):
-            assert np.allclose(jacobian[j].numpy(), numerical_jacobian[0][j],
-                               self.rtol, self.atol)
-
-    def test_multi_input_and_multi_output(self):
-        def func(x, y):
-            return paddle.matmul(x, y), x * y
-
-        numerical_jacobian = _compute_numerical_jacobian(
-            func, [self.x, self.y], self.numerical_delta, self.np_dtype)
-        self.x.stop_gradient = False
-        self.y.stop_gradient = False
-        jacobian = paddle.autograd.jacobian(func, [self.x, self.y])
-        for i in range(len(jacobian)):
-            for j in range(len(jacobian[0])):
-                assert np.allclose(jacobian[i][j].numpy(),
-                                   numerical_jacobian[i][j], self.rtol,
-                                   self.atol)
-
-    def test_allow_unused_false(self):
-        def func(x, y):
-            return paddle.matmul(x, x)
-
-        try:
-            self.x.stop_gradient = False
-            self.y.stop_gradient = False
-            jacobian = paddle.autograd.jacobian(func, [self.x, self.y])
-        except ValueError as e:
-            error_msg = cpt.get_exception_message(e)
-            assert error_msg.find("allow_unused") > 0
-
-    def test_allow_unused_true(self):
-        def func(x, y):
-            return paddle.matmul(x, x)
-
-        numerical_jacobian = _compute_numerical_jacobian(
-            func, [self.x, self.y], self.numerical_delta, self.np_dtype)
-        self.x.stop_gradient = False
-        self.y.stop_gradient = False
-        jacobian = paddle.autograd.jacobian(
-            func, [self.x, self.y], allow_unused=True)
-        assert np.allclose(jacobian[0].numpy(), numerical_jacobian[0][0],
-                           self.rtol, self.atol)
-        assert jacobian[1] is None
-
-    def test_create_graph_false(self):
-        def func(x, y):
-            return paddle.matmul(x, y)
-
-        numerical_jacobian = _compute_numerical_jacobian(
-            func, [self.x, self.y], self.numerical_delta, self.np_dtype)
-        self.x.stop_gradient = False
-        self.y.stop_gradient = False
-        jacobian = paddle.autograd.jacobian(func, [self.x, self.y])
-        for j in range(len(jacobian)):
-            assert jacobian[j].stop_gradient == True
-            assert np.allclose(jacobian[j].numpy(), numerical_jacobian[0][j],
-                               self.rtol, self.atol)
-        try:
-            paddle.grad(jacobian[0], [self.x, self.y])
-        except RuntimeError as e:
-            error_msg = cpt.get_exception_message(e)
-            assert error_msg.find("has no gradient") > 0
-
-    def test_create_graph_true(self):
-        def func(x, y):
-            return paddle.matmul(x, y)
-
-        numerical_jacobian = _compute_numerical_jacobian(
-            func, [self.x, self.y], self.numerical_delta, self.np_dtype)
-        self.x.stop_gradient = False
-        self.y.stop_gradient = False
-        jacobian = paddle.autograd.jacobian(
-            func, [self.x, self.y], create_graph=True)
-        for j in range(len(jacobian)):
-            assert jacobian[j].stop_gradient == False
-            assert np.allclose(jacobian[j].numpy(), numerical_jacobian[0][j],
-                               self.rtol, self.atol)
-        double_grad = paddle.grad(jacobian[0], [self.x, self.y])
-        assert double_grad is not None
-
-
-class TestJacobianFloat64(TestJacobian):
-    @classmethod
-    def setUpClass(self):
-        self.shape = (4, 4)
-        self.dtype = 'float64'
-        self.np_dtype = np.float64
-        self.numerical_delta = 1e-7
-        self.rtol = 1e-7
-        self.atol = 1e-7
-        self.x = paddle.rand(shape=self.shape, dtype=self.dtype)
-        self.y = paddle.rand(shape=self.shape, dtype=self.dtype)
-
-
-class TestJacobianBatch(unittest.TestCase):
-    @classmethod
-    def setUpClass(self):
-        self.x_shape = (4, 2)
-        self.weight_shape = (2, 4)
-        self.y_shape = (4, 2)
-        self.dtype = 'float32'
-        self.np_dtype = np.float32
-        self.numerical_delta = 1e-4
-        self.rtol = 1e-3
-        self.atol = 1e-3
-        self.x = paddle.rand(shape=self.x_shape, dtype=self.dtype)
-        self.weight = paddle.rand(shape=self.weight_shape, dtype=self.dtype)
-        self.y = paddle.rand(shape=self.y_shape, dtype=self.dtype)
-
-    def test_batch_single_input_and_batch_single_output(self):
-        def func(x):
-            return paddle.matmul(paddle.matmul(x, self.weight), self.y)
-
-        numerical_jacobian = _compute_numerical_batch_jacobian(
-            func, [self.x], self.numerical_delta, self.np_dtype)
-
-        self.x.stop_gradient = False
-        batch_jacobian = paddle.autograd.batch_jacobian(
-            func,
-            self.x, )
-
-        self.assertTrue(
-            np.allclose(batch_jacobian.numpy().all(), numerical_jacobian[0][0]
-                        .all()))
-
-    def test_batch_single_input_and_batch_multi_output(self):
-        def func(x):
-            return paddle.matmul(paddle.matmul(x, self.weight), self.y), x * x
-
-        numerical_jacobian = _compute_numerical_batch_jacobian(
-            func, [self.x], self.numerical_delta, self.np_dtype)
-
-        self.x.stop_gradient = False
-        batch_jacobian = paddle.autograd.batch_jacobian(
-            func,
-            self.x, )
-
-        for i in range(len(batch_jacobian)):
-            assert np.allclose(batch_jacobian[i].numpy(),
-                               numerical_jacobian[i][0], self.rtol, self.atol)
-
-    def test_batch_multi_input_and_batch_single_output(self):
-        def func(x, y):
-            return x * y
-
-        numerical_jacobian = _compute_numerical_batch_jacobian(
-            func, [self.x, self.y], self.numerical_delta, self.np_dtype)
-
-        self.x.stop_gradient = False
-        self.y.stop_gradient = False
-        batch_jacobian = paddle.autograd.batch_jacobian(func, [self.x, self.y])
-
-        for j in range(len(batch_jacobian)):
-            assert np.allclose(batch_jacobian[j].numpy(),
-                               numerical_jacobian[0][j], self.rtol, self.atol)
-
-    def test_batch_multi_input_and_batch_multi_output(self):
-        def func(x, y):
-            return x * y, x * y
-
-        numerical_jacobian = _compute_numerical_batch_jacobian(
-            func, [self.x, self.y], self.numerical_delta, self.np_dtype)
-
-        self.x.stop_gradient = False
-        self.y.stop_gradient = False
-        batch_jacobian = paddle.autograd.batch_jacobian(func, [self.x, self.y])
-
-        for i in range(len(batch_jacobian)):
-            assert np.allclose(batch_jacobian[i], numerical_jacobian[i],
-                               self.rtol, self.atol)
-
-    def test_allow_unused_false(self):
-        def func(x, y):
-            return x * x
-
-        try:
-            self.x.stop_gradient = False
-            self.y.stop_gradient = False
-            jacobian = paddle.autograd.batch_jacobian(func, [self.x, self.y])
-        except ValueError as e:
-            error_msg = cpt.get_exception_message(e)
-            assert error_msg.find("allow_unused") > 0
-
-    def test_allow_unused_true(self):
-        def func(x, y):
-            return x * x
-
-        numerical_jacobian = _compute_numerical_batch_jacobian(
-            func, [self.x, self.y], self.numerical_delta, self.np_dtype)
-        self.x.stop_gradient = False
-        self.y.stop_gradient = False
-        jacobian = paddle.autograd.batch_jacobian(
-            func, [self.x, self.y], allow_unused=True)
-
-        assert np.allclose(jacobian[0].numpy(), numerical_jacobian[0][0],
-                           self.rtol, self.atol)
-        assert jacobian[1] is None
-
-    def test_create_graph_false(self):
-        def func(x, y):
-            return x * y
-
-        numerical_jacobian = _compute_numerical_batch_jacobian(
-            func, [self.x, self.y], self.numerical_delta, self.np_dtype)
-        self.x.stop_gradient = False
-        self.y.stop_gradient = False
-        jacobian = paddle.autograd.batch_jacobian(func, [self.x, self.y])
-        for j in range(len(jacobian)):
-            assert jacobian[j].stop_gradient == True
-            assert np.allclose(jacobian[j].numpy(), numerical_jacobian[0][j],
-                               self.rtol, self.atol)
-        try:
-            paddle.grad(jacobian[0], [self.x, self.y])
-        except RuntimeError as e:
-            error_msg = cpt.get_exception_message(e)
-            assert error_msg.find("has no gradient") > 0
-
-    def test_create_graph_true(self):
-        def func(x, y):
-            return x * y
-
-        numerical_jacobian = _compute_numerical_batch_jacobian(
-            func, [self.x, self.y], self.numerical_delta, self.np_dtype)
-        self.x.stop_gradient = False
-        self.y.stop_gradient = False
-        jacobian = paddle.autograd.batch_jacobian(
-            func, [self.x, self.y], create_graph=True)
-        for j in range(len(jacobian)):
-            assert jacobian[j].stop_gradient == False
-            assert np.allclose(jacobian[j].numpy(), numerical_jacobian[0][j],
-                               self.rtol, self.atol)
-        double_grad = paddle.grad(jacobian[0], [self.x, self.y])
-        assert double_grad is not None
-
-
-class TestJacobianBatchFloat64(TestJacobianBatch):
-    @classmethod
-    def setUpClass(self):
-        self.x_shape = (12, 2)
-        self.weight_shape = (2, 12)
-        self.y_shape = (12, 2)
-        self.dtype = 'float64'
-        self.np_dtype = np.float64
-        self.numerical_delta = 1e-7
-        self.rtol = 1e-7
-        self.atol = 1e-7
-        self.x = paddle.rand(shape=self.x_shape, dtype=self.dtype)
-        self.weight = paddle.rand(shape=self.weight_shape, dtype=self.dtype)
-        self.y = paddle.rand(shape=self.y_shape, dtype=self.dtype)
-
-
-if __name__ == "__main__":
-    unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/autograd/test_vhp.py b/python/paddle/fluid/tests/unittests/autograd/test_vhp.py
deleted file mode 100644
index 09b25203e04a4..0000000000000
--- a/python/paddle/fluid/tests/unittests/autograd/test_vhp.py
+++ /dev/null
@@ -1,182 +0,0 @@
-# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
-# 
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-# 
-#     http://www.apache.org/licenses/LICENSE-2.0
-# 
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import unittest
-import numpy as np
-import paddle
-import paddle.compat as cpt
-import paddle.nn.functional as F
-from utils import _compute_numerical_vhp
-
-
-class TestVHP(unittest.TestCase):
-    @classmethod
-    def setUpClass(self):
-        self.shape = (2, 2)
-        self.dtype = 'float32'
-        self.np_dtype = np.float32
-        self.numerical_delta = 1e-2
-        self.rtol = 1e-2
-        self.atol = 1e-2
-        self.x = paddle.rand(shape=self.shape, dtype=self.dtype)
-        self.y = paddle.rand(shape=self.shape, dtype=self.dtype)
-        self.vx = paddle.rand(shape=self.shape, dtype=self.dtype)
-        self.vy = paddle.rand(shape=self.shape, dtype=self.dtype)
-
-    def test_single_input(self):
-        def func(x):
-            return paddle.sum(paddle.matmul(x, x))
-
-        numerical_func_output = func(self.x).numpy()
-        numerical_vhp = _compute_numerical_vhp(
-            func, self.x, self.vx, self.numerical_delta, self.np_dtype)
-
-        self.x.stop_gradient = False
-        func_output, vhp = paddle.autograd.vhp(func, self.x, self.vx)
-        assert np.allclose(func_output.numpy(), numerical_func_output,
-                           self.rtol, self.atol)
-        assert np.allclose(vhp[0].numpy(), numerical_vhp[0], self.rtol,
-                           self.atol)
-
-    def test_multi_input(self):
-        def func(x, y):
-            return paddle.sum(paddle.matmul(x, y))
-
-        numerical_func_output = func(self.x, self.y).numpy()
-        numerical_vhp = _compute_numerical_vhp(
-            func, [self.x, self.y], [self.vx, self.vy], self.numerical_delta,
-            self.np_dtype)
-
-        self.x.stop_gradient = False
-        self.y.stop_gradient = False
-        func_output, vhp = paddle.autograd.vhp(func, [self.x, self.y],
-                                               [self.vx, self.vy])
-        assert np.allclose(func_output.numpy(), numerical_func_output,
-                           self.rtol, self.atol)
-        for i in range(len(vhp)):
-            assert np.allclose(vhp[i].numpy(), numerical_vhp[i], self.rtol,
-                               self.atol)
-
-    def test_v_default(self):
-        def func(x, y):
-            return paddle.sum(paddle.matmul(x, y))
-
-        numerical_func_output = func(self.x, self.y).numpy()
-        vx = paddle.ones(self.vx.shape, dtype=self.vx.dtype)
-        vy = paddle.ones(self.vy.shape, dtype=self.vy.dtype)
-        numerical_vhp = _compute_numerical_vhp(func, [self.x, self.y],
-                                               [vx, vy], self.numerical_delta,
-                                               self.np_dtype)
-
-        self.x.stop_gradient = False
-        self.y.stop_gradient = False
-        func_output, vhp = paddle.autograd.vhp(func, [self.x, self.y])
-        assert np.allclose(func_output.numpy(), numerical_func_output,
-                           self.rtol, self.atol)
-        for i in range(len(vhp)):
-            assert np.allclose(vhp[i].numpy(), numerical_vhp[i], self.rtol,
-                               self.atol)
-
-    def test_allow_unused_false(self):
-        def func(x, y):
-            return paddle.sum(paddle.matmul(x, x))
-
-        try:
-            self.x.stop_gradient = False
-            self.y.stop_gradient = False
-            _ = paddle.autograd.vhp(func, [self.x, self.y])
-        except ValueError as e:
-            error_msg = cpt.get_exception_message(e)
-            assert error_msg.find("allow_unused") > 0
-
-    def test_allow_unused_true(self):
-        def func(x, y):
-            return paddle.sum(paddle.matmul(x, x))
-
-        numerical_func_output = func(self.x, self.y).numpy()
-        numerical_vhp = _compute_numerical_vhp(
-            func, [self.x, self.y], [self.vx, self.vy], self.numerical_delta,
-            self.np_dtype)
-
-        self.x.stop_gradient = False
-        self.y.stop_gradient = False
-        func_output, vhp = paddle.autograd.vhp(func, [self.x, self.y],
-                                               [self.vx, self.vy],
-                                               allow_unused=True)
-        assert np.allclose(func_output.numpy(), numerical_func_output,
-                           self.rtol, self.atol)
-        assert np.allclose(vhp[0].numpy(), numerical_vhp[0], self.rtol,
-                           self.atol)
-        assert vhp[1] is None
-
-    def test_create_graph_false(self):
-        def func(x):
-            return paddle.sum(F.sigmoid(x))
-
-        numerical_func_output = func(self.x).numpy()
-        numerical_vhp = _compute_numerical_vhp(
-            func, self.x, self.vx, self.numerical_delta, self.np_dtype)
-
-        self.x.stop_gradient = False
-        func_output, vhp = paddle.autograd.vhp(func, self.x, self.vx)
-        assert np.allclose(func_output.numpy(), numerical_func_output,
-                           self.rtol, self.atol)
-        assert vhp[0].stop_gradient == True
-        assert np.allclose(vhp[0].numpy(), numerical_vhp[0], self.rtol,
-                           self.atol)
-        try:
-            paddle.grad(vhp, self.x)
-        except RuntimeError as e:
-            error_msg = cpt.get_exception_message(e)
-            assert error_msg.find("has no gradient") > 0
-
-    def test_create_graph_true(self):
-        def func(x):
-            return paddle.sum(F.sigmoid(x))
-
-        numerical_func_output = func(self.x).numpy()
-        numerical_vhp = _compute_numerical_vhp(
-            func, self.x, self.vx, self.numerical_delta, self.np_dtype)
-
-        self.x.stop_gradient = False
-        func_output, vhp = paddle.autograd.vhp(func,
-                                               self.x,
-                                               self.vx,
-                                               create_graph=True)
-        assert np.allclose(func_output.numpy(), numerical_func_output,
-                           self.rtol, self.atol)
-        assert vhp[0].stop_gradient == False
-        assert np.allclose(vhp[0].numpy(), numerical_vhp[0], self.rtol,
-                           self.atol)
-        triple_grad = paddle.grad(vhp, self.x)
-        assert triple_grad is not None
-
-
-class TestVHPFloat64(TestVHP):
-    @classmethod
-    def setUpClass(self):
-        self.shape = (2, 2)
-        self.dtype = 'float64'
-        self.np_dtype = np.float64
-        self.numerical_delta = 1e-5
-        self.rtol = 1e-5
-        self.atol = 1e-5
-        self.x = paddle.rand(shape=self.shape, dtype=self.dtype)
-        self.y = paddle.rand(shape=self.shape, dtype=self.dtype)
-        self.vx = paddle.rand(shape=self.shape, dtype=self.dtype)
-        self.vy = paddle.rand(shape=self.shape, dtype=self.dtype)
-
-
-if __name__ == "__main__":
-    unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/autograd/test_vjp_jvp.py b/python/paddle/fluid/tests/unittests/autograd/test_vjp_jvp.py
deleted file mode 100644
index c228ad79321d4..0000000000000
--- a/python/paddle/fluid/tests/unittests/autograd/test_vjp_jvp.py
+++ /dev/null
@@ -1,315 +0,0 @@
-# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import unittest
-import paddle
-
-from paddle.autograd.functional import vjp, jvp, _tensors
-from paddle import grad, ones_like, zeros_like
-
-
-def reduce(x):
-    return paddle.sum(x)
-
-
-def reduce_dim(x):
-    return paddle.sum(x, axis=0)
-
-
-def matmul(x, y):
-    return paddle.matmul(x, y)
-
-
-def mul(x, y):
-    return x * y
-
-
-def pow(x, y):
-    return paddle.pow(x, y)
-
-
-def o2(x, y):
-    return paddle.multiply(x, y), paddle.matmul(x, y.t())
-
-
-def unuse(x, y):
-    return paddle.sum(x)
-
-
-def nested(x):
-    def inner(y):
-        return x * y
-
-    return inner
-
-
-def make_v(f, inputs):
-    outputs = _tensors(f(*inputs), "outputs")
-    return [ones_like(x) for x in outputs]
-
-
-class TestAutogradFunctional(unittest.TestCase):
-    @classmethod
-    def setUpClass(cls):
-        cls.RAW_INPUTS = {
-            'a': [1.0],
-            'b': [1.0, 2.0],
-            'c': [3.0, 4.0],
-            'd': [[2.0], [3.0]],
-            'A': [[1.0, 2.0], [2.0, 3.0], [3.0, 4.0]],
-            'B': [[1.0, 2.0, 3.0], [2.0, 3.0, 4.0]],
-        }
-
-    def setUp(self):
-        pass
-
-    def gen_input(self, inp, stop_gradient=False):
-        if isinstance(inp, paddle.Tensor):
-            return inp
-        return paddle.to_tensor(
-            self.RAW_INPUTS[inp], stop_gradient=stop_gradient)
-
-    def gen_inputs(self, inputs):
-        if isinstance(inputs, list):
-            inputs = [self.gen_input(x) for x in inputs]
-        else:
-            inputs = [self.gen_input(inputs)]
-        return inputs
-
-    def gen_test_pairs(self,
-                       func,
-                       inputs,
-                       v=None,
-                       create_graph=False,
-                       allow_unused=False):
-        def vjp_test():
-            nonlocal v
-            xs = self.gen_inputs(inputs)
-            if v is not None:
-                v = self.gen_inputs(v)
-                outputs, inputs_grad = vjp(func,
-                                           xs,
-                                           v,
-                                           create_graph=create_graph,
-                                           allow_unused=allow_unused)
-            else:
-                outputs, inputs_grad = vjp(func,
-                                           xs,
-                                           create_graph=create_graph,
-                                           allow_unused=allow_unused)
-            return outputs, inputs_grad
-
-        def grad_test():
-            nonlocal v
-            xs = self.gen_inputs(inputs)
-            if v is not None:
-                v = self.gen_inputs(v)
-            outputs = func(*xs)
-            if v is not None:
-                inputs_grad = grad(
-                    outputs,
-                    xs,
-                    v,
-                    create_graph=create_graph,
-                    allow_unused=allow_unused)
-            else:
-                inputs_grad = grad(
-                    outputs,
-                    xs,
-                    create_graph=create_graph,
-                    allow_unused=allow_unused)
-            return outputs, inputs_grad
-
-        return vjp_test, grad_test
-
-    def gen_jvp_tests(self,
-                      func,
-                      inputs,
-                      v=None,
-                      create_graph=False,
-                      allow_unused=False):
-        def jvp_test():
-            nonlocal v
-            xs = self.gen_inputs(inputs)
-            if v is not None:
-                v = self.gen_inputs(v)
-                outputs, outputs_grad = jvp(func,
-                                            xs,
-                                            v,
-                                            create_graph=create_graph,
-                                            allow_unused=allow_unused)
-            else:
-                outputs, outputs_grad = jvp(func,
-                                            xs,
-                                            create_graph=create_graph,
-                                            allow_unused=allow_unused)
-            return outputs, outputs_grad
-
-        return jvp_test
-
-    def check_results(self, ref, res):
-        type_error = 'Result is different than expected in shape or type'
-        value_error = 'Result is different than expected values'
-        if ref is None:
-            self.assertTrue(res is None, type_error)
-        elif isinstance(ref, paddle.Tensor):
-            self.assertTrue(isinstance(res, paddle.Tensor), type_error)
-            self.assertTrue(paddle.allclose(res, ref), value_error)
-        else:
-            self.assertTrue(len(res) == len(ref), type_error)
-            for i in range(len(ref)):
-                self.check_results(ref[i], res[i])
-        return True
-
-
-class TestVJP(TestAutogradFunctional):
-    def test_vjp_i1o1_no_create_graph(self):
-        test_cases = [
-            [reduce, 'A'],  #noqa
-            [reduce_dim, 'A'],  #noqa
-        ]  #noqa
-        for f, inputs in test_cases:
-            vjp, grad = self.gen_test_pairs(f, inputs)
-            vjp_result, grad_result = vjp(), grad()
-            self.check_results(grad_result, vjp_result)
-
-    def test_vjp_i2o1_no_create_graph(self):
-        test_cases = [
-            [matmul, ['A', 'B']],  #noqa
-            [mul, ['b', 'c']],  #noqa
-        ]  #noqa
-        for f, inputs in test_cases:
-            vjp, grad = self.gen_test_pairs(f, inputs)
-            vjp_result, grad_result = vjp(), grad()
-            self.check_results(grad_result, vjp_result)
-
-    def test_vjp_i2o2_no_create_graph(self):
-        test_cases = [
-            [o2, ['A', 'A']],  #noqa
-        ]  #noqa
-        for f, inputs in test_cases:
-            inputs = self.gen_inputs(inputs)
-            v = make_v(f, inputs)
-            vjp, grad = self.gen_test_pairs(f, inputs, v=v)
-            vjp_result, grad_result = vjp(), grad()
-            self.check_results(grad_result, vjp_result)
-
-    def test_vjp_i2o2_omitting_v_no_create_graph(self):
-        test_cases = [
-            [o2, ['A', 'A']],  #noqa
-        ]  #noqa
-        for f, inputs in test_cases:
-            inputs = self.gen_inputs(inputs)
-            vjp, grad = self.gen_test_pairs(f, inputs)
-            vjp_result, grad_result = vjp(), grad()
-            self.check_results(grad_result, vjp_result)
-
-    def test_vjp_nested_no_create_graph(self):
-        x = self.gen_input('a')
-        test_cases = [
-            [nested(x), 'a'],  #noqa
-        ]
-        for f, inputs in test_cases:
-            vjp, grad = self.gen_test_pairs(f, inputs)
-            vjp_result, grad_result = vjp(), grad()
-            self.check_results(grad_result, vjp_result)
-
-    def test_vjp_aliased_input_no_create_graph(self):
-        x = self.gen_input('a')
-        ref = self.gen_test_pairs(nested(x), 'a')[0]
-        aliased = self.gen_test_pairs(nested(x), x)[0]
-        ref_result, aliased_result = ref(), aliased()
-        self.check_results(ref_result, aliased_result)
-
-    def test_vjp_allowunused_no_create_graph(self):
-        x, y = self.gen_input('A'), self.gen_input('a')
-        vjp, grad = self.gen_test_pairs(unuse, [x, y], allow_unused=True)
-        vjp_result, grad_result = vjp(), grad()
-        self.check_results(grad_result, vjp_result)
-
-
-def jac(grad_fn, f, inputs):
-    assert grad_fn in [vjp, jvp]
-    if grad_fn is jvp:
-        vs = [zeros_like(x) for x in inputs]
-    else:
-        outputs = f(*inputs)
-        if isinstance(outputs, paddle.Tensor):
-            outputs = [outputs]
-        vs = [zeros_like(y) for y in outputs]
-    JJ_cols = []
-    for i, v in enumerate(vs):
-        v = v.flatten()
-        for j in range(len(v)):
-            _v = zeros_like(v).detach()
-            _v[j] = 1.0
-            _v = _v.reshape(vs[i].shape)
-            _vs = vs.copy()
-            _vs[i] = _v
-            _, grads = grad_fn(f, inputs, vs)
-            d_outs = paddle.concat([d_out.flatten() for d_out in grads])
-            JJ_cols.append(d_outs)
-    # JJ is the fully unrolled jacobian
-    JJ = paddle.stack(JJ_cols)
-    if grad_fn is vjp:
-        JJ = JJ.t()
-    return JJ
-
-
-class TestJVP(TestAutogradFunctional):
-    def test_jvp_i1o1_no_create_graph(self):
-        test_cases = [
-            [reduce, 'A'],  #noqa
-            [reduce_dim, 'A'],  #noqa
-        ]  #noqa
-        for f, inputs in test_cases:
-            inputs = self.gen_inputs(inputs)
-            forward_jac = jac(jvp, f, inputs)
-            reverse_jac = jac(vjp, f, inputs)
-            self.check_results(forward_jac, reverse_jac)
-
-    def test_jvp_i2o1_no_create_graph(self):
-        test_cases = [  #noqa
-            [matmul, ['A', 'B']],  #noqa
-        ]  #noqa
-        for f, inputs in test_cases:
-            inputs = self.gen_inputs(inputs)
-            forward_jac = jac(jvp, f, inputs)
-            reverse_jac = jac(vjp, f, inputs)
-            self.check_results(forward_jac, reverse_jac)
-
-    def test_jvp_i2o2_no_create_graph(self):
-        test_cases = [  #noqa
-            [o2, ['A', 'A']],  #noqa
-        ]  #noqa
-        for f, inputs in test_cases:
-            inputs = self.gen_inputs(inputs)
-            forward_jac = jac(jvp, f, inputs)
-            reverse_jac = jac(vjp, f, inputs)
-            self.check_results(forward_jac, reverse_jac)
-
-    def test_jvp_i2o2_omitting_v_no_create_graph(self):
-        test_cases = [  #noqa
-            [o2, ['A', 'A']],  #noqa
-        ]  #noqa
-        for f, inputs in test_cases:
-            inputs = self.gen_inputs(inputs)
-            results_omitting_v = jvp(f, inputs)
-            v = [ones_like(x) for x in inputs]
-            results_with_v = jvp(f, inputs, v)
-            self.check_results(results_omitting_v, results_with_v)
-
-
-if __name__ == "__main__":
-    unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/autograd/utils.py b/python/paddle/fluid/tests/unittests/autograd/utils.py
index b06ce6ed7cca3..0816b57fbf70b 100644
--- a/python/paddle/fluid/tests/unittests/autograd/utils.py
+++ b/python/paddle/fluid/tests/unittests/autograd/utils.py
@@ -1,22 +1,33 @@
 # Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
-# 
+#
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
-# 
+#
 #     http://www.apache.org/licenses/LICENSE-2.0
-# 
+#
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
+import typing
+import enum
+import sys
+import re
+import inspect
+import functools
+import contextlib
+import collections
 import numpy as np
 import paddle
-from paddle.autograd.functional import _tensors
+from paddle.autograd.functional import _as_tensors
 
 
+##########################################################
+# Finite Difference Utils
+##########################################################
 def _product(t):
     if isinstance(t, int):
         return t
@@ -25,7 +36,9 @@ def _product(t):
 
 
 def _get_item(t, idx):
-    assert isinstance(t, paddle.Tensor), "The first argument t must be Tensor."
+    assert isinstance(
+        t,
+        paddle.fluid.framework.Variable), "The first argument t must be Tensor."
     assert isinstance(idx,
                       int), "The second argument idx must be an int number."
     flat_t = paddle.reshape(t, [-1])
@@ -33,7 +46,9 @@ def _get_item(t, idx):
 
 
 def _set_item(t, idx, value):
-    assert isinstance(t, paddle.Tensor), "The first argument t must be Tensor."
+    assert isinstance(
+        t,
+        paddle.fluid.framework.Variable), "The first argument t must be Tensor."
     assert isinstance(idx,
                       int), "The second argument idx must be an int number."
     flat_t = paddle.reshape(t, [-1])
@@ -42,8 +57,8 @@ def _set_item(t, idx, value):
 
 
 def _compute_numerical_jacobian(func, xs, delta, np_dtype):
-    xs = _tensors(xs, "xs")
-    ys = _tensors(func(*xs), "ys")
+    xs = list(_as_tensors(xs))
+    ys = list(_as_tensors(func(*xs)))
     fin_size = len(xs)
     fout_size = len(ys)
     jacobian = list([] for _ in range(fout_size))
@@ -59,11 +74,11 @@ def _compute_numerical_jacobian(func, xs, delta, np_dtype):
             orig = _get_item(xs[j], q)
             x_pos = orig + delta
             xs[j] = _set_item(xs[j], q, x_pos)
-            ys_pos = _tensors(func(*xs), "ys_pos")
+            ys_pos = _as_tensors(func(*xs))
 
             x_neg = orig - delta
             xs[j] = _set_item(xs[j], q, x_neg)
-            ys_neg = _tensors(func(*xs), "ys_neg")
+            ys_neg = _as_tensors(func(*xs))
 
             xs[j] = _set_item(xs[j], q, orig)
 
@@ -76,8 +91,8 @@ def _compute_numerical_jacobian(func, xs, delta, np_dtype):
 
 
 def _compute_numerical_hessian(func, xs, delta, np_dtype):
-    xs = _tensors(xs, "xs")
-    ys = _tensors(func(*xs), "ys")
+    xs = list(_as_tensors(xs))
+    ys = list(_as_tensors(func(*xs)))
     fin_size = len(xs)
     hessian = list([] for _ in range(fin_size))
     for i in range(fin_size):
@@ -107,10 +122,22 @@ def _compute_numerical_hessian(func, xs, delta, np_dtype):
     return hessian
 
 
-def _compute_numerical_batch_jacobian(func, xs, delta, np_dtype):
+def concat_to_matrix(xs, is_batched=False):
+    """Concats a tuple of tuple of Jacobian/Hessian matrix into one matrix"""
+    rows = []
+    for i in range(len(xs)):
+        rows.append(np.concatenate([x for x in xs[i]], -1))
+    return np.concatenate(rows, 1) if is_batched else np.concatenate(rows, 0)
+
+
+def _compute_numerical_batch_jacobian(func,
+                                      xs,
+                                      delta,
+                                      np_dtype,
+                                      merge_batch=True):
     no_batch_jacobian = _compute_numerical_jacobian(func, xs, delta, np_dtype)
-    xs = _tensors(xs, "xs")
-    ys = _tensors(func(*xs), "ys")
+    xs = list(_as_tensors(xs))
+    ys = list(_as_tensors(func(*xs)))
     fin_size = len(xs)
     fout_size = len(ys)
     bs = xs[0].shape[0]
@@ -128,7 +155,8 @@ def _compute_numerical_batch_jacobian(func, xs, delta, np_dtype):
                 for b in range(bs):
                     for q in range(in_size):
                         batch_jac_i_j[p][b][q] = jac[b][p][b][q]
-            batch_jac_i_j = np.reshape(batch_jac_i_j, (out_size, -1))
+            if merge_batch:
+                batch_jac_i_j = np.reshape(batch_jac_i_j, (out_size, -1))
             batch_jac_i.append(batch_jac_i_j)
         bat_jac.append(batch_jac_i)
 
@@ -136,7 +164,7 @@ def _compute_numerical_batch_jacobian(func, xs, delta, np_dtype):
 
 
 def _compute_numerical_batch_hessian(func, xs, delta, np_dtype):
-    xs = _tensors(xs, "xs")
+    xs = list(_as_tensors(xs))
     batch_size = xs[0].shape[0]
     fin_size = len(xs)
     hessian = []
@@ -175,8 +203,10 @@ def _compute_numerical_batch_hessian(func, xs, delta, np_dtype):
 
 
 def _compute_numerical_vjp(func, xs, v, delta, np_dtype):
-    xs = _tensors(xs, "xs")
+    xs = _as_tensors(xs)
     jacobian = np.array(_compute_numerical_jacobian(func, xs, delta, np_dtype))
+    if v is None:
+        v = [paddle.ones_like(x) for x in xs]
     flat_v = np.array([v_el.numpy().reshape(-1) for v_el in v])
     vjp = [np.zeros((_product(x.shape)), dtype=np_dtype) for x in xs]
     for j in range(len(xs)):
@@ -188,7 +218,7 @@ def _compute_numerical_vjp(func, xs, v, delta, np_dtype):
 
 
 def _compute_numerical_vhp(func, xs, v, delta, np_dtype):
-    xs = _tensors(xs, "xs")
+    xs = list(_as_tensors(xs))
     hessian = np.array(_compute_numerical_hessian(func, xs, delta, np_dtype))
     flat_v = np.array([v_el.numpy().reshape(-1) for v_el in v])
     vhp = [np.zeros((_product(x.shape)), dtype=np_dtype) for x in xs]
@@ -198,3 +228,166 @@ def _compute_numerical_vhp(func, xs, v, delta, np_dtype):
                                flat_v)
     vhp = [vhp[j].reshape(xs[j].shape) for j in range(len(xs))]
     return vhp
+
+
+##########################################################
+# TestCases of different function.
+##########################################################
+def reduce(x):
+    return paddle.sum(x)
+
+
+def reduce_dim(x):
+    return paddle.sum(x, axis=0)
+
+
+def matmul(x, y):
+    return paddle.matmul(x, y)
+
+
+def mul(x, y):
+    return x * y
+
+
+def pow(x, y):
+    return paddle.pow(x, y)
+
+
+def o2(x, y):
+    return paddle.multiply(x, y), paddle.matmul(x, y.t())
+
+
+def unuse(x, y):
+    return paddle.sum(x)
+
+
+def nested(x):
+    def inner(y):
+        return x * y
+
+    return inner
+
+
+def square(x):
+    return x * x
+
+
+##########################################################
+# Parameterized Test Utils.
+##########################################################
+
+TEST_CASE_NAME = 'suffix'
+
+
+def place(devices, key='place'):
+    """A Decorator for a class which will make the class running on different 
+    devices .
+
+    Args:
+        devices (Sequence[Paddle.CUDAPlace|Paddle.CPUPlace]): Device list.
+        key (str, optional): Defaults to 'place'.
+    """
+
+    def decorate(cls):
+        module = sys.modules[cls.__module__].__dict__
+        raw_classes = {
+            k: v
+            for k, v in module.items() if k.startswith(cls.__name__)
+        }
+
+        for raw_name, raw_cls in raw_classes.items():
+            for d in devices:
+                test_cls = dict(raw_cls.__dict__)
+                test_cls.update({key: d})
+                new_name = raw_name + '.' + d.__class__.__name__
+                module[new_name] = type(new_name, (raw_cls, ), test_cls)
+            del module[raw_name]
+        return cls
+
+    return decorate
+
+
+def parameterize(fields, values=None):
+    """Decorator for a unittest class which make the class running on different 
+    test cases.
+
+    Args:
+        fields (Sequence): The feild name sequence of test cases.
+        values (Sequence, optional): The test cases sequence. Defaults to None.
+
+    """
+    fields = [fields] if isinstance(fields, str) else fields
+    params = [dict(zip(fields, vals)) for vals in values]
+
+    def decorate(cls):
+        test_cls_module = sys.modules[cls.__module__].__dict__
+        for i, values in enumerate(params):
+            test_cls = dict(cls.__dict__)
+            values = {
+                k: staticmethod(v) if callable(v) else v
+                for k, v in values.items()
+            }
+            test_cls.update(values)
+            name = cls.__name__ + str(i)
+            name = name + '.' + \
+                values.get('suffix') if values.get('suffix') else name
+
+            test_cls_module[name] = type(name, (cls, ), test_cls)
+
+        for m in list(cls.__dict__):
+            if m.startswith("test"):
+                delattr(cls, m)
+        return cls
+
+    return decorate
+
+
+##########################################################
+# Utils for transpose different Jacobian/Hessian matrix format.
+##########################################################
+
+# B is batch size, N is row size, M is column size.
+MatrixFormat = enum.Enum('MatrixFormat', ('NBM', 'BNM', 'NMB', 'NM'))
+
+
+def _np_transpose_matrix_format(src, src_format, des_format):
+    """Transpose Jacobian/Hessian matrix format."""
+    supported_format = (MatrixFormat.NBM, MatrixFormat.BNM, MatrixFormat.NMB)
+    if src_format not in supported_format or des_format not in supported_format:
+        raise ValueError(
+            f"Supported Jacobian format is {supported_format}, but got src: {src_format}, des: {des_format}"
+        )
+
+    src_axis = {c: i for i, c in enumerate(src_format.name)}
+    dst_axis = tuple(src_axis[c] for c in des_format.name)
+
+    return np.transpose(src, dst_axis)
+
+
+def _np_concat_matrix_sequence(src, src_format=MatrixFormat.NM):
+    """Convert a sequence of sequence of Jacobian/Hessian matrix into one huge 
+    matrix."""
+
+    def concat_col(xs):
+        if src_format in (MatrixFormat.NBM, MatrixFormat.BNM, MatrixFormat.NM):
+            return np.concatenate(xs, axis=-1)
+        else:
+            return np.concatenate(xs, axis=1)
+
+    def concat_row(xs):
+        if src_format in (MatrixFormat.NBM, MatrixFormat.NM, MatrixFormat.NMB):
+            return np.concatenate(xs, axis=0)
+        else:
+            return np.concatenate(xs, axis=1)
+
+    supported_format = (MatrixFormat.NBM, MatrixFormat.BNM, MatrixFormat.NMB,
+                        MatrixFormat.NM)
+    if src_format not in supported_format:
+        raise ValueError(
+            f"Supported Jacobian format is {supported_format}, but got {src_format}"
+        )
+    if not isinstance(src, typing.Sequence):
+        return src
+    if not isinstance(src[0], typing.Sequence):
+        src = [src]
+    return concat_row(tuple(concat_col(xs) for xs in src))
diff --git a/python/paddle/fluid/tests/unittests/check_nan_inf_base.py b/python/paddle/fluid/tests/unittests/check_nan_inf_base.py
index 1c5db616306ca..13a7ff6860e4d 100644
--- a/python/paddle/fluid/tests/unittests/check_nan_inf_base.py
+++ b/python/paddle/fluid/tests/unittests/check_nan_inf_base.py
@@ -103,6 +103,14 @@ def check(use_cuda):
 
 
 if __name__ == '__main__':
+    try:
+        check(use_cuda=False)
+        assert False
+    except Exception as e:
+        print(e)
+        print(type(e))
+        assert type(e) == RuntimeError
+
     if core.is_compiled_with_cuda():
         try:
             check(use_cuda=True)
@@ -113,10 +121,3 @@ def check(use_cuda):
             # Note. Enforce in cuda kernel may not catch in paddle, and
             # Exception type will be RuntimeError
             assert type(e) == OSError or type(e) == RuntimeError
-    try:
-        check(use_cuda=False)
-        assert False
-    except Exception as e:
-        print(e)
-        print(type(e))
-        assert type(e) == RuntimeError
diff --git a/python/paddle/fluid/tests/unittests/collective_reducescatter.py b/python/paddle/fluid/tests/unittests/collective_reducescatter.py
index 8b989c73d4deb..00d4a1c4cf6bd 100644
--- a/python/paddle/fluid/tests/unittests/collective_reducescatter.py
+++ b/python/paddle/fluid/tests/unittests/collective_reducescatter.py
@@ -48,6 +48,7 @@ def get_model(self, main_prog, startup_program):
             tindata = layers.data(
                 name="tindata", shape=[10, 1000], dtype='float32')
             toutdata = fluid.layers.collective._c_reducescatter(tindata, nranks)
+            toutdata = fluid.layers.collective._c_sync_comm_stream(toutdata, 0)
             return toutdata
 
 
diff --git a/python/paddle/fluid/tests/unittests/distributed_passes/dist_pass_test_base.py b/python/paddle/fluid/tests/unittests/distributed_passes/dist_pass_test_base.py
index 488e7c809fc39..f0ed2cdc04950 100644
--- a/python/paddle/fluid/tests/unittests/distributed_passes/dist_pass_test_base.py
+++ b/python/paddle/fluid/tests/unittests/distributed_passes/dist_pass_test_base.py
@@ -32,7 +32,7 @@ def prepare_python_path_and_return_module(path):
     assert filename.endswith(py_suffix), filename
 
     env_name = 'PYTHONPATH'
-    python_path = env_name
+    python_path = os.environ.get(env_name, '')
     if python_path:
         paths = [p for p in python_path.split(":") if p]
         if dirname not in paths:
@@ -41,6 +41,7 @@ def prepare_python_path_and_return_module(path):
     else:
         python_path = path
     os.environ[env_name] = python_path
+    print('GLOG_v=', os.environ.get('GLOG_v', None), flush=1)
     return filename[:-len(py_suffix)]
 
 
diff --git a/python/paddle/fluid/tests/unittests/dygraph_fleet_api.py b/python/paddle/fluid/tests/unittests/dygraph_fleet_api.py
index 2a9d74e4afd4b..de4457a58fb0f 100644
--- a/python/paddle/fluid/tests/unittests/dygraph_fleet_api.py
+++ b/python/paddle/fluid/tests/unittests/dygraph_fleet_api.py
@@ -57,4 +57,6 @@ def test_dygraph_fleet_api(self):
 
 
 if __name__ == "__main__":
+    with _test_eager_guard():
+        pass
     unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/dygraph_to_static/CMakeLists.txt b/python/paddle/fluid/tests/unittests/dygraph_to_static/CMakeLists.txt
index 7ee5e83e76d6e..f046c7b73927e 100644
--- a/python/paddle/fluid/tests/unittests/dygraph_to_static/CMakeLists.txt
+++ b/python/paddle/fluid/tests/unittests/dygraph_to_static/CMakeLists.txt
@@ -1,12 +1,17 @@
 file(GLOB TEST_OPS RELATIVE "${CMAKE_CURRENT_SOURCE_DIR}" "test_*.py")
 string(REPLACE ".py" "" TEST_OPS "${TEST_OPS}")
 set(GC_ENVS FLAGS_eager_delete_tensor_gb=0.0)
+set(DY2ST_EAGER_TEST_ENVS ${GC_ENVS} FLAGS_enable_eager_mode=1)
 
+set(TEST_EAGER_OPS test_bmn test_break_continue test_ifelse test_loop test_mnist_amp 
+    test_mnist_pure_fp16 test_mobile_net test_program_translator test_ptb_lm test_reinforcement_learning 
+    test_resnet test_resnet_amp test_resnet_pure_fp16 test_se_resnet test_sentiment test_seq2seq 
+    test_tsm test_word2vec test_yolov3 test_bert test_cycle_gan test_lstm test_simnet)
 list(REMOVE_ITEM TEST_OPS test_lac)
 # NOTE(Aurelius84): In case of Windows CI, if open ON_INFER, RWLOCK of Scope will
 # be removed and will cause some random failed in multi-thread.
 if(NOT ON_INFER)
-    py_test_modules(test_lac MODULES test_lac)
+    py_test_modules(test_lac MODULES test_lac ENVS FLAGS_enable_eager_mode=1)
     set_tests_properties(test_lac PROPERTIES TIMEOUT 120)
 endif()
 
@@ -15,7 +20,12 @@ if(WIN32 AND NOT WITH_GPU)
 endif()
 
 foreach(TEST_OP ${TEST_OPS})
-    py_test_modules(${TEST_OP} MODULES ${TEST_OP} ENVS ${GC_ENVS})
+    list(FIND TEST_EAGER_OPS ${TEST_OP} WAS_FOUND)
+    if (NOT WAS_FOUND EQUAL -1)
+        py_test_modules(${TEST_OP} MODULES ${TEST_OP} ENVS ${DY2ST_EAGER_TEST_ENVS})
+    else()
+        py_test_modules(${TEST_OP} MODULES ${TEST_OP} ENVS ${GC_ENVS})
+    endif()
 endforeach(TEST_OP)
 
 set_tests_properties(test_se_resnet PROPERTIES TIMEOUT 900)
diff --git a/python/paddle/fluid/tests/unittests/dygraph_to_static/test_bert.py b/python/paddle/fluid/tests/unittests/dygraph_to_static/test_bert.py
index 6c26189a4adb3..a9e94ef09b9ac 100644
--- a/python/paddle/fluid/tests/unittests/dygraph_to_static/test_bert.py
+++ b/python/paddle/fluid/tests/unittests/dygraph_to_static/test_bert.py
@@ -220,4 +220,5 @@ def verify_predict(self):
 
 
 if __name__ == '__main__':
-    unittest.main()
+    with fluid.framework._test_eager_guard():
+        unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/dygraph_to_static/test_bmn.py b/python/paddle/fluid/tests/unittests/dygraph_to_static/test_bmn.py
index f69abb1e37669..00af9c96ba9cc 100644
--- a/python/paddle/fluid/tests/unittests/dygraph_to_static/test_bmn.py
+++ b/python/paddle/fluid/tests/unittests/dygraph_to_static/test_bmn.py
@@ -773,4 +773,5 @@ def predict_analysis_inference(self, data):
 
 
 if __name__ == "__main__":
-    unittest.main()
+    with fluid.framework._test_eager_guard():
+        unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/dygraph_to_static/test_mobile_net.py b/python/paddle/fluid/tests/unittests/dygraph_to_static/test_mobile_net.py
index c6f491a5484d9..f58041cbb6c8d 100644
--- a/python/paddle/fluid/tests/unittests/dygraph_to_static/test_mobile_net.py
+++ b/python/paddle/fluid/tests/unittests/dygraph_to_static/test_mobile_net.py
@@ -628,4 +628,5 @@ def verify_predict(self):
 
 
 if __name__ == '__main__':
-    unittest.main()
+    with fluid.framework._test_eager_guard():
+        unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/dygraph_to_static/test_program_translator.py b/python/paddle/fluid/tests/unittests/dygraph_to_static/test_program_translator.py
index d2c43c31a8839..b0ffbac88fb42 100644
--- a/python/paddle/fluid/tests/unittests/dygraph_to_static/test_program_translator.py
+++ b/python/paddle/fluid/tests/unittests/dygraph_to_static/test_program_translator.py
@@ -363,4 +363,5 @@ def test_remove_comment(self):
 
 
 if __name__ == '__main__':
-    unittest.main()
+    with fluid.framework._test_eager_guard():
+        unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/dygraph_to_static/test_resnet_v2.py b/python/paddle/fluid/tests/unittests/dygraph_to_static/test_resnet_v2.py
index ae7a588579059..0cf96b7159579 100644
--- a/python/paddle/fluid/tests/unittests/dygraph_to_static/test_resnet_v2.py
+++ b/python/paddle/fluid/tests/unittests/dygraph_to_static/test_resnet_v2.py
@@ -14,6 +14,8 @@
 
 from __future__ import print_function
 
+import os
+os.environ["FLAGS_enable_eager_mode"] = "0"
 import math
 import time
 import unittest
diff --git a/python/paddle/fluid/tests/unittests/dygraph_to_static/test_sentiment.py b/python/paddle/fluid/tests/unittests/dygraph_to_static/test_sentiment.py
index 74d415cc3eed7..b72894fb14764 100644
--- a/python/paddle/fluid/tests/unittests/dygraph_to_static/test_sentiment.py
+++ b/python/paddle/fluid/tests/unittests/dygraph_to_static/test_sentiment.py
@@ -358,4 +358,5 @@ def test_train(self):
 
 
 if __name__ == '__main__':
-    unittest.main()
+    with fluid.framework._test_eager_guard():
+        unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/dygraph_to_static/test_word2vec.py b/python/paddle/fluid/tests/unittests/dygraph_to_static/test_word2vec.py
index dd6cc7e9d3206..f270c5672afc3 100644
--- a/python/paddle/fluid/tests/unittests/dygraph_to_static/test_word2vec.py
+++ b/python/paddle/fluid/tests/unittests/dygraph_to_static/test_word2vec.py
@@ -303,4 +303,5 @@ def test_dygraph_static_same_loss(self):
 
 
 if __name__ == '__main__':
-    unittest.main()
+    with fluid.framework._test_eager_guard():
+        unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/init_process_group.py b/python/paddle/fluid/tests/unittests/init_process_group.py
index 90926b1a021d3..c9c957572c515 100644
--- a/python/paddle/fluid/tests/unittests/init_process_group.py
+++ b/python/paddle/fluid/tests/unittests/init_process_group.py
@@ -37,11 +37,15 @@ def config(self):
         pass
 
     def test_init_process_group(self):
-        paddle.distributed.collective._init_parallel_env()
-        paddle.distributed.collective._new_group()
-        with self.assertRaises(ValueError):
-            paddle.distributed.collective._new_group(
-                backend="gloo", group_name="_default_pg")
+        with _test_eager_guard():
+            paddle.distributed.init_parallel_env()
+            paddle.distributed.new_group()
+            group = paddle.distributed.new_group([-1, -2])
+            assert group.process_group == None
+
+            group = paddle.distributed.collective.Group(-1, 2, 0, [-1, -2])
+            ret = paddle.distributed.barrier(group)
+            assert ret == None
         print("test ok\n")
 
 
diff --git a/python/paddle/fluid/tests/unittests/interpreter/test_standalone_executor.py b/python/paddle/fluid/tests/unittests/interpreter/test_standalone_executor.py
index cff4f7f41d02b..c07d4cc15bee0 100644
--- a/python/paddle/fluid/tests/unittests/interpreter/test_standalone_executor.py
+++ b/python/paddle/fluid/tests/unittests/interpreter/test_standalone_executor.py
@@ -277,6 +277,18 @@ def test_compiled_program(self):
         for x, y in zip(gt, res):
             self.assertTrue(np.array_equal(x, y))
 
+    def test_empty_program(self):
+        program = paddle.static.Program()
+        exe = paddle.static.Executor(self.place)
+        for i in range(10):
+            out = exe.run()  # old executor
+
+        for i in range(10):
+            print(i, flush=1)
+            os.environ['FLAGS_USE_STANDALONE_EXECUTOR'] = '1'
+            out = exe.run(program, feed=None)
+            del os.environ['FLAGS_USE_STANDALONE_EXECUTOR']
+
 
 class TestException(unittest.TestCase):
     def setUp(self):
diff --git a/python/paddle/fluid/tests/unittests/ipu/test_cast_op_ipu.py b/python/paddle/fluid/tests/unittests/ipu/test_cast_op_ipu.py
index 5f0eeaa2f99ab..2de23d95e1c96 100644
--- a/python/paddle/fluid/tests/unittests/ipu/test_cast_op_ipu.py
+++ b/python/paddle/fluid/tests/unittests/ipu/test_cast_op_ipu.py
@@ -94,6 +94,105 @@ def test_base(self):
         self.assertTrue(res0.shape == res1.shape)
 
 
+class TestEnableFp16(TestBase):
+    def set_atol(self):
+        self.atol = 1e-10
+
+    def set_data_feed(self):
+        self.feed = {"x": np.array([1, 200, 3000, 40000]).astype('int32'), }
+
+    def set_op_attrs(self):
+        self.attrs = {}
+        self.attrs['dtype'] = 'float32'
+
+    def _test_base(self, run_ipu=True):
+        scope = paddle.static.Scope()
+        main_prog = paddle.static.Program()
+        startup_prog = paddle.static.Program()
+        main_prog.random_seed = self.SEED
+        startup_prog.random_seed = self.SEED
+
+        with paddle.static.scope_guard(scope):
+            with paddle.static.program_guard(main_prog, startup_prog):
+                x = paddle.static.data(
+                    name=self.feed_list[0],
+                    shape=self.feed_shape[0],
+                    dtype=self.feed_dtype[0])
+                out = paddle.cast(x, **self.attrs)
+                fetch_list = [out.name]
+
+            if run_ipu:
+                place = paddle.IPUPlace()
+            else:
+                place = paddle.CPUPlace()
+            exe = paddle.static.Executor(place)
+            exe.run(startup_prog)
+
+            if run_ipu:
+                feed_list = self.feed_list
+                ipu_strategy = paddle.static.IpuStrategy()
+                ipu_strategy.set_graph_config(is_training=self.is_training)
+                ipu_strategy.set_precision_config(enable_fp16=True)
+                program = paddle.static.IpuCompiledProgram(
+                    main_prog,
+                    ipu_strategy=ipu_strategy).compile(feed_list, fetch_list)
+            else:
+                program = main_prog
+
+            result = exe.run(program, feed=self.feed, fetch_list=fetch_list)
+            return result[0]
+
+
+class TestDisableTransferCast(TestEnableFp16):
+    def set_atol(self):
+        self.atol = 1e-10
+
+    def set_data_feed(self):
+        self.feed = {"x": np.array([1, 200, 3000, 40000]).astype('int32'), }
+
+    def set_op_attrs(self):
+        self.attrs = {}
+        self.attrs['dtype'] = 'float32'
+
+    def _test_base(self, run_ipu=True):
+        scope = paddle.static.Scope()
+        main_prog = paddle.static.Program()
+        startup_prog = paddle.static.Program()
+        main_prog.random_seed = self.SEED
+        startup_prog.random_seed = self.SEED
+
+        with paddle.static.scope_guard(scope):
+            with paddle.static.program_guard(main_prog, startup_prog):
+                x = paddle.static.data(
+                    name=self.feed_list[0],
+                    shape=self.feed_shape[0],
+                    dtype=self.feed_dtype[0])
+                out = paddle.cast(x, **self.attrs)
+                fetch_list = [out.name]
+
+            if run_ipu:
+                place = paddle.IPUPlace()
+            else:
+                place = paddle.CPUPlace()
+            exe = paddle.static.Executor(place)
+            exe.run(startup_prog)
+
+            if run_ipu:
+                feed_list = self.feed_list
+                ipu_strategy = paddle.static.IpuStrategy()
+                ipu_strategy.set_graph_config(is_training=self.is_training)
+                ipu_strategy.set_precision_config(enable_fp16=True)
+                ipu_strategy.set_options({"transfer_cast_op": False})
+                program = paddle.static.IpuCompiledProgram(
+                    main_prog,
+                    ipu_strategy=ipu_strategy).compile(feed_list, fetch_list)
+            else:
+                program = main_prog
+
+            result = exe.run(program, feed=self.feed, fetch_list=fetch_list)
+            return result[0]
+
+
 class TestCase2(TestBase):
     def set_atol(self):
         self.atol = 1e-10
diff --git a/python/paddle/fluid/tests/unittests/ipu/test_eval_model_ipu.py b/python/paddle/fluid/tests/unittests/ipu/test_eval_model_ipu.py
new file mode 100644
index 0000000000000..30a4a5370790a
--- /dev/null
+++ b/python/paddle/fluid/tests/unittests/ipu/test_eval_model_ipu.py
@@ -0,0 +1,126 @@
+#  Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+import numpy as np
+import paddle
+import paddle.static
+from paddle.fluid.tests.unittests.ipu.op_test_ipu import IPUOpTest
+
+
+@unittest.skipIf(not paddle.is_compiled_with_ipu(),
+                 "core is not compiled with IPU")
+class TestBase(IPUOpTest):
+    def setUp(self):
+        self.set_atol()
+        self.set_data_feed()
+        self.set_feed_attr()
+        self.set_attrs()
+
+    def set_atol(self):
+        self.atol = 1e-4
+
+    def set_data_feed(self):
+        self.feed = {
+            "image": np.random.uniform(size=[1, 3, 10, 10]).astype('float32'),
+        }
+
+    def set_feed_attr(self):
+        self.feed_shape = [x.shape for x in self.feed.values()]
+        self.feed_list = list(self.feed.keys())
+        self.feed_dtype = [x.dtype for x in self.feed.values()]
+
+    def set_attrs(self):
+        self.attrs = {
+            "optimizer": 'lamb',
+            "weight_decay": 2.0,
+        }
+
+    def _test_optimizer(self, run_ipu=True):
+        scope = paddle.static.Scope()
+        main_prog = paddle.static.Program()
+        startup_prog = paddle.static.Program()
+        main_prog.random_seed = self.SEED
+        startup_prog.random_seed = self.SEED
+        np.random.seed(self.SEED)
+
+        with paddle.static.scope_guard(scope):
+            with paddle.static.program_guard(main_prog, startup_prog):
+                image = paddle.static.data(
+                    name='image', shape=[1, 3, 10, 10], dtype='float32')
+                conv1 = paddle.static.nn.conv2d(
+                    image, num_filters=3, filter_size=3, bias_attr=False)
+                loss = paddle.mean(conv1)
+
+                weight_decay = self.attrs['weight_decay']
+                opt = paddle.optimizer.SGD(learning_rate=1e-1,
+                                           weight_decay=weight_decay)
+                if self.attrs['optimizer'] == 'adam':
+                    opt = paddle.optimizer.Adam(
+                        learning_rate=1e-1, weight_decay=weight_decay)
+                elif self.attrs['optimizer'] == 'lamb':
+
+                    opt = paddle.optimizer.Lamb(
+                        learning_rate=1e-1, lamb_weight_decay=weight_decay)
+                opt.minimize(loss)
+
+            if run_ipu:
+                place = paddle.IPUPlace()
+            else:
+                place = paddle.CPUPlace()
+            exe = paddle.static.Executor(place)
+            exe.run(startup_prog)
+
+            if run_ipu:
+                feed_list = [image.name]
+                fetch_list = [loss.name]
+                ipu_strategy = paddle.static.IpuStrategy()
+                ipu_strategy.set_graph_config(is_training=True)
+                ipu_strategy.set_options({"runtime_options.enable_eval": True})
+                program = paddle.static.IpuCompiledProgram(
+                    main_prog, ipu_strategy=ipu_strategy).compile(feed_list,
+                                                                  fetch_list)
+            else:
+                program = main_prog
+
+            result = []
+            if run_ipu:
+                for epoch in range(200):
+                    if epoch == 100:
+                        ipu_strategy.set_options({
+                            "runtime_options.enable_eval": False
+                        })
+                    loss_res = exe.run(program,
+                                       feed=self.feed,
+                                       fetch_list=[loss])
+                    result.append(loss_res)
+            else:
+                for epoch in range(100):
+                    loss_res = exe.run(program,
+                                       feed=self.feed,
+                                       fetch_list=[loss])
+                    result.append(loss_res)
+            return np.array(result)
+
+    def test(self):
+        # cpu and ipu dimenstion mismatch, cpu:(100, 1, 1), ipu:(100, 1)
+        ipu_loss = self._test_optimizer(True).flatten()
+        cpu_loss = self._test_optimizer(False).flatten()
+        self.assertTrue(ipu_loss[0] == ipu_loss[99])
+        self.assertTrue(np.allclose(ipu_loss[100:], cpu_loss, atol=self.atol))
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/ipu/test_fp16_support_ipu.py b/python/paddle/fluid/tests/unittests/ipu/test_fp16_support_ipu.py
new file mode 100644
index 0000000000000..71742deefcd2c
--- /dev/null
+++ b/python/paddle/fluid/tests/unittests/ipu/test_fp16_support_ipu.py
@@ -0,0 +1,211 @@
+#  Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+import numpy as np
+import paddle
+import paddle.fluid as fluid
+import paddle.static
+from paddle.fluid.tests.unittests.ipu.op_test_ipu import IPUOpTest, ExecutionMode
+
+
+@unittest.skipIf(not paddle.is_compiled_with_ipu(),
+                 "core is not compiled with IPU")
+class TestBase(IPUOpTest):
+    def setUp(self):
+        self.set_atol()
+        self.set_training()
+        self.set_data_feed()
+        self.set_feed_attr()
+        self.set_op_attrs()
+
+    @property
+    def fp16_enabled(self):
+        return True
+
+    def set_atol(self):
+        self.atol = 5e-6
+        self.rtol = 1e-5
+        self.atol_fp16 = 1e-2
+        self.rtol_fp16 = 1e-3
+
+    def set_data_feed(self):
+        np_data = np.random.uniform(low=-1, high=1, size=[1, 3, 100, 100])
+        self.feed_fp32 = {"x": np_data.astype('float32')}
+        self.feed_fp16 = {"x": np_data.astype('float16')}
+
+    def set_feed_attr(self):
+        self.feed_shape = [x.shape for x in self.feed_fp32.values()]
+        self.feed_list = list(self.feed_fp32.keys())
+        self.feed_dtype = [x.dtype for x in self.feed_fp32.values()]
+
+    def set_op_attrs(self):
+        self.attrs = {}
+
+    def _test_base(self, exec_mode):
+        scope = fluid.core.Scope()
+        main_prog = paddle.static.Program()
+        startup_prog = paddle.static.Program()
+        main_prog.random_seed = self.SEED
+        startup_prog.random_seed = self.SEED
+
+        with fluid.scope_guard(scope):
+            with paddle.static.program_guard(main_prog, startup_prog):
+                x = paddle.static.data(
+                    name=self.feed_list[0],
+                    shape=self.feed_shape[0],
+                    dtype='float32')
+
+                conv1 = paddle.static.nn.conv2d(
+                    x, num_filters=3, filter_size=3, bias_attr=False)
+                conv2 = paddle.static.nn.conv2d(
+                    x, num_filters=3, filter_size=3, bias_attr=False)
+                add1 = conv1 + conv2
+                conv3 = paddle.static.nn.conv2d(
+                    add1, num_filters=8, filter_size=8, bias_attr=False)
+                out = paddle.fluid.layers.relu(conv3, **self.attrs)
+
+            fetch_list = [out.name]
+
+            if exec_mode == ExecutionMode.CPU_FP32:
+                place = paddle.CPUPlace()
+            else:
+                place = paddle.IPUPlace()
+
+            exe = paddle.static.Executor(place)
+            exe.run(startup_prog)
+
+            if exec_mode != ExecutionMode.CPU_FP32:
+                feed_list = self.feed_list
+                ipu_strategy = paddle.static.IpuStrategy()
+                ipu_strategy.set_graph_config(is_training=self.is_training)
+                if exec_mode == ExecutionMode.IPU_POPART_FP16:
+                    ipu_strategy.set_precision_config(enable_fp16=True)
+                program = paddle.static.IpuCompiledProgram(
+                    main_prog,
+                    ipu_strategy=ipu_strategy).compile(feed_list, fetch_list)
+            else:
+                program = main_prog
+
+            feed = self.feed_fp32
+            if exec_mode > ExecutionMode.IPU_FP32:
+                feed = self.feed_fp16
+
+            result = exe.run(program, feed=feed, fetch_list=fetch_list)
+            return result[0]
+
+    def test_base(self):
+        output_dict = {}
+        for mode in ExecutionMode:
+            if mode > ExecutionMode.IPU_FP32 and not self.fp16_enabled:
+                break
+            output_dict[mode] = self._test_base(mode).flatten()
+
+        self.check(output_dict)
+
+
+class TestIntInput(IPUOpTest):
+    def setUp(self):
+        self.set_atol()
+        self.set_training()
+        self.set_data_feed()
+        self.set_feed_attr()
+        self.set_op_attrs()
+
+    @property
+    def fp16_enabled(self):
+        return True
+
+    def set_data_feed(self):
+        embedding = np.random.uniform(size=[10, 20])
+        indice = np.array([1, 3, 5]).astype(np.int32)
+        self.feed_fp32 = {
+            "embedding": embedding.astype(np.float32),
+            "indice": indice,
+        }
+        self.feed_fp16 = {
+            "embedding": embedding.astype(np.float16),
+            "indice": indice,
+        }
+
+    def set_feed_attr(self):
+        self.feed_shape = [x.shape for x in self.feed_fp32.values()]
+        self.feed_list = list(self.feed_fp32.keys())
+        self.feed_dtype = [x.dtype for x in self.feed_fp32.values()]
+
+    def set_op_attrs(self):
+        self.attrs = {}
+
+    def _test_base(self, exec_mode):
+        scope = fluid.core.Scope()
+        main_prog = paddle.static.Program()
+        startup_prog = paddle.static.Program()
+        main_prog.random_seed = self.SEED
+        startup_prog.random_seed = self.SEED
+
+        with fluid.scope_guard(scope):
+            with paddle.static.program_guard(main_prog, startup_prog):
+                x = paddle.static.data(
+                    name=self.feed_list[0],
+                    shape=self.feed_shape[0],
+                    dtype='float32')
+                y = paddle.static.data(
+                    name=self.feed_list[1],
+                    shape=self.feed_shape[1],
+                    dtype='int32')
+
+                out = paddle.fluid.layers.gather(x, index=y)
+
+            fetch_list = [out.name]
+
+            if exec_mode == ExecutionMode.CPU_FP32:
+                place = paddle.CPUPlace()
+            else:
+                place = paddle.IPUPlace()
+
+            exe = paddle.static.Executor(place)
+            exe.run(startup_prog)
+
+            if exec_mode != ExecutionMode.CPU_FP32:
+                feed_list = self.feed_list
+                ipu_strategy = paddle.static.IpuStrategy()
+                ipu_strategy.set_graph_config(is_training=self.is_training)
+                if exec_mode == ExecutionMode.IPU_POPART_FP16:
+                    ipu_strategy.set_precision_config(enable_fp16=True)
+                program = paddle.static.IpuCompiledProgram(
+                    main_prog,
+                    ipu_strategy=ipu_strategy).compile(feed_list, fetch_list)
+            else:
+                program = main_prog
+
+            feed = self.feed_fp32
+            if exec_mode > ExecutionMode.IPU_FP32:
+                feed = self.feed_fp16
+
+            result = exe.run(program, feed=feed, fetch_list=fetch_list)
+            return np.array(result)
+
+    def test_base(self):
+        output_dict = {}
+        for mode in ExecutionMode:
+            if mode > ExecutionMode.IPU_FP32 and not self.fp16_enabled:
+                break
+            output_dict[mode] = self._test_base(mode).flatten()
+
+        self.check(output_dict)
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/ipu/test_gradient_clip_ipu.py b/python/paddle/fluid/tests/unittests/ipu/test_gradient_clip_ipu.py
new file mode 100644
index 0000000000000..281baeca09e47
--- /dev/null
+++ b/python/paddle/fluid/tests/unittests/ipu/test_gradient_clip_ipu.py
@@ -0,0 +1,140 @@
+#  Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+import numpy as np
+import paddle
+import paddle.static
+from paddle.fluid.tests.unittests.ipu.op_test_ipu import IPUOpTest
+
+
+@unittest.skipIf(not paddle.is_compiled_with_ipu(),
+                 "core is not compiled with IPU")
+class TestBase(IPUOpTest):
+    def setUp(self):
+        self.set_atol()
+        self.set_data_feed()
+        self.set_feed_attr()
+        self.set_attrs()
+
+    def set_atol(self):
+        self.atol = 1e-6
+
+    def set_data_feed(self):
+        self.feed = {
+            "image": np.random.uniform(size=[1, 3, 10, 10]).astype('float32'),
+        }
+
+    def set_feed_attr(self):
+        self.feed_shape = [x.shape for x in self.feed.values()]
+        self.feed_list = list(self.feed.keys())
+        self.feed_dtype = [x.dtype for x in self.feed.values()]
+
+    def set_attrs(self):
+        self.attrs = {
+            "optimizer": 'sgd',
+            "weight_decay": 0.0,
+        }
+
+    def _test_optimizer(self, run_ipu=True):
+        scope = paddle.static.Scope()
+        main_prog = paddle.static.Program()
+        startup_prog = paddle.static.Program()
+        main_prog.random_seed = self.SEED
+        startup_prog.random_seed = self.SEED
+        np.random.seed(self.SEED)
+
+        with paddle.static.scope_guard(scope):
+            with paddle.static.program_guard(main_prog, startup_prog):
+                image = paddle.static.data(
+                    name='image', shape=[1, 3, 10, 10], dtype='float32')
+                conv1 = paddle.static.nn.conv2d(
+                    image, num_filters=3, filter_size=3, bias_attr=False)
+                loss = paddle.mean(conv1)
+
+                weight_decay = self.attrs['weight_decay']
+                # Only support ClipGradByGlobalNorm
+                clip = paddle.nn.ClipGradByGlobalNorm(clip_norm=1.0)
+                if self.attrs['optimizer'] == 'sgd':
+                    opt = paddle.optimizer.SGD(learning_rate=1e-1,
+                                               weight_decay=weight_decay,
+                                               grad_clip=clip)
+                elif self.attrs['optimizer'] == 'adam':
+                    opt = paddle.optimizer.Adam(
+                        learning_rate=1e-1,
+                        weight_decay=weight_decay,
+                        grad_clip=clip)
+                elif self.attrs['optimizer'] == 'lamb':
+                    opt = paddle.optimizer.Lamb(
+                        learning_rate=1e-1,
+                        lamb_weight_decay=weight_decay,
+                        grad_clip=clip)
+                else:
+                    raise ValueError(
+                        f"Not supported optimizer {self.attrs['optimizer']} for test"
+                    )
+                opt.minimize(loss)
+
+            if run_ipu:
+                place = paddle.IPUPlace()
+            else:
+                place = paddle.CPUPlace()
+            exe = paddle.static.Executor(place)
+            exe.run(startup_prog)
+
+            if run_ipu:
+                feed_list = [image.name]
+                fetch_list = [loss.name]
+                ipu_strategy = paddle.static.IpuStrategy()
+                ipu_strategy.set_graph_config(is_training=True)
+                program = paddle.static.IpuCompiledProgram(
+                    main_prog, ipu_strategy=ipu_strategy).compile(feed_list,
+                                                                  fetch_list)
+            else:
+                program = main_prog
+
+            result = []
+            for epoch in range(100):
+                loss_res = exe.run(program, feed=self.feed, fetch_list=[loss])
+                result.append(loss_res)
+
+            return np.array(result)
+
+    def test(self):
+        # cpu and ipu dimenstion mismatch, cpu:(100, 1, 1), ipu:(100, 1)
+        ipu_loss = self._test_optimizer(True).flatten()
+        cpu_loss = self._test_optimizer(False).flatten()
+
+        self.assertTrue(np.allclose(ipu_loss, cpu_loss, atol=self.atol))
+
+
+class TestAdam(TestBase):
+    def set_attrs(self):
+        self.attrs = {
+            "optimizer": 'adam',
+            "weight_decay": 0.0,
+        }
+
+
+class TestLamb(TestBase):
+    def set_attrs(self):
+        self.attrs = {
+            "optimizer": 'lamb',
+            "weight_decay": 0.1,
+        }
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/ipu/test_ipu_shard_api.py b/python/paddle/fluid/tests/unittests/ipu/test_ipu_shard_api.py
new file mode 100644
index 0000000000000..a306a3f7725b5
--- /dev/null
+++ b/python/paddle/fluid/tests/unittests/ipu/test_ipu_shard_api.py
@@ -0,0 +1,111 @@
+#  Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+import numpy as np
+import paddle
+
+paddle.enable_static()
+
+
+@unittest.skipIf(not paddle.is_compiled_with_ipu(),
+                 "core is not compiled with IPU")
+class TestIpuShard(unittest.TestCase):
+    def _test(self):
+        # build graph
+        a = paddle.static.data(name='data', shape=[None, 1], dtype='int32')
+        b = a + 2  # scale : scale * x + bias, ipu_index : no
+
+        with paddle.static.ipu_shard_guard(index=1):
+            c = b + 1  # scale, ipu_index : 1
+            with paddle.static.ipu_shard_guard(index=2):
+                d = c * 2  # scale, ipu_index : 2
+            with paddle.static.ipu_shard_guard(index=3):
+                e = d + 3  # scale, ipu_index : 3
+                with paddle.static.ipu_shard_guard(index=1):
+                    e = e + 3  # scale, ipu_index : 1
+                    with paddle.static.ipu_shard_guard(index=2):
+                        e = e + 3  # scale, ipu_index : 2
+
+        with paddle.static.ipu_shard_guard(index=1):
+            f = paddle.tensor.pow(e, 2.0)  # pow, ipu_index : 1
+
+        with paddle.static.ipu_shard_guard(index=2):
+            g = f - 1  # scale, ipu_index : 2
+
+        h = g + 1  # scale, ipu_index : no
+
+        ipu_index_list = []
+        main_prog = paddle.static.default_main_program()
+        for op in main_prog.global_block().ops:
+            if op.desc.has_attr("ipu_index"):
+                ipu_index_list.append(op.desc.attr("ipu_index"))
+
+        return ipu_index_list
+
+    def test_ipu_shard(self):
+        ipu_index_list = self._test()
+        expected_ipu_index_list = [1, 2, 3, 1, 2, 1, 2]
+        self.assertTrue(
+            np.allclose(
+                ipu_index_list, expected_ipu_index_list, atol=0))
+
+
+@unittest.skipIf(not paddle.is_compiled_with_ipu(),
+                 "core is not compiled with IPU")
+class TestIpuPipeline(unittest.TestCase):
+    def _test(self):
+        # build graph
+        a = paddle.static.data(name='data', shape=[None, 1], dtype='int32')
+        b = a + 2  # scale : scale * x + bias, ipu_stage : no
+
+        with paddle.static.ipu_shard_guard(stage=1):
+            c = b + 1  # scale, ipu_stage : 1
+            with paddle.static.ipu_shard_guard(stage=2):
+                d = c * 2  # scale, ipu_stage : 2
+            with paddle.static.ipu_shard_guard(stage=3):
+                e = d + 3  # scale, ipu_stage : 3
+                with paddle.static.ipu_shard_guard(stage=1):
+                    e = e + 3  # scale, ipu_stage : 1
+                    with paddle.static.ipu_shard_guard(stage=2):
+                        e = e + 3  # scale, ipu_stage : 2
+
+        with paddle.static.ipu_shard_guard(stage=1):
+            f = paddle.tensor.pow(e, 2.0)  # pow, ipu_stage : 1
+
+        with paddle.static.ipu_shard_guard(stage=2):
+            g = f - 1  # scale, ipu_stage : 2
+
+        h = g + 1  # scale, ipu_stage : no
+
+        ipu_index_list = []
+        main_prog = paddle.static.default_main_program()
+        for op in main_prog.global_block().ops:
+            if op.desc.has_attr("ipu_stage"):
+                ipu_index_list.append(op.desc.attr("ipu_stage"))
+
+        return ipu_index_list
+
+    def test_ipu_shard(self):
+        ipu_index_list = self._test()
+        expected_ipu_index_list = [1, 2, 3, 1, 2, 1, 2]
+
+        self.assertTrue(
+            np.allclose(
+                ipu_index_list, expected_ipu_index_list, atol=0))
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/ipu/test_matmul_serilize_ipu.py b/python/paddle/fluid/tests/unittests/ipu/test_matmul_serilize_ipu.py
new file mode 100644
index 0000000000000..ddb06400540e3
--- /dev/null
+++ b/python/paddle/fluid/tests/unittests/ipu/test_matmul_serilize_ipu.py
@@ -0,0 +1,109 @@
+#  Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+import numpy as np
+import paddle
+import paddle.static
+from paddle.fluid.tests.unittests.ipu.op_test_ipu import IPUOpTest
+
+
+def set_serialize_factor(serialize_factor):
+    main_prog = paddle.static.default_main_program()
+    op = main_prog.current_block().ops[-1]
+    op._set_attr('serialize_factor', serialize_factor)
+
+
+@unittest.skipIf(not paddle.is_compiled_with_ipu() or IPUOpTest.use_ipumodel(),
+                 "core is not compiled with IPU")
+class TestBase(IPUOpTest):
+    def setUp(self):
+        self.set_atol()
+        self.set_training()
+        self.set_data_feed()
+        self.set_feed_attr()
+        self.set_op_attrs()
+
+    def set_data_feed(self):
+        self.feed = {
+            "x": np.random.uniform(size=[2048, 3072]).astype('float32'),
+            "y": np.random.uniform(size=[3072, 2048]).astype('float32'),
+        }
+
+    def set_feed_attr(self):
+        self.feed_shape = [x.shape for x in self.feed.values()]
+        self.feed_list = list(self.feed.keys())
+        self.feed_dtype = [x.dtype for x in self.feed.values()]
+
+    def set_op_attrs(self):
+        self.attrs = {"transpose_x": False, "transpose_y": False}
+
+    def _test_base(self, run_ipu=True):
+        scope = paddle.static.Scope()
+        main_prog = paddle.static.Program()
+        startup_prog = paddle.static.Program()
+        main_prog.random_seed = self.SEED
+        startup_prog.random_seed = self.SEED
+
+        with paddle.static.scope_guard(scope):
+            with paddle.static.program_guard(main_prog, startup_prog):
+                x = paddle.static.data(
+                    name=self.feed_list[0],
+                    shape=self.feed_shape[0],
+                    dtype=self.feed_dtype[0])
+                y = paddle.static.data(
+                    name=self.feed_list[1],
+                    shape=self.feed_shape[1],
+                    dtype=self.feed_dtype[1])
+
+                # decrator maybe the best choice, but need to modify api
+                out = paddle.matmul(x, y, **self.attrs)
+                set_serialize_factor(4)
+
+                fetch_list = [out.name]
+
+            if run_ipu:
+                place = paddle.IPUPlace()
+            else:
+                place = paddle.CPUPlace()
+            exe = paddle.static.Executor(place)
+            exe.run(startup_prog)
+
+            if run_ipu:
+                feed_list = self.feed_list
+                ipu_strategy = paddle.static.IpuStrategy()
+                ipu_strategy.set_graph_config(is_training=self.is_training)
+                program = paddle.static.IpuCompiledProgram(
+                    main_prog,
+                    ipu_strategy=ipu_strategy).compile(feed_list, fetch_list)
+            else:
+                program = main_prog
+
+            result = exe.run(program, feed=self.feed, fetch_list=fetch_list)
+            return result[0]
+
+    def test_base(self):
+        res0 = self._test_base(False)
+        res1 = self._test_base(True)
+
+        self.assertTrue(
+            np.allclose(
+                res0.flatten(), res1.flatten(), atol=self.atol))
+
+        self.assertTrue(res0.shape == res1.shape)
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/ipu/test_optimizer_ipu.py b/python/paddle/fluid/tests/unittests/ipu/test_optimizer_ipu.py
index bc9d05c4a87ec..43f54b52b5c55 100644
--- a/python/paddle/fluid/tests/unittests/ipu/test_optimizer_ipu.py
+++ b/python/paddle/fluid/tests/unittests/ipu/test_optimizer_ipu.py
@@ -90,7 +90,9 @@ def _test_optimizer(self, run_ipu=True):
                 fetch_list = [loss.name]
                 ipu_strategy = paddle.static.IpuStrategy()
                 ipu_strategy.set_graph_config(is_training=True)
-                ipu_strategy.loss_scaling = self.attrs["loss_scaling"]
+                ipu_strategy.set_options({
+                    'loss_scaling': self.attrs["loss_scaling"]
+                })
                 if "use_no_bias_optimizer" in self.attrs.keys():
                     ipu_strategy.set_options({
                         "use_no_bias_optimizer":
diff --git a/python/paddle/fluid/tests/unittests/ir/inference/CMakeLists.txt b/python/paddle/fluid/tests/unittests/ir/inference/CMakeLists.txt
index 5be531258edac..808821f06cbae 100755
--- a/python/paddle/fluid/tests/unittests/ir/inference/CMakeLists.txt
+++ b/python/paddle/fluid/tests/unittests/ir/inference/CMakeLists.txt
@@ -91,7 +91,7 @@ set_tests_properties(test_trt_matmul_quant_dequant PROPERTIES TIMEOUT 100)
 set_tests_properties(test_trt_conv3d_op PROPERTIES TIMEOUT 60)
 set_tests_properties(test_trt_conv3d_transpose_op PROPERTIES TIMEOUT 60)
 set_tests_properties(test_trt_nearest_interp_v2_op PROPERTIES TIMEOUT 30)
-set_tests_properties(test_trt_multiclass_nms3_op PROPERTIES TIMEOUT 60)
+set_tests_properties(test_trt_multiclass_nms3_op PROPERTIES TIMEOUT 60 ENVIRONMENT FLAGS_USE_STANDALONE_EXECUTOR=0) 
 
 if (WITH_MKLDNN AND TENSORRT_FOUND AND WITH_GPU)
   set_tests_properties(test_emb_eltwise_layernorm_fuse_pass PROPERTIES TIMEOUT 120)
diff --git a/python/paddle/fluid/tests/unittests/ir/inference/test_mkldnn_conv_elementwise_add_fuse_pass.py b/python/paddle/fluid/tests/unittests/ir/inference/test_mkldnn_conv_elementwise_add_fuse_pass.py
index 2e84607e2f5c2..58d09a880619c 100644
--- a/python/paddle/fluid/tests/unittests/ir/inference/test_mkldnn_conv_elementwise_add_fuse_pass.py
+++ b/python/paddle/fluid/tests/unittests/ir/inference/test_mkldnn_conv_elementwise_add_fuse_pass.py
@@ -26,7 +26,7 @@
 
 
 # the two inputs of elementwise_add are tensor
-class TestConvElementwiseAddMkldnnFusePass1(PassAutoScanTest):
+class TestConvElementwiseAddMkldnnFusePass(PassAutoScanTest):
     def is_program_valid(self, program_config: ProgramConfig) -> bool:
         attrs = [
             program_config.ops[i].attrs
@@ -125,139 +125,5 @@ def test(self):
             quant=False, passes=["conv_elementwise_add_mkldnn_fuse_pass"])
 
 
-'''
-class TestConvElementwiseAddMkldnnFusePass(PassAutoScanTest):
-    def is_program_valid(self, program_config: ProgramConfig) -> bool:
-        attrs = [
-            program_config.ops[i].attrs
-            for i in range(len(program_config.ops))
-        ]
-        if "elementwise_weight" in program_config.weights:
-            if program_config.weights["elementwise_weight"].shape[0] == program_config.inputs["input_data1"].shape[1]:
-                if attrs[2]['axis'] != 1:
-                    return False
-            if program_config.weights["elementwise_weight"].shape[0] == program_config.inputs["input_data1"].shape[3]:
-                if attrs[2]['axis'] != -1:
-                    return False
-        return True
-
-    def sample_program_config(self, draw):
-        data_format = draw(st.sampled_from(["NCHW", "NHWC"]))
-        dilations = draw(st.sampled_from([[1, 1], [2, 2], [1, 2]]))
-        padding_algorithm = draw(st.sampled_from(["EXPLICIT", "SAME", "VALID"]))
-        groups = draw(st.sampled_from([1, 2, 4]))
-        paddings = draw(st.sampled_from([[0, 3], [1, 1], [1, 2, 3, 4]]))
-        strides = draw(st.sampled_from([[1, 1], [2, 2], [1, 2]]))
-        axis = draw(st.sampled_from([-1, 0, 1]))
-        batch_size = draw(st.integers(min_value=1, max_value=4))
-
-        def generate_input1():
-            if data_format == "NCHW":
-                return np.random.random(
-                    [batch_size, 48, 64, 64]).astype(np.float32)
-            else:
-                return np.random.random(
-                    [batch_size, 64, 64, 48]).astype(np.float32)
-
-        def generate_weight1():
-            return np.random.random(
-                [48, int(48 / groups), 3, 3]).astype(np.float32)
-
-        def compute_out_shape(padding_alg):
-            import paddle
-            import paddle.nn as nn
-
-            x_var = paddle.uniform(
-                (batch_size, 48, 64, 64), dtype='float32', min=-1., max=1.)
-            if padding_alg == "EXPLICIT":
-                conv = nn.Conv2D(48, 48, (3, 3), strides, paddings, dilations,
-                                 1)
-            else:
-                conv = nn.Conv2D(48, 48, (3, 3), strides, padding_alg,
-                                 dilations, 1)
-            y_var = conv(x_var)
-            return y_var.shape
-
-        def generate_weight2():
-            return np.random.random([48]).astype(np.float32)
-
-        if compute_out_shape(padding_algorithm) != (batch_size, 48, 64, 64):
-            axis = 1
-
-        relu_op = OpConfig(
-            type="relu",
-            inputs={"X": ["input_data1"]},
-            outputs={"Out": ["sigmoid_out"]},
-            attrs={})
-
-        conv2d_op = OpConfig(
-            type="conv2d",
-            inputs={"Input": ["sigmoid_out"],
-                    "Filter": ["conv_weight"]},
-            outputs={"Output": ["conv_output"]},
-            attrs={
-                "data_format": data_format,
-                "dilations": dilations,
-                "padding_algorithm": padding_algorithm,
-                "groups": groups,
-                "paddings": paddings,
-                "strides": strides
-            })
-
-        if axis == 0:
-            elt_op = OpConfig(
-                type="elementwise_add",
-                inputs={"X": ["input_data1"],
-                        "Y": ["conv_output"]},
-                outputs={"Out": ["elementwise_output"]},
-                attrs={'axis': axis})
-        else:
-            elt_op = OpConfig(
-                type="elementwise_add",
-                inputs={"X": ["conv_output"],
-                        "Y": ["elementwise_weight"]},
-                outputs={"Out": ["elementwise_output"]},
-                attrs={'axis': axis})
-
-        model_net = [relu_op, conv2d_op, elt_op]
-
-        if axis == 0:
-            program_config = ProgramConfig(
-                ops=model_net,
-                weights={
-                    "conv_weight":
-                    TensorConfig(data_gen=partial(generate_weight1))
-                },
-                inputs={
-                    "input_data1":
-                    TensorConfig(data_gen=partial(generate_input1))
-                },
-                outputs=["elementwise_output"])
-        else:
-            program_config = ProgramConfig(
-                ops=model_net,
-                weights={
-                    "conv_weight":
-                    TensorConfig(data_gen=partial(generate_weight1)),
-                    "elementwise_weight":
-                    TensorConfig(data_gen=partial(generate_weight2))
-                },
-                inputs={
-                    "input_data1":
-                    TensorConfig(data_gen=partial(generate_input1))
-                },
-                outputs=["elementwise_output"])
-
-        return program_config
-
-    def sample_predictor_configs(self, program_config):
-        config = self.create_inference_config(use_mkldnn=True)
-        yield config, ["relu", "conv2d"], (1e-5, 1e-5)
-
-    def test(self):
-        self.run_and_statis(
-            quant=False, passes=["conv_elementwise_add_mkldnn_fuse_pass"])
-'''
-
 if __name__ == "__main__":
     unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/ir/inference/test_trt_convert_elementwise.py b/python/paddle/fluid/tests/unittests/ir/inference/test_trt_convert_elementwise.py
index 047a6094ec1e1..e849496621a10 100644
--- a/python/paddle/fluid/tests/unittests/ir/inference/test_trt_convert_elementwise.py
+++ b/python/paddle/fluid/tests/unittests/ir/inference/test_trt_convert_elementwise.py
@@ -397,5 +397,139 @@ def test(self):
         self.run_test()
 
 
+class TrtConvertElementwiseTest_one_input_corner_case(TrtLayerAutoScanTest):
+    def is_program_valid(self, program_config: ProgramConfig) -> bool:
+        return True
+
+    def sample_program_configs(self):
+        def generate_input(shape):
+            return np.random.random(shape).astype(np.float32)
+
+        def generate_weight():
+            return np.random.randn(32).astype(np.float32)
+
+        for batch in [1, 2, 4]:
+            for shape in [[32], [batch, 32], [batch, 32, 32],
+                          [batch, 32, 16, 32]]:
+                for op_type in [
+                        "elementwise_add", "elementwise_mul", "elementwise_sub",
+                        "elementwise_div"
+                ]:
+                    for axis in [-1 if len(shape) == 1 else 1]:
+                        self.dims = len(shape)
+                        dics = [{"axis": axis}]
+                        ops_config = [{
+                            "op_type": op_type,
+                            "op_inputs": {
+                                "X": ["weight"],
+                                "Y": ["input_data"]
+                            },
+                            "op_outputs": {
+                                "Out": ["output_data"]
+                            },
+                            "op_attrs": dics[0]
+                        }]
+                        ops = self.generate_op_config(ops_config)
+
+                        program_config = ProgramConfig(
+                            ops=ops,
+                            weights={
+                                "weight":
+                                TensorConfig(data_gen=partial(generate_weight))
+                            },
+                            inputs={
+                                "input_data": TensorConfig(
+                                    data_gen=partial(generate_input, shape)),
+                            },
+                            outputs=["output_data"])
+
+                        yield program_config
+
+    def sample_predictor_configs(
+            self, program_config) -> (paddle_infer.Config, List[int], float):
+        def generate_dynamic_shape(attrs):
+            # The input.dims[1] must be equal to the weight's length.
+            if self.dims == 1:
+                self.dynamic_shape.min_input_shape = {"input_data": [4]}
+                self.dynamic_shape.max_input_shape = {"input_data": [256]}
+                self.dynamic_shape.opt_input_shape = {"input_data": [16]}
+            elif self.dims == 2:
+                self.dynamic_shape.min_input_shape = {"input_data": [1, 32]}
+                self.dynamic_shape.max_input_shape = {"input_data": [4, 32]}
+                self.dynamic_shape.opt_input_shape = {"input_data": [2, 32]}
+            elif self.dims == 3:
+                self.dynamic_shape.min_input_shape = {"input_data": [1, 32, 4]}
+                self.dynamic_shape.max_input_shape = {
+                    "input_data": [4, 32, 256]
+                }
+                self.dynamic_shape.opt_input_shape = {"input_data": [2, 32, 16]}
+            elif self.dims == 4:
+                self.dynamic_shape.min_input_shape = {
+                    "input_data": [1, 32, 4, 4]
+                }
+                self.dynamic_shape.max_input_shape = {
+                    "input_data": [4, 32, 128, 256]
+                }
+                self.dynamic_shape.opt_input_shape = {
+                    "input_data": [2, 32, 32, 16]
+                }
+
+        def clear_dynamic_shape():
+            self.dynamic_shape.max_input_shape = {}
+            self.dynamic_shape.min_input_shape = {}
+            self.dynamic_shape.opt_input_shape = {}
+
+        def generate_trt_nodes_num(attrs, dynamic_shape):
+            if self.dims == 1:
+                return 0, 3
+            return 1, 2
+
+        attrs = [
+            program_config.ops[i].attrs
+            for i in range(len(program_config.ops))
+        ]
+
+        # for static_shape
+        clear_dynamic_shape()
+        self.trt_param.precision = paddle_infer.PrecisionType.Float32
+        yield self.create_inference_config(), generate_trt_nodes_num(
+            attrs, False), 1e-5
+        self.trt_param.precision = paddle_infer.PrecisionType.Half
+        yield self.create_inference_config(), generate_trt_nodes_num(
+            attrs, False), 1e-5
+
+        # for dynamic_shape
+        generate_dynamic_shape(attrs)
+        self.trt_param.precision = paddle_infer.PrecisionType.Float32
+        yield self.create_inference_config(), generate_trt_nodes_num(attrs,
+                                                                     True), 1e-5
+        self.trt_param.precision = paddle_infer.PrecisionType.Half
+        yield self.create_inference_config(), generate_trt_nodes_num(attrs,
+                                                                     True), 1e-5
+
+    def add_skip_trt_case(self):
+        def teller1(program_config, predictor_config):
+            input_x_names = program_config.ops[0].inputs["X"]
+            for weight_name in program_config.weights:
+                if weight_name in input_x_names:
+                    return True
+            op_type = program_config.ops[0].type
+            if op_type in ["elementwise_sub", "elementwise_div"]:
+                input_y_names = program_config.ops[0].inputs["Y"]
+                for weight_name in program_config.weights:
+                    if weight_name in input_y_names:
+                        return True
+            return False
+
+        self.add_skip_case(
+            teller1, SkipReasons.TRT_NOT_SUPPORT,
+            "Input X should not be parameters in elementwise op and Input Y should not be parameters in elementwise_sub or elementwise_div op"
+        )
+
+    def test(self):
+        self.add_skip_trt_case()
+        self.run_test()
+
+
 if __name__ == "__main__":
     unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/ir/inference/test_trt_convert_multiclass_nms3.py b/python/paddle/fluid/tests/unittests/ir/inference/test_trt_convert_multiclass_nms3.py
new file mode 100644
index 0000000000000..b6a3f0c9cb1c6
--- /dev/null
+++ b/python/paddle/fluid/tests/unittests/ir/inference/test_trt_convert_multiclass_nms3.py
@@ -0,0 +1,181 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from trt_layer_auto_scan_test import TrtLayerAutoScanTest, SkipReasons
+from program_config import TensorConfig, ProgramConfig
+import numpy as np
+import paddle.inference as paddle_infer
+from functools import partial
+from typing import Optional, List, Callable, Dict, Any, Set
+import unittest
+
+
+class TrtConvertMulticlassNMS3Test(TrtLayerAutoScanTest):
+    def is_program_valid(self, program_config: ProgramConfig) -> bool:
+        return True
+
+    def create_inference_config(self, use_trt=True) -> paddle_infer.Config:
+        if use_trt:
+            config = paddle_infer.Config()
+            config.disable_glog_info()
+            config.enable_use_gpu(100, 0)
+            config.set_optim_cache_dir(self.cache_dir)
+            config.switch_ir_debug()
+            config.enable_tensorrt_engine(
+                max_batch_size=self.trt_param.max_batch_size,
+                workspace_size=self.trt_param.workspace_size,
+                min_subgraph_size=self.trt_param.min_subgraph_size,
+                precision_mode=self.trt_param.precision,
+                use_static=self.trt_param.use_static,
+                use_calib_mode=self.trt_param.use_calib_mode)
+            if len(self.dynamic_shape.min_input_shape
+                   ) != 0 and self.dynamic_shape.min_input_shape.keys(
+                   ) == self.dynamic_shape.max_input_shape.keys(
+                   ) and self.dynamic_shape.min_input_shape.keys(
+                   ) == self.dynamic_shape.opt_input_shape.keys():
+                config.set_trt_dynamic_shape_info(
+                    self.dynamic_shape.min_input_shape,
+                    self.dynamic_shape.max_input_shape,
+                    self.dynamic_shape.opt_input_shape,
+                    self.dynamic_shape.disable_trt_plugin_fp16)
+            return config
+        else:
+            config = paddle_infer.Config()
+            config.switch_ir_debug(True)
+            config.set_optim_cache_dir(self.cache_dir)
+            config.disable_glog_info()
+            return config
+
+    def sample_program_configs(self):
+        def generate_boxes(batch, num_boxes):
+            return np.arange(
+                batch * num_boxes * 4,
+                dtype=np.float32).reshape([batch, num_boxes, 4])
+
+        def generate_scores(batch, num_boxes, num_classes):
+            return np.arange(
+                batch * num_classes * num_boxes,
+                dtype=np.float32).reshape([batch, num_classes, num_boxes])
+            # return np.random.rand(batch, num_classes, num_boxes).astype(np.float32)
+
+        for batch in [1, 2]:
+            for num_boxes in [4, 12]:
+                for num_classes in [2, 6]:
+                    for score_threshold in [0.01, ]:
+                        ops_config = [{
+                            "op_type": "multiclass_nms3",
+                            "op_inputs": {
+                                "BBoxes": ["input_bboxes"],
+                                "Scores": ["input_scores"],
+                            },
+                            "op_outputs": {
+                                "Out": ["nms_output_boxes"],
+                                "Index": ["nms_output_index"],
+                                "NmsRoisNum": ["nms_output_num"]
+                            },
+                            "op_attrs": {
+                                "background_label": -1,
+                                "score_threshold": score_threshold,
+                                "nms_top_k": num_boxes,
+                                "keep_top_k": num_boxes,
+                                "nms_threshold": 0.3,
+                                "normalized": False,
+                                "nms_eta": 1.1
+                            }
+                        }]
+                        ops = self.generate_op_config(ops_config)
+                        program_config = ProgramConfig(
+                            ops=ops,
+                            weights={},
+                            inputs={
+                                "input_bboxes": TensorConfig(data_gen=partial(
+                                    generate_boxes, batch, num_boxes)),
+                                "input_scores": TensorConfig(
+                                    data_gen=partial(generate_scores, batch,
+                                                     num_boxes, num_classes))
+                            },
+                            outputs=[
+                                "nms_output_boxes", "nms_output_num",
+                                "nms_output_index"
+                            ])
+
+                        yield program_config
+
+    def sample_predictor_configs(
+            self, program_config) -> (paddle_infer.Config, List[int], float):
+        def clear_dynamic_shape():
+            self.dynamic_shape.min_input_shape = {}
+            self.dynamic_shape.max_input_shape = {}
+            self.dynamic_shape.opt_input_shape = {}
+
+        def generate_trt_nodes_num(attrs, dynamic_shape):
+            return 1, 2
+
+        attrs = [
+            program_config.ops[i].attrs
+            for i in range(len(program_config.ops))
+        ]
+
+        # for static_shape
+        clear_dynamic_shape()
+        self.trt_param.precision = paddle_infer.PrecisionType.Float32
+        yield self.create_inference_config(), generate_trt_nodes_num(
+            attrs, False), 1e-5
+        self.trt_param.precision = paddle_infer.PrecisionType.Half
+        yield self.create_inference_config(), generate_trt_nodes_num(
+            attrs, False), 1e-2
+
+    def assert_tensors_near(self,
+                            atol: float,
+                            rtol: float,
+                            tensor: Dict[str, np.array],
+                            baseline: Dict[str, np.array]):
+        # the order of tensorrt outputs are not consistent with paddle
+        for key, arr in tensor.items():
+            if key == "nms_output_index":
+                continue
+            if key == "nms_output_boxes":
+                basline_arr = np.array(
+                    sorted(
+                        baseline[key].reshape((-1, 6)),
+                        key=lambda i: [i[0], i[1]]))
+                arr = np.array(
+                    sorted(
+                        arr.reshape((-1, 6)), key=lambda i: [i[0], i[1]]))
+            else:
+                basline_arr = np.array(baseline[key].reshape((-1, 1)))
+                arr = np.array(arr.reshape((-1, 1)))
+
+            self.assertTrue(
+                basline_arr.shape == arr.shape,
+                "The output shapes are not equal, the baseline shape is " +
+                str(basline_arr.shape) + ', but got ' + str(arr.shape))
+            diff = abs(basline_arr - arr)
+            self.assertTrue(
+                np.allclose(
+                    basline_arr, arr, atol=atol, rtol=rtol),
+                "Output has diff, Maximum absolute error: {}".format(
+                    np.amax(diff)))
+
+    def assert_op_size(self, trt_engine_num, paddle_op_num):
+        # tensorrt op num is not consistent with paddle
+        return True
+
+    def test(self):
+        self.trt_param.workspace_size = 1 << 20
+        self.run_test()
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/ir/inference/test_trt_convert_multihead_matmul.py b/python/paddle/fluid/tests/unittests/ir/inference/test_trt_convert_multihead_matmul.py
index 97a94ef348a67..26066be7dc787 100644
--- a/python/paddle/fluid/tests/unittests/ir/inference/test_trt_convert_multihead_matmul.py
+++ b/python/paddle/fluid/tests/unittests/ir/inference/test_trt_convert_multihead_matmul.py
@@ -491,8 +491,7 @@ def generate_weight2():
                                 "x_num_col_dims": 2,
                                 "y_num_col_dims": 1,
                                 "enable_int8": True,
-                                "X_scale": 1.0,
-                                "weight_scale": [1.0],
+                                "Input_scale": 1.0,
                             }, {
                                 "axis": 2,
                                 "out_threshold": 1.0,
@@ -504,8 +503,7 @@ def generate_weight2():
                                 "x_num_col_dims": 2,
                                 "y_num_col_dims": 1,
                                 "enable_int8": True,
-                                "X_scale": 1.0,
-                                "weight_scale": [1.0],
+                                "Input_scale": 1.0,
                             }, {
                                 "axis": 2,
                                 "out_threshold": 1.0,
@@ -517,8 +515,7 @@ def generate_weight2():
                                 "x_num_col_dims": 2,
                                 "y_num_col_dims": 1,
                                 "enable_int8": True,
-                                "X_scale": 1.0,
-                                "weight_scale": [1.0],
+                                "Input_scale": 1.0,
                             }, {
                                 "axis": 2,
                                 "out_threshold": 1.0,
diff --git a/python/paddle/fluid/tests/unittests/ir/inference/test_trt_convert_nearest_interp_v2.py b/python/paddle/fluid/tests/unittests/ir/inference/test_trt_convert_nearest_interp_v2.py
index 57d7d70c66a5b..cf8b7b3516b37 100644
--- a/python/paddle/fluid/tests/unittests/ir/inference/test_trt_convert_nearest_interp_v2.py
+++ b/python/paddle/fluid/tests/unittests/ir/inference/test_trt_convert_nearest_interp_v2.py
@@ -41,7 +41,9 @@ def generate_input():
                 "data_layout": "NCHW",
                 "interp_method": "nearest",
                 "align_corners": False,
+                "align_mode": 1,
                 "scale": [2., 2.],
+                "out_d": 0,
                 "out_h": 0,
                 "out_w": 0
             }
diff --git a/python/paddle/fluid/tests/unittests/ir/inference/test_trt_multiclass_nms_op.py b/python/paddle/fluid/tests/unittests/ir/inference/test_trt_multiclass_nms_op.py
index 3ca6985985985..045261fabb020 100644
--- a/python/paddle/fluid/tests/unittests/ir/inference/test_trt_multiclass_nms_op.py
+++ b/python/paddle/fluid/tests/unittests/ir/inference/test_trt_multiclass_nms_op.py
@@ -1,4 +1,4 @@
-# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
diff --git a/python/paddle/fluid/tests/unittests/mkldnn/test_activation_mkldnn_op.py b/python/paddle/fluid/tests/unittests/mkldnn/test_activation_mkldnn_op.py
index 4e4fe69d914fa..44263b89e1616 100644
--- a/python/paddle/fluid/tests/unittests/mkldnn/test_activation_mkldnn_op.py
+++ b/python/paddle/fluid/tests/unittests/mkldnn/test_activation_mkldnn_op.py
@@ -113,6 +113,7 @@ def setUp(self):
         super(TestMKLDNNSwishDim2, self).setUp()
 
         self.attrs["use_mkldnn"] = True
+        self.check_eager = False
 
     def init_dtype(self):
         self.dtype = np.float32
@@ -284,6 +285,7 @@ def setUp(self):
         self.inputs = {'X': OpTest.np_dtype_to_fluid_dtype(x)}
         self.outputs = {'Out': out}
         self.attrs = {"use_mkldnn": True, "beta": beta}
+        self.check_eager = False
 
     def init_dtype(self):
         self.dtype = np.float32
diff --git a/python/paddle/fluid/tests/unittests/op_test.py b/python/paddle/fluid/tests/unittests/op_test.py
index ae74fbd1c1e09..cfe0d4e32ef7a 100644
--- a/python/paddle/fluid/tests/unittests/op_test.py
+++ b/python/paddle/fluid/tests/unittests/op_test.py
@@ -710,6 +710,8 @@ def _calc_python_api_output(self, place, egr_inps=None, egr_oups=None):
         def prepare_python_api_arguments(api, op_proto_ins, op_proto_attrs,
                                          kernel_sig):
             """ map from `op proto inputs and attrs` to `api input list and api attrs dict`
+                
+                NOTE: the op_proto_attrs and op_proto_ins is a default dict. default value is []
             """
 
             class Empty:
@@ -770,7 +772,9 @@ def parse_attri_value(name, op_inputs, op_attrs):
                 api_params), "Error happens. contack xiongkun03 to solve."
             inputs_sig, attrs_sig, outputs_sig = kernel_sig
             inputs_and_attrs = inputs_sig + attrs_sig
-            input_arguments = [op_proto_ins[name] for name in inputs_sig] + [
+            input_arguments = [
+                op_proto_ins.get(name, Empty()) for name in inputs_sig
+            ] + [
                 parse_attri_value(name, op_proto_ins, op_proto_attrs)
                 for name in attrs_sig
             ]
@@ -781,10 +785,12 @@ def parse_attri_value(name, op_inputs, op_attrs):
                 if arg_name in api_ignore_param_list:
                     results.append(get_default(idx, api_defaults))
                 else:
-                    assert idx_of_op_proto_arguments < len(
-                        input_arguments), "Assert False."
-                    tmp = input_arguments[idx_of_op_proto_arguments]
-                    idx_of_op_proto_arguments += 1
+                    if (idx_of_op_proto_arguments < len(input_arguments)):
+                        tmp = input_arguments[idx_of_op_proto_arguments]
+                        idx_of_op_proto_arguments += 1
+                    else:
+                        tmp = Empty()  # use the default value
+
                     if isinstance(tmp, Empty):
                         results.append(get_default(idx, api_defaults))
                     else:
@@ -812,16 +818,19 @@ def assumption_assert_and_transform(args, inp_num):
             transform inputs by the following rules:
                 1. [Tensor] -> Tensor
                 2. [Tensor, Tensor, ...] -> list of Tensors
+                3. None -> None
+                4. Others: raise Error
 
             only support "X" is list of Tensor, currently don't support other structure like dict.
             """
-            for inp in args[:inp_num]:
+            inp_args = [[inp] if inp is None else inp
+                        for inp in args[:inp_num]]  # convert None -> [None]
+            for inp in inp_args:
                 assert isinstance(
                     inp, list
                 ), "currently only support `X` is [Tensor], don't support other structure."
-            args = [
-                inp[0] if len(inp) == 1 else inp for inp in args[:inp_num]
-            ] + args[inp_num:]
+            args = [inp[0] if len(inp) == 1 else inp
+                    for inp in inp_args] + args[inp_num:]
             return args
 
         def _get_kernel_signature(eager_tensor_inputs, eager_tensor_outputs,
@@ -1356,6 +1365,9 @@ def __init__(self, op_test, expect_dict):
                 self.op_test = op_test  # stop the op_test object.
                 self.op_type = op_test.op_type
 
+            def init(self):
+                pass
+
             def convert_uint16_to_float(self, actual_np, expect_np):
                 raise NotImplementedError("base class, not implement!")
 
@@ -1387,7 +1399,7 @@ def _compare_numpy(self, name, actual_np, expect_np):
                         rtol=self.rtol if hasattr(self, 'rtol') else 1e-5,
                         equal_nan=equal_nan),
                     "Output (" + name + ") has diff at " + str(place) + " in " +
-                    self.checker_name + " checker")
+                    self.checker_name)
 
             def _compare_list(self, name, actual, expect):
                 """ if expect is a tuple, we need to compare list.
@@ -1403,7 +1415,7 @@ def compare_single_output_with_expect(self, name, expect):
                 # NOTE(zhiqiu): np.allclose([], [1.]) returns True
                 # see details: https://stackoverflow.com/questions/38331703/why-does-numpys-broadcasting-sometimes-allow-comparing-arrays-of-different-leng
                 if expect_np.size == 0:
-                    self.op_test.assertTrue(actual_np.size == 0)
+                    self.op_test.assertTrue(actual_np.size == 0)  # }}}
                 self._compare_numpy(name, actual_np, expect_np)
                 if isinstance(expect, tuple):
                     self._compare_list(name, actual, expect)
@@ -1431,10 +1443,14 @@ def check(self):
 
                 the main enter point of Checker class
                 """
+                self.init()
                 self.calculate_output()
                 self.compare_outputs_with_expects()
 
         class StaticChecker(Checker):
+            def init(self):
+                self.checker_name = "static checker"
+
             def calculate_output(self):
                 outs, fetch_list = self.op_test._calc_output(
                     place, no_check_set=no_check_set)
@@ -1474,6 +1490,9 @@ def _compare_list(self, name, actual, expect):
                     "Output (" + name + ") has different lod at " + str(place))
 
         class DygraphChecker(Checker):
+            def init(self):
+                self.checker_name = "dygraph checker"
+
             def calculate_output(self):
                 self.outputs = self.op_test._calc_dygraph_output(
                     place, no_check_set=no_check_set)
@@ -1519,18 +1538,21 @@ def _compare_numpy(self, name, actual_np, expect_np):
                             rtol=self.rtol if hasattr(self, 'rtol') else 1e-5,
                             equal_nan=equal_nan),
                         "Output (" + name + ") has diff at " + str(place) +
-                        " in " + self.checker_name + " checker")
+                        " in " + self.checker_name)
 
         class EagerChecker(DygraphChecker):
+            def init(self):
+                self.checker_name = "eager checker"
+
             def calculate_output(self):
                 # we only check end2end api when check_eager=True
-                self.is_python_api_test = True
                 with _test_eager_guard():
+                    self.is_python_api_test = True
                     eager_dygraph_outs = self.op_test._calc_python_api_output(
                         place)
                     if eager_dygraph_outs is None:
-                        # missing KernelSignature, fall back to eager middle output.
                         self.is_python_api_test = False
+                        # missing KernelSignature, fall back to eager middle output.
                         eager_dygraph_outs = self.op_test._calc_dygraph_output(
                             place, no_check_set=no_check_set)
                 self.outputs = eager_dygraph_outs
@@ -1915,6 +1937,9 @@ def check_grad_with_place(self,
                               "Gradient Check On %s" % str(place))
 
         if check_dygraph:
+            # ensure switch into legacy dygraph
+            g_enable_legacy_dygraph()
+
             dygraph_grad = self._get_dygraph_grad(
                 inputs_to_check, place, output_names, user_defined_grad_outputs,
                 no_grad_set, False)
@@ -1928,6 +1953,8 @@ def check_grad_with_place(self,
             self._assert_is_close(numeric_grads, dygraph_grad, inputs_to_check,
                                   max_relative_error,
                                   "Gradient Check On %s" % str(place))
+            # ensure switch back eager dygraph
+            g_disable_legacy_dygraph()
 
         if check_eager:
             with fluid.dygraph.base.guard(place):
@@ -2065,7 +2092,6 @@ def _get_dygraph_grad(self,
                         inputs={"X": loss_sum},
                         outputs={"Out": loss},
                         attrs={'scale': 1.0 / float(len(avg_sum))})
-
                 loss.backward()
 
                 fetch_list_grad = []
@@ -2080,7 +2106,7 @@ def _get_dygraph_grad(self,
                 grad_outputs = []
                 for grad_out_value in user_defined_grad_outputs:
                     grad_outputs.append(paddle.to_tensor(grad_out_value))
-                # delete the inputs which no need to calculate grad
+                # delete the inputs which no need to calculate grad                
                 for no_grad_val in no_grad_set:
                     del (inputs[no_grad_val])
 
diff --git a/python/paddle/fluid/tests/unittests/op_test_xpu.py b/python/paddle/fluid/tests/unittests/op_test_xpu.py
index 107f340d3a847..4a67af02bcff3 100644
--- a/python/paddle/fluid/tests/unittests/op_test_xpu.py
+++ b/python/paddle/fluid/tests/unittests/op_test_xpu.py
@@ -54,13 +54,11 @@ def tearDownClass(cls):
         """Restore random seeds"""
 
         def is_empty_grad_op(op_type):
-            all_op_kernels = core._get_all_register_op_kernels()
             grad_op = op_type + '_grad'
-            if grad_op in all_op_kernels.keys():
-                grad_op_kernels = all_op_kernels[grad_op]
-                for grad_op_kernel in grad_op_kernels:
-                    if 'XPU' in grad_op_kernel:
-                        return False
+            xpu_version = core.get_xpu_device_version(0)
+            xpu_op_list = core.get_xpu_device_op_list(xpu_version)
+            if grad_op in xpu_op_list.keys():
+                return False
             return True
 
         if cls.dtype == np.float16:
@@ -70,9 +68,20 @@ def is_empty_grad_op(op_type):
         super().tearDownClass()
 
     def _get_places(self):
-        places = [fluid.XPUPlace(0)]
+        places = [paddle.XPUPlace(0)]
         return places
 
+    def check_output(self,
+                     atol=0.001,
+                     no_check_set=None,
+                     equal_nan=False,
+                     check_dygraph=True,
+                     inplace_atol=None,
+                     check_eager=False):
+        place = paddle.XPUPlace(0)
+        self.check_output_with_place(place, atol, no_check_set, equal_nan,
+                                     check_dygraph, inplace_atol, check_eager)
+
     def check_output_with_place(self,
                                 place,
                                 atol=0.001,
@@ -82,20 +91,37 @@ def check_output_with_place(self,
                                 inplace_atol=None,
                                 check_eager=False):
         self.infer_dtype_from_inputs_outputs(self.inputs, self.outputs)
-        #xpu not support float64
         if self.dtype == np.float64:
             return
-        if place == None:
-            place = paddle.XPUPlace(0)
 
         if self.dtype == np.float16:
             if core.is_float16_supported(place) == False:
                 return
+
         if self.dtype == np.float16:
             atol = 0.1
         return super().check_output_with_place(
             place, atol, no_check_set, equal_nan, check_dygraph, inplace_atol)
 
+    def check_grad(self,
+                   inputs_to_check,
+                   output_names,
+                   no_grad_set=None,
+                   numeric_grad_delta=0.005,
+                   in_place=False,
+                   max_relative_error=0.005,
+                   user_defined_grads=None,
+                   user_defined_grad_outputs=None,
+                   check_dygraph=True,
+                   numeric_place=None,
+                   check_eager=False):
+        place = paddle.XPUPlace(0)
+        self.check_grad_with_place(place, inputs_to_check, output_names,
+                                   no_grad_set, numeric_grad_delta, in_place,
+                                   max_relative_error, user_defined_grads,
+                                   user_defined_grad_outputs, check_dygraph,
+                                   numeric_place, check_eager)
+
     def check_grad_with_place(self,
                               place,
                               inputs_to_check,
@@ -116,9 +142,6 @@ def check_grad_with_place(self,
                 self._check_grad_helper()
                 return
 
-        if place == None:
-            place = paddle.XPUPlace(0)
-
         if self.dtype == np.float64:
             return
 
diff --git a/python/paddle/fluid/tests/unittests/parallel_dygraph_dataparallel_in_eager_mode.py b/python/paddle/fluid/tests/unittests/parallel_dygraph_dataparallel_in_eager_mode.py
deleted file mode 100644
index d48a7f09ce769..0000000000000
--- a/python/paddle/fluid/tests/unittests/parallel_dygraph_dataparallel_in_eager_mode.py
+++ /dev/null
@@ -1,137 +0,0 @@
-# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-from __future__ import division
-from __future__ import print_function
-
-import unittest
-import os
-import copy
-import numpy as np
-import random
-import socket
-
-import paddle
-import paddle.nn as nn
-from paddle.fluid.dygraph.nn import Linear
-import paddle.fluid.core as core
-from paddle.fluid.framework import _test_eager_guard
-import paddle.distributed as dist
-from paddle.fluid.dygraph.parallel import ParallelEnv
-from paddle.optimizer import SGD
-from paddle.fluid.initializer import NumpyArrayInitializer
-from test_parallel_dygraph_dataparallel import get_dist_port_from_flags
-
-
-def init_process_group(strategy=None):
-    nranks = ParallelEnv().nranks
-    rank = ParallelEnv().local_rank
-    is_master = True if rank == 0 else False
-    envs = copy.copy(os.environ.copy())
-    port = get_dist_port_from_flags()
-    store = paddle.fluid.core.TCPStore("127.0.0.1", port, is_master, nranks)
-    if 'PADDLE_DISTRI_BACKEND' in envs.keys() and envs[
-            'PADDLE_DISTRI_BACKEND'] == 'gloo':
-        group = core.ProcessGroupGloo(store, rank, nranks)
-    else:
-        group = core.ProcessGroupNCCL(store, rank, nranks)
-    return group
-
-
-class LinearModel(nn.Layer):
-    def __init__(self, attr_list):
-        super(LinearModel, self).__init__()
-        self._linear1 = paddle.nn.Linear(
-            50, 30, weight_attr=attr_list[0], bias_attr=False)
-        self._linear2 = paddle.nn.Linear(
-            30, 10, weight_attr=attr_list[1], bias_attr=False)
-        self._linear3 = paddle.nn.Linear(
-            10, 10, weight_attr=attr_list[2], bias_attr=False)
-
-    def forward(self, x):
-        output = self._linear1(x)
-        output = self._linear2(output)
-        output = self._linear3(output)
-        return output
-
-
-class TestDistTraning(unittest.TestCase):
-    def test_multiple_gpus(self):
-        process_group = init_process_group()
-        self.generate_reducer("float32", process_group)
-        if paddle.get_device() != "cpu":
-            self.generate_reducer("float16", process_group)
-
-    def generate_reducer(self, dtype, process_group):
-        local_rank = ParallelEnv().local_rank
-        np.random.seed(2022 + local_rank)
-        paddle.set_default_dtype(dtype)
-
-        w_1 = paddle.ParamAttr(initializer=NumpyArrayInitializer(
-            np.random.rand(50, 30).astype(dtype)))
-        w_2 = paddle.ParamAttr(initializer=NumpyArrayInitializer(
-            np.random.rand(30, 10).astype(dtype)))
-        w_3 = paddle.ParamAttr(initializer=NumpyArrayInitializer(
-            np.random.rand(10, 10).astype(dtype)))
-
-        attr_list = [w_1, w_2, w_3]
-        inp = np.random.rand(10, 50).astype(dtype)
-
-        # original reducer
-        params_a = self.model_train(attr_list, inp)
-
-        # refactored reducer in eager mode
-        with _test_eager_guard():
-            params_b = self.model_train(
-                attr_list, inp, process_group=process_group)
-
-        for i in range(len(params_a)):
-            np.testing.assert_allclose(params_a[i].numpy(), params_b[i].numpy())
-
-    def model_train(self, attr_list, inp, process_group=None):
-        model = LinearModel(attr_list)
-        model = paddle.DataParallel(model, process_group=process_group)
-        optimizer = SGD(learning_rate=0.0003, parameters=model.parameters())
-
-        x = paddle.to_tensor(inp)
-        x.stop_gradient = False
-
-        for step in range(10):
-            y = model(x)
-            loss = y.mean()
-
-            loss.backward()
-            optimizer.step()
-            optimizer.clear_grad()
-
-        return model.parameters()
-
-
-class TestCatchErrors1(unittest.TestCase):
-    def test_multiple_gpus(self):
-        linear = paddle.nn.Linear(2, 4)
-        with _test_eager_guard():
-            self.assertRaises(RuntimeError, paddle.DataParallel, linear)
-
-
-class TestCatchErrors2(unittest.TestCase):
-    def test_multiple_gpus(self):
-        with _test_eager_guard():
-            linear = paddle.nn.Linear(2, 4)
-            self.assertRaises(RuntimeError, paddle.DataParallel, linear)
-
-
-if __name__ == '__main__':
-    dist.init_parallel_env()
-    unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/parallel_dygraph_dataparallel_with_pylayer.py b/python/paddle/fluid/tests/unittests/parallel_dygraph_dataparallel_with_pylayer.py
index f623ba36dcab5..8ce2275868b39 100644
--- a/python/paddle/fluid/tests/unittests/parallel_dygraph_dataparallel_with_pylayer.py
+++ b/python/paddle/fluid/tests/unittests/parallel_dygraph_dataparallel_with_pylayer.py
@@ -21,7 +21,8 @@
 import numpy as np
 import paddle.distributed as dist
 from paddle.fluid.dygraph.nn import Linear
-from paddle.autograd import PyLayer
+from paddle.autograd import PyLayer, EagerPyLayer
+from paddle.fluid.framework import in_dygraph_mode, _in_legacy_dygraph
 from paddle.distributed.fleet.utils.hybrid_parallel_util import fused_allreduce_gradients
 
 batch = 5
@@ -43,6 +44,20 @@ def backward(ctx, dy):
         return grad
 
 
+class cus_tanh_eager(EagerPyLayer):
+    @staticmethod
+    def forward(ctx, x):
+        y = paddle.tanh(x)
+        ctx.save_for_backward(y)
+        return y
+
+    @staticmethod
+    def backward(ctx, dy):
+        y, = ctx.saved_tensor()
+        grad = dy * (1 - paddle.square(y))
+        return grad
+
+
 class SimpleNet(paddle.nn.Layer):
     def __init__(self, train_id, model_id):
         super(SimpleNet, self).__init__()
@@ -55,7 +70,10 @@ def __init__(self, train_id, model_id):
 
     def forward(self, inputs):
         if self.model_id == 0:
-            inputs = cus_tanh.apply(inputs)
+            if in_dygraph_mode():
+                inputs = cus_tanh_eager.apply(inputs)
+            elif _in_legacy_dygraph():
+                inputs = cus_tanh.apply(inputs)
         else:
             inputs = self.tanh(inputs)
 
diff --git a/python/paddle/fluid/tests/unittests/parallel_dygraph_gradient_check_in_eager_mode.py b/python/paddle/fluid/tests/unittests/parallel_dygraph_gradient_check_in_eager_mode.py
index bf337d486435a..db41236dd5c1d 100644
--- a/python/paddle/fluid/tests/unittests/parallel_dygraph_gradient_check_in_eager_mode.py
+++ b/python/paddle/fluid/tests/unittests/parallel_dygraph_gradient_check_in_eager_mode.py
@@ -36,19 +36,6 @@
 out_dim = 20
 
 
-def init_process_group(strategy=None):
-    nranks = ParallelEnv().nranks
-    rank = ParallelEnv().local_rank
-    is_master = True if rank == 0 else False
-    current_env = copy.copy(os.environ.copy())
-    port = 6175
-    if 'PADDLE_DIST_UT_PORT' in current_env.keys():
-        port = int(current_env['PADDLE_DIST_UT_PORT'])
-    store = paddle.fluid.core.TCPStore("127.0.0.1", port, is_master, nranks)
-    group = core.ProcessGroupNCCL(store, rank, nranks)
-    return group
-
-
 class SimpleNet(fluid.Layer):
     def __init__(self, train_id):
         super(SimpleNet, self).__init__()
@@ -83,12 +70,9 @@ def forward(self, x):
 
 class TestDistTraning(unittest.TestCase):
     def test_multiple_gpus(self):
-        dist.init_parallel_env()
         self.trainer_id = dist.get_rank()
-
-        process_group = init_process_group()
-        self.pg = process_group
         with _test_eager_guard():
+            self.pg = dist.init_parallel_env()
 
             model_a = SimpleNet(self.trainer_id)
             model_b = SimpleNet(self.trainer_id)
@@ -97,13 +81,9 @@ def test_multiple_gpus(self):
             model_b.set_state_dict(state_dict)
 
             model_a = paddle.DataParallel(
-                model_a,
-                find_unused_parameters=True,
-                process_group=process_group)
+                model_a, find_unused_parameters=True, group=self.pg)
             model_b = paddle.DataParallel(
-                model_b,
-                find_unused_parameters=True,
-                process_group=process_group)
+                model_b, find_unused_parameters=True, group=self.pg)
 
             ones_input = paddle.ones(shape=(batch, in_dim))
             ones_input.stop_gradient = True
@@ -150,7 +130,7 @@ def print_trainer_0(self, *args):
             print(*args)
 
     def broadcast_param(self, param, root):
-        self.pg.broadcast(param, root)
+        self.pg.process_group.broadcast(param, root)
         return param
 
     def check_gradient(self, params):
diff --git a/python/paddle/fluid/tests/unittests/parallel_dygraph_no_sync.py b/python/paddle/fluid/tests/unittests/parallel_dygraph_no_sync.py
index f5af896f73e26..9a3b5ee2f0f3e 100644
--- a/python/paddle/fluid/tests/unittests/parallel_dygraph_no_sync.py
+++ b/python/paddle/fluid/tests/unittests/parallel_dygraph_no_sync.py
@@ -69,18 +69,6 @@ def run_one_loop(self, model, optimizer, batch):
         loss = out.sum() / len(batch)
         return loss
 
-    def run_trainer(self, args):
-        if args.eager_mode:
-            self.run_trainer_in_eager_mode(args)
-        else:
-            self.run_trainer_func(args)
-
-    def run_trainer_with_spawn(self, args):
-        if args.eager_mode:
-            return self.run_trainer_with_spawn_in_eager_mode(args)
-        else:
-            return self.run_trainer_with_spawn_func(args)
-
     def run_trainer_func(self, args):
         if fluid.core.is_compiled_with_cuda():
             device_id = int(os.getenv("FLAGS_selected_gpus", "0"))
@@ -103,41 +91,36 @@ def run_trainer_func(self, args):
                 model = paddle.DataParallel(
                     model, find_unused_parameters=args.find_unused_parameters)
             print_to_err(type(self).__name__, "model built in dygraph")
-            return self.model_train(args, model, opt, train_reader)
-
-    def run_trainer_in_eager_mode(self, args):
-        if fluid.core.is_compiled_with_cuda():
-            device_id = int(os.getenv("FLAGS_selected_gpus", "0"))
-            place = fluid.CUDAPlace(device_id)
-        else:
-            assert ("Only support CUDAPlace for now.")
-
-        with fluid.dygraph.guard(place):
-            fluid.default_startup_program().random_seed = seed
-            fluid.default_main_program().random_seed = seed
-            np.random.seed(seed)
-            random.seed(seed)
-
-            with _test_eager_guard():
-                model, train_reader, opt = self.get_model()
-                if args.update_method == "nccl2":
-                    dist.init_parallel_env()
-                    print_to_err(
-                        type(self).__name__,
-                        "begin to prepare context in dygraph with nccl2")
-
-                    nranks = ParallelEnv().nranks
-                    rank = ParallelEnv().local_rank
-                    is_master = True if rank == 0 else False
-                    store = paddle.fluid.core.TCPStore(
-                        "127.0.0.1", args.dist_port, is_master, nranks)
-                    group = core.ProcessGroupNCCL(store, rank, nranks)
-                    model = paddle.DataParallel(
-                        model,
-                        process_group=group,
-                        find_unused_parameters=args.find_unused_parameters)
-                print_to_err(type(self).__name__, "model built in dygraph")
-                return self.model_train(args, model, opt, train_reader)
+            out_losses = self.model_train(args, model, opt, train_reader)
+            print_to_out(out_losses)
+            return out_losses
+
+    def run_trainer_with_spawn_func(self, args):
+        # 1. enable dygraph
+        paddle.disable_static()
+
+        # 2. init seed
+        seed = 90
+        paddle.static.default_startup_program().random_seed = seed
+        paddle.static.default_main_program().random_seed = seed
+        np.random.seed(seed)
+        random.seed(seed)
+        # get trainer id
+        args.trainer_id = paddle.distributed.get_rank()
+
+        # 3. init parallel env
+        if args.update_method in ["nccl2", "gloo"]:
+            paddle.distributed.init_parallel_env()
+
+        # 4. train model
+        model, train_reader, opt = self.get_model()
+        if args.update_method in ["nccl2", "gloo"]:
+            model = paddle.DataParallel(
+                model, find_unused_parameters=args.find_unused_parameters)
+
+        out_losses = self.model_train(args, model, opt, train_reader)
+        print_to_out(out_losses)
+        return out_losses
 
     def model_train(self, args, model, opt, train_reader):
         out_losses = []
@@ -157,12 +140,8 @@ def model_train(self, args, model, opt, train_reader):
                 loss = self.run_one_loop(model, opt, data)
                 loss.backward()
                 opt.minimize(loss)
-                print_to_err(
-                    type(self).__name__,
-                    "loss at step %d: %f" % (step_id, loss.numpy()))
                 out_losses.append(loss.numpy())
                 model.clear_gradients()
-        print_to_out(out_losses)
         return out_losses
 
 
diff --git a/python/paddle/fluid/tests/unittests/parallel_dygraph_sparse_embedding.py b/python/paddle/fluid/tests/unittests/parallel_dygraph_sparse_embedding.py
index 226f1293ef688..33ae0acf43d12 100644
--- a/python/paddle/fluid/tests/unittests/parallel_dygraph_sparse_embedding.py
+++ b/python/paddle/fluid/tests/unittests/parallel_dygraph_sparse_embedding.py
@@ -42,7 +42,6 @@ def __init__(self,
             dtype=dtype,
             is_sparse=is_sparse,
             param_attr=fluid.ParamAttr(
-                name='embedding_param',
                 initializer=fluid.initializer.UniformInitializer(
                     low=-init_scale, high=init_scale)))
         self.softmax_weight = self.create_parameter(
@@ -103,8 +102,8 @@ def get_model(self):
         train_reader = paddle.batch(
             fake_sample_reader(), batch_size=batch_size, drop_last=True)
 
-        optimizer = fluid.optimizer.SGD(learning_rate=0.001,
-                                        parameter_list=model.parameters())
+        optimizer = paddle.optimizer.SGD(learning_rate=0.001,
+                                         parameters=model.parameters())
 
         return model, train_reader, optimizer
 
diff --git a/python/paddle/fluid/tests/unittests/parallel_dygraph_sparse_embedding_fp64.py b/python/paddle/fluid/tests/unittests/parallel_dygraph_sparse_embedding_fp64.py
index a15b263a29508..b341a227285b1 100644
--- a/python/paddle/fluid/tests/unittests/parallel_dygraph_sparse_embedding_fp64.py
+++ b/python/paddle/fluid/tests/unittests/parallel_dygraph_sparse_embedding_fp64.py
@@ -40,7 +40,6 @@ def __init__(self,
             self.hidden_size,
             sparse=True,
             weight_attr=paddle.ParamAttr(
-                name='embedding_param',
                 initializer=paddle.nn.initializer.Uniform(
                     low=-init_scale, high=init_scale)))
         self.softmax_weight = self.create_parameter(
diff --git a/python/paddle/fluid/tests/unittests/parallel_dygraph_unused_variables.py b/python/paddle/fluid/tests/unittests/parallel_dygraph_unused_variables.py
index 9f877381101e9..b4dd03aecfaf3 100644
--- a/python/paddle/fluid/tests/unittests/parallel_dygraph_unused_variables.py
+++ b/python/paddle/fluid/tests/unittests/parallel_dygraph_unused_variables.py
@@ -39,7 +39,6 @@ def __init__(self,
             self.hidden_size,
             sparse=is_sparse,
             weight_attr=paddle.ParamAttr(
-                name='embedding_param',
                 initializer=paddle.nn.initializer.Uniform(
                     low=-init_scale, high=init_scale)))
         self.softmax_weight = self.create_parameter(
diff --git a/python/paddle/fluid/tests/unittests/process_group_gloo.py b/python/paddle/fluid/tests/unittests/process_group_gloo.py
index b1f3a71ab3e94..03886ab8a147f 100644
--- a/python/paddle/fluid/tests/unittests/process_group_gloo.py
+++ b/python/paddle/fluid/tests/unittests/process_group_gloo.py
@@ -45,7 +45,7 @@ def test_create_process_group_gloo(self):
             nranks = ParallelEnv().nranks
             rank = ParallelEnv().local_rank
             is_master = True if rank == 0 else False
-            store = paddle.fluid.core.TCPStore("127.0.0.1", 6172, is_master,
+            store = paddle.fluid.core.TCPStore("127.0.0.1", 6272, is_master,
                                                nranks, datetime.timedelta(0))
             pg = paddle.fluid.core.ProcessGroupGloo(store, rank, nranks)
 
diff --git a/python/paddle/fluid/tests/unittests/process_group_nccl.py b/python/paddle/fluid/tests/unittests/process_group_nccl.py
index b1da0777feb3d..7ae38b3bbc4d2 100644
--- a/python/paddle/fluid/tests/unittests/process_group_nccl.py
+++ b/python/paddle/fluid/tests/unittests/process_group_nccl.py
@@ -26,16 +26,16 @@
 import paddle.fluid.core as core
 from paddle.fluid.framework import _test_eager_guard
 from paddle.fluid.dygraph.parallel import ParallelEnv
+import paddle.distributed as dist
 
 
 def init_process_group(strategy=None):
     nranks = ParallelEnv().nranks
     rank = ParallelEnv().local_rank
     is_master = True if rank == 0 else False
-    store = paddle.fluid.core.TCPStore("127.0.0.1", 6173, is_master, nranks)
-    pg_group = core.ProcessGroupNCCL(store, rank, nranks)
+    pg_group = dist.init_parallel_env()
 
-    return pg_group
+    return pg_group.process_group
 
 
 class TestProcessGroupFp32(unittest.TestCase):
@@ -68,12 +68,10 @@ def test_create_process_group_nccl(self):
 
             sum_result = tensor_x + tensor_y
             if pg.rank() == 0:
-                task = pg.allreduce(tensor_x)
-                task.wait()
+                task = dist.all_reduce(tensor_x)
                 assert np.array_equal(tensor_x, sum_result)
             else:
-                task = pg.allreduce(tensor_y)
-                task.wait()
+                task = dist.all_reduce(tensor_y)
                 assert np.array_equal(tensor_y, sum_result)
 
             print("test allreduce sum api ok")
@@ -89,16 +87,41 @@ def test_create_process_group_nccl(self):
             max_result = paddle.maximum(tensor_x, tensor_y)
 
             if pg.rank() == 0:
-                task = pg.allreduce(tensor_x, core.ReduceOp.MAX)
+                task = dist.all_reduce(
+                    tensor_x, dist.ReduceOp.MAX, use_calc_stream=False)
                 task.wait()
                 assert np.array_equal(tensor_x, max_result)
             else:
-                task = pg.allreduce(tensor_y, core.ReduceOp.MAX)
+                task = dist.all_reduce(
+                    tensor_y, dist.ReduceOp.MAX, use_calc_stream=False)
                 task.wait()
                 assert np.array_equal(tensor_y, max_result)
 
             print("test allreduce max api ok")
 
+            # test allreduce min
+            # rank 0
+            x = np.random.random(self.shape).astype(self.dtype)
+            tensor_x = paddle.to_tensor(x)
+            # rank 1
+            y = np.random.random(self.shape).astype(self.dtype)
+            tensor_y = paddle.to_tensor(y)
+
+            min_result = paddle.minimum(tensor_x, tensor_y)
+
+            if pg.rank() == 0:
+                task = dist.all_reduce(
+                    tensor_x, dist.ReduceOp.MIN, use_calc_stream=False)
+                task.wait()
+                assert np.array_equal(tensor_x, min_result)
+            else:
+                task = dist.all_reduce(
+                    tensor_y, dist.ReduceOp.MIN, use_calc_stream=False)
+                task.wait()
+                assert np.array_equal(tensor_y, min_result)
+
+            print("test allreduce min api ok")
+
             # test broadcast
             # rank 0
             x = np.random.random(self.shape).astype(self.dtype)
@@ -109,16 +132,14 @@ def test_create_process_group_nccl(self):
 
             broadcast_result = paddle.assign(tensor_x)
             if pg.rank() == 0:
-                task = pg.broadcast(tensor_x, 0)
+                task = dist.broadcast(tensor_x, 0, use_calc_stream=False)
                 task.synchronize()
                 paddle.device.cuda.synchronize()
                 assert task.is_completed()
                 assert np.array_equal(broadcast_result, tensor_x)
             else:
-                task = pg.broadcast(tensor_y, 0)
-                task.synchronize()
+                task = dist.broadcast(tensor_y, 0)
                 paddle.device.cuda.synchronize()
-                assert task.is_completed()
                 assert np.array_equal(broadcast_result, tensor_y)
 
             print("test broadcast api ok")
@@ -126,8 +147,7 @@ def test_create_process_group_nccl(self):
             # test barrier
             # rank 0
             if pg.rank() == 0:
-                task = pg.barrier()
-                task.wait()
+                dist.barrier()
             # rank 1
             else:
                 task = pg.barrier()
@@ -151,9 +171,13 @@ def test_create_process_group_nccl(self):
                 paddle.device.cuda.synchronize()
             # rank 1
             else:
-                task = pg.all_gather(tensor_y, tensor_out)
-                task.wait()
+                tensor_out_list = [
+                    paddle.empty_like(tensor_x), paddle.empty_like(tensor_x)
+                ]
+                task = dist.all_gather(
+                    tensor_out_list, tensor_y, use_calc_stream=False)
                 paddle.device.cuda.synchronize()
+                tensor_out = paddle.concat(tensor_out_list)
             out_1 = paddle.slice(tensor_out, [0], [0], [out_shape[0] // 2])
             out_2 = paddle.slice(tensor_out, [0], [out_shape[0] // 2],
                                  [out_shape[0]])
@@ -178,12 +202,14 @@ def test_create_process_group_nccl(self):
             if pg.rank() == 0:
                 task = pg.alltoall(tensor_x, tensor_out1)
                 task.wait()
-                paddle.device.cuda.synchronize()
             # rank 1
             else:
-                task = pg.alltoall(tensor_y, tensor_out2)
-                task.wait()
+                in_1, in_2 = paddle.split(tensor_y, 2)
+                out_1, out_2 = paddle.split(tensor_out2, 2)
+                out_tensor_list = [out_1, out_2]
+                task = dist.alltoall([in_1, in_2], out_tensor_list)
                 paddle.device.cuda.synchronize()
+                tensor_out2 = paddle.concat(out_tensor_list)
             out1_2 = paddle.slice(tensor_out1, [0], [self.shape[0] // 2],
                                   [self.shape[0]])
             out2_1 = paddle.slice(tensor_out2, [0], [0], [self.shape[0] // 2])
@@ -201,18 +227,61 @@ def test_create_process_group_nccl(self):
             tensor_y = paddle.to_tensor(y)
             sum_result = tensor_x + tensor_y
             if pg.rank() == 0:
-                task = pg.reduce(tensor_x, 0)
-                task.wait()
+                task = dist.reduce(tensor_x, 0, use_calc_stream=True)
                 paddle.device.cuda.synchronize()
             # rank 1
             else:
-                task = pg.reduce(tensor_y, 0)
+                task = dist.reduce(tensor_y, 0, use_calc_stream=False)
                 task.wait()
                 paddle.device.cuda.synchronize()
             if pg.rank() == 0:
                 assert np.array_equal(tensor_x, sum_result)
             print("test reduce sum api ok\n")
 
+            # test reduce max
+            # rank 0
+            x = np.random.random(self.shape).astype(self.dtype)
+            tensor_x = paddle.to_tensor(x)
+            # rank 1
+            y = np.random.random(self.shape).astype(self.dtype)
+            tensor_y = paddle.to_tensor(y)
+
+            max_result = paddle.maximum(tensor_x, tensor_y)
+
+            if pg.rank() == 0:
+                task = dist.reduce(
+                    tensor_x, 0, dist.ReduceOp.MAX, use_calc_stream=False)
+                task.wait()
+                assert np.array_equal(tensor_x, max_result)
+            else:
+                task = dist.reduce(
+                    tensor_y, 0, dist.ReduceOp.MAX, use_calc_stream=False)
+                task.wait()
+
+            print("test reduce max api ok")
+
+            # test reduce min
+            # rank 0
+            x = np.random.random(self.shape).astype(self.dtype)
+            tensor_x = paddle.to_tensor(x)
+            # rank 1
+            y = np.random.random(self.shape).astype(self.dtype)
+            tensor_y = paddle.to_tensor(y)
+
+            min_result = paddle.minimum(tensor_x, tensor_y)
+
+            if pg.rank() == 0:
+                task = dist.reduce(
+                    tensor_x, 0, dist.ReduceOp.MIN, use_calc_stream=False)
+                task.wait()
+                assert np.array_equal(tensor_x, min_result)
+            else:
+                task = dist.reduce(
+                    tensor_y, 0, dist.ReduceOp.MIN, use_calc_stream=False)
+                task.wait()
+
+            print("test reduce min api ok")
+
             # test Scatter
             # rank 0
             in_shape = list(self.shape)
@@ -222,12 +291,14 @@ def test_create_process_group_nccl(self):
             tensor_x = paddle.to_tensor(x)
             tensor_y = paddle.to_tensor(y)
             if pg.rank() == 0:
-                task = pg.scatter(tensor_x, tensor_y, 0)
-                task.wait()
+                in_1, in_2 = paddle.split(tensor_x, 2)
+                task = dist.scatter(
+                    tensor_y, [in_1, in_2], 0, use_calc_stream=True)
+                #task.wait()
                 paddle.device.cuda.synchronize()
             # rank 1
             else:
-                task = pg.scatter(tensor_x, tensor_y, 0)
+                task = dist.scatter(tensor_y, [], 0, use_calc_stream=False)
                 task.wait()
                 paddle.device.cuda.synchronize()
             out1 = paddle.slice(tensor_x, [0], [0], [self.shape[0]])
@@ -239,6 +310,40 @@ def test_create_process_group_nccl(self):
                 assert np.array_equal(tensor_y, out2)
             print("test scatter api ok\n")
 
+            # test send min
+            # rank 0
+            x = np.random.random(self.shape).astype(self.dtype)
+            tensor_x = paddle.to_tensor(x)
+            # rank 1
+            y = np.random.random(self.shape).astype(self.dtype)
+            tensor_y = paddle.to_tensor(y)
+
+            if pg.rank() == 0:
+                task = dist.send(tensor_x, 1, use_calc_stream=False)
+                task.wait()
+            else:
+                task = dist.recv(tensor_y, 0, use_calc_stream=False)
+                task.wait()
+                assert np.array_equal(tensor_y, tensor_x)
+
+            print("test send api ok")
+
+            # test send min
+            # rank 0
+            x = np.random.random(self.shape).astype(self.dtype)
+            tensor_x = paddle.to_tensor(x)
+            # rank 1
+            y = np.random.random(self.shape).astype(self.dtype)
+            tensor_y = paddle.to_tensor(y)
+
+            if pg.rank() == 0:
+                task = dist.send(tensor_x, 1, use_calc_stream=True)
+            else:
+                task = dist.recv(tensor_y, 0, use_calc_stream=True)
+                assert np.array_equal(tensor_y, tensor_x)
+
+            print("test send api ok")
+
 
 class TestProcessGroupFp16(TestProcessGroupFp32):
     def setUp(self):
diff --git a/python/paddle/fluid/tests/unittests/spawn_runner_base.py b/python/paddle/fluid/tests/unittests/spawn_runner_base.py
index e7057f95d28de..11f8cd559d1a6 100644
--- a/python/paddle/fluid/tests/unittests/spawn_runner_base.py
+++ b/python/paddle/fluid/tests/unittests/spawn_runner_base.py
@@ -21,7 +21,7 @@
 
 # used by model.run_trainer in test_dist_base
 from test_dist_base import RUN_STEP
-from test_parallel_dygraph_dataparallel import get_dist_port_from_flags
+from paddle.fluid.framework import _test_eager_guard
 
 
 # NOTE: compatible TestParallelDyGraphRunnerBase args
@@ -29,8 +29,6 @@ class SpawnAssistTestArgs(object):
     update_method = "local"
     trainer_id = 0
     find_unused_parameters = False
-    eager_mode = False
-    dist_port = get_dist_port_from_flags()
 
 
 class TestDistSpawnRunner(unittest.TestCase):
@@ -55,14 +53,17 @@ def _run_parallel(self, model, args):
             result_list.append(res_queue.get())
         return result_list
 
-    def _args_config(self, args):
-        return
-
     def check_dist_result_with_spawn(self, test_class, delta=1e-3):
+        with _test_eager_guard():
+            self.check_dist_result_with_spawn_func(
+                test_class=test_class, delta=delta)
+        self.check_dist_result_with_spawn_func(
+            test_class=test_class, delta=delta)
+
+    def check_dist_result_with_spawn_func(self, test_class, delta=1e-3):
         # 0. prepare model and args
         model = test_class()
         args = SpawnAssistTestArgs()
-        self._args_config(args)
 
         # 1. calc signal card loss
         losses = self._run(model, args)
diff --git a/python/paddle/fluid/tests/unittests/test_Tensor_type.py b/python/paddle/fluid/tests/unittests/test_Tensor_type.py
index f1427d29782b9..c40981c073724 100644
--- a/python/paddle/fluid/tests/unittests/test_Tensor_type.py
+++ b/python/paddle/fluid/tests/unittests/test_Tensor_type.py
@@ -18,10 +18,11 @@
 import numpy as np
 import paddle
 import paddle.fluid.core as core
+from paddle.fluid.framework import _test_eager_guard
 
 
 class TensorTypeTest(unittest.TestCase):
-    def test_type_totensor(self):
+    def func_type_totensor(self):
         paddle.disable_static()
         inx = np.array([1, 2])
         tensorx = paddle.to_tensor(inx)
@@ -29,7 +30,12 @@ def test_type_totensor(self):
         expectx = "<class 'paddle.Tensor'>"
         self.assertEqual((typex_str == expectx), True)
 
-    def test_type_Tensor(self):
+    def test_type_totensor(self):
+        with _test_eager_guard():
+            self.func_type_totensor()
+        self.func_type_totensor()
+
+    def func_type_Tensor(self):
         paddle.disable_static()
         inx = np.array([1, 2])
         tensorx = paddle.Tensor(inx)
@@ -43,7 +49,12 @@ def test_type_Tensor(self):
         expectx = "<class 'paddle.Tensor'>"
         self.assertEqual((typex_str == expectx), True)
 
-    def test_type_core(self):
+    def test_type_Tensor(self):
+        with _test_eager_guard():
+            self.func_type_Tensor()
+        self.func_type_Tensor()
+
+    def func_type_core(self):
         paddle.disable_static()
         inx = np.array([1, 2])
         tensorx = core.VarBase(inx)
@@ -56,6 +67,11 @@ def test_type_core(self):
         expectx = "<class 'paddle.Tensor'>"
         self.assertEqual((typex_str == expectx), True)
 
+    def test_type_core(self):
+        with _test_eager_guard():
+            pass
+        self.func_type_core()
+
 
 if __name__ == '__main__':
     unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/test_activation_op.py b/python/paddle/fluid/tests/unittests/test_activation_op.py
index 471d0245aa83c..58d8610ee352d 100755
--- a/python/paddle/fluid/tests/unittests/test_activation_op.py
+++ b/python/paddle/fluid/tests/unittests/test_activation_op.py
@@ -25,6 +25,7 @@
 import paddle.fluid as fluid
 import paddle.fluid.core as core
 from paddle.fluid import compiler, Program, program_guard
+from paddle.fluid.framework import _test_eager_guard
 
 paddle.enable_static()
 
@@ -50,6 +51,8 @@ def setUp(self):
         self.op_type = "exp"
         self.init_dtype()
         self.init_kernel_type()
+        self.check_eager = True
+        self.python_api = paddle.exp
 
         np.random.seed(2049)
         x = np.random.uniform(0.1, 1, [11, 17]).astype(self.dtype)
@@ -59,12 +62,18 @@ def setUp(self):
         self.outputs = {'Out': out}
 
     def test_check_output(self):
-        self.check_output()
+        check_eager = False
+        if hasattr(self, 'check_eager'):
+            check_eager = self.check_eager
+        self.check_output(check_eager=check_eager)
 
     def test_check_grad(self):
         if self.dtype == np.float16:
             return
-        self.check_grad(['X'], 'Out')
+        check_eager = False
+        if hasattr(self, 'check_eager'):
+            check_eager = self.check_eager
+        self.check_grad(['X'], 'Out', check_eager=check_eager)
 
     def init_dtype(self):
         self.dtype = np.float64
@@ -76,6 +85,7 @@ def init_kernel_type(self):
 class TestExpm1(TestActivation):
     def setUp(self):
         self.op_type = "expm1"
+        self.python_api = paddle.expm1
         self.init_dtype()
 
         np.random.seed(2049)
@@ -86,7 +96,10 @@ def setUp(self):
         self.outputs = {'Out': out}
 
     def test_check_grad(self):
-        self.check_grad(['X'], 'Out')
+        self.check_grad(['X'], 'Out', check_eager=True)
+
+    def test_check_output(self):
+        self.check_output(check_eager=True)
 
 
 class TestExpm1API(unittest.TestCase):
@@ -876,6 +889,8 @@ def ref_softshrink(x, threshold=0.5):
 class TestSoftshrink(TestActivation):
     def setUp(self):
         self.op_type = "softshrink"
+        self.check_eager = True
+        self.python_api = paddle.nn.functional.softshrink
         self.init_dtype()
 
         threshold = 0.8
@@ -890,7 +905,7 @@ def setUp(self):
     def test_check_grad(self):
         if self.dtype == np.float16:
             return
-        self.check_grad(['X'], 'Out')
+        self.check_grad(['X'], 'Out', check_eager=True)
 
 
 class TestSoftshrinkAPI(unittest.TestCase):
@@ -1009,6 +1024,7 @@ def test_check_grad(self):
 class TestRsqrt(TestActivation):
     def setUp(self):
         self.op_type = "rsqrt"
+        self.python_api = paddle.rsqrt
         self.init_dtype()
 
         np.random.seed(1024)
@@ -1021,7 +1037,8 @@ def setUp(self):
     def test_check_grad(self):
         if self.dtype == np.float16:
             return
-        self.check_grad(['X'], 'Out', max_relative_error=0.0005)
+        self.check_grad(
+            ['X'], 'Out', max_relative_error=0.0005, check_eager=True)
 
 
 class TestAbs(TestActivation):
@@ -1050,6 +1067,8 @@ def test_check_grad(self):
 class TestCeil(TestActivation):
     def setUp(self):
         self.op_type = "ceil"
+        self.check_eager = True
+        self.python_api = paddle.ceil
         self.init_dtype()
 
         np.random.seed(1024)
@@ -1067,6 +1086,8 @@ def test_check_grad(self):
 class TestFloor(TestActivation):
     def setUp(self):
         self.op_type = "floor"
+        self.check_eager = True
+        self.python_api = paddle.floor
         self.init_dtype()
 
         np.random.seed(1024)
@@ -1263,6 +1284,8 @@ def test_check_grad(self):
 class TestRound(TestActivation):
     def setUp(self):
         self.op_type = "round"
+        self.check_eager = True
+        self.python_api = paddle.round
         self.init_dtype()
 
         np.random.seed(1024)
@@ -1733,7 +1756,7 @@ class TestHardSwish(TestActivation):
     def setUp(self):
         self.op_type = 'hard_swish'
         self.init_dtype()
-
+        self.python_api = paddle.nn.functional.hardswish
         skip_check_grad_ci(reason="not implemented yet")
 
         np.random.seed(1024)
@@ -1755,7 +1778,10 @@ def test_check_grad(self):
             return
 
         return  # not implemented yet
-        self.check_grad(['X'], 'Out')
+        self.check_grad(['X'], 'Out', check_eager=True)
+
+    def test_check_output(self):
+        self.check_output(check_eager=True)
 
 
 class TestHardswishAPI(unittest.TestCase):
@@ -1816,6 +1842,11 @@ def test_errors(self):
                 name='x_fp16', shape=[12, 10], dtype='float16')
             F.hardswish(x_fp16)
 
+    def test_api_eager_dygraph(self):
+        with _test_eager_guard():
+            self.test_dygraph_api()
+            self.test_errors()
+
 
 class TestSoftRelu(TestActivation):
     def setUp(self):
@@ -2075,6 +2106,8 @@ def test_check_output(self):
 class TestLog(TestActivation):
     def setUp(self):
         self.op_type = "log"
+        self.check_eager = True
+        self.python_api = paddle.log
         self.init_dtype()
 
         np.random.seed(1024)
@@ -2087,7 +2120,7 @@ def setUp(self):
     def test_check_grad(self):
         if self.dtype == np.float16:
             return
-        self.check_grad(['X'], 'Out')
+        self.check_grad(['X'], 'Out', check_eager=True)
 
     def test_error(self):
         in1 = fluid.layers.data(
@@ -2102,6 +2135,8 @@ def test_error(self):
 class TestLog2(TestActivation):
     def setUp(self):
         self.op_type = "log2"
+        self.check_eager = True
+        self.python_api = paddle.log2
         self.init_dtype()
 
         x = np.random.uniform(0.1, 1, [11, 17]).astype(self.dtype)
@@ -2113,7 +2148,7 @@ def setUp(self):
     def test_check_grad(self):
         if self.dtype == np.float16:
             return
-        self.check_grad(['X'], 'Out')
+        self.check_grad(['X'], 'Out', check_eager=True)
 
     def test_error(self):
         in1 = paddle.static.data(name="in1", shape=[11, 17], dtype="int32")
@@ -2151,6 +2186,8 @@ def test_api(self):
 class TestLog10(TestActivation):
     def setUp(self):
         self.op_type = "log10"
+        self.check_eager = True
+        self.python_api = paddle.log10
         self.init_dtype()
 
         x = np.random.uniform(0.1, 1, [11, 17]).astype(self.dtype)
@@ -2162,7 +2199,7 @@ def setUp(self):
     def test_check_grad(self):
         if self.dtype == np.float16:
             return
-        self.check_grad(['X'], 'Out')
+        self.check_grad(['X'], 'Out', check_eager=True)
 
     def test_error(self):
         in1 = paddle.static.data(name="in1", shape=[11, 17], dtype="int32")
@@ -2200,6 +2237,8 @@ def test_api(self):
 class TestLog1p(TestActivation):
     def setUp(self):
         self.op_type = "log1p"
+        self.check_eager = True
+        self.python_api = paddle.log1p
         self.init_dtype()
 
         np.random.seed(1024)
@@ -2212,7 +2251,7 @@ def setUp(self):
     def test_check_grad(self):
         if self.dtype == np.float16:
             return
-        self.check_grad(['X'], 'Out')
+        self.check_grad(['X'], 'Out', check_eager=True)
 
     def test_api(self):
         with fluid.program_guard(fluid.Program(), fluid.Program()):
@@ -2298,6 +2337,8 @@ def test_check_grad(self):
 class TestPow(TestActivation):
     def setUp(self):
         self.op_type = "pow"
+        self.python_api = paddle.pow
+        self.check_eager = True
         self.init_dtype()
 
         np.random.seed(1024)
@@ -2308,15 +2349,20 @@ def setUp(self):
         self.attrs = {'factor': 3.0}
         self.outputs = {'Out': out}
 
+    def test_check_output(self):
+        self.check_output(check_eager=self.check_eager)
+
     def test_check_grad(self):
         if self.dtype == np.float16:
             return
-        self.check_grad(['X'], 'Out')
+        self.check_grad(['X'], 'Out', check_eager=self.check_eager)
 
 
 class TestPow_factor_tensor(TestActivation):
     def setUp(self):
         self.op_type = "pow"
+        self.check_eager = False
+        self.python_api = paddle.pow
         self.init_dtype()
 
         np.random.seed(1024)
@@ -2332,12 +2378,12 @@ def setUp(self):
         self.outputs = {'Out': out}
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=self.check_eager)
 
     def test_check_grad(self):
         if self.dtype == np.float16:
             return
-        self.check_grad(['X'], 'Out')
+        self.check_grad(['X'], 'Out', check_eager=self.check_eager)
 
     def test_api(self):
         input = np.random.uniform(1, 2, [11, 17]).astype("float32")
@@ -2894,7 +2940,9 @@ def ref_swish(x):
 class TestSwish(TestActivation):
     def setUp(self):
         self.op_type = "swish"
+        self.python_api = paddle.nn.functional.swish
         self.init_dtype()
+        self.check_eager = True
 
         np.random.seed(1024)
         x = np.random.uniform(-1, 1, [10, 12]).astype(self.dtype)
@@ -2906,7 +2954,10 @@ def setUp(self):
     def test_check_grad(self):
         if self.dtype == np.float16:
             return
-        self.check_grad(['X'], 'Out')
+        check_eager = False
+        if hasattr(self, 'check_eager'):
+            check_eager = self.check_eager
+        self.check_grad(['X'], 'Out', check_eager=check_eager)
 
 
 class TestSwishAPI(unittest.TestCase):
@@ -2941,6 +2992,10 @@ def test_dygraph_api(self):
             self.assertEqual(np.allclose(out_ref, r.numpy()), True)
         paddle.enable_static()
 
+    def test_dygraph_final_state_api(self):
+        with _test_eager_guard():
+            self.test_dygraph_api()
+
     def test_fluid_api(self):
         paddle.enable_static()
         with fluid.program_guard(fluid.Program()):
@@ -2975,6 +3030,7 @@ def ref_mish(x, threshold=20.):
 class TestMish(TestActivation):
     def setUp(self):
         self.op_type = "mish"
+        self.python_api = paddle.fluid.layers.nn.mish
         self.init_dtype()
 
         np.random.seed(1024)
@@ -2983,10 +3039,13 @@ def setUp(self):
         self.inputs = {'X': x}
         self.outputs = {'Out': out}
 
+    def test_check_output(self):
+        self.check_output(check_eager=True)
+
     def test_check_grad(self):
         if self.dtype == np.float16:
             return
-        self.check_grad(['X'], 'Out')
+        self.check_grad(['X'], 'Out', check_eager=True)
 
 
 class TestMishAPI(unittest.TestCase):
diff --git a/python/paddle/fluid/tests/unittests/test_adamax_api.py b/python/paddle/fluid/tests/unittests/test_adamax_api.py
index 57cb9d3cb5f7d..1698ac90a9f2d 100644
--- a/python/paddle/fluid/tests/unittests/test_adamax_api.py
+++ b/python/paddle/fluid/tests/unittests/test_adamax_api.py
@@ -19,10 +19,11 @@
 from op_test import OpTest
 import paddle
 import paddle.fluid as fluid
+from paddle.fluid.framework import _test_eager_guard
 
 
 class TestAdamaxAPI(unittest.TestCase):
-    def test_adamax_api_dygraph(self):
+    def func_adamax_api_dygraph(self):
         paddle.disable_static()
         value = np.arange(26).reshape(2, 13).astype("float32")
         a = paddle.to_tensor(value)
@@ -36,7 +37,12 @@ def test_adamax_api_dygraph(self):
         adam.step()
         adam.clear_gradients()
 
-    def test_adamax_api(self):
+    def test_adamax_api_dygraph(self):
+        with _test_eager_guard():
+            self.func_adamax_api_dygraph()
+        self.func_adamax_api_dygraph()
+
+    def func_adamax_api(self):
         paddle.enable_static()
         place = fluid.CPUPlace()
         shape = [2, 3, 8, 8]
@@ -63,9 +69,14 @@ def test_adamax_api(self):
         rets = exe.run(train_prog, feed={"data": data_np}, fetch_list=[loss])
         assert rets[0] is not None
 
+    def test_adamax_api(self):
+        with _test_eager_guard():
+            self.func_adamax_api()
+        self.func_adamax_api()
+
 
 class TestAdamaxAPIGroup(TestAdamaxAPI):
-    def test_adamax_api_dygraph(self):
+    def func_adamax_api_dygraph(self):
         paddle.disable_static()
         value = np.arange(26).reshape(2, 13).astype("float32")
         a = paddle.to_tensor(value)
@@ -89,6 +100,11 @@ def test_adamax_api_dygraph(self):
         adam.step()
         adam.clear_gradients()
 
+    def test_adamax_api_dygraph(self):
+        with _test_eager_guard():
+            self.func_adamax_api_dygraph()
+        self.func_adamax_api_dygraph()
+
 
 if __name__ == "__main__":
     unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/test_allclose_op.py b/python/paddle/fluid/tests/unittests/test_allclose_op.py
index e96bf951240e7..ec1c5363fcde1 100644
--- a/python/paddle/fluid/tests/unittests/test_allclose_op.py
+++ b/python/paddle/fluid/tests/unittests/test_allclose_op.py
@@ -29,6 +29,7 @@ def set_args(self):
     def setUp(self):
         self.set_args()
         self.op_type = "allclose"
+        self.python_api = paddle.allclose
         self.inputs = {
             'Input': self.input,
             'Other': self.other,
@@ -48,7 +49,7 @@ def setUp(self):
         }
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
 
 class TestAllcloseOpException(TestAllcloseOp):
@@ -56,28 +57,28 @@ def test_check_output(self):
         def test_rtol_num():
             self.inputs['Rtol'] = np.array([1e-05, 1e-05]).astype("float64")
             self.inputs['Atol'] = np.array([1e-08]).astype("float64")
-            self.check_output()
+            self.check_output(check_eager=True)
 
         self.assertRaises(ValueError, test_rtol_num)
 
         def test_rtol_type():
             self.inputs['Rtol'] = np.array([5]).astype("int32")
             self.inputs['Atol'] = np.array([1e-08]).astype("float64")
-            self.check_output()
+            self.check_output(check_eager=True)
 
         self.assertRaises(ValueError, test_rtol_type)
 
         def test_atol_num():
             self.inputs['Rtol'] = np.array([1e-05]).astype("float64")
             self.inputs['Atol'] = np.array([1e-08, 1e-08]).astype("float64")
-            self.check_output()
+            self.check_output(check_eager=True)
 
         self.assertRaises(ValueError, test_atol_num)
 
         def test_atol_type():
             self.inputs['Rtol'] = np.array([1e-05]).astype("float64")
             self.inputs['Atol'] = np.array([8]).astype("int32")
-            self.check_output()
+            self.check_output(check_eager=True)
 
         self.assertRaises(ValueError, test_atol_type)
 
diff --git a/python/paddle/fluid/tests/unittests/test_batch_norm_op_v2.py b/python/paddle/fluid/tests/unittests/test_batch_norm_op_v2.py
index c9abac8fb7946..ac09d9f5fdfd0 100644
--- a/python/paddle/fluid/tests/unittests/test_batch_norm_op_v2.py
+++ b/python/paddle/fluid/tests/unittests/test_batch_norm_op_v2.py
@@ -19,7 +19,7 @@
 from paddle.fluid.op import Operator
 import paddle.fluid as fluid
 from op_test import OpTest, _set_use_system_allocator
-from paddle.fluid.framework import grad_var_name
+from paddle.fluid.framework import grad_var_name, _test_eager_guard
 import paddle.fluid as fluid
 from paddle.fluid import Program, program_guard
 import paddle
@@ -46,32 +46,32 @@ def test_error(self):
             def error1d_dataformat():
                 x_data_4 = np.random.random(size=(2, 1, 3, 3)).astype('float32')
                 batch_norm1d = paddle.nn.BatchNorm1D(1, data_format='NCDHW')
-                batch_norm1d(fluid.dygraph.to_variable(x_data_4))
+                batch_norm1d(paddle.to_tensor(x_data_4))
 
             def error2d_dataformat():
                 x_data_3 = np.random.random(size=(2, 1, 3)).astype('float32')
                 batch_norm2d = paddle.nn.BatchNorm2D(1, data_format='NCDHW')
-                batch_norm2d(fluid.dygraph.to_variable(x_data_3))
+                batch_norm2d(paddle.to_tensor(x_data_3))
 
             def error3d_dataformat():
                 x_data_4 = np.random.random(size=(2, 1, 3, 3)).astype('float32')
                 batch_norm3d = paddle.nn.BatchNorm3D(1, data_format='NCL')
-                batch_norm3d(fluid.dygraph.to_variable(x_data_4))
+                batch_norm3d(paddle.to_tensor(x_data_4))
 
             def error1d():
                 x_data_4 = np.random.random(size=(2, 1, 3, 3)).astype('float32')
                 batch_norm1d = paddle.nn.BatchNorm1D(1)
-                batch_norm1d(fluid.dygraph.to_variable(x_data_4))
+                batch_norm1d(paddle.to_tensor(x_data_4))
 
             def error2d():
                 x_data_3 = np.random.random(size=(2, 1, 3)).astype('float32')
                 batch_norm2d = paddle.nn.BatchNorm2D(1)
-                batch_norm2d(fluid.dygraph.to_variable(x_data_3))
+                batch_norm2d(paddle.to_tensor(x_data_3))
 
             def error3d():
                 x_data_4 = np.random.random(size=(2, 1, 3, 3)).astype('float32')
                 batch_norm3d = paddle.nn.BatchNorm3D(1)
-                batch_norm3d(fluid.dygraph.to_variable(x_data_4))
+                batch_norm3d(paddle.to_tensor(x_data_4))
 
             with fluid.dygraph.guard(p):
                 self.assertRaises(ValueError, error1d)
@@ -81,6 +81,40 @@ def error3d():
                 self.assertRaises(ValueError, error2d_dataformat)
                 self.assertRaises(ValueError, error3d_dataformat)
 
+    def test_eager_api(self):
+        places = [fluid.CPUPlace()]
+        if core.is_compiled_with_cuda():
+            places.append(fluid.CUDAPlace(0))
+        for p in places:
+            shape = [4, 10, 4, 4]
+
+            def compute_v1(x):
+                with fluid.dygraph.guard(p):
+                    bn = fluid.dygraph.BatchNorm(shape[1])
+                    #bn = paddle.nn.BatchNorm2D(shape[1])
+                    x1 = paddle.to_tensor(x)
+                    x1.stop_gradient = False
+                    y = bn(x1)
+                    y.backward()
+                    return y.numpy(), x1.gradient()
+
+            def compute_v2(x):
+                with fluid.dygraph.guard(p):
+                    with _test_eager_guard():
+                        print("v2")
+                        bn = paddle.nn.BatchNorm2D(shape[1])
+                        x1 = paddle.to_tensor(x)
+                        x1.stop_gradient = False
+                        y = bn(x1)
+                        y.backward()
+                        return y.numpy(), x1.gradient()
+
+        x = np.random.randn(*shape).astype("float32")
+        y1, g1 = compute_v1(x)
+        y2, g2 = compute_v2(x)
+        self.assertTrue(np.allclose(g1, g2))
+        self.assertTrue(np.allclose(y1, y2))
+
     def test_dygraph(self):
         places = [fluid.CPUPlace()]
         if core.is_compiled_with_cuda():
@@ -94,13 +128,18 @@ def compute_v1(x, is_test, trainable_statistics):
                         shape[1],
                         is_test=is_test,
                         trainable_statistics=trainable_statistics)
-                    y = bn(fluid.dygraph.to_variable(x))
+                    y = bn(paddle.to_tensor(x))
                 return y.numpy()
 
             def compute_v2(x):
                 with fluid.dygraph.guard(p):
                     bn = paddle.nn.BatchNorm2D(shape[1])
-                    y = bn(fluid.dygraph.to_variable(x))
+                    y = bn(paddle.to_tensor(x))
+
+                    with _test_eager_guard():
+                        bn = paddle.nn.BatchNorm2D(shape[1])
+                        eag_y = bn(paddle.to_tensor(x))
+                    assert np.allclose(eag_y.numpy(), y.numpy())
                 return y.numpy()
 
             def compute_v3(x, is_test, trainable_statistics):
@@ -115,14 +154,14 @@ def compute_v3(x, is_test, trainable_statistics):
                             initializer=fluid.initializer.Constant(0.0),
                             trainable=False),
                         trainable_statistics=trainable_statistics)
-                    y = bn(fluid.dygraph.to_variable(x))
+                    y = bn(paddle.to_tensor(x))
                 return y.numpy()
 
             def compute_v4(x):
                 with fluid.dygraph.guard(p):
                     bn = paddle.nn.BatchNorm2D(
                         shape[1], weight_attr=False, bias_attr=False)
-                    y = bn(fluid.dygraph.to_variable(x))
+                    y = bn(paddle.to_tensor(x))
                 return y.numpy()
 
             x = np.random.randn(*shape).astype("float32")
diff --git a/python/paddle/fluid/tests/unittests/test_bce_with_logits_loss.py b/python/paddle/fluid/tests/unittests/test_bce_with_logits_loss.py
index 153b8fd3e7f6b..ea6d82d15ce0c 100644
--- a/python/paddle/fluid/tests/unittests/test_bce_with_logits_loss.py
+++ b/python/paddle/fluid/tests/unittests/test_bce_with_logits_loss.py
@@ -17,6 +17,7 @@
 import numpy as np
 import unittest
 from op_test import OpTest
+from paddle.fluid.framework import _test_eager_guard
 
 
 def call_bce_layer(logit, label, weight=None, reduction='mean',
@@ -81,23 +82,22 @@ def test_dygraph(place,
                  reduction='mean',
                  pos_weight_np=None,
                  functional=False):
-    paddle.disable_static()
-    logit = paddle.to_tensor(logit_np)
-    label = paddle.to_tensor(label_np)
-    weight = None
-    pos_weight = None
-    if weight_np is not None:
-        weight = paddle.to_tensor(weight_np)
-    if pos_weight_np is not None:
-        pos_weight = paddle.to_tensor(pos_weight_np)
-    if functional:
-        dy_res = call_bce_functional(logit, label, weight, reduction,
-                                     pos_weight)
-    else:
-        dy_res = call_bce_layer(logit, label, weight, reduction, pos_weight)
-    dy_result = dy_res.numpy()
-    paddle.enable_static()
-    return dy_result
+    with paddle.fluid.dygraph.base.guard():
+        logit = paddle.to_tensor(logit_np)
+        label = paddle.to_tensor(label_np)
+        weight = None
+        pos_weight = None
+        if weight_np is not None:
+            weight = paddle.to_tensor(weight_np)
+        if pos_weight_np is not None:
+            pos_weight = paddle.to_tensor(pos_weight_np)
+        if functional:
+            dy_res = call_bce_functional(logit, label, weight, reduction,
+                                         pos_weight)
+        else:
+            dy_res = call_bce_layer(logit, label, weight, reduction, pos_weight)
+        dy_result = dy_res.numpy()
+        return dy_result
 
 
 def calc_bce_with_logits_loss(logit_np,
@@ -154,9 +154,19 @@ def test_BCEWithLogitsLoss(self):
                     label_np,
                     reduction=reduction,
                     functional=True)
+
+                with _test_eager_guard():
+                    eager_functional = test_dygraph(
+                        place,
+                        logit_np,
+                        label_np,
+                        reduction=reduction,
+                        functional=True)
+
                 self.assertTrue(np.allclose(static_functional, expected))
                 self.assertTrue(np.allclose(static_functional, dy_functional))
                 self.assertTrue(np.allclose(dy_functional, expected))
+                self.assertTrue(np.allclose(eager_functional, expected))
 
     def test_BCEWithLogitsLoss_weight(self):
         logit_np = np.random.uniform(
diff --git a/python/paddle/fluid/tests/unittests/test_bernoulli_op.py b/python/paddle/fluid/tests/unittests/test_bernoulli_op.py
index 426d5d463f453..fc4ee13384b2d 100644
--- a/python/paddle/fluid/tests/unittests/test_bernoulli_op.py
+++ b/python/paddle/fluid/tests/unittests/test_bernoulli_op.py
@@ -75,9 +75,6 @@ def test_fixed_random_number(self):
         if not paddle.is_compiled_with_cuda():
             return
 
-        if os.getenv("FLAGS_use_curand", None) in ('0', 'False', None):
-            return
-
         print("Test Fixed Random number on GPU------>")
         paddle.disable_static()
         paddle.set_device('gpu')
diff --git a/python/paddle/fluid/tests/unittests/test_bicubic_interp_op.py b/python/paddle/fluid/tests/unittests/test_bicubic_interp_op.py
index f3f3431c9fb3e..8d7dd0d81180e 100644
--- a/python/paddle/fluid/tests/unittests/test_bicubic_interp_op.py
+++ b/python/paddle/fluid/tests/unittests/test_bicubic_interp_op.py
@@ -127,6 +127,9 @@ def setUp(self):
         self.data_layout = 'NCHW'
         self.init_test_case()
         self.op_type = "bicubic_interp"
+        # NOTE(dev): some AsDispensible input is not used under imperative mode.
+        # Skip check_eager while found them in Inputs.
+        self.check_eager = True
         input_np = np.random.random(self.input_shape).astype("float64")
 
         if self.data_layout == "NCHW":
@@ -149,8 +152,10 @@ def setUp(self):
         self.inputs = {'X': input_np}
         if self.out_size is not None:
             self.inputs['OutSize'] = self.out_size
+            self.check_eager = False
         if self.actual_shape is not None:
             self.inputs['OutSize'] = self.actual_shape
+            self.check_eager = False
 
         self.attrs = {
             'out_h': self.out_h,
@@ -163,10 +168,11 @@ def setUp(self):
         self.outputs = {'Out': output_np}
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=self.check_eager)
 
     def test_check_grad(self):
-        self.check_grad(['X'], 'Out', in_place=True)
+        self.check_grad(
+            ['X'], 'Out', in_place=True, check_eager=self.check_eager)
 
     def init_test_case(self):
         self.interp_method = 'bicubic'
@@ -442,4 +448,5 @@ def test_outshape_and_scale():
 
 
 if __name__ == "__main__":
+    paddle.enable_static()
     unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/test_bicubic_interp_v2_op.py b/python/paddle/fluid/tests/unittests/test_bicubic_interp_v2_op.py
index 43e418addf2bf..d5c3aee2f4372 100644
--- a/python/paddle/fluid/tests/unittests/test_bicubic_interp_v2_op.py
+++ b/python/paddle/fluid/tests/unittests/test_bicubic_interp_v2_op.py
@@ -16,7 +16,7 @@
 
 import unittest
 import numpy as np
-from op_test import OpTest
+from op_test import OpTest, _in_eager_without_dygraph_check
 import paddle.fluid.core as core
 import paddle.fluid as fluid
 import paddle
@@ -135,6 +135,10 @@ def setUp(self):
         self.data_layout = 'NCHW'
         self.init_test_case()
         self.op_type = "bicubic_interp_v2"
+        # NOTE(dev): some AsDispensible input is not used under imperative mode.
+        # Skip check_eager while found them in Inputs.
+        # TODO(dev): add self.python_api
+        self.check_eager = False
         input_np = np.random.random(self.input_shape).astype("float64")
         scale_h = 0
         scale_w = 0
@@ -166,8 +170,10 @@ def setUp(self):
         self.inputs = {'X': input_np}
         if self.out_size is not None:
             self.inputs['OutSize'] = self.out_size
+            self.check_eager = False
         if self.actual_shape is not None:
             self.inputs['OutSize'] = self.actual_shape
+            self.check_eager = False
 
         self.attrs = {
             'out_h': self.out_h,
@@ -186,10 +192,11 @@ def setUp(self):
         self.outputs = {'Out': output_np}
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=self.check_eager)
 
     def test_check_grad(self):
-        self.check_grad(['X'], 'Out', in_place=True)
+        self.check_grad(
+            ['X'], 'Out', in_place=True, check_eager=self.check_eager)
 
     def init_test_case(self):
         self.interp_method = 'bicubic'
@@ -543,4 +550,5 @@ def test_input_shape_1():
 
 
 if __name__ == "__main__":
+    paddle.enable_static()
     unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/test_bilinear_interp_op.py b/python/paddle/fluid/tests/unittests/test_bilinear_interp_op.py
index 083b671c283a0..1817ef160c70a 100755
--- a/python/paddle/fluid/tests/unittests/test_bilinear_interp_op.py
+++ b/python/paddle/fluid/tests/unittests/test_bilinear_interp_op.py
@@ -102,6 +102,9 @@ def setUp(self):
         self.data_layout = 'NCHW'
         self.init_test_case()
         self.op_type = "bilinear_interp"
+        # NOTE(dev): some AsDispensible input is not used under imperative mode.
+        # Skip check_eager while found them in Inputs.
+        self.check_eager = True
         input_np = np.random.random(self.input_shape).astype("float64")
 
         if self.data_layout == "NCHW":
@@ -124,8 +127,10 @@ def setUp(self):
         self.inputs = {'X': input_np}
         if self.out_size is not None:
             self.inputs['OutSize'] = self.out_size
+            self.check_eager = False
         if self.actual_shape is not None:
             self.inputs['OutSize'] = self.actual_shape
+            self.check_eager = False
 
         self.attrs = {
             'out_h': self.out_h,
@@ -139,10 +144,11 @@ def setUp(self):
         self.outputs = {'Out': output_np}
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=self.check_eager)
 
     def test_check_grad(self):
-        self.check_grad(['X'], 'Out', in_place=True)
+        self.check_grad(
+            ['X'], 'Out', in_place=True, check_eager=self.check_eager)
 
     def init_test_case(self):
         self.interp_method = 'bilinear'
@@ -266,6 +272,7 @@ def setUp(self):
         self.actual_shape = None
         self.init_test_case()
         self.op_type = "bilinear_interp"
+        self.check_eager = True
         input_np = np.random.randint(
             low=0, high=256, size=self.input_shape).astype("uint8")
 
@@ -282,6 +289,7 @@ def setUp(self):
         self.inputs = {'X': input_np}
         if self.out_size is not None:
             self.inputs['OutSize'] = self.out_size
+            self.check_eager = False
 
         self.attrs = {
             'out_h': self.out_h,
@@ -294,7 +302,8 @@ def setUp(self):
         self.outputs = {'Out': output_np}
 
     def test_check_output(self):
-        self.check_output_with_place(place=core.CPUPlace(), atol=1)
+        self.check_output_with_place(
+            place=core.CPUPlace(), atol=1, check_eager=self.check_eager)
 
     def init_test_case(self):
         self.interp_method = 'bilinear'
@@ -397,6 +406,7 @@ def setUp(self):
         self.actual_shape = None
         self.init_test_case()
         self.op_type = "bilinear_interp"
+        self.check_eager = True
         self.shape_by_1Dtensor = False
         self.scale_by_1Dtensor = False
         self.attrs = {
@@ -419,12 +429,14 @@ def setUp(self):
 
         if self.shape_by_1Dtensor:
             self.inputs['OutSize'] = self.out_size
+            self.check_eager = False
         elif self.out_size is not None:
             size_tensor = []
             for index, ele in enumerate(self.out_size):
                 size_tensor.append(("x" + str(index), np.ones(
                     (1)).astype('int32') * ele))
             self.inputs['SizeTensor'] = size_tensor
+            self.check_eager = False
 
         self.attrs['out_h'] = self.out_h
         self.attrs['out_w'] = self.out_w
@@ -433,10 +445,11 @@ def setUp(self):
         self.outputs = {'Out': output_np}
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=self.check_eager)
 
     def test_check_grad(self):
-        self.check_grad(['X'], 'Out', in_place=True)
+        self.check_grad(
+            ['X'], 'Out', in_place=True, check_eager=self.check_eager)
 
     def init_test_case(self):
         self.interp_method = 'bilinear'
diff --git a/python/paddle/fluid/tests/unittests/test_bincount_op.py b/python/paddle/fluid/tests/unittests/test_bincount_op.py
index 851bf7b01125a..17b04b954afe8 100644
--- a/python/paddle/fluid/tests/unittests/test_bincount_op.py
+++ b/python/paddle/fluid/tests/unittests/test_bincount_op.py
@@ -126,6 +126,7 @@ class TestBincountOp(OpTest):
     # without weights
     def setUp(self):
         self.op_type = "bincount"
+        self.python_api = paddle.bincount
         self.init_test_case()
         self.inputs = {"X": self.np_input}
         self.attrs = {"minlength": self.minlength}
@@ -137,13 +138,14 @@ def init_test_case(self):
         self.Out = np.bincount(self.np_input, minlength=self.minlength)
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=False)
 
 
 class TestCase1(TestBincountOp):
     # with weights(FLOAT32)
     def setUp(self):
         self.op_type = "bincount"
+        self.python_api = paddle.bincount
         self.init_test_case()
         self.inputs = {"X": self.np_input, "Weights": self.np_weights}
         self.attrs = {"minlength": self.minlength}
@@ -163,6 +165,7 @@ class TestCase2(TestBincountOp):
     # with weights(other)
     def setUp(self):
         self.op_type = "bincount"
+        self.python_api = paddle.bincount
         self.init_test_case()
         self.inputs = {"X": self.np_input, "Weights": self.np_weights}
         self.attrs = {"minlength": self.minlength}
diff --git a/python/paddle/fluid/tests/unittests/test_cast_op.py b/python/paddle/fluid/tests/unittests/test_cast_op.py
index d80a9dc920076..a828eca4f4ba7 100644
--- a/python/paddle/fluid/tests/unittests/test_cast_op.py
+++ b/python/paddle/fluid/tests/unittests/test_cast_op.py
@@ -22,6 +22,7 @@
 import paddle.fluid as fluid
 from paddle.fluid import compiler, Program, program_guard
 from op_test import OpTest, convert_uint16_to_float, convert_float_to_uint16
+from paddle.fluid.framework import _test_eager_guard
 
 
 class TestCastOpFp32ToFp64(OpTest):
@@ -115,6 +116,20 @@ def test_errors(self):
             self.assertRaises(TypeError, fluid.layers.cast, x1, 'int32')
 
 
+class TestCastOpEager(unittest.TestCase):
+    def test_eager(self):
+        with paddle.fluid.dygraph.base.guard():
+            with _test_eager_guard():
+                x = paddle.ones([2, 2], dtype="float16")
+                x.stop_gradient = False
+                out = paddle.cast(x, "float32")
+                self.assertTrue(
+                    np.array_equal(out, np.ones([2, 2]).astype("float32")))
+                out.backward()
+                self.assertTrue(np.array_equal(x.gradient(), x.numpy()))
+                self.assertTrue(x.gradient().dtype == np.float16)
+
+
 if __name__ == '__main__':
     paddle.enable_static()
     unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/test_clip_op.py b/python/paddle/fluid/tests/unittests/test_clip_op.py
index 74c5f693a37f1..f4423ccd0294c 100644
--- a/python/paddle/fluid/tests/unittests/test_clip_op.py
+++ b/python/paddle/fluid/tests/unittests/test_clip_op.py
@@ -20,11 +20,13 @@
 import paddle.fluid as fluid
 from paddle.fluid import Program, program_guard
 from op_test import OpTest
+from paddle.fluid.framework import _test_eager_guard
 
 
 class TestClipOp(OpTest):
     def setUp(self):
         self.max_relative_error = 0.006
+        self.python_api = paddle.clip
 
         self.inputs = {}
         self.initTestCase()
@@ -51,12 +53,12 @@ def setUp(self):
 
     def test_check_output(self):
         paddle.enable_static()
-        self.check_output()
+        self.check_output(check_eager=True)
         paddle.disable_static()
 
     def test_check_grad_normal(self):
         paddle.enable_static()
-        self.check_grad(['X'], 'Out')
+        self.check_grad(['X'], 'Out', check_eager=True)
         paddle.disable_static()
 
     def initTestCase(self):
@@ -228,6 +230,10 @@ def test_clip_dygraph(self):
         self.assertTrue(
             np.allclose(out_5.numpy(), (data * 10).astype(np.int64).clip(2, 8)))
 
+    def test_eager(self):
+        with _test_eager_guard():
+            self.test_clip_dygraph()
+
     def test_errors(self):
         paddle.enable_static()
         x1 = fluid.data(name='x1', shape=[1], dtype="int16")
diff --git a/python/paddle/fluid/tests/unittests/test_complex_abs.py b/python/paddle/fluid/tests/unittests/test_complex_abs.py
index 4bc6beacb689f..a29d9baadead0 100644
--- a/python/paddle/fluid/tests/unittests/test_complex_abs.py
+++ b/python/paddle/fluid/tests/unittests/test_complex_abs.py
@@ -46,7 +46,7 @@ def init_grad_input_output(self):
         self.grad_x = self.grad_out * (self.x / np.abs(self.x))
 
     def test_check_output(self):
-        self.check_output(check_eager=True)
+        self.check_output(check_eager=False)
 
     def test_check_grad(self):
         self.check_grad(
@@ -54,7 +54,7 @@ def test_check_grad(self):
             'Out',
             user_defined_grads=[self.grad_x],
             user_defined_grad_outputs=[self.grad_out],
-            check_eager=True)
+            check_eager=False)
 
 
 class TestComplexAbsOpZeroValues(OpTest):
@@ -80,7 +80,7 @@ def init_grad_input_output(self):
         self.grad_x = np.zeros(self.shape, self.dtype)
 
     def test_check_output(self):
-        self.check_output(check_eager=True)
+        self.check_output(check_eager=False)
 
     def test_check_grad(self):
         self.check_grad(
@@ -88,7 +88,7 @@ def test_check_grad(self):
             'Out',
             user_defined_grads=[self.grad_x],
             user_defined_grad_outputs=[self.grad_out],
-            check_eager=True)
+            check_eager=False)
 
 
 class TestAbs(unittest.TestCase):
@@ -133,7 +133,7 @@ def init_grad_input_output(self):
         self.grad_x = self.grad_out * (self.x / np.abs(self.x))
 
     def test_check_output(self):
-        self.check_output(check_eager=True)
+        self.check_output(check_eager=False)
 
     def test_check_grad(self):
         self.check_grad(
@@ -141,7 +141,7 @@ def test_check_grad(self):
             'Out',
             user_defined_grads=[self.grad_x],
             user_defined_grad_outputs=[self.grad_out],
-            check_eager=True)
+            check_eager=False)
 
 
 if __name__ == '__main__':
diff --git a/python/paddle/fluid/tests/unittests/test_concat_op.py b/python/paddle/fluid/tests/unittests/test_concat_op.py
index 4feca1b92505b..629ddb31d7b62 100644
--- a/python/paddle/fluid/tests/unittests/test_concat_op.py
+++ b/python/paddle/fluid/tests/unittests/test_concat_op.py
@@ -19,6 +19,7 @@
 from paddle.fluid.tests.unittests.op_test import OpTest, skip_check_grad_ci, convert_float_to_uint16
 import paddle.fluid as fluid
 from paddle.fluid import compiler, Program, program_guard, core
+from paddle.fluid.framework import _test_eager_guard
 import paddle
 
 
@@ -49,7 +50,7 @@ def test_check_output(self):
             place = core.CUDAPlace(0)
             self.check_output_with_place(place)
         else:
-            self.check_output()
+            self.check_output(check_eager=True)
 
     def test_check_grad(self):
         if self.dtype == np.uint16:
@@ -58,9 +59,9 @@ def test_check_grad(self):
             self.check_grad_with_place(place, ['x1'], 'Out')
             self.check_grad_with_place(place, ['x2'], 'Out')
         else:
-            self.check_grad(['x0'], 'Out')
-            self.check_grad(['x1'], 'Out')
-            self.check_grad(['x2'], 'Out')
+            self.check_grad(['x0'], 'Out', check_eager=True)
+            self.check_grad(['x1'], 'Out', check_eager=True)
+            self.check_grad(['x2'], 'Out', check_eager=True)
 
     def init_test_data(self):
         if self.dtype == np.uint16:
@@ -124,6 +125,7 @@ class TestConcatOp6(TestConcatOp):
     def setUp(self):
         self.op_type = "concat"
         self.dtype = self.get_dtype()
+        self.python_api = paddle.concat
         self.init_test_data()
         self.lod = [[20, 80]]
         self.out_lod = [[20, 80, 20, 80, 20, 80]]
@@ -141,12 +143,12 @@ def setUp(self):
         self.outputs = {'Out': (out, self.out_lod)}
 
     def test_check_output(self):
-        self.check_output(check_dygraph=False)
+        self.check_output(check_eager=True)
 
     def test_check_grad(self):
-        self.check_grad(['x0'], 'Out', check_dygraph=False)
-        self.check_grad(['x1'], 'Out', check_dygraph=False)
-        self.check_grad(['x2'], 'Out', check_dygraph=False)
+        self.check_grad(['x0'], 'Out', check_eager=True)
+        self.check_grad(['x1'], 'Out', check_eager=True)
+        self.check_grad(['x2'], 'Out', check_eager=True)
 
     def init_test_data(self):
         self.x0 = np.random.random([100]).astype(self.dtype)
@@ -159,6 +161,7 @@ def create_test_AxisTensor(parent):
     class TestConcatAxisTensor(parent):
         def setUp(self):
             self.op_type = "concat"
+            self.python_api = paddle.concat
             self.dtype = self.get_dtype()
             self.init_test_data()
 
@@ -334,6 +337,12 @@ def test_imperative(self):
         self.assertEqual((out1.numpy() == np_out1).all(), True)
         self.assertEqual((out2.numpy() == np_out2).all(), True)
 
+    def test_eager(self):
+        with _test_eager_guard():
+            self.test_api()
+            self.test_fluid_api()
+            self.test_imperative()
+
     def test_errors(self):
         with program_guard(Program(), Program()):
             # The item in input must be Variable.
@@ -370,6 +379,7 @@ class TestConcatAPIWithLoDTensorArray(unittest.TestCase):
 
     def setUp(self):
         self.axis = 1
+        self.python = paddle.concat
         self.iter_num = 3
         self.input_shape = [2, 3]
         self.x = np.random.random(self.input_shape).astype("float32")
diff --git a/python/paddle/fluid/tests/unittests/test_conj_op.py b/python/paddle/fluid/tests/unittests/test_conj_op.py
index 774a29ada4a84..fe9efc301fea7 100644
--- a/python/paddle/fluid/tests/unittests/test_conj_op.py
+++ b/python/paddle/fluid/tests/unittests/test_conj_op.py
@@ -32,6 +32,7 @@
 class TestConjOp(OpTest):
     def setUp(self):
         self.op_type = "conj"
+        self.python_api = paddle.tensor.conj
         self.init_dtype_type()
         self.init_input_output()
         self.init_grad_input_output()
@@ -53,14 +54,15 @@ def init_grad_input_output(self):
         self.grad_in = np.conj(self.grad_out)
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
     def test_check_grad_normal(self):
         self.check_grad(
             ['X'],
             'Out',
             user_defined_grads=[self.grad_in],
-            user_defined_grad_outputs=[self.grad_out])
+            user_defined_grad_outputs=[self.grad_out],
+            check_eager=True)
 
 
 class TestComplexConjOp(unittest.TestCase):
diff --git a/python/paddle/fluid/tests/unittests/test_conv2d_layer.py b/python/paddle/fluid/tests/unittests/test_conv2d_layer.py
index 892fa649a6c5b..508bd7b1e64d8 100644
--- a/python/paddle/fluid/tests/unittests/test_conv2d_layer.py
+++ b/python/paddle/fluid/tests/unittests/test_conv2d_layer.py
@@ -19,6 +19,7 @@
 import paddle.fluid.initializer as I
 import unittest
 import paddle
+from paddle.fluid.framework import _test_eager_guard
 
 
 def _reverse_repeat_list(t, n):
@@ -166,7 +167,8 @@ def functional(self, place):
         return y_np
 
     def paddle_nn_layer(self):
-        x_var = dg.to_variable(self.input)
+        x_var = paddle.to_tensor(self.input)
+        x_var.stop_gradient = False
         conv = nn.Conv2D(
             self.num_channels,
             self.num_filters,
@@ -181,17 +183,23 @@ def paddle_nn_layer(self):
         if not self.no_bias:
             conv.bias.set_value(self.bias)
         y_var = conv(x_var)
+        y_var.backward()
         y_np = y_var.numpy()
-        return y_np
+        t1 = x_var.gradient()
+        return y_np, t1
 
     def _test_equivalence(self, place):
         place = fluid.CPUPlace()
         result1 = self.fluid_layer(place)
         result2 = self.functional(place)
         with dg.guard(place):
-            result3 = self.paddle_nn_layer()
+            result3, g1 = self.paddle_nn_layer()
+            with _test_eager_guard():
+                res_eager, g2 = self.paddle_nn_layer()
         np.testing.assert_array_almost_equal(result1, result2)
         np.testing.assert_array_almost_equal(result2, result3)
+        self.assertTrue(np.allclose(result3, res_eager))
+        self.assertTrue(np.allclose(g1, g2))
 
     def runTest(self):
         place = fluid.CPUPlace()
diff --git a/python/paddle/fluid/tests/unittests/test_create_parameter.py b/python/paddle/fluid/tests/unittests/test_create_parameter.py
index 199558acd4ef6..fb4b5e4b6fa88 100644
--- a/python/paddle/fluid/tests/unittests/test_create_parameter.py
+++ b/python/paddle/fluid/tests/unittests/test_create_parameter.py
@@ -22,7 +22,8 @@
 
 
 class TestCreateParameterError(unittest.TestCase):
-    def test_errors(self):
+    def func_errors(self):
+        paddle.enable_static()
         with program_guard(Program(), Program()):
 
             def test_shape():
@@ -49,6 +50,11 @@ def test_default_initializer():
 
             self.assertRaises(TypeError, test_default_initializer)
 
+    def test_errors(self):
+        with fluid.framework._test_eager_guard():
+            self.func_errors()
+        self.func_errors()
+
 
 if __name__ == '__main__':
     paddle.enable_static()
diff --git a/python/paddle/fluid/tests/unittests/test_crop_op.py b/python/paddle/fluid/tests/unittests/test_crop_op.py
index b08648b99f123..acb652ad6f9e8 100644
--- a/python/paddle/fluid/tests/unittests/test_crop_op.py
+++ b/python/paddle/fluid/tests/unittests/test_crop_op.py
@@ -71,10 +71,10 @@ def initTestCase(self):
         self.offsets = [1, 2]
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
     def test_check_grad_normal(self):
-        self.check_grad(['X'], 'Out')
+        self.check_grad(['X'], 'Out', check_eager=True)
 
 
 class TestCase1(TestCropOp):
@@ -125,4 +125,6 @@ def initTestCase(self):
 
 
 if __name__ == '__main__':
+    import paddle
+    paddle.enable_static()
     unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/test_crop_tensor_op.py b/python/paddle/fluid/tests/unittests/test_crop_tensor_op.py
index 0808f99ff1a94..a4552c8f5ddbb 100644
--- a/python/paddle/fluid/tests/unittests/test_crop_tensor_op.py
+++ b/python/paddle/fluid/tests/unittests/test_crop_tensor_op.py
@@ -77,10 +77,10 @@ def initTestCase(self):
         self.offsets = [1, 2]
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
     def test_check_grad_normal(self):
-        self.check_grad(['X'], 'Out')
+        self.check_grad(['X'], 'Out', check_eager=True)
 
 
 class TestCase1(TestCropTensorOp):
@@ -175,10 +175,10 @@ def initTestCase(self):
         self.shape_attr = [0, 0]
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
     def test_check_grad_normal(self):
-        self.check_grad(["X"], "Out")
+        self.check_grad(["X"], "Out", check_eager=True)
 
 
 class TestCropTensorOpTensorAttrCase1(TestCropTensorOpTensorAttr):
@@ -262,4 +262,6 @@ def input_dtype():
 
 
 if __name__ == '__main__':
+    import paddle
+    paddle.enable_static()
     unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/test_cross_entropy_loss.py b/python/paddle/fluid/tests/unittests/test_cross_entropy_loss.py
index d3ed76e34a614..4402d875a41f6 100644
--- a/python/paddle/fluid/tests/unittests/test_cross_entropy_loss.py
+++ b/python/paddle/fluid/tests/unittests/test_cross_entropy_loss.py
@@ -21,6 +21,7 @@
 from test_softmax_op import stable_softmax
 from test_softmax_with_cross_entropy_op import cross_entropy
 from paddle.fluid import Program, program_guard
+from paddle.fluid.framework import _test_eager_guard
 
 
 def log_softmax(x, axis=-1):
@@ -1447,6 +1448,43 @@ def test_cross_entropy_loss_2d_sum(self):
         self.assertTrue(np.allclose(static_ret, expected))
         self.assertTrue(np.allclose(dy_ret_value, expected))
 
+    def test_soft_1d_dygraph_final_state_api(self):
+        with _test_eager_guard():
+            self.test_cross_entropy_loss_soft_1d()
+            self.test_cross_entropy_loss_soft_1d_weight()
+            self.test_cross_entropy_loss_soft_1d_mean()
+            self.test_cross_entropy_loss_soft_1d_weight_mean()
+
+    # put all testcases in one test will be failed
+    def test_soft_2d_dygraph_final_state_api(self):
+        with _test_eager_guard():
+            self.test_cross_entropy_loss_soft_2d()
+            self.test_cross_entropy_loss_soft_2d_weight_mean()
+
+    def test_other_dygraph_final_state_api(self):
+        with _test_eager_guard():
+            self.test_cross_entropy_loss_1d_with_mean_ignore()
+            self.test_cross_entropy_loss_1d_with_mean_ignore_negative()
+            self.test_cross_entropy_loss_1d_with_weight_mean_ignore()
+            self.test_cross_entropy_loss_1d_with_weight_mean_ignore_exceedlabel(
+            )
+            self.test_cross_entropy_loss_1d_with_weight_mean()
+            self.test_cross_entropy_loss_1d_with_weight_sum()
+            self.test_cross_entropy_loss_1d_with_weight_none()
+            self.test_cross_entropy_loss_1d_with_weight_none_func()
+            self.test_cross_entropy_loss_1d_mean()
+            self.test_cross_entropy_loss_1d_sum()
+            self.test_cross_entropy_loss_1d_none()
+            self.test_cross_entropy_loss_2d_with_weight_none()
+            self.test_cross_entropy_loss_2d_with_weight_axis_change_mean()
+            self.test_cross_entropy_loss_2d_with_weight_mean_ignore_exceedlabel(
+            )
+            self.test_cross_entropy_loss_2d_with_weight_mean()
+            self.test_cross_entropy_loss_2d_with_weight_sum()
+            self.test_cross_entropy_loss_2d_none()
+            self.test_cross_entropy_loss_2d_mean()
+            self.test_cross_entropy_loss_2d_sum()
+
 
 class TestCrossEntropyFAPIError(unittest.TestCase):
     def test_errors(self):
diff --git a/python/paddle/fluid/tests/unittests/test_cross_op.py b/python/paddle/fluid/tests/unittests/test_cross_op.py
index 6cba72213ff97..8b884583646a7 100644
--- a/python/paddle/fluid/tests/unittests/test_cross_op.py
+++ b/python/paddle/fluid/tests/unittests/test_cross_op.py
@@ -48,10 +48,10 @@ def init_output(self):
         self.outputs = {'Out': np.array(z_list).reshape(self.shape)}
 
     def test_check_output(self):
-        self.check_output(check_eager=False)
+        self.check_output(check_eager=True)
 
     def test_check_grad_normal(self):
-        self.check_grad(['X', 'Y'], 'Out', check_eager=False)
+        self.check_grad(['X', 'Y'], 'Out', check_eager=True)
 
 
 class TestCrossOpCase1(TestCrossOp):
diff --git a/python/paddle/fluid/tests/unittests/test_cuda_max_memory_allocated.py b/python/paddle/fluid/tests/unittests/test_cuda_max_memory_allocated.py
index 51c9ba182ab72..ae8bdeed1ef7a 100644
--- a/python/paddle/fluid/tests/unittests/test_cuda_max_memory_allocated.py
+++ b/python/paddle/fluid/tests/unittests/test_cuda_max_memory_allocated.py
@@ -16,10 +16,11 @@
 import unittest
 from paddle.fluid import core
 from paddle.device.cuda import device_count, memory_allocated, max_memory_allocated
+from paddle.fluid.framework import _test_eager_guard, _in_legacy_dygraph
 
 
 class TestMaxMemoryAllocated(unittest.TestCase):
-    def test_max_memory_allocated(self, device=None):
+    def func_test_max_memory_allocated(self, device=None):
         if core.is_compiled_with_cuda():
             alloc_time = 100
             max_alloc_size = 10000
@@ -35,16 +36,26 @@ def test_max_memory_allocated(self, device=None):
             self.assertEqual(peak_memory_allocated_size,
                              max_memory_allocated(device))
 
-    def test_max_memory_allocated_for_all_places(self):
+    def test_max_memory_allocated(self):
+        with _test_eager_guard():
+            self.func_test_max_memory_allocated()
+        self.func_test_max_memory_allocated()
+
+    def func_test_max_memory_allocated_for_all_places(self):
         if core.is_compiled_with_cuda():
             gpu_num = device_count()
             for i in range(gpu_num):
                 paddle.device.set_device("gpu:" + str(i))
-                self.test_max_memory_allocated(core.CUDAPlace(i))
-                self.test_max_memory_allocated(i)
-                self.test_max_memory_allocated("gpu:" + str(i))
+                self.func_test_max_memory_allocated(core.CUDAPlace(i))
+                self.func_test_max_memory_allocated(i)
+                self.func_test_max_memory_allocated("gpu:" + str(i))
 
-    def test_max_memory_allocated_exception(self):
+    def test_max_memory_allocated_for_all_places(self):
+        with _test_eager_guard():
+            self.func_test_max_memory_allocated_for_all_places()
+        self.func_test_max_memory_allocated_for_all_places()
+
+    def func_test_max_memory_allocated_exception(self):
         if core.is_compiled_with_cuda():
             wrong_device = [
                 core.CPUPlace(), device_count() + 1, -2, 0.5, "gpu1", "npu"
@@ -56,6 +67,11 @@ def test_max_memory_allocated_exception(self):
             with self.assertRaises(BaseException):
                 max_memory_allocated()
 
+    def test_max_memory_allocated_exception(self):
+        with _test_eager_guard():
+            self.func_test_max_memory_allocated_exception()
+        self.func_test_max_memory_allocated_exception()
+
 
 if __name__ == "__main__":
     unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/test_cuda_memory_reserved.py b/python/paddle/fluid/tests/unittests/test_cuda_memory_reserved.py
index 149760de8b231..ca551ab4a3f28 100644
--- a/python/paddle/fluid/tests/unittests/test_cuda_memory_reserved.py
+++ b/python/paddle/fluid/tests/unittests/test_cuda_memory_reserved.py
@@ -17,26 +17,37 @@
 import numpy as np
 from paddle.fluid import core
 from paddle.device.cuda import device_count, memory_reserved
+from paddle.fluid.framework import _test_eager_guard, _in_legacy_dygraph
 
 
 class TestMemoryreserved(unittest.TestCase):
-    def test_memory_reserved(self, device=None):
+    def func_test_memory_reserved(self, device=None):
         if core.is_compiled_with_cuda():
             tensor = paddle.zeros(shape=[256])
             alloc_size = 4 * 256  # 256 float32 data, with 4 bytes for each one
             memory_reserved_size = memory_reserved(device)
             self.assertEqual(memory_reserved_size, alloc_size)
 
-    def test_memory_reserved_for_all_places(self):
+    def test_memory_reserved(self):
+        with _test_eager_guard():
+            self.func_test_memory_reserved()
+        self.func_test_memory_reserved()
+
+    def func_test_memory_reserved_for_all_places(self):
         if core.is_compiled_with_cuda():
             gpu_num = device_count()
             for i in range(gpu_num):
                 paddle.device.set_device("gpu:" + str(i))
-                self.test_memory_reserved(core.CUDAPlace(i))
-                self.test_memory_reserved(i)
-                self.test_memory_reserved("gpu:" + str(i))
+                self.func_test_memory_reserved(core.CUDAPlace(i))
+                self.func_test_memory_reserved(i)
+                self.func_test_memory_reserved("gpu:" + str(i))
 
-    def test_memory_reserved_exception(self):
+    def test_memory_reserved_for_all_places(self):
+        with _test_eager_guard():
+            self.func_test_memory_reserved_for_all_places()
+        self.func_test_memory_reserved_for_all_places()
+
+    def func_test_memory_reserved_exception(self):
         if core.is_compiled_with_cuda():
             wrong_device = [
                 core.CPUPlace(), device_count() + 1, -2, 0.5, "gpu1", "npu"
@@ -48,6 +59,11 @@ def test_memory_reserved_exception(self):
             with self.assertRaises(BaseException):
                 memory_reserved()
 
+    def test_memory_reserved_exception(self):
+        with _test_eager_guard():
+            self.func_test_memory_reserved_exception()
+        self.func_test_memory_reserved_exception()
+
 
 if __name__ == "__main__":
     unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/test_cumprod_op.py b/python/paddle/fluid/tests/unittests/test_cumprod_op.py
index 31e7ee287f0ea..681b8d6cc0bdf 100644
--- a/python/paddle/fluid/tests/unittests/test_cumprod_op.py
+++ b/python/paddle/fluid/tests/unittests/test_cumprod_op.py
@@ -73,6 +73,7 @@ def setUp(self):
         self.init_params()
         self.init_dtype()
         self.op_type = "cumprod"
+        self.python_api = paddle.cumprod
         self.inputs = {'X': None}
         self.outputs = {'Out': None}
         self.attrs = {'dim': None}
@@ -110,7 +111,7 @@ def test_check_output(self):
         for dim in range(-len(self.shape), len(self.shape)):
             for zero_num in self.zero_nums:
                 self.prepare_inputs_outputs_attrs(dim, zero_num)
-                self.check_output()
+                self.check_output(check_eager=True)
 
     # test backward.
     def test_check_grad(self):
@@ -119,13 +120,14 @@ def test_check_grad(self):
                 self.prepare_inputs_outputs_attrs(dim, zero_num)
                 self.init_grad_input_output(dim)
                 if self.dtype == np.float64:
-                    self.check_grad(['X'], 'Out')
+                    self.check_grad(['X'], 'Out', check_eager=True)
                 else:
                     self.check_grad(
                         ['X'],
                         'Out',
                         user_defined_grads=[self.grad_x],
-                        user_defined_grad_outputs=[self.grad_out])
+                        user_defined_grad_outputs=[self.grad_out],
+                        check_eager=True)
 
 
 # test float32 case.
diff --git a/python/paddle/fluid/tests/unittests/test_dataloader_dataset.py b/python/paddle/fluid/tests/unittests/test_dataloader_dataset.py
index c54a1406e39bf..786d04272e3eb 100644
--- a/python/paddle/fluid/tests/unittests/test_dataloader_dataset.py
+++ b/python/paddle/fluid/tests/unittests/test_dataloader_dataset.py
@@ -22,10 +22,11 @@
 import paddle.vision.transforms as transforms
 import paddle.fluid as fluid
 from paddle.io import *
+from paddle.fluid.framework import _test_eager_guard, _in_legacy_dygraph
 
 
 class TestDatasetAbstract(unittest.TestCase):
-    def test_main(self):
+    def func_test_main(self):
         dataset = Dataset()
         try:
             d = dataset[0]
@@ -39,6 +40,11 @@ def test_main(self):
         except NotImplementedError:
             pass
 
+    def test_main(self):
+        with _test_eager_guard():
+            self.func_test_main()
+        self.func_test_main()
+
 
 class TestDatasetWithDiffOutputPlace(unittest.TestCase):
     def get_dataloader(self, num_workers):
@@ -60,7 +66,7 @@ def run_check_on_cpu(self):
             self.assertTrue(label.place.is_cpu_place())
             break
 
-    def test_single_process(self):
+    def func_test_single_process(self):
         self.run_check_on_cpu()
         if paddle.is_compiled_with_cuda():
             # Get (image, label) tuple from MNIST dataset
@@ -72,7 +78,12 @@ def test_single_process(self):
                 self.assertTrue(label.place.is_cuda_pinned_place())
                 break
 
-    def test_multi_process(self):
+    def test_single_process(self):
+        with _test_eager_guard():
+            self.func_test_single_process()
+        self.func_test_single_process()
+
+    def func_test_multi_process(self):
         # DataLoader with multi-process mode is not supported on MacOs and Windows currently
         if sys.platform != 'darwin' and sys.platform != 'win32':
             self.run_check_on_cpu()
@@ -86,6 +97,11 @@ def test_multi_process(self):
                     self.assertTrue(label.place.is_cuda_pinned_place())
                     break
 
+    def test_multi_process(self):
+        with _test_eager_guard():
+            self.func_test_multi_process()
+        self.func_test_multi_process()
+
 
 if __name__ == '__main__':
     unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/test_decayed_adagrad_op.py b/python/paddle/fluid/tests/unittests/test_decayed_adagrad_op.py
index a664a1529f4de..e2f6d17cc96a8 100644
--- a/python/paddle/fluid/tests/unittests/test_decayed_adagrad_op.py
+++ b/python/paddle/fluid/tests/unittests/test_decayed_adagrad_op.py
@@ -48,7 +48,7 @@ def setUp(self):
         self.outputs = {'ParamOut': param_out, 'MomentOut': moment_out}
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
 
 class TestDecayedAdagradOp2(OpTest):
@@ -80,8 +80,10 @@ def setUp(self):
         self.outputs = {'ParamOut': param_out, 'MomentOut': moment_out}
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
 
 if __name__ == "__main__":
+    import paddle
+    paddle.enable_static()
     unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/test_determinant_op.py b/python/paddle/fluid/tests/unittests/test_determinant_op.py
index f8110bffa2f71..d447d213f3c81 100644
--- a/python/paddle/fluid/tests/unittests/test_determinant_op.py
+++ b/python/paddle/fluid/tests/unittests/test_determinant_op.py
@@ -22,21 +22,23 @@
 import paddle.fluid as fluid
 import paddle.fluid.core as core
 import paddle.tensor as tensor
+from paddle.fluid.framework import _test_eager_guard
 
 paddle.enable_static()
 
 
 class TestDeterminantOp(OpTest):
     def setUp(self):
+        self.python_api = paddle.linalg.det
         self.init_data()
         self.op_type = "determinant"
         self.outputs = {'Out': self.target}
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
     def test_check_grad(self):
-        self.check_grad(['Input'], ['Out'])
+        self.check_grad(['Input'], ['Out'], check_eager=True)
 
     def init_data(self):
         np.random.seed(0)
@@ -89,6 +91,10 @@ def test_api_dygraph(self):
         self.assertEqual(np.allclose(out.numpy(), out_ref, rtol=1e-03), True)
         paddle.enable_static()
 
+    def test_eager(self):
+        with _test_eager_guard():
+            self.test_api_dygraph()
+
 
 class TestSlogDeterminantOp(OpTest):
     def setUp(self):
diff --git a/python/paddle/fluid/tests/unittests/test_dist_base.py b/python/paddle/fluid/tests/unittests/test_dist_base.py
index a2faf1e395d60..11972059c832c 100755
--- a/python/paddle/fluid/tests/unittests/test_dist_base.py
+++ b/python/paddle/fluid/tests/unittests/test_dist_base.py
@@ -36,7 +36,6 @@
 from paddle.fluid.dygraph.base import to_variable
 from paddle.fluid.dygraph.parallel import DataParallel, ParallelEnv
 from paddle.fluid.framework import _test_eager_guard
-
 from paddle.fluid.incubate.fleet.collective import fleet, DistributedStrategy
 import paddle.fluid.incubate.fleet.base.role_maker as role_maker
 
@@ -543,12 +542,6 @@ def _get_data(self, batch, args):
             return batch
 
     def run_trainer(self, args):
-        if args.eager_mode:
-            self.run_trainer_in_eager_mode(args)
-        else:
-            self.run_trainer_func(args)
-
-    def run_trainer_func(self, args):
         seed = 90
         if args.update_method == 'gloo':
             place = fluid.CPUPlace()
@@ -580,6 +573,7 @@ def run_trainer_func(self, args):
                 strategy.local_rank = args.trainer_id
                 strategy.trainer_endpoints = args.endpoints.split(",")
                 strategy.current_endpoint = args.current_endpoint
+                paddle.distributed.init_parallel_env()
                 print_to_err(
                     type(self).__name__,
                     "begin to prepare context in dygraph with nccl2")
@@ -621,82 +615,7 @@ def run_trainer_func(self, args):
                     model.clear_gradients()
         print_to_out(out_losses)
 
-    def run_trainer_in_eager_mode(self, args):
-        seed = 90
-        if args.update_method == 'gloo':
-            place = fluid.CPUPlace()
-        elif fluid.core.is_compiled_with_cuda():
-            device_id = int(os.getenv("FLAGS_selected_gpus", "0"))
-            place = fluid.CUDAPlace(device_id)
-        elif fluid.core.is_compiled_with_xpu():
-            device_id = int(os.getenv("FLAGS_selected_xpus", "0"))
-            place = fluid.XPUPlace(device_id)
-        elif fluid.core.is_compiled_with_npu():
-            device_id = int(os.getenv("FLAGS_selected_npus", "0"))
-            place = fluid.NPUPlace(device_id)
-        else:
-            assert ("Only support CUDAPlace or XPUPlace or CPU(Gloo) for now.")
-
-        with _test_eager_guard():
-            with fluid.dygraph.guard(place):
-                fluid.default_startup_program().random_seed = seed
-                fluid.default_main_program().random_seed = seed
-                np.random.seed(seed)
-                import random
-                random.seed(seed)
-
-                model, train_reader, opt = self.get_model()
-
-                #if args.update_method == "nccl2":
-                if args.update_method in ["nccl2", "gloo"]:
-                    paddle.distributed.init_parallel_env()
-                    nranks = ParallelEnv().nranks
-                    rank = ParallelEnv().local_rank
-                    is_master = True if rank == 0 else False
-                    store = paddle.fluid.core.TCPStore(
-                        "127.0.0.1", args.dist_port, is_master, nranks)
-                    if args.update_method == "nccl2":
-                        group = core.ProcessGroupNCCL(store, rank, nranks)
-                    elif args.update_method == "gloo":
-                        group = core.ProcessGroupGloo(store, rank, nranks)
-
-                    print_to_err(
-                        type(self).__name__,
-                        "begin to prepare context in dygraph with nccl2")
-                    model = dygraph.parallel.DataParallel(
-                        model,
-                        process_group=group,
-                        find_unused_parameters=args.find_unused_parameters)
-                    print_to_err(type(self).__name__, "model built in dygraph")
-
-                out_losses = []
-                print_to_err(
-                    type(self).__name__, "begin to run dygraph training")
-                for step_id, data in enumerate(train_reader()):
-                    data = self._get_data(data, args)
-                    if step_id == RUN_STEP:
-                        break
-                    loss = self.run_one_loop(model, opt, data)
-                    if step_id % 10 == 0:
-                        print_to_err(
-                            type(self).__name__,
-                            "loss at step %d: %f" % (step_id, loss.numpy()))
-                    out_losses.append(loss.numpy())
-
-                    loss.backward()
-
-                    opt.minimize(loss)
-                    if not args.accumulate_gradient:
-                        model.clear_gradients()
-            print_to_out(out_losses)
-
     def run_trainer_with_spawn(self, args):
-        if args.eager_mode:
-            return self.run_trainer_with_spawn_in_eager_mode(args)
-        else:
-            return self.run_trainer_with_spawn_func(args)
-
-    def run_trainer_with_spawn_func(self, args):
         # 1. enable dygraph
         paddle.disable_static()
 
@@ -733,64 +652,7 @@ def run_trainer_with_spawn_func(self, args):
             model.clear_gradients()
         return out_losses
 
-    def run_trainer_with_spawn_in_eager_mode(self, args):
-        # 1. enable dygraph
-        paddle.disable_static()
-
-        # 2. init seed
-        seed = 90
-        paddle.static.default_startup_program().random_seed = seed
-        paddle.static.default_main_program().random_seed = seed
-        np.random.seed(seed)
-        random.seed(seed)
-        # get trainer id
-        args.trainer_id = paddle.distributed.get_rank()
-
-        # 3. init parallel env
-        if args.update_method in ["nccl2", "gloo"]:
-            paddle.distributed.init_parallel_env()
-
-            # 4. build process group
-            nranks = ParallelEnv().nranks
-            rank = ParallelEnv().local_rank
-            is_master = True if rank == 0 else False
-            store = paddle.fluid.core.TCPStore("127.0.0.1", args.dist_port,
-                                               is_master, nranks)
-            if args.update_method == "nccl2":
-                group = core.ProcessGroupNCCL(store, rank, nranks)
-            elif args.update_method == "gloo":
-                group = core.ProcessGroupGloo(store, rank, nranks)
-
-        # 5. train model
-        with _test_eager_guard():
-            model, train_reader, opt = self.get_model()
-            if args.update_method in ["nccl2", "gloo"]:
-                model = paddle.DataParallel(
-                    model,
-                    process_group=group,
-                    find_unused_parameters=args.find_unused_parameters)
-
-            out_losses = []
-            for step_id, data in enumerate(train_reader()):
-                data = self._get_data(data, args)
-                if step_id == RUN_STEP:
-                    break
-                loss = self.run_one_loop(model, opt, data)
-                out_losses.append(loss.numpy())
-
-                loss.backward()
-
-                opt.minimize(loss)
-                model.clear_gradients()
-        return out_losses
-
     def run_use_fleet_api_trainer(self, args):
-        if args.eager_mode:
-            self.run_use_fleet_api_trainer_in_eager_mode(args)
-        else:
-            self.run_use_fleet_api_trainer_func(args)
-
-    def run_use_fleet_api_trainer_func(self, args):
         import paddle.distributed.fleet as fleet
         import paddle.distributed.fleet.base.role_maker as role_maker
         # 1. enable dygraph
@@ -835,52 +697,6 @@ def run_use_fleet_api_trainer_func(self, args):
                 opt.clear_grad()
         print_to_out(out_losses)
 
-    def run_use_fleet_api_trainer_in_eager_mode(self, args):
-        import paddle.distributed.fleet as fleet
-        import paddle.distributed.fleet.base.role_maker as role_maker
-        # 1. enable dygraph
-        paddle.disable_static()
-
-        # 2. init seed
-        seed = 90
-        paddle.static.default_startup_program().random_seed = seed
-        paddle.static.default_main_program().random_seed = seed
-        np.random.seed(seed)
-        random.seed(seed)
-        # get trainer id
-        args.trainer_id = paddle.distributed.get_rank()
-
-        # set strategy
-        strategy = fleet.DistributedStrategy()
-        if args.find_unused_parameters:
-            strategy.find_unused_parameters = True
-
-        # 3. init parallel env
-        if args.update_method == "nccl2" or "bkcl" or "hccl":
-            fleet.init(is_collective=True, strategy=strategy)
-
-        # 4. train model
-        with _test_eager_guard():
-            model, train_reader, opt = self.get_model()
-            if args.update_method == "nccl2" or "bkcl" or "hccl":
-                opt = fleet.distributed_optimizer(opt)
-                model = fleet.distributed_model(model)
-
-            out_losses = []
-            for step_id, data in enumerate(train_reader()):
-                data = self._get_data(data, args)
-                if step_id == RUN_STEP:
-                    break
-                loss = self.run_one_loop(model, opt, data)
-                out_losses.append(loss.numpy())
-
-                loss.backward()
-
-                opt.step()
-                if not args.accumulate_gradient:
-                    opt.clear_grad()
-        print_to_out(out_losses)
-
 
 def runtime_main(test_class):
     parser = argparse.ArgumentParser(description='Run dist test.')
@@ -911,8 +727,6 @@ def runtime_main(test_class):
     parser.add_argument(
         '--current_endpoint', type=str, required=False, default="")
     parser.add_argument('--sync_mode', action='store_true')
-    parser.add_argument('--eager_mode', action='store_true')
-    parser.add_argument('--dist_port', type=int, required=False, default=6175)
     parser.add_argument('--use_cuda', action='store_true')
     parser.add_argument('--use_cpu', action='store_true')
     parser.add_argument('--use_xpu', action='store_true')
@@ -1005,8 +819,6 @@ def setUp(self):
         self._port_set = set()
         self._python_interp = sys.executable
         self._sync_mode = True
-        self._dist_port = 6175
-        self._eager_mode = False
         self._hogwild_mode = False
         self._enforce_place = None
         self._use_reduce = False
@@ -1168,10 +980,6 @@ def _run_local(self,
         if len(devices) > 1 and self._use_dgc:
             cmd += " --use_dgc"
 
-        if self._eager_mode:
-            cmd += " --eager_mode"
-            cmd += " --dist_port {}".format(self._dist_port)
-
         if self._accumulate_gradient:
             cmd += " --accumulate_gradient"
 
@@ -1245,11 +1053,6 @@ def _run_cluster(self, model, envs, check_error_log, log_name):
         if self._sync_mode:
             tr0_cmd += " --sync_mode"
             tr1_cmd += " --sync_mode"
-        if self._eager_mode:
-            tr0_cmd += " --eager_mode"
-            tr1_cmd += " --eager_mode"
-            tr0_cmd += " --dist_port {}".format(self._dist_port)
-            tr1_cmd += " --dist_port {}".format(self._dist_port)
         if self._hogwild_mode:
             tr0_cmd += " --hogwild"
             tr1_cmd += " --hogwild"
@@ -1356,10 +1159,6 @@ def _get_gloo_trainer_cmd(self, model, ep, update_method, trainer_id,
 
         assert self._use_dgc == False, "gloo not support use dgc"
 
-        if self._eager_mode:
-            tr_cmd += " --eager_mode"
-            tr_cmd += " --dist_port {}".format(self._dist_port)
-
         if self._accumulate_gradient:
             tr_cmd += " --accumulate_gradient"
 
@@ -1437,10 +1236,6 @@ def _get_nccl2_trainer_cmd(self, model, ep, update_method, trainer_id,
         if self._use_dgc:
             tr_cmd += " --use_dgc"
 
-        if self._eager_mode:
-            tr_cmd += " --eager_mode"
-            tr_cmd += " --dist_port {}".format(self._dist_port)
-
         if self._accumulate_gradient:
             tr_cmd += " --accumulate_gradient"
 
@@ -1665,7 +1460,34 @@ def check_with_place(self,
                          check_error_log=False,
                          need_envs={},
                          log_name=""):
+        if self._dygraph and (self._gloo_mode or self._nccl2_mode):
+            with _test_eager_guard():
+                self.check_with_place_func(
+                    model_file=model_file,
+                    delta=delta,
+                    check_error_log=check_error_log,
+                    need_envs=need_envs,
+                    log_name=log_name)
+            self.check_with_place_func(
+                model_file=model_file,
+                delta=delta,
+                check_error_log=check_error_log,
+                need_envs=need_envs,
+                log_name=log_name)
+        else:
+            self.check_with_place_func(
+                model_file=model_file,
+                delta=delta,
+                check_error_log=check_error_log,
+                need_envs=need_envs,
+                log_name=log_name)
 
+    def check_with_place_func(self,
+                              model_file,
+                              delta=1e-3,
+                              check_error_log=False,
+                              need_envs={},
+                              log_name=""):
         required_envs = self._get_required_envs(check_error_log, need_envs)
 
         if self._gloo_mode:
diff --git a/python/paddle/fluid/tests/unittests/test_dist_op.py b/python/paddle/fluid/tests/unittests/test_dist_op.py
index b9b8ea92cb3a8..ad999c3feae42 100644
--- a/python/paddle/fluid/tests/unittests/test_dist_op.py
+++ b/python/paddle/fluid/tests/unittests/test_dist_op.py
@@ -37,6 +37,7 @@ def dist(x, y, p):
 class TestDistOp(OpTest):
     def setUp(self):
         self.op_type = 'dist'
+        self.python_api = paddle.dist
         self.attrs = {}
         self.init_case()
         self.init_data_type()
@@ -106,10 +107,14 @@ def get_reduce_dims(x, y):
         return x_grad, y_grad
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
     def test_check_grad(self):
-        self.check_grad(["X", "Y"], "Out", user_defined_grads=self.gradient)
+        self.check_grad(
+            ["X", "Y"],
+            "Out",
+            user_defined_grads=self.gradient,
+            check_eager=True)
 
 
 class TestDistOpCase1(TestDistOp):
@@ -174,4 +179,5 @@ def test_api(self):
 
 
 if __name__ == '__main__':
+    paddle.enable_static()
     unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/test_dpsgd_op.py b/python/paddle/fluid/tests/unittests/test_dpsgd_op.py
index 48bf786e139dd..35a922b78205f 100644
--- a/python/paddle/fluid/tests/unittests/test_dpsgd_op.py
+++ b/python/paddle/fluid/tests/unittests/test_dpsgd_op.py
@@ -45,7 +45,7 @@ def setUp(self):
         self.outputs = {'ParamOut': param_out}
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
 
 def dpsgd_step(inputs, attributes):
@@ -70,4 +70,6 @@ def dpsgd_step(inputs, attributes):
 
 
 if __name__ == "__main__":
+    import paddle
+    paddle.enable_static()
     unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/test_dropout_op.py b/python/paddle/fluid/tests/unittests/test_dropout_op.py
index 09712005d4125..3aca428ac77af 100644
--- a/python/paddle/fluid/tests/unittests/test_dropout_op.py
+++ b/python/paddle/fluid/tests/unittests/test_dropout_op.py
@@ -22,8 +22,11 @@
 import paddle.static as static
 import paddle.fluid as fluid
 from paddle.fluid import Program, program_guard
+from paddle.fluid.framework import _test_eager_guard
 import os
 
+from paddle import _C_ops
+
 
 class TestDropoutOp(OpTest):
     def setUp(self):
@@ -960,6 +963,19 @@ def test_backward_downscale_in_infer(self):
                     np.array_equal(input.gradient(
                     ), self.cal_grad_downscale_in_infer(mask.numpy())))
 
+    def test_backward_downscale_in_infer_eager(self):
+        for place in self.places:
+            with fluid.dygraph.guard(place):
+                with _test_eager_guard():
+                    input = paddle.uniform([40, 40], dtype="float32")
+                    input.stop_gradient = False
+                    out, mask = _C_ops.final_state_dropout(
+                        input, None, 0.5, False, "downgrade_in_infer", 0, False)
+                    out.backward()
+                    self.assertTrue(
+                        np.array_equal(input.gradient(
+                        ), self.cal_grad_downscale_in_infer(mask.numpy())))
+
     def test_backward_upscale_train(self):
         for place in self.places:
             with fluid.dygraph.guard(place):
@@ -976,6 +992,21 @@ def test_backward_upscale_train(self):
                     np.allclose(input.gradient(
                     ), self.cal_grad_upscale_train(mask.numpy(), prob)))
 
+    def test_backward_upscale_train_eager(self):
+        for place in self.places:
+            with fluid.dygraph.guard(place):
+                with _test_eager_guard():
+                    prob = 0.5
+                    input = paddle.uniform([40, 40], dtype="float32")
+                    input.stop_gradient = False
+                    out, mask = _C_ops.final_state_dropout(
+                        input, None, 0.5, False, "upscale_in_train", 0, False)
+                    out.backward()
+
+                    self.assertTrue(
+                        np.allclose(input.gradient(
+                        ), self.cal_grad_upscale_train(mask.numpy(), prob)))
+
     def test_backward_upscale_train_2(self):
         for place in self.places:
             with fluid.dygraph.guard(place):
@@ -1003,9 +1034,6 @@ def test_fixed_random_number(self):
         if not "V100" in paddle.device.cuda.get_device_name():
             return
 
-        if os.getenv("FLAGS_use_curand", None) in ('0', 'False', None):
-            return
-
         print("Test Fixed Random number on V100 GPU------>")
         paddle.disable_static()
         paddle.set_device('gpu')
diff --git a/python/paddle/fluid/tests/unittests/test_dygraph_recompute.py b/python/paddle/fluid/tests/unittests/test_dygraph_recompute.py
index 4a4bcd2b8163c..fa9ea5d086c03 100755
--- a/python/paddle/fluid/tests/unittests/test_dygraph_recompute.py
+++ b/python/paddle/fluid/tests/unittests/test_dygraph_recompute.py
@@ -23,6 +23,7 @@
 import random
 
 import paddle.fluid.layers as layers
+from paddle.fluid.framework import _test_eager_guard
 
 
 def get_fc_block(block_idx, input_size, is_last=False):
@@ -141,96 +142,75 @@ def run_model(recompute_block=[],
 
 
 class TestPyLayer(unittest.TestCase):
-    def test_fc_net_with_dropout(self):
+    def test_base_case(self, enable_autocast=False, pure_fp16=False):
         def check_identical(loss_ref, param_ref, grad_ref, loss, param, grad):
             self.assertEqual(loss_ref, loss)
             self.assertEqual(param_ref, param)
             self.assertEqual(grad_ref, grad)
 
         # without recompute
-        loss_ref, param_ref, grad_ref = run_model(recompute_block=[])
-
-        # recompute second block
-        loss, param, grad = run_model(recompute_block=[1])
-        check_identical(loss_ref, param_ref, grad_ref, loss, param, grad)
-
-        # recompute fourth block
-        loss, param, grad = run_model(recompute_block=[3])
-        check_identical(loss_ref, param_ref, grad_ref, loss, param, grad)
-
-        # recompute second to fourth block
-        loss, param, grad = run_model(recompute_block=[1, 2, 3])
-        check_identical(loss_ref, param_ref, grad_ref, loss, param, grad)
-
-        # recompute second & fourth block
-        loss, param, grad = run_model(recompute_block=[1, 3])
-        check_identical(loss_ref, param_ref, grad_ref, loss, param, grad)
-
-    def test_fc_net_without_restore_rng(self):
         loss_ref, param_ref, grad_ref = run_model(
-            recompute_block=[2],
-            recompute_kwargs={"preserve_rng_state": False},
-            enable_autocast=True)
-
-    def test_fc_net_with_amp(self):
-        def check_identical(loss_ref, param_ref, grad_ref, loss, param, grad):
-            self.assertEqual(loss_ref, loss)
-            self.assertEqual(param_ref, param)
-            self.assertEqual(grad_ref, grad)
-
-        # without recompute
-        loss_ref, param_ref, grad_ref = run_model(
-            recompute_block=[], enable_autocast=True)
+            recompute_block=[],
+            enable_autocast=enable_autocast,
+            pure_fp16=pure_fp16)
 
         # recompute second block
-        loss, param, grad = run_model(recompute_block=[1], enable_autocast=True)
+        loss, param, grad = run_model(
+            recompute_block=[1],
+            enable_autocast=enable_autocast,
+            pure_fp16=pure_fp16)
         check_identical(loss_ref, param_ref, grad_ref, loss, param, grad)
 
         # recompute fourth block
-        loss, param, grad = run_model(recompute_block=[3], enable_autocast=True)
+        loss, param, grad = run_model(
+            recompute_block=[3],
+            enable_autocast=enable_autocast,
+            pure_fp16=pure_fp16)
         check_identical(loss_ref, param_ref, grad_ref, loss, param, grad)
 
         # recompute second to fourth block
         loss, param, grad = run_model(
-            recompute_block=[1, 2, 3], enable_autocast=True)
+            recompute_block=[1, 2, 3],
+            enable_autocast=enable_autocast,
+            pure_fp16=pure_fp16)
         check_identical(loss_ref, param_ref, grad_ref, loss, param, grad)
 
         # recompute second & fourth block
         loss, param, grad = run_model(
-            recompute_block=[1, 3], enable_autocast=True)
+            recompute_block=[1, 3],
+            enable_autocast=enable_autocast,
+            pure_fp16=pure_fp16)
         check_identical(loss_ref, param_ref, grad_ref, loss, param, grad)
 
-    def test_fc_net_with_fp16(self):
-        def check_identical(loss_ref, param_ref, grad_ref, loss, param, grad):
-            self.assertEqual(loss_ref, loss)
-            self.assertEqual(param_ref, param)
-            self.assertEqual(grad_ref, grad)
-
-        # without recompute
-        loss_ref, param_ref, grad_ref = run_model(
-            recompute_block=[], enable_autocast=True, pure_fp16=True)
-
-        # recompute second block
-        loss, param, grad = run_model(
-            recompute_block=[1], enable_autocast=True, pure_fp16=True)
-        check_identical(loss_ref, param_ref, grad_ref, loss, param, grad)
+    def test_fc_net_with_dropout(self):
+        with _test_eager_guard():
+            self.test_base_case()
+        self.test_base_case()
 
-        # recompute fourth block
-        loss, param, grad = run_model(
-            recompute_block=[3], enable_autocast=True, pure_fp16=True)
-        check_identical(loss_ref, param_ref, grad_ref, loss, param, grad)
+    def test_fc_net_without_restore_rng(self):
+        with _test_eager_guard():
+            loss_ref, param_ref, grad_ref = run_model(
+                recompute_block=[2],
+                recompute_kwargs={"preserve_rng_state": False},
+                enable_autocast=True)
 
-        # recompute second to fourth block
-        loss, param, grad = run_model(
-            recompute_block=[1, 2, 3], enable_autocast=True, pure_fp16=True)
-        check_identical(loss_ref, param_ref, grad_ref, loss, param, grad)
+    def test_fc_net_with_amp(self):
+        with _test_eager_guard():
+            self.test_base_case(enable_autocast=True)
+        self.test_base_case(enable_autocast=True)
 
-        # recompute second & fourth block
-        loss, param, grad = run_model(
-            recompute_block=[1, 3], enable_autocast=True, pure_fp16=True)
-        check_identical(loss_ref, param_ref, grad_ref, loss, param, grad)
+    def test_fc_net_with_fp16(self):
+        with _test_eager_guard():
+            self.test_base_case(enable_autocast=True, pure_fp16=True)
+        self.test_base_case(enable_autocast=True, pure_fp16=True)
 
     def test_recompute_kwargs(self):
+        with _test_eager_guard():
+            paddle.set_device("gpu")
+            kwargs = {"is_test": False}
+            with self.assertRaises(ValueError):
+                loss_ref, param_ref, grad_ref = run_model(
+                    recompute_block=[2], recompute_kwargs=kwargs)
         paddle.set_device("gpu")
         kwargs = {"is_test": False}
         with self.assertRaises(ValueError):
@@ -238,6 +218,11 @@ def test_recompute_kwargs(self):
                 recompute_block=[2], recompute_kwargs=kwargs)
 
     def test_recompute_cpu_rng(self):
+        with _test_eager_guard():
+            paddle.set_device("cpu")
+            with self.assertRaises(RuntimeError):
+                loss_ref, param_ref, grad_ref = run_model(recompute_block=[2])
+
         paddle.set_device("cpu")
         with self.assertRaises(RuntimeError):
             loss_ref, param_ref, grad_ref = run_model(recompute_block=[2])
diff --git a/python/paddle/fluid/tests/unittests/test_eager_dist_api.py b/python/paddle/fluid/tests/unittests/test_eager_dist_api.py
new file mode 100644
index 0000000000000..e00f90f4b0d5f
--- /dev/null
+++ b/python/paddle/fluid/tests/unittests/test_eager_dist_api.py
@@ -0,0 +1,33 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import print_function
+
+import unittest
+from test_parallel_dygraph_dataparallel import TestMultipleGpus
+
+
+class TestProcessGroup(TestMultipleGpus):
+    def test_process_group_nccl(self):
+        self.run_mnist_2gpu('process_group_nccl.py')
+
+    def test_process_group_gloo(self):
+        self.run_mnist_2gpu('process_group_gloo.py')
+
+    def test_init_process_group(self):
+        self.run_mnist_2gpu('init_process_group.py')
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/test_elementwise_add_op.py b/python/paddle/fluid/tests/unittests/test_elementwise_add_op.py
index 4ddfe9d1559de..22787a23feadf 100644
--- a/python/paddle/fluid/tests/unittests/test_elementwise_add_op.py
+++ b/python/paddle/fluid/tests/unittests/test_elementwise_add_op.py
@@ -28,6 +28,7 @@ def init_kernel_type(self):
 
     def setUp(self):
         self.op_type = "elementwise_add"
+        self.python_api = paddle.add
         self.init_dtype()
         self.init_input_output()
         self.init_kernel_type()
@@ -41,8 +42,7 @@ def setUp(self):
         self.outputs = {'Out': self.out}
 
     def check_eager(self):
-        return False
-        #return (self.use_mkldnn == False and self.axis == -1)
+        return (self.use_mkldnn == False and self.axis == -1)
 
     def test_check_output(self):
         # TODO(wangzhongpu): support mkldnn op in dygraph mode
diff --git a/python/paddle/fluid/tests/unittests/test_elementwise_div_op.py b/python/paddle/fluid/tests/unittests/test_elementwise_div_op.py
index a86758a9cb92b..d50241e58dea3 100644
--- a/python/paddle/fluid/tests/unittests/test_elementwise_div_op.py
+++ b/python/paddle/fluid/tests/unittests/test_elementwise_div_op.py
@@ -32,6 +32,7 @@ def setUp(self):
         'X': np.random.random((32,84)).astype("float32"),
         'Y': np.random.random((32,84)).astype("float32")
         """
+
         self.inputs = {
             'X': np.random.uniform(0.1, 1, [13, 17]).astype(self.dtype),
             'Y': np.random.uniform(0.1, 1, [13, 17]).astype(self.dtype)
@@ -39,7 +40,7 @@ def setUp(self):
         self.outputs = {'Out': np.divide(self.inputs['X'], self.inputs['Y'])}
 
     def check_eager(self):
-        return (self.use_mkldnn == False and self.axis == -1)
+        return (not hasattr(self, "attrs") or (self.attrs["axis"] != -1))
 
     def test_check_output(self):
         self.check_output(check_eager=False)
@@ -65,6 +66,7 @@ def init_dtype(self):
 class TestElementwiseDivOpBF16(OpTest):
     def setUp(self):
         self.op_type = "elementwise_div"
+        self.python_api = paddle.divide
         self.dtype = np.uint16
 
         x = np.random.uniform(0.1, 1, [12, 13]).astype(np.float32)
@@ -100,6 +102,7 @@ def test_check_grad_ingore_y(self):
 class TestElementwiseDivOp_scalar(ElementwiseDivOp):
     def setUp(self):
         self.op_type = "elementwise_div"
+        self.python_api = paddle.divide
         self.inputs = {
             'X': np.random.uniform(0.1, 1, [20, 3, 4]).astype(np.float64),
             'Y': np.random.uniform(0.1, 1, [1]).astype(np.float64)
@@ -110,6 +113,7 @@ def setUp(self):
 class TestElementwiseDivOp_Vector(ElementwiseDivOp):
     def setUp(self):
         self.op_type = "elementwise_div"
+        self.python_api = paddle.divide
         self.inputs = {
             'X': np.random.uniform(0.1, 1, [100]).astype("float64"),
             'Y': np.random.uniform(0.1, 1, [100]).astype("float64")
@@ -120,6 +124,7 @@ def setUp(self):
 class TestElementwiseDivOp_broadcast_0(ElementwiseDivOp):
     def setUp(self):
         self.op_type = "elementwise_div"
+        self.python_api = paddle.divide
         self.inputs = {
             'X': np.random.uniform(0.1, 1, [100, 3, 4]).astype("float64"),
             'Y': np.random.uniform(0.1, 1, [100]).astype("float64")
@@ -135,6 +140,7 @@ def setUp(self):
 class TestElementwiseDivOp_broadcast_1(ElementwiseDivOp):
     def setUp(self):
         self.op_type = "elementwise_div"
+        self.python_api = paddle.divide
         self.inputs = {
             'X': np.random.uniform(0.1, 1, [2, 100, 4]).astype("float64"),
             'Y': np.random.uniform(0.1, 1, [100]).astype("float64")
@@ -150,6 +156,7 @@ def setUp(self):
 class TestElementwiseDivOp_broadcast_2(ElementwiseDivOp):
     def setUp(self):
         self.op_type = "elementwise_div"
+        self.python_api = paddle.divide
         self.inputs = {
             'X': np.random.uniform(0.1, 1, [2, 3, 100]).astype("float64"),
             'Y': np.random.uniform(0.1, 1, [100]).astype("float64")
@@ -164,6 +171,7 @@ def setUp(self):
 class TestElementwiseDivOp_broadcast_3(ElementwiseDivOp):
     def setUp(self):
         self.op_type = "elementwise_div"
+        self.python_api = paddle.divide
         self.inputs = {
             'X': np.random.uniform(0.1, 1, [2, 10, 12, 5]).astype("float64"),
             'Y': np.random.uniform(0.1, 1, [10, 12]).astype("float64")
@@ -179,6 +187,7 @@ def setUp(self):
 class TestElementwiseDivOp_broadcast_4(ElementwiseDivOp):
     def setUp(self):
         self.op_type = "elementwise_div"
+        self.python_api = paddle.divide
         self.inputs = {
             'X': np.random.uniform(0.1, 1, [2, 3, 50]).astype("float64"),
             'Y': np.random.uniform(0.1, 1, [2, 1, 50]).astype("float64")
@@ -189,6 +198,7 @@ def setUp(self):
 class TestElementwiseDivOp_broadcast_5(ElementwiseDivOp):
     def setUp(self):
         self.op_type = "elementwise_div"
+        self.python_api = paddle.divide
         self.inputs = {
             'X': np.random.uniform(0.1, 1, [2, 3, 4, 20]).astype("float64"),
             'Y': np.random.uniform(0.1, 1, [2, 3, 1, 20]).astype("float64")
@@ -199,6 +209,7 @@ def setUp(self):
 class TestElementwiseDivOp_commonuse_1(ElementwiseDivOp):
     def setUp(self):
         self.op_type = "elementwise_div"
+        self.python_api = paddle.divide
         self.inputs = {
             'X': np.random.uniform(0.1, 1, [2, 3, 100]).astype("float64"),
             'Y': np.random.uniform(0.1, 1, [1, 1, 100]).astype("float64"),
@@ -209,6 +220,7 @@ def setUp(self):
 class TestElementwiseDivOp_commonuse_2(ElementwiseDivOp):
     def setUp(self):
         self.op_type = "elementwise_div"
+        self.python_api = paddle.divide
         self.inputs = {
             'X': np.random.uniform(0.1, 1, [30, 3, 1, 5]).astype("float64"),
             'Y': np.random.uniform(0.1, 1, [30, 1, 4, 1]).astype("float64"),
@@ -219,6 +231,7 @@ def setUp(self):
 class TestElementwiseDivOp_xsize_lessthan_ysize(ElementwiseDivOp):
     def setUp(self):
         self.op_type = "elementwise_div"
+        self.python_api = paddle.divide
         self.inputs = {
             'X': np.random.uniform(0.1, 1, [10, 12]).astype("float64"),
             'Y': np.random.uniform(0.1, 1, [2, 3, 10, 12]).astype("float64"),
@@ -232,6 +245,7 @@ def setUp(self):
 class TestElementwiseDivOp_INT(OpTest):
     def setUp(self):
         self.op_type = "elementwise_div"
+        self.python_api = paddle.divide
         self.dtype = np.int32
         self.init_dtype()
         self.inputs = {
@@ -304,6 +318,7 @@ def test_dygraph(self):
 class TestComplexElementwiseDivOp(OpTest):
     def setUp(self):
         self.op_type = "elementwise_div"
+        self.python_api = paddle.divide
         self.init_base_dtype()
         self.init_input_output()
         self.init_grad_input_output()
@@ -334,7 +349,7 @@ def init_grad_input_output(self):
         self.grad_y = -self.grad_out * np.conj(self.x / self.y / self.y)
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=False)
 
     def test_check_grad_normal(self):
         self.check_grad(
diff --git a/python/paddle/fluid/tests/unittests/test_elementwise_floordiv_op.py b/python/paddle/fluid/tests/unittests/test_elementwise_floordiv_op.py
index 007affc140849..6ea24b4543f3f 100644
--- a/python/paddle/fluid/tests/unittests/test_elementwise_floordiv_op.py
+++ b/python/paddle/fluid/tests/unittests/test_elementwise_floordiv_op.py
@@ -29,6 +29,7 @@ def init_kernel_type(self):
 
     def setUp(self):
         self.op_type = "elementwise_floordiv"
+        self.python_api = paddle.floor_divide
         self.dtype = np.int32
         self.axis = -1
         self.init_dtype()
@@ -44,7 +45,7 @@ def setUp(self):
         self.outputs = {'Out': self.out}
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
     def init_input_output(self):
         self.x = np.random.uniform(0, 10000, [10, 10]).astype(self.dtype)
diff --git a/python/paddle/fluid/tests/unittests/test_elementwise_max_op.py b/python/paddle/fluid/tests/unittests/test_elementwise_max_op.py
index 719ee5df6dbbf..21b0595b6dc86 100644
--- a/python/paddle/fluid/tests/unittests/test_elementwise_max_op.py
+++ b/python/paddle/fluid/tests/unittests/test_elementwise_max_op.py
@@ -20,11 +20,13 @@
 import os
 import re
 import paddle.fluid.core as core
+import paddle
 
 
 class TestElementwiseOp(OpTest):
     def setUp(self):
         self.op_type = "elementwise_max"
+        self.python_api = paddle.maximum
         # If x and y have the same value, the max() is not differentiable.
         # So we generate test data by the following method
         # to avoid them being too close to each other.
@@ -35,10 +37,16 @@ def setUp(self):
         self.outputs = {'Out': np.maximum(self.inputs['X'], self.inputs['Y'])}
 
     def test_check_output(self):
-        self.check_output()
+        if hasattr(self, 'attrs'):
+            self.check_output(check_eager=False)
+        else:
+            self.check_output(check_eager=True)
 
     def test_check_grad_normal(self):
-        self.check_grad(['X', 'Y'], 'Out')
+        if hasattr(self, 'attrs'):
+            self.check_grad(['X', 'Y'], 'Out', check_eager=False)
+        else:
+            self.check_grad(['X', 'Y'], 'Out', check_eager=True)
 
     def test_check_grad_ingore_x(self):
         self.check_grad(
@@ -55,6 +63,7 @@ def test_check_grad_ingore_y(self):
 class TestElementwiseBF16Op(OpTest):
     def setUp(self):
         self.op_type = "elementwise_max"
+        self.python_api = paddle.maximum
         self.dtype = np.uint16
         # If x and y have the same value, the max() is not differentiable.
         # So we generate test data by the following method
@@ -69,10 +78,16 @@ def setUp(self):
         self.outputs = {'Out': convert_float_to_uint16(np.maximum(x, y))}
 
     def test_check_output(self):
-        self.check_output()
+        if hasattr(self, 'attrs'):
+            self.check_output(check_eager=False)
+        else:
+            self.check_output(check_eager=True)
 
     def test_check_grad_normal(self):
-        self.check_grad(['X', 'Y'], 'Out')
+        if hasattr(self, 'attrs'):
+            self.check_grad(['X', 'Y'], 'Out', check_eager=False)
+        else:
+            self.check_grad(['X', 'Y'], 'Out', check_eager=True)
 
     def test_check_grad_ingore_x(self):
         self.check_grad(['Y'], 'Out', no_grad_set=set("X"))
@@ -86,6 +101,7 @@ def test_check_grad_ingore_y(self):
 class TestElementwiseMaxOp_scalar(TestElementwiseOp):
     def setUp(self):
         self.op_type = "elementwise_max"
+        self.python_api = paddle.maximum
         x = np.random.random_integers(-5, 5, [2, 3, 20]).astype("float64")
         y = np.array([0.5]).astype("float64")
         self.inputs = {'X': x, 'Y': y}
@@ -95,6 +111,7 @@ def setUp(self):
 class TestElementwiseMaxOp_Vector(TestElementwiseOp):
     def setUp(self):
         self.op_type = "elementwise_max"
+        self.python_api = paddle.maximum
         x = np.random.random((100, )).astype("float64")
         sgn = np.random.choice([-1, 1], (100, )).astype("float64")
         y = x + sgn * np.random.uniform(0.1, 1, (100, )).astype("float64")
@@ -105,6 +122,7 @@ def setUp(self):
 class TestElementwiseMaxOp_broadcast_0(TestElementwiseOp):
     def setUp(self):
         self.op_type = "elementwise_max"
+        self.python_api = paddle.maximum
         x = np.random.uniform(0.5, 1, (100, 5, 2)).astype(np.float64)
         sgn = np.random.choice([-1, 1], (100, )).astype(np.float64)
         y = x[:, 0, 0] + sgn * \
@@ -121,6 +139,7 @@ def setUp(self):
 class TestElementwiseMaxOp_broadcast_1(TestElementwiseOp):
     def setUp(self):
         self.op_type = "elementwise_max"
+        self.python_api = paddle.maximum
         x = np.random.uniform(0.5, 1, (2, 100, 3)).astype(np.float64)
         sgn = np.random.choice([-1, 1], (100, )).astype(np.float64)
         y = x[0, :, 0] + sgn * \
@@ -137,6 +156,7 @@ def setUp(self):
 class TestElementwiseMaxOp_broadcast_2(TestElementwiseOp):
     def setUp(self):
         self.op_type = "elementwise_max"
+        self.python_api = paddle.maximum
         x = np.random.uniform(0.5, 1, (1, 3, 100)).astype(np.float64)
         sgn = np.random.choice([-1, 1], (100, )).astype(np.float64)
         y = x[0, 0, :] + sgn * \
@@ -152,6 +172,7 @@ def setUp(self):
 class TestElementwiseMaxOp_broadcast_3(TestElementwiseOp):
     def setUp(self):
         self.op_type = "elementwise_max"
+        self.python_api = paddle.maximum
         x = np.random.uniform(0.5, 1, (2, 50, 2, 1)).astype(np.float64)
         sgn = np.random.choice([-1, 1], (50, 2)).astype(np.float64)
         y = x[0, :, :, 0] + sgn * \
@@ -168,6 +189,7 @@ def setUp(self):
 class TestElementwiseMaxOp_broadcast_4(TestElementwiseOp):
     def setUp(self):
         self.op_type = "elementwise_max"
+        self.python_api = paddle.maximum
         x = np.random.uniform(0.5, 1, (2, 3, 4, 5)).astype(np.float64)
         sgn = np.random.choice([-1, 1], (2, 3, 1, 5)).astype(np.float64)
         y = x + sgn * \
diff --git a/python/paddle/fluid/tests/unittests/test_elementwise_min_op.py b/python/paddle/fluid/tests/unittests/test_elementwise_min_op.py
index 0999acc75acff..f8dc9602c35a5 100644
--- a/python/paddle/fluid/tests/unittests/test_elementwise_min_op.py
+++ b/python/paddle/fluid/tests/unittests/test_elementwise_min_op.py
@@ -27,6 +27,7 @@
 class TestElementwiseOp(OpTest):
     def setUp(self):
         self.op_type = "elementwise_min"
+        self.python_api = paddle.minimum
         # If x and y have the same value, the min() is not differentiable.
         # So we generate test data by the following method
         # to avoid them being too close to each other.
@@ -37,10 +38,16 @@ def setUp(self):
         self.outputs = {'Out': np.minimum(self.inputs['X'], self.inputs['Y'])}
 
     def test_check_output(self):
-        self.check_output()
+        if hasattr(self, 'attrs'):
+            self.check_output(check_eager=False)
+        else:
+            self.check_output(check_eager=True)
 
     def test_check_grad_normal(self):
-        self.check_grad(['X', 'Y'], 'Out')
+        if hasattr(self, 'attrs'):
+            self.check_grad(['X', 'Y'], 'Out', check_eager=False)
+        else:
+            self.check_grad(['X', 'Y'], 'Out', check_eager=True)
 
     def test_check_grad_ingore_x(self):
         self.check_grad(
@@ -56,6 +63,7 @@ def test_check_grad_ingore_y(self):
 class TestElementwiseMinOp_scalar(TestElementwiseOp):
     def setUp(self):
         self.op_type = "elementwise_min"
+        self.python_api = paddle.minimum
         x = np.random.random_integers(-5, 5, [10, 3, 4]).astype("float64")
         y = np.array([0.5]).astype("float64")
         self.inputs = {'X': x, 'Y': y}
@@ -65,6 +73,7 @@ def setUp(self):
 class TestElementwiseMinOp_Vector(TestElementwiseOp):
     def setUp(self):
         self.op_type = "elementwise_min"
+        self.python_api = paddle.minimum
         x = np.random.random((100, )).astype("float64")
         sgn = np.random.choice([-1, 1], (100, )).astype("float64")
         y = x + sgn * np.random.uniform(0.1, 1, (100, )).astype("float64")
@@ -75,6 +84,7 @@ def setUp(self):
 class TestElementwiseMinOp_broadcast_0(TestElementwiseOp):
     def setUp(self):
         self.op_type = "elementwise_min"
+        self.python_api = paddle.minimum
         x = np.random.uniform(0.5, 1, (100, 3, 2)).astype(np.float64)
         sgn = np.random.choice([-1, 1], (100, )).astype(np.float64)
         y = x[:, 0, 0] + sgn * \
@@ -91,6 +101,7 @@ def setUp(self):
 class TestElementwiseMinOp_broadcast_1(TestElementwiseOp):
     def setUp(self):
         self.op_type = "elementwise_min"
+        self.python_api = paddle.minimum
         x = np.random.uniform(0.5, 1, (2, 100, 3)).astype(np.float64)
         sgn = np.random.choice([-1, 1], (100, )).astype(np.float64)
         y = x[0, :, 0] + sgn * \
@@ -107,6 +118,7 @@ def setUp(self):
 class TestElementwiseMinOp_broadcast_2(TestElementwiseOp):
     def setUp(self):
         self.op_type = "elementwise_min"
+        self.python_api = paddle.minimum
         x = np.random.uniform(0.5, 1, (2, 3, 100)).astype(np.float64)
         sgn = np.random.choice([-1, 1], (100, )).astype(np.float64)
         y = x[0, 0, :] + sgn * \
@@ -122,6 +134,7 @@ def setUp(self):
 class TestElementwiseMinOp_broadcast_3(TestElementwiseOp):
     def setUp(self):
         self.op_type = "elementwise_min"
+        self.python_api = paddle.minimum
         x = np.random.uniform(0.5, 1, (2, 25, 4, 1)).astype(np.float64)
         sgn = np.random.choice([-1, 1], (25, 4)).astype(np.float64)
         y = x[0, :, :, 0] + sgn * \
@@ -138,6 +151,7 @@ def setUp(self):
 class TestElementwiseMinOp_broadcast_4(TestElementwiseOp):
     def setUp(self):
         self.op_type = "elementwise_min"
+        self.python_api = paddle.minimum
         x = np.random.uniform(0.5, 1, (2, 10, 2, 5)).astype(np.float64)
         sgn = np.random.choice([-1, 1], (2, 10, 1, 5)).astype(np.float64)
         y = x + sgn * \
diff --git a/python/paddle/fluid/tests/unittests/test_elementwise_mod_op.py b/python/paddle/fluid/tests/unittests/test_elementwise_mod_op.py
index 2a8ca51693ecf..c6973255f2644 100644
--- a/python/paddle/fluid/tests/unittests/test_elementwise_mod_op.py
+++ b/python/paddle/fluid/tests/unittests/test_elementwise_mod_op.py
@@ -29,6 +29,7 @@ def init_kernel_type(self):
 
     def setUp(self):
         self.op_type = "elementwise_mod"
+        self.python_api = paddle.remainder
         self.axis = -1
         self.init_dtype()
         self.init_input_output()
@@ -43,7 +44,10 @@ def setUp(self):
         self.outputs = {'Out': self.out}
 
     def test_check_output(self):
-        self.check_output()
+        if self.attrs['axis'] == -1:
+            self.check_output(check_eager=True)
+        else:
+            self.check_output(check_eager=False)
 
     def init_input_output(self):
         self.x = np.random.uniform(0, 10000, [10, 10]).astype(self.dtype)
@@ -76,7 +80,10 @@ def init_input_output(self):
         self.out = np.fmod(self.y + np.fmod(self.x, self.y), self.y)
 
     def test_check_output(self):
-        self.check_output()
+        if self.attrs['axis'] == -1:
+            self.check_output(check_eager=True)
+        else:
+            self.check_output(check_eager=False)
 
 
 class TestElementwiseModOpDouble(TestElementwiseModOpFloat):
diff --git a/python/paddle/fluid/tests/unittests/test_elementwise_pow_op.py b/python/paddle/fluid/tests/unittests/test_elementwise_pow_op.py
index 08ffb564484b3..3c9e350360dd1 100644
--- a/python/paddle/fluid/tests/unittests/test_elementwise_pow_op.py
+++ b/python/paddle/fluid/tests/unittests/test_elementwise_pow_op.py
@@ -17,11 +17,13 @@
 import numpy as np
 from op_test import OpTest, skip_check_grad_ci
 import paddle.fluid as fluid
+import paddle
 
 
 class TestElementwisePowOp(OpTest):
     def setUp(self):
         self.op_type = "elementwise_pow"
+        self.python_api = paddle.pow
         self.inputs = {
             'X': np.random.uniform(1, 2, [20, 5]).astype("float64"),
             'Y': np.random.uniform(1, 2, [20, 5]).astype("float64")
@@ -29,15 +31,22 @@ def setUp(self):
         self.outputs = {'Out': np.power(self.inputs['X'], self.inputs['Y'])}
 
     def test_check_output(self):
-        self.check_output()
+        if hasattr(self, 'attrs'):
+            self.check_output(check_eager=False)
+        else:
+            self.check_output(check_eager=True)
 
     def test_check_grad_normal(self):
-        self.check_grad(['X', 'Y'], 'Out')
+        if hasattr(self, 'attrs'):
+            self.check_grad(['X', 'Y'], 'Out', check_eager=False)
+        else:
+            self.check_grad(['X', 'Y'], 'Out', check_eager=True)
 
 
 class TestElementwisePowOp_big_shape_1(TestElementwisePowOp):
     def setUp(self):
         self.op_type = "elementwise_pow"
+        self.python_api = paddle.pow
         self.inputs = {
             'X': np.random.uniform(1, 2, [10, 10]).astype("float64"),
             'Y': np.random.uniform(0.1, 1, [10, 10]).astype("float64")
@@ -48,6 +57,7 @@ def setUp(self):
 class TestElementwisePowOp_big_shape_2(TestElementwisePowOp):
     def setUp(self):
         self.op_type = "elementwise_pow"
+        self.python_api = paddle.pow
         self.inputs = {
             'X': np.random.uniform(1, 2, [10, 10]).astype("float64"),
             'Y': np.random.uniform(0.2, 2, [10, 10]).astype("float64")
@@ -60,6 +70,7 @@ def setUp(self):
 class TestElementwisePowOp_scalar(TestElementwisePowOp):
     def setUp(self):
         self.op_type = "elementwise_pow"
+        self.python_api = paddle.pow
         self.inputs = {
             'X': np.random.uniform(0.1, 1, [3, 3, 4]).astype(np.float64),
             'Y': np.random.uniform(0.1, 1, [1]).astype(np.float64)
@@ -70,6 +81,7 @@ def setUp(self):
 class TestElementwisePowOp_tensor(TestElementwisePowOp):
     def setUp(self):
         self.op_type = "elementwise_pow"
+        self.python_api = paddle.pow
         self.inputs = {
             'X': np.random.uniform(0.1, 1, [100]).astype("float64"),
             'Y': np.random.uniform(1, 3, [100]).astype("float64")
@@ -80,6 +92,7 @@ def setUp(self):
 class TestElementwisePowOp_broadcast_0(TestElementwisePowOp):
     def setUp(self):
         self.op_type = "elementwise_pow"
+        self.python_api = paddle.pow
         self.inputs = {
             'X': np.random.uniform(0.1, 1, [2, 1, 100]).astype("float64"),
             'Y': np.random.uniform(0.1, 1, [100]).astype("float64")
@@ -90,6 +103,7 @@ def setUp(self):
 class TestElementwisePowOp_broadcast_1(TestElementwisePowOp):
     def setUp(self):
         self.op_type = "elementwise_pow"
+        self.python_api = paddle.pow
         self.inputs = {
             'X': np.random.uniform(0.1, 1, [2, 100, 1]).astype("float64"),
             'Y': np.random.uniform(0.1, 1, [100]).astype("float64")
@@ -103,6 +117,7 @@ def setUp(self):
 class TestElementwisePowOp_broadcast_2(TestElementwisePowOp):
     def setUp(self):
         self.op_type = "elementwise_pow"
+        self.python_api = paddle.pow
         self.inputs = {
             'X': np.random.uniform(0.1, 1, [100, 3, 1]).astype("float64"),
             'Y': np.random.uniform(0.1, 1, [100]).astype("float64")
@@ -117,6 +132,7 @@ def setUp(self):
 class TestElementwisePowOp_broadcast_3(TestElementwisePowOp):
     def setUp(self):
         self.op_type = "elementwise_pow"
+        self.python_api = paddle.pow
         self.inputs = {
             'X': np.random.uniform(0.1, 1, [2, 20, 5, 1]).astype("float64"),
             'Y': np.random.uniform(0.1, 1, [20, 5]).astype("float64")
@@ -131,6 +147,7 @@ def setUp(self):
 class TestElementwisePowOp_broadcast_4(TestElementwisePowOp):
     def setUp(self):
         self.op_type = "elementwise_pow"
+        self.python_api = paddle.pow
         self.inputs = {
             'X': np.random.uniform(0.1, 1, [2, 10, 3, 5]).astype("float64"),
             'Y': np.random.uniform(0.1, 1, [2, 10, 1, 5]).astype("float64")
@@ -141,11 +158,15 @@ def setUp(self):
 class TestElementwisePowOpInt(OpTest):
     def setUp(self):
         self.op_type = "elementwise_pow"
+        self.python_api = paddle.pow
         self.inputs = {'X': np.asarray([1, 3, 6]), 'Y': np.asarray([1, 1, 1])}
         self.outputs = {'Out': np.power(self.inputs['X'], self.inputs['Y'])}
 
     def test_check_output(self):
-        self.check_output()
+        if hasattr(self, 'attrs'):
+            self.check_output(check_eager=False)
+        else:
+            self.check_output(check_eager=True)
 
 
 class TestElementwisePowGradOpInt(unittest.TestCase):
diff --git a/python/paddle/fluid/tests/unittests/test_erfinv_op.py b/python/paddle/fluid/tests/unittests/test_erfinv_op.py
index 847a868dd6ca0..5b5a7c0384316 100644
--- a/python/paddle/fluid/tests/unittests/test_erfinv_op.py
+++ b/python/paddle/fluid/tests/unittests/test_erfinv_op.py
@@ -28,6 +28,7 @@
 class TestErfinv(OpTest):
     def setUp(self):
         self.op_type = "erfinv"
+        self.python_api = paddle.erfinv
         self.init_dtype()
         self.shape = [11, 17]
         self.x = np.random.uniform(-1, 1, size=self.shape).astype(self.dtype)
@@ -42,7 +43,7 @@ def init_dtype(self):
         self.dtype = np.float64
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
     def test_check_grad(self):
         self.check_grad(
diff --git a/python/paddle/fluid/tests/unittests/test_expand_as_v2_op.py b/python/paddle/fluid/tests/unittests/test_expand_as_v2_op.py
index 62cd465a176d5..3bf6868fed9c9 100755
--- a/python/paddle/fluid/tests/unittests/test_expand_as_v2_op.py
+++ b/python/paddle/fluid/tests/unittests/test_expand_as_v2_op.py
@@ -21,77 +21,63 @@
 import paddle.fluid as fluid
 
 
-class TestExpandAsOpRank1(OpTest):
+class TestExpandAsBasic(OpTest):
     def setUp(self):
         self.op_type = "expand_as_v2"
+        self.python_api = paddle.expand_as
         x = np.random.rand(100).astype("float64")
         target_tensor = np.random.rand(2, 100).astype("float64")
-        self.inputs = {'X': x}
+        self.inputs = {'X': x, "Y": target_tensor}
         self.attrs = {'target_shape': target_tensor.shape}
         bcast_dims = [2, 1]
         output = np.tile(self.inputs['X'], bcast_dims)
         self.outputs = {'Out': output}
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
     def test_check_grad(self):
-        self.check_grad(['X'], 'Out')
+        self.check_grad(['X'], 'Out', check_eager=True)
 
 
-class TestExpandAsOpRank2(OpTest):
+class TestExpandAsOpRank2(TestExpandAsBasic):
     def setUp(self):
         self.op_type = "expand_as_v2"
+        self.python_api = paddle.expand_as
         x = np.random.rand(10, 12).astype("float64")
         target_tensor = np.random.rand(10, 12).astype("float64")
-        self.inputs = {'X': x}
+        self.inputs = {'X': x, "Y": target_tensor}
         self.attrs = {'target_shape': target_tensor.shape}
         bcast_dims = [1, 1]
         output = np.tile(self.inputs['X'], bcast_dims)
         self.outputs = {'Out': output}
 
-    def test_check_output(self):
-        self.check_output()
-
-    def test_check_grad(self):
-        self.check_grad(['X'], 'Out')
 
-
-class TestExpandAsOpRank3(OpTest):
+class TestExpandAsOpRank3(TestExpandAsBasic):
     def setUp(self):
         self.op_type = "expand_as_v2"
+        self.python_api = paddle.expand_as
         x = np.random.rand(2, 3, 20).astype("float64")
         target_tensor = np.random.rand(2, 3, 20).astype("float64")
-        self.inputs = {'X': x}
+        self.inputs = {'X': x, "Y": target_tensor}
         self.attrs = {'target_shape': target_tensor.shape}
         bcast_dims = [1, 1, 1]
         output = np.tile(self.inputs['X'], bcast_dims)
         self.outputs = {'Out': output}
 
-    def test_check_output(self):
-        self.check_output()
-
-    def test_check_grad(self):
-        self.check_grad(['X'], 'Out')
-
 
-class TestExpandAsOpRank4(OpTest):
+class TestExpandAsOpRank4(TestExpandAsBasic):
     def setUp(self):
         self.op_type = "expand_as_v2"
+        self.python_api = paddle.expand_as
         x = np.random.rand(1, 1, 7, 16).astype("float64")
         target_tensor = np.random.rand(4, 6, 7, 16).astype("float64")
-        self.inputs = {'X': x}
+        self.inputs = {'X': x, "Y": target_tensor}
         self.attrs = {'target_shape': target_tensor.shape}
         bcast_dims = [4, 6, 1, 1]
         output = np.tile(self.inputs['X'], bcast_dims)
         self.outputs = {'Out': output}
 
-    def test_check_output(self):
-        self.check_output()
-
-    def test_check_grad(self):
-        self.check_grad(['X'], 'Out')
-
 
 class TestExpandAsV2Error(unittest.TestCase):
     def test_errors(self):
@@ -129,4 +115,5 @@ def test_api(self):
 
 
 if __name__ == "__main__":
+    paddle.enable_static()
     unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/test_expand_v2_op.py b/python/paddle/fluid/tests/unittests/test_expand_v2_op.py
index a204c26c1b823..70b3fda79b50f 100644
--- a/python/paddle/fluid/tests/unittests/test_expand_v2_op.py
+++ b/python/paddle/fluid/tests/unittests/test_expand_v2_op.py
@@ -27,6 +27,7 @@ class TestExpandV2OpRank1(OpTest):
     def setUp(self):
         self.op_type = "expand_v2"
         self.init_data()
+        self.python_api = paddle.expand
 
         self.inputs = {'X': np.random.random(self.ori_shape).astype("float64")}
         self.attrs = {'shape': self.shape}
diff --git a/python/paddle/fluid/tests/unittests/test_exponential_op.py b/python/paddle/fluid/tests/unittests/test_exponential_op.py
index 7a3ae203be62d..c8f4101ea5d6b 100644
--- a/python/paddle/fluid/tests/unittests/test_exponential_op.py
+++ b/python/paddle/fluid/tests/unittests/test_exponential_op.py
@@ -100,9 +100,6 @@ def test_fixed_random_number(self):
         if not "V100" in paddle.device.cuda.get_device_name():
             return
 
-        if os.getenv("FLAGS_use_curand", None) in ('0', 'False', None):
-            return
-
         print("Test Fixed Random number on V100 GPU------>")
         paddle.disable_static()
         paddle.set_device('gpu')
diff --git a/python/paddle/fluid/tests/unittests/test_eye_op.py b/python/paddle/fluid/tests/unittests/test_eye_op.py
index cb757cffc4425..704762d809414 100644
--- a/python/paddle/fluid/tests/unittests/test_eye_op.py
+++ b/python/paddle/fluid/tests/unittests/test_eye_op.py
@@ -28,6 +28,7 @@ def setUp(self):
         '''
 	Test eye op with specified shape
         '''
+        self.python_api = paddle.eye
         self.op_type = "eye"
 
         self.inputs = {}
@@ -39,7 +40,7 @@ def setUp(self):
         self.outputs = {'Out': np.eye(219, 319, dtype=np.int32)}
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
 
 class TestEyeOp1(OpTest):
@@ -47,6 +48,7 @@ def setUp(self):
         '''
 	Test eye op with default parameters
         '''
+        self.python_api = paddle.eye
         self.op_type = "eye"
 
         self.inputs = {}
@@ -54,7 +56,7 @@ def setUp(self):
         self.outputs = {'Out': np.eye(50, dtype=float)}
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
 
 class TestEyeOp2(OpTest):
@@ -62,6 +64,7 @@ def setUp(self):
         '''
         Test eye op with specified shape
         '''
+        self.python_api = paddle.eye
         self.op_type = "eye"
 
         self.inputs = {}
@@ -69,7 +72,7 @@ def setUp(self):
         self.outputs = {'Out': np.eye(99, 1, dtype=float)}
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
 
 class API_TestTensorEye(unittest.TestCase):
diff --git a/python/paddle/fluid/tests/unittests/test_fake_dequantize_op.py b/python/paddle/fluid/tests/unittests/test_fake_dequantize_op.py
index b30e0a6775ea9..728e178845c9b 100644
--- a/python/paddle/fluid/tests/unittests/test_fake_dequantize_op.py
+++ b/python/paddle/fluid/tests/unittests/test_fake_dequantize_op.py
@@ -172,5 +172,83 @@ def set_args(self):
         self.data_type = "float32"
 
 
+class TestChannelWiseDequantizeOp(OpTest):
+    def set_args(self):
+        self.bit_length = 8
+        self.data_type = "float32"
+        self.quant_axis = 0
+
+    def setUp(self):
+        self.set_args()
+        self.op_type = "dequantize_linear"
+        x = np.random.randn(4, 3, 64, 64).astype(self.data_type)
+        yq, scale = channel_wise_quantize_max_abs(x, self.bit_length,
+                                                  self.quant_axis)
+        ydq = channel_wise_dequantize_max_abs(yq, scale, self.bit_length,
+                                              self.quant_axis)
+        scale = np.array(scale).astype(self.data_type)
+        zero_point = np.zeros(scale.shape, dtype="int32")
+        print('TestChannelWiseDequantizeOp:')
+        self.inputs = {'X': yq, 'Scale': scale, 'ZeroPoint': zero_point}
+        self.attrs = {
+            'bit_length': self.bit_length,
+            'quant_axis': self.quant_axis
+        }
+        self.outputs = {'Y': ydq}
+
+    def test_check_output(self):
+        self.check_output()
+
+
+class TestChannelWiseDequantizeOp1(TestChannelWiseDequantizeOp):
+    def set_args(self):
+        self.bit_length = 8
+        self.data_type = "float32"
+        self.quant_axis = 1
+
+
+class TestDequantizeOp(OpTest):
+    def set_args(self):
+        self.bit_length = 8
+        self.quant_axis = -1
+        self.max_range = math.pow(2, self.bit_length - 1) - 1
+        self.data_type = "float32"
+
+    def setUp(self):
+        self.set_args()
+        self.op_type = "dequantize_linear"
+        x = np.random.randn(31, 65).astype(self.data_type)
+        yq, scale = quantize_max_abs(x, self.max_range)
+        ydq = dequantize_max_abs(yq, scale, self.max_range)
+        scale = np.array(scale).astype(self.data_type)
+        zero_point = np.zeros(scale.shape, dtype="int32")
+
+        self.inputs = {'X': yq, 'Scale': scale, 'ZeroPoint': zero_point}
+        self.attrs = {
+            'bit_length': self.bit_length,
+            'quant_axis': self.quant_axis
+        }
+        self.outputs = {'Y': ydq}
+
+    def test_check_output(self):
+        self.check_output()
+
+
+class TestDequantizeOpDouble(TestDequantizeOp):
+    def set_args(self):
+        self.bit_length = 8
+        self.max_range = math.pow(2, self.bit_length - 1) - 1
+        self.data_type = "float64"
+        self.quant_axis = -1
+
+
+class TestDequantizeOp5Bits(TestDequantizeOp):
+    def set_args(self):
+        self.bit_length = 5
+        self.max_range = math.pow(2, self.bit_length - 1) - 1
+        self.data_type = "float32"
+        self.quant_axis = -1
+
+
 if __name__ == "__main__":
     unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/test_fake_quantize_op.py b/python/paddle/fluid/tests/unittests/test_fake_quantize_op.py
index 1d7bfc9f6963c..2be61d1218560 100644
--- a/python/paddle/fluid/tests/unittests/test_fake_quantize_op.py
+++ b/python/paddle/fluid/tests/unittests/test_fake_quantize_op.py
@@ -16,6 +16,7 @@
 
 import unittest
 import numpy as np
+import math
 from op_test import OpTest
 import paddle.fluid.core as core
 
@@ -374,5 +375,144 @@ def set_arg(self):
         self.inputs = {'X': np.random.random((30, 15)).astype("float32"), }
 
 
+def quantize_max_abs(x, max_range):
+    scale = np.max(np.abs(x).flatten())
+    y = np.round(x / scale * max_range)
+    return y, scale
+
+
+def channel_wise_quantize_max_abs(x, quant_bit=8, quant_axis=0):
+    assert quant_axis in [0, 1], "The quant_axis should be 0 or 1."
+    scales = []
+    y = x.copy()
+    max_range = math.pow(2, quant_bit - 1) - 1
+    if quant_axis == 0:
+        for i in range(x.shape[0]):
+            scale = np.max(np.abs(x[i])).astype("float32")
+            scales.append(scale)
+            y[i] = np.round(x[i] * max_range / scale)
+    elif quant_axis == 1:
+        for i in range(x.shape[1]):
+            scale = np.max(np.abs(x[:, i])).astype("float32")
+            scales.append(scale)
+            y[:, i] = np.round(x[:, i] * max_range / scale)
+    return y, scales
+
+
+class TestChannelWiseQuantizeOp(OpTest):
+    def set_args(self):
+        self.bit_length = 8
+        self.data_type = "float32"
+        self.quant_axis = 0
+
+    def setUp(self):
+        self.set_args()
+        self.op_type = "quantize_linear"
+        x = np.random.randn(4, 3, 64, 64).astype(self.data_type)
+        yq, scale = channel_wise_quantize_max_abs(x, self.bit_length,
+                                                  self.quant_axis)
+        scale = np.array(scale).astype(self.data_type)
+        zero_point = np.zeros(scale.shape, dtype="int32")
+
+        self.inputs = {'X': x, 'Scale': scale, 'ZeroPoint': zero_point}
+        self.attrs = {
+            'bit_length': self.bit_length,
+            'quant_axis': self.quant_axis
+        }
+        self.outputs = {'Y': yq}
+
+    def test_check_output(self):
+        self.check_output()
+
+
+class TestChannelWiseQuantizeOp1(TestChannelWiseQuantizeOp):
+    def set_args(self):
+        self.bit_length = 8
+        self.data_type = "float32"
+        self.quant_axis = 1
+
+
+class TestChannelWiseQuantizeOpTrain(OpTest):
+    def set_args(self):
+        self.bit_length = 8
+        self.data_type = "float32"
+        self.quant_axis = 0
+        self.is_test = False
+
+    def setUp(self):
+        self.set_args()
+        self.op_type = "quantize_linear"
+        x = np.random.randn(4, 3, 64, 64).astype(self.data_type)
+        yq, scale = channel_wise_quantize_max_abs(x, self.bit_length,
+                                                  self.quant_axis)
+        scale = np.array(scale).astype(self.data_type)
+        zero_point = np.zeros(scale.shape, dtype="int32")
+
+        self.inputs = {'X': x, 'Scale': scale, 'ZeroPoint': zero_point}
+        self.attrs = {
+            'bit_length': self.bit_length,
+            'quant_axis': self.quant_axis,
+            'is_test': self.is_test
+        }
+        self.outputs = {'Y': yq, 'OutScale': scale}
+
+    def test_check_output(self):
+        self.check_output()
+
+
+class TestquantizeOp(OpTest):
+    def set_args(self):
+        self.bit_length = 8
+        self.quant_axis = -1
+        self.max_range = math.pow(2, self.bit_length - 1) - 1
+        self.data_type = "float32"
+
+    def setUp(self):
+        self.set_args()
+        self.op_type = "quantize_linear"
+        x = np.random.randn(31, 65).astype(self.data_type)
+        yq, scale = quantize_max_abs(x, self.max_range)
+        scale = np.array(scale).astype(self.data_type)
+        zero_point = np.zeros(scale.shape, dtype="int32")
+
+        self.inputs = {'X': x, 'Scale': scale, 'ZeroPoint': zero_point}
+        self.attrs = {
+            'bit_length': self.bit_length,
+            'quant_axis': self.quant_axis,
+        }
+        self.outputs = {'Y': yq}
+
+    def test_check_output(self):
+        self.check_output()
+
+
+class TestquantizeOpTrain(TestquantizeOp):
+    def set_args(self):
+        self.bit_length = 8
+        self.quant_axis = -1
+        self.max_range = math.pow(2, self.bit_length - 1) - 1
+        self.data_type = "float32"
+        self.is_test = False
+
+    def setUp(self):
+        self.set_args()
+        self.op_type = "quantize_linear"
+        x = np.random.randn(31, 65).astype(self.data_type)
+        yq, scale = quantize_max_abs(x, self.max_range)
+        scale = np.array(scale).astype(self.data_type)
+        zero_point = np.zeros(scale.shape, dtype="int32")
+
+        self.inputs = {'X': x, 'Scale': scale, 'ZeroPoint': zero_point}
+        self.attrs = {
+            'bit_length': self.bit_length,
+            'quant_axis': self.quant_axis,
+            'is_test': self.is_test
+        }
+        self.outputs = {'Y': yq, 'OutScale': scale}
+
+    def test_check_output(self):
+        self.check_output()
+
+
 if __name__ == "__main__":
     unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/test_fill_constant_batch_size_like.py b/python/paddle/fluid/tests/unittests/test_fill_constant_batch_size_like.py
new file mode 100644
index 0000000000000..774134f7a9960
--- /dev/null
+++ b/python/paddle/fluid/tests/unittests/test_fill_constant_batch_size_like.py
@@ -0,0 +1,75 @@
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import print_function
+
+import paddle
+import paddle.fluid.core as core
+from paddle.static import program_guard, Program
+import paddle.compat as cpt
+import unittest
+import numpy as np
+from op_test import OpTest
+from paddle.fluid.framework import convert_np_dtype_to_dtype_
+
+paddle.enable_static()
+
+
+def fill_constant_batch_size_like(input,
+                                  shape,
+                                  value,
+                                  data_type,
+                                  input_dim_idx=0,
+                                  output_dim_idx=0,
+                                  force_cpu=False):
+    return paddle.fluid.layers.fill_constant_batch_size_like(
+        input, shape, data_type, value, input_dim_idx, output_dim_idx,
+        force_cpu)
+
+
+class TestFillConstatnBatchSizeLike1(OpTest):
+    # test basic
+    def setUp(self):
+        self.op_type = "fill_constant_batch_size_like"
+        self.python_api = fill_constant_batch_size_like
+        self.init_data()
+
+        input = np.zeros(self.shape)
+        out = np.full_like(input, self.value, self.dtype)
+
+        self.inputs = {'Input': input}
+        self.outputs = {'Out': out}
+        self.attrs = {
+            'shape': self.shape,
+            'dtype': convert_np_dtype_to_dtype_(self.dtype),
+            'value': self.value,
+            'input_dim_idx': self.input_dim_idx,
+            'output_dim_idx': self.output_dim_idx,
+            'force_cpu': self.force_cpu
+        }
+
+    def init_data(self):
+        self.shape = [10, 10]
+        self.dtype = np.float32
+        self.value = 100
+        self.input_dim_idx = 0
+        self.output_dim_idx = 0
+        self.force_cpu = False
+
+    def test_check_output(self):
+        self.check_output(check_eager=True)
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/test_flatten_contiguous_range_op.py b/python/paddle/fluid/tests/unittests/test_flatten_contiguous_range_op.py
index 9093050d6d5c6..ac352fcdf87ea 100644
--- a/python/paddle/fluid/tests/unittests/test_flatten_contiguous_range_op.py
+++ b/python/paddle/fluid/tests/unittests/test_flatten_contiguous_range_op.py
@@ -23,6 +23,8 @@
 
 class TestFlattenOp(OpTest):
     def setUp(self):
+        self.python_api = paddle.flatten
+        self.python_out_sig = ["Out"]
         self.op_type = "flatten_contiguous_range"
         self.start_axis = 0
         self.stop_axis = -1
@@ -35,10 +37,10 @@ def setUp(self):
         }
 
     def test_check_output(self):
-        self.check_output(no_check_set=["XShape"])
+        self.check_output(no_check_set=["XShape"], check_eager=True)
 
     def test_check_grad(self):
-        self.check_grad(["X"], "Out")
+        self.check_grad(["X"], "Out", check_eager=True)
 
     def init_test_case(self):
         self.in_shape = (3, 2, 5, 4)
diff --git a/python/paddle/fluid/tests/unittests/test_fleet_base_single.py b/python/paddle/fluid/tests/unittests/test_fleet_base_single.py
index 589d6adb0f52d..ff54035045b2e 100644
--- a/python/paddle/fluid/tests/unittests/test_fleet_base_single.py
+++ b/python/paddle/fluid/tests/unittests/test_fleet_base_single.py
@@ -46,7 +46,7 @@ def setUp(self):
 
     def test_dygraph_single(self):
         paddle.disable_static()
-        fleet.init(is_collective=True)
+        paddle.distributed.init_parallel_env()
 
         layer = LinearNet()
         loss_fn = nn.MSELoss()
diff --git a/python/paddle/fluid/tests/unittests/test_flip.py b/python/paddle/fluid/tests/unittests/test_flip.py
index 5e2aacf9cefed..010d23bca51d7 100644
--- a/python/paddle/fluid/tests/unittests/test_flip.py
+++ b/python/paddle/fluid/tests/unittests/test_flip.py
@@ -67,6 +67,7 @@ def test_dygraph(self):
 class TestFlipOp(OpTest):
     def setUp(self):
         self.op_type = 'flip'
+        self.python_api = paddle.tensor.flip
         self.init_test_case()
         self.inputs = {'X': np.random.random(self.in_shape).astype('float64')}
         self.init_attrs()
@@ -76,10 +77,10 @@ def init_attrs(self):
         self.attrs = {"axis": self.axis}
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
     def test_check_grad(self):
-        self.check_grad(["X"], "Out")
+        self.check_grad(["X"], "Out", check_eager=True)
 
     def init_test_case(self):
         self.in_shape = (6, 4, 2, 3)
@@ -131,4 +132,5 @@ def init_test_case(self):
 
 
 if __name__ == "__main__":
+    paddle.enable_static()
     unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/test_fmax_op.py b/python/paddle/fluid/tests/unittests/test_fmax_op.py
index 3981d63c00582..608d97b68ac22 100644
--- a/python/paddle/fluid/tests/unittests/test_fmax_op.py
+++ b/python/paddle/fluid/tests/unittests/test_fmax_op.py
@@ -125,6 +125,7 @@ class TestElementwiseFmaxOp(OpTest):
     def setUp(self):
         """setUp"""
         self.op_type = "elementwise_fmax"
+        self.python_api = paddle.fmax
         # If x and y have the same value, the max() is not differentiable.
         # So we generate test data by the following method
         # to avoid them being too close to each other.
@@ -136,21 +137,29 @@ def setUp(self):
 
     def test_check_output(self):
         """test_check_output"""
-        self.check_output()
+        self.check_output(check_eager=True)
 
     def test_check_grad_normal(self):
         """test_check_grad_normal"""
-        self.check_grad(['X', 'Y'], 'Out')
+        self.check_grad(['X', 'Y'], 'Out', check_eager=True)
 
     def test_check_grad_ingore_x(self):
         """test_check_grad_ingore_x"""
         self.check_grad(
-            ['Y'], 'Out', max_relative_error=0.005, no_grad_set=set("X"))
+            ['Y'],
+            'Out',
+            max_relative_error=0.005,
+            no_grad_set=set("X"),
+            check_eager=True)
 
     def test_check_grad_ingore_y(self):
         """test_check_grad_ingore_y"""
         self.check_grad(
-            ['X'], 'Out', max_relative_error=0.005, no_grad_set=set('Y'))
+            ['X'],
+            'Out',
+            max_relative_error=0.005,
+            no_grad_set=set('Y'),
+            check_eager=True)
 
 
 class TestElementwiseFmax2Op(OpTest):
@@ -159,6 +168,7 @@ class TestElementwiseFmax2Op(OpTest):
     def setUp(self):
         """setUp"""
         self.op_type = "elementwise_fmax"
+        self.python_api = paddle.fmax
         # If x and y have the same value, the max() is not differentiable.
         # So we generate test data by the following method
         # to avoid them being too close to each other.
@@ -172,18 +182,26 @@ def setUp(self):
 
     def test_check_output(self):
         """test_check_output"""
-        self.check_output()
+        self.check_output(check_eager=True)
 
     def test_check_grad_normal(self):
         """test_check_grad_normal"""
-        self.check_grad(['X', 'Y'], 'Out')
+        self.check_grad(['X', 'Y'], 'Out', check_eager=True)
 
     def test_check_grad_ingore_x(self):
         """test_check_grad_ingore_x"""
         self.check_grad(
-            ['Y'], 'Out', max_relative_error=0.005, no_grad_set=set("X"))
+            ['Y'],
+            'Out',
+            max_relative_error=0.005,
+            no_grad_set=set("X"),
+            check_eager=True)
 
     def test_check_grad_ingore_y(self):
         """test_check_grad_ingore_y"""
         self.check_grad(
-            ['X'], 'Out', max_relative_error=0.005, no_grad_set=set('Y'))
+            ['X'],
+            'Out',
+            max_relative_error=0.005,
+            no_grad_set=set('Y'),
+            check_eager=True)
diff --git a/python/paddle/fluid/tests/unittests/test_fmin_op.py b/python/paddle/fluid/tests/unittests/test_fmin_op.py
index 7231823c37532..b9d26827988cd 100644
--- a/python/paddle/fluid/tests/unittests/test_fmin_op.py
+++ b/python/paddle/fluid/tests/unittests/test_fmin_op.py
@@ -127,6 +127,7 @@ class TestElementwiseFminOp(OpTest):
     def setUp(self):
         """setUp"""
         self.op_type = "elementwise_fmin"
+        self.python_api = paddle.fmin
         # If x and y have the same value, the min() is not differentiable.
         # So we generate test data by the following method
         # to avoid them being too close to each other.
@@ -138,21 +139,29 @@ def setUp(self):
 
     def test_check_output(self):
         """test_check_output"""
-        self.check_output()
+        self.check_output(check_eager=True)
 
     def test_check_grad_normal(self):
         """test_check_grad_normal"""
-        self.check_grad(['X', 'Y'], 'Out')
+        self.check_grad(['X', 'Y'], 'Out', check_eager=True)
 
     def test_check_grad_ingore_x(self):
         """test_check_grad_ingore_x"""
         self.check_grad(
-            ['Y'], 'Out', max_relative_error=0.005, no_grad_set=set("X"))
+            ['Y'],
+            'Out',
+            max_relative_error=0.005,
+            no_grad_set=set("X"),
+            check_eager=True)
 
     def test_check_grad_ingore_y(self):
         """test_check_grad_ingore_y"""
         self.check_grad(
-            ['X'], 'Out', max_relative_error=0.005, no_grad_set=set('Y'))
+            ['X'],
+            'Out',
+            max_relative_error=0.005,
+            no_grad_set=set('Y'),
+            check_eager=True)
 
 
 class TestElementwiseFmin2Op(OpTest):
@@ -161,6 +170,7 @@ class TestElementwiseFmin2Op(OpTest):
     def setUp(self):
         """setUp"""
         self.op_type = "elementwise_fmin"
+        self.python_api = paddle.fmin
         # If x and y have the same value, the min() is not differentiable.
         # So we generate test data by the following method
         # to avoid them being too close to each other.
@@ -174,21 +184,29 @@ def setUp(self):
 
     def test_check_output(self):
         """test_check_output"""
-        self.check_output()
+        self.check_output(check_eager=True)
 
     def test_check_grad_normal(self):
         """test_check_grad_normal"""
-        self.check_grad(['X', 'Y'], 'Out')
+        self.check_grad(['X', 'Y'], 'Out', check_eager=True)
 
     def test_check_grad_ingore_x(self):
         """test_check_grad_ingore_x"""
         self.check_grad(
-            ['Y'], 'Out', max_relative_error=0.005, no_grad_set=set("X"))
+            ['Y'],
+            'Out',
+            max_relative_error=0.005,
+            no_grad_set=set("X"),
+            check_eager=True)
 
     def test_check_grad_ingore_y(self):
         """test_check_grad_ingore_y"""
         self.check_grad(
-            ['X'], 'Out', max_relative_error=0.005, no_grad_set=set('Y'))
+            ['X'],
+            'Out',
+            max_relative_error=0.005,
+            no_grad_set=set('Y'),
+            check_eager=True)
 
 
 if __name__ == "__main__":
diff --git a/python/paddle/fluid/tests/unittests/test_ftrl_op.py b/python/paddle/fluid/tests/unittests/test_ftrl_op.py
index f58672a7a1e89..1826fdc3c0604 100644
--- a/python/paddle/fluid/tests/unittests/test_ftrl_op.py
+++ b/python/paddle/fluid/tests/unittests/test_ftrl_op.py
@@ -101,7 +101,7 @@ def setUp(self):
         }
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
 
 class TestSparseFTRLOp(unittest.TestCase):
@@ -201,4 +201,6 @@ def init_kernel(self):
 
 
 if __name__ == "__main__":
+    import paddle
+    paddle.enable_static()
     unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/test_full_like_op.py b/python/paddle/fluid/tests/unittests/test_full_like_op.py
index 3ae2e9ff6bdaf..05a310a9c5033 100644
--- a/python/paddle/fluid/tests/unittests/test_full_like_op.py
+++ b/python/paddle/fluid/tests/unittests/test_full_like_op.py
@@ -21,6 +21,7 @@
 import unittest
 import numpy as np
 from op_test import OpTest
+from paddle.fluid.framework import convert_np_dtype_to_dtype_
 
 
 class TestFullOp(unittest.TestCase):
@@ -92,5 +93,45 @@ def test_input_dtype():
                 dtype='uint4')
 
 
+class TestFullLikeOp1(OpTest):
+    # test basic
+    def setUp(self):
+        self.op_type = "fill_any_like"
+        self.python_api = paddle.full_like
+        self.init_data()
+
+        x = np.zeros(self.shape)
+        out = np.full_like(x, self.fill_value, self.dtype)
+
+        self.inputs = {'X': x}
+        self.outputs = {'Out': out}
+        self.attrs = {
+            'value': self.fill_value,
+            'dtype': convert_np_dtype_to_dtype_(self.dtype)
+        }
+
+    def init_data(self):
+        self.fill_value = 5
+        self.shape = [10, 10]
+        self.dtype = np.float32
+
+    def test_check_output(self):
+        self.check_output(check_eager=True)
+
+
+class TestFullLikeOp2(TestFullLikeOp1):
+    def init_data(self):
+        self.fill_value = 1000
+        self.shape = [1024, 1024]
+        self.dtype = np.float64
+
+
+class TestFullLikeOp3(TestFullLikeOp1):
+    def init_data(self):
+        self.fill_value = 8888
+        self.shape = [5000, 5000]
+        self.dtype = np.int64
+
+
 if __name__ == "__main__":
     unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/test_functional_conv2d_transpose.py b/python/paddle/fluid/tests/unittests/test_functional_conv2d_transpose.py
index f25a15106c491..781169d70c17c 100644
--- a/python/paddle/fluid/tests/unittests/test_functional_conv2d_transpose.py
+++ b/python/paddle/fluid/tests/unittests/test_functional_conv2d_transpose.py
@@ -13,12 +13,13 @@
 # limitations under the License.
 
 import paddle
-import paddle.nn.functional as F
-from paddle import fluid
+import unittest
+import numpy as np
 import paddle.fluid.dygraph as dg
 import paddle.fluid.initializer as I
-import numpy as np
-import unittest
+import paddle.nn.functional as F
+from paddle import fluid
+from paddle.fluid.framework import _test_eager_guard
 from unittest import TestCase
 
 
@@ -159,12 +160,22 @@ def test_identity_cpu(self):
         self.place = fluid.CPUPlace()
         self._test_identity()
 
+    def test_identity_cpu_check_eager(self):
+        with _test_eager_guard():
+            self.test_identity_cpu()
+
     @unittest.skipIf(not fluid.core.is_compiled_with_cuda(),
                      "core is not compiled with CUDA")
     def test_identity_gpu(self):
         self.place = fluid.CUDAPlace(0)
         self._test_identity()
 
+    @unittest.skipIf(not fluid.core.is_compiled_with_cuda(),
+                     "core is not compiled with CUDA")
+    def test_identity_gpu_check_eager(self):
+        with _test_eager_guard():
+            self.test_identity_gpu()
+
 
 class TestFunctionalConv2DError(TestCase):
     batch_size = 4
@@ -520,6 +531,10 @@ def test_dygraph_exception(self):
         with self.assertRaises(ValueError):
             self.dygraph_case()
 
+    def test_dygraph_exception_check_eager(self):
+        with _test_eager_guard():
+            self.test_dygraph_exception()
+
     def test_static_exception(self):
         with self.assertRaises(ValueError):
             self.static_graph_case()
diff --git a/python/paddle/fluid/tests/unittests/test_functional_conv3d_transpose.py b/python/paddle/fluid/tests/unittests/test_functional_conv3d_transpose.py
index a003de6596822..6f25d65aac227 100644
--- a/python/paddle/fluid/tests/unittests/test_functional_conv3d_transpose.py
+++ b/python/paddle/fluid/tests/unittests/test_functional_conv3d_transpose.py
@@ -13,12 +13,13 @@
 # limitations under the License.
 
 import paddle
-import paddle.nn.functional as F
-from paddle import fluid
+import numpy as np
 import paddle.fluid.dygraph as dg
 import paddle.fluid.initializer as I
-import numpy as np
+import paddle.nn.functional as F
 import unittest
+from paddle import fluid
+from paddle.fluid.framework import _test_eager_guard
 from unittest import TestCase
 
 
@@ -165,12 +166,22 @@ def test_identity_cpu(self):
         self.place = fluid.CPUPlace()
         self._test_identity()
 
+    def test_identity_cpu_check_eager(self):
+        with _test_eager_guard():
+            self.test_identity_cpu()
+
     @unittest.skipIf(not fluid.core.is_compiled_with_cuda(),
                      "core is not compiled with CUDA")
     def test_identity_gpu(self):
         self.place = fluid.CUDAPlace(0)
         self._test_identity()
 
+    @unittest.skipIf(not fluid.core.is_compiled_with_cuda(),
+                     "core is not compiled with CUDA")
+    def test_identity_gpu_check_eager(self):
+        with _test_eager_guard():
+            self.test_identity_gpu()
+
 
 class TestFunctionalConv3DTransposeError(TestCase):
     batch_size = 4
@@ -540,6 +551,10 @@ def test_dygraph_exception(self):
         with self.assertRaises(ValueError):
             self.dygraph_case()
 
+    def test_dygraph_exception_check_eager(self):
+        with _test_eager_guard():
+            self.test_dygraph_exception()
+
     def test_static_exception(self):
         with self.assertRaises(ValueError):
             self.static_graph_case()
diff --git a/python/paddle/fluid/tests/unittests/test_fused_transformer_encoder_layer.py b/python/paddle/fluid/tests/unittests/test_fused_transformer_encoder_layer.py
index e0281d6e21e5a..7dc86d0dea382 100644
--- a/python/paddle/fluid/tests/unittests/test_fused_transformer_encoder_layer.py
+++ b/python/paddle/fluid/tests/unittests/test_fused_transformer_encoder_layer.py
@@ -16,7 +16,7 @@
 import paddle
 from paddle.incubate.nn import FusedTransformerEncoderLayer
 from paddle.nn import TransformerEncoderLayer
-from paddle.fluid.framework import default_main_program
+from paddle.fluid.framework import default_main_program, in_dygraph_mode
 import unittest
 
 
@@ -61,6 +61,8 @@ def fused_qkv(self, q, k, v, num_head):
         return paddle.concat(x=[fq, fk, fv], axis=0)
 
     def test_out(self):
+        if in_dygraph_mode():
+            return
         default_main_program().random_seed = 42
         base_encoder = TransformerEncoderLayer(
             self.d_model, self.nhead, self.dim_feedforward, self.dropout_rate,
diff --git a/python/paddle/fluid/tests/unittests/test_gather_op.py b/python/paddle/fluid/tests/unittests/test_gather_op.py
index 978a3d86d882a..3d7dc2da052f3 100644
--- a/python/paddle/fluid/tests/unittests/test_gather_op.py
+++ b/python/paddle/fluid/tests/unittests/test_gather_op.py
@@ -33,6 +33,7 @@ def gather_numpy(x, index, axis):
 class TestGatherOp(OpTest):
     def setUp(self):
         self.op_type = "gather"
+        self.python_api = paddle.gather
         self.config()
         xnp = np.random.random(self.x_shape).astype(self.x_type)
         self.inputs = {
@@ -42,10 +43,10 @@ def setUp(self):
         self.outputs = {'Out': self.inputs["X"][self.inputs["Index"]]}
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
     def test_check_grad(self):
-        self.check_grad(['X'], 'Out')
+        self.check_grad(['X'], 'Out', check_eager=True)
 
     def config(self):
         """
@@ -120,6 +121,7 @@ def config(self):
 class TestGatherBF16Op(OpTest):
     def setUp(self):
         self.op_type = "gather"
+        self.python_api = paddle.gather
         self.dtype = np.uint16
         self.config()
         xnp = np.random.random(self.x_shape).astype(np.float32)
@@ -134,10 +136,10 @@ def setUp(self):
         self.outputs = {'Out': out}
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
     def test_check_grad(self):
-        self.check_grad(['X'], 'Out', numeric_grad_delta=0.5)
+        self.check_grad(['X'], 'Out', numeric_grad_delta=0.5, check_eager=True)
 
     def config(self):
         """
@@ -153,6 +155,7 @@ def config(self):
 class TestGatherOp1(OpTest):
     def setUp(self):
         self.op_type = "gather"
+        self.python_api = paddle.gather
         self.config()
         xnp = np.random.random(self.x_shape).astype(self.x_type)
         axis_np = np.array(self.axis).astype(self.index_type)
@@ -162,10 +165,10 @@ def setUp(self):
         self.outputs = {'Out': out}
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
     def test_check_grad(self):
-        self.check_grad(['X'], 'Out')
+        self.check_grad(['X'], 'Out', check_eager=True)
 
     def config(self):
         """
diff --git a/python/paddle/fluid/tests/unittests/test_gaussian_random_op.py b/python/paddle/fluid/tests/unittests/test_gaussian_random_op.py
index 738441a46d377..4fca8b9f2a118 100644
--- a/python/paddle/fluid/tests/unittests/test_gaussian_random_op.py
+++ b/python/paddle/fluid/tests/unittests/test_gaussian_random_op.py
@@ -342,9 +342,6 @@ def test_fixed_random_number(self):
         if not "V100" in paddle.device.cuda.get_device_name():
             return
 
-        if os.getenv("FLAGS_use_curand", None) in ('0', 'False', None):
-            return
-
         def _check_random_value(dtype, expect, expect_mean, expect_std):
             x = paddle.randn([32, 3, 1024, 1024], dtype=dtype)
             actual = x.numpy()
diff --git a/python/paddle/fluid/tests/unittests/test_gelu_op.py b/python/paddle/fluid/tests/unittests/test_gelu_op.py
index de34b63c9398e..abfb65c27a951 100644
--- a/python/paddle/fluid/tests/unittests/test_gelu_op.py
+++ b/python/paddle/fluid/tests/unittests/test_gelu_op.py
@@ -21,6 +21,7 @@
 import paddle.fluid.dygraph as dg
 import paddle
 import paddle.nn.functional as F
+from paddle.fluid.framework import _test_eager_guard
 
 
 def gelu(x, approximate):
@@ -91,6 +92,10 @@ def run_gelu_op(approximate):
             np.allclose(
                 x_g_ref, x_g_fast_math, rtol=1e-5, atol=5e-4))
 
+    def test_fast_math_eager(self):
+        with _test_eager_guard():
+            self.test_fast_math()
+
 
 if __name__ == '__main__':
     unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/test_graph_reindex.py b/python/paddle/fluid/tests/unittests/test_graph_reindex.py
new file mode 100644
index 0000000000000..52abbbe81aef9
--- /dev/null
+++ b/python/paddle/fluid/tests/unittests/test_graph_reindex.py
@@ -0,0 +1,113 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+import numpy as np
+import paddle
+import paddle.fluid as fluid
+
+
+class TestGraphReindex(unittest.TestCase):
+    def setUp(self):
+        self.x = np.arange(5).astype("int64")
+        self.neighbors = np.random.randint(100, size=20).astype("int64")
+        self.count = np.array([2, 8, 4, 3, 3], dtype="int32")
+
+        # Get numpy result.
+        out_nodes = list(self.x)
+        for neighbor in self.neighbors:
+            if neighbor not in out_nodes:
+                out_nodes.append(neighbor)
+        self.out_nodes = np.array(out_nodes, dtype="int64")
+        reindex_dict = {node: ind for ind, node in enumerate(self.out_nodes)}
+        self.reindex_src = np.array(
+            [reindex_dict[node] for node in self.neighbors])
+        reindex_dst = []
+        for node, c in zip(self.x, self.count):
+            for i in range(c):
+                reindex_dst.append(reindex_dict[node])
+        self.reindex_dst = np.array(reindex_dst, dtype="int64")
+        self.num_nodes = np.max(np.concatenate([self.x, self.neighbors])) + 1
+
+    def test_reindex_result(self):
+        paddle.disable_static()
+        x = paddle.to_tensor(self.x)
+        neighbors = paddle.to_tensor(self.neighbors)
+        count = paddle.to_tensor(self.count)
+        value_buffer = paddle.full([self.num_nodes], -1, dtype="int32")
+        index_buffer = paddle.full([self.num_nodes], -1, dtype="int32")
+
+        reindex_src, reindex_dst, out_nodes = \
+            paddle.incubate.graph_reindex(x, neighbors, count)
+        self.assertTrue(np.allclose(self.reindex_src, reindex_src))
+        self.assertTrue(np.allclose(self.reindex_dst, reindex_dst))
+        self.assertTrue(np.allclose(self.out_nodes, out_nodes))
+
+        reindex_src, reindex_dst, out_nodes = \
+            paddle.incubate.graph_reindex(x, neighbors, count,
+                                          value_buffer, index_buffer,
+                                          flag_buffer_hashtable=True)
+        self.assertTrue(np.allclose(self.reindex_src, reindex_src))
+        self.assertTrue(np.allclose(self.reindex_dst, reindex_dst))
+        self.assertTrue(np.allclose(self.out_nodes, out_nodes))
+
+    def test_reindex_result_static(self):
+        paddle.enable_static()
+        with paddle.static.program_guard(paddle.static.Program()):
+            x = paddle.static.data(
+                name="x", shape=self.x.shape, dtype=self.x.dtype)
+            neighbors = paddle.static.data(
+                name="neighbors",
+                shape=self.neighbors.shape,
+                dtype=self.neighbors.dtype)
+            count = paddle.static.data(
+                name="count", shape=self.count.shape, dtype=self.count.dtype)
+            value_buffer = paddle.static.data(
+                name="value_buffer", shape=[self.num_nodes], dtype="int32")
+            index_buffer = paddle.static.data(
+                name="index_buffer", shape=[self.num_nodes], dtype="int32")
+
+            reindex_src_1, reindex_dst_1, out_nodes_1 = \
+                paddle.incubate.graph_reindex(x, neighbors, count)
+            reindex_src_2, reindex_dst_2, out_nodes_2 = \
+                paddle.incubate.graph_reindex(x, neighbors, count,
+                                              value_buffer, index_buffer,
+                                              flag_buffer_hashtable=True)
+
+            exe = paddle.static.Executor(paddle.CPUPlace())
+            ret = exe.run(feed={
+                'x': self.x,
+                'neighbors': self.neighbors,
+                'count': self.count,
+                'value_buffer': np.full(
+                    [self.num_nodes], -1, dtype="int32"),
+                'index_buffer': np.full(
+                    [self.num_nodes], -1, dtype="int32")
+            },
+                          fetch_list=[
+                              reindex_src_1, reindex_dst_1, out_nodes_1,
+                              reindex_src_2, reindex_dst_2, out_nodes_2
+                          ])
+            reindex_src_1, reindex_dst_1, out_nodes_1, reindex_src_2, \
+                reindex_dst_2, out_nodes_2 = ret
+            self.assertTrue(np.allclose(self.reindex_src, reindex_src_1))
+            self.assertTrue(np.allclose(self.reindex_dst, reindex_dst_1))
+            self.assertTrue(np.allclose(self.out_nodes, out_nodes_1))
+            self.assertTrue(np.allclose(self.reindex_src, reindex_src_2))
+            self.assertTrue(np.allclose(self.reindex_dst, reindex_dst_2))
+            self.assertTrue(np.allclose(self.out_nodes, out_nodes_2))
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/test_graph_sample_neighbors.py b/python/paddle/fluid/tests/unittests/test_graph_sample_neighbors.py
new file mode 100644
index 0000000000000..d2fbeab3fd42c
--- /dev/null
+++ b/python/paddle/fluid/tests/unittests/test_graph_sample_neighbors.py
@@ -0,0 +1,209 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+import numpy as np
+import paddle
+import paddle.fluid as fluid
+
+
+class TestGraphSampleNeighbors(unittest.TestCase):
+    def setUp(self):
+        num_nodes = 20
+        edges = np.random.randint(num_nodes, size=(100, 2))
+        edges = np.unique(edges, axis=0)
+        self.edges_id = np.arange(0, len(edges)).astype("int64")
+        sorted_edges = edges[np.argsort(edges[:, 1])]
+
+        # Calculate dst index cumsum counts, also means colptr
+        dst_count = np.zeros(num_nodes)
+        dst_src_dict = {}
+        for dst in range(0, num_nodes):
+            true_index = sorted_edges[:, 1] == dst
+            dst_count[dst] = np.sum(true_index)
+            dst_src_dict[dst] = sorted_edges[:, 0][true_index]
+        dst_count = dst_count.astype("int64")
+        colptr = np.cumsum(dst_count)
+        colptr = np.insert(colptr, 0, 0)
+
+        self.row = sorted_edges[:, 0].astype("int64")
+        self.colptr = colptr.astype("int64")
+        self.nodes = np.unique(np.random.randint(
+            num_nodes, size=5)).astype("int64")
+        self.sample_size = 5
+        self.dst_src_dict = dst_src_dict
+
+    def test_sample_result(self):
+        paddle.disable_static()
+        row = paddle.to_tensor(self.row)
+        colptr = paddle.to_tensor(self.colptr)
+        nodes = paddle.to_tensor(self.nodes)
+
+        out_neighbors, out_count = paddle.incubate.graph_sample_neighbors(
+            row, colptr, nodes, sample_size=self.sample_size)
+        out_count_cumsum = paddle.cumsum(out_count)
+        for i in range(len(out_count)):
+            if i == 0:
+                neighbors = out_neighbors[0:out_count_cumsum[i]]
+            else:
+                neighbors = out_neighbors[out_count_cumsum[i - 1]:
+                                          out_count_cumsum[i]]
+            # Ensure the correct sample size.
+            self.assertTrue(
+                out_count[i] == self.sample_size or
+                out_count[i] == len(self.dst_src_dict[self.nodes[i]]))
+            # Ensure no repetitive sample neighbors.
+            self.assertTrue(
+                neighbors.shape[0] == paddle.unique(neighbors).shape[0])
+            # Ensure the correct sample neighbors.
+            in_neighbors = np.isin(neighbors.numpy(),
+                                   self.dst_src_dict[self.nodes[i]])
+            self.assertTrue(np.sum(in_neighbors) == in_neighbors.shape[0])
+
+    def test_sample_result_fisher_yates_sampling(self):
+        paddle.disable_static()
+        if fluid.core.is_compiled_with_cuda():
+            row = paddle.to_tensor(self.row)
+            colptr = paddle.to_tensor(self.colptr)
+            nodes = paddle.to_tensor(self.nodes)
+            perm_buffer = paddle.to_tensor(self.edges_id)
+
+            out_neighbors, out_count = paddle.incubate.graph_sample_neighbors(
+                row,
+                colptr,
+                nodes,
+                perm_buffer=perm_buffer,
+                sample_size=self.sample_size,
+                flag_perm_buffer=True)
+            out_count_cumsum = paddle.cumsum(out_count)
+            for i in range(len(out_count)):
+                if i == 0:
+                    neighbors = out_neighbors[0:out_count_cumsum[i]]
+                else:
+                    neighbors = out_neighbors[out_count_cumsum[i - 1]:
+                                              out_count_cumsum[i]]
+                # Ensure the correct sample size.
+                self.assertTrue(
+                    out_count[i] == self.sample_size or
+                    out_count[i] == len(self.dst_src_dict[self.nodes[i]]))
+                # Ensure no repetitive sample neighbors.
+                self.assertTrue(
+                    neighbors.shape[0] == paddle.unique(neighbors).shape[0])
+                # Ensure the correct sample neighbors.
+                in_neighbors = np.isin(neighbors.numpy(),
+                                       self.dst_src_dict[self.nodes[i]])
+                self.assertTrue(np.sum(in_neighbors) == in_neighbors.shape[0])
+
+    def test_sample_result_static(self):
+        paddle.enable_static()
+        with paddle.static.program_guard(paddle.static.Program()):
+            row = paddle.static.data(
+                name="row", shape=self.row.shape, dtype=self.row.dtype)
+            colptr = paddle.static.data(
+                name="colptr", shape=self.colptr.shape, dtype=self.colptr.dtype)
+            nodes = paddle.static.data(
+                name="nodes", shape=self.nodes.shape, dtype=self.nodes.dtype)
+
+            out_neighbors, out_count = paddle.incubate.graph_sample_neighbors(
+                row, colptr, nodes, sample_size=self.sample_size)
+            exe = paddle.static.Executor(paddle.CPUPlace())
+            ret = exe.run(feed={
+                'row': self.row,
+                'colptr': self.colptr,
+                'nodes': self.nodes
+            },
+                          fetch_list=[out_neighbors, out_count])
+            out_neighbors, out_count = ret
+            out_count_cumsum = np.cumsum(out_count)
+            out_neighbors = np.split(out_neighbors, out_count_cumsum)[:-1]
+            for neighbors, node, count in zip(out_neighbors, self.nodes,
+                                              out_count):
+                self.assertTrue(count == self.sample_size or
+                                count == len(self.dst_src_dict[node]))
+                self.assertTrue(
+                    neighbors.shape[0] == np.unique(neighbors).shape[0])
+                in_neighbors = np.isin(neighbors, self.dst_src_dict[node])
+                self.assertTrue(np.sum(in_neighbors) == in_neighbors.shape[0])
+
+    def test_raise_errors(self):
+        paddle.disable_static()
+        row = paddle.to_tensor(self.row)
+        colptr = paddle.to_tensor(self.colptr)
+        nodes = paddle.to_tensor(self.nodes)
+
+        def check_eid_error():
+            paddle.incubate.graph_sample_neighbors(
+                row,
+                colptr,
+                nodes,
+                sample_size=self.sample_size,
+                return_eids=True)
+
+        def check_perm_buffer_error():
+            paddle.incubate.graph_sample_neighbors(
+                row,
+                colptr,
+                nodes,
+                sample_size=self.sample_size,
+                flag_perm_buffer=True)
+
+        self.assertRaises(ValueError, check_eid_error)
+        self.assertRaises(ValueError, check_perm_buffer_error)
+
+    def test_sample_result_with_eids(self):
+        # Note: Currently return eid results is not initialized.
+        paddle.disable_static()
+        row = paddle.to_tensor(self.row)
+        colptr = paddle.to_tensor(self.colptr)
+        nodes = paddle.to_tensor(self.nodes)
+        eids = paddle.to_tensor(self.edges_id)
+
+        out_neighbors, out_count, _ = paddle.incubate.graph_sample_neighbors(
+            row,
+            colptr,
+            nodes,
+            eids=eids,
+            sample_size=self.sample_size,
+            return_eids=True)
+
+        paddle.enable_static()
+        with paddle.static.program_guard(paddle.static.Program()):
+            row = paddle.static.data(
+                name="row", shape=self.row.shape, dtype=self.row.dtype)
+            colptr = paddle.static.data(
+                name="colptr", shape=self.colptr.shape, dtype=self.colptr.dtype)
+            nodes = paddle.static.data(
+                name="nodes", shape=self.nodes.shape, dtype=self.nodes.dtype)
+            eids = paddle.static.data(
+                name="eids", shape=self.edges_id.shape, dtype=self.nodes.dtype)
+
+            out_neighbors, out_count, _ = paddle.incubate.graph_sample_neighbors(
+                row,
+                colptr,
+                nodes,
+                eids,
+                sample_size=self.sample_size,
+                return_eids=True)
+            exe = paddle.static.Executor(paddle.CPUPlace())
+            ret = exe.run(feed={
+                'row': self.row,
+                'colptr': self.colptr,
+                'nodes': self.nodes,
+                'eids': self.edges_id
+            },
+                          fetch_list=[out_neighbors, out_count])
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/test_graph_send_recv_op.py b/python/paddle/fluid/tests/unittests/test_graph_send_recv_op.py
index 30f943e3248e9..c233606c053d8 100644
--- a/python/paddle/fluid/tests/unittests/test_graph_send_recv_op.py
+++ b/python/paddle/fluid/tests/unittests/test_graph_send_recv_op.py
@@ -17,13 +17,26 @@
 import numpy as np
 import paddle
 import paddle.fluid as fluid
+from paddle.fluid.framework import _test_eager_guard
 
 from op_test import OpTest
 
 
+def graph_send_recv_wrapper(x,
+                            src_index,
+                            dst_index,
+                            pool_type="sum",
+                            out_size=None,
+                            name=None):
+    return paddle.incubate.graph_send_recv(x, src_index, dst_index,
+                                           pool_type.lower(), out_size, name)
+
+
 class TestGraphSendRecvMaxOp(OpTest):
     def setUp(self):
         paddle.enable_static()
+        self.python_api = graph_send_recv_wrapper
+        self.python_out_sig = ["Out"]
         self.op_type = "graph_send_recv"
         x = np.random.random((10, 20)).astype("float64")
         index = np.random.randint(0, 10, (15, 2)).astype(np.int64)
@@ -39,15 +52,18 @@ def setUp(self):
         self.outputs = {'Out': out}
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
     def test_check_grad(self):
-        self.check_grad(['X'], 'Out', user_defined_grads=[self.gradient])
+        self.check_grad(
+            ['X'], 'Out', user_defined_grads=[self.gradient], check_eager=True)
 
 
 class TestGraphSendRecvMinOp(OpTest):
     def setUp(self):
         paddle.enable_static()
+        self.python_api = graph_send_recv_wrapper
+        self.python_out_sig = ["Out"]
         self.op_type = "graph_send_recv"
         x = np.random.random((10, 20)).astype("float64")
         index = np.random.randint(0, 10, (15, 2)).astype(np.int64)
@@ -64,15 +80,18 @@ def setUp(self):
         self.outputs = {'Out': out}
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
     def test_check_grad(self):
-        self.check_grad(['X'], 'Out', user_defined_grads=[self.gradient])
+        self.check_grad(
+            ['X'], 'Out', user_defined_grads=[self.gradient], check_eager=True)
 
 
 class TestGraphSendRecvSumOp(OpTest):
     def setUp(self):
         paddle.enable_static()
+        self.python_api = graph_send_recv_wrapper
+        self.python_out_sig = ["Out"]
         self.op_type = "graph_send_recv"
         x = np.random.random((10, 20)).astype("float64")
         index = np.random.randint(0, 10, (15, 2)).astype(np.int64)
@@ -88,15 +107,17 @@ def setUp(self):
         self.outputs = {'Out': out}
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
     def test_check_grad(self):
-        self.check_grad(['X'], 'Out')
+        self.check_grad(['X'], 'Out', check_eager=True)
 
 
 class TestGraphSendRecvMeanOp(OpTest):
     def setUp(self):
         paddle.enable_static()
+        self.python_api = graph_send_recv_wrapper
+        self.python_out_sig = ["Out"]
         self.op_type = "graph_send_recv"
         x = np.random.random((10, 20)).astype("float64")
         index = np.random.randint(0, 10, (15, 2)).astype(np.int64)
@@ -113,10 +134,10 @@ def setUp(self):
         self.outputs = {'Out': out, 'Dst_count': dst_count}
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
     def test_check_grad(self):
-        self.check_grad(['X'], 'Out')
+        self.check_grad(['X'], 'Out', check_eager=True)
 
 
 def compute_graph_send_recv_for_sum_mean(inputs, attributes):
@@ -333,6 +354,12 @@ def test_set_outsize_gpu(self):
                 {}\n{}, check diff!"
                 .format(np_res_set_outsize, res_set_outsize))
 
+    def test_api_eager_dygraph(self):
+        with _test_eager_guard():
+            self.test_dygraph()
+            self.test_int32_input()
+            self.test_set_outsize_gpu()
+
 
 if __name__ == '__main__':
     unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/test_gumbel_softmax_op.py b/python/paddle/fluid/tests/unittests/test_gumbel_softmax_op.py
index e423404d07fb1..7c706eabd1d7a 100644
--- a/python/paddle/fluid/tests/unittests/test_gumbel_softmax_op.py
+++ b/python/paddle/fluid/tests/unittests/test_gumbel_softmax_op.py
@@ -17,6 +17,7 @@
 import paddle
 import paddle.fluid as fluid
 from paddle.fluid import Program, program_guard
+from paddle.fluid.framework import _test_eager_guard
 paddle.enable_static()
 
 
@@ -177,12 +178,17 @@ def test_check_api(self):
         self.assertEqual(out_np.sum(), self.count_expected)
 
         # test dygrapg api
-        paddle.disable_static()
-        x = paddle.to_tensor(self.x)
-        y = paddle.nn.functional.gumbel_softmax(x, hard=True)
-        out_np = np.array(y)
-        self.assertEqual(out_np.sum(), self.count_expected)
-        paddle.enable_static()
+        with paddle.fluid.dygraph.base.guard():
+            x = paddle.to_tensor(self.x)
+            y = paddle.nn.functional.gumbel_softmax(x, hard=True)
+            out_np = np.array(y)
+            self.assertEqual(out_np.sum(), self.count_expected)
+
+            with _test_eager_guard():
+                x = paddle.to_tensor(self.x)
+                y = paddle.nn.functional.gumbel_softmax(x, hard=True)
+                out_np = np.array(y)
+                self.assertEqual(out_np.sum(), self.count_expected)
 
 
 class TestGumbelSoftmaxOpError(unittest.TestCase):
diff --git a/python/paddle/fluid/tests/unittests/test_histogram_op.py b/python/paddle/fluid/tests/unittests/test_histogram_op.py
index 7da9dbd62e9f9..819029c5fcd9d 100644
--- a/python/paddle/fluid/tests/unittests/test_histogram_op.py
+++ b/python/paddle/fluid/tests/unittests/test_histogram_op.py
@@ -21,6 +21,7 @@
 import paddle.fluid.core as core
 from paddle.fluid import Program, program_guard
 from op_test import OpTest
+from paddle.fluid.framework import _test_eager_guard
 
 
 class TestHistogramOpAPI(unittest.TestCase):
@@ -57,6 +58,15 @@ def test_dygraph(self):
                 (actual.numpy() == expected).all(),
                 msg='histogram output is wrong, out =' + str(actual.numpy()))
 
+            with _test_eager_guard():
+                inputs_np = np.array([[2, 4, 2], [2, 5, 4]]).astype(np.int64)
+                inputs = paddle.to_tensor(inputs_np)
+                actual = paddle.histogram(inputs, bins=5, min=1, max=5)
+                self.assertTrue(
+                    (actual.numpy() == expected).all(),
+                    msg='histogram output is wrong, out =' +
+                    str(actual.numpy()))
+
 
 class TestHistogramOpError(unittest.TestCase):
     """Test histogram op error."""
@@ -118,6 +128,7 @@ def setUp(self):
         self.op_type = "histogram"
         self.init_test_case()
         np_input = np.random.uniform(low=0.0, high=20.0, size=self.in_shape)
+        self.python_api = paddle.histogram
         self.inputs = {"X": np_input}
         self.init_attrs()
         Out, _ = np.histogram(
@@ -134,7 +145,7 @@ def init_attrs(self):
         self.attrs = {"bins": self.bins, "min": self.min, "max": self.max}
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
 
 if __name__ == "__main__":
diff --git a/python/paddle/fluid/tests/unittests/test_huber_loss_op.py b/python/paddle/fluid/tests/unittests/test_huber_loss_op.py
index 9354569e34754..9a0437ad2f556 100644
--- a/python/paddle/fluid/tests/unittests/test_huber_loss_op.py
+++ b/python/paddle/fluid/tests/unittests/test_huber_loss_op.py
@@ -18,6 +18,7 @@
 import numpy as np
 from op_test import OpTest
 import paddle.fluid as fluid
+import paddle
 from paddle.fluid import compiler, Program, program_guard
 
 
@@ -32,6 +33,8 @@ def huber_loss_forward(val, delta):
 class TestHuberLossOp(OpTest):
     def setUp(self):
         self.op_type = 'huber_loss'
+        self.python_api = paddle.fluid.layers.huber_loss
+        self.python_out_sig = ["Out"]
         self.delta = 1.0
         self.init_input()
         shape = self.set_shape()
@@ -52,10 +55,10 @@ def set_shape(self):
         return (100, 1)
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
     def test_check_grad_normal(self):
-        self.check_grad(['X', 'Y'], 'Out')
+        self.check_grad(['X', 'Y'], 'Out', check_eager=True)
 
     def test_check_grad_ingore_x(self):
         self.check_grad(
@@ -103,4 +106,5 @@ def test_errors(self):
 
 
 if __name__ == '__main__':
+    paddle.enable_static()
     unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/test_imperative_double_grad.py b/python/paddle/fluid/tests/unittests/test_imperative_double_grad.py
index 9977756f406d5..c9e41fe93ebe1 100644
--- a/python/paddle/fluid/tests/unittests/test_imperative_double_grad.py
+++ b/python/paddle/fluid/tests/unittests/test_imperative_double_grad.py
@@ -639,5 +639,40 @@ def test_resnet_resnet101(self):
         self.assertTrue(np.array_equal(egr_g_numpy, g_numpy))
 
 
+class TestDoubleGradBasics(TestCase):
+    def test_matmul(self):
+        input_numpy = np.ones([3, 3]) * 2
+        with _test_eager_guard():
+            x = paddle.to_tensor(
+                input_numpy, stop_gradient=False, dtype='float32')
+            y = paddle.to_tensor(
+                input_numpy, stop_gradient=False, dtype='float32')
+            grad_out = paddle.to_tensor(
+                np.ones([3, 3]), stop_gradient=False, dtype='float32')
+
+            out = paddle.matmul(x, y, False, False)
+            new_x_g, new_y_g = paddle.grad(
+                [out], [x, y], [grad_out], retain_graph=True, create_graph=True)
+            new_x_g.backward()
+
+            out_ref = np.ones([3, 3]) * 12.0
+            self.assertTrue(np.array_equal(out.numpy(), out_ref))
+
+            new_x_g_ref = np.ones([3, 3]) * 6.0
+            new_y_g_ref = np.ones([3, 3]) * 6.0
+            self.assertTrue(np.array_equal(new_x_g.numpy(), new_x_g_ref))
+            self.assertTrue(np.array_equal(new_y_g.numpy(), new_y_g_ref))
+
+            x_grad_ref = np.ones([3, 3]) * 0.0
+            self.assertTrue(np.array_equal(x.grad.numpy(), x_grad_ref))
+
+            y_grad_ref = np.ones([3, 3]) * 3.0
+            self.assertTrue(np.array_equal(y.grad.numpy(), y_grad_ref))
+
+            grad_out_grad_ref = np.ones([3, 3]) * 6.0
+            self.assertTrue(
+                np.array_equal(grad_out.grad.numpy(), grad_out_grad_ref))
+
+
 if __name__ == '__main__':
     unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/test_imperative_group.py b/python/paddle/fluid/tests/unittests/test_imperative_group.py
index 89535797ed098..994ae27a290a3 100644
--- a/python/paddle/fluid/tests/unittests/test_imperative_group.py
+++ b/python/paddle/fluid/tests/unittests/test_imperative_group.py
@@ -26,7 +26,7 @@
 from paddle.fluid.dygraph.nn import Linear
 import paddle.fluid.core as core
 from paddle.fluid.optimizer import SGDOptimizer
-from paddle.fluid.framework import _test_eager_guard
+from paddle.fluid.framework import _test_eager_guard, _in_legacy_dygraph, in_dygraph_mode
 
 
 class TestDataParallelGroup(unittest.TestCase):
@@ -34,7 +34,10 @@ def create_varbase(self, dtype, shape):
         return paddle.rand(shape=shape, dtype=dtype)
 
     def assign_group_by_size(self, *args):
-        return core.assign_group_by_size(*args)
+        if in_dygraph_mode():
+            return core.eager_assign_group_by_size(*args)
+        elif _in_legacy_dygraph():
+            return core.assign_group_by_size(*args)
 
     def test_construct_group0(self):
         # one dtype & one limit capability
@@ -160,14 +163,19 @@ def test_construct_group9(self):
                                         [300], [1, 0, 2, 3])
         self.assertEqual([[1, 0], [3], [2]], res)
 
-
-class TestDataParallelGroupEager(TestDataParallelGroup):
-    def create_varbase(self, dtype, shape):
+    def test_construct_group_in_legacy_mode(self):
         with _test_eager_guard():
-            return paddle.rand(shape=shape, dtype=dtype)
-
-    def assign_group_by_size(self, *args):
-        return core.eager_assign_group_by_size(*args)
+            pass
+        self.test_construct_group0()
+        self.test_construct_group1()
+        self.test_construct_group2()
+        self.test_construct_group3()
+        self.test_construct_group4()
+        self.test_construct_group5()
+        self.test_construct_group6()
+        self.test_construct_group7()
+        self.test_construct_group8()
+        self.test_construct_group9()
 
 
 if __name__ == '__main__':
diff --git a/python/paddle/fluid/tests/unittests/test_imperative_mnist.py b/python/paddle/fluid/tests/unittests/test_imperative_mnist.py
index 06836ed85a321..f9bd5e4597121 100644
--- a/python/paddle/fluid/tests/unittests/test_imperative_mnist.py
+++ b/python/paddle/fluid/tests/unittests/test_imperative_mnist.py
@@ -265,4 +265,5 @@ def test_mnist_float32(self):
 
 
 if __name__ == '__main__':
+    paddle.enable_static()
     unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/test_imperative_resnet.py b/python/paddle/fluid/tests/unittests/test_imperative_resnet.py
index 3a643c5316cdc..e48e75c661fd1 100644
--- a/python/paddle/fluid/tests/unittests/test_imperative_resnet.py
+++ b/python/paddle/fluid/tests/unittests/test_imperative_resnet.py
@@ -164,7 +164,7 @@ def forward(self, inputs):
 
 
 class ResNet(fluid.Layer):
-    def __init__(self, layers=50, class_dim=102, use_cudnn=False):
+    def __init__(self, layers=50, class_dim=102, use_cudnn=True):
         super(ResNet, self).__init__()
 
         self.layers = layers
@@ -438,4 +438,5 @@ def test_resnet_float32(self):
 
 
 if __name__ == '__main__':
+    paddle.enable_static()
     unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/test_imperative_star_gan_with_gradient_penalty.py b/python/paddle/fluid/tests/unittests/test_imperative_star_gan_with_gradient_penalty.py
index 2b8e10d779256..be81c15677a3a 100644
--- a/python/paddle/fluid/tests/unittests/test_imperative_star_gan_with_gradient_penalty.py
+++ b/python/paddle/fluid/tests/unittests/test_imperative_star_gan_with_gradient_penalty.py
@@ -17,6 +17,7 @@
 import numpy as np
 import unittest
 from paddle import _C_ops
+from paddle.fluid.framework import _test_eager_guard, _in_legacy_dygraph, _in_eager_without_dygraph_check
 
 if fluid.is_compiled_with_cuda():
     fluid.core.globals()['FLAGS_cudnn_deterministic'] = True
@@ -583,7 +584,7 @@ def run(self, image_real, label_org, label_trg):
 
 
 class TestStarGANWithGradientPenalty(unittest.TestCase):
-    def test_main(self):
+    def func_main(self):
         self.place_test(fluid.CPUPlace())
 
         if fluid.is_compiled_with_cuda():
@@ -615,6 +616,10 @@ def place_test(self, place):
             self.assertEqual(g_loss_s, g_loss_d)
             self.assertEqual(d_loss_s, d_loss_d)
 
+    def test_all_cases(self):
+        if _in_legacy_dygraph():
+            self.func_main()
+
 
 if __name__ == '__main__':
     paddle.enable_static()
diff --git a/python/paddle/fluid/tests/unittests/test_imperative_trace_non_persistable_inputs.py b/python/paddle/fluid/tests/unittests/test_imperative_trace_non_persistable_inputs.py
index 645a05e75f6fb..a621105f5084c 100644
--- a/python/paddle/fluid/tests/unittests/test_imperative_trace_non_persistable_inputs.py
+++ b/python/paddle/fluid/tests/unittests/test_imperative_trace_non_persistable_inputs.py
@@ -33,6 +33,8 @@ def forward(self, x):
 
 class TestTracedLayerRecordNonPersistableInput(unittest.TestCase):
     def test_main(self):
+        if fluid.framework.in_dygraph_mode():
+            return
         traced_layer = None
         with fluid.dygraph.guard():
             feature_size = 3
diff --git a/python/paddle/fluid/tests/unittests/test_imperative_transformer_sorted_gradient.py b/python/paddle/fluid/tests/unittests/test_imperative_transformer_sorted_gradient.py
index 010c8aeccacd6..531c89fb19ec6 100644
--- a/python/paddle/fluid/tests/unittests/test_imperative_transformer_sorted_gradient.py
+++ b/python/paddle/fluid/tests/unittests/test_imperative_transformer_sorted_gradient.py
@@ -21,7 +21,7 @@
 from paddle.fluid.dygraph import to_variable, guard
 from paddle.fluid.dygraph import TracedLayer
 from test_imperative_base import new_program_scope
-from paddle.fluid.framework import _test_eager_guard
+from paddle.fluid.framework import _test_eager_guard, in_dygraph_mode, _in_legacy_dygraph
 from paddle.fluid import core
 import numpy as np
 import six
@@ -1041,8 +1041,9 @@ def run_dygraph():
 
         with guard():
             fluid.set_flags({'FLAGS_sort_sum_gradient': True})
-            dy_avg_cost_value, dy_sum_cost_value, dy_predict_value, dy_token_num_value, \
-                dy_param_init, dy_param_updated = run_dygraph()
+            if _in_legacy_dygraph():
+                dy_avg_cost_value, dy_sum_cost_value, dy_predict_value, dy_token_num_value, \
+                    dy_param_init, dy_param_updated = run_dygraph()
 
         with new_program_scope():
             paddle.seed(seed)
@@ -1116,21 +1117,22 @@ def run_dygraph():
                     for k in range(4, len(out)):
                         static_param_updated[static_param_name_list[k -
                                                                     4]] = out[k]
-
-        self.assertTrue(
-            np.array_equal(static_avg_cost_value, dy_avg_cost_value))
-        self.assertTrue(
-            np.array_equal(static_sum_cost_value, dy_sum_cost_value))
-        self.assertTrue(np.array_equal(static_predict_value, dy_predict_value))
-        self.assertTrue(
-            np.array_equal(static_token_num_value, dy_token_num_value))
-
-        for key, value in six.iteritems(static_param_init):
-            self.assertTrue(np.array_equal(value, dy_param_init[key]))
-        for key, value in six.iteritems(static_param_updated):
-            self.assertTrue(np.array_equal(value, dy_param_updated[key]))
-
-        # check eager result
+        if _in_legacy_dygraph():
+            self.assertTrue(
+                np.array_equal(static_avg_cost_value, dy_avg_cost_value))
+            self.assertTrue(
+                np.array_equal(static_sum_cost_value, dy_sum_cost_value))
+            self.assertTrue(
+                np.array_equal(static_predict_value, dy_predict_value))
+            self.assertTrue(
+                np.array_equal(static_token_num_value, dy_token_num_value))
+
+            for key, value in six.iteritems(static_param_init):
+                self.assertTrue(np.array_equal(value, dy_param_init[key]))
+            for key, value in six.iteritems(static_param_updated):
+                self.assertTrue(np.array_equal(value, dy_param_updated[key]))
+
+        # compare eager result with imperative result
         with guard():
             fluid.set_flags({'FLAGS_sort_sum_gradient': False})
             dy_avg_cost_value, dy_sum_cost_value, dy_predict_value, dy_token_num_value, \
diff --git a/python/paddle/fluid/tests/unittests/test_imperative_triple_grad.py b/python/paddle/fluid/tests/unittests/test_imperative_triple_grad.py
index 3644eead6bc65..f0c5316412f1e 100644
--- a/python/paddle/fluid/tests/unittests/test_imperative_triple_grad.py
+++ b/python/paddle/fluid/tests/unittests/test_imperative_triple_grad.py
@@ -19,6 +19,7 @@
 import unittest
 from unittest import TestCase
 import numpy as np
+from paddle.fluid.framework import _test_eager_guard, _in_legacy_dygraph, _in_eager_without_dygraph_check
 
 
 def _dygraph_guard_(func):
@@ -41,6 +42,68 @@ def random_var(size, low=-1, high=1, dtype='float32'):
     return fluid.dygraph.to_variable(x_np)
 
 
+class TestDygraphTripleGradMatmul(TestCase):
+    def test_matmul_triple_grad(self):
+        input_numpy = np.ones([3, 3]) * 2
+        with _test_eager_guard():
+            x = paddle.to_tensor(
+                input_numpy, stop_gradient=False, dtype='float32')
+            y = paddle.to_tensor(
+                input_numpy, stop_gradient=False, dtype='float32')
+            out = paddle.matmul(x, y, False, False)
+
+            new_out_g = paddle.to_tensor(
+                np.ones([3, 3]), stop_gradient=False, dtype='float32')
+            new_x_g, new_y_g = paddle.grad(
+                [out], [x, y], [new_out_g],
+                retain_graph=True,
+                create_graph=True)
+
+            new_x_g_g = paddle.to_tensor(
+                np.ones([3, 3]), stop_gradient=False, dtype='float32')
+            new_y_g_g = paddle.to_tensor(
+                np.ones([3, 3]), stop_gradient=False, dtype='float32')
+            new_a, new_b, new_c = paddle.grad(
+                [new_x_g, new_y_g], [x, y, new_out_g], [new_x_g_g, new_y_g_g],
+                retain_graph=True,
+                create_graph=True)
+
+            new_a.backward()
+
+            out_ref = np.ones([3, 3]) * 12.0
+            self.assertTrue(np.array_equal(out.numpy(), out_ref))
+
+            new_x_g_ref = np.ones([3, 3]) * 6.0
+            new_y_g_ref = np.ones([3, 3]) * 6.0
+            self.assertTrue(np.array_equal(new_x_g.numpy(), new_x_g_ref))
+            self.assertTrue(np.array_equal(new_y_g.numpy(), new_y_g_ref))
+
+            new_a_ref = np.ones([3, 3]) * 3.0
+            new_b_ref = np.ones([3, 3]) * 3.0
+            new_c_ref = np.ones([3, 3]) * 12.0
+
+            self.assertTrue(np.array_equal(new_a.numpy(), new_a_ref))
+            self.assertTrue(np.array_equal(new_b.numpy(), new_b_ref))
+            self.assertTrue(np.array_equal(new_c.numpy(), new_c_ref))
+
+            x_grad_ref = np.ones([3, 3]) * 0.0
+            self.assertTrue(np.array_equal(x.grad.numpy(), x_grad_ref))
+
+            y_grad_ref = np.ones([3, 3]) * 0.0
+            self.assertTrue(np.array_equal(y.grad.numpy(), y_grad_ref))
+
+            new_out_g_ref = np.ones([3, 3]) * 3.0
+            self.assertTrue(
+                np.array_equal(new_out_g.grad.numpy(), new_out_g_ref))
+
+            new_x_g_g_ref = np.ones([3, 3]) * 0.0
+            new_y_g_g_ref = np.ones([3, 3]) * 3.0
+            self.assertTrue(
+                np.array_equal(new_x_g_g.grad.numpy(), new_x_g_g_ref))
+            self.assertTrue(
+                np.array_equal(new_y_g_g.grad.numpy(), new_y_g_g_ref))
+
+
 class TestDygraphTripleGrad(TestCase):
     def setUp(self):
         self.sort_sum_gradient = False
@@ -65,7 +128,7 @@ def grad(self,
             allow_unused=allow_unused)
 
     @dygraph_guard
-    def test_exception(self):
+    def func_exception(self):
         with self.assertRaises(AssertionError):
             self.grad(None, None)
 
@@ -95,7 +158,7 @@ def test_exception(self):
             self.grad([random_var(shape)], [random_var(shape)], no_grad_vars=1)
 
     @dygraph_guard
-    def test_example_with_gradient_and_create_graph(self):
+    def func_example_with_gradient_and_create_graph(self):
         x = random_var(self.shape)
         x_np = x.numpy()
         x.stop_gradient = False
@@ -145,6 +208,11 @@ def test_example_with_gradient_and_create_graph(self):
         dddx_grad_actual = x.gradient()
         self.assertTrue(np.allclose(dddx_grad_actual, dddx_expected))
 
+    def test_all_cases(self):
+        if _in_legacy_dygraph():
+            self.func_exception()
+            self.func_example_with_gradient_and_create_graph()
+
 
 class TestDygraphTripleGradBradcastCase(TestCase):
     def setUp(self):
@@ -172,7 +240,7 @@ def grad(self,
             allow_unused=allow_unused)
 
     @dygraph_guard
-    def test_example_with_gradient_and_create_graph(self):
+    def func_example_with_gradient_and_create_graph(self):
         x = random_var(self.x_shape)
         x_np = x.numpy()
         x.stop_gradient = False
@@ -227,6 +295,10 @@ def test_example_with_gradient_and_create_graph(self):
         dddx_grad_actual = x.gradient()
         self.assertTrue(np.allclose(dddx_grad_actual, dddx_expected))
 
+    def test_all_cases(self):
+        if _in_legacy_dygraph():
+            self.func_example_with_gradient_and_create_graph()
+
 
 if __name__ == '__main__':
     unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/test_index_sample_op.py b/python/paddle/fluid/tests/unittests/test_index_sample_op.py
index e2ccb153f4063..4da03c9643fa9 100644
--- a/python/paddle/fluid/tests/unittests/test_index_sample_op.py
+++ b/python/paddle/fluid/tests/unittests/test_index_sample_op.py
@@ -40,10 +40,10 @@ def setUp(self):
         self.outputs = {'Out': out}
 
     def test_check_output(self):
-        self.check_output(check_eager=False)
+        self.check_output(check_eager=True)
 
     def test_check_grad(self):
-        self.check_grad(['X'], 'Out', check_eager=False)
+        self.check_grad(['X'], 'Out', check_eager=True)
 
     def config(self):
         """
diff --git a/python/paddle/fluid/tests/unittests/test_index_select_op.py b/python/paddle/fluid/tests/unittests/test_index_select_op.py
index f4545d406901c..0c0e946fddede 100644
--- a/python/paddle/fluid/tests/unittests/test_index_select_op.py
+++ b/python/paddle/fluid/tests/unittests/test_index_select_op.py
@@ -25,6 +25,7 @@
 
 class TestIndexSelectOp(OpTest):
     def setUp(self):
+        self.python_api = paddle.index_select
         self.op_type = "index_select"
         self.init_dtype_type()
         index_np = np.random.randint(
@@ -54,10 +55,10 @@ def init_dtype_type(self):
         self.index_size = 100
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
     def test_check_grad_normal(self):
-        self.check_grad(['X'], 'Out')
+        self.check_grad(['X'], 'Out', check_eager=True)
 
 
 class TestIndexSelectOpCase2(TestIndexSelectOp):
diff --git a/python/paddle/fluid/tests/unittests/test_initializer.py b/python/paddle/fluid/tests/unittests/test_initializer.py
index 8dc822c69b2c5..91c2800836c9d 100644
--- a/python/paddle/fluid/tests/unittests/test_initializer.py
+++ b/python/paddle/fluid/tests/unittests/test_initializer.py
@@ -655,7 +655,7 @@ def test_set_global_bias_initilizer(self):
 
 
 class TestUniformInitializerDygraph(unittest.TestCase):
-    def test_uniform_initializer(self, dtype="float32"):
+    def func_uniform_initializer(self, dtype="float32"):
         """
         In dygraph mode, we can use initializer directly to initialize a tensor.
         """
@@ -679,9 +679,14 @@ def test_uniform_initializer(self, dtype="float32"):
 
         paddle.enable_static()
 
+    def test_uniform_initializer(self, dtype="float32"):
+        with framework._test_eager_guard():
+            self.func_uniform_initializer()
+        self.func_uniform_initializer()
+
 
 class TesetconsistencyOfDynamicAndStaticGraph(unittest.TestCase):
-    def test_order(self):
+    def func_order(self):
         paddle.set_device('cpu')
         SEED = 123
         weight_attr = paddle.framework.ParamAttr(
@@ -723,6 +728,11 @@ def run_static_graph():
         self.assertTrue(np.array_equal(dynamic_res[0], static_res[0]))
         self.assertTrue(np.array_equal(dynamic_res[1], static_res[1]))
 
+    def test_order(self):
+        with framework._test_eager_guard():
+            self.func_order()
+        self.func_order()
+
 
 # 2-D Parameter with shape: [10, 15]
 class TestOrthogonalInitializer1(unittest.TestCase):
@@ -742,7 +752,7 @@ def check_result(self, a, b):
         self.assertTrue(np.array_equal(a, b))
         self.assertTrue(np.allclose(np.matmul(a, a.T), 9 * np.eye(10)))
 
-    def test_orthogonal(self):
+    def func_orthogonal(self):
         self.config()
         paddle.set_default_dtype(self.dtype)
 
@@ -777,6 +787,11 @@ def test_orthogonal(self):
 
         self.check_result(res_dygraph, res_static)
 
+    def test_orthogonal(self):
+        with framework._test_eager_guard():
+            self.func_orthogonal()
+        self.func_orthogonal()
+
 
 # 2-D Parameter with shape: [15, 10]
 class TestOrthogonalInitializer2(TestOrthogonalInitializer1):
@@ -841,7 +856,7 @@ def check_result(self, a, b):
         a = a.reshape(6, -1)
         self.assertTrue(np.allclose(np.matmul(a, a.T), 9 * np.eye(6)))
 
-    def test_orthogonal(self):
+    def func_orthogonal(self):
         self.config()
         paddle.set_default_dtype(self.dtype)
 
@@ -869,6 +884,11 @@ def test_orthogonal(self):
                                  fetch_list=[conv2d.weight])[0]
         self.check_result(res_dygraph, res_static)
 
+    def test_orthogonal(self):
+        with framework._test_eager_guard():
+            self.func_orthogonal()
+        self.func_orthogonal()
+
 
 # 4-D Parameter with shape: [50, 4, 3, 3]
 class TestOrthogonalInitializer5(TestOrthogonalInitializer4):
@@ -928,7 +948,7 @@ def check_result(self, w_dygraph, w_static, conv_in, conv_out):
         self.assertTrue(np.array_equal(w_dygraph, w_static))
         self.assertTrue(np.array_equal(conv_out, conv_in[:, 0:2, 1:9]))
 
-    def test_dirac(self):
+    def func_dirac(self):
         self.config()
         paddle.set_default_dtype(self.dtype)
 
@@ -971,6 +991,11 @@ def test_dirac(self):
         self.check_result(weight_dygraph, weight_static, conv_input,
                           conv_output)
 
+    def test_dirac(self):
+        with framework._test_eager_guard():
+            self.func_dirac()
+        self.func_dirac()
+
 
 # initialize Conv2D weight
 class TestDiracInitializer2(TestDiracInitializer1):
diff --git a/python/paddle/fluid/tests/unittests/test_inplace.py b/python/paddle/fluid/tests/unittests/test_inplace.py
index 617e9811d630f..c54d3f02d43f0 100644
--- a/python/paddle/fluid/tests/unittests/test_inplace.py
+++ b/python/paddle/fluid/tests/unittests/test_inplace.py
@@ -31,11 +31,7 @@ def func_test_forward_version(self):
             var[0] = 1.1
             self.assertEqual(var.inplace_version, 1)
 
-            # TODO1: assign don't support inplace in temporary
-            if in_dygraph_mode():
-                var[0] = 2
-            else:
-                paddle.assign(paddle.ones(shape=[3]), var)
+            paddle.assign(paddle.ones(shape=[3]), var)
 
             # NOTE(liym27): assign(input, output) is an inplace operation for output.
             # There is inplace-related processing for api assign, var.inplace_version should be 2 not 1.
@@ -65,18 +61,11 @@ def func_test_backward_error(self):
             var_d = var_b**2
 
             loss = paddle.nn.functional.relu(var_c + var_d)
-            if in_dygraph_mode():
-                with self.assertRaisesRegexp(
-                        RuntimeError,
-                        "received current_inplace_version:{} != inplace_version_snapshot_:{}".
-                        format(1, 0)):
-                    loss.backward()
-            else:
-                with self.assertRaisesRegexp(
-                        RuntimeError,
-                        "received tensor_version:{} != wrapper_version_snapshot:{}".
-                        format(1, 0)):
-                    loss.backward()
+            with self.assertRaisesRegexp(
+                    RuntimeError,
+                    "received tensor_version:{} != wrapper_version_snapshot:{}".
+                    format(1, 0)):
+                loss.backward()
 
     def test_backward_error(self):
         with _test_eager_guard():
@@ -122,9 +111,8 @@ def func_test_backward_success_2(self):
             loss.backward()
 
     def test_backward_success_2(self):
-        # TODO2: need to process no_need_buffer in eager mode
-        # with _test_eager_guard():
-        #     self.func_test_backward_success_2()
+        with _test_eager_guard():
+            self.func_test_backward_success_2()
         self.func_test_backward_success_2()
 
 
@@ -207,18 +195,11 @@ def func_test_backward_error(self):
             self.inplace_api_processing(var_b)
 
             loss = paddle.nn.functional.relu(var_c)
-            if in_dygraph_mode():
-                with self.assertRaisesRegexp(
-                        RuntimeError,
-                        "received current_inplace_version:{} != inplace_version_snapshot_:{}".
-                        format(1, 0)):
-                    loss.backward()
-            else:
-                with self.assertRaisesRegexp(
-                        RuntimeError,
-                        "received tensor_version:{} != wrapper_version_snapshot:{}".
-                        format(1, 0)):
-                    loss.backward()
+            with self.assertRaisesRegexp(
+                    RuntimeError,
+                    "received tensor_version:{} != wrapper_version_snapshot:{}".
+                    format(1, 0)):
+                loss.backward()
 
     def test_backward_error(self):
         with _test_eager_guard():
diff --git a/python/paddle/fluid/tests/unittests/test_isclose_op.py b/python/paddle/fluid/tests/unittests/test_isclose_op.py
index 2bb58d7c5741f..245520e5ab666 100644
--- a/python/paddle/fluid/tests/unittests/test_isclose_op.py
+++ b/python/paddle/fluid/tests/unittests/test_isclose_op.py
@@ -30,6 +30,7 @@ def setUp(self):
         paddle.enable_static()
         self.set_args()
         self.op_type = "isclose"
+        self.python_api = paddle.isclose
         self.inputs = {
             'Input': self.input,
             'Other': self.other,
@@ -49,7 +50,7 @@ def setUp(self):
         }
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
 
 class TestIscloseOpException(TestIscloseOp):
@@ -57,28 +58,28 @@ def test_check_output(self):
         def test_rtol_num():
             self.inputs['Rtol'] = np.array([1e-05, 1e-05]).astype("float64")
             self.inputs['Atol'] = np.array([1e-08]).astype("float64")
-            self.check_output()
+            self.check_output(check_eager=True)
 
         self.assertRaises(ValueError, test_rtol_num)
 
         def test_rtol_type():
             self.inputs['Rtol'] = np.array([5]).astype("int32")
             self.inputs['Atol'] = np.array([1e-08]).astype("float64")
-            self.check_output()
+            self.check_output(check_eager=True)
 
         self.assertRaises(ValueError, test_rtol_type)
 
         def test_atol_num():
             self.inputs['Rtol'] = np.array([1e-05]).astype("float64")
             self.inputs['Atol'] = np.array([1e-08, 1e-08]).astype("float64")
-            self.check_output()
+            self.check_output(check_eager=True)
 
         self.assertRaises(ValueError, test_atol_num)
 
         def test_atol_type():
             self.inputs['Rtol'] = np.array([1e-05]).astype("float64")
             self.inputs['Atol'] = np.array([8]).astype("int32")
-            self.check_output()
+            self.check_output(check_eager=True)
 
         self.assertRaises(ValueError, test_atol_type)
 
@@ -211,7 +212,7 @@ def set_args(self):
         self.equal_nan = False
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
 
 class TestIscloseOpLargeDimInput(TestIscloseOp):
diff --git a/python/paddle/fluid/tests/unittests/test_isfinite_v2_op.py b/python/paddle/fluid/tests/unittests/test_isfinite_v2_op.py
index 0d4d3b58e862c..c861f912803f9 100644
--- a/python/paddle/fluid/tests/unittests/test_isfinite_v2_op.py
+++ b/python/paddle/fluid/tests/unittests/test_isfinite_v2_op.py
@@ -16,6 +16,7 @@
 import paddle.fluid as fluid
 import unittest
 import numpy as np
+from paddle.fluid.framework import _test_eager_guard
 
 
 def run_static(x_np, dtype, op_str, use_gpu=False):
@@ -46,6 +47,18 @@ def run_dygraph(x_np, op_str, use_gpu=True):
     return dygraph_result
 
 
+def run_eager(x_np, op_str, use_gpu=True):
+    with paddle.fluid.dygraph.guard():
+        with _test_eager_guard():
+            place = paddle.CPUPlace()
+            if use_gpu and fluid.core.is_compiled_with_cuda():
+                place = paddle.CUDAPlace(0)
+
+            x = paddle.to_tensor(x_np)
+            dygraph_result = getattr(paddle.tensor, op_str)(x)
+            return dygraph_result
+
+
 def np_data_generator(low, high, np_shape, type, sv_list, op_str, *args,
                       **kwargs):
     x_np = np.random.uniform(low, high, np_shape).astype(getattr(np, type))
@@ -107,8 +120,10 @@ def test(test_case, op_str, use_gpu=False):
         x_np, result_np = np_data_generator(**meta_data)
         static_result = run_static(x_np, meta_data['type'], op_str, use_gpu)
         dygraph_result = run_dygraph(x_np, op_str, use_gpu)
+        eager_result = run_eager(x_np, op_str, use_gpu)
         test_case.assertTrue((static_result == result_np).all())
         test_case.assertTrue((dygraph_result.numpy() == result_np).all())
+        test_case.assertTrue((eager_result.numpy() == result_np).all())
 
 
 class TestCPUNormal(unittest.TestCase):
@@ -158,4 +173,5 @@ def test_isfinite_bad_x():
 
 
 if __name__ == '__main__':
+    paddle.enable_static()
     unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/test_jit_save_load.py b/python/paddle/fluid/tests/unittests/test_jit_save_load.py
index fc58f979b4dc2..5dabf854734dd 100644
--- a/python/paddle/fluid/tests/unittests/test_jit_save_load.py
+++ b/python/paddle/fluid/tests/unittests/test_jit_save_load.py
@@ -1454,4 +1454,5 @@ def test_jit_save_incompatible_input_sepc(self):
 
 
 if __name__ == '__main__':
-    unittest.main()
+    with fluid.framework._test_eager_guard():
+        unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/test_kldiv_loss_op.py b/python/paddle/fluid/tests/unittests/test_kldiv_loss_op.py
index a301748ed7bbb..aa94cf2d35cc7 100644
--- a/python/paddle/fluid/tests/unittests/test_kldiv_loss_op.py
+++ b/python/paddle/fluid/tests/unittests/test_kldiv_loss_op.py
@@ -17,6 +17,7 @@
 import unittest
 import numpy as np
 from op_test import OpTest
+from paddle.nn.functional import kl_div
 
 
 def kldiv_loss(x, target, reduction):
@@ -40,6 +41,7 @@ class TestKLDivLossOp(OpTest):
     def setUp(self):
         self.initTestCase()
         self.op_type = 'kldiv_loss'
+        self.python_api = kl_div
         x = np.random.uniform(-10, 10, self.x_shape).astype('float64')
         target = np.random.uniform(-10, 10, self.x_shape).astype('float64')
 
@@ -53,10 +55,11 @@ def setUp(self):
         self.outputs = {'Loss': loss.astype('float64')}
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
     def test_check_grad(self):
-        self.check_grad(['X'], 'Loss', no_grad_set=set(["Target"]))
+        self.check_grad(
+            ['X'], 'Loss', no_grad_set=set(["Target"]), check_eager=True)
 
     def initTestCase(self):
         self.x_shape = (4, 5, 5)
diff --git a/python/paddle/fluid/tests/unittests/test_kron_op.py b/python/paddle/fluid/tests/unittests/test_kron_op.py
index d6db4c2f074a9..f4d013b7c6a3e 100644
--- a/python/paddle/fluid/tests/unittests/test_kron_op.py
+++ b/python/paddle/fluid/tests/unittests/test_kron_op.py
@@ -21,11 +21,13 @@
 import paddle
 import paddle.fluid as fluid
 import paddle.fluid.dygraph as dg
+from paddle.fluid.framework import _test_eager_guard
 
 
 class TestKronOp(OpTest):
     def setUp(self):
         self.op_type = "kron"
+        self.python_api = paddle.kron
         self.dtype = self._init_dtype()
         x = np.random.uniform(size=(10, 10)).astype(self.dtype)
         y = np.random.uniform(size=(10, 10)).astype(self.dtype)
@@ -37,21 +39,22 @@ def _init_dtype(self):
         return "float64"
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
     def test_check_grad(self):
-        self.check_grad(['X', 'Y'], 'Out')
+        self.check_grad(['X', 'Y'], 'Out', check_eager=True)
 
     def test_check_grad_ignore_x(self):
-        self.check_grad(['Y'], 'Out', no_grad_set=set('X'))
+        self.check_grad(['Y'], 'Out', no_grad_set=set('X'), check_eager=True)
 
     def test_check_grad_ignore_y(self):
-        self.check_grad(['X'], 'Out', no_grad_set=set('Y'))
+        self.check_grad(['X'], 'Out', no_grad_set=set('Y'), check_eager=True)
 
 
 class TestKronOp2(TestKronOp):
     def setUp(self):
         self.op_type = "kron"
+        self.python_api = paddle.kron
         self.dtype = self._init_dtype()
         x = np.random.uniform(size=(5, 5, 4)).astype(self.dtype)
         y = np.random.uniform(size=(10, 10)).astype(self.dtype)
@@ -63,6 +66,7 @@ def setUp(self):
 class TestKronOp3(TestKronOp):
     def setUp(self):
         self.op_type = "kron"
+        self.python_api = paddle.kron
         self.dtype = self._init_dtype()
         x = np.random.uniform(size=(10, 10)).astype(self.dtype)
         y = np.random.uniform(size=(5, 5, 4)).astype(self.dtype)
@@ -101,10 +105,16 @@ def test_case_with_output(self):
         c, = exe.run(main, feed={'a': a, 'b': b}, fetch_list=[out_var])
         np.testing.assert_allclose(c, np.kron(a, b))
 
+    def test_api_eager_dygraph(self):
+        with _test_eager_guard():
+            self.test_case()
+            self.test_case_with_output()
+
 
 class TestComplexKronOp(OpTest):
     def setUp(self):
         self.op_type = "kron"
+        self.python_api = paddle.kron
         self.x_shape = np.array([10, 10])
         self.y_shape = np.array([3, 35])
         self.out_shape = self.x_shape * self.y_shape
@@ -160,14 +170,15 @@ def get_grad_y_by_numpy(self):
         return grad_y
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
     def test_check_grad_normal(self):
         self.check_grad(
             ['X', 'Y'],
             'Out',
             user_defined_grads=[self.grad_x, self.grad_y],
-            user_defined_grad_outputs=[self.grad_out])
+            user_defined_grad_outputs=[self.grad_out],
+            check_eager=True)
 
     def test_check_grad_ingore_x(self):
         self.check_grad(
@@ -175,7 +186,8 @@ def test_check_grad_ingore_x(self):
             'Out',
             no_grad_set=set("X"),
             user_defined_grads=[self.grad_y],
-            user_defined_grad_outputs=[self.grad_out])
+            user_defined_grad_outputs=[self.grad_out],
+            check_eager=True)
 
     def test_check_grad_ingore_y(self):
         self.check_grad(
@@ -183,7 +195,8 @@ def test_check_grad_ingore_y(self):
             'Out',
             no_grad_set=set('Y'),
             user_defined_grads=[self.grad_x],
-            user_defined_grad_outputs=[self.grad_out])
+            user_defined_grad_outputs=[self.grad_out],
+            check_eager=True)
 
 
 class TestKronOpTypePromotion(TestComplexKronOp):
diff --git a/python/paddle/fluid/tests/unittests/test_kthvalue_op.py b/python/paddle/fluid/tests/unittests/test_kthvalue_op.py
index 68dd58835c56c..e1b1422580983 100644
--- a/python/paddle/fluid/tests/unittests/test_kthvalue_op.py
+++ b/python/paddle/fluid/tests/unittests/test_kthvalue_op.py
@@ -41,6 +41,7 @@ def init_args(self):
 
     def setUp(self):
         self.op_type = "kthvalue"
+        self.python_api = paddle.kthvalue
         self.dtype = np.float64
         self.input_data = np.random.random((2, 1, 2, 4, 10))
         self.init_args()
@@ -52,11 +53,11 @@ def setUp(self):
 
     def test_check_output(self):
         paddle.enable_static()
-        self.check_output()
+        self.check_output(check_eager=True)
 
     def test_check_grad(self):
         paddle.enable_static()
-        self.check_grad(set(['X']), 'Out')
+        self.check_grad(set(['X']), 'Out', check_eager=True)
 
 
 class TestKthvalueOpWithKeepdim(OpTest):
@@ -67,6 +68,7 @@ def init_args(self):
     def setUp(self):
         self.init_args()
         self.op_type = "kthvalue"
+        self.python_api = paddle.kthvalue
         self.dtype = np.float64
         self.input_data = np.random.random((1, 3, 2, 4, 10))
         self.inputs = {'X': self.input_data}
@@ -77,11 +79,11 @@ def setUp(self):
 
     def test_check_output(self):
         paddle.enable_static()
-        self.check_output()
+        self.check_output(check_eager=True)
 
     def test_check_grad(self):
         paddle.enable_static()
-        self.check_grad(set(['X']), 'Out')
+        self.check_grad(set(['X']), 'Out', check_eager=True)
 
 
 class TestKthvalueOpKernels(unittest.TestCase):
diff --git a/python/paddle/fluid/tests/unittests/test_label_smooth_op.py b/python/paddle/fluid/tests/unittests/test_label_smooth_op.py
index 8ff6bb49675be..b1d49f8604ec7 100644
--- a/python/paddle/fluid/tests/unittests/test_label_smooth_op.py
+++ b/python/paddle/fluid/tests/unittests/test_label_smooth_op.py
@@ -39,10 +39,10 @@ def setUp(self):
         self.outputs = {'Out': smoothed_label}
 
     def test_check_output(self):
-        self.check_output(check_eager=False)
+        self.check_output(check_eager=True)
 
     def test_check_grad(self):
-        self.check_grad(["X"], "Out", check_eager=False)
+        self.check_grad(["X"], "Out", check_eager=True)
 
 
 class TestLabelSmoothOpWithPriorDist(TestLabelSmoothOp):
diff --git a/python/paddle/fluid/tests/unittests/test_layers.py b/python/paddle/fluid/tests/unittests/test_layers.py
index bb244a20bd873..6c5864cfebc93 100644
--- a/python/paddle/fluid/tests/unittests/test_layers.py
+++ b/python/paddle/fluid/tests/unittests/test_layers.py
@@ -2819,7 +2819,7 @@ def setUp(self):
         })
         self.all_close_compare = set({"make_spectral_norm"})
 
-    def test_all_layers(self):
+    def func_all_layers(self):
         attrs = (getattr(self, name) for name in dir(self))
         methods = filter(inspect.ismethod, attrs)
         for method in methods:
@@ -2867,6 +2867,11 @@ def test_all_layers(self):
                     np.array_equal(static_result[0], dy_result_value),
                     "Result of function [{}] not equal".format(method.__name__))
 
+    def test_all_layers(self):
+        with _test_eager_guard():
+            self.func_all_layers()
+        self.func_all_layers()
+
     def _get_np_data(self, shape, dtype, append_batch_size=True):
         np.random.seed(self.seed)
         if append_batch_size:
@@ -3656,8 +3661,9 @@ def make_scale_variable(self):
                 shape=[1],
                 dtype='float32',
                 append_batch_size=False)
-
-            out = layers.scale(input, scale=scale_var)
+            _scale = scale_var.numpy().item(0) if isinstance(
+                scale_var, core.eager.Tensor) else scale_var
+            out = layers.scale(input, scale=_scale)
             return out
 
     def make_softshrink(self):
diff --git a/python/paddle/fluid/tests/unittests/test_lerp_op.py b/python/paddle/fluid/tests/unittests/test_lerp_op.py
index 0f740444123cb..10ab2610a26e4 100644
--- a/python/paddle/fluid/tests/unittests/test_lerp_op.py
+++ b/python/paddle/fluid/tests/unittests/test_lerp_op.py
@@ -27,6 +27,7 @@
 class TestLerp(OpTest):
     def setUp(self):
         self.op_type = "lerp"
+        self.python_api = paddle.lerp
         self.init_dtype()
         self.init_shape()
         x = np.arange(1., 101.).astype(self.dtype).reshape(self.shape)
@@ -42,10 +43,10 @@ def init_shape(self):
         self.shape = [100]
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
     def test_check_grad(self):
-        self.check_grad(['X', 'Y'], 'Out')
+        self.check_grad(['X', 'Y'], 'Out', check_eager=True)
 
 
 class TestLerpWithDim2(TestLerp):
diff --git a/python/paddle/fluid/tests/unittests/test_lgamma_op.py b/python/paddle/fluid/tests/unittests/test_lgamma_op.py
index 686d5b1eb6dfe..8e9edab55baf8 100644
--- a/python/paddle/fluid/tests/unittests/test_lgamma_op.py
+++ b/python/paddle/fluid/tests/unittests/test_lgamma_op.py
@@ -24,6 +24,7 @@
 class TestLgammaOp(OpTest):
     def setUp(self):
         self.op_type = 'lgamma'
+        self.python_api = paddle.lgamma
         self.init_dtype_type()
         shape = (5, 20)
         data = np.random.random(shape).astype(self.dtype) + 1
@@ -38,10 +39,10 @@ def init_dtype_type(self):
         self.dtype = np.float64
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
     def test_check_grad_normal(self):
-        self.check_grad(['X'], 'Out', numeric_grad_delta=1e-7)
+        self.check_grad(['X'], 'Out', numeric_grad_delta=1e-7, check_eager=True)
 
 
 class TestLgammaOpFp32(TestLgammaOp):
@@ -49,7 +50,8 @@ def init_dtype_type(self):
         self.dtype = np.float32
 
     def test_check_grad_normal(self):
-        self.check_grad(['X'], 'Out', numeric_grad_delta=0.005)
+        self.check_grad(
+            ['X'], 'Out', numeric_grad_delta=0.005, check_eager=True)
 
 
 if __name__ == "__main__":
diff --git a/python/paddle/fluid/tests/unittests/test_linear.py b/python/paddle/fluid/tests/unittests/test_linear.py
index 9d07a80da15db..6b00a86e3e900 100644
--- a/python/paddle/fluid/tests/unittests/test_linear.py
+++ b/python/paddle/fluid/tests/unittests/test_linear.py
@@ -73,6 +73,22 @@ def test_error(self, place=paddle.CPUPlace()):
         np.testing.assert_array_almost_equal(res_f, res_nn)
         np.testing.assert_array_almost_equal(res_nn, res_np)
 
+    def test_weight_init(self):
+        if not paddle.is_compiled_with_cuda():
+            return
+        paddle.seed(100)
+        linear = paddle.nn.Linear(
+            2, 3, weight_attr=paddle.nn.initializer.Normal(0, 1.))
+        paddle.nn.utils._stride_column(linear.weight)
+        expect = [[1.4349908, -0.8099171, -2.64788],
+                  [-1.4981681, -1.1784115, -0.023253186]]
+        self.assertTrue(np.allclose(linear.weight.numpy(), expect))
+
+        linear = paddle.nn.Linear(2, 3)
+        expect = [[0.73261100, 0.43836895, 0.07908206],
+                  [0.85075015, -1.04724526, 0.64371765]]
+        self.assertTrue(np.allclose(linear.weight.numpy(), expect))
+
 
 if __name__ == "__main__":
     unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/test_linspace.py b/python/paddle/fluid/tests/unittests/test_linspace.py
index 54846e6a14bd2..65a6c21fb0720 100644
--- a/python/paddle/fluid/tests/unittests/test_linspace.py
+++ b/python/paddle/fluid/tests/unittests/test_linspace.py
@@ -21,11 +21,13 @@
 import paddle.fluid as fluid
 from paddle.fluid import compiler, Program, program_guard
 from paddle.fluid import core
+from paddle.fluid.framework import _test_eager_guard
 
 
 class TestLinspaceOpCommonCase(OpTest):
     def setUp(self):
         self.op_type = "linspace"
+        self.python_api = paddle.linspace
         dtype = 'float32'
         self.inputs = {
             'Start': np.array([0]).astype(dtype),
@@ -37,12 +39,13 @@ def setUp(self):
         self.outputs = {'Out': np.arange(0, 11).astype(dtype)}
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
 
 class TestLinspaceOpReverseCase(OpTest):
     def setUp(self):
         self.op_type = "linspace"
+        self.python_api = paddle.linspace
         dtype = 'float32'
         self.inputs = {
             'Start': np.array([10]).astype(dtype),
@@ -54,12 +57,13 @@ def setUp(self):
         self.outputs = {'Out': np.arange(10, -1, -1).astype(dtype)}
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
 
 class TestLinspaceOpNumOneCase(OpTest):
     def setUp(self):
         self.op_type = "linspace"
+        self.python_api = paddle.linspace
         dtype = 'float32'
         self.inputs = {
             'Start': np.array([10]).astype(dtype),
@@ -71,7 +75,7 @@ def setUp(self):
         self.outputs = {'Out': np.array(10, dtype=dtype)}
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
 
 class TestLinspaceAPI(unittest.TestCase):
@@ -123,6 +127,11 @@ def test_imperative(self):
         self.assertEqual((out2.numpy() == np_out2).all(), True)
         self.assertEqual((out3.numpy() == np_out3).all(), True)
 
+    def test_api_eager_dygraph(self):
+        with _test_eager_guard():
+            self.test_variable_input2()
+            self.test_imperative()
+
 
 class TestLinspaceOpError(unittest.TestCase):
     def test_errors(self):
diff --git a/python/paddle/fluid/tests/unittests/test_log_softmax.py b/python/paddle/fluid/tests/unittests/test_log_softmax.py
index 423eeaf3ada45..b3b164725fc34 100644
--- a/python/paddle/fluid/tests/unittests/test_log_softmax.py
+++ b/python/paddle/fluid/tests/unittests/test_log_softmax.py
@@ -42,6 +42,7 @@ def ref_log_softmax_grad(x, axis):
 class TestLogSoftmaxOp(OpTest):
     def setUp(self):
         self.op_type = 'log_softmax'
+        self.python_api = F.log_softmax
         self.dtype = 'float64'
         self.shape = [2, 3, 4, 5]
         self.axis = -1
@@ -59,10 +60,11 @@ def set_attrs(self):
         pass
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
     def test_check_grad(self):
-        self.check_grad(['X'], ['Out'], user_defined_grads=[self.x_grad])
+        self.check_grad(
+            ['X'], ['Out'], user_defined_grads=[self.x_grad], check_eager=True)
 
 
 class TestLogSoftmaxShape(TestLogSoftmaxOp):
@@ -80,6 +82,7 @@ def set_attrs(self):
 class TestLogSoftmaxBF16Op(OpTest):
     def setUp(self):
         self.op_type = 'log_softmax'
+        self.python_api = F.log_softmax
         self.dtype = np.uint16
         self.shape = [2, 3, 4, 5]
         self.axis = -1
@@ -94,12 +97,14 @@ def setUp(self):
 
     def test_check_output(self):
         place = core.CUDAPlace(0)
-        self.check_output_with_place(place)
+        self.check_output_with_place(place, check_eager=True)
 
     def test_check_grad(self):
         place = core.CUDAPlace(0)
         self.check_grad_with_place(
-            place, ['X'], ['Out'], user_defined_grads=[self.x_grad])
+            place, ['X'], ['Out'],
+            user_defined_grads=[self.x_grad],
+            check_eager=True)
 
 
 class TestNNLogSoftmaxAPI(unittest.TestCase):
diff --git a/python/paddle/fluid/tests/unittests/test_logical_op.py b/python/paddle/fluid/tests/unittests/test_logical_op.py
index 174f3bc665ea1..91d339940d114 100755
--- a/python/paddle/fluid/tests/unittests/test_logical_op.py
+++ b/python/paddle/fluid/tests/unittests/test_logical_op.py
@@ -20,6 +20,7 @@
 import paddle
 import paddle.fluid as fluid
 from paddle.static import Program, program_guard
+from paddle.fluid.framework import _test_eager_guard
 
 SUPPORTED_DTYPES = [
     bool, np.int8, np.int16, np.int32, np.int64, np.float32, np.float64
@@ -144,6 +145,22 @@ def run_dygraph(x_np, y_np, op_str, use_gpu=False, binary_op=True):
     return dygraph_result
 
 
+def run_eager(x_np, y_np, op_str, use_gpu=False, binary_op=True):
+    place = paddle.CPUPlace()
+    if use_gpu and fluid.core.is_compiled_with_cuda():
+        place = paddle.CUDAPlace(0)
+    paddle.disable_static(place)
+    with _test_eager_guard():
+        op = getattr(paddle, op_str)
+        x = paddle.to_tensor(x_np, dtype=x_np.dtype)
+        if not binary_op:
+            dygraph_result = op(x)
+        else:
+            y = paddle.to_tensor(y_np, dtype=y_np.dtype)
+            dygraph_result = op(x, y)
+        return dygraph_result
+
+
 def np_data_generator(np_shape, dtype, *args, **kwargs):
     if dtype == bool:
         return np.random.choice(a=[True, False], size=np_shape).astype(bool)
@@ -174,6 +191,7 @@ def test(unit_test, use_gpu=False, test_error=False):
                     continue
                 static_result = run_static(**meta_data)
                 dygraph_result = run_dygraph(**meta_data)
+                eager_result = run_eager(**meta_data)
                 if meta_data['binary_op']:
                     np_result = np_op(meta_data['x_np'], meta_data['y_np'])
                 else:
@@ -181,6 +199,7 @@ def test(unit_test, use_gpu=False, test_error=False):
                 unit_test.assertTrue((static_result == np_result).all())
                 unit_test.assertTrue((dygraph_result.numpy() == np_result).all(
                 ))
+                unit_test.assertTrue((eager_result.numpy() == np_result).all())
 
 
 def test_type_error(unit_test, use_gpu, type_str_map):
@@ -259,4 +278,5 @@ def test_type_error(self):
 
 
 if __name__ == '__main__':
+    paddle.enable_static()
     unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/test_logit_op.py b/python/paddle/fluid/tests/unittests/test_logit_op.py
index 9254996eb4463..9b46039da13b1 100644
--- a/python/paddle/fluid/tests/unittests/test_logit_op.py
+++ b/python/paddle/fluid/tests/unittests/test_logit_op.py
@@ -16,6 +16,7 @@
 import numpy as np
 from op_test import OpTest
 import paddle
+from paddle.fluid.framework import _test_eager_guard
 np.random.seed(10)
 
 
@@ -37,6 +38,7 @@ def logit_grad(x, eps=1e-8):
 class TestLogitOp(OpTest):
     def setUp(self):
         self.op_type = 'logit'
+        self.python_api = paddle.logit
         self.dtype = np.float64
         self.shape = [120]
         self.eps = 1e-8
@@ -52,10 +54,11 @@ def set_attrs(self):
         pass
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
     def test_check_grad(self):
-        self.check_grad(['X'], ['Out'], user_defined_grads=[self.x_grad])
+        self.check_grad(
+            ['X'], ['Out'], user_defined_grads=[self.x_grad], check_eager=True)
 
 
 class TestLogitShape(TestLogitOp):
@@ -106,6 +109,11 @@ def test_errors(self):
             x = paddle.fluid.data(name='X2', shape=[100], dtype='float32')
             self.assertRaises(TypeError, paddle.logit, x, dtype='int32')
 
+    def test_api_eager_dygraph(self):
+        with _test_eager_guard():
+            self.test_check_api()
+            self.test_errors()
+
 
 if __name__ == "__main__":
     unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/test_logsumexp.py b/python/paddle/fluid/tests/unittests/test_logsumexp.py
index 31c68b88b86a7..91eb65ef284a5 100644
--- a/python/paddle/fluid/tests/unittests/test_logsumexp.py
+++ b/python/paddle/fluid/tests/unittests/test_logsumexp.py
@@ -29,9 +29,16 @@ def ref_logsumexp(x, axis=None, keepdim=False, reduce_all=False):
     return out
 
 
+def logsumexp_wrapper(x, axis=None, keepdim=False, allreduce=False):
+    if allreduce:
+        return paddle.logsumexp(x, None, keepdim)
+    return paddle.logsumexp(x, axis, keepdim)
+
+
 class TestLogsumexp(OpTest):
     def setUp(self):
         self.op_type = 'logsumexp'
+        self.python_api = logsumexp_wrapper
         self.shape = [2, 3, 4, 5]
         self.dtype = 'float64'
         self.axis = [-1]
@@ -61,13 +68,14 @@ def set_attrs_addition(self):
         pass
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
     def test_check_grad(self):
         self.check_grad(
             ['X'], ['Out'],
             user_defined_grads=self.user_defined_grads,
-            user_defined_grad_outputs=self.user_defined_grad_outputs)
+            user_defined_grad_outputs=self.user_defined_grad_outputs,
+            check_eager=True)
 
     def calc_grad(self):
         dy = np.ones(1, dtype=self.dtype)
diff --git a/python/paddle/fluid/tests/unittests/test_lookahead.py b/python/paddle/fluid/tests/unittests/test_lookahead.py
index a4b5e6d0d9576..263310043a5f7 100644
--- a/python/paddle/fluid/tests/unittests/test_lookahead.py
+++ b/python/paddle/fluid/tests/unittests/test_lookahead.py
@@ -22,6 +22,7 @@
 import paddle.fluid as fluid
 import paddle
 import paddle.nn as nn
+from paddle.fluid.framework import _test_eager_guard, _in_legacy_dygraph
 
 LOOKAHEAD_K = 5
 LOOKAHEAD_ALPHA = 0.2
@@ -68,7 +69,7 @@ def test_lookahead_static(self):
                     slow_param.all(), latest_b.all(), delta=5e-3)
             fast_param = latest_b - SGD_LR * b_grad
 
-    def test_look_ahead_dygraph(self):
+    def func_test_look_ahead_dygraph(self):
         BATCH_SIZE = 16
         BATCH_NUM = 4
         EPOCH_NUM = 4
@@ -142,6 +143,11 @@ def train(layer, loader, loss_fn, opt):
 
         train(layer, loader, loss_fn, lookahead)
 
+    def test_look_ahead_dygraph(self):
+        with _test_eager_guard():
+            self.func_test_look_ahead_dygraph()
+        self.func_test_look_ahead_dygraph()
+
 
 if __name__ == "__main__":
     unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/test_masked_select_op.py b/python/paddle/fluid/tests/unittests/test_masked_select_op.py
index ed1a981d0306b..764f4806ba4ba 100644
--- a/python/paddle/fluid/tests/unittests/test_masked_select_op.py
+++ b/python/paddle/fluid/tests/unittests/test_masked_select_op.py
@@ -33,6 +33,7 @@ class TestMaskedSelectOp(OpTest):
     def setUp(self):
         self.init()
         self.op_type = "masked_select"
+        self.python_api = paddle.masked_select
         x = np.random.random(self.shape).astype("float64")
         mask = np.array(np.random.randint(2, size=self.shape, dtype=bool))
         out = np_masked_select(x, mask)
@@ -40,10 +41,10 @@ def setUp(self):
         self.outputs = {'Y': out}
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
     def test_check_grad(self):
-        self.check_grad(['X'], 'Y')
+        self.check_grad(['X'], 'Y', check_eager=True)
 
     def init(self):
         self.shape = (50, 3)
@@ -121,4 +122,5 @@ def test_mask_dtype():
 
 
 if __name__ == '__main__':
+    paddle.enable_static()
     unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/test_math_op_patch_var_base.py b/python/paddle/fluid/tests/unittests/test_math_op_patch_var_base.py
index 44876c9bd5773..48aa530ff87f9 100644
--- a/python/paddle/fluid/tests/unittests/test_math_op_patch_var_base.py
+++ b/python/paddle/fluid/tests/unittests/test_math_op_patch_var_base.py
@@ -19,6 +19,7 @@
 import paddle.fluid as fluid
 import numpy as np
 import inspect
+from paddle.fluid.framework import _test_eager_guard, _in_legacy_dygraph
 
 
 class TestMathOpPatchesVarBase(unittest.TestCase):
@@ -26,7 +27,7 @@ def setUp(self):
         self.shape = [10, 1024]
         self.dtype = np.float32
 
-    def test_add(self):
+    def func_test_add(self):
         a_np = np.random.random(self.shape).astype(self.dtype)
         b_np = np.random.random(self.shape).astype(self.dtype)
         with fluid.dygraph.guard():
@@ -35,7 +36,12 @@ def test_add(self):
             res = a + b
             self.assertTrue(np.array_equal(res.numpy(), a_np + b_np))
 
-    def test_sub(self):
+    def test_add(self):
+        with _test_eager_guard():
+            self.func_test_add()
+        self.func_test_add()
+
+    def func_test_sub(self):
         a_np = np.random.random(self.shape).astype(self.dtype)
         b_np = np.random.random(self.shape).astype(self.dtype)
         with fluid.dygraph.guard():
@@ -44,7 +50,12 @@ def test_sub(self):
             res = a - b
             self.assertTrue(np.array_equal(res.numpy(), a_np - b_np))
 
-    def test_mul(self):
+    def test_sub(self):
+        with _test_eager_guard():
+            self.func_test_sub()
+        self.func_test_sub()
+
+    def func_test_mul(self):
         a_np = np.random.random(self.shape).astype(self.dtype)
         b_np = np.random.random(self.shape).astype(self.dtype)
         with fluid.dygraph.guard():
@@ -53,7 +64,12 @@ def test_mul(self):
             res = a * b
             self.assertTrue(np.array_equal(res.numpy(), a_np * b_np))
 
-    def test_div(self):
+    def test_mul(self):
+        with _test_eager_guard():
+            self.func_test_mul()
+        self.func_test_mul()
+
+    def func_test_div(self):
         a_np = np.random.random(self.shape).astype(self.dtype)
         b_np = np.random.random(self.shape).astype(self.dtype)
         with fluid.dygraph.guard():
@@ -63,7 +79,12 @@ def test_div(self):
             #NOTE: Not sure why array_equal fails on windows, allclose is acceptable
             self.assertTrue(np.allclose(res.numpy(), a_np / b_np))
 
-    def test_add_scalar(self):
+    def test_div(self):
+        with _test_eager_guard():
+            self.func_test_div()
+        self.func_test_div()
+
+    def func_test_add_scalar(self):
         a_np = np.random.random(self.shape).astype(self.dtype)
         with fluid.dygraph.guard():
             a = fluid.dygraph.to_variable(a_np)
@@ -71,7 +92,12 @@ def test_add_scalar(self):
             res = a + b
             self.assertTrue(np.array_equal(res.numpy(), a_np + b))
 
-    def test_add_scalar_reverse(self):
+    def test_add_scalar(self):
+        with _test_eager_guard():
+            self.func_test_add_scalar()
+        self.func_test_add_scalar()
+
+    def func_test_add_scalar_reverse(self):
         a_np = np.random.random(self.shape).astype(self.dtype)
         with fluid.dygraph.guard():
             a = fluid.dygraph.to_variable(a_np)
@@ -79,7 +105,12 @@ def test_add_scalar_reverse(self):
             res = b + a
             self.assertTrue(np.array_equal(res.numpy(), b + a_np))
 
-    def test_sub_scalar(self):
+    def test_add_scalar_reverse(self):
+        with _test_eager_guard():
+            self.func_test_add_scalar_reverse()
+        self.func_test_add_scalar_reverse()
+
+    def func_test_sub_scalar(self):
         a_np = np.random.random(self.shape).astype(self.dtype)
         with fluid.dygraph.guard():
             a = fluid.dygraph.to_variable(a_np)
@@ -87,7 +118,12 @@ def test_sub_scalar(self):
             res = a - b
             self.assertTrue(np.array_equal(res.numpy(), a_np - b))
 
-    def test_sub_scalar_reverse(self):
+    def test_sub_scalar(self):
+        with _test_eager_guard():
+            self.func_test_sub_scalar()
+        self.func_test_sub_scalar()
+
+    def func_test_sub_scalar_reverse(self):
         a_np = np.random.random(self.shape).astype(self.dtype)
         with fluid.dygraph.guard():
             a = fluid.dygraph.to_variable(a_np)
@@ -95,7 +131,12 @@ def test_sub_scalar_reverse(self):
             res = b - a
             self.assertTrue(np.array_equal(res.numpy(), b - a_np))
 
-    def test_mul_scalar(self):
+    def test_sub_scalar_reverse(self):
+        with _test_eager_guard():
+            self.func_test_sub_scalar_reverse()
+        self.func_test_sub_scalar_reverse()
+
+    def func_test_mul_scalar(self):
         a_np = np.random.random(self.shape).astype(self.dtype)
         with fluid.dygraph.guard():
             a = fluid.dygraph.to_variable(a_np)
@@ -103,8 +144,13 @@ def test_mul_scalar(self):
             res = a * b
             self.assertTrue(np.array_equal(res.numpy(), a_np * b))
 
+    def test_mul_scalar(self):
+        with _test_eager_guard():
+            self.func_test_mul_scalar()
+        self.func_test_mul_scalar()
+
     # div_scalar, not equal
-    def test_div_scalar(self):
+    def func_test_div_scalar(self):
         a_np = np.random.random(self.shape).astype(self.dtype)
         with fluid.dygraph.guard():
             a = fluid.dygraph.to_variable(a_np)
@@ -112,8 +158,13 @@ def test_div_scalar(self):
             res = a / b
             self.assertTrue(np.allclose(res.numpy(), a_np / b))
 
+    def test_div_scalar(self):
+        with _test_eager_guard():
+            self.func_test_div_scalar()
+        self.func_test_div_scalar()
+
     # pow of float type, not equal
-    def test_pow(self):
+    def func_test_pow(self):
         a_np = np.random.random(self.shape).astype(self.dtype)
         b_np = np.random.random(self.shape).astype(self.dtype)
         with fluid.dygraph.guard():
@@ -122,7 +173,12 @@ def test_pow(self):
             res = a**b
             self.assertTrue(np.allclose(res.numpy(), a_np**b_np))
 
-    def test_floor_div(self):
+    def test_pow(self):
+        with _test_eager_guard():
+            self.func_test_pow()
+        self.func_test_pow()
+
+    def func_test_floor_div(self):
         a_np = np.random.randint(1, 100, size=self.shape)
         b_np = np.random.randint(1, 100, size=self.shape)
         with fluid.dygraph.guard():
@@ -131,7 +187,12 @@ def test_floor_div(self):
             res = a // b
             self.assertTrue(np.array_equal(res.numpy(), a_np // b_np))
 
-    def test_mod(self):
+    def test_floor_div(self):
+        with _test_eager_guard():
+            self.func_test_floor_div()
+        self.func_test_floor_div()
+
+    def func_test_mod(self):
         a_np = np.random.randint(1, 100, size=self.shape)
         b_np = np.random.randint(1, 100, size=self.shape)
         with fluid.dygraph.guard():
@@ -140,8 +201,13 @@ def test_mod(self):
             res = a % b
             self.assertTrue(np.array_equal(res.numpy(), a_np % b_np))
 
+    def test_mod(self):
+        with _test_eager_guard():
+            self.func_test_mod()
+        self.func_test_mod()
+
     # for bitwise and/or/xor/not
-    def test_bitwise(self):
+    def func_test_bitwise(self):
         paddle.disable_static()
 
         x_np = np.random.randint(-100, 100, [2, 3, 5])
@@ -165,8 +231,13 @@ def test_bitwise(self):
         out = ~x
         self.assertTrue(np.array_equal(out.numpy(), out_np))
 
+    def test_bitwise(self):
+        with _test_eager_guard():
+            self.func_test_bitwise()
+        self.func_test_bitwise()
+
     # for logical compare
-    def test_equal(self):
+    def func_test_equal(self):
         a_np = np.asarray([1, 2, 3, 4, 5])
         b_np = np.asarray([1, 2, 3, 4, 5])
         c_np = np.asarray([1, 2, 2, 4, 5])
@@ -179,7 +250,12 @@ def test_equal(self):
             self.assertTrue(np.array_equal(res1.numpy(), a_np == b_np))
             self.assertTrue(np.array_equal(res2.numpy(), a_np == c_np))
 
-    def test_not_equal(self):
+    def test_equal(self):
+        with _test_eager_guard():
+            self.func_test_equal()
+        self.func_test_equal()
+
+    def func_test_not_equal(self):
         a_np = np.asarray([1, 2, 3, 4, 5])
         b_np = np.asarray([1, 2, 3, 4, 5])
         c_np = np.asarray([1, 2, 2, 4, 5])
@@ -192,7 +268,12 @@ def test_not_equal(self):
             self.assertTrue(np.array_equal(res1.numpy(), a_np != b_np))
             self.assertTrue(np.array_equal(res2.numpy(), a_np != c_np))
 
-    def test_less_than(self):
+    def test_not_equal(self):
+        with _test_eager_guard():
+            self.func_test_not_equal()
+        self.func_test_not_equal()
+
+    def func_test_less_than(self):
         a_np = np.random.random(self.shape).astype(self.dtype)
         b_np = np.random.random(self.shape).astype(self.dtype)
         with fluid.dygraph.guard():
@@ -201,7 +282,12 @@ def test_less_than(self):
             res = (a < b)
             self.assertTrue(np.array_equal(res.numpy(), a_np < b_np))
 
-    def test_less_equal(self):
+    def test_less_than(self):
+        with _test_eager_guard():
+            self.func_test_less_than()
+        self.func_test_less_than()
+
+    def func_test_less_equal(self):
         a_np = np.random.random(self.shape).astype(self.dtype)
         b_np = np.random.random(self.shape).astype(self.dtype)
         with fluid.dygraph.guard():
@@ -210,7 +296,12 @@ def test_less_equal(self):
             res = (a <= b)
             self.assertTrue(np.array_equal(res.numpy(), a_np <= b_np))
 
-    def test_greater_than(self):
+    def test_less_equal(self):
+        with _test_eager_guard():
+            self.func_test_less_equal()
+        self.func_test_less_equal()
+
+    def func_test_greater_than(self):
         a_np = np.random.random(self.shape).astype(self.dtype)
         b_np = np.random.random(self.shape).astype(self.dtype)
         with fluid.dygraph.guard():
@@ -219,7 +310,12 @@ def test_greater_than(self):
             res = (a > b)
             self.assertTrue(np.array_equal(res.numpy(), a_np > b_np))
 
-    def test_greater_equal(self):
+    def test_greater_than(self):
+        with _test_eager_guard():
+            self.func_test_greater_than()
+        self.func_test_greater_than()
+
+    def func_test_greater_equal(self):
         a_np = np.random.random(self.shape).astype(self.dtype)
         b_np = np.random.random(self.shape).astype(self.dtype)
         with fluid.dygraph.guard():
@@ -228,27 +324,47 @@ def test_greater_equal(self):
             res = (a >= b)
             self.assertTrue(np.array_equal(res.numpy(), a_np >= b_np))
 
-    def test_neg(self):
+    def test_greater_equal(self):
+        with _test_eager_guard():
+            self.func_test_greater_equal()
+        self.func_test_greater_equal()
+
+    def func_test_neg(self):
         a_np = np.random.uniform(-1, 1, self.shape).astype(self.dtype)
         with fluid.dygraph.guard():
             a = fluid.dygraph.to_variable(a_np)
             res = -a
             self.assertTrue(np.array_equal(res.numpy(), -a_np))
 
-    def test_float_int_long(self):
+    def test_neg(self):
+        with _test_eager_guard():
+            self.func_test_neg()
+        self.func_test_neg()
+
+    def func_test_float_int_long(self):
         with fluid.dygraph.guard():
             a = fluid.dygraph.to_variable(np.array([100.1]))
             self.assertTrue(float(a) == 100.1)
             self.assertTrue(int(a) == 100)
             self.assertTrue(int(a) == 100)
 
-    def test_len(self):
+    def test_float_int_long(self):
+        with _test_eager_guard():
+            self.func_test_float_int_long()
+        self.func_test_float_int_long()
+
+    def func_test_len(self):
         a_np = np.random.uniform(-1, 1, self.shape).astype(self.dtype)
         with fluid.dygraph.guard():
             a = fluid.dygraph.to_variable(a_np)
             self.assertTrue(len(a) == 10)
 
-    def test_index(self):
+    def test_len(self):
+        with _test_eager_guard():
+            self.func_test_len()
+        self.func_test_len()
+
+    def func_test_index(self):
         with fluid.dygraph.guard():
             var1 = fluid.dygraph.to_variable(np.array([2]))
             i_tmp = 0
@@ -260,7 +376,12 @@ def test_index(self):
             str1 = "just test"
             self.assertTrue(str1[var1] == 's')
 
-    def test_np_left_mul(self):
+    def test_index(self):
+        with _test_eager_guard():
+            self.func_test_index()
+        self.func_test_index()
+
+    def func_test_np_left_mul(self):
         with fluid.dygraph.guard():
             t = np.sqrt(2.0 * np.pi)
             x = fluid.layers.ones((2, 2), dtype="float32")
@@ -274,7 +395,12 @@ def test_np_left_mul(self):
                     rtol=1e-05,
                     atol=0.0))
 
-    def test_add_different_dtype(self):
+    def test_np_left_mul(self):
+        with _test_eager_guard():
+            self.func_test_np_left_mul()
+        self.func_test_np_left_mul()
+
+    def func_test_add_different_dtype(self):
         a_np = np.random.random(self.shape).astype(np.float32)
         b_np = np.random.random(self.shape).astype(np.float16)
         with fluid.dygraph.guard():
@@ -283,7 +409,12 @@ def test_add_different_dtype(self):
             res = a + b
             self.assertTrue(np.array_equal(res.numpy(), a_np + b_np))
 
-    def test_floordiv_different_dtype(self):
+    def test_add_different_dtype(self):
+        with _test_eager_guard():
+            self.func_test_add_different_dtype()
+        self.func_test_add_different_dtype()
+
+    def func_test_floordiv_different_dtype(self):
         a_np = np.full(self.shape, 10, np.int64)
         b_np = np.full(self.shape, 2, np.int32)
         with fluid.dygraph.guard():
@@ -292,7 +423,12 @@ def test_floordiv_different_dtype(self):
             res = a // b
             self.assertTrue(np.array_equal(res.numpy(), a_np // b_np))
 
-    def test_astype(self):
+    def test_floordiv_different_dtype(self):
+        with _test_eager_guard():
+            self.func_test_floordiv_different_dtype()
+        self.func_test_floordiv_different_dtype()
+
+    def func_test_astype(self):
         a_np = np.random.uniform(-1, 1, self.shape).astype(self.dtype)
         with fluid.dygraph.guard():
             a = fluid.dygraph.to_variable(a_np)
@@ -306,7 +442,12 @@ def test_astype(self):
             self.assertTrue(np.array_equal(res1.numpy(), res2.numpy()))
             self.assertTrue(np.array_equal(res1.numpy(), res3.numpy()))
 
-    def test_conpare_op_broadcast(self):
+    def test_astype(self):
+        with _test_eager_guard():
+            self.func_test_astype()
+        self.func_test_astype()
+
+    def func_test_conpare_op_broadcast(self):
         a_np = np.random.uniform(-1, 1, [10, 1, 10]).astype(self.dtype)
         b_np = np.random.uniform(-1, 1, [1, 1, 10]).astype(self.dtype)
         with fluid.dygraph.guard():
@@ -316,7 +457,12 @@ def test_conpare_op_broadcast(self):
             self.assertEqual((a != b).dtype, fluid.core.VarDesc.VarType.BOOL)
             self.assertTrue(np.array_equal((a != b).numpy(), a_np != b_np))
 
-    def test_tensor_patch_method(self):
+    def test_conpare_op_broadcast(self):
+        with _test_eager_guard():
+            self.func_test_conpare_op_broadcast()
+        self.func_test_conpare_op_broadcast()
+
+    def func_test_tensor_patch_method(self):
         paddle.disable_static()
         x_np = np.random.uniform(-1, 1, [2, 3]).astype(self.dtype)
         y_np = np.random.uniform(-1, 1, [2, 3]).astype(self.dtype)
@@ -590,13 +736,23 @@ def test_tensor_patch_method(self):
         self.assertTrue(inspect.ismethod(a.std))
         self.assertTrue(inspect.ismethod(a.numel))
 
-    def test_complex_scalar(self):
+    def test_tensor_patch_method(self):
+        with _test_eager_guard():
+            self.func_test_tensor_patch_method()
+        self.func_test_tensor_patch_method()
+
+    def func_test_complex_scalar(self):
         a_np = np.random.random(self.shape).astype(self.dtype)
         with fluid.dygraph.guard():
             a = fluid.dygraph.to_variable(a_np)
             res = 1J * a
             self.assertTrue(np.array_equal(res.numpy(), 1J * a_np))
 
+    def test_complex_scalar(self):
+        with _test_eager_guard():
+            self.func_test_complex_scalar()
+        self.func_test_complex_scalar()
+
 
 if __name__ == '__main__':
     unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/test_matrix_rank_op.py b/python/paddle/fluid/tests/unittests/test_matrix_rank_op.py
index d0b84a0d7e108..b13b346261762 100644
--- a/python/paddle/fluid/tests/unittests/test_matrix_rank_op.py
+++ b/python/paddle/fluid/tests/unittests/test_matrix_rank_op.py
@@ -30,8 +30,13 @@
 np.random.seed(SEED)
 
 
+def matrix_rank_wraper(x, tol=None, use_default_tol=True, hermitian=False):
+    return paddle.linalg.matrix_rank(x, tol, hermitian)
+
+
 class TestMatrixRankOP(OpTest):
     def setUp(self):
+        self.python_api = matrix_rank_wraper
         self.op_type = "matrix_rank"
         self.init_data()
         self.inputs = {'X': self.x}
@@ -44,7 +49,7 @@ def setUp(self):
         self.outputs = {'Out': self.out}
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
     def init_data(self):
         self.x = np.eye(3, dtype=np.float32)
@@ -110,6 +115,28 @@ def init_data(self):
                                          self.hermitian)
 
 
+class TestMatrixRankOP6(TestMatrixRankOP):
+    def init_data(self):
+        self.x = np.random.rand(3, 4, 5, 6).astype(np.float32)
+        self.tol_tensor = None
+        self.tol = None
+        self.use_default_tol = False
+        self.hermitian = False
+        self.out = np.linalg.matrix_rank(self.x, self.tol_tensor,
+                                         self.hermitian)
+
+
+class TestMatrixRankOP7(TestMatrixRankOP):
+    def init_data(self):
+        self.x = np.eye(200, dtype=np.float64)
+        self.tol_tensor = np.random.random([200, 200]).astype(self.x.dtype)
+        self.tol = None
+        self.use_default_tol = True
+        self.hermitian = True
+        self.out = np.linalg.matrix_rank(self.x, self.tol_tensor,
+                                         self.hermitian)
+
+
 class TestMatrixRankAPI(unittest.TestCase):
     def test_dygraph(self):
         paddle.disable_static()
diff --git a/python/paddle/fluid/tests/unittests/test_max_op.py b/python/paddle/fluid/tests/unittests/test_max_op.py
index 5e413e80d7143..d5b884dfcc93b 100644
--- a/python/paddle/fluid/tests/unittests/test_max_op.py
+++ b/python/paddle/fluid/tests/unittests/test_max_op.py
@@ -18,6 +18,7 @@
 import numpy as np
 from op_test import OpTest, skip_check_grad_ci, check_out_dtype
 import paddle
+from paddle.fluid.framework import _test_eager_guard
 import paddle.fluid.core as core
 
 
@@ -86,6 +87,10 @@ def test_imperative_api(self):
         z_expected = np.array(np.max(np_x, axis=0))
         self.assertEqual((np_z == z_expected).all(), True)
 
+    def test_eager_api(self):
+        with _test_eager_guard():
+            self.test_imperative_api()
+
     def test_big_dimension(self):
         paddle.disable_static()
         x = paddle.rand(shape=[2, 2, 2, 2, 2, 2, 2])
diff --git a/python/paddle/fluid/tests/unittests/test_mean_iou.py b/python/paddle/fluid/tests/unittests/test_mean_iou.py
index 4e89a9034a341..b392a328494b3 100644
--- a/python/paddle/fluid/tests/unittests/test_mean_iou.py
+++ b/python/paddle/fluid/tests/unittests/test_mean_iou.py
@@ -15,6 +15,7 @@
 from __future__ import print_function
 
 from __future__ import division
+
 import unittest
 import numpy as np
 from op_test import OpTest
@@ -113,6 +114,11 @@ def config(self):
         self.in_correct_num = 2
         self.in_mean_iou_num = 2
 
+    # NOTE(dev): Skip check_dygraph becuase Python API doesn't expose
+    # in_wrong_num/in_correct_num/in_mean_iou_num argument
+    def test_check_output(self):
+        self.check_output(check_dygraph=False, check_eager=False)
+
 
 class TestMeanIOUOpError(unittest.TestCase):
     def test_errors(self):
@@ -130,5 +136,4 @@ def test_errors(self):
 
 
 if __name__ == '__main__':
-    paddle.enable_static()
     unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/test_mean_op.py b/python/paddle/fluid/tests/unittests/test_mean_op.py
index 7a49770e57985..c5ee5c91e1c75 100644
--- a/python/paddle/fluid/tests/unittests/test_mean_op.py
+++ b/python/paddle/fluid/tests/unittests/test_mean_op.py
@@ -21,13 +21,26 @@
 import paddle.fluid.core as core
 import paddle.fluid as fluid
 from paddle.fluid import Program, program_guard
-
+from paddle.fluid.framework import _test_eager_guard
 np.random.seed(10)
 
 
+def mean_wrapper(x, axis=None, keepdim=False, reduce_all=False):
+    if reduce_all == True:
+        return paddle.mean(x, range(len(x.shape)), keepdim)
+    return paddle.mean(x, axis, keepdim)
+
+
+def reduce_mean_wrapper(x, axis=0, keepdim=False, reduce_all=False):
+    if reduce_all == True:
+        return paddle.mean(x, range(len(x.shape)), keepdim)
+    return paddle.mean(x, axis, keepdim)
+
+
 class TestMeanOp(OpTest):
     def setUp(self):
         self.op_type = "mean"
+        self.python_api = fluid.layers.mean
         self.dtype = np.float64
         self.init_dtype_type()
         self.inputs = {'X': np.random.random((10, 10)).astype(self.dtype)}
@@ -37,10 +50,10 @@ def init_dtype_type(self):
         pass
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
     def test_checkout_grad(self):
-        self.check_grad(['X'], 'Out')
+        self.check_grad(['X'], 'Out', check_eager=True)
 
 
 class TestMeanOpError(unittest.TestCase):
@@ -68,7 +81,7 @@ def init_dtype_type(self):
     def test_check_output(self):
         place = core.CUDAPlace(0)
         if core.is_float16_supported(place):
-            self.check_output_with_place(place)
+            self.check_output_with_place(place, check_eager=True)
 
     def test_checkout_grad(self):
         place = core.CUDAPlace(0)
@@ -91,11 +104,11 @@ def init_dtype_type(self):
 
     def test_check_output(self):
         paddle.enable_static()
-        self.check_output_with_place(core.CPUPlace())
+        self.check_output_with_place(core.CPUPlace(), check_eager=True)
 
     def test_checkout_grad(self):
         place = core.CPUPlace()
-        self.check_grad_with_place(place, ['X'], 'Out')
+        self.check_grad_with_place(place, ['X'], 'Out', check_eager=True)
 
 
 def ref_reduce_mean(x, axis=None, keepdim=False, reduce_all=False):
@@ -117,6 +130,7 @@ def ref_reduce_mean_grad(x, axis, dtype):
 class TestReduceMeanOp(OpTest):
     def setUp(self):
         self.op_type = 'reduce_mean'
+        self.python_api = reduce_mean_wrapper
         self.dtype = 'float64'
         self.shape = [2, 3, 4, 5]
         self.axis = [0]
@@ -145,7 +159,7 @@ def set_attrs(self):
 
     def test_check_output(self):
         if self.dtype != 'float16':
-            self.check_output()
+            self.check_output(check_eager=True)
         else:
             if not core.is_compiled_with_cuda():
                 return
@@ -154,7 +168,7 @@ def test_check_output(self):
 
     def test_check_grad(self):
         if self.dtype != 'float16':
-            self.check_grad(['X'], ['Out'])
+            self.check_grad(['X'], ['Out'], check_eager=True)
         else:
             return
             if not core.is_compiled_with_cuda():
@@ -175,6 +189,7 @@ def test_check_grad(self):
 class TestReduceMeanOpDefaultAttrs(TestReduceMeanOp):
     def setUp(self):
         self.op_type = 'reduce_mean'
+        self.python_api = reduce_mean_wrapper
         self.dtype = 'float64'
         self.shape = [2, 3, 4, 5]
 
diff --git a/python/paddle/fluid/tests/unittests/test_meshgrid_op.py b/python/paddle/fluid/tests/unittests/test_meshgrid_op.py
index 2cb83eba3767c..95acdbe4a0687 100644
--- a/python/paddle/fluid/tests/unittests/test_meshgrid_op.py
+++ b/python/paddle/fluid/tests/unittests/test_meshgrid_op.py
@@ -20,6 +20,7 @@
 import paddle.fluid as fluid
 import paddle
 from paddle.fluid import compiler, Program, program_guard, core
+from paddle.fluid.framework import _test_eager_guard
 
 
 class TestMeshgridOp(OpTest):
@@ -149,6 +150,10 @@ def test_api_with_dygraph(self):
             assert np.array_equal(res_3.shape, [100, 200])
             assert np.array_equal(res_4.shape, [100, 200])
 
+    def test_api_eager_dygraph(self):
+        with _test_eager_guard():
+            self.test_api_with_dygraph()
+
 
 class TestMeshgridOp7(unittest.TestCase):
     def test_api_with_dygraph_list_input(self):
@@ -163,6 +168,10 @@ def test_api_with_dygraph_list_input(self):
             assert np.array_equal(res_3.shape, [100, 200])
             assert np.array_equal(res_4.shape, [100, 200])
 
+    def test_api_eager_dygraph(self):
+        with _test_eager_guard():
+            self.test_api_with_dygraph_list_input()
+
 
 class TestMeshgridOp8(unittest.TestCase):
     def test_api_with_dygraph_tuple_input(self):
@@ -177,6 +186,40 @@ def test_api_with_dygraph_tuple_input(self):
             assert np.array_equal(res_3.shape, [100, 200])
             assert np.array_equal(res_4.shape, [100, 200])
 
+    def test_api_eager_dygraph(self):
+        with _test_eager_guard():
+            self.test_api_with_dygraph_tuple_input()
+
+
+class TestMeshgridEager(unittest.TestCase):
+    def test_dygraph_final_state_api(self):
+        input_1 = np.random.randint(0, 100, [100, ]).astype('int32')
+        input_2 = np.random.randint(0, 100, [200, ]).astype('int32')
+
+        with fluid.dygraph.guard():
+            tensor_1 = fluid.dygraph.to_variable(input_1)
+            tensor_2 = fluid.dygraph.to_variable(input_2)
+            tensor_1.stop_gradient = False
+            tensor_2.stop_gradient = False
+            res_1, res_2 = paddle.tensor.meshgrid((tensor_1, tensor_2))
+            sum = paddle.add_n([res_1, res_2])
+            sum.backward()
+            with _test_eager_guard():
+                tensor_eager_1 = fluid.dygraph.to_variable(input_1)
+                tensor_eager_2 = fluid.dygraph.to_variable(input_2)
+                tensor_eager_1.stop_gradient = False
+                tensor_eager_2.stop_gradient = False
+                res_eager_1, res_eager_2 = paddle.tensor.meshgrid(
+                    (tensor_eager_1, tensor_eager_2))
+                sum_eager = paddle.add_n([res_eager_1, res_eager_2])
+                sum_eager.backward()
+                self.assertEqual((
+                    tensor_1.grad.numpy() == tensor_eager_1.grad.numpy()).all(),
+                                 True)
+                self.assertEqual((
+                    tensor_2.grad.numpy() == tensor_eager_2.grad.numpy()).all(),
+                                 True)
+
 
 if __name__ == '__main__':
     paddle.enable_static()
diff --git a/python/paddle/fluid/tests/unittests/test_min_op.py b/python/paddle/fluid/tests/unittests/test_min_op.py
index f865c234a747c..13f82fb9bd7cb 100644
--- a/python/paddle/fluid/tests/unittests/test_min_op.py
+++ b/python/paddle/fluid/tests/unittests/test_min_op.py
@@ -19,6 +19,7 @@
 from op_test import OpTest, skip_check_grad_ci, check_out_dtype
 import paddle
 import paddle.fluid.core as core
+from paddle.fluid.framework import _test_eager_guard
 
 
 class ApiMinTest(unittest.TestCase):
@@ -86,6 +87,10 @@ def test_imperative_api(self):
         z_expected = np.array(np.min(np_x, axis=0))
         self.assertEqual((np_z == z_expected).all(), True)
 
+    def test_eager_api(self):
+        with _test_eager_guard():
+            self.test_imperative_api()
+
 
 class TestOutDtype(unittest.TestCase):
     def test_min(self):
diff --git a/python/paddle/fluid/tests/unittests/test_mode_op.py b/python/paddle/fluid/tests/unittests/test_mode_op.py
index 1b0458f2e255f..471904b0c9426 100644
--- a/python/paddle/fluid/tests/unittests/test_mode_op.py
+++ b/python/paddle/fluid/tests/unittests/test_mode_op.py
@@ -62,6 +62,7 @@ def init_args(self):
 
     def setUp(self):
         self.op_type = "mode"
+        self.python_api = paddle.mode
         self.dtype = np.float64
         np.random.seed(666)
         self.input_data = np.random.rand(2, 64, 1)
@@ -73,11 +74,11 @@ def setUp(self):
 
     def test_check_output(self):
         paddle.enable_static()
-        self.check_output()
+        self.check_output(check_eager=True)
 
     def test_check_grad(self):
         paddle.enable_static()
-        self.check_grad(set(['X']), 'Out')
+        self.check_grad(set(['X']), 'Out', check_eager=True)
 
 
 class TestModeOpLastdim(OpTest):
@@ -86,6 +87,7 @@ def init_args(self):
 
     def setUp(self):
         self.op_type = "mode"
+        self.python_api = paddle.mode
         self.dtype = np.float64
         np.random.seed(666)
         self.input_data = np.random.rand(2, 1, 1, 2, 30)
@@ -97,11 +99,11 @@ def setUp(self):
 
     def test_check_output(self):
         paddle.enable_static()
-        self.check_output()
+        self.check_output(check_eager=True)
 
     def test_check_grad(self):
         paddle.enable_static()
-        self.check_grad(set(['X']), 'Out')
+        self.check_grad(set(['X']), 'Out', check_eager=True)
 
 
 class TestModeOpKernels(unittest.TestCase):
diff --git a/python/paddle/fluid/tests/unittests/test_momentum_op.py b/python/paddle/fluid/tests/unittests/test_momentum_op.py
index 7f3690cff60f5..a4f38e37731e8 100644
--- a/python/paddle/fluid/tests/unittests/test_momentum_op.py
+++ b/python/paddle/fluid/tests/unittests/test_momentum_op.py
@@ -22,6 +22,7 @@
 import paddle
 import paddle.fluid as fluid
 import numpy
+from paddle.fluid.framework import _test_eager_guard
 
 
 def calculate_momentum_by_numpy(param,
@@ -528,6 +529,11 @@ def test_raise_error(self):
             ValueError, paddle.optimizer.Momentum, learning_rate=None)
         self.assertRaises(ValueError, paddle.optimizer.Momentum, momentum=None)
 
+    def test_api_eager_dygraph(self):
+        with _test_eager_guard():
+            self.test_momentum_dygraph()
+            self.test_raise_error()
+
 
 class TestMomentumOpWithDecay(OpTest):
     def setUp(self):
@@ -921,6 +927,10 @@ def test_main(self):
                 self._check_with_param_arrt(place, use_amp)
                 self._check_with_param_group(place, use_amp)
 
+    def test_api_eager_dygraph(self):
+        with _test_eager_guard():
+            self.test_main()
+
 
 class TestMultiTensorMomentumStatic(unittest.TestCase):
     def _momentum_optimize_static(self,
diff --git a/python/paddle/fluid/tests/unittests/test_multinomial_op.py b/python/paddle/fluid/tests/unittests/test_multinomial_op.py
index a65a1c7e14c2b..ecde527523d3d 100644
--- a/python/paddle/fluid/tests/unittests/test_multinomial_op.py
+++ b/python/paddle/fluid/tests/unittests/test_multinomial_op.py
@@ -227,9 +227,6 @@ def test_fixed_random_number(self):
         if not "V100" in paddle.device.cuda.get_device_name():
             return
 
-        if os.getenv("FLAGS_use_curand", None) in ('0', 'False', None):
-            return
-
         print("Test Fixed Random number on V100 GPU------>")
         paddle.disable_static()
         paddle.set_device('gpu')
diff --git a/python/paddle/fluid/tests/unittests/test_multiplex_op.py b/python/paddle/fluid/tests/unittests/test_multiplex_op.py
index a840586d78db0..a26eed12246e4 100644
--- a/python/paddle/fluid/tests/unittests/test_multiplex_op.py
+++ b/python/paddle/fluid/tests/unittests/test_multiplex_op.py
@@ -68,26 +68,26 @@ def test_errors(self):
 
             def test_list():
                 # the inputs type must be list
-                fluid.layers.multiplex(inputs=x1, index=index)
+                paddle.multiplex(inputs=x1, index=index)
 
             self.assertRaises(TypeError, test_list)
 
             def test_len():
-                fluid.layers.multiplex(inputs=[x1], index=index)
+                paddle.multiplex(inputs=[x1], index=index)
 
             self.assertRaises(ValueError, test_len)
 
             def test_type():
                 y1 = fluid.data(name='y1', shape=[None, 2], dtype='int16')
                 y2 = fluid.data(name='y2', shape=[None, 2], dtype='int16')
-                fluid.layers.multiplex(inputs=[y1, y2], index=index)
+                paddle.multiplex(inputs=[y1, y2], index=index)
 
             self.assertRaises(TypeError, test_type)
 
             def test_type2():
                 index2 = fluid.data(
                     name='index2', shape=[None, 1], dtype='int16')
-                fluid.layers.multiplex(inputs=[x1, x2], index=index2)
+                paddle.multiplex(inputs=[x1, x2], index=index2)
 
             self.assertRaises(TypeError, test_type2)
 
diff --git a/python/paddle/fluid/tests/unittests/test_multiprocess_dataloader_dataset.py b/python/paddle/fluid/tests/unittests/test_multiprocess_dataloader_dataset.py
index 8f1febcdeddf7..e23905005df56 100755
--- a/python/paddle/fluid/tests/unittests/test_multiprocess_dataloader_dataset.py
+++ b/python/paddle/fluid/tests/unittests/test_multiprocess_dataloader_dataset.py
@@ -21,6 +21,7 @@
 import paddle.fluid as fluid
 from paddle.io import Dataset, IterableDataset, TensorDataset, \
         ComposeDataset, ChainDataset, DataLoader, random_split, Subset
+from paddle.fluid.framework import _test_eager_guard, _in_legacy_dygraph
 
 IMAGE_SIZE = 32
 
@@ -76,21 +77,28 @@ def run_main(self, num_workers, places):
                 assert len(label) == 1
                 assert input.shape == [1, 3, 4]
                 assert label.shape == [1, 1]
-                assert isinstance(input, paddle.Tensor)
-                assert isinstance(label, paddle.Tensor)
+                assert isinstance(input,
+                                  (fluid.core.VarBase, fluid.core.eager.Tensor))
+                assert isinstance(label,
+                                  (fluid.core.VarBase, fluid.core.eager.Tensor))
                 assert np.allclose(input.numpy(), input_np[i])
                 assert np.allclose(label.numpy(), label_np[i])
 
-    def test_main(self):
+    def func_test_main(self):
         places = [paddle.CPUPlace()]
         if paddle.is_compiled_with_cuda():
             places.append(paddle.CUDAPlace(0))
         for p in places:
             self.run_main(num_workers=0, places=p)
 
+    def test_main(self):
+        with _test_eager_guard():
+            self.func_test_main()
+        self.func_test_main()
+
 
 class TestComposeDataset(unittest.TestCase):
-    def test_main(self):
+    def func_test_main(self):
         paddle.static.default_startup_program().random_seed = 1
         paddle.static.default_main_program().random_seed = 1
 
@@ -108,9 +116,14 @@ def test_main(self):
             assert np.allclose(input2, input2_t)
             assert np.allclose(label2, label2_t)
 
+    def test_main(self):
+        with _test_eager_guard():
+            self.func_test_main()
+        self.func_test_main()
+
 
 class TestRandomSplitApi(unittest.TestCase):
-    def test_main(self):
+    def func_test_main(self):
         paddle.static.default_startup_program().random_seed = 1
         paddle.static.default_main_program().random_seed = 1
 
@@ -129,9 +142,14 @@ def test_main(self):
 
         self.assertTrue(len(elements_list) == 0)
 
+    def test_main(self):
+        with _test_eager_guard():
+            self.func_test_main()
+        self.func_test_main()
+
 
 class TestRandomSplitError(unittest.TestCase):
-    def test_errors(self):
+    def func_test_errors(self):
         paddle.static.default_startup_program().random_seed = 1
         paddle.static.default_main_program().random_seed = 1
 
@@ -139,6 +157,11 @@ def test_errors(self):
         self.assertRaises(ValueError, paddle.io.random_split, range(5), [8])
         self.assertRaises(ValueError, paddle.io.random_split, range(5), [])
 
+    def test_errors(self):
+        with _test_eager_guard():
+            self.func_test_errors()
+        self.func_test_errors()
+
 
 class TestSubsetDataset(unittest.TestCase):
     def run_main(self, num_workers, places):
@@ -173,8 +196,10 @@ def assert_basic(input, label):
             assert len(label) == 1
             assert input.shape == [1, 3, 4]
             assert label.shape == [1, 1]
-            assert isinstance(input, paddle.Tensor)
-            assert isinstance(label, paddle.Tensor)
+            assert isinstance(input,
+                              (fluid.core.VarBase, fluid.core.eager.Tensor))
+            assert isinstance(label,
+                              (fluid.core.VarBase, fluid.core.eager.Tensor))
 
         elements_list = list()
         for _, (input, label) in enumerate(dataloader()):
@@ -192,7 +217,7 @@ def assert_basic(input, label):
 
         self.assertEqual(odd_list, elements_list)
 
-    def test_main(self):
+    def func_test_main(self):
         paddle.static.default_startup_program().random_seed = 1
         paddle.static.default_main_program().random_seed = 1
 
@@ -202,6 +227,11 @@ def test_main(self):
         for p in places:
             self.run_main(num_workers=0, places=p)
 
+    def test_main(self):
+        with _test_eager_guard():
+            self.func_test_main()
+        self.func_test_main()
+
 
 class TestChainDataset(unittest.TestCase):
     def run_main(self, num_workers, places):
@@ -227,13 +257,18 @@ def run_main(self, num_workers, places):
             assert np.allclose(label, samples[idx][1])
             idx += 1
 
-    def test_main(self):
+    def func_test_main(self):
         places = [paddle.CPUPlace()]
         if paddle.is_compiled_with_cuda():
             places.append(paddle.CUDAPlace(0))
         for p in places:
             self.run_main(num_workers=0, places=p)
 
+    def test_main(self):
+        with _test_eager_guard():
+            self.func_test_main()
+        self.func_test_main()
+
 
 class NumpyMixTensorDataset(Dataset):
     def __init__(self, sample_num):
@@ -269,8 +304,10 @@ def run_main(self, num_workers, places):
                 assert len(label) == 1
                 assert input.shape == [1, IMAGE_SIZE]
                 assert label.shape == [1, 1]
-                assert isinstance(input, paddle.Tensor)
-                assert isinstance(label, paddle.Tensor)
+                assert isinstance(input,
+                                  (fluid.core.VarBase, fluid.core.eager.Tensor))
+                assert isinstance(label,
+                                  (fluid.core.VarBase, fluid.core.eager.Tensor))
 
 
 class ComplextDataset(Dataset):
@@ -325,10 +362,15 @@ def run_main(self, num_workers):
                 assert data[4]['a'].shape == [2]
                 assert data[4]['b'].shape == [2, 2]
 
-    def test_main(self):
+    def func_test_main(self):
         for num_workers in [0, 2]:
             self.run_main(num_workers)
 
+    def test_main(self):
+        with _test_eager_guard():
+            self.func_test_main()
+        self.func_test_main()
+
 
 class SingleFieldDataset(Dataset):
     def __init__(self, sample_num):
@@ -360,13 +402,19 @@ def run_main(self, num_workers):
                 drop_last=True)
 
             for i, data in enumerate(dataloader()):
-                assert isinstance(data, paddle.Tensor)
+                assert isinstance(data,
+                                  (fluid.core.VarBase, fluid.core.eager.Tensor))
                 assert data.shape == [2, 2, 3]
 
-    def test_main(self):
+    def func_test_main(self):
         for num_workers in [0, 2]:
             self.run_main(num_workers)
 
+    def test_main(self):
+        with _test_eager_guard():
+            self.func_test_main()
+        self.func_test_main()
+
 
 class SingleFieldIterableDataset(IterableDataset):
     def __init__(self, sample_num):
@@ -390,12 +438,17 @@ def setUp(self):
                         [2834126987, 2358157858, 1860244682, 1437227251],
                         [457190280, 2660306227, 859341110, 354512857]]
 
-    def test_main(self):
+    def func_test_main(self):
         from paddle.fluid.dataloader.worker import _generate_states
         for inp, outp in zip(self.inputs, self.outputs):
             out = _generate_states(*inp)
             assert out == outp
 
+    def test_main(self):
+        with _test_eager_guard():
+            self.func_test_main()
+        self.func_test_main()
+
 
 class TestDatasetWithDropLast(unittest.TestCase):
     def run_main(self, dataset, num_samples, batch_size):
@@ -413,14 +466,24 @@ def run_main(self, dataset, num_samples, batch_size):
                     datas.append(data)
                 assert len(datas) == steps
 
-    def test_map_dataset(self):
+    def func_test_map_dataset(self):
         dataset = RandomDataset(10)
         self.run_main(dataset, 10, 3)
 
-    def test_iterable_dataset(self):
+    def test_map_dataset(self):
+        with _test_eager_guard():
+            self.func_test_map_dataset()
+        self.func_test_map_dataset()
+
+    def func_test_iterable_dataset(self):
         dataset = RandomIterableDataset(10)
         self.run_main(dataset, 10, 3)
 
+    def test_iterable_dataset(self):
+        with _test_eager_guard():
+            self.func_test_iterable_dataset()
+        self.func_test_iterable_dataset()
+
 
 if __name__ == '__main__':
     unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/test_nan_inf.py b/python/paddle/fluid/tests/unittests/test_nan_inf.py
index cb7e673c6ca29..84559048a2b8a 100644
--- a/python/paddle/fluid/tests/unittests/test_nan_inf.py
+++ b/python/paddle/fluid/tests/unittests/test_nan_inf.py
@@ -47,10 +47,12 @@ def check_nan_inf(self):
         print(out)
         print(err)
 
-        assert returncode == 0
         # in python3, type(out+err) is 'bytes', need use encode
-        assert (out + err
-                ).find('There are `nan` or `inf` in tensor'.encode()) != -1
+        if paddle.fluid.core.is_compiled_with_cuda():
+            assert (out + err).find('find nan or inf==='.encode()) != -1
+        else:
+            assert (out + err
+                    ).find('There are `nan` or `inf` in tensor'.encode()) != -1
 
     def test_nan_inf_in_static_mode(self):
         self._python_interp += " check_nan_inf_base.py"
diff --git a/python/paddle/fluid/tests/unittests/test_nanmean_api.py b/python/paddle/fluid/tests/unittests/test_nanmean_api.py
new file mode 100644
index 0000000000000..90a9a130899d3
--- /dev/null
+++ b/python/paddle/fluid/tests/unittests/test_nanmean_api.py
@@ -0,0 +1,137 @@
+#   Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import print_function
+
+import unittest
+import numpy as np
+import paddle
+import paddle.fluid as fluid
+import paddle.fluid.core as core
+from paddle.fluid import Program, program_guard
+
+np.random.seed(10)
+
+
+class TestNanmeanAPI(unittest.TestCase):
+    # test paddle.tensor.math.nanmean
+
+    def setUp(self):
+        self.x_shape = [2, 3, 4, 5]
+        self.x = np.random.uniform(-1, 1, self.x_shape).astype(np.float32)
+        self.x[0, :, :, :] = np.nan
+        self.x_grad = np.array([[np.nan, np.nan, 3.],
+                                [0., np.nan, 2.]]).astype(np.float32)
+        self.place = paddle.CUDAPlace(0) if core.is_compiled_with_cuda() \
+            else paddle.CPUPlace()
+
+    def test_api_static(self):
+        paddle.enable_static()
+        with paddle.static.program_guard(paddle.static.Program()):
+            x = paddle.fluid.data('X', self.x_shape)
+            out1 = paddle.nanmean(x)
+            out2 = paddle.tensor.nanmean(x)
+            out3 = paddle.tensor.math.nanmean(x)
+            axis = np.arange(len(self.x_shape)).tolist()
+            out4 = paddle.nanmean(x, axis)
+            out5 = paddle.nanmean(x, tuple(axis))
+            exe = paddle.static.Executor(self.place)
+            res = exe.run(feed={'X': self.x},
+                          fetch_list=[out1, out2, out3, out4, out5])
+        out_ref = np.nanmean(self.x)
+        for out in res:
+            self.assertEqual(np.allclose(out, out_ref, rtol=1e-04), True)
+
+    def test_api_dygraph(self):
+        paddle.disable_static(self.place)
+
+        def test_case(x, axis=None, keepdim=False):
+            x_tensor = paddle.to_tensor(x)
+            out = paddle.nanmean(x_tensor, axis, keepdim)
+            if isinstance(axis, list):
+                axis = tuple(axis)
+                if len(axis) == 0:
+                    axis = None
+
+            out_ref = np.nanmean(x, axis, keepdims=keepdim)
+            if np.isnan(out_ref).sum():
+                nan_mask = np.isnan(out_ref)
+                out_ref[nan_mask] = 0
+                out_np = out.numpy()
+                out_np[nan_mask] = 0
+                self.assertEqual(np.allclose(out_np, out_ref, rtol=1e-04), True)
+            else:
+                self.assertEqual(
+                    np.allclose(
+                        out.numpy(), out_ref, rtol=1e-04), True)
+
+        test_case(self.x)
+        test_case(self.x, [])
+        test_case(self.x, -1)
+        test_case(self.x, keepdim=True)
+        test_case(self.x, 2, keepdim=True)
+        test_case(self.x, [0, 2])
+        test_case(self.x, (0, 2))
+        test_case(self.x, [0, 1, 2, 3])
+        paddle.enable_static()
+
+    def test_errors(self):
+        paddle.enable_static()
+        with paddle.static.program_guard(paddle.static.Program()):
+            x = paddle.fluid.data('X', [10, 12], 'int32')
+            self.assertRaises(TypeError, paddle.nanmean, x)
+
+    def test_api_dygraph_grad(self):
+        paddle.disable_static(self.place)
+
+        def test_case(x, axis=None, keepdim=False):
+            if isinstance(axis, list):
+                axis = list(axis)
+                if len(axis) == 0:
+                    axis = None
+            x_tensor = paddle.to_tensor(x, stop_gradient=False)
+            y = paddle.nanmean(x_tensor, axis, keepdim)
+            dx = paddle.grad(y, x_tensor)[0].numpy()
+            sum_dx_ref = np.prod(y.shape)
+            if np.isnan(y.numpy()).sum():
+                sum_dx_ref -= np.isnan(y.numpy()).sum()
+            cnt = paddle.sum(~paddle.isnan(x_tensor),
+                             axis=axis,
+                             keepdim=keepdim)
+            if (cnt == 0).sum():
+                dx[np.isnan(dx)] = 0
+            sum_dx = dx.sum()
+            self.assertEqual(np.allclose(sum_dx, sum_dx_ref, rtol=1e-04), True)
+
+        test_case(self.x)
+        test_case(self.x, [])
+        test_case(self.x, -1)
+        test_case(self.x, keepdim=True)
+        test_case(self.x, 2, keepdim=True)
+        test_case(self.x, [0, 2])
+        test_case(self.x, (0, 2))
+        test_case(self.x, [0, 1, 2, 3])
+
+        test_case(self.x_grad)
+        test_case(self.x_grad, [])
+        test_case(self.x_grad, -1)
+        test_case(self.x_grad, keepdim=True)
+        test_case(self.x_grad, 0, keepdim=True)
+        test_case(self.x_grad, 1)
+        test_case(self.x_grad, (0, 1))
+        paddle.enable_static()
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/test_nearest_interp_op.py b/python/paddle/fluid/tests/unittests/test_nearest_interp_op.py
index eda530f30df26..5df085d4febac 100755
--- a/python/paddle/fluid/tests/unittests/test_nearest_interp_op.py
+++ b/python/paddle/fluid/tests/unittests/test_nearest_interp_op.py
@@ -79,6 +79,7 @@ def setUp(self):
         self.data_layout = 'NCHW'
         self.init_test_case()
         self.op_type = "nearest_interp"
+        self.check_eager = True
         input_np = np.random.random(self.input_shape).astype("float64")
 
         if self.data_layout == "NCHW":
@@ -101,8 +102,10 @@ def setUp(self):
         self.inputs = {'X': input_np}
         if self.out_size is not None:
             self.inputs['OutSize'] = self.out_size
+            self.check_eager = False
         if self.actual_shape is not None:
             self.inputs['OutSize'] = self.actual_shape
+            self.check_eager = False
         self.attrs = {
             'out_h': self.out_h,
             'out_w': self.out_w,
@@ -114,10 +117,11 @@ def setUp(self):
         self.outputs = {'Out': output_np}
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=self.check_eager)
 
     def test_check_grad(self):
-        self.check_grad(['X'], 'Out', in_place=True)
+        self.check_grad(
+            ['X'], 'Out', in_place=True, check_eager=self.check_eager)
 
     def init_test_case(self):
         self.interp_method = 'nearest'
@@ -231,6 +235,7 @@ def setUp(self):
         self.actual_shape = None
         self.init_test_case()
         self.op_type = "nearest_interp"
+        self.check_eager = True
         input_np = np.random.randint(
             low=0, high=256, size=self.input_shape).astype("uint8")
 
@@ -247,6 +252,7 @@ def setUp(self):
         self.inputs = {'X': input_np}
         if self.out_size is not None:
             self.inputs['OutSize'] = self.out_size
+            self.check_eager = False
         self.attrs = {
             'out_h': self.out_h,
             'out_w': self.out_w,
@@ -257,7 +263,8 @@ def setUp(self):
         self.outputs = {'Out': output_np}
 
     def test_check_output(self):
-        self.check_output_with_place(place=core.CPUPlace(), atol=1)
+        self.check_output_with_place(
+            place=core.CPUPlace(), atol=1, check_eager=self.check_eager)
 
     def init_test_case(self):
         self.interp_method = 'nearest'
@@ -339,6 +346,9 @@ def setUp(self):
             'interp_method': self.interp_method,
             'align_corners': self.align_corners,
         }
+        # NOTE(dev): some AsDispensible input is not used under imperative mode.
+        # Skip check_eager while found them in Inputs.
+        self.check_eager = True
 
         input_np = np.random.random(self.input_shape).astype("float64")
         self.inputs = {'X': input_np}
@@ -355,12 +365,14 @@ def setUp(self):
 
         if self.shape_by_1Dtensor:
             self.inputs['OutSize'] = self.out_size
+            self.check_eager = False
         elif self.out_size is not None:
             size_tensor = []
             for index, ele in enumerate(self.out_size):
                 size_tensor.append(("x" + str(index), np.ones(
                     (1)).astype('int32') * ele))
             self.inputs['SizeTensor'] = size_tensor
+            self.check_eager = False
 
         self.attrs['out_h'] = self.out_h
         self.attrs['out_w'] = self.out_w
@@ -370,10 +382,11 @@ def setUp(self):
         self.outputs = {'Out': output_np}
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=self.check_eager)
 
     def test_check_grad(self):
-        self.check_grad(['X'], 'Out', in_place=True)
+        self.check_grad(
+            ['X'], 'Out', in_place=True, check_eager=self.check_eager)
 
     def init_test_case(self):
         self.interp_method = 'nearest'
@@ -495,4 +508,6 @@ def attr_scale_value():
 
 
 if __name__ == "__main__":
+    import paddle
+    paddle.enable_static()
     unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/test_nll_loss.py b/python/paddle/fluid/tests/unittests/test_nll_loss.py
index a87d9052bd6d3..c53fdffe1cf1b 100644
--- a/python/paddle/fluid/tests/unittests/test_nll_loss.py
+++ b/python/paddle/fluid/tests/unittests/test_nll_loss.py
@@ -17,6 +17,7 @@
 import numpy as np
 import unittest
 from op_test import OpTest
+from paddle.fluid.framework import _test_eager_guard
 
 
 def nll_loss_1d(logs, targets, weight=None, reduction='mean',
@@ -97,14 +98,21 @@ def test_NLLLoss_1D_mean(self):
         with fluid.dygraph.guard():
             nll_loss = paddle.nn.loss.NLLLoss()
             dy_res = nll_loss(
-                fluid.dygraph.to_variable(input_np),
-                fluid.dygraph.to_variable(label_np))
+                paddle.to_tensor(input_np), paddle.to_tensor(label_np))
             dy_result = dy_res.numpy()
 
+        with fluid.dygraph.guard():
+            with _test_eager_guard():
+                nll_loss = paddle.nn.loss.NLLLoss()
+                eager_res = nll_loss(
+                    paddle.to_tensor(input_np), paddle.to_tensor(label_np))
+                eager_result = eager_res.numpy()
+
         expected = nll_loss_1d(input_np, label_np)[0]
         self.assertTrue(np.allclose(static_result, expected))
         self.assertTrue(np.allclose(static_result, dy_result))
         self.assertTrue(np.allclose(dy_result, expected))
+        self.assertTrue(np.allclose(eager_result, expected))
 
     def test_NLLLoss_1D_sum(self):
         np.random.seed(200)
@@ -132,14 +140,24 @@ def test_NLLLoss_1D_sum(self):
         with fluid.dygraph.guard():
             nll_loss = paddle.nn.loss.NLLLoss(reduction='sum')
             dy_res = nll_loss(
-                fluid.dygraph.to_variable(input_np),
-                fluid.dygraph.to_variable(label_np))
+                paddle.to_tensor(input_np), paddle.to_tensor(label_np))
             dy_result = dy_res.numpy()
 
+            with _test_eager_guard():
+                nll_loss = paddle.nn.loss.NLLLoss(reduction='sum')
+                in_t = paddle.to_tensor(input_np)
+                label = paddle.to_tensor(label_np)
+                in_t.stop_gradient = False
+                eager_res = nll_loss(in_t, label)
+                eager_result = eager_res.numpy()
+                loss = eager_res.sum()
+                loss.backward()
+
         expected = nll_loss_1d(input_np, label_np, reduction='sum')[0]
         self.assertTrue(np.allclose(static_result, expected))
         self.assertTrue(np.allclose(static_result, dy_result))
         self.assertTrue(np.allclose(dy_result, expected))
+        self.assertTrue(np.allclose(eager_result, expected))
 
     def test_NLLLoss_1D_with_weight_mean(self):
         np.random.seed(200)
@@ -170,16 +188,26 @@ def test_NLLLoss_1D_with_weight_mean(self):
 
         with fluid.dygraph.guard():
             nll_loss = paddle.nn.loss.NLLLoss(
-                weight=fluid.dygraph.to_variable(weight_np))
+                weight=paddle.to_tensor(weight_np))
             dy_res = nll_loss(
-                fluid.dygraph.to_variable(input_np),
-                fluid.dygraph.to_variable(label_np))
+                paddle.to_tensor(input_np), paddle.to_tensor(label_np))
             dy_result = dy_res.numpy()
+
+            with _test_eager_guard():
+                nll_loss = paddle.nn.loss.NLLLoss(
+                    weight=paddle.to_tensor(weight_np))
+                eager_res = nll_loss(
+                    paddle.to_tensor(input_np), paddle.to_tensor(label_np))
+                loss = eager_res.sum()
+                loss.backward()
+                eager_result = eager_res.numpy()
+
         expected = nll_loss_1d(input_np, label_np, weight=weight_np)[0]
 
         self.assertTrue(np.allclose(static_result, expected))
         self.assertTrue(np.allclose(static_result, dy_result))
         self.assertTrue(np.allclose(dy_result, expected))
+        self.assertTrue(np.allclose(eager_result, expected))
 
     def test_NLLLoss_1D_with_weight_sum(self):
         np.random.seed(200)
@@ -210,10 +238,9 @@ def test_NLLLoss_1D_with_weight_sum(self):
 
         with fluid.dygraph.guard():
             nll_loss = paddle.nn.loss.NLLLoss(
-                weight=fluid.dygraph.to_variable(weight_np), reduction='sum')
+                weight=paddle.to_tensor(weight_np), reduction='sum')
             dy_res = nll_loss(
-                fluid.dygraph.to_variable(input_np),
-                fluid.dygraph.to_variable(label_np))
+                paddle.to_tensor(input_np), paddle.to_tensor(label_np))
             dy_result = dy_res.numpy()
         expected = nll_loss_1d(
             input_np, label_np, weight=weight_np, reduction='sum')[0]
@@ -249,10 +276,9 @@ def test_NLLLoss_1D_with_weight_mean_cpu(self):
 
         with fluid.dygraph.guard():
             nll_loss = paddle.nn.loss.NLLLoss(
-                weight=fluid.dygraph.to_variable(weight_np))
+                weight=paddle.to_tensor(weight_np))
             dy_res = nll_loss(
-                fluid.dygraph.to_variable(input_np),
-                fluid.dygraph.to_variable(label_np))
+                paddle.to_tensor(input_np), paddle.to_tensor(label_np))
             dy_result = dy_res.numpy()
         expected = nll_loss_1d(input_np, label_np, weight=weight_np)[0]
 
@@ -287,10 +313,9 @@ def test_NLLLoss_1D_with_weight_no_reduce_cpu(self):
 
         with fluid.dygraph.guard():
             nll_loss = paddle.nn.loss.NLLLoss(
-                weight=fluid.dygraph.to_variable(weight_np), reduction='none')
+                weight=paddle.to_tensor(weight_np), reduction='none')
             dy_res = nll_loss(
-                fluid.dygraph.to_variable(input_np),
-                fluid.dygraph.to_variable(label_np))
+                paddle.to_tensor(input_np), paddle.to_tensor(label_np))
             dy_result = dy_res.numpy()
         expected = nll_loss_1d(
             input_np, label_np, weight=weight_np, reduction='none')
@@ -326,8 +351,7 @@ def test_NLLLoss_2D_mean(self):
         with fluid.dygraph.guard():
             nll_loss = paddle.nn.loss.NLLLoss()
             dy_res = nll_loss(
-                fluid.dygraph.to_variable(input_np),
-                fluid.dygraph.to_variable(label_np))
+                paddle.to_tensor(input_np), paddle.to_tensor(label_np))
             dy_result = dy_res.numpy()
 
         expected = nll_loss_2d(input_np, label_np)[0]
@@ -363,8 +387,7 @@ def test_NLLLoss_2D_sum(self):
         with fluid.dygraph.guard():
             nll_loss = paddle.nn.loss.NLLLoss(reduction='sum')
             dy_res = nll_loss(
-                fluid.dygraph.to_variable(input_np),
-                fluid.dygraph.to_variable(label_np))
+                paddle.to_tensor(input_np), paddle.to_tensor(label_np))
             dy_result = dy_res.numpy()
 
         expected = nll_loss_2d(input_np, label_np, reduction='sum')[0]
@@ -404,10 +427,9 @@ def test_NLLLoss_2D_with_weight_mean(self):
 
         with fluid.dygraph.guard():
             nll_loss = paddle.nn.loss.NLLLoss(
-                weight=fluid.dygraph.to_variable(weight_np))
+                weight=paddle.to_tensor(weight_np))
             dy_res = nll_loss(
-                fluid.dygraph.to_variable(input_np),
-                fluid.dygraph.to_variable(label_np))
+                paddle.to_tensor(input_np), paddle.to_tensor(label_np))
             dy_result = dy_res.numpy()
 
         expected = nll_loss_2d(input_np, label_np, weight=weight_np)[0]
@@ -445,10 +467,9 @@ def test_NLLLoss_2D_with_weight_mean_cpu(self):
 
         with fluid.dygraph.guard():
             nll_loss = paddle.nn.loss.NLLLoss(
-                weight=fluid.dygraph.to_variable(weight_np))
+                weight=paddle.to_tensor(weight_np))
             dy_res = nll_loss(
-                fluid.dygraph.to_variable(input_np),
-                fluid.dygraph.to_variable(label_np))
+                paddle.to_tensor(input_np), paddle.to_tensor(label_np))
             dy_result = dy_res.numpy()
 
         expected = nll_loss_2d(input_np, label_np, weight=weight_np)[0]
@@ -487,10 +508,9 @@ def test_NLLLoss_2D_with_weight_sum(self):
 
         with fluid.dygraph.guard():
             nll_loss = paddle.nn.loss.NLLLoss(
-                weight=fluid.dygraph.to_variable(weight_np), reduction='sum')
+                weight=paddle.to_tensor(weight_np), reduction='sum')
             dy_res = nll_loss(
-                fluid.dygraph.to_variable(input_np),
-                fluid.dygraph.to_variable(label_np))
+                paddle.to_tensor(input_np), paddle.to_tensor(label_np))
             dy_result = dy_res.numpy()
 
         expected = nll_loss_2d(
@@ -527,8 +547,7 @@ def test_NLLLoss_in_dims_not_2or4_mean(self):
         with fluid.dygraph.guard():
             nll_loss = paddle.nn.loss.NLLLoss()
             dy_res = nll_loss(
-                fluid.dygraph.to_variable(input_np),
-                fluid.dygraph.to_variable(label_np))
+                paddle.to_tensor(input_np), paddle.to_tensor(label_np))
             dy_result = dy_res.numpy()
 
         input_shape = input_np.shape
@@ -572,10 +591,9 @@ def test_NLLLoss_in_dims_not_2or4_with_weight_mean(self):
 
         with fluid.dygraph.guard():
             nll_loss = paddle.nn.loss.NLLLoss(
-                weight=fluid.dygraph.to_variable(weight_np))
+                weight=paddle.to_tensor(weight_np))
             dy_res = nll_loss(
-                fluid.dygraph.to_variable(input_np),
-                fluid.dygraph.to_variable(label_np))
+                paddle.to_tensor(input_np), paddle.to_tensor(label_np))
             dy_result = dy_res.numpy()
 
         input_shape = input_np.shape
@@ -620,10 +638,9 @@ def test_NLLLoss_in_dims_not_2or4_with_weight_sum(self):
 
         with fluid.dygraph.guard():
             nll_loss = paddle.nn.loss.NLLLoss(
-                weight=fluid.dygraph.to_variable(weight_np), reduction='sum')
+                weight=paddle.to_tensor(weight_np), reduction='sum')
             dy_res = nll_loss(
-                fluid.dygraph.to_variable(input_np),
-                fluid.dygraph.to_variable(label_np))
+                paddle.to_tensor(input_np), paddle.to_tensor(label_np))
             dy_result = dy_res.numpy()
 
         input_shape = input_np.shape
@@ -671,10 +688,9 @@ def test_NLLLoss_in_dims_not_2or4_with_weight_no_reduce(self):
 
         with fluid.dygraph.guard():
             nll_loss = paddle.nn.loss.NLLLoss(
-                weight=fluid.dygraph.to_variable(weight_np), reduction='none')
+                weight=paddle.to_tensor(weight_np), reduction='none')
             dy_res = nll_loss(
-                fluid.dygraph.to_variable(input_np),
-                fluid.dygraph.to_variable(label_np))
+                paddle.to_tensor(input_np), paddle.to_tensor(label_np))
             dy_result = dy_res.numpy()
 
         input_shape = input_np.shape
@@ -721,10 +737,9 @@ def test_NLLLoss_in_dims_not_2or4_with_weight_no_reduce_cpu(self):
 
         with fluid.dygraph.guard():
             nll_loss = paddle.nn.loss.NLLLoss(
-                weight=fluid.dygraph.to_variable(weight_np), reduction='none')
+                weight=paddle.to_tensor(weight_np), reduction='none')
             dy_res = nll_loss(
-                fluid.dygraph.to_variable(input_np),
-                fluid.dygraph.to_variable(label_np))
+                paddle.to_tensor(input_np), paddle.to_tensor(label_np))
             dy_result = dy_res.numpy()
 
         input_shape = input_np.shape
@@ -748,7 +763,11 @@ class TestNLLLossOp1DWithReduce(OpTest):
     def setUp(self):
         self.init_test_case()
         self.op_type = "nll_loss"
+        self.python_api = paddle.nn.functional.nll_loss
+        self.python_out_sig = ["Out"]
         self.with_weight = False
+        self.python_api = paddle.nn.functional.nll_loss
+        self.python_out_sig = ["Out"]
         np.random.seed(200)
         input_np = np.random.uniform(0.1, 0.8,
                                      self.input_shape).astype("float64")
@@ -769,19 +788,19 @@ def setUp(self):
         self.attrs = {'reduction': 'mean', 'ignore_index': -100}
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
     def test_check_output_with_weight(self):
         self.with_weight = True
-        self.check_output()
+        self.check_output(check_eager=True)
 
     def test_check_grad(self):
         self.with_weight = True
         place = fluid.CPUPlace()
-        self.check_grad_with_place(place, ['X'], 'Out')
+        self.check_grad_with_place(place, ['X'], 'Out', check_eager=True)
         if fluid.core.is_compiled_with_cuda():
             place = fluid.CUDAPlace(0)
-            self.check_grad_with_place(place, ['X'], 'Out')
+            self.check_grad_with_place(place, ['X'], 'Out', check_eager=True)
 
     def init_test_case(self):
         self.input_shape = [10, 10]
@@ -792,6 +811,8 @@ class TestNLLLossOp1DNoReduce(OpTest):
     def setUp(self):
         self.init_test_case()
         self.op_type = "nll_loss"
+        self.python_api = paddle.nn.functional.nll_loss
+        self.python_out_sig = ["Out"]
         self.with_weight = False
         np.random.seed(200)
         input_np = np.random.uniform(0.1, 0.8,
@@ -814,19 +835,19 @@ def setUp(self):
         self.attrs = {'reduction': 'none', 'ignore_index': -100}
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
     def test_check_output_with_weight(self):
         self.with_weight = True
-        self.check_output()
+        self.check_output(check_eager=True)
 
     def test_check_grad(self):
         self.with_weight = True
         place = fluid.CPUPlace()
-        self.check_grad_with_place(place, ['X'], 'Out')
+        self.check_grad_with_place(place, ['X'], 'Out', check_eager=True)
         if fluid.core.is_compiled_with_cuda():
             place = fluid.CUDAPlace(0)
-            self.check_grad_with_place(place, ['X'], 'Out')
+            self.check_grad_with_place(place, ['X'], 'Out', check_eager=True)
 
     def init_test_case(self):
         self.input_shape = [10, 10]
@@ -837,6 +858,8 @@ class TestNLLLossOp2DWithReduce(OpTest):
     def setUp(self):
         self.init_test_case()
         self.op_type = "nll_loss"
+        self.python_api = paddle.nn.functional.nll_loss
+        self.python_out_sig = ["Out"]
         self.with_weight = False
         np.random.seed(200)
         input_np = np.random.uniform(0.1, 0.8,
@@ -858,19 +881,19 @@ def setUp(self):
         self.attrs = {'reduction': 'mean', 'ignore_index': -100}
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
     def test_check_output_with_weight(self):
         self.with_weight = True
-        self.check_output()
+        self.check_output(check_eager=True)
 
     def test_check_grad(self):
         self.with_weight = True
         place = fluid.CPUPlace()
-        self.check_grad_with_place(place, ['X'], 'Out')
+        self.check_grad_with_place(place, ['X'], 'Out', check_eager=True)
         if fluid.core.is_compiled_with_cuda():
             place = fluid.CUDAPlace(0)
-            self.check_grad_with_place(place, ['X'], 'Out')
+            self.check_grad_with_place(place, ['X'], 'Out', check_eager=True)
 
     def init_test_case(self):
         self.input_shape = [2, 3, 5, 5]
@@ -881,6 +904,8 @@ class TestNLLLossOp2DNoReduce(OpTest):
     def setUp(self):
         self.init_test_case()
         self.op_type = "nll_loss"
+        self.python_api = paddle.nn.functional.nll_loss
+        self.python_out_sig = ["Out"]
         self.with_weight = False
         np.random.seed(200)
         input_np = np.random.uniform(0.1, 0.8,
@@ -903,19 +928,19 @@ def setUp(self):
         self.attrs = {'reduction': 'none', 'ignore_index': -100}
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
     def test_check_output_with_weight(self):
         self.with_weight = True
-        self.check_output()
+        self.check_output(check_eager=True)
 
     def test_check_grad(self):
         self.with_weight = True
         place = fluid.CPUPlace()
-        self.check_grad_with_place(place, ['X'], 'Out')
+        self.check_grad_with_place(place, ['X'], 'Out', check_eager=True)
         if fluid.core.is_compiled_with_cuda():
             place = fluid.CUDAPlace(0)
-            self.check_grad_with_place(place, ['X'], 'Out')
+            self.check_grad_with_place(place, ['X'], 'Out', check_eager=True)
 
     def init_test_case(self):
         self.input_shape = [5, 3, 5, 5]
@@ -1014,4 +1039,5 @@ def test_nll_loss_function_reduction_imperative_not_sum_mean_none():
 
 
 if __name__ == "__main__":
+    paddle.enable_static()
     unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/test_nms_op.py b/python/paddle/fluid/tests/unittests/test_nms_op.py
new file mode 100644
index 0000000000000..1b5ac1f1337d0
--- /dev/null
+++ b/python/paddle/fluid/tests/unittests/test_nms_op.py
@@ -0,0 +1,92 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+import numpy as np
+from op_test import OpTest
+
+
+def iou(box_a, box_b):
+    """Apply intersection-over-union overlap between box_a and box_b
+    """
+    xmin_a = min(box_a[0], box_a[2])
+    ymin_a = min(box_a[1], box_a[3])
+    xmax_a = max(box_a[0], box_a[2])
+    ymax_a = max(box_a[1], box_a[3])
+
+    xmin_b = min(box_b[0], box_b[2])
+    ymin_b = min(box_b[1], box_b[3])
+    xmax_b = max(box_b[0], box_b[2])
+    ymax_b = max(box_b[1], box_b[3])
+
+    area_a = (ymax_a - ymin_a) * (xmax_a - xmin_a)
+    area_b = (ymax_b - ymin_b) * (xmax_b - xmin_b)
+    if area_a <= 0 and area_b <= 0:
+        return 0.0
+
+    xa = max(xmin_a, xmin_b)
+    ya = max(ymin_a, ymin_b)
+    xb = min(xmax_a, xmax_b)
+    yb = min(ymax_a, ymax_b)
+
+    inter_area = max(xb - xa, 0.0) * max(yb - ya, 0.0)
+
+    iou_ratio = inter_area / (area_a + area_b - inter_area)
+    return iou_ratio
+
+
+def nms(boxes, nms_threshold):
+    selected_indices = np.zeros(boxes.shape[0], dtype=np.int64)
+    keep = np.ones(boxes.shape[0], dtype=int)
+    io_ratio = np.ones((boxes.shape[0], boxes.shape[0]), dtype=np.float64)
+    cnt = 0
+    for i in range(boxes.shape[0]):
+        if keep[i] == 0:
+            continue
+        selected_indices[cnt] = i
+        cnt += 1
+        for j in range(i + 1, boxes.shape[0]):
+            io_ratio[i][j] = iou(boxes[i], boxes[j])
+            if keep[j]:
+                overlap = iou(boxes[i], boxes[j])
+                keep[j] = 1 if overlap <= nms_threshold else 0
+            else:
+                continue
+
+    return selected_indices
+
+
+class TestNMSOp(OpTest):
+    def setUp(self):
+        self.op_type = 'nms'
+        self.dtype = np.float64
+        self.init_dtype_type()
+        boxes = np.random.rand(32, 4).astype(self.dtype)
+        boxes[:, 2] = boxes[:, 0] + boxes[:, 2]
+        boxes[:, 3] = boxes[:, 1] + boxes[:, 3]
+
+        self.inputs = {'Boxes': boxes}
+        self.attrs = {'iou_threshold': 0.5}
+        out_py = nms(boxes, self.attrs['iou_threshold'])
+        self.outputs = {'KeepBoxesIdxs': out_py}
+
+    def init_dtype_type(self):
+        pass
+
+    def test_check_output(self):
+        self.check_output()
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/test_nn_grad.py b/python/paddle/fluid/tests/unittests/test_nn_grad.py
index 722926b0d77f7..55f87540c1b8a 100644
--- a/python/paddle/fluid/tests/unittests/test_nn_grad.py
+++ b/python/paddle/fluid/tests/unittests/test_nn_grad.py
@@ -27,6 +27,7 @@
 
 
 class TestSliceOpDoubleGradCheck(unittest.TestCase):
+    @prog_scope()
     def func(self, place):
         self.config()
 
diff --git a/python/paddle/fluid/tests/unittests/test_norm_all.py b/python/paddle/fluid/tests/unittests/test_norm_all.py
index ef912699455d1..5b0a9599bf84e 100644
--- a/python/paddle/fluid/tests/unittests/test_norm_all.py
+++ b/python/paddle/fluid/tests/unittests/test_norm_all.py
@@ -20,6 +20,24 @@
 import paddle
 import paddle.fluid as fluid
 import paddle.fluid.core as core
+from paddle import _C_ops
+from paddle.fluid.framework import in_dygraph_mode, _in_legacy_dygraph
+
+
+# hack method for test p_norm final state
+def p_norm_python_api(x,
+                      p=2.0,
+                      axis=-1,
+                      epsilon=1e-12,
+                      keepdim=False,
+                      as_vector=False):
+    if in_dygraph_mode():
+        return _C_ops.final_state_p_norm(x, p, axis, epsilon, keepdim,
+                                         as_vector)
+    if _in_legacy_dygraph():
+        return _C_ops.p_norm(x, 'axis', axis, 'porder',
+                             float(p), 'keepdim', keepdim, 'epsilon', epsilon,
+                             'as_vector', as_vector)
 
 
 def p_norm(x, axis, porder, keepdims=False, reduce_all=False):
@@ -68,8 +86,13 @@ def frobenius_norm(x, axis=None, keepdims=False):
     return r
 
 
+def final_state_frobenius_norm(x, dim, keep_dim, reduce_all):
+    return paddle.linalg.norm(x, p='fro', axis=dim, keepdim=keep_dim)
+
+
 class TestFrobeniusNormOp(OpTest):
     def setUp(self):
+        self.python_api = final_state_frobenius_norm
         self.op_type = "frobenius_norm"
         self.init_test_case()
         x = (np.random.random(self.shape) + 1.0).astype(self.dtype)
@@ -84,10 +107,10 @@ def setUp(self):
         self.outputs = {'Out': norm}
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
     def test_check_grad(self):
-        self.check_grad(['X'], 'Out')
+        self.check_grad(['X'], 'Out', check_eager=True)
 
     def init_test_case(self):
         self.shape = [2, 3, 4, 5]
@@ -104,12 +127,13 @@ def init_test_case(self):
         self.dtype = "float32"
 
     def test_check_grad(self):
-        self.check_grad(['X'], 'Out')
+        self.check_grad(['X'], 'Out', check_eager=True)
 
 
 class TestPnormOp(OpTest):
     def setUp(self):
         self.op_type = "p_norm"
+        self.python_api = p_norm_python_api
         self.init_test_case()
         x = (np.random.random(self.shape) + 0.5).astype(self.dtype)
         norm = p_norm(x, self.axis, self.porder, self.keepdim, self.asvector)
@@ -125,10 +149,10 @@ def setUp(self):
         self.gradient = self.calc_gradient()
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
     def test_check_grad(self):
-        self.check_grad(['X'], 'Out')
+        self.check_grad(['X'], 'Out', check_eager=True)
 
     def init_test_case(self):
         self.shape = [2, 3, 4, 5]
@@ -287,6 +311,7 @@ def init_test_case(self):
 class TestPnormBF16Op(OpTest):
     def setUp(self):
         self.op_type = "p_norm"
+        self.python_api = p_norm_python_api
         self.init_test_case()
         self.x = (np.random.random(self.shape) + 0.5).astype(np.float32)
         self.norm = p_norm(self.x, self.axis, self.porder, self.keepdim,
@@ -304,12 +329,15 @@ def setUp(self):
 
     def test_check_output(self):
         place = core.CUDAPlace(0)
-        self.check_output_with_place(place, atol=1e-3)
+        self.check_output_with_place(place, atol=1e-3, check_eager=True)
 
     def test_check_grad(self):
         place = core.CUDAPlace(0)
         self.check_grad_with_place(
-            place, ['X'], 'Out', user_defined_grads=self.gradient)
+            place, ['X'],
+            'Out',
+            user_defined_grads=self.gradient,
+            check_eager=True)
 
     def init_test_case(self):
         self.shape = [2, 3, 4, 5]
diff --git a/python/paddle/fluid/tests/unittests/test_norm_op.py b/python/paddle/fluid/tests/unittests/test_norm_op.py
index 626de9b12b9c1..49e1f2533491d 100644
--- a/python/paddle/fluid/tests/unittests/test_norm_op.py
+++ b/python/paddle/fluid/tests/unittests/test_norm_op.py
@@ -32,6 +32,7 @@ def l2_norm(x, axis, epsilon):
 class TestNormOp(OpTest):
     def setUp(self):
         self.op_type = "norm"
+        self.python_api = paddle.fluid.layers.l2_normalize
         self.init_test_case()
         self.init_dtype()
         x = np.random.random(self.shape).astype(self.dtype)
diff --git a/python/paddle/fluid/tests/unittests/test_normalize.py b/python/paddle/fluid/tests/unittests/test_normalize.py
index 274a4ebee7c3c..2f52ae391c7de 100644
--- a/python/paddle/fluid/tests/unittests/test_normalize.py
+++ b/python/paddle/fluid/tests/unittests/test_normalize.py
@@ -20,6 +20,7 @@
 import paddle.fluid as fluid
 import paddle.fluid.core as core
 import numpy as np
+from paddle.fluid.framework import _test_eager_guard
 
 
 def p_normalize(x, axis=1, p=2, epsilon=1e-12, keepdims=True):
@@ -87,6 +88,12 @@ def test_cpu(self):
         with fluid.program_guard(fluid.Program()):
             self.run_static()
 
+    def test_cpu_eager(self):
+        with _test_eager_guard():
+            paddle.disable_static(place=paddle.fluid.CPUPlace())
+            self.run_imperative()
+            paddle.enable_static()
+
     def test_gpu(self):
         if not fluid.core.is_compiled_with_cuda():
             return
@@ -98,6 +105,15 @@ def test_gpu(self):
         with fluid.program_guard(fluid.Program()):
             self.run_static(use_gpu=True)
 
+    def test_gpu_eager(self):
+        with _test_eager_guard():
+            if not fluid.core.is_compiled_with_cuda():
+                return
+
+            paddle.disable_static(place=paddle.fluid.CUDAPlace(0))
+            self.run_imperative()
+            paddle.enable_static()
+
 
 if __name__ == "__main__":
     unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/test_one_hot_v2_op.py b/python/paddle/fluid/tests/unittests/test_one_hot_v2_op.py
index f65281713a2db..b16c4b5ce69e1 100644
--- a/python/paddle/fluid/tests/unittests/test_one_hot_v2_op.py
+++ b/python/paddle/fluid/tests/unittests/test_one_hot_v2_op.py
@@ -117,24 +117,6 @@ def test_check_output(self):
         self.check_output()
 
 
-class TestOneHotOp_out_of_range(OpTest):
-    def setUp(self):
-        self.op_type = 'one_hot_v2'
-        depth = 10
-        x_lod = [[4, 1, 3, 3]]
-        x = [np.random.choice([-1, depth]) for i in range(sum(x_lod[0]))]
-        x = np.array(x).astype('int32').reshape([sum(x_lod[0])])
-
-        out = np.zeros(shape=(np.product(x.shape), depth)).astype('float32')
-
-        self.inputs = {'X': (x, x_lod)}
-        self.attrs = {'depth': depth, 'allow_out_of_range': True}
-        self.outputs = {'Out': (out, x_lod)}
-
-    def test_check_output(self):
-        self.check_output()
-
-
 class TestOneHotOp_exception(unittest.TestCase):
     def setUp(self):
         self.op_type = 'one_hot_v2'
diff --git a/python/paddle/fluid/tests/unittests/test_op_function_generator.py b/python/paddle/fluid/tests/unittests/test_op_function_generator.py
index 216deddb9ef98..c712b5db0f31f 100644
--- a/python/paddle/fluid/tests/unittests/test_op_function_generator.py
+++ b/python/paddle/fluid/tests/unittests/test_op_function_generator.py
@@ -15,7 +15,7 @@
 from __future__ import print_function
 
 import unittest
-from paddle.fluid.framework import default_main_program, Program, convert_np_dtype_to_dtype_, _non_static_mode
+from paddle.fluid.framework import default_main_program, Program, convert_np_dtype_to_dtype_, _non_static_mode, in_dygraph_mode
 import paddle.fluid as fluid
 import paddle.fluid.layers as layers
 import paddle.fluid.core as core
@@ -92,6 +92,8 @@ def test_trace_backward(self):
             self.assertTrue(np.array_equal(y_grad, loss.gradient() * a))
 
     def test_traced_layer(self):
+        if in_dygraph_mode():
+            return
         with fluid.dygraph.guard():
             layer = TestTracedLayer("test_traced_layer")
             a = np.random.uniform(-1, 1, self.shape).astype(self.dtype)
diff --git a/python/paddle/fluid/tests/unittests/test_ops_nms.py b/python/paddle/fluid/tests/unittests/test_ops_nms.py
new file mode 100644
index 0000000000000..c0bbe82d3581a
--- /dev/null
+++ b/python/paddle/fluid/tests/unittests/test_ops_nms.py
@@ -0,0 +1,190 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+import numpy as np
+import paddle
+from test_nms_op import nms
+
+
+def _find(condition):
+    """
+    Find the indices of elements saticfied the condition.
+
+    Args:
+        condition(Tensor[N] or np.ndarray([N,])): Element should be bool type.
+
+    Returns:
+        Tensor: Indices of True element.
+    """
+    res = []
+    for i in range(condition.shape[0]):
+        if condition[i]:
+            res.append(i)
+    return np.array(res)
+
+
+def multiclass_nms(boxes, scores, category_idxs, iou_threshold, top_k):
+    mask = np.zeros_like(scores)
+
+    for category_id in np.unique(category_idxs):
+        cur_category_boxes_idxs = _find(category_idxs == category_id)
+        cur_category_boxes = boxes[cur_category_boxes_idxs]
+        cur_category_scores = scores[cur_category_boxes_idxs]
+        cur_category_sorted_indices = np.argsort(-cur_category_scores)
+        cur_category_sorted_boxes = cur_category_boxes[
+            cur_category_sorted_indices]
+
+        cur_category_keep_boxes_sub_idxs = cur_category_sorted_indices[nms(
+            cur_category_sorted_boxes, iou_threshold)]
+
+        mask[cur_category_boxes_idxs[cur_category_keep_boxes_sub_idxs]] = True
+
+    keep_boxes_idxs = _find(mask == True)
+    topK_sub_indices = np.argsort(-scores[keep_boxes_idxs])[:top_k]
+    return keep_boxes_idxs[topK_sub_indices]
+
+
+def gen_args(num_boxes, dtype):
+    boxes = np.random.rand(num_boxes, 4).astype(dtype)
+    boxes[:, 2] = boxes[:, 0] + boxes[:, 2]
+    boxes[:, 3] = boxes[:, 1] + boxes[:, 3]
+
+    scores = np.random.rand(num_boxes).astype(dtype)
+
+    categories = [0, 1, 2, 3]
+    category_idxs = np.random.choice(categories, num_boxes)
+
+    return boxes, scores, category_idxs, categories
+
+
+class TestOpsNMS(unittest.TestCase):
+    def setUp(self):
+        self.num_boxes = 64
+        self.threshold = 0.5
+        self.topk = 20
+        self.dtypes = ['float32']
+        self.devices = ['cpu']
+        if paddle.is_compiled_with_cuda():
+            self.devices.append('gpu')
+
+    def test_nms(self):
+        for device in self.devices:
+            for dtype in self.dtypes:
+                boxes, scores, category_idxs, categories = gen_args(
+                    self.num_boxes, dtype)
+                paddle.set_device(device)
+                out = paddle.vision.ops.nms(
+                    paddle.to_tensor(boxes), self.threshold,
+                    paddle.to_tensor(scores))
+                out = paddle.vision.ops.nms(
+                    paddle.to_tensor(boxes), self.threshold)
+                out_py = nms(boxes, self.threshold)
+
+                self.assertTrue(
+                    np.array_equal(out.numpy(), out_py),
+                    "paddle out: {}\n py out: {}\n".format(out, out_py))
+
+    def test_multiclass_nms_dynamic(self):
+        for device in self.devices:
+            for dtype in self.dtypes:
+                boxes, scores, category_idxs, categories = gen_args(
+                    self.num_boxes, dtype)
+                paddle.set_device(device)
+                out = paddle.vision.ops.nms(
+                    paddle.to_tensor(boxes), self.threshold,
+                    paddle.to_tensor(scores),
+                    paddle.to_tensor(category_idxs), categories, self.topk)
+                out_py = multiclass_nms(boxes, scores, category_idxs,
+                                        self.threshold, self.topk)
+
+                self.assertTrue(
+                    np.array_equal(out.numpy(), out_py),
+                    "paddle out: {}\n py out: {}\n".format(out, out_py))
+
+    def test_multiclass_nms_static(self):
+        for device in self.devices:
+            for dtype in self.dtypes:
+                paddle.enable_static()
+                boxes, scores, category_idxs, categories = gen_args(
+                    self.num_boxes, dtype)
+                boxes_static = paddle.static.data(
+                    shape=boxes.shape, dtype=boxes.dtype, name="boxes")
+                scores_static = paddle.static.data(
+                    shape=scores.shape, dtype=scores.dtype, name="scores")
+                category_idxs_static = paddle.static.data(
+                    shape=category_idxs.shape,
+                    dtype=category_idxs.dtype,
+                    name="category_idxs")
+                out = paddle.vision.ops.nms(boxes_static, self.threshold,
+                                            scores_static, category_idxs_static,
+                                            categories, self.topk)
+                place = paddle.CPUPlace()
+                if device == 'gpu':
+                    place = paddle.CUDAPlace(0)
+                exe = paddle.static.Executor(place)
+                out = exe.run(paddle.static.default_main_program(),
+                              feed={
+                                  'boxes': boxes,
+                                  'scores': scores,
+                                  'category_idxs': category_idxs
+                              },
+                              fetch_list=[out])
+                paddle.disable_static()
+                out_py = multiclass_nms(boxes, scores, category_idxs,
+                                        self.threshold, self.topk)
+                out = np.array(out)
+                out = np.squeeze(out)
+                self.assertTrue(
+                    np.array_equal(out, out_py),
+                    "paddle out: {}\n py out: {}\n".format(out, out_py))
+
+    def test_multiclass_nms_dynamic_to_static(self):
+        for device in self.devices:
+            for dtype in self.dtypes:
+                paddle.set_device(device)
+
+                def fun(x):
+                    scores = np.arange(0, 64).astype('float32')
+                    categories = np.array([0, 1, 2, 3])
+                    category_idxs = categories.repeat(16)
+                    out = paddle.vision.ops.nms(x, 0.1,
+                                                paddle.to_tensor(scores),
+                                                paddle.to_tensor(category_idxs),
+                                                categories, 10)
+                    return out
+
+                path = "./net"
+                boxes = np.random.rand(64, 4).astype('float32')
+                boxes[:, 2] = boxes[:, 0] + boxes[:, 2]
+                boxes[:, 3] = boxes[:, 1] + boxes[:, 3]
+
+                origin = fun(paddle.to_tensor(boxes))
+                paddle.jit.save(
+                    fun,
+                    path,
+                    input_spec=[
+                        paddle.static.InputSpec(
+                            shape=[None, 4], dtype='float32', name='x')
+                    ], )
+                load_func = paddle.jit.load(path)
+                res = load_func(paddle.to_tensor(boxes))
+                self.assertTrue(
+                    np.array_equal(origin, res),
+                    "origin out: {}\n inference model out: {}\n".format(origin,
+                                                                        res))
+
+
+if __name__ == '__main__':
+    unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/test_pad3d_op.py b/python/paddle/fluid/tests/unittests/test_pad3d_op.py
index 7abc314bc1ba0..12f6f7b572108 100644
--- a/python/paddle/fluid/tests/unittests/test_pad3d_op.py
+++ b/python/paddle/fluid/tests/unittests/test_pad3d_op.py
@@ -30,6 +30,7 @@ def setUp(self):
         self.variable_paddings = False
         self.initTestCase()
         self.op_type = "pad3d"
+        self.python_api = paddle.nn.functional.pad
         self.inputs = {'X': np.random.random(self.shape).astype("float64")}
         self.attrs = {}
         if self.variable_paddings:
@@ -72,10 +73,10 @@ def setUp(self):
         self.outputs = {'Out': out}
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
     def test_check_grad_normal(self):
-        self.check_grad(['X'], 'Out')
+        self.check_grad(['X'], 'Out', check_eager=True)
 
     def initTestCase(self):
         self.shape = (2, 3, 4, 5, 6)
diff --git a/python/paddle/fluid/tests/unittests/test_paddle_multiprocessing.py b/python/paddle/fluid/tests/unittests/test_paddle_multiprocessing.py
index 1e31356a6bc81..7825b13001f28 100644
--- a/python/paddle/fluid/tests/unittests/test_paddle_multiprocessing.py
+++ b/python/paddle/fluid/tests/unittests/test_paddle_multiprocessing.py
@@ -19,6 +19,7 @@
 import time
 import paddle
 import paddle.incubate.multiprocessing as mp
+from paddle.fluid.framework import _test_eager_guard, _in_legacy_dygraph, in_dygraph_mode
 
 REPEAT = 20
 HAS_SHM_FILES = os.path.isdir('/dev/shm')
@@ -174,26 +175,54 @@ def test_receive():
 
 
 class TestMultiprocessingCpu(TestMultiprocessingBase):
-    def test_pass_tensor(self):
+    def func_test_pass_tensor(self):
+        if in_dygraph_mode():
+            return
         paddle.set_device("cpu")
         self._test_sharing(repeat=REPEAT)
 
-    def test_pass_parambase(self):
+    def test_pass_tensor(self):
+        with _test_eager_guard():
+            self.func_test_pass_tensor()
+        self.func_test_pass_tensor()
+
+    def func_test_pass_parambase(self):
+        if in_dygraph_mode():
+            return
         paddle.set_device("cpu")
         self._test_sharing(repeat=1, param=True)
 
-    def test_pass_empty(self):
+    def test_pass_parambase(self):
+        with _test_eager_guard():
+            self.func_test_pass_parambase()
+        self.func_test_pass_parambase()
+
+    def func_test_pass_empty(self):
+        if in_dygraph_mode():
+            return
         paddle.set_device("cpu")
         self._test_empty()
 
+    def test_pass_empty(self):
+        with _test_eager_guard():
+            self.func_test_pass_empty()
+        self.func_test_pass_empty()
+
 
 class TestMultiprocessingGpu(TestMultiprocessingBase):
     @unittest.skipIf(not paddle.fluid.core.is_compiled_with_cuda(),
                      "core is not compiled with CUDA")
-    def test_pass_tensor(self):
+    def func_test_pass_tensor(self):
+        if in_dygraph_mode():
+            return
         paddle.set_device("gpu")
         self._test_sharing(mp.get_context("spawn"), "gpu")
 
+    def test_pass_tensor(self):
+        with _test_eager_guard():
+            self.func_test_pass_tensor()
+        self.func_test_pass_tensor()
+
 
 if __name__ == "__main__":
     unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/test_parallel_dygraph_control_flow_in_eager_mode.py b/python/paddle/fluid/tests/unittests/test_parallel_dygraph_control_flow_in_eager_mode.py
deleted file mode 100644
index dde0c4b260cca..0000000000000
--- a/python/paddle/fluid/tests/unittests/test_parallel_dygraph_control_flow_in_eager_mode.py
+++ /dev/null
@@ -1,84 +0,0 @@
-# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-from __future__ import print_function
-
-import os
-import sys
-import unittest
-
-import paddle.fluid as fluid
-from test_dist_base import TestDistBase
-from spawn_runner_base import TestDistSpawnRunner
-
-flag_name = os.path.splitext(__file__)[0]
-
-
-class TestDygraphControlFlowSameEager(TestDistBase):
-    def _setup_config(self):
-        self._sync_mode = False
-        self._nccl2_mode = True
-        self._eager_mode = True
-        self._dygraph = True
-        self._find_unused_parameters = True
-
-    def test_net(self):
-        if fluid.core.is_compiled_with_cuda():
-            self.check_with_place(
-                "parallel_dygraph_control_flow_same.py",
-                delta=1e-5,
-                check_error_log=True,
-                log_name=flag_name)
-
-
-class TestDygraphControlFlowSameAccGradEager(TestDygraphControlFlowSameEager):
-    def _setup_config(self):
-        self._sync_mode = False
-        self._nccl2_mode = True
-        self._eager_mode = True
-        self._dygraph = True
-        self._accumulate_gradient = True
-        self._find_unused_parameters = True
-
-
-class TestDygraphControlFlowDiffEager(TestDistBase):
-    def _setup_config(self):
-        self._sync_mode = False
-        self._nccl2_mode = True
-        self._eager_mode = True
-        self._dygraph = True
-        self._find_unused_parameters = True
-
-    def test_net(self):
-        if fluid.core.is_compiled_with_cuda():
-            self.check_with_place(
-                "parallel_dygraph_control_flow_different.py",
-                delta=1e-5,
-                check_error_log=True,
-                log_name=flag_name)
-
-
-class TestFleetDygraphControlFlowDiffAccGradEager(
-        TestDygraphControlFlowDiffEager):
-    def _setup_config(self):
-        self._sync_mode = False
-        self._nccl2_mode = True
-        self._eager_mode = True
-        self._dygraph = True
-        self._accumulate_gradient = True
-        self._find_unused_parameters = True
-
-
-if __name__ == "__main__":
-    unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/test_parallel_dygraph_dataparallel.py b/python/paddle/fluid/tests/unittests/test_parallel_dygraph_dataparallel.py
index cbf08856e7eff..59013236967db 100644
--- a/python/paddle/fluid/tests/unittests/test_parallel_dygraph_dataparallel.py
+++ b/python/paddle/fluid/tests/unittests/test_parallel_dygraph_dataparallel.py
@@ -23,6 +23,7 @@
 import subprocess
 
 from paddle.distributed.utils import find_free_ports, watch_local_trainers, get_cluster, TrainerProc
+from paddle.fluid.framework import _test_eager_guard
 
 
 def get_cluster_from_args(selected_gpus):
@@ -205,14 +206,11 @@ def test_multiple_gpus_dynamic(self):
 
 class TestDataParallelWithPyLayer(TestMultipleGpus):
     def test_parallel_dygraph_dataparallel_with_pylayer(self):
+        with _test_eager_guard():
+            self.run_mnist_2gpu('parallel_dygraph_dataparallel_with_pylayer.py')
         self.run_mnist_2gpu('parallel_dygraph_dataparallel_with_pylayer.py')
 
 
-class TestDataParallelInEagerMode(TestMultipleGpus):
-    def test_multiple_gpus_dynamic(self):
-        self.run_mnist_2gpu('parallel_dygraph_dataparallel_in_eager_mode.py')
-
-
 class TestGradientCheckInEagerMode(TestMultipleGpus):
     def test_multiple_gpus_dynamic(self):
         self.run_mnist_2gpu('parallel_dygraph_gradient_check_in_eager_mode.py')
diff --git a/python/paddle/fluid/tests/unittests/test_parallel_dygraph_dataparallel_cpuonly.py b/python/paddle/fluid/tests/unittests/test_parallel_dygraph_dataparallel_cpuonly.py
index 587824a1dc74c..ce67a2ce4d209 100644
--- a/python/paddle/fluid/tests/unittests/test_parallel_dygraph_dataparallel_cpuonly.py
+++ b/python/paddle/fluid/tests/unittests/test_parallel_dygraph_dataparallel_cpuonly.py
@@ -70,6 +70,9 @@ def start_local_trainers(cluster,
             "PADDLE_CURRENT_ENDPOINT": "%s" % t.endpoint,
             "PADDLE_TRAINERS_NUM": "%d" % cluster.trainers_nranks(),
             "PADDLE_TRAINER_ENDPOINTS": ",".join(cluster.trainers_endpoints()),
+            "MASTER_ADDR": "127.0.0.1",
+            "MASTER_PORT": "6170",
+            "NCCL_DEBUG": "INFO",
             "PADDLE_DISTRI_BACKEND":
             "gloo",  # make init_parallel_env get 'gloo' argument.
         }
@@ -133,7 +136,7 @@ def test_multiple_gpus_dynamic(self):
 
 class TestDataParallelGradientCheckInEagerMode(TestMultipleGpus):
     def test_multiple_gpus_dynamic(self):
-        self.run_mnist_2gpu('parallel_dygraph_dataparallel_in_eager_mode.py')
+        self.run_mnist_2gpu('parallel_dygraph_gradient_check_in_eager_mode.py')
 
 
 if __name__ == "__main__":
diff --git a/python/paddle/fluid/tests/unittests/test_parallel_dygraph_no_sync_in_eager_mode.py b/python/paddle/fluid/tests/unittests/test_parallel_dygraph_no_sync_in_eager_mode.py
deleted file mode 100644
index d0e7d413952b7..0000000000000
--- a/python/paddle/fluid/tests/unittests/test_parallel_dygraph_no_sync_in_eager_mode.py
+++ /dev/null
@@ -1,111 +0,0 @@
-# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-from __future__ import print_function
-
-import os
-import sys
-import unittest
-
-import paddle.fluid as fluid
-from test_dist_base import TestDistBase
-from spawn_runner_base import TestDistSpawnRunner
-from parallel_dygraph_no_sync import TestNoSync
-from parallel_dygraph_no_sync_unused_params import TestNoSyncUnusedParam
-from parallel_dygraph_no_sync_control_flow import TestNoSyncControlFlow
-
-flag_name = os.path.splitext(__file__)[0]
-
-
-class TestParallelDygraphNoSync(TestDistBase):
-    def _setup_config(self):
-        self._sync_mode = False
-        self._eager_mode = True
-        self._nccl2_mode = True
-        self._dygraph = True
-        self._find_unused_parameters = False
-
-    def test_no_sync(self):
-        if fluid.core.is_compiled_with_cuda():
-            self.check_with_place(
-                "parallel_dygraph_no_sync.py",
-                delta=1e-5,
-                check_error_log=True,
-                log_name=flag_name)
-
-
-class TestParallelDygraphNoSyncUnusedParam(TestDistBase):
-    def _setup_config(self):
-        self._sync_mode = False
-        self._eager_mode = True
-        self._nccl2_mode = True
-        self._dygraph = True
-        self._find_unused_parameters = True
-
-    def test_no_sync_ununsed_param(self):
-        if fluid.core.is_compiled_with_cuda():
-            self.check_with_place(
-                "parallel_dygraph_no_sync_unused_params.py",
-                delta=1e-5,
-                check_error_log=True,
-                log_name=flag_name)
-
-
-class TestParallelDygraphNoSyncControlFlow(TestDistBase):
-    def _setup_config(self):
-        self._sync_mode = False
-        self._eager_mode = True
-        self._nccl2_mode = True
-        self._dygraph = True
-        self._find_unused_parameters = True
-
-    def test_no_sync_control_flow(self):
-        if fluid.core.is_compiled_with_cuda():
-            self.check_with_place(
-                "parallel_dygraph_no_sync_control_flow.py",
-                delta=1e-5,
-                check_error_log=True,
-                log_name=flag_name)
-
-
-class TestParallelDygraphNoSyncSpawn(TestDistSpawnRunner):
-    def test_no_sync_with_spawn(self):
-        if fluid.core.is_compiled_with_cuda() and sys.version_info >= (3, 4):
-            self.check_dist_result_with_spawn(test_class=TestNoSync, delta=1e-5)
-
-
-class TestParallelDygraphNoSyncUnusedParamSpawn(TestDistSpawnRunner):
-    def _args_config(self, args):
-        args.find_unused_parameters = True
-        args.eager_mode = True
-
-    def test_no_sync_with_spawn(self):
-        if fluid.core.is_compiled_with_cuda() and sys.version_info >= (3, 4):
-            self.check_dist_result_with_spawn(
-                test_class=TestNoSyncUnusedParam, delta=1e-5)
-
-
-class TestParallelDygraphNoSyncControlFlowSpawn(TestDistSpawnRunner):
-    def _args_config(self, args):
-        args.find_unused_parameters = True
-        args.eager_mode = True
-
-    def test_no_sync_with_spawn(self):
-        if fluid.core.is_compiled_with_cuda() and sys.version_info >= (3, 4):
-            self.check_dist_result_with_spawn(
-                test_class=TestNoSyncControlFlow, delta=1e-5)
-
-
-if __name__ == "__main__":
-    unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/test_parallel_dygraph_transformer.py b/python/paddle/fluid/tests/unittests/test_parallel_dygraph_transformer.py
index e0aab8541a542..71a8c7347e162 100644
--- a/python/paddle/fluid/tests/unittests/test_parallel_dygraph_transformer.py
+++ b/python/paddle/fluid/tests/unittests/test_parallel_dygraph_transformer.py
@@ -41,13 +41,6 @@ def test_transformer(self):
                 log_name=flag_name)
 
 
-class TestParallelDygraphTransformerSpawn(TestDistSpawnRunner):
-    def test_transformer_with_spawn(self):
-        if fluid.core.is_compiled_with_cuda() and sys.version_info >= (3, 4):
-            self.check_dist_result_with_spawn(
-                test_class=TestTransformer, delta=1e-5)
-
-
 class TestParallelDygraphTransformerAccGrad(TestDistBase):
     def _setup_config(self):
         self._sync_mode = False
diff --git a/python/paddle/fluid/tests/unittests/test_parameter.py b/python/paddle/fluid/tests/unittests/test_parameter.py
index 85ba69cd438a7..61d75fca2745e 100644
--- a/python/paddle/fluid/tests/unittests/test_parameter.py
+++ b/python/paddle/fluid/tests/unittests/test_parameter.py
@@ -18,7 +18,7 @@
 import copy
 import paddle
 from paddle.fluid.dygraph import guard
-from paddle.fluid.framework import default_main_program, Variable
+from paddle.fluid.framework import default_main_program, Variable, _test_eager_guard
 import paddle.fluid.core as core
 from paddle.fluid.executor import Executor
 import paddle.fluid.io as io
@@ -50,7 +50,7 @@ def test_parameter(self):
         p = io.get_parameter_value_by_name('fc.w', exe, main_program)
         self.assertTrue(np.array_equal(p, np.ones(shape) * val))
 
-    def test_parambase(self):
+    def func_parambase(self):
         with guard():
             linear = paddle.nn.Linear(10, 10)
             param = linear.weight
@@ -72,7 +72,12 @@ def test_parambase(self):
             pram_copy2 = copy.deepcopy(param, memo)
             self.assertEqual(id(param_copy), id(pram_copy2))
 
-    def test_exception(self):
+    def test_parambase(self):
+        with _test_eager_guard():
+            self.func_parambase()
+        self.func_parambase()
+
+    def func_exception(self):
         b = main_program.global_block()
         with self.assertRaises(ValueError):
             b.create_parameter(
@@ -87,7 +92,7 @@ def test_exception(self):
             b.create_parameter(
                 name='test', shape=[-1], dtype='float32', initializer=None)
 
-    def test_parambase_to_vector(self):
+    def func_parambase_to_vector(self):
         with guard():
             initializer = paddle.ParamAttr(
                 initializer=paddle.nn.initializer.Constant(3.))
@@ -112,6 +117,11 @@ def test_parambase_to_vector(self):
             self.assertTrue(linear2.weight.is_leaf, True)
             self.assertTrue(linear2.bias.is_leaf, True)
 
+    def test_parambase_to_vector(self):
+        with _test_eager_guard():
+            self.func_parambase_to_vector()
+        self.func_parambase_to_vector()
+
 
 if __name__ == '__main__':
     unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/test_pixel_shuffle.py b/python/paddle/fluid/tests/unittests/test_pixel_shuffle.py
index f1a409c712fc3..06d975fe2b88f 100644
--- a/python/paddle/fluid/tests/unittests/test_pixel_shuffle.py
+++ b/python/paddle/fluid/tests/unittests/test_pixel_shuffle.py
@@ -52,6 +52,7 @@ def pixel_shuffle_np(x, up_factor, data_format="NCHW"):
 class TestPixelShuffleOp(OpTest):
     def setUp(self):
         self.op_type = "pixel_shuffle"
+        self.python_api = paddle.nn.functional.pixel_shuffle
         self.init_data_format()
         n, c, h, w = 2, 9, 4, 4
 
@@ -73,10 +74,10 @@ def init_data_format(self):
         self.format = "NCHW"
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
     def test_check_grad(self):
-        self.check_grad(['X'], 'Out')
+        self.check_grad(['X'], 'Out', check_eager=True)
 
 
 class TestChannelLast(TestPixelShuffleOp):
@@ -220,4 +221,5 @@ def error_data_format_layer():
 
 
 if __name__ == '__main__':
+    paddle.enable_static()
     unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/test_poisson_op.py b/python/paddle/fluid/tests/unittests/test_poisson_op.py
index 2123d4e0e7e35..f8183bb5f8db2 100644
--- a/python/paddle/fluid/tests/unittests/test_poisson_op.py
+++ b/python/paddle/fluid/tests/unittests/test_poisson_op.py
@@ -107,9 +107,6 @@ def test_fixed_random_number(self):
         if not paddle.is_compiled_with_cuda():
             return
 
-        if os.getenv("FLAGS_use_curand", None) in ('0', 'False', None):
-            return
-
         print("Test Fixed Random number on GPU------>")
         paddle.disable_static()
         paddle.set_device('gpu')
diff --git a/python/paddle/fluid/tests/unittests/test_pool1d_api.py b/python/paddle/fluid/tests/unittests/test_pool1d_api.py
index 9e7b0c8a1efa7..e1cfcc3f06602 100644
--- a/python/paddle/fluid/tests/unittests/test_pool1d_api.py
+++ b/python/paddle/fluid/tests/unittests/test_pool1d_api.py
@@ -12,16 +12,15 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-import numpy as np
+import paddle
 import unittest
+import paddle.fluid as fluid
+import paddle.fluid.core as core
+import paddle.nn.functional as F
 import numpy as np
 from op_test import OpTest
-import paddle.fluid.core as core
-import paddle.fluid as fluid
 from paddle.fluid import compiler, Program, program_guard
-import paddle
-import paddle.nn.functional as F
-import paddle.fluid as fluid
+from paddle.fluid.framework import _test_eager_guard
 
 
 def adaptive_start_index(index, input_size, output_size):
@@ -244,6 +243,10 @@ def test_pool1d(self):
             self.check_avg_dygraph_padding_same(place)
             self.check_max_dygraph_return_index_results(place)
 
+    def test_dygraph_final_state_api(self):
+        with _test_eager_guard():
+            self.test_pool1d()
+
 
 class TestPool2DError_API(unittest.TestCase):
     def test_error_api(self):
@@ -370,6 +373,10 @@ def run_stride_out_of_range():
 
         self.assertRaises(ValueError, run_stride_out_of_range)
 
+    def test_dygraph_final_state_api(self):
+        with _test_eager_guard():
+            self.test_error_api()
+
 
 if __name__ == '__main__':
     unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/test_pool2d_api.py b/python/paddle/fluid/tests/unittests/test_pool2d_api.py
index 872bec666bf8c..e86fa0ec48330 100644
--- a/python/paddle/fluid/tests/unittests/test_pool2d_api.py
+++ b/python/paddle/fluid/tests/unittests/test_pool2d_api.py
@@ -12,14 +12,15 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-from test_pool2d_op import adaptive_start_index, adaptive_end_index, pool2D_forward_naive, avg_pool2D_forward_naive, max_pool2D_forward_naive
 import unittest
-from op_test import OpTest
+import paddle
 import numpy as np
+import paddle.fluid as fluid
 import paddle.fluid.core as core
+from op_test import OpTest
+from paddle.fluid.framework import _test_eager_guard
 from paddle.nn.functional import avg_pool2d, max_pool2d
-import paddle.fluid as fluid
-import paddle
+from test_pool2d_op import adaptive_start_index, adaptive_end_index, pool2D_forward_naive, avg_pool2D_forward_naive, max_pool2D_forward_naive
 
 
 class TestPool2D_API(unittest.TestCase):
@@ -324,6 +325,10 @@ def test_pool2d(self):
             self.check_max_dygraph_ceilmode_results(place)
             self.check_max_dygraph_nhwc_results(place)
 
+    def test_dygraph_final_state_api(self):
+        with _test_eager_guard():
+            self.test_pool2d()
+
 
 class TestPool2DError_API(unittest.TestCase):
     def test_error_api(self):
@@ -524,6 +529,10 @@ def run_stride_out_of_range():
 
         self.assertRaises(ValueError, run_stride_out_of_range)
 
+    def test_dygraph_final_state_api(self):
+        with _test_eager_guard():
+            self.test_error_api()
+
 
 if __name__ == '__main__':
     unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/test_pool3d_api.py b/python/paddle/fluid/tests/unittests/test_pool3d_api.py
index cddb09e5daa41..f20d2aad49f27 100644
--- a/python/paddle/fluid/tests/unittests/test_pool3d_api.py
+++ b/python/paddle/fluid/tests/unittests/test_pool3d_api.py
@@ -15,13 +15,15 @@
 from __future__ import print_function
 from __future__ import division
 
+import paddle
 import unittest
 import numpy as np
-import paddle
+import paddle.fluid as fluid
 import paddle.fluid.core as core
 from op_test import OpTest
-import paddle.fluid as fluid
+from paddle.fluid.framework import _test_eager_guard
 from paddle.nn.functional import avg_pool3d, max_pool3d
+from paddle.fluid.framework import _test_eager_guard
 from test_pool3d_op import adaptive_start_index, adaptive_end_index, pool3D_forward_naive, avg_pool3D_forward_naive, max_pool3D_forward_naive
 
 
@@ -326,6 +328,10 @@ def test_pool3d(self):
             self.check_max_dygraph_ndhwc_results(place)
             self.check_max_dygraph_ceilmode_results(place)
 
+    def test_dygraph_final_state_api(self):
+        with _test_eager_guard():
+            self.test_pool3d()
+
 
 class TestPool3DError_API(unittest.TestCase):
     def test_error_api(self):
@@ -499,6 +505,10 @@ def run_size_out_of_range():
 
         self.assertRaises(ValueError, run_size_out_of_range)
 
+    def test_dygraph_final_state_api(self):
+        with _test_eager_guard():
+            self.test_error_api()
+
 
 if __name__ == '__main__':
     unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/test_prelu_op.py b/python/paddle/fluid/tests/unittests/test_prelu_op.py
index 6afc462322fba..73c423a23e6ba 100644
--- a/python/paddle/fluid/tests/unittests/test_prelu_op.py
+++ b/python/paddle/fluid/tests/unittests/test_prelu_op.py
@@ -23,6 +23,7 @@
 from op_test import OpTest, skip_check_grad_ci
 import paddle
 import paddle.nn.functional as F
+from paddle.fluid.framework import _test_eager_guard
 
 
 def ref_prelu(x, weight):
@@ -76,6 +77,10 @@ def test_dygraph_api(self):
         self.dygraph_check(self.weight_np_0)
         self.dygraph_check(self.weight_np_1)
 
+    def test_dygraph_api_eager(self):
+        with _test_eager_guard():
+            self.test_dygraph_api()
+
     def test_error(self):
         with paddle.static.program_guard(paddle.static.Program()):
             weight_fp32 = paddle.fluid.data(
@@ -151,12 +156,19 @@ def test_dygraph_api(self):
         paddle.enable_static()
 
 
+def prelu_api_wrapper(x, weight, data_format="NCHW"):
+    weight = weight.reshape([-1])
+    return paddle.nn.functional.prelu(x, weight, data_format, name=None)
+
+
 class PReluTest(OpTest):
     def setUp(self):
         self.init_dtype()
         self.init_input_shape()
+        self.eager_mode = True
         self.init_attr()
         self.op_type = "prelu"
+        self.python_api = prelu_api_wrapper
 
         x_np = np.random.uniform(-1, 1, self.x_shape).astype(self.dtype)
         # Since zero point in prelu is not differentiable, avoid randomize
@@ -177,6 +189,8 @@ def setUp(self):
             alpha_np = np.random.uniform(-1, -0.5, [1, 1, 1, self.x_shape[-1]])
         else:
             alpha_np = np.random.uniform(-1, -0.5, [1] + self.x_shape[1:])
+            # eager check don't support mode = 'all'
+            self.eager_mode = False
         alpha_np = alpha_np.astype(self.dtype)
 
         self.inputs = {'X': x_np, 'Alpha': alpha_np}
@@ -207,10 +221,10 @@ def init_attr(self):
         self.attrs = {'mode': "channel", "data_format": "NCHW"}
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=self.eager_mode)
 
     def test_check_grad(self):
-        self.check_grad(['X', 'Alpha'], 'Out')
+        self.check_grad(['X', 'Alpha'], 'Out', check_eager=self.eager_mode)
 
 
 @skip_check_grad_ci(
@@ -373,7 +387,8 @@ def test_check_output(self):
             if core.is_compiled_with_cuda():
                 place = core.CUDAPlace(0)
                 if core.is_float16_supported(place):
-                    self.check_output_with_place(place, atol=atol)
+                    self.check_output_with_place(
+                        place, atol=atol, check_eager=self.eager_mode)
 
         def test_check_grad(self):
             place = core.CUDAPlace(0)
@@ -381,7 +396,8 @@ def test_check_grad(self):
                 self.check_grad_with_place(
                     place, ['X', 'Alpha'],
                     'Out',
-                    max_relative_error=max_relative_error)
+                    max_relative_error=max_relative_error,
+                    check_eager=self.eager_mode)
 
     cls_name = "{0}_{1}".format(parent.__name__, "Fp16Op")
     TestPReluFp16Case.__name__ = cls_name
diff --git a/python/paddle/fluid/tests/unittests/test_profiler_statistic.py b/python/paddle/fluid/tests/unittests/test_profiler_statistic.py
index adc42d0447f34..dc944e68c7f55 100644
--- a/python/paddle/fluid/tests/unittests/test_profiler_statistic.py
+++ b/python/paddle/fluid/tests/unittests/test_profiler_statistic.py
@@ -185,20 +185,22 @@ def test_statistic_case1(self):
                 profiler.TracerEventType.Communication), 5)
         self.assertEqual(len(event_summary.items), 2)
         self.assertEqual(len(event_summary.userdefined_items), 1)
-        self.assertEqual(len(event_summary.model_perspective_items), 3)
+        self.assertEqual(len(event_summary.model_perspective_items), 4)
         self.assertEqual(len(event_summary.memory_manipulation_items), 1)
         self.assertEqual(event_summary.items['conv2d'].cpu_time, 15)
-        self.assertEqual(event_summary.items['conv2d'].gpu_time, 25)
+        self.assertEqual(event_summary.items['conv2d'].general_gpu_time, 25)
         self.assertEqual(
             event_summary.model_perspective_items['Forward'].cpu_time, 100)
         self.assertEqual(
-            event_summary.model_perspective_items['Forward'].gpu_time, 135)
+            event_summary.model_perspective_items['Forward'].general_gpu_time,
+            135)
         self.assertEqual(
-            event_summary.model_perspective_items['Backward'].gpu_time, 0)
+            event_summary.model_perspective_items['Backward'].general_gpu_time,
+            0)
         self.assertEqual(
             event_summary.memory_manipulation_items['AsyncMemcpy'].cpu_time, 15)
-        self.assertEqual(
-            event_summary.memory_manipulation_items['AsyncMemcpy'].gpu_time, 60)
+        self.assertEqual(event_summary.memory_manipulation_items['AsyncMemcpy']
+                         .general_gpu_time, 60)
         print(
             profiler.profiler_statistic._build_table(
                 statistic_data,
@@ -226,31 +228,31 @@ def test_statistic_case2(self):
         userdefined_node = HostPythonNode('Communication Time',
                                           profiler.TracerEventType.UserDefined,
                                           100, 110, 1000, 1001)
-        reduce_all_launchkernel0 = HostPythonNode(
+        allreduce_launchkernel0 = HostPythonNode(
             'cudalaunchkernel', profiler.TracerEventType.CudaRuntime, 102, 104,
             1000, 1001)
 
-        nccl_reduce_all_kernel0 = DevicePythonNode(
-            'nccl_reduce_all_kernel', profiler.TracerEventType.Kernel, 105, 120,
+        nccl_allreduce_kernel0 = DevicePythonNode(
+            'nccl_allreduce_kernel', profiler.TracerEventType.Kernel, 105, 120,
             0, 0, 2)
 
         communication_node = HostPythonNode(
             'Communication', profiler.TracerEventType.Communication, 105, 110,
             1000, 1001)
 
-        reduce_all_op1 = HostPythonNode('reduce_all_op1',
-                                        profiler.TracerEventType.Operator, 105,
-                                        108, 1000, 1001)
-        reduce_all_op1_infershape = HostPythonNode(
-            'reduce_all_op1::infershape',
-            profiler.TracerEventType.OperatorInner, 105, 106, 1000, 1001)
+        allreduce_op1 = HostPythonNode('allreduce_op1',
+                                       profiler.TracerEventType.Operator, 105,
+                                       108, 1000, 1001)
+        allreduce_op1_infershape = HostPythonNode(
+            'allreduce_op1::infershape', profiler.TracerEventType.OperatorInner,
+            105, 106, 1000, 1001)
 
-        reduce_all_launchkernel1 = HostPythonNode(
+        allreduce_launchkernel1 = HostPythonNode(
             'cudalaunchkernel', profiler.TracerEventType.CudaRuntime, 106, 107,
             1000, 1001)
 
-        nccl_reduce_all_kernel1 = DevicePythonNode(
-            'nccl_reduce_all_kernel', profiler.TracerEventType.Kernel, 130, 150,
+        nccl_allreduce_kernel1 = DevicePythonNode(
+            'nccl_allreduce_kernel', profiler.TracerEventType.Kernel, 130, 150,
             0, 0, 2)
 
         backward_node = HostPythonNode('Gradient Backward',
@@ -305,19 +307,19 @@ def test_statistic_case2(self):
             'sync_batch_norm_memcpy', profiler.TracerEventType.Memcpy, 150, 200,
             0, 0, 1)
 
-        reduce_all_node2 = HostPythonNode('reduce_all',
-                                          profiler.TracerEventType.Operator,
-                                          230, 250, 1000, 1001)
+        allreduce_node2 = HostPythonNode('allreduce',
+                                         profiler.TracerEventType.Operator, 230,
+                                         250, 1000, 1001)
 
-        reduce_all_node2_infershape = HostPythonNode(
-            'reduce_all_node2::infershape',
+        allreduce_node2_infershape = HostPythonNode(
+            'allreduce_node2::infershape',
             profiler.TracerEventType.OperatorInner, 231, 232, 1000, 1001)
-        reduce_all_launchkernel2 = HostPythonNode(
+        allreduce_launchkernel2 = HostPythonNode(
             'cudalaunchkernel', profiler.TracerEventType.CudaRuntime, 235, 240,
             1000, 1001)
 
-        nccl_reduce_all_kernel2 = DevicePythonNode(
-            'nccl_reduce_all_kernel', profiler.TracerEventType.Kernel, 250, 280,
+        nccl_allreduce_kernel2 = DevicePythonNode(
+            'nccl_allreduce_kernel', profiler.TracerEventType.Kernel, 250, 280,
             0, 0, 2)
 
         root_node.children_node.append(profilerstep_node)
@@ -329,12 +331,12 @@ def test_statistic_case2(self):
         yolonet_node.children_node.extend(
             [sync_batch_norm_node, userdefined_node])
         userdefined_node.children_node.append(communication_node)
-        userdefined_node.runtime_node.append(reduce_all_launchkernel0)
-        reduce_all_launchkernel0.device_node.append(nccl_reduce_all_kernel0)
-        communication_node.children_node.append(reduce_all_op1)
-        reduce_all_op1.children_node.append(reduce_all_op1_infershape)
-        reduce_all_op1.runtime_node.append(reduce_all_launchkernel1)
-        reduce_all_launchkernel1.device_node.append(nccl_reduce_all_kernel1)
+        userdefined_node.runtime_node.append(allreduce_launchkernel0)
+        allreduce_launchkernel0.device_node.append(nccl_allreduce_kernel0)
+        communication_node.children_node.append(allreduce_op1)
+        allreduce_op1.children_node.append(allreduce_op1_infershape)
+        allreduce_op1.runtime_node.append(allreduce_launchkernel1)
+        allreduce_launchkernel1.device_node.append(nccl_allreduce_kernel1)
         conv2d_node.children_node.extend(
             [conv2d_infer_shape, conv2d_compute, conv2d_MemCpy])
         conv2d_compute.runtime_node.append(conv2d_launchkernel)
@@ -350,10 +352,10 @@ def test_statistic_case2(self):
         sync_batch_norm_MemCpy.runtime_node.append(sync_batch_norm_cudaMemCpy)
         sync_batch_norm_launchkernel.device_node.append(sync_batch_norm_kernel)
         sync_batch_norm_cudaMemCpy.device_node.append(sync_batch_norm_memcpy)
-        optimization_node.children_node.append(reduce_all_node2)
-        reduce_all_node2.children_node.append(reduce_all_node2_infershape)
-        reduce_all_node2.runtime_node.append(reduce_all_launchkernel2)
-        reduce_all_launchkernel2.device_node.append(nccl_reduce_all_kernel2)
+        optimization_node.children_node.append(allreduce_node2)
+        allreduce_node2.children_node.append(allreduce_node2_infershape)
+        allreduce_node2.runtime_node.append(allreduce_launchkernel2)
+        allreduce_launchkernel2.device_node.append(nccl_allreduce_kernel2)
         thread_tree = {'thread1001': root_node}
         extra_info = {
             'Process Cpu Utilization': '1.02',
@@ -415,20 +417,22 @@ def test_statistic_case2(self):
                 distributed_summary.overlap_range), 85)
         self.assertEqual(len(event_summary.items), 4)
         self.assertEqual(len(event_summary.userdefined_items), 1)
-        self.assertEqual(len(event_summary.model_perspective_items), 3)
+        self.assertEqual(len(event_summary.model_perspective_items), 4)
         self.assertEqual(len(event_summary.memory_manipulation_items), 1)
         self.assertEqual(event_summary.items['conv2d'].cpu_time, 15)
-        self.assertEqual(event_summary.items['conv2d'].gpu_time, 25)
+        self.assertEqual(event_summary.items['conv2d'].general_gpu_time, 25)
         self.assertEqual(
             event_summary.model_perspective_items['Forward'].cpu_time, 100)
         self.assertEqual(
-            event_summary.model_perspective_items['Forward'].gpu_time, 315)
+            event_summary.model_perspective_items['Forward'].general_gpu_time,
+            315)
         self.assertEqual(
-            event_summary.model_perspective_items['Backward'].gpu_time, 0)
+            event_summary.model_perspective_items['Backward'].general_gpu_time,
+            0)
         self.assertEqual(
             event_summary.memory_manipulation_items['AsyncMemcpy'].cpu_time, 15)
-        self.assertEqual(
-            event_summary.memory_manipulation_items['AsyncMemcpy'].gpu_time, 60)
+        self.assertEqual(event_summary.memory_manipulation_items['AsyncMemcpy']
+                         .general_gpu_time, 60)
         print(
             profiler.profiler_statistic._build_table(
                 statistic_data,
diff --git a/python/paddle/fluid/tests/unittests/test_prroi_pool_op.py b/python/paddle/fluid/tests/unittests/test_prroi_pool_op.py
index efb5e05bdebca..8e5ba7c3363a1 100644
--- a/python/paddle/fluid/tests/unittests/test_prroi_pool_op.py
+++ b/python/paddle/fluid/tests/unittests/test_prroi_pool_op.py
@@ -80,14 +80,14 @@ def setUp(self):
         self.set_data()
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
     def test_backward(self):
         places = [fluid.CPUPlace()]
         if fluid.core.is_compiled_with_cuda():
             places.append(fluid.CUDAPlace(0))
         for place in places:
-            self.check_grad_with_place(place, ['X'], 'Out')
+            self.check_grad_with_place(place, ['X'], 'Out', check_eager=True)
 
     def run_net(self, place):
         with program_guard(Program(), Program()):
@@ -197,14 +197,14 @@ def setUp(self):
         self.set_data()
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
     def test_backward(self):
         places = [fluid.CPUPlace()]
         if fluid.core.is_compiled_with_cuda():
             places.append(fluid.CUDAPlace(0))
         for place in places:
-            self.check_grad_with_place(place, ['X'], 'Out')
+            self.check_grad_with_place(place, ['X'], 'Out', check_eager=True)
 
     def run_net(self, place):
         with program_guard(Program(), Program()):
@@ -280,4 +280,6 @@ def test_bad_y():
 
 
 if __name__ == '__main__':
+    import paddle
+    paddle.enable_static()
     unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/test_psroi_pool_op.py b/python/paddle/fluid/tests/unittests/test_psroi_pool_op.py
index 95b8c5c3c0a94..39dec982b6607 100644
--- a/python/paddle/fluid/tests/unittests/test_psroi_pool_op.py
+++ b/python/paddle/fluid/tests/unittests/test_psroi_pool_op.py
@@ -95,7 +95,8 @@ def set_data(self):
                                     self.pooled_width).astype('float64')
         self.inputs = {
             'X': self.x,
-            'ROIs': (self.rois_with_batch_id[:, 1:5], self.rois_lod)
+            'ROIs': (self.rois_with_batch_id[:, 1:5], self.rois_lod),
+            'RoisNum': self.boxes_num
         }
         self.attrs = {
             'output_channels': self.output_channels,
@@ -145,13 +146,14 @@ def make_rois(self):
 
     def setUp(self):
         self.op_type = 'psroi_pool'
+        self.python_api = lambda x, boxes, boxes_num, pooled_height, pooled_width, output_channels, spatial_scale: paddle.vision.ops.psroi_pool(x, boxes, boxes_num, (pooled_height, pooled_width), spatial_scale)
         self.set_data()
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
     def test_check_grad(self):
-        self.check_grad(['X'], 'Out')
+        self.check_grad(['X'], 'Out', check_eager=True)
 
 
 class TestPSROIPoolDynamicFunctionAPI(unittest.TestCase):
diff --git a/python/paddle/fluid/tests/unittests/test_pylayer_op.py b/python/paddle/fluid/tests/unittests/test_pylayer_op.py
index 91e7b5d00e1a7..aadfb4d39442c 100644
--- a/python/paddle/fluid/tests/unittests/test_pylayer_op.py
+++ b/python/paddle/fluid/tests/unittests/test_pylayer_op.py
@@ -487,7 +487,7 @@ def forward(self, data):
             z = layer(data)
             with self.assertRaisesRegexp(
                     RuntimeError,
-                    "received current_inplace_version:{} != inplace_version_snapshot_:{}".
+                    "received tensor_version:{} != wrapper_version_snapshot:{}".
                     format(1, 0)):
                 z.backward()
 
diff --git a/python/paddle/fluid/tests/unittests/test_quantile.py b/python/paddle/fluid/tests/unittests/test_quantile.py
index 0fd3c1de9ca82..936d1d3be3a19 100644
--- a/python/paddle/fluid/tests/unittests/test_quantile.py
+++ b/python/paddle/fluid/tests/unittests/test_quantile.py
@@ -20,46 +20,59 @@
 
 
 class TestQuantile(unittest.TestCase):
+    """
+        This class is used for numerical precision testing. If there is 
+        a corresponding numpy API, the precision comparison can be performed directly. 
+        Otherwise, it needs to be verified by numpy implementated function.
+    """
+
     def setUp(self):
         np.random.seed(678)
         self.input_data = np.random.rand(6, 7, 8, 9, 10)
 
+    # Test correctness when q and axis are set.
     def test_quantile_single_q(self):
         x = paddle.to_tensor(self.input_data)
         paddle_res = paddle.quantile(x, q=0.5, axis=2)
         np_res = np.quantile(self.input_data, q=0.5, axis=2)
         self.assertTrue(np.allclose(paddle_res.numpy(), np_res))
 
+    # Test correctness for default axis.
     def test_quantile_with_no_axis(self):
         x = paddle.to_tensor(self.input_data)
         paddle_res = paddle.quantile(x, q=0.35)
         np_res = np.quantile(self.input_data, q=0.35)
         self.assertTrue(np.allclose(paddle_res.numpy(), np_res))
 
+    # Test correctness for multiple axis.     
     def test_quantile_with_multi_axis(self):
         x = paddle.to_tensor(self.input_data)
         paddle_res = paddle.quantile(x, q=0.75, axis=[0, 2, 3])
         np_res = np.quantile(self.input_data, q=0.75, axis=[0, 2, 3])
         self.assertTrue(np.allclose(paddle_res.numpy(), np_res))
 
+    # Test correctness when keepdim is set.
     def test_quantile_with_keepdim(self):
         x = paddle.to_tensor(self.input_data)
         paddle_res = paddle.quantile(x, q=0.35, axis=4, keepdim=True)
         np_res = np.quantile(self.input_data, q=0.35, axis=4, keepdims=True)
         self.assertTrue(np.allclose(paddle_res.numpy(), np_res))
 
+    # Test correctness when all parameters are set.
     def test_quantile_with_keepdim_and_multiple_axis(self):
         x = paddle.to_tensor(self.input_data)
         paddle_res = paddle.quantile(x, q=0.1, axis=[1, 4], keepdim=True)
         np_res = np.quantile(self.input_data, q=0.1, axis=[1, 4], keepdims=True)
         self.assertTrue(np.allclose(paddle_res.numpy(), np_res))
 
+    # Test correctness when q = 0.
     def test_quantile_with_boundary_q(self):
         x = paddle.to_tensor(self.input_data)
         paddle_res = paddle.quantile(x, q=0, axis=3)
         np_res = np.quantile(self.input_data, q=0, axis=3)
         self.assertTrue(np.allclose(paddle_res.numpy(), np_res))
 
+    # Test correctness when input includes NaN.
     def test_quantile_include_NaN(self):
         input_data = np.random.randn(2, 3, 4)
         input_data[0, 1, 1] = np.nan
@@ -69,6 +82,10 @@ def test_quantile_include_NaN(self):
 
 
 class TestQuantileMuitlpleQ(unittest.TestCase):
+    """
+        This class is used to test multiple input of q.
+    """
+
     def setUp(self):
         np.random.seed(678)
         self.input_data = np.random.rand(10, 3, 4, 5, 4)
@@ -95,56 +112,125 @@ def test_quantile_multiple_axis_keepdim(self):
 
 
 class TestQuantileError(unittest.TestCase):
+    """
+        This class is used to test that exceptions are thrown correctly.
+        Validity of all parameter values and types should be considered.
+    """
+
     def setUp(self):
         self.x = paddle.randn((2, 3, 4))
 
     def test_errors(self):
+        # Test error when q > 1
         def test_q_range_error_1():
             paddle_res = paddle.quantile(self.x, q=1.5)
 
         self.assertRaises(ValueError, test_q_range_error_1)
 
+        # Test error when q < 0
         def test_q_range_error_2():
             paddle_res = paddle.quantile(self.x, q=[0.2, -0.3])
 
         self.assertRaises(ValueError, test_q_range_error_2)
 
+        # Test error with no valid q
         def test_q_range_error_3():
             paddle_res = paddle.quantile(self.x, q=[])
 
         self.assertRaises(ValueError, test_q_range_error_3)
 
+        # Test error when x is not Tensor
         def test_x_type_error():
             x = [1, 3, 4]
             paddle_res = paddle.quantile(x, q=0.9)
 
         self.assertRaises(TypeError, test_x_type_error)
 
+        # Test error when scalar axis is not int
         def test_axis_type_error_1():
             paddle_res = paddle.quantile(self.x, q=0.4, axis=0.4)
 
         self.assertRaises(ValueError, test_axis_type_error_1)
 
+        # Test error when axis in List is not int
         def test_axis_type_error_2():
             paddle_res = paddle.quantile(self.x, q=0.4, axis=[1, 0.4])
 
         self.assertRaises(ValueError, test_axis_type_error_2)
 
+        # Test error when axis not in [-D, D)
         def test_axis_value_error_1():
             paddle_res = paddle.quantile(self.x, q=0.4, axis=10)
 
         self.assertRaises(ValueError, test_axis_value_error_1)
 
+        # Test error when axis not in [-D, D)
         def test_axis_value_error_2():
             paddle_res = paddle.quantile(self.x, q=0.4, axis=[1, -10])
 
         self.assertRaises(ValueError, test_axis_value_error_2)
 
+        # Test error with no valid axis
         def test_axis_value_error_3():
             paddle_res = paddle.quantile(self.x, q=0.4, axis=[])
 
         self.assertRaises(ValueError, test_axis_value_error_3)
 
 
+class TestQuantileRuntime(unittest.TestCase):
+    """
+        This class is used to test the API could run correctly with
+        different devices, different data types, and dygraph/static mode.
+    """
+
+    def setUp(self):
+        np.random.seed(678)
+        self.input_data = np.random.rand(6, 7, 8, 9, 10)
+        self.dtypes = ['float32', 'float64']
+        self.devices = ['cpu']
+        if paddle.device.is_compiled_with_cuda():
+            self.devices.append('gpu')
+
+    def test_dygraph(self):
+        paddle.disable_static()
+        for device in self.devices:
+            # Check different devices
+            paddle.set_device(device)
+            for dtype in self.dtypes:
+                # Check different dtypes
+                np_input_data = self.input_data.astype(dtype)
+                x = paddle.to_tensor(np_input_data, dtype=dtype)
+                paddle_res = paddle.quantile(x, q=0.5, axis=2)
+                np_res = np.quantile(np_input_data, q=0.5, axis=2)
+                self.assertTrue(np.allclose(paddle_res.numpy(), np_res))
+
+    def test_static(self):
+        paddle.enable_static()
+        for device in self.devices:
+            x = paddle.static.data(
+                name="x", shape=self.input_data.shape, dtype=paddle.float32)
+            x_fp64 = paddle.static.data(
+                name="x_fp64",
+                shape=self.input_data.shape,
+                dtype=paddle.float64)
+
+            results = paddle.quantile(x, q=0.5, axis=2)
+            np_input_data = self.input_data.astype('float32')
+            results_fp64 = paddle.quantile(x_fp64, q=0.5, axis=2)
+            np_input_data_fp64 = self.input_data.astype('float64')
+
+            exe = paddle.static.Executor(device)
+            paddle_res, paddle_res_fp64 = exe.run(
+                paddle.static.default_main_program(),
+                feed={"x": np_input_data,
+                      "x_fp64": np_input_data_fp64},
+                fetch_list=[results, results_fp64])
+            np_res = np.quantile(np_input_data, q=0.5, axis=2)
+            np_res_fp64 = np.quantile(np_input_data_fp64, q=0.5, axis=2)
+            self.assertTrue(
+                np.allclose(paddle_res, np_res) and np.allclose(paddle_res_fp64,
+                                                                np_res_fp64))
+
+
 if __name__ == '__main__':
     unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/test_randint_op.py b/python/paddle/fluid/tests/unittests/test_randint_op.py
index 5f58054d7efc9..361f4d280f70f 100644
--- a/python/paddle/fluid/tests/unittests/test_randint_op.py
+++ b/python/paddle/fluid/tests/unittests/test_randint_op.py
@@ -14,13 +14,14 @@
 
 from __future__ import print_function
 
+import os
+import paddle
 import unittest
 import numpy as np
 from op_test import OpTest
-import paddle
 from paddle.fluid import core
+from paddle.fluid.framework import _test_eager_guard
 from paddle.static import program_guard, Program
-import os
 
 paddle.enable_static()
 
@@ -53,6 +54,10 @@ def verify_output(self, outs):
             np.allclose(
                 hist, prob, rtol=0, atol=0.001), "hist: " + str(hist))
 
+    def test_check_output_eager(self):
+        with _test_eager_guard():
+            self.test_check_output()
+
 
 class TestRandintOpError(unittest.TestCase):
     def test_errors(self):
@@ -67,6 +72,10 @@ def test_errors(self):
             self.assertRaises(
                 TypeError, paddle.randint, 5, shape=[shape_tensor])
 
+    def test_errors_eager(self):
+        with _test_eager_guard():
+            self.test_errors()
+
 
 class TestRandintOp_attr_tensorlist(OpTest):
     def setUp(self):
@@ -93,6 +102,10 @@ def verify_output(self, outs):
             np.allclose(
                 hist, prob, rtol=0, atol=0.001), "hist: " + str(hist))
 
+    def test_check_output_eager(self):
+        with _test_eager_guard():
+            self.test_check_output()
+
 
 class TestRandint_attr_tensor(OpTest):
     def setUp(self):
@@ -114,6 +127,10 @@ def verify_output(self, outs):
             np.allclose(
                 hist, prob, rtol=0, atol=0.001), "hist: " + str(hist))
 
+    def test_check_output_eager(self):
+        with _test_eager_guard():
+            self.test_check_output()
+
 
 # Test python API
 class TestRandintAPI(unittest.TestCase):
@@ -145,18 +162,30 @@ def test_api(self):
                 feed={'var_shape': np.array([100, 100]).astype('int64')},
                 fetch_list=[out1, out2, out3, out4, out5])
 
+    def test_api_eager(self):
+        with _test_eager_guard():
+            self.test_api()
+
 
 class TestRandintImperative(unittest.TestCase):
     def test_api(self):
-        n = 10
         paddle.disable_static()
+
+        self.run_test_case()
+
+        with _test_eager_guard():
+            self.run_test_case()
+
+        paddle.enable_static()
+
+    def run_test_case(self):
+        n = 10
         x1 = paddle.randint(n, shape=[10], dtype="int32")
         x2 = paddle.tensor.randint(n)
         x3 = paddle.tensor.random.randint(n)
         for i in [x1, x2, x3]:
             for j in i.numpy().tolist():
                 self.assertTrue((j >= 0 and j < n))
-        paddle.enable_static()
 
 
 class TestRandomValue(unittest.TestCase):
@@ -169,11 +198,17 @@ def test_fixed_random_number(self):
         if not "V100" in paddle.device.cuda.get_device_name():
             return
 
-        if os.getenv("FLAGS_use_curand", None) in ('0', 'False', None):
-            return
-
         print("Test Fixed Random number on GPU------>")
         paddle.disable_static()
+
+        self.run_test_case()
+
+        with _test_eager_guard():
+            self.run_test_case()
+
+        paddle.enable_static()
+
+    def run_test_case(self):
         paddle.set_device('gpu')
         paddle.seed(100)
 
@@ -198,7 +233,6 @@ def test_fixed_random_number(self):
         self.assertTrue(np.array_equal(x[20, 1, 600, 600:605], expect))
         expect = [3581, 3420, -8027, -5237, -2436]
         self.assertTrue(np.array_equal(x[30, 2, 1000, 1000:1005], expect))
-        paddle.enable_static()
 
 
 if __name__ == "__main__":
diff --git a/python/paddle/fluid/tests/unittests/test_randperm_op.py b/python/paddle/fluid/tests/unittests/test_randperm_op.py
index 2380ccb14aaee..deb0a9a082140 100644
--- a/python/paddle/fluid/tests/unittests/test_randperm_op.py
+++ b/python/paddle/fluid/tests/unittests/test_randperm_op.py
@@ -18,6 +18,7 @@
 import paddle
 import paddle.fluid.core as core
 from paddle.static import program_guard, Program
+from paddle.fluid.framework import _test_eager_guard
 import os
 
 
@@ -50,6 +51,7 @@ class TestRandpermOp(OpTest):
 
     def setUp(self):
         self.op_type = "randperm"
+        self.python_api = paddle.randperm
         self.n = 200
         self.dtype = "int64"
 
@@ -72,6 +74,10 @@ def verify_output(self, outs):
         self.assertTrue(
             check_randperm_out(self.n, out_np), msg=error_msg(out_np))
 
+    def test_eager(self):
+        with _test_eager_guard():
+            self.test_check_output()
+
 
 class TestRandpermOpN(TestRandpermOp):
     def init_attrs(self):
@@ -130,15 +136,25 @@ def test_out(self):
         paddle.enable_static()
 
 
+class TestRandpermEager(unittest.TestCase):
+    def test_out(self):
+        paddle.disable_static()
+        n = 10
+        with _test_eager_guard():
+            for dtype in ['int32', np.int64, 'float32', 'float64']:
+                data_p = paddle.randperm(n, dtype)
+                data_np = data_p.numpy()
+                self.assertTrue(
+                    check_randperm_out(n, data_np), msg=error_msg(data_np))
+        paddle.enable_static()
+
+
 class TestRandomValue(unittest.TestCase):
     def test_fixed_random_number(self):
         # Test GPU Fixed random number, which is generated by 'curandStatePhilox4_32_10_t'
         if not paddle.is_compiled_with_cuda():
             return
 
-        if os.getenv("FLAGS_use_curand", None) in ('0', 'False', None):
-            return
-
         print("Test Fixed Random number on GPU------>")
         paddle.disable_static()
         paddle.set_device('gpu')
diff --git a/python/paddle/fluid/tests/unittests/test_range.py b/python/paddle/fluid/tests/unittests/test_range.py
index f129ae78cbf7e..e19c1b227f531 100644
--- a/python/paddle/fluid/tests/unittests/test_range.py
+++ b/python/paddle/fluid/tests/unittests/test_range.py
@@ -14,9 +14,15 @@
 
 from __future__ import print_function
 
+import paddle
 import unittest
 import numpy as np
 from op_test import OpTest
+from functools import partial
+
+
+def arange_wrapper(start, end, step, dtype=None):
+    return paddle.arange(start, end, step, dtype)
 
 
 class TestRangeOp(OpTest):
@@ -36,33 +42,38 @@ def setUp(self):
 
     def init_config(self):
         self.dtype = np.float32
+        self.python_api = partial(arange_wrapper, dtype=self.dtype)
         self.case = (0, 1, 0.2)
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
 
 class TestFloatRangeOpCase0(TestRangeOp):
     def init_config(self):
         self.dtype = np.float32
+        self.python_api = partial(arange_wrapper, dtype=self.dtype)
         self.case = (0, 5, 1)
 
 
 class TestInt32RangeOpCase0(TestRangeOp):
     def init_config(self):
         self.dtype = np.int32
+        self.python_api = partial(arange_wrapper, dtype=self.dtype)
         self.case = (0, 5, 2)
 
 
 class TestInt32RangeOpCase1(TestRangeOp):
     def init_config(self):
         self.dtype = np.int32
+        self.python_api = partial(arange_wrapper, dtype=self.dtype)
         self.case = (10, 1, -2)
 
 
 class TestInt32RangeOpCase2(TestRangeOp):
     def init_config(self):
         self.dtype = np.int32
+        self.python_api = partial(arange_wrapper, dtype=self.dtype)
         self.case = (-1, -10, -2)
 
 
diff --git a/python/paddle/fluid/tests/unittests/test_real_imag_op.py b/python/paddle/fluid/tests/unittests/test_real_imag_op.py
index ab24506f80101..523f48374eab9 100644
--- a/python/paddle/fluid/tests/unittests/test_real_imag_op.py
+++ b/python/paddle/fluid/tests/unittests/test_real_imag_op.py
@@ -39,6 +39,7 @@ def setUp(self):
         paddle.enable_static()
         # op test attrs
         self.op_type = "real"
+        self.python_api = paddle.real
         self.dtype = np.float64
         self.init_input_output()
         # backward attrs
@@ -58,14 +59,15 @@ def init_grad_input_output(self):
             self.grad_out.shape)
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
     def test_check_grad(self):
         self.check_grad(
             ['X'],
             'Out',
             user_defined_grads=[self.grad_x],
-            user_defined_grad_outputs=[self.grad_out])
+            user_defined_grad_outputs=[self.grad_out],
+            check_eager=True)
 
 
 class TestImagOp(TestRealOp):
@@ -74,6 +76,7 @@ def setUp(self):
         paddle.enable_static()
         # op test attrs
         self.op_type = "imag"
+        self.python_api = paddle.imag
         self.dtype = np.float64
         self.init_input_output()
         # backward attrs
diff --git a/python/paddle/fluid/tests/unittests/test_reduce_op.py b/python/paddle/fluid/tests/unittests/test_reduce_op.py
index d246356b4ec75..01d386724d161 100644
--- a/python/paddle/fluid/tests/unittests/test_reduce_op.py
+++ b/python/paddle/fluid/tests/unittests/test_reduce_op.py
@@ -26,19 +26,22 @@
 
 class TestSumOp(OpTest):
     def setUp(self):
+        self.python_api = paddle.sum
         self.op_type = "reduce_sum"
         self.inputs = {'X': np.random.random((5, 6, 10)).astype("float64")}
         self.outputs = {'Out': self.inputs['X'].sum(axis=0)}
+        self.attrs = {'dim': [0]}
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
     def test_check_grad(self):
-        self.check_grad(['X'], 'Out')
+        self.check_grad(['X'], 'Out', check_eager=True)
 
 
 class TestSumOp_fp16(OpTest):
     def setUp(self):
+        self.python_api = paddle.sum
         self.op_type = "reduce_sum"
         self.inputs = {
             'X': np.random.uniform(0, 0.1, (5, 6, 10)).astype("float16")
@@ -50,7 +53,7 @@ def setUp(self):
         self.gradient = self.calc_gradient()
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
     def calc_gradient(self):
         x = self.inputs["X"]
@@ -58,7 +61,8 @@ def calc_gradient(self):
         return grad,
 
     def test_check_grad(self):
-        self.check_grad(['X'], 'Out', user_defined_grads=self.gradient)
+        self.check_grad(
+            ['X'], 'Out', user_defined_grads=self.gradient, check_eager=True)
 
 
 @unittest.skipIf(not core.is_compiled_with_cuda(),
@@ -66,6 +70,7 @@ def test_check_grad(self):
 class TestSumOp_bf16(OpTest):
     def setUp(self):
         np.random.seed(100)
+        self.python_api = paddle.sum
         self.op_type = "reduce_sum"
         self.dtype = np.uint16
         self.x = np.random.uniform(0, 0.1, (2, 5, 10)).astype(np.float32)
@@ -79,12 +84,15 @@ def setUp(self):
 
     def test_check_output(self):
         place = core.CUDAPlace(0)
-        self.check_output_with_place(place)
+        self.check_output_with_place(place, check_eager=True)
 
     def test_check_grad(self):
         place = core.CUDAPlace(0)
         self.check_grad_with_place(
-            place, ['X'], 'Out', user_defined_grads=self.gradient)
+            place, ['X'],
+            'Out',
+            user_defined_grads=self.gradient,
+            check_eager=True)
 
     def calc_gradient(self):
         x = self.x
@@ -94,6 +102,7 @@ def calc_gradient(self):
 
 class TestSumOp_fp16_withInt(OpTest):
     def setUp(self):
+        self.python_api = paddle.sum
         self.op_type = "reduce_sum"
         self.inputs = {
             # ref to https://en.wikipedia.org/wiki/Half-precision_floating-point_format
@@ -107,7 +116,7 @@ def setUp(self):
         self.gradient = self.calc_gradient()
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
     def calc_gradient(self):
         x = self.inputs["X"]
@@ -115,41 +124,47 @@ def calc_gradient(self):
         return grad,
 
     def test_check_grad(self):
-        self.check_grad(['X'], 'Out', user_defined_grads=self.gradient)
+        self.check_grad(
+            ['X'], 'Out', user_defined_grads=self.gradient, check_eager=True)
 
 
 class TestSumOp5D(OpTest):
     def setUp(self):
+        self.python_api = paddle.sum
         self.op_type = "reduce_sum"
         self.inputs = {
             'X': np.random.random((1, 2, 5, 6, 10)).astype("float64")
         }
+        self.attrs = {'dim': [0]}
         self.outputs = {'Out': self.inputs['X'].sum(axis=0)}
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
     def test_check_grad(self):
-        self.check_grad(['X'], 'Out')
+        self.check_grad(['X'], 'Out', check_eager=True)
 
 
 class TestSumOp6D(OpTest):
     def setUp(self):
+        self.python_api = paddle.sum
         self.op_type = "reduce_sum"
         self.inputs = {
             'X': np.random.random((1, 1, 2, 5, 6, 10)).astype("float64")
         }
+        self.attrs = {'dim': [0]}
         self.outputs = {'Out': self.inputs['X'].sum(axis=0)}
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
     def test_check_grad(self):
-        self.check_grad(['X'], 'Out')
+        self.check_grad(['X'], 'Out', check_eager=True)
 
 
 class TestSumOp8D(OpTest):
     def setUp(self):
+        self.python_api = paddle.sum
         self.op_type = "reduce_sum"
         self.inputs = {
             'X': np.random.random((1, 3, 1, 2, 1, 4, 3, 10)).astype("float64")
@@ -158,10 +173,10 @@ def setUp(self):
         self.outputs = {'Out': self.inputs['X'].sum(axis=(0, 3))}
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
     def test_check_grad(self):
-        self.check_grad(['X'], 'Out')
+        self.check_grad(['X'], 'Out', check_eager=True)
 
 
 @skip_check_grad_ci(
@@ -172,6 +187,7 @@ class TestMaxOp(OpTest):
 
     def setUp(self):
         self.op_type = "reduce_max"
+        self.python_api = paddle.max
         self.inputs = {'X': np.random.random((5, 6, 10)).astype("float64")}
         self.attrs = {'dim': [-1]}
         self.outputs = {
@@ -179,7 +195,7 @@ def setUp(self):
         }
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
 
 @skip_check_grad_ci(
@@ -190,6 +206,7 @@ class TestMinOp(OpTest):
 
     def setUp(self):
         self.op_type = "reduce_min"
+        self.python_api = paddle.min
         self.inputs = {'X': np.random.random((5, 6, 10)).astype("float64")}
         self.attrs = {'dim': [2]}
         self.outputs = {
@@ -197,7 +214,7 @@ def setUp(self):
         }
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
 
 class TestMin6DOp(OpTest):
@@ -205,6 +222,7 @@ class TestMin6DOp(OpTest):
 
     def setUp(self):
         self.op_type = "reduce_min"
+        self.python_api = paddle.min
         self.inputs = {
             'X': np.random.random((2, 4, 3, 5, 6, 10)).astype("float64")
         }
@@ -214,7 +232,7 @@ def setUp(self):
         }
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
 
 class TestMin8DOp(OpTest):
@@ -222,6 +240,7 @@ class TestMin8DOp(OpTest):
 
     def setUp(self):
         self.op_type = "reduce_min"
+        self.python_api = paddle.min
         self.inputs = {
             'X': np.random.random((2, 4, 3, 5, 6, 3, 2, 4)).astype("float64")
         }
@@ -231,12 +250,17 @@ def setUp(self):
         }
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
+
+
+def raw_reduce_prod(x, dim=[0], keep_dim=False):
+    return paddle.prod(x, dim, keep_dim)
 
 
 class TestProdOp(OpTest):
     def setUp(self):
         self.op_type = "reduce_prod"
+        self.python_api = raw_reduce_prod
         self.init_data_type()
         self.inputs = {'X': np.random.random((5, 6, 10)).astype(self.data_type)}
         self.outputs = {'Out': self.inputs['X'].prod(axis=0)}
@@ -246,15 +270,16 @@ def init_data_type(self):
         ) else "float64"
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
     def test_check_grad(self):
-        self.check_grad(['X'], 'Out')
+        self.check_grad(['X'], 'Out', check_eager=True)
 
 
 class TestProd6DOp(OpTest):
     def setUp(self):
         self.op_type = "reduce_prod"
+        self.python_api = raw_reduce_prod
         self.init_data_type()
         self.inputs = {
             'X': np.random.random((5, 6, 2, 3, 4, 2)).astype(self.data_type)
@@ -269,15 +294,16 @@ def init_data_type(self):
         ) else "float64"
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
     def test_check_grad(self):
-        self.check_grad(['X'], 'Out')
+        self.check_grad(['X'], 'Out', check_eager=True)
 
 
 class TestProd8DOp(OpTest):
     def setUp(self):
         self.op_type = "reduce_prod"
+        self.python_api = raw_reduce_prod
         self.init_data_type()
         self.inputs = {
             'X': np.random.random(
@@ -293,26 +319,28 @@ def init_data_type(self):
         ) else "float64"
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
     def test_check_grad(self):
-        self.check_grad(['X'], 'Out')
+        self.check_grad(['X'], 'Out', check_eager=True)
 
 
 class TestAllOp(OpTest):
     def setUp(self):
         self.op_type = "reduce_all"
+        self.python_api = paddle.all
         self.inputs = {'X': np.random.randint(0, 2, (5, 6, 10)).astype("bool")}
         self.outputs = {'Out': self.inputs['X'].all()}
         self.attrs = {'reduce_all': True}
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
 
 class TestAll8DOp(OpTest):
     def setUp(self):
         self.op_type = "reduce_all"
+        self.python_api = paddle.all
         self.inputs = {
             'X': np.random.randint(0, 2,
                                    (2, 5, 3, 2, 2, 3, 4, 2)).astype("bool")
@@ -321,23 +349,25 @@ def setUp(self):
         self.outputs = {'Out': self.inputs['X'].all(axis=self.attrs['dim'])}
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
 
 class TestAllOpWithDim(OpTest):
     def setUp(self):
         self.op_type = "reduce_all"
+        self.python_api = paddle.all
         self.inputs = {'X': np.random.randint(0, 2, (5, 6, 10)).astype("bool")}
         self.attrs = {'dim': (1, )}
         self.outputs = {'Out': self.inputs['X'].all(axis=self.attrs['dim'])}
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
 
 class TestAll8DOpWithDim(OpTest):
     def setUp(self):
         self.op_type = "reduce_all"
+        self.python_api = paddle.all
         self.inputs = {
             'X': np.random.randint(0, 2,
                                    (2, 5, 3, 2, 2, 3, 4, 2)).astype("bool")
@@ -346,12 +376,13 @@ def setUp(self):
         self.outputs = {'Out': self.inputs['X'].all(axis=self.attrs['dim'])}
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
 
 class TestAllOpWithKeepDim(OpTest):
     def setUp(self):
         self.op_type = "reduce_all"
+        self.python_api = paddle.all
         self.inputs = {'X': np.random.randint(0, 2, (5, 6, 10)).astype("bool")}
         self.attrs = {'dim': [1], 'keep_dim': True}
         self.outputs = {
@@ -360,12 +391,13 @@ def setUp(self):
         }
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
 
 class TestAll8DOpWithKeepDim(OpTest):
     def setUp(self):
         self.op_type = "reduce_all"
+        self.python_api = paddle.all
         self.inputs = {
             'X': np.random.randint(0, 2,
                                    (2, 5, 3, 2, 2, 3, 4, 2)).astype("bool")
@@ -377,7 +409,7 @@ def setUp(self):
         }
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
 
 class TestAllOpError(unittest.TestCase):
@@ -395,17 +427,19 @@ def test_errors(self):
 class TestAnyOp(OpTest):
     def setUp(self):
         self.op_type = "reduce_any"
+        self.python_api = paddle.any
         self.inputs = {'X': np.random.randint(0, 2, (5, 6, 10)).astype("bool")}
         self.outputs = {'Out': self.inputs['X'].any()}
         self.attrs = {'reduce_all': True}
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
 
 class TestAny8DOp(OpTest):
     def setUp(self):
         self.op_type = "reduce_any"
+        self.python_api = paddle.any
         self.inputs = {
             'X': np.random.randint(0, 2,
                                    (2, 5, 3, 2, 2, 3, 4, 2)).astype("bool")
@@ -414,23 +448,25 @@ def setUp(self):
         self.outputs = {'Out': self.inputs['X'].any(axis=self.attrs['dim'])}
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
 
 class TestAnyOpWithDim(OpTest):
     def setUp(self):
         self.op_type = "reduce_any"
+        self.python_api = paddle.any
         self.inputs = {'X': np.random.randint(0, 2, (5, 6, 10)).astype("bool")}
         self.attrs = {'dim': [1]}
         self.outputs = {'Out': self.inputs['X'].any(axis=1)}
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
 
 class TestAny8DOpWithDim(OpTest):
     def setUp(self):
         self.op_type = "reduce_any"
+        self.python_api = paddle.any
         self.inputs = {
             'X': np.random.randint(0, 2,
                                    (2, 5, 3, 2, 2, 3, 4, 2)).astype("bool")
@@ -439,12 +475,13 @@ def setUp(self):
         self.outputs = {'Out': self.inputs['X'].any(axis=self.attrs['dim'])}
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
 
 class TestAnyOpWithKeepDim(OpTest):
     def setUp(self):
         self.op_type = "reduce_any"
+        self.python_api = paddle.any
         self.inputs = {'X': np.random.randint(0, 2, (5, 6, 10)).astype("bool")}
         self.attrs = {'dim': (1, ), 'keep_dim': True}
         self.outputs = {
@@ -453,12 +490,13 @@ def setUp(self):
         }
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
 
 class TestAny8DOpWithKeepDim(OpTest):
     def setUp(self):
         self.op_type = "reduce_any"
+        self.python_api = paddle.any
         self.inputs = {
             'X': np.random.randint(0, 2,
                                    (2, 5, 3, 2, 2, 3, 4, 2)).astype("bool")
@@ -470,7 +508,7 @@ def setUp(self):
         }
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
 
 class TestAnyOpError(unittest.TestCase):
@@ -600,6 +638,7 @@ class TestReduceMaxOpMultiAxises(OpTest):
 
     def setUp(self):
         self.op_type = "reduce_max"
+        self.python_api = paddle.max
         self.inputs = {'X': np.random.random((5, 6, 10)).astype("float64")}
         self.attrs = {'dim': [-2, -1]}
         self.outputs = {
@@ -607,7 +646,7 @@ def setUp(self):
         }
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
 
 @skip_check_grad_ci(
@@ -618,6 +657,7 @@ class TestReduceMinOpMultiAxises(OpTest):
 
     def setUp(self):
         self.op_type = "reduce_min"
+        self.python_api = paddle.min
         self.inputs = {'X': np.random.random((5, 6, 10)).astype("float64")}
         self.attrs = {'dim': [1, 2]}
         self.outputs = {
@@ -625,7 +665,7 @@ def setUp(self):
         }
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
 
 class TestKeepDimReduceSumMultiAxises(OpTest):
diff --git a/python/paddle/fluid/tests/unittests/test_retain_graph.py b/python/paddle/fluid/tests/unittests/test_retain_graph.py
index 79664fe4b12fb..0259b898a488e 100644
--- a/python/paddle/fluid/tests/unittests/test_retain_graph.py
+++ b/python/paddle/fluid/tests/unittests/test_retain_graph.py
@@ -134,9 +134,15 @@ def run_retain(self, need_retain):
         loss_g.backward()
         optim_g.minimize(loss_g)
 
-    def test_retain(self):
+    def func_retain(self):
         self.run_retain(need_retain=True)
-        self.assertRaises(RuntimeError, self.run_retain, need_retain=False)
+        if not fluid.framework.in_dygraph_mode():
+            self.assertRaises(RuntimeError, self.run_retain, need_retain=False)
+
+    def test_retain(self):
+        with fluid.framework._test_eager_guard():
+            self.func_retain()
+        self.func_retain()
 
 
 if __name__ == '__main__':
diff --git a/python/paddle/fluid/tests/unittests/test_rnn_decode_api.py b/python/paddle/fluid/tests/unittests/test_rnn_decode_api.py
index a0009a71b3ef7..dacb7a5b59957 100644
--- a/python/paddle/fluid/tests/unittests/test_rnn_decode_api.py
+++ b/python/paddle/fluid/tests/unittests/test_rnn_decode_api.py
@@ -31,7 +31,7 @@
 
 from paddle.fluid.executor import Executor
 from paddle.fluid import framework
-
+from paddle.fluid.framework import _test_eager_guard
 paddle.enable_static()
 
 
@@ -554,7 +554,7 @@ def test_beam_search_infer(self):
                 },
                 fetch_list=[output])[0]
 
-    def test_dynamic_basic_decoder(self):
+    def func_dynamic_basic_decoder(self):
         paddle.disable_static()
         src = paddle.to_tensor(np.random.randint(8, size=(8, 4)))
         src_length = paddle.to_tensor(np.random.randint(8, size=(8)))
@@ -562,6 +562,11 @@ def test_dynamic_basic_decoder(self):
         probs, samples, sample_length = model(src, src_length)
         paddle.enable_static()
 
+    def test_dynamic_basic_decoder(self):
+        with _test_eager_guard():
+            self.func_dynamic_basic_decoder()
+        self.func_dynamic_basic_decoder()
+
 
 class ModuleApiTest(unittest.TestCase):
     @classmethod
@@ -708,9 +713,16 @@ def make_inputs(self):
         ]
         return inputs
 
-    def test_check_output(self):
+    def func_check_output(self):
+        self.setUp()
+        self.make_inputs()
         self.check_output()
 
+    def test_check_output(self):
+        with _test_eager_guard():
+            self.func_check_output()
+        self.func_check_output()
+
 
 if __name__ == '__main__':
     unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/test_roi_align_op.py b/python/paddle/fluid/tests/unittests/test_roi_align_op.py
index 7fab4017ab0ba..a22b331b03241 100644
--- a/python/paddle/fluid/tests/unittests/test_roi_align_op.py
+++ b/python/paddle/fluid/tests/unittests/test_roi_align_op.py
@@ -14,6 +14,7 @@
 
 from __future__ import print_function
 
+import paddle
 import unittest
 import numpy as np
 import math
@@ -30,6 +31,7 @@ def set_data(self):
         self.inputs = {
             'X': self.x,
             'ROIs': (self.rois[:, 1:5], self.rois_lod),
+            'RoisNum': self.boxes_num
         }
         self.attrs = {
             'spatial_scale': self.spatial_scale,
@@ -170,16 +172,19 @@ def make_rois(self):
                 rois.append(roi)
         self.rois_num = len(rois)
         self.rois = np.array(rois).astype("float64")
+        self.boxes_num = np.array(
+            [bno + 1 for bno in range(self.batch_size)]).astype('int32')
 
     def setUp(self):
         self.op_type = "roi_align"
+        self.python_api = lambda x, boxes, boxes_num, pooled_height, pooled_width, spatial_scale, sampling_ratio, aligned: paddle.vision.ops.roi_align(x, boxes, boxes_num, (pooled_height, pooled_width), spatial_scale, sampling_ratio, aligned)
         self.set_data()
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
     def test_check_grad(self):
-        self.check_grad(['X'], 'Out')
+        self.check_grad(['X'], 'Out', check_eager=True)
 
 
 class TestROIAlignInLodOp(TestROIAlignOp):
diff --git a/python/paddle/fluid/tests/unittests/test_roi_pool_op.py b/python/paddle/fluid/tests/unittests/test_roi_pool_op.py
index c6622cf8d9ce8..f0afcff63c6c4 100644
--- a/python/paddle/fluid/tests/unittests/test_roi_pool_op.py
+++ b/python/paddle/fluid/tests/unittests/test_roi_pool_op.py
@@ -14,6 +14,7 @@
 
 from __future__ import print_function
 
+import paddle
 import unittest
 import numpy as np
 import math
@@ -32,6 +33,7 @@ def set_data(self):
         self.inputs = {
             'X': self.x,
             'ROIs': (self.rois[:, 1:5], self.rois_lod),
+            'RoisNum': self.boxes_num
         }
 
         self.attrs = {
@@ -130,16 +132,20 @@ def make_rois(self):
                 rois.append(roi)
         self.rois_num = len(rois)
         self.rois = np.array(rois).astype("float64")
+        self.boxes_num = np.array(
+            [bno + 1 for bno in range(self.batch_size)]).astype('int32')
 
     def setUp(self):
         self.op_type = "roi_pool"
+        self.python_api = lambda x, boxes, boxes_num, pooled_height, pooled_width, spatial_scale: paddle.vision.ops.roi_pool(x, boxes, boxes_num, (pooled_height, pooled_width), spatial_scale)
+        self.python_out_sig = ["Out"]
         self.set_data()
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
     def test_check_grad(self):
-        self.check_grad(['X'], 'Out')
+        self.check_grad(['X'], 'Out', check_eager=True)
 
 
 class BadInputTestRoiPool(unittest.TestCase):
diff --git a/python/paddle/fluid/tests/unittests/test_roll_op.py b/python/paddle/fluid/tests/unittests/test_roll_op.py
index bca7665b814db..c315aa9b74618 100644
--- a/python/paddle/fluid/tests/unittests/test_roll_op.py
+++ b/python/paddle/fluid/tests/unittests/test_roll_op.py
@@ -25,6 +25,7 @@
 
 class TestRollOp(OpTest):
     def setUp(self):
+        self.python_api = paddle.roll
         self.op_type = "roll"
         self.init_dtype_type()
         self.inputs = {'X': np.random.random(self.x_shape).astype(self.dtype)}
@@ -41,10 +42,10 @@ def init_dtype_type(self):
         self.axis = [0, -2]
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
     def test_check_grad_normal(self):
-        self.check_grad(['X'], 'Out')
+        self.check_grad(['X'], 'Out', check_eager=True)
 
 
 class TestRollOpCase2(TestRollOp):
diff --git a/python/paddle/fluid/tests/unittests/test_run_program_op.py b/python/paddle/fluid/tests/unittests/test_run_program_op.py
index 57281eafff7ee..68f24bf257008 100644
--- a/python/paddle/fluid/tests/unittests/test_run_program_op.py
+++ b/python/paddle/fluid/tests/unittests/test_run_program_op.py
@@ -99,7 +99,7 @@ def get_program_desc(self):
     def prepare_attrs(self):
         return ('global_block', self.program_desc.block(0), 'start_op_index', 0,
                 'end_op_index', self.fwd_op_num, 'program_id',
-                _hash_with_id(self.program_desc))
+                _hash_with_id(self.program_desc, self))
 
     def get_param_grad_names(self):
         grad_names = []
diff --git a/python/paddle/fluid/tests/unittests/test_scale_op.py b/python/paddle/fluid/tests/unittests/test_scale_op.py
index d432b8057f624..04ddb5a788d6f 100644
--- a/python/paddle/fluid/tests/unittests/test_scale_op.py
+++ b/python/paddle/fluid/tests/unittests/test_scale_op.py
@@ -27,6 +27,7 @@
 class TestScaleOp(OpTest):
     def setUp(self):
         self.op_type = "scale"
+        self.python_api = paddle.scale
         self.dtype = np.float64
         self.init_dtype_type()
         self.inputs = {'X': np.random.random((10, 10)).astype(self.dtype)}
@@ -39,15 +40,16 @@ def init_dtype_type(self):
         pass
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
     def test_check_grad(self):
-        self.check_grad(['X'], 'Out')
+        self.check_grad(['X'], 'Out', check_eager=True)
 
 
 class TestScaleOpScaleVariable(OpTest):
     def setUp(self):
         self.op_type = "scale"
+        self.python_api = paddle.scale
         self.dtype = np.float64
         self.init_dtype_type()
         self.scale = -2.3
@@ -62,10 +64,10 @@ def init_dtype_type(self):
         pass
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
     def test_check_grad(self):
-        self.check_grad(['X'], 'Out')
+        self.check_grad(['X'], 'Out', check_eager=True)
 
 
 class TestScaleOpSelectedRows(unittest.TestCase):
@@ -144,18 +146,19 @@ def init_dtype_type(self):
     def test_check_output(self):
         place = core.CUDAPlace(0)
         if core.is_float16_supported(place):
-            self.check_output_with_place(place, atol=0.002)
+            self.check_output_with_place(place, atol=0.002, check_eager=True)
 
     def test_check_grad(self):
         place = core.CUDAPlace(0)
         if core.is_float16_supported(place):
             self.check_grad_with_place(
-                place, ["X"], "Out", max_relative_error=0.05)
+                place, ["X"], "Out", max_relative_error=0.05, check_eager=True)
 
 
 class TestScaleBF16Op(OpTest):
     def setUp(self):
         self.op_type = "scale"
+        self.python_api = paddle.scale
         self.dtype = np.uint16
         self.attrs = {'scale': -2.3}
         x = np.random.random((10, 10)).astype(np.float32)
@@ -164,10 +167,10 @@ def setUp(self):
         self.outputs = {'Out': convert_float_to_uint16(out)}
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
     def test_check_grad(self):
-        self.check_grad(['X'], 'Out', numeric_grad_delta=0.8)
+        self.check_grad(['X'], 'Out', numeric_grad_delta=0.8, check_eager=True)
 
 
 @unittest.skipIf(not core.is_compiled_with_cuda(),
diff --git a/python/paddle/fluid/tests/unittests/test_scatter_nd_op.py b/python/paddle/fluid/tests/unittests/test_scatter_nd_op.py
index d7a27bbddebba..ddbee33c35bb1 100644
--- a/python/paddle/fluid/tests/unittests/test_scatter_nd_op.py
+++ b/python/paddle/fluid/tests/unittests/test_scatter_nd_op.py
@@ -77,10 +77,10 @@ def setUp(self):
         self.outputs = {'Out': expect_np}
 
     def test_check_output(self):
-        self.check_output(check_eager=False)
+        self.check_output(check_eager=True)
 
     def test_check_grad(self):
-        self.check_grad(['X', 'Updates'], 'Out', check_eager=False)
+        self.check_grad(['X', 'Updates'], 'Out', check_eager=True)
 
 
 class TestScatterNdAddWithEmptyIndex(OpTest):
@@ -101,10 +101,10 @@ def setUp(self):
         self.outputs = {'Out': expect_np}
 
     def test_check_output(self):
-        self.check_output(check_eager=False)
+        self.check_output(check_eager=True)
 
     def test_check_grad(self):
-        self.check_grad(['X', 'Updates'], 'Out', check_eager=False)
+        self.check_grad(['X', 'Updates'], 'Out', check_eager=True)
 
 
 class TestScatterNdAddWithHighRankSame(OpTest):
@@ -128,10 +128,10 @@ def setUp(self):
         self.outputs = {'Out': expect_np}
 
     def test_check_output(self):
-        self.check_output(check_eager=False)
+        self.check_output(check_eager=True)
 
     def test_check_grad(self):
-        self.check_grad(['X', 'Updates'], 'Out', check_eager=False)
+        self.check_grad(['X', 'Updates'], 'Out', check_eager=True)
 
 
 class TestScatterNdAddWithHighRankDiff(OpTest):
@@ -154,10 +154,10 @@ def setUp(self):
         self.outputs = {'Out': expect_np}
 
     def test_check_output(self):
-        self.check_output(check_eager=False)
+        self.check_output(check_eager=True)
 
     def test_check_grad(self):
-        self.check_grad(['X', 'Updates'], 'Out', check_eager=False)
+        self.check_grad(['X', 'Updates'], 'Out', check_eager=True)
 
 
 #Test Python API
diff --git a/python/paddle/fluid/tests/unittests/test_searchsorted_op.py b/python/paddle/fluid/tests/unittests/test_searchsorted_op.py
index f595d06d5bce7..f802b0adfcb2a 100644
--- a/python/paddle/fluid/tests/unittests/test_searchsorted_op.py
+++ b/python/paddle/fluid/tests/unittests/test_searchsorted_op.py
@@ -25,7 +25,7 @@
 
 class TestSearchSorted(OpTest):
     def setUp(self):
-
+        self.python_api = paddle.searchsorted
         self.op_type = "searchsorted"
         self.init_test_case()
 
@@ -41,7 +41,7 @@ def setUp(self):
         }
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
     def init_test_case(self):
         self.sorted_sequence = np.array([1, 3, 5, 7, 9]).astype("float32")
diff --git a/python/paddle/fluid/tests/unittests/test_segment_ops.py b/python/paddle/fluid/tests/unittests/test_segment_ops.py
index e2aadbedbd07f..90d597837a8e1 100644
--- a/python/paddle/fluid/tests/unittests/test_segment_ops.py
+++ b/python/paddle/fluid/tests/unittests/test_segment_ops.py
@@ -73,6 +73,17 @@ def compute_segment_min_max(x, segment_ids, pooltype="MAX"):
     return results, gradient / results.size
 
 
+def segment_pool_split(X, SegmentIds, pooltype):
+    if pooltype == "SUM":
+        return paddle.incubate.tensor.segment_sum(X, SegmentIds)
+    elif pooltype == "MEAN":
+        return paddle.incubate.tensor.segment_mean(X, SegmentIds)
+    elif pooltype == "MIN":
+        return paddle.incubate.tensor.segment_min(X, SegmentIds)
+    elif pooltype == "MAX":
+        return paddle.incubate.tensor.segment_max(X, SegmentIds)
+
+
 class TestSegmentOps(OpTest):
     def set_data(self):
         x = np.random.uniform(-1, 1, self.shape).astype(self.dtype)
@@ -90,6 +101,8 @@ def compute(self, x, segment_ids):
 
     def prepare(self):
         self.op_type = "segment_pool"
+        self.python_api = segment_pool_split
+        self.python_out_sig = ["Out"]
         self.dtype = np.float64
         self.shape = [30, 15]
         self.attrs = {"pooltype": "SUM"}
@@ -105,10 +118,10 @@ def setUp(self):
         self.outputs = {'Out': result.astype(self.dtype)}
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
     def test_check_grad(self):
-        self.check_grad(["X"], "Out")
+        self.check_grad(["X"], "Out", check_eager=True)
 
 
 class TestSegmentSum2(TestSegmentOps):
@@ -259,4 +272,5 @@ def test_dygraph(self):
 
 
 if __name__ == '__main__':
+    paddle.enable_static()
     unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/test_shape_op.py b/python/paddle/fluid/tests/unittests/test_shape_op.py
index bada62e3239ea..3d961a7413ca0 100644
--- a/python/paddle/fluid/tests/unittests/test_shape_op.py
+++ b/python/paddle/fluid/tests/unittests/test_shape_op.py
@@ -17,6 +17,7 @@
 import unittest
 import numpy as np
 from op_test import OpTest
+import paddle
 from paddle.fluid import core
 from paddle.fluid.op import Operator
 
@@ -24,6 +25,7 @@
 class TestShapeOp(OpTest):
     def setUp(self):
         self.op_type = "shape"
+        self.python_api = paddle.shape
         self.config()
         self.shape = [2, 3]
         input = np.zeros(self.shape)
@@ -34,7 +36,7 @@ def config(self):
         self.shape = [2, 3]
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
 
 class case1(TestShapeOp):
diff --git a/python/paddle/fluid/tests/unittests/test_sigmoid_cross_entropy_with_logits_op.py b/python/paddle/fluid/tests/unittests/test_sigmoid_cross_entropy_with_logits_op.py
index 51751588f7b94..e5406f4d0c224 100644
--- a/python/paddle/fluid/tests/unittests/test_sigmoid_cross_entropy_with_logits_op.py
+++ b/python/paddle/fluid/tests/unittests/test_sigmoid_cross_entropy_with_logits_op.py
@@ -22,6 +22,12 @@
 import unittest
 from paddle.fluid import compiler, Program, program_guard
 import paddle.fluid as fluid
+import paddle
+
+
+def test_fluid_sigmoid(x, label, normalize=False, ignore_index=-100):
+    return paddle.fluid.layers.sigmoid_cross_entropy_with_logits(
+        x, label, int(ignore_index), normalize=normalize)
 
 
 class TestSigmoidCrossEntropyWithLogitsOp1(OpTest):
@@ -30,6 +36,7 @@ class TestSigmoidCrossEntropyWithLogitsOp1(OpTest):
 
     def setUp(self):
         self.op_type = "sigmoid_cross_entropy_with_logits"
+        self.python_api = test_fluid_sigmoid
         batch_size = 64
         num_classes = 20
         self.inputs = {
@@ -49,10 +56,10 @@ def setUp(self):
         self.outputs = {'Out': -term1 - term2}
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
     def test_check_grad(self):
-        self.check_grad(['X'], 'Out')
+        self.check_grad(['X'], 'Out', check_eager=True)
 
 
 class TestSigmoidCrossEntropyWithLogitsOp2(OpTest):
@@ -61,6 +68,7 @@ class TestSigmoidCrossEntropyWithLogitsOp2(OpTest):
 
     def setUp(self):
         self.op_type = "sigmoid_cross_entropy_with_logits"
+        self.python_api = test_fluid_sigmoid
         batch_size = 64
         num_classes = 20
         ignore_index = -1
@@ -83,10 +91,10 @@ def setUp(self):
         self.outputs = {'Out': out}
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
     def test_check_grad(self):
-        self.check_grad(['X'], 'Out')
+        self.check_grad(['X'], 'Out', check_eager=True)
 
 
 class TestSigmoidCrossEntropyWithLogitsOp3(OpTest):
@@ -95,6 +103,7 @@ class TestSigmoidCrossEntropyWithLogitsOp3(OpTest):
 
     def setUp(self):
         self.op_type = "sigmoid_cross_entropy_with_logits"
+        self.python_api = test_fluid_sigmoid
         batch_size = 64
         num_classes = 20
         self.inputs = {
@@ -114,15 +123,16 @@ def setUp(self):
         self.outputs = {'Out': -term1 - term2}
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
     def test_check_grad(self):
-        self.check_grad(['X'], 'Out')
+        self.check_grad(['X'], 'Out', check_eager=True)
 
 
 class TestSigmoidCrossEntropyWithNorm(OpTest):
     def setUp(self):
         self.op_type = "sigmoid_cross_entropy_with_logits"
+        self.python_api = test_fluid_sigmoid
         batch_size = 64
         num_classes = 20
         ignore_index = -1
@@ -145,10 +155,10 @@ def setUp(self):
         self.outputs = {'Out': out}
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
     def test_check_grad(self):
-        self.check_grad(['X'], 'Out')
+        self.check_grad(['X'], 'Out', check_eager=True)
 
 
 class TestSigmoidCrossEntropyWithLogitsOp5(OpTest):
@@ -157,6 +167,7 @@ class TestSigmoidCrossEntropyWithLogitsOp5(OpTest):
 
     def setUp(self):
         self.op_type = "sigmoid_cross_entropy_with_logits"
+        self.python_api = test_fluid_sigmoid
         batch_size = [10, 10]
         num_classes = 20
         self.inputs = {
@@ -176,15 +187,16 @@ def setUp(self):
         self.outputs = {'Out': -term1 - term2}
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
     def test_check_grad(self):
-        self.check_grad(['X'], 'Out')
+        self.check_grad(['X'], 'Out', check_eager=True)
 
 
 class TestSigmoidCrossEntropyWithNorm2(OpTest):
     def setUp(self):
         self.op_type = "sigmoid_cross_entropy_with_logits"
+        self.python_api = test_fluid_sigmoid
         batch_size = [10, 10]
         num_classes = 20
         ignore_index = -1
@@ -207,68 +219,71 @@ def setUp(self):
         self.outputs = {'Out': out}
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
     def test_check_grad(self):
-        self.check_grad(['X'], 'Out')
-
-
-class TestSigmoidCrossEntropyWithLogitsOp6(OpTest):
-    """Test sigmoid_cross_entropy_with_logit_op with binary label
-    """
-
-    def setUp(self):
-        self.op_type = "sigmoid_cross_entropy_with_logits"
-        batch_size = [10, 10]
-        num_classes = 20
-        self.inputs = {
-            'X': logit(
-                np.random.uniform(0, 1, tuple(batch_size + [num_classes]))
-                .astype("float64")),
-            'Label': np.random.randint(0, 2, tuple(batch_size + [num_classes]))
-            .astype("float64")
-        }
-
-        # Fw Pass is implemented as elementwise sigmoid followed by
-        # elementwise logistic loss
-        # Label * -log(sigmoid(X)) + (1 - label) * -log(1 - sigmoid(X))
-        sigmoid_X = expit(self.inputs['X'])
-        term1 = self.inputs['Label'] * np.log(sigmoid_X)
-        term2 = (1 - self.inputs['Label']) * np.log(1 - sigmoid_X)
-        self.outputs = {'Out': -term1 - term2}
-
-    def test_check_output(self):
-        self.check_output()
-
-    def test_check_grad(self):
-        self.check_grad(['X'], 'Out')
-
-
-class TestSigmoidCrossEntropyWithLogitsOpError(unittest.TestCase):
-    def test_errors(self):
-        with program_guard(Program(), Program()):
-
-            def test_Variable():
-                # the input of sigmoid_cross_entropy_with_logits must be Variable.
-                x1 = fluid.create_lod_tensor(
-                    np.array([-1, 3, 5, 5]), [[1, 1, 1, 1]], fluid.CPUPlace())
-                lab1 = fluid.create_lod_tensor(
-                    np.array([-1, 3, 5, 5]), [[1, 1, 1, 1]], fluid.CPUPlace())
-                fluid.layers.sigmoid_cross_entropy_with_logits(x1, lab1)
-
-            self.assertRaises(TypeError, test_Variable)
-
-            def test_dtype():
-                # the input dtype of sigmoid_cross_entropy_with_logits must be float16 or float32 or float64
-                # float16 only can be set on GPU place
-                x2 = fluid.layers.data(
-                    name='x2', shape=[3, 4, 5, 6], dtype="int32")
-                lab2 = fluid.layers.data(
-                    name='lab2', shape=[3, 4, 5, 6], dtype="int32")
-                fluid.layers.sigmoid_cross_entropy_with_logits(x2, lab2)
-
-            self.assertRaises(TypeError, test_dtype)
+        self.check_grad(['X'], 'Out', check_eager=True)
+
+    class TestSigmoidCrossEntropyWithLogitsOp6(OpTest):
+        """Test sigmoid_cross_entropy_with_logit_op with binary label
+        """
+
+        def setUp(self):
+            self.op_type = "sigmoid_cross_entropy_with_logits"
+            self.python_api = test_fluid_sigmoid
+            batch_size = [10, 10]
+            num_classes = 20
+            self.inputs = {
+                'X': logit(
+                    np.random.uniform(0, 1, tuple(batch_size + [num_classes]))
+                    .astype("float64")),
+                'Label':
+                np.random.randint(0, 2, tuple(batch_size + [num_classes]))
+                .astype("float64")
+            }
+
+            # Fw Pass is implemented as elementwise sigmoid followed by
+            # elementwise logistic loss
+            # Label * -log(sigmoid(X)) + (1 - label) * -log(1 - sigmoid(X))
+            sigmoid_X = expit(self.inputs['X'])
+            term1 = self.inputs['Label'] * np.log(sigmoid_X)
+            term2 = (1 - self.inputs['Label']) * np.log(1 - sigmoid_X)
+            self.outputs = {'Out': -term1 - term2}
+
+        def test_check_output(self):
+            self.check_output(check_eager=True)
+
+        def test_check_grad(self):
+            self.check_grad(['X'], 'Out', check_eager=True)
+
+    class TestSigmoidCrossEntropyWithLogitsOpError(unittest.TestCase):
+        def test_errors(self):
+            with program_guard(Program(), Program()):
+
+                def test_Variable():
+                    # the input of sigmoid_cross_entropy_with_logits must be Variable.
+                    x1 = fluid.create_lod_tensor(
+                        np.array([-1, 3, 5, 5]), [[1, 1, 1, 1]],
+                        fluid.CPUPlace())
+                    lab1 = fluid.create_lod_tensor(
+                        np.array([-1, 3, 5, 5]), [[1, 1, 1, 1]],
+                        fluid.CPUPlace())
+                    fluid.layers.sigmoid_cross_entropy_with_logits(x1, lab1)
+
+                self.assertRaises(TypeError, test_Variable)
+
+                def test_dtype():
+                    # the input dtype of sigmoid_cross_entropy_with_logits must be float16 or float32 or float64
+                    # float16 only can be set on GPU place
+                    x2 = fluid.layers.data(
+                        name='x2', shape=[3, 4, 5, 6], dtype="int32")
+                    lab2 = fluid.layers.data(
+                        name='lab2', shape=[3, 4, 5, 6], dtype="int32")
+                    fluid.layers.sigmoid_cross_entropy_with_logits(x2, lab2)
+
+                self.assertRaises(TypeError, test_dtype)
 
 
 if __name__ == '__main__':
+    paddle.enable_static()
     unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/test_sigmoid_focal_loss.py b/python/paddle/fluid/tests/unittests/test_sigmoid_focal_loss.py
index 2ef04d9cbfa73..15a4827cecba3 100644
--- a/python/paddle/fluid/tests/unittests/test_sigmoid_focal_loss.py
+++ b/python/paddle/fluid/tests/unittests/test_sigmoid_focal_loss.py
@@ -18,6 +18,7 @@
 import unittest
 from op_test import OpTest
 from test_sigmoid_focal_loss_op import sigmoid_focal_loss_forward
+from paddle.fluid.framework import _test_eager_guard
 
 
 def call_sfl_functional(logit,
@@ -140,6 +141,10 @@ def test_SigmoidFocalLoss(self):
                             dy_result = test_dygraph(place, logit_np, label_np,
                                                      normalizer_np, alpha,
                                                      gamma, reduction)
+                            with _test_eager_guard():
+                                eager_result = test_dygraph(
+                                    place, logit_np, label_np, normalizer_np,
+                                    alpha, gamma, reduction)
                             expected = calc_sigmoid_focal_loss(
                                 logit_np, label_np, normalizer_np, alpha, gamma,
                                 reduction)
@@ -148,6 +153,7 @@ def test_SigmoidFocalLoss(self):
                             self.assertTrue(
                                 np.allclose(static_result, dy_result))
                             self.assertTrue(np.allclose(dy_result, expected))
+                            self.assertTrue(np.allclose(eager_result, expected))
 
     def test_SigmoidFocalLoss_error(self):
         paddle.disable_static()
diff --git a/python/paddle/fluid/tests/unittests/test_slice_op.py b/python/paddle/fluid/tests/unittests/test_slice_op.py
index 71869b96aedf0..a565bba304184 100644
--- a/python/paddle/fluid/tests/unittests/test_slice_op.py
+++ b/python/paddle/fluid/tests/unittests/test_slice_op.py
@@ -21,6 +21,7 @@
 import paddle.fluid as fluid
 import paddle.fluid.layers as layers
 import paddle
+from paddle.fluid.framework import _test_eager_guard
 
 paddle.enable_static()
 
@@ -599,6 +600,31 @@ def test_bool_tensor(self):
             self.assertTrue(np.array_equal(y_paddle.numpy(), y_np))
 
 
+class TestSliceApiEager(unittest.TestCase):
+    def test_slice_api(self):
+        with paddle.fluid.dygraph.guard():
+            with _test_eager_guard():
+                a = paddle.rand(shape=[4, 5, 6], dtype='float32')
+                a.stop_gradient = False
+                axes = [0, 1, 2]
+                starts = [-3, 0, 2]
+                ends = [3, 2, 4]
+                a_1 = paddle.slice(a, axes=axes, starts=starts, ends=ends)
+
+                a_2 = paddle.slice(
+                    a,
+                    axes=axes,
+                    starts=paddle.to_tensor(starts),
+                    ends=paddle.to_tensor(ends))
+
+                a_1.backward()
+                grad_truth = paddle.zeros_like(a)
+                grad_truth[-3:3, 0:2, 2:4] = 1
+                self.assertTrue(np.array_equal(grad_truth, a.gradient()))
+
+                self.assertTrue(np.allclose(a_1.numpy(), a[-3:3, 0:2, 2:4]))
+
+
 class TestSliceApiWithLoDTensorArray(unittest.TestCase):
     def setUp(self):
         self.shape = (3, 4)
diff --git a/python/paddle/fluid/tests/unittests/test_smooth_l1_loss_op.py b/python/paddle/fluid/tests/unittests/test_smooth_l1_loss_op.py
index 3c825c08e8c3f..63e8568048d13 100644
--- a/python/paddle/fluid/tests/unittests/test_smooth_l1_loss_op.py
+++ b/python/paddle/fluid/tests/unittests/test_smooth_l1_loss_op.py
@@ -48,18 +48,27 @@ def setUp(self):
         }
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
     def test_check_grad_normal(self):
-        self.check_grad(['X', 'Y'], 'Out', max_relative_error=0.02)
+        self.check_grad(
+            ['X', 'Y'], 'Out', max_relative_error=0.02, check_eager=True)
 
     def test_check_grad_ingore_x(self):
         self.check_grad(
-            ['Y'], 'Out', max_relative_error=0.03, no_grad_set=set("X"))
+            ['Y'],
+            'Out',
+            max_relative_error=0.03,
+            no_grad_set=set("X"),
+            check_eager=True)
 
     def test_check_grad_ingore_y(self):
         self.check_grad(
-            ['X'], 'Out', max_relative_error=0.03, no_grad_set=set('Y'))
+            ['X'],
+            'Out',
+            max_relative_error=0.03,
+            no_grad_set=set('Y'),
+            check_eager=True)
 
 
 class TestSmoothL1LossOp2(OpTest):
@@ -86,24 +95,27 @@ def setUp(self):
         }
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
     def test_check_grad_normal(self):
-        self.check_grad(['X', 'Y'], 'Out', max_relative_error=0.03)
+        self.check_grad(
+            ['X', 'Y'], 'Out', max_relative_error=0.03, check_eager=True)
 
     def test_check_grad_ingore_x(self):
         self.check_grad(
             ['Y'],
             'Out',
             max_relative_error=0.03,
-            no_grad_set=set(['X', 'InsideWeight', 'OutsideWeight']))
+            no_grad_set=set(['X', 'InsideWeight', 'OutsideWeight']),
+            check_eager=True)
 
     def test_check_grad_ingore_y(self):
         self.check_grad(
             ['X'],
             'Out',
             max_relative_error=0.03,
-            no_grad_set=set(['Y', 'InsideWeight', 'OutsideWeight']))
+            no_grad_set=set(['Y', 'InsideWeight', 'OutsideWeight']),
+            check_eager=True)
 
 
 class TestSmoothL1LossOpError(unittest.TestCase):
diff --git a/python/paddle/fluid/tests/unittests/test_softmax_with_cross_entropy_op.py b/python/paddle/fluid/tests/unittests/test_softmax_with_cross_entropy_op.py
index 69f6a87dd9ed1..75d09e3df0c30 100644
--- a/python/paddle/fluid/tests/unittests/test_softmax_with_cross_entropy_op.py
+++ b/python/paddle/fluid/tests/unittests/test_softmax_with_cross_entropy_op.py
@@ -26,7 +26,6 @@
 def cross_entropy(softmax, label, soft_label, axis, ignore_index=-1):
     if soft_label:
         return (-label * np.log(softmax)).sum(axis=axis, keepdims=True)
-
     shape = softmax.shape
     axis %= len(shape)
     n = int(np.prod(shape[:axis]))
@@ -43,6 +42,41 @@ def cross_entropy(softmax, label, soft_label, axis, ignore_index=-1):
     return result.reshape(label.shape)
 
 
+def python_api(logits,
+               label,
+               soft_label=False,
+               use_softmax=True,
+               numeric_stable_mode=True,
+               ignore_index=-100,
+               axis=-1):
+    # here only can test paddle.nn.functional.softmax_with_cross_entropy,
+    # the paddle.nn.functional.cross_entropy contains other math ops
+    return paddle.nn.functional.softmax_with_cross_entropy(
+        logits,
+        label,
+        soft_label=soft_label,
+        ignore_index=ignore_index,
+        numeric_stable_mode=numeric_stable_mode,
+        return_softmax=use_softmax,
+        axis=axis)
+
+
+def python_core_api_without_softmax(logits,
+                                    label,
+                                    soft_label=False,
+                                    use_softmax=False,
+                                    numeric_stable_mode=True,
+                                    ignore_index=-100,
+                                    axis=-1):
+    # the API paddle.nn.functional.softmax_with_cross_entropy cannot
+    # set use_softmax=False, so add a core api manually
+    assert use_softmax is False
+    _, loss = paddle._C_ops.final_state_cross_entropy_with_softmax(
+        logits, label, soft_label, use_softmax, numeric_stable_mode,
+        ignore_index, axis)
+    return loss
+
+
 class TestSoftmaxWithCrossEntropyOp(OpTest):
     """
     Test softmax with cross entropy operator with discreate one-hot labels.
@@ -50,6 +84,8 @@ class TestSoftmaxWithCrossEntropyOp(OpTest):
 
     def initParams(self):
         self.op_type = "softmax_with_cross_entropy"
+        self.python_api = python_api
+        self.python_out_sig = ["Loss", "Softmax"]
         self.numeric_stable_mode = False
         self.soft_label = False
         # explicilty use float32 for ROCm, as MIOpen does not yet support float64
@@ -102,13 +138,27 @@ def setUp(self):
             self.attrs['axis'] = self.axis
 
     def test_check_output(self):
+        if self.python_api is not None:
+            self.check_output(check_eager=True)
         self.check_output()
 
     def test_check_grad(self):
         if core.is_compiled_with_rocm():
+            if self.python_api is not None:
+                self.check_grad(
+                    ["Logits"],
+                    "Loss",
+                    max_relative_error=5e-1,
+                    check_eager=True)
             # HIP will have accuracy fail when using float32 in CPU place
             self.check_grad(["Logits"], "Loss", max_relative_error=5e-1)
         else:
+            if self.python_api is not None:
+                self.check_grad(
+                    ["Logits"],
+                    "Loss",
+                    numeric_grad_delta=0.001,
+                    check_eager=True)
             self.check_grad(["Logits"], "Loss", numeric_grad_delta=0.001)
 
 
@@ -136,6 +186,8 @@ class TestSoftmaxWithCrossEntropyOp_NotWithSoftmax_SoftLabel_1D(
         TestSoftmaxWithCrossEntropyOp):
     def initParams(self):
         self.op_type = "softmax_with_cross_entropy"
+        self.python_api = python_core_api_without_softmax
+        self.python_out_sig = ["Loss"]
         self.numeric_stable_mode = True
         self.soft_label = True
         self.shape = [13, 8]
@@ -149,6 +201,8 @@ class TestSoftmaxWithCrossEntropyOp_NotWithSoftmax_HardLabel_1D(
         TestSoftmaxWithCrossEntropyOp):
     def initParams(self):
         self.op_type = "softmax_with_cross_entropy"
+        self.python_api = python_core_api_without_softmax
+        self.python_out_sig = ["Loss"]
         self.numeric_stable_mode = True
         self.soft_label = False
         self.shape = [13, 8]
@@ -165,6 +219,8 @@ class TestSoftmaxWithCrossEntropyOp_NotWithSoftmax_SoftLabel_2D(
         TestSoftmaxWithCrossEntropyOp):
     def initParams(self):
         self.op_type = "softmax_with_cross_entropy"
+        self.python_api = python_core_api_without_softmax
+        self.python_out_sig = ["Loss"]
         self.numeric_stable_mode = True
         self.soft_label = True
         self.shape = [3, 5, 7, 11]
@@ -178,6 +234,8 @@ class TestSoftmaxWithCrossEntropyOp_NotWithSoftmax_SoftLabel_2D_Axis2(
         TestSoftmaxWithCrossEntropyOp):
     def initParams(self):
         self.op_type = "softmax_with_cross_entropy"
+        self.python_api = python_core_api_without_softmax
+        self.python_out_sig = ["Loss"]
         self.numeric_stable_mode = True
         self.soft_label = True
         self.dtype = np.float32 if core.is_compiled_with_rocm() else np.float64
@@ -191,6 +249,8 @@ class TestSoftmaxWithCrossEntropyOp_NotWithSoftmax_SoftLabel_2D_Axis3(
         TestSoftmaxWithCrossEntropyOp):
     def initParams(self):
         self.op_type = "softmax_with_cross_entropy"
+        self.python_api = python_core_api_without_softmax
+        self.python_out_sig = ["Loss"]
         self.numeric_stable_mode = True
         self.soft_label = True
         self.dtype = np.float32 if core.is_compiled_with_rocm() else np.float64
@@ -204,6 +264,8 @@ class TestSoftmaxWithCrossEntropyOp_NotWithSoftmax_SoftLabel_2D_Axis4(
         TestSoftmaxWithCrossEntropyOp):
     def initParams(self):
         self.op_type = "softmax_with_cross_entropy"
+        self.python_api = python_core_api_without_softmax
+        self.python_out_sig = ["Loss"]
         self.numeric_stable_mode = True
         self.soft_label = True
         self.dtype = np.float32 if core.is_compiled_with_rocm() else np.float64
@@ -226,6 +288,8 @@ class TestSoftmaxWithCrossEntropyOp_NotWithSoftmax_HardLabel_2D(
         TestSoftmaxWithCrossEntropyOp):
     def initParams(self):
         self.op_type = "softmax_with_cross_entropy"
+        self.python_api = python_core_api_without_softmax
+        self.python_out_sig = ["Loss"]
         self.numeric_stable_mode = True
         self.soft_label = False
         self.shape = [3, 5, 7, 11]
@@ -239,6 +303,8 @@ class TestSoftmaxWithCrossEntropyOp_NotWithSoftmax_HardLabel_2D_Axis2(
         TestSoftmaxWithCrossEntropyOp):
     def initParams(self):
         self.op_type = "softmax_with_cross_entropy"
+        self.python_api = python_core_api_without_softmax
+        self.python_out_sig = ["Loss"]
         self.numeric_stable_mode = True
         self.soft_label = False
         self.dtype = np.float32 if core.is_compiled_with_rocm() else np.float64
@@ -252,6 +318,8 @@ class TestSoftmaxWithCrossEntropyOp_NotWithSoftmax_HardLabel_2D_Axis3(
         TestSoftmaxWithCrossEntropyOp):
     def initParams(self):
         self.op_type = "softmax_with_cross_entropy"
+        self.python_api = python_core_api_without_softmax
+        self.python_out_sig = ["Loss"]
         self.numeric_stable_mode = True
         self.soft_label = False
         self.dtype = np.float32 if core.is_compiled_with_rocm() else np.float64
@@ -265,6 +333,8 @@ class TestSoftmaxWithCrossEntropyOp_NotWithSoftmax_HardLabel_2D_Axis4(
         TestSoftmaxWithCrossEntropyOp):
     def initParams(self):
         self.op_type = "softmax_with_cross_entropy"
+        self.python_api = python_core_api_without_softmax
+        self.python_out_sig = ["Loss"]
         self.numeric_stable_mode = True
         self.soft_label = False
         self.dtype = np.float32 if core.is_compiled_with_rocm() else np.float64
@@ -287,6 +357,8 @@ class TestSoftmaxWithCrossEntropyOp_NotWithSoftmax_HardLabel_Ignore(
         TestSoftmaxWithCrossEntropyOp):
     def initParams(self):
         self.op_type = "softmax_with_cross_entropy"
+        self.python_api = python_core_api_without_softmax
+        self.python_out_sig = ["Loss"]
         self.numeric_stable_mode = False
         self.soft_label = False
         self.shape = [13, 8]
@@ -300,6 +372,8 @@ class TestSoftmaxWithCrossEntropyOp_NotWithSoftmax_HardLabel_Ignore_Axis(
         TestSoftmaxWithCrossEntropyOp):
     def initParams(self):
         self.op_type = "softmax_with_cross_entropy"
+        self.python_api = python_core_api_without_softmax
+        self.python_out_sig = ["Loss"]
         self.numeric_stable_mode = False
         self.soft_label = False
         self.shape = [13, 8]
@@ -313,6 +387,8 @@ class TestSoftmaxWithCrossEntropyOp_NotWithSoftmax_HardLabel_2D_Ignore(
         TestSoftmaxWithCrossEntropyOp):
     def initParams(self):
         self.op_type = "softmax_with_cross_entropy"
+        self.python_api = python_core_api_without_softmax
+        self.python_out_sig = ["Loss"]
         self.numeric_stable_mode = True
         self.soft_label = False
         self.shape = [3, 5, 7, 11]
@@ -326,6 +402,8 @@ class TestSoftmaxWithCrossEntropyOp_NotWithSoftmax_HardLabel_2D_Ignore_Axis3(
         TestSoftmaxWithCrossEntropyOp):
     def initParams(self):
         self.op_type = "softmax_with_cross_entropy"
+        self.python_api = python_core_api_without_softmax
+        self.python_out_sig = ["Loss"]
         self.numeric_stable_mode = True
         self.soft_label = False
         self.dtype = np.float32 if core.is_compiled_with_rocm() else np.float64
@@ -343,6 +421,8 @@ def initParams(self):
 class TestSoftmaxWithCrossEntropyOpNoCudnn(TestSoftmaxWithCrossEntropyOp):
     def initParams(self):
         self.op_type = "softmax_with_cross_entropy"
+        self.python_api = python_api
+        self.python_out_sig = ["Loss", "Softmax"]
         self.numeric_stable_mode = True
         self.soft_label = False
         self.shape = [3, 5, 7, 11]
@@ -357,6 +437,8 @@ def initParams(self):
 class TestSoftmaxWithCrossEntropyOpFp16(TestSoftmaxWithCrossEntropyOp):
     def initParams(self):
         self.op_type = "softmax_with_cross_entropy"
+        self.python_api = python_api
+        self.python_out_sig = ["Loss", "Softmax"]
         self.numeric_stable_mode = False
         self.soft_label = False
         self.shape = [3, 5, 7, 11]
@@ -394,9 +476,14 @@ def setUp(self):
             self.attrs['axis'] = self.axis
 
     def test_check_output(self):
+        if self.python_api is not None:
+            self.check_output(atol=1e-2, check_eager=True)
         self.check_output(atol=1e-2)
 
     def test_check_grad(self):
+        if self.python_api is not None:
+            self.check_grad(
+                ["Logits"], "Loss", max_relative_error=0.1, check_eager=True)
         self.check_grad(["Logits"], "Loss", max_relative_error=0.1)
 
 
@@ -404,6 +491,8 @@ class TestSoftmaxWithCrossEntropyOpNoCudnnFp16(
         TestSoftmaxWithCrossEntropyOpFp16):
     def initParams(self):
         self.op_type = "softmax_with_cross_entropy"
+        self.python_api = python_api
+        self.python_out_sig = ["Loss", "Softmax"]
         self.numeric_stable_mode = True
         self.soft_label = False
         self.shape = [3, 5, 7, 11]
@@ -412,6 +501,9 @@ def initParams(self):
         self.dtype = np.float16
 
     def test_check_grad(self):
+        if self.python_api is not None:
+            self.check_grad(
+                ["Logits"], "Loss", max_relative_error=0.1, check_eager=True)
         self.check_grad(["Logits"], "Loss", max_relative_error=0.1)
 
 
@@ -422,6 +514,8 @@ class TestSoftmaxWithCrossEntropyOp2(TestSoftmaxWithCrossEntropyOp):
 
     def initParams(self):
         self.op_type = "softmax_with_cross_entropy"
+        self.python_api = python_api
+        self.python_out_sig = ["Loss", "Softmax"]
         self.numeric_stable_mode = True
         self.soft_label = True
         self.dtype = np.float32 if core.is_compiled_with_rocm() else np.float64
@@ -431,13 +525,23 @@ def initParams(self):
         self.use_softmax = True
 
     def test_check_output(self):
+        if self.python_api is not None:
+            self.check_output(check_eager=True)
         self.check_output()
 
     def test_check_grad(self):
         if core.is_compiled_with_rocm():
             # HIP will have accuracy fail when using float32 in CPU place
+            if self.python_api is not None:
+                self.check_grad(
+                    ["Logits"],
+                    "Loss",
+                    max_relative_error=0.1,
+                    check_eager=True)
             self.check_grad(["Logits"], "Loss", max_relative_error=0.1)
         else:
+            if self.python_api is not None:
+                self.check_grad(["Logits"], "Loss", check_eager=True)
             self.check_grad(["Logits"], "Loss")
 
 
@@ -448,6 +552,8 @@ class TestSoftmaxWithCrossEntropyOp3(TestSoftmaxWithCrossEntropyOp):
 
     def initParams(self):
         self.op_type = "softmax_with_cross_entropy"
+        self.python_api = python_api
+        self.python_out_sig = ["Loss", "Softmax"]
         self.numeric_stable_mode = False
         self.soft_label = False
         self.shape = [41, 37]
@@ -460,6 +566,8 @@ def initParams(self):
 class TestSoftmaxWithCrossEntropyOp3NoCudnn(TestSoftmaxWithCrossEntropyOp3):
     def initParams(self):
         self.op_type = "softmax_with_cross_entropy"
+        self.python_api = python_api
+        self.python_out_sig = ["Loss", "Softmax"]
         self.numeric_stable_mode = True
         self.soft_label = False
         self.shape = [3, 5, 7, 11]
@@ -477,6 +585,8 @@ class TestSoftmaxWithCrossEntropyOpAxis1(TestSoftmaxWithCrossEntropyOp):
 
     def initParams(self):
         self.op_type = "softmax_with_cross_entropy"
+        self.python_api = python_api
+        self.python_out_sig = ["Loss", "Softmax"]
         self.numeric_stable_mode = True
         self.soft_label = False
         self.dtype = np.float32 if core.is_compiled_with_rocm() else np.float64
@@ -494,6 +604,8 @@ class TestSoftmaxWithCrossEntropyOpAxis2(TestSoftmaxWithCrossEntropyOp):
 
     def initParams(self):
         self.op_type = "softmax_with_cross_entropy"
+        self.python_api = python_api
+        self.python_out_sig = ["Loss", "Softmax"]
         self.numeric_stable_mode = True
         self.soft_label = False
         self.dtype = np.float32 if core.is_compiled_with_rocm() else np.float64
@@ -511,6 +623,8 @@ class TestSoftmaxWithCrossEntropyOpAxis3(TestSoftmaxWithCrossEntropyOp):
 
     def initParams(self):
         self.op_type = "softmax_with_cross_entropy"
+        self.python_api = python_api
+        self.python_out_sig = ["Loss", "Softmax"]
         self.numeric_stable_mode = True
         self.soft_label = False
         self.dtype = np.float32 if core.is_compiled_with_rocm() else np.float64
@@ -528,6 +642,8 @@ class TestSoftmaxWithCrossEntropyOpAxis4(TestSoftmaxWithCrossEntropyOp):
 
     def initParams(self):
         self.op_type = "softmax_with_cross_entropy"
+        self.python_api = python_api
+        self.python_out_sig = ["Loss", "Softmax"]
         self.numeric_stable_mode = True
         self.soft_label = False
         self.dtype = np.float32 if core.is_compiled_with_rocm() else np.float64
@@ -546,6 +662,8 @@ class TestSoftmaxWithCrossEntropyOpAxisDimEqualOne(
 
     def initParams(self):
         self.op_type = "softmax_with_cross_entropy"
+        self.python_api = python_api
+        self.python_out_sig = ["Loss", "Softmax"]
         self.numeric_stable_mode = True
         self.soft_label = False
         self.dtype = np.float32 if core.is_compiled_with_rocm() else np.float64
@@ -559,6 +677,8 @@ class TestSoftmaxWithCrossEntropyOpNoCudnnFp16Axis1(
         TestSoftmaxWithCrossEntropyOpNoCudnnFp16):
     def initParams(self):
         self.op_type = "softmax_with_cross_entropy"
+        self.python_api = python_api
+        self.python_out_sig = ["Loss", "Softmax"]
         self.numeric_stable_mode = True
         self.soft_label = False
         self.shape = [3, 5, 7, 11]
@@ -572,6 +692,8 @@ class TestSoftmaxWithCrossEntropyOpNoCudnnFp16Axis2(
         TestSoftmaxWithCrossEntropyOpNoCudnnFp16):
     def initParams(self):
         self.op_type = "softmax_with_cross_entropy"
+        self.python_api = python_api
+        self.python_out_sig = ["Loss", "Softmax"]
         self.numeric_stable_mode = True
         self.soft_label = False
         self.shape = [3, 5, 7, 11]
@@ -585,6 +707,8 @@ class TestSoftmaxWithCrossEntropyOpNoCudnnFp16Axis3(
         TestSoftmaxWithCrossEntropyOpNoCudnnFp16):
     def initParams(self):
         self.op_type = "softmax_with_cross_entropy"
+        self.python_api = python_api
+        self.python_out_sig = ["Loss", "Softmax"]
         self.numeric_stable_mode = True
         self.soft_label = False
         self.shape = [3, 5, 7, 11]
@@ -598,6 +722,8 @@ class TestSoftmaxWithCrossEntropyOpSoftLabelAxis1(
         TestSoftmaxWithCrossEntropyOp2):
     def initParams(self):
         self.op_type = "softmax_with_cross_entropy"
+        self.python_api = python_api
+        self.python_out_sig = ["Loss", "Softmax"]
         self.numeric_stable_mode = True
         self.soft_label = True
         self.shape = [3, 5, 7, 11]
@@ -611,6 +737,8 @@ class TestSoftmaxWithCrossEntropyOpSoftLabelAxis2(
         TestSoftmaxWithCrossEntropyOp2):
     def initParams(self):
         self.op_type = "softmax_with_cross_entropy"
+        self.python_api = python_api
+        self.python_out_sig = ["Loss", "Softmax"]
         self.numeric_stable_mode = True
         self.soft_label = True
         self.shape = [3, 5, 7, 11]
@@ -624,6 +752,8 @@ class TestSoftmaxWithCrossEntropyOpSoftLabelAxis3(
         TestSoftmaxWithCrossEntropyOp2):
     def initParams(self):
         self.op_type = "softmax_with_cross_entropy"
+        self.python_api = python_api
+        self.python_out_sig = ["Loss", "Softmax"]
         self.numeric_stable_mode = True
         self.soft_label = True
         self.shape = [3, 5, 7, 11]
@@ -637,6 +767,8 @@ class TestSoftmaxWithCrossEntropyOpSoftLabelAxis4(
         TestSoftmaxWithCrossEntropyOp2):
     def initParams(self):
         self.op_type = "softmax_with_cross_entropy"
+        self.python_api = python_api
+        self.python_out_sig = ["Loss", "Softmax"]
         self.numeric_stable_mode = True
         self.soft_label = True
         self.shape = [3, 5, 7, 11]
@@ -650,6 +782,8 @@ class TestSoftmaxWithCrossEntropyOpIgnoreIndexNoCudnnAxis1(
         TestSoftmaxWithCrossEntropyOp3):
     def initParams(self):
         self.op_type = "softmax_with_cross_entropy"
+        self.python_api = python_api
+        self.python_out_sig = ["Loss", "Softmax"]
         self.numeric_stable_mode = True
         self.soft_label = False
         self.shape = [3, 5, 7, 11]
@@ -663,6 +797,8 @@ class TestSoftmaxWithCrossEntropyOpIgnoreIndexNoCudnnAxis2(
         TestSoftmaxWithCrossEntropyOp3):
     def initParams(self):
         self.op_type = "softmax_with_cross_entropy"
+        self.python_api = python_api
+        self.python_out_sig = ["Loss", "Softmax"]
         self.numeric_stable_mode = True
         self.soft_label = False
         self.shape = [3, 5, 7, 11]
@@ -676,6 +812,8 @@ class TestSoftmaxWithCrossEntropyOpIgnoreIndexNoCudnnAxis3(
         TestSoftmaxWithCrossEntropyOp3):
     def initParams(self):
         self.op_type = "softmax_with_cross_entropy"
+        self.python_api = python_api
+        self.python_out_sig = ["Loss", "Softmax"]
         self.numeric_stable_mode = True
         self.soft_label = False
         self.shape = [3, 5, 7, 11]
@@ -689,6 +827,8 @@ class TestSoftmaxWithCrossEntropyOpIgnoreIndexNoCudnnAxis4(
         TestSoftmaxWithCrossEntropyOp3):
     def initParams(self):
         self.op_type = "softmax_with_cross_entropy"
+        self.python_api = python_api
+        self.python_out_sig = ["Loss", "Softmax"]
         self.numeric_stable_mode = True
         self.soft_label = False
         self.shape = [3, 5, 7, 11]
@@ -706,6 +846,8 @@ class TestSoftmaxWithCrossEntropyOpBoundary0(TestSoftmaxWithCrossEntropyOp):
 
     def initParams(self):
         self.op_type = "softmax_with_cross_entropy"
+        self.python_api = python_api
+        self.python_out_sig = ["Loss", "Softmax"]
         self.numeric_stable_mode = True
         self.soft_label = False
         self.shape = [3, 5, 7, 11]
@@ -724,6 +866,8 @@ class TestSoftmaxWithCrossEntropyOpBoundary1(TestSoftmaxWithCrossEntropyOp):
 
     def initParams(self):
         self.op_type = "softmax_with_cross_entropy"
+        self.python_api = python_api
+        self.python_out_sig = ["Loss", "Softmax"]
         self.numeric_stable_mode = True
         self.soft_label = False
         self.shape = [3, 5, 7, 11]
diff --git a/python/paddle/fluid/tests/unittests/test_sparse_activation_op.py b/python/paddle/fluid/tests/unittests/test_sparse_activation_op.py
index df13ae4e4b7ff..b4abbd56303ff 100644
--- a/python/paddle/fluid/tests/unittests/test_sparse_activation_op.py
+++ b/python/paddle/fluid/tests/unittests/test_sparse_activation_op.py
@@ -16,7 +16,6 @@
 import unittest
 import numpy as np
 import paddle
-from paddle import _C_ops
 from paddle.fluid.framework import _test_eager_guard
 
 
@@ -24,16 +23,27 @@ class TestSparseActivation(unittest.TestCase):
     def test_sparse_relu(self):
         with _test_eager_guard():
             x = [[0, -1, 0, 2], [0, 0, -3, 0], [4, 5, 0, 0]]
-            dense_x = paddle.to_tensor(x, dtype='float32')
-            dense_shape = [3, 4]
-            stop_gradient = True
+
+            def dense_relu(x):
+                dense_x = paddle.to_tensor(
+                    x, dtype='float32', stop_gradient=False)
+                dense_relu = paddle.nn.ReLU()
+                dense_out = dense_relu(dense_x)
+                dense_out.backward(dense_out)
+                return dense_out, dense_x.grad
+
+            dense_x = paddle.to_tensor(x, dtype='float32', stop_gradient=False)
             sparse_dim = 2
-            sparse_coo_x = dense_x.to_sparse_coo(sparse_dim)
-            #TODO(zhangkaihuo): change to test the corresponding API: paddle.sparse.relu(sparse_coo_x)
-            sparse_act_out = _C_ops.final_state_sparse_relu(sparse_coo_x)
-            correct_result = [0, 2, 0, 4, 5]
-            actual_result = sparse_act_out.non_zero_elements().numpy()
-            assert np.array_equal(correct_result, actual_result)
+            sparse_x = dense_x.to_sparse_coo(sparse_dim)
+            sparse_relu = paddle.sparse.ReLU()
+            sparse_out = sparse_relu(sparse_x)
+            sparse_out.backward(sparse_out)
+
+            dense_out, dense_x_grad = dense_relu(x)
+            assert np.array_equal(dense_out.numpy(),
+                                  sparse_out.to_dense().numpy())
+            assert np.array_equal(dense_x_grad.numpy(),
+                                  sparse_x.grad.to_dense().numpy())
 
 
 if __name__ == "__main__":
diff --git a/python/paddle/fluid/tests/unittests/test_sparse_conv_op.py b/python/paddle/fluid/tests/unittests/test_sparse_conv_op.py
new file mode 100644
index 0000000000000..d5a61423e9c44
--- /dev/null
+++ b/python/paddle/fluid/tests/unittests/test_sparse_conv_op.py
@@ -0,0 +1,53 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import print_function
+import unittest
+import numpy as np
+import paddle
+from paddle import _C_ops
+from paddle.fluid import core
+from paddle.fluid.framework import _test_eager_guard
+
+
+class TestSparseConv(unittest.TestCase):
+    def test_conv3d(self):
+        with _test_eager_guard():
+            kernel = [[[[[1], [1], [1]], [[1], [1], [1]], [[1], [1], [1]]]]]
+            dense_kernel = paddle.to_tensor(
+                kernel, dtype='float32', stop_gradient=False)
+            dense_kernel = paddle.reshape(dense_kernel, [1, 3, 3, 1, 1])
+            paddings = [0, 0, 0]
+            strides = [1, 1, 1]
+            dilations = [1, 1, 1]
+
+            indices = [[0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 1, 2], [1, 3, 2, 3]]
+            values = [1, 2, 3, 4]
+            indices = paddle.to_tensor(indices, dtype='int32')
+            values = paddle.to_tensor(values, dtype='float32')
+            dense_shape = [1, 1, 3, 4, 1]
+            correct_out_values = [[4], [10]]
+            sparse_input = core.eager.sparse_coo_tensor(indices, values,
+                                                        dense_shape, False)
+            out = _C_ops.final_state_sparse_conv3d(sparse_input, dense_kernel,
+                                                   paddings, dilations, strides,
+                                                   1, False)
+            out.backward(out)
+            #At present, only backward can be verified to work normally
+            #TODO(zhangkaihuo): compare the result with dense conv
+            print(sparse_input.grad.values())
+            assert np.array_equal(correct_out_values, out.values().numpy())
+
+
+#TODO: Add more test case
diff --git a/python/paddle/fluid/tests/unittests/test_sparse_copy_op.py b/python/paddle/fluid/tests/unittests/test_sparse_copy_op.py
new file mode 100644
index 0000000000000..9cf5eace71bb1
--- /dev/null
+++ b/python/paddle/fluid/tests/unittests/test_sparse_copy_op.py
@@ -0,0 +1,49 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import print_function
+import unittest
+import numpy as np
+import paddle
+from paddle import _C_ops
+from paddle.fluid import core
+from paddle.fluid.framework import _test_eager_guard
+
+
+class TestSparseCopy(unittest.TestCase):
+    def test_copy_sparse_coo(self):
+        with _test_eager_guard():
+            np_x = [[0, 1.0, 0], [2.0, 0, 0], [0, 3.0, 0]]
+            np_values = [1.0, 2.0, 3.0]
+            dense_x = paddle.to_tensor(np_x, dtype='float32')
+            coo_x = dense_x.to_sparse_coo(2)
+
+            np_x_2 = [[0, 3.0, 0], [2.0, 0, 0], [0, 3.0, 0]]
+            dense_x_2 = paddle.to_tensor(np_x_2, dtype='float32')
+            coo_x_2 = dense_x_2.to_sparse_coo(2)
+            coo_x_2.copy_(coo_x, True)
+            assert np.array_equal(np_values, coo_x_2.values().numpy())
+
+    def test_copy_sparse_csr(self):
+        with _test_eager_guard():
+            np_x = [[0, 1.0, 0], [2.0, 0, 0], [0, 3.0, 0]]
+            np_values = [1.0, 2.0, 3.0]
+            dense_x = paddle.to_tensor(np_x, dtype='float32')
+            csr_x = dense_x.to_sparse_csr()
+
+            np_x_2 = [[0, 3.0, 0], [2.0, 0, 0], [0, 3.0, 0]]
+            dense_x_2 = paddle.to_tensor(np_x_2, dtype='float32')
+            csr_x_2 = dense_x_2.to_sparse_csr()
+            csr_x_2.copy_(csr_x, True)
+            assert np.array_equal(np_values, csr_x_2.values().numpy())
diff --git a/python/paddle/fluid/tests/unittests/test_sparse_momentum_op.py b/python/paddle/fluid/tests/unittests/test_sparse_momentum_op.py
index e36cb72efc725..033dbd250ed61 100644
--- a/python/paddle/fluid/tests/unittests/test_sparse_momentum_op.py
+++ b/python/paddle/fluid/tests/unittests/test_sparse_momentum_op.py
@@ -163,7 +163,8 @@ def init_use_nesterov(self):
         pass
 
     def test_check_output(self):
-        self.check_output(atol=5e-3 if self.multi_precision else 1e-5)
+        self.check_output(
+            atol=5e-3 if self.multi_precision else 1e-5, check_eager=True)
 
 
 class TestSparseMomentumOpDtype1(TestSparseMomentumOp):
@@ -240,3 +241,7 @@ def init_multi_precision(self):
 
     def init_use_nesterov(self):
         self.use_nesterov = False
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/test_sparse_utils_op.py b/python/paddle/fluid/tests/unittests/test_sparse_utils_op.py
index 80a7db6516641..04488ac58c5fb 100644
--- a/python/paddle/fluid/tests/unittests/test_sparse_utils_op.py
+++ b/python/paddle/fluid/tests/unittests/test_sparse_utils_op.py
@@ -16,77 +16,176 @@
 import unittest
 import numpy as np
 import paddle
-from paddle import _C_ops
-from paddle.fluid import core
+import paddle.fluid.core as core
 from paddle.fluid.framework import _test_eager_guard
 
 
-class TestSparseUtils(unittest.TestCase):
-    def test_create_sparse_coo_tensor(self):
+class TestSparseCreate(unittest.TestCase):
+    def test_create_coo_by_tensor(self):
         with _test_eager_guard():
-            non_zero_indices = [[0, 0, 1, 2, 2], [1, 3, 2, 0, 1]]
-            non_zero_elements = [1, 2, 3, 4, 5]
+            indices = [[0, 0, 1, 2, 2], [1, 3, 2, 0, 1]]
+            values = [1, 2, 3, 4, 5]
             dense_shape = [3, 4]
-            dense_indices = paddle.to_tensor(non_zero_indices)
-            dense_elements = paddle.to_tensor(
-                non_zero_elements, dtype='float32')
-            stop_gradient = False
-            coo = core.eager.sparse_coo_tensor(dense_indices, dense_elements,
-                                               dense_shape, stop_gradient)
+            dense_indices = paddle.to_tensor(indices)
+            dense_elements = paddle.to_tensor(values, dtype='float32')
+            coo = paddle.sparse.sparse_coo_tensor(
+                dense_indices, dense_elements, dense_shape, stop_gradient=False)
+            assert np.array_equal(indices, coo.indices().numpy())
+            assert np.array_equal(values, coo.values().numpy())
 
-            print(coo)
+    def test_create_coo_by_np(self):
+        with _test_eager_guard():
+            indices = [[0, 1, 2], [1, 2, 0]]
+            values = [1.0, 2.0, 3.0]
+            dense_shape = [2, 3]
+            coo = paddle.sparse.sparse_coo_tensor(indices, values, dense_shape)
+            assert np.array_equal(indices, coo.indices().numpy())
+            assert np.array_equal(values, coo.values().numpy())
 
-    def test_create_sparse_csr_tensor(self):
+    def test_create_csr_by_tensor(self):
         with _test_eager_guard():
-            non_zero_crows = [0, 2, 3, 5]
-            non_zero_cols = [1, 3, 2, 0, 1]
-            non_zero_elements = [1, 2, 3, 4, 5]
+            crows = [0, 2, 3, 5]
+            cols = [1, 3, 2, 0, 1]
+            values = [1, 2, 3, 4, 5]
             dense_shape = [3, 4]
-            dense_crows = paddle.to_tensor(non_zero_crows)
-            dense_cols = paddle.to_tensor(non_zero_cols)
-            dense_elements = paddle.to_tensor(
-                non_zero_elements, dtype='float32')
+            dense_crows = paddle.to_tensor(crows)
+            dense_cols = paddle.to_tensor(cols)
+            dense_elements = paddle.to_tensor(values, dtype='float32')
             stop_gradient = False
-            csr = core.eager.sparse_csr_tensor(dense_crows, dense_cols,
-                                               dense_elements, dense_shape,
-                                               stop_gradient)
+            csr = paddle.sparse.sparse_csr_tensor(
+                dense_crows,
+                dense_cols,
+                dense_elements,
+                dense_shape,
+                stop_gradient=stop_gradient)
+
+    def test_create_csr_by_np(self):
+        with _test_eager_guard():
+            crows = [0, 2, 3, 5]
+            cols = [1, 3, 2, 0, 1]
+            values = [1, 2, 3, 4, 5]
+            dense_shape = [3, 4]
+            csr = paddle.sparse.sparse_csr_tensor(crows, cols, values,
+                                                  dense_shape)
+            assert np.array_equal(crows, csr.crows().numpy())
+            assert np.array_equal(cols, csr.cols().numpy())
+            assert np.array_equal(values, csr.values().numpy())
+
+    def test_place(self):
+        with _test_eager_guard():
+            place = core.CPUPlace()
+            indices = [[0, 1], [0, 1]]
+            values = [1.0, 2.0]
+            dense_shape = [2, 2]
+            coo = paddle.sparse.sparse_coo_tensor(
+                indices, values, dense_shape, place=place)
+            assert coo.place.is_cpu_place()
+            assert coo.values().place.is_cpu_place()
+            assert coo.indices().place.is_cpu_place()
+
+            crows = [0, 2, 3, 5]
+            cols = [1, 3, 2, 0, 1]
+            values = [1.0, 2.0, 3.0, 4.0, 5.0]
+            csr = paddle.sparse.sparse_csr_tensor(
+                crows, cols, values, [3, 5], place=place)
+            assert csr.place.is_cpu_place()
+            assert csr.crows().place.is_cpu_place()
+            assert csr.cols().place.is_cpu_place()
+            assert csr.values().place.is_cpu_place()
+
+    def test_dtype(self):
+        with _test_eager_guard():
+            indices = [[0, 1], [0, 1]]
+            values = [1.0, 2.0]
+            dense_shape = [2, 2]
+            indices = paddle.to_tensor(indices, dtype='int32')
+            values = paddle.to_tensor(values, dtype='float32')
+            coo = paddle.sparse.sparse_coo_tensor(
+                indices, values, dense_shape, dtype='float64')
+            assert coo.dtype == paddle.float64
 
-            print(csr)
+            crows = [0, 2, 3, 5]
+            cols = [1, 3, 2, 0, 1]
+            values = [1.0, 2.0, 3.0, 4.0, 5.0]
+            csr = paddle.sparse.sparse_csr_tensor(
+                crows, cols, values, [3, 5], dtype='float16')
+            assert csr.dtype == paddle.float16
 
+    def test_create_coo_no_shape(self):
+        with _test_eager_guard():
+            indices = [[0, 1], [0, 1]]
+            values = [1.0, 2.0]
+            indices = paddle.to_tensor(indices, dtype='int32')
+            values = paddle.to_tensor(values, dtype='float32')
+            coo = paddle.sparse.sparse_coo_tensor(indices, values)
+            assert [2, 2] == coo.shape
+
+
+class TestSparseConvert(unittest.TestCase):
     def test_to_sparse_coo(self):
         with _test_eager_guard():
             x = [[0, 1, 0, 2], [0, 0, 3, 0], [4, 5, 0, 0]]
-            non_zero_indices = [[0, 0, 1, 2, 2], [1, 3, 2, 0, 1]]
-            non_zero_elements = [1, 2, 3, 4, 5]
-            dense_x = paddle.to_tensor(x)
+            indices = [[0, 0, 1, 2, 2], [1, 3, 2, 0, 1]]
+            values = [1.0, 2.0, 3.0, 4.0, 5.0]
+            dense_x = paddle.to_tensor(x, dtype='float32', stop_gradient=False)
             out = dense_x.to_sparse_coo(2)
-            print(out)
-            assert np.array_equal(out.non_zero_indices().numpy(),
-                                  non_zero_indices)
-            assert np.array_equal(out.non_zero_elements().numpy(),
-                                  non_zero_elements)
+            assert np.array_equal(out.indices().numpy(), indices)
+            assert np.array_equal(out.values().numpy(), values)
+            #test to_sparse_coo_grad backward
+            out_grad_indices = [[0, 1], [0, 1]]
+            out_grad_values = [2.0, 3.0]
+            out_grad = core.eager.sparse_coo_tensor(
+                paddle.to_tensor(out_grad_indices),
+                paddle.to_tensor(out_grad_values), out.shape, True)
+            out.backward(out_grad)
+            assert np.array_equal(dense_x.grad.numpy(),
+                                  out_grad.to_dense().numpy())
 
-            dense_tensor = out.to_dense()
-            assert np.array_equal(dense_tensor.numpy(), x)
+    def test_coo_to_dense(self):
+        with _test_eager_guard():
+            indices = [[0, 0, 1, 2, 2], [1, 3, 2, 0, 1]]
+            values = [1.0, 2.0, 3.0, 4.0, 5.0]
+            sparse_x = core.eager.sparse_coo_tensor(
+                paddle.to_tensor(indices),
+                paddle.to_tensor(values), [3, 4], False)
+            dense_tensor = sparse_x.to_dense()
+            #test to_dense_grad backward
+            out_grad = [[1.0, 2.0, 3.0, 4.0], [5.0, 6.0, 7.0, 8.0],
+                        [9.0, 10.0, 11.0, 12.0]]
+            dense_tensor.backward(paddle.to_tensor(out_grad))
+            #mask the out_grad by sparse_x.indices() 
+            correct_x_grad = [2.0, 4.0, 7.0, 9.0, 10.0]
+            assert np.array_equal(correct_x_grad,
+                                  sparse_x.grad.values().numpy())
 
     def test_to_sparse_csr(self):
         with _test_eager_guard():
             x = [[0, 1, 0, 2], [0, 0, 3, 0], [4, 5, 0, 0]]
-            non_zero_crows = [0, 2, 3, 5]
-            non_zero_cols = [1, 3, 2, 0, 1]
-            non_zero_elements = [1, 2, 3, 4, 5]
+            crows = [0, 2, 3, 5]
+            cols = [1, 3, 2, 0, 1]
+            values = [1, 2, 3, 4, 5]
             dense_x = paddle.to_tensor(x)
             out = dense_x.to_sparse_csr()
-            print(out)
-            assert np.array_equal(out.non_zero_crows().numpy(), non_zero_crows)
-            assert np.array_equal(out.non_zero_cols().numpy(), non_zero_cols)
-            assert np.array_equal(out.non_zero_elements().numpy(),
-                                  non_zero_elements)
+            assert np.array_equal(out.crows().numpy(), crows)
+            assert np.array_equal(out.cols().numpy(), cols)
+            assert np.array_equal(out.values().numpy(), values)
 
             dense_tensor = out.to_dense()
-            print(dense_tensor)
             assert np.array_equal(dense_tensor.numpy(), x)
 
+    def test_coo_values_grad(self):
+        with _test_eager_guard():
+            indices = [[0, 0, 1, 2, 2], [1, 3, 2, 0, 1]]
+            values = [1.0, 2.0, 3.0, 4.0, 5.0]
+            sparse_x = core.eager.sparse_coo_tensor(
+                paddle.to_tensor(indices),
+                paddle.to_tensor(values), [3, 4], False)
+            values_tensor = sparse_x.values()
+            out_grad = [2.0, 3.0, 5.0, 8.0, 9.0]
+            # test coo_values_grad
+            values_tensor.backward(paddle.to_tensor(out_grad))
+            assert np.array_equal(out_grad, sparse_x.grad.values().numpy())
+
 
 if __name__ == "__main__":
     unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/test_squeeze2_op.py b/python/paddle/fluid/tests/unittests/test_squeeze2_op.py
index fc43a8e782382..7d7893cfda0b1 100755
--- a/python/paddle/fluid/tests/unittests/test_squeeze2_op.py
+++ b/python/paddle/fluid/tests/unittests/test_squeeze2_op.py
@@ -27,6 +27,10 @@
 class TestSqueezeOp(OpTest):
     def setUp(self):
         self.op_type = "squeeze2"
+        self.python_api = paddle.squeeze
+        self.python_out_sig = [
+            "Out"
+        ]  # python out sig is customized output signature.
         self.init_test_case()
         self.inputs = {"X": np.random.random(self.ori_shape).astype("float64")}
         self.init_attrs()
@@ -36,10 +40,10 @@ def setUp(self):
         }
 
     def test_check_output(self):
-        self.check_output(no_check_set=['XShape'])
+        self.check_output(no_check_set=['XShape'], check_eager=True)
 
     def test_check_grad(self):
-        self.check_grad(["X"], "Out")
+        self.check_grad(["X"], "Out", check_eager=True)
 
     def init_test_case(self):
         self.ori_shape = (1, 3, 1, 40)
diff --git a/python/paddle/fluid/tests/unittests/test_stack_op.py b/python/paddle/fluid/tests/unittests/test_stack_op.py
index 76f9cf1128ac4..faabcea13aec7 100644
--- a/python/paddle/fluid/tests/unittests/test_stack_op.py
+++ b/python/paddle/fluid/tests/unittests/test_stack_op.py
@@ -40,6 +40,7 @@ def setUp(self):
         self.initDefaultParameters()
         self.initParameters()
         self.op_type = 'stack'
+        self.python_api = paddle.stack
         self.x = []
         for i in range(self.num_inputs):
             self.x.append(
@@ -55,20 +56,20 @@ def setUp(self):
         self.attrs = {'axis': self.axis}
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
     def test_check_grad(self):
-        self.check_grad(self.get_x_names(), 'Y')
+        self.check_grad(self.get_x_names(), 'Y', check_eager=True)
 
 
 class TestStackOp1(TestStackOpBase):
     def initParameters(self):
-        self.num_inputs = 16
+        self.num_inputs = 8
 
 
 class TestStackOp2(TestStackOpBase):
     def initParameters(self):
-        self.num_inputs = 20
+        self.num_inputs = 10
 
 
 class TestStackOp3(TestStackOpBase):
@@ -111,6 +112,7 @@ def setUp(self):
         self.initDefaultParameters()
         self.initParameters()
         self.op_type = 'stack'
+        self.python_api = paddle.stack
         self.x = []
         for i in range(self.num_inputs):
             self.x.append(
@@ -128,10 +130,10 @@ def setUp(self):
         self.attrs = {'axis': self.axis}
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
     def test_check_grad(self):
-        self.check_grad(self.get_x_names(), 'Y')
+        self.check_grad(self.get_x_names(), 'Y', check_eager=True)
 
 
 class TestStackAPIWithLoDTensorArray(unittest.TestCase):
diff --git a/python/paddle/fluid/tests/unittests/test_stft_op.py b/python/paddle/fluid/tests/unittests/test_stft_op.py
index f228c148d6e17..41e950606b3db 100644
--- a/python/paddle/fluid/tests/unittests/test_stft_op.py
+++ b/python/paddle/fluid/tests/unittests/test_stft_op.py
@@ -77,12 +77,12 @@ def initTestCase(self):
 
     def test_check_output(self):
         paddle.enable_static()
-        self.check_output()
+        self.check_output(check_eager=True)
         paddle.disable_static()
 
     def test_check_grad_normal(self):
         paddle.enable_static()
-        self.check_grad(['X'], 'Out')
+        self.check_grad(['X'], 'Out', check_eager=True)
         paddle.disable_static()
 
 
diff --git a/python/paddle/fluid/tests/unittests/test_strided_slice_op.py b/python/paddle/fluid/tests/unittests/test_strided_slice_op.py
index e9be6b338fb86..ae17cb9b1b57c 100644
--- a/python/paddle/fluid/tests/unittests/test_strided_slice_op.py
+++ b/python/paddle/fluid/tests/unittests/test_strided_slice_op.py
@@ -58,6 +58,7 @@ class TestStrideSliceOp(OpTest):
     def setUp(self):
         self.initTestCase()
         self.op_type = 'strided_slice'
+        self.python_api = paddle.strided_slice
         self.output = strided_slice_native_forward(
             self.input, self.axes, self.starts, self.ends, self.strides)
 
@@ -72,10 +73,10 @@ def setUp(self):
         }
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
     def test_check_grad(self):
-        self.check_grad(set(['Input']), 'Out')
+        self.check_grad(set(['Input']), 'Out', check_eager=True)
 
     def initTestCase(self):
         self.input = np.random.rand(100)
@@ -704,7 +705,7 @@ def create_case(self, net):
         l2.sum().backward()
         grads_static = net.get_all_grads()
         net.clear_all_grad()
-        # compare result of dygraph and static 
+        # compare result of dygraph and static
         self.is_grads_equal(grads_static, grads_dy)
         self.assertTrue(
             np.array_equal(s1, s2),
diff --git a/python/paddle/fluid/tests/unittests/test_sum_op.py b/python/paddle/fluid/tests/unittests/test_sum_op.py
index 7040145a76833..6f625c097979b 100644
--- a/python/paddle/fluid/tests/unittests/test_sum_op.py
+++ b/python/paddle/fluid/tests/unittests/test_sum_op.py
@@ -25,6 +25,7 @@
 from paddle.fluid.tests.unittests.op_test import (
     OpTest, convert_float_to_uint16, convert_uint16_to_float)
 from paddle import _C_ops
+from paddle.fluid.framework import _test_eager_guard
 
 
 class TestSumOp(OpTest):
@@ -347,6 +348,27 @@ def test_api(self):
 
             self.assertEqual((sum_value.numpy() == expected_result).all(), True)
 
+    def test_dygraph_final_state_api(self):
+        with fluid.dygraph.guard():
+            with _test_eager_guard():
+                input0 = paddle.ones(shape=[2, 3], dtype='float32')
+                input1 = paddle.ones(shape=[2, 3], dtype='float32')
+                input0.stop_gradient = False
+                input1.stop_gradient = False
+                expected_result = np.empty((2, 3))
+                expected_result.fill(2)
+                sum_value = paddle.add_n([input0, input1])
+                self.assertEqual((sum_value.numpy() == expected_result).all(),
+                                 True)
+
+                expected_grad_result = np.empty((2, 3))
+                expected_grad_result.fill(1)
+                sum_value.backward()
+                self.assertEqual(
+                    (input0.grad.numpy() == expected_grad_result).all(), True)
+                self.assertEqual(
+                    (input1.grad.numpy() == expected_grad_result).all(), True)
+
 
 class TestRaiseSumError(unittest.TestCase):
     def test_errors(self):
diff --git a/python/paddle/fluid/tests/unittests/test_switch_autotune.py b/python/paddle/fluid/tests/unittests/test_switch_autotune.py
new file mode 100644
index 0000000000000..08cf120a0366e
--- /dev/null
+++ b/python/paddle/fluid/tests/unittests/test_switch_autotune.py
@@ -0,0 +1,147 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+import unittest
+import numpy
+
+
+class SimpleNet(paddle.nn.Layer):
+    def __init__(self):
+        super(SimpleNet, self).__init__()
+        self.conv = paddle.nn.Conv2D(1, 2, (3, 3))
+
+    def forward(self, image, label=None):
+        return self.conv(image)
+
+
+def train_dygraph(net, data):
+    out = net(data)
+    loss = paddle.mean(out)
+    adam = paddle.optimizer.Adam(parameters=net.parameters())
+    out.backward()
+    adam.step()
+    adam.clear_grad()
+
+
+def static_program(net, data):
+    out = net(data)
+    loss = paddle.mean(out)
+    adam = paddle.optimizer.Adam()
+    adam.minimize(loss)
+    return loss
+
+
+class TestAutoTune(unittest.TestCase):
+    def test_autotune(self):
+        paddle.fluid.core.disable_autotune()
+        status = paddle.fluid.core.autotune_status()
+        self.assertEqual(status["use_autotune"], False)
+
+        paddle.fluid.core.enable_autotune()
+        status = paddle.fluid.core.autotune_status()
+        self.assertEqual(status["use_autotune"], True)
+
+    def check_status(self, expected_res):
+        status = paddle.fluid.core.autotune_status()
+        for key in status.keys():
+            self.assertEqual(status[key], expected_res[key])
+
+
+class TestDygraphAutoTuneStatus(TestAutoTune):
+    def run_program(self, enable_autotune):
+        if enable_autotune:
+            paddle.fluid.core.enable_autotune()
+        else:
+            paddle.fluid.core.disable_autotune()
+        paddle.fluid.core.autotune_range(1, 2)
+        x_var = paddle.uniform((1, 1, 8, 8), dtype='float32', min=-1., max=1.)
+        net = SimpleNet()
+        for i in range(3):
+            train_dygraph(net, x_var)
+            if i >= 1 and i < 2:
+                expected_res = {
+                    "step_id": i,
+                    "use_autotune": enable_autotune,
+                    "cache_size": 0,
+                    "cache_hit_rate": 0
+                }
+                self.check_status(expected_res)
+            else:
+                expected_res = {
+                    "step_id": i,
+                    "use_autotune": False,
+                    "cache_size": 0,
+                    "cache_hit_rate": 0
+                }
+                self.check_status(expected_res)
+
+    def test_enable_autotune(self):
+        self.run_program(enable_autotune=True)
+
+    def test_disable_autotune(self):
+        self.run_program(enable_autotune=False)
+
+
+class TestStaticAutoTuneStatus(TestAutoTune):
+    def run_program(self, enable_autotune):
+        paddle.enable_static()
+        if enable_autotune:
+            paddle.fluid.core.enable_autotune()
+        else:
+            paddle.fluid.core.disable_autotune()
+        paddle.fluid.core.autotune_range(1, 2)
+
+        data_shape = [1, 1, 8, 8]
+        data = paddle.static.data(name='X', shape=data_shape, dtype='float32')
+        net = SimpleNet()
+        loss = static_program(net, data)
+        place = paddle.CUDAPlace(0) if paddle.fluid.core.is_compiled_with_cuda(
+        ) else paddle.CPUPlace()
+        exe = paddle.static.Executor(place)
+        exe.run(paddle.static.default_startup_program())
+        x = numpy.random.random(size=data_shape).astype('float32')
+
+        for i in range(3):
+            exe.run(feed={'X': x}, fetch_list=[loss])
+            status = paddle.fluid.core.autotune_status()
+            # In static mode, the startup_program will run at first.
+            # The expected step_id will be increased by 1.
+            if i >= 0 and i < 1:
+                expected_res = {
+                    "step_id": i + 1,
+                    "use_autotune": enable_autotune,
+                    "cache_size": 0,
+                    "cache_hit_rate": 0
+                }
+                self.check_status(expected_res)
+            else:
+                expected_res = {
+                    "step_id": i + 1,
+                    "use_autotune": False,
+                    "cache_size": 0,
+                    "cache_hit_rate": 0
+                }
+                self.check_status(expected_res)
+        paddle.disable_static()
+
+    def test_enable_autotune(self):
+        self.run_program(enable_autotune=True)
+
+    def test_disable_autotune(self):
+        self.run_program(enable_autotune=False)
+
+
+if __name__ == '__main__':
+    unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/test_sync_batch_norm_op.py b/python/paddle/fluid/tests/unittests/test_sync_batch_norm_op.py
index 47a6d2b811552..6bf811be2ad0d 100644
--- a/python/paddle/fluid/tests/unittests/test_sync_batch_norm_op.py
+++ b/python/paddle/fluid/tests/unittests/test_sync_batch_norm_op.py
@@ -30,6 +30,7 @@
 from paddle.fluid import Program, program_guard
 
 from op_test import OpTest, _set_use_system_allocator
+from decorator_helper import prog_scope
 
 _set_use_system_allocator(True)
 
@@ -105,6 +106,7 @@ def _build_program(self,
                     sgd_opt.backward(out)
         return main, startup, [out, conv, bn]
 
+    @prog_scope()
     def _compare(self, place, layout, only_forward):
         """Compare results."""
         seed = 10
diff --git a/python/paddle/fluid/tests/unittests/test_tensor_fill_.py b/python/paddle/fluid/tests/unittests/test_tensor_fill_.py
index 5891aee5bd32e..2f43f129978cd 100644
--- a/python/paddle/fluid/tests/unittests/test_tensor_fill_.py
+++ b/python/paddle/fluid/tests/unittests/test_tensor_fill_.py
@@ -17,13 +17,14 @@
 import numpy as np
 import six
 import paddle
+from paddle.fluid.framework import _test_eager_guard
 
 
 class TensorFill_Test(unittest.TestCase):
     def setUp(self):
         self.shape = [32, 32]
 
-    def test_tensor_fill_true(self):
+    def func_test_tensor_fill_true(self):
         typelist = ['float32', 'float64', 'int32', 'int64', 'float16']
         places = [fluid.CPUPlace()]
         if fluid.core.is_compiled_with_cuda():
@@ -46,7 +47,12 @@ def test_tensor_fill_true(self):
                 tensor.fill_(var)  #var type is basic type in typelist
                 self.assertEqual((tensor.numpy() == target).all(), True)
 
-    def test_tensor_fill_backward(self):
+    def test_tensor_fill_true(self):
+        with _test_eager_guard():
+            self.func_test_tensor_fill_true()
+        self.func_test_tensor_fill_true()
+
+    def func_test_tensor_fill_backward(self):
         typelist = ['float32']
         places = [fluid.CPUPlace()]
         if fluid.core.is_compiled_with_cuda():
@@ -71,13 +77,23 @@ def test_tensor_fill_backward(self):
 
                 self.assertEqual((y.grad.numpy() == 0).all().item(), True)
 
-    def test_errors(self):
+    def test_tensor_fill_backward(self):
+        with _test_eager_guard():
+            self.func_test_tensor_fill_backward()
+        self.func_test_tensor_fill_backward()
+
+    def func_test_errors(self):
         def test_list():
             x = paddle.to_tensor([2, 3, 4])
             x.fill_([1])
 
         self.assertRaises(TypeError, test_list)
 
+    def test_errors(self):
+        with _test_eager_guard():
+            self.func_test_errors()
+        self.func_test_errors()
+
 
 if __name__ == '__main__':
     unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/test_tensor_fill_diagonal_.py b/python/paddle/fluid/tests/unittests/test_tensor_fill_diagonal_.py
index 3beb6a537eca0..ca0c97adedb94 100644
--- a/python/paddle/fluid/tests/unittests/test_tensor_fill_diagonal_.py
+++ b/python/paddle/fluid/tests/unittests/test_tensor_fill_diagonal_.py
@@ -17,10 +17,11 @@
 import numpy as np
 import six
 import paddle
+from paddle.fluid.framework import _test_eager_guard
 
 
 class TensorFillDiagonal_Test(unittest.TestCase):
-    def test_dim2_normal(self):
+    def func_dim2_normal(self):
         expected_np = np.array(
             [[1, 2, 2], [2, 1, 2], [2, 2, 1]]).astype('float32')
         expected_grad = np.array(
@@ -50,7 +51,12 @@ def test_dim2_normal(self):
                     (y.grad.numpy().astype('float32') == expected_grad).all(),
                     True)
 
-    def test_offset(self):
+    def test_dim2_normal(self):
+        with _test_eager_guard():
+            self.func_dim2_normal()
+        self.func_dim2_normal()
+
+    def func_offset(self):
         expected_np = np.array(
             [[2, 2, 1], [2, 2, 2], [2, 2, 2]]).astype('float32')
         expected_grad = np.array(
@@ -80,7 +86,12 @@ def test_offset(self):
                     (y.grad.numpy().astype('float32') == expected_grad).all(),
                     True)
 
-    def test_bool(self):
+    def test_offset(self):
+        with _test_eager_guard():
+            self.func_offset()
+        self.func_offset()
+
+    def func_bool(self):
         expected_np = np.array(
             [[False, True, True], [True, False, True], [True, True, False]])
 
@@ -101,7 +112,12 @@ def test_bool(self):
 
                 self.assertEqual((x.numpy() == expected_np).all(), True)
 
-    def test_dim2_unnormal_wrap(self):
+    def test_bool(self):
+        with _test_eager_guard():
+            self.func_bool()
+        self.func_bool()
+
+    def func_dim2_unnormal_wrap(self):
         expected_np = np.array([[1, 2, 2], [2, 1, 2], [2, 2, 1], [2, 2, 2],
                                 [1, 2, 2], [2, 1, 2],
                                 [2, 2, 1]]).astype('float32')
@@ -133,7 +149,12 @@ def test_dim2_unnormal_wrap(self):
                     (y.grad.numpy().astype('float32') == expected_grad).all(),
                     True)
 
-    def test_dim2_unnormal_unwrap(self):
+    def test_dim2_unnormal_wrap(self):
+        with _test_eager_guard():
+            self.func_dim2_unnormal_wrap()
+        self.func_dim2_unnormal_wrap()
+
+    def func_dim2_unnormal_unwrap(self):
         expected_np = np.array([[1, 2, 2], [2, 1, 2], [2, 2, 1], [2, 2, 2],
                                 [2, 2, 2], [2, 2, 2],
                                 [2, 2, 2]]).astype('float32')
@@ -165,7 +186,12 @@ def test_dim2_unnormal_unwrap(self):
                     (y.grad.numpy().astype('float32') == expected_grad).all(),
                     True)
 
-    def test_dim_larger2_normal(self):
+    def test_dim2_unnormal_unwrap(self):
+        with _test_eager_guard():
+            self.func_dim2_unnormal_unwrap()
+        self.func_dim2_unnormal_unwrap()
+
+    def func_dim_larger2_normal(self):
         expected_np = np.array([[[1, 2, 2], [2, 2, 2], [2, 2, 2]], [[2, 2, 2], [
             2, 1, 2
         ], [2, 2, 2]], [[2, 2, 2], [2, 2, 2], [2, 2, 1]]]).astype('float32')
@@ -198,6 +224,11 @@ def test_dim_larger2_normal(self):
                     (y.grad.numpy().astype('float32') == expected_grad).all(),
                     True)
 
+    def test_dim_larger2_normal(self):
+        with _test_eager_guard():
+            self.func_dim_larger2_normal()
+        self.func_dim_larger2_normal()
+
 
 if __name__ == '__main__':
     unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/test_tensor_fill_diagonal_tensor_.py b/python/paddle/fluid/tests/unittests/test_tensor_fill_diagonal_tensor_.py
index 2f37ccf219eb0..81ec1daa6691d 100644
--- a/python/paddle/fluid/tests/unittests/test_tensor_fill_diagonal_tensor_.py
+++ b/python/paddle/fluid/tests/unittests/test_tensor_fill_diagonal_tensor_.py
@@ -18,6 +18,7 @@
 import numpy as np
 import six
 import paddle
+from paddle.fluid.framework import _test_eager_guard
 
 
 class TensorFillDiagTensor_Test(unittest.TestCase):
@@ -27,7 +28,7 @@ def setUp(self):
         if fluid.core.is_compiled_with_cuda():
             self.places.append(fluid.CUDAPlace(0))
 
-    def test_dim2(self):
+    def func_dim2(self):
         expected_np = np.array(
             [[1, 2, 2], [2, 1, 2], [2, 2, 1], [2, 2, 2]]).astype('float32')
         expected_grad = np.array(
@@ -54,7 +55,12 @@ def test_dim2(self):
                     (y.grad.numpy().astype('float32') == expected_grad).all(),
                     True)
 
-    def test_dim2_offset_1(self):
+    def test_dim2(self):
+        with _test_eager_guard():
+            self.func_dim2()
+        self.func_dim2()
+
+    def func_dim2_offset_1(self):
         expected_np = np.array(
             [[2, 2, 2], [1, 2, 2], [2, 1, 2], [2, 2, 1]]).astype('float32')
         expected_grad = np.array(
@@ -81,7 +87,12 @@ def test_dim2_offset_1(self):
                     (y.grad.numpy().astype('float32') == expected_grad).all(),
                     True)
 
-    def test_dim2_offset1(self):
+    def test_dim2_offset_1(self):
+        with _test_eager_guard():
+            self.func_dim2_offset_1()
+        self.func_dim2_offset_1()
+
+    def func_dim2_offset1(self):
         expected_np = np.array(
             [[2, 1, 2], [2, 2, 1], [2, 2, 2], [2, 2, 2]]).astype('float32')
         expected_grad = np.array(
@@ -108,7 +119,12 @@ def test_dim2_offset1(self):
                     (y.grad.numpy().astype('float32') == expected_grad).all(),
                     True)
 
-    def test_dim4(self):
+    def test_dim2_offset1(self):
+        with _test_eager_guard():
+            self.func_dim2_offset1()
+        self.func_dim2_offset1()
+
+    def func_dim4(self):
         expected_np = np.array(
             [[[[0, 3], [2, 2], [2, 2]], [[2, 2], [1, 4], [2, 2]],
               [[2, 2], [2, 2], [2, 5]], [[2, 2], [2, 2], [2, 2]]],
@@ -144,7 +160,12 @@ def test_dim4(self):
                     (y.grad.numpy().astype('float32') == expected_grad).all(),
                     True)
 
-    def test_largedim(self):
+    def test_func_dim4(self):
+        with _test_eager_guard():
+            self.func_dim4()
+        self.func_dim4()
+
+    def func_largedim(self):
         #large dim only test on gpu because the cpu version is too slow for ci test, and the memory is limited
         if len(self.places) > 1:
             bsdim = 1024
@@ -168,6 +189,11 @@ def test_largedim(self):
                 self.assertEqual((y == expected_pred).all(), True)
                 self.assertEqual((y.grad == expected_grad).all(), True)
 
+    def test_largedim(self):
+        with _test_eager_guard():
+            self.func_largedim()
+        self.func_largedim()
+
 
 if __name__ == '__main__':
     unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/test_tensor_to_list.py b/python/paddle/fluid/tests/unittests/test_tensor_to_list.py
index 73b91297e6fd6..a78113030ed53 100644
--- a/python/paddle/fluid/tests/unittests/test_tensor_to_list.py
+++ b/python/paddle/fluid/tests/unittests/test_tensor_to_list.py
@@ -17,13 +17,14 @@
 import numpy as np
 import six
 import paddle
+from paddle.fluid.framework import _test_eager_guard
 
 
 class TensorToListTest(unittest.TestCase):
     def setUp(self):
         self.shape = [11, 25, 32, 43]
 
-    def test_tensor_tolist(self):
+    def func_tensor_tolist(self):
         places = [fluid.CPUPlace()]
         if fluid.core.is_compiled_with_cuda():
             places.append(fluid.CUDAPlace(0))
@@ -39,6 +40,11 @@ def test_tensor_tolist(self):
 
             self.assertEqual(tensorlist, expectlist)
 
+    def test_tensor_tolist(self):
+        with _test_eager_guard():
+            self.func_tensor_tolist()
+        self.func_tensor_tolist()
+
 
 if __name__ == '__main__':
     unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/test_tensor_uva.py b/python/paddle/fluid/tests/unittests/test_tensor_uva.py
index c60d4d98d7154..4af04b8f6d41e 100644
--- a/python/paddle/fluid/tests/unittests/test_tensor_uva.py
+++ b/python/paddle/fluid/tests/unittests/test_tensor_uva.py
@@ -15,10 +15,12 @@
 import paddle
 import unittest
 import numpy as np
+from paddle.fluid import core
+from paddle.fluid.framework import _test_eager_guard, _in_legacy_dygraph
 
 
 class TestTensorCopyFrom(unittest.TestCase):
-    def test_main(self):
+    def func_main(self):
         if paddle.fluid.core.is_compiled_with_cuda():
             place = paddle.CPUPlace()
             np_value = np.random.random(size=[10, 30]).astype('float32')
@@ -26,9 +28,14 @@ def test_main(self):
             tensor._uva()
             self.assertTrue(tensor.place.is_gpu_place())
 
+    def test_main(self):
+        with _test_eager_guard():
+            self.func_main()
+        self.func_main()
+
 
 class TestUVATensorFromNumpy(unittest.TestCase):
-    def test_uva_tensor_creation(self):
+    def func_uva_tensor_creation(self):
         if paddle.fluid.core.is_compiled_with_cuda():
             dtype_list = [
                 "int32", "int64", "float32", "float64", "float16", "int8",
@@ -36,10 +43,18 @@ def test_uva_tensor_creation(self):
             ]
             for dtype in dtype_list:
                 data = np.random.randint(10, size=[4, 5]).astype(dtype)
-                tensor = paddle.fluid.core.to_uva_tensor(data, 0)
+                if _in_legacy_dygraph():
+                    tensor = paddle.fluid.core.to_uva_tensor(data, 0)
+                else:
+                    tensor = core.eager.to_uva_tensor(data, 0)
                 self.assertTrue(tensor.place.is_gpu_place())
                 self.assertTrue(np.allclose(tensor.numpy(), data))
 
+    def test_uva_tensor_creation(self):
+        with _test_eager_guard():
+            self.func_uva_tensor_creation()
+        self.func_uva_tensor_creation()
+
 
 if __name__ == "__main__":
     unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/test_tensor_zero_.py b/python/paddle/fluid/tests/unittests/test_tensor_zero_.py
index 65620038fc497..d47585f78bb7b 100644
--- a/python/paddle/fluid/tests/unittests/test_tensor_zero_.py
+++ b/python/paddle/fluid/tests/unittests/test_tensor_zero_.py
@@ -17,13 +17,14 @@
 import numpy as np
 import six
 import paddle
+from paddle.fluid.framework import _test_eager_guard
 
 
 class TensorFill_Test(unittest.TestCase):
     def setUp(self):
         self.shape = [32, 32]
 
-    def test_tensor_fill_true(self):
+    def func_test_tensor_fill_true(self):
         typelist = ['float32', 'float64', 'int32', 'int64', 'float16']
         places = [fluid.CPUPlace()]
         if fluid.core.is_compiled_with_cuda():
@@ -41,6 +42,11 @@ def test_tensor_fill_true(self):
                 tensor.zero_()
                 self.assertEqual((tensor.numpy() == target).all().item(), True)
 
+    def test_tensor_fill_true(self):
+        with _test_eager_guard():
+            self.func_test_tensor_fill_true()
+        self.func_test_tensor_fill_true()
+
 
 if __name__ == '__main__':
     unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/test_top_k_op.py b/python/paddle/fluid/tests/unittests/test_top_k_op.py
index 52d1fda0ae299..83a940d064e76 100644
--- a/python/paddle/fluid/tests/unittests/test_top_k_op.py
+++ b/python/paddle/fluid/tests/unittests/test_top_k_op.py
@@ -18,6 +18,7 @@
 import numpy as np
 from op_test import OpTest
 import paddle.fluid.core as core
+import paddle
 
 
 class TestTopkOp(OpTest):
@@ -61,4 +62,5 @@ def test_check_grad(self):
 
 
 if __name__ == "__main__":
+    paddle.enable_static()
     unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/test_top_k_v2_op.py b/python/paddle/fluid/tests/unittests/test_top_k_v2_op.py
index 4be53304733cb..c4f50414f954e 100644
--- a/python/paddle/fluid/tests/unittests/test_top_k_v2_op.py
+++ b/python/paddle/fluid/tests/unittests/test_top_k_v2_op.py
@@ -19,6 +19,7 @@
 from op_test import OpTest
 import paddle
 import paddle.fluid.core as core
+from paddle.fluid.framework import _test_eager_guard
 
 
 def numpy_topk(x, k=1, axis=-1, largest=True):
@@ -45,6 +46,7 @@ def init_args(self):
 
     def setUp(self):
         self.op_type = "top_k_v2"
+        self.python_api = paddle.topk
         self.dtype = np.float64
         self.input_data = np.random.rand(10, 20)
         self.init_args()
@@ -55,12 +57,10 @@ def setUp(self):
         self.outputs = {'Out': output, 'Indices': indices}
 
     def test_check_output(self):
-        paddle.enable_static()
-        self.check_output()
+        self.check_output(check_eager=True)
 
     def test_check_grad(self):
-        paddle.enable_static()
-        self.check_grad(set(['X']), 'Out')
+        self.check_grad(set(['X']), 'Out', check_eager=True)
 
 
 class TestTopkOp1(TestTopkOp):
@@ -85,6 +85,7 @@ def init_args(self):
 
     def setUp(self):
         self.op_type = "top_k_v2"
+        self.python_api = paddle.topk
         self.dtype = np.float64
         self.input_data = np.random.rand(16, 100)
         self.init_args()
@@ -103,6 +104,7 @@ def init_args(self):
 
     def setUp(self):
         self.op_type = "top_k_v2"
+        self.python_api = paddle.topk
         self.dtype = np.float64
         self.input_data = np.random.rand(10, 10, 5)
         self.init_args()
@@ -121,6 +123,7 @@ def init_args(self):
 
     def setUp(self):
         self.op_type = "top_k_v2"
+        self.python_api = paddle.topk
         self.dtype = np.float64
         self.input_data = np.random.rand(10, 10, 5)
         self.init_args()
@@ -139,6 +142,7 @@ def init_args(self):
 
     def setUp(self):
         self.op_type = "top_k_v2"
+        self.python_api = paddle.topk
         self.dtype = np.float64
         self.input_data = np.random.rand(80, 16384)
         self.init_args()
@@ -156,48 +160,64 @@ def setUp(self):
         self.large_input_data = np.random.rand(2, 1030)
 
     def run_dygraph(self, place):
-        paddle.disable_static(place)
-        input_tensor = paddle.to_tensor(self.input_data)
-        large_input_tensor = paddle.to_tensor(self.large_input_data)
-        # test case for basic test case 1
-        paddle_result = paddle.topk(input_tensor, k=2)
-        numpy_result = numpy_topk(self.input_data, k=2)
-        self.assertTrue(np.allclose(paddle_result[0].numpy(), numpy_result[0]))
-        self.assertTrue(np.allclose(paddle_result[1].numpy(), numpy_result[1]))
-        # test case for basic test case 2 with axis
-        paddle_result = paddle.topk(input_tensor, k=2, axis=1)
-        numpy_result = numpy_topk(self.input_data, k=2, axis=1)
-        self.assertTrue(np.allclose(paddle_result[0].numpy(), numpy_result[0]))
-        self.assertTrue(np.allclose(paddle_result[1].numpy(), numpy_result[1]))
-        # test case for basic test case 3 with tensor K
-        k_tensor = paddle.to_tensor(np.array([2]))
-        paddle_result = paddle.topk(input_tensor, k=k_tensor, axis=1)
-        numpy_result = numpy_topk(self.input_data, k=2, axis=1)
-        self.assertTrue(np.allclose(paddle_result[0].numpy(), numpy_result[0]))
-        self.assertTrue(np.allclose(paddle_result[1].numpy(), numpy_result[1]))
-        # test case for basic test case 4 with tensor largest
-        k_tensor = paddle.to_tensor(np.array([2]))
-        paddle_result = paddle.topk(input_tensor, k=2, axis=1, largest=False)
-        numpy_result = numpy_topk(self.input_data, k=2, axis=1, largest=False)
-        self.assertTrue(np.allclose(paddle_result[0].numpy(), numpy_result[0]))
-        self.assertTrue(np.allclose(paddle_result[1].numpy(), numpy_result[1]))
-        # test case for basic test case 5 with axis -1
-        k_tensor = paddle.to_tensor(np.array([2]))
-        paddle_result = paddle.topk(input_tensor, k=2, axis=-1, largest=False)
-        numpy_result = numpy_topk(self.input_data, k=2, axis=-1, largest=False)
-        self.assertTrue(np.allclose(paddle_result[0].numpy(), numpy_result[0]))
-        self.assertTrue(np.allclose(paddle_result[1].numpy(), numpy_result[1]))
-        # test case for basic test case 6 for the partial sort 
-        paddle_result = paddle.topk(large_input_tensor, k=1, axis=-1)
-        numpy_result = numpy_topk(self.large_input_data, k=1, axis=-1)
-        self.assertTrue(np.allclose(paddle_result[0].numpy(), numpy_result[0]))
-        self.assertTrue(np.allclose(paddle_result[1].numpy(), numpy_result[1]))
-        # test case for basic test case 7 for the unsorted 
-        paddle_result = paddle.topk(input_tensor, k=2, axis=1, sorted=False)
-        sort_paddle = numpy_topk(
-            np.array(paddle_result[0].numpy()), axis=1, k=2)
-        numpy_result = numpy_topk(self.input_data, k=2, axis=1)
-        self.assertTrue(np.allclose(sort_paddle[0], numpy_result[0]))
+        with paddle.fluid.dygraph.guard(place):
+            input_tensor = paddle.to_tensor(self.input_data)
+            large_input_tensor = paddle.to_tensor(self.large_input_data)
+            # test case for basic test case 1
+            paddle_result = paddle.topk(input_tensor, k=2)
+            numpy_result = numpy_topk(self.input_data, k=2)
+            self.assertTrue(
+                np.allclose(paddle_result[0].numpy(), numpy_result[0]))
+            self.assertTrue(
+                np.allclose(paddle_result[1].numpy(), numpy_result[1]))
+            # test case for basic test case 2 with axis
+            paddle_result = paddle.topk(input_tensor, k=2, axis=1)
+            numpy_result = numpy_topk(self.input_data, k=2, axis=1)
+            self.assertTrue(
+                np.allclose(paddle_result[0].numpy(), numpy_result[0]))
+            self.assertTrue(
+                np.allclose(paddle_result[1].numpy(), numpy_result[1]))
+            # test case for basic test case 3 with tensor K
+            k_tensor = paddle.to_tensor(np.array([2]))
+            paddle_result = paddle.topk(input_tensor, k=k_tensor, axis=1)
+            numpy_result = numpy_topk(self.input_data, k=2, axis=1)
+            self.assertTrue(
+                np.allclose(paddle_result[0].numpy(), numpy_result[0]))
+            self.assertTrue(
+                np.allclose(paddle_result[1].numpy(), numpy_result[1]))
+            # test case for basic test case 4 with tensor largest
+            k_tensor = paddle.to_tensor(np.array([2]))
+            paddle_result = paddle.topk(
+                input_tensor, k=2, axis=1, largest=False)
+            numpy_result = numpy_topk(
+                self.input_data, k=2, axis=1, largest=False)
+            self.assertTrue(
+                np.allclose(paddle_result[0].numpy(), numpy_result[0]))
+            self.assertTrue(
+                np.allclose(paddle_result[1].numpy(), numpy_result[1]))
+            # test case for basic test case 5 with axis -1
+            k_tensor = paddle.to_tensor(np.array([2]))
+            paddle_result = paddle.topk(
+                input_tensor, k=2, axis=-1, largest=False)
+            numpy_result = numpy_topk(
+                self.input_data, k=2, axis=-1, largest=False)
+            self.assertTrue(
+                np.allclose(paddle_result[0].numpy(), numpy_result[0]))
+            self.assertTrue(
+                np.allclose(paddle_result[1].numpy(), numpy_result[1]))
+            # test case for basic test case 6 for the partial sort 
+            paddle_result = paddle.topk(large_input_tensor, k=1, axis=-1)
+            numpy_result = numpy_topk(self.large_input_data, k=1, axis=-1)
+            self.assertTrue(
+                np.allclose(paddle_result[0].numpy(), numpy_result[0]))
+            self.assertTrue(
+                np.allclose(paddle_result[1].numpy(), numpy_result[1]))
+            # test case for basic test case 7 for the unsorted 
+            paddle_result = paddle.topk(input_tensor, k=2, axis=1, sorted=False)
+            sort_paddle = numpy_topk(
+                np.array(paddle_result[0].numpy()), axis=1, k=2)
+            numpy_result = numpy_topk(self.input_data, k=2, axis=1)
+            self.assertTrue(np.allclose(sort_paddle[0], numpy_result[0]))
 
     def run_static(self, place):
         paddle.enable_static()
@@ -264,14 +284,15 @@ def test_cases(self):
             self.run_static(place)
 
     def test_errors(self):
-        paddle.disable_static()
-        x = paddle.to_tensor([1, 2, 3])
-        with self.assertRaises(BaseException):
-            paddle.topk(x, k=-1)
+        with paddle.fluid.dygraph.guard():
+            x = paddle.to_tensor([1, 2, 3])
+            with self.assertRaises(BaseException):
+                paddle.topk(x, k=-1)
 
-        with self.assertRaises(BaseException):
-            paddle.topk(x, k=0)
+            with self.assertRaises(BaseException):
+                paddle.topk(x, k=0)
 
 
 if __name__ == "__main__":
+    paddle.enable_static()
     unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/test_traced_layer_err_msg.py b/python/paddle/fluid/tests/unittests/test_traced_layer_err_msg.py
index 3b9fbd69e9d0a..5703ce1313176 100644
--- a/python/paddle/fluid/tests/unittests/test_traced_layer_err_msg.py
+++ b/python/paddle/fluid/tests/unittests/test_traced_layer_err_msg.py
@@ -1,4 +1,5 @@
 # Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -52,6 +53,8 @@ def setUp(self):
         self.type_str = 'class'
 
     def test_trace_err(self):
+        if fluid.framework.in_dygraph_mode():
+            return
         with fluid.dygraph.guard():
             in_x = fluid.dygraph.to_variable(
                 np.random.random((self.batch_size, self.feature_size)).astype(
@@ -80,6 +83,8 @@ def test_trace_err(self):
                 self.layer, [in_x])
 
     def test_set_strategy_err(self):
+        if fluid.framework.in_dygraph_mode():
+            return
         with fluid.dygraph.guard():
             in_x = fluid.dygraph.to_variable(
                 np.random.random((self.batch_size, self.feature_size)).astype(
@@ -105,6 +110,8 @@ def test_set_strategy_err(self):
                                       fluid.ExecutionStrategy())
 
     def test_save_inference_model_err(self):
+        if fluid.framework.in_dygraph_mode():
+            return
         with fluid.dygraph.guard():
             in_x = fluid.dygraph.to_variable(
                 np.random.random((self.batch_size, self.feature_size)).astype(
@@ -169,6 +176,8 @@ def _train_simple_net(self):
 
 class TestOutVarWithNoneErrMsg(unittest.TestCase):
     def test_linear_net_with_none(self):
+        if fluid.framework.in_dygraph_mode():
+            return
         model = LinearNetWithNone(100, 16)
         in_x = paddle.to_tensor(np.random.random((4, 100)).astype('float32'))
         with self.assertRaises(TypeError):
@@ -186,6 +195,8 @@ def setUp(self):
             shutil.rmtree(os.path.dirname(self.save_path))
 
     def test_mkdir_when_input_path_non_exist(self):
+        if fluid.framework.in_dygraph_mode():
+            return
         fc_layer = SimpleFCLayer(3, 4, 2)
         input_var = paddle.to_tensor(np.random.random([4, 3]).astype('float32'))
         with fluid.dygraph.guard():
diff --git a/python/paddle/fluid/tests/unittests/test_tril_triu_op.py b/python/paddle/fluid/tests/unittests/test_tril_triu_op.py
index cdb5f66f57892..00f6169fa3103 100644
--- a/python/paddle/fluid/tests/unittests/test_tril_triu_op.py
+++ b/python/paddle/fluid/tests/unittests/test_tril_triu_op.py
@@ -28,6 +28,7 @@ class TrilTriuOpDefaultTest(OpTest):
 
     def setUp(self):
         self.initTestCase()
+        self.python_api = paddle.tril if self.real_op_type == 'tril' else paddle.triu
         self.real_np_op = getattr(np, self.real_op_type)
 
         self.op_type = "tril_triu"
@@ -42,10 +43,10 @@ def setUp(self):
         }
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
     def test_check_grad_normal(self):
-        self.check_grad(['X'], 'Out')
+        self.check_grad(['X'], 'Out', check_eager=True)
 
     def initTestCase(self):
         self.real_op_type = np.random.choice(['triu', 'tril'])
diff --git a/python/paddle/fluid/tests/unittests/test_trilinear_interp_op.py b/python/paddle/fluid/tests/unittests/test_trilinear_interp_op.py
index 2778fa0c6ace4..49699b8fafd03 100755
--- a/python/paddle/fluid/tests/unittests/test_trilinear_interp_op.py
+++ b/python/paddle/fluid/tests/unittests/test_trilinear_interp_op.py
@@ -131,6 +131,9 @@ def setUp(self):
         self.data_layout = 'NCDHW'
         self.init_test_case()
         self.op_type = "trilinear_interp"
+        # NOTE(dev): some AsDispensible input is not used under imperative mode.
+        # Skip check_eager while found them in Inputs.
+        self.check_eager = True
         input_np = np.random.random(self.input_shape).astype("float32")
 
         if self.data_layout == "NCDHW":
@@ -157,8 +160,10 @@ def setUp(self):
         self.inputs = {'X': input_np}
         if self.out_size is not None:
             self.inputs['OutSize'] = self.out_size
+            self.check_eager = False
         if self.actual_shape is not None:
             self.inputs['OutSize'] = self.actual_shape
+            self.check_eager = False
         # c++ end treat NCDHW the same way as NCHW
         if self.data_layout == 'NCDHW':
             data_layout = 'NCHW'
@@ -177,10 +182,11 @@ def setUp(self):
         self.outputs = {'Out': output_np}
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=self.check_eager)
 
     def test_check_grad(self):
-        self.check_grad(['X'], 'Out', in_place=True)
+        self.check_grad(
+            ['X'], 'Out', in_place=True, check_eager=self.check_eager)
 
     def init_test_case(self):
         self.interp_method = 'trilinear'
@@ -326,6 +332,7 @@ def setUp(self):
         self.actual_shape = None
         self.init_test_case()
         self.op_type = "trilinear_interp"
+        self.check_eager = True
         input_np = np.random.randint(
             low=0, high=256, size=self.input_shape).astype("uint8")
 
@@ -344,6 +351,7 @@ def setUp(self):
         self.inputs = {'X': input_np}
         if self.out_size is not None:
             self.inputs['OutSize'] = self.out_size
+            self.check_eager = False
 
         self.attrs = {
             'out_d': self.out_d,
@@ -357,7 +365,8 @@ def setUp(self):
         self.outputs = {'Out': output_np}
 
     def test_check_output(self):
-        self.check_output_with_place(place=core.CPUPlace(), atol=1)
+        self.check_output_with_place(
+            place=core.CPUPlace(), atol=1, check_eager=self.check_eager)
 
     def init_test_case(self):
         self.interp_method = 'trilinear'
@@ -467,6 +476,7 @@ def setUp(self):
         self.actual_shape = None
         self.init_test_case()
         self.op_type = "trilinear_interp"
+        self.check_eager = True
         self.shape_by_1Dtensor = False
         self.scale_by_1Dtensor = False
         self.attrs = {
@@ -492,12 +502,14 @@ def setUp(self):
 
         if self.shape_by_1Dtensor:
             self.inputs['OutSize'] = self.out_size
+            self.check_eager = False
         elif self.out_size is not None:
             size_tensor = []
             for index, ele in enumerate(self.out_size):
                 size_tensor.append(("x" + str(index), np.ones(
                     (1)).astype('int32') * ele))
             self.inputs['SizeTensor'] = size_tensor
+            self.check_eager = False
 
         self.attrs['out_d'] = self.out_d
         self.attrs['out_h'] = self.out_h
@@ -508,10 +520,11 @@ def setUp(self):
         self.outputs = {'Out': output_np}
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=self.check_eager)
 
     def test_check_grad(self):
-        self.check_grad(['X'], 'Out', in_place=True)
+        self.check_grad(
+            ['X'], 'Out', in_place=True, check_eager=self.check_eager)
 
     def init_test_case(self):
         self.interp_method = 'trilinear'
diff --git a/python/paddle/fluid/tests/unittests/test_trilinear_interp_v2_op.py b/python/paddle/fluid/tests/unittests/test_trilinear_interp_v2_op.py
index 9f46b539a04b6..6d072e3c377fe 100755
--- a/python/paddle/fluid/tests/unittests/test_trilinear_interp_v2_op.py
+++ b/python/paddle/fluid/tests/unittests/test_trilinear_interp_v2_op.py
@@ -145,6 +145,10 @@ def setUp(self):
         self.data_layout = 'NCDHW'
         self.init_test_case()
         self.op_type = "trilinear_interp_v2"
+        # NOTE(dev): some AsDispensible input is not used under imperative mode.
+        # Skip check_eager while found them in Inputs.
+        # TODO(dev): add self.python_api
+        self.check_eager = False
         input_np = np.random.random(self.input_shape).astype("float32")
 
         scale_w = 0
@@ -183,8 +187,10 @@ def setUp(self):
         self.inputs = {'X': input_np}
         if self.out_size is not None:
             self.inputs['OutSize'] = self.out_size
+            self.check_eager = False
         if self.actual_shape is not None:
             self.inputs['OutSize'] = self.actual_shape
+            self.check_eager = False
         # c++ end treat NCDHW the same way as NCHW
         if self.data_layout == 'NCDHW':
             data_layout = 'NCHW'
@@ -208,10 +214,11 @@ def setUp(self):
         self.outputs = {'Out': output_np}
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=self.check_eager)
 
     def test_check_grad(self):
-        self.check_grad(['X'], 'Out', in_place=True)
+        self.check_grad(
+            ['X'], 'Out', in_place=True, check_eager=self.check_eager)
 
     def init_test_case(self):
         self.interp_method = 'trilinear'
@@ -357,6 +364,8 @@ def setUp(self):
         self.actual_shape = None
         self.init_test_case()
         self.op_type = "trilinear_interp_v2"
+        # TODO(dev): add self.python_api
+        self.check_eager = False
         input_np = np.random.randint(
             low=0, high=256, size=self.input_shape).astype("uint8")
 
@@ -383,6 +392,7 @@ def setUp(self):
         self.inputs = {'X': input_np}
         if self.out_size is not None:
             self.inputs['OutSize'] = self.out_size
+            self.check_eager = False
 
         self.attrs = {
             'out_d': self.out_d,
@@ -401,7 +411,8 @@ def setUp(self):
         self.outputs = {'Out': output_np}
 
     def test_check_output(self):
-        self.check_output_with_place(place=core.CPUPlace(), atol=1)
+        self.check_output_with_place(
+            place=core.CPUPlace(), atol=1, check_eager=self.check_eager)
 
     def init_test_case(self):
         self.interp_method = 'trilinear'
@@ -511,6 +522,8 @@ def setUp(self):
         self.actual_shape = None
         self.init_test_case()
         self.op_type = "trilinear_interp_v2"
+        # TODO(dev): add self.python_api
+        self.check_eager = False
         self.shape_by_1Dtensor = False
         self.scale_by_1Dtensor = False
         self.attrs = {
@@ -543,12 +556,14 @@ def setUp(self):
 
         if self.shape_by_1Dtensor:
             self.inputs['OutSize'] = self.out_size
+            self.check_eager = False
         elif self.out_size is not None:
             size_tensor = []
             for index, ele in enumerate(self.out_size):
                 size_tensor.append(("x" + str(index), np.ones(
                     (1)).astype('int32') * ele))
             self.inputs['SizeTensor'] = size_tensor
+            self.check_eager = False
 
         self.attrs['out_d'] = self.out_d
         self.attrs['out_h'] = self.out_h
@@ -565,10 +580,11 @@ def setUp(self):
         self.outputs = {'Out': output_np}
 
     def test_check_output(self):
-        self.check_output()
+        self.check_output(check_eager=self.check_eager)
 
     def test_check_grad(self):
-        self.check_grad(['X'], 'Out', in_place=True)
+        self.check_grad(
+            ['X'], 'Out', in_place=True, check_eager=self.check_eager)
 
     def init_test_case(self):
         self.interp_method = 'trilinear'
diff --git a/python/paddle/fluid/tests/unittests/test_truncated_gaussian_random_op.py b/python/paddle/fluid/tests/unittests/test_truncated_gaussian_random_op.py
index 4abeae77d26e8..fe28e0c9638b4 100644
--- a/python/paddle/fluid/tests/unittests/test_truncated_gaussian_random_op.py
+++ b/python/paddle/fluid/tests/unittests/test_truncated_gaussian_random_op.py
@@ -17,10 +17,13 @@
 import unittest
 import numpy
 
+import paddle
 import paddle.fluid as fluid
 import paddle.fluid.core as core
+from op_test import OpTest
 from paddle.fluid.op import Operator
 from paddle.fluid.executor import Executor
+from paddle.fluid.framework import _test_eager_guard
 
 
 class TestTrunctedGaussianRandomOp(unittest.TestCase):
@@ -33,15 +36,16 @@ def setUp(self):
             "std": 1.,
             "seed": 10,
         }
-
         self.outputs = ["Out"]
 
     def test_cpu(self):
         self.gaussian_random_test(place=fluid.CPUPlace())
+        self.gaussian_random_test_eager(place=fluid.CPUPlace())
 
     def test_gpu(self):
         if core.is_compiled_with_cuda():
             self.gaussian_random_test(place=fluid.CUDAPlace(0))
+            self.gaussian_random_test_eager(place=fluid.CUDAPlace(0))
 
     def gaussian_random_test(self, place):
 
@@ -64,6 +68,17 @@ def gaussian_random_test(self, place):
         self.assertAlmostEqual(numpy.mean(tensor), .0, delta=0.1)
         self.assertAlmostEqual(numpy.var(tensor), 0.773, delta=0.1)
 
+    # TruncatedNormal.__call__ has no return value, so here call _C_ops api
+    # directly
+    def gaussian_random_test_eager(self, place):
+        with fluid.dygraph.guard(place):
+            with _test_eager_guard():
+                out = paddle._C_ops.final_state_truncated_gaussian_random(
+                    self.attrs["shape"], self.attrs["mean"], self.attrs["std"],
+                    self.attrs["seed"], core.VarDesc.VarType.FP32, place)
+                self.assertAlmostEqual(numpy.mean(out.numpy()), .0, delta=0.1)
+                self.assertAlmostEqual(numpy.var(out.numpy()), 0.773, delta=0.1)
+
 
 if __name__ == "__main__":
     unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/test_unbind_op.py b/python/paddle/fluid/tests/unittests/test_unbind_op.py
index e16fb6ddaacd7..43f2f3526ac0f 100644
--- a/python/paddle/fluid/tests/unittests/test_unbind_op.py
+++ b/python/paddle/fluid/tests/unittests/test_unbind_op.py
@@ -17,9 +17,11 @@
 import unittest
 import numpy as np
 from op_test import OpTest, convert_float_to_uint16
+import paddle
 import paddle.fluid as fluid
 import paddle.tensor as tensor
 from paddle.fluid import compiler, Program, program_guard, core
+from paddle.fluid.framework import _test_eager_guard
 
 
 class TestUnbind(unittest.TestCase):
@@ -39,6 +41,25 @@ def test_unbind(self):
         assert np.array_equal(res_1, input_1[0, 0:100])
         assert np.array_equal(res_2, input_1[1, 0:100])
 
+    def test_unbind_dygraph(self):
+        with fluid.dygraph.guard():
+            np_x = np.random.random([2, 3]).astype("float32")
+            x = paddle.to_tensor(np_x)
+            x.stop_gradient = False
+            [res_1, res_2] = paddle.unbind(x, 0)
+            self.assertTrue(np.array_equal(res_1, np_x[0, 0:100]))
+            self.assertTrue(np.array_equal(res_2, np_x[1, 0:100]))
+
+            out = paddle.add_n([res_1, res_2])
+
+            np_grad = np.ones(x.shape, np.float32)
+            out.backward()
+            self.assertTrue(np.array_equal(x.grad.numpy(), np_grad))
+
+    def test_unbind_dygraph_final_state(self):
+        with _test_eager_guard():
+            self.test_unbind_dygraph()
+
 
 class TestLayersUnbind(unittest.TestCase):
     def test_layers_unbind(self):
@@ -157,6 +178,7 @@ def outReshape(self):
 class TestUnbindBF16Op(OpTest):
     def setUp(self):
         self._set_op_type()
+        self.python_api = paddle.unbind
         self.dtype = self.get_dtype()
         self.axis = 0
         self.num = 3
diff --git a/python/paddle/fluid/tests/unittests/test_uniform_random_op.py b/python/paddle/fluid/tests/unittests/test_uniform_random_op.py
index 41b6ed36d65cc..683cc2fdf867e 100644
--- a/python/paddle/fluid/tests/unittests/test_uniform_random_op.py
+++ b/python/paddle/fluid/tests/unittests/test_uniform_random_op.py
@@ -573,37 +573,46 @@ def test_fixed_random_number(self):
         if not "V100" in paddle.device.cuda.get_device_name():
             return
 
-        if os.getenv("FLAGS_use_curand", None) in ('0', 'False', None):
-            return
-
-        def _check_random_value(dtype, expect, expect_mean, expect_std):
-            x = paddle.rand([32, 3, 1024, 1024], dtype=dtype)
-            actual = x.numpy()
-            self.assertTrue(np.allclose(actual[2, 1, 512, 1000:1010], expect))
-            self.assertEqual(np.mean(actual), expect_mean)
-            self.assertEqual(np.std(actual), expect_std)
-
         print("Test Fixed Random number on V100 GPU------>")
         paddle.disable_static()
+
         paddle.set_device('gpu')
         paddle.seed(2021)
+
+        expect_mean = 0.50000454338820143895816272561205551028251647949218750
+        expect_std = 0.28867379167297479991560749112977646291255950927734375
         expect = [
             0.55298901, 0.65184678, 0.49375412, 0.57943639, 0.16459608,
             0.67181056, 0.03021481, 0.0238559, 0.07742096, 0.55972187
         ]
-        expect_mean = 0.50000454338820143895816272561205551028251647949218750
-        expect_std = 0.28867379167297479991560749112977646291255950927734375
-        _check_random_value(core.VarDesc.VarType.FP64, expect, expect_mean,
-                            expect_std)
+        out = paddle.rand([32, 3, 1024, 1024], dtype='float64').numpy()
+        self.assertEqual(np.mean(out), expect_mean)
+        self.assertEqual(np.std(out), expect_std)
+        self.assertTrue(np.allclose(out[2, 1, 512, 1000:1010], expect))
 
+        expect_mean = 0.50002604722976684570312500
+        expect_std = 0.2886914908885955810546875
         expect = [
             0.45320973, 0.17582087, 0.725341, 0.30849215, 0.622257, 0.46352342,
             0.97228295, 0.12771158, 0.286525, 0.9810645
         ]
-        expect_mean = 0.50002604722976684570312500
-        expect_std = 0.2886914908885955810546875
-        _check_random_value(core.VarDesc.VarType.FP32, expect, expect_mean,
-                            expect_std)
+        out = paddle.rand([32, 3, 1024, 1024], dtype='float32').numpy()
+        self.assertEqual(np.mean(out), expect_mean)
+        self.assertEqual(np.std(out), expect_std)
+        self.assertTrue(np.allclose(out[2, 1, 512, 1000:1010], expect))
+
+        expect_mean = 25.11843109130859375
+        expect_std = 43.370647430419921875
+        expect = [
+            30.089634, 77.05225, 3.1201615, 68.34072, 59.266724, -25.33281,
+            12.973292, 27.41127, -17.412298, 27.931019
+        ]
+        out = paddle.empty(
+            [16, 16, 16, 16], dtype='float32').uniform_(-50, 100).numpy()
+        self.assertEqual(np.mean(out), expect_mean)
+        self.assertEqual(np.std(out), expect_std)
+        self.assertTrue(np.allclose(out[10, 10, 10, 0:10], expect))
+
         paddle.enable_static()
 
 
diff --git a/python/paddle/fluid/tests/unittests/test_unique.py b/python/paddle/fluid/tests/unittests/test_unique.py
index a4bef436e1375..71dce5cc463cf 100644
--- a/python/paddle/fluid/tests/unittests/test_unique.py
+++ b/python/paddle/fluid/tests/unittests/test_unique.py
@@ -21,6 +21,7 @@
 import paddle.fluid as fluid
 import paddle.fluid.core as core
 from paddle.fluid.op import Operator
+from paddle.fluid.framework import _test_eager_guard
 
 
 class TestUniqueOp(OpTest):
@@ -251,6 +252,12 @@ def test_dygraph_attr_dtype(self):
         self.assertTrue((counts.numpy() == np_counts).all(), True)
         paddle.enable_static()
 
+    def test_dygraph_final_state_api(self):
+        with _test_eager_guard():
+            self.test_dygraph_api_out()
+            self.test_dygraph_api_attr()
+            self.test_dygraph_attr_dtype()
+
     def test_static_graph(self):
         with paddle.static.program_guard(paddle.static.Program(),
                                          paddle.static.Program()):
diff --git a/python/paddle/fluid/tests/unittests/test_unsqueeze2_op.py b/python/paddle/fluid/tests/unittests/test_unsqueeze2_op.py
index b75e32f2bad14..af9d3db629581 100755
--- a/python/paddle/fluid/tests/unittests/test_unsqueeze2_op.py
+++ b/python/paddle/fluid/tests/unittests/test_unsqueeze2_op.py
@@ -29,6 +29,8 @@ class TestUnsqueezeOp(OpTest):
     def setUp(self):
         self.init_test_case()
         self.op_type = "unsqueeze2"
+        self.python_api = paddle.unsqueeze
+        self.python_out_sig = ["Out"]
         self.inputs = {"X": np.random.random(self.ori_shape).astype("float64")}
         self.init_attrs()
         self.outputs = {
@@ -37,10 +39,10 @@ def setUp(self):
         }
 
     def test_check_output(self):
-        self.check_output(no_check_set=["XShape"])
+        self.check_output(no_check_set=["XShape"], check_eager=True)
 
     def test_check_grad(self):
-        self.check_grad(["X"], "Out")
+        self.check_grad(["X"], "Out", check_eager=True)
 
     def init_test_case(self):
         self.ori_shape = (3, 40)
@@ -88,6 +90,8 @@ class TestUnsqueezeOp_AxesTensorList(OpTest):
     def setUp(self):
         self.init_test_case()
         self.op_type = "unsqueeze2"
+        self.python_out_sig = ["Out"]
+        self.python_api = paddle.unsqueeze
 
         axes_tensor_list = []
         for index, ele in enumerate(self.axes):
@@ -105,10 +109,10 @@ def setUp(self):
         }
 
     def test_check_output(self):
-        self.check_output(no_check_set=["XShape"])
+        self.check_output(no_check_set=["XShape"], check_eager=True)
 
     def test_check_grad(self):
-        self.check_grad(["X"], "Out")
+        self.check_grad(["X"], "Out", check_eager=True)
 
     def init_test_case(self):
         self.ori_shape = (20, 5)
@@ -152,6 +156,8 @@ class TestUnsqueezeOp_AxesTensor(OpTest):
     def setUp(self):
         self.init_test_case()
         self.op_type = "unsqueeze2"
+        self.python_out_sig = ["Out"]
+        self.python_api = paddle.unsqueeze
 
         self.inputs = {
             "X": np.random.random(self.ori_shape).astype("float64"),
@@ -164,10 +170,10 @@ def setUp(self):
         }
 
     def test_check_output(self):
-        self.check_output(no_check_set=["XShape"])
+        self.check_output(no_check_set=["XShape"], check_eager=True)
 
     def test_check_grad(self):
-        self.check_grad(["X"], "Out")
+        self.check_grad(["X"], "Out", check_eager=True)
 
     def init_test_case(self):
         self.ori_shape = (20, 5)
diff --git a/python/paddle/fluid/tests/unittests/test_var_base.py b/python/paddle/fluid/tests/unittests/test_var_base.py
index b648caf750e96..724a71ebe3dda 100644
--- a/python/paddle/fluid/tests/unittests/test_var_base.py
+++ b/python/paddle/fluid/tests/unittests/test_var_base.py
@@ -31,7 +31,7 @@ def setUp(self):
         self.dtype = np.float32
         self.array = np.random.uniform(0.1, 1, self.shape).astype(self.dtype)
 
-    def test_to_tensor(self):
+    def func_test_to_tensor(self):
         def _test_place(place):
             with fluid.dygraph.guard():
                 paddle.set_default_dtype('float32')
@@ -262,7 +262,12 @@ def _test_place(place):
             _test_place(core.NPUPlace(0))
             _test_place("npu:0")
 
-    def test_to_tensor_not_change_input_stop_gradient(self):
+    def test_to_tensor(self):
+        with _test_eager_guard():
+            self.func_test_to_tensor()
+        self.func_test_to_tensor()
+
+    def func_test_to_tensor_not_change_input_stop_gradient(self):
         with paddle.fluid.dygraph.guard(core.CPUPlace()):
             a = paddle.zeros([1024])
             a.stop_gradient = False
@@ -270,7 +275,12 @@ def test_to_tensor_not_change_input_stop_gradient(self):
             self.assertEqual(a.stop_gradient, False)
             self.assertEqual(b.stop_gradient, True)
 
-    def test_to_tensor_change_place(self):
+    def test_to_tensor_not_change_input_stop_gradient(self):
+        with _test_eager_guard():
+            self.func_test_to_tensor_not_change_input_stop_gradient()
+        self.func_test_to_tensor_not_change_input_stop_gradient()
+
+    def func_test_to_tensor_change_place(self):
         if core.is_compiled_with_cuda():
             a_np = np.random.rand(1024, 1024)
             with paddle.fluid.dygraph.guard(core.CPUPlace()):
@@ -288,7 +298,12 @@ def test_to_tensor_change_place(self):
                 a = paddle.to_tensor(a, place=paddle.CUDAPinnedPlace())
                 self.assertEqual(a.place.__repr__(), "Place(gpu_pinned)")
 
-    def test_to_tensor_with_lodtensor(self):
+    def test_to_tensor_change_place(self):
+        with _test_eager_guard():
+            self.func_test_to_tensor_change_place()
+        self.func_test_to_tensor_change_place()
+
+    def func_test_to_tensor_with_lodtensor(self):
         if core.is_compiled_with_cuda():
             a_np = np.random.rand(1024, 1024)
             with paddle.fluid.dygraph.guard(core.CPUPlace()):
@@ -304,7 +319,12 @@ def test_to_tensor_with_lodtensor(self):
                 self.assertTrue(np.array_equal(a_np, a.numpy()))
                 self.assertTrue(a.place.__repr__(), "Place(cpu)")
 
-    def test_to_variable(self):
+    def test_to_tensor_with_lodtensor(self):
+        with _test_eager_guard():
+            self.func_test_to_tensor_with_lodtensor()
+        self.func_test_to_tensor_with_lodtensor()
+
+    def func_test_to_variable(self):
         with fluid.dygraph.guard():
             var = fluid.dygraph.to_variable(self.array, name="abc")
             self.assertTrue(np.array_equal(var.numpy(), self.array))
@@ -323,7 +343,12 @@ def test_to_variable(self):
                 linear = fluid.dygraph.Linear(32, 64)
                 var = linear._helper.to_variable("test", name="abc")
 
-    def test_list_to_variable(self):
+    def test_to_variable(self):
+        with _test_eager_guard():
+            self.func_test_to_variable()
+        self.func_test_to_variable()
+
+    def func_test_list_to_variable(self):
         with fluid.dygraph.guard():
             array = [[[1, 2], [1, 2], [1.0, 2]], [[1, 2], [1, 2], [1, 2]]]
             var = fluid.dygraph.to_variable(array, dtype='int32')
@@ -332,7 +357,12 @@ def test_list_to_variable(self):
             self.assertEqual(var.dtype, core.VarDesc.VarType.INT32)
             self.assertEqual(var.type, core.VarDesc.VarType.LOD_TENSOR)
 
-    def test_tuple_to_variable(self):
+    def test_list_to_variable(self):
+        with _test_eager_guard():
+            self.func_test_list_to_variable()
+        self.func_test_list_to_variable()
+
+    def func_test_tuple_to_variable(self):
         with fluid.dygraph.guard():
             array = (((1, 2), (1, 2), (1, 2)), ((1, 2), (1, 2), (1, 2)))
             var = fluid.dygraph.to_variable(array, dtype='float32')
@@ -341,14 +371,24 @@ def test_tuple_to_variable(self):
             self.assertEqual(var.dtype, core.VarDesc.VarType.FP32)
             self.assertEqual(var.type, core.VarDesc.VarType.LOD_TENSOR)
 
-    def test_tensor_to_variable(self):
+    def test_tuple_to_variable(self):
+        with _test_eager_guard():
+            self.func_test_tuple_to_variable()
+        self.func_test_tuple_to_variable()
+
+    def func_test_tensor_to_variable(self):
         with fluid.dygraph.guard():
             t = fluid.Tensor()
             t.set(np.random.random((1024, 1024)), fluid.CPUPlace())
             var = fluid.dygraph.to_variable(t)
             self.assertTrue(np.array_equal(t, var.numpy()))
 
-    def test_leaf_tensor(self):
+    def test_tensor_to_variable(self):
+        with _test_eager_guard():
+            self.func_test_tensor_to_variable()
+        self.func_test_tensor_to_variable()
+
+    def func_test_leaf_tensor(self):
         with fluid.dygraph.guard():
             x = paddle.to_tensor(np.random.uniform(-1, 1, size=[10, 10]))
             self.assertTrue(x.is_leaf)
@@ -374,7 +414,12 @@ def test_leaf_tensor(self):
             self.assertTrue(linear.bias.is_leaf)
             self.assertFalse(out.is_leaf)
 
-    def test_detach(self):
+    def test_leaf_tensor(self):
+        with _test_eager_guard():
+            self.func_test_leaf_tensor()
+        self.func_test_leaf_tensor()
+
+    def func_test_detach(self):
         with fluid.dygraph.guard():
             x = paddle.to_tensor(1.0, dtype="float64", stop_gradient=False)
             detach_x = x.detach()
@@ -407,7 +452,12 @@ def test_detach(self):
                 detach_x[:] = 5.0
                 y.backward()
 
-    def test_write_property(self):
+    def test_detach(self):
+        with _test_eager_guard():
+            self.func_test_detach()
+        self.func_test_detach()
+
+    def func_test_write_property(self):
         with fluid.dygraph.guard():
             var = fluid.dygraph.to_variable(self.array)
 
@@ -423,9 +473,17 @@ def test_write_property(self):
             var.stop_gradient = False
             self.assertEqual(var.stop_gradient, False)
 
-    def test_deep_copy(self):
+    def test_write_property(self):
+        with _test_eager_guard():
+            self.func_test_write_property()
+        self.func_test_write_property()
+
+    def func_test_deep_copy(self):
         with fluid.dygraph.guard():
-            empty_var = core.VarBase()
+            if _in_legacy_dygraph():
+                empty_var = core.VarBase()
+            else:
+                empty_var = core.eager.Tensor()
             empty_var_copy = copy.deepcopy(empty_var)
             self.assertEqual(empty_var.stop_gradient,
                              empty_var_copy.stop_gradient)
@@ -462,9 +520,15 @@ def test_deep_copy(self):
             self.assertEqual(id(y_copy), id(y_copy2))
 
             # test copy selected rows
-            x = core.VarBase(core.VarDesc.VarType.FP32, [3, 100],
-                             "selected_rows",
-                             core.VarDesc.VarType.SELECTED_ROWS, True)
+            if _in_legacy_dygraph():
+                x = core.VarBase(core.VarDesc.VarType.FP32, [3, 100],
+                                 "selected_rows",
+                                 core.VarDesc.VarType.SELECTED_ROWS, True)
+            else:
+                x = core.eager.Tensor(core.VarDesc.VarType.FP32, [3, 100],
+                                      "selected_rows",
+                                      core.VarDesc.VarType.SELECTED_ROWS, True)
+
             selected_rows = x.value().get_selected_rows()
             selected_rows.get_tensor().set(
                 np.random.rand(3, 100), core.CPUPlace())
@@ -486,8 +550,13 @@ def test_deep_copy(self):
                     np.array(copy_selected_rows.get_tensor()),
                     np.array(selected_rows.get_tensor())))
 
+    def test_deep_copy(self):
+        with _test_eager_guard():
+            self.func_test_deep_copy()
+        self.func_test_deep_copy()
+
     # test some patched methods
-    def test_set_value(self):
+    def func_test_set_value(self):
         with fluid.dygraph.guard():
             var = fluid.dygraph.to_variable(self.array)
             tmp1 = np.random.uniform(0.1, 1, [2, 2, 3]).astype(self.dtype)
@@ -497,12 +566,22 @@ def test_set_value(self):
             var.set_value(tmp2)
             self.assertTrue(np.array_equal(var.numpy(), tmp2))
 
-    def test_to_string(self):
+    def test_set_value(self):
+        with _test_eager_guard():
+            self.func_test_set_value()
+        self.func_test_set_value()
+
+    def func_test_to_string(self):
         with fluid.dygraph.guard():
             var = fluid.dygraph.to_variable(self.array)
             self.assertTrue(isinstance(str(var), str))
 
-    def test_element_size(self):
+    def test_to_string(self):
+        with _test_eager_guard():
+            self.func_test_to_string()
+        self.func_test_to_string()
+
+    def func_test_element_size(self):
         with fluid.dygraph.guard():
             x = paddle.to_tensor(1, dtype='bool')
             self.assertEqual(x.element_size(), 1)
@@ -537,7 +616,12 @@ def test_element_size(self):
             x = paddle.to_tensor(1, dtype='complex128')
             self.assertEqual(x.element_size(), 16)
 
-    def test_backward(self):
+    def test_element_size(self):
+        with _test_eager_guard():
+            self.func_test_element_size()
+        self.func_test_element_size()
+
+    def func_test_backward(self):
         with fluid.dygraph.guard():
             var = fluid.dygraph.to_variable(self.array)
             var.stop_gradient = False
@@ -546,7 +630,12 @@ def test_backward(self):
             grad_var = var._grad_ivar()
             self.assertEqual(grad_var.shape, self.shape)
 
-    def test_gradient(self):
+    def test_backward(self):
+        with _test_eager_guard():
+            self.func_test_backward()
+        self.func_test_backward()
+
+    def func_test_gradient(self):
         with fluid.dygraph.guard():
             var = fluid.dygraph.to_variable(self.array)
             var.stop_gradient = False
@@ -555,12 +644,22 @@ def test_gradient(self):
             grad_var = var.gradient()
             self.assertEqual(grad_var.shape, self.array.shape)
 
-    def test_block(self):
+    def test_gradient(self):
+        with _test_eager_guard():
+            self.func_test_gradient()
+        self.func_test_gradient()
+
+    def func_test_block(self):
         with fluid.dygraph.guard():
             var = fluid.dygraph.to_variable(self.array)
             self.assertEqual(var.block,
                              fluid.default_main_program().global_block())
 
+    def test_block(self):
+        with _test_eager_guard():
+            self.func_test_block()
+        self.func_test_block()
+
     def _test_slice(self):
         w = fluid.dygraph.to_variable(
             np.random.random((784, 100, 100)).astype('float64'))
@@ -795,17 +894,17 @@ def _test_bool_index(self):
         np_value = np.random.random(shape).astype('float32')
         var_tensor = paddle.to_tensor(np_value)
         index = [[True, True, True, True], [True, False, True, True],
-                 [True, False, False, True], [False, 0, 1, True, True]]
+                 [True, False, False, True], [False, 0, 1, True, True],
+                 [False, False, False, False]]
         index2d = np.array([[True, True], [False, False], [True, False],
                             [True, True]])
         tensor_index = paddle.to_tensor(index2d)
         var = [
-            var_tensor[index[0]].numpy(),
-            var_tensor[index[1]].numpy(),
-            var_tensor[index[2]].numpy(),
-            var_tensor[index[3]].numpy(),
+            var_tensor[index[0]].numpy(), var_tensor[index[1]].numpy(),
+            var_tensor[index[2]].numpy(), var_tensor[index[3]].numpy(),
             var_tensor[paddle.to_tensor(index[0])].numpy(),
             var_tensor[tensor_index].numpy(),
+            var_tensor[paddle.to_tensor(index[4])].numpy()
         ]
         self.assertTrue(np.array_equal(var[0], np_value[index[0]]))
         self.assertTrue(np.array_equal(var[1], np_value[index[1]]))
@@ -813,6 +912,7 @@ def _test_bool_index(self):
         self.assertTrue(np.array_equal(var[3], np_value[index[3]]))
         self.assertTrue(np.array_equal(var[4], np_value[index[0]]))
         self.assertTrue(np.array_equal(var[5], np_value[index2d]))
+        self.assertTrue(np.array_equal(var[6], np_value[index[4]]))
         self.assertTrue(
             np.array_equal(var_tensor[var_tensor > 0.67], np_value[np_value >
                                                                    0.67]))
@@ -915,14 +1015,19 @@ def test_slice(self):
             self.func_test_slice()
         self.func_test_slice()
 
-    def test_var_base_to_np(self):
+    def func_test_var_base_to_np(self):
         with fluid.dygraph.guard():
             var = fluid.dygraph.to_variable(self.array)
             self.assertTrue(
                 np.array_equal(var.numpy(),
                                fluid.framework._var_base_to_np(var)))
 
-    def test_var_base_as_np(self):
+    def test_var_base_to_np(self):
+        with _test_eager_guard():
+            self.func_test_var_base_to_np()
+        self.func_test_var_base_to_np()
+
+    def func_test_var_base_as_np(self):
         with fluid.dygraph.guard():
             var = fluid.dygraph.to_variable(self.array)
             self.assertTrue(np.array_equal(var.numpy(), np.array(var)))
@@ -931,7 +1036,12 @@ def test_var_base_as_np(self):
                     var.numpy(), np.array(
                         var, dtype=np.float32)))
 
-    def test_if(self):
+    def test_var_base_as_np(self):
+        with _test_eager_guard():
+            self.func_test_var_base_as_np()
+        self.func_test_var_base_as_np()
+
+    def func_test_if(self):
         with fluid.dygraph.guard():
             var1 = fluid.dygraph.to_variable(np.array([[[0]]]))
             var2 = fluid.dygraph.to_variable(np.array([[[1]]]))
@@ -950,7 +1060,12 @@ def test_if(self):
             assert bool(var1) == False, "bool(var1) is False"
             assert bool(var2) == True, "bool(var2) is True"
 
-    def test_to_static_var(self):
+    def test_if(self):
+        with _test_eager_guard():
+            self.func_test_if()
+        self.func_test_if()
+
+    def func_test_to_static_var(self):
         with fluid.dygraph.guard():
             # Convert VarBase into Variable or Parameter
             var_base = fluid.dygraph.to_variable(self.array, name="var_base_1")
@@ -973,6 +1088,11 @@ def test_to_static_var(self):
             static_param = weight._to_static_var()
             self._assert_to_static(weight, static_param, True)
 
+    def test_to_static_var(self):
+        with _test_eager_guard():
+            self.func_test_to_static_var()
+        self.func_test_to_static_var()
+
     def _assert_to_static(self, var_base, static_var, is_param=False):
         if is_param:
             self.assertTrue(isinstance(static_var, fluid.framework.Parameter))
@@ -996,7 +1116,7 @@ def _assert_to_static(self, var_base, static_var, is_param=False):
 
         self.assertListEqual(list(var_base.shape), list(static_var.shape))
 
-    def test_tensor_str(self):
+    def func_test_tensor_str(self):
         paddle.enable_static()
         paddle.disable_static(paddle.CPUPlace())
         paddle.seed(10)
@@ -1014,9 +1134,13 @@ def test_tensor_str(self):
         [0.2665, 0.8483, 0.5389, ..., 0.4956, 0.6862, 0.9178]])'''
 
         self.assertEqual(a_str, expected)
-        paddle.enable_static()
 
-    def test_tensor_str2(self):
+    def test_tensor_str(self):
+        with _test_eager_guard():
+            self.func_test_tensor_str()
+        self.func_test_tensor_str()
+
+    def func_test_tensor_str2(self):
         paddle.disable_static(paddle.CPUPlace())
         a = paddle.to_tensor([[1.5111111, 1.0], [0, 0]])
         a_str = str(a)
@@ -1026,9 +1150,13 @@ def test_tensor_str2(self):
         [0.    , 0.    ]])'''
 
         self.assertEqual(a_str, expected)
-        paddle.enable_static()
 
-    def test_tensor_str3(self):
+    def test_tensor_str2(self):
+        with _test_eager_guard():
+            self.func_test_tensor_str2()
+        self.func_test_tensor_str2()
+
+    def func_test_tensor_str3(self):
         paddle.disable_static(paddle.CPUPlace())
         a = paddle.to_tensor([[-1.5111111, 1.0], [0, -0.5]])
         a_str = str(a)
@@ -1038,9 +1166,13 @@ def test_tensor_str3(self):
         [ 0.    , -0.5000]])'''
 
         self.assertEqual(a_str, expected)
-        paddle.enable_static()
 
-    def test_tensor_str_scaler(self):
+    def test_tensor_str3(self):
+        with _test_eager_guard():
+            self.func_test_tensor_str3()
+        self.func_test_tensor_str3()
+
+    def func_test_tensor_str_scaler(self):
         paddle.disable_static(paddle.CPUPlace())
         a = paddle.to_tensor(np.array(False))
         a_str = str(a)
@@ -1049,9 +1181,13 @@ def test_tensor_str_scaler(self):
        False)'''
 
         self.assertEqual(a_str, expected)
-        paddle.enable_static()
 
-    def test_tensor_str_shape_with_zero(self):
+    def test_tensor_str_scaler(self):
+        with _test_eager_guard():
+            self.func_test_tensor_str_scaler()
+        self.func_test_tensor_str_scaler()
+
+    def func_test_tensor_str_shape_with_zero(self):
         paddle.disable_static(paddle.CPUPlace())
         x = paddle.ones((10, 10))
         y = paddle.fluid.layers.where(x == 0)
@@ -1061,9 +1197,13 @@ def test_tensor_str_shape_with_zero(self):
        [])'''
 
         self.assertEqual(a_str, expected)
-        paddle.enable_static()
 
-    def test_tensor_str_linewidth(self):
+    def test_tensor_str_shape_with_zero(self):
+        with _test_eager_guard():
+            self.func_test_tensor_str_shape_with_zero()
+        self.func_test_tensor_str_shape_with_zero()
+
+    def func_test_tensor_str_linewidth(self):
         paddle.disable_static(paddle.CPUPlace())
         paddle.seed(2021)
         x = paddle.rand([128])
@@ -1089,9 +1229,13 @@ def test_tensor_str_linewidth(self):
         0.4678, 0.5047])'''
 
         self.assertEqual(a_str, expected)
-        paddle.enable_static()
 
-    def test_tensor_str_linewidth2(self):
+    def test_tensor_str_linewidth(self):
+        with _test_eager_guard():
+            self.func_test_tensor_str_linewidth()
+        self.func_test_tensor_str_linewidth()
+
+    def func_test_tensor_str_linewidth2(self):
         paddle.disable_static(paddle.CPUPlace())
         paddle.seed(2021)
         x = paddle.rand([128])
@@ -1112,9 +1256,13 @@ def test_tensor_str_linewidth2(self):
         8.9448e-01, 7.0981e-01, 8.0783e-01, 4.7065e-01, 5.7154e-01, 7.2319e-01, 4.6777e-01, 5.0465e-01])'''
 
         self.assertEqual(a_str, expected)
-        paddle.enable_static()
 
-    def test_tensor_str_bf16(self):
+    def test_tensor_str_linewidth2(self):
+        with _test_eager_guard():
+            self.func_test_tensor_str_linewidth2()
+        self.func_test_tensor_str_linewidth2()
+
+    def func_tensor_str_bf16(self):
         paddle.disable_static(paddle.CPUPlace())
         a = paddle.to_tensor([[1.5, 1.0], [0, 0]])
         a = paddle.cast(a, dtype=core.VarDesc.VarType.BF16)
@@ -1126,9 +1274,18 @@ def test_tensor_str_bf16(self):
         [0.    , 0.    ]])'''
 
         self.assertEqual(a_str, expected)
-        paddle.enable_static()
 
-    def test_print_tensor_dtype(self):
+    def test_tensor_str_bf16(self):
+        with _test_eager_guard():
+            self.func_tensor_str_bf16()
+        self.func_tensor_str_bf16()
+
+    def test_tensor_str_bf16(self):
+        with _test_eager_guard():
+            self.func_tensor_str_bf16()
+        self.func_tensor_str_bf16()
+
+    def func_test_print_tensor_dtype(self):
         paddle.disable_static(paddle.CPUPlace())
         a = paddle.rand([1])
         a_str = str(a.dtype)
@@ -1136,11 +1293,15 @@ def test_print_tensor_dtype(self):
         expected = 'paddle.float32'
 
         self.assertEqual(a_str, expected)
-        paddle.enable_static()
+
+    def test_print_tensor_dtype(self):
+        with _test_eager_guard():
+            self.func_test_print_tensor_dtype()
+        self.func_test_print_tensor_dtype()
 
 
 class TestVarBaseSetitem(unittest.TestCase):
-    def setUp(self):
+    def func_setUp(self):
         self.set_dtype()
         self.tensor_x = paddle.to_tensor(np.ones((4, 2, 3)).astype(self.dtype))
         self.np_value = np.random.random((2, 3)).astype(self.dtype)
@@ -1184,9 +1345,9 @@ def func_test_value_tensor(self):
 
     def test_value_tensor(self):
         with _test_eager_guard():
-            self.setUp()
+            self.func_setUp()
             self.func_test_value_tensor()
-        self.setUp()
+        self.func_setUp()
         self.func_test_value_tensor()
 
     def func_test_value_numpy(self):
@@ -1194,9 +1355,9 @@ def func_test_value_numpy(self):
 
     def test_value_numpy(self):
         with _test_eager_guard():
-            self.setUp()
+            self.func_setUp()
             self.func_test_value_numpy()
-        self.setUp()
+        self.func_setUp()
         self.func_test_value_numpy()
 
     def func_test_value_int(self):
@@ -1204,9 +1365,9 @@ def func_test_value_int(self):
 
     def test_value_int(self):
         with _test_eager_guard():
-            self.setUp()
+            self.func_setUp()
             self.func_test_value_int()
-        self.setUp()
+        self.func_setUp()
         self.func_test_value_int()
 
 
@@ -1219,10 +1380,17 @@ class TestVarBaseSetitemFp32(TestVarBaseSetitem):
     def set_dtype(self):
         self.dtype = "float32"
 
-    def test_value_float(self):
+    def func_test_value_float(self):
         paddle.disable_static()
         self._test(3.3)
 
+    def test_value_float(self):
+        with _test_eager_guard():
+            self.func_setUp()
+            self.func_test_value_float()
+        self.func_setUp()
+        self.func_test_value_float()
+
 
 class TestVarBaseSetitemFp64(TestVarBaseSetitem):
     def set_dtype(self):
@@ -1230,7 +1398,7 @@ def set_dtype(self):
 
 
 class TestVarBaseSetitemBoolIndex(unittest.TestCase):
-    def setUp(self):
+    def func_setUp(self):
         paddle.disable_static()
         self.set_dtype()
         self.set_input()
@@ -1273,18 +1441,39 @@ def _test(self, value):
         self.assertTrue(np.array_equal(self.tensor_x[3].numpy(), result))
         self.assertEqual(id_origin, id(self.tensor_x))
 
-    def test_value_tensor(self):
+    def func_test_value_tensor(self):
         paddle.disable_static()
         self._test(self.tensor_value)
 
-    def test_value_numpy(self):
+    def test_value_tensor(self):
+        with _test_eager_guard():
+            self.func_setUp()
+            self.func_test_value_tensor()
+        self.func_setUp()
+        self.func_test_value_tensor()
+
+    def func_test_value_numpy(self):
         paddle.disable_static()
         self._test(self.np_value)
 
-    def test_value_int(self):
+    def test_value_numpy(self):
+        with _test_eager_guard():
+            self.func_setUp()
+            self.func_test_value_numpy()
+        self.func_setUp()
+        self.func_test_value_numpy()
+
+    def func_test_value_int(self):
         paddle.disable_static()
         self._test(10)
 
+    def test_value_int(self):
+        with _test_eager_guard():
+            self.func_setUp()
+            self.func_test_value_int()
+        self.func_setUp()
+        self.func_test_value_int()
+
 
 class TestVarBaseSetitemBoolScalarIndex(unittest.TestCase):
     def set_input(self):
@@ -1312,7 +1501,7 @@ def _test(self, value):
 
 
 class TestVarBaseInplaceVersion(unittest.TestCase):
-    def test_setitem(self):
+    def func_test_setitem(self):
         paddle.disable_static()
 
         var = paddle.ones(shape=[4, 2, 3], dtype="float32")
@@ -1324,7 +1513,12 @@ def test_setitem(self):
         var[1:2] = 1
         self.assertEqual(var.inplace_version, 2)
 
-    def test_bump_inplace_version(self):
+    def test_setitem(self):
+        with _test_eager_guard():
+            self.func_test_setitem()
+        self.func_test_setitem()
+
+    def func_test_bump_inplace_version(self):
         paddle.disable_static()
         var = paddle.ones(shape=[4, 2, 3], dtype="float32")
         self.assertEqual(var.inplace_version, 0)
@@ -1335,9 +1529,14 @@ def test_bump_inplace_version(self):
         var._bump_inplace_version()
         self.assertEqual(var.inplace_version, 2)
 
+    def test_bump_inplace_version(self):
+        with _test_eager_guard():
+            self.func_test_bump_inplace_version()
+        self.func_test_bump_inplace_version()
+
 
 class TestVarBaseSlice(unittest.TestCase):
-    def test_slice(self):
+    def func_test_slice(self):
         paddle.disable_static()
         np_x = np.random.random((3, 8, 8))
         x = paddle.to_tensor(np_x, dtype="float64")
@@ -1345,18 +1544,28 @@ def test_slice(self):
         actual_x = paddle.to_tensor(actual_x)
         self.assertEqual(actual_x.numpy().all(), np_x[0:1].all())
 
+    def test_slice(self):
+        with _test_eager_guard():
+            self.func_test_slice()
+        self.func_test_slice()
+
 
 class TestVarBaseClear(unittest.TestCase):
-    def test_clear(self):
+    def func_test_clear(self):
         paddle.disable_static()
         np_x = np.random.random((3, 8, 8))
         x = paddle.to_tensor(np_x, dtype="float64")
         x._clear()
         self.assertEqual(str(x), "Tensor(Not initialized)")
 
+    def test_clear(self):
+        with _test_eager_guard():
+            self.func_test_clear()
+        self.func_test_clear()
+
 
 class TestVarBaseOffset(unittest.TestCase):
-    def test_offset(self):
+    def func_offset(self):
         paddle.disable_static()
         np_x = np.random.random((3, 8, 8))
         x = paddle.to_tensor(np_x, dtype="float64")
@@ -1365,25 +1574,38 @@ def test_offset(self):
         actual_x = paddle.to_tensor(actual_x)
         self.assertEqual(actual_x._offset(), expected_offset)
 
+    def test_offset(self):
+        with _test_eager_guard():
+            self.func_offset()
+        self.func_offset()
+
 
 class TestVarBaseShareBufferTo(unittest.TestCase):
-    def test_share_buffer_To(self):
+    def func_test_share_buffer_To(self):
         paddle.disable_static()
         np_src = np.random.random((3, 8, 8))
         src = paddle.to_tensor(np_src, dtype="float64")
         # empty_var
-        dst = core.VarBase()
+        if _in_legacy_dygraph():
+            dst = core.VarBase()
+        else:
+            dst = core.eager.Tensor()
         src._share_buffer_to(dst)
         self.assertEqual(src._is_shared_buffer_with(dst), True)
 
+    def test_share_buffer_To(self):
+        with _test_eager_guard():
+            self.func_test_share_buffer_To()
+        self.func_test_share_buffer_To()
+
 
 class TestVarBaseTo(unittest.TestCase):
-    def setUp(self):
+    def func_setUp(self):
         paddle.disable_static()
         self.np_x = np.random.random((3, 8, 8))
         self.x = paddle.to_tensor(self.np_x, dtype="float32")
 
-    def test_to_api(self):
+    def func_test_to_api(self):
         x_double = self.x._to(dtype='double')
         self.assertEqual(x_double.dtype, paddle.fluid.core.VarDesc.VarType.FP64)
         self.assertTrue(np.allclose(self.np_x, x_double))
@@ -1430,9 +1652,16 @@ def test_to_api(self):
         self.assertRaises(ValueError, self.x._to, device=1)
         self.assertRaises(AssertionError, self.x._to, blocking=1)
 
+    def test_to_api(self):
+        with _test_eager_guard():
+            self.func_setUp()
+            self.func_test_to_api()
+        self.func_setUp()
+        self.func_test_to_api()
+
 
 class TestVarBaseInitVarBaseFromTensorWithDevice(unittest.TestCase):
-    def test_varbase_init(self):
+    def func_test_varbase_init(self):
         paddle.disable_static()
         t = fluid.Tensor()
         np_x = np.random.random((3, 8, 8))
@@ -1440,17 +1669,28 @@ def test_varbase_init(self):
 
         if paddle.fluid.is_compiled_with_cuda():
             device = paddle.CUDAPlace(0)
-            tmp = fluid.core.VarBase(t, device)
+            if _in_legacy_dygraph():
+                tmp = fluid.core.VarBase(t, device)
+            else:
+                tmp = fluid.core.eager.Tensor(t, device)
             self.assertTrue(tmp.place.is_gpu_place())
             self.assertEqual(tmp.numpy().all(), np_x.all())
 
         device = paddle.CPUPlace()
-        tmp = fluid.core.VarBase(t, device)
+        if _in_legacy_dygraph():
+            tmp = fluid.core.VarBase(t, device)
+        else:
+            tmp = fluid.core.eager.Tensor(t, device)
         self.assertEqual(tmp.numpy().all(), np_x.all())
 
+    def test_varbase_init(self):
+        with _test_eager_guard():
+            self.func_test_varbase_init()
+        self.func_test_varbase_init()
+
 
 class TestVarBaseNumel(unittest.TestCase):
-    def test_numel_normal(self):
+    def func_test_numel_normal(self):
         paddle.disable_static()
         np_x = np.random.random((3, 8, 8))
         x = paddle.to_tensor(np_x, dtype="float64")
@@ -1458,15 +1698,28 @@ def test_numel_normal(self):
         x_expected_numel = np.product((3, 8, 8))
         self.assertEqual(x_actual_numel, x_expected_numel)
 
-    def test_numel_without_holder(self):
+    def test_numel_normal(self):
+        with _test_eager_guard():
+            self.func_test_numel_normal()
+        self.func_test_numel_normal()
+
+    def func_test_numel_without_holder(self):
         paddle.disable_static()
-        x_without_holder = core.VarBase()
+        if _in_legacy_dygraph():
+            x_without_holder = core.VarBase()
+        else:
+            x_without_holder = core.eager.Tensor()
         x_actual_numel = x_without_holder._numel()
         self.assertEqual(x_actual_numel, 0)
 
+    def ttest_numel_without_holder(self):
+        with _test_eager_guard():
+            self.func_test_numel_without_holder()
+        self.func_test_numel_without_holder()
+
 
 class TestVarBaseCopyGradientFrom(unittest.TestCase):
-    def test_copy_gradient_from(self):
+    def func_test_copy_gradient_from(self):
         paddle.disable_static()
         np_x = np.random.random((2, 2))
         np_y = np.random.random((2, 2))
@@ -1477,7 +1730,11 @@ def test_copy_gradient_from(self):
         x._copy_gradient_from(y)
         self.assertEqual(x.grad.numpy().all(), np_y.all())
 
+    def test_copy_gradient_from(self):
+        with _test_eager_guard():
+            self.func_test_copy_gradient_from()
+        self.func_test_copy_gradient_from()
+
 
 if __name__ == '__main__':
-    paddle.enable_static()
     unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/test_variable.py b/python/paddle/fluid/tests/unittests/test_variable.py
index b218739ff9527..3a924669b0020 100644
--- a/python/paddle/fluid/tests/unittests/test_variable.py
+++ b/python/paddle/fluid/tests/unittests/test_variable.py
@@ -690,6 +690,61 @@ def test_dygraph_list_index_muti_dim(self):
         y = x[index_t1, index_t2]
         self.assertTrue(np.array_equal(y.numpy(), y_np))
 
+    def run_getitem_list_index(self, array, index):
+        x = paddle.static.data(name='x', shape=array.shape, dtype='float32')
+
+        y = x[index]
+        place = paddle.fluid.CPUPlace()
+
+        prog = paddle.static.default_main_program()
+        exe = paddle.static.Executor(place)
+
+        exe.run(paddle.static.default_startup_program())
+        fetch_list = [y.name]
+        array2 = array.copy()
+
+        try:
+            value_np = array2[index]
+        except:
+            with self.assertRaises(ValueError):
+                getitem_pp = exe.run(prog,
+                                     feed={x.name: array},
+                                     fetch_list=fetch_list)
+            return
+        getitem_pp = exe.run(prog, feed={x.name: array}, fetch_list=fetch_list)
+
+        print(getitem_pp)
+        self.assertTrue(
+            np.array_equal(value_np, getitem_pp[0]),
+            msg='\n numpy:{},\n paddle:{}'.format(value_np, getitem_pp[0]))
+
+    def test_static_graph_getitem_bool_index(self):
+        paddle.enable_static()
+
+        # case 1:
+        array = np.ones((4, 2, 3), dtype='float32')
+        value_np = np.random.random((2, 3)).astype('float32')
+        index = np.array([True, False, False, False])
+        program = paddle.static.Program()
+        with paddle.static.program_guard(program):
+            self.run_getitem_list_index(array, index)
+
+        # case 2:
+        array = np.ones((4, 2, 3), dtype='float32')
+        value_np = np.random.random((2, 3)).astype('float32')
+        index = np.array([False, True, False, False])
+        program = paddle.static.Program()
+        with paddle.static.program_guard(program):
+            self.run_getitem_list_index(array, index)
+
+        # case 3:
+        array = np.ones((4, 2, 3), dtype='float32')
+        value_np = np.random.random((2, 3)).astype('float32')
+        index = np.array([True, True, True, True])
+        program = paddle.static.Program()
+        with paddle.static.program_guard(program):
+            self.run_getitem_list_index(array, index)
+
     def run_setitem_list_index(self, array, index, value_np):
         x = paddle.static.data(name='x', shape=array.shape, dtype='float32')
 
diff --git a/python/paddle/fluid/tests/unittests/test_view_op_reuse_allocation.py b/python/paddle/fluid/tests/unittests/test_view_op_reuse_allocation.py
index 92078a69b53a5..0d4e379660b75 100644
--- a/python/paddle/fluid/tests/unittests/test_view_op_reuse_allocation.py
+++ b/python/paddle/fluid/tests/unittests/test_view_op_reuse_allocation.py
@@ -91,18 +91,11 @@ def func_test_backward_error(self):
             view_var_b[0] = 2.  # var_b is modified inplace
 
             loss = paddle.nn.functional.relu(var_c)
-            if in_dygraph_mode():
-                with self.assertRaisesRegexp(
-                        RuntimeError,
-                        "received current_inplace_version:{} != inplace_version_snapshot_:{}".
-                        format(1, 0)):
-                    loss.backward()
-            else:
-                with self.assertRaisesRegexp(
-                        RuntimeError,
-                        "received tensor_version:{} != wrapper_version_snapshot:{}".
-                        format(1, 0)):
-                    loss.backward()
+            with self.assertRaisesRegexp(
+                    RuntimeError,
+                    "received tensor_version:{} != wrapper_version_snapshot:{}".
+                    format(1, 0)):
+                loss.backward()
 
     def test_backward_error(self):
         with _test_eager_guard():
diff --git a/python/paddle/fluid/tests/unittests/test_viterbi_decode_op.py b/python/paddle/fluid/tests/unittests/test_viterbi_decode_op.py
index 6f64322e97545..163e246b71560 100644
--- a/python/paddle/fluid/tests/unittests/test_viterbi_decode_op.py
+++ b/python/paddle/fluid/tests/unittests/test_viterbi_decode_op.py
@@ -74,6 +74,7 @@ def set_attr(self):
 
     def setUp(self):
         self.op_type = "viterbi_decode"
+        self.python_api = paddle.text.viterbi_decode
         self.set_attr()
         bz, length, ntags = self.bz, self.len, self.ntags
         self.input = np.random.randn(bz, length, ntags).astype(self.dtype)
@@ -90,7 +91,7 @@ def setUp(self):
         self.outputs = {'Scores': scores, 'Path': path}
 
     def test_output(self):
-        self.check_output()
+        self.check_output(check_eager=True)
 
 
 class TestViterbiAPI(unittest.TestCase):
@@ -132,3 +133,8 @@ def check_static_result(self, place):
     def test_static_net(self):
         for place in self.places:
             self.check_static_result(place)
+
+
+if __name__ == "__main__":
+    paddle.enable_static()
+    unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/test_while_op.py b/python/paddle/fluid/tests/unittests/test_while_op.py
index d6d52b7d604aa..8af9a39634fdb 100644
--- a/python/paddle/fluid/tests/unittests/test_while_op.py
+++ b/python/paddle/fluid/tests/unittests/test_while_op.py
@@ -137,5 +137,44 @@ def test_bad_x():
             self.assertRaises(TypeError, test_bad_x)
 
 
+class TestIgnoreVarNameInWhile(unittest.TestCase):
+    def test_ignore_var(self):
+        def cond(i, ten, temp, y):
+            return i < ten
+
+        def body_func(i, ten, batch_info, origin_seq):
+            print(batch_info)
+            batch_info = fluid.contrib.layers.shuffle_batch(batch_info)
+            print(batch_info)
+            i = i + 1
+            return [i, ten, batch_info, origin_seq]
+
+        x = fluid.layers.data(name='x', shape=[-1, 1, 4])
+        y = fluid.layers.data(name='y', shape=[-1, 1, 1])
+        temp = layers.concat(input=[x, y], axis=-1)
+        i = layers.fill_constant(shape=[1], value=0, dtype='int32')
+        num = layers.fill_constant(shape=[1], value=5, dtype='int32')
+
+        i, ten, shuffle_temp, y = layers.while_loop(cond, body_func,
+                                                    [i, num, temp, y])
+
+        output = shuffle_temp
+
+        exe = fluid.Executor(fluid.CPUPlace())
+        exe.run(fluid.default_startup_program())
+
+        input_x = numpy.array([[1, 2, 3, 4], [4, 5, 6, 7], [7, 8, 9, 10]])
+        input_x = input_x.reshape(3, 1, 4)
+        input_y = numpy.array([[10], [12], [33]])
+        input_y = input_y.reshape(3, 1, 1)
+
+        res, = exe.run(fluid.default_main_program(),
+                       feed={'x': input_x,
+                             'y': input_y},
+                       fetch_list=[output])
+
+        self.assertListEqual(list(res.shape), [3, 1, 5])
+
+
 if __name__ == '__main__':
     unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/test_yolo_box_op.py b/python/paddle/fluid/tests/unittests/test_yolo_box_op.py
index 05a4dfe3c06b6..19dcb49cd957c 100644
--- a/python/paddle/fluid/tests/unittests/test_yolo_box_op.py
+++ b/python/paddle/fluid/tests/unittests/test_yolo_box_op.py
@@ -31,7 +31,7 @@ def YoloBox(x, img_size, attrs):
     an_num = int((len(anchors) // 2))
     class_num = attrs['class_num']
     conf_thresh = attrs['conf_thresh']
-    downsample = attrs['downsample']
+    downsample = attrs['downsample_ratio']
     clip_bbox = attrs['clip_bbox']
     scale_x_y = attrs['scale_x_y']
     iou_aware = attrs['iou_aware']
@@ -92,13 +92,14 @@ class TestYoloBoxOp(OpTest):
     def setUp(self):
         self.initTestCase()
         self.op_type = 'yolo_box'
+        self.python_api = paddle.vision.ops.yolo_box
         x = np.random.random(self.x_shape).astype('float32')
         img_size = np.random.randint(10, 20, self.imgsize_shape).astype('int32')
         self.attrs = {
             'anchors': self.anchors,
             'class_num': self.class_num,
             'conf_thresh': self.conf_thresh,
-            'downsample': self.downsample,
+            'downsample_ratio': self.downsample,
             'clip_bbox': self.clip_bbox,
             'scale_x_y': self.scale_x_y,
             'iou_aware': self.iou_aware,
diff --git a/python/paddle/fluid/tests/unittests/xpu/test_adamw_op_xpu.py b/python/paddle/fluid/tests/unittests/xpu/test_adamw_op_xpu.py
new file mode 100644
index 0000000000000..99e9fdd123eb1
--- /dev/null
+++ b/python/paddle/fluid/tests/unittests/xpu/test_adamw_op_xpu.py
@@ -0,0 +1,295 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import sys
+sys.path.append("..")
+
+import unittest
+import paddle
+import random
+import numpy as np
+import paddle.fluid as fluid
+from functools import partial
+from paddle.framework import core
+
+from op_test_xpu import XPUOpTest
+
+from xpu.get_test_cover_info import create_test_class, get_xpu_op_support_types, XPUOpTestWrapper
+
+
+def adamw_step(inputs, attributes):
+    param = inputs['Param']
+    grad = inputs['Grad']
+    moment1 = inputs['Moment1']
+    moment2 = inputs['Moment2']
+    lr = inputs['LearningRate']
+    beta1_pow = inputs['Beta1Pow']
+    beta2_pow = inputs['Beta2Pow']
+
+    epsilon = attributes['epsilon']
+
+    if 'lr_ratio' in attributes:
+        lr = lr * attributes['lr_ratio']
+
+    if attributes["with_decay"]:
+        coeff = attributes["coeff"]
+        decay = 1.0 - lr * coeff
+        param2 = param * decay
+        param = param2.copy()
+
+    if 'beta1' in attributes:
+        beta1 = attributes['beta1']
+    else:
+        beta1 = inputs['Beta1Tensor'][0]
+    if 'beta2' in attributes:
+        beta2 = attributes['beta2']
+    else:
+        beta2 = inputs['Beta2Tensor'][0]
+
+    moment1_out = beta1 * moment1 + (1 - beta1) * grad
+    moment2_out = beta2 * moment2 + (1 - beta2) * np.square(grad)
+    lr_t = lr * np.sqrt(1 - beta2_pow) / (1 - beta1_pow)
+    param_out = param - lr_t * (moment1_out / (np.sqrt(moment2_out) + epsilon))
+    return param_out, moment1_out, moment2_out
+
+
+def simple_lr_setting(param, decay_rate, n_layers):
+    if "fc_0" in param.name or "linear_1" in param.name:
+        depth = int(param.name.split("_")[2]) + 1
+    elif "fc_1" in param.name or "linear_2" in param.name:
+        depth = int(param.name.split("_")[2]) + 2
+    else:
+        depth = 0
+
+    return decay_rate**(n_layers + 2 - depth)
+
+
+class XPUTestAdamwOp1(XPUOpTestWrapper):
+    def __init__(self):
+        self.op_name = 'adamw'
+        self.use_dynamic_create_class = False
+
+    class TestAdamW(XPUOpTest):
+        def setUp(self):
+            #Test AdamW Op with supplied attributes
+            self.op_type = "adamw"
+            self.init_shape()
+            self.dtype = self.in_type_str
+            param = np.random.uniform(-1, 1, self.shape).astype(self.dtype)
+            grad = np.random.uniform(-1, 1, self.shape).astype(self.dtype)
+            moment1 = np.random.uniform(-1, 1, self.shape).astype(self.dtype)
+            # The second moment is positive
+            moment2 = np.random.random(self.shape).astype(self.dtype)
+
+            learning_rate = 0.004
+            beta1 = 0.78
+            beta2 = 0.836
+            epsilon = 1e-4
+            beta1_pow = beta1**10
+            beta2_pow = beta2**10
+
+            self.inputs = {
+                'Param': param,
+                'Grad': grad,
+                'Moment1': moment1,
+                'Moment2': moment2,
+                'LearningRate': np.array([learning_rate]).astype(self.dtype),
+                'Beta1Pow': np.array([beta1_pow]).astype(self.dtype),
+                'Beta2Pow': np.array([beta2_pow]).astype(self.dtype)
+            }
+
+            self.attrs = {
+                'epsilon': epsilon,
+                'beta1': beta1,
+                'beta2': beta2,
+                "coeff": 0.5,
+                "with_decay": True
+            }
+
+            param_out, moment1_out, \
+                moment2_out = adamw_step(self.inputs, self.attrs)
+
+            self.outputs = {
+                'Moment1Out': moment1_out,
+                'Moment2Out': moment2_out,
+                'ParamOut': param_out,
+                'Beta1PowOut': np.array([beta1_pow]).astype(self.dtype) * beta1,
+                'Beta2PowOut': np.array([beta2_pow]).astype(self.dtype) * beta2
+            }
+
+        def init_shape(self):
+            self.shape = [102, 105]
+
+        def test_check_output(self):
+            paddle.enable_static()
+            self.check_output_with_place(place=paddle.XPUPlace(0))
+
+    class TestAdamW2(TestAdamW):
+        def init_shape(self):
+            self.shape = [1000, ]
+
+    class TestAdamW3(TestAdamW):
+        def init_shape(self):
+            self.shape = [200, 3000]
+
+
+class XPUTestAdamwOp2(XPUOpTestWrapper):
+    def __init__(self):
+        self.op_name = 'adamw'
+        self.use_dynamic_create_class = False
+
+    class TestAdamWOp(unittest.TestCase):
+        def test_adamw_op_dygraph(self):
+            paddle.disable_static()
+            value = np.arange(26).reshape(2, 13).astype(self.in_type_str)
+            a = paddle.to_tensor(value)
+            linear = paddle.nn.Linear(13, 5)
+            adam = paddle.optimizer.AdamW(
+                learning_rate=0.01,
+                parameters=linear.parameters(),
+                apply_decay_param_fun=lambda name: True,
+                weight_decay=0.01)
+
+            for _ in range(2):
+                out = linear(a)
+                out.backward()
+                adam.step()
+                adam.clear_gradients()
+
+        def test_adamw_op_coverage(self):
+            paddle.disable_static()
+            value = np.arange(26).reshape(2, 13).astype(self.in_type_str)
+            a = paddle.to_tensor(value)
+            linear = paddle.nn.Linear(13, 5)
+            adam = paddle.optimizer.AdamW(
+                learning_rate=0.0,
+                parameters=linear.parameters(),
+                apply_decay_param_fun=lambda name: True,
+                weight_decay=0.01)
+            assert (adam.__str__() is not None)
+
+        def test_adamw_op(self):
+            paddle.enable_static()
+            place = fluid.XPUPlace(0)
+            shape = [2, 3, 8, 8]
+            exe = fluid.Executor(place)
+            train_prog = fluid.Program()
+            startup = fluid.Program()
+            with fluid.program_guard(train_prog, startup):
+                with fluid.unique_name.guard():
+                    data = fluid.data(name="data", shape=shape)
+                    conv = fluid.layers.conv2d(data, 8, 3)
+                    loss = paddle.mean(conv)
+
+                    beta1 = fluid.layers.create_global_var(
+                        shape=[1],
+                        value=0.85,
+                        dtype=self.in_type_str,
+                        persistable=True)
+                    beta2 = fluid.layers.create_global_var(
+                        shape=[1],
+                        value=0.95,
+                        dtype=self.in_type_str,
+                        persistable=True)
+                    betas = [beta1, beta2]
+                    opt = paddle.optimizer.AdamW(
+                        learning_rate=1e-5,
+                        beta1=beta1,
+                        beta2=beta2,
+                        weight_decay=0.01,
+                        epsilon=1e-8)
+                    opt.minimize(loss)
+
+            exe.run(startup)
+            data_np = np.random.random(shape).astype(self.in_type_str)
+            rets = exe.run(train_prog,
+                           feed={"data": data_np},
+                           fetch_list=[loss])
+            assert rets[0] is not None
+            paddle.disable_static()
+
+        def test_adamw_op_invalid_input(self):
+            paddle.disable_static()
+            linear = paddle.nn.Linear(10, 10)
+            with self.assertRaises(ValueError):
+                adam = paddle.optimizer.AdamW(
+                    0.1, beta1=-1, parameters=linear.parameters())
+            with self.assertRaises(ValueError):
+                adam = paddle.optimizer.AdamW(
+                    0.1, beta2=-1, parameters=linear.parameters())
+            with self.assertRaises(ValueError):
+                adam = paddle.optimizer.AdamW(
+                    0.1, epsilon=-1, parameters=linear.parameters())
+
+    class TestAdamWOpGroup(TestAdamWOp):
+        def test_adamw_op_dygraph(self):
+            paddle.disable_static()
+            value = np.arange(26).reshape(2, 13).astype(self.in_type_str)
+            a = paddle.to_tensor(value)
+            linear_1 = paddle.nn.Linear(13, 5)
+            linear_2 = paddle.nn.Linear(5, 3)
+            adam = paddle.optimizer.AdamW(
+                learning_rate=0.01,
+                parameters=[{
+                    'params': linear_1.parameters()
+                }, {
+                    'params': linear_2.parameters(),
+                    'weight_decay': 0.001
+                }],
+                apply_decay_param_fun=lambda name: True,
+                weight_decay=0.01)
+
+            for _ in range(2):
+                out = linear_1(a)
+                out = linear_2(out)
+                out.backward()
+                adam.step()
+                adam.clear_gradients()
+
+    class TestAdamWOpGroupWithLR(TestAdamWOp):
+        def test_adamw_op_dygraph(self):
+            paddle.disable_static()
+            value = np.arange(26).reshape(2, 13).astype(self.in_type_str)
+            a = paddle.to_tensor(value)
+            linear_1 = paddle.nn.Linear(13, 5)
+            linear_2 = paddle.nn.Linear(5, 3)
+            adam = paddle.optimizer.AdamW(
+                learning_rate=paddle.optimizer.lr.PiecewiseDecay(
+                    boundaries=[3, 6], values=[0.1, 0.2, 0.3]),
+                parameters=[{
+                    'params': linear_1.parameters(),
+                    'learning_rate': 0.1,
+                }, {
+                    'params': linear_2.parameters(),
+                    'weight_decay': 0.001,
+                }],
+                apply_decay_param_fun=lambda name: True,
+                weight_decay=0.01)
+
+            for _ in range(2):
+                out = linear_1(a)
+                out = linear_2(out)
+                out.backward()
+                adam.step()
+                adam.clear_gradients()
+
+
+support_types = get_xpu_op_support_types('adamw')
+for stype in support_types:
+    create_test_class(globals(), XPUTestAdamwOp1, stype)
+    create_test_class(globals(), XPUTestAdamwOp2, stype)
+
+if __name__ == "__main__":
+    paddle.enable_static()
+    unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/xpu/test_assign_op_xpu.py b/python/paddle/fluid/tests/unittests/xpu/test_assign_op_xpu.py
index 7b74a8bb38365..b79bbafb37554 100644
--- a/python/paddle/fluid/tests/unittests/xpu/test_assign_op_xpu.py
+++ b/python/paddle/fluid/tests/unittests/xpu/test_assign_op_xpu.py
@@ -25,8 +25,7 @@
 import paddle.fluid as fluid
 from paddle.fluid import compiler, Program, program_guard
 from paddle.fluid.backward import append_backward
-
-
+'''
 class TestAssignOp(op_test.OpTest):
     def setUp(self):
         self.op_type = "assign"
@@ -84,7 +83,7 @@ def test_errors(self):
             self.assertRaises(TypeError, fluid.layers.assign, x1)
             x2 = np.array([[2.5, 2.5]], dtype='uint8')
             self.assertRaises(TypeError, fluid.layers.assign, x2)
-
+'''
 
 if __name__ == '__main__':
     paddle.enable_static()
diff --git a/python/paddle/fluid/tests/unittests/xpu/test_bilinear_interp_op_xpu.py b/python/paddle/fluid/tests/unittests/xpu/test_bilinear_interp_op_xpu.py
index f8ae945b6ebe5..ddc2b49ebe08e 100755
--- a/python/paddle/fluid/tests/unittests/xpu/test_bilinear_interp_op_xpu.py
+++ b/python/paddle/fluid/tests/unittests/xpu/test_bilinear_interp_op_xpu.py
@@ -27,8 +27,7 @@
 import time
 
 paddle.enable_static()
-
-
+'''
 def bilinear_interp_np(input,
                        out_h,
                        out_w,
@@ -513,7 +512,7 @@ def test_case(self):
             x_data, out_h=12, out_w=12, align_corners=True)
         for res in results:
             self.assertTrue(np.allclose(res, expect_res))
-
+'''
 
 if __name__ == "__main__":
     unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/xpu/test_cast_op_xpu.py b/python/paddle/fluid/tests/unittests/xpu/test_cast_op_xpu.py
index 08d4810a6530b..201e758c0acea 100644
--- a/python/paddle/fluid/tests/unittests/xpu/test_cast_op_xpu.py
+++ b/python/paddle/fluid/tests/unittests/xpu/test_cast_op_xpu.py
@@ -23,6 +23,9 @@
 import paddle.fluid.core as core
 import paddle.fluid as fluid
 from paddle.fluid import compiler, Program, program_guard
+from op_test_xpu import XPUOpTest
+
+from xpu.get_test_cover_info import create_test_class, get_xpu_op_support_types, XPUOpTestWrapper
 
 typeid_dict = {
     'int32': int(core.VarDesc.VarType.INT32),
@@ -33,10 +36,27 @@
 }
 
 
-def create_test_class(in_typename, out_typename):
-    class Cls(op_test.OpTest):
+class XPUTestCastOp(XPUOpTestWrapper):
+    def __init__(self):
+        self.op_name = 'cast'
+        self.use_dynamic_create_class = True
+
+    def dynamic_create_class(self):
+        base_class = self.TestCastOp
+        classes = []
+        for out_type in {'float16', 'float32', 'int32', 'int64'}:
+            class_name = 'XPUTestCastOp_outtype_' + out_type
+            attr_dict = {'out_typename': out_type}
+            classes.append([class_name, attr_dict])
+        return base_class, classes
+
+    class TestCastOp(XPUOpTest):
         def setUp(self):
             ipt = np.random.random(size=[10, 10])
+            in_typename = self.in_type_str
+            out_typename = 'float32' if not hasattr(
+                self, 'out_typename') else self.out_typename
+
             self.inputs = {'X': ipt.astype(in_typename)}
             self.outputs = {'Out': ipt.astype(in_typename).astype(out_typename)}
             self.attrs = {
@@ -47,18 +67,12 @@ def setUp(self):
             self.__class__.no_need_check_grad = True
 
         def test_check_output(self):
-            if paddle.is_compiled_with_xpu():
-                place = paddle.XPUPlace(0)
-                self.check_output_with_place(place)
-
-    cls_name = "cast_{0}_{1}".format(in_typename, out_typename)
-    Cls.__name__ = cls_name
-    globals()[cls_name] = Cls
+            self.check_output()
 
 
-for in_type in {'float16', 'float32', 'int32', 'int64', 'bool'}:
-    for out_type in {'float16', 'float32', 'int32', 'int64'}:
-        create_test_class(in_type, out_type)
+support_types = get_xpu_op_support_types('cast')
+for stype in support_types:
+    create_test_class(globals(), XPUTestCastOp, stype)
 
 
 class TestCastOpError(unittest.TestCase):
diff --git a/python/paddle/fluid/tests/unittests/xpu/test_dropout_op_xpu.py b/python/paddle/fluid/tests/unittests/xpu/test_dropout_op_xpu.py
index ca3b3a418abf6..2baa837b23a07 100644
--- a/python/paddle/fluid/tests/unittests/xpu/test_dropout_op_xpu.py
+++ b/python/paddle/fluid/tests/unittests/xpu/test_dropout_op_xpu.py
@@ -25,90 +25,196 @@
 from op_test_xpu import XPUOpTest
 paddle.enable_static()
 
-
-class TestDropoutOp(XPUOpTest):
-    def setUp(self):
-        self.op_type = "dropout"
-        self.inputs = {'X': np.random.random((32, 64)).astype("float32")}
-        self.attrs = {'dropout_prob': 0.0, 'fix_seed': True, 'is_test': False}
-        self.outputs = {
-            'Out': self.inputs['X'],
-            'Mask': np.ones((32, 64)).astype('uint8')
-        }
-
-    def test_check_output(self):
-        if paddle.is_compiled_with_xpu():
-            paddle.enable_static()
-            place = paddle.XPUPlace(0)
-            self.check_output_with_place(place)
-
-    def test_check_grad_normal(self):
-        if paddle.is_compiled_with_xpu():
-            paddle.enable_static()
-            place = paddle.XPUPlace(0)
-            self.check_grad_with_place(place, ['X'], 'Out')
-
-
-class TestDropoutOpInput1d(XPUOpTest):
-    def setUp(self):
-        self.op_type = "dropout"
-        self.inputs = {'X': np.random.random((2000, )).astype("float32")}
-        self.attrs = {'dropout_prob': 0.0, 'fix_seed': True, 'is_test': False}
-        self.outputs = {
-            'Out': self.inputs['X'],
-            'Mask': np.ones((2000)).astype('uint8')
-        }
-
-    def test_check_output(self):
-        if paddle.is_compiled_with_xpu():
-            paddle.enable_static()
-            place = paddle.XPUPlace(0)
-            self.check_output_with_place(place)
-
-    def test_check_grad_normal(self):
-        if paddle.is_compiled_with_xpu():
-            paddle.enable_static()
-            place = paddle.XPUPlace(0)
-            self.check_grad_with_place(place, ['X'], 'Out')
-
-
-class TestDropoutOp2(TestDropoutOp):
-    def setUp(self):
-        self.op_type = "dropout"
-        self.inputs = {'X': np.random.random((32, 64)).astype("float32")}
-        self.attrs = {'dropout_prob': 1.0, 'fix_seed': True, 'is_test': False}
-        self.outputs = {
-            'Out': np.zeros((32, 64)).astype('float32'),
-            'Mask': np.zeros((32, 64)).astype('uint8')
-        }
-
-
-class TestDropoutOp3(TestDropoutOp):
-    def setUp(self):
-        self.op_type = "dropout"
-        self.inputs = {'X': np.random.random((32, 64, 2)).astype("float32")}
-        self.attrs = {'dropout_prob': 0.0, 'fix_seed': True, 'is_test': False}
-        self.outputs = {
-            'Out': self.inputs['X'],
-            'Mask': np.ones((32, 64, 2)).astype('uint8')
-        }
-
-
-class TestDropoutOp6(TestDropoutOp):
-    def setUp(self):
-        self.op_type = "dropout"
-        self.inputs = {'X': np.random.random((32, 64, 2)).astype("float32")}
-        self.attrs = {
-            'dropout_prob': 0.0,
-            'fix_seed': True,
-            'is_test': False,
-            'dropout_implementation': 'upscale_in_train'
-        }
-        self.outputs = {
-            'Out': self.inputs['X'],
-            'Mask': np.ones((32, 64, 2)).astype('uint8')
-        }
-
+from xpu.get_test_cover_info import create_test_class, get_xpu_op_support_types, XPUOpTestWrapper
+
+
+class XPUTestDropoutOp(XPUOpTestWrapper):
+    def __init__(self):
+        self.op_name = 'dropout'
+        self.use_dynamic_create_class = False
+
+    class TestDropoutOp(XPUOpTest):
+        def setUp(self):
+            self.init_inputs_shape()
+            self.init_attrs()
+            self.dtype = self.in_type
+            self.op_type = 'dropout'
+            self.inputs = {'X': np.random.random(self.shape).astype(self.dtype)}
+            self.attrs = {
+                'dropout_prob': self.dropout_prob,
+                'fix_seed': self.fix_seed,
+                'is_test': self.is_test,
+                'dropout_implementation': self.dropout_implementation
+            }
+
+            out = self.inputs['X'] * (1.0 - self.dropout_prob)
+            if self.is_test == False:
+                mask = None
+                if self.dropout_prob == 0.0:
+                    mask = np.ones(self.shape).astype(self.dtype)
+                elif self.dropout_prob == 1.0:
+                    mask = np.zeros(self.shape).astype(self.dtype)
+                self.outputs = {'Out': out, 'Mask': mask}
+            else:
+                self.outputs = {'Out': out}
+
+        def init_inputs_shape(self):
+            self.shape = [32, 64]
+
+        def init_attrs(self):
+            self.__class__.no_need_check_grad = False
+            self.dropout_prob = 0.0
+            self.fix_seed = True
+            self.is_test = False
+            self.dropout_implementation = "upscale_in_train"
+
+        def test_check_output(self):
+            self.check_output()
+
+        def test_check_grad_normal(self):
+            if hasattr(self.__class__, "no_need_check_grad"
+                       ) and self.__class__.no_need_check_grad == True:
+                return
+
+            self.check_grad(['X'], 'Out')
+
+    class TestDropoutOpInput1d(TestDropoutOp):
+        def init_inputs_shape(self):
+            self.shape = [2000]
+
+    class TestDropoutOp2(TestDropoutOp):
+        def init_inputs_shape(self):
+            self.shape = [32, 64]
+
+        def init_attrs(self):
+            self.dropout_prob = 1.0
+            self.fix_seed = True
+            self.is_test = False
+            self.dropout_implementation = "upscale_in_train"
+
+    class TestDropoutOp3(TestDropoutOp):
+        def init_inputs_shape(self):
+            self.shape = [32, 64, 2]
+
+    class TestDropoutOp4(TestDropoutOp):
+        def init_attrs(self):
+            self.__class__.no_need_check_grad = True
+            self.dropout_prob = 0.35
+            self.fix_seed = True
+            self.is_test = True
+            self.dropout_implementation = "downgrade_in_infer"
+
+    class TestDropoutOp5(TestDropoutOp):
+        def init_inputs_shape(self):
+            self.shape = [32, 64, 3]
+
+        def init_attrs(self):
+            self.__class__.no_need_check_grad = True
+            self.dropout_prob = 0.75
+            self.fix_seed = True
+            self.is_test = True
+            self.dropout_implementation = "downgrade_in_infer"
+
+    class TestDropoutOpError(unittest.TestCase):
+        def test_errors(self):
+            with program_guard(Program(), Program()):
+
+                def test_Variable():
+                    # the input of dropout must be Variable.
+                    x1 = fluid.create_lod_tensor(
+                        np.array([-1, 3, 5, 5]), [[1, 1, 1, 1]],
+                        fluid.CPUPlace())
+                    fluid.layers.dropout(x1, dropout_prob=0.5)
+
+                self.assertRaises(TypeError, test_Variable)
+
+                def test_dtype():
+                    # the input dtype of dropout must be float16 or float32 or float64
+                    # float16 only can be set on GPU place
+                    x2 = fluid.layers.data(
+                        name='x2', shape=[3, 4, 5, 6], dtype="int32")
+                    fluid.layers.dropout(x2, dropout_prob=0.5)
+
+                self.assertRaises(TypeError, test_dtype)
+
+    class TestDropoutCAPI(unittest.TestCase):
+        def setUp(self):
+            np.random.seed(123)
+            self.places = [fluid.CPUPlace()]
+            self.places.append(fluid.XPUPlace(0))
+
+        def test_dygraph(self):
+            for place in self.places:
+                with fluid.dygraph.guard(place):
+                    input_np = np.random.random([40, 40]).astype(self.in_type)
+                    result_np = input_np
+                    input = fluid.dygraph.to_variable(input_np)
+                    m = paddle.nn.Dropout(p=0.)
+                    m.eval()
+                    result = m(input)
+                    self.assertTrue(np.allclose(result.numpy(), result_np))
+
+    class TestDropoutBackward(unittest.TestCase):
+        def setUp(self):
+            np.random.seed(123)
+            self.places = [fluid.CPUPlace()]
+            self.places.append(fluid.XPUPlace(0))
+
+        def cal_grad_upscale_train(self, mask, prob):
+            return mask.astype(self.in_type) / (1 - prob)
+
+        def cal_grad_downscale_in_infer(self, mask):
+            return mask.astype(self.in_type)
+
+        def test_backward_downscale_in_infer(self):
+            for place in self.places:
+                with fluid.dygraph.guard(place):
+
+                    input = paddle.uniform([40, 40], dtype=self.in_type)
+                    input.stop_gradient = False
+                    out, mask = core.ops.dropout(input, 'dropout_prob', 0.5)
+                    out.backward()
+
+                    self.assertTrue(
+                        np.array_equal(input.gradient(
+                        ), self.cal_grad_downscale_in_infer(mask.numpy())))
+
+        def test_backward_upscale_train(self):
+            for place in self.places:
+                with fluid.dygraph.guard(place):
+
+                    prob = 0.5
+                    input = paddle.uniform([40, 40], dtype=self.in_type)
+                    input.stop_gradient = False
+                    out, mask = core.ops.dropout(input, 'dropout_prob', prob,
+                                                 "dropout_implementation",
+                                                 "upscale_in_train")
+                    out.backward()
+
+                    self.assertTrue(
+                        np.allclose(input.gradient(
+                        ), self.cal_grad_upscale_train(mask.numpy(), prob)))
+
+        def test_backward_upscale_train_2(self):
+            for place in self.places:
+                with fluid.dygraph.guard(place):
+
+                    prob = 0.3
+                    input = paddle.uniform([40, 40], dtype=self.in_type)
+                    input.stop_gradient = False
+                    out, mask = core.ops.dropout(input, 'dropout_prob', prob,
+                                                 "dropout_implementation",
+                                                 "upscale_in_train")
+                    out.backward()
+
+                    self.assertTrue(
+                        np.allclose(input.gradient(
+                        ), self.cal_grad_upscale_train(mask.numpy(), prob)))
+
+
+support_types = get_xpu_op_support_types('dropout')
+for stype in support_types:
+    create_test_class(globals(), XPUTestDropoutOp, stype)
 
 if __name__ == '__main__':
     unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/xpu/test_lamb_op_xpu.py b/python/paddle/fluid/tests/unittests/xpu/test_lamb_op_xpu.py
index 0e1714f1922de..f6aa82d596be7 100644
--- a/python/paddle/fluid/tests/unittests/xpu/test_lamb_op_xpu.py
+++ b/python/paddle/fluid/tests/unittests/xpu/test_lamb_op_xpu.py
@@ -22,8 +22,7 @@
 from paddle.fluid.op import Operator
 import paddle.fluid as fluid
 import paddle
-
-
+"""
 class TestLambOp1(XPUOpTest):
     def set_attrs(self):
         self.attrs = {
@@ -36,11 +35,11 @@ def set_attrs(self):
     def setUp(self):
         '''Test Lamb Op with supplied attributes
         '''
-        self.op_type = "lamb"
-        param = np.random.uniform(-1, 1, 5000).astype("float32")
-        grad = np.random.uniform(-1, 1, 5000).astype("float32")
-        moment1 = np.random.uniform(-1, 1, 5000).astype("float32")
-        moment2 = np.random.random(5000).astype("float32")
+        self.op_type = 'lamb'
+        param = np.random.uniform(-1, 1, 5000).astype('float32')
+        grad = np.random.uniform(-1, 1, 5000).astype('float32')
+        moment1 = np.random.uniform(-1, 1, 5000).astype('float32')
+        moment2 = np.random.random(5000).astype('float32')
 
         self.set_attrs()
         learning_rate = 0.001
@@ -52,9 +51,9 @@ def setUp(self):
             'Grad': grad,
             'Moment1': moment1,
             'Moment2': moment2,
-            'LearningRate': np.array([learning_rate]).astype("float32"),
-            'Beta1Pow': np.array([beta1_pow]).astype("float32"),
-            'Beta2Pow': np.array([beta2_pow]).astype("float32")
+            'LearningRate': np.array([learning_rate]).astype('float32'),
+            'Beta1Pow': np.array([beta1_pow]).astype('float32'),
+            'Beta2Pow': np.array([beta2_pow]).astype('float32')
         }
 
         param_out, moment1_out, moment2_out, \
@@ -114,7 +113,7 @@ def lamb_step(inputs, attributes):
     beta2_pow_out = beta2_pow * beta2
 
     return param_out, moment1_out, moment2_out, beta1_pow_out, beta2_pow_out
-
+"""
 
 if __name__ == "__main__":
     paddle.enable_static()
diff --git a/python/paddle/fluid/tests/unittests/xpu/test_matmul_op_xpu.py b/python/paddle/fluid/tests/unittests/xpu/test_matmul_op_xpu.py
index 59646f2db413e..3120f1973f4f8 100644
--- a/python/paddle/fluid/tests/unittests/xpu/test_matmul_op_xpu.py
+++ b/python/paddle/fluid/tests/unittests/xpu/test_matmul_op_xpu.py
@@ -24,7 +24,46 @@
 import paddle.fluid as fluid
 from paddle.fluid import Program, program_guard
 
-paddle.enable_static()
+from xpu.get_test_cover_info import create_test_class, get_xpu_op_support_types, XPUOpTestWrapper
+
+
+def reference_matmul(X, Y, transpose_X=False, transpose_Y=False):
+    """Reference forward implementation using np.matmul."""
+    # np.matmul does not support the transpose flags, so we manually
+    # transpose X and Y appropriately.
+    if transpose_X:
+        if X.ndim == 1:
+            X = X.reshape((X.size, 1))
+        elif X.ndim == 2:
+            X = X.T
+        else:
+            dim = [i for i in range(len(X.shape))]
+            dim[-1], dim[len(X.shape) - 2] = dim[len(X.shape) - 2], dim[-1]
+            X = np.transpose(X, tuple(dim))
+    if transpose_Y:
+        if Y.ndim == 1:
+            Y = Y.reshape((1, Y.size))
+        elif Y.ndim == 2:
+            Y = Y.T
+        else:
+            dim = [i for i in range(len(Y.shape))]
+            dim[-1], dim[len(Y.shape) - 2] = dim[len(Y.shape) - 2], dim[-1]
+            Y = np.transpose(Y, tuple(dim))
+
+    if X.ndim == 3 and Y.ndim == 2:
+        x_dims = X.shape
+        X = X.reshape((x_dims[0] * x_dims[1], x_dims[2]))
+    if Y.ndim == 3 and X.ndim == 2:
+        y_dims = Y.shape
+        Y = Y.reshape((y_dims[0] * y_dims[1], y_dims[2]))
+    Out = np.matmul(X, Y)
+    if not Out.shape:
+        # We do not support 0-dimensional Tensors (scalars). So where
+        # np.matmul outputs a scalar, we must convert to a Tensor of
+        # shape (1, ) instead.
+        # Everywhere else, we are compatible with np.matmul.
+        Out = np.array([Out], dtype="float32")
+    return Out
 
 
 def generate_compatible_shapes(dim_X, dim_Y, transpose_X, transpose_Y,
@@ -72,96 +111,26 @@ def generate_compatible_shapes(dim_X, dim_Y, transpose_X, transpose_Y,
     return shape_X, shape_Y
 
 
-def reference_matmul(X, Y, transpose_X=False, transpose_Y=False):
-    """Reference forward implementation using np.matmul."""
-    # np.matmul does not support the transpose flags, so we manually
-    # transpose X and Y appropriately.
-    if transpose_X:
-        if X.ndim == 1:
-            X = X.reshape((X.size, 1))
-        elif X.ndim == 2:
-            X = X.T
-        else:
-            dim = [i for i in range(len(X.shape))]
-            dim[-1], dim[len(X.shape) - 2] = dim[len(X.shape) - 2], dim[-1]
-            X = np.transpose(X, tuple(dim))
-    if transpose_Y:
-        if Y.ndim == 1:
-            Y = Y.reshape((1, Y.size))
-        elif Y.ndim == 2:
-            Y = Y.T
-        else:
-            dim = [i for i in range(len(Y.shape))]
-            dim[-1], dim[len(Y.shape) - 2] = dim[len(Y.shape) - 2], dim[-1]
-            Y = np.transpose(Y, tuple(dim))
-
-    if X.ndim == 3 and Y.ndim == 2:
-        x_dims = X.shape
-        X = X.reshape((x_dims[0] * x_dims[1], x_dims[2]))
-    if Y.ndim == 3 and X.ndim == 2:
-        y_dims = Y.shape
-        Y = Y.reshape((y_dims[0] * y_dims[1], y_dims[2]))
-    Out = np.matmul(X, Y)
-    if not Out.shape:
-        # We do not support 0-dimensional Tensors (scalars). So where
-        # np.matmul outputs a scalar, we must convert to a Tensor of
-        # shape (1, ) instead.
-        # Everywhere else, we are compatible with np.matmul.
-        Out = np.array([Out], dtype="float32")
-    return Out
-
-
-class Generator(object):
-    def setUp(self):
-        self.use_xpu = True
-        self.op_type = "matmul"
-        # self.init_test_case()
-        X = np.random.random(self.shape_X).astype("float32")
-        Y = np.random.random(self.shape_Y).astype("float32")
-        Out = reference_matmul(X, Y, self.transpose_X, self.transpose_Y)
-        self.inputs = {'X': X, 'Y': Y}
-        self.attrs = {
-            'transpose_X': self.transpose_X,
-            'transpose_Y': self.transpose_Y
-        }
-        self.outputs = {'Out': Out}
-
-    def test_check_output(self):
-        place = paddle.XPUPlace(0)
-        self.check_output_with_place(place, atol=1e-3)
-
-    def test_check_grad_normal(self):
-        place = paddle.XPUPlace(0)
-        self.check_grad_with_place(
-            place, ['X', 'Y'], 'Out', max_relative_error=5e-2)
-
-    def test_check_grad_ignore_x(self):
-        place = paddle.XPUPlace(0)
-        self.check_grad_with_place(
-            place, ['Y'], 'Out', max_relative_error=5e-2, no_grad_set=set("X"))
+def generate_compatible_shapes_2(dim, transpose_X, transpose_Y):
+    M = 2
+    N = 4
+    K = 3
+    shape_X = [2 for _ in range(dim - 2)]
+    shape_Y = [2 for _ in range(dim - 2)]
 
-    def test_check_grad_ignore_y(self):
-        place = paddle.XPUPlace(0)
-        self.check_grad_with_place(
-            place, ['X'], 'Out', max_relative_error=5e-2, no_grad_set=set('Y'))
+    if transpose_X:
+        shape_X += [K, M]
+    else:
+        shape_X += [M, K]
 
+    if transpose_Y:
+        shape_Y += [N, K]
+    else:
+        shape_Y += [K, N]
 
-class TestMatmulOpError(unittest.TestCase):
-    def test_errors(self):
-        with program_guard(Program(), Program()):
-            # The inputs type of matmul_op must be Variable.
-            input1 = 12
-            self.assertRaises(TypeError, fluid.layers.matmul, input1, input1)
-            # The inputs dtype of matmul_op must be float32, float64.
-            input2 = fluid.layers.data(
-                name='input2', shape=[10, 10], dtype="int32")
-            self.assertRaises(TypeError, fluid.layers.matmul, input2, input2)
-            input3 = fluid.layers.data(
-                name='input3', shape=[2, 2], dtype="float16")
-            fluid.layers.matmul(input3, input3)
+    return shape_X, shape_Y
 
 
-# Negative dimension generation
 def generate_negative_dims(in_shape):
     from itertools import combinations
     size = len(in_shape)
@@ -175,16 +144,15 @@ def generate_negative_dims(in_shape):
     return shapes
 
 
-# Build program with inputs sizes that contain negative numbers
 def test_negative_dims_program(obj):
     for shape_x in generate_negative_dims(obj.shape_X):
         for shape_y in generate_negative_dims(obj.shape_Y):
-            X = np.random.random(obj.shape_X).astype("float32")
-            Y = np.random.random(obj.shape_Y).astype("float32")
+            X = np.random.random(obj.shape_X).astype(obj.in_type)
+            Y = np.random.random(obj.shape_Y).astype(obj.in_type)
             Ref = reference_matmul(X, Y, obj.transpose_X, obj.transpose_Y)
             with program_guard(Program(), Program()):
-                x = fluid.data(name='x', shape=shape_x, dtype='float32')
-                y = fluid.data(name='y', shape=shape_y, dtype='float32')
+                x = fluid.data(name='x', shape=shape_x, dtype=obj.in_type_str)
+                y = fluid.data(name='y', shape=shape_y, dtype=obj.in_type_str)
                 output = fluid.layers.matmul(x, y, obj.transpose_X,
                                              obj.transpose_Y)
                 obj.assertEqual(len(Ref.shape), len(output.shape))
@@ -196,167 +164,252 @@ def test_negative_dims_program(obj):
                                feed={'x': X,
                                      'y': Y},
                                fetch_list=[output])
-                np.allclose(res, Ref, atol=1e-5)
-
-
-# Generate program api cases for all negative possibilities
-def api_test(dim_x, dim_y, trans_x, trans_y, batch_size):
-    test_name = ('TestMatMulAPI_dimX_{}_dim_Y_{}_transX_{}_transY_{}'.format(
-        dim_x, dim_y, trans_x, trans_y))
-    shape_x, shape_y = generate_compatible_shapes(dim_x, dim_y, trans_x,
-                                                  trans_y, batch_size)
-    globals()[test_name] = type(test_name, (unittest.TestCase, ), {
-        'shape_X': shape_x,
-        'shape_Y': shape_y,
-        'transpose_X': trans_x,
-        'transpose_Y': trans_y,
-        'test_propram': test_negative_dims_program,
-    })
-
-
-# Generate operators cases for all possibilities
-def inject_test(dim_x, dim_y, trans_x, trans_y, batch_size):
-    test_name = (
-        'TestMatMulOp_dimX_{}_dim_Y_{}_transX_{}_transY_{}_batch_{}'.format(
-            dim_x, dim_y, trans_x, trans_y, batch))
-    shape_x, shape_y = generate_compatible_shapes(dim_x, dim_y, trans_x,
-                                                  trans_y, batch_size)
-    globals()[test_name] = type(test_name, (Generator, XPUOpTest), {
-        'shape_X': shape_x,
-        'shape_Y': shape_y,
-        'transpose_X': trans_x,
-        'transpose_Y': trans_y,
-        'op_type': "matmul"
-    })
-
-
-xpu_support_dims_list = [[1, 1], [2, 2], [3, 3]]
-batch_size = [2, 4, 5, 10, 50, 100, 300]
-for dims in xpu_support_dims_list:
-    dim_X = dims[0]
-    dim_Y = dims[1]
-    for transose_x in (False, True):
-        for transose_y in (False, True):
-            for batch in batch_size:
-                inject_test(dim_X, dim_Y, transose_x, transose_y, batch)
-            # xpu not support all negative possibilities
-            # api_test(dim_X, dim_Y, False, False, 10)
-
-
-            # Test case n-dim
-def generate_compatible_shapes_(dim, transpose_X, transpose_Y):
-    M = 2
-    N = 4
-    K = 3
-    shape_X = [2 for _ in range(dim - 2)]
-    shape_Y = [2 for _ in range(dim - 2)]
+                np.allclose(res, Ref, atol=1e-3)
 
-    if transpose_X:
-        shape_X += [K, M]
-    else:
-        shape_X += [M, K]
 
-    if transpose_Y:
-        shape_Y += [N, K]
-    else:
-        shape_Y += [K, N]
+class XPUTestMatmulOpErr(XPUOpTestWrapper):
+    def __init__(self):
+        self.op_name = "matmul"
+        self.use_dynamic_create_class = False
 
-    return shape_X, shape_Y
+    class TestMatmulOpError(unittest.TestCase):
+        def test_errors(self):
+            with program_guard(Program(), Program()):
+                # The inputs type of matmul_op must be Variable.
+                input1 = 12
+                self.assertRaises(TypeError, fluid.layers.matmul, input1,
+                                  input1)
+                # The inputs dtype of matmul_op must be float32, float16
+                input2 = fluid.layers.data(
+                    name='input2', shape=[10, 10], dtype="int32")
+                self.assertRaises(TypeError, fluid.layers.matmul, input2,
+                                  input2)
+                input3 = fluid.layers.data(
+                    name='input3', shape=[2, 2], dtype="float16")
+                fluid.layers.matmul(input3, input3)
+
+    class API_TestMm(unittest.TestCase):
+        def test_out(self):
+            with fluid.program_guard(fluid.Program()):
+                x = fluid.data(name="x", shape=[2], dtype=self.in_type)
+                y = fluid.data(name='y', shape=[2], dtype=self.in_type)
+                res = fluid.data(name="output", shape=[1], dtype=self.in_type)
+                result = paddle.mm(x, y)
+                exe = fluid.Executor(fluid.XPUPlace(0))
+                data1 = np.random.rand(2).astype(self.in_type)
+                data2 = np.random.rand(2).astype(self.in_type)
+                np_res = exe.run(feed={'x': data1,
+                                       'y': data2},
+                                 fetch_list=[result])
+                expected_result = np.matmul(
+                    data1.reshape(1, 2), data2.reshape(2, 1))
+
+                self.assertTrue(
+                    np.allclose(
+                        np_res, expected_result, atol=1e-3),
+                    "two value is\
+                    {}\n{}, check diff!".format(np_res, expected_result))
+
+        def test_dygraph_without_out(self):
+            device = fluid.XPUPlace(0)
+            with fluid.dygraph.guard(device):
+                input_array1 = np.random.rand(3, 4).astype(self.in_type)
+                input_array2 = np.random.rand(4, 3).astype(self.in_type)
+                data1 = fluid.dygraph.to_variable(input_array1)
+                data2 = fluid.dygraph.to_variable(input_array2)
+                out = paddle.mm(data1, data2)
+                expected_result = np.matmul(input_array1, input_array2)
+                self.assertTrue(
+                    np.allclose(
+                        expected_result, out.numpy(), atol=1e-3))
+
+    class Test_API_Matmul(unittest.TestCase):
+        def test_dygraph_without_out(self):
+            device = fluid.XPUPlace(0)
+            with fluid.dygraph.guard(device):
+                input_array1 = np.random.rand(3, 4).astype(self.in_type)
+                input_array2 = np.random.rand(4, 3).astype(self.in_type)
+                data1 = fluid.dygraph.to_variable(input_array1).astype(
+                    self.in_type)
+                data2 = fluid.dygraph.to_variable(input_array2).astype(
+                    self.in_type)
+                out = paddle.matmul(data1, data2)
+                expected_result = np.matmul(input_array1, input_array2)
+                self.assertTrue(
+                    np.allclose(
+                        expected_result, out.numpy(), atol=1e-3))
+
+    class API_TestMmError(unittest.TestCase):
+        def test_errors(self):
+            def test_error1():
+                with fluid.program_guard(fluid.Program(), fluid.Program()):
+                    data1 = fluid.data(
+                        name="data1", shape=[10, 2], dtype="float32")
+                    data2 = fluid.data(
+                        name="data2", shape=[3, 10], dtype="float32")
+                    paddle.mm(data1, data2)
+
+            self.assertRaises(ValueError, test_error1)
+
+            def test_error2():
+                with fluid.program_guard(fluid.Program(), fluid.Program()):
+                    data1 = fluid.data(
+                        name="data1", shape=[-1, 10, 2], dtype="float32")
+                    data2 = fluid.data(
+                        name="data2", shape=[-1, 2, 10], dtype="float32")
+                    paddle.mm(data1, data2)
+
+            test_error2()
+
+            def test_error3():
+                with fluid.program_guard(fluid.Program(), fluid.Program()):
+                    data1 = fluid.data(
+                        name="data1", shape=[10, 10, 2], dtype="float32")
+                    data2 = fluid.data(
+                        name="data2", shape=[3, 2, 10], dtype="float32")
+                    paddle.mm(data1, data2)
+
+            self.assertRaises(ValueError, test_error3)
+
+
+class TestMatmulBaseGenerator(XPUOpTest):
+    def setUp(self):
+        self.op_type = "matmul"
+        self.dtype = np.float32 if not hasattr(self,
+                                               'in_type') else self.in_type
+        shape_X = [4, 5] if not hasattr(self, 'shape_X') else self.shape_X
+        shape_Y = [5, 6] if not hasattr(self, 'shape_Y') else self.shape_Y
+        transpose_X = False if not hasattr(self,
+                                           'transpose_X') else self.transpose_X
+        transpose_Y = False if not hasattr(self,
+                                           'transpose_Y') else self.transpose_Y
+
+        X = np.random.random(shape_X).astype(self.dtype)
+        Y = np.random.random(shape_Y).astype(self.dtype)
+        Out = reference_matmul(X, Y, transpose_X, transpose_Y)
+        self.inputs = {'X': X, 'Y': Y}
+        self.attrs = {'transpose_X': transpose_X, 'transpose_Y': transpose_Y}
+        self.outputs = {'Out': Out}
+
+    def test_check_output(self):
+        place = paddle.XPUPlace(0)
+        self.check_output_with_place(place, atol=1e-3)
+
+    def test_check_grad_normal(self):
+        place = paddle.XPUPlace(0)
+        self.check_grad_with_place(
+            place, ['X', 'Y'], 'Out', max_relative_error=5e-2)
 
+    def test_check_grad_ignore_x(self):
+        place = paddle.XPUPlace(0)
+        self.check_grad_with_place(
+            place, ['Y'], 'Out', max_relative_error=5e-2, no_grad_set=set("X"))
+
+    def test_check_grad_ignore_y(self):
+        place = paddle.XPUPlace(0)
+        self.check_grad_with_place(
+            place, ['X'], 'Out', max_relative_error=5e-2, no_grad_set=set('Y'))
 
-# Test case n-dim
-for dim in [4]:
-    for transpose_X in [False, True]:
-        for transpose_Y in [False, True]:
-            test_name = (
-                'TestMatMulOp_dimX_{}_dim_Y_{}_transX_{}_transY_{}'.format(
-                    dim, dim, transpose_X, transpose_Y))
-            shape_X, shape_Y = generate_compatible_shapes_(dim, transpose_X,
-                                                           transpose_Y)
-            globals()[test_name] = type(test_name, (Generator, XPUOpTest), {
-                'shape_X': shape_X,
-                'shape_Y': shape_Y,
-                'transpose_X': transpose_X,
-                'transpose_Y': transpose_Y,
-                'op_type': "matmul"
-            })
-
-
-class API_TestMm(unittest.TestCase):
-    def test_out(self):
-        with fluid.program_guard(fluid.Program()):
-            x = fluid.data(name="x", shape=[2], dtype="float64")
-            y = fluid.data(name='y', shape=[2], dtype='float64')
-            res = fluid.data(name="output", shape=[1], dtype="float64")
-            result = paddle.mm(x, y)
-            exe = fluid.Executor(fluid.XPUPlace(0))
-            data1 = np.random.rand(2)
-            data2 = np.random.rand(2)
-            np_res = exe.run(feed={'x': data1, 'y': data2}, fetch_list=[result])
-            expected_result = np.matmul(
-                data1.reshape(1, 2), data2.reshape(2, 1))
-
-        self.assertTrue(
-            np.allclose(
-                np_res, expected_result, atol=1e-5),
-            "two value is\
-            {}\n{}, check diff!".format(np_res, expected_result))
-
-    def test_dygraph_without_out(self):
-        device = fluid.XPUPlace(0)
-        with fluid.dygraph.guard(device):
-            input_array1 = np.random.rand(3, 4).astype("float64")
-            input_array2 = np.random.rand(4, 3).astype("float64")
-            data1 = fluid.dygraph.to_variable(input_array1)
-            data2 = fluid.dygraph.to_variable(input_array2)
-            out = paddle.mm(data1, data2)
-            expected_result = np.matmul(input_array1, input_array2)
-        self.assertTrue(np.allclose(expected_result, out.numpy()))
-
-
-class Test_API_Matmul(unittest.TestCase):
-    def test_dygraph_without_out(self):
-        device = fluid.XPUPlace(0)
-        with fluid.dygraph.guard(device):
-            input_array1 = np.random.rand(3, 4).astype("float64")
-            input_array2 = np.random.rand(4, 3).astype("float64")
-            data1 = fluid.dygraph.to_variable(input_array1)
-            data2 = fluid.dygraph.to_variable(input_array2)
-            out = paddle.matmul(data1, data2)
-            expected_result = np.matmul(input_array1, input_array2)
-        self.assertTrue(np.allclose(expected_result, out.numpy()))
-
-
-class API_TestMmError(unittest.TestCase):
-    def test_errors(self):
-        def test_error1():
-            with fluid.program_guard(fluid.Program(), fluid.Program()):
-                data1 = fluid.data(name="data1", shape=[10, 2], dtype="float32")
-                data2 = fluid.data(name="data2", shape=[3, 10], dtype="float32")
-                paddle.mm(data1, data2)
-
-        self.assertRaises(ValueError, test_error1)
-
-        def test_error2():
-            with fluid.program_guard(fluid.Program(), fluid.Program()):
-                data1 = fluid.data(
-                    name="data1", shape=[-1, 10, 2], dtype="float32")
-                data2 = fluid.data(
-                    name="data2", shape=[-1, 2, 10], dtype="float32")
-                paddle.mm(data1, data2)
-
-        test_error2()
-
-        def test_error3():
-            with fluid.program_guard(fluid.Program(), fluid.Program()):
-                data1 = fluid.data(
-                    name="data1", shape=[10, 10, 2], dtype="float32")
-                data2 = fluid.data(
-                    name="data2", shape=[3, 2, 10], dtype="float32")
-                paddle.mm(data1, data2)
-
-        self.assertRaises(ValueError, test_error3)
 
+class XPUTestMatmulOp1(XPUOpTestWrapper):
+    def __init__(self):
+        self.op_name = "matmul"
+        self.use_dynamic_create_class = True
+
+    def dynamic_create_class(self):
+        base_class = TestMatmulBaseGenerator
+        classes = []
+        xpu_support_dims_list = [[1, 1], [2, 2], [3, 3]]
+        batch_size = [2, 4, 5, 10, 50, 100, 300]
+        for dims in xpu_support_dims_list:
+            dim_X = dims[0]
+            dim_Y = dims[1]
+            for transose_x in [True, False]:
+                for transose_y in [True, False]:
+                    for batch in batch_size:
+                        class_name = (
+                            'TestMatMulOp_dimX_{}_dim_Y_{}_transX_{}_transY_{}_batch_{}'.
+                            format(dim_X, dim_Y, transose_x, transose_y, batch))
+                        shape_x, shape_y = generate_compatible_shapes(
+                            dim_X, dim_Y, transose_x, transose_y, batch)
+                        attr_dict = {
+                            'shape_X': shape_x,
+                            'shape_Y': shape_y,
+                            'transpose_X': transose_x,
+                            'transpose_Y': transose_y,
+                            'op_type': "matmul"
+                        }
+                        classes.append([class_name, attr_dict])
+
+        return base_class, classes
+
+
+class XPUTestMatmulOp2(XPUOpTestWrapper):
+    def __init__(self):
+        self.op_name = "matmul"
+        self.use_dynamic_create_class = True
+
+    def dynamic_create_class(self):
+        base_class = unittest.TestCase
+        classes = []
+        xpu_support_dims_list = [[1, 1], [2, 2], [3, 3]]
+        batch_size = [2, 4, 5, 10, 50, 100, 300]
+        for dims in xpu_support_dims_list:
+            dim_X = dims[0]
+            dim_Y = dims[1]
+            for transose_x in [True, False]:
+                for transose_y in [True, False]:
+                    for batch in batch_size:
+                        class_name = (
+                            'TestMatMulAPI_dimX_{}_dim_Y_{}_transX_{}_transY_{}_batch_{}'.
+                            format(dim_X, dim_Y, transose_x, transose_y, batch))
+                        shape_x, shape_y = generate_compatible_shapes(
+                            dim_X, dim_Y, transose_x, transose_y, batch)
+                        attr_dict = {
+                            'shape_X': shape_x,
+                            'shape_Y': shape_y,
+                            'transpose_X': transose_x,
+                            'transpose_Y': transose_y,
+                            'test_propram': test_negative_dims_program,
+                        }
+                        classes.append([class_name, attr_dict])
+        return base_class, classes
+
+
+class XPUTestMatmulOp3(XPUOpTestWrapper):
+    def __init__(self):
+        self.op_name = "matmul"
+        self.use_dynamic_create_class = True
+
+    def dynamic_create_class(self):
+        base_class = TestMatmulBaseGenerator
+        classes = []
+        for dim in [4]:
+            for transpose_X in [False, True]:
+                for transpose_Y in [False, True]:
+                    class_name = (
+                        'TestMatMulOp2_dimX_{}_dim_Y_{}_transX_{}_transY_{}'.
+                        format(dim, dim, transpose_X, transpose_Y))
+                    shape_X, shape_Y = generate_compatible_shapes_2(
+                        dim, transpose_X, transpose_Y)
+                    attr_dict = {
+                        'shape_X': shape_X,
+                        'shape_Y': shape_Y,
+                        'transpose_X': transpose_X,
+                        'transpose_Y': transpose_Y,
+                        'op_type': "matmul"
+                    }
+                    classes.append([class_name, attr_dict])
+        return base_class, classes
+
+
+support_types = get_xpu_op_support_types('matmul')
+for stype in support_types:
+    create_test_class(globals(), XPUTestMatmulOpErr, stype)
+    create_test_class(globals(), XPUTestMatmulOp1, stype)
+    create_test_class(globals(), XPUTestMatmulOp2, stype)
+    create_test_class(globals(), XPUTestMatmulOp3, stype)
 
 if __name__ == "__main__":
+    paddle.enable_static()
     unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/xpu/test_matmul_v2_op_xpu.py b/python/paddle/fluid/tests/unittests/xpu/test_matmul_v2_op_xpu.py
index 9891da6ea21d9..3db3031f44c80 100644
--- a/python/paddle/fluid/tests/unittests/xpu/test_matmul_v2_op_xpu.py
+++ b/python/paddle/fluid/tests/unittests/xpu/test_matmul_v2_op_xpu.py
@@ -23,6 +23,9 @@
 import paddle
 import paddle.fluid as fluid
 import paddle.fluid.framework as framework
+from paddle.fluid.framework import _test_eager_guard
+
+from xpu.get_test_cover_info import create_test_class, get_xpu_op_support_types, XPUOpTestWrapper
 
 
 def reference_matmul(X, Y, transpose_X=False, transpose_Y=False):
@@ -55,273 +58,239 @@ def reference_matmul(X, Y, transpose_X=False, transpose_Y=False):
     return Out
 
 
-class TestMatMulV2Op(XPUOpTest):
-    """
-    case 1
-    """
-
-    def config(self):
-        self.x_shape = (100, )
-        self.y_shape = (100, )
-        self.trans_x = False
-        self.trans_y = False
-
-    def init_kernel_type(self):
-        self.dtype = "float32"
-
-    def setUp(self):
-        self.use_xpu = True
-        self.init_kernel_type()
-        self.config()
-        self.op_type = "matmul_v2"
-        x = np.random.random(self.x_shape).astype(self.dtype)
-        y = np.random.random(self.y_shape).astype(self.dtype)
-        # -0.1 ~ 0.1
-        x = -0.1 + 0.2 * x
-        y = -0.1 + 0.2 * y
-        result = reference_matmul(x, y, self.trans_x, self.trans_y)
-        result = result.astype(self.dtype)
-        self.inputs = {
-            'X': x,
-            'Y': y,
-        }
-        self.attrs = {'trans_x': self.trans_x, 'trans_y': self.trans_y}
-        self.outputs = {'Out': result}
-
-    def test_check_output(self):
-        place = paddle.XPUPlace(0)
-        self.check_output_with_place(place)
-
-    def test_check_grad(self):
-        place = paddle.XPUPlace(0)
-        self.check_grad_with_place(place, ['X', 'Y'], 'Out')
-
-
-class TestMatMulOp2(TestMatMulV2Op):
-    """
-    case 2
-    """
-
-    def config(self):
-        self.x_shape = (100)
-        self.y_shape = (100, 3)
-        self.trans_x = False
-        self.trans_y = False
-
-
-class TestMatMulOp3(TestMatMulV2Op):
-    """
-    case 3
-    """
-
-    def config(self):
-        self.x_shape = (100, )
-        self.y_shape = (1, 1, 100, 2)
-        self.trans_x = False
-        self.trans_y = False
-
-
-class TestMatMulOp4(TestMatMulV2Op):
-    """
-    case 4
-    """
-
-    def config(self):
-        self.x_shape = (1, 1, 100, 1)
-        self.y_shape = (1, 100)
-        self.trans_x = False
-        self.trans_y = False
-
-
-class TestMatMulOp5(TestMatMulV2Op):
-    """
-    case 5
-    """
-
-    def config(self):
-        self.x_shape = (1, 1, 100, 1)
-        self.y_shape = (100, )
-        self.trans_x = True
-        self.trans_y = False
-
-
-class TestMatMulOp6(TestMatMulV2Op):
-    """
-    case 6
-    """
-
-    def config(self):
-        self.x_shape = (1, 2, 102, 10)
-        self.y_shape = (2, 10, 111)
-        self.trans_x = False
-        self.trans_y = False
-
-
-class TestMatMulOp7(TestMatMulV2Op):
-    """
-    case 7
-    """
-
-    def config(self):
-        self.x_shape = (1, 2, 100, 1)
-        self.y_shape = (2, 100, 12)
-        self.trans_x = True
-        self.trans_y = False
-
-
-class TestMatMulOp8(TestMatMulV2Op):
-    """
-    case 8
-    """
-
-    def config(self):
-        self.x_shape = (1, 1, 2, 100)
-        self.y_shape = (1, 1, 100, 2)
-        self.trans_x = False
-        self.trans_y = False
-
-
-class TestMatMulOp9(TestMatMulV2Op):
-    """
-    case 9
-    """
-
-    def config(self):
-        self.x_shape = (100, 20, 100)
-        self.y_shape = (100, 100, 100)
-        self.trans_x = False
-        self.trans_y = True
-
-
-class TestMatMulOp10(TestMatMulV2Op):
-    """
-    case 10
-    """
-
-    def config(self):
-        self.x_shape = (100, 20, 100)
-        self.y_shape = (100, 20, 100)
-        self.trans_x = True
-        self.trans_y = False
-
-
-class TestMatMulOp11(TestMatMulV2Op):
-    """
-    case 11
-    """
-
-    def config(self):
-        self.x_shape = (2, 20, 100)
-        self.y_shape = (100, 30)
-        self.trans_x = False
-        self.trans_y = False
-
-
-class TestMatMulOp12(TestMatMulV2Op):
-    """
-    case 12
-    """
-
-    def config(self):
-        self.x_shape = (1, 20, 100)
-        self.y_shape = (100, )
-        self.trans_x = False
-        self.trans_y = False
-
-
-class TestMatMulOp13(TestMatMulV2Op):
-    """
-    case 13
-    """
-
-    def config(self):
-        self.x_shape = (2, 2, 10, 10)
-        self.y_shape = (2, 2, 10, 10)
-        self.trans_x = True
-        self.trans_y = False
-
-
-class TestMatMulOp14(TestMatMulV2Op):
-    """
-    case 14_1
-    """
-
-    def config(self):
-        self.x_shape = (100, 2, 100, 10)
-        self.y_shape = (100, 2, 10, 90)
-        self.trans_x = False
-        self.trans_y = False
-
-
-class TestMatMulOp15(TestMatMulV2Op):
-    """
-    case 14_2
-    """
-
-    def config(self):
-        self.x_shape = (100, 2, 100, 10)
-        self.y_shape = (100, 2, 100, 10)
-        self.trans_x = False
-        self.trans_y = True
-
-
-class TestMatMulOp16(TestMatMulV2Op):
-    """
-    case 16 : to check the big data
-    """
-
-    def config(self):
-        self.x_shape = (1000, 2, 100, 100)
-        self.y_shape = (1000, 2, 100, 900)
-        self.trans_x = False
-        self.trans_y = False
-
-
-class TestMatMulOp17(TestMatMulV2Op):
-    """
-    case 17 : to check the gradient for special case
-    """
-
-    def config(self):
-        self.x_shape = (2, 1, 100)
-        self.y_shape = (100)
-        self.trans_x = False
-        self.trans_y = False
-
-
-class TestMatMulOp18(TestMatMulV2Op):
-    """
-    case 18 : for ppyoloe model
-    """
-
-    def config(self):
-        self.x_shape = (8, 111, 4, 17)
-        self.y_shape = (17)
-        self.trans_x = False
-        self.trans_y = False
-
-
-# class TestMatMulOpBroadcast1(TestMatMulV2Op):
-#     """
-#     case 14_3
-#     """
-
-#     def config(self):
-#         self.x_shape = (3, 1, 10, 10)
-#         self.y_shape = (1, 2, 10, 10)
-#         self.trans_x = True
-#         self.trans_y = True
-
-# class TestMatMulOpBroadcast2(TestMatMulV2Op):
-#     """
-#     case 14_4
-#     """
-
-#     def config(self):
-#         self.x_shape = (3, 1, 10, 10)
-#         self.y_shape = (1, 2, 10, 10)
-#         self.trans_x = False
-#         self.trans_y = True
+class XPUTestMatmulV2Op(XPUOpTestWrapper):
+    def __init__(self):
+        self.op_name = "matmul_v2"
+        self.use_dynamic_create_class = False
+
+    class TestMatMulV2Op(XPUOpTest):
+        """
+        case 1
+        """
+
+        def config(self):
+            self.x_shape = (100, )
+            self.y_shape = (100, )
+            self.trans_x = False
+            self.trans_y = False
+
+        def setUp(self):
+            self.dtype = self.in_type
+            self.config()
+            self.op_type = "matmul_v2"
+            x = np.random.random(self.x_shape).astype(self.dtype)
+            y = np.random.random(self.y_shape).astype(self.dtype)
+            # -0.1 ~ 0.1
+            x = -0.1 + 0.2 * x
+            y = -0.1 + 0.2 * y
+            result = reference_matmul(x, y, self.trans_x, self.trans_y)
+            result = result.astype(self.dtype)
+            self.inputs = {
+                'X': x,
+                'Y': y,
+            }
+            self.attrs = {'trans_x': self.trans_x, 'trans_y': self.trans_y}
+            self.outputs = {'Out': result}
+
+        def test_check_output(self):
+            place = paddle.XPUPlace(0)
+            self.check_output_with_place(place)
+
+        def test_check_grad(self):
+            place = paddle.XPUPlace(0)
+            self.check_grad_with_place(place, ['X', 'Y'], 'Out')
+
+    class TestMatMulOp2(TestMatMulV2Op):
+        """
+        case 2
+        """
+
+        def config(self):
+            self.x_shape = (100)
+            self.y_shape = (100, 3)
+            self.trans_x = False
+            self.trans_y = False
+
+    class TestMatMulOp3(TestMatMulV2Op):
+        """
+        case 3
+        """
+
+        def config(self):
+            self.x_shape = (100, )
+            self.y_shape = (1, 1, 100, 2)
+            self.trans_x = False
+            self.trans_y = False
+
+    class TestMatMulOp4(TestMatMulV2Op):
+        """
+        case 4
+        """
+
+        def config(self):
+            self.x_shape = (1, 1, 100, 1)
+            self.y_shape = (1, 100)
+            self.trans_x = False
+            self.trans_y = False
+
+    class TestMatMulOp5(TestMatMulV2Op):
+        """
+        case 5
+        """
+
+        def config(self):
+            self.x_shape = (1, 1, 100, 1)
+            self.y_shape = (100, )
+            self.trans_x = True
+            self.trans_y = False
+
+    class TestMatMulOp6(TestMatMulV2Op):
+        """
+        case 6
+        """
+
+        def config(self):
+            self.x_shape = (1, 2, 102, 10)
+            self.y_shape = (2, 10, 111)
+            self.trans_x = False
+            self.trans_y = False
+
+    class TestMatMulOp7(TestMatMulV2Op):
+        """
+        case 7
+        """
+
+        def config(self):
+            self.x_shape = (1, 2, 100, 1)
+            self.y_shape = (2, 100, 12)
+            self.trans_x = True
+            self.trans_y = False
+
+    class TestMatMulOp8(TestMatMulV2Op):
+        """
+        case 8
+        """
+
+        def config(self):
+            self.x_shape = (1, 1, 2, 100)
+            self.y_shape = (1, 1, 100, 2)
+            self.trans_x = False
+            self.trans_y = False
+
+    class TestMatMulOp9(TestMatMulV2Op):
+        """
+        case 9
+        """
+
+        def config(self):
+            self.x_shape = (100, 20, 100)
+            self.y_shape = (100, 100, 100)
+            self.trans_x = False
+            self.trans_y = True
+
+    class TestMatMulOp10(TestMatMulV2Op):
+        """
+        case 10
+        """
+
+        def config(self):
+            self.x_shape = (100, 20, 100)
+            self.y_shape = (100, 20, 100)
+            self.trans_x = True
+            self.trans_y = False
+
+    class TestMatMulOp11(TestMatMulV2Op):
+        """
+        case 11
+        """
+
+        def config(self):
+            self.x_shape = (2, 20, 100)
+            self.y_shape = (100, 30)
+            self.trans_x = False
+            self.trans_y = False
+
+    class TestMatMulOp12(TestMatMulV2Op):
+        """
+        case 12
+        """
+
+        def config(self):
+            self.x_shape = (1, 20, 100)
+            self.y_shape = (100, )
+            self.trans_x = False
+            self.trans_y = False
+
+    class TestMatMulOp13(TestMatMulV2Op):
+        """
+        case 13
+        """
+
+        def config(self):
+            self.x_shape = (2, 2, 10, 10)
+            self.y_shape = (2, 2, 10, 10)
+            self.trans_x = True
+            self.trans_y = False
+
+    class TestMatMulOp14(TestMatMulV2Op):
+        """
+        case 14_1
+        """
+
+        def config(self):
+            self.x_shape = (100, 2, 100, 10)
+            self.y_shape = (100, 2, 10, 90)
+            self.trans_x = False
+            self.trans_y = False
+
+    class TestMatMulOp15(TestMatMulV2Op):
+        """
+        case 14_2
+        """
+
+        def config(self):
+            self.x_shape = (100, 2, 100, 10)
+            self.y_shape = (100, 2, 100, 10)
+            self.trans_x = False
+            self.trans_y = True
+
+    class TestMatMulOp16(TestMatMulV2Op):
+        """
+        case 16 : to check the big data
+        """
+
+        def config(self):
+            self.x_shape = (1000, 2, 100, 100)
+            self.y_shape = (1000, 2, 100, 900)
+            self.trans_x = False
+            self.trans_y = False
+
+    class TestMatMulOp17(TestMatMulV2Op):
+        """
+        case 17 : to check the gradient for special case
+        """
+
+        def config(self):
+            self.x_shape = (2, 1, 100)
+            self.y_shape = (100)
+            self.trans_x = False
+            self.trans_y = False
+
+    class TestMatMulOp18(TestMatMulV2Op):
+        """
+        case 18 : for ppyoloe model
+        """
+
+        def config(self):
+            self.x_shape = (8, 111, 4, 17)
+            self.y_shape = (17)
+            self.trans_x = False
+            self.trans_y = False
+
+
+support_types = get_xpu_op_support_types('matmul_v2')
+for stype in support_types:
+    create_test_class(globals(), XPUTestMatmulV2Op, stype)
 
 if __name__ == "__main__":
     paddle.enable_static()
diff --git a/python/paddle/fluid/tests/unittests/xpu/test_momentum_op_xpu.py b/python/paddle/fluid/tests/unittests/xpu/test_momentum_op_xpu.py
index ccee79e8cd77a..f7c1f0041e805 100644
--- a/python/paddle/fluid/tests/unittests/xpu/test_momentum_op_xpu.py
+++ b/python/paddle/fluid/tests/unittests/xpu/test_momentum_op_xpu.py
@@ -1,4 +1,4 @@
-#   Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#   Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -17,52 +17,150 @@
 import unittest
 import numpy as np
 import sys
-import os
 sys.path.append("..")
-from op_test import OpTest
-import paddle
-from paddle.fluid import core
-from paddle.fluid.op import Operator
-
 
-class TestMomentumOp1(OpTest):
-    def setUp(self):
-        self.op_type = "momentum"
-        self.dtype = np.float32
-        self.init_dtype()
+import paddle
+import paddle.fluid.core as core
 
-        param = np.random.random((123, 321)).astype(self.dtype)
-        grad = np.random.random((123, 321)).astype(self.dtype)
-        velocity = np.zeros((123, 321)).astype(self.dtype)
-        learning_rate = np.array([0.001]).astype(self.dtype)
-        mu = 0.0001
-        use_nesterov = False
+from op_test import OpTest
+from op_test_xpu import XPUOpTest
+from xpu.get_test_cover_info import create_test_class, get_xpu_op_support_types, XPUOpTestWrapper
 
-        self.inputs = {
-            'Param': param,
-            'Grad': grad,
-            'Velocity': velocity,
-            'LearningRate': learning_rate
-        }
+paddle.enable_static()
 
-        self.attrs = {'mu': mu}
 
+def calculate_momentum_by_numpy(param, grad, mu, velocity, use_nesterov,
+                                learning_rate, regularization_method,
+                                regularization_coeff):
+    if regularization_method == "l2_decay":
+        grad = grad + regularization_coeff * param
+        velocity_out = mu * velocity + grad
+        if use_nesterov:
+            param_out = param - (grad + velocity_out * mu) * learning_rate
+        else:
+            param_out = param - learning_rate * velocity_out
+    else:
         velocity_out = mu * velocity + grad
         if use_nesterov:
             param_out = param - grad * learning_rate - \
                         velocity_out * mu * learning_rate
         else:
             param_out = param - learning_rate * velocity_out
+    return param_out, velocity_out
+
+
+class XPUTestMomentumOP(XPUOpTestWrapper):
+    def __init__(self):
+        self.op_name = 'momentum'
+        self.use_dynamic_create_class = False
+
+    class TestMomentumOPBase(XPUOpTest):
+        def setUp(self):
+            self.place = paddle.XPUPlace(0)
+            self.xpu_version = core.get_xpu_device_version(0)
+            self.init_dtype()
+            self.set_case()
+
+        def set_case(self):
+            self.op_type = 'momentum'
+            self.dtype = self.in_type
+            self.init_config()
+
+            self.param = np.random.uniform(-1, 1,
+                                           self.input_shape).astype(self.dtype)
+            self.grad = np.random.uniform(-1, 1,
+                                          self.input_shape).astype(self.dtype)
+            self.velocity = np.random.uniform(
+                -1, 1, self.input_shape).astype(self.dtype)
+
+            param_out, velocity_out = calculate_momentum_by_numpy(
+                param=self.param,
+                grad=self.grad,
+                mu=self.mu,
+                velocity=self.velocity,
+                use_nesterov=self.use_nesterov,
+                learning_rate=self.learning_rate,
+                regularization_method=self.regularization_method,
+                regularization_coeff=self.regularization_coeff)
+            self.inputs = {
+                'Param': self.param,
+                'Grad': self.grad,
+                'Velocity': self.velocity,
+                'LearningRate': self.learning_rate,
+            }
+            self.attrs = {
+                'use_xpu': True,
+                'mu': self.mu,
+                'use_nesterov': self.use_nesterov,
+                'regularization_method': self.regularization_method,
+                'regularization_coeff': self.regularization_coeff
+            }
+            self.outputs = {'ParamOut': param_out, 'VelocityOut': velocity_out}
+
+        def init_dtype(self):
+            self.dtype = np.float32
+
+        def test_check_output(self):
+            self.check_output_with_place(self.place)
+
+        def init_config(self):
+            self.input_shape = [864]
+            self.learning_rate = np.array([0.001]).astype(self.dtype)
+            self.mu = 0.0001
+            self.use_nesterov = False
+            self.regularization_method = None
+            self.regularization_coeff = 0
+
+    class XPUTestMomentum1(TestMomentumOPBase):
+        def init_config(self):
+            self.input_shape = [2, 768]
+            self.learning_rate = np.array([0.002]).astype(self.dtype)
+            self.mu = 0.001
+            self.use_nesterov = False
+            self.regularization_method = None
+            self.regularization_coeff = 0
+
+    class XPUTestMomentum2(TestMomentumOPBase):
+        def init_config(self):
+            self.input_shape = [3, 8, 4096]
+            self.learning_rate = np.array([0.005]).astype(self.dtype)
+            self.mu = 0.002
+            self.use_nesterov = True
+            self.regularization_method = None
+            self.regularization_coeff = 0
 
-        self.outputs = {'ParamOut': param_out, 'VelocityOut': velocity_out}
+    class XPUTestMomentum3(TestMomentumOPBase):
+        def init_config(self):
+            self.input_shape = [1024]
+            self.learning_rate = np.array([0.01]).astype(self.dtype)
+            self.mu = 0.0001
+            self.use_nesterov = False
+            if self.xpu_version != core.XPUVersion.XPU1:
+                self.regularization_method = "l2_decay"
+                self.regularization_coeff = 0.005
+            else:
+                # regularization not supported on XPU1
+                self.regularization_method = None
+                self.regularization_coeff = 0
 
-    def init_dtype(self):
-        pass
+    class XPUTestMomentum4(TestMomentumOPBase):
+        def init_config(self):
+            self.input_shape = [2, 2, 255]
+            self.learning_rate = np.array([0.0005]).astype(self.dtype)
+            self.mu = 0.005
+            self.use_nesterov = True
+            if self.xpu_version != core.XPUVersion.XPU1:
+                self.regularization_method = "l2_decay"
+                self.regularization_coeff = 0.005
+            else:
+                # regularization not supported on XPU1
+                self.regularization_method = None
+                self.regularization_coeff = 0
 
-    def test_check_output_with_place(self):
-        self.check_output_with_place(paddle.XPUPlace(0))
 
+support_types = get_xpu_op_support_types('momentum')
+for stype in support_types:
+    create_test_class(globals(), XPUTestMomentumOP, stype)
 
 if __name__ == "__main__":
-    paddle.enable_static()
     unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/xpu/test_mul_op_xpu.py b/python/paddle/fluid/tests/unittests/xpu/test_mul_op_xpu.py
index 58a8fa3083055..9d98ab70041e9 100644
--- a/python/paddle/fluid/tests/unittests/xpu/test_mul_op_xpu.py
+++ b/python/paddle/fluid/tests/unittests/xpu/test_mul_op_xpu.py
@@ -27,104 +27,120 @@
 
 paddle.enable_static()
 
+from xpu.get_test_cover_info import create_test_class, get_xpu_op_support_types, XPUOpTestWrapper
+
 
-@unittest.skipIf(not paddle.is_compiled_with_xpu(),
-                 "core is not compiled with XPU")
 class TestMulOpError(unittest.TestCase):
     def test_errors(self):
         with program_guard(Program(), Program()):
             # The input type of mul_op must be Variable.
             x1 = fluid.create_lod_tensor(
-                np.array([[-1]]), [[1]], fluid.CPUPlace())
+                np.array([[-1]]), [[1]], fluid.XPUPlace(0))
             x2 = fluid.create_lod_tensor(
-                np.array([[-1]]), [[1]], fluid.CPUPlace())
+                np.array([[-1]]), [[1]], fluid.XPUPlace(0))
             self.assertRaises(TypeError, fluid.layers.mul, x1, x2)
-            # The input dtype of mul_op must be float32 or float64.
+            # The input dtype of mul_op must be float32.
             x3 = fluid.layers.data(name='x3', shape=[4], dtype="int32")
             x4 = fluid.layers.data(name='x4', shape=[4], dtype="int32")
             self.assertRaises(TypeError, fluid.layers.mul, x3, x4)
 
 
-@unittest.skipIf(not paddle.is_compiled_with_xpu(),
-                 "core is not compiled with XPU")
-class TestXPUMulOp1(XPUOpTest):
-    def setUp(self):
-        self.op_type = "mul"
-        self.dtype = np.float32
-        self.use_xpu = True
-        self.init_dtype_type()
-        self.inputs = {
-            'X': np.random.random((3, 4, 2, 9)).astype(self.dtype),
-            'Y': np.random.random((3, 6, 1, 2, 3)).astype(self.dtype)
-        }
-        self.attrs = {
-            'x_num_col_dims': 2,
-            'y_num_col_dims': 2,
-        }
-        result = np.dot(self.inputs['X'].reshape(3 * 4, 2 * 9),
-                        self.inputs['Y'].reshape(3 * 6, 1 * 2 * 3))
-        result = result.reshape(3, 4, 1, 2, 3)
-        self.outputs = {'Out': result}
-
-    def init_dtype_type(self):
-        pass
-
-    def test_check_output(self):
-        place = paddle.XPUPlace(0)
-        self.check_output_with_place(place, atol=0.01)
-
-    def test_check_grad_normal(self):
-        place = paddle.XPUPlace(0)
-        self.check_grad_with_place(
-            place, ['X', 'Y'], 'Out', max_relative_error=0.1)
-
-    def test_check_grad_ingore_x(self):
-        place = paddle.XPUPlace(0)
-        self.check_grad_with_place(
-            place, ['Y'], 'Out', max_relative_error=0.1, no_grad_set=set("X"))
-
-    def test_check_grad_ignore_y(self):
-        place = paddle.XPUPlace(0)
-        self.check_grad_with_place(
-            place, ['X'], 'Out', max_relative_error=0.1, no_grad_set=set('Y'))
-
-
-@unittest.skipIf(not paddle.is_compiled_with_xpu(),
-                 "core is not compiled with XPU")
-class TestXPUMulOp2(XPUOpTest):
-    def setUp(self):
-        self.op_type = "mul"
-        self.use_xpu = True
-        self.dtype = np.float32
-        self.init_dtype_type()
-        self.inputs = {
-            'X': np.random.random((20, 5)).astype(self.dtype),
-            'Y': np.random.random((5, 21)).astype(self.dtype)
-        }
-        self.outputs = {'Out': np.dot(self.inputs['X'], self.inputs['Y'])}
-
-    def init_dtype_type(self):
-        self.dtype = np.float32
-
-    def test_check_output(self):
-        place = paddle.XPUPlace(0)
-        self.check_output_with_place(place, atol=0.01)
-
-    def test_check_grad_normal(self):
-        place = paddle.XPUPlace(0)
-        self.check_grad_with_place(
-            place, ['X', 'Y'], 'Out', max_relative_error=0.1)
-
-    def test_check_grad_ingore_x(self):
-        place = paddle.XPUPlace(0)
-        self.check_grad_with_place(
-            place, ['Y'], 'Out', max_relative_error=0.1, no_grad_set=set("X"))
-
-    def test_check_grad_ingore_y(self):
-        place = paddle.XPUPlace(0)
-        self.check_grad_with_place(
-            place, ['X'], 'Out', max_relative_error=0.1, no_grad_set=set('Y'))
-
+class XPUTestMulOp(XPUOpTestWrapper):
+    def __init__(self):
+        self.op_name = 'mul'
+        self.use_dynamic_create_class = False
+
+    class TestXPUMulOp1(XPUOpTest):
+        def setUp(self):
+            self.op_type = "mul"
+            self.dtype = self.in_type
+            self.inputs = {
+                'X': np.random.random((3, 4, 2, 9)).astype(self.in_type_str),
+                'Y': np.random.random((3, 6, 1, 2, 3)).astype(self.in_type_str)
+            }
+            self.attrs = {
+                'x_num_col_dims': 2,
+                'y_num_col_dims': 2,
+            }
+            result = np.dot(self.inputs['X'].reshape(3 * 4, 2 * 9),
+                            self.inputs['Y'].reshape(3 * 6, 1 * 2 * 3))
+            result = result.reshape(3, 4, 1, 2, 3)
+            self.outputs = {'Out': result}
+
+        def test_check_output(self):
+            paddle.enable_static()
+            place = paddle.XPUPlace(0)
+            self.check_output_with_place(place, atol=0.01)
+
+        def test_check_grad_normal(self):
+            place = paddle.XPUPlace(0)
+            paddle.enable_static()
+            self.check_grad_with_place(
+                place, ['X', 'Y'], 'Out', max_relative_error=0.1)
+
+        def test_check_grad_ingore_x(self):
+            place = paddle.XPUPlace(0)
+            paddle.enable_static()
+            self.check_grad_with_place(
+                place, ['Y'],
+                'Out',
+                max_relative_error=0.1,
+                no_grad_set=set("X"))
+
+        def test_check_grad_ignore_y(self):
+            place = paddle.XPUPlace(0)
+            paddle.enable_static()
+            self.check_grad_with_place(
+                place, ['X'],
+                'Out',
+                max_relative_error=0.1,
+                no_grad_set=set('Y'))
+
+    class TestXPUMulOp2(XPUOpTest):
+        def setUp(self):
+            self.op_type = "mul"
+            self.use_xpu = True
+            self.dtype = self.in_type
+            self.inputs = {
+                'X': np.random.random((20, 5)).astype(self.in_type_str),
+                'Y': np.random.random((5, 21)).astype(self.in_type_str)
+            }
+            self.outputs = {'Out': np.dot(self.inputs['X'], self.inputs['Y'])}
+
+        def test_check_output(self):
+            place = paddle.XPUPlace(0)
+            paddle.enable_static()
+            self.check_output_with_place(place, atol=0.01)
+
+        def test_check_grad_normal(self):
+            place = paddle.XPUPlace(0)
+            paddle.enable_static()
+            self.check_grad_with_place(
+                place, ['X', 'Y'], 'Out', max_relative_error=0.1)
+
+        def test_check_grad_ingore_x(self):
+            place = paddle.XPUPlace(0)
+            paddle.enable_static()
+            self.check_grad_with_place(
+                place, ['Y'],
+                'Out',
+                max_relative_error=0.1,
+                no_grad_set=set("X"))
+
+        def test_check_grad_ingore_y(self):
+            place = paddle.XPUPlace(0)
+            paddle.enable_static()
+            self.check_grad_with_place(
+                place, ['X'],
+                'Out',
+                max_relative_error=0.1,
+                no_grad_set=set('Y'))
+
+
+support_types = get_xpu_op_support_types('mul')
+for stype in support_types:
+    create_test_class(globals(), XPUTestMulOp, stype)
 
 if __name__ == "__main__":
+    paddle.enable_static()
     unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/xpu/test_nearest_interp_op_xpu.py b/python/paddle/fluid/tests/unittests/xpu/test_nearest_interp_op_xpu.py
index 35dadb59bf202..731358d5304b4 100644
--- a/python/paddle/fluid/tests/unittests/xpu/test_nearest_interp_op_xpu.py
+++ b/python/paddle/fluid/tests/unittests/xpu/test_nearest_interp_op_xpu.py
@@ -25,15 +25,14 @@
 from paddle.fluid import Program, program_guard
 
 paddle.enable_static()
-
-
+'''
 def nearest_neighbor_interp_np(X,
                                out_h,
                                out_w,
                                out_size=None,
                                actual_shape=None,
                                align_corners=True,
-                               data_layout='NCHW'):
+                               data_layout="NCHW"):
     """nearest neighbor interpolation implement in shape [N, C, H, W]"""
     if data_layout == "NHWC":
         X = np.transpose(X, (0, 3, 1, 2))  # NHWC => NCHW
@@ -85,7 +84,7 @@ def setUp(self):
         self.use_xpu = True
         self.out_size = None
         self.actual_shape = None
-        self.data_layout = 'NCHW'
+        self.data_layout = "NCHW"
         self.init_test_case()
         self.op_type = "nearest_interp"
         input_np = np.random.random(self.input_shape).astype("float32")
@@ -107,20 +106,20 @@ def setUp(self):
         output_np = nearest_neighbor_interp_np(
             input_np, out_h, out_w, self.out_size, self.actual_shape,
             self.align_corners, self.data_layout)
-        self.inputs = {'X': input_np}
+        self.inputs = {"X": input_np}
         if self.out_size is not None:
-            self.inputs['OutSize'] = self.out_size
+            self.inputs["OutSize"] = self.out_size
         if self.actual_shape is not None:
-            self.inputs['OutSize'] = self.actual_shape
+            self.inputs["OutSize"] = self.actual_shape
         self.attrs = {
-            'out_h': self.out_h,
-            'out_w': self.out_w,
-            'scale': self.scale,
-            'interp_method': self.interp_method,
-            'align_corners': self.align_corners,
-            'data_layout': self.data_layout
+            "out_h": self.out_h,
+            "out_w": self.out_w,
+            "scale": self.scale,
+            "interp_method": self.interp_method,
+            "align_corners": self.align_corners,
+            "data_layout": self.data_layout
         }
-        self.outputs = {'Out': output_np}
+        self.outputs = {"Out": output_np}
 
     def test_check_output(self):
         place = paddle.XPUPlace(0)
@@ -128,10 +127,10 @@ def test_check_output(self):
 
     def test_check_grad(self):
         place = paddle.XPUPlace(0)
-        self.check_grad_with_place(place, ['X'], 'Out', in_place=True)
+        self.check_grad_with_place(place, ["X"], "Out", in_place=True)
 
     def init_test_case(self):
-        self.interp_method = 'nearest'
+        self.interp_method = "nearest"
         self.input_shape = [2, 3, 4, 5]
         self.out_h = 2
         self.out_w = 2
@@ -144,7 +143,7 @@ def init_test_case(self):
                  "core is not compiled with XPU")
 class TestNearestNeighborInterpCase1(TestNearestInterpOp):
     def init_test_case(self):
-        self.interp_method = 'nearest'
+        self.interp_method = "nearest"
         self.input_shape = [4, 1, 7, 8]
         self.out_h = 1
         self.out_w = 1
@@ -156,7 +155,7 @@ def init_test_case(self):
                  "core is not compiled with XPU")
 class TestNearestNeighborInterpCase2(TestNearestInterpOp):
     def init_test_case(self):
-        self.interp_method = 'nearest'
+        self.interp_method = "nearest"
         self.input_shape = [3, 3, 9, 6]
         self.out_h = 12
         self.out_w = 12
@@ -168,7 +167,7 @@ def init_test_case(self):
                  "core is not compiled with XPU")
 class TestNearestNeighborInterpCase3(TestNearestInterpOp):
     def init_test_case(self):
-        self.interp_method = 'nearest'
+        self.interp_method = "nearest"
         self.input_shape = [1, 1, 32, 64]
         self.out_h = 64
         self.out_w = 32
@@ -180,7 +179,7 @@ def init_test_case(self):
                  "core is not compiled with XPU")
 class TestNearestNeighborInterpCase4(TestNearestInterpOp):
     def init_test_case(self):
-        self.interp_method = 'nearest'
+        self.interp_method = "nearest"
         self.input_shape = [4, 1, 7, 8]
         self.out_h = 1
         self.out_w = 1
@@ -193,7 +192,7 @@ def init_test_case(self):
                  "core is not compiled with XPU")
 class TestNearestNeighborInterpCase5(TestNearestInterpOp):
     def init_test_case(self):
-        self.interp_method = 'nearest'
+        self.interp_method = "nearest"
         self.input_shape = [3, 3, 9, 6]
         self.out_h = 12
         self.out_w = 12
@@ -206,7 +205,7 @@ def init_test_case(self):
                  "core is not compiled with XPU")
 class TestNearestNeighborInterpCase6(TestNearestInterpOp):
     def init_test_case(self):
-        self.interp_method = 'nearest'
+        self.interp_method = "nearest"
         self.input_shape = [1, 1, 32, 64]
         self.out_h = 64
         self.out_w = 32
@@ -219,7 +218,7 @@ def init_test_case(self):
                  "core is not compiled with XPU")
 class TestNearestNeighborInterpSame(TestNearestInterpOp):
     def init_test_case(self):
-        self.interp_method = 'nearest'
+        self.interp_method = "nearest"
         self.input_shape = [2, 3, 32, 64]
         self.out_h = 32
         self.out_w = 64
@@ -231,7 +230,7 @@ def init_test_case(self):
                  "core is not compiled with XPU")
 class TestNearestNeighborInterpActualShape(TestNearestInterpOp):
     def init_test_case(self):
-        self.interp_method = 'nearest'
+        self.interp_method = "nearest"
         self.input_shape = [3, 2, 32, 16]
         self.out_h = 64
         self.out_w = 32
@@ -244,7 +243,7 @@ def init_test_case(self):
                  "core is not compiled with XPU")
 class TestNearestNeighborInterpDataLayout(TestNearestInterpOp):
     def init_test_case(self):
-        self.interp_method = 'nearest'
+        self.interp_method = "nearest"
         self.input_shape = [2, 4, 4, 5]
         self.out_h = 2
         self.out_w = 2
@@ -265,7 +264,7 @@ def set_align_corners(self):
                  "core is not compiled with XPU")
 class TestNearestNeighborInterpScale1(TestNearestInterpOp):
     def init_test_case(self):
-        self.interp_method = 'nearest'
+        self.interp_method = "nearest"
         self.input_shape = [3, 2, 7, 5]
         self.out_h = 64
         self.out_w = 32
@@ -278,7 +277,7 @@ def init_test_case(self):
                  "core is not compiled with XPU")
 class TestNearestNeighborInterpScale2(TestNearestInterpOp):
     def init_test_case(self):
-        self.interp_method = 'nearest'
+        self.interp_method = "nearest"
         self.input_shape = [3, 2, 5, 7]
         self.out_h = 64
         self.out_w = 32
@@ -291,7 +290,7 @@ def init_test_case(self):
                  "core is not compiled with XPU")
 class TestNearestNeighborInterpScale3(TestNearestInterpOp):
     def init_test_case(self):
-        self.interp_method = 'nearest'
+        self.interp_method = "nearest"
         self.input_shape = [3, 2, 7, 5]
         self.out_h = 64
         self.out_w = 32
@@ -311,38 +310,38 @@ def setUp(self):
         self.shape_by_1Dtensor = False
         self.scale_by_1Dtensor = False
         self.attrs = {
-            'interp_method': self.interp_method,
-            'align_corners': self.align_corners,
+            "interp_method": self.interp_method,
+            "align_corners": self.align_corners,
         }
 
         input_np = np.random.random(self.input_shape).astype("float32")
-        self.inputs = {'X': input_np}
+        self.inputs = {"X": input_np}
 
         if self.scale_by_1Dtensor:
-            self.inputs['Scale'] = np.array([self.scale]).astype("float32")
+            self.inputs["Scale"] = np.array([self.scale]).astype("float32")
         elif self.scale > 0:
             out_h = int(self.input_shape[2] * self.scale)
             out_w = int(self.input_shape[3] * self.scale)
-            self.attrs['scale'] = self.scale
+            self.attrs["scale"] = self.scale
         else:
             out_h = self.out_h
             out_w = self.out_w
 
         if self.shape_by_1Dtensor:
-            self.inputs['OutSize'] = self.out_size
+            self.inputs["OutSize"] = self.out_size
         elif self.out_size is not None:
             size_tensor = []
             for index, ele in enumerate(self.out_size):
                 size_tensor.append(("x" + str(index), np.ones(
-                    (1)).astype('int32') * ele))
-            self.inputs['SizeTensor'] = size_tensor
+                    (1)).astype("int32") * ele))
+            self.inputs["SizeTensor"] = size_tensor
 
-        self.attrs['out_h'] = self.out_h
-        self.attrs['out_w'] = self.out_w
+        self.attrs["out_h"] = self.out_h
+        self.attrs["out_w"] = self.out_w
         output_np = nearest_neighbor_interp_np(input_np, out_h, out_w,
                                                self.out_size, self.actual_shape,
                                                self.align_corners)
-        self.outputs = {'Out': output_np}
+        self.outputs = {"Out": output_np}
 
     def test_check_output(self):
         place = paddle.XPUPlace(0)
@@ -350,10 +349,10 @@ def test_check_output(self):
 
     def test_check_grad(self):
         place = paddle.XPUPlace(0)
-        self.check_grad_with_place(place, ['X'], 'Out', in_place=True)
+        self.check_grad_with_place(place, ["X"], "Out", in_place=True)
 
     def init_test_case(self):
-        self.interp_method = 'nearest'
+        self.interp_method = "nearest"
         self.input_shape = [2, 5, 4, 4]
         self.out_h = 3
         self.out_w = 3
@@ -367,7 +366,7 @@ def init_test_case(self):
                  "core is not compiled with XPU")
 class TestNearestInterp_attr_tensor_Case1(TestNearestInterpOp_attr_tensor):
     def init_test_case(self):
-        self.interp_method = 'nearest'
+        self.interp_method = "nearest"
         self.input_shape = [3, 3, 9, 6]
         self.out_h = 12
         self.out_w = 12
@@ -381,7 +380,7 @@ def init_test_case(self):
                  "core is not compiled with XPU")
 class TestNearestInterp_attr_tensor_Case2(TestNearestInterpOp_attr_tensor):
     def init_test_case(self):
-        self.interp_method = 'nearest'
+        self.interp_method = "nearest"
         self.input_shape = [3, 2, 32, 16]
         self.out_h = 64
         self.out_w = 32
@@ -396,7 +395,7 @@ def init_test_case(self):
                  "core is not compiled with XPU")
 class TestNearestInterp_attr_tensor_Case3(TestNearestInterpOp_attr_tensor):
     def init_test_case(self):
-        self.interp_method = 'nearest'
+        self.interp_method = "nearest"
         self.input_shape = [3, 2, 32, 16]
         self.out_h = 64
         self.out_w = 32
@@ -415,10 +414,10 @@ def test_exception(self):
         def attr_data_format():
             # for 4-D input, data_format can only be NCHW or NHWC
             out = fluid.layers.resize_nearest(
-                input, out_shape=[4, 8], data_format='NDHWC')
+                input, out_shape=[4, 8], data_format="NDHWC")
 
         def attr_scale_type():
-            out = fluid.layers.resize_nearest(input, scale='scale')
+            out = fluid.layers.resize_nearest(input, scale="scale")
 
         def attr_scale_value():
             out = fluid.layers.resize_nearest(input, scale=-0.3)
@@ -426,7 +425,7 @@ def attr_scale_value():
         self.assertRaises(ValueError, attr_data_format)
         self.assertRaises(TypeError, attr_scale_type)
         self.assertRaises(ValueError, attr_scale_value)
-
+'''
 
 if __name__ == "__main__":
     unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/xpu/test_one_hot_op_xpu.py b/python/paddle/fluid/tests/unittests/xpu/test_one_hot_op_xpu.py
index 7898b5f6892f9..8c8406ba433de 100644
--- a/python/paddle/fluid/tests/unittests/xpu/test_one_hot_op_xpu.py
+++ b/python/paddle/fluid/tests/unittests/xpu/test_one_hot_op_xpu.py
@@ -26,10 +26,9 @@
 import time
 
 paddle.enable_static()
-
-
+"""
 @unittest.skipIf(not paddle.is_compiled_with_xpu(),
-                 "core is not compiled with XPU")
+                 'core is not compiled with XPU')
 class TestOneHotOp(XPUOpTest):
     def setUp(self):
         self.use_xpu = True
@@ -56,7 +55,7 @@ def test_check_output(self):
 
 
 @unittest.skipIf(not paddle.is_compiled_with_xpu(),
-                 "core is not compiled with XPU")
+                 'core is not compiled with XPU')
 class TestOneHotOp_attr(XPUOpTest):
     def setUp(self):
         self.op_type = 'one_hot'
@@ -81,7 +80,7 @@ def test_check_output(self):
 
 
 @unittest.skipIf(not paddle.is_compiled_with_xpu(),
-                 "core is not compiled with XPU")
+                 'core is not compiled with XPU')
 class TestOneHotOp_default_dtype(XPUOpTest):
     def setUp(self):
         self.op_type = 'one_hot'
@@ -107,7 +106,7 @@ def test_check_output(self):
 
 
 @unittest.skipIf(not paddle.is_compiled_with_xpu(),
-                 "core is not compiled with XPU")
+                 'core is not compiled with XPU')
 class TestOneHotOp_default_dtype_attr(XPUOpTest):
     def setUp(self):
         self.op_type = 'one_hot'
@@ -132,7 +131,7 @@ def test_check_output(self):
 
 
 @unittest.skipIf(not paddle.is_compiled_with_xpu(),
-                 "core is not compiled with XPU")
+                 'core is not compiled with XPU')
 class TestOneHotOp_out_of_range(XPUOpTest):
     def setUp(self):
         self.op_type = 'one_hot'
@@ -154,30 +153,30 @@ def test_check_output(self):
 
 
 @unittest.skipIf(not paddle.is_compiled_with_xpu(),
-                 "core is not compiled with XPU")
+                 'core is not compiled with XPU')
 class TestOneHotOpError(unittest.TestCase):
     def test_errors(self):
         with program_guard(Program(), Program()):
             # the input must be Variable
-            in_w = np.random.random((4, 1)).astype("int32")
+            in_w = np.random.random((4, 1)).astype('int32')
             self.assertRaises(TypeError, fluid.layers.one_hot, in_w)
             # the input must be int32 or int 64
             in_w2 = fluid.layers.data(
-                name="in_w2",
+                name='in_w2',
                 shape=[4, 1],
                 append_batch_size=False,
-                dtype="float32")
+                dtype='float32')
             self.assertRaises(TypeError, fluid.layers.one_hot, in_w2)
             # the depth must be int, long or Variable
             in_r = fluid.layers.data(
-                name="in_r",
+                name='in_r',
                 shape=[4, 1],
                 append_batch_size=False,
-                dtype="int32")
+                dtype='int32')
             depth_w = np.array([4])
             self.assertRaises(TypeError, fluid.layers.one_hot, in_r, 4.1)
             self.assertRaises(TypeError, fluid.layers.one_hot, in_r, depth_w)
-
+"""
 
 if __name__ == '__main__':
     paddle.enable_static()
diff --git a/python/paddle/fluid/tests/unittests/xpu/test_reduce_max_op_xpu.py b/python/paddle/fluid/tests/unittests/xpu/test_reduce_max_op_xpu.py
index 55ed5442cf1f3..6ea55f5ba9368 100644
--- a/python/paddle/fluid/tests/unittests/xpu/test_reduce_max_op_xpu.py
+++ b/python/paddle/fluid/tests/unittests/xpu/test_reduce_max_op_xpu.py
@@ -25,8 +25,7 @@
 import paddle.fluid as fluid
 from paddle.fluid import compiler, Program, program_guard
 from paddle.fluid.framework import convert_np_dtype_to_dtype_
-
-
+"""
 class TestXPUReduceMaxOp(XPUOpTest):
     def setUp(self):
         self.init_op_type()
@@ -38,7 +37,7 @@ def setUp(self):
             'keep_dim': self.keep_dim,
             'reduce_all': self.reduce_all
         }
-        self.inputs = {'X': np.random.random(self.shape).astype("float32")}
+        self.inputs = {'X': np.random.random(self.shape).astype('float32')}
         if self.attrs['reduce_all']:
             self.outputs = {'Out': self.inputs['X'].max()}
         else:
@@ -60,7 +59,7 @@ def test_check_grad(self):
             self.check_grad_with_place(place, ['X'], 'Out')
 
     def init_op_type(self):
-        self.op_type = "reduce_max"
+        self.op_type = 'reduce_max'
         self.use_mkldnn = False
         self.keep_dim = False
         self.reduce_all = False
@@ -68,7 +67,7 @@ def init_op_type(self):
     def initTestCase(self):
         self.shape = (5, 6, 10)
         self.axis = (-1, )
-
+"""
 
 if __name__ == '__main__':
     unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/xpu/test_rmsprop_op_xpu.py b/python/paddle/fluid/tests/unittests/xpu/test_rmsprop_op_xpu.py
index 8fd6b1ff4050e..a94a9d5541f61 100644
--- a/python/paddle/fluid/tests/unittests/xpu/test_rmsprop_op_xpu.py
+++ b/python/paddle/fluid/tests/unittests/xpu/test_rmsprop_op_xpu.py
@@ -45,10 +45,9 @@ def create_selected_rows_and_tensor(scope, place, height, row_num,
     tensor.set(tensor_val, place)
     return tensor_val, sr_val
 '''
-
-
+"""
 class TestBase(XPUOpTest):
-    op_type = "rmsprop"
+    op_type = 'rmsprop'
 
     def setup(self,
               place,
@@ -63,29 +62,29 @@ def setup(self,
         self.scope = fluid.global_scope()
         self.place = place
 
-        self.param_name = "param"
-        self.param = np.random.random(size).astype("float32")
+        self.param_name = 'param'
+        self.param = np.random.random(size).astype('float32')
 
-        self.mean_square_name = "mean_square"
+        self.mean_square_name = 'mean_square'
         self.mean_square = np.random.uniform(
-            low=1, high=2, size=size).astype("float32")
+            low=1, high=2, size=size).astype('float32')
 
-        self.mean_grad_name = "mean_grad"
-        self.mean_grad = np.random.random(size).astype("float32")
+        self.mean_grad_name = 'mean_grad'
+        self.mean_grad = np.random.random(size).astype('float32')
 
-        self.lr_name = "lr"
-        self.learning_rate = np.array([0.01]).astype("float32")
+        self.lr_name = 'lr'
+        self.learning_rate = np.array([0.01]).astype('float32')
 
-        self.grad_name = "grad"
+        self.grad_name = 'grad'
         self.is_sparse = is_sparse
 
-        self.grad = np.random.random(size).astype("float32")
+        self.grad = np.random.random(size).astype('float32')
         grad_tensor = self.scope.var(self.grad_name).get_tensor()
         grad_tensor.set(self.grad, place)
 
-        self.moment_name = "moment"
+        self.moment_name = 'moment'
         self.moment = np.random.uniform(
-            low=0, high=1, size=size).astype("float32")
+            low=0, high=1, size=size).astype('float32')
 
         self.epsilon = epsilon
         self.decay = 0.9
@@ -128,8 +127,8 @@ def check(self, actual_t, expect_t, place, out_name, atol=1e-5):
         self.assertTrue(
             np.allclose(
                 actual_t, expect_t, atol=atol),
-            "Output (" + out_name + ") has diff at " + str(place) + "\nExpect "
-            + str(expect_t) + "\n" + "But Got" + str(actual_t))
+            'Output (' + out_name + ') has diff at ' + str(place) + '\nExpect '
+            + str(expect_t) + '\n' + 'But Got' + str(actual_t))
 
 
 class TestRmspropOp(TestBase):
@@ -223,11 +222,11 @@ def test_rmsprop(self):
 
 
 class TestRMSPropV2(XPUOpTest):
-    op_type = "rmsprop"
+    op_type = 'rmsprop'
 
     def test_rmsprop_dygraph(self):
         paddle.disable_static()
-        value = np.arange(26).reshape(2, 13).astype("float32")
+        value = np.arange(26).reshape(2, 13).astype('float32')
         a = paddle.to_tensor(value)
         linear = paddle.nn.Linear(13, 5)
         # This can be any optimizer supported by dygraph.
@@ -293,7 +292,7 @@ def test_rmsprop_op_invalid_input(self):
         with self.assertRaises(ValueError):
             adam = paddle.optimizer.RMSProp(
                 0.1, rho=-1, parameters=linear.parameters())
-
+"""
 
 if __name__ == "__main__":
     paddle.enable_static()
diff --git a/python/paddle/fluid/tests/unittests/xpu/test_rnn_op_xpu.py b/python/paddle/fluid/tests/unittests/xpu/test_rnn_op_xpu.py
index a27d806319cb2..e0d208644e79e 100755
--- a/python/paddle/fluid/tests/unittests/xpu/test_rnn_op_xpu.py
+++ b/python/paddle/fluid/tests/unittests/xpu/test_rnn_op_xpu.py
@@ -1,9 +1,7 @@
-#   Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
+#   Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+# # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
+# You may obtain a copy of the License at #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
@@ -14,6 +12,8 @@
 
 from __future__ import print_function
 
+import sys
+sys.path.append("..")
 import unittest
 import numpy as np
 import math
@@ -22,152 +22,180 @@
 import paddle.fluid as fluid
 import paddle.fluid.layers as layers
 import random
-import sys
 
-sys.path.append("..")
+from op_test import OpTest
 from op_test_xpu import XPUOpTest
 sys.path.append("../rnn")
 from rnn_numpy import SimpleRNN, LSTM, GRU
 from convert import get_params_for_net
+from xpu.get_test_cover_info import create_test_class, get_xpu_op_support_types, XPUOpTestWrapper
 
 random.seed(2)
 np.set_printoptions(threshold=np.inf)
 paddle.enable_static()
 
 
-class TestRNNOp(XPUOpTest):
-    def init_size(self):
-        self.seq_length = 1
-        self.batch_size = 1
-        self.input_size = 5
-        self.hidden_size = 16
-
-    def get_weight_names(self):
-        weight_names = []
-        for i in range(self.num_layers):
-            for j in range(0, 2 * self.direction_num):
-                weight_names.append("{}.weight_{}".format(i, j))
-        for i in range(self.num_layers):
-            for j in range(0, 2 * self.direction_num):
-                weight_names.append("{}.bias_{}".format(i, j))
-        return weight_names
-
-    def setUp(self):
-        self.init_size()
-        self.op_type = "rnn"
-        self.dtype = np.float32
-        self.sequence_length = np.ones(
-            (self.batch_size, ), dtype=np.int32) * self.seq_length
-        self.num_layers = 1
-        self.is_bidirec = False
-        self.mode = "LSTM"
-        self.is_test = False
-        self.dropout = 0.0
-        self.set_attrs()
-
-        self.direction_num = 2 if self.is_bidirec else 1
-        direction = "bidirectional" if self.is_bidirec else "forward"
-
-        input = np.random.uniform(
-            low=-0.1,
-            high=0.1,
-            size=(self.seq_length, self.batch_size,
-                  self.input_size)).astype(self.dtype)
-
-        rnn1 = LSTM(
-            self.input_size,
-            self.hidden_size,
-            num_layers=self.num_layers,
-            time_major=True,
-            direction=direction,
-            dropout=self.dropout,
-            dtype="float32")
-
-        flat_w = get_params_for_net(rnn1)
-        output, (last_hidden, last_cell) = rnn1(
-            input, sequence_length=self.sequence_length)
-
-        init_h = np.zeros(
-            (self.num_layers * self.direction_num, self.batch_size,
-             self.hidden_size)).astype(self.dtype)
-        init_c = np.zeros(
-            (self.num_layers * self.direction_num, self.batch_size,
-             self.hidden_size)).astype(self.dtype)
-        state_out = np.ndarray((300)).astype("uint8")
-
-        self.inputs = {
-            'Input': input,
-            'WeightList': flat_w,
-            'PreState': [('init_h', init_h), ('init_c', init_c)],
-            'SequenceLength': self.sequence_length
-        }
-        if self.sequence_length is None:
+class XPUTestRNNOp(XPUOpTestWrapper):
+    def __init__(self):
+        self.op_name = 'rnn'
+        self.use_dynamic_create_class = False
+
+    class TestRNNOp(XPUOpTest):
+        def setUp(self):
+            self.init_size()
+            self.init_dtype()
+            self.op_type = "rnn"
+            self.place = paddle.XPUPlace(0)
+            self.sequence_length = np.ones(
+                (self.batch_size, ), dtype=np.int32) * self.seq_length
+            self.set_attrs()
+            self.mode = "LSTM"
+            self.is_test = False
+            self.dropout = 0.0
+
+            self.direction_num = 2 if self.is_bidirec else 1
+            direction = "bidirectional" if self.is_bidirec else "forward"
+
+            input = np.random.uniform(
+                low=-0.1,
+                high=0.1,
+                size=(self.seq_length, self.batch_size,
+                      self.input_size)).astype(self.dtype)
+
+            rnn1 = LSTM(
+                self.input_size,
+                self.hidden_size,
+                num_layers=self.num_layers,
+                time_major=True,
+                direction=direction,
+                dropout=self.dropout,
+                dtype=self.dtype)
+
+            flat_w = get_params_for_net(rnn1)
+            output, (last_hidden, last_cell) = rnn1(
+                input, sequence_length=self.sequence_length)
+
+            init_h = np.zeros(
+                (self.num_layers * self.direction_num, self.batch_size,
+                 self.hidden_size)).astype(self.dtype)
+            init_c = np.zeros(
+                (self.num_layers * self.direction_num, self.batch_size,
+                 self.hidden_size)).astype(self.dtype)
+            state_out = np.ndarray((300)).astype("uint8")
+
             self.inputs = {
                 'Input': input,
                 'WeightList': flat_w,
                 'PreState': [('init_h', init_h), ('init_c', init_c)],
+                'SequenceLength': self.sequence_length
+            }
+            if self.sequence_length is None:
+                self.inputs = {
+                    'Input': input,
+                    'WeightList': flat_w,
+                    'PreState': [('init_h', init_h), ('init_c', init_c)],
+                }
+            self.attrs = {
+                'dropout_prob': self.dropout,
+                'is_bidirec': self.is_bidirec,
+                'input_size': self.input_size,
+                'hidden_size': self.hidden_size,
+                'num_layers': self.num_layers,
+                'mode': self.mode,
+                'is_test': self.is_test
+            }
+            self.outputs = {
+                'Out': output,
+                "State":
+                [('last_hidden', last_hidden), ('last_cell', last_cell)],
+                'Reserve': np.ndarray((400)).astype("uint8"),
+                'DropoutState': state_out
             }
-        self.attrs = {
-            'dropout_prob': self.dropout,
-            'is_bidirec': self.is_bidirec,
-            'input_size': self.input_size,
-            'hidden_size': self.hidden_size,
-            'num_layers': self.num_layers,
-            'mode': self.mode,
-            'is_test': self.is_test
-        }
-        self.outputs = {
-            'Out': output,
-            "State": [('last_hidden', last_hidden), ('last_cell', last_cell)],
-            'Reserve': np.ndarray((400)).astype("uint8"),
-            'DropoutState': state_out
-        }
-
-    def test_output(self):
-        if paddle.is_compiled_with_xpu():
-            place = paddle.XPUPlace(0)
-            self.check_output_with_place(
-                place, atol=0.01, no_check_set=['Reserve', 'DropoutState'])
-
-    def set_attrs(self):
-        pass
-
-    def test_grad(self):
-        if paddle.is_compiled_with_xpu():
-            place = paddle.XPUPlace(0)
-            if not self.is_test:
-                var_name_list = self.get_weight_names()
-                grad_check_list = ['Input', 'init_h', 'init_c']
-                grad_check_list.extend(var_name_list)
-                self.check_grad_with_place(
-                    place,
-                    set(grad_check_list), ['Out', 'last_hidden', 'last_cell'],
-                    max_relative_error=0.1)
-
-
-class TestRNNOpCase0(TestRNNOp):
-    def init_size(self):
-        self.seq_length = 2
-        self.batch_size = 4
-        self.input_size = 10
-        self.hidden_size = 32
-
-
-class TestRNNOpCase1(TestRNNOp):
-    def init_size(self):
-        self.seq_length = 5
-        self.batch_size = 16
-        self.input_size = 30
-        self.hidden_size = 64
-
-
-class TestRNNOpCase2(TestRNNOp):
-    def init_size(self):
-        self.seq_length = 10
-        self.batch_size = 64
-        self.input_size = 50
-        self.hidden_size = 64
 
+        def init_dtype(self):
+            self.dtype = self.in_type
+
+        def set_xpu(self):
+            self.__class__.use_xpu = True
+            self.__class__.no_need_check_grad = True
+            self.__class__.op_type = self.in_type
+
+        def test_check_output(self):
+            self.check_output_with_place(
+                self.place, atol=0.01,
+                no_check_set=['Reserve', 'DropoutState'])
+
+        def init_size(self):
+            self.seq_length = 1
+            self.batch_size = 1
+            self.input_size = 5
+            self.hidden_size = 16
+
+        def get_weight_names(self):
+            weight_names = []
+            for i in range(self.num_layers):
+                for j in range(0, 2 * self.direction_num):
+                    weight_names.append("{}.weight_{}".format(i, j))
+            for i in range(self.num_layers):
+                for j in range(0, 2 * self.direction_num):
+                    weight_names.append("{}.bias_{}".format(i, j))
+            return weight_names
+
+        def set_attrs(self):
+            self.num_layers = 1
+            self.is_bidirec = False
+
+    class TestRNNOp1(TestRNNOp):
+        def init_size(self):
+            self.seq_length = 2
+            self.batch_size = 4
+            self.input_size = 10
+            self.hidden_size = 32
+
+        def set_attrs(self):
+            self.num_layers = 1
+            self.is_bidirec = False
+
+    class TestRNNOp2(TestRNNOp):
+        def init_size(self):
+            self.seq_length = 5
+            self.batch_size = 16
+            self.input_size = 30
+            self.hidden_size = 64
+
+        def set_attrs(self):
+            self.num_layers = 1
+            self.is_bidirec = True
+
+    class TestRNNOp3(TestRNNOp):
+        def init_size(self):
+            self.seq_length = 10
+            self.batch_size = 64
+            self.input_size = 50
+            self.hidden_size = 64
+
+        def set_attrs(self):
+            self.num_layers = 2
+            self.is_bidirec = False
+
+    class TestRNNOp4(TestRNNOp):
+        def set_attrs(self):
+            self.num_layers = 3
+            self.is_bidirec = False
+
+    class TestRNNOp5(TestRNNOp):
+        def set_attrs(self):
+            self.num_layers = 2
+            self.is_bidirec = True
+
+
+support_types = get_xpu_op_support_types('rnn')
+for stype in support_types:
+    create_test_class(
+        globals(),
+        XPUTestRNNOp,
+        stype,
+        ignore_deivce_version=[core.XPUVersion.XPU1])
 
 if __name__ == '__main__':
     unittest.main()
diff --git a/python/paddle/fluid/variable_index.py b/python/paddle/fluid/variable_index.py
index e6990e25a08af..257ddc96d9c87 100644
--- a/python/paddle/fluid/variable_index.py
+++ b/python/paddle/fluid/variable_index.py
@@ -279,6 +279,37 @@ def deal_attrs(attrs, attr, attr_name, tensor_attr_name, inputs, infer_flags):
         attrs[attr_name] = attr
 
 
+# the item is a tensor of bool 
+def get_value_for_bool_tensor(var, item):
+    if len(item.shape) > len(var.shape):
+        raise IndexError("The dims of bool index doesn't match indexed array, "
+                         "the dims of bool index except to be equal or less "
+                         "than {}, but received {}.".format(
+                             len(var.shape), len(item.shape)))
+    for i, dim_len in enumerate(item.shape):
+        if dim_len != var.shape[i]:
+            raise IndexError(
+                "The dimension of bool index doesn't match indexed array along "\
+                "dimension {}, the target dimension is {}, but received {}.".
+                format(i, var.shape[i], dim_len))
+
+    def idx_not_empty(var, item):
+        from .layers.nn import where
+        from ..tensor import gather_nd
+
+        bool_2_idx = where(item == True)
+        return gather_nd(var, bool_2_idx)
+
+    def idx_empty(var):
+        var_shape = list(var.shape)
+        var_shape[0] = 0
+        return paddle.empty(var_shape, dtype=var.dtype)
+
+    from .layers.control_flow import cond
+    return cond(item.any(), lambda: idx_not_empty(var, item),
+                lambda: idx_empty(var))
+
+
 def _getitem_impl_(var, item):
     """
     Slice the variable.
@@ -393,24 +424,10 @@ def _getitem_impl_(var, item):
         elif isinstance(slice_item, (Variable, core.eager.Tensor)):
             if len(item) == 1:
 
-                from ..tensor import index_select, gather_nd
-                from .layers.nn import where
+                from ..tensor import index_select
 
                 if slice_item.dtype == paddle.bool:
-                    if len(slice_item.shape) > len(var.shape):
-                        raise IndexError(
-                            "The dims of bool index doesn't match indexed array, "
-                            "the dims of bool index except to be equal or less "
-                            "than {}, but received {}.".format(
-                                len(var.shape), len(slice_item.shape)))
-                    for i, dim_len in enumerate(slice_item.shape):
-                        if dim_len != var.shape[i]:
-                            raise IndexError(
-                                "The dimension of bool index doesn't match indexed array along "\
-                                "dimension {}, the target dimension is {}, but received {}.".
-                                format(i, var.shape[i], dim_len))
-                    bool_2_idx = where(slice_item == True)
-                    return gather_nd(var, bool_2_idx)
+                    return get_value_for_bool_tensor(var, slice_item)
                 else:
                     if len(slice_item.shape) == 1:
                         return index_select(var, index=slice_item, axis=0)
diff --git a/python/paddle/framework/__init__.py b/python/paddle/framework/__init__.py
index f4a4052ee5e15..ffd1607fe87b4 100644
--- a/python/paddle/framework/__init__.py
+++ b/python/paddle/framework/__init__.py
@@ -47,9 +47,13 @@
 from ..fluid.dygraph.base import enable_dygraph as disable_static  # noqa: F401
 from ..fluid.dygraph.base import disable_dygraph as enable_static  # noqa: F401
 from ..fluid.framework import _non_static_mode as in_dynamic_mode  # noqa: F401
+from ..fluid.framework import _non_static_mode  #  noqa: F401; temporary used for hackson
 from ..fluid.framework import _current_expected_place, _get_paddle_place  # noqa: F401
 from ..fluid.framework import dygraph_only  # noqa: F401
 from ..fluid.framework import convert_np_dtype_to_dtype_, _varbase_creator, OpProtoHolder  # noqa: F401
 from ..fluid.framework import _dygraph_tracer  # noqa: F401
 
+from ..fluid.layer_helper import LayerHelper  # noqa: F401
+from ..fluid.framework import in_dygraph_mode  # noqa: F401
+
 __all__ = []
diff --git a/python/paddle/incubate/__init__.py b/python/paddle/incubate/__init__.py
index 83dad710bad7d..d8cc322a66e27 100644
--- a/python/paddle/incubate/__init__.py
+++ b/python/paddle/incubate/__init__.py
@@ -21,11 +21,14 @@
 from .operators import softmax_mask_fuse  # noqa: F401
 from .operators import graph_send_recv
 from .operators import graph_khop_sampler
+from .operators import graph_sample_neighbors
+from .operators import graph_reindex
 from .tensor import segment_sum
 from .tensor import segment_mean
 from .tensor import segment_max
 from .tensor import segment_min
 from .passes import fuse_resnet_unit_pass
+import paddle.incubate.autograd
 
 from . import nn  #noqa: F401
 
@@ -36,6 +39,8 @@
     'softmax_mask_fuse',
     'graph_send_recv',
     'graph_khop_sampler',
+    'graph_sample_neighbors',
+    'graph_reindex',
     'segment_sum',
     'segment_mean',
     'segment_max',
diff --git a/python/paddle/incubate/autograd/__init__.py b/python/paddle/incubate/autograd/__init__.py
new file mode 100644
index 0000000000000..5528bb4d06c6f
--- /dev/null
+++ b/python/paddle/incubate/autograd/__init__.py
@@ -0,0 +1,18 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from paddle.autograd.functional import Hessian, Jacobian, jvp, vjp
+
+__all__ = [  # noqa
+    'vjp', 'jvp', 'Jacobian', 'Hessian'
+]
diff --git a/python/paddle/incubate/operators/__init__.py b/python/paddle/incubate/operators/__init__.py
index 073c3afcbcbfc..bc4ba8c3890fd 100644
--- a/python/paddle/incubate/operators/__init__.py
+++ b/python/paddle/incubate/operators/__init__.py
@@ -17,3 +17,5 @@
 from .resnet_unit import ResNetUnit  #noqa: F401
 from .graph_send_recv import graph_send_recv  #noqa: F401
 from .graph_khop_sampler import graph_khop_sampler  #noqa: F401
+from .graph_sample_neighbors import graph_sample_neighbors  #noqa: F401
+from .graph_reindex import graph_reindex  #noqa: F401
diff --git a/python/paddle/incubate/operators/graph_reindex.py b/python/paddle/incubate/operators/graph_reindex.py
new file mode 100644
index 0000000000000..328b87a699750
--- /dev/null
+++ b/python/paddle/incubate/operators/graph_reindex.py
@@ -0,0 +1,127 @@
+#   Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+from paddle.fluid.layer_helper import LayerHelper
+from paddle.fluid.framework import _non_static_mode
+from paddle.fluid.data_feeder import check_variable_and_dtype
+from paddle.fluid import core
+from paddle import _C_ops
+
+
+def graph_reindex(x,
+                  neighbors,
+                  count,
+                  value_buffer=None,
+                  index_buffer=None,
+                  flag_buffer_hashtable=False,
+                  name=None):
+    """
+    Graph Reindex API.
+
+    This API is mainly used in Graph Learning domain, which should be used
+    in conjunction with `graph_sample_neighbors` API. And the main purpose
+    is to reindex the ids information of the input nodes, and return the 
+    corresponding graph edges after reindex.
+
+    Take input nodes x = [0, 1, 2] as an example. 
+    If we have neighbors = [8, 9, 0, 4, 7, 6, 7], and count = [2, 3, 2], 
+    then we know that the neighbors of 0 is [8, 9], the neighbors of 1
+    is [0, 4, 7], and the neighbors of 2 is [6, 7].
+
+    Args:
+        x (Tensor): The input nodes which we sample neighbors for. The available
+                    data type is int32, int64.
+        neighbors (Tensor): The neighbors of the input nodes `x`. The data type
+                            should be the same with `x`.
+        count (Tensor): The neighbor count of the input nodes `x`. And the 
+                        data type should be int32.
+        value_buffer (Tensor|None): Value buffer for hashtable. The data type should 
+                                    be int32, and should be filled with -1.
+        index_buffer (Tensor|None): Index buffer for hashtable. The data type should 
+                                    be int32, and should be filled with -1.
+        flag_buffer_hashtable (bool): Whether to use buffer for hashtable to speed up.
+                                      Default is False. Only useful for gpu version currently.
+        name (str, optional): Name for the operation (optional, default is None).
+                              For more information, please refer to :ref:`api_guide_Name`.
+    
+    Returns:
+        reindex_src (Tensor): The source node index of graph edges after reindex.
+        reindex_dst (Tensor): The destination node index of graph edges after reindex.
+        out_nodes (Tensor): The index of unique input nodes and neighbors before reindex,
+                            where we put the input nodes `x` in the front, and put neighbor
+                            nodes in the back.
+
+    Examples:
+        
+        .. code-block:: python
+
+        import paddle
+
+        x = [0, 1, 2]
+        neighbors = [8, 9, 0, 4, 7, 6, 7]
+        count = [2, 3, 2]
+        x = paddle.to_tensor(x, dtype="int64")
+        neighbors = paddle.to_tensor(neighbors, dtype="int64")
+        count = paddle.to_tensor(count, dtype="int32")
+
+        reindex_src, reindex_dst, out_nodes = \
+             paddle.incubate.graph_reindex(x, neighbors, count)
+        # reindex_src: [3, 4, 0, 5, 6, 7, 6]
+        # reindex_dst: [0, 0, 1, 1, 1, 2, 2]
+        # out_nodes: [0, 1, 2, 8, 9, 4, 7, 6]
+
+    """
+    if flag_buffer_hashtable:
+        if value_buffer is None or index_buffer is None:
+            raise ValueError(f"`value_buffer` and `index_buffer` should not"
+                             "be None if `flag_buffer_hashtable` is True.")
+
+    if _non_static_mode():
+        reindex_src, reindex_dst, out_nodes = \
+            _C_ops.graph_reindex(x, neighbors, count, value_buffer, index_buffer,
+                                 "flag_buffer_hashtable", flag_buffer_hashtable)
+        return reindex_src, reindex_dst, out_nodes
+
+    check_variable_and_dtype(x, "X", ("int32", "int64"), "graph_reindex")
+    check_variable_and_dtype(neighbors, "Neighbors", ("int32", "int64"),
+                             "graph_reindex")
+    check_variable_and_dtype(count, "Count", ("int32"), "graph_reindex")
+
+    if flag_buffer_hashtable:
+        check_variable_and_dtype(value_buffer, "HashTable_Value", ("int32"),
+                                 "graph_reindex")
+        check_variable_and_dtype(index_buffer, "HashTable_Value", ("int32"),
+                                 "graph_reindex")
+
+    helper = LayerHelper("graph_reindex", **locals())
+    reindex_src = helper.create_variable_for_type_inference(dtype=x.dtype)
+    reindex_dst = helper.create_variable_for_type_inference(dtype=x.dtype)
+    out_nodes = helper.create_variable_for_type_inference(dtype=x.dtype)
+    helper.append_op(
+        type="graph_reindex",
+        inputs={
+            "X": x,
+            "Neighbors": neighbors,
+            "Count": count,
+            "HashTable_Value": value_buffer if flag_buffer_hashtable else None,
+            "HashTable_Index": index_buffer if flag_buffer_hashtable else None,
+        },
+        outputs={
+            "Reindex_Src": reindex_src,
+            "Reindex_Dst": reindex_dst,
+            "Out_Nodes": out_nodes
+        },
+        attrs={"flag_buffer_hashtable": flag_buffer_hashtable})
+    return reindex_src, reindex_dst, out_nodes
diff --git a/python/paddle/incubate/operators/graph_sample_neighbors.py b/python/paddle/incubate/operators/graph_sample_neighbors.py
new file mode 100644
index 0000000000000..d5a85af7272e7
--- /dev/null
+++ b/python/paddle/incubate/operators/graph_sample_neighbors.py
@@ -0,0 +1,150 @@
+#   Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+from paddle.fluid.layer_helper import LayerHelper
+from paddle.fluid.framework import _non_static_mode
+from paddle.fluid.data_feeder import check_variable_and_dtype
+from paddle.fluid import core
+from paddle import _C_ops
+
+
+def graph_sample_neighbors(row,
+                           colptr,
+                           input_nodes,
+                           eids=None,
+                           perm_buffer=None,
+                           sample_size=-1,
+                           return_eids=False,
+                           flag_perm_buffer=False,
+                           name=None):
+    """
+    Graph Sample Neighbors API.
+
+    This API is mainly used in Graph Learning domain, and the main purpose is to
+    provide high performance of graph sampling method. For example, we get the 
+    CSC(Compressed Sparse Column) format of the input graph edges as `row` and 
+    `colptr`, so as to convert graph data into a suitable format for sampling.
+    `input_nodes` means the nodes we need to sample neighbors, and `sample_sizes` 
+    means the number of neighbors and number of layers we want to sample.
+
+    Besides, we support fisher-yates sampling in GPU version. 
+
+    Args:
+        row (Tensor): One of the components of the CSC format of the input graph, and
+                      the shape should be [num_edges, 1] or [num_edges]. The available
+                      data type is int32, int64.
+        colptr (Tensor): One of the components of the CSC format of the input graph,
+                         and the shape should be [num_nodes + 1, 1] or [num_nodes + 1].
+                         The data type should be the same with `row`.
+        input_nodes (Tensor): The input nodes we need to sample neighbors for, and the
+                              data type should be the same with `row`.
+        eids (Tensor): The eid information of the input graph. If return_eids is True,
+                            then `eids` should not be None. The data type should be the 
+                            same with `row`. Default is None.
+        perm_buffer (Tensor): Permutation buffer for fisher-yates sampling. If `flag_perm_buffer`
+                              is True, then `perm_buffer` should not be None. The data type should
+                              be the same with `row`. Default is None. 
+        sample_size (int): The number of neighbors we need to sample. Default value is 
+                           -1, which means returning all the neighbors of the input nodes.
+        return_eids (bool): Whether to return eid information of sample edges. Default is False.
+        flag_perm_buffer (bool): Using the permutation for fisher-yates sampling in GPU. Default 
+                                 value is false, means not using it. 
+        name (str, optional): Name for the operation (optional, default is None).
+                              For more information, please refer to :ref:`api_guide_Name`.
+
+    Returns:
+        out_neighbors (Tensor): The sample neighbors of the input nodes.
+        out_count (Tensor): The number of sampling neighbors of each input node, and the shape
+                            should be the same with `input_nodes`.
+        out_eids (Tensor): If `return_eids` is True, we will return the eid information of the 
+                           sample edges.
+
+    Examples:
+        .. code-block:: python
+        import paddle
+        # edges: (3, 0), (7, 0), (0, 1), (9, 1), (1, 2), (4, 3), (2, 4),
+        #        (9, 5), (3, 5), (9, 6), (1, 6), (9, 8), (7, 8)
+        row = [3, 7, 0, 9, 1, 4, 2, 9, 3, 9, 1, 9, 7]
+        colptr = [0, 2, 4, 5, 6, 7, 9, 11, 11, 13, 13]
+        nodes = [0, 8, 1, 2]
+        sample_size = 2
+        row = paddle.to_tensor(row, dtype="int64")
+        colptr = paddle.to_tensor(colptr, dtype="int64")
+        nodes = paddle.to_tensor(nodes, dtype="int64")
+        out_neighbors, out_count = \
+            paddle.incubate.graph_sample_neighbors(row, colptr, nodes, 
+                                                   sample_size=sample_size)
+
+    """
+
+    if return_eids:
+        if eids is None:
+            raise ValueError(
+                f"`eids` should not be None if `return_eids` is True.")
+
+    if flag_perm_buffer:
+        if perm_buffer is None:
+            raise ValueError(
+                f"`perm_buffer` should not be None if `flag_perm_buffer`"
+                "is True.")
+
+    if _non_static_mode():
+        out_neighbors, out_count, out_eids = _C_ops.graph_sample_neighbors(
+            row, colptr, input_nodes, eids, perm_buffer, "sample_size",
+            sample_size, "return_eids", return_eids, "flag_perm_buffer",
+            flag_perm_buffer)
+        if return_eids:
+            return out_neighbors, out_count, out_eids
+        return out_neighbors, out_count
+
+    check_variable_and_dtype(row, "Row", ("int32", "int64"),
+                             "graph_sample_neighbors")
+    check_variable_and_dtype(colptr, "Col_Ptr", ("int32", "int64"),
+                             "graph_sample_neighbors")
+    check_variable_and_dtype(input_nodes, "X", ("int32", "int64"),
+                             "graph_sample_neighbors")
+    if return_eids:
+        check_variable_and_dtype(eids, "Eids", ("int32", "int64"),
+                                 "graph_sample_neighbors")
+    if flag_perm_buffer:
+        check_variable_and_dtype(perm_buffer, "Perm_Buffer", ("int32", "int64"),
+                                 "graph_sample_neighbors")
+
+    helper = LayerHelper("graph_sample_neighbors", **locals())
+    out_neighbors = helper.create_variable_for_type_inference(dtype=row.dtype)
+    out_count = helper.create_variable_for_type_inference(dtype=row.dtype)
+    out_eids = helper.create_variable_for_type_inference(dtype=row.dtype)
+    helper.append_op(
+        type="graph_sample_neighbors",
+        inputs={
+            "Row": row,
+            "Col_Ptr": colptr,
+            "X": input_nodes,
+            "Eids": eids if return_eids else None,
+            "Perm_Buffer": perm_buffer if flag_perm_buffer else None
+        },
+        outputs={
+            "Out": out_neighbors,
+            "Out_Count": out_count,
+            "Out_Eids": out_eids
+        },
+        attrs={
+            "sample_size": sample_size,
+            "return_eids": return_eids,
+            "flag_perm_buffer": flag_perm_buffer
+        })
+    if return_eids:
+        return out_neighbors, out_count, out_eids
+    return out_neighbors, out_count
diff --git a/python/paddle/incubate/operators/graph_send_recv.py b/python/paddle/incubate/operators/graph_send_recv.py
index 3972180d86564..80a21aec6cf5a 100644
--- a/python/paddle/incubate/operators/graph_send_recv.py
+++ b/python/paddle/incubate/operators/graph_send_recv.py
@@ -13,7 +13,7 @@
 # limitations under the License.
 
 from paddle.fluid.layer_helper import LayerHelper
-from paddle.fluid.framework import _non_static_mode
+from paddle.fluid.framework import _non_static_mode, _in_legacy_dygraph, in_dygraph_mode
 from paddle.fluid.data_feeder import check_variable_and_dtype
 from paddle.fluid import core
 from paddle import _C_ops
@@ -109,15 +109,30 @@ def graph_send_recv(x,
 
     # TODO(daisiming): Should we add judgement for out_size: max(dst_index) + 1.
 
-    if _non_static_mode():
-        if out_size is None or out_size <= 0:
+    if out_size is None or out_size <= 0:
+        if _in_legacy_dygraph():
             out, tmp = _C_ops.graph_send_recv(x, src_index, dst_index,
                                               'pool_type', pool_type.upper())
-        else:
+            return out
+        if in_dygraph_mode():
+            return _C_ops.final_state_graph_send_recv(x, src_index, dst_index,
+                                                      pool_type.upper(), 0)
+    else:
+        if _in_legacy_dygraph():
             out, tmp = _C_ops.graph_send_recv(
                 x, src_index, dst_index, 'pool_type',
                 pool_type.upper(), 'out_size', out_size)
-        return out
+            return out
+        if in_dygraph_mode():
+            if isinstance(out_size, core.eager.Tensor):
+                if (out_size.size < 1):
+                    raise ValueError(
+                        "out_size should be long type, but received Tensor type."
+                    )
+                out_size = out_size.numpy()[0]
+            return _C_ops.final_state_graph_send_recv(x, src_index, dst_index,
+                                                      pool_type.upper(),
+                                                      out_size)
 
     check_variable_and_dtype(x, "X", ("float32", "float64", "int32", "int64"),
                              "graph_send_recv")
diff --git a/python/paddle/incubate/optimizer/distributed_fused_lamb.py b/python/paddle/incubate/optimizer/distributed_fused_lamb.py
index 00a39dfba0f18..74b5398230dee 100644
--- a/python/paddle/incubate/optimizer/distributed_fused_lamb.py
+++ b/python/paddle/incubate/optimizer/distributed_fused_lamb.py
@@ -17,7 +17,7 @@
 from paddle.fluid.clip import ClipGradByGlobalNorm
 from paddle.fluid.initializer import Constant
 from paddle.fluid.layer_helper import LayerHelper
-from paddle.optimizer import Optimizer
+from paddle.fluid.optimizer import Optimizer
 from paddle.distributed import get_rank, get_world_size
 from paddle.fluid.executor import global_scope
 from paddle.fluid.framework import name_scope
@@ -42,11 +42,7 @@ def __init__(self,
         assert not framework._non_static_mode(
         ), "DistributedFusedLamb does not support dygraph mode"
         super(DistributedFusedLamb, self).__init__(
-            learning_rate=learning_rate,
-            parameters=parameters,
-            weight_decay=None,
-            grad_clip=None,
-            name=name)
+            learning_rate=learning_rate, grad_clip=None, name=name)
 
         self._beta1 = beta1
         self._beta2 = beta2
@@ -75,9 +71,18 @@ def __init__(self,
             name=unique_name.generate('found_inf'),
             shape=[1],
             dtype=core.VarDesc.VarType.BOOL)
+        self._step = None
 
         self._param_to_master_param = {}
 
+    def _set_step(self, step):
+        self._step = step
+
+    def _get_or_create_step(self):
+        if self._step is None:
+            self._step = self._create_persistable_var('step', dtype='int64')
+        return self._step
+
     def _set_scale(self, scale):
         assert scale is not None
         if not isinstance(scale, Variable):
@@ -189,6 +194,8 @@ def _apply_gradients_impl(self, params_grads):
         param_order = self._create_persistable_var('param_order', dtype='int32')
         param_order.is_distributed = True
 
+        step = self._get_or_create_step()
+
         rank = get_rank()
         nranks = get_world_size()
         scale = self._get_or_create_scale()
@@ -234,6 +241,7 @@ def _apply_gradients_impl(self, params_grads):
                 'FP16ShardFusedParamOffsets': [fp16_partial_fused_offsets],
                 'FusedParamOffsets': [fused_offsets],
                 'ParamOrder': [param_order],
+                'Step': [step],
             },
             attrs={
                 'alignment': self._alignment,
@@ -290,6 +298,7 @@ def _apply_gradients_impl(self, params_grads):
                 'ParamOut': params,
                 'GradOut': grads,
                 'FoundInf': [self._found_inf],
+                'Step': [step],
             },
             attrs={
                 'weight_decay': self._weight_decay,
diff --git a/python/paddle/incubate/optimizer/functional/bfgs.py b/python/paddle/incubate/optimizer/functional/bfgs.py
index 9afcc2240aeb5..abdab457fda00 100644
--- a/python/paddle/incubate/optimizer/functional/bfgs.py
+++ b/python/paddle/incubate/optimizer/functional/bfgs.py
@@ -100,7 +100,7 @@ def func(x):
                 return paddle.dot(x, x)
 
             x0 = paddle.to_tensor([1.3, 2.7])
-            results = paddle.optimizer.functional.minimize_bfgs(func, x0)
+            results = paddle.incubate.optimizer.functional.minimize_bfgs(func, x0)
             print("is_converge: ", results[0])
             print("the minimum of func is: ", results[2])
             # is_converge:  is_converge:  Tensor(shape=[1], dtype=bool, place=Place(gpu:0), stop_gradient=True,
@@ -126,7 +126,8 @@ def func(x):
         check_initial_inverse_hessian_estimate(initial_inverse_hessian_estimate)
 
     Hk = paddle.assign(initial_inverse_hessian_estimate)
-    xk = initial_position
+    # use detach and assign to create new tensor rather than =, or xk will share memory and grad with initial_position
+    xk = paddle.assign(initial_position.detach())
 
     value, g1 = _value_and_gradient(objective_func, xk)
     num_func_calls = paddle.full(shape=[1], fill_value=1, dtype='int64')
diff --git a/python/paddle/incubate/optimizer/functional/lbfgs.py b/python/paddle/incubate/optimizer/functional/lbfgs.py
index 90ae452653a5c..d4bf511f85a99 100644
--- a/python/paddle/incubate/optimizer/functional/lbfgs.py
+++ b/python/paddle/incubate/optimizer/functional/lbfgs.py
@@ -89,7 +89,7 @@ def func(x):
                 return paddle.dot(x, x)
 
             x0 = paddle.to_tensor([1.3, 2.7])
-            results = paddle.optimizer.functional.minimize_lbfgs(func, x0)
+            results = paddle.incubate.optimizer.functional.minimize_lbfgs(func, x0)
             print("is_converge: ", results[0])
             print("the minimum of func is: ", results[2])
             # is_converge:  is_converge:  Tensor(shape=[1], dtype=bool, place=Place(gpu:0), stop_gradient=True,
@@ -113,7 +113,8 @@ def func(x):
         check_initial_inverse_hessian_estimate(initial_inverse_hessian_estimate)
         H0 = initial_inverse_hessian_estimate
 
-    xk = initial_position
+    # use detach and assign to create new tensor rather than =, or xk will share memory and grad with initial_position
+    xk = paddle.assign(initial_position.detach())
     value, g1 = _value_and_gradient(objective_func, xk)
 
     k = paddle.full(shape=[1], fill_value=0, dtype='int64')
diff --git a/python/paddle/incubate/optimizer/functional/utils.py b/python/paddle/incubate/optimizer/functional/utils.py
index c197f8a1acb5e..3000c82a71e87 100644
--- a/python/paddle/incubate/optimizer/functional/utils.py
+++ b/python/paddle/incubate/optimizer/functional/utils.py
@@ -13,7 +13,6 @@
 # limitations under the License.
 
 import paddle
-from paddle.autograd.functional import vjp, Jacobian
 from paddle.fluid.framework import Variable
 from paddle.fluid.data_feeder import check_type, check_dtype
 
@@ -86,11 +85,14 @@ def _value_and_gradient(f, x, v=None):
         value: a tensor that holds the function value.
         gradient: a tensor that holds the function gradients.  
     """
+    # use detach to cut off relation between x and original graph
+    x = x.detach()
+    x.stop_gradient = False
+    value = f(x)
     if paddle.in_dynamic_mode():
-        value, gradient = vjp(f, x, v=v)
-        gradient = gradient[0]
+        # only need to compute first order derivative, and some op dont support high order derivative.
+        gradient = paddle.grad([value], [x], create_graph=False)[0]
     else:
-        JJ = Jacobian(f, x)
-        gradient = JJ[:][0]
-        value = f(x)
-    return value, gradient
+        gradient = paddle.static.gradients([value], [x])[0]
+    # use detach to make results real number without grad to avoid assign error
+    return value.detach(), gradient.detach()
diff --git a/python/paddle/incubate/tensor/math.py b/python/paddle/incubate/tensor/math.py
index 3e5a7e0ff58ae..07dc7c1581fc4 100644
--- a/python/paddle/incubate/tensor/math.py
+++ b/python/paddle/incubate/tensor/math.py
@@ -15,6 +15,7 @@
 from paddle.fluid.layer_helper import LayerHelper, _non_static_mode
 from paddle.fluid.data_feeder import check_variable_and_dtype
 from paddle import _C_ops
+from paddle.fluid.framework import _in_legacy_dygraph, in_dygraph_mode
 
 __all__ = []
 
@@ -50,7 +51,9 @@ def segment_sum(data, segment_ids, name=None):
             #Outputs: [[4., 4., 4.], [4., 5., 6.]]
 
     """
-    if _non_static_mode():
+    if in_dygraph_mode():
+        return _C_ops.final_state_segment_pool(data, segment_ids, "SUM")[0]
+    if _in_legacy_dygraph():
         out, tmp = _C_ops.segment_pool(data, segment_ids, 'pooltype', "SUM")
         return out
 
@@ -104,6 +107,9 @@ def segment_mean(data, segment_ids, name=None):
             #Outputs: [[2., 2., 2.], [4., 5., 6.]]
 
     """
+
+    if in_dygraph_mode():
+        return _C_ops.final_state_segment_pool(data, segment_ids, "MEAN")[0]
     if _non_static_mode():
         out, tmp = _C_ops.segment_pool(data, segment_ids, 'pooltype', "MEAN")
         return out
@@ -157,6 +163,10 @@ def segment_min(data, segment_ids, name=None):
             #Outputs:  [[1., 2., 1.], [4., 5., 6.]]
 
     """
+
+    if in_dygraph_mode():
+        return _C_ops.final_state_segment_pool(data, segment_ids, "MIN")[0]
+
     if _non_static_mode():
         out, tmp = _C_ops.segment_pool(data, segment_ids, 'pooltype', "MIN")
         return out
@@ -210,6 +220,11 @@ def segment_max(data, segment_ids, name=None):
             #Outputs: [[3., 2., 3.], [4., 5., 6.]]
 
     """
+
+    if in_dygraph_mode():
+        out, tmp = _C_ops.final_state_segment_pool(data, segment_ids, "MAX")
+        return out
+
     if _non_static_mode():
         out, tmp = _C_ops.segment_pool(data, segment_ids, 'pooltype', "MAX")
         return out
diff --git a/python/paddle/metric/metrics.py b/python/paddle/metric/metrics.py
index 84b2fcff94ba6..118004088da16 100644
--- a/python/paddle/metric/metrics.py
+++ b/python/paddle/metric/metrics.py
@@ -22,7 +22,7 @@
 
 from ..fluid.data_feeder import check_variable_and_dtype
 from ..fluid.layer_helper import LayerHelper
-from ..fluid.framework import core, _varbase_creator, _non_static_mode
+from ..fluid.framework import core, _varbase_creator, _non_static_mode, _in_legacy_dygraph
 import paddle
 from paddle import _C_ops
 
@@ -800,6 +800,7 @@ def accuracy(input, label, k=1, correct=None, total=None, name=None):
         topk_out, topk_indices = paddle.topk(input, k=k)
         _acc, _, _ = _C_ops.accuracy(topk_out, topk_indices, label, correct,
                                      total)
+
         return _acc
 
     helper = LayerHelper("accuracy", **locals())
diff --git a/python/paddle/nn/functional/activation.py b/python/paddle/nn/functional/activation.py
index 794e265930247..90283b632ef2b 100644
--- a/python/paddle/nn/functional/activation.py
+++ b/python/paddle/nn/functional/activation.py
@@ -23,10 +23,12 @@
 import warnings
 from ...fluid.layer_helper import LayerHelper
 from ...fluid.framework import convert_np_dtype_to_dtype_
+from ...fluid.framework import _in_legacy_dygraph, in_dygraph_mode, _non_static_mode
 from ...fluid.data_feeder import check_variable_and_dtype, check_dtype
 import paddle
 from paddle import _C_ops, in_dynamic_mode
 from paddle.framework import core
+from paddle.fluid.framework import _in_legacy_dygraph, in_dygraph_mode
 
 __all__ = []
 
@@ -174,7 +176,10 @@ def gelu(x, approximate=False, name=None):
             #  [ 0.84119201,  1.39957154]]
     """
 
-    if in_dynamic_mode():
+    if in_dygraph_mode():
+        return _C_ops.final_state_gelu(x, approximate)
+
+    if _in_legacy_dygraph():
         return _C_ops.gelu(x, 'approximate', approximate)
 
     check_variable_and_dtype(x, 'x', ['float16', 'float32', 'float64'], 'gelu')
@@ -382,8 +387,10 @@ def hardswish(x, name=None):
             out = F.hardswish(x) # [0., 5., 0.666667]
     """
 
-    if in_dynamic_mode():
+    if _in_legacy_dygraph():
         return _C_ops.hard_swish(x)
+    if in_dygraph_mode():
+        return _C_ops.final_state_hard_swish(x, 6, 6, 3)
 
     check_variable_and_dtype(x, 'x', ['float16', 'float32', 'float64'],
                              'hardswish')
@@ -518,7 +525,9 @@ def prelu(x, weight, data_format="NCHW", name=None):
                 1], "The weight size should be equal to x input channel in prelu() when weight shape is not [1]."
         mode = 'channel'
 
-    if in_dynamic_mode():
+    if in_dygraph_mode():
+        return _C_ops.final_state_prelu(x, weight, data_format, mode)
+    if _in_legacy_dygraph():
         return _C_ops.prelu(x, weight, 'mode', mode, 'data_format', data_format)
 
     helper = LayerHelper('prelu', **locals())
@@ -560,9 +569,10 @@ def relu(x, name=None):
             out = F.relu(x) # [0., 0., 1.]
     """
 
-    if in_dynamic_mode():
+    if in_dygraph_mode():
+        return _C_ops.final_state_relu(x)
+    if _in_legacy_dygraph():
         return _C_ops.relu(x)
-
     check_variable_and_dtype(x, 'x', ['float16', 'float32', 'float64'], 'relu')
     helper = LayerHelper('relu', **locals())
     out = helper.create_variable_for_type_inference(x.dtype)
@@ -576,9 +586,10 @@ def relu_(x, name=None):
     Inplace version of ``relu`` API, the output Tensor will be inplaced with input ``x``.
     Please refer to :ref:`api_nn_cn_relu`.
     """
-    if paddle.fluid.framework._in_eager_mode_:
+    if in_dygraph_mode():
         return _C_ops.final_state_relu_(x)
-    return _C_ops.relu_(x)
+    if _in_legacy_dygraph():
+        return _C_ops.relu_(x)
 
 
 def log_sigmoid(x, name=None):
@@ -782,7 +793,9 @@ def selu(x,
         raise ValueError(
             "The alpha must be no less than zero. Received: {}.".format(alpha))
 
-    if in_dynamic_mode():
+    if in_dygraph_mode():
+        return _C_ops.final_state_selu(x, scale, alpha)
+    if _in_legacy_dygraph():
         return _C_ops.selu(x, 'scale', scale, 'alpha', alpha)
 
     check_variable_and_dtype(x, 'x', ['float16', 'float32', 'float64'], 'selu')
@@ -953,7 +966,12 @@ def softmax(x, axis=-1, dtype=None, name=None):
         dtype = convert_np_dtype_to_dtype_(dtype)
     use_cudnn = True
 
-    if in_dynamic_mode():
+    if in_dygraph_mode():
+        outs_cast = x if dtype is None \
+            else _C_ops.cast(x, 'in_dtype', x.dtype, 'out_dtype', dtype)
+        return _C_ops.final_state_softmax(outs_cast, axis)
+
+    if _in_legacy_dygraph():
         outs_cast = x if dtype is None \
             else _C_ops.cast(x, 'in_dtype', x.dtype, 'out_dtype', dtype)
         return _C_ops.softmax(outs_cast, 'axis', axis, 'use_cudnn', use_cudnn)
@@ -1083,7 +1101,9 @@ def softshrink(x, threshold=0.5, name=None):
             "The threshold must be no less than zero. Received: {}.".format(
                 threshold))
 
-    if in_dynamic_mode():
+    if in_dygraph_mode():
+        return _C_ops.final_state_soft_shrink(x, threshold)
+    if _in_legacy_dygraph():
         return _C_ops.softshrink(x, 'lambda', threshold)
 
     check_variable_and_dtype(x, 'x', ['float16', 'float32', 'float64'],
@@ -1161,8 +1181,9 @@ def swish(x, name=None):
             x = paddle.to_tensor(np.array([-2., 0., 1.]))
             out = F.swish(x) # [-0.238406, 0., 0.731059]
     """
-
-    if in_dynamic_mode():
+    if in_dygraph_mode():
+        return _C_ops.final_state_swish(x, 1.0)
+    if _in_legacy_dygraph():
         return _C_ops.swish(x, 'beta', 1.0)
 
     check_variable_and_dtype(x, 'x', ['float16', 'float32', 'float64'], 'swish')
@@ -1206,7 +1227,9 @@ def mish(x, name=None):
             x = paddle.to_tensor([-5., 0., 5.])
             out = F.mish(x) # [-0.03357624, 0., 4.99955208]
     """
-    if in_dynamic_mode():
+    if in_dygraph_mode():
+        return _C_ops.final_state_mish(x, 20)
+    if _in_legacy_dygraph():
         return _C_ops.mish(x)
 
     check_variable_and_dtype(x, 'x', ['float16', 'float32', 'float64'], 'mish')
@@ -1362,10 +1385,12 @@ def log_softmax(x, axis=-1, dtype=None, name=None):
     if (dtype is not None) and (not isinstance(dtype, core.VarDesc.VarType)):
         dtype = convert_np_dtype_to_dtype_(dtype)
 
-    if in_dynamic_mode():
+    if _non_static_mode():
         if dtype is not None:
             x = _C_ops.cast(x, 'in_dtype', x.dtype, 'out_dtype', dtype)
-        return _C_ops.log_softmax(x, 'axis', axis)
+        if _in_legacy_dygraph():
+            return _C_ops.log_softmax(x, 'axis', axis)
+        return _C_ops.final_state_log_softmax(x, axis)
 
     if dtype is None:
         check_variable_and_dtype(x, 'x', ['float16', 'float32', 'float64'],
@@ -1500,6 +1525,9 @@ def gumbel_softmax(x, temperature=1.0, hard=False, axis=-1, name=None):
             # [0.00000000, 0.00000000, 0.00000000, 0.00001258, 0.99998736, 0.00000000]]
         
     """
+    if in_dygraph_mode():
+        return _C_ops.final_state_gumbel_softmax(x, temperature, hard, axis)
+
     if in_dynamic_mode():
         return _C_ops.gumbel_softmax(x, 'temperature', temperature, 'hard',
                                      hard, 'axis', axis)
diff --git a/python/paddle/nn/functional/common.py b/python/paddle/nn/functional/common.py
index e757fbf53487e..5cbd66b7832d8 100644
--- a/python/paddle/nn/functional/common.py
+++ b/python/paddle/nn/functional/common.py
@@ -28,7 +28,7 @@
 from ...tensor import sum
 from ...tensor import sqrt
 from ...fluid.data_feeder import check_variable_and_dtype, check_dtype
-from ...fluid.framework import _varbase_creator
+from ...fluid.framework import _varbase_creator, _in_legacy_dygraph, in_dygraph_mode, _non_static_mode
 
 from ...fluid import dygraph_utils
 from ...fluid import layers
@@ -38,6 +38,7 @@
 from paddle.framework import in_dynamic_mode
 from paddle.tensor.creation import full
 from paddle.framework import core
+from paddle.fluid.framework import _in_legacy_dygraph
 from paddle.static import default_main_program
 
 __all__ = []
@@ -894,9 +895,15 @@ def dropout(x,
         seed = None
         mode = 'downgrade_in_infer' if mode == 'downscale_in_infer' else mode  #semantic transfer
 
-        if in_dynamic_mode():
+        if _non_static_mode():
             if default_main_program().random_seed != 0:
                 seed = default_main_program().random_seed
+
+            if in_dygraph_mode():
+                out, mask = _C_ops.final_state_dropout( x, None, p, not training, mode, \
+                    seed if seed is not None else 0, seed is not None)
+
+                return out
             out, mask = _C_ops.dropout(
                 x, 'dropout_prob', p, 'is_test', not training, 'fix_seed',
                 seed is not None, 'seed', seed
@@ -1352,8 +1359,11 @@ def pad(x, pad, mode='constant', value=0, data_format="NCHW", name=None):
     if in_dynamic_mode():
         if isinstance(pad, Variable):
             pad = pad.numpy()
-        out = _C_ops.pad3d(x, "paddings", pad, "mode", mode, "value", value,
-                           "data_format", data_format, "name", name)
+        if _in_legacy_dygraph():
+            out = _C_ops.pad3d(x, "paddings", pad, "mode", mode, "value", value,
+                               "data_format", data_format, "name", name)
+        else:
+            out = _C_ops.final_state_pad3d(x, pad, mode, value, data_format)
     else:
         attrs = {'mode': mode, 'value': value, 'data_format': data_format}
         inputs = {'X': [x]}
@@ -1613,10 +1623,14 @@ def label_smooth(label, prior_dist=None, epsilon=0.1, name=None):
             #[[[0.03333334 0.93333334 0.03333334]
             #  [0.93333334 0.03333334 0.93333334]]]
     """
+    if in_dygraph_mode():
+        return _C_ops.final_state_label_smooth(label, prior_dist,
+                                               float(epsilon))
+
     if epsilon > 1. or epsilon < 0.:
         raise ValueError("The value of epsilon must be between 0 and 1.")
 
-    if in_dynamic_mode():
+    if paddle.in_dynamic_mode():
         return _C_ops.label_smooth(label, prior_dist, 'epsilon', float(epsilon))
 
     check_variable_and_dtype(label, 'label', ['float32', 'float64'],
diff --git a/python/paddle/nn/functional/conv.py b/python/paddle/nn/functional/conv.py
index f7d765d854116..086ae78919454 100644
--- a/python/paddle/nn/functional/conv.py
+++ b/python/paddle/nn/functional/conv.py
@@ -12,7 +12,6 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 from __future__ import print_function
-from paddle.fluid.framework import _global_flags
 
 import numpy as np
 from ...device import get_cudnn_version
@@ -22,15 +21,20 @@
 from ...fluid.data_feeder import check_variable_and_dtype
 from ...framework import ParamAttr
 from ...fluid.layer_helper import LayerHelper
-from paddle import _C_ops
 from ...tensor.manipulation import unsqueeze, squeeze
 from ...tensor.math import add
 from ...fluid.layers import nn
+from paddle import _C_ops
+from paddle import get_flags
+from paddle import in_dynamic_mode
 from paddle.device import is_compiled_with_cuda
-from paddle.device import is_compiled_with_rocm
 from paddle.device import is_compiled_with_npu
 from paddle import in_dynamic_mode
 from paddle import get_flags
+from paddle.device import is_compiled_with_rocm
+from paddle.fluid.framework import _global_flags
+from paddle.fluid.framework import _in_legacy_dygraph
+from paddle.fluid.framework import in_dygraph_mode
 
 __all__ = []
 
@@ -118,6 +122,15 @@ def _conv_nd(x,
              name=None):
 
     # Due to the poor performance of NHWC, we transpose the input to NCHW.
+    if in_dygraph_mode() and op_type == "conv2d":
+        pre_bias = _C_ops.final_state_conv2d(
+            x, weight, stride, padding, padding_algorithm, groups, dilation,
+            data_format, False, -1, False)
+        if bias is not None:
+            out = nn.elementwise_add(pre_bias, bias, axis=channel_dim)
+            return out
+        else:
+            return pre_bias
     if in_dynamic_mode():
         attrs = ('strides', stride, 'paddings', padding, 'dilations', dilation,
                  'groups', groups, 'use_cudnn', use_cudnn, 'use_mkldnn',
@@ -560,8 +573,6 @@ def conv2d(x,
     use_cudnn = True if (is_compiled_with_cuda() and
                          cudnn_version is not None) else False
 
-    use_mkldnn = _global_flags()["FLAGS_use_mkldnn"]
-
     # update attrs
     padding, padding_algorithm = _update_padding_nd(padding, channel_last, 2)
     stride = convert_to_list(stride, 2, 'stride')
@@ -575,6 +586,18 @@ def conv2d(x,
             use_cudnn = True
         else:
             use_cudnn = False
+    else:
+        if in_dygraph_mode():
+            pre_bias = _C_ops.final_state_conv2d(
+                x, weight, stride, padding, padding_algorithm, groups, dilation,
+                data_format, False, -1, False)
+            if bias is not None:
+                out = nn.elementwise_add(pre_bias, bias, axis=channel_dim)
+                return out
+            else:
+                return pre_bias
+
+    use_mkldnn = _global_flags()["FLAGS_use_mkldnn"]
 
     # NPU only supports depthwise_conv2d when  "input_channel = output_channel = groups"
     if is_compiled_with_npu():
@@ -1061,7 +1084,17 @@ def conv2d_transpose(x,
         op_type = 'depthwise_conv2d_transpose'
         use_cudnn = False
 
-    if in_dynamic_mode():
+    if in_dygraph_mode():
+        final_state_op = _C_ops.final_state_conv2d_transpose if op_type == 'conv2d_transpose' else _C_ops.final_state_depthwise_conv2d_transpose
+        pre_bias = final_state_op(x, weight, stride, padding, output_padding,
+                                  output_size, padding_algorithm, groups,
+                                  dilation, data_format)
+        if bias is not None:
+            return nn.elementwise_add(pre_bias, bias, axis=channel_dim)
+        else:
+            return pre_bias
+
+    if _in_legacy_dygraph():
         attrs = ('output_padding', output_padding, 'output_size', output_size,
                  'strides', stride, 'paddings', padding, 'padding_algorithm',
                  padding_algorithm, 'dilations', dilation, 'groups', groups,
@@ -1468,7 +1501,16 @@ def conv3d_transpose(x,
     op_type = 'conv3d_transpose'
     data_format_ = "NHWC" if channel_last else "NCHW"
 
-    if in_dynamic_mode():
+    if in_dygraph_mode():
+        pre_bias = _C_ops.final_state_conv3d_transpose(
+            x, weight, stride, padding, output_padding, output_size,
+            padding_algorithm, groups, dilation, data_format_)
+        if bias is not None:
+            return nn.elementwise_add(pre_bias, bias, axis=channel_dim)
+        else:
+            return pre_bias
+
+    if _in_legacy_dygraph():
         attrs = ('output_padding', output_padding, 'output_size', output_size,
                  'paddings', padding, "padding_algorithm", padding_algorithm,
                  'strides', stride, 'dilations', dilation, 'groups', groups,
diff --git a/python/paddle/nn/functional/extension.py b/python/paddle/nn/functional/extension.py
index 6a8686b612e7f..2483eab6c053a 100644
--- a/python/paddle/nn/functional/extension.py
+++ b/python/paddle/nn/functional/extension.py
@@ -20,7 +20,7 @@
 from ...static import Variable
 from ...tensor.creation import assign
 from ...fluid import dygraph_utils
-from ...fluid.layers.layer_function_generator import templatedoc
+from ...tensor.layer_function_generator import templatedoc
 from ...fluid.layers.sequence_lod import sequence_mask  #noqa: F401
 from paddle import in_dynamic_mode
 
diff --git a/python/paddle/nn/functional/loss.py b/python/paddle/nn/functional/loss.py
index e7763853bf7c2..593cea2d2cf64 100755
--- a/python/paddle/nn/functional/loss.py
+++ b/python/paddle/nn/functional/loss.py
@@ -37,7 +37,7 @@
 from paddle import _C_ops
 from paddle import in_dynamic_mode
 from paddle.framework import core
-from ...fluid.framework import _in_legacy_dygraph, in_dygraph_mode
+from ...fluid.framework import _in_legacy_dygraph, in_dygraph_mode, _non_static_mode
 __all__ = []
 
 
@@ -259,12 +259,16 @@ def binary_cross_entropy_with_logits(logit,
             "should be 'sum', 'mean' or 'none', but received %s, which is not allowed."
             % reduction)
 
-    if in_dynamic_mode():
+    if _non_static_mode():
         one = _varbase_creator(dtype=logit.dtype)
         _C_ops.fill_constant(one, 'value',
                              float(1.0), 'force_cpu', False, 'dtype', one.dtype,
                              'str_value', '1.0', 'shape', [1])
-        out = _C_ops.sigmoid_cross_entropy_with_logits(logit, label)
+        if in_dygraph_mode():
+            out = _C_ops.final_state_sigmoid_cross_entropy_with_logits(
+                logit, label, False, -100)
+        else:
+            out = _C_ops.sigmoid_cross_entropy_with_logits(logit, label)
         if pos_weight is not None:
             log_weight = _C_ops.elementwise_add(
                 _C_ops.elementwise_mul(label,
@@ -784,11 +788,22 @@ def nll_loss(input,
             input_dims))
     n = input_shape[0]
     c = input_shape[1]
-    if in_dynamic_mode():
+    if in_dygraph_mode():
+        if input_dims != 2 and input_dims != 4:
+            input, _ = _C_ops.reshape2(input, None, 'shape', [n, c, 1, -1])
+            label, _ = _C_ops.reshape2(label, None, 'shape', [n, 1, -1])
+            out_shape = [n] + input_shape[2:]
+        out, total_weight = _C_ops.final_state_nll_loss(input, label, weight,
+                                                        ignore_index, reduction)
+        if input_dims != 2 and input_dims != 4 and reduction == 'none':
+            out, _ = _C_ops.reshape2(out, None, 'shape', out_shape)
+        return out
+    if _in_legacy_dygraph():
         if input_dims != 2 and input_dims != 4:
             input, _ = _C_ops.reshape2(input, None, 'shape', [n, c, 1, -1])
             label, _ = _C_ops.reshape2(label, None, 'shape', [n, 1, -1])
             out_shape = [n] + input_shape[2:]
+
         out, total_weight = _C_ops.nll_loss(input, label, weight,
                                             'ignore_index', ignore_index,
                                             'reduction', reduction)
@@ -910,8 +925,11 @@ def kl_div(input, label, reduction='mean', name=None):
                 label.dtype) == 'float32':
         label = paddle.cast(label, 'float64')
 
-    if paddle.in_dynamic_mode():
-        out = _C_ops.kldiv_loss(input, label, 'reduction', 'none')
+    if _non_static_mode():
+        if _in_legacy_dygraph():
+            out = _C_ops.kldiv_loss(input, label, 'reduction', 'none')
+        else:
+            out = _C_ops.final_state_kldiv_loss(input, label, 'none')
         if reduction == 'mean':
             out = paddle.mean(out)
         elif reduction == 'sum':
@@ -1686,7 +1704,8 @@ def cross_entropy(input,
              (got nput_dims{}, label_dims{})'.format(input_dims, label_dims))
     if input_dims - 1 == label_dims:
         label = paddle.unsqueeze(label, axis=axis)
-    if in_dynamic_mode():
+
+    if _non_static_mode():
         if soft_label == False:
             valid_label = paddle.cast(
                 label != ignore_index, dtype=label.dtype) * label
@@ -1704,10 +1723,15 @@ def cross_entropy(input,
                 ignore_index, 'numeric_stable_mode', True, 'axis', axis,
                 'use_softmax', use_softmax)
         else:
-            _, out = _C_ops.softmax_with_cross_entropy(
-                input, label, 'soft_label', soft_label, 'ignore_index',
-                ignore_index, 'numeric_stable_mode', True, 'axis', axis,
-                'use_softmax', use_softmax)
+            if in_dygraph_mode():
+                _, out = _C_ops.final_state_cross_entropy_with_softmax(
+                    input, label, soft_label, use_softmax, True, ignore_index,
+                    axis)
+            if _in_legacy_dygraph():
+                _, out = _C_ops.softmax_with_cross_entropy(
+                    input, label, 'soft_label', soft_label, 'ignore_index',
+                    ignore_index, 'numeric_stable_mode', True, 'axis', axis,
+                    'use_softmax', use_softmax)
 
         if weight is not None:
 
@@ -2004,12 +2028,16 @@ def sigmoid_focal_loss(logit,
                 "Expected one dimension of normalizer in sigmoid_focal_loss but got {}.".
                 format(normalizer_dims))
 
-    if in_dynamic_mode():
+    if _non_static_mode():
         one = _varbase_creator(dtype=logit.dtype)
         _C_ops.fill_constant(one, 'value',
                              float(1.0), 'force_cpu', False, 'dtype', one.dtype,
                              'str_value', '1.0', 'shape', logit.shape)
-        loss = _C_ops.sigmoid_cross_entropy_with_logits(logit, label)
+        if in_dygraph_mode():
+            loss = _C_ops.final_state_sigmoid_cross_entropy_with_logits(
+                logit, label, False, -100)
+        else:
+            loss = _C_ops.sigmoid_cross_entropy_with_logits(logit, label)
         pred = _C_ops.sigmoid(logit)
         p_t = _C_ops.elementwise_add(
             _C_ops.elementwise_mul(pred, label),
diff --git a/python/paddle/nn/functional/norm.py b/python/paddle/nn/functional/norm.py
index c039754af4d12..1a5fc109805e0 100644
--- a/python/paddle/nn/functional/norm.py
+++ b/python/paddle/nn/functional/norm.py
@@ -24,6 +24,7 @@
 import numbers
 from paddle import _C_ops
 from paddle import in_dynamic_mode
+from paddle.fluid.framework import core, _non_static_mode, in_dygraph_mode, _in_legacy_dygraph
 
 __all__ = []
 
@@ -78,7 +79,12 @@ def normalize(x, p=2, axis=1, epsilon=1e-12, name=None):
             # [[0.         0.24253564 0.37139067]
             # [1.         0.97014254 0.9284767 ]]
     """
-    if in_dynamic_mode():
+    if in_dygraph_mode():
+        eps = fluid.dygraph.base.to_variable([epsilon], dtype=x.dtype)
+        out = _C_ops.final_state_p_norm(x, float(p), axis, epsilon, True, False)
+        return x / _C_ops.elementwise_max(out, eps)
+
+    if _in_legacy_dygraph():
         eps = fluid.dygraph.base.to_variable([epsilon], dtype=x.dtype)
         out = _C_ops.p_norm(x, 'axis', axis, 'porder',
                             float(p), 'keepdim', True, 'epsilon', epsilon)
@@ -180,15 +186,26 @@ def batch_norm(x,
     else:
         trainable_statistics = not use_global_stats
 
-    if in_dynamic_mode():
+    if in_dygraph_mode():
+        batch_norm_out, _, _, _, _, _ = _C_ops.final_state_batch_norm(
+            x, weight, bias, running_mean, running_var, momentum, epsilon,
+            data_format, not training, use_global_stats, trainable_statistics,
+            False)
+
+        return dygraph_utils._append_activation_in_dygraph(
+            batch_norm_out, act=None)
+
+    elif _in_legacy_dygraph():
         # for dygraph need tuple
         attrs = ("momentum", momentum, "epsilon", epsilon, "is_test",
                  not training, "data_layout", data_format, "use_mkldnn", False,
                  "fuse_with_relu", False, "use_global_stats", use_global_stats,
                  "trainable_statistics", trainable_statistics)
+
         batch_norm_out, _, _, _, _, _ = _C_ops.batch_norm(
-            x, weight, bias, running_mean, running_var, mean_out, variance_out,
-            *attrs)
+            x, weight, bias, running_mean, running_var, None, mean_out,
+            variance_out, *attrs)
+
         return dygraph_utils._append_activation_in_dygraph(
             batch_norm_out, act=None)
 
diff --git a/python/paddle/nn/functional/pooling.py b/python/paddle/nn/functional/pooling.py
index 34a0159fbb0dc..b9cae4784725d 100755
--- a/python/paddle/nn/functional/pooling.py
+++ b/python/paddle/nn/functional/pooling.py
@@ -18,6 +18,8 @@
 from ...fluid.data_feeder import check_type, check_variable_and_dtype
 from paddle import _C_ops
 from paddle import in_dynamic_mode
+from paddle.fluid.framework import _in_legacy_dygraph
+from paddle.fluid.framework import in_dygraph_mode
 
 __all__ = []
 
@@ -344,13 +346,18 @@ def avg_pool2d(x,
     padding, padding_algorithm = _update_padding_nd(
         padding, 2, channel_last, ceil_mode=ceil_mode)
 
-    if in_dynamic_mode():
-        output = _C_ops.pool2d(x, 'pooling_type', 'avg', 'ksize', kernel_size,
-                               'global_pooling', False, 'padding_algorithm',
-                               padding_algorithm, 'strides', stride, 'paddings',
-                               padding, 'use_cudnn', True, 'ceil_mode',
-                               ceil_mode, 'use_mkldnn', False, 'exclusive',
-                               exclusive, 'data_format', data_format)
+    if in_dygraph_mode() or _in_legacy_dygraph():
+        if in_dygraph_mode():
+            output = _C_ops.final_state_pool2d(
+                x, kernel_size, stride, padding, ceil_mode, exclusive,
+                data_format, 'avg', False, False, padding_algorithm)
+        else:
+            output = _C_ops.pool2d(
+                x, 'pooling_type', 'avg', 'ksize', kernel_size,
+                'global_pooling', False, 'padding_algorithm', padding_algorithm,
+                'strides', stride, 'paddings', padding, 'use_cudnn', True,
+                'ceil_mode', ceil_mode, 'use_mkldnn', False, 'exclusive',
+                exclusive, 'data_format', data_format)
         if divisor_override is None:
             return output
         else:
@@ -466,13 +473,18 @@ def avg_pool3d(x,
     _check_value_limitation(kernel_size, "kernel_size", min_limit=1e-3)
     _check_value_limitation(stride, "stride", min_limit=1e-3)
 
-    if in_dynamic_mode():
-        output = _C_ops.pool3d(
-            x, 'pooling_type', 'avg', 'ksize', kernel_size, 'strides', stride,
-            'paddings', padding, 'global_pooling', False, 'padding_algorithm',
-            padding_algorithm, 'use_cudnn', True, 'ceil_mode', ceil_mode,
-            'use_mkldnn', False, 'exclusive', exclusive, 'data_format',
-            data_format)
+    if in_dygraph_mode() or _in_legacy_dygraph():
+        if in_dygraph_mode():
+            output = _C_ops.final_state_pool3d(
+                x, kernel_size, stride, padding, ceil_mode, exclusive,
+                data_format, 'avg', False, False, padding_algorithm)
+        if _in_legacy_dygraph():
+            output = _C_ops.pool3d(
+                x, 'pooling_type', 'avg', 'ksize', kernel_size, 'strides',
+                stride, 'paddings', padding, 'global_pooling', False,
+                'padding_algorithm', padding_algorithm, 'use_cudnn', True,
+                'ceil_mode', ceil_mode, 'use_mkldnn', False, 'exclusive',
+                exclusive, 'data_format', data_format)
         if divisor_override is None:
             return output
         else:
@@ -585,7 +597,20 @@ def max_pool1d(x,
     # use 2d to implenment 1d should expand padding in advance.
     padding = _expand_low_nd_padding(padding)
 
-    if in_dynamic_mode():
+    if in_dygraph_mode():
+        if return_mask:
+            pool_out = _C_ops.final_state_max_pool2d_with_index(
+                x, kernel_size, stride, padding, False, False)
+            return (squeeze(pool_out[0], [2]),
+                    squeeze(pool_out[1],
+                            [2])) if return_mask else squeeze(pool_out[0], [2])
+        else:
+            pool_out = _C_ops.final_state_pool2d(
+                x, kernel_size, stride, padding, ceil_mode, True, data_format,
+                'max', False, False, padding_algorithm)
+            return squeeze(pool_out, [2])
+
+    if _in_legacy_dygraph():
         if return_mask:
             pool_out = _C_ops.max_pool2d_with_index(
                 x, 'ksize', kernel_size, 'global_pooling', False, 'strides',
@@ -1027,7 +1052,17 @@ def max_pool2d(x,
             "When setting return_mask to true, data_format must be set to NCHW in API:max_pool2d"
         )
 
-    if in_dynamic_mode():
+    if in_dygraph_mode():
+        if return_mask:
+            output = _C_ops.final_state_max_pool2d_with_index(
+                x, kernel_size, stride, padding, False, False)
+            return output if return_mask else output[0]
+        else:
+            return _C_ops.final_state_pool2d(
+                x, kernel_size, stride, padding, ceil_mode, True, data_format,
+                'max', False, False, padding_algorithm)
+
+    if _in_legacy_dygraph():
         if return_mask:
             output = _C_ops.max_pool2d_with_index(
                 x, 'ksize', kernel_size, 'global_pooling', False, 'strides',
@@ -1158,7 +1193,17 @@ def max_pool3d(x,
             "When setting return_mask to true, data_format must be set to NCDHW in API:max_pool3d"
         )
 
-    if in_dynamic_mode():
+    if in_dygraph_mode():
+        if return_mask:
+            output = _C_ops.final_state_max_pool3d_with_index(
+                x, kernel_size, stride, padding, False, False)
+            return output if return_mask else output[0]
+        else:
+            return _C_ops.final_state_pool3d(
+                x, kernel_size, stride, padding, ceil_mode, True, data_format,
+                'max', False, False, padding_algorithm)
+
+    if _in_legacy_dygraph():
         if return_mask:
             output = _C_ops.max_pool3d_with_index(
                 x, 'pooling_type', 'max', 'ksize', kernel_size, 'strides',
@@ -1355,11 +1400,15 @@ def adaptive_avg_pool2d(x, output_size, data_format='NCHW', name=None):
         if output_size[1] == None:
             output_size[1] = in_w
 
-    if in_dynamic_mode():
-        output = _C_ops.pool2d(x, 'pooling_type', 'avg', 'ksize', output_size,
-                               'global_pooling', False, 'adaptive', True,
-                               'data_format', data_format)
-        return output
+    if in_dygraph_mode():
+        return _C_ops.final_state_pool2d(x, output_size, [1, 1], [0, 0], False,
+                                         True, data_format, 'avg', False, True,
+                                         "EXPLICIT")
+
+    if _in_legacy_dygraph():
+        return _C_ops.pool2d(x, 'pooling_type', 'avg', 'ksize', output_size,
+                             'global_pooling', False, 'adaptive', True,
+                             'data_format', data_format)
 
     l_type = 'pool2d'
 
@@ -1462,10 +1511,9 @@ def adaptive_avg_pool3d(x, output_size, data_format='NCDHW', name=None):
             output_size[2] = in_w
 
     if in_dynamic_mode():
-        output = _C_ops.pool3d(x, 'pooling_type', 'avg', 'ksize', output_size,
-                               'global_pooling', False, 'adaptive', True,
-                               'data_format', data_format)
-        return output
+        return _C_ops.pool3d(x, 'pooling_type', 'avg', 'ksize', output_size,
+                             'global_pooling', False, 'adaptive', True,
+                             'data_format', data_format)
 
     l_type = 'pool3d'
 
diff --git a/python/paddle/nn/initializer/dirac.py b/python/paddle/nn/initializer/dirac.py
index da3266ab33694..c7cb1052d2f78 100644
--- a/python/paddle/nn/initializer/dirac.py
+++ b/python/paddle/nn/initializer/dirac.py
@@ -18,7 +18,8 @@
 from ...fluid import framework
 from paddle import in_dynamic_mode
 from paddle.utils import unique_name
-
+from paddle import _C_ops
+from ... import fluid
 __all__ = []
 
 
@@ -123,17 +124,24 @@ def __call__(self, var, block=None):
                 persistable=False)
         else:
             out_var = var
-
-        block.append_op(
-            type='fill_constant',
-            inputs={},
-            outputs={'Out': out_var},
-            attrs={
-                'value': float(0),
-                'dtype': out_var.dtype,
-                'shape': out_var.shape,
-            },
-            stop_gradient=True)
+        op = None
+        if framework.in_dygraph_mode():
+            with fluid.dygraph.no_grad():
+                _C_ops.fill_constant(out_var, 'value',
+                                     float(0), 'force_cpu', False, 'dtype',
+                                     out_var.dtype, 'str_value',
+                                     str(float(0)), 'shape', out_var.shape)
+        else:
+            block.append_op(
+                type='fill_constant',
+                inputs={},
+                outputs={'Out': out_var},
+                attrs={
+                    'value': float(0),
+                    'dtype': out_var.dtype,
+                    'shape': out_var.shape,
+                },
+                stop_gradient=True)
 
         origin_shape = var.shape
         num_per_group = origin_shape[0] // self._groups
@@ -158,71 +166,102 @@ def __call__(self, var, block=None):
                     else:
                         offset += origin_shape[k] // 2 * stride
                 idx_list.append(offset)
-
-        block.append_op(
-            type="reshape",
-            inputs={"X": out_var},
-            attrs={'shape': [-1]},
-            outputs={"Out": out_var},
-            stop_gradient=True)
+        if framework.in_dygraph_mode():
+            with fluid.dygraph.no_grad():
+                tmp_out = _C_ops.reshape(out_var, 'shape', [-1])
+                tmp_out._share_underline_tensor_to(out_var)
+        else:
+            block.append_op(
+                type="reshape",
+                inputs={"X": out_var},
+                attrs={'shape': [-1]},
+                outputs={"Out": out_var},
+                stop_gradient=True)
 
         index_tensor = block.create_var(
             name=unique_name.generate('scatter_index'),
             persistable=False,
             stop_gradient=True)
 
-        block.append_op(
-            type='assign_value',
-            outputs={'Out': index_tensor},
-            attrs={
-                'dtype': VarDesc.VarType.INT64,
-                'shape': [len(idx_list)],
-                'int64_values': idx_list
-            },
-            stop_gradient=True)
+        if framework.in_dygraph_mode():
+            with fluid.dygraph.no_grad():
+                tmp_tensor = framework._varbase_creator()
+                _C_ops.assign_value(tmp_tensor, 'shape', [len(idx_list)],
+                                    'dtype', VarDesc.VarType.INT64,
+                                    'int64_values', idx_list)
+                tmp_tensor._share_underline_tensor_to(index_tensor)
+        else:
+            block.append_op(
+                type='assign_value',
+                outputs={'Out': index_tensor},
+                attrs={
+                    'dtype': VarDesc.VarType.INT64,
+                    'shape': [len(idx_list)],
+                    'int64_values': idx_list
+                },
+                stop_gradient=True)
 
         value_tensor = block.create_var(
             name=unique_name.generate('scatter_value'),
             persistable=False,
             stop_gradient=True)
 
-        block.append_op(
-            type='assign_value',
-            outputs={'Out': value_tensor},
-            attrs={
-                'dtype': VarDesc.VarType.FP32,
-                'shape': [len(value_list)],
-                'fp32_values': value_list
-            },
-            stop_gradient=True)
-
-        op = block.append_op(
-            type="scatter",
-            inputs={
-                "X": out_var,
-                "Ids": index_tensor,
-                "Updates": value_tensor
-            },
-            attrs={'overwrite': True},
-            outputs={"Out": out_var},
-            stop_gradient=True)
+        if framework.in_dygraph_mode():
+            with fluid.dygraph.no_grad():
+                tmp_tensor = framework._varbase_creator()
+                _C_ops.assign_value(tmp_tensor, 'shape', [len(value_list)],
+                                    'dtype', VarDesc.VarType.FP32,
+                                    'fp32_values', value_list)
+                tmp_tensor._share_underline_tensor_to(value_tensor)
+        else:
+            block.append_op(
+                type='assign_value',
+                outputs={'Out': value_tensor},
+                attrs={
+                    'dtype': VarDesc.VarType.FP32,
+                    'shape': [len(value_list)],
+                    'fp32_values': value_list
+                },
+                stop_gradient=True)
 
-        block.append_op(
-            type="reshape",
-            inputs={"X": out_var},
-            attrs={'shape': origin_shape},
-            outputs={"Out": out_var},
-            stop_gradient=True)
+        if framework.in_dygraph_mode():
+            with fluid.dygraph.no_grad():
+                tmp_out = _C_ops.final_state_scatter(out_var, index_tensor,
+                                                     value_tensor, True)
+                tmp_out._share_underline_tensor_to(out_var)
+                tmp_reshape_out = _C_ops.reshape(out_var, 'shape', origin_shape)
+                tmp_reshape_out._share_underline_tensor_to(out_var)
+                if var.dtype != VarDesc.VarType.FP32:
+                    tmp_cast_out = _C_ops.cast(out_var, 'in_dtype',
+                                               out_var.dtype, 'out_dtype',
+                                               var.dtype)
+                    tmp_cast_out._share_underline_tensor_to(var)
 
-        if var.dtype != VarDesc.VarType.FP32:
+        else:
+            op = block.append_op(
+                type="scatter",
+                inputs={
+                    "X": out_var,
+                    "Ids": index_tensor,
+                    "Updates": value_tensor
+                },
+                attrs={'overwrite': True},
+                outputs={"Out": out_var},
+                stop_gradient=True)
             block.append_op(
-                type="cast",
+                type="reshape",
                 inputs={"X": out_var},
-                outputs={"Out": var},
-                attrs={"in_dtype": out_var.dtype,
-                       "out_dtype": var.dtype},
+                attrs={'shape': origin_shape},
+                outputs={"Out": out_var},
                 stop_gradient=True)
-
+            if var.dtype != VarDesc.VarType.FP32:
+                block.append_op(
+                    type="cast",
+                    inputs={"X": out_var},
+                    outputs={"Out": var},
+                    attrs={"in_dtype": out_var.dtype,
+                           "out_dtype": var.dtype},
+                    stop_gradient=True)
         if not in_dynamic_mode():
             var.op = op
         return op
diff --git a/python/paddle/nn/layer/distance.py b/python/paddle/nn/layer/distance.py
index 1fb7e8c4f2148..eb85de5711078 100644
--- a/python/paddle/nn/layer/distance.py
+++ b/python/paddle/nn/layer/distance.py
@@ -20,6 +20,7 @@
 from ...fluid.layer_helper import LayerHelper
 from paddle import _C_ops
 from paddle import in_dynamic_mode
+from paddle.fluid.framework import in_dygraph_mode, _in_legacy_dygraph
 
 __all__ = []
 
@@ -78,7 +79,12 @@ def __init__(self, p=2., epsilon=1e-6, keepdim=False, name=None):
         check_type(self.keepdim, 'keepdim', (bool), 'PairwiseDistance')
 
     def forward(self, x, y):
-        if in_dynamic_mode():
+        if in_dygraph_mode():
+            sub = _C_ops.elementwise_sub(x, y)
+            return _C_ops.final_state_p_norm(sub, self.p, 1, self.epsilon,
+                                             self.keepdim, False)
+
+        if _in_legacy_dygraph():
             sub = _C_ops.elementwise_sub(x, y)
             return _C_ops.p_norm(sub, 'axis', 1, 'porder', self.p, 'keepdim',
                                  self.keepdim, 'epsilon', self.epsilon)
diff --git a/python/paddle/nn/utils/__init__.py b/python/paddle/nn/utils/__init__.py
index 8f9b55d15cad0..8ec4e8cfd60b5 100644
--- a/python/paddle/nn/utils/__init__.py
+++ b/python/paddle/nn/utils/__init__.py
@@ -14,7 +14,7 @@
 
 from .spectral_norm_hook import spectral_norm
 from .weight_norm_hook import weight_norm, remove_weight_norm  # noqa: F401
-from .transform_parameters import parameters_to_vector, vector_to_parameters  # noqa: F401
+from .transform_parameters import parameters_to_vector, vector_to_parameters, _stride_column  # noqa: F401
 
 __all__ = [  #noqa
     'weight_norm', 'remove_weight_norm', 'spectral_norm', 'parameters_to_vector', 'vector_to_parameters'
diff --git a/python/paddle/nn/utils/transform_parameters.py b/python/paddle/nn/utils/transform_parameters.py
index 03d2fa514869d..feb70e02d5988 100644
--- a/python/paddle/nn/utils/transform_parameters.py
+++ b/python/paddle/nn/utils/transform_parameters.py
@@ -15,20 +15,58 @@
 from functools import reduce
 
 import paddle
-from paddle.fluid.framework import dygraph_only, _dygraph_tracer, _varbase_creator
+from paddle.fluid.framework import dygraph_only, _dygraph_tracer, _varbase_creator, in_dygraph_mode
 from paddle import _C_ops
 
 
 #input==output, inplace strategy of reshape has no cost almostly
 def _inplace_reshape_dygraph(x, shape):
-    x_shape = _varbase_creator(dtype=x.dtype)
-    _dygraph_tracer().trace_op(
-        type="reshape2",
-        inputs={'X': x},
-        outputs={'Out': x,
-                 'XShape': x_shape},
-        attrs={'shape': shape},
-        stop_gradient=True)
+    x_shape = _varbase_creator(dtype='int64')
+    if in_dygraph_mode():
+        with paddle.fluid.dygraph.no_grad():
+            tmp_out, _ = _C_ops.reshape2(x, None, 'shape', shape)
+            tmp_out._share_underline_tensor_to(x)
+    else:
+        _dygraph_tracer().trace_op(
+            type="reshape2",
+            inputs={'X': x},
+            outputs={'Out': x,
+                     'XShape': x_shape},
+            attrs={'shape': shape},
+            stop_gradient=True)
+
+
+@dygraph_only
+def _stride_column(param):
+    """
+    A tool function. Permute date of parameter as a 'columns' stride. Now, it only support 2-D parameter.
+
+    Args:
+        param(Tensor]): The param that will be strided according to 'columns'.
+    
+    Examples:
+       .. code-block:: python
+
+            import paddle
+            paddle.seed(100)
+
+            linear = paddle.nn.Linear(2, 3)
+            print(linear.weight)
+            # [[-0.31485492, -1.02896988,  0.45741916],
+            #  [-0.65525872, -1.04643178,  1.07262802]]
+
+            paddle.nn.utils.stride_column(linear.weight)
+            print(linear.weight)
+            # [[-0.31485492,  0.45741916, -1.04643178],
+            #  [-1.02896988, -0.65525872,  1.07262802]]
+
+    """
+    assert len(param.shape) == 2
+    shape = [param.shape[1], param.shape[0]]
+    with paddle.fluid.dygraph.no_grad():
+        reshape_var = paddle.reshape(param, shape)
+        transpose_var = paddle.transpose(reshape_var, [1, 0])
+        transpose_var._share_underline_tensor_to(param)
 
 
 @dygraph_only
@@ -62,12 +100,18 @@ def parameters_to_vector(parameters, name=None):
         _inplace_reshape_dygraph(param, [-1])
 
     out = _varbase_creator(dtype=dtype)
-    _dygraph_tracer().trace_op(
-        type='concat',
-        inputs={'X': parameters},
-        outputs={'Out': [out]},
-        attrs={'axis': 0},
-        stop_gradient=True)
+    if in_dygraph_mode():
+        with paddle.fluid.dygraph.no_grad():
+            tmp = _varbase_creator()
+            _C_ops.concat(parameters, tmp, 'axis', 0)
+            tmp._share_underline_tensor_to(out)
+    else:
+        _dygraph_tracer().trace_op(
+            type='concat',
+            inputs={'X': parameters},
+            outputs={'Out': [out]},
+            attrs={'axis': 0},
+            stop_gradient=True)
     for i, param in enumerate(parameters):
         _inplace_reshape_dygraph(param, origin_shapes[i])
     return out
@@ -109,13 +153,20 @@ def vector_to_parameters(vec, parameters, name=None):
         numel = reduce(lambda x, y: x * y, shape)
         sections.append(numel)
 
-    _dygraph_tracer().trace_op(
-        type='split',
-        inputs={'X': [vec]},
-        outputs={'Out': parameters},
-        attrs={'axis': 0,
-               'sections': sections},
-        stop_gradient=True)
+    if in_dygraph_mode():
+        with paddle.fluid.dygraph.no_grad():
+            res = [_varbase_creator() for n in range(len(parameters))]
+            _C_ops.split(vec, res, 'axis', 0, 'sections', sections)
+            for i in range(0, len(res)):
+                res[i]._share_underline_tensor_to(parameters[i])
+    else:
+        _dygraph_tracer().trace_op(
+            type='split',
+            inputs={'X': [vec]},
+            outputs={'Out': parameters},
+            attrs={'axis': 0,
+                   'sections': sections},
+            stop_gradient=True)
 
     for i, param in enumerate(parameters):
         _inplace_reshape_dygraph(param, origin_shapes[i])
diff --git a/python/paddle/optimizer/adamax.py b/python/paddle/optimizer/adamax.py
index de70e2e72a9c6..4c4a85559c0d9 100644
--- a/python/paddle/optimizer/adamax.py
+++ b/python/paddle/optimizer/adamax.py
@@ -16,6 +16,7 @@
 from ..fluid import core
 from ..fluid import framework
 from ..fluid.framework import Variable, name_scope
+from paddle import _C_ops
 
 __all__ = []
 
@@ -190,30 +191,38 @@ def _append_optimize_op(self, block, param_and_grad):
                                          param_and_grad[0])
         beta1_pow_acc = self._get_accumulator(self._beta1_pow_acc_str,
                                               param_and_grad[0])
-        # create the adamax optimize op
-        adamax_op = block.append_op(
-            type=self.type,
-            inputs={
-                "Param": param_and_grad[0],
-                "Grad": param_and_grad[1],
-                "LearningRate": self._create_param_lr(param_and_grad),
-                "Moment": moment,
-                "InfNorm": inf_norm,
-                "Beta1Pow": beta1_pow_acc
-            },
-            outputs={
-                "ParamOut": param_and_grad[0],
-                "MomentOut": moment,
-                "InfNormOut": inf_norm
-            },
-            attrs={
-                "beta1": self._beta1,
-                "beta2": self._beta2,
-                "epsilon": self._epsilon
-            },
-            stop_gradient=True)
 
-        return adamax_op
+        if framework._non_static_mode():
+            _C_ops.adamax(param_and_grad[0], param_and_grad[1],
+                          self._create_param_lr(param_and_grad), moment,
+                          inf_norm, beta1_pow_acc, param_and_grad[0], moment,
+                          inf_norm, "beta1", self._beta1, "beta2", self._beta2,
+                          "epsilon", self._epsilon)
+        else:
+            # create the adamax optimize op
+            adamax_op = block.append_op(
+                type=self.type,
+                inputs={
+                    "Param": param_and_grad[0],
+                    "Grad": param_and_grad[1],
+                    "LearningRate": self._create_param_lr(param_and_grad),
+                    "Moment": moment,
+                    "InfNorm": inf_norm,
+                    "Beta1Pow": beta1_pow_acc
+                },
+                outputs={
+                    "ParamOut": param_and_grad[0],
+                    "MomentOut": moment,
+                    "InfNormOut": inf_norm
+                },
+                attrs={
+                    "beta1": self._beta1,
+                    "beta2": self._beta2,
+                    "epsilon": self._epsilon
+                },
+                stop_gradient=True)
+
+            return adamax_op
 
     def _finish_update(self, block, parameters_and_grads):
         """Update Beta1 Power accumulator
diff --git a/python/paddle/optimizer/momentum.py b/python/paddle/optimizer/momentum.py
index f68bbad4ab249..ce112c19250ca 100644
--- a/python/paddle/optimizer/momentum.py
+++ b/python/paddle/optimizer/momentum.py
@@ -25,6 +25,7 @@
 from paddle.fluid.regularizer import L2DecayRegularizer
 from paddle import _C_ops
 import paddle
+from paddle.fluid.framework import in_dygraph_mode, _in_legacy_dygraph
 
 __all__ = []
 
@@ -313,7 +314,7 @@ def _append_optimize_op(self, block, param_and_grad):
         master_weight = (self._master_weights[param_and_grad[0].name]
                          if find_master else None)
 
-        if framework._non_static_mode():
+        if _in_legacy_dygraph():
             if isinstance(param_and_grad, dict):
                 self._update_regularization(param_and_grad['weight_decay'])
             _, _, _ = _C_ops.momentum(
@@ -323,8 +324,15 @@ def _append_optimize_op(self, block, param_and_grad):
                 'regularization_method', regularization_method,
                 'regularization_coeff', regularization_coeff, 'multi_precision',
                 find_master)
-
             return None
+        if in_dygraph_mode():
+            if isinstance(param_and_grad, dict):
+                self._update_regularization(param_and_grad['weight_decay'])
+            return _C_ops.final_state_momentum(
+                param_and_grad[0], param_and_grad[1], velocity_acc, lr,
+                master_weight, self._momentum, self._use_nesterov,
+                regularization_method, regularization_coeff, find_master,
+                self._rescale_grad)
 
         attrs = {
             "mu": self._momentum,
diff --git a/python/paddle/profiler/profiler_statistic.py b/python/paddle/profiler/profiler_statistic.py
index 3be6088a484b8..5fed51476132e 100755
--- a/python/paddle/profiler/profiler_statistic.py
+++ b/python/paddle/profiler/profiler_statistic.py
@@ -28,7 +28,7 @@
     TracerEventType.PythonOp, TracerEventType.PythonUserDefined
 ]
 
-_CommunicationOpName = ['reduce', 'broadcast', 'rpc']
+_CommunicationOpName = ['allreduce', 'broadcast', 'rpc']
 
 
 class SortedKeys(Enum):
@@ -74,8 +74,10 @@ def __init__(self, hostnode):
         self.runtime_node = []
         self.cpu_time = 0
         self.self_cpu_time = 0
-        self.gpu_time = 0
+        self.gpu_time = 0  # kernel time
         self.self_gpu_time = 0
+        self.general_gpu_time = 0  # besides kernel, include time of gpu events like memcpy and memset
+        self.self_general_gpu_time = 0
 
     def cal_statistic(self):
         for child in self.children_node:
@@ -86,14 +88,20 @@ def cal_statistic(self):
         self.cpu_time = self.hostnode.end_ns - self.hostnode.start_ns
         for child in self.children_node:
             self.gpu_time += child.gpu_time
+            self.general_gpu_time += child.general_gpu_time
             self.self_cpu_time -= (child.end_ns - child.start_ns)
         for rt in self.runtime_node:
             self.self_cpu_time -= (rt.end_ns - rt.start_ns)
             self.gpu_time += rt.gpu_time
             self.self_gpu_time += rt.gpu_time
+            self.general_gpu_time += rt.general_gpu_time
+            self.self_general_gpu_time += rt.general_gpu_time
         for device in self.hostnode.device_node:
-            self.gpu_time += (device.end_ns - device.start_ns)
-            self.self_gpu_time += (device.end_ns - device.start_ns)
+            if device.type == TracerEventType.Kernel:
+                self.gpu_time += (device.end_ns - device.start_ns)
+                self.self_gpu_time += (device.end_ns - device.start_ns)
+            self.general_gpu_time += (device.end_ns - device.start_ns)
+            self.self_general_gpu_time += (device.end_ns - device.start_ns)
 
     @property
     def end_ns(self):
@@ -258,6 +266,8 @@ def __init__(self):
         self.communication_range = []
         self.computation_range = []
         self.overlap_range = []
+        self.cpu_calls = 0
+        self.gpu_calls = 0
 
     def parse(self, nodetrees):
         '''
@@ -300,6 +310,8 @@ def parse(self, nodetrees):
                                 else:
                                     self.computation_range.append((
                                         devicenode.start_ns, devicenode.end_ns))
+        self.cpu_calls = len(set(self.cpu_communication_range))
+        self.gpu_calls = len(set(self.gpu_communication_range))
         self.cpu_communication_range = merge_self_ranges(
             self.cpu_communication_range, is_sorted=False)
         self.gpu_communication_range = merge_self_ranges(
@@ -354,6 +366,9 @@ def __init__(self, name):
             self.min_gpu_time = float('inf')
             self.devices = {}
             self.operator_inners = {}
+            self.general_gpu_time = 0
+            self.min_general_gpu_time = float('inf')
+            self.max_general_gpu_time = 0
 
         @property
         def avg_cpu_time(self):
@@ -363,6 +378,10 @@ def avg_cpu_time(self):
         def avg_gpu_time(self):
             return self.gpu_time / self.call
 
+        @property
+        def avg_general_gpu_time(self):
+            return self.general_gpu_time / self.call
+
         def add_cpu_time(self, time):
             if time > self.max_cpu_time:
                 self.max_cpu_time = time
@@ -377,6 +396,13 @@ def add_gpu_time(self, time):
                 self.min_gpu_time = time
             self.gpu_time += time
 
+        def add_general_gpu_time(self, time):
+            if time > self.max_general_gpu_time:
+                self.max_general_gpu_time = time
+            if time < self.min_general_gpu_time:
+                self.min_general_gpu_time = time
+            self.general_gpu_time += time
+
         def add_call(self):
             self.call += 1
 
@@ -384,6 +410,7 @@ def add_item(self, node):
             self.add_call()
             self.add_cpu_time(node.cpu_time)
             self.add_gpu_time(node.gpu_time)
+            self.add_general_gpu_time(node.general_gpu_time)
             for child in node.children_node:
                 if child.name not in self.operator_inners:
                     self.operator_inners[
@@ -407,6 +434,9 @@ def __init__(self, name):
             self.gpu_time = 0
             self.max_gpu_time = 0
             self.min_gpu_time = float('inf')
+            self.general_gpu_time = 0
+            self.min_general_gpu_time = float('inf')
+            self.max_general_gpu_time = 0
 
         @property
         def avg_cpu_time(self):
@@ -416,6 +446,10 @@ def avg_cpu_time(self):
         def avg_gpu_time(self):
             return self.gpu_time / self.call
 
+        @property
+        def avg_general_gpu_time(self):
+            return self.general_gpu_time / self.call
+
         def add_cpu_time(self, time):
             if time > self.max_cpu_time:
                 self.max_cpu_time = time
@@ -430,6 +464,13 @@ def add_gpu_time(self, time):
                 self.min_gpu_time = time
             self.gpu_time += time
 
+        def add_general_gpu_time(self, time):
+            if time > self.max_general_gpu_time:
+                self.max_general_gpu_time = time
+            if time < self.min_general_gpu_time:
+                self.min_general_gpu_time = time
+            self.general_gpu_time += time
+
         def add_call(self):
             self.call += 1
 
@@ -437,6 +478,7 @@ def add_item(self, node):
             self.add_call()
             self.add_cpu_time(node.cpu_time)
             self.add_gpu_time(node.gpu_time)
+            self.add_general_gpu_time(node.general_gpu_time)
 
     def __init__(self):
         self.items = {}  # for operator summary
@@ -478,6 +520,8 @@ def parse(self, nodetrees):
                         self.add_model_perspective_item(
                             child)  #find first model perspective node
                     else:
+                        if child.type == TracerEventType.ProfileStep:
+                            self.add_model_perspective_item(child)
                         deque.append(child)
 
     def add_operator_item(self, operator_node):
@@ -533,6 +577,8 @@ def add_model_perspective_item(self, model_perspective_node):
             name = 'Optimization'
         elif model_perspective_node.type == TracerEventType.Dataloader:
             name = 'Dataloader'
+        elif model_perspective_node.type == TracerEventType.ProfileStep:
+            name = 'ProfileStep'
         else:
             return
         if name not in self.model_perspective_items:
@@ -626,7 +672,6 @@ def format_ratio(ratio, indent=0):
     # construct table string
 
     append(add_title(line_length, "Device Summary"))
-    append('Time unit: {}'.format(time_unit))
     append(header_sep)
     append(row_format.format(*headers))
     append(header_sep)
@@ -661,7 +706,7 @@ def format_ratio(ratio, indent=0):
         return ''.join(result)
 
     ###### Print Overview Summary ######
-    headers = ['Event Type', 'CPU Time', 'Ratio (%)']
+    headers = ['Event Type', 'Calls', 'CPU Time', 'Ratio (%)']
     row_format_list = [""]
     header_sep_list = [""]
     line_length_list = [-SPACING_SIZE]
@@ -680,13 +725,13 @@ def format_ratio(ratio, indent=0):
     append(header_sep)
     append(row_format.format(*headers))
     append(header_sep)
-    row_values = [
-        'Total Time', format_time(
-            total_time, unit=time_unit), format_ratio(1)
-    ]
-    append(row_format.format(*row_values))
     cpu_type_time = collections.defaultdict(int)
     gpu_type_time = collections.defaultdict(int)
+    cpu_call_times = collections.defaultdict(int)
+    gpu_call_times = collections.defaultdict(int)
+    cpu_call_times.update(statistic_data.time_range_summary.call_times)
+    gpu_call_times.update(statistic_data.time_range_summary.call_times)
+
     for event_type, value in statistic_data.time_range_summary.CPUTimeRangeSum.items(
     ):
         if event_type != TracerEventType.Communication:
@@ -694,6 +739,9 @@ def format_ratio(ratio, indent=0):
     if statistic_data.distributed_summary.cpu_communication_range:
         cpu_type_time[TracerEventType.Communication] = sum_ranges(
             statistic_data.distributed_summary.cpu_communication_range)
+        cpu_call_times[
+            TracerEventType.
+            Communication] = statistic_data.distributed_summary.cpu_calls
 
     gpu_time_range = collections.defaultdict(list)
     for device_id, device_time_ranges in statistic_data.time_range_summary.GPUTimeRange.items(
@@ -706,22 +754,34 @@ def format_ratio(ratio, indent=0):
     if statistic_data.distributed_summary.gpu_communication_range:
         gpu_type_time[TracerEventType.Communication] = sum_ranges(
             statistic_data.distributed_summary.gpu_communication_range)
+        gpu_call_times[
+            TracerEventType.
+            Communication] = statistic_data.distributed_summary.gpu_calls
 
     sorted_items = sorted(
         cpu_type_time.items(), key=lambda x: x[1], reverse=True)
-    for event_type, time in sorted_items:
+    event_type, time = sorted_items[0]
+    row_values = [
+        '{}'.format(str(event_type).split('.')[1]), cpu_call_times[event_type],
+        format_time(
+            time, unit=time_unit), format_ratio(float(time) / total_time)
+    ]
+    append(row_format.format(*row_values))
+    for event_type, time in sorted_items[1:]:
         row_values = [
-            '  {}'.format(str(event_type).split('.')[1]), format_time(
+            '  {}'.format(str(event_type).split('.')[1]),
+            cpu_call_times[event_type], format_time(
                 time, unit=time_unit), format_ratio(float(time) / total_time)
         ]
         append(row_format.format(*row_values))
     append(header_sep)
-    headers = ['', 'GPU Time', 'Ratio (%)']
+    headers = ['', 'Calls', 'GPU Time', 'Ratio (%)']
     append(row_format.format(*headers))
     append(header_sep)
     for event_type, time in gpu_type_time.items():
         row_values = [
-            '  {}'.format(str(event_type).split('.')[1]), format_time(
+            '  {}'.format(str(event_type).split('.')[1]),
+            gpu_call_times[event_type], format_time(
                 time, unit=time_unit), format_ratio(float(time) / total_time)
         ]
         append(row_format.format(*row_values))
@@ -730,7 +790,7 @@ def format_ratio(ratio, indent=0):
     append(
         "Note:\nIn this table, We sum up all collected events in terms of event type.\n"
         "The time of events collected on host are presented as CPU Time, and as GPU Time if on device.\n"
-        "Ratio = CPU(GPU) Time / Total Time.\n"
+        "The time with ratio 100% is the base time for calculating ratio. \n"
         "Events with different types may overlap or inclusion, e.g. Operator includes OperatorInner, so the sum of ratios is not 100%.\n"
         "The time of events in the same type with overlap will not calculate twice, and all time is summed after merged.\n"
         "Example:\n"
@@ -746,21 +806,21 @@ def format_ratio(ratio, indent=0):
 
     ###### Print Model Summary Report ######
     model_perspective_items = statistic_data.event_summary.model_perspective_items
-    if model_perspective_items:
+    if len(model_perspective_items) > 1:
         all_row_values = []
-        row_values = [
-            'Total Time', '-', '{} / - / - / - / {}'.format(
-                format_time(
-                    total_time, unit=time_unit), format_ratio(1)),
-            '- / - / - / -/ -'
-        ]
-        all_row_values.append(row_values)
         accmulation_time = 0
-        for name in ['Dataloader', 'Forward', 'Backward', 'Optimization']:
+        gpu_accmulation_time = 0
+        gpu_total_time = 0
+        for name in [
+                'ProfileStep', 'Dataloader', 'Forward', 'Backward',
+                'Optimization'
+        ]:
             if name in model_perspective_items:
                 item = model_perspective_items[name]
+                name = '{}'.format(
+                    name) if 'ProfileStep' in name else '  {}'.format(name)
                 row_values = [
-                    '  {}'.format(name), item.call,
+                    '{}'.format(name), item.call,
                     '{} / {} / {} / {} / {}'.format(
                         format_time(
                             item.cpu_time, unit=time_unit),
@@ -783,15 +843,23 @@ def format_ratio(ratio, indent=0):
                         format_ratio(float(item.gpu_time) / total_time))
                 ]
                 all_row_values.append(row_values)
-                accmulation_time += item.cpu_time
+                if 'ProfileStep' not in name:
+                    accmulation_time += item.cpu_time
+                    gpu_accmulation_time += item.gpu_time
+                else:
+                    gpu_total_time = item.gpu_time
 
         other_time = total_time - accmulation_time
+        other_gpu_time = gpu_total_time - gpu_accmulation_time
         row_values = [
             '  Others', '-', '{} / - / - / - / {}'.format(
                 format_time(
                     other_time, unit=time_unit),
                 format_ratio(float(other_time) / total_time)),
-            '- / - / - / - / -'
+            '{} / - / - / - / {}'.format(
+                format_time(
+                    other_gpu_time, unit=time_unit),
+                format_ratio(float(other_gpu_time) / gpu_total_time))
         ]
         all_row_values.append(row_values)
         # Calculate the column width
@@ -835,6 +903,7 @@ def format_ratio(ratio, indent=0):
         append(
             "Note:\nIn this table, GPU time is the sum of all device(GPU) events called in the phase.\n"
             "Unlike overview summary, if two device(GPU) events execute on different streams with overlap time, we sum them directly here.\n"
+            "The time with ratio 100% is the base time for calculating ratio. \n"
         )
         append('-' * line_length)
         append('')
@@ -872,21 +941,27 @@ def format_ratio(ratio, indent=0):
         overlap_time = sum_ranges(
             statistic_data.distributed_summary.overlap_range)
         row_values = [
-            'Communication', format_time(
+            'ProfileStep', format_time(
+                total_time, unit=time_unit),
+            format_ratio(float(total_time) / total_time)
+        ]
+        append(row_format.format(*row_values))
+        row_values = [
+            '  Communication', format_time(
                 communication_time, unit=time_unit),
             format_ratio(float(communication_time) / total_time)
         ]
         append(row_format.format(*row_values))
 
         row_values = [
-            'Computation', format_time(
+            '  Computation', format_time(
                 computation_time, unit=time_unit),
             format_ratio(float(computation_time) / total_time)
         ]
         append(row_format.format(*row_values))
 
         row_values = [
-            'Overlap', format_time(
+            '  Overlap', format_time(
                 overlap_time, unit=time_unit),
             format_ratio(float(overlap_time) / total_time)
         ]
@@ -896,6 +971,7 @@ def format_ratio(ratio, indent=0):
             "Note:\nCommunication time: Communication Event time, Communication Op time and its kernel time on gpu.\n"
             "Computation time: Kernel time, except kernels belong to communication(nccl kernels).\n"
             "Overlap time: Communication time intersects with computation time.\n"
+            "The time with ratio 100% is the base time for calculating ratio. \n"
             "Example:\n"
             "Communication:\n"
             "  CPU:              |_________________|\n"
@@ -938,20 +1014,22 @@ def format_ratio(ratio, indent=0):
                     items.items(), key=lambda x: x[1].min_cpu_time)
             elif sorted_by == SortedKeys.GPUTotal:
                 sorted_items = sorted(
-                    items.items(), key=lambda x: x[1].gpu_time, reverse=True)
+                    items.items(),
+                    key=lambda x: x[1].general_gpu_time,
+                    reverse=True)
             elif sorted_by == SortedKeys.GPUAvg:
                 sorted_items = sorted(
                     items.items(),
-                    key=lambda x: x[1].avg_gpu_time,
+                    key=lambda x: x[1].avg_general_gpu_time,
                     reverse=True)
             elif sorted_by == SortedKeys.GPUMax:
                 sorted_items = sorted(
                     items.items(),
-                    key=lambda x: x[1].max_gpu_time,
+                    key=lambda x: x[1].max_general_gpu_time,
                     reverse=True)
             elif sorted_by == SortedKeys.GPUMin:
                 sorted_items = sorted(
-                    items.items(), key=lambda x: x[1].min_gpu_time)
+                    items.items(), key=lambda x: x[1].min_general_gpu_time)
 
             for name, item in sorted_items:
                 row_values = [
@@ -967,14 +1045,15 @@ def format_ratio(ratio, indent=0):
                         format_ratio(float(item.cpu_time) / total_time)),
                     '{} / {} / {} / {} / {}'.format(
                         format_time(
-                            item.gpu_time, unit=time_unit),
+                            item.general_gpu_time, unit=time_unit),
                         format_time(
-                            item.avg_gpu_time, unit=time_unit),
+                            item.avg_general_gpu_time, unit=time_unit),
                         format_time(
-                            item.max_gpu_time, unit=time_unit),
+                            item.max_general_gpu_time, unit=time_unit),
                         format_time(
-                            item.min_gpu_time, unit=time_unit),
-                        format_ratio(float(item.gpu_time) / total_time))
+                            item.min_general_gpu_time, unit=time_unit),
+                        format_ratio(
+                            float(item.general_gpu_time) / total_time))
                 ]
                 all_row_values.append(row_values)
                 if op_detail:
@@ -998,18 +1077,23 @@ def format_ratio(ratio, indent=0):
                                     float(innerop_node.cpu_time) / total_time)),
                             '{} / {} / {} / {} / {}'.format(
                                 format_time(
-                                    innerop_node.gpu_time, unit=time_unit),
+                                    innerop_node.general_gpu_time,
+                                    unit=time_unit),
                                 format_time(
-                                    innerop_node.avg_gpu_time, unit=time_unit),
+                                    innerop_node.avg_general_gpu_time,
+                                    unit=time_unit),
                                 format_time(
-                                    innerop_node.max_gpu_time, unit=time_unit),
+                                    innerop_node.max_general_gpu_time,
+                                    unit=time_unit),
                                 format_time(
-                                    innerop_node.min_gpu_time, unit=time_unit),
+                                    innerop_node.min_general_gpu_time,
+                                    unit=time_unit),
                                 format_ratio(
-                                    float(innerop_node.gpu_time) / total_time))
+                                    float(innerop_node.general_gpu_time) /
+                                    total_time))
                         ]
                         all_row_values.append(row_values)
-                        for device_node_name, devicenode in innerop_node.devices.items(
+                        for device_node_name, device_node in innerop_node.devices.items(
                         ):
                             if len(device_node_name) + 4 > name_column_width:
                                 device_node_name = device_node_name[:
@@ -1018,21 +1102,21 @@ def format_ratio(ratio, indent=0):
                                 device_node_name += "..."
                             row_values = [
                                 '    {}'.format(device_node_name),
-                                devicenode.call, '- / - / - / - / -',
+                                device_node.call, '- / - / - / - / -',
                                 '{} / {} / {} / {} / {}'.format(
                                     format_time(
-                                        devicenode.gpu_time, unit=time_unit),
+                                        device_node.gpu_time, unit=time_unit),
                                     format_time(
-                                        devicenode.avg_gpu_time,
+                                        device_node.avg_gpu_time,
                                         unit=time_unit),
                                     format_time(
-                                        devicenode.max_gpu_time,
+                                        device_node.max_gpu_time,
                                         unit=time_unit),
                                     format_time(
-                                        devicenode.min_gpu_time,
+                                        device_node.min_gpu_time,
                                         unit=time_unit),
                                     format_ratio(
-                                        float(devicenode.gpu_time) /
+                                        float(device_node.gpu_time) /
                                         total_time))
                             ]
                             all_row_values.append(row_values)
@@ -1043,19 +1127,19 @@ def format_ratio(ratio, indent=0):
                                                                 - 5]
                             device_node_name += "..."
                         row_values = [
-                            '  {}'.format(device_node_name), devicenode.call,
+                            '  {}'.format(device_node_name), device_node.call,
                             '- / - / - / - / -',
                             '{} / {} / {} / {} / {}'.format(
                                 format_time(
-                                    devicenode.gpu_time, unit=time_unit),
+                                    device_node.gpu_time, unit=time_unit),
                                 format_time(
-                                    devicenode.avg_gpu_time, unit=time_unit),
+                                    device_node.avg_gpu_time, unit=time_unit),
                                 format_time(
-                                    devicenode.max_gpu_time, unit=time_unit),
+                                    device_node.max_gpu_time, unit=time_unit),
                                 format_time(
-                                    devicenode.min_gpu_time, unit=time_unit),
+                                    device_node.min_gpu_time, unit=time_unit),
                                 format_ratio(
-                                    float(devicenode.gpu_time) / total_time))
+                                    float(device_node.gpu_time) / total_time))
                         ]
                         all_row_values.append(row_values)
         # Calculate the column width
@@ -1123,14 +1207,14 @@ def format_ratio(ratio, indent=0):
                     format_ratio(float(item.cpu_time) / total_time)),
                 '{} / {} / {} / {} / {}'.format(
                     format_time(
-                        item.gpu_time, unit=time_unit),
+                        item.general_gpu_time, unit=time_unit),
                     format_time(
-                        item.avg_gpu_time, unit=time_unit),
+                        item.avg_general_gpu_time, unit=time_unit),
                     format_time(
-                        item.max_gpu_time, unit=time_unit),
+                        item.max_general_gpu_time, unit=time_unit),
                     format_time(
-                        item.min_gpu_time, unit=time_unit),
-                    format_ratio(float(item.gpu_time) / total_time)),
+                        item.min_general_gpu_time, unit=time_unit),
+                    format_ratio(float(item.general_gpu_time) / total_time)),
             ]
             all_row_values.append(row_values)
 
@@ -1207,20 +1291,22 @@ def format_ratio(ratio, indent=0):
                     items.items(), key=lambda x: x[1].min_cpu_time)
             elif sorted_by == SortedKeys.GPUTotal:
                 sorted_items = sorted(
-                    items.items(), key=lambda x: x[1].gpu_time, reverse=True)
+                    items.items(),
+                    key=lambda x: x[1].general_gpu_time,
+                    reverse=True)
             elif sorted_by == SortedKeys.GPUAvg:
                 sorted_items = sorted(
                     items.items(),
-                    key=lambda x: x[1].avg_gpu_time,
+                    key=lambda x: x[1].avg_general_gpu_time,
                     reverse=True)
             elif sorted_by == SortedKeys.GPUMax:
                 sorted_items = sorted(
                     items.items(),
-                    key=lambda x: x[1].max_gpu_time,
+                    key=lambda x: x[1].max_general_gpu_time,
                     reverse=True)
             elif sorted_by == SortedKeys.GPUMin:
                 sorted_items = sorted(
-                    items.items(), key=lambda x: x[1].min_gpu_time)
+                    items.items(), key=lambda x: x[1].min_general_gpu_time)
 
             for name, item in sorted_items:
                 row_values = [
@@ -1238,14 +1324,15 @@ def format_ratio(ratio, indent=0):
                         format_ratio(float(item.cpu_time) / total_time)),
                     '{} / {} / {} / {} / {}'.format(
                         format_time(
-                            item.gpu_time, unit=time_unit),
+                            item.general_gpu_time, unit=time_unit),
                         format_time(
-                            item.avg_gpu_time, unit=time_unit),
+                            item.avg_general_gpu_time, unit=time_unit),
                         format_time(
-                            item.max_gpu_time, unit=time_unit),
+                            item.max_general_gpu_time, unit=time_unit),
                         format_time(
-                            item.min_gpu_time, unit=time_unit),
-                        format_ratio(float(item.gpu_time) / total_time)),
+                            item.min_general_gpu_time, unit=time_unit),
+                        format_ratio(
+                            float(item.general_gpu_time) / total_time)),
                 ]
                 all_row_values.append(row_values)
 
diff --git a/python/paddle/sparse/__init__.py b/python/paddle/sparse/__init__.py
new file mode 100644
index 0000000000000..aff9625469ef2
--- /dev/null
+++ b/python/paddle/sparse/__init__.py
@@ -0,0 +1,19 @@
+#   Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .creation import sparse_coo_tensor
+from .creation import sparse_csr_tensor
+from .layer.activation import ReLU
+
+__all__ = ['sparse_coo_tensor', 'sparse_csr_tensor', 'ReLU']
diff --git a/python/paddle/sparse/creation.py b/python/paddle/sparse/creation.py
new file mode 100644
index 0000000000000..e29351e3d179c
--- /dev/null
+++ b/python/paddle/sparse/creation.py
@@ -0,0 +1,191 @@
+#   Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from paddle import _C_ops
+from ..framework import core, dygraph_only
+from ..tensor import to_tensor
+from ..tensor import max
+from ..fluid.data_feeder import check_variable_and_dtype, check_type, check_dtype, convert_dtype
+
+__all__ = [
+    'sparse_coo_tensor',
+    'sparse_csr_tensor',
+]
+
+
+def _handle_dtype(data, dtype):
+    if dtype:
+        if convert_dtype(dtype) != convert_dtype(data.dtype):
+            return data.astype(convert_dtype(dtype))
+    return data
+
+
+def _infer_dense_shape(indices):
+    assert len(indices.shape) == 2
+    lens = max(indices, axis=1)
+    lens = lens + 1
+    return list(lens.numpy())
+
+
+@dygraph_only
+def sparse_coo_tensor(indices,
+                      values,
+                      shape=None,
+                      dtype=None,
+                      place=None,
+                      stop_gradient=True):
+    r"""
+    Constructs a sparse ``paddle.Tensor`` in coordinate format according to the indices 
+    and values of the specified non-zero elements.
+
+    Args:
+        indices(list|tuple|ndarray|Tensor): the indices of non-zero elements.
+            Can be a list, tuple, numpy\.ndarray, paddle\.Tensor. The indices must be 2-D.
+        values(list|tuple|ndarray|Tensor): Initial values for the tensor.
+            Can be a scalar, list, tuple, numpy\.ndarray, paddle\.Tensor.
+        shape(list|tuple, optional): The shape of the sparse tensor also represents the shape of
+            original dense tensor. If not provided the smallest shape will be inferred to 
+            hold all elements.
+        dtype(str|np.dtype, optional): The desired data type of returned tensor. Can be 'bool' , 'float16' , 
+            'float32' , 'float64' , 'int8' , 'int16' , 'int32' , 'int64' , 'uint8',
+            'complex64' , 'complex128'. Default: None, infers dtype from ``data`` 
+            except for python float number which gets dtype from ``get_default_type`` .
+        place(CPUPlace|CUDAPinnedPlace|CUDAPlace|str, optional): The place to allocate Tensor. Can be  
+            CPUPlace, CUDAPinnedPlace, CUDAPlace. Default: None, means global place. If ``place`` is 
+            string, It can be ``cpu``, ``gpu:x`` and ``gpu_pinned``, where ``x`` is the index of the GPUs. 
+        stop_gradient(bool, optional): Whether to block the gradient propagation of Autograd. Default: True.
+
+    Returns:
+        Tensor: A Tensor constructed from ``indices`` and ``values`` .
+
+    Raises:
+        TypeError: If the data type of ``values`` is not list, tuple, numpy.ndarray, paddle.Tensor
+        ValueError: If ``values`` is tuple|list, it can't contain nested tuple|list with different lengths , such as: [[1, 2], [3, 4, 5]]. If the ``indices`` is not a 2-D. 
+        TypeError: If ``dtype`` is not bool, float16, float32, float64, int8, int16, int32, int64, uint8, complex64, complex128
+        ValueError: If ``place`` is not paddle.CPUPlace, paddle.CUDAPinnedPlace, paddle.CUDAPlace or specified pattern string. 
+
+    Examples:
+
+    .. code-block:: python
+
+        import paddle
+        from paddle.fluid.framework import _test_eager_guard
+
+        with _test_eager_guard():
+            indices = [[0, 1, 2], [1, 2, 0]]
+            values = [1.0, 2.0, 3.0]
+            dense_shape = [2, 3]
+            coo = paddle.sparse.sparse_coo_tensor(indices, values, dense_shape)
+            # print(coo)
+            # Tensor(shape=[2, 3], dtype=paddle.float32, place=Place(gpu:0), stop_gradient=True,
+            #       indices=[[0, 1, 2],
+            #                [1, 2, 0]],
+            #       values=[1., 2., 3.])
+    """
+
+    if not isinstance(indices, core.eager.Tensor):
+        indices = to_tensor(
+            indices, dtype=None, place=place, stop_gradient=True)
+    if not isinstance(values, core.eager.Tensor):
+        values = to_tensor(values, dtype, place, stop_gradient)
+    if len(indices.shape) != 2:
+        raise ValueError("'indices' must be 2-D.")
+    if place is not None:
+        indices = indices._copy_to(place, False)
+        values = values._copy_to(place, False)
+    values = _handle_dtype(values, dtype)
+    if shape is None:
+        shape = _infer_dense_shape(indices)
+    return core.eager.sparse_coo_tensor(indices, values, shape, stop_gradient)
+
+
+#TODO: need to support shape is None
+@dygraph_only
+def sparse_csr_tensor(crows,
+                      cols,
+                      values,
+                      shape,
+                      dtype=None,
+                      place=None,
+                      stop_gradient=True):
+    r"""
+    Constructs a sparse ``paddle.Tensor`` in CSR(Compressed Sparse Row) format according to the 
+    ``crows``, ``cols`` and ``values``.
+
+    Args:
+        crows(list|tuple|ndarray|Tensor): 1-D array, each element in the rows represents the 
+            starting position of the first non-zero element of each row in values. 
+            Can be a list, tuple, numpy\.ndarray, paddle\.Tensor. 
+        cols(list|tuple|ndarray|Tensor): 1-D array, the column of non-zero elements.
+            Can be a list, tuple, numpy\.ndarray, paddle\.Tensor. 
+        values(list|tuple|ndarray|Tensor): 1-D array, the non-zero elements.
+            Can be a scalar, list, tuple, numpy\.ndarray, paddle\.Tensor.
+        shape(list|tuple, optional): The shape of the sparse tensor also represents the shape of
+            original dense tensor. 
+            hold all elements.
+        dtype(str|np.dtype, optional): The desired data type of returned tensor. Can be 'bool' , 'float16' , 
+            'float32' , 'float64' , 'int8' , 'int16' , 'int32' , 'int64' , 'uint8',
+            'complex64' , 'complex128'. Default: None, infers dtype from ``data`` 
+            except for python float number which gets dtype from ``get_default_type`` .
+        place(CPUPlace|CUDAPinnedPlace|CUDAPlace|str, optional): The place to allocate Tensor. Can be  
+            CPUPlace, CUDAPinnedPlace, CUDAPlace. Default: None, means global place. If ``place`` is 
+            string, It can be ``cpu``, ``gpu:x`` and ``gpu_pinned``, where ``x`` is the index of the GPUs. 
+        stop_gradient(bool, optional): Whether to block the gradient propagation of Autograd. Default: True.
+
+    Returns:
+        Tensor: A Tensor constructed from ``crows``, ``cols`` and ``values`` .
+
+    Raises:
+        TypeError: If the data type of ``values`` is not list, tuple, numpy.ndarray, paddle.Tensor
+        ValueError: If ``values`` is tuple|list, it can't contain nested tuple|list with different lengths , such as: [[1, 2], [3, 4, 5]]. If the ``crow``, ``cols`` and ``values`` is not a 2-D. 
+        TypeError: If ``dtype`` is not bool, float16, float32, float64, int8, int16, int32, int64, uint8, complex64, complex128
+        ValueError: If ``place`` is not paddle.CPUPlace, paddle.CUDAPinnedPlace, paddle.CUDAPlace or specified pattern string. 
+
+    Examples:
+
+    .. code-block:: python
+
+        import paddle
+        from paddle.fluid.framework import _test_eager_guard
+
+        with _test_eager_guard():
+            crows = [0, 2, 3, 5]
+            cols = [1, 3, 2, 0, 1]
+            values = [1, 2, 3, 4, 5]
+            dense_shape = [3, 4]
+            csr = paddle.sparse.sparse_csr_tensor(crows, cols, values, dense_shape)
+            # print(csr)
+            # Tensor(shape=[3, 4], dtype=paddle.int64, place=Place(gpu:0), stop_gradient=True,
+            #       crows=[0, 2, 3, 5],
+            #       cols=[1, 3, 2, 0, 1],
+            #       values=[1, 2, 3, 4, 5])
+    """
+    if not isinstance(crows, core.eager.Tensor):
+        crows = to_tensor(crows, dtype=None, place=place, stop_gradient=True)
+    if not isinstance(cols, core.eager.Tensor):
+        cols = to_tensor(cols, dtype=None, place=place, stop_gradient=True)
+    if not isinstance(values, core.eager.Tensor):
+        values = to_tensor(values, dtype, place, stop_gradient)
+    if len(crows.shape) != 1 or len(cols.shape) != 1 or len(values.shape) != 1:
+        raise ValueError(
+            "SparseCsrTensor only support 2-D or 3-D matrix. The 'crows', 'cols' and 'values' must be 1-D."
+        )
+
+    if place is not None:
+        crows = crows._copy_to(place, False)
+        cols = cols._copy_to(place, False)
+        values = values._copy_to(place, False)
+    values = _handle_dtype(values, dtype)
+    return core.eager.sparse_csr_tensor(crows, cols, values, shape,
+                                        stop_gradient)
diff --git a/python/paddle/sparse/functional/__init__.py b/python/paddle/sparse/functional/__init__.py
new file mode 100644
index 0000000000000..f4c5b33a5a7ea
--- /dev/null
+++ b/python/paddle/sparse/functional/__init__.py
@@ -0,0 +1,17 @@
+#   Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .activation import relu  # noqa: F401
+
+__all__ = ['relu']
diff --git a/python/paddle/sparse/functional/activation.py b/python/paddle/sparse/functional/activation.py
new file mode 100644
index 0000000000000..c0109bc4e2429
--- /dev/null
+++ b/python/paddle/sparse/functional/activation.py
@@ -0,0 +1,53 @@
+#   Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+__all__ = []
+
+from paddle import _C_ops, in_dynamic_mode
+
+
+def relu(x, name=None):
+    """
+    sparse relu activation.
+
+    .. math::
+
+        out = max(x, 0)
+
+    Parameters:
+        x (Tensor): The input Sparse Tensor with data type float32, float64.
+        name (str, optional): Name for the operation (optional, default is None).
+            For more information, please refer to :ref:`api_guide_Name`.
+
+    Returns:
+        A Sparse Tensor with the same data type and shape as ``x`` .
+
+    Examples:
+        .. code-block:: python
+
+            import paddle
+            import numpy as np
+            from paddle.fluid.framework import _test_eager_guard
+
+            with _test_eager_guard():
+                dense_x = paddle.to_tensor(np.array([-2, 0, 1]).astype('float32'))
+                sparse_x = dense_x.to_sparse_coo(1)
+                out = paddle.sparse.functional.relu(sparse_x) 
+    """
+
+    assert in_dynamic_mode(), "Currently, Sparse API only support dynamic mode"
+    assert x.is_sparse_coo(
+    ), "Currently, sparse.relu only support the input of SparseCooTensor"
+
+    return _C_ops.final_state_sparse_relu(x)
diff --git a/python/paddle/sparse/layer/__init__.py b/python/paddle/sparse/layer/__init__.py
new file mode 100644
index 0000000000000..66abce260b6f7
--- /dev/null
+++ b/python/paddle/sparse/layer/__init__.py
@@ -0,0 +1,17 @@
+#   Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .activation import ReLU
+
+__all__ = []
diff --git a/python/paddle/sparse/layer/activation.py b/python/paddle/sparse/layer/activation.py
new file mode 100644
index 0000000000000..ad0dbc1880782
--- /dev/null
+++ b/python/paddle/sparse/layer/activation.py
@@ -0,0 +1,61 @@
+#   Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .. import functional as F
+from paddle.nn import Layer
+
+__all__ = []
+
+
+class ReLU(Layer):
+    """
+    Sparse ReLU Activation.
+
+    .. math::
+
+        ReLU(x) = max(x, 0)
+
+    Parameters:
+        name (str, optional): Name for the operation (optional, default is None).
+            For more information, please refer to :ref:`api_guide_Name`.
+
+    Shape:
+        - input: Sparse Tensor with any shape.
+        - output: Sparse Tensor with the same shape as input.
+
+    Examples:
+        .. code-block:: python
+
+            import paddle
+            from paddle.fluid.framework import _test_eager_guard
+            with _test_eager_guard():
+                x = [[0, -1, 0, 2], [0, 0, -3, 0], [4, 5, 0, 0]]
+                dense_x = paddle.to_tensor(x, dtype='float32')
+                sparse_dim = 2
+                sparse_x = dense_x.to_sparse_coo(sparse_dim)
+                relu = paddle.sparse.ReLU()
+                out = relu(sparse_x)
+                #out.values: [0., 2., 0., 4., 5.]
+    """
+
+    def __init__(self, name=None):
+        super(ReLU, self).__init__()
+        self._name = name
+
+    def forward(self, x):
+        return F.relu(x, self._name)
+
+    def extra_repr(self):
+        name_str = 'name={}'.format(self._name) if self._name else ''
+        return name_str
diff --git a/python/paddle/tensor/__init__.py b/python/paddle/tensor/__init__.py
index 86f26d1b21b8a..3c4647d4d6b68 100755
--- a/python/paddle/tensor/__init__.py
+++ b/python/paddle/tensor/__init__.py
@@ -165,6 +165,7 @@
 from .math import stanh  # noqa: F401
 from .math import sum  # noqa: F401
 from .math import nansum  # noqa: F401
+from .math import nanmean  # noqa: F401
 from .math import tanh  # noqa: F401
 from .math import tanh_  # noqa: F401
 from .math import add_n  # noqa: F401
@@ -334,6 +335,7 @@
            'stanh',
            'sum',
            'nansum',
+           'nanmean',
            'tanh',
            'tanh_',
            'add_n',
diff --git a/python/paddle/tensor/attribute.py b/python/paddle/tensor/attribute.py
index b851f6db4acab..07db7794b6d98 100644
--- a/python/paddle/tensor/attribute.py
+++ b/python/paddle/tensor/attribute.py
@@ -18,12 +18,13 @@
 from ..fluid.layer_helper import LayerHelper
 from ..fluid.data_feeder import check_variable_and_dtype
 
-# TODO: define functions to get tensor attributes  
+# TODO: define functions to get tensor attributes
 from ..fluid.layers import rank  # noqa: F401
 from ..fluid.layers import shape  # noqa: F401
 import paddle
 from paddle import _C_ops
 from paddle.static import Variable
+from ..fluid.framework import _in_legacy_dygraph, in_dygraph_mode
 
 __all__ = []
 
@@ -185,7 +186,9 @@ def real(x, name=None):
             #        [[1., 2., 3.],
             #         [4., 5., 6.]])
     """
-    if paddle.in_dynamic_mode():
+    if in_dygraph_mode():
+        return _C_ops.final_state_real(x)
+    if _in_legacy_dygraph():
         return _C_ops.real(x)
 
     check_variable_and_dtype(x, 'x', ['complex64', 'complex128'], 'real')
@@ -229,7 +232,9 @@ def imag(x, name=None):
             #        [[6., 5., 4.],
             #         [3., 2., 1.]])
     """
-    if paddle.in_dynamic_mode():
+    if in_dygraph_mode():
+        return _C_ops.final_state_imag(x)
+    if _in_legacy_dygraph():
         return _C_ops.imag(x)
 
     check_variable_and_dtype(x, 'x', ['complex64', 'complex128'], 'imag')
diff --git a/python/paddle/tensor/creation.py b/python/paddle/tensor/creation.py
index 6e7e5678be0b0..95f145cf447b5 100644
--- a/python/paddle/tensor/creation.py
+++ b/python/paddle/tensor/creation.py
@@ -224,7 +224,10 @@ def full_like(x, fill_value, dtype=None, name=None):
         if not isinstance(dtype, core.VarDesc.VarType):
             dtype = convert_np_dtype_to_dtype_(dtype)
 
-    if paddle.in_dynamic_mode():
+    if in_dygraph_mode():
+        return _C_ops.final_state_full_like(x, fill_value, dtype, x.place)
+
+    if _in_legacy_dygraph():
         return _C_ops.fill_any_like(x, 'value', fill_value, 'dtype', dtype)
 
     helper = LayerHelper("full_like", **locals())
@@ -658,7 +661,10 @@ def tril(x, diagonal=0, name=None):
             #        [ 9, 10,  0,  0]])
 
     """
-    if paddle.in_dynamic_mode():
+    if in_dygraph_mode():
+        return _C_ops.final_state_tril_triu(x, diagonal, True)
+
+    if _in_legacy_dygraph():
         op = getattr(_C_ops, 'tril_triu')
         return op(x, 'diagonal', diagonal, "lower", True)
 
@@ -725,7 +731,10 @@ def triu(x, diagonal=0, name=None):
             #        [ 0, 10, 11, 12]])
 
     """
-    if paddle.in_dynamic_mode():
+    if in_dygraph_mode():
+        return _C_ops.final_state_tril_triu(x, diagonal, False)
+
+    if _in_legacy_dygraph():
         op = getattr(_C_ops, 'tril_triu')
         return op(x, 'diagonal', diagonal, "lower", False)
 
@@ -767,10 +776,12 @@ def meshgrid(*args, **kwargs):
 
     if len(args) == 1 and isinstance(args[0], (list, tuple)):
         args = args[0]
-    if paddle.in_dynamic_mode():
+    if _in_legacy_dygraph():
         num = len(args)
         out = _C_ops.meshgrid(list(args), num)
         return out
+    if in_dygraph_mode():
+        return _C_ops.final_state_meshgrid(list(args))
 
     name = kwargs.get("name", None)
     helper = LayerHelper('meshgrid', **locals())
diff --git a/python/paddle/tensor/layer_function_generator.py b/python/paddle/tensor/layer_function_generator.py
new file mode 100644
index 0000000000000..ecb13613a125e
--- /dev/null
+++ b/python/paddle/tensor/layer_function_generator.py
@@ -0,0 +1,382 @@
+#  Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import print_function
+import re
+import functools
+import warnings
+import string
+
+from six.moves import cStringIO
+from ..static import Variable
+from ..fluid.proto import framework_pb2
+from ..framework import OpProtoHolder, core, convert_np_dtype_to_dtype_
+from ..framework import LayerHelper
+from ..fluid.data_feeder import check_variable_and_dtype
+import paddle
+from paddle import _C_ops
+
+__all__ = []
+
+
+def _convert_(name):
+    """
+    Formatting.
+
+    Args:
+       name: The name/alias
+
+    This function takes in a name and converts it to a standard format of
+    group1_group2. Where as per the regular expression, group1 can have
+    alphabets and numbers and group2 has capital alphabets.
+
+    """
+    s1 = re.sub('(.)([A-Z][a-z]+)', r'\1_\2', name)
+    return re.sub('([a-z0-9])([A-Z])', r'\1_\2', s1).lower()
+
+
+def _type_to_str_(tp):
+    return framework_pb2.AttrType.Name(tp)
+
+
+_two_dollar_pattern_ = re.compile(r"\$\$([^\$]+)\$\$")
+_single_dollar_pattern_ = re.compile(r"\$([^\$]+)\$")
+_two_bang_pattern_ = re.compile(r"!!([^!]+)!!")
+
+
+def escape_math(text):
+    #return _two_bang_pattern_.sub(
+    #    r'$$\1$$',
+    #    _single_dollar_pattern_.sub(r':math:\n`\1`',
+    #                                _two_dollar_pattern_.sub(r"!!\1!!", text)))
+    return _two_dollar_pattern_.sub(r':math:`\1`', text)
+
+
+def _generate_doc_string_(op_proto,
+                          additional_args_lines=None,
+                          skip_attrs_set=None):
+    """
+    Generate docstring by OpProto
+
+    Args:
+        op_proto (framework_pb2.OpProto): a protobuf message typed OpProto
+
+    Returns:
+        str: the document string
+    """
+
+    if not isinstance(op_proto, framework_pb2.OpProto):
+        raise TypeError("OpProto should be `framework_pb2.OpProto`")
+
+    buf = cStringIO()
+    buf.write(escape_math(op_proto.comment))
+    buf.write('\nArgs:\n')
+    for each_input in op_proto.inputs:
+        line_begin = '    {0}'.format(_convert_(each_input.name))
+        buf.write(line_begin)
+        buf.write(" (Tensor): ")
+        buf.write(escape_math(each_input.comment))
+        if each_input.duplicable:
+            buf.write("  Duplicatable.")
+        if each_input.dispensable:
+            buf.write("  Optional.")
+        buf.write('\n')
+
+    skip_attrs = OpProtoHolder.generated_op_attr_names()
+    # attr use_mkldnn and is_test also should not be visible to users.
+    skip_attrs.add("use_mkldnn")
+    skip_attrs.add("is_test")
+    skip_attrs.add("use_cudnn")
+
+    if skip_attrs_set:
+        for t in skip_attrs_set:
+            skip_attrs.add(t)
+
+    for each_attr in op_proto.attrs:
+        if each_attr.name in skip_attrs:
+            continue
+        buf.write('    ')
+        buf.write(each_attr.name)
+        buf.write(' (')
+        buf.write(_type_to_str_(each_attr.type))
+        buf.write('): ')
+        buf.write(escape_math(each_attr.comment))
+        buf.write('\n')
+
+    if additional_args_lines is not None:
+        for line in additional_args_lines:
+            line = line.strip()
+            buf.write('    ')
+            buf.write(line)
+            buf.write('\n')
+
+    if len(op_proto.outputs) != 0:
+        buf.write('\nReturns:\n')
+        buf.write('    ')
+        for each_opt in op_proto.outputs:
+            if not each_opt.intermediate:
+                break
+        buf.write(_convert_(each_opt.name))
+        buf.write(' (Tensor): ')
+        buf.write(escape_math(each_opt.comment))
+
+    return buf.getvalue()
+
+
+def generate_layer_fn(op_type):
+    """Register the Python layer for an Operator.
+
+    Args:
+       op_type: The name of the operator to be created.
+
+    This function takes in the operator type (sigmoid, mean , average etc) and
+    creates the operator functionality.
+
+    """
+    op_proto = OpProtoHolder.instance().get_op_proto(op_type)
+    not_intermediate_outputs = \
+        [output for output in op_proto.outputs if not output.intermediate]
+    intermediate_outputs = \
+        [output for output in op_proto.outputs if output.intermediate]
+
+    if len(not_intermediate_outputs) != 1:
+        raise ValueError("Only one non intermediate output operator can be",
+                         "automatically generated. {0}".format(op_type))
+
+    if not_intermediate_outputs[0].duplicable:
+        raise ValueError(
+            "Only non duplicable op can be automatically generated.")
+
+    for output in intermediate_outputs:
+        if output.duplicable:
+            raise ValueError("The op can be automatically generated only when ",
+                             "all intermediate ops are not duplicable.")
+
+    o_name = not_intermediate_outputs[0].name
+    intermediate_output_names = [output.name for output in intermediate_outputs]
+
+    def infer_and_check_dtype(op_proto, *args, **kwargs):
+        """
+        This function performs the sanity check for dtype and
+        instance type.
+        """
+        dtype = None
+        for ipt in op_proto.inputs:
+            name = _convert_(ipt.name)
+            val = kwargs.pop(name, [])
+            if not isinstance(val, list) and not isinstance(val, tuple):
+                val = [val]
+            if len(val) == 0:
+                if len(args) == 0:
+                    continue
+                val = [args[0]]
+                args = args[1:]
+
+            for each in val:
+                if not isinstance(each, Variable):
+                    raise ValueError("input of {0} must be variable".format(
+                        op_type))
+
+                if dtype is None:
+                    dtype = each.dtype
+                elif dtype != each.dtype:
+                    raise ValueError(
+                        "operator {0} must input same dtype. {1} vs {2}".format(
+                            op_type, dtype, each.dtype))
+
+        if dtype is None:
+            arg_dtype = kwargs.get("dtype")
+            if arg_dtype:
+                if not isinstance(arg_dtype, core.VarDesc.VarType):
+                    dtype = convert_np_dtype_to_dtype_(arg_dtype)
+                else:
+                    dtype = arg_dtype
+            else:
+                dtype = core.VarDesc.VarType.FP32
+        return dtype
+
+    def func(*args, **kwargs):
+        helper = LayerHelper(op_type, **kwargs)
+
+        dtype = infer_and_check_dtype(op_proto, *args, **kwargs)
+
+        inputs = dict()
+        for ipt in op_proto.inputs:
+            name = _convert_(ipt.name)
+            val = kwargs.pop(name, [])
+            if not isinstance(val, list) and not isinstance(val, tuple):
+                val = [val]
+            if len(val) == 0 and len(args) != 0:
+                val = args[0]
+                args = args[1:]
+            inputs[ipt.name] = val
+
+        outputs = dict()
+        out = kwargs.pop(_convert_(o_name), [])
+        if out:
+            out_var = out[0] if (isinstance(out, list) or
+                                 isinstance(out, tuple)) else out
+        else:
+            out_var = helper.create_variable_for_type_inference(dtype=dtype)
+        outputs[o_name] = [out_var]
+        for name in intermediate_output_names:
+            outputs[name] = [
+                helper.create_variable_for_type_inference(dtype=dtype)
+            ]
+        helper.append_op(
+            type=op_type, inputs=inputs, outputs=outputs, attrs=kwargs)
+        return helper.append_activation(out_var)
+
+    func.__name__ = op_type
+    func.__doc__ = _generate_doc_string_(op_proto)
+    return func
+
+
+def generate_activation_fn(op_type):
+    """Register the Python layer for an Operator without Attribute.
+
+    Args:
+       op_type: The name of the operator to be created.
+
+    This function takes in the operator type (sigmoid, exp , tanh etc) and
+    creates the operator functionality.
+
+    """
+    op_proto = OpProtoHolder.instance().get_op_proto(op_type)
+
+    def func(x, name=None):
+        if paddle.in_dynamic_mode():
+            op = getattr(_C_ops, op_type)
+            return op(x)
+
+        if op_type not in ["abs", "exp", "square"]:
+            check_variable_and_dtype(x, 'x', ['float16', 'float32', 'float64'],
+                                     op_type)
+        else:
+            # abs exp square ops support dtype(int32, int64, float16, float32, float64)
+            check_variable_and_dtype(
+                x, 'x', ['int32', 'int64', 'float16', 'float32', 'float64'],
+                op_type)
+
+        helper = LayerHelper(op_type, **locals())
+
+        output = helper.create_variable_for_type_inference(dtype=x.dtype)
+        helper.append_op(type=op_type, inputs={"X": x}, outputs={"Out": output})
+        return output
+
+    func.__name__ = op_type
+    func.__doc__ = _generate_doc_string_(
+        op_proto,
+        additional_args_lines=[
+            "name (str, optional): Name for the operation (optional, default is None). For more information, please refer to :ref:`api_guide_Name`."
+        ])
+    return func
+
+
+def generate_inplace_fn(inplace_op_type):
+    """Register the Python layer for an Inplace Operator without Attribute.
+
+    Args:
+       inplace_op_type: The name of the inplace operator to be created.
+
+    This function takes in the inplace operator type (exp_ , ceil_ etc) and
+    creates the operator functionality.
+    """
+    origin_op_type = inplace_op_type[:-1]
+
+    def func(x, name=None):
+        if paddle.in_dynamic_mode():
+            op = getattr(_C_ops, inplace_op_type)
+            return op(x)
+        warnings.warn(
+            "In static mode, {}() is the same as {}() and does not perform inplace operation.".
+            format(inplace_op_type, origin_op_type))
+        return generate_activation_fn(origin_op_type)(x, name)
+
+    func.__name__ = inplace_op_type
+    func.__doc__ = """
+Inplace version of ``{0}`` API, the output Tensor will be inplaced with input ``x``.
+Please refer to :ref:`api_fluid_layers_{1}`.
+""".format(origin_op_type, origin_op_type)
+
+    return func
+
+
+def templatedoc(op_type=None):
+    """
+    Decorator of layer function. It will use the docstring from the layer
+    function as the template. The template arguments are:
+
+    * ${comment}: The operator comment written in CPP.
+    * ${{name}_comment}: The comment of ${name} written with AddAttr, AddOutput,
+        and AddInput. The ${name} is Python snake style. i.e., xxx_xxx.
+    * ${{name}_type}: The type of ${name}.
+
+    Returns:
+        Decorated function.
+    """
+
+    def trim_ending_dot(msg):
+        return msg.rstrip('.')
+
+    def __impl__(func):
+        if op_type is None:
+            op_type_name = func.__name__
+        else:
+            op_type_name = op_type
+        op_proto = OpProtoHolder.instance().get_op_proto(op_type_name)
+        tmpl = string.Template(func.__doc__)
+
+        comment_lines = op_proto.comment.split("\n")
+        comment = ""
+        for line in comment_lines:
+            line = line.strip()
+            if len(line) != 0:
+                comment += escape_math(line)
+                comment += " "
+            elif len(comment) != 0:
+                comment += "\n    \n    "
+
+        args = {"comment": trim_ending_dot(comment)}
+        for each_input in op_proto.inputs:
+            input_name = _convert_(each_input.name)
+            args["{0}_comment".format(input_name)] = trim_ending_dot(
+                each_input.comment)
+            args["{0}_type".format(input_name)] = "Variable"
+        for each_attr in op_proto.attrs:
+            input_name = _convert_(each_attr.name)
+            args["{0}_comment".format(input_name)] = trim_ending_dot(
+                each_attr.comment)
+            args["{0}_type".format(input_name)] = _type_to_str_(each_attr.type)
+
+        for each_opt in op_proto.outputs:
+            output_name = _convert_(each_opt.name)
+            args["{0}_comment".format(output_name)] = trim_ending_dot(
+                each_opt.comment)
+            args["{0}_type".format(output_name)] = "Variable"
+        func.__doc__ = tmpl.substitute(args)
+        return func
+
+    return __impl__
+
+
+def add_sample_code(func, sample_code):
+    """
+    Append sample code for dynamically generated functions. 
+
+    Args:
+       func: The function of the function to be append sample code to.
+       sample_code: sample code session in rst format.
+    """
+    func.__doc__ = func.__doc__ + sample_code
diff --git a/python/paddle/tensor/linalg.py b/python/paddle/tensor/linalg.py
index 7901379d9c793..38616026f128a 100644
--- a/python/paddle/tensor/linalg.py
+++ b/python/paddle/tensor/linalg.py
@@ -17,7 +17,7 @@
 from ..framework import _varbase_creator, _dygraph_tracer
 from ..fluid.data_feeder import check_variable_and_dtype, check_type, check_dtype
 from ..static import Variable
-from ..fluid.framework import _in_legacy_dygraph, in_dygraph_mode
+from ..fluid.framework import _in_legacy_dygraph, in_dygraph_mode, _non_static_mode
 from ..fluid.layers import transpose, cast  # noqa: F401
 from ..fluid import layers
 import paddle
@@ -27,6 +27,9 @@
 
 __all__ = []
 
+# Consistent with kDefaultDim from C++ Backend
+K_DEFAULT_DIM = 9
+
 
 def matmul(x, y, transpose_x=False, transpose_y=False, name=None):
     """
@@ -251,7 +254,12 @@ def frobenius_norm(input, dim=None, keepdim=False, name=None):
             raise ValueError(
                 "The dim of frobenius norm op should be None or two elements list!"
             )
-        if paddle.in_dynamic_mode():
+
+        if in_dygraph_mode():
+            if dim is None:
+                return _C_ops.final_state_frobenius_norm(input, keepdim, True)
+            return _C_ops.final_state_frobenius_norm(input, dim, keepdim, False)
+        if _in_legacy_dygraph():
             if dim is None:
                 return _C_ops.frobenius_norm(input, 'keep_dim', keepdim,
                                              'reduce_all', True)
@@ -288,10 +296,16 @@ def vector_norm(input,
           axis (int, optional): None for last dimension.
           keepdim (bool, optional): Whether keep the dimensions as the `input`, Default False.
         """
-        if paddle.in_dynamic_mode():
+        if in_dygraph_mode():
+            if axis is None: axis = -1
+            return _C_ops.final_state_p_norm(input, porder, axis, 1e-12,
+                                             keepdim, asvector)
+
+        if _in_legacy_dygraph():
             if axis is None: axis = -1
             return _C_ops.p_norm(input, 'porder', porder, 'axis', axis,
                                  'keepdim', keepdim, 'asvector', asvector)
+
         if porder is not None:
             check_type(porder, 'porder', (float, int), 'p_norm')
         if axis is not None:
@@ -537,6 +551,9 @@ def dist(x, y, p=2, name=None):
             out = paddle.dist(x, y, float("-inf"))
             print(out) # out = [0.]
     """
+    if in_dygraph_mode():
+        return _C_ops.final_state_dist(x, y, p)
+
     check_variable_and_dtype(x, 'dtype', ['float32', 'float64'], 'dist')
     check_variable_and_dtype(y, 'dtype', ['float32', 'float64'], 'dist')
     check_type(p, 'p', (float, int), 'dist')
@@ -1112,7 +1129,7 @@ def t(input, name=None):
     return out
 
 
-def cross(x, y, axis=None, name=None):
+def cross(x, y, axis=9, name=None):
     """
     Computes the cross product between two tensors along an axis.
 
@@ -1122,7 +1139,7 @@ def cross(x, y, axis=None, name=None):
     Args:
         x (Tensor): The first input tensor.
         y (Tensor): The second input tensor.
-        axis (int, optional): The axis along which to compute the cross product. It defaults to the first axis found with the length 3.
+        axis (int, optional): The axis along which to compute the cross product. It defaults to be 9 which indicates using the first axis found with the length 3.
         name (str, optional): Name for the operation (optional, default is None). For more information, please refer to :ref:`api_guide_Name`.
 
     Returns:
@@ -1151,6 +1168,7 @@ def cross(x, y, axis=None, name=None):
             #  [0. 0. 0.]]
     """
     if in_dygraph_mode():
+        axis = K_DEFAULT_DIM if axis is None else axis
         return _C_ops.final_state_cross(x, y, axis)
     else:
         if _in_legacy_dygraph():
@@ -1212,8 +1230,12 @@ def cholesky(x, upper=False, name=None):
             #  [1.25450498 0.05600871 0.06400121]]
 
     """
-    if paddle.in_dynamic_mode():
+    if in_dygraph_mode():
+        return _C_ops.final_state_cholesky(x, upper)
+
+    if _in_legacy_dygraph():
         return _C_ops.cholesky(x, "upper", upper)
+
     check_variable_and_dtype(x, 'dtype', ['float32', 'float64'], 'cholesky')
     check_type(upper, 'upper', bool, 'cholesky')
     helper = LayerHelper('cholesky', **locals())
@@ -1265,8 +1287,26 @@ def matrix_rank(x, tol=None, hermitian=False, name=None):
             #      [1, 1, 1, 1]]
 
     """
+    if in_dygraph_mode():
+        if isinstance(tol, Variable):
+            if tol.dtype != x.dtype:
+                tol_tensor = cast(tol, x.dtype)
+            else:
+                tol_tensor = tol
+            use_default_tol = False
+            return _C_ops.final_state_matrix_rank_tol(
+                x, tol_tensor, use_default_tol, hermitian)
 
-    if paddle.in_dynamic_mode():
+        if tol is None:
+            tol_attr = 0.0
+            use_default_tol = True
+        else:
+            tol_attr = float(tol)
+            use_default_tol = False
+        return _C_ops.final_state_matrix_rank(x, tol_attr, use_default_tol,
+                                              hermitian)
+
+    if _in_legacy_dygraph():
         if tol is None:
             tol_tensor = None
             tol_attr = 0.0
@@ -1397,7 +1437,10 @@ def histogram(input, bins=100, min=0, max=0, name=None):
             result = paddle.histogram(inputs, bins=4, min=0, max=3)
             print(result) # [0, 2, 1, 0]
     """
-    if paddle.in_dynamic_mode():
+    if in_dygraph_mode():
+        return _C_ops.final_state_histogram(input, bins, min, max)
+
+    if _in_legacy_dygraph():
         return _C_ops.histogram(input, "bins", bins, "min", min, "max", max)
 
     helper = LayerHelper('histogram', **locals())
@@ -1444,7 +1487,7 @@ def bincount(x, weights=None, minlength=0, name=None):
     if x.dtype not in [paddle.int32, paddle.int64]:
         raise TypeError("Elements in Input(x) should all be integers")
 
-    if paddle.in_dynamic_mode():
+    if _non_static_mode():
         return _C_ops.bincount(x, weights, "minlength", minlength)
 
     helper = LayerHelper('bincount', **locals())
@@ -1555,7 +1598,10 @@ def det(x, name=None):
 
 
     """
-    if paddle.in_dynamic_mode():
+    if in_dygraph_mode():
+        return _C_ops.final_state_det(x)
+
+    if _in_legacy_dygraph():
         return _C_ops.determinant(x)
 
     check_dtype(x.dtype, 'Input', ['float32', 'float64'], 'det')
@@ -1758,7 +1804,10 @@ def matrix_power(x, n, name=None):
             #  [-7.66666667 ,  8.         , -1.83333333 ],
             #  [ 1.80555556 , -1.91666667 ,  0.44444444 ]]
     """
-    if paddle.in_dynamic_mode():
+    if in_dygraph_mode():
+        return _C_ops.final_state_matrix_power(x, n)
+
+    if _in_legacy_dygraph():
         return _C_ops.matrix_power(x, "n", n)
 
     check_variable_and_dtype(x, 'dtype', ['float32', 'float64'], 'matrix_power')
@@ -2276,7 +2325,10 @@ def eigh(x, UPLO='L', name=None):
             #[ 0.3826834323650898j    , -0.9238795325112867j    ]]
 
     """
-    if paddle.in_dynamic_mode():
+    if in_dygraph_mode():
+        return _C_ops.final_state_eigh(x, UPLO)
+
+    if _in_legacy_dygraph():
         return _C_ops.eigh(x, 'UPLO', UPLO)
 
     def __check_input(x, UPLO):
@@ -2746,7 +2798,10 @@ def cholesky_solve(x, y, upper=False, name=None):
         print(out)
         # [-2.5, -7, 9.5]
     """
-    if paddle.in_dynamic_mode():
+    if in_dygraph_mode():
+        return _C_ops.final_state_cholesky_solve(x, y, upper)
+
+    if _in_legacy_dygraph():
         return _C_ops.cholesky_solve(x, y, 'upper', upper)
 
     helper = LayerHelper("cholesky_solve", **locals())
diff --git a/python/paddle/tensor/logic.py b/python/paddle/tensor/logic.py
index 03b64e2b828df..636b2ef17c6a0 100755
--- a/python/paddle/tensor/logic.py
+++ b/python/paddle/tensor/logic.py
@@ -14,11 +14,15 @@
 
 from ..fluid.layer_helper import LayerHelper
 from ..fluid.data_feeder import check_type, check_variable_and_dtype
-from ..fluid.layers.layer_function_generator import templatedoc
+from .layer_function_generator import templatedoc
 from ..static import Variable
-from ..framework import VarBase as Tensor
 from ..fluid.framework import _in_legacy_dygraph, in_dygraph_mode
-# TODO: define logic functions of a tensor  
+# TODO: define logic functions of a tensor
+import paddle.fluid as fluid
+if fluid.framework._in_eager_mode_:
+    Tensor = fluid.framework.core.eager.Tensor
+else:
+    from ..framework import VarBase as Tensor
 from ..fluid.layers import is_empty  # noqa: F401
 from ..fluid.layers import logical_and  # noqa: F401
 from ..fluid.layers import logical_not  # noqa: F401
@@ -122,11 +126,17 @@ def allclose(x, y, rtol=1e-05, atol=1e-08, equal_nan=False, name=None):
           # [True]
     """
 
-    if paddle.in_dynamic_mode():
+    if in_dygraph_mode():
+        # NOTE(dev): Pass tol as Tensor to fix precision loss problem, because
+        # C++ backend will cast it into float32 if passing float from python.
+        as_tensor = lambda x: paddle.to_tensor([x], dtype='float64', place='cpu')
+        return _C_ops.final_state_allclose(x, y,
+                                           as_tensor(rtol),
+                                           as_tensor(atol), equal_nan)
+    if _in_legacy_dygraph():
         return _C_ops.allclose(x, y, 'rtol',
                                str(rtol), 'atol',
                                str(atol), 'equal_nan', equal_nan)
-
     check_variable_and_dtype(x, "input", ['float32', 'float64'], 'allclose')
     check_variable_and_dtype(y, "input", ['float32', 'float64'], 'allclose')
     check_type(rtol, 'rtol', float, 'allclose')
@@ -181,7 +191,8 @@ def equal(x, y, name=None):
         y = full(shape=[1], dtype=x.dtype, fill_value=y)
 
     if in_dygraph_mode():
-        return _C_ops.final_state_equal(x, y)
+        default_axis = -1
+        return _C_ops.final_state_equal(x, y, default_axis)
     else:
         if _in_legacy_dygraph():
             return _C_ops.equal(x, y)
@@ -230,7 +241,8 @@ def greater_equal(x, y, name=None):
             print(result1)  # result1 = [True False True]
     """
     if in_dygraph_mode():
-        return _C_ops.final_state_greater_equal(x, y)
+        default_axis = -1
+        return _C_ops.final_state_greater_equal(x, y, default_axis)
     else:
         if _in_legacy_dygraph():
             return _C_ops.greater_equal(x, y)
@@ -279,7 +291,7 @@ def greater_than(x, y, name=None):
             print(result1)  # result1 = [False False True]
     """
     if in_dygraph_mode():
-        return _C_ops.final_state_greater_than(x, y)
+        return _C_ops.final_state_greater_than(x, y, -1)
     else:
         if _in_legacy_dygraph():
             return _C_ops.greater_than(x, y)
@@ -329,7 +341,8 @@ def less_equal(x, y, name=None):
             print(result1)  # result1 = [True True False]
     """
     if in_dygraph_mode():
-        return _C_ops.final_state_less_equal(x, y)
+        axis = -1
+        return _C_ops.final_state_less_equal(x, y, axis)
     else:
         if _in_legacy_dygraph():
             return _C_ops.less_equal(x, y)
@@ -379,7 +392,8 @@ def less_than(x, y, name=None):
             print(result1)  # result1 = [False True False]
     """
     if in_dygraph_mode():
-        return _C_ops.final_state_less_than(x, y)
+        default_axis = -1
+        return _C_ops.final_state_less_than(x, y, default_axis)
     else:
         if _in_legacy_dygraph():
             return _C_ops.less_than(x, y)
@@ -429,7 +443,8 @@ def not_equal(x, y, name=None):
             print(result1)  # result1 = [False True True]
     """
     if in_dygraph_mode():
-        return _C_ops.final_state_not_equal(x, y)
+        axis = -1
+        return _C_ops.final_state_not_equal(x, y, axis)
     else:
         if _in_legacy_dygraph():
             return _C_ops.not_equal(x, y)
@@ -536,6 +551,8 @@ def bitwise_and(x, y, out=None, name=None):
             res = paddle.bitwise_and(x, y)
             print(res)  # [0, 2, 1]
     """
+    if in_dygraph_mode() and out is None:
+        return _C_ops.final_state_bitwise_and(x, y)
     return _bitwise_op(
         op_name="bitwise_and", x=x, y=y, name=name, out=out, binary_op=True)
 
@@ -562,6 +579,9 @@ def bitwise_or(x, y, out=None, name=None):
             res = paddle.bitwise_or(x, y)
             print(res)  # [-1, -1, -3]
     """
+    if in_dygraph_mode() and out is None:
+        return _C_ops.final_state_bitwise_or(x, y)
+
     return _bitwise_op(
         op_name="bitwise_or", x=x, y=y, name=name, out=out, binary_op=True)
 
@@ -588,6 +608,8 @@ def bitwise_xor(x, y, out=None, name=None):
             res = paddle.bitwise_xor(x, y)
             print(res) # [-1, -3, -4]
     """
+    if in_dygraph_mode() and out is None:
+        return _C_ops.final_state_bitwise_xor(x, y)
     return _bitwise_op(
         op_name="bitwise_xor", x=x, y=y, name=name, out=out, binary_op=True)
 
@@ -612,6 +634,8 @@ def bitwise_not(x, out=None, name=None):
             res = paddle.bitwise_not(x)
             print(res) # [4, 0, -2]
     """
+    if in_dygraph_mode() and out is None:
+        return _C_ops.final_state_bitwise_not(x)
 
     return _bitwise_op(
         op_name="bitwise_not", x=x, y=None, name=name, out=out, binary_op=False)
@@ -669,7 +693,14 @@ def isclose(x, y, rtol=1e-05, atol=1e-08, equal_nan=False, name=None):
           # [True, True]
     """
 
-    if paddle.in_dynamic_mode():
+    if in_dygraph_mode():
+        # NOTE(dev): Pass tol as Tensor to fix precision loss problem, because
+        # C++ backend will cast it into float32 if passing float from python.
+        as_tensor = lambda x: paddle.to_tensor([x], dtype='float64', place='cpu')
+        return _C_ops.final_state_isclose(x, y,
+                                          as_tensor(rtol),
+                                          as_tensor(atol), equal_nan)
+    if _in_legacy_dygraph():
         return _C_ops.isclose(x, y, 'rtol',
                               str(rtol), 'atol',
                               str(atol), 'equal_nan', equal_nan)
diff --git a/python/paddle/tensor/manipulation.py b/python/paddle/tensor/manipulation.py
index 7921c7798be3a..d8021f36c211c 100755
--- a/python/paddle/tensor/manipulation.py
+++ b/python/paddle/tensor/manipulation.py
@@ -17,7 +17,7 @@
 
 from ..static import Variable, device_guard
 from ..framework import core
-from ..fluid.framework import _in_legacy_dygraph, in_dygraph_mode, _in_eager_without_dygraph_check
+from ..fluid.framework import _in_legacy_dygraph, in_dygraph_mode, _in_eager_without_dygraph_check, _non_static_mode
 from ..fluid.layer_helper import LayerHelper
 from ..framework import OpProtoHolder, convert_np_dtype_to_dtype_, dygraph_only
 from ..fluid.data_feeder import convert_dtype, check_variable_and_dtype, check_type, check_dtype
@@ -76,9 +76,6 @@ def fill_(x, value):
                             float(value), "value_int", int(value))
 
 
-setattr(core.VarBase, 'fill_', fill_)
-
-
 @dygraph_only
 def zero_(x):
     """
@@ -107,9 +104,6 @@ def zero_(x):
     return _C_ops.fill_any_(x, "value_float", 0., "value_int", int(0))
 
 
-setattr(core.VarBase, 'zero_', zero_)
-
-
 @dygraph_only
 def fill_diagonal_(x, value, offset=0, wrap=False, name=None):
     """
@@ -156,9 +150,6 @@ def fill_diagonal_(x, value, offset=0, wrap=False, name=None):
                                  True)
 
 
-setattr(core.VarBase, 'fill_diagonal_', fill_diagonal_)
-
-
 def _fill_diagonal_tensor_impl(x, y, offset=0, dim1=0, dim2=1, inplace=False):
     inshape = x.shape
     assert dim1 < len(inshape) and dim1 >= -len(inshape), (
@@ -226,9 +217,6 @@ def fill_diagonal_tensor_(x, y, offset=0, dim1=0, dim2=1, name=None):
         x, y, offset=offset, dim1=dim1, dim2=dim2, inplace=True)
 
 
-setattr(core.VarBase, 'fill_diagonal_tensor_', fill_diagonal_tensor_)
-
-
 def fill_diagonal_tensor(x, y, offset=0, dim1=0, dim2=1, name=None):
     """
     This function fill the source Tensor y into the x Tensor's diagonal.
@@ -262,12 +250,6 @@ def fill_diagonal_tensor(x, y, offset=0, dim1=0, dim2=1, name=None):
         x, y, offset=offset, dim1=dim1, dim2=dim2, inplace=False)
 
 
-setattr(core.VarBase, 'fill_diagonal_tensor', fill_diagonal_tensor)
-
-if _in_eager_without_dygraph_check():
-    setattr(core.eager.Tensor, 'fill_diagonal_tensor', fill_diagonal_tensor)
-
-
 @dygraph_only
 def tolist(x):
     """
@@ -301,9 +283,6 @@ def tolist(x):
     return x.numpy().tolist()
 
 
-setattr(core.VarBase, 'tolist', tolist)
-
-
 def concat(x, axis=0, name=None):
     """
 
@@ -479,6 +458,10 @@ def flip(x, axis, name=None):
     """
     if isinstance(axis, int):
         axis = [axis]
+
+    if in_dygraph_mode():
+        return _C_ops.final_state_flip(x, axis)
+
     if paddle.in_dynamic_mode():
         return _C_ops.flip(x, "axis", axis)
 
@@ -697,7 +680,11 @@ def flatten(x, start_axis=0, stop_axis=-1, name=None):
     if start_axis > stop_axis:
         raise ValueError("The stop_axis should be larger than stat_axis")
 
-    if paddle.in_dynamic_mode():
+    if in_dygraph_mode():
+        dy_out, _ = _C_ops.final_state_flatten(x, start_axis, stop_axis)
+        return dy_out
+
+    if _in_legacy_dygraph():
         dy_out, _ = _C_ops.flatten_contiguous_range(x, 'start_axis', start_axis,
                                                     'stop_axis', stop_axis)
         return dy_out
@@ -796,7 +783,10 @@ def roll(x, shifts, axis=None, name=None):
     else:
         axis = []
 
-    if paddle.in_dynamic_mode():
+    if in_dygraph_mode():
+        return _C_ops.final_state_roll(x, shifts, axis)
+
+    if _in_legacy_dygraph():
         return _C_ops.roll(x, 'axis', axis, 'shifts', shifts)
 
     helper = LayerHelper("roll", **locals())
@@ -1225,11 +1215,16 @@ def unique(x,
     else:
         axis = [axis]
     attr_dtype = convert_np_dtype_to_dtype_(dtype)
-    if paddle.in_dynamic_mode():
-        out, inverse, indices, counts = _C_ops.unique(
-            x, 'dtype', attr_dtype, 'return_index', return_index,
-            'return_inverse', return_inverse, 'return_counts', return_counts,
-            'axis', axis, "is_sorted", True)
+    if _non_static_mode():
+        if in_dygraph_mode():
+            out, indices, inverse, counts = _C_ops.final_state_unique(
+                x, return_index, return_inverse, return_counts, axis,
+                attr_dtype)
+        if _in_legacy_dygraph():
+            out, inverse, indices, counts = _C_ops.unique(
+                x, 'dtype', attr_dtype, 'return_index', return_index,
+                'return_inverse', return_inverse, 'return_counts',
+                return_counts, 'axis', axis, "is_sorted", True)
         outs = [out]
         if return_index:
             outs.append(indices)
@@ -1409,7 +1404,9 @@ def gather(x, index, axis=None, name=None):
     if axis is None:
         axis = 0
 
-    if paddle.in_dynamic_mode():
+    if in_dygraph_mode():
+        return _C_ops.final_state_gather(x, index, axis)
+    if _in_legacy_dygraph():
         axis = axis.item() if isinstance(axis, paddle.Tensor) else axis
         return _C_ops.gather(x, index, None, "axis", axis, "overwrite", False)
 
@@ -1476,6 +1473,9 @@ def unbind(input, axis=0):
             # x3.shape [3, 5]
 
     """
+    if in_dygraph_mode():
+        return _C_ops.final_state_unbind(input, axis)
+
     if not isinstance(axis, (int)):
         raise TypeError("The type of 'axis'  must be int, but received %s." %
                         (type(axis)))
@@ -1484,7 +1484,7 @@ def unbind(input, axis=0):
     input_shape = input.shape
     axis_ = axis if axis >= 0 else len(input_shape) + axis
     num = input_shape[axis_]
-    if paddle.in_dynamic_mode():
+    if _in_legacy_dygraph():
         return _C_ops.unbind(input, num, 'axis', axis)
 
     helper = LayerHelper("unbind", **locals())
@@ -1762,8 +1762,17 @@ def tile(x, repeat_times, name=None):
             np_out = out.numpy()
             # [[1, 2, 3, 1, 2, 3]]
     """
-    if paddle.in_dynamic_mode():
+    if in_dygraph_mode():
+        if isinstance(repeat_times, core.eager.Tensor):
+            assert (repeat_times.ndim == 1,
+                    "Only support ndim == 1 while repeat_times is a Tensor.")
+            repeat_times = repeat_times.numpy().tolist()
+
+        return _C_ops.final_state_tile(x, repeat_times)
+
+    if _in_legacy_dygraph():
         return _C_ops.tile(x, 'repeat_times', repeat_times)
+
     check_type(repeat_times, 'repeat_times', (list, tuple, Variable), 'tile')
     if isinstance(repeat_times, Variable):
         assert len(repeat_times.shape) == 1, (
@@ -1845,7 +1854,10 @@ def expand_as(x, y, name=None):
             np_out = out.numpy()
             # [[1, 2, 3], [1, 2, 3]]
     """
-    if paddle.in_dynamic_mode():
+    if in_dygraph_mode():
+        return _C_ops.final_state_expand_as(x, None, y.shape)
+
+    if _non_static_mode():
         return _C_ops.expand_as_v2(x, 'target_shape', y.shape)
 
     check_variable_and_dtype(
@@ -2833,12 +2845,14 @@ def take_along_axis(arr, indices, axis):
     if not broadcast_shape:
         # if indices matrix have larger size than arr, arr should broadcast into indices shape.
         broadcast_shape = indices.shape
-    if paddle.in_dynamic_mode():
+    if _non_static_mode():
         indices = paddle.broadcast_to(indices, broadcast_shape)
         broadcast_shape_list = list(broadcast_shape)
         broadcast_shape_list[axis] = list(arr.shape)[axis]
         broadcast_shape = tuple(broadcast_shape_list)
         arr = paddle.broadcast_to(arr, broadcast_shape)
+        if not _in_legacy_dygraph():
+            return _C_ops.final_state_take_along_axis(arr, indices, axis)
         return _C_ops.take_along_axis(arr, indices, 'Axis', axis)
     check_variable_and_dtype(
         arr, 'x', ['float16', 'float32', 'float64', 'int32', 'int64', 'uint8'],
@@ -2898,12 +2912,15 @@ def put_along_axis(arr, indices, values, axis, reduce='assign'):
             "`indices` and `arr` must have the same number of dimensions!")
     axis = non_negative_axis(arr, axis)
     broadcast_shape = infer_broadcast_shape(arr, indices, axis)
-    if paddle.in_dynamic_mode():
+    if _non_static_mode():
         values = paddle.to_tensor(values) if not isinstance(
             values, paddle.Tensor) else values
         if broadcast_shape:
             indices = paddle.broadcast_to(indices, broadcast_shape)
         values = paddle.broadcast_to(values, indices.shape)
+        if in_dygraph_mode():
+            return _C_ops.final_state_put_along_axis(arr, indices, values, axis,
+                                                     reduce)
         return _C_ops.put_along_axis(arr, indices, values, "Axis", axis,
                                      "Reduce", reduce)
 
@@ -2947,3 +2964,17 @@ def put_along_axis_(arr, indices, values, axis, reduce='assign'):
     values = paddle.broadcast_to(values, indices.shape)
     return _C_ops.put_along_axis_(arr, indices, values, "Axis", axis, "Reduce",
                                   reduce)
+
+
+# TODO(dev): We need avoid implementing it by this way.
+__METHODS = {
+    'fill_': fill_,
+    'zero_': zero_,
+    'fill_diagonal_': fill_diagonal_,
+    'fill_diagonal_tensor_': fill_diagonal_tensor_,
+    "fill_diagonal_tensor": fill_diagonal_tensor,
+    'tolist': tolist
+}
+for name, func in __METHODS.items():
+    setattr(core.VarBase, name, func)
+    setattr(core.eager.Tensor, name, func)
diff --git a/python/paddle/tensor/math.py b/python/paddle/tensor/math.py
old mode 100755
new mode 100644
index f7b5e11184778..cfc9abb86984d
--- a/python/paddle/tensor/math.py
+++ b/python/paddle/tensor/math.py
@@ -23,56 +23,52 @@
 from paddle.common_ops_import import templatedoc
 from paddle.common_ops_import import dygraph_utils
 
-from paddle.tensor import cast
-from paddle.tensor.attribute import _complex_to_real_dtype
+from .manipulation import cast
+from .creation import _complex_to_real_dtype
+from .layer_function_generator import _generate_doc_string_, generate_activation_fn, generate_layer_fn
+
 import paddle
-from paddle.static import Variable
-from ..framework import core
-from ..fluid.framework import _in_legacy_dygraph, in_dygraph_mode
+from ..static import Variable
+from ..framework import core, in_dygraph_mode, _non_static_mode, LayerHelper
+from ..fluid.framework import _in_legacy_dygraph
 from ..framework import _varbase_creator, convert_np_dtype_to_dtype_
-from ..fluid.layer_helper import LayerHelper
 from ..fluid.data_feeder import check_variable_and_dtype, check_type, check_dtype, convert_dtype
-from ..fluid.layers.layer_function_generator import _generate_doc_string_, generate_activation_fn, generate_layer_fn
 from ..fluid.dygraph.inplace_utils import inplace_apis_in_dygraph_only
 
 # TODO: define math functions
 # yapf: disable
-from ..fluid.layers import abs    # noqa: F401
-from ..fluid.layers import acos    # noqa: F401
-from ..fluid.layers import asin    # noqa: F401
-from ..fluid.layers import ceil    # noqa: F401
-from ..fluid.layers import ceil_    # noqa: F401
-from ..fluid.layers import cos    # noqa: F401
-from ..fluid.layers import tan    # noqa: F401
-from ..fluid.layers import sinh    # noqa: F401
-from ..fluid.layers import cosh    # noqa: F401
-from ..fluid.layers import exp    # noqa: F401
-from ..fluid.layers import exp_    # noqa: F401
-from ..fluid.layers import expm1    # noqa: F401
-from ..fluid.layers import floor    # noqa: F401
-from ..fluid.layers import floor_    # noqa: F401
-from ..fluid.layers import log    # noqa: F401
-from ..fluid.layers import reciprocal    # noqa: F401
-from ..fluid.layers import reciprocal_    # noqa: F401
-from ..fluid.layers import round    # noqa: F401
-from ..fluid.layers import round_    # noqa: F401
-from ..fluid.layers import rsqrt    # noqa: F401
-from ..fluid.layers import rsqrt_    # noqa: F401
-from ..fluid.layers import scale    # noqa: F401
-from ..fluid.layers import square    # noqa: F401
-from ..fluid.layers import stanh    # noqa: F401
-from ..fluid.layers import atan    # noqa: F401
-from ..fluid.layers import erf    # noqa: F401
-from ..fluid.layers import sqrt    # noqa: F401
-from ..fluid.layers import sqrt_    # noqa: F401
-from ..fluid.layers import sin    # noqa: F401
-from ..fluid.layers import lgamma    # noqa: F401
-from ..fluid.layers import asinh    # noqa: F401
-from ..fluid.layers import acosh    # noqa: F401
-from ..fluid.layers import atanh    # noqa: F401
-
-from ..fluid.layers import multiplex    # noqa: F401
-from ..fluid.layers import reduce_prod
+from .ops import abs    # noqa: F401
+from .ops import acos    # noqa: F401
+from .ops import asin    # noqa: F401
+from .ops import ceil    # noqa: F401
+from .ops import ceil_    # noqa: F401
+from .ops import cos    # noqa: F401
+from .ops import tan    # noqa: F401
+from .ops import sinh    # noqa: F401
+from .ops import cosh    # noqa: F401
+from .ops import exp    # noqa: F401
+from .ops import exp_    # noqa: F401
+from .ops import expm1    # noqa: F401
+from .ops import floor    # noqa: F401
+from .ops import floor_    # noqa: F401
+from .ops import reciprocal    # noqa: F401
+from .ops import reciprocal_    # noqa: F401
+from .ops import round    # noqa: F401
+from .ops import round_    # noqa: F401
+from .ops import rsqrt    # noqa: F401
+from .ops import rsqrt_    # noqa: F401
+from .ops import square    # noqa: F401
+from .ops import atan    # noqa: F401
+from .ops import erf    # noqa: F401
+from .ops import sqrt    # noqa: F401
+from .ops import sqrt_    # noqa: F401
+from .ops import sin    # noqa: F401
+from .ops import lgamma    # noqa: F401
+from .ops import asinh    # noqa: F401
+from .ops import acosh    # noqa: F401
+from .ops import atanh    # noqa: F401
+
+
 from ..fluid.layers import elementwise_sub
 from paddle import _C_ops
 
@@ -92,16 +88,254 @@
 ]
 
 
+def log(x, name=None):
+    r"""
+    Calculates the natural log of the given input tensor, element-wise.
+
+    .. math::
+
+        Out = \\ln(x)
+
+    Args:
+        x (Tensor): Input Tensor. Must be one of the following types: float32, float64.
+        name (str|None): The default value is None. Normally there is no need for user to set this property. For more information, please refer to :ref:`api_guide_Name`
+
+
+    Returns:
+        Tensor: The natural log of the input Tensor computed element-wise.
+
+    Examples:
+
+        .. code-block:: python
+
+            import paddle
+
+            x = [[2,3,4], [7,8,9]]
+            x = paddle.to_tensor(x, dtype='float32')
+            res = paddle.log(x)
+            # [[0.693147, 1.09861, 1.38629], [1.94591, 2.07944, 2.19722]]
+    """
+    if in_dygraph_mode():
+        return _C_ops.final_state_log(x)
+    if _in_legacy_dygraph():
+        return _C_ops.log(x)
+
+    check_variable_and_dtype(x, 'x', ['float32', 'float64'], "log")
+    inputs = {'X': [x]}
+    helper = LayerHelper('log', **locals())
+    dtype = helper.input_dtype(input_param_name='x')
+    out = helper.create_variable_for_type_inference(dtype)
+    helper.append_op(type="log", inputs={"X": x}, outputs={"Out": out})
+    return out
+
+
+def scale(x, scale=1.0, bias=0.0, bias_after_scale=True, act=None, name=None):
+    """
+    Scale operator.
+
+    Putting scale and bias to the input Tensor as following:
+
+    ``bias_after_scale`` is True:
+
+    .. math::
+                            Out=scale*X+bias
+
+    ``bias_after_scale`` is False:
+
+    .. math::
+                            Out=scale*(X+bias)
+
+    Args:
+        x(Tensor): Input N-D Tensor of scale operator. Data type can be float32, float64, int8, int16, int32, int64, uint8.
+        scale(float|Tensor): The scale factor of the input, it should be a float number or a Tensor with shape [1] and data type as float32.
+        bias(float): The bias to be put on the input.
+        bias_after_scale(bool): Apply bias addition after or before scaling. It is useful for numeric stability in some circumstances.
+        act(str, optional): Activation applied to the output such as tanh, softmax, sigmoid, relu.
+        name(str, optional): The default value is None. Normally there is no need for user to set this property.  For more information, please refer to :ref:`api_guide_Name`
+
+    Returns:
+        Tensor: Output tensor of scale operator, with shape and data type same as input.
+
+    Examples:
+        .. code-block:: python
+            
+            # scale as a float32 number
+            import paddle
+
+            data = paddle.randn(shape=[2,3], dtype='float32')
+            res = paddle.scale(data, scale=2.0, bias=1.0)
+
+        .. code-block:: python
+
+            # scale with parameter scale as a Tensor
+            import paddle
+
+            data = paddle.randn(shape=[2, 3], dtype='float32')
+            factor = paddle.to_tensor([2], dtype='float32')
+            res = paddle.scale(data, scale=factor, bias=1.0)
+
+    """
+
+    if in_dygraph_mode():
+        out = _C_ops.final_state_scale(x, scale, float(bias), bias_after_scale)
+        return dygraph_utils._append_activation_in_dygraph(out)
+    if _non_static_mode():
+        _scale = scale.numpy().item(0) if isinstance(scale, Variable) else scale
+        out = _C_ops.scale(x, 'scale',
+                           float(_scale), 'bias',
+                           float(bias), 'bias_after_scale', bias_after_scale)
+        return dygraph_utils._append_activation_in_dygraph(out)
+
+    check_variable_and_dtype(x, "x", [
+        'float16', 'uint16', 'float32', 'float64', 'int8', 'int16', 'int32',
+        'int64', 'uint8'
+    ], "scale")
+    inputs = {'X': [x]}
+    attrs = {
+        'bias': float(bias),
+        'bias_after_scale': bias_after_scale,
+    }
+    if isinstance(scale, Variable):
+        inputs['ScaleTensor'] = [scale]
+    else:
+        attrs['scale'] = float(scale)
+    helper = LayerHelper('scale', **locals())
+    out = helper.create_variable_for_type_inference(dtype=x.dtype)
+
+    helper.append_op(
+        type='scale', inputs=inputs, outputs={'Out': out}, attrs=attrs)
+    return helper.append_activation(out)
+
+
+def stanh(x, scale_a=0.67, scale_b=1.7159, name=None):
+    """
+    stanh activation.
+
+    .. math::
+
+        out = b * \\frac{e^{a * x} - e^{-a * x}}{e^{a * x} + e^{-a * x}}
+
+    Parameters:
+        x (Tensor): The input Tensor with data type float32, float64.
+        scale_a (float, optional): The scale factor a of the input. Default is 0.67.
+        scale_b (float, optional): The scale factor b of the output. Default is 1.7159.
+        name (str, optional): Name for the operation (optional, default is None).
+            For more information, please refer to :ref:`api_guide_Name`.
+
+    Returns:
+        A Tensor with the same data type and shape as ``x`` .
+
+    Examples:
+        .. code-block:: python
+
+            import paddle
+
+            x = paddle.to_tensor([1.0, 2.0, 3.0, 4.0])
+            out = paddle.stanh(x, scale_a=0.67, scale_b=1.72) # [1.00616539, 1.49927628, 1.65933108, 1.70390463]
+
+    """
+
+    if _non_static_mode():
+        return _C_ops.stanh(x, 'scale_a', scale_a, 'scale_b', scale_b)
+
+    check_variable_and_dtype(x, 'x', ['float16', 'float32', 'float64'], 'stanh')
+
+    helper = LayerHelper('stanh', **locals())
+    out = helper.create_variable_for_type_inference(dtype=x.dtype)
+    helper.append_op(
+        type='stanh',
+        inputs={'X': x},
+        outputs={'Out': out},
+        attrs={'scale_a': scale_a,
+               'scale_b': scale_b})
+    return out
+
+def multiplex(inputs, index, name=None):
+    """
+
+    Based on the given index parameter, the OP selects a specific row from each input Tensor to construct the output Tensor.
+
+    If the input of this OP contains :math:`m` Tensors, where :math:`I_{i}` means the i-th input Tensor, :math:`i` between :math:`[0,m)` .
+
+    And :math:`O` means the output, where :math:`O[i]` means the i-th row of the output, then the output satisfies that :math:`O[i] = I_{index[i]}[i]` .
+
+    For Example:
+
+            .. code-block:: text
+
+                Given:
+
+                inputs = [[[0,0,3,4], [0,1,3,4], [0,2,4,4], [0,3,3,4]],
+                          [[1,0,3,4], [1,1,7,8], [1,2,4,2], [1,3,3,4]],
+                          [[2,0,3,4], [2,1,7,8], [2,2,4,2], [2,3,3,4]],
+                          [[3,0,3,4], [3,1,7,8], [3,2,4,2], [3,3,3,4]]]
+
+                index = [[3],[0],[1],[2]]
+
+                out = [[3,0,3,4],    # out[0] = inputs[index[0]][0] = inputs[3][0] = [3,0,3,4]
+                       [0,1,3,4],    # out[1] = inputs[index[1]][1] = inputs[0][1] = [0,1,3,4]
+                       [1,2,4,2],    # out[2] = inputs[index[2]][2] = inputs[1][2] = [1,2,4,2]
+                       [2,3,3,4]]    # out[3] = inputs[index[3]][3] = inputs[2][3] = [2,3,3,4]
+
+
+    Args:
+        inputs (list): The input Tensor list. The list elements are N-D Tensors of data types float32, float64, int32, int64. All input Tensor shapes should be the same and rank must be at least 2.
+        index (Tensor): Used to select some rows in the input Tensor to construct an index of the output Tensor. It is a 2-D Tensor with data type int32 or int64 and shape [M, 1], where M is the number of input Tensors.
+        name(str, optional): The default value is None. Normally there is no
+            need for user to set this property. For more information, please
+            refer to :ref:`api_guide_Name`.
+    Returns:
+        Tensor: Output of multiplex OP, with data type being float32, float64, int32, int64.
+
+    Examples:
+
+        .. code-block:: python
+
+            import paddle
+            import numpy as np
+            img1 = np.array([[1, 2], [3, 4]]).astype(np.float32)
+            img2 = np.array([[5, 6], [7, 8]]).astype(np.float32)
+            inputs = [paddle.to_tensor(img1), paddle.to_tensor(img2)]
+            index = paddle.to_tensor(np.array([[1], [0]]).astype(np.int32))
+            res = paddle.multiplex(inputs, index)
+            print(res) # [array([[5., 6.], [3., 4.]], dtype=float32)]
+
+    """
+    if _non_static_mode():
+        return _C_ops.multiplex(index, inputs)
+    helper = LayerHelper('multiplex', **locals())
+
+    check_type(inputs, 'inputs', (list), 'multiplex')
+    if len(inputs) < 2:
+        raise ValueError(
+            "inputs should be a list object with at least 2 elements.")
+    for id, x in enumerate(inputs):
+        check_variable_and_dtype(x, 'input[' + str(id) + ']',
+                                 ['float32', 'float64', 'int32', 'int64'],
+                                 'multiplex')
+    check_variable_and_dtype(index, "index", ['int32', 'int64'], 'multiplex')
+
+    out = helper.create_variable_for_type_inference(inputs[0].dtype)
+    helper.append_op(
+        type='multiplex',
+        inputs={'X': inputs,
+                'Ids': index},
+        outputs={'Out': [out]})
+    return out
+
 @inplace_apis_in_dygraph_only
 def scale_(x, scale=1.0, bias=0.0, bias_after_scale=True, act=None, name=None):
     """
     Inplace version of ``scale`` API, the output Tensor will be inplaced with input ``x``.
     Please refer to :ref:`api_tensor_scale`.
     """
-    _scale = scale.numpy().item(0) if isinstance(scale, Variable) else scale
-    return _C_ops.scale_(x, 'scale',
-                            float(_scale), 'bias',
-                            float(bias), 'bias_after_scale', bias_after_scale)
+    if in_dygraph_mode():
+        return _C_ops.final_state_scale_(x, scale, float(bias), bias_after_scale)
+    if _in_legacy_dygraph():
+        _scale = scale.numpy().item(0) if isinstance(scale, Variable) else scale
+        return _C_ops.scale_(x, 'scale',
+                                float(_scale), 'bias',
+                                float(bias), 'bias_after_scale', bias_after_scale)
 
 
 def pow(x, y, name=None):
@@ -150,33 +384,47 @@ def pow(x, y, name=None):
 
     """
     # in dynamic graph mode
-    if paddle.in_dynamic_mode():
+    if in_dygraph_mode():
         if isinstance(y, (int, float)):
-            return _C_ops.pow(x, 'factor', y)
+            return _C_ops.final_state_pow(x, y)
         elif isinstance(y, (paddle.Tensor, Variable)):
             return _elementwise_op_in_dygraph(
                 x, y, axis=-1, act=None, op_name='elementwise_pow')
         else:
             raise TypeError('y must be scalar or tensor type, but received: %s '% (y.dtype))
-    # in static graph mode
-    else:
+    if _in_legacy_dygraph():
         if isinstance(y, (int, float)):
-            helper = LayerHelper('pow', **locals())
-            inputs = {'X': x}
-            attrs = {'factor': y}
-            out = helper.create_variable_for_type_inference(dtype=x.dtype)
-            helper.append_op(
-                type='pow', inputs=inputs, outputs={'Out': out}, attrs=attrs)
-            return out
+            return _C_ops.pow(x, 'factor', y)
         elif isinstance(y, (paddle.Tensor, Variable)):
-            # TODO A potential speed improvement is supporting different types in C++ and removing the cast ops here
-            helper = LayerHelper('elementwise_pow', **locals())
-            out = helper.create_variable_for_type_inference(dtype=x.dtype)
-            return _elementwise_op(LayerHelper('elementwise_pow', **locals()))
+            return _elementwise_op_in_dygraph(
+                x, y, axis=-1, act=None, op_name='elementwise_pow')
         else:
-            raise TypeError('y must be scalar or tensor type, but received: %s '% (type(y)))
+            raise TypeError('y must be scalar or tensor type, but received: %s '% (y.dtype))
+    # in static graph mode
+    if isinstance(y, (int, float)):
+        helper = LayerHelper('pow', **locals())
+        inputs = {'X': x}
+        attrs = {'factor': y}
+        out = helper.create_variable_for_type_inference(dtype=x.dtype)
+        helper.append_op(
+            type='pow', inputs=inputs, outputs={'Out': out}, attrs=attrs)
+        return out
+    elif isinstance(y, (paddle.Tensor, Variable)):
+        # TODO A potential speed improvement is supporting different types in C++ and removing the cast ops here
+        helper = LayerHelper('elementwise_pow', **locals())
+        out = helper.create_variable_for_type_inference(dtype=x.dtype)
+        return _elementwise_op(LayerHelper('elementwise_pow', **locals()))
+    else:
+        raise TypeError('y must be scalar or tensor type, but received: %s '% (type(y)))
 
 
+OP_NAMEMAPPING = {
+    'elementwise_max': 'final_state_maximum',
+    'elementwise_min': 'final_state_minimum',
+    'elementwise_pow': 'final_state_elementwise_pow',
+    'elementwise_floordiv': 'final_state_floor_divide',
+    'elementwise_mod': 'final_state_modulo',
+}
 
 @dygraph_only
 def _elementwise_op_in_dygraph(x,
@@ -185,13 +433,24 @@ def _elementwise_op_in_dygraph(x,
                                act=None,
                                use_mkldnn=False,
                                op_name=None):
-    op = getattr(_C_ops, op_name)
-    out = op(x, y, 'axis', axis, 'use_mkldnn', use_mkldnn)
+    def is_inplace(op_name):
+        return  op_name[-1] == "_"
+
+    if op_name not in OP_NAMEMAPPING.keys():
+        op = getattr(_C_ops, op_name)
+        out = op(x, y, 'axis', axis, 'use_mkldnn', use_mkldnn)
+    else:
+        if in_dygraph_mode():
+            op = getattr(_C_ops, OP_NAMEMAPPING[op_name] if not is_inplace(op_name) else op_name)
+            out = op(x, y)
+
+        if _in_legacy_dygraph():
+            op = getattr(_C_ops, op_name)
+            out = op(x, y, 'axis', axis, 'use_mkldnn', use_mkldnn)
 
     return dygraph_utils._append_activation_in_dygraph(
         out, act, use_mkldnn=use_mkldnn)
 
-
 def _elementwise_op(helper):
     op_type = helper.layer_type
     original_op_type = helper.kwargs.get('original_op_type', op_type)
@@ -706,7 +965,9 @@ def fmax(x, y, name=None):
     op_type = 'elementwise_fmax'
     axis = -1
     act = None
-    if paddle.in_dynamic_mode():
+    if in_dygraph_mode():
+        return _C_ops.final_state_fmax(x, y, axis)
+    if _in_legacy_dygraph():
         return _elementwise_op_in_dygraph(
             x, y, axis=axis, act=act, op_name=op_type)
     return _elementwise_op(LayerHelper(op_type, **locals()))
@@ -767,7 +1028,9 @@ def fmin(x, y, name=None):
     op_type = 'elementwise_fmin'
     axis = -1
     act = None
-    if paddle.in_dynamic_mode():
+    if in_dygraph_mode():
+        return _C_ops.final_state_fmin(x, y, axis)
+    if _in_legacy_dygraph():
         return _elementwise_op_in_dygraph(
             x, y, axis=axis, act=act, op_name=op_type)
     return _elementwise_op(LayerHelper(op_type, **locals()))
@@ -876,7 +1139,18 @@ def get_dtype(x, dtype):
         return (False, src_type)
 
     dtype_flag, dtype = get_dtype(x, dtype)
-    if paddle.in_dynamic_mode():
+
+    if in_dygraph_mode():
+        if reduce_all_flag:
+            axis = range(len(x.shape))
+        else:
+            axis = axis if axis != None and axis != [] else [0]
+
+        out_dtype = convert_np_dtype_to_dtype_(dtype)
+        out = _C_ops.final_state_sum(x, axis, out_dtype, keepdim)
+        return out
+
+    if _in_legacy_dygraph():
         axis = axis if axis != None and axis != [] else [0]
         if dtype_flag:
             return _C_ops.reduce_sum(x, 'dim', axis, 'keep_dim', keepdim,
@@ -981,6 +1255,73 @@ def nansum(x, axis=None, dtype=None, keepdim=False, name=None):
     return sum(tmp_tensor, axis, dtype, keepdim, name)
 
 
+def nanmean(x, axis=None, keepdim=False, name=None):
+    r"""
+    Compute the arithmetic mean along the specified axis, ignoring NaNs.
+
+    Args:
+        x (Tensor): The input Tensor with data type uint16, float16, float32, float64.
+        axis (int|list|tuple, optional):The axis along which to perform nanmean
+            calculations. ``axis`` should be int, list(int) or tuple(int). If
+            ``axis`` is a list/tuple of dimension(s), nanmean is calculated along
+            all element(s) of ``axis`` . ``axis`` or element(s) of ``axis``
+            should be in range [-D, D), where D is the dimensions of ``x`` . If
+            ``axis`` or element(s) of ``axis`` is less than 0, it works the
+            same way as :math:`axis + D` . If ``axis`` is None, nanmean is
+            calculated over all elements of ``x``. Default is None.
+        keepdim (bool, optional): Whether to reserve the reduced dimension(s)
+            in the output Tensor. If ``keepdim`` is True, the dimensions of
+            the output Tensor is the same as ``x`` except in the reduced
+            dimensions(it is of size 1 in this case). Otherwise, the shape of
+            the output Tensor is squeezed in ``axis`` . Default is False.
+        name (str, optional): Name for the operation (optional, default is None).
+            For more information, please refer to :ref:`api_guide_Name`.
+
+    Returns:
+        Tensor, results of arithmetic mean along ``axis`` of ``x``, with the same data
+        type as ``x``.
+
+    Examples:
+
+        .. code-block:: python
+            :name: code-example1
+
+            import paddle
+            # x is a 2-D Tensor:
+            x = paddle.to_tensor([[float('nan'), 0.3, 0.5, 0.9],
+                                  [0.1, 0.2, float('-nan'), 0.7]])
+            out1 = paddle.nanmean(x)
+            # [0.44999996]
+            out2 = paddle.nanmean(x, axis=0)
+            # [0.1, 0.25, 0.5, 0.79999995]
+            out3 = paddle.nanmean(x, axis=0, keepdim=True)
+            # [[0.1, 0.25, 0.5, 0.79999995]]
+            out4 = paddle.nanmean(x, axis=1)
+            # [0.56666666 0.33333334]
+            out5 = paddle.nanmean(x, axis=1, keepdim=True)
+            # [[0.56666666]
+            #  [0.33333334]]
+
+            # y is a 3-D Tensor:
+            y = paddle.to_tensor([[[1, float('nan')], [3, 4]],
+                                   [[5, 6], [float('-nan'), 8]]])
+            out6 = paddle.nanmean(y, axis=[1, 2])
+            # [2.66666675, 6.33333349]
+            out7 = paddle.nanmean(y, axis=[0, 1])
+            # [3., 6.]
+    """
+    if isinstance(axis, int):
+        axis = [axis]
+    check_variable_and_dtype(x, 'x/input',
+                             ['uint16', 'float16', 'float32', 'float64'],
+                             'nanmean' )
+    if axis is not None:
+        check_type(axis, 'axis/dim', (int, list, tuple), 'nanmean')
+
+    cnt = paddle.sum(~paddle.isnan(x), axis = axis,keepdim=keepdim)
+    return paddle.divide(paddle.nansum(x, axis=axis, keepdim=keepdim, name=name), cnt.astype(x.dtype))
+
+
 @templatedoc(op_type="sum")
 def add_n(inputs, name=None):
     """
@@ -1040,7 +1381,11 @@ def add_n(inputs, name=None):
             # [[8., 10., 12.], 
             #  [14., 16., 18.]]
     """
-    if paddle.in_dynamic_mode():
+    if in_dygraph_mode():
+        if isinstance(inputs, Variable):
+            inputs = [inputs]
+        return _C_ops.final_state_add_n(inputs)
+    if _in_legacy_dygraph():
         if isinstance(inputs, Variable):
             inputs = [inputs]
         return _C_ops.sum(inputs, 'use_mkldnn', False)
@@ -1538,7 +1883,11 @@ def logsumexp(x, axis=None, keepdim=False, name=None):
     if axis is None or len(axis) == 0:
         axis = [0]
 
-    if paddle.in_dynamic_mode():
+    if in_dygraph_mode():
+        if reduce_all:
+            axis = range(len(x.shape))
+        return _C_ops.final_state_logsumexp(x, axis, keepdim, reduce_all)
+    if _in_legacy_dygraph():
         return _C_ops.logsumexp(x, 'axis', axis, 'keepdim', keepdim, 'reduce_all', reduce_all)
 
     check_variable_and_dtype(x, 'x',
@@ -1698,7 +2047,11 @@ def max(x, axis=None, keepdim=False, name=None):
     """
 
     reduce_all, axis = _get_reduce_all_value(axis)
-    if paddle.in_dynamic_mode():
+    if in_dygraph_mode():
+        if reduce_all:
+            axis = range(len(x.shape))
+        return _C_ops.final_state_max(x, axis, keepdim)
+    if _in_legacy_dygraph():
         return _C_ops.reduce_max(x, 'dim', axis, 'keep_dim', keepdim,
                                    'reduce_all', reduce_all)
 
@@ -1798,7 +2151,12 @@ def min(x, axis=None, keepdim=False, name=None):
     """
 
     reduce_all, axis = _get_reduce_all_value(axis)
-    if paddle.in_dynamic_mode():
+    if in_dygraph_mode():
+        if reduce_all:
+            axis = range(len(x.shape))
+        return _C_ops.final_state_min(x, axis, keepdim)
+
+    if _in_legacy_dygraph():
         return _C_ops.reduce_min(x, 'dim', axis, 'keep_dim', keepdim,
                                    'reduce_all', reduce_all)
 
@@ -2068,7 +2426,9 @@ def log1p(x, name=None):
             # [[0.], [0.6931472]]
     """
 
-    if paddle.in_dynamic_mode():
+    if in_dygraph_mode():
+        return _C_ops.final_state_log1p(x)
+    if _in_legacy_dygraph():
         return _C_ops.log1p(x)
 
     check_variable_and_dtype(x, 'x', ['float32', 'float64'], "log1p")
@@ -2117,7 +2477,9 @@ def log2(x, name=None):
             res = paddle.log2(x_i)
             print(res) # [1.0]
     """
-    if paddle.in_dynamic_mode():
+    if in_dygraph_mode():
+        return _C_ops.final_state_log2(x)
+    if _in_legacy_dygraph():
         return _C_ops.log2(x)
 
     check_variable_and_dtype(x, 'x', ['float16', 'float32', 'float64'], "log2")
@@ -2167,7 +2529,9 @@ def log10(x, name=None):
             res = paddle.log10(x_i)
             print(res) # [1.0]
     """
-    if paddle.in_dynamic_mode():
+    if in_dygraph_mode():
+        return _C_ops.final_state_log10(x)
+    if _in_legacy_dygraph():
         return _C_ops.log10(x)
 
     check_variable_and_dtype(x, 'x', ['float16', 'float32', 'float64'], "log10")
@@ -2228,7 +2592,16 @@ def clip(x, min=None, max=None, name=None):
         min_ = float(np.finfo(np.float32).min)
         max_ = float(np.finfo(np.float32).max)
 
-    if paddle.in_dynamic_mode():
+    if in_dygraph_mode():
+        if isinstance(min, Variable):
+            min = min.numpy().item(0)
+        if isinstance(max, Variable):
+            max = max.numpy().item(0)
+        min = min_ if min is None else min
+        max = max_ if max is None else max
+        return _C_ops.final_state_clip(x, min, max)
+
+    if _in_legacy_dygraph():
         if isinstance(min, Variable):
             min = min.numpy().item(0)
         if isinstance(max, Variable):
@@ -2288,7 +2661,12 @@ def clip_(x, min=None, max=None, name=None):
         max = max.numpy().item(0)
     min = fmin if min is None else min
     max = fmax if max is None else max
-    return _C_ops.clip_(x, "min", min, "max", max)
+
+    if in_dygraph_mode():
+        return _C_ops.final_state_clip_(x, min, max)
+
+    if _in_legacy_dygraph():
+        return _C_ops.clip_(x, "min", min, "max", max)
 
 
 
@@ -2361,7 +2739,10 @@ def __check_input(input, offset, dim1, dim2):
                 "But received axis1 = %d, axis2 = %d\n"%(axis1, axis2)
 
     __check_input(input, offset, axis1, axis2)
-    if paddle.in_dynamic_mode():
+    if in_dygraph_mode():
+        return _C_ops.final_state_trace( x, offset, axis1, axis2 )
+
+    if _in_legacy_dygraph():
         return _C_ops.trace(x, 'offset', offset, 'axis1', axis1, 'axis2', axis2)
 
     inputs = {'Input': [x]}
@@ -2524,9 +2905,10 @@ def kron(x, y, name=None):
             #         [12, 15, 18, 16, 20, 24],
             #         [21, 24, 27, 28, 32, 36]])
     """
-    if paddle.in_dynamic_mode():
+    if _in_legacy_dygraph():
         return _C_ops.kron(x, y)
-
+    if in_dygraph_mode():
+        return _C_ops.final_state_kron(x, y)
     helper = LayerHelper('kron', **locals())
     check_variable_and_dtype(x, 'x', ['float16', 'float32', 'float64', 'int32', 'int64'], 'kron')
     check_variable_and_dtype(y, 'y', ['float16', 'float32', 'float64', 'int32', 'int64'], 'kron')
@@ -2584,7 +2966,10 @@ def cumsum(x, axis=None, dtype=None, name=None):
     if dtype is not None and x.dtype != convert_np_dtype_to_dtype_(dtype):
         x = cast(x, dtype)
 
-    if paddle.in_dynamic_mode():
+    if in_dygraph_mode():
+        if axis is None: axis = -1
+        return _C_ops.final_state_cumsum(x, axis, flatten, False, False)
+    if _in_legacy_dygraph():
         if axis is None:
             return _C_ops.cumsum(x, 'flatten', flatten)
         else:
@@ -2649,7 +3034,9 @@ def cumprod(x, dim=None, dtype=None, name=None):
     if dtype is not None and x.dtype != convert_np_dtype_to_dtype_(dtype):
         x = cast(x, dtype)
 
-    if paddle.in_dynamic_mode():
+    if in_dygraph_mode():
+        return _C_ops.final_state_cumprod(x, dim)
+    if _in_legacy_dygraph():
         return _C_ops.cumprod(x, 'dim', dim)
 
     check_variable_and_dtype(x, "x", ['complex64', 'complex128', 'float32', 'float64', 'int32', 'int64'], 'cumprod')
@@ -2681,7 +3068,9 @@ def isfinite(x, name=None):
             out = paddle.tensor.isfinite(x)
             print(out)  # [False  True  True False  True False False]
     """
-    if paddle.in_dynamic_mode():
+    if in_dygraph_mode():
+        return _C_ops.final_state_isfinite( x )
+    if _in_legacy_dygraph():
         return _C_ops.isfinite_v2(x)
     helper = LayerHelper("isfinite_v2", **locals())
     check_variable_and_dtype(x, 'x', ['float16', 'float32', 'float64', 'int32', 'int64'], 'isfinite')
@@ -2709,7 +3098,9 @@ def isinf(x, name=None):
             out = paddle.tensor.isinf(x)
             print(out)  # [ True False False  True False False False]
     """
-    if paddle.in_dynamic_mode():
+    if in_dygraph_mode():
+        return _C_ops.final_state_isinf( x )
+    if _in_legacy_dygraph():
         return _C_ops.isinf_v2(x)
     helper = LayerHelper("isinf_v2", **locals())
     check_variable_and_dtype(x, 'x', ['float16', 'float32', 'float64', 'int32', 'int64'], 'isinf')
@@ -2737,7 +3128,10 @@ def isnan(x, name=None):
             out = paddle.tensor.isnan(x)
             print(out)  # [False False False False False  True  True]
     """
-    if paddle.in_dynamic_mode():
+    if in_dygraph_mode():
+        return _C_ops.final_state_isnan( x )
+
+    if _in_legacy_dygraph():
         return _C_ops.isnan_v2(x)
     helper = LayerHelper("isnan_v2", **locals())
     check_variable_and_dtype(x, 'x', ['float16', 'float32', 'float64', 'int32', 'int64'], 'isnan')
@@ -2810,7 +3204,38 @@ def prod(x, axis=None, keepdim=False, dtype=None, name=None):
         if x.dtype != convert_np_dtype_to_dtype_(dtype):
             x = cast(x, dtype)
 
-    return reduce_prod(input=x, dim=axis, keep_dim=keepdim, name=name)
+    input = x
+    dim = axis
+    keep_dim = keepdim
+    if dim is not None and not isinstance(dim, list):
+        if isinstance(dim, tuple):
+            dim = list(dim)
+        elif isinstance(dim, int):
+            dim = [dim]
+        else:
+            raise TypeError(
+                "The type of axis must be int, list or tuple, but received {}".
+                format(type(dim)))
+    if in_dygraph_mode():
+        return _C_ops.final_state_reduce_prod(
+            input, dim if dim != None and dim != [] else [0], keep_dim, True if
+            dim == None or dim == [] or len(dim) == len(input.shape) else False)
+
+    helper = LayerHelper('reduce_prod', **locals())
+    check_variable_and_dtype(
+        input, 'input', ['float32', 'float64', 'int32', 'int64'], 'reduce_prod')
+    out = helper.create_variable_for_type_inference(dtype=helper.input_dtype())
+    helper.append_op(
+        type='reduce_prod',
+        inputs={'X': input},
+        outputs={'Out': out},
+        attrs={
+            'dim': dim if dim != None and dim != [] else [0],
+            'keep_dim': keep_dim,
+            'reduce_all': True if dim == None or dim == [] or
+            len(dim) == len(input.shape) else False
+        })
+    return out
 
 
 def sign(x, name=None):
@@ -2834,7 +3259,10 @@ def sign(x, name=None):
           out = paddle.sign(x=x)
           print(out)  # [1.0, 0.0, -1.0, 1.0]
     """
-    if paddle.in_dynamic_mode():
+    if in_dygraph_mode():
+        return _C_ops.final_state_sign(x)
+
+    if _in_legacy_dygraph():
         return _C_ops.sign(x)
 
     check_variable_and_dtype(x, 'x', ['float16', 'float32', 'float64'], 'sign')
@@ -2871,7 +3299,10 @@ def tanh(x, name=None):
             print(out)
             # [-0.37994896 -0.19737532  0.09966799  0.29131261]
     """
-    if paddle.in_dynamic_mode():
+    if in_dygraph_mode():
+        return _C_ops.final_state_tanh( x )
+
+    if _in_legacy_dygraph():
         return _C_ops.tanh(x)
 
     check_variable_and_dtype(x, 'x', ['float16', 'float32', 'float64'], 'tanh')
@@ -2913,7 +3344,10 @@ def increment(x, value=1.0, name=None):
             # [1.]
 
     """
-    if paddle.in_dynamic_mode():
+    if in_dygraph_mode():
+        return _C_ops.final_state_increment( x, value)
+
+    if _in_legacy_dygraph():
         return _C_ops.increment(x, 'step', value)
 
     check_variable_and_dtype(x, 'x', ['float32', 'float64', 'int32', 'int64'],
@@ -2994,7 +3428,12 @@ def all(x, axis=None, keepdim=False, name=None):
         else:
             reduce_all_flag = False
 
-    if paddle.in_dynamic_mode():
+    if in_dygraph_mode():
+        if reduce_all_flag:
+            axis = range(len(x.shape))
+        return _C_ops.final_state_all(x, axis, keepdim)
+
+    if _in_legacy_dygraph():
         axis = axis if axis != None and axis != [] else [0]
         return _C_ops.reduce_all(x, 'dim', axis, 'keep_dim', keepdim,
                                        'reduce_all', reduce_all_flag)
@@ -3086,7 +3525,12 @@ def any(x, axis=None, keepdim=False, name=None):
         else:
             reduce_all_flag = False
 
-    if paddle.in_dynamic_mode():
+    if in_dygraph_mode():
+        if reduce_all_flag:
+            axis = range(len(x.shape))
+        return _C_ops.final_state_any(x, axis, keepdim)
+
+    if _in_legacy_dygraph():
         axis = axis if axis != None and axis != [] else [0]
         return _C_ops.reduce_any(x, 'dim', axis, 'keep_dim', keepdim,
                                        'reduce_all', reduce_all_flag)
@@ -3167,6 +3611,9 @@ def conj(x, name=None):
           #        [(4-4j), (5-5j), (6-6j)]])
 
     """
+    if in_dygraph_mode():
+        return _C_ops.final_state_conj(x)
+
     if paddle.in_dynamic_mode():
         return _C_ops.conj(x)
 
@@ -3344,9 +3791,10 @@ def logit(x, eps=None, name=None):
 
     if eps == None:
         eps = 0.0
-    if paddle.in_dynamic_mode():
+    if _in_legacy_dygraph():
         return _C_ops.logit(x, 'eps', eps)
-
+    if in_dygraph_mode():
+        return _C_ops.final_state_logit(x, eps)
     check_variable_and_dtype(x, 'x', ['float16', 'float32', 'float64'], 'logit')
     helper = LayerHelper("logit", **locals())
     out = helper.create_variable_for_type_inference(x.dtype)
@@ -3387,8 +3835,13 @@ def lerp(x, y, weight, name=None):
             # out: [5.5., 6., 6.5, 7.]
 
     """
-    if paddle.in_dynamic_mode():
+    if in_dygraph_mode():
         check_type(weight, 'weight', (float, paddle.Tensor, Variable), 'lerp')
+        if isinstance(weight, float):
+            weight = paddle.to_tensor(weight, dtype=x.dtype)
+
+        return _C_ops.final_state_lerp( x, y, weight)
+    if _in_legacy_dygraph():
         if isinstance(weight, float):
             weight = paddle.to_tensor(weight, dtype=x.dtype)
         return _C_ops.lerp(x, y, weight)
@@ -3448,6 +3901,9 @@ def erfinv(x, name=None):
             # out: [0, 0.4769, -inf]
 
     """
+    if in_dygraph_mode():
+        return _C_ops.final_state_erfinv( x )
+
     check_variable_and_dtype(x, 'x', ['float32', 'float64'], 'erfinv')
 
     if paddle.in_dynamic_mode():
@@ -3793,7 +4249,8 @@ def diff(x, n=1, axis=-1, prepend=None, append=None, name=None):
             input_list = [x, append]
             has_pend = True
         if has_pend:
-            new_input = _C_ops.concat(input_list, 'axis', axis)
+            new_input = _varbase_creator()
+            _C_ops.concat(input_list, new_input, 'axis', axis)
         else:
             new_input = x
 
@@ -3821,6 +4278,7 @@ def diff(x, n=1, axis=-1, prepend=None, append=None, name=None):
         else:
             out = elementwise_sub(input_back, input_front, axis=axis)
         return out
+
     else:
         check_variable_and_dtype(x, 'x', ['float32', 'float64', 'bool', 'int32', 'int64'], 'diff')
         check_type(axis, 'axis', (int), 'diff')
diff --git a/python/paddle/tensor/ops.py b/python/paddle/tensor/ops.py
new file mode 100644
index 0000000000000..9ee59c6cfd843
--- /dev/null
+++ b/python/paddle/tensor/ops.py
@@ -0,0 +1,532 @@
+#   Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import print_function
+import os
+from .layer_function_generator import generate_layer_fn, generate_activation_fn, generate_inplace_fn, add_sample_code
+from ..framework import core
+from ..framework import convert_np_dtype_to_dtype_
+from ..static import Variable
+from ..fluid.data_feeder import convert_dtype, check_variable_and_dtype, check_type, check_dtype
+
+__deprecated_func_name__ = {
+    'tanh_shrink': 'tanhshrink',
+    'logsigmoid': 'log_sigmoid'
+}
+
+__activations_noattr__ = [
+    'sigmoid',
+    'silu',
+    'logsigmoid',
+    'tanh_shrink',
+    'softplus',
+    'softsign',
+    'tanh',
+]
+
+__unary_func__ = [
+    'exp',
+    'expm1',
+    'atan',
+    'sqrt',
+    'rsqrt',
+    'abs',
+    'ceil',
+    'floor',
+    'cos',
+    'tan',
+    'acos',
+    'sin',
+    'sinh',
+    'asin',
+    'cosh',
+    'round',
+    'reciprocal',
+    'square',
+    'lgamma',
+    'acosh',
+    'asinh',
+    'atanh',
+]
+
+__inplace_unary_func__ = [
+    'exp_',
+    'sqrt_',
+    'rsqrt_',
+    'ceil_',
+    'floor_',
+    'round_',
+    'reciprocal_',
+]
+
+__all__ = []
+
+for _OP in set(__all__):
+    globals()[_OP] = generate_layer_fn(_OP)
+
+# It is a hot fix in some unittest using:
+#   fluid.layers.scale(x=x, scale=10.0, out=out_var)
+# e.g.: test_program_code.py, test_dist_train.py
+globals()['_scale'] = generate_layer_fn('scale')
+
+globals()['_elementwise_div'] = generate_layer_fn('elementwise_div')
+
+__all__ += __activations_noattr__
+__all__ += __unary_func__
+__all__ += __inplace_unary_func__
+
+for _OP in set(__activations_noattr__):
+    _new_OP = _OP
+    if _OP in __deprecated_func_name__:
+        _new_OP = __deprecated_func_name__[_OP]
+    _func = generate_activation_fn(_OP)
+    globals()[_OP] = _func
+
+for _OP in set(__unary_func__):
+    _new_OP = _OP
+    if _OP in __deprecated_func_name__:
+        _new_OP = __deprecated_func_name__[_OP]
+    _func = generate_activation_fn(_OP)
+    globals()[_OP] = _func
+
+for _OP in set(__inplace_unary_func__):
+    _new_OP = _OP
+    if _OP in __deprecated_func_name__:
+        _new_OP = __deprecated_func_name__[_OP]
+    _func = generate_inplace_fn(_OP)
+    globals()[_OP] = _func
+
+add_sample_code(globals()["sigmoid"], r"""
+Examples:
+    .. code-block:: python
+
+        import paddle
+        import paddle.nn.functional as F
+
+        x = paddle.to_tensor([-0.4, -0.2, 0.1, 0.3])
+        out = F.sigmoid(x)
+        print(out)
+        # [0.40131234 0.450166   0.52497919 0.57444252]
+
+""")
+
+add_sample_code(globals()["silu"], r"""
+Examples:
+    .. code-block:: python
+
+        import paddle
+        import paddle.nn.functional as F
+
+        x = paddle.to_tensor([1.0, 2.0, 3.0, 4.0])
+        out = F.silu(x)
+        print(out)
+        # [ 0.7310586 1.7615942 2.8577224, 3.9280552 ]
+
+""")
+
+add_sample_code(globals()["logsigmoid"], r"""
+Examples:
+    .. code-block:: python
+
+        import paddle
+        import paddle.nn.functional as F
+
+        x = paddle.to_tensor([-0.4, -0.2, 0.1, 0.3])
+        out = F.log_sigmoid(x)
+        print(out)
+        # [-0.91301525 -0.79813887 -0.64439666 -0.55435524]
+
+""")
+
+add_sample_code(globals()["exp"], r"""
+Examples:
+    .. code-block:: python
+
+        import paddle
+
+        x = paddle.to_tensor([-0.4, -0.2, 0.1, 0.3])
+        out = paddle.exp(x)
+        print(out)
+        # [0.67032005 0.81873075 1.10517092 1.34985881]
+
+""")
+
+add_sample_code(globals()["expm1"], r"""
+Examples:
+    .. code-block:: python
+
+        import paddle
+
+        x = paddle.to_tensor([-0.4, -0.2, 0.1, 0.3])
+        out = paddle.expm1(x)
+        print(out)
+        # [-0.32967997, -0.18126924,  0.10517092,  0.34985882]
+
+""")
+
+add_sample_code(globals()["tanh"], r"""
+Examples:
+    .. code-block:: python
+
+        import paddle
+
+        x = paddle.to_tensor([-0.4, -0.2, 0.1, 0.3])
+        out = paddle.tanh(x)
+        print(out)
+        # [-0.37994896 -0.19737532  0.09966799  0.29131261]
+
+""")
+
+add_sample_code(globals()["atan"], r"""
+Examples:
+    .. code-block:: python
+
+        import paddle
+
+        x = paddle.to_tensor([-0.4, -0.2, 0.1, 0.3])
+        out = paddle.atan(x)
+        print(out)
+        # [-0.38050638 -0.19739556  0.09966865  0.29145679]
+
+""")
+
+add_sample_code(globals()["tanh_shrink"], r"""
+Examples:
+    .. code-block:: python
+
+        import paddle
+        import paddle.nn.functional as F
+
+        x = paddle.to_tensor([-0.4, -0.2, 0.1, 0.3])
+        out = F.tanhshrink(x) 
+        print(out)
+        # [-0.020051, -0.00262468, 0.000332005, 0.00868739]
+
+""")
+
+add_sample_code(globals()["sqrt"], r"""
+Examples:
+    .. code-block:: python
+
+        import paddle
+
+        x = paddle.to_tensor([0.1, 0.2, 0.3, 0.4])
+        out = paddle.sqrt(x)
+        print(out)
+        # [0.31622777 0.4472136  0.54772256 0.63245553]
+
+""")
+
+add_sample_code(globals()["rsqrt"], r"""
+Examples:
+    .. code-block:: python
+
+        import paddle
+
+        x = paddle.to_tensor([0.1, 0.2, 0.3, 0.4])
+        out = paddle.rsqrt(x)
+        print(out)
+        # [3.16227766 2.23606798 1.82574186 1.58113883]
+
+""")
+
+add_sample_code(globals()["abs"], r"""
+Examples:
+    .. code-block:: python
+
+        import paddle
+
+        x = paddle.to_tensor([-0.4, -0.2, 0.1, 0.3])
+        out = paddle.abs(x)
+        print(out)
+        # [0.4 0.2 0.1 0.3]
+
+""")
+
+add_sample_code(globals()["ceil"], r"""
+Examples:
+    .. code-block:: python
+
+        import paddle
+
+        x = paddle.to_tensor([-0.4, -0.2, 0.1, 0.3])
+        out = paddle.ceil(x)
+        print(out)
+        # [-0. -0.  1.  1.]
+
+""")
+
+add_sample_code(globals()["floor"], r"""
+Examples:
+    .. code-block:: python
+
+        import paddle
+
+        x = paddle.to_tensor([-0.4, -0.2, 0.1, 0.3])
+        out = paddle.floor(x)
+        print(out)
+        # [-1. -1.  0.  0.]
+
+""")
+
+add_sample_code(globals()["cos"], r"""
+Examples:
+    .. code-block:: python
+
+        import paddle
+
+        x = paddle.to_tensor([-0.4, -0.2, 0.1, 0.3])
+        out = paddle.cos(x)
+        print(out)
+        # [0.92106099 0.98006658 0.99500417 0.95533649]
+
+""")
+
+add_sample_code(globals()["tan"], r"""
+Examples:
+    .. code-block:: python
+
+        import paddle
+
+        x = paddle.to_tensor([-0.4, -0.2, 0.1, 0.3])
+        out = paddle.tan(x)
+        print(out)
+        # [-0.42279324, -0.20271005, 0.10033467, 0.30933627]
+
+""")
+
+add_sample_code(globals()["acos"], r"""
+Examples:
+    .. code-block:: python
+
+        import paddle
+
+        x = paddle.to_tensor([-0.4, -0.2, 0.1, 0.3])
+        out = paddle.acos(x)
+        print(out)
+        # [1.98231317 1.77215425 1.47062891 1.26610367]
+
+""")
+
+add_sample_code(globals()["sin"], r"""
+Examples:
+    .. code-block:: python
+
+        import paddle
+
+        x = paddle.to_tensor([-0.4, -0.2, 0.1, 0.3])
+        out = paddle.sin(x)
+        print(out)
+        # [-0.38941834 -0.19866933  0.09983342  0.29552021]
+
+""")
+
+add_sample_code(globals()["asin"], r"""
+Examples:
+    .. code-block:: python
+
+        import paddle
+
+        x = paddle.to_tensor([-0.4, -0.2, 0.1, 0.3])
+        out = paddle.asin(x)
+        print(out)
+        # [-0.41151685 -0.20135792  0.10016742  0.30469265]
+
+""")
+
+add_sample_code(globals()["cosh"], r"""
+Examples:
+    .. code-block:: python
+
+        import paddle
+
+        x = paddle.to_tensor([-0.4, -0.2, 0.1, 0.3])
+        out = paddle.cosh(x)
+        print(out)
+        # [1.08107237 1.02006676 1.00500417 1.04533851]
+
+""")
+
+add_sample_code(globals()["sinh"], r"""
+Examples:
+    .. code-block:: python
+
+        import paddle
+
+        x = paddle.to_tensor([-0.4, -0.2, 0.1, 0.3])
+        out = paddle.sinh(x)
+        print(out)
+        # [-0.41075233 -0.201336    0.10016675  0.30452029]
+
+""")
+
+add_sample_code(globals()["asinh"], r"""
+Examples:
+    .. code-block:: python
+
+        import paddle
+
+        x = paddle.to_tensor([-0.4, -0.2, 0.1, 0.3])
+        out = paddle.asinh(x)
+        print(out)
+        # [-0.39003533, -0.19869010,  0.09983408,  0.29567307]
+
+""")
+
+add_sample_code(globals()["acosh"], r"""
+Examples:
+    .. code-block:: python
+
+        import paddle
+
+        x = paddle.to_tensor([1., 3., 4., 5.])
+        out = paddle.acosh(x)
+        print(out)
+        # [0.        , 1.76274729, 2.06343699, 2.29243159]
+
+""")
+
+add_sample_code(globals()["atanh"], r"""
+Examples:
+    .. code-block:: python
+
+        import paddle
+
+        x = paddle.to_tensor([-0.4, -0.2, 0.1, 0.3])
+        out = paddle.atanh(x)
+        print(out)
+        # [-0.42364895, -0.20273256,  0.10033535,  0.30951962]
+
+""")
+
+add_sample_code(globals()["round"], r"""
+Examples:
+    .. code-block:: python
+
+        import paddle
+
+        x = paddle.to_tensor([-0.5, -0.2, 0.6, 1.5])
+        out = paddle.round(x)
+        print(out)
+        # [-1. -0.  1.  2.]
+
+""")
+
+add_sample_code(globals()["reciprocal"], r"""
+Examples:
+    .. code-block:: python
+
+        import paddle
+
+        x = paddle.to_tensor([-0.4, -0.2, 0.1, 0.3])
+        out = paddle.reciprocal(x)
+        print(out)
+        # [-2.5        -5.         10.          3.33333333]
+
+""")
+
+add_sample_code(globals()["square"], r"""
+Examples:
+    .. code-block:: python
+
+        import paddle
+
+        x = paddle.to_tensor([-0.4, -0.2, 0.1, 0.3])
+        out = paddle.square(x)
+        print(out)
+        # [0.16 0.04 0.01 0.09]
+
+""")
+
+add_sample_code(globals()["lgamma"], r"""
+Examples:
+    .. code-block:: python
+
+        import paddle
+
+        x = paddle.to_tensor([-0.4, -0.2, 0.1, 0.3])
+        out = paddle.lgamma(x)
+        print(out)
+        # [1.31452441, 1.76149750, 2.25271273, 1.09579802]
+
+""")
+
+add_sample_code(globals()["softplus"], r"""
+Examples:
+    .. code-block:: python
+
+        import paddle
+        import paddle.nn.functional as F
+
+        x = paddle.to_tensor([-0.4, -0.2, 0.1, 0.3])
+        out = F.softplus(x) 
+        print(out)
+        # [0.513015, 0.598139, 0.744397, 0.854355]
+
+""")
+
+add_sample_code(globals()["softsign"], r"""
+Examples:
+    .. code-block:: python
+
+        import paddle
+        import paddle.nn.functional as F
+
+        x = paddle.to_tensor([-0.4, -0.2, 0.1, 0.3])
+        out = F.softsign(x) 
+        print(out)
+        # [-0.285714, -0.166667, 0.0909091, 0.230769]
+
+""")
+
+__all__ += ['erf']
+
+_erf_ = generate_layer_fn('erf')
+
+
+def erf(x, name=None):
+    locals_var = locals().copy()
+    kwargs = dict()
+    for name, val in locals_var.items():
+        if val is not None:
+            kwargs[name] = val
+    return _erf_(**kwargs)
+
+
+erf.__doc__ = r"""
+:strong:`Erf Operator`
+For more details, see [Error function](https://en.wikipedia.org/wiki/Error_function).
+
+Equation:
+    ..  math::
+        out = \\frac{2}{\\sqrt{\\pi}} \\int_{0}^{x}e^{- \\eta^{2}}d\\eta
+
+Args:
+
+    x (Tensor): The input tensor, it's data type should be float32, float64.
+
+Returns:
+
+    Tensor: The output of Erf op, dtype: float32 or float64, the same as the input, shape: the same as the input.
+
+Examples:
+    
+    .. code-block:: python
+    
+        import paddle
+        x = paddle.to_tensor([-0.4, -0.2, 0.1, 0.3])
+        out = paddle.erf(x)
+        print(out)
+        # [-0.42839236 -0.22270259  0.11246292  0.32862676]
+"""
diff --git a/python/paddle/tensor/random.py b/python/paddle/tensor/random.py
index 660803f9f7475..d2e4363443720 100644
--- a/python/paddle/tensor/random.py
+++ b/python/paddle/tensor/random.py
@@ -22,6 +22,7 @@
 import paddle
 from paddle import _C_ops
 from paddle.static import Variable
+from paddle.fluid.framework import in_dygraph_mode, _in_legacy_dygraph, _current_expected_place
 
 __all__ = []
 
@@ -66,7 +67,10 @@ def bernoulli(x, name=None):
 
     """
 
-    if paddle.in_dynamic_mode():
+    if in_dygraph_mode():
+        return _C_ops.final_state_bernoulli(x)
+
+    if _in_legacy_dygraph():
         return _C_ops.bernoulli(x)
 
     check_variable_and_dtype(x, "x", ["float32", "float64"], "bernoulli")
@@ -174,7 +178,10 @@ def multinomial(x, num_samples=1, replacement=False, name=None):
     assert core.is_compiled_with_rocm() == False, (
         "multinomial op is not supported on ROCM yet.")
 
-    if paddle.in_dynamic_mode():
+    if in_dygraph_mode():
+        return _C_ops.final_state_multinomial(x, num_samples, replacement)
+
+    if _in_legacy_dygraph():
         return _C_ops.multinomial(x, 'num_samples', num_samples, 'replacement',
                                   replacement)
 
@@ -680,7 +687,11 @@ def randint(low=0, high=None, shape=[1], dtype=None, name=None):
     if not isinstance(dtype, core.VarDesc.VarType):
         dtype = convert_np_dtype_to_dtype_(dtype)
 
-    if paddle.in_dynamic_mode():
+    if in_dygraph_mode():
+        shape = utils.convert_shape_to_list(shape)
+        place = _current_expected_place()
+        return _C_ops.final_state_randint(low, high, shape, dtype, place)
+    if _in_legacy_dygraph():
         shape = utils.convert_shape_to_list(shape)
         return _C_ops.randint('shape', shape, 'low', low, 'high', high, 'seed',
                               0, 'dtype', dtype)
@@ -912,7 +923,9 @@ def randperm(n, dtype="int64", name=None):
     if not isinstance(dtype, core.VarDesc.VarType):
         dtype = convert_np_dtype_to_dtype_(dtype)
 
-    if paddle.in_dynamic_mode():
+    if in_dygraph_mode():
+        return _C_ops.final_state_randperm(n, dtype, _current_expected_place())
+    if _in_legacy_dygraph():
         return _C_ops.randperm('n', n, 'seed', 0, 'dtype', dtype)
 
     if n < 1:
diff --git a/python/paddle/tensor/search.py b/python/paddle/tensor/search.py
index ef10135fb99c1..5c290aa0eb760 100644
--- a/python/paddle/tensor/search.py
+++ b/python/paddle/tensor/search.py
@@ -18,7 +18,7 @@
 from ..fluid.data_feeder import check_variable_and_dtype, check_type, check_dtype
 from ..fluid import layers
 from ..framework import core
-from ..fluid.framework import _in_legacy_dygraph, in_dygraph_mode
+from ..fluid.framework import _in_legacy_dygraph, in_dygraph_mode, _non_static_mode
 from paddle.common_ops_import import convert_np_dtype_to_dtype_
 from paddle.common_ops_import import Variable
 from paddle.common_ops_import import VarDesc
@@ -91,7 +91,11 @@ def argsort(x, axis=-1, descending=False, name=None):
             #  [1 1 0 2]
             #  [0 2 1 1]]]
     """
-    if paddle.in_dynamic_mode():
+    if in_dygraph_mode():
+        _, ids, = _C_ops.final_state_argsort(x, axis, descending)
+        return ids
+
+    if _in_legacy_dygraph():
         _, ids = _C_ops.argsort(x, 'axis', axis, 'descending', descending)
         return ids
     check_variable_and_dtype(
@@ -171,7 +175,9 @@ def argmax(x, axis=None, keepdim=False, dtype="int64", name=None):
         flatten = True
         axis = 0
 
-    if paddle.in_dynamic_mode():
+    if in_dygraph_mode():
+        return _C_ops.final_state_argmax(x, axis, keepdim, flatten, var_dtype)
+    if _in_legacy_dygraph():
         out = _C_ops.arg_max(x, 'axis', axis, 'dtype', var_dtype, 'keepdims',
                              keepdim, 'flatten', flatten)
         return out
@@ -251,7 +257,9 @@ def argmin(x, axis=None, keepdim=False, dtype="int64", name=None):
         flatten = True
         axis = 0
 
-    if paddle.in_dynamic_mode():
+    if in_dygraph_mode():
+        return _C_ops.final_state_argmin(x, axis, keepdim, flatten, var_dtype)
+    if _in_legacy_dygraph():
         out = _C_ops.arg_min(x, 'axis', axis, 'dtype', var_dtype, 'keepdims',
                              keepdim, 'flatten', flatten)
         return out
@@ -311,7 +319,10 @@ def index_select(x, index, axis=0, name=None):
             # [ 9. 10. 10.]]
     """
 
-    if paddle.in_dynamic_mode():
+    if in_dygraph_mode():
+        return _C_ops.final_state_index_select(x, index, axis)
+
+    if _in_legacy_dygraph():
         return _C_ops.index_select(x, index, 'dim', axis)
 
     helper = LayerHelper("index_select", **locals())
@@ -510,7 +521,9 @@ def mode(x, axis=-1, keepdim=False, name=None):
            #    [1, 0]]))
            
     """
-    if paddle.in_dynamic_mode():
+    if in_dygraph_mode():
+        return _C_ops.final_state_mode(x, axis, keepdim)
+    if _in_legacy_dygraph():
         return _C_ops.mode(x, "axis", axis, "keepdim", keepdim)
 
     helper = LayerHelper("mode", **locals())
@@ -774,7 +787,10 @@ def masked_select(x, mask, name=None):
             #[1.0 5.0 6.0 9.0]
     """
 
-    if paddle.in_dynamic_mode():
+    if in_dygraph_mode():
+        return _C_ops.final_state_masked_select(x, mask)
+
+    if _in_legacy_dygraph():
         return _C_ops.masked_select(x, mask)
 
     helper = LayerHelper("masked_select", **locals())
@@ -844,8 +860,14 @@ def topk(x, k, axis=None, largest=True, sorted=True, name=None):
            # [[1 1 0 0]]
 
     """
-    if paddle.in_dynamic_mode():
-        k = k.numpy().item(0) if isinstance(k, Variable) else k
+
+    if in_dygraph_mode():
+        if axis == None:
+            axis = -1
+        out, indices = _C_ops.final_state_top_k(x, k, axis, largest, sorted)
+        return out, indices
+
+    if _non_static_mode():
         if axis is None:
             out, indices = _C_ops.top_k_v2(x, 'k',
                                            int(k), 'largest', largest, 'sorted',
@@ -927,8 +949,11 @@ def searchsorted(sorted_sequence,
             #         [1, 3, 4, 5]])
             
     """
+    if in_dygraph_mode():
+        return _C_ops.final_state_searchsorted(sorted_sequence, values,
+                                               out_int32, right)
 
-    if paddle.in_dynamic_mode():
+    if _in_legacy_dygraph():
         return _C_ops.searchsorted(sorted_sequence, values, "out_int32",
                                    out_int32, "right", right)
 
@@ -991,11 +1016,16 @@ def kthvalue(x, k, axis=None, keepdim=False, name=None):
             #  [[0, 2],
             #  [1, 2]]))
     """
-    if paddle.in_dynamic_mode():
+    if _non_static_mode():
         if axis is not None:
-            return _C_ops.kthvalue(x, 'k', k, "axis", axis, "keepdim", keepdim)
+            if _in_legacy_dygraph():
+                return _C_ops.kthvalue(x, 'k', k, "axis", axis, "keepdim",
+                                       keepdim)
+            return _C_ops.final_state_kthvalue(x, k, axis, keepdim)
         else:
-            return _C_ops.kthvalue(x, 'k', k, "keepdim", keepdim)
+            if _in_legacy_dygraph():
+                return _C_ops.kthvalue(x, 'k', k, "keepdim", keepdim)
+            return _C_ops.final_state_kthvalue(x, k, -1, keepdim)
 
     helper = LayerHelper("kthvalue", **locals())
     inputs = {"X": [x]}
diff --git a/python/paddle/tensor/stat.py b/python/paddle/tensor/stat.py
index dd0da03e4fd28..89462e2a8721f 100644
--- a/python/paddle/tensor/stat.py
+++ b/python/paddle/tensor/stat.py
@@ -18,6 +18,7 @@
 from ..static import Variable
 from ..fluid.layer_helper import LayerHelper
 from ..framework import core
+from paddle.fluid.framework import _in_legacy_dygraph, in_dygraph_mode
 from .search import where
 from ..fluid.data_feeder import convert_dtype, check_variable_and_dtype, check_type, check_dtype
 import paddle
@@ -87,7 +88,11 @@ def mean(x, axis=None, keepdim=False, name=None):
     if axis is None or len(axis) == 0:
         axis = [0]
 
-    if paddle.in_dynamic_mode():
+    if in_dygraph_mode():
+        if reduce_all:
+            axis = range(len(x.shape))
+        return _C_ops.final_state_mean(x, axis, keepdim)
+    if _in_legacy_dygraph():
         return _C_ops.reduce_mean(x, 'dim', axis, 'keep_dim', keepdim,
                                   'reduce_all', reduce_all)
 
@@ -387,7 +392,7 @@ def quantile(x, q, axis=None, keepdim=False):
     if not isinstance(x, Variable):
         raise TypeError("input x should be a Tensor.")
     dims = len(x.shape)
-    out_shape = x.shape
+    out_shape = list(x.shape)
     if axis is None:
         x = paddle.flatten(x)
         axis = 0
@@ -433,16 +438,15 @@ def quantile(x, q, axis=None, keepdim=False):
             indices.append(q_num * (x.shape[axis] - 1))
     else:
         raise TypeError("Type of q should be int, float, list or tuple.")
-    indices = paddle.to_tensor(indices).astype(paddle.float32)
     sorted_tensor = paddle.sort(x, axis)
-    indices_below = paddle.floor(indices).astype(paddle.int32)
-    indices_upper = paddle.ceil(indices).astype(paddle.int32)
+    indices_tensor = paddle.assign(indices).astype(paddle.float32)
+    indices_below = paddle.floor(indices_tensor).astype(paddle.int32)
+    indices_upper = paddle.ceil(indices_tensor).astype(paddle.int32)
     outputs = []
 
     def expand_dim(indices, sorted_tensor_shape, axis):
         assert axis < len(list(sorted_tensor_shape))
         expanded_shape = [1] * len(list(sorted_tensor_shape))
-        expanded_shape[axis] = len(indices)
         expanded_shape = tuple(expanded_shape)
         indices = indices.reshape(expanded_shape)
         return indices
diff --git a/python/paddle/tensor/to_string.py b/python/paddle/tensor/to_string.py
index 6caa792adb159..71c97d4cac986 100644
--- a/python/paddle/tensor/to_string.py
+++ b/python/paddle/tensor/to_string.py
@@ -264,6 +264,9 @@ def to_string(var, prefix='Tensor'):
 
 
 def _format_dense_tensor(tensor, indent):
+    if tensor.dtype == core.VarDesc.VarType.BF16:
+        tensor = tensor.astype('float32')
+
     np_tensor = tensor.numpy()
 
     if len(tensor.shape) == 0:
@@ -288,11 +291,11 @@ def sparse_tensor_to_string(tensor, prefix='Tensor'):
     indent = len(prefix) + 1
     if tensor.is_sparse_coo():
         _template = "{prefix}(shape={shape}, dtype={dtype}, place={place}, stop_gradient={stop_gradient}, \n{indent}{indices}, \n{indent}{values})"
-        indices_tensor = tensor.non_zero_indices()
-        elements_tensor = tensor.non_zero_elements()
+        indices_tensor = tensor.indices()
+        values_tensor = tensor.values()
         indices_data = 'indices=' + _format_dense_tensor(indices_tensor, indent
                                                          + len('indices='))
-        values_data = 'values=' + _format_dense_tensor(elements_tensor, indent +
+        values_data = 'values=' + _format_dense_tensor(values_tensor, indent +
                                                        len('values='))
         return _template.format(
             prefix=prefix,
@@ -305,9 +308,9 @@ def sparse_tensor_to_string(tensor, prefix='Tensor'):
             values=values_data)
     else:
         _template = "{prefix}(shape={shape}, dtype={dtype}, place={place}, stop_gradient={stop_gradient}, \n{indent}{crows}, \n{indent}{cols}, \n{indent}{values})"
-        crows_tensor = tensor.non_zero_crows()
-        cols_tensor = tensor.non_zero_cols()
-        elements_tensor = tensor.non_zero_elements()
+        crows_tensor = tensor.crows()
+        cols_tensor = tensor.cols()
+        elements_tensor = tensor.values()
         crows_data = 'crows=' + _format_dense_tensor(crows_tensor, indent +
                                                      len('crows='))
         cols_data = 'cols=' + _format_dense_tensor(cols_tensor, indent +
@@ -330,6 +333,10 @@ def sparse_tensor_to_string(tensor, prefix='Tensor'):
 def tensor_to_string(tensor, prefix='Tensor'):
     indent = len(prefix) + 1
 
+    dtype = convert_dtype(tensor.dtype)
+    if tensor.dtype == core.VarDesc.VarType.BF16:
+        dtype = 'bfloat16'
+
     _template = "{prefix}(shape={shape}, dtype={dtype}, place={place}, stop_gradient={stop_gradient},\n{indent}{data})"
 
     if tensor.is_sparse():
@@ -342,7 +349,7 @@ def tensor_to_string(tensor, prefix='Tensor'):
         return _template.format(
             prefix=prefix,
             shape=tensor.shape,
-            dtype=tensor.dtype,
+            dtype=dtype,
             place=tensor._place_str,
             stop_gradient=tensor.stop_gradient,
             indent=' ' * indent,
diff --git a/python/paddle/tests/CMakeLists.txt b/python/paddle/tests/CMakeLists.txt
index 0babdee3a0884..bc9f402ed9686 100644
--- a/python/paddle/tests/CMakeLists.txt
+++ b/python/paddle/tests/CMakeLists.txt
@@ -47,7 +47,7 @@ set_tests_properties(test_dataset_cifar PROPERTIES TIMEOUT 120)
 set_tests_properties(test_pretrained_model PROPERTIES TIMEOUT 120)
 set_tests_properties(test_model PROPERTIES TIMEOUT 300)
 set_tests_properties(test_dataset_movielens PROPERTIES TIMEOUT 120)
-set_tests_properties(test_datasets PROPERTIES TIMEOUT 150)
+set_tests_properties(test_datasets PROPERTIES TIMEOUT 300)
 set_tests_properties(test_dataset_wmt PROPERTIES TIMEOUT 120)
 set_tests_properties(test_vision_models PROPERTIES TIMEOUT 120)
 set_tests_properties(test_dataset_uci_housing PROPERTIES TIMEOUT 120)
diff --git a/python/paddle/tests/test_async_read_write.py b/python/paddle/tests/test_async_read_write.py
index 91875b446aba4..babdf43199dd6 100644
--- a/python/paddle/tests/test_async_read_write.py
+++ b/python/paddle/tests/test_async_read_write.py
@@ -18,10 +18,11 @@
 import paddle
 from paddle.fluid import core
 from paddle.device import cuda
+from paddle.fluid.framework import _test_eager_guard, _in_legacy_dygraph
 
 
 class TestAsyncRead(unittest.TestCase):
-    def setUp(self):
+    def func_setUp(self):
         self.empty = paddle.to_tensor(
             np.array(
                 [], dtype="int64"), place=paddle.CPUPlace())
@@ -35,16 +36,20 @@ def setUp(self):
             shape=[50, 50, 50], dtype="float32").pin_memory()
         self.stream = cuda.Stream()
 
-    def test_async_read_empty_offset_and_count(self):
+    def func_test_async_read_empty_offset_and_count(self):
         with cuda.stream_guard(self.stream):
-            core.async_read(self.src, self.dst, self.index, self.buffer,
-                            self.empty, self.empty)
+            if _in_legacy_dygraph():
+                core.async_read(self.src, self.dst, self.index, self.buffer,
+                                self.empty, self.empty)
+            else:
+                core.eager.async_read(self.src, self.dst, self.index,
+                                      self.buffer, self.empty, self.empty)
         array1 = paddle.gather(self.src, self.index)
         array2 = self.dst[:len(self.index)]
 
         self.assertTrue(np.allclose(array1.numpy(), array2.numpy()))
 
-    def test_async_read_success(self):
+    def func_test_async_read_success(self):
         offset = paddle.to_tensor(
             np.array(
                 [10, 20], dtype="int64"), place=paddle.CPUPlace())
@@ -52,9 +57,12 @@ def test_async_read_success(self):
             np.array(
                 [5, 10], dtype="int64"), place=paddle.CPUPlace())
         with cuda.stream_guard(self.stream):
-            core.async_read(self.src, self.dst, self.index, self.buffer, offset,
-                            count)
-
+            if _in_legacy_dygraph():
+                core.async_read(self.src, self.dst, self.index, self.buffer,
+                                offset, count)
+            else:
+                core.eager.async_read(self.src, self.dst, self.index,
+                                      self.buffer, offset, count)
         # index data
         index_array1 = paddle.gather(self.src, self.index)
         count_numel = paddle.sum(count).numpy()[0]
@@ -69,26 +77,43 @@ def test_async_read_success(self):
         self.assertTrue(
             np.allclose(offset_array1.numpy(), offset_array2.numpy()))
 
-    def test_async_read_only_1dim(self):
+    def func_test_async_read_only_1dim(self):
         src = paddle.rand([40], dtype="float32").pin_memory()
         dst = paddle.empty([40], dtype="float32")
         buffer_ = paddle.empty([20]).pin_memory()
         with cuda.stream_guard(self.stream):
-            core.async_read(src, dst, self.index, buffer_, self.empty,
-                            self.empty)
+            if _in_legacy_dygraph():
+                core.async_read(src, dst, self.index, buffer_, self.empty,
+                                self.empty)
+            else:
+                core.eager.async_read(src, dst, self.index, buffer_, self.empty,
+                                      self.empty)
         array1 = paddle.gather(src, self.index)
         array2 = dst[:len(self.index)]
         self.assertTrue(np.allclose(array1.numpy(), array2.numpy()))
 
+    def test_main(self):
+        with _test_eager_guard():
+            self.func_setUp()
+            self.func_test_async_read_empty_offset_and_count()
+            self.func_test_async_read_success()
+            self.func_test_async_read_only_1dim()
+        self.func_setUp()
+        self.func_test_async_read_empty_offset_and_count()
+        self.func_setUp()
+        self.func_test_async_read_success()
+        self.func_setUp()
+        self.func_test_async_read_only_1dim()
+
 
 class TestAsyncWrite(unittest.TestCase):
-    def setUp(self):
+    def func_setUp(self):
         self.src = paddle.rand(shape=[100, 50, 50, 5], dtype="float32")
         self.dst = paddle.empty(
             shape=[200, 50, 50, 5], dtype="float32").pin_memory()
         self.stream = cuda.Stream()
 
-    def test_async_write_success(self):
+    def func_test_async_write_success(self):
         offset = paddle.to_tensor(
             np.array(
                 [0, 60], dtype="int64"), place=paddle.CPUPlace())
@@ -96,13 +121,23 @@ def test_async_write_success(self):
             np.array(
                 [40, 60], dtype="int64"), place=paddle.CPUPlace())
         with cuda.stream_guard(self.stream):
-            core.async_write(self.src, self.dst, offset, count)
+            if _in_legacy_dygraph():
+                core.async_write(self.src, self.dst, offset, count)
+            else:
+                core.eager.async_write(self.src, self.dst, offset, count)
 
         offset_a = paddle.gather(self.dst, paddle.to_tensor(np.arange(0, 40)))
         offset_b = paddle.gather(self.dst, paddle.to_tensor(np.arange(60, 120)))
         offset_array = paddle.concat([offset_a, offset_b], axis=0)
         self.assertTrue(np.allclose(self.src.numpy(), offset_array.numpy()))
 
+    def test_async_write_success(self):
+        with _test_eager_guard():
+            self.func_setUp()
+            self.func_test_async_write_success()
+        self.func_setUp()
+        self.func_test_async_write_success()
+
 
 if __name__ == "__main__":
     if core.is_compiled_with_cuda():
diff --git a/python/paddle/tests/test_callback_reduce_lr_on_plateau.py b/python/paddle/tests/test_callback_reduce_lr_on_plateau.py
index e950528ee4b65..d7680537f378b 100644
--- a/python/paddle/tests/test_callback_reduce_lr_on_plateau.py
+++ b/python/paddle/tests/test_callback_reduce_lr_on_plateau.py
@@ -29,6 +29,7 @@
 from paddle.vision.datasets import MNIST
 from paddle.metric import Accuracy
 from paddle.nn.layer.loss import CrossEntropyLoss
+from paddle.fluid.framework import _test_eager_guard, _in_legacy_dygraph
 
 
 # Accelerate unittest
@@ -38,7 +39,7 @@ def __len__(self):
 
 
 class TestReduceLROnPlateau(unittest.TestCase):
-    def test_reduce_lr_on_plateau(self):
+    def func_reduce_lr_on_plateau(self):
         transform = T.Compose([T.Transpose(), T.Normalize([127.5], [127.5])])
         train_dataset = CustomMnist(mode='train', transform=transform)
         val_dataset = CustomMnist(mode='test', transform=transform)
@@ -59,7 +60,12 @@ def test_reduce_lr_on_plateau(self):
                   epochs=10,
                   callbacks=[callbacks])
 
-    def test_warn_or_error(self):
+    def test_reduce_lr_on_plateau(self):
+        with _test_eager_guard():
+            self.func_reduce_lr_on_plateau()
+        self.func_reduce_lr_on_plateau()
+
+    def func_warn_or_error(self):
         with self.assertRaises(ValueError):
             paddle.callbacks.ReduceLROnPlateau(factor=2.0)
         # warning
@@ -101,6 +107,11 @@ def test_warn_or_error(self):
                   epochs=3,
                   callbacks=[callbacks])
 
+    def test_warn_or_error(self):
+        with _test_eager_guard():
+            self.func_warn_or_error()
+        self.func_warn_or_error()
+
 
 if __name__ == '__main__':
     unittest.main()
diff --git a/python/paddle/tests/test_callback_visualdl.py b/python/paddle/tests/test_callback_visualdl.py
index db3b83f2b1414..355e88edd2bec 100644
--- a/python/paddle/tests/test_callback_visualdl.py
+++ b/python/paddle/tests/test_callback_visualdl.py
@@ -29,6 +29,7 @@
 from paddle.vision.datasets import MNIST
 from paddle.metric import Accuracy
 from paddle.nn.layer.loss import CrossEntropyLoss
+from paddle.fluid.framework import _test_eager_guard, _in_legacy_dygraph
 
 
 class MnistDataset(MNIST):
@@ -43,7 +44,7 @@ def setUp(self):
     def tearDown(self):
         shutil.rmtree(self.save_dir)
 
-    def test_visualdl_callback(self):
+    def func_visualdl_callback(self):
         # visualdl not support python2
         if sys.version_info < (3, ):
             return
@@ -70,6 +71,11 @@ def test_visualdl_callback(self):
                   batch_size=64,
                   callbacks=callback)
 
+    def test_visualdl_callback(self):
+        with _test_eager_guard():
+            self.func_visualdl_callback()
+        self.func_visualdl_callback()
+
 
 if __name__ == '__main__':
     unittest.main()
diff --git a/python/paddle/tests/test_datasets.py b/python/paddle/tests/test_datasets.py
index c93bac3ac27e8..be26dff6c0426 100644
--- a/python/paddle/tests/test_datasets.py
+++ b/python/paddle/tests/test_datasets.py
@@ -22,6 +22,7 @@
 import paddle.vision.transforms as T
 from paddle.vision.datasets import DatasetFolder, ImageFolder, MNIST, FashionMNIST, Flowers
 from paddle.dataset.common import _check_exists_and_download
+from paddle.fluid.framework import _test_eager_guard, _in_legacy_dygraph
 
 
 class TestFolderDatasets(unittest.TestCase):
@@ -39,7 +40,7 @@ def setUp(self):
     def tearDown(self):
         shutil.rmtree(self.data_dir)
 
-    def test_dataset(self):
+    def func_test_dataset(self):
         dataset_folder = DatasetFolder(self.data_dir)
 
         for _ in dataset_folder:
@@ -52,7 +53,12 @@ def test_dataset(self):
         for _ in dataset_folder:
             pass
 
-    def test_folder(self):
+    def test_dataset(self):
+        with _test_eager_guard():
+            self.func_test_dataset()
+        self.func_test_dataset()
+
+    def func_test_folder(self):
         loader = ImageFolder(self.data_dir)
 
         for _ in loader:
@@ -64,7 +70,12 @@ def test_folder(self):
 
         assert len(loader) == 4
 
-    def test_transform(self):
+    def test_folder(self):
+        with _test_eager_guard():
+            self.func_test_folder()
+        self.func_test_folder()
+
+    def func_test_transform(self):
         def fake_transform(img):
             return img
 
@@ -78,7 +89,12 @@ def fake_transform(img):
         for _ in loader:
             pass
 
-    def test_errors(self):
+    def test_transform(self):
+        with _test_eager_guard():
+            self.func_test_transform()
+        self.func_test_transform()
+
+    def func_test_errors(self):
         with self.assertRaises(RuntimeError):
             ImageFolder(self.empty_dir)
         with self.assertRaises(RuntimeError):
@@ -87,9 +103,14 @@ def test_errors(self):
         with self.assertRaises(ValueError):
             _check_exists_and_download('temp_paddle', None, None, None, False)
 
+    def test_errors(self):
+        with _test_eager_guard():
+            self.func_test_errors()
+        self.func_test_errors()
+
 
 class TestMNISTTest(unittest.TestCase):
-    def test_main(self):
+    def func_test_main(self):
         transform = T.Transpose()
         mnist = MNIST(mode='test', transform=transform)
         self.assertTrue(len(mnist) == 10000)
@@ -102,9 +123,14 @@ def test_main(self):
         self.assertTrue(label.shape[0] == 1)
         self.assertTrue(0 <= int(label) <= 9)
 
+    def test_main(self):
+        with _test_eager_guard():
+            self.func_test_main()
+        self.func_test_main()
+
 
 class TestMNISTTrain(unittest.TestCase):
-    def test_main(self):
+    def func_test_main(self):
         transform = T.Transpose()
         mnist = MNIST(mode='train', transform=transform)
         self.assertTrue(len(mnist) == 60000)
@@ -133,9 +159,14 @@ def test_main(self):
         with self.assertRaises(ValueError):
             mnist = MNIST(mode='train', transform=transform, backend=1)
 
+    def test_main(self):
+        with _test_eager_guard():
+            self.func_test_main()
+        self.func_test_main()
+
 
 class TestFASHIONMNISTTest(unittest.TestCase):
-    def test_main(self):
+    def func_test_main(self):
         transform = T.Transpose()
         mnist = FashionMNIST(mode='test', transform=transform)
         self.assertTrue(len(mnist) == 10000)
@@ -148,9 +179,14 @@ def test_main(self):
         self.assertTrue(label.shape[0] == 1)
         self.assertTrue(0 <= int(label) <= 9)
 
+    def test_main(self):
+        with _test_eager_guard():
+            self.func_test_main()
+        self.func_test_main()
+
 
 class TestFASHIONMNISTTrain(unittest.TestCase):
-    def test_main(self):
+    def func_test_main(self):
         transform = T.Transpose()
         mnist = FashionMNIST(mode='train', transform=transform)
         self.assertTrue(len(mnist) == 60000)
@@ -179,16 +215,26 @@ def test_main(self):
         with self.assertRaises(ValueError):
             mnist = FashionMNIST(mode='train', transform=transform, backend=1)
 
-    def test_dataset_value(self):
+    def test_main(self):
+        with _test_eager_guard():
+            self.func_test_main()
+        self.func_test_main()
+
+    def func_test_dataset_value(self):
         fmnist = FashionMNIST(mode='train')
         value = np.mean([np.array(x[0]) for x in fmnist])
 
         # 72.94035223214286 was getted from competitive products
         np.testing.assert_allclose(value, 72.94035223214286)
 
+    def test_dataset_value(self):
+        with _test_eager_guard():
+            self.func_test_dataset_value()
+        self.func_test_dataset_value()
+
 
 class TestFlowersTrain(unittest.TestCase):
-    def test_main(self):
+    def func_test_main(self):
         flowers = Flowers(mode='train')
         self.assertTrue(len(flowers) == 6149)
 
@@ -201,9 +247,14 @@ def test_main(self):
         self.assertTrue(image.shape[2] == 3)
         self.assertTrue(label.shape[0] == 1)
 
+    def test_main(self):
+        with _test_eager_guard():
+            self.func_test_main()
+        self.func_test_main()
+
 
 class TestFlowersValid(unittest.TestCase):
-    def test_main(self):
+    def func_test_main(self):
         flowers = Flowers(mode='valid')
         self.assertTrue(len(flowers) == 1020)
 
@@ -216,9 +267,14 @@ def test_main(self):
         self.assertTrue(image.shape[2] == 3)
         self.assertTrue(label.shape[0] == 1)
 
+    def test_main(self):
+        with _test_eager_guard():
+            self.func_test_main()
+        self.func_test_main()
+
 
 class TestFlowersTest(unittest.TestCase):
-    def test_main(self):
+    def func_test_main(self):
         flowers = Flowers(mode='test')
         self.assertTrue(len(flowers) == 1020)
 
@@ -247,6 +303,11 @@ def test_main(self):
         with self.assertRaises(ValueError):
             flowers = Flowers(mode='test', backend=1)
 
+    def test_main(self):
+        with _test_eager_guard():
+            self.func_test_main()
+        self.func_test_main()
+
 
 if __name__ == '__main__':
     unittest.main()
diff --git a/python/paddle/tests/test_model.py b/python/paddle/tests/test_model.py
index 53dce286b71e9..ce3a3bd4b02fe 100644
--- a/python/paddle/tests/test_model.py
+++ b/python/paddle/tests/test_model.py
@@ -715,6 +715,8 @@ def test_summary_error(self):
         paddle.summary(nlp_net, (1, 1, 2))
 
     def test_static_flops(self):
+        if paddle.fluid.framework._in_eager_without_dygraph_check():
+            return
         paddle.disable_static()
         net = models.__dict__['mobilenet_v2'](pretrained=False)
         inputs = paddle.randn([1, 3, 224, 224])
diff --git a/python/paddle/text/viterbi_decode.py b/python/paddle/text/viterbi_decode.py
index dbf16bfbc6a97..ce5667b134a03 100644
--- a/python/paddle/text/viterbi_decode.py
+++ b/python/paddle/text/viterbi_decode.py
@@ -13,7 +13,7 @@
 # limitations under the License.
 
 from ..nn import Layer
-from ..fluid.framework import core, _non_static_mode
+from ..fluid.framework import core, _non_static_mode, in_dygraph_mode
 from ..fluid.layer_helper import LayerHelper
 from ..fluid.data_feeder import check_variable_and_dtype, check_type
 from paddle import _C_ops
@@ -58,6 +58,10 @@ def viterbi_decode(potentials,
             transition = paddle.rand((num_tags, num_tags), dtype='float32')
             scores, path = paddle.text.viterbi_decode(emission, transition, length, False) # scores: [3.37089300, 1.56825531], path: [[1, 0, 0], [1, 1, 0]]
     """
+    if in_dygraph_mode():
+        return _C_ops.final_state_viterbi_decode(potentials, transition_params,
+                                                 lengths, include_bos_eos_tag)
+
     if _non_static_mode():
         return _C_ops.viterbi_decode(potentials, transition_params, lengths,
                                      'include_bos_eos_tag', include_bos_eos_tag)
diff --git a/python/paddle/utils/code_gen/api.yaml b/python/paddle/utils/code_gen/api.yaml
index 89e6f9faafacd..3a76e89bbb727 100644
--- a/python/paddle/utils/code_gen/api.yaml
+++ b/python/paddle/utils/code_gen/api.yaml
@@ -1,258 +1,3 @@
-# - api : norm
-#   args : (Tensor x, int axis, float epsilon, bool is_test)
-#   output : Tensor(out), Tensor(norm)
-#   infer_meta :
-#     func : NormInferMeta
-#   kernel :
-#     func : norm
-#   intermediate : norm
-#   backward : norm_grad
-
-# # maxout
-# - api : maxout
-#   args : (Tensor x, int groups, int axis)
-#   output : Tensor
-#   infer_meta :
-#     func : MaxoutInferMeta
-#   kernel :
-#     func : maxout
-#   backward : maxout_grad
-
-# # batch_norm
-# - api : batch_norm
-#   args : (Tensor x, Tensor scale, Tensor bias, Tensor mean, Tensor variance, float momentum, float epsilon, str data_layout, bool is_test, bool use_global_stats, bool trainable_statistics, bool fuse_with_relu)
-#   output : Tensor(out), Tensor(mean_out), Tensor(variance_out), Tensor(saved_mean), Tensor(saved_variance), Tensor(reserve_space)
-#   infer_meta :
-#     func : XXXXInferMeta
-#   kernel :
-#     func : batch_norm
-#   backward: batch_norm_grad
-
-# # bilinear_tensor_product ?? optional
-# - api : bilinear_tensor_product
-#   args : (Tensor x, Tensor y, Tensor weight, Tensor bias)
-#   output : Tensor
-#   infer_meta :
-#     func : BilinearTensorProductInferMeta
-#   kernel :
-#     func : bilinear_tensor_product
-#   backward : bilinear_tensor_product_grad
-#   optional : bias
-
-# broadcast_tensors
-# - api : broadcast_tensors
-#   args : (Tensor[] x)
-#   output : Tensor[]
-#   infer_meta :
-#     func : BroadcastTensorsInferMeta
-#   kernel :
-#     func : broadcast_tensors
-#   backward : broadcast_tensors_grad
-
-# # dropout
-# - api : dropout
-#   args : (Tensor x, Tensor seed_tensor, float p, bool is_test, str mode, int seed, bool fix_seed)
-#   output : Tensor(out), Tensor(mask)
-#   infer_meta :
-#     func : DropoutInferMeta
-#   kernel :
-#     func : dropout
-
-# # expand
-# - api : expand
-#   args : (Tensor x, IntArray shape)
-#   output : Tensor
-#   infer_meta :
-#     func : ExpandInferMeta
-#   kernel :
-#     func : expand
-#   backward : expand_grad
-
-# eye
-# - api : eye
-#   args : (int64_t num_rows, int64_t num_colums, DataType dtype = DataType::FLOAT32)
-#   output : Tensor
-#   infer_meta :
-#     func : EyeInferMeta
-#   kernel :
-#     func : eye
-
-# gaussian_random
-# - api : gaussian_random
-#   args : (IntArray shape, float mean, float std, int seed, DataType dtype=DataType::FLOAT32)
-#   output : Tensor
-#   infer_meta :
-#     func : CreateInferMeta
-#     param : [shape, dtype]
-#   kernel :
-#     func : gaussian_random
-#     data_type : dtype
-
-# # graph_send_recv
-# - api : graph_send_recv
-#   args : (Tensor x, Tensor src_index, Tensor dst_index, str pool_type)
-#   output : Tensor(out), Tensor(dst_count)
-#   infer_meta :
-#     func : GraphSendRecvInferMeta
-#   kernel :
-#     func : graph_send_recv
-#   backward : graph_send_recv_grad
-
-# # label_smooth
-# - api : label_smooth
-#   args : (Tensor label, Tensor prior_dist, float epsilon)
-#   output : Tensor
-#   infer_meta :
-#     func : UnchangedInferMeta
-#     param : [label]
-#   kernel :
-#     func : label_smooth
-#     data_type : label
-#   optional : prior_dist
-#   backward : label_smooth_grad
-
-# linspace start stop number
-# - api : linspace
-#   args : (Tensor start, Tensor stop, Tensor number, DataType dtype=DataType::FLOAT32)
-#   output : Tensor
-#   infer_meta :
-#     func : LinspaceInferMeta
-#   kernel :
-#     func : linspace
-
-# # multi_dot
-# - api : multi_dot
-#   args : (Tensor[] x)
-#   output : Tensor
-#   infer_meta :
-#     func : MultiDotInferMeta
-#   kernel :
-#     func : multi_dot
-#   backward : multi_dot_grad
-
-# # nll_loss
-# - api : nll_loss
-#   args : (Tensor x, Tensor label, Tensor weight, int64_t ignore_index, str reduction)
-#   output : Tensor(out), Tensor(total_weight)
-#   infer_meta :
-#     func : NllLossRawInferMeta
-#   kernel :
-#     func : nll_loss
-#     data_type : x
-#   optional : weight
-#   backward : nll_loss_grad
-
-# # psroi_pool
-# - api : psroi_pool
-#   args : (Tensor x, Tensor rois, Tensor rois_num, int pooled_weight, int pooled_width, int output_channels, float spatial_scale )
-#   output : Tensor
-#   infer_meta :
-#     func : PsroiPoolInferMeta
-#   kernel :
-#     func : psroi_pool
-#   backward : psroi_pool_grad
-#   optional : rois_num
-
-# # randint
-# - api : randint
-#   args : (int low, int high, IntArray shape, DataType dtype)
-#   output : Tensor
-#   infer_meta :
-#     func : RandintInferMeta
-#   kernel :
-#     func : randint
-
-# # randperm
-# - api : randperm
-#   args : (int n, DataType dtype)
-#   output : Tensor
-#   infer_meta :
-#     func : RandpermInferMeta
-#   kernel :
-#     func : randperm
-
-# # max
-# - api : max
-#   args : (Tensor x, int64_t[] dims, bool keep_dim)
-#   output : Tensor
-#   infer_meta :
-#     func : MaxInferMeta
-#   kernel :
-#     func : max
-
-# # phi_transfer_layout | not have python api
-
-# # truncated_gaussian_random
-# - api : truncated_gaussian_random
-#   args : (int[] shape, float mean, float std, int seed, DataType dtype)
-#   output : Tensor
-#   infer_meta :
-#     func : TruncatedGaussianRandomInferMeta
-#   kernel :
-#     func : truncated_gaussian_random
-
-# # unbind
-# - api : unbind
-#   args : (Tensor x, int axis)
-#   output : Tensor[]
-#   infer_meta :
-#     func : UnbindInferMeta
-#   kernel :
-#     func : unbind
-
-# # uniform_random_raw selected rows ??
-
-# - api : pixel_shuffle
-#   args : (Tensor x, int upscale_factor, const std::string& data_format)
-#   output : Tensor
-#   infer_meta :
-#     func : PixelShuffleInferMeta
-#   kernel :
-#     func : pixel_shuffle
-
-# BilinearTensorProductInferMeta
-
-# BroadcastTensorsInferMeta
-
-# bincount
-# - api : bincount
-#   args : (Tensor x, Tensor weight, int minlength)
-#   output : Tensor
-#   infer_meta :
-#     func : BincountInferMeta
-#   kernel :
-#     func : bincount
-#   optional : weight
-
-# expand_as
-# - api : expand_as
-#   args : (Tensor x, Tensor y, int[] target_shape)
-#   output : Tensor
-#   infer_meta :
-#     func : ExpandAsInferMeta
-#   kernel :
-#     func : expand_as
-#   optional : y
-#   # backward : expand_as_grad
-#   # optional : y
-
-# - api : equal_all
-#   args : (Tensor x, Tensor y)
-#   output : Tensor
-#   infer_meta :
-#     func : CompareAllInferMeta
-#   kernel :
-#     func : equal_all
-
-# histogram
-# - api : histogram
-#   args : (Tensor x, int64_t bins, int min, int max)
-#   output : Tensor
-#   infer_meta :
-#     func : HistogramInferMeta
-#   kernel :
-#     func : histogram
-
 - api : abs
   args : (Tensor x)
   output : Tensor
@@ -316,7 +61,15 @@
   kernel :
     func : add
   backward : add_grad
-  # no_need_buffer : x, y
+
+- api : add_n
+  args : (Tensor[] x)
+  output : Tensor
+  infer_meta :
+    func : AddNInferMeta
+  kernel :
+    func : add_n
+  backward : add_n_grad
 
 - api : addmm
   args : (Tensor input, Tensor x, Tensor y, float alpha, float beta)
@@ -327,6 +80,45 @@
     func : addmm
   backward : addmm_grad
 
+- api : all
+  args : (Tensor x, int64_t[] dims={}, bool keep_dim=false)
+  output : Tensor(out)
+  infer_meta :
+    func : ReduceInferMeta
+  kernel :
+    func : all
+
+- api : allclose
+  args : (Tensor x, Tensor y, Scalar rtol, Scalar atol, bool equal_nan)
+  output : Tensor(out)
+  infer_meta :
+    func : AllValueCompareInferMeta
+    param: [x, y]
+  kernel :
+    func : allclose
+
+- api : any
+  args : (Tensor x, int64_t[] dims={}, bool keep_dim=false)
+  output : Tensor(out)
+  infer_meta :
+    func : ReduceInferMeta
+  kernel :
+    func : any
+
+- api : arange
+  args : (Tensor start, Tensor end, Tensor step, DataType dtype, Place place={})
+  output : Tensor
+  infer_meta :
+    func : ArangeInferMeta
+    param : [start, end, step]
+  kernel :
+    func : arange
+    param : [start, end, step]
+    data_type : dtype
+    backend : place
+  data_transform :
+    support_trans_dtype : start, end, step
+
 # arg_max
 - api : argmax
   args : (Tensor x, int64_t axis, bool keepdims, bool flatten, int dtype)
@@ -354,7 +146,6 @@
   kernel :
     func : argsort
   backward : argsort_grad
-  # no_need_buffer : x
 
 # asin
 - api : asin
@@ -414,6 +205,13 @@
   kernel :
     func : auc
 
+# batch_norm
+- api : batch_norm
+  args : (Tensor x, Tensor scale, Tensor bias, Tensor mean, Tensor variance, float momentum, float epsilon, str data_layout, bool is_test, bool use_global_stats, bool trainable_statistics, bool fuse_with_relu)
+  output : Tensor(out), Tensor(mean_out), Tensor(variance_out), Tensor(saved_mean), Tensor(saved_variance), Tensor(reserve_space)
+  invoke : batch_norm_impl(x, scale, bias, mean, variance, momentum, epsilon, data_layout, is_test, use_global_stats, trainable_statistics, fuse_with_relu)
+  backward : batch_norm_grad
+
 - api : bce_loss
   args : (Tensor input, Tensor label)
   output : Tensor
@@ -490,6 +288,15 @@
     data_type : x
   backward : cast_grad
 
+- api : ceil
+  args : (Tensor x)
+  output : Tensor(out)
+  infer_meta :
+    func : UnchangedInferMeta
+  kernel :
+    func : ceil
+  backward : ceil_grad
+
 # cholesky
 - api : cholesky
   args : (Tensor x, bool upper)
@@ -510,6 +317,17 @@
     func : cholesky_solve
   backward : cholesky_solve_grad
 
+- api : clip
+  args : (Tensor x, Scalar(float) min, Scalar(float) max)
+  output : Tensor(out)
+  inplace : (x -> out)
+  infer_meta :
+    func : UnchangedInferMeta
+    param : [x]
+  kernel :
+    func : clip
+  backward : clip_grad
+
 - api : concat
   args : (Tensor[] x, Scalar(int64_t) axis)
   output : Tensor
@@ -518,6 +336,7 @@
     param : [x, axis]
   kernel :
     func : concat
+  backward : concat_grad
 
 - api : conj
   args : (Tensor x)
@@ -526,6 +345,31 @@
     func : UnchangedInferMeta
   kernel :
     func : conj
+  backward : conj_grad
+
+- api : conv2d
+  args : (Tensor input, Tensor filter, int[] strides, int[] paddings, str paddding_algorithm, int groups, int[] dilations, str data_format, bool use_addto, int workspace_size_MB, bool exhaustive_search)
+  output : Tensor
+  invoke : conv2d_impl(input, filter, strides, paddings, paddding_algorithm, groups, dilations, data_format, use_addto, workspace_size_MB, exhaustive_search)
+  backward : conv2d_grad
+
+- api : conv2d_transpose
+  args : (Tensor x, Tensor filter, int[] strides, int[] paddings, int[] output_padding, int[] output_size, str padding_algorithm, int groups, int[] dilations, str data_format)
+  output : Tensor(out)
+  infer_meta :
+    func : ConvTransposeInferMeta
+  kernel :
+    func : conv2d_transpose
+  backward : conv2d_transpose_grad
+
+- api : conv3d_transpose
+  args : (Tensor x, Tensor filter, int[] strides, int[] paddings, int[] output_padding, int[] output_size, str padding_algorithm, int groups, int[] dilations, str data_format)
+  output : Tensor(out)
+  infer_meta :
+    func : ConvTransposeInferMeta
+  kernel :
+    func : conv3d_transpose
+  backward : conv3d_transpose_grad
 
 - api : copy_to
   args : (Tensor x, Place place, bool blocking)
@@ -561,14 +405,53 @@
     func : cross
   backward : cross_grad
 
-# cumsum
+# Part of python API paddle.nn.functional.cross_entropy
+- api : cross_entropy_with_softmax
+  args : (Tensor input, Tensor label, bool soft_label, bool use_softmax, bool numeric_stable_mode, int ignore_index, int axis)
+  output : Tensor(softmax), Tensor(loss)
+  infer_meta :
+    func : CrossEntropyWithSoftmaxInferMeta
+  kernel :
+    func : cross_entropy_with_softmax
+    data_type : input
+  backward : cross_entropy_with_softmax_grad
+
+- api : cumprod
+  args : (Tensor x,  int dim)
+  output : Tensor(out)
+  infer_meta :
+    func : UnchangedInferMeta
+    param: [x]
+  kernel :
+    func : cumprod
+  backward : cumprod_grad
+
 - api : cumsum
   args : (Tensor x, int axis, bool flatten, bool exclusive, bool reverse)
-  output : Tensor
+  output : Tensor(out)
   infer_meta :
     func : CumsumInferMeta
   kernel :
     func : cumsum
+  backward : cumsum_grad
+
+- api : depthwise_conv2d_transpose
+  args : (Tensor x, Tensor filter, int[] strides, int[] paddings, int[] output_padding, int[] output_size, str padding_algorithm, int groups, int[] dilations, str data_format)
+  output : Tensor(out)
+  infer_meta :
+    func : ConvTransposeInferMeta
+  kernel :
+    func : depthwise_conv2d_transpose
+  backward : depthwise_conv2d_transpose_grad
+
+- api : det
+  args : (Tensor x)
+  output : Tensor
+  infer_meta :
+    func : UnchangedInferMeta
+  kernel :
+    func : determinant
+  backward : det_grad
 
 - api : diag
   args : (Tensor x, int offset, float padding_value)
@@ -586,7 +469,6 @@
   kernel :
     func : diagonal
   backward : diagonal_grad
-  # no_need_buffer : x
 
 - api : digamma
   args : (Tensor x)
@@ -623,6 +505,16 @@
   kernel :
     func : dot
 
+- api : dropout
+  args : (Tensor x, Tensor seed_tensor, float p, bool is_test, str mode, int seed, bool fix_seed)
+  output : Tensor(out), Tensor(mask)
+  infer_meta :
+    func : DropoutInferMeta
+  kernel :
+    func : dropout
+  optional : seed_tensor
+  backward : dropout_grad
+
 # eigh
 - api : eigh
   args : (Tensor x, str uplo)
@@ -633,6 +525,15 @@
     func : eigh
   backward : eigh_grad
 
+- api : elementwise_pow
+  args : (Tensor x, Tensor y)
+  output : Tensor(out)
+  infer_meta :
+    func : ElementwiseInferMeta
+  kernel :
+    func : elementwise_pow
+  backward : elementwise_pow_grad
+
 # elu
 - api : elu
   args : (Tensor x, float alpha)
@@ -696,13 +597,60 @@
     func : erfinv
   backward : erfinv_grad
 
+# exp
+- api : exp
+  args : (Tensor x)
+  output : Tensor
+  infer_meta :
+    func : UnchangedInferMeta
+  kernel :
+    func : exp
+  backward : exp_grad
+
+# expand_as
+- api : expand_as
+  args : (Tensor x, Tensor y, int[] target_shape)
+  output : Tensor
+  infer_meta :
+    func : ExpandAsInferMeta
+  kernel :
+    func : expand_as
+  optional : y
+  backward : expand_as_grad
+
+- api : expm1
+  args : (Tensor x)
+  output : Tensor
+  infer_meta :
+    func : UnchangedInferMeta
+    param : [x]
+  kernel :
+    func : expm1
+  backward : expm1_grad
+
+- api : eye
+  args : (int64_t num_rows, int64_t num_columns, DataType dtype=DataType::FLOAT32, Place place={})
+  output : Tensor(out)
+  infer_meta :
+    func : EyeInferMeta
+    param : [num_rows, num_columns, dtype]
+  kernel :
+    func : eye
+    param : [num_rows, num_columns, dtype]
+    data_type : dtype
+    backend : place
+
 - api : flatten
   args : (Tensor x, int start_axis, int stop_axis)
-  output : Tensor
+  output : Tensor(out), Tensor(xshape)
   infer_meta :
-    func : FlattenInferMeta
+    func : FlattenWithXShapeInferMeta
   kernel :
-    func : flatten
+    func : flatten_with_xshape
+    backend : x
+  inplace : (x -> out)
+  view : (x -> out)
+  backward : flatten_grad
 
 # flip
 - api : flip
@@ -712,6 +660,53 @@
     func : FlipInferMeta
   kernel :
     func : flip
+  backward : flip_grad
+
+- api : floor
+  args : (Tensor x)
+  output : Tensor(out)
+  infer_meta :
+    func : UnchangedInferMeta
+  kernel :
+    func : floor
+  backward : floor_grad
+
+- api : floor_divide
+  args : (Tensor x, Tensor y)
+  output : Tensor(out)
+  infer_meta :
+    func : ElementwiseInferMeta
+  kernel :
+    func : floor_divide
+
+- api : fmax
+  args : (Tensor x, Tensor y,  int axis)
+  output : Tensor(out)
+  infer_meta :
+    param: [x, y]
+    func : ElementwiseInferMeta
+  kernel :
+    func : fmax
+  backward : fmax_grad
+
+- api : fmin
+  args : (Tensor x, Tensor y,  int axis)
+  output : Tensor(out)
+  infer_meta :
+    param: [x, y]
+    func : ElementwiseInferMeta
+  kernel :
+    func : fmin
+  backward : fmin_grad
+
+- api : frobenius_norm
+  args : (Tensor x, int64_t[] axis,  bool keep_dim,  bool reduce_all)
+  output : Tensor(out)
+  infer_meta :
+    func : ReduceInferMetaBase
+  kernel :
+    func : frobenius_norm
+  backward : frobenius_norm_grad
 
 - api : full
   args : (IntArray shape, Scalar value, DataType dtype=DataType::FLOAT32, Place place=CPUPlace())
@@ -725,6 +720,18 @@
     data_type : dtype
     backend : place
 
+- api : full_batch_size_like
+  args : (Tensor input, int[] shape, DataType dtype, Scalar value, int input_dim_idx, int output_dim_idx, Place place=CPUPlace())
+  output: Tensor
+  infer_meta :
+    func : FullBatchSizeLikeInferMeta
+    param : [input, shape, value, dtype, input_dim_idx, output_dim_idx]
+  kernel :
+    func : full_batch_size_like
+    param : [input, shape, value, dtype, input_dim_idx, output_dim_idx]
+    data_type : dtype
+    backend : place
+
 - api : full_like
   args : (Tensor x, Scalar value, DataType dtype = DataType::UNDEFINED, Place place = {})
   output: Tensor
@@ -737,6 +744,16 @@
     data_type : dtype > x
     backend : place > x
 
+- api : gather
+  args : (Tensor x, Tensor index, Scalar axis=0)
+  output : Tensor(out)
+  infer_meta :
+    func : GatherInferMeta
+  kernel :
+    func : gather
+    data_type: x
+  backward : gather_grad
+
 - api : gather_nd
   args : (Tensor x, Tensor index)
   output : Tensor
@@ -755,21 +772,42 @@
   kernel :
     func : gather_tree
 
-- api : greater
+- api : gelu
+  args : (Tensor x,  bool approximate)
+  output : Tensor(out)
+  infer_meta :
+    func : UnchangedInferMeta
+    param: [x]
+  kernel :
+    func : gelu
+  backward : gelu_grad
+
+- api : graph_send_recv
+  args : (Tensor x, Tensor src_index, Tensor dst_index, str pool_type = "SUM", int64_t out_size = 0)
+  output : Tensor(out), Tensor(dst_count)
+  infer_meta :
+    func : GraphSendRecvInferMeta
+  kernel :
+    func : graph_send_recv
+    data_type : x
+  intermediate : dst_count
+  backward : graph_send_recv_grad
+
+- api : greater_equal
   args : (Tensor x, Tensor y, int axis = -1)
   output : Tensor
   infer_meta :
     func : CompareInferMeta
   kernel :
-    func : greater
+    func : greater_equal
 
-- api : greater_equal
+- api : greater_than
   args : (Tensor x, Tensor y, int axis = -1)
   output : Tensor
   infer_meta :
     func : CompareInferMeta
   kernel :
-    func : greater_equal
+    func : greater_than
 
 - api : gumbel_softmax
   args : (Tensor x, float temperature, bool hard, int axis)
@@ -778,7 +816,7 @@
     func : GumbelSoftmaxInferMeta
   kernel :
     func : gumbel_softmax
-  # backward : gumbel_softmax_grad
+  backward : gumbel_softmax_grad
 
 # hard_shrink
 - api : hard_shrink
@@ -802,6 +840,25 @@
     func : hard_sigmoid
   backward : hard_sigmoid_grad
 
+- api : hard_swish
+  args : (Tensor x, float threshold = 6.0, float scale = 6.0, float offset = 3.0)
+  output : Tensor
+  infer_meta :
+    func : UnchangedInferMeta
+    param : [x]
+  kernel :
+    func : hard_swish
+  backward : hard_swish_grad
+
+# histogram
+- api : histogram
+  args : (Tensor x, int64_t bins, int min, int max)
+  output : Tensor
+  infer_meta :
+    func : HistogramInferMeta
+  kernel :
+    func : histogram
+
 - api : huber_loss
   args : (Tensor input, Tensor label, float delta)
   output : Tensor(out), Tensor(residual)
@@ -809,7 +866,16 @@
     func : HuberLossInferMeta
   kernel :
     func : huber_loss
-  # backward : huber_loss_grad
+  backward : huber_loss_grad
+
+- api : imag
+  args : (Tensor x)
+  output : Tensor
+  infer_meta :
+    func : RealAndImagInferMeta
+  kernel :
+    func : imag
+  backward : imag_grad
 
 # increment
 - api : increment
@@ -829,7 +895,16 @@
     func : index_sample
     data_type : x
   backward : index_sample_grad
-  # no_need_buffer : x
+
+- api : index_select
+  args : (Tensor x, Tensor index,  int dim)
+  output : Tensor(out)
+  infer_meta :
+    func : IndexSelectInferMeta
+  kernel :
+    func : index_select
+    data_type : x
+  backward : index_select_grad
 
 # is_empty
 - api : is_empty
@@ -840,6 +915,15 @@
   kernel :
     func : is_empty
 
+- api : isclose
+  args : (Tensor x, Tensor y, Scalar rtol, Scalar atol,  bool equal_nan)
+  output : Tensor(out)
+  infer_meta :
+    func : ValueCompareInferMeta
+    param: [x, y]
+  kernel :
+    func : isclose
+
 # isfinite
 - api : isfinite
   args : (Tensor x)
@@ -867,6 +951,47 @@
   kernel :
     func : isnan, isnan_sr
 
+- api : kldiv_loss
+  args : (Tensor x, Tensor label, str reduction)
+  output : Tensor(out)
+  infer_meta :
+    func : KLDivInferMeta
+  kernel :
+    func : kldiv_loss
+    data_type : x
+  backward : kldiv_loss_grad
+
+- api : kron
+  args : (Tensor x, Tensor y)
+  output : Tensor
+  infer_meta :
+    func : KronInferMeta
+  kernel :
+    func : kron
+  backward : kron_grad
+
+- api : kthvalue
+  args : (Tensor x, int k, int axis, bool keepdim)
+  output : Tensor(out), Tensor(indices)
+  infer_meta :
+    func : KthvalueInferMeta
+  kernel :
+    func : kthvalue
+  backward : kthvalue_grad
+
+# label_smooth
+- api : label_smooth
+  args : (Tensor label, Tensor prior_dist, float epsilon)
+  output : Tensor
+  infer_meta :
+    func : UnchangedInferMeta
+    param : [label]
+  kernel :
+    func : label_smooth
+    data_type : label
+  optional : prior_dist
+  backward : label_smooth_grad
+
 # leaky_relu
 - api : leaky_relu
   args : (Tensor x, float alpha)
@@ -903,6 +1028,60 @@
   kernel :
     func : less_than
 
+- api : lgamma
+  args : (Tensor x)
+  output : Tensor(out)
+  infer_meta :
+    func : UnchangedInferMeta
+  kernel :
+    func : lgamma
+  backward : lgamma_grad
+
+- api : linspace
+  args : (Tensor start, Tensor stop, Tensor number, DataType dtype)
+  output : Tensor
+  infer_meta :
+    func : LinspaceInferMeta
+  kernel :
+    func : linspace
+    data_type : dtype
+
+- api : log
+  args : (Tensor x)
+  output : Tensor
+  infer_meta :
+    func : UnchangedInferMeta
+  kernel :
+    func : log
+  backward: log_grad
+
+- api : log10
+  args : (Tensor x)
+  output : Tensor
+  infer_meta :
+    func : UnchangedInferMeta
+  kernel :
+    func : log10
+  backward: log10_grad
+
+- api : log1p
+  args : (Tensor x)
+  output : Tensor
+  infer_meta :
+    func : UnchangedInferMeta
+  kernel :
+    func : log1p
+  backward: log1p_grad
+
+- api : log2
+  args : (Tensor x)
+  output : Tensor
+  infer_meta :
+    func : UnchangedInferMeta
+  kernel :
+    func : log2
+  backward: log2_grad
+
 # log_loss
 - api : log_loss
   args : (Tensor input, Tensor label, float epsilon)
@@ -913,6 +1092,15 @@
     func : log_loss
   backward : log_loss_grad
 
+- api : log_softmax
+  args : (Tensor x,  int axis)
+  output : Tensor(out)
+  infer_meta :
+    func : UnchangedInferMetaCheckAxis
+  kernel :
+    func : log_softmax
+  backward : log_softmax_grad
+
 # logical_and
 - api : logical_and
   args : (Tensor x, Tensor y)
@@ -949,6 +1137,17 @@
   kernel :
     func : logical_xor
 
+# logit
+- api : logit
+  args : (Tensor x, float eps = 1e-6f)
+  output : Tensor
+  infer_meta :
+    func : UnchangedInferMeta
+    param : [x]
+  kernel :
+    func : logit
+  backward : logit_grad
+
 # logsigmoid
 - api : logsigmoid
   args : (Tensor x)
@@ -959,6 +1158,15 @@
     func : logsigmoid
   backward : logsigmoid_grad
 
+- api : logsumexp
+  args : (Tensor x, int64_t[] axis,  bool keepdim,  bool reduce_all)
+  output : Tensor(out)
+  infer_meta :
+    func : LogsumexpInferMeta
+  kernel :
+    func : logsumexp
+  backward : logsumexp_grad
+
 # masked_select
 - api : masked_select
   args : (Tensor x, Tensor mask)
@@ -990,13 +1198,119 @@
     func : matrix_power
   backward : matrix_power_grad
 
+- api : matrix_rank
+  args : (Tensor x, float tol, bool use_default_tol=true, bool hermitian=false)
+  output : Tensor(out)
+  infer_meta :
+    func : MatrixRankInferMeta
+    param : [x, use_default_tol, hermitian]
+  kernel :
+    func : matrix_rank
+
+- api : matrix_rank_tol
+  args : (Tensor x, Tensor atol_tensor, bool use_default_tol=true, bool hermitian=false)
+  output : Tensor(out)
+  infer_meta :
+    func : MatrixRankTolInferMeta
+  kernel :
+    func : matrix_rank_tol
+
+- api : max
+  args : (Tensor x, int64_t[] dims={}, bool keep_dim=false)
+  output : Tensor(out)
+  infer_meta :
+    func : ReduceInferMeta
+  kernel :
+    func : max
+  backward : max_grad
+
+- api : max_pool2d_with_index
+  args : (Tensor x, int[] kernel_size, int[] strides, int[] paddings, bool global_pooling, bool adaptive)
+  output : Tensor(out), Tensor(mask)
+  infer_meta :
+    func : MaxPoolWithIndexInferMeta
+  kernel :
+    func : max_pool2d_with_index
+  backward : max_pool2d_with_index_grad
+
+- api : max_pool3d_with_index
+  args : (Tensor x, int[] kernel_size, int[] strides, int[] paddings, bool global_pooling, bool adaptive)
+  output : Tensor(out), Tensor(mask)
+  infer_meta :
+    func : MaxPoolWithIndexInferMeta
+  kernel :
+    func : max_pool3d_with_index
+  backward : max_pool3d_with_index_grad
+
+- api : maximum
+  args : (Tensor x, Tensor y)
+  output : Tensor(out)
+  infer_meta :
+    func : ElementwiseInferMeta
+  kernel :
+    func : maximum
+  backward : maximum_grad
+
 - api : mean
-  args : (Tensor x, int64_t[] axis={}, bool keep_dim=false)
-  output : Tensor
+  args : (Tensor x, int64_t[] dims={}, bool keep_dim=false)
+  output : Tensor(out)
   infer_meta :
     func : ReduceInferMeta
   kernel :
     func : mean
+  backward : mean_grad
+
+- api : mean_all
+  args : (Tensor x)
+  output : Tensor
+  infer_meta :
+    func : MeanAllInferMeta
+  kernel :
+    func : mean_all
+  backward : mean_all_grad
+
+- api : meshgrid
+  args : (Tensor[] inputs)
+  output : Tensor[]
+  invoke : meshgrid_impl(inputs)
+  backward : meshgrid_grad
+
+- api : min
+  args : (Tensor x, int64_t[] dims={}, bool keep_dim=false)
+  output : Tensor(out)
+  infer_meta :
+    func : ReduceInferMeta
+  kernel :
+    func : min
+  backward : min_grad
+
+- api : minimum
+  args : (Tensor x, Tensor y)
+  output : Tensor(out)
+  infer_meta :
+    func : ElementwiseInferMeta
+  kernel :
+    func : minimum
+  backward : minimum_grad
+
+- api : mish
+  args : (Tensor x, float lambda)
+  output : Tensor
+  infer_meta :
+    func : UnchangedInferMeta
+    param : [x]
+  kernel :
+    func : mish
+  backward : mish_grad
+
+- api : mode
+  args : (Tensor x,  int axis,  bool keepdim)
+  output : Tensor(out), Tensor(indices)
+  infer_meta :
+    func : ModeInferMeta
+  kernel :
+    func : mode
+  backward : mode_grad
 
 - api : modulo
   args : (Tensor x, Tensor y)
@@ -1007,6 +1321,12 @@
     func : modulo
   backward : modulo_grad
 
+- api : momentum
+  args : (Tensor param, Tensor grad, Tensor velocity, Tensor learning_rate, Tensor master_param, float mu, bool use_nesterov = false, str regularization_method = "", float regularization_coeff = 0.0, bool multi_precision = false, float rescale_grad = 1.0f)
+  output : Tensor(param_out), Tensor(velocity_out), Tensor(master_param_out)
+  invoke : momentum_impl(param, grad, velocity, learning_rate, master_param, mu, use_nesterov, regularization_method, regularization_coeff, multi_precision, rescale_grad)
+  optional : master_param
+
 # multinomial
 - api : multinomial
   args : (Tensor x, int num_samples, bool replacement)
@@ -1034,6 +1354,27 @@
     func : mv
   backward : mv_grad
 
+- api : nll_loss
+  args : (Tensor input, Tensor label, Tensor weight, int64_t ignore_index, str reduction)
+  output : Tensor(out), Tensor(total_weight)
+  infer_meta :
+    func : NllLossRawInferMeta
+  kernel :
+    func : nll_loss
+    data_type : input
+  optional : weight
+  backward : nll_loss_grad
+
+- api : norm
+  args : (Tensor x, int axis, float epsilon, bool is_test)
+  output : Tensor(out), Tensor(norm)
+  infer_meta :
+    func : NormInferMeta
+  kernel :
+    func : norm
+  intermediate : norm
+  backward : norm_grad
+
 - api : not_equal
   args : (Tensor x, Tensor y, int axis = -1)
   output : Tensor
@@ -1055,6 +1396,15 @@
   output : Tensor
   invoke : full_like(x, 1, dtype, place)
 
+- api : p_norm
+  args : (Tensor x,  float porder,  int axis,  float epsilon,  bool keepdim,  bool asvector=false)
+  output : Tensor(out)
+  infer_meta :
+    func : PNormInferMeta
+  kernel :
+    func : p_norm
+  backward : p_norm_grad
+
 # pad
 - api : pad
   args : (Tensor x, int[] paddings, float pad_value)
@@ -1065,6 +1415,15 @@
     func : pad
   # backward : pad_grad
 
+- api : pad3d
+  args : (Tensor x, IntArray paddings, str mode,  float pad_value, str data_format)
+  output : Tensor(out)
+  infer_meta :
+    func : Pad3dInferMeta
+  kernel :
+    func : pad3d
+  backward : pad3d_grad
+
 # pixel_shuffle
 - api : pixel_shuffle
   args : (Tensor x, int upscale_factor, str data_format)
@@ -1073,7 +1432,7 @@
     func : PixelShuffleInferMeta
   kernel :
     func : pixel_shuffle
-  # backward : pixel_shuffle_grad
+  backward : pixel_shuffle_grad
 
 # poisson  // no need grad
 - api : poisson
@@ -1089,8 +1448,47 @@
   output : Tensor(out)
   infer_meta :
     func : PoolInferMeta
-  kernel:
+  kernel :
     func : pool2d
+  backward : pool2d_grad
+
+- api : pool3d
+  args : (Tensor x, int[] kernel_size, int[] strides, int[] paddings, bool ceil_mode, bool exclusive, str data_format, str pooling_type, bool global_pooling, bool adaptive, str padding_algorithm)
+  output : Tensor(out)
+  infer_meta :
+    func : PoolInferMeta
+  kernel :
+    func : pool3d
+  backward : pool3d_grad
+
+- api : pow
+  args : (Tensor x, Scalar s)
+  output : Tensor(out)
+  infer_meta :
+    func : UnchangedInferMeta
+    param: [x]
+  kernel :
+    func : pow
+  backward : pow_grad
+
+- api : prelu
+  args : (Tensor x, Tensor alpha, str data_format, str mode)
+  output : Tensor(out)
+  infer_meta :
+    func : PReluInferMeta
+  kernel :
+    func : prelu
+  backward : prelu_grad
+
+- api : psroi_pool
+  args : (Tensor x, Tensor boxes, Tensor boxes_num, int pooled_height, int pooled_width, int output_channels, float spatial_scale)
+  output : Tensor
+  infer_meta :
+    func : PsroiPoolInferMeta
+  kernel :
+    func : psroi_pool
+  optional : boxes_num
+  backward : psroi_pool_grad
 
 # put_along_axis
 - api : put_along_axis
@@ -1104,6 +1502,48 @@
     data_type : x
   backward : put_along_axis_grad
 
+- api : qr
+  args : (Tensor x, str mode)
+  output : Tensor(q), Tensor(r)
+  infer_meta :
+    func : QrInferMeta
+  kernel :
+    func : qr
+  # backward : qr_grad
+
+- api : randint
+  args : (int low, int high, IntArray shape, DataType dtype=DataType::INT64, Place place={})
+  output : Tensor(out)
+  infer_meta :
+    func : RandintInferMeta
+    param : [low, high, shape, dtype]
+  kernel :
+    func : randint
+    param : [low, high, shape, dtype]
+    data_type : dtype
+    backend : place
+
+- api : randperm
+  args : (int n, DataType dtype, Place place={})
+  output : Tensor
+  infer_meta :
+    func : RandpermInferMeta
+    param : [n, dtype]
+  kernel :
+    func : randperm
+    param : [n, dtype]
+    data_type : dtype
+    backend : place
+
+- api : real
+  args : (Tensor x)
+  output : Tensor
+  infer_meta :
+    func : RealAndImagInferMeta
+  kernel :
+    func : real
+  backward : real_grad
+
 - api : reciprocal
   args : (Tensor x)
   output : Tensor
@@ -1120,7 +1560,8 @@
   infer_meta :
     func : ReduceInferMetaBase
   kernel :
-    func : reduce_prod
+    func : prod_raw
+  backward : reduce_prod_grad
 
 - api : relu
   args : (Tensor x)
@@ -1144,6 +1585,55 @@
   intermediate : xshape
   backward: reshape_grad
 
+- api : roi_align
+  args : (Tensor x, Tensor boxes, Tensor boxes_num, int pooled_height, int pooled_width, float spatial_scale, int sampling_ratio, bool aligned)
+  output : Tensor
+  infer_meta :
+    func : RoiAlignInferMeta
+  kernel :
+    func : roi_align
+  optional : boxes_num
+  backward : roi_align_grad
+
+- api : roi_pool
+  args : (Tensor x, Tensor boxes, Tensor boxes_num, int pooled_height, int pooled_width, float spatial_scale)
+  output : Tensor(out), Tensor(arg_max)
+  infer_meta :
+    func : RoiPoolInferMeta
+  kernel :
+    func : roi_pool
+  optional : boxes_num
+  intermediate : arg_max
+  backward : roi_pool_grad
+
+- api : roll
+  args : (Tensor x, IntArray shifts, int64_t[] axis)
+  output : Tensor(out)
+  infer_meta :
+    func : RollInferMeta
+  kernel :
+    func : roll
+  backward : roll_grad
+
+- api : round
+  args : (Tensor x)
+  output : Tensor(out)
+  infer_meta :
+    func : UnchangedInferMeta
+  kernel :
+    func : round
+  backward : round_grad
+
+- api : rsqrt
+  args : (Tensor x)
+  output : Tensor(out)
+  infer_meta :
+    func : UnchangedInferMeta
+  kernel :
+    func : rsqrt
+  inplace : (x -> out)
+  backward : rsqrt_grad
+
 - api : scale
   args : (Tensor x, Scalar scale, float bias, bool bias_after_scale)
   output : Tensor
@@ -1153,6 +1643,7 @@
   kernel :
     func : scale, scale_sr
   inplace : (x -> out)
+  backward : scale_grad
 
 - api : scatter
   args : (Tensor x, Tensor index, Tensor updates, bool overwrite)
@@ -1163,7 +1654,6 @@
   kernel :
     func : scatter
   backward : scatter_grad
-  # no_need_buffer : updates
 
 - api : scatter_nd_add
   args : (Tensor x, Tensor index, Tensor updates)
@@ -1174,7 +1664,14 @@
   kernel :
     func : scatter_nd_add
   backward : scatter_nd_add_grad
-  # no_need_buffer : updates
+
+- api : searchsorted
+  args : (Tensor sorted_sequence, Tensor value, bool out_int32, bool right)
+  output : Tensor(out)
+  infer_meta :
+    func : SearchsortedInferMeta
+  kernel :
+    func : searchsorted
 
 # segment_pool
 - api : segment_pool
@@ -1184,6 +1681,7 @@
     func : SegmentPoolInferMeta
   kernel :
     func : segment_pool
+    data_type : x
   backward : segment_pool_grad
 
 # selu
@@ -1197,6 +1695,14 @@
     func : selu
   backward : selu_grad
 
+- api : shape
+  args : (Tensor input)
+  output : Tensor
+  infer_meta :
+    func : ShapeInferMeta
+  kernel :
+    func : shape, shape_sr
+
 # shard_index
 - api : shard_index
   args : (Tensor in, int index_num, int nshards, int shard_id, int ignore_value)
@@ -1273,6 +1779,15 @@
   kernel :
     func : size
 
+- api : slice
+  args : (Tensor input, int64_t[] axes, IntArray starts, IntArray ends, int64_t[] infer_flags, int64_t[] decrease_axis)
+  output : Tensor
+  infer_meta :
+    func : SliceRawInferMeta
+  kernel :
+    func : slice
+  backward : slice_grad
+
 # soft_shrink
 - api : soft_shrink
   args : (Tensor x, float lambda)
@@ -1316,6 +1831,34 @@
     func : square
   backward : square_grad
 
+- api : squeeze
+  args : (Tensor x, int[] axes)
+  output : Tensor(xshape), Tensor(out)
+  infer_meta :
+    func : SqueezeInferMeta
+  kernel :
+    func : squeeze
+  view: (x -> out)
+  backward : squeeze_grad
+
+- api : stack
+  args : (Tensor[] x, int axis)
+  output : Tensor
+  infer_meta :
+    func : StackInferMeta
+  kernel :
+    func : stack
+  backward : stack_grad
+
+- api : strided_slice
+  args : (Tensor x, int[] axes, IntArray starts, IntArray ends, IntArray strides)
+  output : Tensor
+  infer_meta :
+    func : StridedSliceInferMeta
+  kernel :
+    func : strided_slice
+  backward : strided_slice_grad
+
 - api : subtract
   args : (Tensor x, Tensor y)
   output : Tensor
@@ -1324,16 +1867,27 @@
   kernel :
     func : subtract
   backward : subtract_grad
-  # no_need_buffer : x, y
 
 - api : sum
-  args : (Tensor x, int64_t[] axis={}, DataType dtype=DataType::UNDEFINED, bool keep_dim=false)
-  output : Tensor
+  args : (Tensor x, int64_t[] dims={}, DataType out_dtype=paddle::experimental::DataType::UNDEFINED, bool keep_dim=false)
+  output : Tensor(out)
   infer_meta :
     func : SumInferMeta
   kernel :
     func : sum
     data_type : x
+  backward : sum_grad
+
+# The python API paddle.nn.functional.swish has no `bete` argument, it may be removed later
+- api : swish
+  args : (Tensor x, float beta=1.0)
+  output : Tensor(out)
+  infer_meta :
+    func : UnchangedInferMeta
+    param : [x]
+  kernel :
+    func : swish
+  backward : swish_grad
 
 # take_along_axis
 - api : take_along_axis
@@ -1397,7 +1951,6 @@
   kernel :
     func : tile
   backward : tile_grad
-  # no_need_buffer : x
 
 - api : top_k
   args : (Tensor x, Scalar k, int axis = -1, bool largest = true, bool sorted = true)
@@ -1416,7 +1969,6 @@
   kernel :
     func : trace
   backward : trace_grad
-  no_need_buffer : x
 
 - api : transpose
   args : (Tensor x, int[] axis)
@@ -1436,6 +1988,15 @@
     func : triangular_solve
   # backward : triangular_solve_grad
 
+- api : tril_triu
+  args : (Tensor x,  int diagonal,  bool lower)
+  output : Tensor(out)
+  infer_meta :
+    func : TrilTriuInferMeta
+  kernel :
+    func : tril_triu
+  backward : tril_triu_grad
+
 - api : trunc
   args : (Tensor x)
   output : Tensor
@@ -1445,6 +2006,25 @@
     func : trunc
   backward : trunc_grad
 
+# python API: paddle.nn.initializer.TruncatedNormal
+- api : truncated_gaussian_random
+  args : (int[] shape, float mean, float std, int seed, DataType dtype=DataType::FLOAT32, Place place={})
+  output : Tensor
+  infer_meta :
+    func : TruncatedGaussianRandomInferMeta
+    param : [shape, mean, std, seed, dtype]
+  kernel :
+    func : truncated_gaussian_random
+    param : [shape, mean, std, seed, dtype]
+    backend : place
+    data_type : dtype
+
+- api : unbind
+  args : (Tensor input, int axis)
+  output : Tensor[]
+  invoke : unbind_impl(input, axis)
+  backward : unbind_grad
+
 # unfold
 - api : unfold
   args : (Tensor x, int[] kernel_sizes, int[] strides, int[] paddings, int[] dilations)
@@ -1454,7 +2034,26 @@
   kernel :
     func : unfold
   backward : unfold_grad
-  # no_need_buffer : x
+
+# The `axis` argument of Python API paddle.unique is not vector
+- api : unique
+  args : (Tensor x, bool return_index, bool return_inverse, bool return_counts, int[] axis, DataType dtype=DataType::INT64)
+  output : Tensor(out), Tensor(indices), Tensor(inverse), Tensor(counts)
+  infer_meta :
+    func : UniqueInferMeta
+  kernel :
+    func : unique
+    data_type : x
+
+- api : unsqueeze
+  args : (Tensor x, IntArray axes)
+  output : Tensor(xshape), Tensor(out)
+  infer_meta :
+    func : UnsqueezeInferMeta
+  kernel :
+    func : unsqueeze
+  view: (x -> out)
+  backward : unsqueeze_grad
 
 # viterbi_decode
 - api : viterbi_decode
@@ -1483,7 +2082,6 @@
     func : WhereIndexInferMeta
   kernel :
     func : where_index
-  # no_need_buffer : x, y
 
 # yolo_box
 - api : yolo_box
diff --git a/python/paddle/utils/code_gen/api_base.py b/python/paddle/utils/code_gen/api_base.py
index e281484f69744..38aa3e0cb0b73 100644
--- a/python/paddle/utils/code_gen/api_base.py
+++ b/python/paddle/utils/code_gen/api_base.py
@@ -193,7 +193,7 @@ def parse_output_item(output_item):
                     f"{api_name} : Output type error: the output type only support Tensor and Tensor[], \
                       but now is {out_type}."
 
-                return out_type, result.group('name')
+                return output_type_map[out_type], result.group('name')
 
             else:
                 if output_item.strip() in output_type_map:
@@ -238,7 +238,8 @@ def parse_kernel(self, kernel_config):
             'param': None,
             'backend': None,
             'layout': None,
-            'data_type': None
+            'data_type': None,
+            'use_cudnn': 'false'
         }
         if 'backend' in kernel_config and len(kernel_config['backend']) > 0:
             kernel['backend'] = kernel_config['backend']
@@ -248,6 +249,10 @@ def parse_kernel(self, kernel_config):
             kernel['data_type'] = kernel_config['data_type']
         if 'param' in kernel_config:
             kernel['param'] = kernel_config['param']
+        if 'use_cudnn' in kernel_config:
+            kernel['use_cudnn'] = kernel_config['use_cudnn']
+            if isinstance(kernel['use_cudnn'], bool):
+                kernel['use_cudnn'] = str(kernel['use_cudnn']).lower()
         kernel['func'] = [
             kernel_fn.strip() for kernel_fn in kernel_config['func'].split(',')
         ]
@@ -480,11 +485,15 @@ def gene_infer_meta(self, kernel_output_names, code_indent) -> str:
                     param_code = param_code + param + "_metas, "
                 elif param in self.optional_vars:
                     meta_tensor_code = meta_tensor_code + f"""
-{code_indent}  paddle::optional<const phi::MetaTensor&> {PREFIX_TENSOR_NAME}meta_ref_{param}(paddle::none);
-{code_indent}  auto {PREFIX_TENSOR_NAME}meta_{param} = MakeMetaTensor({PREFIX_TENSOR_NAME}{param});
-{code_indent}  if ({PREFIX_TENSOR_NAME}meta_{param}) {{
-{code_indent}    {PREFIX_TENSOR_NAME}meta_ref_{param} = paddle::make_optional<const phi::MetaTensor&>(*{PREFIX_TENSOR_NAME}meta_{param});
-{code_indent}  }}"""
+{code_indent}  paddle::optional<const phi::MetaTensor&> {PREFIX_TENSOR_NAME}meta_ref_{param} = paddle::none;
+{code_indent}  phi::DenseTensor {param}_dt;
+{code_indent}  phi::MetaTensor {PREFIX_TENSOR_NAME}meta_tmp_{param}({param}_dt);
+{code_indent}  if ({PREFIX_TENSOR_NAME}{param}_ptr) {{
+{code_indent}    {PREFIX_TENSOR_NAME}meta_tmp_{param}.set_dtype( {PREFIX_TENSOR_NAME}{param}_ptr->dtype() );
+{code_indent}    {PREFIX_TENSOR_NAME}meta_tmp_{param}.set_dims( {PREFIX_TENSOR_NAME}{param}_ptr->dims() );
+{code_indent}    {PREFIX_TENSOR_NAME}meta_tmp_{param}.set_layout( {PREFIX_TENSOR_NAME}{param}_ptr->layout() );
+{code_indent}    {PREFIX_TENSOR_NAME}meta_ref_{param} =  {PREFIX_TENSOR_NAME}meta_tmp_{param};
+{code_indent}  }}\n"""
 
                     param_code = param_code + f"{PREFIX_TENSOR_NAME}meta_ref_{param}, "
                 else:
@@ -709,10 +718,12 @@ def gen_dense_tensor_kernel_code(self, code_indent, inplace_flag=False):
         outputs_args, kernel_output_names, output_create = self.gene_output(
             self.outputs['types'], 'SetKernelOutput', code_indent, inplace_flag)
         api_func_name = self.get_api_func_name() + ('_' if inplace_flag else '')
+        cudnn_args = '' if self.kernel[
+            'use_cudnn'] == 'false' else ', ' + self.kernel['use_cudnn']
         return f"""
-{code_indent}  const auto& kernel = phi::KernelFactory::Instance().SelectKernelOrThrowError(
-{code_indent}      "{self.kernel['func'][0]}", {{kernel_backend, kernel_layout, kernel_data_type}});
 {code_indent}  VLOG(6) << "{self.api} API kernel key: [" << kernel_backend << ", " << kernel_layout << ", "<< kernel_data_type << "]";
+{code_indent}  const auto& kernel = phi::KernelFactory::Instance().SelectKernelOrThrowError(
+{code_indent}      "{self.kernel['func'][0]}", {{kernel_backend, kernel_layout, kernel_data_type}}{cudnn_args});
 {code_indent}  VLOG(6) << "{self.api} API kernel: " << kernel;
 
 {code_indent}  auto* dev_ctx = GetDeviceContextByBackend(kernel_backend);
diff --git a/python/paddle/utils/code_gen/api_gen.py b/python/paddle/utils/code_gen/api_gen.py
index f95edf6c591ab..4087b55b51324 100644
--- a/python/paddle/utils/code_gen/api_gen.py
+++ b/python/paddle/utils/code_gen/api_gen.py
@@ -163,6 +163,8 @@ def source_include(header_file_path):
 #include "paddle/phi/infermeta/ternary.h"
 
 #include "paddle/fluid/platform/profiler/event_tracing.h"
+
+DECLARE_bool(conv2d_disable_cudnn);
 """
 
 
diff --git a/python/paddle/utils/code_gen/backward.yaml b/python/paddle/utils/code_gen/backward.yaml
index 2d893dc855fc0..602fecc83b8f7 100644
--- a/python/paddle/utils/code_gen/backward.yaml
+++ b/python/paddle/utils/code_gen/backward.yaml
@@ -1,183 +1,3 @@
-# - backward_api : norm_grad
-#   forward : norm (Tensor x, int axis, float epsilon, bool is_test) -> Tensor(out), Tensor(norm)
-#   args : (Tensor x, Tensor norm, Tensor out_grad, int axis, float epsilon, bool is_test)
-#   output : Tensor(x_grad)
-#   infer_meta :
-#     func : UnchangedInferMeta
-#     param : [x]
-#   kernel :
-#     func : norm_grad
-
-# - backward_api : matmul_triple_grad
-#   forward : matmul_double_grad (Tensor x, Tensor y, Tensor out_grad, Tensor dx_grad, Tensor dy_grad, bool transpose_x, bool transpose_y) -> Tensor(d2x), Tensor(d2y), Tensor(dout_grad)
-#   args : (Tensor x, Tensor y, Tensor out_grad, Tensor dx_grad, Tensor dy_grad, Tensor d2x_grad, Tensor d2y_grad, Tensor dout_grad_grad, bool transpose_x, bool transpose_y)
-#   output : Tensor(d3x), Tensor(d3y), Tensor(d2out_grad), Tensor(ddx_grad), Tensor(ddy_grad)
-#   infer_meta :
-#     func : MatmulTripleGradInferMeta
-#   kernel :
-#     func : matmul_triple_grad
-
-# - backward_api : maxout_grad
-#   forward : maxout (Tensor x, int groups, int axis) -> Tensor(out)
-#   args : (Tensor x, Tensor out, Tensor out_grad, int groups, int axis)
-#   output : Tensor(x_grad)
-#   infer_meta :
-#     func : UnchangedInferMeta
-#     param : [x]
-#   kernel :
-#     func : maxout_grad
-
-# - backward_api : batch_norm_grad
-#   forward : batch_norm (Tensor x, Tensor scale, Tensor bias, Tensor mean, Tensor variance, float momentum, float epsilon, str data_layout, bool is_test, bool use_global_stats, bool trainable_statistics, bool fuse_with_relu) -> Tensor(out), Tensor(mean_out), Tensor(variance_out), Tensor(saved_mean), Tensor(saved_variance), Tensor(reserve_space)
-#   args : (Tensor indices, Tensor x, Tensor out_grad, int axis, bool descending)
-#   output : Tensor(x_grad), Tensor(scale_grad), Tensor(bias_grad)
-#   infer_meta :
-#     func : GeneralTernaryGradInferMeta
-#     param : [x, scale, bias]
-#   kernel :
-#     func : batch_norm_grad
-
-# - backward_api : bilinear_tensor_product_grad
-#   forward : bilinear_tensor_product (Tensor x, Tensor y, Tensor weight, Tensor bias) -> Tensor(out)
-#   args : (Tensor x, Tensor y, Tensor weight, Tensor out_grad)
-#   output : Tensor(x_grad), Tensor(y_grad), Tensor(weight_grad), Tensor(bias_grad)
-#   infer_meta :
-#     func : FourXXXXInferMeta
-#     param : [x, y, weight, bias]
-#   kernel :
-#     func : bilinear_tensor_product_grad
-#   optional : bias
-
-# - backward_api : broadcast_tensor_grad
-#   forward : broadcast_tensors (Tensor[] x) -> Tensor [] (out)
-#   args : (Tensor [] out_grad)
-#   output : Tensor [] (x_grad)
-#   infer_meta :
-#     func : XXXXInferMeta
-#     param : [out_grad]
-#   kernel :
-#     func : broadcast_tensor_grad
-
-# - backward_api : gumbel_softmax_grad
-#   forward : gumbel_softmax (Tensor x, float temperature, bool hard, int axis) -> Tensor(out)
-#   args : (Tensor out, Tensor out_grad, int axis)
-#   output : Tensor(x_grad)
-#   infer_meta :
-#     func : GumbelSoftmaxGradInferMeta
-#     param : [out, out_grad, axis]
-#   kernel :
-#     func : gumbel_softmax_grad
-
-# - backward_api : huber_loss_grad
-#   forward : huber_loss (Tensor input, Tensor label, float delta) -> Tensor(out), Tensor(residual)
-#   args : (Tensor residual, Tensor out_grad, float delta)
-#   output : Tensor(input_grad), Tensor(label_grad)
-#   infer_meta :
-#     func : GeneralBinaryGradInferMeta
-#     param : [x, y]
-#   kernel :
-#     func : where_grad
-
-# - backward_api : triangular_solve_grad
-#   forward : triangular_solve (Tensor x, Tensor y, bool upper, bool tranpose, bool unitriangular) -> Tensor(out)
-#   args : (Tensor x, Tensor y, Tensor out, Tensor out_grad, bool upper, bool tranpose, bool unitriangular)
-#   output : Tensor(x_grad), Tensor(y_grad)
-#   infer_meta :
-#     func : GeneralBinaryGradInferMeta
-#     param : [x, y]
-#   kernel :
-#     func : triangular_solve_grad
-
-# - backward_api : dropout_grad
-#   forward : dropout (Tensor x, Tensor seed_tensor, float p, bool is_test, str mode, int seed, bool fix_seed) -> Tensor(out), Tensor(mask)
-#   args : (Tensor mask, Tensor out_grad, float p, bool is_test, str mode)
-#   output : Tensor(x_grad)
-#   infer_meta :
-#     func : UnchangedInferMeta
-#     param : [out_grad]
-#   kernel :
-#     func : dropout_grad
-
-# - backward_api : expand_as_grad
-#   forward : expand_as (Tensor x, Tensor y, int[] target_shape) -> Tensor(out)
-#   args : (Tensor x, Tensor out_grad, int[] target_shape)
-#   output : Tensor(x_grad)
-#   infer_meta :
-#     func : UnchangedInferMeta
-#     param : [x]
-#   kernel :
-#     func : expand_as_grad
-
-# - backward_api : expand_grad
-#   forward : expand (Tensor x, IntArray shape) -> Tensor(out)
-#   args : (Tensor x, Tensor out_grad, IntArray shape)
-#   output : Tensor(x_grad)
-#   infer_meta :
-#     func : UnchangedGradInferMeta
-#     param : [x]
-#   kernel :
-#     func : expand_grad
-
-# - backward_api : graph_send_recv_grad
-#   forward : graph_send_recv (Tensor x, Tensor src_index, Tensor dst_index, str pool_type) -> Tensor(out), Tensor(dst_count)
-#   args : (Tensor out_grad, Tensor x, Tensor out, Tensor src_index, Tensor dst_index, Tensor dst_count, str pool_type)
-#   output : Tensor(x_grad)
-#   infer_meta :
-#     func : UnchangedInferMeta
-#     param : [x]
-#   kernel :
-#     func : graph_send_recv_grad
-
-# - backward_api : multi_dot_grad
-#   forward : multi_dot (Tensor[] x) -> Tensor(out)
-#   args : (Tensor out_grad, Tensor[] x)
-#   output : Tensor[] (x_grad)
-#   infer_meta :
-#     func : XXXXInferMeta
-#     param : [x]
-#   kernel :
-#     func : multi_dot_grad
-
-# - backward_api : pad_grad
-#   forward : pad (Tensor x, int[] paddings, float pad_value) -> Tensor(out)
-#   args : (Tensor out_grad, int[] paddings, float pad_value)
-#   output : Tensor(x_grad)
-#   infer_meta :
-#     func : XXXXXInferMeta
-#     param : [x]
-#   kernel :
-#     func : pad_grad
-
-# - backward_api : pixel_shuffle_grad
-#   forward : pixel_shuffle (Tensor x, int upscale_factor, str data_format) -> Tensor(out)
-#   args : (Tensor out_grad, int upscale_factor, str data_format)
-#   output : Tensor(x_grad)
-#   infer_meta :
-#     func : XXXXXInferMeta
-#     param : [x]
-#   kernel :
-#     func : pixel_shuffle_grad
-
-# - backward_api : poisson_grad
-#   forward : poisson (Tensor x) -> Tensor(out)
-#   args : ()
-#   output : Tensor(x_grad)
-#   infer_meta :
-#     func : XXXXXInferMeta
-#     param : [x]
-#   kernel :
-#     func : poisson_grad
-
-# - backward_api : where_index_grad
-#   forward : where_index (Tensor condition) -> Tensor(out)
-#   args : (Tensor out_grad, Tensor x, int offset, int axis1, int axis2)
-#   output : Tensor(x_grad)
-#   infer_meta :
-#     func : UnchangedInferMeta
-#     param : [x]
-#   kernel :
-#     func : where_index_grad
-
 - backward_api : abs_grad
   forward : abs (Tensor x) -> Tensor(out)
   args : (Tensor x, Tensor out_grad)
@@ -221,8 +41,15 @@
     func : add_grad
   no_need_buffer : x, y
 
+- backward_api : add_n_grad
+  forward : add_n (Tensor[] x) -> Tensor(out)
+  args : (Tensor[] x, Tensor out_grad)
+  output : Tensor[](x_grad)
+  invoke : add_n_grad_impl(x, out_grad)
+  no_need_buffer : x
+
 - backward_api : addmm_grad
-  forward : scatter (Tensor input, Tensor x, Tensor y, float alpha, float beta) -> Tensor(out)
+  forward : addmm (Tensor input, Tensor x, Tensor y, float alpha, float beta) -> Tensor(out)
   args : (Tensor input, Tensor x, Tensor y, Tensor out_grad, float alpha, float beta)
   output : Tensor(input_grad), Tensor(x_grad), Tensor(y_grad)
   infer_meta :
@@ -240,6 +67,7 @@
     param : [x]
   kernel :
     func : argsort_grad
+  no_need_buffer : x
 
 - backward_api : asin_grad
   forward : asin (Tensor x) -> Tensor(out)
@@ -262,7 +90,7 @@
     func : asinh_grad
 
 - backward_api : atan2_grad
-  forward : cross (Tensor x, Tensor y) -> Tensor(out)
+  forward : atan2 (Tensor x, Tensor y) -> Tensor(out)
   args : (Tensor x, Tensor y, Tensor out_grad)
   output : Tensor(x_grad), Tensor(y_grad)
   infer_meta :
@@ -291,6 +119,18 @@
   kernel :
     func : atanh_grad
 
+- backward_api : batch_norm_grad
+  forward : batch_norm (Tensor x, Tensor scale, Tensor bias, Tensor mean, Tensor variance, float momentum, float epsilon, str data_layout, bool is_test, bool use_global_stats, bool trainable_statistics, bool fuse_with_relu) -> Tensor(out), Tensor(mean_out), Tensor(variance_out), Tensor(saved_mean), Tensor(saved_variance), Tensor(reserve_space)
+  args : (Tensor x, Tensor scale, Tensor bias, Tensor mean_out, Tensor variance_out, Tensor saved_mean, Tensor saved_variance, Tensor reserve_space, Tensor out_grad, float momentum, float epsilon, str data_layout, bool is_test, bool use_global_stats, bool trainable_statistics, bool fuse_with_relu)
+  output : Tensor(x_grad), Tensor(scale_grad), Tensor(bias_grad)
+  infer_meta :
+    func : GeneralTernaryGradInferMeta
+    param : [x, scale, bias]
+  kernel :
+    func : batch_norm_grad
+    data_type : out_grad
+  optional : mean_out, variance_out, reserve_space
+
 - backward_api : bce_loss_grad
   forward : bce_loss (Tensor input, Tensor label) -> Tensor(out)
   args : (Tensor input, Tensor label, Tensor out_grad)
@@ -322,6 +162,16 @@
     func : cast_grad
     data_type : out_grad
 
+- backward_api : ceil_grad
+  forward : ceil(Tensor x) -> Tensor(out)
+  args : (Tensor out_grad)
+  output : Tensor(x_grad)
+  infer_meta :
+    func : UnchangedInferMeta
+    param: [out_grad]
+  kernel :
+    func : ceil_grad
+
 - backward_api : cholesky_grad
   forward : cholesky (Tensor x, bool upper) -> Tensor(out)
   args : (Tensor out, Tensor out_grad, bool upper)
@@ -333,7 +183,7 @@
     func : cholesky_grad
 
 - backward_api : cholesky_solve_grad
-  forward : cholesky (Tensor x, Tensor y, bool upper) -> Tensor(out)
+  forward : cholesky_solve (Tensor x, Tensor y, bool upper) -> Tensor(out)
   args : (Tensor x, Tensor y, Tensor out, Tensor out_grad, bool upper)
   output : Tensor(x_grad), Tensor(y_grad)
   infer_meta :
@@ -342,6 +192,56 @@
   kernel :
     func : cholesky_solve_grad
 
+- backward_api : clip_grad
+  forward : clip (Tensor x, Scalar min, Scalar max) -> Tensor(out)
+  args : (Tensor x, Tensor out_grad, Scalar min = 0., Scalar max = 0.)
+  output : Tensor(x_grad)
+  infer_meta :
+    func : UnchangedInferMeta
+    param : [x]
+  kernel :
+    func : clip_grad
+
+- backward_api : concat_grad
+  forward : concat (Tensor[] x, Scalar axis) -> Tensor(out)
+  args : (Tensor[] x, Tensor out_grad, Scalar axis = 0)
+  output : Tensor[](x_grad)
+  invoke : concat_grad_impl(x, out_grad, axis)
+
+- backward_api : conj_grad
+  forward : conj (Tensor x) -> Tensor(out)
+  args : (Tensor out_grad)
+  output : Tensor(x_grad)
+  infer_meta :
+    func : UnchangedInferMeta
+    param: [out_grad]
+  kernel :
+    func : conj
+
+- backward_api : conv2d_grad
+  forward : conv2d (Tensor input, Tensor filter, int[] strides, int[] paddings, str paddding_algorithm, int groups, int[] dilations, str data_format, bool use_addto, int workspace_size_MB, bool exhaustive_search) -> Tensor(out)
+  args : (Tensor input, Tensor filter, Tensor out_grad,  int[] strides, int[] paddings, str paddding_algorithm, int groups, int[] dilations, str data_format, bool use_addto, int workspace_size_MB, bool exhaustive_search)
+  output : Tensor(input_grad), Tensor(filter_grad)
+  invoke : conv2d_grad_impl(input, filter, out_grad,  strides, paddings, paddding_algorithm, groups, dilations, data_format, use_addto, workspace_size_MB, exhaustive_search)
+
+- backward_api : conv2d_transpose_grad
+  forward : conv2d_transpose(Tensor x, Tensor filter, int[] strides, int[] paddings, int[] output_padding, int[] output_size, str padding_algorithm, int groups, int[] dilations, str data_format) -> Tensor(out)
+  args : (Tensor x, Tensor filter, Tensor out_grad, int[] strides, int[] paddings, int[] output_padding, int[] output_size, str padding_algorithm, int groups, int[] dilations, str data_format)
+  output : Tensor(x_grad), Tensor(filter_grad)
+  infer_meta :
+    func : ConvTransposeGradInferMeta
+  kernel :
+    func : conv2d_transpose_grad
+
+- backward_api : conv3d_transpose_grad
+  forward : conv3d_transpose(Tensor x, Tensor filter, int[] strides, int[] paddings, int[] output_padding, int[] output_size, str padding_algorithm, int groups, int[] dilations, str data_format) -> Tensor(out)
+  args : (Tensor x, Tensor filter, Tensor out_grad, int[] strides, int[] paddings, int[] output_padding, int[] output_size, str padding_algorithm, int groups, int[] dilations, str data_format)
+  output : Tensor(x_grad), Tensor(filter_grad)
+  infer_meta :
+    func : ConvTransposeGradInferMeta
+  kernel :
+    func : conv3d_transpose_grad
+
 - backward_api : cos_grad
   forward : cos (Tensor x) -> Tensor(out)
   args : (Tensor x, Tensor out_grad)
@@ -362,6 +262,16 @@
   kernel :
     func : cosh_grad
 
+- backward_api : cross_entropy_with_softmax_grad
+  forward : cross_entropy_with_softmax (Tensor input, Tensor label, bool soft_label, bool use_softmax, bool numeric_stable_mode, int ignore_index, int axis) -> Tensor(softmax), Tensor(loss)
+  args : (Tensor label, Tensor softmax, Tensor loss_grad, bool soft_label, bool use_softmax, bool numeric_stable_mode, int ignore_index, int axis)
+  output : Tensor(input_grad)
+  infer_meta :
+    func : CrossEntropyWithSoftmaxGradInferMeta
+  kernel :
+    func : cross_entropy_with_softmax_grad
+    data_type : softmax
+
 - backward_api : cross_grad
   forward : cross (Tensor x, Tensor y, int axis = 9) -> Tensor(out)
   args : (Tensor x, Tensor y, Tensor out_grad, int axis)
@@ -372,6 +282,44 @@
   kernel :
     func : cross_grad
 
+- backward_api : cumprod_grad
+  forward : cumprod (Tensor x, int dim) -> Tensor(out)
+  args : (Tensor x, Tensor out, Tensor out_grad, int dim)
+  output : Tensor(x_grad)
+  infer_meta :
+    func : UnchangedInferMeta
+    param: [x]
+  kernel :
+    func : cumprod_grad
+
+- backward_api : cumsum_grad
+  forward : cumsum(Tensor x, int axis, bool flatten, bool exclusive, bool reverse) -> Tensor(out)
+  infer_meta :
+    func : UnchangedInferMeta
+    param : [x]
+  args : (Tensor out_grad, int axis, bool flatten, bool exclusive, bool reverse)
+  output : Tensor(x_grad)
+  invoke : cumsum(out_grad, axis, flatten, exclusive, !reverse)
+
+- backward_api : depthwise_conv2d_transpose_grad
+  forward : depthwise_conv2d_transpose(Tensor x, Tensor filter, int[] strides, int[] paddings, int[] output_padding, int[] output_size, str padding_algorithm, int groups, int[] dilations, str data_format) -> Tensor(out)
+  args : (Tensor x, Tensor filter, Tensor out_grad, int[] strides, int[] paddings, int[] output_padding, int[] output_size, str padding_algorithm, int groups, int[] dilations, str data_format)
+  output : Tensor(x_grad), Tensor(filter_grad)
+  infer_meta :
+    func : ConvTransposeGradInferMeta
+  kernel :
+    func : depthwise_conv2d_transpose_grad
+
+- backward_api : det_grad
+  forward : det (Tensor x) -> Tensor(out)
+  args : (Tensor x, Tensor out, Tensor out_grad)
+  output : Tensor(x_grad)
+  infer_meta :
+    func : UnchangedInferMeta
+    param : [x]
+  kernel :
+    func : determinant_grad
+
 - backward_api : diagonal_grad
   forward : diagonal (Tensor x, int offset, int axis1, int axis2) -> Tensor(out)
   args : (Tensor x, Tensor out_grad, int offset = 0, int axis1 = 0, int axis2 = 1)
@@ -381,6 +329,7 @@
     param : [x]
   kernel :
     func : diagonal_grad
+  no_need_buffer : x
 
 - backward_api : digamma_grad
   forward : digamma (Tensor x) -> Tensor(out)
@@ -404,7 +353,7 @@
 
 - backward_api : divide_grad
   forward : divide (Tensor x, Tensor y) -> Tensor(out)
-  args : (Tensor x, Tensor y, Tensor out_grad, int axis = -1)
+  args : (Tensor x, Tensor y, Tensor out, Tensor out_grad, int axis = -1)
   output : Tensor(x_grad), Tensor(y_grad)
   infer_meta :
     func : GeneralBinaryGradInferMeta
@@ -412,6 +361,17 @@
   kernel :
     func : divide_grad
 
+- backward_api : dropout_grad
+  forward : dropout (Tensor x, Tensor seed_tensor, float p, bool is_test, str mode, int seed, bool fix_seed) -> Tensor(out), Tensor(mask)
+  args : (Tensor mask, Tensor out_grad, float p, bool is_test, str mode)
+  output : Tensor(x_grad)
+  infer_meta :
+    func : UnchangedInferMeta
+    param : [out_grad]
+  kernel :
+    func : dropout_grad
+  optional : seed_tensor
+
 - backward_api : eigh_grad
   forward : eigh (Tensor x, str uplo) -> Tensor(out_w), Tensor(out_v)
   args : (Tensor out_w, Tensor out_v, Tensor out_w_grad, Tensor out_v_grad)
@@ -422,6 +382,16 @@
   kernel :
     func : eigh_grad
 
+- backward_api : elementwise_pow_grad
+  forward : elementwise_pow(Tensor x, Tensor y) -> Tensor(out)
+  args : (Tensor x, Tensor y, Tensor out_grad, int axis=-1)
+  output : Tensor(x_grad), Tensor(y_grad)
+  infer_meta :
+    func : GeneralBinaryGradInferMeta
+    param: [x, y]
+  kernel :
+    func : elementwise_pow_grad
+
 - backward_api : elu_grad
   forward : elu (Tensor x, float alpha) -> Tensor(out)
   args : (Tensor x, Tensor out, Tensor out_grad, float alpha)
@@ -444,7 +414,7 @@
     data_type : out_grad
 
 - backward_api : erfinv_grad
-  forward : erf (Tensor x) -> Tensor(out)
+  forward : erfinv (Tensor x) -> Tensor(out)
   args : (Tensor out, Tensor out_grad)
   output : Tensor(x_grad)
   infer_meta :
@@ -453,6 +423,110 @@
   kernel :
     func : erfinv_grad
 
+- backward_api : exp_grad
+  forward : exp (Tensor x) -> Tensor(out)
+  args : (Tensor out, Tensor out_grad)
+  output : Tensor(x_grad)
+  infer_meta :
+    func : UnchangedInferMeta
+    param : [out]
+  kernel :
+    func : exp_grad
+
+- backward_api : expand_as_grad
+  forward : expand_as (Tensor x, Tensor y, int[] target_shape) -> Tensor(out)
+  args : (Tensor x, Tensor out_grad, int[] target_shape)
+  output : Tensor(x_grad)
+  infer_meta :
+    func : UnchangedInferMeta
+    param : [x]
+  kernel :
+    func : expand_as_grad
+
+- backward_api : expm1_grad
+  forward : expm1 (Tensor x) -> Tensor(out)
+  args : (Tensor out, Tensor out_grad)
+  output : Tensor(x_grad)
+  infer_meta :
+    func : UnchangedInferMeta
+    param : [out]
+  kernel :
+    func : expm1_grad
+
+- backward_api : flatten_grad
+  forward : flatten(Tensor x, int start_axis, int stop_axis) -> Tensor(out), Tensor(xshape)
+  args : (Tensor xshape, Tensor out_grad)
+  output : Tensor(x_grad)
+  infer_meta :
+    func :  KernelWithXShapeInferMeta
+    param : [xshape]
+  kernel :
+    func : flatten_grad
+    data_type: out_grad
+    backend: out_grad
+    layout: out_grad
+
+- backward_api : flip_grad
+  forward : flip (Tensor x, int[] axis) -> Tensor(out)
+  args : (Tensor out_grad, int[] axis)
+  output : Tensor(x_grad)
+  infer_meta :
+    func : UnchangedInferMeta
+    param: [out_grad]
+  kernel :
+    func : flip
+
+- backward_api : floor_grad
+  forward : floor(Tensor x) -> Tensor(out)
+  args : (Tensor out_grad)
+  output : Tensor(x_grad)
+  infer_meta :
+    func : UnchangedInferMeta
+    param: [out_grad]
+  kernel :
+    func : floor_grad
+
+- backward_api : fmax_grad
+  forward : fmax(Tensor x, Tensor y, int axis) -> Tensor(out)
+  args : (Tensor x, Tensor y, Tensor out_grad, int axis)
+  output : Tensor(x_grad), Tensor(y_grad)
+  infer_meta :
+    func : GeneralBinaryGradInferMeta
+    param: [x, y]
+  kernel :
+    func : fmax_grad
+
+- backward_api : fmin_grad
+  forward : fmin(Tensor x, Tensor y, int axis) -> Tensor(out)
+  args : (Tensor x, Tensor y, Tensor out_grad, int axis)
+  output : Tensor(x_grad), Tensor(y_grad)
+  infer_meta :
+    func : GeneralBinaryGradInferMeta
+    param: [x, y]
+  kernel :
+    func : fmin_grad
+
+- backward_api : frobenius_norm_grad
+  forward : frobenius_norm(Tensor x, int64_t[] axis,  bool keep_dim,  bool reduce_all) -> Tensor(out)
+  args : (Tensor x, Tensor out, Tensor out_grad, int64_t[] axis,  bool keep_dim,  bool reduce_all)
+  output : Tensor(x_grad)
+  infer_meta :
+    func : UnchangedInferMeta
+    param : [x]
+  kernel :
+    func : frobenius_norm_grad
+
+- backward_api : gather_grad
+  forward : gather(Tensor x, Tensor index, Scalar axis=0) -> Tensor(out)
+  args : (Tensor x, Tensor index, Tensor out_grad, Scalar axis=0, bool overwrite=false)
+  output : Tensor(x_grad)
+  infer_meta :
+    func : UnchangedInferMeta
+    param: [x]
+  kernel :
+    data_type: x
+    func : gather_grad
+
 - backward_api : gather_nd_grad
   forward : gather_nd (Tensor x, Tensor index) -> Tensor(out)
   args : (Tensor x, Tensor index, Tensor out_grad)
@@ -463,6 +537,37 @@
   kernel :
     func : gather_nd_grad
 
+- backward_api : gelu_grad
+  forward : gelu(Tensor x,  bool approximate) -> Tensor(out)
+  args : (Tensor x, Tensor out_grad,  bool approximate)
+  output : Tensor(x_grad)
+  infer_meta :
+    func : UnchangedInferMeta
+    param: [x]
+  kernel :
+    func : gelu_grad
+
+- backward_api : graph_send_recv_grad
+  forward : graph_send_recv (Tensor x, Tensor src_index, Tensor dst_index, str pool_type = "SUM", int64_t out_size = 0) -> Tensor(out), Tensor(dst_count)
+  args : (Tensor x, Tensor src_index, Tensor dst_index, Tensor out, Tensor dst_count, Tensor out_grad, str pool_type = "SUM")
+  output : Tensor(x_grad)
+  infer_meta :
+    func : GeneralUnaryGradInferMeta
+    param : [x]
+  kernel :
+    func : graph_send_recv_grad
+  optional: out, dst_count
+
+- backward_api : gumbel_softmax_grad
+  forward : gumbel_softmax (Tensor x, float temperature, bool hard, int axis) -> Tensor(out)
+  args : (Tensor out, Tensor out_grad, int axis)
+  output : Tensor(x_grad)
+  infer_meta :
+    func : GumbelSoftmaxGradInferMeta
+    param : [out, out_grad, axis]
+  kernel :
+    func : gumbel_softmax_grad
+
 - backward_api : hard_shrink_grad
   forward : hard_shrink (Tensor x, float threshold) -> Tensor(out)
   args : (Tensor x, Tensor out_grad, float threshold)
@@ -483,6 +588,32 @@
   kernel :
     func : hard_sigmoid_grad
 
+- backward_api : hard_swish_grad
+  forward : hard_swish (Tensor x, float threshold = 6.0, float scale = 6.0, float offset = 3.0) -> Tensor(out)
+  args : (Tensor x, Tensor out_grad, float threshold, float scale, float offset)
+  output : Tensor(x_grad)
+  infer_meta :
+    func : UnchangedInferMeta
+    param : [x]
+  kernel :
+    func : hard_swish_grad
+
+- backward_api : huber_loss_grad
+  forward : huber_loss (Tensor input, Tensor label, float delta) -> Tensor(out), Tensor(residual)
+  args : (Tensor residual, Tensor out_grad, float delta)
+  output : Tensor(input_grad), Tensor(label_grad)
+  infer_meta :
+    func : GeneralBinaryGradInferMeta
+    param : [residual, residual]
+  kernel :
+    func : huber_loss_grad
+
+- backward_api : imag_grad
+  forward : imag (Tensor x) -> Tensor(out)
+  args : (Tensor out_grad)
+  output : Tensor(x_grad)
+  invoke : imag_grad_impl(out_grad)
+
 - backward_api : index_sample_grad
   forward : index_sample (Tensor x, Tensor index) -> Tensor(out)
   args : (Tensor x, Tensor index, Tensor out_grad)
@@ -493,6 +624,49 @@
   kernel :
     func : index_sample_grad
     data_type : out_grad
+  no_need_buffer : x
+
+- backward_api : index_select_grad
+  forward : index_select(Tensor x, Tensor index,  int dim) -> Tensor(out)
+  args : (Tensor x, Tensor index, Tensor out_grad,  int dim)
+  output : Tensor(x_grad)
+  infer_meta :
+    func : UnchangedInferMeta
+    param : [x]
+  kernel :
+    func : index_select_grad
+    data_type : x
+
+- backward_api : kldiv_loss_grad
+  forward : kldiv_loss(Tensor x, Tensor label, str reduction) -> Tensor(out)
+  args : (Tensor x, Tensor label, Tensor out_grad, str reduction)
+  output : Tensor(x_grad)
+  infer_meta :
+    func : UnchangedInferMeta
+    param: [x]
+  kernel :
+    func : kldiv_loss_grad
+
+- backward_api : kron_grad
+  forward : kron (Tensor x, Tensor y) -> Tensor(out)
+  args : (Tensor x, Tensor y, Tensor out_grad)
+  output : Tensor(x_grad), Tensor(y_grad)
+  infer_meta :
+    func : GeneralBinaryGradInferMeta
+    param : [x, y]
+  kernel :
+    func : kron_grad
+    data_type : out_grad
+
+- backward_api : kthvalue_grad
+  forward : kthvalue(Tensor x, int k, int axis, bool keepdim) -> Tensor(out), Tensor(indices)
+  args : (Tensor x, Tensor indices, Tensor out_grad, int k, int axis, bool keepdim)
+  output : Tensor(x_grad)
+  infer_meta :
+    func : UnchangedInferMeta
+    param: [x]
+  kernel :
+    func : kthvalue_grad
 
 - backward_api : label_smooth_grad
   forward : label_smooth (Tensor label, Tensor prior_dist, float epsilon) -> Tensor(out)
@@ -516,7 +690,7 @@
     func : leaky_relu_grad
 
 - backward_api : lerp_grad
-  forward : transpose (Tensor x, Tensor y, Tensor weight) -> Tensor(out)
+  forward : lerp (Tensor x, Tensor y, Tensor weight) -> Tensor(out)
   args : (Tensor x, Tensor y, Tensor weight, Tensor out, Tensor out_grad)
   output : Tensor(x_grad), Tensor(y_grad)
   infer_meta :
@@ -525,6 +699,56 @@
   kernel :
     func : lerp_grad
 
+- backward_api : lgamma_grad
+  forward : lgamma(Tensor x) -> Tensor(out)
+  args : (Tensor x, Tensor out_grad)
+  output : Tensor(x_grad)
+  infer_meta :
+    func : UnchangedInferMeta
+    param: [x]
+  kernel :
+    func : lgamma_grad
+
+- backward_api : log10_grad
+  forward : log10 (Tensor x) -> Tensor(out)
+  args : (Tensor x, Tensor out_grad)
+  output : Tensor(x_grad)
+  infer_meta :
+    func : UnchangedInferMeta
+    param : [x]
+  kernel :
+    func : log10_grad
+
+- backward_api : log1p_grad
+  forward : log1p (Tensor x) -> Tensor(out)
+  args : (Tensor x, Tensor out_grad)
+  output : Tensor(x_grad)
+  infer_meta :
+    func : UnchangedInferMeta
+    param : [x]
+  kernel :
+    func : log1p_grad
+
+- backward_api : log2_grad
+  forward : log2 (Tensor x) -> Tensor(out)
+  args : (Tensor x, Tensor out_grad)
+  output : Tensor(x_grad)
+  infer_meta :
+    func : UnchangedInferMeta
+    param : [x]
+  kernel :
+    func : log2_grad
+
+- backward_api : log_grad
+  forward : log (Tensor x) -> Tensor(out)
+  args : (Tensor x, Tensor out_grad)
+  output : Tensor(x_grad)
+  infer_meta :
+    func : UnchangedInferMeta
+    param : [x]
+  kernel :
+    func : log_grad
+
 - backward_api : log_loss_grad
   forward : log_loss (Tensor input, Tensor label, float epsilon) -> Tensor(out)
   args : (Tensor input, Tensor label, Tensor out_grad, float epsilon)
@@ -535,6 +759,26 @@
   kernel :
     func : log_loss_grad
 
+- backward_api : log_softmax_grad
+  forward : log_softmax(Tensor x,  int axis) -> Tensor(out)
+  args : (Tensor out, Tensor out_grad,  int axis)
+  output : Tensor(x_grad)
+  infer_meta :
+    func : UnchangedInferMeta
+    param: [out]
+  kernel :
+    func : log_softmax_grad
+
+- backward_api : logit_grad
+  forward : logit (Tensor x, float eps = 1e-6f) -> Tensor(out)
+  args : (Tensor x, Tensor out_grad, float eps)
+  output : Tensor(x_grad)
+  infer_meta :
+    func : UnchangedInferMeta
+    param : [x]
+  kernel :
+    func : logit_grad
+
 - backward_api : logsigmoid_grad
   forward : logsigmoid (Tensor x) -> Tensor(out)
   args : (Tensor x, Tensor out_grad)
@@ -545,6 +789,16 @@
   kernel :
     func : logsigmoid_grad
 
+- backward_api : logsumexp_grad
+  forward : logsumexp(Tensor x, int64_t[] axis,  bool keepdim,  bool reduce_all) -> Tensor(out)
+  args : (Tensor x, Tensor out, Tensor out_grad, int64_t[] axis,  bool keepdim,  bool reduce_all)
+  output : Tensor(x_grad)
+  infer_meta :
+    func : UnchangedInferMeta
+    param: [x]
+  kernel :
+    func : logsumexp_grad
+
 - backward_api : masked_select_grad
   forward : masked_select (Tensor x, Tensor mask) -> Tensor(out)
   args : (Tensor x, Tensor mask, Tensor out_grad)
@@ -557,15 +811,16 @@
     data_type : x
 
 - backward_api : matmul_double_grad
-  forward : matmul_grad (Tensor x, Tensor y, Tensor out_grad, bool transpose_x, bool transpose_y) -> Tensor(dx), Tensor(dy)
-  args : (Tensor x, Tensor y, Tensor out_grad, Tensor dx_grad, Tensor dy_grad, bool transpose_x, bool transpose_y)
-  output : Tensor(d2x), Tensor(d2y), Tensor(dout_grad)
+  forward : matmul_grad (Tensor x, Tensor y, Tensor grad_out, bool transpose_x=false, bool transpose_y=false) -> Tensor(grad_x), Tensor(grad_y)
+  args : (Tensor x, Tensor y, Tensor grad_out, Tensor grad_x_grad, Tensor grad_y_grad, bool transpose_x=false, bool transpose_y=false)
+  output : Tensor(x_grad), Tensor(y_grad), Tensor(grad_out_grad)
   infer_meta :
     func : GeneralTernaryGradInferMeta
-    param : [x, y, out_grad]
+    param : [x, y, grad_out]
   kernel :
     func : matmul_double_grad
-  optional : dx_grad, dy_grad
+  backward : matmul_triple_grad
+  optional : grad_x_grad, grad_y_grad
 
 - backward_api : matmul_grad
   forward : matmul (Tensor x, Tensor y, bool transpose_x=false, bool transpose_y=false) -> Tensor(out)
@@ -576,6 +831,18 @@
     param : [x, y]
   kernel :
     func : matmul_grad
+  backward : matmul_double_grad
+
+- backward_api : matmul_triple_grad
+  forward : matmul_double_grad (Tensor x, Tensor y, Tensor fwd_grad_out, Tensor fwd_grad_grad_x, Tensor fwd_grad_grad_y, bool transpose_x=false, bool transpose_y=false) -> Tensor(grad_x), Tensor(grad_y), Tensor(grad_grad_out)
+  args : (Tensor x, Tensor y, Tensor fwd_grad_out, Tensor fwd_grad_grad_x, Tensor fwd_grad_grad_y, Tensor grad_x_grad, Tensor grad_y_grad, Tensor grad_grad_out_grad, bool transpose_x=false, bool transpose_y=false)
+  output : Tensor(x_grad), Tensor(y_grad), Tensor(fwd_grad_out_grad), Tensor(fwd_grad_grad_x_grad), Tensor(fwd_grad_grad_y_grad)
+  infer_meta :
+    func : GeneralQuinaryGradInferMeta
+    param : [x, y, fwd_grad_out, fwd_grad_grad_x, fwd_grad_grad_y]
+  kernel :
+    func : matmul_triple_grad
+  optional : grad_x_grad, grad_y_grad, grad_grad_out_grad
 
 - backward_api : matrix_power_grad
   forward : matrix_power (Tensor x, int n) -> Tensor(out)
@@ -587,8 +854,112 @@
   kernel :
     func : matrix_power_grad
 
+- backward_api : max_grad
+  forward: max (Tensor x,  int64_t[] dims={},  bool keep_dim=false) -> Tensor(out)
+  args : (Tensor x, Tensor out, Tensor out_grad, int64_t[] dims={},  bool keep_dim=false, bool reduce_all=false)
+  output : Tensor(x_grad)
+  infer_meta :
+    func : UnchangedInferMeta
+    param: [x]
+  kernel :
+    func : max_grad
+
+- backward_api : max_pool2d_with_index_grad
+  forward : max_pool2d_with_index(Tensor x, int[] kernel_size, int[] strides, int[] paddings, bool global_pooling, bool adaptive) -> Tensor(out), Tensor(mask)
+  args : (Tensor x, Tensor mask, Tensor out_grad, int[] kernel_size, int[] strides, int[] paddings, bool global_pooling, bool adaptive)
+  output : Tensor(x_grad)
+  infer_meta :
+    func : MaxPoolWithIndexGradInferMeta
+  kernel :
+    func : max_pool2d_with_index_grad
+
+- backward_api : max_pool3d_with_index_grad
+  forward : max_pool3d_with_index(Tensor x, int[] kernel_size, int[] strides, int[] paddings, bool global_pooling, bool adaptive) -> Tensor(out), Tensor(mask)
+  args : (Tensor x, Tensor mask, Tensor out_grad, int[] kernel_size, int[] strides, int[] paddings, bool global_pooling, bool adaptive)
+  output : Tensor(x_grad)
+  infer_meta :
+    func : MaxPoolWithIndexGradInferMeta
+  kernel :
+    func : max_pool3d_with_index_grad
+
+- backward_api : maximum_grad
+  forward : maximum(Tensor x, Tensor y) -> Tensor(out)
+  args : (Tensor x, Tensor y, Tensor out_grad, int axis=-1)
+  output : Tensor(x_grad), Tensor(y_grad)
+  infer_meta :
+    func : GeneralBinaryGradInferMeta
+    param: [x, y]
+  kernel :
+    func : maximum_grad
+
+- backward_api : mean_all_grad
+  forward : mean_all(Tensor x) -> Tensor(out)
+  args : (Tensor x, Tensor out_grad)
+  output : Tensor(x_grad)
+  infer_meta :
+    func : UnchangedInferMeta
+    param: [x]
+  kernel :
+    func : mean_all_grad
+
+- backward_api : mean_grad
+  forward: mean (Tensor x,  int64_t[] dims={},  bool keep_dim=false) -> Tensor(out)
+  args : (Tensor x, Tensor out_grad, int64_t[] dims={},  bool keep_dim=false, bool reduce_all=false)
+  output : Tensor(x_grad)
+  infer_meta :
+    func : UnchangedInferMeta
+    param: [x]
+  kernel :
+    func : mean_grad
+
+- backward_api : meshgrid_grad
+  forward : meshgrid (Tensor[] inputs) -> Tensor[](outputs)
+  args : (Tensor[] inputs, Tensor[] outputs_grad)
+  output : Tensor[](inputs_grad)
+  invoke : meshgrid_grad_impl(inputs, outputs_grad)
+
+- backward_api : min_grad
+  forward: min (Tensor x,  int64_t[] dims={},  bool keep_dim=false) -> Tensor(out)
+  args : (Tensor x, Tensor out, Tensor out_grad, int64_t[] dims={},  bool keep_dim=false, bool reduce_all=false)
+  output : Tensor(x_grad)
+  infer_meta :
+    func : UnchangedInferMeta
+    param: [x]
+  kernel :
+    func : min_grad
+
+- backward_api : minimum_grad
+  forward : minimum(Tensor x, Tensor y) -> Tensor(out)
+  args : (Tensor x, Tensor y, Tensor out_grad, int axis=-1)
+  output : Tensor(x_grad), Tensor(y_grad)
+  infer_meta :
+    func : GeneralBinaryGradInferMeta
+    param: [x, y]
+  kernel :
+    func : minimum_grad
+
+- backward_api : mish_grad
+  forward : mish (Tensor x, float threshold) -> Tensor(out)
+  args : (Tensor x, Tensor out_grad, float threshold)
+  output : Tensor(x_grad)
+  infer_meta :
+    func : UnchangedInferMeta
+    param : [x]
+  kernel :
+    func : mish_grad
+
+- backward_api : mode_grad
+  forward : mode(Tensor x,  int axis,  bool keepdim) -> Tensor(out), Tensor(indices)
+  args : (Tensor x, Tensor indices, Tensor out_grad,  int axis,  bool keepdim)
+  output : Tensor(x_grad)
+  infer_meta :
+    func : UnchangedInferMeta
+    param: [x]
+  kernel :
+    func : mode_grad
+
 - backward_api : modulo_grad
-  forward : add (Tensor x, Tensor y) -> Tensor(out)
+  forward : modulo (Tensor x, Tensor y) -> Tensor(out)
   args : (Tensor x, Tensor y, Tensor out_grad, int axis = -1)
   output : Tensor(x_grad), Tensor(y_grad)
   infer_meta :
@@ -619,27 +990,103 @@
     func : mv_grad
 
 - backward_api : nll_loss_grad
-  forward : nll_loss (Tensor x, Tensor label, Tensor weight, int64_t ignore_index, str reduction) -> Tensor(out), Tensor(total_weight)
-  args : (Tensor x, Tensor label,  Tensor weight, Tensor total_weight, Tensor out_grad, int64_t ignore_index, str reduction)
-  output : Tensor (x_grad)
+  forward : nll_loss (Tensor input, Tensor label, Tensor weight, int64_t ignore_index, str reduction) -> Tensor(out), Tensor(total_weight)
+  args : (Tensor input, Tensor label, Tensor weight, Tensor total_weight, Tensor out_grad, int64_t ignore_index, str reduction)
+  output : Tensor(input_grad)
   infer_meta :
-    func : UnchangedInferMeta
-    param : [x]
+    func : NllLossGradInferMeta
   kernel :
     func : nll_loss_grad
-    data_type : out_grad
+    data_type : input
   optional : weight
 
-- backward_api : psroi_pool_grad
-  forward : psroi_pool (Tensor x, Tensor rois, Tensor rois_num, int pooled_weight, int pooled_width, int output_channels, float spatial_scale ) -> Tensor(out)
-  args : (Tensor x, Tensor rois, Tensor rois_num, Tensor out_grad, int pooled_weight, int pooled_width, int output_channels, float spatial_scale)
+- backward_api : norm_grad
+  forward : norm (Tensor x, int axis, float epsilon, bool is_test) -> Tensor(out), Tensor(norm)
+  args : (Tensor x, Tensor norm, Tensor out_grad, int axis, float epsilon, bool is_test)
   output : Tensor(x_grad)
   infer_meta :
     func : UnchangedInferMeta
     param : [x]
+  kernel :
+    func : norm_grad
+
+- backward_api : p_norm_grad
+  forward : p_norm(Tensor x,  float porder,  int axis,  float epsilon,  bool keepdim,  bool asvector=false) -> Tensor(out)
+  args : (Tensor x, Tensor out, Tensor out_grad,  float porder,  int axis,  float epsilon,  bool keepdim,  bool asvector)
+  output : Tensor(x_grad)
+  infer_meta :
+    func : UnchangedInferMeta
+    param: [x]
+  kernel :
+    func : p_norm_grad
+
+- backward_api : pad3d_grad
+  forward : pad3d(Tensor x, IntArray paddings, str mode,  float pad_value, str data_format) -> Tensor(out)
+  args : (Tensor x, Tensor out_grad, IntArray paddings, str mode,  float pad_value, str data_format)
+  output : Tensor(x_grad)
+  infer_meta :
+    func : UnchangedInferMeta
+    param: [x]
+  kernel :
+    func : pad3d_grad
+
+- backward_api : pixel_shuffle_grad
+  forward : pixel_shuffle (Tensor x, int upscale_factor, str data_format) -> Tensor(out)
+  args : (Tensor out_grad, int upscale_factor, str data_format)
+  output : Tensor(x_grad)
+  infer_meta :
+    func : PixelShuffleGradInferMeta
+  kernel :
+    func : pixel_shuffle_grad
+
+- backward_api : pool2d_grad
+  forward : pool2d(Tensor x, int[] kernel_size, int[] strides, int[] paddings, bool ceil_mode, bool exclusive, str data_format, str pooling_type, bool global_pooling, bool adaptive, str padding_algorithm) -> Tensor(out)
+  args : (Tensor x, Tensor out, Tensor out_grad, int[] kernel_size, int[] strides, int[] paddings, bool ceil_mode, bool exclusive, str data_format, str pooling_type, bool global_pooling, bool adaptive, str padding_algorithm)
+  output : Tensor(x_grad)
+  infer_meta :
+    func : PoolGradInferMeta
+  kernel :
+    func : pool2d_grad
+
+- backward_api : pool3d_grad
+  forward : pool3d(Tensor x, int[] kernel_size, int[] strides, int[] paddings, bool ceil_mode, bool exclusive, str data_format, str pooling_type, bool global_pooling, bool adaptive, str padding_algorithm) -> Tensor(out)
+  args : (Tensor x, Tensor out, Tensor out_grad, int[] kernel_size, int[] strides, int[] paddings, bool ceil_mode, bool exclusive, str data_format, str pooling_type, bool global_pooling, bool adaptive, str padding_algorithm)
+  output : Tensor(x_grad)
+  infer_meta :
+    func : PoolGradInferMeta
+  kernel :
+    func : pool3d_grad
+
+- backward_api : pow_grad
+  forward : pow(Tensor x, Scalar s) -> Tensor(out)
+  args : (Tensor x, Tensor out_grad, Scalar s=-1)
+  output : Tensor(x_grad)
+  infer_meta :
+    func : UnchangedInferMeta
+    param: [x]
+  kernel :
+    func : pow_grad
+
+- backward_api : prelu_grad
+  forward : prelu(Tensor x, Tensor alpha, str data_format, str mode) -> Tensor(out)
+  args : (Tensor x, Tensor alpha, Tensor out_grad, str data_format, str mode)
+  output : Tensor(x_grad), Tensor(alpha_grad)
+  infer_meta :
+    func : GeneralBinaryGradInferMeta
+    param: [x, alpha]
+  kernel :
+    func : prelu_grad
+
+- backward_api : psroi_pool_grad
+  forward : psroi_pool (Tensor x, Tensor boxes, Tensor boxes_num, int pooled_height, int pooled_width, int output_channels, float spatial_scale) -> Tensor(out)
+  args : (Tensor x, Tensor boxes, Tensor boxes_num, Tensor out_grad, int pooled_height, int pooled_width, int output_channels, float spatial_scale)
+  output : Tensor(x_grad)
+  infer_meta :
+    func : GeneralUnaryGradInferMeta
+    param : [x]
   kernel :
     func : psroi_pool_grad
-  optional : rois_num
+  optional : boxes_num
 
 # output is optional
 - backward_api : put_along_axis_grad
@@ -652,6 +1099,12 @@
   kernel :
     func : put_along_axis_grad
 
+- backward_api : real_grad
+  forward : real (Tensor x) -> Tensor(out)
+  args : (Tensor out_grad)
+  output : Tensor(x_grad)
+  invoke : real_grad_impl(out_grad)
+
 - backward_api : reciprocal_grad
   forward : reciprocal (Tensor x) -> Tensor(out)
   args : (Tensor out, Tensor out_grad)
@@ -662,6 +1115,16 @@
   kernel :
     func : reciprocal_grad
 
+- backward_api : reduce_prod_grad
+  forward : reduce_prod (Tensor x, int64_t[] dims, bool keep_dim, bool reduce_all) -> Tensor(out)
+  args : (Tensor x, Tensor out, Tensor out_grad, int64_t[] dims,  bool keep_dim, bool reduce_all)
+  output : Tensor(x_grad)
+  infer_meta :
+    func : UnchangedInferMeta
+    param : [x]
+  kernel :
+    func : prod_grad
+
 - backward_api : relu_double_grad
   forward : relu_grad (Tensor out, Tensor grad_out) -> Tensor(grad_x)
   args : (Tensor out, Tensor grad_x_grad)
@@ -697,11 +1160,64 @@
     backend: out_grad
     layout: out_grad
 
+- backward_api : roi_align_grad
+  forward : roi_align (Tensor x, Tensor boxes, Tensor boxes_num, int pooled_height, int pooled_width, float spatial_scale, int sampling_ratio, bool aligned) -> Tensor(out)
+  args : (Tensor x, Tensor boxes, Tensor boxes_num, Tensor out_grad, int pooled_height, int pooled_width, float spatial_scale, int sampling_ratio, bool aligned)
+  output : Tensor(x_grad)
+  infer_meta :
+    func : UnchangedInferMeta
+    param : [x]
+  kernel :
+    func : roi_align_grad
+  optional : boxes_num
+
+- backward_api : roi_pool_grad
+  forward : roi_pool (Tensor x, Tensor boxes, Tensor boxes_num, int pooled_height, int pooled_width, float spatial_scale) -> Tensor(out), Tensor(arg_max)
+  args : (Tensor x, Tensor boxes, Tensor boxes_num, Tensor arg_max, Tensor out_grad, int pooled_height, int pooled_width, float spatial_scale)
+  output : Tensor(x_grad)
+  infer_meta :
+    func : UnchangedInferMeta
+    param : [x]
+  kernel :
+    func : roi_pool_grad
+  optional : boxes_num
+
+- backward_api : roll_grad
+  forward : roll(Tensor x, IntArray shifts, int64_t[] axis) -> Tensor(out)
+  args : (Tensor x, Tensor out_grad, IntArray shifts, int64_t[] axis)
+  output : Tensor(x_grad)
+  infer_meta :
+    func : UnchangedInferMeta
+    param : [x]
+  kernel :
+    func : roll_grad
+    data_type : x
+
+- backward_api : round_grad
+  forward : round(Tensor x) -> Tensor(out)
+  args : (Tensor out_grad)
+  output : Tensor(x_grad)
+  infer_meta :
+    func : UnchangedInferMeta
+    param: [out_grad]
+  kernel :
+    func : round_grad
+
+- backward_api : rsqrt_grad
+  forward : rsqrt (Tensor x) -> Tensor(out)
+  args : (Tensor out, Tensor out_grad)
+  output : Tensor(x_grad)
+  infer_meta :
+    func : UnchangedInferMeta
+    param : [out]
+  kernel :
+    func : rsqrt_grad
+
 - backward_api : scale_grad
   forward : scale (Tensor x, Scalar scale, float bias, bool bias_after_scale) -> Tensor(out)
-  args : (Tensor out_grad, Scalar scale, float bias=0.0, bool bias_after_scale=true)
+  args : (Tensor out_grad, Scalar scale=1.0, float bias=0.0, bool bias_after_scale=true)
   output : Tensor(x_grad)
-  invoke : scale(out_grad, scale, bias, bias_after_scale)
+  invoke : scale(out_grad, scale, 0.0, bias_after_scale)
 
 - backward_api : scatter_grad
   forward : scatter (Tensor x, Tensor index, Tensor updates, bool overwrite) -> Tensor(out)
@@ -712,16 +1228,18 @@
     param : [index, updates, out_grad, overwrite]
   kernel :
     func : scatter_grad
+  no_need_buffer : updates
 
 - backward_api : scatter_nd_add_grad
-  forward : scatter (Tensor x, Tensor index, Tensor updates) -> Tensor(out)
+  forward : scatter_nd_add (Tensor x, Tensor index, Tensor updates) -> Tensor(out)
   args : (Tensor index, Tensor updates, Tensor out_grad)
   output : Tensor(x_grad), Tensor(updates_grad)
   infer_meta :
     func : ScatterNdAddGradInferMeta
     param : [index, updates, out_grad]
   kernel :
-    func : scatter_nd_grad
+    func : scatter_nd_add_grad
+  no_need_buffer : updates
 
 - backward_api : segment_pool_grad
   forward : segment_pool (Tensor x, Tensor segment_ids, str pooltype) -> Tensor(out), Tensor(summed_ids)
@@ -732,6 +1250,8 @@
     param : [x]
   kernel :
     func : segment_pool_grad
+    data_type : x
+  optional : summed_ids
 
 - backward_api : selu_grad
   forward : selu (Tensor x, float scale, float alpha) -> Tensor(out)
@@ -753,6 +1273,17 @@
   kernel :
     func : sigmoid_cross_entropy_with_logits_grad
 
+- backward_api : sigmoid_double_grad
+  forward : sigmoid_grad (Tensor out, Tensor fwd_grad_out) -> Tensor(grad_x)
+  args : (Tensor out, Tensor fwd_grad_out, Tensor grad_x_grad)
+  output : Tensor(out_grad), Tensor(fwd_grad_out_grad)
+  infer_meta :
+    func : GeneralBinaryGradInferMeta
+    param : [out, fwd_grad_out]
+  kernel :
+    func : sigmoid_double_grad
+  backward : sigmoid_triple_grad
+
 - backward_api : sigmoid_grad
   forward : sigmoid (Tensor x) -> Tensor(out)
   args : (Tensor out, Tensor out_grad)
@@ -762,6 +1293,17 @@
     param : [out]
   kernel :
     func : sigmoid_grad
+  backward : sigmoid_double_grad
+
+- backward_api : sigmoid_triple_grad
+  forward : sigmoid_double_grad (Tensor out, Tensor fwd_grad_out, Tensor grad_grad_x) -> Tensor(grad_out), Tensor(grad_grad_out)
+  args : (Tensor out, Tensor fwd_grad_out, Tensor grad_grad_x, Tensor grad_out_grad, Tensor grad_grad_out_grad)
+  output : Tensor(out_grad), Tensor(fwd_grad_out_grad), Tensor(grad_grad_x_grad)
+  infer_meta :
+    func : GeneralTernaryGradInferMeta
+    param : [out, fwd_grad_out, grad_grad_x]
+  kernel :
+    func : sigmoid_double_grad
 
 - backward_api : silu_grad
   forward : silu (Tensor x) -> Tensor(out)
@@ -793,6 +1335,16 @@
   kernel :
     func : sinh_grad
 
+- backward_api : slice_grad
+  forward : slice (Tensor input, int64_t[] axes, IntArray starts, IntArray ends, int64_t[] infer_flags, int64_t[] decrease_axis) -> Tensor(out)
+  args : (Tensor input, Tensor out_grad, int64_t[] axes, IntArray starts, IntArray ends, int64_t[] infer_flags, int64_t[] decrease_axis)
+  output : Tensor(input_grad)
+  infer_meta :
+    func : UnchangedInferMeta
+    param : [input]
+  kernel :
+    func : slice_grad
+
 - backward_api : soft_shrink_grad
   forward : soft_shrink (Tensor x, float lambda) -> Tensor(out)
   args : (Tensor x, Tensor out_grad, float lambda)
@@ -840,6 +1392,33 @@
   kernel :
     func : square_grad
 
+- backward_api : squeeze_grad
+  forward : squeeze(Tensor x, int[] axes) -> Tensor(xshape), Tensor(out)
+  args : (Tensor xshape, Tensor out_grad, int[] axes)
+  output : Tensor(x_grad)
+  infer_meta :
+    func : KernelWithXShapeInferMeta
+    param: [xshape]
+  kernel :
+    func : squeeze_grad
+
+- backward_api : stack_grad
+  forward : stack (Tensor[] x, int axis) -> Tensor(out)
+  args : (Tensor[] x, Tensor out_grad, int axis)
+  output : Tensor[](x_grad)
+  invoke : stack_grad_impl(x, out_grad, axis)
+  no_need_buffer : x
+
+- backward_api : strided_slice_grad
+  forward : strided_slice (Tensor x, int[] axes, IntArray starts, IntArray ends, IntArray strides) -> Tensor(out)
+  args : (Tensor x, Tensor out_grad, int[] axes, IntArray starts, IntArray ends, IntArray strides)
+  output : Tensor(x_grad)
+  infer_meta :
+    func : GeneralUnaryGradInferMeta
+    param : [x]
+  kernel :
+    func : strided_slice_grad
+
 - backward_api : subtract_grad
   forward : subtract (Tensor x, Tensor y) -> Tensor(out)
   args : (Tensor x, Tensor y, Tensor out_grad, int axis = -1)
@@ -849,6 +1428,27 @@
     param : [x, y]
   kernel :
     func : subtract_grad
+  no_need_buffer : x, y
+
+- backward_api : sum_grad
+  forward : sum (Tensor x, int64_t[] dims={}, DataType out_dtype=paddle::experimental::DataType::UNDEFINED, bool keep_dim=false) -> Tensor(out)
+  args : (Tensor x, Tensor out_grad, int64_t[] dims, bool keep_dim, bool reduce_all=false)
+  output : Tensor(x_grad)
+  infer_meta :
+    func : UnchangedInferMeta
+    param : [x]
+  kernel :
+    func : sum_grad
+
+- backward_api : swish_grad
+  forward : swish (Tensor x, float beta=1.0) -> Tensor(out)
+  args : (Tensor x, Tensor out_grad, float bete=1.0)
+  output : Tensor(x_grad)
+  infer_meta :
+    func : GeneralUnaryGradInferMeta
+    param : [x]
+  kernel :
+    func : swish_grad
 
 - backward_api : take_along_axis_grad
   forward : take_along_axis (Tensor x, Tensor index, int axis) -> Tensor(out)
@@ -909,6 +1509,7 @@
     param : [x]
   kernel :
     func : tile_grad
+  no_need_buffer : x
 
 - backward_api : top_k_grad
   forward : top_k (Tensor x, Scalar k, int axis = -1, bool largest = true, bool sorted = true) -> Tensor(out), Tensor(indices)
@@ -929,6 +1530,7 @@
     param : [x]
   kernel :
     func : trace_grad
+  no_need_buffer : x
 
 - backward_api : transpose_grad
   forward : transpose (Tensor x, int[] axis) -> Tensor(out)
@@ -940,6 +1542,16 @@
   kernel :
     func : transpose_grad
 
+- backward_api : tril_triu_grad
+  forward : tril_triu(Tensor x,  int diagonal,  bool lower) -> Tensor(out)
+  args : (Tensor out_grad,  int diagonal,  bool lower)
+  output : Tensor(x_grad)
+  infer_meta :
+    func : UnchangedInferMeta
+    param : [out_grad]
+  kernel :
+    func : tril_triu_grad
+
 - backward_api : trunc_grad
   forward : trunc (Tensor x) -> Tensor(out)
   args : (Tensor out_grad)
@@ -950,6 +1562,12 @@
   kernel :
     func : trunc_grad
 
+- backward_api : unbind_grad
+  forward : unbind (Tensor input, int axis) -> Tensor[](out)
+  args : (Tensor[] out_grad, int axis)
+  output : Tensor(input_grad)
+  invoke : stack(out_grad, axis)
+
 - backward_api : unfold_grad
   forward : unfold (Tensor x, int[] kernel_sizes, int[] strides, int[] paddings, int[] dilations) -> Tensor(out)
   args : (Tensor x, Tensor out_grad, int[] kernel_sizes, int[] strides, int[] paddings, int[] dilations)
@@ -959,6 +1577,17 @@
     param : [x]
   kernel :
     func : unfold_grad
+  no_need_buffer : x
+
+- backward_api : unsqueeze_grad
+  forward : unsqueeze(Tensor x, IntArray axes) -> Tensor(xshape), Tensor(out)
+  args : (Tensor xshape, Tensor out_grad)
+  output : Tensor(x_grad)
+  infer_meta :
+    func : KernelWithXShapeInferMeta
+    param: [xshape]
+  kernel :
+    func : unsqueeze_grad
 
 - backward_api : where_grad
   forward : where (Tensor condition, Tensor x, Tensor y) -> Tensor(out)
diff --git a/python/paddle/utils/code_gen/backward_api_gen.py b/python/paddle/utils/code_gen/backward_api_gen.py
index e26f65387878c..46aa3e7e23d51 100644
--- a/python/paddle/utils/code_gen/backward_api_gen.py
+++ b/python/paddle/utils/code_gen/backward_api_gen.py
@@ -176,9 +176,12 @@ def source_include(header_file_path):
 #include "paddle/phi/core/kernel_registry.h"
 #include "paddle/phi/api/include/api.h"
 #include "paddle/phi/infermeta/backward.h"
+#include "paddle/phi/infermeta/unary.h"
 
 #include "paddle/fluid/eager/api/utils/global_utils.h"
 #include "paddle/fluid/platform/profiler/event_tracing.h"
+
+DECLARE_bool(conv2d_disable_cudnn);
 """
 
 
diff --git a/python/paddle/utils/code_gen/sparse_api.yaml b/python/paddle/utils/code_gen/sparse_api.yaml
index 56b253159fa72..7bdd77e27bcef 100644
--- a/python/paddle/utils/code_gen/sparse_api.yaml
+++ b/python/paddle/utils/code_gen/sparse_api.yaml
@@ -7,6 +7,33 @@
   intermediate : rulebook
   backward : conv3d_grad
 
+- api : coo_to_dense
+  args : (Tensor x)
+  output : Tensor(out@DenseTensor)
+  invoke : to_dense_impl(x)
+  backward : coo_to_dense_grad
+
+- api : coo_values
+  args : (Tensor x)
+  output : Tensor(out@DenseTensor)
+  kernel :
+    func : coo_values
+    layout : x
+  backward : coo_values_grad
+
+- api : csr_values
+  args : (Tensor x)
+  output : Tensor(out@DenseTensor)
+  kernel :
+    func : csr_values
+    layout : x
+
+- api : dense_to_coo
+  args : (Tensor x, int64_t sparse_dim)
+  output : Tensor(out@SparseCooTensor)
+  invoke : to_sparse_coo_impl(x, sparse_dim)
+  backward : dense_to_coo_grad
+
 - api : relu
   args : (Tensor x)
   output : Tensor(out@SparseCooTensor)
diff --git a/python/paddle/utils/code_gen/sparse_bw_api.yaml b/python/paddle/utils/code_gen/sparse_bw_api.yaml
index 711b4cedc59a5..800145b06e0b6 100644
--- a/python/paddle/utils/code_gen/sparse_bw_api.yaml
+++ b/python/paddle/utils/code_gen/sparse_bw_api.yaml
@@ -1,13 +1,33 @@
 - backward_api : conv3d_grad
   forward : conv3d (Tensor x, Tensor kernel, int[] paddings, int[] dilations, int[] strides, int groups, bool subm) -> Tensor(out@SparseCooTensor), Tensor(rulebook@DenseTensor)
   args : (Tensor x, Tensor kernel, Tensor rulebook, Tensor out_grad, int[] paddings, int[] dilations, int[] strides, int groups, bool subm)
-  output : Tensor(x_grad@DenseTensor), Tensor(kernel_grad@DenseTensor)
+  output : Tensor(x_grad@SparseCooTensor), Tensor(kernel_grad@DenseTensor)
   kernel :
     func : sparse_conv3d_grad
 
+- backward_api : coo_to_dense_grad
+  forward : coo_to_dense(Tensor x) -> Tensor(out@DenseTensor)
+  args : (Tensor x, Tensor out_grad)
+  output : Tensor(x_grad@SparseCooTensor)
+  kernel :
+    func : sparse_coo_to_dense_grad
+
+- backward_api : coo_values_grad
+  forward : coo_values(Tensor x) -> Tensor(out@DenseTensor)
+  args : (Tensor x, Tensor out_grad)
+  output : Tensor(x_grad@SparseCooTensor)
+  kernel :
+    func : coo_values_grad
+
+- backward_api : dense_to_coo_grad
+  forward : dense_to_coo(Tensor x, int64_t sparse_dim) -> Tensor(out@SparseCooTensor)
+  args : (Tensor out_grad)
+  output : Tensor(x_grad@DenseTensor)
+  invoke : to_dense_impl(out_grad)
+
 - backward_api : sparse_relu_grad
   forward : sparse_relu(Tensor x) -> Tensor(out@SparseCooTensor)
   args : (Tensor x, Tensor out_grad)
   output : Tensor(x_grad@SparseCooTensor)
-  kernel : 
+  kernel :
     func : sparse_relu_grad
diff --git a/python/paddle/utils/code_gen/wrapped_infermeta_gen.py b/python/paddle/utils/code_gen/wrapped_infermeta_gen.py
index 39b950e15dc93..b50db007d92e9 100644
--- a/python/paddle/utils/code_gen/wrapped_infermeta_gen.py
+++ b/python/paddle/utils/code_gen/wrapped_infermeta_gen.py
@@ -51,7 +51,6 @@ def gene_wrapped_infermeta_and_register(api):
 
             wrapped_infermeta_name = get_wrapped_infermeta_name(api.api)
             args = []
-            print("@@@", api.api)
             for input_name in api.inputs['names']:
                 if input_name in kernel_params:
                     print("type", api.inputs['input_info'])
diff --git a/python/paddle/vision/ops.py b/python/paddle/vision/ops.py
index 00bd6ed38a3ad..2ed01d42cfb8c 100644
--- a/python/paddle/vision/ops.py
+++ b/python/paddle/vision/ops.py
@@ -19,7 +19,7 @@
 from ..fluid.layers import nn, utils
 from ..nn import Layer, Conv2D, Sequential, ReLU, BatchNorm2D
 from ..fluid.initializer import Normal
-from ..fluid.framework import _non_static_mode
+from ..fluid.framework import _non_static_mode, in_dygraph_mode, _in_legacy_dygraph
 from paddle.common_ops_import import *
 from paddle import _C_ops
 
@@ -36,6 +36,7 @@
     'PSRoIPool',
     'roi_align',
     'RoIAlign',
+    'nms',
 ]
 
 
@@ -377,6 +378,12 @@ def yolo_box(x,
                                                    clip_bbox=True,
                                                    scale_x_y=1.)
     """
+    if in_dygraph_mode():
+        boxes, scores = _C_ops.final_state_yolo_box(
+            x, img_size, anchors, class_num, conf_thresh, downsample_ratio,
+            clip_bbox, scale_x_y, iou_aware, iou_aware_factor)
+        return boxes, scores
+
     if _non_static_mode():
         boxes, scores = _C_ops.yolo_box(
             x, img_size, 'anchors', anchors, 'class_num', class_num,
@@ -952,7 +959,11 @@ def psroi_pool(x, boxes, boxes_num, output_size, spatial_scale=1.0, name=None):
     assert len(x.shape) == 4, \
             "Input features with shape should be (N, C, H, W)"
     output_channels = int(x.shape[1] / (pooled_height * pooled_width))
-    if _non_static_mode():
+    if in_dygraph_mode():
+        return _C_ops.final_state_psroi_pool(x, boxes, boxes_num, pooled_height,
+                                             pooled_width, output_channels,
+                                             spatial_scale)
+    if _in_legacy_dygraph():
         return _C_ops.psroi_pool(x, boxes, boxes_num, "output_channels",
                                  output_channels, "spatial_scale",
                                  spatial_scale, "pooled_height", pooled_height,
@@ -1062,7 +1073,11 @@ def roi_pool(x, boxes, boxes_num, output_size, spatial_scale=1.0, name=None):
         output_size = (output_size, output_size)
 
     pooled_height, pooled_width = output_size
-    if _non_static_mode():
+    if in_dygraph_mode():
+        assert boxes_num is not None, "boxes_num should not be None in dygraph mode."
+        return _C_ops.final_state_roi_pool(x, boxes, boxes_num, pooled_height,
+                                           pooled_width, spatial_scale)
+    if _in_legacy_dygraph():
         assert boxes_num is not None, "boxes_num should not be None in dygraph mode."
         pool_out, argmaxes = _C_ops.roi_pool(
             x, boxes, boxes_num, "pooled_height", pooled_height, "pooled_width",
@@ -1217,7 +1232,12 @@ def roi_align(x,
         output_size = (output_size, output_size)
 
     pooled_height, pooled_width = output_size
-    if _non_static_mode():
+    if in_dygraph_mode():
+        assert boxes_num is not None, "boxes_num should not be None in dygraph mode."
+        return _C_ops.final_state_roi_align(x, boxes, boxes_num, pooled_height,
+                                            pooled_width, spatial_scale,
+                                            sampling_ratio, aligned)
+    if _in_legacy_dygraph():
         assert boxes_num is not None, "boxes_num should not be None in dygraph mode."
         align_out = _C_ops.roi_align(
             x, boxes, boxes_num, "pooled_height", pooled_height, "pooled_width",
@@ -1351,3 +1371,151 @@ def __init__(self,
         if activation_layer is not None:
             layers.append(activation_layer())
         super().__init__(*layers)
+
+
+def nms(boxes,
+        iou_threshold=0.3,
+        scores=None,
+        category_idxs=None,
+        categories=None,
+        top_k=None):
+    r"""
+    This operator implements non-maximum suppression. Non-maximum suppression (NMS)
+    is used to select one bounding box out of many overlapping bounding boxes in object detection. 
+    Boxes with IoU > iou_threshold will be considered as overlapping boxes, 
+    just one with highest score can be kept. Here IoU is Intersection Over Union, 
+    which can be computed by:
+
+    ..  math::
+
+        IoU = \frac{intersection\_area(box1, box2)}{union\_area(box1, box2)}
+
+    If scores are provided, input boxes will be sorted by their scores firstly.
+    If category_idxs and categories are provided, NMS will be performed with a batched style, 
+    which means NMS will be applied to each category respectively and results of each category
+    will be concated and sorted by scores.
+    If K is provided, only the first k elements will be returned. Otherwise, all box indices sorted by scores will be returned.
+
+    Args:
+        boxes(Tensor): The input boxes data to be computed, it's a 2D-Tensor with 
+            the shape of [num_boxes, 4] and boxes should be sorted by their 
+            confidence scores. The data type is float32 or float64. 
+            Given as [[x1, y1, x2, y2], …],  (x1, y1) is the top left coordinates, 
+            and (x2, y2) is the bottom right coordinates. 
+            Their relation should be ``0 <= x1 < x2 && 0 <= y1 < y2``.
+        iou_threshold(float32): IoU threshold for determine overlapping boxes. Default value: 0.3.
+        scores(Tensor, optional): Scores corresponding to boxes, it's a 1D-Tensor with 
+            shape of [num_boxes]. The data type is float32 or float64.
+        category_idxs(Tensor, optional): Category indices corresponding to boxes. 
+            it's a 1D-Tensor with shape of [num_boxes]. The data type is int64.
+        categories(List, optional): A list of unique id of all categories. The data type is int64.
+        top_k(int64, optional): The top K boxes who has higher score and kept by NMS preds to 
+            consider. top_k should be smaller equal than num_boxes.
+
+    Returns:
+        Tensor: 1D-Tensor with the shape of [num_boxes]. Indices of boxes kept by NMS.
+
+    Examples:
+        .. code-block:: python
+        
+            import paddle
+            import numpy as np
+
+            boxes = np.random.rand(4, 4).astype('float32')
+            boxes[:, 2] = boxes[:, 0] + boxes[:, 2]
+            boxes[:, 3] = boxes[:, 1] + boxes[:, 3]
+            # [[0.06287421 0.5809351  0.3443958  0.8713329 ]
+            #  [0.0749094  0.9713205  0.99241287 1.2799143 ]
+            #  [0.46246734 0.6753201  1.346266   1.3821303 ]
+            #  [0.8984796  0.5619834  1.1254641  1.0201943 ]]
+
+            out =  paddle.vision.ops.nms(paddle.to_tensor(boxes), 0.1)
+            # [0, 1, 3, 0]
+
+            scores = np.random.rand(4).astype('float32')
+            # [0.98015213 0.3156527  0.8199343  0.874901 ]
+
+            categories = [0, 1, 2, 3]
+            category_idxs = np.random.choice(categories, 4)                        
+            # [2 0 0 3]
+
+            out =  paddle.vision.ops.nms(paddle.to_tensor(boxes), 
+                                                    0.1, 
+                                                    paddle.to_tensor(scores), 
+                                                    paddle.to_tensor(category_idxs), 
+                                                    categories, 
+                                                    4)
+            # [0, 3, 2]
+    """
+
+    def _nms(boxes, iou_threshold):
+        if _non_static_mode():
+            return _C_ops.nms(boxes, 'iou_threshold', iou_threshold)
+
+        helper = LayerHelper('nms', **locals())
+        out = helper.create_variable_for_type_inference('int64')
+        helper.append_op(
+            type='nms',
+            inputs={'Boxes': boxes},
+            outputs={'KeepBoxesIdxs': out},
+            attrs={'iou_threshold': iou_threshold})
+        return out
+
+    if scores is None:
+        return _nms(boxes, iou_threshold)
+
+    import paddle
+    if category_idxs is None:
+        sorted_global_indices = paddle.argsort(scores, descending=True)
+        return _nms(boxes[sorted_global_indices], iou_threshold)
+
+    if top_k is not None:
+        assert top_k <= scores.shape[
+            0], "top_k should be smaller equal than the number of boxes"
+    assert categories is not None, "if category_idxs is given, categories which is a list of unique id of all categories is necessary"
+
+    mask = paddle.zeros_like(scores, dtype=paddle.int32)
+
+    for category_id in categories:
+        cur_category_boxes_idxs = paddle.where(category_idxs == category_id)[0]
+        shape = cur_category_boxes_idxs.shape[0]
+        cur_category_boxes_idxs = paddle.reshape(cur_category_boxes_idxs,
+                                                 [shape])
+        if shape == 0:
+            continue
+        elif shape == 1:
+            mask[cur_category_boxes_idxs] = 1
+            continue
+        cur_category_boxes = boxes[cur_category_boxes_idxs]
+        cur_category_scores = scores[cur_category_boxes_idxs]
+        cur_category_sorted_indices = paddle.argsort(
+            cur_category_scores, descending=True)
+        cur_category_sorted_boxes = cur_category_boxes[
+            cur_category_sorted_indices]
+
+        cur_category_keep_boxes_sub_idxs = cur_category_sorted_indices[_nms(
+            cur_category_sorted_boxes, iou_threshold)]
+
+        updates = paddle.ones_like(
+            cur_category_boxes_idxs[cur_category_keep_boxes_sub_idxs],
+            dtype=paddle.int32)
+        mask = paddle.scatter(
+            mask,
+            cur_category_boxes_idxs[cur_category_keep_boxes_sub_idxs],
+            updates,
+            overwrite=True)
+    keep_boxes_idxs = paddle.where(mask)[0]
+    shape = keep_boxes_idxs.shape[0]
+    keep_boxes_idxs = paddle.reshape(keep_boxes_idxs, [shape])
+    sorted_sub_indices = paddle.argsort(
+        scores[keep_boxes_idxs], descending=True)
+
+    if top_k is None:
+        return keep_boxes_idxs[sorted_sub_indices]
+
+    if _non_static_mode():
+        top_k = shape if shape < top_k else top_k
+        _, topk_sub_indices = paddle.topk(scores[keep_boxes_idxs], top_k)
+        return keep_boxes_idxs[topk_sub_indices]
+
+    return keep_boxes_idxs[sorted_sub_indices][:top_k]
diff --git a/python/setup.py.in b/python/setup.py.in
index 3e59e22fcbc63..e4637444be171 100755
--- a/python/setup.py.in
+++ b/python/setup.py.in
@@ -273,10 +273,12 @@ packages=['paddle',
           'paddle.distributed.ps',
           'paddle.distributed.ps.utils',
           'paddle.incubate',
+          'paddle.incubate.autograd',
           'paddle.incubate.optimizer',
           'paddle.incubate.checkpoint',
           'paddle.incubate.operators',
           'paddle.incubate.tensor',
+          'paddle.incubate.multiprocessing',
           'paddle.incubate.nn',
           'paddle.incubate.passes',
           'paddle.distribution',
@@ -382,7 +384,10 @@ packages=['paddle',
           'paddle.device',
           'paddle.device.cuda',
           'paddle.version',
-          'paddle.profiler'
+          'paddle.profiler',
+          'paddle.sparse',
+          'paddle.sparse.layer',
+          'paddle.sparse.functional',
           ]
 
 with open('@PADDLE_SOURCE_DIR@/python/requirements.txt') as f:
@@ -545,10 +550,6 @@ if '${WITH_XPU_BKCL}' == 'ON':
     shutil.copy('${XPU_BKCL_LIB}', libs_path)
     package_data['paddle.libs']+=['${XPU_BKCL_LIB_NAME}']
 
-if '${WITH_IPU}' == 'ON':
-    shutil.copy('${PADDLE_IPU_LIB}', libs_path)
-    package_data['paddle.libs'] += ['libpaddle_ipu' + ext_name]
-
 # remove unused paddle/libs/__init__.py
 if os.path.isfile(libs_path+'/__init__.py'):
     os.remove(libs_path+'/__init__.py')
diff --git a/tools/check_file_diff_approvals.sh b/tools/check_file_diff_approvals.sh
index d2892d13fc401..e0598112c822a 100644
--- a/tools/check_file_diff_approvals.sh
+++ b/tools/check_file_diff_approvals.sh
@@ -304,8 +304,8 @@ for CHANGE_FILE in ${ALL_CHANGE_FILES}; do
     fi
 done
 if [ "${ALL_OPTEST_BAN_DYGRAPH_MESSAGE}" != "" ] && [ "${GIT_PR_ID}" != "" ]; then
-  echo_line="Developers are not allowed to set the check_dygraph field directly, which is set to True by default. If you need to change the check_dygraph field, you must have one RD (phlrain (Recommend), fuyinno4 (Recommend for kunlun) or lanxianghit) review and approve. \nThe code that do not meet the specification are as follows:\n${ALL_OPTEST_BAN_DYGRAPH_MESSAGE}\n"
-    check_approval 1 43953930 47554610 35824027
+  echo_line="Developers are not allowed to set the check_dygraph field directly, which is set to True by default. If you need to change the check_dygraph field, you must have one RD (phlrain (Recommend), fuyinno4, QingshuChen (Recommend for kunlun) or lanxianghit) review and approve. \nThe code that do not meet the specification are as follows:\n${ALL_OPTEST_BAN_DYGRAPH_MESSAGE}\n"
+    check_approval 1 43953930 47554610 35824027 2002279
 fi
 
 NEW_OP_ADDED=`git diff --name-only --diff-filter=A upstream/$BRANCH |grep -oE ".+_op..*" || true`
@@ -326,8 +326,8 @@ fi
 
 HAS_INPLACE_TESTS=`git diff -U0 upstream/$BRANCH |grep "+" |grep -E "inplace_atol[[:space:]]*=.*" || true`
 if [ "${HAS_INPLACE_TESTS}" != "" ] && [ "${GIT_PR_ID}" != "" ]; then
-    echo_line="The calculation results of setting inplace enabled and disabled must be equal, that is, it's not recommended to set inplace_atol.\n If you do need to use inplace_atol, you must have one RD (XiaoguangHu01, lanxianghit, phlrain, luotao1) approval for the usage of inplace_atol.\nThe corresponding lines are as follows:\n${HAS_INPLACE_TESTS}\n"
-    check_approval 1 46782768 47554610 43953930 6836917
+    echo_line="The calculation results of setting inplace enabled and disabled must be equal, that is, it's not recommended to set inplace_atol.\n If you do need to use inplace_atol, you must have one RD (XiaoguangHu01, lanxianghit, phlrain, luotao1, QingshuChen) approval for the usage of inplace_atol.\nThe corresponding lines are as follows:\n${HAS_INPLACE_TESTS}\n"
+    check_approval 1 46782768 47554610 43953930 6836917 2002279
 fi
 
 OP_FILE_CHANGED=`git diff --name-only --diff-filter=AMR upstream/$BRANCH |grep -oE ".+_op..*" || true`
@@ -373,7 +373,7 @@ if [ "${UNITTEST_FILE_CHANGED}" != "" ] && [ "${GIT_PR_ID}" != "" ]; then
     done
     if [ "${ERROR_LINES}" != "" ]; then
         ERROR_LINES=${ERROR_LINES//+/'\n+\t'}
-        echo_line="It is an Op accuracy problem, please take care of it. You must have one RD (zhangting2020 (Recommend), luotao1 or phlrain, qili93) approval for the usage (either add or delete) of @skip_check_grad_ci. For more information, please refer to: https://github.com/PaddlePaddle/Paddle/wiki/Gradient-Check-Is-Required-for-Op-Test. The corresponding lines are as follows:\n${ERROR_LINES}\n"
+        echo_line="It is an Op accuracy problem, please take care of it. You must have one RD (zhangting2020 (Recommend), luotao1 or phlrain, qili93, QingshuChen) approval for the usage (either add or delete) of @skip_check_grad_ci. For more information, please refer to: https://github.com/PaddlePaddle/Paddle/wiki/Gradient-Check-Is-Required-for-Op-Test. The corresponding lines are as follows:\n${ERROR_LINES}\n"
         check_approval 1 26615455 6836917 43953930 16605440
     fi
 fi
diff --git a/tools/infrt/get_compat_kernel_signature.py b/tools/infrt/get_compat_kernel_signature.py
index 45dc931fac19d..a66a236b0f975 100644
--- a/tools/infrt/get_compat_kernel_signature.py
+++ b/tools/infrt/get_compat_kernel_signature.py
@@ -19,6 +19,13 @@
 skip_list = ["adam_sig.cc", "adamw_sig.cc"]
 
 
+def is_grad_kernel(kernel_info):
+    kernel_name = kernel_info.split(",")[0]
+    if kernel_name.endswith("_grad"):
+        return True
+    return False
+
+
 def parse_compat_registry(kernel_info):
     name, inputs_str, attrs_str, outputs_str = kernel_info.split(",{")
     kernel_info = {}
@@ -62,6 +69,8 @@ def get_compat_kernels_info():
                         "").strip("return").strip("KernelSignature(").strip(
                             "\);").replace("\"", "").replace("\\", "")
                     registry = False
+                    if is_grad_kernel(data):
+                        continue
                     name, registry_info = parse_compat_registry(data)
 
                     if name in kernels_info:
diff --git a/tools/infrt/skipped_phi_api.json b/tools/infrt/skipped_phi_api.json
index 7e03e01d0fe5b..64fc4c618aebc 100644
--- a/tools/infrt/skipped_phi_api.json
+++ b/tools/infrt/skipped_phi_api.json
@@ -1,4 +1,4 @@
 {
-"phi_apis":["conj"],
+"phi_apis":["conj", "dropout", "expand_as", "flatten", "nll_loss", "psroi_pool", "roi_align", "roi_pool", "label_smooth"],
 "phi_kernels":["equal_all"]
 }
diff --git a/tools/parallel_UT_rule.py b/tools/parallel_UT_rule.py
index f075439e54fe7..5088ad3457fb9 100755
--- a/tools/parallel_UT_rule.py
+++ b/tools/parallel_UT_rule.py
@@ -1174,6 +1174,7 @@
 ]
 
 LOWEST_PARALLEL_JOB_NEW = [
+    'heter_cloud_comm_cpu_test',
     'heter_server_test',
     'test_scatter_op',
     'test_trt_convert_hard_sigmoid',
diff --git a/tools/static_mode_white_list.py b/tools/static_mode_white_list.py
index 365047f7e8382..f907d51e4d038 100755
--- a/tools/static_mode_white_list.py
+++ b/tools/static_mode_white_list.py
@@ -349,6 +349,7 @@
     'test_nearest_interp_v2_op',
     'test_network_with_dtype',
     'test_nll_loss',
+    'test_nms_op',
     'test_nn_functional_embedding_static',
     'test_nn_functional_hot_op',
     'test_nonzero_api',
diff --git a/tools/windows/run_unittests.sh b/tools/windows/run_unittests.sh
index dd6a4ad288140..44dc4eac26118 100644
--- a/tools/windows/run_unittests.sh
+++ b/tools/windows/run_unittests.sh
@@ -12,55 +12,8 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-set -e
-set +x
-NIGHTLY_MODE=$1
-PRECISION_TEST=$2
-WITH_GPU=$3
-
-export PADDLE_ROOT="$(cd "$PWD/../" && pwd )"
-if [ ${NIGHTLY_MODE:-OFF} == "ON" ]; then
-    nightly_label=""
-else
-    nightly_label="(RUN_TYPE=NIGHTLY|RUN_TYPE=DIST:NIGHTLY|RUN_TYPE=EXCLUSIVE:NIGHTLY)"
-    echo "========================================="
-    echo "Unittests with nightly labels  are only run at night"
-    echo "========================================="
-fi
-
-if disable_ut_quickly=$(python ${PADDLE_ROOT}/tools/get_quick_disable_lt.py); then
-    echo "========================================="
-    echo "The following unittests have been disabled:"
-    echo ${disable_ut_quickly}
-    echo "========================================="
-else
-    disable_ut_quickly=''
-fi
-
-# check added ut
 
-set +e
-cp $PADDLE_ROOT/tools/check_added_ut.sh $PADDLE_ROOT/tools/check_added_ut_win.sh
-bash $PADDLE_ROOT/tools/check_added_ut_win.sh
-rm -rf $PADDLE_ROOT/tools/check_added_ut_win.sh
-if [ -f "$PADDLE_ROOT/added_ut" ];then
-    added_uts=^$(awk BEGIN{RS=EOF}'{gsub(/\n/,"$|^");print}' $PADDLE_ROOT/added_ut)$
-    ctest -R "(${added_uts})" --output-on-failure -C Release --repeat-until-fail 3;added_ut_error=$?
-    rm -f $PADDLE_ROOT/added_ut
-    if [ "$added_ut_error" != 0 ];then
-        echo "========================================"
-        echo "Added UT should pass three additional executions"
-        echo "========================================"
-        exit 8;
-    fi
-    if nvcc --version | grep 11.2; then
-        echo "Only test added_ut temporarily when running in CI-Windows-inference of CUDA 11.2."
-        exit 0;
-    fi
-fi
-set -e
-
-# /*==================Fixed Disabled Windows GPU MKL unittests==============================*/
+# /*================Fixed Disabled Windows CUDA10.x MKL(PR-CI-Windows) unittests===========================*/
 # TODO: fix these unittest that is bound to fail
 disable_wingpu_test="^test_model$|\
 ^test_dataloader_early_reset$|\
@@ -97,7 +50,7 @@ disable_wingpu_test="^test_model$|\
 ^test_bilinear_interp_op$|\
 ^disable_wingpu_test$"
 
-# /*==================Fixed Disabled Windows GPU MKL unittests==============================*/
+# /*=================Fixed Disabled Windows TRT MKL unittests=======================*/
 # TODO: fix these unittest that is bound to fail
 disable_win_trt_test="^test_trt_convert_conv2d$|\
 ^test_trt_convert_conv2d_fusion$|\
@@ -119,7 +72,13 @@ disable_win_trt_test="^test_trt_convert_conv2d$|\
 ^test_trt_convert_matmul$|\
 ^test_trt_convert_scale$"
 
-# /*==================Fixed Disabled Windows GPU inference_api_test unittests==============================*/
+# /*=============Fixed Disabled Windows CUDA11.x MKL(PR-CI-Windows-Inference) unittests=================*/
+# TODO: fix these unittest that is bound to fail
+disable_wingpu11_test="^test_autograd_functional_dynamic$|\
+^disable_wingpu_test$"
+
+
+# /*==========Fixed Disabled Windows CUDA11.x inference_api_test(PR-CI-Windows-Inference) unittests=============*/
 disable_win_inference_api_test="^trt_quant_int8_yolov3_r50_test$|\
 ^test_trt_dynamic_shape_ernie$|\
 ^test_trt_dynamic_shape_ernie_fp16_ser_deser$|\
@@ -128,9 +87,8 @@ disable_win_inference_api_test="^trt_quant_int8_yolov3_r50_test$|\
 ^lite_mul_model_test$|\
 ^paddle_infer_api_copy_tensor_tester$"
 
-# /*============================================================================*/
 
-# /*==================Fixed Disabled Windows CPU OPENBLAS unittests==============================*/
+# /*==========Fixed Disabled Windows CPU OPENBLAS((PR-CI-Windows-OPENBLAS)) unittests==============================*/
 # TODO: fix these unittest that is bound to fail
 disable_wincpu_test="^jit_kernel_test$|\
 ^test_analyzer_transformer$|\
@@ -189,6 +147,58 @@ long_time_test="^test_gru_op$|\
 ^test_trt_matmul_quant_dequant$|\
 ^test_strided_slice_op$"
 
+
+# /*============================================================================*/
+
+set -e
+set +x
+NIGHTLY_MODE=$1
+PRECISION_TEST=$2
+WITH_GPU=$3
+
+export PADDLE_ROOT="$(cd "$PWD/../" && pwd )"
+if [ ${NIGHTLY_MODE:-OFF} == "ON" ]; then
+    nightly_label=""
+else
+    nightly_label="(RUN_TYPE=NIGHTLY|RUN_TYPE=DIST:NIGHTLY|RUN_TYPE=EXCLUSIVE:NIGHTLY)"
+    echo "========================================="
+    echo "Unittests with nightly labels  are only run at night"
+    echo "========================================="
+fi
+
+if disable_ut_quickly=$(python ${PADDLE_ROOT}/tools/get_quick_disable_lt.py); then
+    echo "========================================="
+    echo "The following unittests have been disabled:"
+    echo ${disable_ut_quickly}
+    echo "========================================="
+else
+    disable_ut_quickly=''
+fi
+
+# check added ut
+
+set +e
+cp $PADDLE_ROOT/tools/check_added_ut.sh $PADDLE_ROOT/tools/check_added_ut_win.sh
+bash $PADDLE_ROOT/tools/check_added_ut_win.sh
+rm -rf $PADDLE_ROOT/tools/check_added_ut_win.sh
+if [ -f "$PADDLE_ROOT/added_ut" ];then
+    added_uts=^$(awk BEGIN{RS=EOF}'{gsub(/\n/,"$|^");print}' $PADDLE_ROOT/added_ut)$
+    ctest -R "(${added_uts})" -E "$disable_wingpu11_test" --output-on-failure -C Release --repeat-until-fail 3;added_ut_error=$?
+    rm -f $PADDLE_ROOT/added_ut
+    if [ "$added_ut_error" != 0 ];then
+        echo "========================================"
+        echo "Added UT should pass three additional executions"
+        echo "========================================"
+        exit 8;
+    fi
+    if nvcc --version | grep 11.2; then
+        echo "Only test added_ut temporarily when running in CI-Windows-inference of CUDA 11.2."
+        exit 0;
+    fi
+fi
+set -e
+
+
 if [ ${WITH_GPU:-OFF} == "ON" ];then
     export CUDA_VISIBLE_DEVICES=0