diff --git a/README.md b/README.md
index 5f415f7e0a..afa2f99d3f 100644
--- a/README.md
+++ b/README.md
@@ -159,8 +159,7 @@ model.eval()
 Next, we need to define simple input and baseline tensors.
 Baselines belong to the input space and often carry no predictive signal.
 Zero tensor can serve as a baseline for many tasks.
-Some interpretability algorithms such as `Integrated
-Gradients`, `Deeplift` and `GradientShap` are designed to attribute the change
+Some interpretability algorithms such as `IntegratedGradients`, `Deeplift` and `GradientShap` are designed to attribute the change
 between the input and baseline to a predictive class or a value that the neural
 network outputs.
 
@@ -472,23 +471,23 @@ You can watch the recorded talk [here](https://www.youtube.com/watch?v=ayhBHZYje
 * `SmoothGrad`: [SmoothGrad: removing noise by adding noise, Daniel Smilkov et al. 2017](https://arxiv.org/abs/1706.03825)
 * `NoiseTunnel`: [Sanity Checks for Saliency Maps, Julius Adebayo et al. 2018](https://arxiv.org/abs/1810.03292)
 * `NeuronConductance`: [How Important is a neuron?, Kedar Dhamdhere et al. 2018](https://arxiv.org/abs/1805.12233)
-* `LayerConductance`: [Computationally Efficient Measures of Internal Neuron Importance, Avanti Shrikumar et al. 2018](https://arxiv.org/pdf/1807.09946.pdf)
-* `DeepLift`, `NeuronDeepLift`, `LayerDeepLift`: [Learning Important Features Through Propagating Activation Differences, Avanti Shrikumar et al. 2017](https://arxiv.org/pdf/1704.02685.pdf) and [Towards better understanding of gradient-based attribution methods for deep neural networks, Marco Ancona et al. 2018](https://openreview.net/pdf?id=Sy21R9JAW)
-* `NeuronIntegratedGradients`: [Computationally Efficient Measures of Internal Neuron Importance, Avanti Shrikumar et al. 2018](https://arxiv.org/pdf/1807.09946.pdf)
+* `LayerConductance`: [Computationally Efficient Measures of Internal Neuron Importance, Avanti Shrikumar et al. 2018](https://arxiv.org/abs/1807.09946)
+* `DeepLift`, `NeuronDeepLift`, `LayerDeepLift`: [Learning Important Features Through Propagating Activation Differences, Avanti Shrikumar et al. 2017](https://arxiv.org/abs/1704.02685) and [Towards better understanding of gradient-based attribution methods for deep neural networks, Marco Ancona et al. 2018](https://openreview.net/pdf?id=Sy21R9JAW)
+* `NeuronIntegratedGradients`: [Computationally Efficient Measures of Internal Neuron Importance, Avanti Shrikumar et al. 2018](https://arxiv.org/abs/1807.09946)
 * `GradientShap`, `NeuronGradientShap`, `LayerGradientShap`, `DeepLiftShap`, `NeuronDeepLiftShap`, `LayerDeepLiftShap`: [A Unified Approach to Interpreting Model Predictions, Scott M. Lundberg et al. 2017](http://papers.nips.cc/paper/7062-a-unified-approach-to-interpreting-model-predictions)
-* `InternalInfluence`: [Influence-Directed Explanations for Deep Convolutional Networks, Klas Leino et al. 2018](https://arxiv.org/pdf/1802.03788.pdf)
+* `InternalInfluence`: [Influence-Directed Explanations for Deep Convolutional Networks, Klas Leino et al. 2018](https://arxiv.org/abs/1802.03788)
 * `Saliency`, `NeuronGradient`: [Deep Inside Convolutional Networks: Visualising
-Image Classification Models and Saliency Maps, K. Simonyan, et. al. 2014](https://arxiv.org/pdf/1312.6034.pdf)
-* `GradCAM`, `Guided GradCAM`: [Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization, Ramprasaath R. Selvaraju et al. 2017](https://arxiv.org/abs/1610.02391.pdf)
-* `Deconvolution`, `Neuron Deconvolution`: [Visualizing and Understanding Convolutional Networks, Matthew D Zeiler et al. 2014](https://arxiv.org/pdf/1311.2901.pdf)
-* `Guided Backpropagation`, `Neuron Guided Backpropagation`: [Striving for Simplicity: The All Convolutional Net, Jost Tobias Springenberg et al. 2015](https://arxiv.org/pdf/1412.6806.pdf)
+Image Classification Models and Saliency Maps, K. Simonyan, et. al. 2014](https://arxiv.org/abs/1312.6034)
+* `GradCAM`, `Guided GradCAM`: [Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization, Ramprasaath R. Selvaraju et al. 2017](https://arxiv.org/abs/1610.02391)
+* `Deconvolution`, `Neuron Deconvolution`: [Visualizing and Understanding Convolutional Networks, Matthew D Zeiler et al. 2014](https://arxiv.org/abs/1311.2901)
+* `Guided Backpropagation`, `Neuron Guided Backpropagation`: [Striving for Simplicity: The All Convolutional Net, Jost Tobias Springenberg et al. 2015](https://arxiv.org/abs/1412.6806)
 * `Feature Permutation`: [Permutation Feature Importance](https://christophm.github.io/interpretable-ml-book/feature-importance.html)
 * `Occlusion`: [Visualizing and Understanding Convolutional Networks](https://arxiv.org/abs/1311.2901)
 * `Shapley Value`: [A value for n-person games. Contributions to the Theory of Games 2.28 (1953): 307-317](https://apps.dtic.mil/dtic/tr/fulltext/u2/604084.pdf)
 * `Shapley Value Sampling`: [Polynomial calculation of the Shapley value based on sampling](https://www.sciencedirect.com/science/article/pii/S0305054808000804)
 * `Infidelity and Sensitivity`: [On the (In)fidelity and Sensitivity for Explanations](https://arxiv.org/abs/1901.09392)
 
-More details about the above mentioned [algorithms](https://captum.ai/docs/algorithms) and their pros and cons can be found on our [web-site](https://captum.ai/docs/algorithms_comparison_matrix).
+More details about the above mentioned [attribution algorithms](https://captum.ai/docs/attribution_algorithms) and their pros and cons can be found on our [web-site](https://captum.ai/docs/algorithms_comparison_matrix).
 
 ## License
 Captum is BSD licensed, as found in the [LICENSE](LICENSE) file.
diff --git a/captum/_utils/av.py b/captum/_utils/av.py
index ac3c32a204..1b749162f8 100644
--- a/captum/_utils/av.py
+++ b/captum/_utils/av.py
@@ -80,7 +80,7 @@ def __getitem__(self, idx: int) -> Union[Tensor, Tuple[Tensor, ...]]:
             av = torch.load(fl)
             return av
 
-        def __len__(self):
+        def __len__(self) -> int:
             return len(self.files)
 
     AV_DIR_NAME: str = "av"
@@ -211,9 +211,9 @@ def save(
                     AV.generate_dataset_activations from batch index.
                     It assumes identifier is same for all layers if a list of
                     `layers` is provided.
-            layers (str or List of str): The layer(s) for which the activation vectors
+            layers (str or list[str]): The layer(s) for which the activation vectors
                     are computed.
-            act_tensors (Tensor or List of Tensor): A batch of activation vectors.
+            act_tensors (tensor or list of tensor): A batch of activation vectors.
                     This must match the dimension of `layers`.
             num_id (str): string representing the batch number for which the activation
                     vectors are computed
@@ -299,13 +299,15 @@ def _manage_loading_layers(
                     for the `layer` are stored.
             model_id (str): The name/version of the model for which layer activations
                     are being computed and stored.
-            layers (str or List of str): The layer(s) for which the activation vectors
+            layers (str or list[str]): The layer(s) for which the activation vectors
                     are computed.
+            load_from_disk (bool, optional): Whether or not to load from disk.
+                Default: True
             identifier (str or None): An optional identifier for the layer
                     activations. Can be used to distinguish between activations for
                     different training batches.
-            num_id (str): An optional string representing the batch number for which the
-                    activation vectors are computed
+            num_id (str, optional): An optional string representing the batch number
+                for which the activation vectors are computed.
 
         Returns:
             List of layer names for which activations should be generated
@@ -357,9 +359,9 @@ def _compute_and_save_activations(
                     define all of its layers as attributes of the model.
             model_id (str): The name/version of the model for which layer activations
                     are being computed and stored.
-            layers (str or List of str): The layer(s) for which the activation vectors
+            layers (str or list[str]): The layer(s) for which the activation vectors
                     are computed.
-            inputs (tensor or tuple of tensors): Batch of examples for
+            inputs (Tensor or tuple of Tensor): Batch of examples for
                     which influential instances are computed. They are passed to the
                     input `model`. The first dimension in `inputs` tensor or tuple of
                     tensors corresponds to the batch size.
@@ -368,7 +370,7 @@ def _compute_and_save_activations(
                     different training batches.
             num_id (str): An required string representing the batch number for which the
                     activation vectors are computed
-            additional_forward_args (optional):  Additional arguments that will be
+            additional_forward_args (Any, optional): Additional arguments that will be
                     passed to `model` after inputs.
                     Default: None
             load_from_disk (bool): Forces function to regenerate activations if False.
@@ -433,7 +435,7 @@ def generate_dataset_activations(
                     define all of its layers as attributes of the model.
             model_id (str): The name/version of the model for which layer activations
                     are being computed and stored.
-            layers (str or List of str): The layer(s) for which the activation vectors
+            layers (str or list[str]): The layer(s) for which the activation vectors
                     are computed.
             dataloader (torch.utils.data.DataLoader): DataLoader that yields Dataset
                     for which influential instances are computed. They are passed to
diff --git a/captum/_utils/gradient.py b/captum/_utils/gradient.py
index a15157d8d7..5b853cd435 100644
--- a/captum/_utils/gradient.py
+++ b/captum/_utils/gradient.py
@@ -730,7 +730,7 @@ def _compute_jacobian_wrt_params(
                 but must behave as a library loss function would if `reduction='none'`.
 
     Returns:
-        grads (Tuple of Tensor): Returns the Jacobian for the minibatch as a
+        grads (tuple of Tensor): Returns the Jacobian for the minibatch as a
                 tuple of gradients corresponding to the tuple of trainable parameters
                 returned by `model.parameters()`. Each object grads[i] references to the
                 gradients for the parameters in the i-th trainable layer of the model.
@@ -804,7 +804,7 @@ def _compute_jacobian_wrt_params_with_sample_wise_trick(
                 Defaults to 'sum'.
 
     Returns:
-        grads (Tuple of Tensor): Returns the Jacobian for the minibatch as a
+        grads (tuple of Tensor): Returns the Jacobian for the minibatch as a
                 tuple of gradients corresponding to the tuple of trainable parameters
                 returned by `model.parameters()`. Each object grads[i] references to the
                 gradients for the parameters in the i-th trainable layer of the model.
diff --git a/captum/_utils/models/linear_model/model.py b/captum/_utils/models/linear_model/model.py
index bfffdbf38a..24302d540c 100644
--- a/captum/_utils/models/linear_model/model.py
+++ b/captum/_utils/models/linear_model/model.py
@@ -20,7 +20,7 @@ def __init__(self, train_fn: Callable, **kwargs) -> None:
         Please note that this is an experimental feature.
 
         Args:
-            train_fn (callable)
+            train_fn (Callable)
                 The function to train with. See
                 `captum._utils.models.linear_model.train.sgd_train_linear_model`
                 and
@@ -65,14 +65,14 @@ def _construct_model_params(
                 normalization parameters used.
             bias (bool):
                 Whether to add a bias term. Not needed if normalized input.
-            weight_values (tensor, optional):
+            weight_values (Tensor, optional):
                 The values to initialize the linear model with. This must be a
                 1D or 2D tensor, and of the form `(num_outputs, num_features)` or
                 `(num_features,)`. Additionally, if this is provided you need not
                 to provide `in_features` or `out_features`.
-            bias_value (tensor, optional):
+            bias_value (Tensor, optional):
                 The bias value to initialize the model with.
-            classes (tensor, optional):
+            classes (Tensor, optional):
                 The list of prediction classes supported by the model in case it
                 performs classificaton. In case of regression it is set to None.
                 Default: None
diff --git a/captum/attr/_core/deep_lift.py b/captum/attr/_core/deep_lift.py
index 251e68dc23..ea059d7fcc 100644
--- a/captum/attr/_core/deep_lift.py
+++ b/captum/attr/_core/deep_lift.py
@@ -112,7 +112,7 @@ def __init__(
         r"""
         Args:
 
-            model (nn.Module):  The reference to PyTorch model instance. Model cannot
+            model (nn.Module): The reference to PyTorch model instance. Model cannot
                         contain any in-place nonlinear submodules; these are not
                         supported by the register_full_backward_hook PyTorch API
                         starting from PyTorch v1.9.
@@ -185,7 +185,7 @@ def attribute(  # type: ignore
         r"""
         Args:
 
-            inputs (tensor or tuple of tensors):  Input for which
+            inputs (Tensor or tuple of Tensor): Input for which
                         attributions are computed. If forward_func takes a single
                         tensor as input, a single input tensor should be provided.
                         If forward_func takes multiple tensors as input, a tuple
@@ -194,7 +194,7 @@ def attribute(  # type: ignore
                         to the number of examples (aka batch size), and if
                         multiple input tensors are provided, the examples must
                         be aligned appropriately.
-            baselines (scalar, tensor, tuple of scalars or tensors, optional):
+            baselines (scalar, Tensor, tuple of scalar, or Tensor, optional):
                         Baselines define reference samples that are compared with
                         the inputs. In order to assign attribution scores DeepLift
                         computes the differences between the inputs/outputs and
@@ -226,7 +226,7 @@ def attribute(  # type: ignore
                         use zero scalar corresponding to each input tensor.
 
                         Default: None
-            target (int, tuple, tensor or list, optional):  Output indices for
+            target (int, tuple, Tensor, or list, optional): Output indices for
                         which gradients are computed (for classification cases,
                         this is usually the target class).
                         If the network returns a scalar value per example,
@@ -251,7 +251,7 @@ def attribute(  # type: ignore
                           target for the corresponding example.
 
                         Default: None
-            additional_forward_args (any, optional): If the forward function
+            additional_forward_args (Any, optional): If the forward function
                         requires additional arguments other than the inputs for
                         which attributions should not be computed, this argument
                         can be provided. It must be either a single additional
@@ -267,7 +267,7 @@ def attribute(  # type: ignore
                         is set to True convergence delta will be returned in
                         a tuple following attributions.
                         Default: False
-            custom_attribution_func (callable, optional): A custom function for
+            custom_attribution_func (Callable, optional): A custom function for
                         computing final attribution scores. This function can take
                         at least one and at most three arguments with the
                         following signature:
@@ -288,7 +288,7 @@ def attribute(  # type: ignore
 
         Returns:
             **attributions** or 2-element tuple of **attributions**, **delta**:
-            - **attributions** (*tensor* or tuple of *tensors*):
+            - **attributions** (*Tensor* or tuple of *Tensor*):
                 Attribution score computed based on DeepLift rescale rule with respect
                 to each input feature. Attributions will always be
                 the same size as the provided inputs, with each value
@@ -296,14 +296,14 @@ def attribute(  # type: ignore
                 If a single tensor is provided as inputs, a single tensor is
                 returned. If a tuple is provided for inputs, a tuple of
                 corresponding sized tensors is returned.
-            - **delta** (*tensor*, returned if return_convergence_delta=True):
+            - **delta** (*Tensor*, returned if return_convergence_delta=True):
                 This is computed using the property that
                 the total sum of forward_func(inputs) - forward_func(baselines)
                 must equal the total sum of the attributions computed
                 based on DeepLift's rescale rule.
                 Delta is calculated per example, meaning that the number of
                 elements in returned delta tensor is equal to the number of
-                of examples in input.
+                examples in input.
                 Note that the logic described for deltas is guaranteed when the
                 default logic for attribution computations is used, meaning that the
                 `custom_attribution_func=None`, otherwise it is not guaranteed and
@@ -611,12 +611,14 @@ class DeepLiftShap(DeepLift):
     each baseline and averages resulting attributions.
     More details about the algorithm can be found here:
 
-    http://papers.nips.cc/paper/7062-a-unified-approach-to-interpreting-model-predictions.pdf
+    https://papers.nips.cc/paper/7062-a-unified-approach-to-interpreting-model-predictions.pdf
 
     Note that the explanation model:
+
         1. Assumes that input features are independent of one another
         2. Is linear, meaning that the explanations are modeled through
             the additive composition of feature effects.
+
     Although, it assumes a linear model for each explanation, the overall
     model across multiple explanations can be complex and non-linear.
     """
@@ -625,7 +627,7 @@ def __init__(self, model: Module, multiply_by_inputs: bool = True) -> None:
         r"""
         Args:
 
-            model (nn.Module):  The reference to PyTorch model instance. Model cannot
+            model (nn.Module): The reference to PyTorch model instance. Model cannot
                         contain any in-place nonlinear submodules; these are not
                         supported by the register_full_backward_hook PyTorch API.
             multiply_by_inputs (bool, optional): Indicates whether to factor
@@ -694,7 +696,7 @@ def attribute(  # type: ignore
         r"""
         Args:
 
-            inputs (tensor or tuple of tensors):  Input for which
+            inputs (Tensor or tuple of Tensor): Input for which
                         attributions are computed. If forward_func takes a single
                         tensor as input, a single input tensor should be provided.
                         If forward_func takes multiple tensors as input, a tuple
@@ -703,7 +705,7 @@ def attribute(  # type: ignore
                         to the number of examples (aka batch size), and if
                         multiple input tensors are provided, the examples must
                         be aligned appropriately.
-            baselines (tensor, tuple of tensors, callable):
+            baselines (Tensor, tuple of Tensor, or Callable):
                         Baselines define reference samples that are compared with
                         the inputs. In order to assign attribution scores DeepLift
                         computes the differences between the inputs/outputs and
@@ -728,7 +730,7 @@ def attribute(  # type: ignore
 
                         It is recommended that the number of samples in the baselines'
                         tensors is larger than one.
-            target (int, tuple, tensor or list, optional):  Output indices for
+            target (int, tuple, Tensor, or list, optional): Output indices for
                         which gradients are computed (for classification cases,
                         this is usually the target class).
                         If the network returns a scalar value per example,
@@ -753,7 +755,7 @@ def attribute(  # type: ignore
                           target for the corresponding example.
 
                         Default: None
-            additional_forward_args (any, optional): If the forward function
+            additional_forward_args (Any, optional): If the forward function
                         requires additional arguments other than the inputs for
                         which attributions should not be computed, this argument
                         can be provided. It must be either a single additional
@@ -769,7 +771,7 @@ def attribute(  # type: ignore
                         is set to True convergence delta will be returned in
                         a tuple following attributions.
                         Default: False
-            custom_attribution_func (callable, optional): A custom function for
+            custom_attribution_func (Callable, optional): A custom function for
                         computing final attribution scores. This function can take
                         at least one and at most three arguments with the
                         following signature:
@@ -789,7 +791,7 @@ def attribute(  # type: ignore
 
         Returns:
             **attributions** or 2-element tuple of **attributions**, **delta**:
-            - **attributions** (*tensor* or tuple of *tensors*):
+            - **attributions** (*Tensor* or tuple of *Tensor*):
                         Attribution score computed based on DeepLift rescale rule with
                         respect to each input feature. Attributions will always be
                         the same size as the provided inputs, with each value
@@ -797,7 +799,7 @@ def attribute(  # type: ignore
                         If a single tensor is provided as inputs, a single tensor is
                         returned. If a tuple is provided for inputs, a tuple of
                         corresponding sized tensors is returned.
-            - **delta** (*tensor*, returned if return_convergence_delta=True):
+            - **delta** (*Tensor*, returned if return_convergence_delta=True):
                         This is computed using the property that the
                         total sum of forward_func(inputs) - forward_func(baselines)
                         must be very close to the total sum of attributions
diff --git a/captum/attr/_core/feature_ablation.py b/captum/attr/_core/feature_ablation.py
index fd0007fc75..70de13e81c 100644
--- a/captum/attr/_core/feature_ablation.py
+++ b/captum/attr/_core/feature_ablation.py
@@ -47,8 +47,8 @@ def __init__(self, forward_func: Callable) -> None:
         r"""
         Args:
 
-            forward_func (callable): The forward function of the model or
-                        any modification of it
+            forward_func (Callable): The forward function of the model or
+                        any modification of it.
         """
         PerturbationAttribution.__init__(self, forward_func)
         self.use_weights = False
@@ -68,7 +68,7 @@ def attribute(
         r"""
         Args:
 
-            inputs (tensor or tuple of tensors):  Input for which ablation
+            inputs (Tensor or tuple of Tensor): Input for which ablation
                         attributions are computed. If forward_func takes a single
                         tensor as input, a single input tensor should be provided.
                         If forward_func takes multiple tensors as input, a tuple
@@ -77,7 +77,7 @@ def attribute(
                         to the number of examples (aka batch size), and if
                         multiple input tensors are provided, the examples must
                         be aligned appropriately.
-            baselines (scalar, tensor, tuple of scalars or tensors, optional):
+            baselines (scalar, Tensor, tuple of scalar, or Tensor, optional):
                         Baselines define reference value which replaces each
                         feature when ablated.
                         Baselines can be provided as:
@@ -101,10 +101,11 @@ def attribute(
                           - or a scalar, corresponding to a tensor in the
                             inputs' tuple. This scalar value is broadcasted
                             for corresponding input tensor.
+
                         In the cases when `baselines` is not provided, we internally
                         use zero scalar corresponding to each input tensor.
                         Default: None
-            target (int, tuple, tensor or list, optional):  Output indices for
+            target (int, tuple, Tensor, or list, optional): Output indices for
                         which gradients are computed (for classification cases,
                         this is usually the target class).
                         If the network returns a scalar value per example,
@@ -129,7 +130,7 @@ def attribute(
                           target for the corresponding example.
 
                         Default: None
-            additional_forward_args (any, optional): If the forward function
+            additional_forward_args (Any, optional): If the forward function
                         requires additional arguments other than the inputs for
                         which attributions should not be computed, this argument
                         can be provided. It must be either a single additional
@@ -144,7 +145,7 @@ def attribute(
                         Note that attributions are not computed with respect
                         to these arguments.
                         Default: None
-            feature_mask (tensor or tuple of tensors, optional):
+            feature_mask (Tensor or tuple of Tensor, optional):
                         feature_mask defines a mask for the input, grouping
                         features which should be ablated together. feature_mask
                         should contain the same number of tensors as inputs.
@@ -193,8 +194,8 @@ def attribute(
                         Default: None
 
         Returns:
-            *tensor* or tuple of *tensors* of **attributions**:
-            - **attributions** (*tensor* or tuple of *tensors*):
+            *Tensor* or tuple of *Tensor* of **attributions**:
+            - **attributions** (*Tensor* or tuple of *Tensor*):
                         The attributions with respect to each input feature.
                         If the forward function returns
                         a scalar value per example, attributions will be
@@ -414,10 +415,10 @@ def _ith_input_ablation_generator(
         **kwargs,
     ):
         """
-        This method return an generator of ablation perturbations of the i-th input
+        This method returns a generator of ablation perturbations of the i-th input
 
         Returns:
-            ablation_iter (generator): yields each perturbation to be evaluated
+            ablation_iter (Generator): yields each perturbation to be evaluated
                         as a tuple (inputs, additional_forward_args, targets, mask).
         """
         extra_args = {}
diff --git a/captum/attr/_core/feature_permutation.py b/captum/attr/_core/feature_permutation.py
index 544ff16ac6..9aac4c11a1 100644
--- a/captum/attr/_core/feature_permutation.py
+++ b/captum/attr/_core/feature_permutation.py
@@ -75,9 +75,9 @@ def __init__(
         r"""
         Args:
 
-            forward_func (callable): The forward function of the model or
-                any modification of it
-            perm_func (callable, optional): A function that accepts a batch of
+            forward_func (Callable): The forward function of the model or
+                any modification of it.
+            perm_func (Callable, optional): A function that accepts a batch of
                 inputs and a feature mask, and "permutes" the feature using
                 feature mask across the batch. This defaults to a function
                 which applies a random permutation, this argument only needs
@@ -101,14 +101,16 @@ def attribute(  # type: ignore
         **kwargs: Any,
     ) -> TensorOrTupleOfTensorsGeneric:
         r"""
-        This function is almost equivalent to `FeatureAblation.attribute`. The
-        main difference is the way ablated examples are generated. Specifically
-        they are generated through the `perm_func`, as we set the baselines for
-        `FeatureAblation.attribute` to None.
+        This function is almost equivalent to
+        :func:`FeatureAblation.attribute <captum.attr.FeatureAblation.attribute>`. The
+        main difference is the way ablated examples are generated. Specifically they
+        are generated through the ``perm_func``, as we set the baselines for
+        :func:`FeatureAblation.attribute <captum.attr.FeatureAblation.attribute>` to
+        ``None``.
 
 
         Args:
-                inputs (tensor or tuple of tensors):  Input for which
+                inputs (Tensor or tuple of Tensor): Input for which
                             permutation attributions are computed. If
                             forward_func takes a single tensor as input, a
                             single input tensor should be provided.  If
@@ -118,7 +120,7 @@ def attribute(  # type: ignore
                             0 corresponds to the number of examples (aka batch
                             size), and if multiple input tensors are provided,
                             the examples must be aligned appropriately.
-                target (int, tuple, tensor or list, optional):  Output indices for
+                target (int, tuple, Tensor, or list, optional): Output indices for
                             which difference is computed (for classification cases,
                             this is usually the target class).
                             If the network returns a scalar value per example,
@@ -143,7 +145,7 @@ def attribute(  # type: ignore
                               target for the corresponding example.
 
                             Default: None
-                additional_forward_args (any, optional): If the forward function
+                additional_forward_args (Any, optional): If the forward function
                             requires additional arguments other than the inputs for
                             which attributions should not be computed, this argument
                             can be provided. It must be either a single additional
@@ -158,7 +160,7 @@ def attribute(  # type: ignore
                             Note that attributions are not computed with respect
                             to these arguments.
                             Default: None
-                feature_mask (tensor or tuple of tensors, optional):
+                feature_mask (Tensor or tuple of Tensor, optional):
                             feature_mask defines a mask for the input, grouping
                             features which should be ablated together. feature_mask
                             should contain the same number of tensors as inputs.
@@ -196,14 +198,14 @@ def attribute(  # type: ignore
                             a simple output of progress.
                             Default: False
                 **kwargs (Any, optional): Any additional arguments used by child
-                            classes of FeatureAblation (such as Occlusion) to construct
-                            ablations. These arguments are ignored when using
-                            FeatureAblation directly.
+                            classes of :class:`.FeatureAblation` (such as
+                            :class:`.Occlusion`) to construct ablations. These
+                            arguments are ignored when using FeatureAblation directly.
                             Default: None
 
         Returns:
-            *tensor* or tuple of *tensors* of **attributions**:
-            - **attributions** (*tensor* or tuple of *tensors*):
+            *Tensor* or tuple of *Tensor* of **attributions**:
+            - **attributions** (*Tensor* or tuple of *Tensor*):
                         The attributions with respect to each input feature.
                         If the forward function returns
                         a scalar value per example, attributions will be
diff --git a/captum/attr/_core/gradient_shap.py b/captum/attr/_core/gradient_shap.py
index 57d5e909af..f6ec8da302 100644
--- a/captum/attr/_core/gradient_shap.py
+++ b/captum/attr/_core/gradient_shap.py
@@ -50,7 +50,7 @@ class GradientShap(GradientAttribution):
     In some sense it can be viewed as an approximation of integrated gradients
     by computing the expectations of gradients for different baselines.
 
-    Current implementation uses Smoothgrad from `NoiseTunnel` in order to
+    Current implementation uses Smoothgrad from :class:`.NoiseTunnel` in order to
     randomly draw samples from the distribution of baselines, add noise to input
     samples and compute the expectation (smoothgrad).
     """
@@ -59,7 +59,7 @@ def __init__(self, forward_func: Callable, multiply_by_inputs: bool = True) -> N
         r"""
         Args:
 
-            forward_func (function): The forward function of the model or
+            forward_func (Callable): The forward function of the model or
                        any modification of it.
             multiply_by_inputs (bool, optional): Indicates whether to factor
                     model inputs' multiplier in the final attribution scores.
@@ -127,7 +127,7 @@ def attribute(
         r"""
         Args:
 
-            inputs (tensor or tuple of tensors):  Input for which SHAP attribution
+            inputs (Tensor or tuple of Tensor): Input for which SHAP attribution
                         values are computed. If `forward_func` takes a single
                         tensor as input, a single input tensor should be provided.
                         If `forward_func` takes multiple tensors as input, a tuple
@@ -135,7 +135,7 @@ def attribute(
                         that for all given input tensors, dimension 0 corresponds
                         to the number of examples, and if multiple input tensors
                         are provided, the examples must be aligned appropriately.
-            baselines (tensor, tuple of tensors, callable):
+            baselines (Tensor, tuple of Tensor, or Callable):
                         Baselines define the starting point from which expectation
                         is computed and can be provided as:
 
@@ -158,11 +158,11 @@ def attribute(
 
                         It is recommended that the number of samples in the baselines'
                         tensors is larger than one.
-            n_samples (int, optional):  The number of randomly generated examples
+            n_samples (int, optional): The number of randomly generated examples
                         per sample in the input batch. Random examples are
                         generated by adding gaussian random noise to each sample.
                         Default: `5` if `n_samples` is not provided.
-            stdevs    (float, or a tuple of floats optional): The standard deviation
+            stdevs    (float or tuple of float, optional): The standard deviation
                         of gaussian noise with zero mean that is added to each
                         input in the batch. If `stdevs` is a single float value
                         then that same value is used for all inputs. If it is
@@ -171,7 +171,7 @@ def attribute(
                         corresponds to the input with the same index in the inputs
                         tuple.
                         Default: 0.0
-            target (int, tuple, tensor or list, optional):  Output indices for
+            target (int, tuple, Tensor, or list, optional): Output indices for
                         which gradients are computed (for classification cases,
                         this is usually the target class).
                         If the network returns a scalar value per example,
@@ -196,7 +196,7 @@ def attribute(
                           target for the corresponding example.
 
                         Default: None
-            additional_forward_args (any, optional): If the forward function
+            additional_forward_args (Any, optional): If the forward function
                         requires additional arguments other than the inputs for
                         which attributions should not be computed, this argument
                         can be provided. It can contain a tuple of ND tensors or
@@ -215,7 +215,7 @@ def attribute(
                         Default: False
         Returns:
             **attributions** or 2-element tuple of **attributions**, **delta**:
-            - **attributions** (*tensor* or tuple of *tensors*):
+            - **attributions** (*Tensor* or tuple of *Tensor*):
                         Attribution score computed based on GradientSHAP with respect
                         to each input feature. Attributions will always be
                         the same size as the provided inputs, with each value
@@ -223,7 +223,7 @@ def attribute(
                         If a single tensor is provided as inputs, a single tensor is
                         returned. If a tuple is provided for inputs, a tuple of
                         corresponding sized tensors is returned.
-            - **delta** (*tensor*, returned if return_convergence_delta=True):
+            - **delta** (*Tensor*, returned if return_convergence_delta=True):
                         This is computed using the property that the total
                         sum of forward_func(inputs) - forward_func(baselines)
                         must be very close to the total sum of the attributions
@@ -294,8 +294,8 @@ def __init__(self, forward_func: Callable, multiply_by_inputs=True) -> None:
         r"""
         Args:
 
-            forward_func (function): The forward function of the model or
-                        any modification of it
+            forward_func (Callable): The forward function of the model or
+                        any modification of it.
             multiply_by_inputs (bool, optional): Indicates whether to factor
                         model inputs' multiplier in the final attribution scores.
                         In the literature this is also known as local vs global
diff --git a/captum/attr/_core/guided_backprop_deconvnet.py b/captum/attr/_core/guided_backprop_deconvnet.py
index e1953ed5b9..ba2c2114c5 100644
--- a/captum/attr/_core/guided_backprop_deconvnet.py
+++ b/captum/attr/_core/guided_backprop_deconvnet.py
@@ -27,7 +27,7 @@ def __init__(self, model: Module, use_relu_grad_output: bool = False) -> None:
         r"""
         Args:
 
-            model (nn.Module):  The reference to PyTorch model instance.
+            model (nn.Module): The reference to PyTorch model instance.
         """
         GradientAttribution.__init__(self, model)
         self.model = model
@@ -121,7 +121,7 @@ def __init__(self, model: Module) -> None:
         r"""
         Args:
 
-            model (nn.Module):  The reference to PyTorch model instance. Model cannot
+            model (nn.Module): The reference to PyTorch model instance. Model cannot
                         contain any in-place ReLU submodules; these are not
                         supported by the register_full_backward_hook PyTorch API.
         """
@@ -139,7 +139,7 @@ def attribute(
         r"""
         Args:
 
-            inputs (tensor or tuple of tensors):  Input for which
+            inputs (Tensor or tuple of Tensor): Input for which
                         attributions are computed. If forward_func takes a single
                         tensor as input, a single input tensor should be provided.
                         If forward_func takes multiple tensors as input, a tuple
@@ -148,7 +148,7 @@ def attribute(
                         to the number of examples (aka batch size), and if
                         multiple input tensors are provided, the examples must
                         be aligned appropriately.
-            target (int, tuple, tensor or list, optional):  Output indices for
+            target (int, tuple, Tensor, or list, optional): Output indices for
                         which gradients are computed (for classification cases,
                         this is usually the target class).
                         If the network returns a scalar value per example,
@@ -173,7 +173,7 @@ def attribute(
                           target for the corresponding example.
 
                         Default: None
-            additional_forward_args (any, optional): If the forward function
+            additional_forward_args (Any, optional): If the forward function
                         requires additional arguments other than the inputs for
                         which attributions should not be computed, this argument
                         can be provided. It must be either a single additional
@@ -186,8 +186,8 @@ def attribute(
                         Default: None
 
         Returns:
-            *tensor* or tuple of *tensors* of **attributions**:
-            - **attributions** (*tensor* or tuple of *tensors*):
+            *Tensor* or tuple of *Tensor* of **attributions**:
+            - **attributions** (*Tensor* or tuple of *Tensor*):
                         The guided backprop gradients with respect to each
                         input feature. Attributions will always
                         be the same size as the provided inputs, with each value
@@ -234,7 +234,7 @@ def __init__(self, model: Module) -> None:
         r"""
         Args:
 
-            model (nn.Module):  The reference to PyTorch model instance. Model cannot
+            model (nn.Module): The reference to PyTorch model instance. Model cannot
                         contain any in-place ReLU submodules; these are not
                         supported by the register_full_backward_hook PyTorch API.
         """
@@ -250,7 +250,7 @@ def attribute(
         r"""
         Args:
 
-            inputs (tensor or tuple of tensors):  Input for which
+            inputs (Tensor or tuple of Tensor): Input for which
                         attributions are computed. If forward_func takes a single
                         tensor as input, a single input tensor should be provided.
                         If forward_func takes multiple tensors as input, a tuple
@@ -259,7 +259,7 @@ def attribute(
                         to the number of examples (aka batch size), and if
                         multiple input tensors are provided, the examples must
                         be aligned appropriately.
-            target (int, tuple, tensor or list, optional):  Output indices for
+            target (int, tuple, Tensor, or list, optional): Output indices for
                         which gradients are computed (for classification cases,
                         this is usually the target class).
                         If the network returns a scalar value per example,
@@ -284,7 +284,7 @@ def attribute(
                           target for the corresponding example.
 
                         Default: None
-            additional_forward_args (any, optional): If the forward function
+            additional_forward_args (Any, optional): If the forward function
                         requires additional arguments other than the inputs for
                         which attributions should not be computed, this argument
                         can be provided. It must be either a single additional
@@ -297,8 +297,8 @@ def attribute(
                         Default: None
 
         Returns:
-            *tensor* or tuple of *tensors* of **attributions**:
-            - **attributions** (*tensor* or tuple of *tensors*):
+            *Tensor* or tuple of *Tensor* of **attributions**:
+            - **attributions** (*Tensor* or tuple of *Tensor*):
                         The deconvolution attributions with respect to each
                         input feature. Attributions will always
                         be the same size as the provided inputs, with each value
diff --git a/captum/attr/_core/guided_grad_cam.py b/captum/attr/_core/guided_grad_cam.py
index f6e29c4b29..3c7478bae8 100644
--- a/captum/attr/_core/guided_grad_cam.py
+++ b/captum/attr/_core/guided_grad_cam.py
@@ -38,7 +38,7 @@ class GuidedGradCam(GradientAttribution):
 
     More details regarding GuidedGradCAM can be found in the original
     GradCAM paper here:
-    https://arxiv.org/pdf/1610.02391.pdf
+    https://arxiv.org/abs/1610.02391
 
     Warning: Ensure that all ReLU operations in the forward function of the
     given model are performed using a module (nn.module.ReLU).
@@ -51,14 +51,14 @@ def __init__(
         r"""
         Args:
 
-            model (nn.Module):  The reference to PyTorch model instance. Model cannot
+            model (nn.Module): The reference to PyTorch model instance. Model cannot
                         contain any in-place ReLU submodules; these are not
                         supported by the register_full_backward_hook PyTorch API
                         starting from PyTorch v1.9.
             layer (torch.nn.Module): Layer for which GradCAM attributions are computed.
                           Currently, only layers with a single tensor output are
                           supported.
-            device_ids (list(int)): Device ID list, necessary only if forward_func
+            device_ids (list[int]): Device ID list, necessary only if forward_func
                           applies a DataParallel model. This allows reconstruction of
                           intermediate outputs from batched results across devices.
                           If forward_func is given as the DataParallel model itself,
@@ -80,7 +80,7 @@ def attribute(
         r"""
         Args:
 
-            inputs (tensor or tuple of tensors):  Input for which attributions
+            inputs (Tensor or tuple of Tensor): Input for which attributions
                         are computed. If forward_func takes a single
                         tensor as input, a single input tensor should be provided.
                         If forward_func takes multiple tensors as input, a tuple
@@ -88,7 +88,7 @@ def attribute(
                         that for all given input tensors, dimension 0 corresponds
                         to the number of examples, and if multiple input tensors
                         are provided, the examples must be aligned appropriately.
-            target (int, tuple, tensor or list, optional):  Output indices for
+            target (int, tuple, Tensor, or list, optional): Output indices for
                         which gradients are computed (for classification cases,
                         this is usually the target class).
                         If the network returns a scalar value per example,
@@ -113,7 +113,7 @@ def attribute(
                           target for the corresponding example.
 
                         Default: None
-            additional_forward_args (any, optional): If the forward function
+            additional_forward_args (Any, optional): If the forward function
                         requires additional arguments other than the inputs for
                         which attributions should not be computed, this argument
                         can be provided. It must be either a single additional
@@ -151,8 +151,8 @@ def attribute(
                         Default: False
 
         Returns:
-            *tensor* of **attributions**:
-            - **attributions** (*tensor*):
+            *Tensor* of **attributions**:
+            - **attributions** (*Tensor*):
                     Element-wise product of (upsampled) GradCAM
                     and Guided Backprop attributions.
                     If a single tensor is provided as inputs, a single tensor is
diff --git a/captum/attr/_core/input_x_gradient.py b/captum/attr/_core/input_x_gradient.py
index 7817466013..fcf1d85025 100644
--- a/captum/attr/_core/input_x_gradient.py
+++ b/captum/attr/_core/input_x_gradient.py
@@ -22,7 +22,7 @@ def __init__(self, forward_func: Callable) -> None:
         r"""
         Args:
 
-            forward_func (callable):  The forward function of the model or any
+            forward_func (Callable): The forward function of the model or any
                           modification of it
         """
         GradientAttribution.__init__(self, forward_func)
@@ -37,7 +37,7 @@ def attribute(
         r"""
         Args:
 
-            inputs (tensor or tuple of tensors):  Input for which
+            inputs (Tensor or tuple of Tensor): Input for which
                         attributions are computed. If forward_func takes a single
                         tensor as input, a single input tensor should be provided.
                         If forward_func takes multiple tensors as input, a tuple
@@ -46,7 +46,7 @@ def attribute(
                         to the number of examples (aka batch size), and if
                         multiple input tensors are provided, the examples must
                         be aligned appropriately.
-            target (int, tuple, tensor or list, optional):  Output indices for
+            target (int, tuple, Tensor, or list, optional): Output indices for
                         which gradients are computed (for classification cases,
                         this is usually the target class).
                         If the network returns a scalar value per example,
@@ -71,7 +71,7 @@ def attribute(
                           target for the corresponding example.
 
                         Default: None
-            additional_forward_args (any, optional): If the forward function
+            additional_forward_args (Any, optional): If the forward function
                         requires additional arguments other than the inputs for
                         which attributions should not be computed, this argument
                         can be provided. It must be either a single additional
@@ -84,8 +84,8 @@ def attribute(
                         Default: None
 
         Returns:
-                *tensor* or tuple of *tensors* of **attributions**:
-                - **attributions** (*tensor* or tuple of *tensors*):
+                *Tensor* or tuple of *Tensor* of **attributions**:
+                - **attributions** (*Tensor* or tuple of *Tensor*):
                             The input x gradient with
                             respect to each input feature. Attributions will always be
                             the same size as the provided inputs, with each value
diff --git a/captum/attr/_core/integrated_gradients.py b/captum/attr/_core/integrated_gradients.py
index e96a826c32..04896fac61 100644
--- a/captum/attr/_core/integrated_gradients.py
+++ b/captum/attr/_core/integrated_gradients.py
@@ -53,7 +53,7 @@ def __init__(
         r"""
         Args:
 
-            forward_func (callable):  The forward function of the model or any
+            forward_func (Callable): The forward function of the model or any
                     modification of it
             multiply_by_inputs (bool, optional): Indicates whether to factor
                     model inputs' multiplier in the final attribution scores.
@@ -130,7 +130,7 @@ def attribute(  # type: ignore
 
         Args:
 
-            inputs (tensor or tuple of tensors):  Input for which integrated
+            inputs (Tensor or tuple of Tensor): Input for which integrated
                         gradients are computed. If forward_func takes a single
                         tensor as input, a single input tensor should be provided.
                         If forward_func takes multiple tensors as input, a tuple
@@ -138,7 +138,7 @@ def attribute(  # type: ignore
                         that for all given input tensors, dimension 0 corresponds
                         to the number of examples, and if multiple input tensors
                         are provided, the examples must be aligned appropriately.
-            baselines (scalar, tensor, tuple of scalars or tensors, optional):
+            baselines (scalar, Tensor, tuple of scalar, or Tensor, optional):
                         Baselines define the starting point from which integral
                         is computed and can be provided as:
 
@@ -162,11 +162,12 @@ def attribute(  # type: ignore
                           - or a scalar, corresponding to a tensor in the
                             inputs' tuple. This scalar value is broadcasted
                             for corresponding input tensor.
+
                         In the cases when `baselines` is not provided, we internally
                         use zero scalar corresponding to each input tensor.
 
                         Default: None
-            target (int, tuple, tensor or list, optional):  Output indices for
+            target (int, tuple, Tensor, or list, optional): Output indices for
                         which gradients are computed (for classification cases,
                         this is usually the target class).
                         If the network returns a scalar value per example,
@@ -191,7 +192,7 @@ def attribute(  # type: ignore
                           target for the corresponding example.
 
                         Default: None
-            additional_forward_args (any, optional): If the forward function
+            additional_forward_args (Any, optional): If the forward function
                         requires additional arguments other than the inputs for
                         which attributions should not be computed, this argument
                         can be provided. It must be either a single additional
@@ -210,7 +211,7 @@ def attribute(  # type: ignore
                         Default: None
             n_steps (int, optional): The number of steps used by the approximation
                         method. Default: 50.
-            method (string, optional): Method for approximating the integral,
+            method (str, optional): Method for approximating the integral,
                         one of `riemann_right`, `riemann_left`, `riemann_middle`,
                         `riemann_trapezoid` or `gausslegendre`.
                         Default: `gausslegendre` if no method is provided.
@@ -232,7 +233,7 @@ def attribute(  # type: ignore
                     Default: False
         Returns:
             **attributions** or 2-element tuple of **attributions**, **delta**:
-            - **attributions** (*tensor* or tuple of *tensors*):
+            - **attributions** (*Tensor* or tuple of *Tensor*):
                     Integrated gradients with respect to each input feature.
                     attributions will always be the same size as the provided
                     inputs, with each value providing the attribution of the
@@ -240,7 +241,7 @@ def attribute(  # type: ignore
                     If a single tensor is provided as inputs, a single tensor is
                     returned. If a tuple is provided for inputs, a tuple of
                     corresponding sized tensors is returned.
-            - **delta** (*tensor*, returned if return_convergence_delta=True):
+            - **delta** (*Tensor*, returned if return_convergence_delta=True):
                     The difference between the total approximated and true
                     integrated gradients. This is computed using the property
                     that the total sum of forward_func(inputs) -
@@ -248,7 +249,7 @@ def attribute(  # type: ignore
                     integrated gradient.
                     Delta is calculated per example, meaning that the number of
                     elements in returned delta tensor is equal to the number of
-                    of examples in inputs.
+                    examples in inputs.
 
         Examples::
 
diff --git a/captum/attr/_core/kernel_shap.py b/captum/attr/_core/kernel_shap.py
index 2826b30dfe..12da6991dc 100644
--- a/captum/attr/_core/kernel_shap.py
+++ b/captum/attr/_core/kernel_shap.py
@@ -29,8 +29,8 @@ def __init__(self, forward_func: Callable) -> None:
         r"""
         Args:
 
-            forward_func (callable): The forward function of the model or
-                        any modification of it
+            forward_func (Callable): The forward function of the model or
+                        any modification of it.
         """
         Lime.__init__(
             self,
@@ -86,7 +86,7 @@ def attribute(  # type: ignore
 
         Args:
 
-            inputs (tensor or tuple of tensors):  Input for which KernelShap
+            inputs (Tensor or tuple of Tensor): Input for which KernelShap
                         is computed. If forward_func takes a single
                         tensor as input, a single input tensor should be provided.
                         If forward_func takes multiple tensors as input, a tuple
@@ -94,7 +94,7 @@ def attribute(  # type: ignore
                         that for all given input tensors, dimension 0 corresponds
                         to the number of examples, and if multiple input tensors
                         are provided, the examples must be aligned appropriately.
-            baselines (scalar, tensor, tuple of scalars or tensors, optional):
+            baselines (scalar, Tensor, tuple of scalar, or Tensor, optional):
                         Baselines define the reference value which replaces each
                         feature when the corresponding interpretable feature
                         is set to 0.
@@ -120,10 +120,11 @@ def attribute(  # type: ignore
                           - or a scalar, corresponding to a tensor in the
                             inputs' tuple. This scalar value is broadcasted
                             for corresponding input tensor.
+
                         In the cases when `baselines` is not provided, we internally
                         use zero scalar corresponding to each input tensor.
                         Default: None
-            target (int, tuple, tensor or list, optional):  Output indices for
+            target (int, tuple, Tensor, or list, optional): Output indices for
                         which surrogate model is trained
                         (for classification cases,
                         this is usually the target class).
@@ -149,7 +150,7 @@ def attribute(  # type: ignore
                           target for the corresponding example.
 
                         Default: None
-            additional_forward_args (any, optional): If the forward function
+            additional_forward_args (Any, optional): If the forward function
                         requires additional arguments other than the inputs for
                         which attributions should not be computed, this argument
                         can be provided. It must be either a single additional
@@ -166,7 +167,7 @@ def attribute(  # type: ignore
                         Note that attributions are not computed with respect
                         to these arguments.
                         Default: None
-            feature_mask (tensor or tuple of tensors, optional):
+            feature_mask (Tensor or tuple of Tensor, optional):
                         feature_mask defines a mask for the input, grouping
                         features which correspond to the same
                         interpretable feature. feature_mask
@@ -184,7 +185,7 @@ def attribute(  # type: ignore
                         If None, then a feature mask is constructed which assigns
                         each scalar within a tensor as a separate feature.
                         Default: None
-            n_samples (int, optional):  The number of samples of the original
+            n_samples (int, optional): The number of samples of the original
                         model used to train the surrogate interpretable model.
                         Default: `50` if `n_samples` is not provided.
             perturbations_per_eval (int, optional): Allows multiple samples
@@ -219,8 +220,8 @@ def attribute(  # type: ignore
                         Default: False
 
         Returns:
-            *tensor* or tuple of *tensors* of **attributions**:
-            - **attributions** (*tensor* or tuple of *tensors*):
+            *Tensor* or tuple of *Tensor* of **attributions**:
+            - **attributions** (*Tensor* or tuple of *Tensor*):
                         The attributions with respect to each input feature.
                         If return_input_shape = True, attributions will be
                         the same size as the provided inputs, with each value
@@ -316,7 +317,9 @@ def kernel_shap_perturb_generator(
         Perturbations are sampled by the following process:
          - Choose k (number of selected features), based on the distribution
                 p(k) = (M - 1) / (k * (M - k))
+
             where M is the total number of features in the interpretable space
+
          - Randomly select a binary vector with k ones, each sample is equally
             likely. This is done by generating a random vector of normal
             values and thresholding based on the top k elements.
diff --git a/captum/attr/_core/layer/grad_cam.py b/captum/attr/_core/layer/grad_cam.py
index c650409149..bcbcb02af7 100644
--- a/captum/attr/_core/layer/grad_cam.py
+++ b/captum/attr/_core/layer/grad_cam.py
@@ -47,7 +47,7 @@ class LayerGradCam(LayerAttribution, GradientAttribution):
 
     More details regarding the GradCAM method can be found in the
     original paper here:
-    https://arxiv.org/pdf/1610.02391.pdf
+    https://arxiv.org/abs/1610.02391
     """
 
     def __init__(
@@ -59,13 +59,13 @@ def __init__(
         r"""
         Args:
 
-            forward_func (callable):  The forward function of the model or any
+            forward_func (Callable): The forward function of the model or any
                           modification of it
             layer (torch.nn.Module): Layer for which attributions are computed.
                           Output size of attribute matches this layer's output
                           dimensions, except for dimension 2, which will be 1,
                           since GradCAM sums over channels.
-            device_ids (list(int)): Device ID list, necessary only if forward_func
+            device_ids (list[int]): Device ID list, necessary only if forward_func
                           applies a DataParallel model. This allows reconstruction of
                           intermediate outputs from batched results across devices.
                           If forward_func is given as the DataParallel model itself,
@@ -86,7 +86,7 @@ def attribute(
         r"""
         Args:
 
-            inputs (tensor or tuple of tensors):  Input for which attributions
+            inputs (Tensor or tuple of Tensor): Input for which attributions
                         are computed. If forward_func takes a single
                         tensor as input, a single input tensor should be provided.
                         If forward_func takes multiple tensors as input, a tuple
@@ -94,7 +94,7 @@ def attribute(
                         that for all given input tensors, dimension 0 corresponds
                         to the number of examples, and if multiple input tensors
                         are provided, the examples must be aligned appropriately.
-            target (int, tuple, tensor or list, optional):  Output indices for
+            target (int, tuple, Tensor, or list, optional): Output indices for
                         which gradients are computed (for classification cases,
                         this is usually the target class).
                         If the network returns a scalar value per example,
@@ -119,7 +119,7 @@ def attribute(
                           target for the corresponding example.
 
                         Default: None
-            additional_forward_args (any, optional): If the forward function
+            additional_forward_args (Any, optional): If the forward function
                         requires additional arguments other than the inputs for
                         which attributions should not be computed, this argument
                         can be provided. It must be either a single additional
@@ -151,8 +151,8 @@ def attribute(
                         Default: False
 
         Returns:
-            *tensor* or tuple of *tensors* of **attributions**:
-            - **attributions** (*tensor* or tuple of *tensors*):
+            *Tensor* or tuple of *Tensor* of **attributions**:
+            - **attributions** (*Tensor* or tuple of *Tensor*):
                         Attributions based on GradCAM method.
                         Attributions will be the same size as the
                         output of the given layer, except for dimension 2,
diff --git a/captum/attr/_core/layer/internal_influence.py b/captum/attr/_core/layer/internal_influence.py
index 8976fe7344..46aba1ff61 100644
--- a/captum/attr/_core/layer/internal_influence.py
+++ b/captum/attr/_core/layer/internal_influence.py
@@ -30,7 +30,7 @@ class InternalInfluence(LayerAttribution, GradientAttribution):
     given input.
     If no baseline is provided, the default baseline is the zero tensor.
     More details on this approach can be found here:
-    https://arxiv.org/pdf/1802.03788.pdf
+    https://arxiv.org/abs/1802.03788
 
     Note that this method is similar to applying integrated gradients and
     taking the layer as input, integrating the gradient of the layer with
@@ -46,7 +46,7 @@ def __init__(
         r"""
         Args:
 
-            forward_func (callable):  The forward function of the model or any
+            forward_func (Callable): The forward function of the model or any
                           modification of it
             layer (torch.nn.Module): Layer for which attributions are computed.
                           Output size of attribute matches this layer's input or
@@ -54,7 +54,7 @@ def __init__(
                           the inputs or outputs of the layer, corresponding to
                           attribution of each neuron in the input or output of
                           this layer.
-            device_ids (list(int)): Device ID list, necessary only if forward_func
+            device_ids (list[int]): Device ID list, necessary only if forward_func
                           applies a DataParallel model. This allows reconstruction of
                           intermediate outputs from batched results across devices.
                           If forward_func is given as the DataParallel model itself,
@@ -78,7 +78,7 @@ def attribute(
         r"""
         Args:
 
-            inputs (tensor or tuple of tensors):  Input for which internal
+            inputs (Tensor or tuple of Tensor): Input for which internal
                         influence is computed. If forward_func takes a single
                         tensor as input, a single input tensor should be provided.
                         If forward_func takes multiple tensors as input, a tuple
@@ -86,7 +86,7 @@ def attribute(
                         that for all given input tensors, dimension 0 corresponds
                         to the number of examples, and if multiple input tensors
                         are provided, the examples must be aligned appropriately.
-            baselines scalar, tensor, tuple of scalars or tensors, optional):
+            baselines (scalar, Tensor, tuple of scalar, or Tensor, optional):
                         Baselines define a starting point from which integral
                         is computed and can be provided as:
 
@@ -115,7 +115,7 @@ def attribute(
                         use zero scalar corresponding to each input tensor.
 
                         Default: None
-            target (int, tuple, tensor or list, optional):  Output indices for
+            target (int, tuple, Tensor, or list, optional): Output indices for
                         which gradients are computed (for classification cases,
                         this is usually the target class).
                         If the network returns a scalar value per example,
@@ -140,7 +140,7 @@ def attribute(
                           target for the corresponding example.
 
                         Default: None
-            additional_forward_args (any, optional): If the forward function
+            additional_forward_args (Any, optional): If the forward function
                         requires additional arguments other than the inputs for
                         which attributions should not be computed, this argument
                         can be provided. It must be either a single additional
@@ -159,7 +159,7 @@ def attribute(
                         Default: None
             n_steps (int, optional): The number of steps used by the approximation
                         method. Default: 50.
-            method (string, optional): Method for approximating the integral,
+            method (str, optional): Method for approximating the integral,
                         one of `riemann_right`, `riemann_left`, `riemann_middle`,
                         `riemann_trapezoid` or `gausslegendre`.
                         Default: `gausslegendre` if no method is provided.
@@ -187,13 +187,13 @@ def attribute(
                         Default: False
 
         Returns:
-            *tensor* or tuple of *tensors* of **attributions**:
-            - **attributions** (*tensor* or tuple of *tensors*):
+            *Tensor* or tuple of *Tensor* of **attributions**:
+              - **attributions** (*Tensor* or tuple of *Tensor*):
                         Internal influence of each neuron in given
                         layer output. Attributions will always be the same size
                         as the output or input of the given layer depending on
                         whether `attribute_to_layer_input` is set to `False` or
-                        `True`respectively.
+                        `True` respectively.
                         Attributions are returned in a tuple if
                         the layer inputs / outputs contain multiple tensors,
                         otherwise a single tensor is returned.
diff --git a/captum/attr/_core/layer/layer_activation.py b/captum/attr/_core/layer/layer_activation.py
index 86c511706b..c4244e5966 100644
--- a/captum/attr/_core/layer/layer_activation.py
+++ b/captum/attr/_core/layer/layer_activation.py
@@ -25,9 +25,9 @@ def __init__(
         r"""
         Args:
 
-            forward_func (callable):  The forward function of the model or any
+            forward_func (Callable): The forward function of the model or any
                           modification of it
-            layer (torch.nn.Module or list(torch.nn.Module)): Layer or layers
+            layer (torch.nn.Module or list of torch.nn.Module): Layer or layers
                           for which attributions are computed.
                           Output size of attribute matches this layer's input or
                           output dimensions, depending on whether we attribute to
@@ -36,7 +36,7 @@ def __init__(
                           this layer. If multiple layers are provided, attributions
                           are returned as a list, each element corresponding to the
                           activations of the corresponding layer.
-            device_ids (list(int)): Device ID list, necessary only if forward_func
+            device_ids (list[int]): Device ID list, necessary only if forward_func
                           applies a DataParallel model. This allows reconstruction of
                           intermediate outputs from batched results across devices.
                           If forward_func is given as the DataParallel model itself,
@@ -54,7 +54,7 @@ def attribute(
         r"""
         Args:
 
-            inputs (tensor or tuple of tensors):  Input for which layer
+            inputs (Tensor or tuple of Tensor): Input for which layer
                         activation is computed. If forward_func takes a single
                         tensor as input, a single input tensor should be provided.
                         If forward_func takes multiple tensors as input, a tuple
@@ -62,7 +62,7 @@ def attribute(
                         that for all given input tensors, dimension 0 corresponds
                         to the number of examples, and if multiple input tensors
                         are provided, the examples must be aligned appropriately.
-            additional_forward_args (any, optional): If the forward function
+            additional_forward_args (Any, optional): If the forward function
                         requires additional arguments other than the inputs for
                         which attributions should not be computed, this argument
                         can be provided. It must be either a single additional
@@ -87,8 +87,8 @@ def attribute(
                         Default: False
 
         Returns:
-            *tensor* or tuple of *tensors* or *list* of **attributions**:
-            - **attributions** (*tensor* or tuple of *tensors* or *list*):
+            *Tensor* or tuple of *Tensor* or list of **attributions**:
+            - **attributions** (*Tensor* or tuple of *Tensor* or *list*):
                         Activation of each neuron in given layer output.
                         Attributions will always be the same size as the
                         output of the given layer.
diff --git a/captum/attr/_core/layer/layer_conductance.py b/captum/attr/_core/layer/layer_conductance.py
index 3d76569c10..b8d9bc563f 100644
--- a/captum/attr/_core/layer/layer_conductance.py
+++ b/captum/attr/_core/layer/layer_conductance.py
@@ -32,7 +32,7 @@ class LayerConductance(LayerAttribution, GradientAttribution):
 
     The details of the approach can be found here:
     https://arxiv.org/abs/1805.12233
-    https://arxiv.org/pdf/1807.09946.pdf
+    https://arxiv.org/abs/1807.09946
 
     Note that this provides the total conductance of each neuron in the
     layer's output. To obtain the breakdown of a neuron's conductance by input
@@ -49,7 +49,7 @@ def __init__(
         r"""
         Args:
 
-            forward_func (callable):  The forward function of the model or any
+            forward_func (Callable): The forward function of the model or any
                           modification of it
             layer (torch.nn.Module): Layer for which attributions are computed.
                           Output size of attribute matches this layer's input or
@@ -57,7 +57,7 @@ def __init__(
                           the inputs or outputs of the layer, corresponding to
                           attribution of each neuron in the input or output of
                           this layer.
-            device_ids (list(int)): Device ID list, necessary only if forward_func
+            device_ids (list[int]): Device ID list, necessary only if forward_func
                           applies a DataParallel model. This allows reconstruction of
                           intermediate outputs from batched results across devices.
                           If forward_func is given as the DataParallel model itself,
@@ -120,7 +120,7 @@ def attribute(
         r"""
         Args:
 
-            inputs (tensor or tuple of tensors):  Input for which layer
+            inputs (Tensor or tuple of Tensor): Input for which layer
                         conductance is computed. If forward_func takes a single
                         tensor as input, a single input tensor should be provided.
                         If forward_func takes multiple tensors as input, a tuple
@@ -128,7 +128,7 @@ def attribute(
                         that for all given input tensors, dimension 0 corresponds
                         to the number of examples, and if multiple input tensors
                         are provided, the examples must be aligned appropriately.
-            baselines (scalar, tensor, tuple of scalars or tensors, optional):
+            baselines (scalar, Tensor, tuple of scalar, or Tensor, optional):
                         Baselines define the starting point from which integral
                         is computed and can be provided as:
 
@@ -152,11 +152,12 @@ def attribute(
                           - or a scalar, corresponding to a tensor in the
                             inputs' tuple. This scalar value is broadcasted
                             for corresponding input tensor.
+
                         In the cases when `baselines` is not provided, we internally
                         use zero scalar corresponding to each input tensor.
 
                         Default: None
-            target (int, tuple, tensor or list, optional):  Output indices for
+            target (int, tuple, Tensor, or list, optional): Output indices for
                         which gradients are computed (for classification cases,
                         this is usually the target class).
                         If the network returns a scalar value per example,
@@ -181,7 +182,7 @@ def attribute(
                           target for the corresponding example.
 
                         Default: None
-            additional_forward_args (any, optional): If the forward function
+            additional_forward_args (Any, optional): If the forward function
                         requires additional arguments other than the inputs for
                         which attributions should not be computed, this argument
                         can be provided. It must be either a single additional
@@ -200,7 +201,7 @@ def attribute(
                         Default: None
             n_steps (int, optional): The number of steps used by the approximation
                         method. Default: 50.
-            method (string, optional): Method for approximating the integral,
+            method (str, optional): Method for approximating the integral,
                         one of `riemann_right`, `riemann_left`, `riemann_middle`,
                         `riemann_trapezoid` or `gausslegendre`.
                         Default: `gausslegendre` if no method is provided.
@@ -234,7 +235,7 @@ def attribute(
 
         Returns:
             **attributions** or 2-element tuple of **attributions**, **delta**:
-            - **attributions** (*tensor* or tuple of *tensors*):
+            - **attributions** (*Tensor* or tuple of *Tensor*):
                         Conductance of each neuron in given layer input or
                         output. Attributions will always be the same size as
                         the input or output of the given layer, depending on
@@ -244,7 +245,7 @@ def attribute(
                         Attributions are returned in a tuple if
                         the layer inputs / outputs contain multiple tensors,
                         otherwise a single tensor is returned.
-            - **delta** (*tensor*, returned if return_convergence_delta=True):
+            - **delta** (*Tensor*, returned if return_convergence_delta=True):
                         The difference between the total
                         approximated and true conductance.
                         This is computed using the property that the total sum of
@@ -252,7 +253,7 @@ def attribute(
                         the total sum of the attributions.
                         Delta is calculated per example, meaning that the number of
                         elements in returned delta tensor is equal to the number of
-                        of examples in inputs.
+                        examples in inputs.
 
         Examples::
 
diff --git a/captum/attr/_core/layer/layer_deep_lift.py b/captum/attr/_core/layer/layer_deep_lift.py
index 71a8e9eb29..362f250170 100644
--- a/captum/attr/_core/layer/layer_deep_lift.py
+++ b/captum/attr/_core/layer/layer_deep_lift.py
@@ -69,7 +69,7 @@ def __init__(
         r"""
         Args:
 
-            model (nn.Module):  The reference to PyTorch model instance. Model cannot
+            model (nn.Module): The reference to PyTorch model instance. Model cannot
                         contain any in-place nonlinear submodules; these are not
                         supported by the register_full_backward_hook PyTorch API
                         starting from PyTorch v1.9.
@@ -144,7 +144,7 @@ def attribute(
         r"""
         Args:
 
-            inputs (tensor or tuple of tensors):  Input for which layer
+            inputs (Tensor or tuple of Tensor): Input for which layer
                         attributions are computed. If forward_func takes a
                         single tensor as input, a single input tensor should be
                         provided. If forward_func takes multiple tensors as input,
@@ -153,7 +153,7 @@ def attribute(
                         corresponds to the number of examples (aka batch size),
                         and if multiple input tensors are provided, the examples
                         must be aligned appropriately.
-            baselines (scalar, tensor, tuple of scalars or tensors, optional):
+            baselines (scalar, Tensor, tuple of scalar, or Tensor, optional):
                         Baselines define reference samples that are compared with
                         the inputs. In order to assign attribution scores DeepLift
                         computes the differences between the inputs/outputs and
@@ -180,11 +180,12 @@ def attribute(
                           - or a scalar, corresponding to a tensor in the
                             inputs' tuple. This scalar value is broadcasted
                             for corresponding input tensor.
+
                         In the cases when `baselines` is not provided, we internally
                         use zero scalar corresponding to each input tensor.
 
                         Default: None
-            target (int, tuple, tensor or list, optional):  Output indices for
+            target (int, tuple, Tensor, or list, optional): Output indices for
                         which gradients are computed (for classification cases,
                         this is usually the target class).
                         If the network returns a scalar value per example,
@@ -209,7 +210,7 @@ def attribute(
                           target for the corresponding example.
 
                         Default: None
-            additional_forward_args (any, optional): If the forward function
+            additional_forward_args (Any, optional): If the forward function
                         requires additional arguments other than the inputs for
                         which attributions should not be computed, this argument
                         can be provided. It must be either a single additional
@@ -236,7 +237,7 @@ def attribute(
                         attribute to the input or output, is a single tensor.
                         Support for multiple tensors will be added later.
                         Default: False
-            custom_attribution_func (callable, optional): A custom function for
+            custom_attribution_func (Callable, optional): A custom function for
                         computing final attribution scores. This function can take
                         at least one and at most three arguments with the
                         following signature:
@@ -255,7 +256,7 @@ def attribute(
 
         Returns:
             **attributions** or 2-element tuple of **attributions**, **delta**:
-            - **attributions** (*tensor* or tuple of *tensors*):
+            - **attributions** (*Tensor* or tuple of *Tensor*):
                 Attribution score computed based on DeepLift's rescale rule with
                 respect to layer's inputs or outputs. Attributions will always be the
                 same size as the provided layer's inputs or outputs, depending on
@@ -264,14 +265,14 @@ def attribute(
                 just a tensor is returned; if the layer input / output
                 has multiple tensors, then a corresponding tuple
                 of tensors is returned.
-            - **delta** (*tensor*, returned if return_convergence_delta=True):
+            - **delta** (*Tensor*, returned if return_convergence_delta=True):
                 This is computed using the property that the total sum of
                 forward_func(inputs) - forward_func(baselines) must equal the
                 total sum of the attributions computed based on DeepLift's
                 rescale rule.
                 Delta is calculated per example, meaning that the number of
                 elements in returned delta tensor is equal to the number of
-                of examples in input.
+                examples in input.
                 Note that the logic described for deltas is guaranteed
                 when the default logic for attribution computations is used,
                 meaning that the `custom_attribution_func=None`, otherwise
@@ -381,12 +382,14 @@ class LayerDeepLiftShap(LayerDeepLift, DeepLiftShap):
     input flag `attribute_to_layer_input`.
     More details about the algorithm can be found here:
 
-    http://papers.nips.cc/paper/7062-a-unified-approach-to-interpreting-model-predictions.pdf
+    https://papers.nips.cc/paper/7062-a-unified-approach-to-interpreting-model-predictions.pdf
 
     Note that the explanation model:
+
         1. Assumes that input features are independent of one another
         2. Is linear, meaning that the explanations are modeled through
             the additive composition of feature effects.
+
     Although, it assumes a linear model for each explanation, the overall
     model across multiple explanations can be complex and non-linear.
     """
@@ -400,7 +403,7 @@ def __init__(
         r"""
         Args:
 
-            model (nn.Module):  The reference to PyTorch model instance. Model cannot
+            model (nn.Module): The reference to PyTorch model instance. Model cannot
                         contain any in-place nonlinear submodules; these are not
                         supported by the register_full_backward_hook PyTorch API
                         starting from PyTorch v1.9.
@@ -479,7 +482,7 @@ def attribute(
         r"""
         Args:
 
-            inputs (tensor or tuple of tensors):  Input for which layer
+            inputs (Tensor or tuple of Tensor): Input for which layer
                         attributions are computed. If forward_func takes a single
                         tensor as input, a single input tensor should be provided.
                         If forward_func takes multiple tensors as input, a tuple
@@ -488,7 +491,7 @@ def attribute(
                         to the number of examples (aka batch size), and if
                         multiple input tensors are provided, the examples must
                         be aligned appropriately.
-            baselines (tensor, tuple of tensors, callable):
+            baselines (Tensor, tuple of Tensor, or Callable):
                         Baselines define reference samples that are compared with
                         the inputs. In order to assign attribution scores DeepLift
                         computes the differences between the inputs/outputs and
@@ -513,7 +516,7 @@ def attribute(
 
                         It is recommended that the number of samples in the baselines'
                         tensors is larger than one.
-            target (int, tuple, tensor or list, optional):  Output indices for
+            target (int, tuple, Tensor, or list, optional): Output indices for
                         which gradients are computed (for classification cases,
                         this is usually the target class).
                         If the network returns a scalar value per example,
@@ -538,7 +541,7 @@ def attribute(
                           target for the corresponding example.
 
                         Default: None
-            additional_forward_args (any, optional): If the forward function
+            additional_forward_args (Any, optional): If the forward function
                         requires additional arguments other than the inputs for
                         which attributions should not be computed, this argument
                         can be provided. It must be either a single additional
@@ -564,7 +567,7 @@ def attribute(
                         outputs of internal layers are single tensors.
                         Support for multiple tensors will be added later.
                         Default: False
-            custom_attribution_func (callable, optional): A custom function for
+            custom_attribution_func (Callable, optional): A custom function for
                         computing final attribution scores. This function can take
                         at least one and at most three arguments with the
                         following signature:
@@ -584,7 +587,7 @@ def attribute(
 
         Returns:
             **attributions** or 2-element tuple of **attributions**, **delta**:
-            - **attributions** (*tensor* or tuple of *tensors*):
+            - **attributions** (*Tensor* or tuple of *Tensor*):
                         Attribution score computed based on DeepLift's rescale rule
                         with respect to layer's inputs or outputs. Attributions
                         will always be the same size as the provided layer's inputs
@@ -595,7 +598,7 @@ def attribute(
                         from a forward hook. For standard modules, inputs of
                         a single tensor are usually wrapped in a tuple, while
                         outputs of a single tensor are not.
-            - **delta** (*tensor*, returned if return_convergence_delta=True):
+            - **delta** (*Tensor*, returned if return_convergence_delta=True):
                         This is computed using the property that the
                         total sum of forward_func(inputs) - forward_func(baselines)
                         must be very close to the total sum of attributions
diff --git a/captum/attr/_core/layer/layer_feature_ablation.py b/captum/attr/_core/layer/layer_feature_ablation.py
index 75ac885eac..ee7df14ff7 100644
--- a/captum/attr/_core/layer/layer_feature_ablation.py
+++ b/captum/attr/_core/layer/layer_feature_ablation.py
@@ -42,7 +42,7 @@ def __init__(
         r"""
         Args:
 
-            forward_func (callable):  The forward function of the model or any
+            forward_func (Callable): The forward function of the model or any
                           modification of it
             layer (torch.nn.Module): Layer for which attributions are computed.
                           Output size of attribute matches this layer's input or
@@ -50,7 +50,7 @@ def __init__(
                           the inputs or outputs of the layer, corresponding to
                           attribution of each neuron in the input or output of
                           this layer.
-            device_ids (list(int)): Device ID list, necessary only if forward_func
+            device_ids (list[int]): Device ID list, necessary only if forward_func
                           applies a DataParallel model. This allows reconstruction of
                           intermediate outputs from batched results across devices.
                           If forward_func is given as the DataParallel model itself
@@ -75,7 +75,7 @@ def attribute(
         r"""
         Args:
 
-            inputs (tensor or tuple of tensors):  Input for which layer
+            inputs (Tensor or tuple of Tensor): Input for which layer
                         attributions are computed. If forward_func takes a single
                         tensor as input, a single input tensor should be provided.
                         If forward_func takes multiple tensors as input, a tuple
@@ -83,7 +83,7 @@ def attribute(
                         that for all given input tensors, dimension 0 corresponds
                         to the number of examples, and if multiple input tensors
                         are provided, the examples must be aligned appropriately.
-            layer_baselines (scalar, tensor, tuple of scalars or tensors, optional):
+            layer_baselines (scalar, Tensor, tuple of scalar, or Tensor, optional):
                         Layer baselines define reference values which replace each
                         layer input / output value when ablated.
                         Layer baselines should be a single tensor with dimensions
@@ -94,7 +94,7 @@ def attribute(
                         In the cases when `baselines` is not provided, we internally
                         use zero as the baseline for each neuron.
                         Default: None
-            target (int, tuple, tensor or list, optional):  Output indices for
+            target (int, tuple, Tensor, or list, optional): Output indices for
                         which gradients are computed (for classification cases,
                         this is usually the target class).
                         If the network returns a scalar value per example,
@@ -119,7 +119,7 @@ def attribute(
                           target for the corresponding example.
 
                         Default: None
-            additional_forward_args (any, optional): If the forward function
+            additional_forward_args (Any, optional): If the forward function
                         requires additional arguments other than the inputs for
                         which attributions should not be computed, this argument
                         can be provided. It must be either a single additional
@@ -131,7 +131,7 @@ def attribute(
                         Note that attributions are not computed with respect
                         to these arguments.
                         Default: None
-            layer_mask (tensor or tuple of tensors, optional):
+            layer_mask (Tensor or tuple of Tensor, optional):
                         layer_mask defines a mask for the layer, grouping
                         elements of the layer input / output which should be
                         ablated together.
@@ -171,8 +171,8 @@ def attribute(
                         Default: 1
 
         Returns:
-            *tensor* or tuple of *tensors* of **attributions**:
-            - **attributions** (*tensor* or tuple of *tensors*):
+            *Tensor* or tuple of *Tensor* of **attributions**:
+            - **attributions** (*Tensor* or tuple of *Tensor*):
                         Attribution of each neuron in given layer input or
                         output. Attributions will always be the same size as
                         the input or output of the given layer, depending on
diff --git a/captum/attr/_core/layer/layer_gradient_shap.py b/captum/attr/_core/layer/layer_gradient_shap.py
index 9473475cdf..b6dfda9106 100644
--- a/captum/attr/_core/layer/layer_gradient_shap.py
+++ b/captum/attr/_core/layer/layer_gradient_shap.py
@@ -29,7 +29,7 @@ class LayerGradientShap(LayerAttribution, GradientAttribution):
     #deep-learning-example-with-gradientexplainer-tensorflowkeraspytorch-models
 
     A Unified Approach to Interpreting Model Predictions
-    http://papers.nips.cc/paper\
+    https://papers.nips.cc/paper\
     7062-a-unified-approach-to-interpreting-model-predictions
 
     GradientShap approximates SHAP values by computing the expectations of
@@ -52,7 +52,7 @@ class LayerGradientShap(LayerAttribution, GradientAttribution):
     In some sense it can be viewed as an approximation of integrated gradients
     by computing the expectations of gradients for different baselines.
 
-    Current implementation uses Smoothgrad from `NoiseTunnel` in order to
+    Current implementation uses Smoothgrad from :class:`.NoiseTunnel` in order to
     randomly draw samples from the distribution of baselines, add noise to input
     samples and compute the expectation (smoothgrad).
     """
@@ -67,7 +67,7 @@ def __init__(
         r"""
         Args:
 
-            forward_func (callable):  The forward function of the model or any
+            forward_func (Callable): The forward function of the model or any
                         modification of it
             layer (torch.nn.Module): Layer for which attributions are computed.
                         Output size of attribute matches this layer's input or
@@ -75,7 +75,7 @@ def __init__(
                         the inputs or outputs of the layer, corresponding to
                         attribution of each neuron in the input or output of
                         this layer.
-            device_ids (list(int)): Device ID list, necessary only if forward_func
+            device_ids (list[int]): Device ID list, necessary only if forward_func
                         applies a DataParallel model. This allows reconstruction of
                         intermediate outputs from batched results across devices.
                         If forward_func is given as the DataParallel model itself,
@@ -146,7 +146,7 @@ def attribute(
         r"""
         Args:
 
-            inputs (tensor or tuple of tensors):  Input which are used to compute
+            inputs (Tensor or tuple of Tensor): Input which are used to compute
                         SHAP attribution values for a given `layer`. If `forward_func`
                         takes a single tensor as input, a single input tensor should
                         be provided.
@@ -155,7 +155,7 @@ def attribute(
                         that for all given input tensors, dimension 0 corresponds
                         to the number of examples, and if multiple input tensors
                         are provided, the examples must be aligned appropriately.
-            baselines (tensor, tuple of tensors, callable):
+            baselines (Tensor, tuple of Tensor, or Callable):
                         Baselines define the starting point from which expectation
                         is computed and can be provided as:
 
@@ -178,11 +178,11 @@ def attribute(
 
                         It is recommended that the number of samples in the baselines'
                         tensors is larger than one.
-            n_samples (int, optional):  The number of randomly generated examples
+            n_samples (int, optional): The number of randomly generated examples
                         per sample in the input batch. Random examples are
                         generated by adding gaussian random noise to each sample.
                         Default: `5` if `n_samples` is not provided.
-            stdevs    (float, or a tuple of floats optional): The standard deviation
+            stdevs    (float or tuple of float, optional): The standard deviation
                         of gaussian noise with zero mean that is added to each
                         input in the batch. If `stdevs` is a single float value
                         then that same value is used for all inputs. If it is
@@ -191,7 +191,7 @@ def attribute(
                         corresponds to the input with the same index in the inputs
                         tuple.
                         Default: 0.0
-            target (int, tuple, tensor or list, optional):  Output indices for
+            target (int, tuple, Tensor, or list, optional): Output indices for
                         which gradients are computed (for classification cases,
                         this is usually the target class).
                         If the network returns a scalar value per example,
@@ -216,7 +216,7 @@ def attribute(
                           target for the corresponding example.
 
                         Default: None
-            additional_forward_args (any, optional): If the forward function
+            additional_forward_args (Any, optional): If the forward function
                         requires additional arguments other than the inputs for
                         which attributions should not be computed, this argument
                         can be provided. It can contain a tuple of ND tensors or
@@ -246,7 +246,7 @@ def attribute(
                         Default: False
         Returns:
             **attributions** or 2-element tuple of **attributions**, **delta**:
-            - **attributions** (*tensor* or tuple of *tensors*):
+            - **attributions** (*Tensor* or tuple of *Tensor*):
                         Attribution score computed based on GradientSHAP with
                         respect to layer's input or output. Attributions will always
                         be the same size as the provided layer's inputs or outputs,
@@ -255,7 +255,7 @@ def attribute(
                         Attributions are returned in a tuple if
                         the layer inputs / outputs contain multiple tensors,
                         otherwise a single tensor is returned.
-            - **delta** (*tensor*, returned if return_convergence_delta=True):
+            - **delta** (*Tensor*, returned if return_convergence_delta=True):
                         This is computed using the property that the total
                         sum of forward_func(inputs) - forward_func(baselines)
                         must be very close to the total sum of the attributions
@@ -335,7 +335,7 @@ def __init__(
         r"""
         Args:
 
-            forward_func (callable):  The forward function of the model or any
+            forward_func (Callable): The forward function of the model or any
                         modification of it
             layer (torch.nn.Module): Layer for which attributions are computed.
                         Output size of attribute matches this layer's input or
@@ -343,7 +343,7 @@ def __init__(
                         the inputs or outputs of the layer, corresponding to
                         attribution of each neuron in the input or output of
                         this layer.
-            device_ids (list(int)): Device ID list, necessary only if forward_func
+            device_ids (list[int]): Device ID list, necessary only if forward_func
                         applies a DataParallel model. This allows reconstruction of
                         intermediate outputs from batched results across devices.
                         If forward_func is given as the DataParallel model itself,
diff --git a/captum/attr/_core/layer/layer_gradient_x_activation.py b/captum/attr/_core/layer/layer_gradient_x_activation.py
index a63a5d7abe..385a1491c4 100644
--- a/captum/attr/_core/layer/layer_gradient_x_activation.py
+++ b/captum/attr/_core/layer/layer_gradient_x_activation.py
@@ -30,9 +30,9 @@ def __init__(
         r"""
         Args:
 
-            forward_func (callable):  The forward function of the model or any
+            forward_func (Callable): The forward function of the model or any
                         modification of it
-            layer (torch.nn.Module or list(torch.nn.Module)): Layer or layers
+            layer (torch.nn.Module or list of torch.nn.Module): Layer or layers
                           for which attributions are computed.
                           Output size of attribute matches this layer's input or
                           output dimensions, depending on whether we attribute to
@@ -41,7 +41,7 @@ def __init__(
                           this layer. If multiple layers are provided, attributions
                           are returned as a list, each element corresponding to the
                           attributions of the corresponding layer.
-            device_ids (list(int)): Device ID list, necessary only if forward_func
+            device_ids (list[int]): Device ID list, necessary only if forward_func
                         applies a DataParallel model. This allows reconstruction of
                         intermediate outputs from batched results across devices.
                         If forward_func is given as the DataParallel model itself,
@@ -80,7 +80,7 @@ def attribute(
         r"""
         Args:
 
-            inputs (tensor or tuple of tensors):  Input for which attributions
+            inputs (Tensor or tuple of Tensor): Input for which attributions
                         are computed. If forward_func takes a single
                         tensor as input, a single input tensor should be provided.
                         If forward_func takes multiple tensors as input, a tuple
@@ -88,7 +88,7 @@ def attribute(
                         that for all given input tensors, dimension 0 corresponds
                         to the number of examples, and if multiple input tensors
                         are provided, the examples must be aligned appropriately.
-            target (int, tuple, tensor or list, optional):  Output indices for
+            target (int, tuple, Tensor, or list, optional): Output indices for
                         which gradients are computed (for classification cases,
                         this is usually the target class).
                         If the network returns a scalar value per example,
@@ -113,7 +113,7 @@ def attribute(
                           target for the corresponding example.
 
                         Default: None
-            additional_forward_args (any, optional): If the forward function
+            additional_forward_args (Any, optional): If the forward function
                         requires additional arguments other than the inputs for
                         which attributions should not be computed, this argument
                         can be provided. It must be either a single additional
@@ -134,8 +134,8 @@ def attribute(
                         Default: False
 
         Returns:
-            *tensor* or tuple of *tensors* or *list* of **attributions**:
-            - **attributions** (*tensor* or tuple of *tensors* or *list*):
+            *Tensor* or tuple of *Tensor* or list of **attributions**:
+            - **attributions** (*Tensor* or tuple of *Tensor* or *list*):
                         Product of gradient and activation for each
                         neuron in given layer output.
                         Attributions will always be the same size as the
diff --git a/captum/attr/_core/layer/layer_integrated_gradients.py b/captum/attr/_core/layer/layer_integrated_gradients.py
index 2e769a5658..d67f52cad5 100644
--- a/captum/attr/_core/layer/layer_integrated_gradients.py
+++ b/captum/attr/_core/layer/layer_integrated_gradients.py
@@ -41,7 +41,6 @@ class LayerIntegratedGradients(LayerAttribution, GradientAttribution):
     More details regarding the integrated gradients method can be found in the
     original paper:
     https://arxiv.org/abs/1703.01365
-
     """
 
     def __init__(
@@ -53,12 +52,12 @@ def __init__(
     ) -> None:
         r"""
         Args:
-            forward_func (callable):  The forward function of the model or any
+
+            forward_func (Callable): The forward function of the model or any
                         modification of it
-            layer (ModuleOrModuleList):
-                        Layer or list of layers for which attributions are computed.
-                        For each layer the output size of the attribute matches
-                        this layer's input or output dimensions, depending on
+            layer (ModuleOrModuleList): Layer or list of layers for which attributions
+                        are computed. For each layer the output size of the attribute
+                        matches this layer's input or output dimensions, depending on
                         whether we attribute to the inputs or outputs of the
                         layer, corresponding to the attribution of each neuron
                         in the input or output of this layer.
@@ -74,7 +73,7 @@ def __init__(
                         dependence, e.g.  if you pass in l2 you cannot pass in
                         l1 or l3.
 
-            device_ids (list(int)): Device ID list, necessary only if forward_func
+            device_ids (list[int]): Device ID list, necessary only if forward_func
                         applies a DataParallel model. This allows reconstruction of
                         intermediate outputs from batched results across devices.
                         If forward_func is given as the DataParallel model itself,
@@ -101,7 +100,7 @@ def __init__(
         if isinstance(layer, list) and len(layer) > 1:
             warnings.warn(
                 "Multiple layers provided. Please ensure that each layer is"
-                "**not** solely solely dependent on the outputs of"
+                "**not** solely dependent on the outputs of"
                 "another layer. Please refer to the documentation for more"
                 "detail."
             )
@@ -192,7 +191,7 @@ def attribute(
 
         Args:
 
-            inputs (tensor or tuple of tensors):  Input for which layer integrated
+            inputs (Tensor or tuple of Tensor): Input for which layer integrated
                         gradients are computed. If forward_func takes a single
                         tensor as input, a single input tensor should be provided.
                         If forward_func takes multiple tensors as input, a tuple
@@ -200,7 +199,7 @@ def attribute(
                         that for all given input tensors, dimension 0 corresponds
                         to the number of examples, and if multiple input tensors
                         are provided, the examples must be aligned appropriately.
-            baselines (scalar, tensor, tuple of scalars or tensors, optional):
+            baselines (scalar, Tensor, tuple of scalar, or Tensor, optional):
                         Baselines define the starting point from which integral
                         is computed and can be provided as:
 
@@ -214,6 +213,7 @@ def attribute(
 
                         - a tuple of tensors or scalars, the baseline corresponding
                           to each tensor in the inputs' tuple can be:
+
                             - either a tensor with matching dimensions to
                               corresponding tensor in the inputs' tuple
                               or the first dimension is one and the remaining
@@ -227,7 +227,7 @@ def attribute(
                         use zero scalar corresponding to each input tensor.
 
                         Default: None
-            target (int, tuple, tensor or list, optional):  Output indices for
+            target (int, tuple, Tensor, or list, optional): Output indices for
                         which gradients are computed (for classification cases,
                         this is usually the target class).
                         If the network returns a scalar value per example,
@@ -252,7 +252,7 @@ def attribute(
                           target for the corresponding example.
 
                         Default: None
-            additional_forward_args (any, optional): If the forward function
+            additional_forward_args (Any, optional): If the forward function
                         requires additional arguments other than the inputs for
                         which attributions should not be computed, this argument
                         can be provided. It must be either a single additional
@@ -261,17 +261,19 @@ def attribute(
                         tensors or any arbitrary python types. These arguments
                         are provided to forward_func in order following the
                         arguments in inputs.
+
                         For a tensor, the first dimension of the tensor must
                         correspond to the number of examples. It will be
                         repeated for each of `n_steps` along the integrated
                         path. For all other types, the given argument is used
                         for all forward evaluations.
+
                         Note that attributions are not computed with respect
                         to these arguments.
                         Default: None
             n_steps (int, optional): The number of steps used by the approximation
                         method. Default: 50.
-            method (string, optional): Method for approximating the integral,
+            method (str, optional): Method for approximating the integral,
                         one of `riemann_right`, `riemann_left`, `riemann_middle`,
                         `riemann_trapezoid` or `gausslegendre`.
                         Default: `gausslegendre` if no method is provided.
@@ -280,6 +282,7 @@ def attribute(
                         which are computed (forward / backward passes)
                         sequentially. internal_batch_size must be at least equal to
                         #examples.
+
                         For DataParallel models, each batch is split among the
                         available devices, so evaluations on each available
                         device contain internal_batch_size / num_devices examples.
@@ -297,16 +300,19 @@ def attribute(
                         then the attributions will be computed with respect to
                         layer input, otherwise it will be computed with respect
                         to layer output.
+
                         Note that currently it is assumed that either the input
                         or the output of internal layer, depending on whether we
                         attribute to the input or output, is a single tensor.
                         Support for multiple tensors will be added later.
                         Default: False
+
             Returns:
                 **attributions** or 2-element tuple of **attributions**, **delta**:
-                - **attributions** (*tensor*, tuple of *tensors* or tuple of *tensors*):
-                        Integrated gradients with respect to `layer`'s inputs or
-                        outputs. Attributions will always be the same size and
+
+                  - **attributions** (*Tensor*, tuple of *Tensor* or tuple of
+                  *Tensor*): Integrated gradients with respect to `layer`'s inputs
+                        or outputs. Attributions will always be the same size and
                         dimensionality as the input or output of the given layer,
                         depending on whether we attribute to the inputs or outputs
                         of the layer which is decided by the input flag
@@ -323,7 +329,8 @@ def attribute(
                         multiple tensors: the corresponding output element will be
                         a tuple of tensors. The ordering of the outputs will be
                         the same order as the layers given in the constructor.
-                - **delta** (*tensor*, returned if return_convergence_delta=True):
+
+                  - **delta** (*Tensor*, returned if return_convergence_delta=True):
                         The difference between the total approximated and true
                         integrated gradients. This is computed using the property
                         that the total sum of forward_func(inputs) -
@@ -331,7 +338,7 @@ def attribute(
                         integrated gradient.
                         Delta is calculated per example, meaning that the number of
                         elements in returned delta tensor is equal to the number of
-                        of examples in inputs.
+                        examples in inputs.
 
             Examples::
 
diff --git a/captum/attr/_core/layer/layer_lrp.py b/captum/attr/_core/layer/layer_lrp.py
index bdc328f47e..1f78d1fdd8 100644
--- a/captum/attr/_core/layer/layer_lrp.py
+++ b/captum/attr/_core/layer/layer_lrp.py
@@ -42,7 +42,7 @@ def __init__(self, model: Module, layer: ModuleOrModuleList) -> None:
         """
         Args:
 
-            model (module): The forward function of the model or
+            model (Module): The forward function of the model or
                         any modification of it. Custom rules for a given layer need to
                         be defined as attribute
                         `module.rule` and need to be of type PropagationRule.
@@ -50,8 +50,7 @@ def __init__(self, model: Module, layer: ModuleOrModuleList) -> None:
                         these are not supported by the register_full_backward_hook
                         PyTorch API starting from PyTorch v1.9.
 
-
-            layer (torch.nn.Module or list(torch.nn.Module)): Layer or layers
+            layer (torch.nn.Module or list of torch.nn.Module): Layer or layers
                           for which attributions are computed.
                           The size and dimensionality of the attributions
                           corresponds to the size and dimensionality of the layer's
@@ -110,9 +109,9 @@ def attribute(
         ],
     ]:
         r"""
-
         Args:
-            inputs (tensor or tuple of tensors):  Input for which relevance is
+
+            inputs (Tensor or tuple of Tensor): Input for which relevance is
                         propagated.
                         If forward_func takes a single
                         tensor as input, a single input tensor should be provided.
@@ -121,12 +120,12 @@ def attribute(
                         that for all given input tensors, dimension 0 corresponds
                         to the number of examples, and if multiple input tensors
                         are provided, the examples must be aligned appropriately.
-            target (int, tuple, tensor or list, optional):  Output indices for
-                        which gradients are computed (for classification cases,
-                        this is usually the target class).
-                        If the network returns a scalar value per example,
-                        no target index is necessary.
-                        For general 2D outputs, targets can be either:
+            target (int, tuple, Tensor, or list, optional): Output indices for
+                    which gradients are computed (for classification cases,
+                    this is usually the target class).
+                    If the network returns a scalar value per example,
+                    no target index is necessary.
+                    For general 2D outputs, targets can be either:
 
                     - a single integer or a tensor containing a single
                         integer, which is applied to all input examples
@@ -176,9 +175,10 @@ def attribute(
                     Default: False
 
         Returns:
-            *tensor* or tuple of *tensors* of **attributions** or 2-element tuple of
-                **attributions**, **delta** or lists of **attributions** and **delta**:
-            - **attributions** (*tensor* or tuple of *tensors*):
+            *Tensor* or tuple of *Tensor* of **attributions** or 2-element tuple of
+            **attributions**, **delta** or list of **attributions** and **delta**:
+
+              - **attributions** (*Tensor* or tuple of *Tensor*):
                         The propagated relevance values with respect to each
                         input feature. Attributions will always
                         be the same size as the provided inputs, with each value
@@ -190,14 +190,15 @@ def attribute(
                         implementations. If attributions for all layers are returned
                         (layer=None) a list of tensors or tuples of tensors is returned
                         with entries for each layer.
-            - **delta** (*tensor* or list of *tensors*
-                         returned if return_convergence_delta=True):
+              - **delta** (*Tensor* or list of *Tensor*
+                        returned if return_convergence_delta=True):
                         Delta is calculated per example, meaning that the number of
                         elements in returned delta tensor is equal to the number of
-                        of examples in input.
+                        examples in input.
                         If attributions for all layers are returned (layer=None) a list
                         of tensors is returned with entries for
                         each layer.
+
         Examples::
 
                 >>> # ImageClassifier takes a single input tensor of images Nx3x32x32,
diff --git a/captum/attr/_core/lime.py b/captum/attr/_core/lime.py
index f5ad7877bc..1f94bb9cb2 100644
--- a/captum/attr/_core/lime.py
+++ b/captum/attr/_core/lime.py
@@ -82,7 +82,7 @@ def __init__(
         Args:
 
 
-            forward_func (callable):  The forward function of the model or any
+            forward_func (Callable): The forward function of the model or any
                     modification of it. If a batch is provided as input for
                     attribution, it is expected that forward_func returns a scalar
                     representing the entire batch.
@@ -106,7 +106,7 @@ def __init__(
                     Note that calling fit multiple times should retrain the
                     interpretable model, each attribution call reuses
                     the same given interpretable model object.
-            similarity_func (callable): Function which takes a single sample
+            similarity_func (Callable): Function which takes a single sample
                     along with its corresponding interpretable representation
                     and returns the weight of the interpretable sample for
                     training interpretable model. Weight is generally
@@ -131,7 +131,7 @@ def __init__(
 
                     All kwargs passed to the attribute method are
                     provided as keyword arguments (kwargs) to this callable.
-            perturb_func (callable): Function which returns a single
+            perturb_func (Callable): Function which returns a single
                     sampled input, generally a perturbation of the original
                     input, which is used to train the interpretable surrogate
                     model. Function can return samples in either
@@ -171,7 +171,7 @@ def __init__(
                     input. Once sampled, inputs can be converted to / from
                     the interpretable representation with either
                     to_interp_rep_transform or from_interp_rep_transform.
-            from_interp_rep_transform (callable): Function which takes a
+            from_interp_rep_transform (Callable): Function which takes a
                     single sampled interpretable representation (tensor
                     of shape 1 x num_interp_features) and returns
                     the corresponding representation in the input space
@@ -194,7 +194,7 @@ def __init__(
                     All kwargs passed to the attribute method are
                     provided as keyword arguments (kwargs) to this callable.
 
-            to_interp_rep_transform (callable): Function which takes a
+            to_interp_rep_transform (Callable): Function which takes a
                     sample in the original input space and converts to
                     its interpretable representation (tensor
                     of shape 1 x num_interp_features).
@@ -266,7 +266,7 @@ def attribute(
 
         Args:
 
-            inputs (tensor or tuple of tensors):  Input for which LIME
+            inputs (Tensor or tuple of Tensor): Input for which LIME
                         is computed. If forward_func takes a single
                         tensor as input, a single input tensor should be provided.
                         If forward_func takes multiple tensors as input, a tuple
@@ -274,7 +274,7 @@ def attribute(
                         that for all given input tensors, dimension 0 corresponds
                         to the number of examples, and if multiple input tensors
                         are provided, the examples must be aligned appropriately.
-            target (int, tuple, tensor or list, optional):  Output indices for
+            target (int, tuple, Tensor, or list, optional): Output indices for
                         which surrogate model is trained
                         (for classification cases,
                         this is usually the target class).
@@ -300,7 +300,7 @@ def attribute(
                           target for the corresponding example.
 
                         Default: None
-            additional_forward_args (any, optional): If the forward function
+            additional_forward_args (Any, optional): If the forward function
                         requires additional arguments other than the inputs for
                         which attributions should not be computed, this argument
                         can be provided. It must be either a single additional
@@ -315,7 +315,7 @@ def attribute(
                         Note that attributions are not computed with respect
                         to these arguments.
                         Default: None
-            n_samples (int, optional):  The number of samples of the original
+            n_samples (int, optional): The number of samples of the original
                         model used to train the surrogate interpretable model.
                         Default: `50` if `n_samples` is not provided.
             perturbations_per_eval (int, optional): Allows multiple samples
@@ -569,7 +569,7 @@ def default_from_interp_rep_transform(curr_sample, original_inputs, **kwargs):
     ), "Must provide feature_mask to use default interpretable representation transform"
     assert (
         "baselines" in kwargs
-    ), "Must provide baselines to use default interpretable representation transfrom"
+    ), "Must provide baselines to use default interpretable representation transform"
     feature_mask = kwargs["feature_mask"]
     if isinstance(feature_mask, Tensor):
         binary_mask = curr_sample[0][feature_mask].bool()
@@ -603,7 +603,7 @@ def get_exp_kernel_similarity_function(
 
     Args:
 
-        distance_mode (str, optional):  Distance mode can be either "cosine" or
+        distance_mode (str, optional): Distance mode can be either "cosine" or
                     "euclidean" corresponding to either cosine distance
                     or Euclidean distance respectively. Distance is computed
                     by flattening the original inputs and perturbed inputs
@@ -732,7 +732,7 @@ def __init__(
         Args:
 
 
-            forward_func (callable):  The forward function of the model or any
+            forward_func (Callable): The forward function of the model or any
                     modification of it
             interpretable_model (Model, optional): Model object to train
                     interpretable model.
@@ -760,14 +760,14 @@ def __init__(
                     Note that calling fit multiple times should retrain the
                     interpretable model, each attribution call reuses
                     the same given interpretable model object.
-            similarity_func (callable, optional): Function which takes a single sample
+            similarity_func (Callable, optional): Function which takes a single sample
                     along with its corresponding interpretable representation
                     and returns the weight of the interpretable sample for
                     training the interpretable model.
                     This is often referred to as a similarity kernel.
 
                     This argument is optional and defaults to a function which
-                    applies an exponential kernel to the consine distance between
+                    applies an exponential kernel to the cosine distance between
                     the original input and perturbed input, with a kernel width
                     of 1.0.
 
@@ -793,7 +793,7 @@ def __init__(
 
                     kwargs includes baselines, feature_mask, num_interp_features
                     (integer, determined from feature mask).
-            perturb_func (callable, optional): Function which returns a single
+            perturb_func (Callable, optional): Function which returns a single
                     sampled input, which is a binary vector of length
                     num_interp_features, or a generator of such tensors.
 
@@ -879,7 +879,7 @@ def attribute(  # type: ignore
 
         Args:
 
-            inputs (tensor or tuple of tensors):  Input for which LIME
+            inputs (Tensor or tuple of Tensor): Input for which LIME
                         is computed. If forward_func takes a single
                         tensor as input, a single input tensor should be provided.
                         If forward_func takes multiple tensors as input, a tuple
@@ -887,7 +887,7 @@ def attribute(  # type: ignore
                         that for all given input tensors, dimension 0 corresponds
                         to the number of examples, and if multiple input tensors
                         are provided, the examples must be aligned appropriately.
-            baselines (scalar, tensor, tuple of scalars or tensors, optional):
+            baselines (scalar, Tensor, tuple of scalar, or Tensor, optional):
                         Baselines define reference value which replaces each
                         feature when the corresponding interpretable feature
                         is set to 0.
@@ -913,10 +913,11 @@ def attribute(  # type: ignore
                           - or a scalar, corresponding to a tensor in the
                             inputs' tuple. This scalar value is broadcasted
                             for corresponding input tensor.
+
                         In the cases when `baselines` is not provided, we internally
                         use zero scalar corresponding to each input tensor.
                         Default: None
-            target (int, tuple, tensor or list, optional):  Output indices for
+            target (int, tuple, Tensor, or list, optional): Output indices for
                         which surrogate model is trained
                         (for classification cases,
                         this is usually the target class).
@@ -942,7 +943,7 @@ def attribute(  # type: ignore
                           target for the corresponding example.
 
                         Default: None
-            additional_forward_args (any, optional): If the forward function
+            additional_forward_args (Any, optional): If the forward function
                         requires additional arguments other than the inputs for
                         which attributions should not be computed, this argument
                         can be provided. It must be either a single additional
@@ -959,7 +960,7 @@ def attribute(  # type: ignore
                         Note that attributions are not computed with respect
                         to these arguments.
                         Default: None
-            feature_mask (tensor or tuple of tensors, optional):
+            feature_mask (Tensor or tuple of Tensor, optional):
                         feature_mask defines a mask for the input, grouping
                         features which correspond to the same
                         interpretable feature. feature_mask
@@ -977,7 +978,7 @@ def attribute(  # type: ignore
                         If None, then a feature mask is constructed which assigns
                         each scalar within a tensor as a separate feature.
                         Default: None
-            n_samples (int, optional):  The number of samples of the original
+            n_samples (int, optional): The number of samples of the original
                         model used to train the surrogate interpretable model.
                         Default: `50` if `n_samples` is not provided.
             perturbations_per_eval (int, optional): Allows multiple samples
@@ -1012,8 +1013,8 @@ def attribute(  # type: ignore
                         Default: False
 
         Returns:
-            *tensor* or tuple of *tensors* of **attributions**:
-            - **attributions** (*tensor* or tuple of *tensors*):
+            *Tensor* or tuple of *Tensor* of **attributions**:
+            - **attributions** (*Tensor* or tuple of *Tensor*):
                         The attributions with respect to each input feature.
                         If return_input_shape = True, attributions will be
                         the same size as the provided inputs, with each value
diff --git a/captum/attr/_core/lrp.py b/captum/attr/_core/lrp.py
index e11d0b8544..d557f0ce20 100644
--- a/captum/attr/_core/lrp.py
+++ b/captum/attr/_core/lrp.py
@@ -45,7 +45,7 @@ def __init__(self, model: Module) -> None:
         r"""
         Args:
 
-            model (module): The forward function of the model or any modification of
+            model (Module): The forward function of the model or any modification of
                 it. Custom rules for a given layer need to be defined as attribute
                 `module.rule` and need to be of type PropagationRule. If no rule is
                 specified for a layer, a pre-defined default rule for the module type
@@ -98,7 +98,8 @@ def attribute(
     ]:
         r"""
         Args:
-            inputs (tensor or tuple of tensors):  Input for which relevance is
+
+            inputs (Tensor or tuple of Tensor): Input for which relevance is
                         propagated. If forward_func takes a single
                         tensor as input, a single input tensor should be provided.
                         If forward_func takes multiple tensors as input, a tuple
@@ -106,12 +107,13 @@ def attribute(
                         that for all given input tensors, dimension 0 corresponds
                         to the number of examples, and if multiple input tensors
                         are provided, the examples must be aligned appropriately.
-            target (int, tuple, tensor or list, optional):  Output indices for
-                        which gradients are computed (for classification cases,
-                        this is usually the target class).
-                        If the network returns a scalar value per example,
-                        no target index is necessary.
-                        For general 2D outputs, targets can be either:
+
+            target (int, tuple, Tensor, or list, optional): Output indices for
+                    which gradients are computed (for classification cases,
+                    this is usually the target class).
+                    If the network returns a scalar value per example,
+                    no target index is necessary.
+                    For general 2D outputs, targets can be either:
 
                     - a single integer or a tensor containing a single
                         integer, which is applied to all input examples
@@ -153,9 +155,10 @@ def attribute(
                     of rules is printed during propagation.
 
         Returns:
-            *tensor* or tuple of *tensors* of **attributions**
-            or 2-element tuple of **attributions**, **delta**::
-            - **attributions** (*tensor* or tuple of *tensors*):
+            *Tensor* or tuple of *Tensor* of **attributions**
+            or 2-element tuple of **attributions**, **delta**:
+
+              - **attributions** (*Tensor* or tuple of *Tensor*):
                         The propagated relevance values with respect to each
                         input feature. The values are normalized by the output score
                         value (sum(relevance)=1). To obtain values comparable to other
@@ -168,10 +171,12 @@ def attribute(
                         corresponding sized tensors is returned. The sum of attributions
                         is one and not corresponding to the prediction score as in other
                         implementations.
-            - **delta** (*tensor*, returned if return_convergence_delta=True):
+
+              - **delta** (*Tensor*, returned if return_convergence_delta=True):
                         Delta is calculated per example, meaning that the number of
                         elements in returned delta tensor is equal to the number of
                         of examples in the inputs.
+
         Examples::
 
                 >>> # ImageClassifier takes a single input tensor of images Nx3x32x32,
@@ -241,7 +246,7 @@ def compute_convergence_delta(
 
         Args:
 
-            attributions (tensor or tuple of tensors): Attribution scores that
+            attributions (Tensor or tuple of Tensor): Attribution scores that
                         are precomputed by an attribution algorithm.
                         Attributions can be provided in form of a single tensor
                         or a tuple of those. It is assumed that attribution
@@ -249,12 +254,13 @@ def compute_convergence_delta(
                         examples, and if multiple input tensors are provided,
                         the examples must be aligned appropriately.
 
-            output (tensor with single element): The output value with respect to which
+            output (Tensor): The output value with respect to which
                         the attribution values are computed. This value corresponds to
-                        the target score of a classification model.
+                        the target score of a classification model. The given tensor
+                        should only have a single element.
 
         Returns:
-            *tensor*:
+            *Tensor*:
             - **delta** Difference of relevance in output layer and input layer.
         """
         if isinstance(attributions, tuple):
diff --git a/captum/attr/_core/neuron/neuron_conductance.py b/captum/attr/_core/neuron/neuron_conductance.py
index dec6b39b01..004d941cb9 100644
--- a/captum/attr/_core/neuron/neuron_conductance.py
+++ b/captum/attr/_core/neuron/neuron_conductance.py
@@ -45,7 +45,7 @@ def __init__(
         r"""
         Args:
 
-            forward_func (callable):  The forward function of the model or any
+            forward_func (Callable): The forward function of the model or any
                         modification of it
             layer (torch.nn.Module): Layer for which neuron attributions are computed.
                         Attributions for a particular neuron in the input or output
@@ -62,7 +62,7 @@ def __init__(
                         Currently, it is assumed that the inputs or the outputs
                         of the layer, depending on which one is used for
                         attribution, can only be a single tensor.
-            device_ids (list(int)): Device ID list, necessary only if forward_func
+            device_ids (list[int]): Device ID list, necessary only if forward_func
                         applies a DataParallel model. This allows reconstruction of
                         intermediate outputs from batched results across devices.
                         If forward_func is given as the DataParallel model itself,
@@ -103,7 +103,7 @@ def attribute(
         r"""
         Args:
 
-            inputs (tensor or tuple of tensors):  Input for which neuron
+            inputs (Tensor or tuple of Tensor): Input for which neuron
                         conductance is computed. If forward_func takes a single
                         tensor as input, a single input tensor should be provided.
                         If forward_func takes multiple tensors as input, a tuple
@@ -111,7 +111,7 @@ def attribute(
                         that for all given input tensors, dimension 0 corresponds
                         to the number of examples, and if multiple input tensors
                         are provided, the examples must be aligned appropriately.
-            neuron_selector (int, callable, or tuple of ints or slices):
+            neuron_selector (int, Callable, tuple of int, or slice):
                         Selector for neuron
                         in given layer for which attribution is desired.
                         Neuron selector can be provided as:
@@ -143,7 +143,7 @@ def attribute(
                           the gradient of output with respect to the intermedite neuron,
                           which cannot be computed for aggregations of multiple
                           intemediate neurons.
-            baselines (scalar, tensor, tuple of scalars or tensors, optional):
+            baselines (scalar, Tensor, tuple of scalar, or Tensor, optional):
                         Baselines define the starting point from which integral
                         is computed and can be provided as:
 
@@ -172,7 +172,7 @@ def attribute(
                         use zero scalar corresponding to each input tensor.
 
                         Default: None
-            target (int, tuple, tensor or list, optional):  Output indices for
+            target (int, tuple, Tensor, or list, optional): Output indices for
                         which gradients are computed (for classification cases,
                         this is usually the target class).
                         If the network returns a scalar value per example,
@@ -197,7 +197,7 @@ def attribute(
                           target for the corresponding example.
 
                         Default: None
-            additional_forward_args (any, optional): If the forward function
+            additional_forward_args (Any, optional): If the forward function
                         requires additional arguments other than the inputs for
                         which attributions should not be computed, this argument
                         can be provided. It must be either a single additional
@@ -216,7 +216,7 @@ def attribute(
                         Default: None
             n_steps (int, optional): The number of steps used by the approximation
                         method. Default: 50.
-            method (string, optional): Method for approximating the integral,
+            method (str, optional): Method for approximating the integral,
                         one of `riemann_right`, `riemann_left`, `riemann_middle`,
                         `riemann_trapezoid` or `gausslegendre`.
                         Default: `gausslegendre` if no method is provided.
@@ -244,8 +244,8 @@ def attribute(
                         Default: False
 
         Returns:
-            *tensor* or tuple of *tensors* of **attributions**:
-            - **attributions** (*tensor* or tuple of *tensors*):
+            *Tensor* or tuple of *Tensor* of **attributions**:
+            - **attributions** (*Tensor* or tuple of *Tensor*):
                         Conductance for
                         particular neuron with respect to each input feature.
                         Attributions will always be the same size as the provided
diff --git a/captum/attr/_core/neuron/neuron_deep_lift.py b/captum/attr/_core/neuron/neuron_deep_lift.py
index aff216d37a..d486bdea51 100644
--- a/captum/attr/_core/neuron/neuron_deep_lift.py
+++ b/captum/attr/_core/neuron/neuron_deep_lift.py
@@ -46,7 +46,7 @@ def __init__(
         r"""
         Args:
 
-            model (nn.Module):  The reference to PyTorch model instance. Model cannot
+            model (nn.Module): The reference to PyTorch model instance. Model cannot
                         contain any in-place nonlinear submodules; these are not
                         supported by the register_full_backward_hook PyTorch API
                         starting from PyTorch v1.9.
@@ -90,7 +90,7 @@ def attribute(
         r"""
         Args:
 
-            inputs (tensor or tuple of tensors):  Input for which layer
+            inputs (Tensor or tuple of Tensor): Input for which layer
                         attributions are computed. If forward_func takes a
                         single tensor as input, a single input tensor should be
                         provided. If forward_func takes multiple tensors as input,
@@ -99,7 +99,7 @@ def attribute(
                         corresponds to the number of examples (aka batch size),
                         and if multiple input tensors are provided, the examples
                         must be aligned appropriately.
-            neuron_selector (int, callable, or tuple of ints or slices):
+            neuron_selector (int, Callable, tuple of int, or slice):
                         Selector for neuron
                         in given layer for which attribution is desired.
                         Neuron selector can be provided as:
@@ -120,7 +120,7 @@ def attribute(
                           indexed output tensor is used for attribution. Note
                           that specifying a slice of a tensor would amount to
                           computing the attribution of the sum of the specified
-                          neurons, and not the individual neurons independantly.
+                          neurons, and not the individual neurons independently.
 
                         - a callable, which should
                           take the target layer as input (single tensor or tuple
@@ -133,7 +133,7 @@ def attribute(
                           or a 1D tensor with length equal to batch_size (one scalar
                           per input example)
 
-            baselines (scalar, tensor, tuple of scalars or tensors, optional):
+            baselines (scalar, Tensor, tuple of scalar, or Tensor, optional):
                         Baselines define reference samples that are compared with
                         the inputs. In order to assign attribution scores DeepLift
                         computes the differences between the inputs/outputs and
@@ -165,7 +165,7 @@ def attribute(
                         use zero scalar corresponding to each input tensor.
 
                         Default: None
-            additional_forward_args (any, optional): If the forward function
+            additional_forward_args (Any, optional): If the forward function
                         requires additional arguments other than the inputs for
                         which attributions should not be computed, this argument
                         can be provided. It must be either a single additional
@@ -187,7 +187,7 @@ def attribute(
                         attribute to the input or output, is a single tensor.
                         Support for multiple tensors will be added later.
                         Default: False
-            custom_attribution_func (callable, optional): A custom function for
+            custom_attribution_func (Callable, optional): A custom function for
                         computing final attribution scores. This function can take
                         at least one and at most three arguments with the
                         following signature:
@@ -207,7 +207,7 @@ def attribute(
 
         Returns:
             **attributions** or 2-element tuple of **attributions**, **delta**:
-            - **attributions** (*tensor* or tuple of *tensors*):
+            - **attributions** (*Tensor* or tuple of *Tensor*):
                 Computes attributions using Deeplift's rescale rule for
                 particular neuron with respect to each input feature.
                 Attributions will always be the same size as the provided
@@ -273,12 +273,13 @@ class NeuronDeepLiftShap(NeuronAttribution, GradientAttribution):
     by the input flag `attribute_to_layer_input`.
     More details about the algorithm can be found here:
 
-    http://papers.nips.cc/paper/7062-a-unified-approach-to-interpreting-model-predictions.pdf
+    https://papers.nips.cc/paper/7062-a-unified-approach-to-interpreting-model-predictions.pdf
 
     Note that the explanation model:
         1. Assumes that input features are independent of one another
         2. Is linear, meaning that the explanations are modeled through
             the additive composition of feature effects.
+
     Although, it assumes a linear model for each explanation, the overall
     model across multiple explanations can be complex and non-linear.
     """
@@ -289,7 +290,7 @@ def __init__(
         r"""
         Args:
 
-            model (nn.Module):  The reference to PyTorch model instance. Model cannot
+            model (nn.Module): The reference to PyTorch model instance. Model cannot
                         contain any in-place nonlinear submodules; these are not
                         supported by the register_full_backward_hook PyTorch API
                         starting from PyTorch v1.9.
@@ -334,7 +335,7 @@ def attribute(
         r"""
         Args:
 
-            inputs (tensor or tuple of tensors):  Input for which layer
+            inputs (Tensor or tuple of Tensor): Input for which layer
                         attributions are computed. If forward_func takes a
                         single tensor as input, a single input tensor should be
                         provided. If forward_func takes multiple tensors as input,
@@ -343,7 +344,7 @@ def attribute(
                         corresponds to the number of examples (aka batch size),
                         and if multiple input tensors are provided, the examples
                         must be aligned appropriately.
-            neuron_selector (int, callable, or tuple of ints or slices):
+            neuron_selector (int, Callable, tuple of int, or slice):
                         Selector for neuron
                         in given layer for which attribution is desired.
                         Neuron selector can be provided as:
@@ -364,7 +365,7 @@ def attribute(
                           indexed output tensor is used for attribution. Note
                           that specifying a slice of a tensor would amount to
                           computing the attribution of the sum of the specified
-                          neurons, and not the individual neurons independantly.
+                          neurons, and not the individual neurons independently.
 
                         - a callable, which should
                           take the target layer as input (single tensor or tuple
@@ -376,7 +377,8 @@ def attribute(
                           this function returns either a tensor with one element
                           or a 1D tensor with length equal to batch_size (one scalar
                           per input example)
-            baselines (tensor, tuple of tensors, callable):
+
+            baselines (Tensor, tuple of Tensor, or Callable):
                         Baselines define reference samples that are compared with
                         the inputs. In order to assign attribution scores DeepLift
                         computes the differences between the inputs/outputs and
@@ -401,7 +403,7 @@ def attribute(
 
                         It is recommended that the number of samples in the baselines'
                         tensors is larger than one.
-            additional_forward_args (any, optional): If the forward function
+            additional_forward_args (Any, optional): If the forward function
                         requires additional arguments other than the inputs for
                         which attributions should not be computed, this argument
                         can be provided. It must be either a single additional
@@ -423,7 +425,7 @@ def attribute(
                         attribute to the input or output, is a single tensor.
                         Support for multiple tensors will be added later.
                         Default: False
-            custom_attribution_func (callable, optional): A custom function for
+            custom_attribution_func (Callable, optional): A custom function for
                         computing final attribution scores. This function can take
                         at least one and at most three arguments with the
                         following signature:
@@ -443,7 +445,7 @@ def attribute(
 
         Returns:
             **attributions** or 2-element tuple of **attributions**, **delta**:
-            - **attributions** (*tensor* or tuple of *tensors*):
+            - **attributions** (*Tensor* or tuple of *Tensor*):
                         Computes attributions using Deeplift's rescale rule for
                         particular neuron with respect to each input feature.
                         Attributions will always be the same size as the provided
diff --git a/captum/attr/_core/neuron/neuron_feature_ablation.py b/captum/attr/_core/neuron/neuron_feature_ablation.py
index d706f71cb4..8ee73197da 100644
--- a/captum/attr/_core/neuron/neuron_feature_ablation.py
+++ b/captum/attr/_core/neuron/neuron_feature_ablation.py
@@ -35,7 +35,7 @@ def __init__(
         r"""
         Args:
 
-            forward_func (callable):  The forward function of the model or any
+            forward_func (Callable): The forward function of the model or any
                           modification of it
             layer (torch.nn.Module): Layer for which attributions are computed.
                           Attributions for a particular neuron in the input or output
@@ -44,7 +44,7 @@ def __init__(
                           Currently, it is assumed that the inputs or the outputs
                           of the layer, depending on which one is used for
                           attribution, can only be a single tensor.
-            device_ids (list(int)): Device ID list, necessary only if forward_func
+            device_ids (list[int]): Device ID list, necessary only if forward_func
                           applies a DataParallel model. This allows reconstruction of
                           intermediate outputs from batched results across devices.
                           If forward_func is given as the DataParallel model itself,
@@ -67,7 +67,7 @@ def attribute(
         r"""
         Args:
 
-            inputs (tensor or tuple of tensors):  Input for which neuron
+            inputs (Tensor or tuple of Tensor): Input for which neuron
                         attributions are computed. If forward_func takes a single
                         tensor as input, a single input tensor should be provided.
                         If forward_func takes multiple tensors as input, a tuple
@@ -75,7 +75,7 @@ def attribute(
                         that for all given input tensors, dimension 0 corresponds
                         to the number of examples, and if multiple input tensors
                         are provided, the examples must be aligned appropriately.
-            neuron_selector (int, callable, or tuple of ints or slices):
+            neuron_selector (int, Callable, tuple of int, or slice):
                         Selector for neuron
                         in given layer for which attribution is desired.
                         Neuron selector can be provided as:
@@ -96,7 +96,7 @@ def attribute(
                           indexed output tensor is used for attribution. Note
                           that specifying a slice of a tensor would amount to
                           computing the attribution of the sum of the specified
-                          neurons, and not the individual neurons independantly.
+                          neurons, and not the individual neurons independently.
 
                         - a callable, which should
                           take the target layer as input (single tensor or tuple
@@ -108,7 +108,8 @@ def attribute(
                           this function returns either a tensor with one element
                           or a 1D tensor with length equal to batch_size (one scalar
                           per input example)
-            baselines (scalar, tensor, tuple of scalars or tensors, optional):
+
+            baselines (scalar, Tensor, tuple of scalar, or Tensor, optional):
                         Baselines define reference value which replaces each
                         feature when ablated.
                         Baselines can be provided as:
@@ -132,10 +133,11 @@ def attribute(
                           - or a scalar, corresponding to a tensor in the
                             inputs' tuple. This scalar value is broadcasted
                             for corresponding input tensor.
+
                         In the cases when `baselines` is not provided, we internally
                         use zero scalar corresponding to each input tensor.
                         Default: None
-            additional_forward_args (any, optional): If the forward function
+            additional_forward_args (Any, optional): If the forward function
                         requires additional arguments other than the inputs for
                         which attributions should not be computed, this argument
                         can be provided. It must be either a single additional
@@ -147,7 +149,7 @@ def attribute(
                         Note that attributions are not computed with respect
                         to these arguments.
                         Default: None
-            feature_mask (tensor or tuple of tensors, optional):
+            feature_mask (Tensor or tuple of Tensor, optional):
                         feature_mask defines a mask for the input, grouping
                         features which should be ablated together. feature_mask
                         should contain the same number of tensors as inputs.
@@ -187,8 +189,8 @@ def attribute(
                         Default: 1
 
         Returns:
-            *tensor* or tuple of *tensors* of **attributions**:
-            - **attributions** (*tensor* or tuple of *tensors*):
+            *Tensor* or tuple of *Tensor* of **attributions**:
+            - **attributions** (*Tensor* or tuple of *Tensor*):
                         Attributions of particular neuron with respect to each input
                         feature. Attributions will always be the same size as the
                         provided inputs, with each value providing the attribution
diff --git a/captum/attr/_core/neuron/neuron_gradient.py b/captum/attr/_core/neuron/neuron_gradient.py
index 5292990bbf..d948dfee1a 100644
--- a/captum/attr/_core/neuron/neuron_gradient.py
+++ b/captum/attr/_core/neuron/neuron_gradient.py
@@ -33,7 +33,7 @@ def __init__(
         r"""
         Args:
 
-            forward_func (callable):  The forward function of the model or any
+            forward_func (Callable): The forward function of the model or any
                           modification of it
             layer (torch.nn.Module): Layer for which attributions are computed.
                           Output size of attribute matches this layer's input or
@@ -44,7 +44,7 @@ def __init__(
                           Currently, it is assumed that the inputs or the outputs
                           of the layer, depending on which one is used for
                           attribution, can only be a single tensor.
-            device_ids (list(int)): Device ID list, necessary only if forward_func
+            device_ids (list[int]): Device ID list, necessary only if forward_func
                           applies a DataParallel model. This allows reconstruction of
                           intermediate outputs from batched results across devices.
                           If forward_func is given as the DataParallel model itself,
@@ -64,7 +64,7 @@ def attribute(
         r"""
         Args:
 
-            inputs (tensor or tuple of tensors):  Input for which neuron
+            inputs (Tensor or tuple of Tensor): Input for which neuron
                         gradients are computed. If forward_func takes a single
                         tensor as input, a single input tensor should be provided.
                         If forward_func takes multiple tensors as input, a tuple
@@ -72,7 +72,7 @@ def attribute(
                         that for all given input tensors, dimension 0 corresponds
                         to the number of examples, and if multiple input tensors
                         are provided, the examples must be aligned appropriately.
-            neuron_selector (int, callable, or tuple of ints or slices):
+            neuron_selector (int, Callable, tuple of int, or slice):
                         Selector for neuron
                         in given layer for which attribution is desired.
                         Neuron selector can be provided as:
@@ -93,7 +93,7 @@ def attribute(
                           indexed output tensor is used for attribution. Note
                           that specifying a slice of a tensor would amount to
                           computing the attribution of the sum of the specified
-                          neurons, and not the individual neurons independantly.
+                          neurons, and not the individual neurons independently.
 
                         - a callable, which should
                           take the target layer as input (single tensor or tuple
@@ -105,7 +105,7 @@ def attribute(
                           this function returns either a tensor with one element
                           or a 1D tensor with length equal to batch_size (one scalar
                           per input example)
-            additional_forward_args (any, optional): If the forward function
+            additional_forward_args (Any, optional): If the forward function
                         requires additional arguments other than the inputs for
                         which attributions should not be computed, this argument
                         can be provided. It must be either a single additional
@@ -130,8 +130,8 @@ def attribute(
                         Default: False
 
         Returns:
-            *tensor* or tuple of *tensors* of **attributions**:
-            - **attributions** (*tensor* or tuple of *tensors*):
+            *Tensor* or tuple of *Tensor* of **attributions**:
+            - **attributions** (*Tensor* or tuple of *Tensor*):
                         Gradients of particular neuron with respect to each input
                         feature. Attributions will always be the same size as the
                         provided inputs, with each value providing the attribution
diff --git a/captum/attr/_core/neuron/neuron_gradient_shap.py b/captum/attr/_core/neuron/neuron_gradient_shap.py
index 42a543b50d..338949352e 100644
--- a/captum/attr/_core/neuron/neuron_gradient_shap.py
+++ b/captum/attr/_core/neuron/neuron_gradient_shap.py
@@ -18,7 +18,7 @@ class NeuronGradientShap(NeuronAttribution, GradientAttribution):
     #deep-learning-example-with-gradientexplainer-tensorflowkeraspytorch-models
 
     A Unified Approach to Interpreting Model Predictions
-    http://papers.nips.cc/paper\
+    https://papers.nips.cc/paper\
     7062-a-unified-approach-to-interpreting-model-predictions
 
     GradientShap approximates SHAP values by computing the expectations of
@@ -41,7 +41,7 @@ class NeuronGradientShap(NeuronAttribution, GradientAttribution):
     In some sense it can be viewed as an approximation of integrated gradients
     by computing the expectations of gradients for different baselines.
 
-    Current implementation uses Smoothgrad from `NoiseTunnel` in order to
+    Current implementation uses Smoothgrad from :class:`.NoiseTunnel` in order to
     randomly draw samples from the distribution of baselines, add noise to input
     samples and compute the expectation (smoothgrad).
     """
@@ -56,17 +56,17 @@ def __init__(
         r"""
         Args:
 
-            forward_func (callable):  The forward function of the model or any
+            forward_func (Callable): The forward function of the model or any
                         modification of it
             layer (torch.nn.Module): Layer for which neuron attributions are computed.
                         The output size of the attribute method matches the
-                        dimensions of the inputs or ouputs of the neuron with
+                        dimensions of the inputs or outputs of the neuron with
                         index `neuron_selector` in this layer, depending on whether
                         we attribute to the inputs or outputs of the neuron.
                         Currently, it is assumed that the inputs or the outputs
                         of the neurons in this layer, depending on which one is
                         used for attribution, can only be a single tensor.
-            device_ids (list(int)): Device ID list, necessary only if forward_func
+            device_ids (list[int]): Device ID list, necessary only if forward_func
                         applies a DataParallel model. This allows reconstruction of
                         intermediate outputs from batched results across devices.
                         If forward_func is given as the DataParallel model itself,
@@ -106,7 +106,7 @@ def attribute(
         r"""
         Args:
 
-            inputs (tensor or tuple of tensors):  Input for which SHAP attribution
+            inputs (Tensor or tuple of Tensor): Input for which SHAP attribution
                         values are computed. If `forward_func` takes a single
                         tensor as input, a single input tensor should be provided.
                         If `forward_func` takes multiple tensors as input, a tuple
@@ -114,7 +114,7 @@ def attribute(
                         that for all given input tensors, dimension 0 corresponds
                         to the number of examples, and if multiple input tensors
                         are provided, the examples must be aligned appropriately.
-            neuron_selector (int, callable, or tuple of ints or slices):
+            neuron_selector (int, Callable, tuple of int, or slice):
                         Selector for neuron
                         in given layer for which attribution is desired.
                         Neuron selector can be provided as:
@@ -135,7 +135,7 @@ def attribute(
                           indexed output tensor is used for attribution. Note
                           that specifying a slice of a tensor would amount to
                           computing the attribution of the sum of the specified
-                          neurons, and not the individual neurons independantly.
+                          neurons, and not the individual neurons independently.
 
                         - a callable, which should
                           take the target layer as input (single tensor or tuple
@@ -147,7 +147,7 @@ def attribute(
                           this function returns either a tensor with one element
                           or a 1D tensor with length equal to batch_size (one scalar
                           per input example)
-            baselines (tensor, tuple of tensors, callable):
+            baselines (Tensor, tuple of Tensor, or Callable):
                         Baselines define the starting point from which expectation
                         is computed and can be provided as:
 
@@ -170,11 +170,11 @@ def attribute(
 
                         It is recommended that the number of samples in the baselines'
                         tensors is larger than one.
-            n_samples (int, optional):  The number of randomly generated examples
+            n_samples (int, optional): The number of randomly generated examples
                         per sample in the input batch. Random examples are
                         generated by adding gaussian random noise to each sample.
                         Default: `5` if `n_samples` is not provided.
-            stdevs    (float, or a tuple of floats optional): The standard deviation
+            stdevs    (float or tuple of float, optional): The standard deviation
                         of gaussian noise with zero mean that is added to each
                         input in the batch. If `stdevs` is a single float value
                         then that same value is used for all inputs. If it is
@@ -183,7 +183,7 @@ def attribute(
                         corresponds to the input with the same index in the inputs
                         tuple.
                         Default: 0.0
-            additional_forward_args (any, optional): If the forward function
+            additional_forward_args (Any, optional): If the forward function
                         requires additional arguments other than the inputs for
                         which attributions should not be computed, this argument
                         can be provided. It can contain a tuple of ND tensors or
@@ -209,7 +209,7 @@ def attribute(
 
         Returns:
             **attributions** or 2-element tuple of **attributions**, **delta**:
-            - **attributions** (*tensor* or tuple of *tensors*):
+            - **attributions** (*Tensor* or tuple of *Tensor*):
                         Attribution score computed based on GradientSHAP with respect
                         to each input feature. Attributions will always be
                         the same size as the provided inputs, with each value
diff --git a/captum/attr/_core/neuron/neuron_guided_backprop_deconvnet.py b/captum/attr/_core/neuron/neuron_guided_backprop_deconvnet.py
index 7c69aed87a..b9a5e80b7f 100644
--- a/captum/attr/_core/neuron/neuron_guided_backprop_deconvnet.py
+++ b/captum/attr/_core/neuron/neuron_guided_backprop_deconvnet.py
@@ -35,7 +35,7 @@ def __init__(
         r"""
         Args:
 
-            model (nn.Module):  The reference to PyTorch model instance. Model cannot
+            model (nn.Module): The reference to PyTorch model instance. Model cannot
                           contain any in-place ReLU submodules; these are not
                           supported by the register_full_backward_hook PyTorch API
                           starting from PyTorch v1.9.
@@ -48,7 +48,7 @@ def __init__(
                           Currently, it is assumed that the inputs or the outputs
                           of the layer, depending on which one is used for
                           attribution, can only be a single tensor.
-            device_ids (list(int)): Device ID list, necessary only if forward_func
+            device_ids (list[int]): Device ID list, necessary only if forward_func
                           applies a DataParallel model. This allows reconstruction of
                           intermediate outputs from batched results across devices.
                           If forward_func is given as the DataParallel model itself,
@@ -69,7 +69,7 @@ def attribute(
         r"""
         Args:
 
-            inputs (tensor or tuple of tensors):  Input for which
+            inputs (Tensor or tuple of Tensor): Input for which
                         attributions are computed. If forward_func takes a single
                         tensor as input, a single input tensor should be provided.
                         If forward_func takes multiple tensors as input, a tuple
@@ -78,7 +78,7 @@ def attribute(
                         to the number of examples (aka batch size), and if
                         multiple input tensors are provided, the examples must
                         be aligned appropriately.
-            neuron_selector (int, callable, or tuple of ints or slices):
+            neuron_selector (int, Callable, tuple of int, or slice):
                         Selector for neuron
                         in given layer for which attribution is desired.
                         Neuron selector can be provided as:
@@ -99,7 +99,7 @@ def attribute(
                           indexed output tensor is used for attribution. Note
                           that specifying a slice of a tensor would amount to
                           computing the attribution of the sum of the specified
-                          neurons, and not the individual neurons independantly.
+                          neurons, and not the individual neurons independently.
 
                         - a callable, which should
                           take the target layer as input (single tensor or tuple
@@ -111,7 +111,7 @@ def attribute(
                           this function returns either a tensor with one element
                           or a 1D tensor with length equal to batch_size (one scalar
                           per input example)
-            additional_forward_args (any, optional): If the forward function
+            additional_forward_args (Any, optional): If the forward function
                         requires additional arguments other than the inputs for
                         which attributions should not be computed, this argument
                         can be provided. It must be either a single additional
@@ -134,8 +134,8 @@ def attribute(
                         Support for multiple tensors will be added later.
                         Default: False
         Returns:
-            *tensor* or tuple of *tensors* of **attributions**:
-            - **attributions** (*tensor* or tuple of *tensors*):
+            *Tensor* or tuple of *Tensor* of **attributions**:
+            - **attributions** (*Tensor* or tuple of *Tensor*):
                         Deconvolution attribution of
                         particular neuron with respect to each input feature.
                         Attributions will always be the same size as the provided
@@ -207,7 +207,7 @@ def __init__(
         r"""
         Args:
 
-            model (nn.Module):  The reference to PyTorch model instance. Model cannot
+            model (nn.Module): The reference to PyTorch model instance. Model cannot
                           contain any in-place ReLU submodules; these are not
                           supported by the register_full_backward_hook PyTorch API
                           starting from PyTorch v1.9.
@@ -217,7 +217,7 @@ def __init__(
                           in the attribute method.
                           Currently, only layers with a single tensor output are
                           supported.
-            device_ids (list(int)): Device ID list, necessary only if forward_func
+            device_ids (list[int]): Device ID list, necessary only if forward_func
                           applies a DataParallel model. This allows reconstruction of
                           intermediate outputs from batched results across devices.
                           If forward_func is given as the DataParallel model itself,
@@ -238,7 +238,7 @@ def attribute(
         r"""
         Args:
 
-            inputs (tensor or tuple of tensors):  Input for which
+            inputs (Tensor or tuple of Tensor): Input for which
                         attributions are computed. If forward_func takes a single
                         tensor as input, a single input tensor should be provided.
                         If forward_func takes multiple tensors as input, a tuple
@@ -247,7 +247,7 @@ def attribute(
                         to the number of examples (aka batch size), and if
                         multiple input tensors are provided, the examples must
                         be aligned appropriately.
-            neuron_selector (int, callable, or tuple of ints or slices):
+            neuron_selector (int, Callable, tuple of int, or slice):
                         Selector for neuron
                         in given layer for which attribution is desired.
                         Neuron selector can be provided as:
@@ -268,7 +268,7 @@ def attribute(
                           indexed output tensor is used for attribution. Note
                           that specifying a slice of a tensor would amount to
                           computing the attribution of the sum of the specified
-                          neurons, and not the individual neurons independantly.
+                          neurons, and not the individual neurons independently.
 
                         - a callable, which should
                           take the target layer as input (single tensor or tuple
@@ -280,7 +280,7 @@ def attribute(
                           this function returns either a tensor with one element
                           or a 1D tensor with length equal to batch_size (one scalar
                           per input example)
-            additional_forward_args (any, optional): If the forward function
+            additional_forward_args (Any, optional): If the forward function
                         requires additional arguments other than the inputs for
                         which attributions should not be computed, this argument
                         can be provided. It must be either a single additional
@@ -303,8 +303,8 @@ def attribute(
                         Support for multiple tensors will be added later.
                         Default: False
         Returns:
-            *tensor* or tuple of *tensors* of **attributions**:
-            - **attributions** (*tensor* or tuple of *tensors*):
+            *Tensor* or tuple of *Tensor* of **attributions**:
+            - **attributions** (*Tensor* or tuple of *Tensor*):
                         Guided backprop attribution of
                         particular neuron with respect to each input feature.
                         Attributions will always be the same size as the provided
diff --git a/captum/attr/_core/neuron/neuron_integrated_gradients.py b/captum/attr/_core/neuron/neuron_integrated_gradients.py
index f67aec7e7e..2afc17180f 100644
--- a/captum/attr/_core/neuron/neuron_integrated_gradients.py
+++ b/captum/attr/_core/neuron/neuron_integrated_gradients.py
@@ -33,7 +33,7 @@ def __init__(
         r"""
         Args:
 
-            forward_func (callable):  The forward function of the model or any
+            forward_func (Callable): The forward function of the model or any
                         modification of it
             layer (torch.nn.Module): Layer for which attributions are computed.
                         Output size of attribute matches this layer's input or
@@ -44,7 +44,7 @@ def __init__(
                         Currently, it is assumed that the inputs or the outputs
                         of the layer, depending on which one is used for
                         attribution, can only be a single tensor.
-            device_ids (list(int)): Device ID list, necessary only if forward_func
+            device_ids (list[int]): Device ID list, necessary only if forward_func
                         applies a DataParallel model. This allows reconstruction of
                         intermediate outputs from batched results across devices.
                         If forward_func is given as the DataParallel model itself,
@@ -84,7 +84,7 @@ def attribute(
         r"""
         Args:
 
-            inputs (tensor or tuple of tensors):  Input for which neuron integrated
+            inputs (Tensor or tuple of Tensor): Input for which neuron integrated
                         gradients are computed. If forward_func takes a single
                         tensor as input, a single input tensor should be provided.
                         If forward_func takes multiple tensors as input, a tuple
@@ -92,7 +92,7 @@ def attribute(
                         that for all given input tensors, dimension 0 corresponds
                         to the number of examples, and if multiple input tensors
                         are provided, the examples must be aligned appropriately.
-            neuron_selector (int, callable, or tuple of ints or slices):
+            neuron_selector (int, Callable, tuple of int, or slice):
                         Selector for neuron
                         in given layer for which attribution is desired.
                         Neuron selector can be provided as:
@@ -113,7 +113,7 @@ def attribute(
                           indexed output tensor is used for attribution. Note
                           that specifying a slice of a tensor would amount to
                           computing the attribution of the sum of the specified
-                          neurons, and not the individual neurons independantly.
+                          neurons, and not the individual neurons independently.
 
                         - a callable, which should
                           take the target layer as input (single tensor or tuple
@@ -125,7 +125,7 @@ def attribute(
                           this function returns either a tensor with one element
                           or a 1D tensor with length equal to batch_size (one scalar
                           per input example)
-            baselines (scalar, tensor, tuple of scalars or tensors, optional):
+            baselines (scalar, Tensor, tuple of scalar, or Tensor, optional):
                         Baselines define the starting point from which integral
                         is computed.
                         Baselines can be provided as:
@@ -155,7 +155,7 @@ def attribute(
                         use zero scalar corresponding to each input tensor.
 
                         Default: None
-            additional_forward_args (any, optional): If the forward function
+            additional_forward_args (Any, optional): If the forward function
                         requires additional arguments other than the inputs for
                         which attributions should not be computed, this argument
                         can be provided. It must be either a single additional
@@ -174,7 +174,7 @@ def attribute(
                         Default: None
             n_steps (int, optional): The number of steps used by the approximation
                         method. Default: 50.
-            method (string, optional): Method for approximating the integral,
+            method (str, optional): Method for approximating the integral,
                         one of `riemann_right`, `riemann_left`, `riemann_middle`,
                         `riemann_trapezoid` or `gausslegendre`.
                         Default: `gausslegendre` if no method is provided.
@@ -202,8 +202,8 @@ def attribute(
                         Default: False
 
         Returns:
-            *tensor* or tuple of *tensors* of **attributions**:
-            - **attributions** (*tensor* or tuple of *tensors*):
+            *Tensor* or tuple of *Tensor* of **attributions**:
+            - **attributions** (*Tensor* or tuple of *Tensor*):
                         Integrated gradients for particular neuron with
                         respect to each input feature.
                         Attributions will always be the same size as the provided
diff --git a/captum/attr/_core/noise_tunnel.py b/captum/attr/_core/noise_tunnel.py
index 0fbc32115e..eda936a048 100644
--- a/captum/attr/_core/noise_tunnel.py
+++ b/captum/attr/_core/noise_tunnel.py
@@ -43,10 +43,12 @@ class NoiseTunnel(Attribution):
     returned.
 
     More details about adding noise can be found in the following papers:
-        https://arxiv.org/abs/1810.03292
-        https://arxiv.org/abs/1810.03307
-        https://arxiv.org/abs/1706.03825
-        https://arxiv.org/pdf/1806.10758
+
+        * https://arxiv.org/abs/1810.03292
+        * https://arxiv.org/abs/1810.03307
+        * https://arxiv.org/abs/1706.03825
+        * https://arxiv.org/abs/1806.10758
+
     This method currently also supports batches of multiple examples input,
     however it can be computationally expensive depending on the model,
     the dimensionality of the data and execution environment.
@@ -93,7 +95,7 @@ def attribute(
         r"""
         Args:
 
-            inputs (tensor or tuple of tensors):  Input for which integrated
+            inputs (Tensor or tuple of Tensor): Input for which integrated
                         gradients are computed. If forward_func takes a single
                         tensor as input, a single input tensor should be provided.
                         If forward_func takes multiple tensors as input, a tuple
@@ -101,21 +103,21 @@ def attribute(
                         that for all given input tensors, dimension 0 corresponds
                         to the number of examples, and if multiple input tensors
                         are provided, the examples must be aligned appropriately.
-            nt_type (string, optional): Smoothing type of the attributions.
+            nt_type (str, optional): Smoothing type of the attributions.
                         `smoothgrad`, `smoothgrad_sq` or `vargrad`
                         Default: `smoothgrad` if `type` is not provided.
-            nt_samples (int, optional):  The number of randomly generated examples
+            nt_samples (int, optional): The number of randomly generated examples
                         per sample in the input batch. Random examples are
                         generated by adding gaussian random noise to each sample.
                         Default: `5` if `nt_samples` is not provided.
-            nt_samples_batch_size (int, optional):  The number of the `nt_samples`
+            nt_samples_batch_size (int, optional): The number of the `nt_samples`
                         that will be processed together. With the help
                         of this parameter we can avoid out of memory situation and
                         reduce the number of randomly generated examples per sample
                         in each batch.
                         Default: None if `nt_samples_batch_size` is not provided. In
                         this case all `nt_samples` will be processed together.
-            stdevs    (float, or a tuple of floats optional): The standard deviation
+            stdevs    (float or tuple of float, optional): The standard deviation
                         of gaussian noise with zero mean that is added to each
                         input in the batch. If `stdevs` is a single float value
                         then that same value is used for all inputs. If it is
@@ -137,7 +139,7 @@ def attribute(
 
         Returns:
             **attributions** or 2-element tuple of **attributions**, **delta**:
-            - **attributions** (*tensor* or tuple of *tensors*):
+            - **attributions** (*Tensor* or tuple of *Tensor*):
                         Attribution with
                         respect to each input feature. attributions will always be
                         the same size as the provided inputs, with each value
@@ -166,7 +168,7 @@ def attribute(
             >>> nt = NoiseTunnel(ig)
             >>> # Generates 10 perturbed input tensors per image.
             >>> # Computes integrated gradients for class 3 for each generated
-            >>> # input and averages attributions accros all 10
+            >>> # input and averages attributions across all 10
             >>> # perturbed inputs per image
             >>> attribution = nt.attribute(input, nt_type='smoothgrad',
             >>>                            nt_samples=10, target=3)
diff --git a/captum/attr/_core/occlusion.py b/captum/attr/_core/occlusion.py
index de148693fa..fedc2dae05 100644
--- a/captum/attr/_core/occlusion.py
+++ b/captum/attr/_core/occlusion.py
@@ -39,8 +39,8 @@ def __init__(self, forward_func: Callable) -> None:
         r"""
         Args:
 
-            forward_func (callable): The forward function of the model or
-                        any modification of it
+            forward_func (Callable): The forward function of the model or
+                        any modification of it.
         """
         FeatureAblation.__init__(self, forward_func)
         self.use_weights = True
@@ -62,7 +62,7 @@ def attribute(  # type: ignore
         r"""
         Args:
 
-                inputs (tensor or tuple of tensors):  Input for which occlusion
+                inputs (Tensor or tuple of Tensor): Input for which occlusion
                             attributions are computed. If forward_func takes a single
                             tensor as input, a single input tensor should be provided.
                             If forward_func takes multiple tensors as input, a tuple
@@ -71,7 +71,7 @@ def attribute(  # type: ignore
                             to the number of examples (aka batch size), and if
                             multiple input tensors are provided, the examples must
                             be aligned appropriately.
-                sliding_window_shapes (tuple or tuple of tuples): Shape of patch
+                sliding_window_shapes (tuple or tuple of tuple): Shape of patch
                             (hyperrectangle) to occlude each input. For a single
                             input tensor, this must be a tuple of length equal to the
                             number of dimensions of the input tensor - 1, defining
@@ -80,7 +80,7 @@ def attribute(  # type: ignore
                             this must be a tuple containing one tuple for each input
                             tensor defining the dimensions of the patch for that
                             input tensor, as described for the single tensor case.
-                strides (int or tuple or tuple of ints or tuple of tuples, optional):
+                strides (int or tuple or tuple of int or tuple of tuple, optional):
                             This defines the step by which the occlusion hyperrectangle
                             should be shifted by in each direction for each iteration.
                             For a single tensor input, this can be either a single
@@ -100,7 +100,7 @@ def attribute(  # type: ignore
                             If None is provided, a stride of 1 is used for each
                             dimension of each input tensor.
                             Default: None
-                baselines (scalar, tensor, tuple of scalars or tensors, optional):
+                baselines (scalar, Tensor, tuple of scalar, or Tensor, optional):
                             Baselines define reference value which replaces each
                             feature when occluded.
                             Baselines can be provided as:
@@ -124,10 +124,11 @@ def attribute(  # type: ignore
                               - or a scalar, corresponding to a tensor in the
                                 inputs' tuple. This scalar value is broadcasted
                                 for corresponding input tensor.
+
                             In the cases when `baselines` is not provided, we internally
                             use zero scalar corresponding to each input tensor.
                             Default: None
-                target (int, tuple, tensor or list, optional):  Output indices for
+                target (int, tuple, Tensor, or list, optional): Output indices for
                             which difference is computed (for classification cases,
                             this is usually the target class).
                             If the network returns a scalar value per example,
@@ -152,7 +153,7 @@ def attribute(  # type: ignore
                               target for the corresponding example.
 
                             Default: None
-                additional_forward_args (any, optional): If the forward function
+                additional_forward_args (Any, optional): If the forward function
                             requires additional arguments other than the inputs for
                             which attributions should not be computed, this argument
                             can be provided. It must be either a single additional
@@ -186,8 +187,8 @@ def attribute(  # type: ignore
                             Default: False
 
         Returns:
-                *tensor* or tuple of *tensors* of **attributions**:
-                - **attributions** (*tensor* or tuple of *tensors*):
+                *Tensor* or tuple of *Tensor* of **attributions**:
+                - **attributions** (*Tensor* or tuple of *Tensor*):
                             The attributions with respect to each input feature.
                             Attributions will always be
                             the same size as the provided inputs, with each value
diff --git a/captum/attr/_core/saliency.py b/captum/attr/_core/saliency.py
index 3790bd2068..505c35b28e 100644
--- a/captum/attr/_core/saliency.py
+++ b/captum/attr/_core/saliency.py
@@ -20,15 +20,15 @@ class Saliency(GradientAttribution):
     the default, the absolute value of the gradients is returned.
 
     More details about the approach can be found in the following paper:
-        https://arxiv.org/pdf/1312.6034.pdf
+        https://arxiv.org/abs/1312.6034
     """
 
     def __init__(self, forward_func: Callable) -> None:
         r"""
         Args:
 
-            forward_func (callable): The forward function of the model or
-                        any modification of it
+            forward_func (Callable): The forward function of the model or
+                        any modification of it.
         """
         GradientAttribution.__init__(self, forward_func)
 
@@ -43,7 +43,7 @@ def attribute(
         r"""
         Args:
 
-            inputs (tensor or tuple of tensors):  Input for which saliency
+            inputs (Tensor or tuple of Tensor): Input for which saliency
                         is computed. If forward_func takes a single tensor
                         as input, a single input tensor should be provided.
                         If forward_func takes multiple tensors as input, a tuple
@@ -52,7 +52,7 @@ def attribute(
                         to the number of examples (aka batch size), and if
                         multiple input tensors are provided, the examples must
                         be aligned appropriately.
-            target (int, tuple, tensor or list, optional):  Output indices for
+            target (int, tuple, Tensor, or list, optional): Output indices for
                         which gradients are computed (for classification cases,
                         this is usually the target class).
                         If the network returns a scalar value per example,
@@ -81,7 +81,7 @@ def attribute(
                         to True, otherwise returns the (signed) gradients if
                         False.
                         Default: True
-            additional_forward_args (any, optional): If the forward function
+            additional_forward_args (Any, optional): If the forward function
                         requires additional arguments other than the inputs for
                         which attributions should not be computed, this argument
                         can be provided. It must be either a single additional
@@ -95,8 +95,8 @@ def attribute(
                         Default: None
 
         Returns:
-            *tensor* or tuple of *tensors* of **attributions**:
-            - **attributions** (*tensor* or tuple of *tensors*):
+            *Tensor* or tuple of *Tensor* of **attributions**:
+            - **attributions** (*Tensor* or tuple of *Tensor*):
                         The gradients with respect to each input feature.
                         Attributions will always be
                         the same size as the provided inputs, with each value
diff --git a/captum/attr/_core/shapley_value.py b/captum/attr/_core/shapley_value.py
index 72af4e7237..4d5f244816 100644
--- a/captum/attr/_core/shapley_value.py
+++ b/captum/attr/_core/shapley_value.py
@@ -66,7 +66,7 @@ def __init__(self, forward_func: Callable) -> None:
         r"""
         Args:
 
-            forward_func (callable): The forward function of the model or
+            forward_func (Callable): The forward function of the model or
                         any modification of it. The forward function can either
                         return a scalar per example, or a single scalar for the
                         full batch. If a single scalar is returned for the batch,
@@ -96,7 +96,7 @@ def attribute(
 
         Args:
 
-                inputs (tensor or tuple of tensors):  Input for which Shapley value
+                inputs (Tensor or tuple of Tensor): Input for which Shapley value
                             sampling attributions are computed. If forward_func takes
                             a single tensor as input, a single input tensor should
                             be provided.
@@ -106,7 +106,7 @@ def attribute(
                             to the number of examples (aka batch size), and if
                             multiple input tensors are provided, the examples must
                             be aligned appropriately.
-                baselines (scalar, tensor, tuple of scalars or tensors, optional):
+                baselines (scalar, Tensor, tuple of scalar, or Tensor, optional):
                             Baselines define reference value which replaces each
                             feature when ablated.
                             Baselines can be provided as:
@@ -131,10 +131,11 @@ def attribute(
                               - or a scalar, corresponding to a tensor in the
                                 inputs' tuple. This scalar value is broadcasted
                                 for corresponding input tensor.
+
                             In the cases when `baselines` is not provided, we internally
                             use zero scalar corresponding to each input tensor.
                             Default: None
-                target (int, tuple, tensor or list, optional):  Output indices for
+                target (int, tuple, Tensor, or list, optional): Output indices for
                             which difference is computed (for classification cases,
                             this is usually the target class).
                             If the network returns a scalar value per example,
@@ -159,7 +160,7 @@ def attribute(
                               target for the corresponding example.
 
                             Default: None
-                additional_forward_args (any, optional): If the forward function
+                additional_forward_args (Any, optional): If the forward function
                             requires additional arguments other than the inputs for
                             which attributions should not be computed, this argument
                             can be provided. It must be either a single additional
@@ -174,7 +175,7 @@ def attribute(
                             Note that attributions are not computed with respect
                             to these arguments.
                             Default: None
-                feature_mask (tensor or tuple of tensors, optional):
+                feature_mask (Tensor or tuple of Tensor, optional):
                             feature_mask defines a mask for the input, grouping
                             features which should be added together. feature_mask
                             should contain the same number of tensors as inputs.
@@ -196,7 +197,7 @@ def attribute(
                             If None, then a feature mask is constructed which assigns
                             each scalar within a tensor as a separate feature
                             Default: None
-                n_samples (int, optional):  The number of feature permutations
+                n_samples (int, optional): The number of feature permutations
                             tested.
                             Default: `25` if `n_samples` is not provided.
                 perturbations_per_eval (int, optional): Allows multiple ablations
@@ -218,8 +219,8 @@ def attribute(
                             Default: False
 
         Returns:
-                *tensor* or tuple of *tensors* of **attributions**:
-                - **attributions** (*tensor* or tuple of *tensors*):
+                *Tensor* or tuple of *Tensor* of **attributions**:
+                - **attributions** (*Tensor* or tuple of *Tensor*):
                             The attributions with respect to each input feature.
                             If the forward function returns
                             a scalar value per example, attributions will be
@@ -519,7 +520,7 @@ def __init__(self, forward_func: Callable) -> None:
         r"""
         Args:
 
-            forward_func (callable): The forward function of the model or
+            forward_func (Callable): The forward function of the model or
                         any modification of it. The forward function can either
                         return a scalar per example, or a single scalar for the
                         full batch. If a single scalar is returned for the batch,
@@ -548,7 +549,7 @@ def attribute(
 
         Args:
 
-                inputs (tensor or tuple of tensors):  Input for which Shapley value
+                inputs (Tensor or tuple of Tensor): Input for which Shapley value
                             sampling attributions are computed. If forward_func takes
                             a single tensor as input, a single input tensor should
                             be provided.
@@ -558,7 +559,7 @@ def attribute(
                             to the number of examples (aka batch size), and if
                             multiple input tensors are provided, the examples must
                             be aligned appropriately.
-                baselines (scalar, tensor, tuple of scalars or tensors, optional):
+                baselines (scalar, Tensor, tuple of scalar, or Tensor, optional):
                             Baselines define reference value which replaces each
                             feature when ablated.
                             Baselines can be provided as:
@@ -583,10 +584,11 @@ def attribute(
                               - or a scalar, corresponding to a tensor in the
                                 inputs' tuple. This scalar value is broadcasted
                                 for corresponding input tensor.
+
                             In the cases when `baselines` is not provided, we internally
                             use zero scalar corresponding to each input tensor.
                             Default: None
-                target (int, tuple, tensor or list, optional):  Output indices for
+                target (int, tuple, Tensor, or list, optional): Output indices for
                             which difference is computed (for classification cases,
                             this is usually the target class).
                             If the network returns a scalar value per example,
@@ -611,7 +613,7 @@ def attribute(
                               target for the corresponding example.
 
                             Default: None
-                additional_forward_args (any, optional): If the forward function
+                additional_forward_args (Any, optional): If the forward function
                             requires additional arguments other than the inputs for
                             which attributions should not be computed, this argument
                             can be provided. It must be either a single additional
@@ -626,7 +628,7 @@ def attribute(
                             Note that attributions are not computed with respect
                             to these arguments.
                             Default: None
-                feature_mask (tensor or tuple of tensors, optional):
+                feature_mask (Tensor or tuple of Tensor, optional):
                             feature_mask defines a mask for the input, grouping
                             features which should be added together. feature_mask
                             should contain the same number of tensors as inputs.
@@ -666,8 +668,8 @@ def attribute(
                             a simple output of progress.
                             Default: False
         Returns:
-                *tensor* or tuple of *tensors* of **attributions**:
-                - **attributions** (*tensor* or tuple of *tensors*):
+                *Tensor* or tuple of *Tensor* of **attributions**:
+                - **attributions** (*Tensor* or tuple of *Tensor*):
                             The attributions with respect to each input feature.
                             If the forward function returns
                             a scalar value per example, attributions will be
diff --git a/captum/attr/_models/base.py b/captum/attr/_models/base.py
index d57646c0da..0b9e406d73 100644
--- a/captum/attr/_models/base.py
+++ b/captum/attr/_models/base.py
@@ -76,7 +76,7 @@ def indices_to_embeddings(self, *input, **kwargs):
 
         Args:
 
-            *input (Any, Optional): This can be a tensor(s) of input indices or any
+            *input (Any, optional): This can be a tensor(s) of input indices or any
                     other variable necessary to comput the embeddings. A typical
                     example of input indices are word or token indices.
             **kwargs (Any, optional): Similar to `input` this can be any sequence
@@ -99,10 +99,10 @@ class TokenReferenceBase:
     `TokenReferenceBase` class.
     """
 
-    def __init__(self, reference_token_idx=0) -> None:
+    def __init__(self, reference_token_idx: int = 0) -> None:
         self.reference_token_idx = reference_token_idx
 
-    def generate_reference(self, sequence_length, device):
+    def generate_reference(self, sequence_length, device: torch.device) -> torch.Tensor:
         r"""
         Generated reference tensor of given `sequence_length` using
         `reference_token_idx`.
@@ -137,22 +137,25 @@ def _set_deep_layer_value(obj, layer_names, value):
     setattr(reduce(getattr, layer_names[:-1], obj), layer_names[-1], value)
 
 
-def configure_interpretable_embedding_layer(model, embedding_layer_name="embedding"):
+def configure_interpretable_embedding_layer(
+    model: Module, embedding_layer_name: str = "embedding"
+) -> InterpretableEmbeddingBase:
     r"""
-    This method wraps model's embedding layer with an interpretable embedding
+    This method wraps a model's embedding layer with an interpretable embedding
     layer that allows us to access the embeddings through their indices.
 
     Args:
 
-        model (torch.nn.Model): An instance of PyTorch model that contains embeddings.
+        model (torch.nn.Module): An instance of PyTorch model that contains embeddings.
         embedding_layer_name (str, optional): The name of the embedding layer
                     in the `model` that we would like to make interpretable.
 
     Returns:
 
-        interpretable_emb (tensor): An instance of `InterpretableEmbeddingBase`
-                    embedding layer that wraps model's embedding layer that is being
-                    accessed through `embedding_layer_name`.
+        interpretable_emb (InterpretableEmbeddingBase): An instance of
+                    `InterpretableEmbeddingBase` embedding layer that wraps model's
+                    embedding layer that is being accessed through
+                    `embedding_layer_name`.
 
     Examples::
 
@@ -202,7 +205,9 @@ def configure_interpretable_embedding_layer(model, embedding_layer_name="embeddi
     return interpretable_emb
 
 
-def remove_interpretable_embedding_layer(model, interpretable_emb):
+def remove_interpretable_embedding_layer(
+    model: Module, interpretable_emb: InterpretableEmbeddingBase
+) -> None:
     r"""
     Removes interpretable embedding layer and sets back original
     embedding layer in the model.
@@ -210,8 +215,8 @@ def remove_interpretable_embedding_layer(model, interpretable_emb):
     Args:
 
         model (torch.nn.Module): An instance of PyTorch model that contains embeddings
-        interpretable_emb (tensor): An instance of `InterpretableEmbeddingBase`
-                    that was originally created in
+        interpretable_emb (InterpretableEmbeddingBase): An instance of
+                    `InterpretableEmbeddingBase` that was originally created in
                     `configure_interpretable_embedding_layer` function and has
                     to be removed after interpretation is finished.
 
diff --git a/captum/attr/_utils/approximation_methods.py b/captum/attr/_utils/approximation_methods.py
index 9d63e90c1a..755e701d6a 100644
--- a/captum/attr/_utils/approximation_methods.py
+++ b/captum/attr/_utils/approximation_methods.py
@@ -28,7 +28,7 @@ def approximation_parameters(
     r"""Retrieves parameters for the input approximation `method`
 
     Args:
-        method: The name of the approximation method. Currently only `riemann`
+        method (str): The name of the approximation method. Currently only `riemann`
                 and gauss legendre are
     """
     if method in SUPPORTED_RIEMANN_METHODS:
@@ -45,17 +45,16 @@ def riemann_builders(
 
     Args:
 
-         n: The number of integration steps
-         method: `left`, `right`, `middle` and `trapezoid` riemann
+         method (Riemann): `left`, `right`, `middle` and `trapezoid` riemann
 
     Returns:
         2-element tuple of **step_sizes**, **alphas**:
-        - **step_sizes** (*callable*):
+        - **step_sizes** (*Callable*):
                     `step_sizes` takes the number of steps as an
                     input argument and returns an array of steps sizes which
                     sum is smaller than or equal to one.
 
-        - **alphas** (*callable*):
+        - **alphas** (*Callable*):
                     `alphas` takes the number of steps as an input argument
                     and returns the multipliers/coefficients for the inputs
                     of integrand in the range of [0, 1]
@@ -104,18 +103,14 @@ def gauss_legendre_builders() -> Tuple[
     proposed by [Xue Feng and her intern Hauroun Habeeb]
     (https://research.fb.com/people/feng-xue/).
 
-    Args:
-
-        n (int): The number of integration steps
-
     Returns:
         2-element tuple of **step_sizes**, **alphas**:
-        - **step_sizes** (*callable*):
+        - **step_sizes** (*Callable*):
                     `step_sizes` takes the number of steps as an
                     input argument and returns an array of steps sizes which
                     sum is smaller than or equal to one.
 
-        - **alphas** (*callable*):
+        - **alphas** (*Callable*):
                     `alphas` takes the number of steps as an input argument
                     and returns the multipliers/coefficients for the inputs
                     of integrand in the range of [0, 1]
diff --git a/captum/attr/_utils/attribution.py b/captum/attr/_utils/attribution.py
index f4b6e9d35c..fed579eb92 100644
--- a/captum/attr/_utils/attribution.py
+++ b/captum/attr/_utils/attribution.py
@@ -31,7 +31,7 @@ class Attribution:
     def __init__(self, forward_func: Callable) -> None:
         r"""
         Args:
-            forward_func (callable or torch.nn.Module): This can either be an instance
+            forward_func (Callable or torch.nn.Module): This can either be an instance
                         of pytorch model or any modification of model's forward
                         function.
         """
@@ -47,17 +47,17 @@ def __init__(self, forward_func: Callable) -> None:
 
     Args:
 
-        inputs (tensor or tuple of tensors):  Input for which attribution
+        inputs (Tensor or tuple of Tensor): Input for which attribution
                     is computed. It can be provided as a single tensor or
                     a tuple of multiple tensors. If multiple input tensors
-                    are provided, the batch sizes must be aligned accross all
+                    are provided, the batch sizes must be aligned across all
                     tensors.
 
 
     Returns:
 
-        *tensor* or tuple of *tensors* of **attributions**:
-        - **attributions** (*tensor* or tuple of *tensors*):
+        *Tensor* or tuple of *Tensor* of **attributions**:
+        - **attributions** (*Tensor* or tuple of *Tensor*):
                     Attribution values for each
                     input tensor. The `attributions` have the same shape and
                     dimensionality as the inputs.
@@ -97,21 +97,21 @@ def has_convergence_delta(self) -> bool:
 
     Args:
 
-            attributions (tensor or tuple of tensors): Attribution scores that
+            attributions (Tensor or tuple of Tensor): Attribution scores that
                         are precomputed by an attribution algorithm.
                         Attributions can be provided in form of a single tensor
                         or a tuple of those. It is assumed that attribution
                         tensor's dimension 0 corresponds to the number of
                         examples, and if multiple input tensors are provided,
                         the examples must be aligned appropriately.
-            *args (optional): Additonal arguments that are used by the
+            *args (Any, optional): Additonal arguments that are used by the
                         sub-classes depending on the specific implementation
                         of `compute_convergence_delta`.
 
     Returns:
 
-            *tensor* of **deltas**:
-            - **deltas** (*tensor*):
+            *Tensor* of **deltas**:
+            - **deltas** (*Tensor*):
                 Depending on specific implementaion of
                 sub-classes, convergence delta can be returned per
                 sample in form of a tensor or it can be aggregated
@@ -150,7 +150,7 @@ def __init__(self, forward_func: Callable) -> None:
         r"""
         Args:
 
-            forward_func (callable or torch.nn.Module): This can either be an instance
+            forward_func (Callable or torch.nn.Module): This can either be an instance
                         of pytorch model or any modification of model's forward
                         function.
         """
@@ -184,26 +184,26 @@ def compute_convergence_delta(
 
         Args:
 
-                attributions (tensor or tuple of tensors): Precomputed attribution
+                attributions (Tensor or tuple of Tensor): Precomputed attribution
                             scores. The user can compute those using any attribution
-                            algorithm. It is assumed the the shape and the
+                            algorithm. It is assumed the shape and the
                             dimensionality of attributions must match the shape and
                             the dimensionality of `start_point` and `end_point`.
                             It also assumes that the attribution tensor's
                             dimension 0 corresponds to the number of
                             examples, and if multiple input tensors are provided,
                             the examples must be aligned appropriately.
-                start_point (tensor or tuple of tensors, optional): `start_point`
+                start_point (Tensor or tuple of Tensor, optional): `start_point`
                             is passed as an input to model's forward function. It
                             is the starting point of attributions' approximation.
                             It is assumed that both `start_point` and `end_point`
                             have the same shape and dimensionality.
-                end_point (tensor or tuple of tensors):  `end_point`
+                end_point (Tensor or tuple of Tensor): `end_point`
                             is passed as an input to model's forward function. It
                             is the end point of attributions' approximation.
                             It is assumed that both `start_point` and `end_point`
                             have the same shape and dimensionality.
-                target (int, tuple, tensor or list, optional):  Output indices for
+                target (int, tuple, Tensor, or list, optional): Output indices for
                             which gradients are computed (for classification cases,
                             this is usually the target class).
                             If the network returns a scalar value per example,
@@ -228,7 +228,7 @@ def compute_convergence_delta(
                               target for the corresponding example.
 
                             Default: None
-                additional_forward_args (any, optional): If the forward function
+                additional_forward_args (Any, optional): If the forward function
                             requires additional arguments other than the inputs for
                             which attributions should not be computed, this argument
                             can be provided. It must be either a single additional
@@ -245,8 +245,8 @@ def compute_convergence_delta(
 
         Returns:
 
-                *tensor* of **deltas**:
-                - **deltas** (*tensor*):
+                *Tensor* of **deltas**:
+                - **deltas** (*Tensor*):
                     This implementation returns convergence delta per
                     sample. Deriving sub-classes may do any type of aggregation
                     of those values, if necessary.
@@ -306,7 +306,7 @@ def __init__(self, forward_func: Callable) -> None:
         r"""
         Args:
 
-            forward_func (callable or torch.nn.Module): This can either be an instance
+            forward_func (Callable or torch.nn.Module): This can either be an instance
                         of pytorch model or any modification of model's forward
                         function.
         """
@@ -318,12 +318,13 @@ def multiplies_by_inputs(self):
 
 
 class InternalAttribution(Attribution, Generic[ModuleOrModuleList]):
-    layer: ModuleOrModuleList
     r"""
     Shared base class for LayerAttrubution and NeuronAttribution,
     attribution types that require a model and a particular layer.
     """
 
+    layer: ModuleOrModuleList
+
     def __init__(
         self,
         forward_func: Callable,
@@ -333,12 +334,12 @@ def __init__(
         r"""
         Args:
 
-            forward_func (callable or torch.nn.Module):  This can either be an instance
+            forward_func (Callable or torch.nn.Module): This can either be an instance
                         of pytorch model or any modification of model's forward
                         function.
             layer (torch.nn.Module): Layer for which output attributions are computed.
                         Output size of attribute matches that of layer output.
-            device_ids (list(int)): Device ID list, necessary only if forward_func
+            device_ids (list[int]): Device ID list, necessary only if forward_func
                         applies a DataParallel model, which allows reconstruction of
                         intermediate outputs from batched results across devices.
                         If forward_func is given as the DataParallel model itself,
@@ -351,7 +352,7 @@ def __init__(
 
 class LayerAttribution(InternalAttribution):
     r"""
-    Layer attribution provides attribution values for the given layer, quanitfying
+    Layer attribution provides attribution values for the given layer, quantifying
     the importance of each neuron within the given layer's output. The output
     attribution of calling attribute on a LayerAttribution object always matches
     the size of the layer output.
@@ -366,12 +367,12 @@ def __init__(
         r"""
         Args:
 
-            forward_func (callable or torch.nn.Module):  This can either be an instance
+            forward_func (Callable or torch.nn.Module): This can either be an instance
                         of pytorch model or any modification of model's forward
                         function.
             layer (torch.nn.Module): Layer for which output attributions are computed.
                         Output size of attribute matches that of layer output.
-            device_ids (list(int)): Device ID list, necessary only if forward_func
+            device_ids (list[int]): Device ID list, necessary only if forward_func
                         applies a DataParallel model, which allows reconstruction of
                         intermediate outputs from batched results across devices.
                         If forward_func is given as the DataParallel model itself,
@@ -392,13 +393,13 @@ def interpolate(
 
         Args:
 
-            layer_attribution (torch.Tensor):  Tensor of given layer attributions.
+            layer_attribution (Tensor): Tensor of given layer attributions.
             interpolate_dims (int or tuple): Upsampled dimensions. The
                         number of elements must be the number of dimensions
                         of layer_attribution - 2, since the first dimension
                         corresponds to number of examples and the second is
                         assumed to correspond to the number of channels.
-            interpolate_mode (str):  Method for interpolation, which
+            interpolate_mode (str): Method for interpolation, which
                         must be a valid input interpolation mode for
                         torch.nn.functional. These methods are
                         "nearest", "area", "linear" (3D-only), "bilinear"
@@ -407,8 +408,8 @@ def interpolate(
                         attribution.
 
         Returns:
-            *tensor* of upsampled **attributions**:
-            - **attributions** (*tensor*):
+            *Tensor* of upsampled **attributions**:
+            - **attributions** (*Tensor*):
                 Upsampled layer attributions with first 2 dimensions matching
                 slayer_attribution and remaining dimensions given by
                 interpolate_dims.
@@ -418,7 +419,7 @@ def interpolate(
 
 class NeuronAttribution(InternalAttribution):
     r"""
-    Neuron attribution provides input attribution for a given neuron, quanitfying
+    Neuron attribution provides input attribution for a given neuron, quantifying
     the importance of each input feature in the activation of a particular neuron.
     Calling attribute on a NeuronAttribution object requires also providing
     the index of the neuron in the output of the given layer for which attributions
@@ -436,12 +437,12 @@ def __init__(
         r"""
         Args:
 
-            forward_func (callable or torch.nn.Module):  This can either be an instance
+            forward_func (Callable or torch.nn.Module): This can either be an instance
                         of pytorch model or any modification of model's forward
                         function.
             layer (torch.nn.Module): Layer for which output attributions are computed.
                         Output size of attribute matches that of layer output.
-            device_ids (list(int)): Device ID list, necessary only if forward_func
+            device_ids (list[int]): Device ID list, necessary only if forward_func
                         applies a DataParallel model, which allows reconstruction of
                         intermediate outputs from batched results across devices.
                         If forward_func is given as the DataParallel model itself,
@@ -469,8 +470,8 @@ def __init__(
 
     Returns:
 
-            *tensor* or tuple of *tensors* of **attributions**:
-            - **attributions** (*tensor* or tuple of *tensors*):
+            *Tensor* or tuple of *Tensor* of **attributions**:
+            - **attributions** (*Tensor* or tuple of *Tensor*):
                     Attribution values for
                     each input vector. The `attributions` have the
                     dimensionality of inputs.
diff --git a/captum/attr/_utils/class_summarizer.py b/captum/attr/_utils/class_summarizer.py
index 2485711866..664088c299 100644
--- a/captum/attr/_utils/class_summarizer.py
+++ b/captum/attr/_utils/class_summarizer.py
@@ -36,11 +36,11 @@ def update(  # type: ignore
         This accepts either a single tensor to summarise or a tuple of tensors.
 
         Args:
-            x (Tensor or Tuple[Tensor, ...]):
+            x (Tensor or tuple of Tensor):
                 The input tensor to be summarised. The first
                 dimension of this input must be associated to
                 the batch size of the inputs.
-            labels (int, tuple, tensor or list, optional):
+            labels (int, tuple, Tensor, or list, optional):
                 The associated labels for `x`. If Any, we
                 assume `labels` represents the label for all inputs in `x`.
 
diff --git a/captum/attr/_utils/summarizer.py b/captum/attr/_utils/summarizer.py
index 874e5d263b..e4c5c860a0 100644
--- a/captum/attr/_utils/summarizer.py
+++ b/captum/attr/_utils/summarizer.py
@@ -173,10 +173,10 @@ class SummarizerSingleTensor:
     def __init__(self, stats: List[Stat], summary_stats_indices: List[int]) -> None:
         r"""
         Args:
-            stats (list of Stat): A list of all the Stat objects that
+            stats (list[Stat]): A list of all the Stat objects that
                 need to be updated. This must be in the appropriate order for
                 updates (see `_reorder_stats`)
-            summary_stats (list of int): A list of indicies, referencing `stats`,
+            summary_stats (list[int]): A list of indicies, referencing `stats`,
                 which are the stats you want to show in the .summary property. This
                 does not require any specific order.
         """
diff --git a/captum/attr/_utils/visualization.py b/captum/attr/_utils/visualization.py
index 0cfada9b7b..c4da31b6d6 100644
--- a/captum/attr/_utils/visualization.py
+++ b/captum/attr/_utils/visualization.py
@@ -118,18 +118,18 @@ def visualize_image_attr(
 
     Args:
 
-        attr (numpy.array): Numpy array corresponding to attributions to be
+        attr (numpy.ndarray): Numpy array corresponding to attributions to be
                     visualized. Shape must be in the form (H, W, C), with
                     channels as last dimension. Shape must also match that of
                     the original image if provided.
-        original_image (numpy.array, optional):  Numpy array corresponding to
+        original_image (numpy.ndarray, optional): Numpy array corresponding to
                     original image. Shape must be in the form (H, W, C), with
                     channels as the last dimension. Image can be provided either
                     with float values in range 0-1 or int values between 0-255.
                     This is a necessary argument for any visualization method
                     which utilizes the original image.
                     Default: None
-        method (string, optional): Chosen method for visualizing attribution.
+        method (str, optional): Chosen method for visualizing attribution.
                     Supported options are:
 
                     1. `heat_map` - Display heat map of chosen attributions
@@ -145,8 +145,9 @@ def visualize_image_attr(
 
                     5. `alpha_scaling` - Sets alpha channel of each pixel
                        to be equal to normalized attribution value.
+
                     Default: `heat_map`
-        sign (string, optional): Chosen sign of attributions to visualize. Supported
+        sign (str, optional): Chosen sign of attributions to visualize. Supported
                     options are:
 
                     1. `positive` - Displays only positive pixel attributions.
@@ -160,6 +161,7 @@ def visualize_image_attr(
                        values. This is not supported for `masked_image` or
                        `alpha_scaling` modes, since signed information cannot
                        be represented in these modes.
+
                     Default: `absolute_value`
         plt_fig_axis (tuple, optional): Tuple of matplotlib.pyplot.figure and axis
                     on which to visualize. If None is provided, then a new figure
@@ -172,7 +174,7 @@ def visualize_image_attr(
                     and scale value are computed using absolute value of
                     attributions.
                     Default: 2
-        cmap (string, optional): String corresponding to desired colormap for
+        cmap (str, optional): String corresponding to desired colormap for
                     heatmap visualization. This defaults to "Reds" for negative
                     sign, "Blues" for absolute value, "Greens" for positive sign,
                     and a spectrum from red to green for all. Note that this
@@ -182,18 +184,18 @@ def visualize_image_attr(
                     `blended_heat_map` visualization mode, which overlays the
                     heat map over the greyscaled original image.
                     Default: 0.5
-        show_colorbar (boolean, optional): Displays colorbar for heatmap below
+        show_colorbar (bool, optional): Displays colorbar for heatmap below
                     the visualization. If given method does not use a heatmap,
                     then a colormap axis is created and hidden. This is
                     necessary for appropriate alignment when visualizing
                     multiple plots, some with colorbars and some without.
                     Default: False
-        title (string, optional): Title string for plot. If None, no title is
+        title (str, optional): Title string for plot. If None, no title is
                     set.
                     Default: None
         fig_size (tuple, optional): Size of figure created.
                     Default: (6,6)
-        use_pyplot (boolean, optional): If true, uses pyplot to create and show
+        use_pyplot (bool, optional): If true, uses pyplot to create and show
                     figure and displays the figure after creating. If False,
                     uses Matplotlib object oriented API and simply returns a
                     figure object without showing.
@@ -347,29 +349,29 @@ def visualize_image_attr_multiple(
 
     Args:
 
-        attr (numpy.array): Numpy array corresponding to attributions to be
+        attr (numpy.ndarray): Numpy array corresponding to attributions to be
                     visualized. Shape must be in the form (H, W, C), with
                     channels as last dimension. Shape must also match that of
                     the original image if provided.
-        original_image (numpy.array, optional):  Numpy array corresponding to
+        original_image (numpy.ndarray, optional): Numpy array corresponding to
                     original image. Shape must be in the form (H, W, C), with
                     channels as the last dimension. Image can be provided either
                     with values in range 0-1 or 0-255. This is a necessary
                     argument for any visualization method which utilizes
                     the original image.
-        methods (list of strings): List of strings of length k, defining method
+        methods (list[str]): List of strings of length k, defining method
                         for each visualization. Each method must be a valid
                         string argument for method to visualize_image_attr.
-        signs (list of strings): List of strings of length k, defining signs for
+        signs (list[str]): List of strings of length k, defining signs for
                         each visualization. Each sign must be a valid
                         string argument for sign to visualize_image_attr.
-        titles (list of strings, optional):  List of strings of length k, providing
+        titles (list[str], optional): List of strings of length k, providing
                     a title string for each plot. If None is provided, no titles
                     are added to subplots.
                     Default: None
         fig_size (tuple, optional): Size of figure created.
                     Default: (8, 6)
-        use_pyplot (boolean, optional): If true, uses pyplot to create and show
+        use_pyplot (bool, optional): If true, uses pyplot to create and show
                     figure and displays the figure after creating. If False,
                     uses Matplotlib object oriented API and simply returns a
                     figure object without showing.
@@ -460,19 +462,19 @@ def visualize_timeseries_attr(
 
     Args:
 
-        attr (numpy.array): Numpy array corresponding to attributions to be
+        attr (numpy.ndarray): Numpy array corresponding to attributions to be
                     visualized. Shape must be in the form (N, C) with channels
                     as last dimension, unless `channels_last` is set to True.
                     Shape must also match that of the timeseries data.
-        data (numpy.array): Numpy array corresponding to the original,
+        data (numpy.ndarray): Numpy array corresponding to the original,
                     equidistant timeseries data. Shape must be in the form
                     (N, C) with channels as last dimension, unless
                     `channels_last` is set to true.
-        x_values (numpy.array, optional): Numpy array corresponding to the
+        x_values (numpy.ndarray, optional): Numpy array corresponding to the
                     points on the x-axis. Shape must be in the form (N, ). If
                     not provided, integers from 0 to N-1 are used.
                     Default: None
-        method (string, optional): Chosen method for visualizing attributions
+        method (str, optional): Chosen method for visualizing attributions
                     overlaid onto data. Supported options are:
 
                     1. `overlay_individual` - Plot each channel individually in
@@ -487,8 +489,9 @@ def visualize_timeseries_attr(
                         and color the graphs according to the attribution
                         values. Works best with color maps that does not contain
                         white or very bright colors.
+
                     Default: `overlay_individual`
-        sign (string, optional): Chosen sign of attributions to visualize.
+        sign (str, optional): Chosen sign of attributions to visualize.
                     Supported options are:
 
                     1. `positive` - Displays only positive pixel attributions.
@@ -500,8 +503,9 @@ def visualize_timeseries_attr(
 
                     4. `all` - Displays both positive and negative attribution
                         values.
+
                     Default: `absolute_value`
-        channel_labels (list of strings, optional): List of labels
+        channel_labels (list[str], optional): List of labels
                     corresponding to each channel in data.
                     Default: None
         channels_last (bool, optional): If True, data is expected to have
@@ -519,7 +523,7 @@ def visualize_timeseries_attr(
                     and scale value are computed using absolute value of
                     attributions.
                     Default: 2
-        cmap (string, optional): String corresponding to desired colormap for
+        cmap (str, optional): String corresponding to desired colormap for
                     heatmap visualization. This defaults to "Reds" for negative
                     sign, "Blues" for absolute value, "Greens" for positive sign,
                     and a spectrum from red to green for all. Note that this
@@ -529,14 +533,14 @@ def visualize_timeseries_attr(
                     `blended_heat_map` visualization mode, which overlays the
                     heat map over the greyscaled original image.
                     Default: 0.7
-        show_colorbar (boolean): Displays colorbar for heat map below
+        show_colorbar (bool): Displays colorbar for heat map below
                     the visualization.
-        title (string, optional): Title string for plot. If None, no title is
+        title (str, optional): Title string for plot. If None, no title is
                     set.
                     Default: None
         fig_size (tuple, optional): Size of figure created.
                     Default: (6,6)
-        use_pyplot (boolean): If true, uses pyplot to create and show
+        use_pyplot (bool): If true, uses pyplot to create and show
                     figure and displays the figure after creating. If False,
                     uses Matplotlib object oriented API and simply returns a
                     figure object without showing.
diff --git a/captum/concept/_core/cav.py b/captum/concept/_core/cav.py
index 39aa9fba85..6aedb24fff 100644
--- a/captum/concept/_core/cav.py
+++ b/captum/concept/_core/cav.py
@@ -14,7 +14,7 @@ class CAV:
     boundary of a classifier which distinguishes between activation
     vectors produced by different concepts.
     More details can be found in the paper:
-        https://arxiv.org/pdf/1711.11279.pdf
+        https://arxiv.org/abs/1711.11279
     """
 
     def __init__(
@@ -65,7 +65,7 @@ def assemble_save_path(
                     layer name.
             model_id (str): A unique model identifier associated with input
                     `layer` and `concepts`
-            concepts (list(Concept)): A list of concepts that are concatenated
+            concepts (list[Concept]): A list of concepts that are concatenated
                     together and used as a concept key using their ids. These
                     concept ids are retrieved from TCAV s`Concept` objects.
             layer (str): The name of the layer for which the activations are
@@ -146,7 +146,7 @@ def load(cavs_path: str, model_id: str, concepts: List[Concept], layer: str):
             model_id (str): A unique model identifier associated with the
                     CAVs. There exist a folder named `model_id` under
                     `cavs_path` path. The CAVs are loaded from this folder.
-            concepts (list[Concept]):  A List of concepts for which
+            concepts (list[Concept]): A List of concepts for which
                     we would like to load the cavs.
             layer (str): The layer name. Ex.: "inception4c". In case of nested
                     layers we use dots to specify the depth / hierarchy.
diff --git a/captum/concept/_core/concept.py b/captum/concept/_core/concept.py
index a550ab8a9d..b0adbd7f39 100644
--- a/captum/concept/_core/concept.py
+++ b/captum/concept/_core/concept.py
@@ -25,7 +25,7 @@ def __init__(
 
         r"""
         Args:
-            id (int):   The unique identifier of the concept.
+            id (int): The unique identifier of the concept.
             name (str): A unique name of the concept.
             data_iter (DataLoader): A pytorch DataLoader object that combines a dataset
                         and a sampler, and provides an iterable over a given
@@ -35,6 +35,7 @@ def __init__(
                         https://pytorch.org/docs/stable/data.html
 
         Example::
+
             >>> # Creates a Concept object named "striped", with a data_iter
             >>> # object to iterate over all files in "./concepts/striped"
             >>> concept_name = "striped"
@@ -79,7 +80,7 @@ def __init__(self, model: Module) -> None:
 
     Args:
 
-        inputs (tensor or tuple of tensors):  Inputs for which concept-based
+        inputs (Tensor or tuple of Tensor): Inputs for which concept-based
                     interpretation scores are computed. It can be provided as
                     a single tensor or a tuple of multiple tensors. If multiple
                     input tensors are provided, the batch size (the first
diff --git a/captum/concept/_core/tcav.py b/captum/concept/_core/tcav.py
index 8b6c996856..64977901ef 100644
--- a/captum/concept/_core/tcav.py
+++ b/captum/concept/_core/tcav.py
@@ -37,11 +37,13 @@ def __init__(self, datasets: List[AV.AVDataset], labels: List[int]) -> None:
         However, __get_item__ not only returns a batch of activation vectors,
         but also a batch of labels indicating which concept that batch of
         activation vectors is associated with.
+
         Args:
+
             datasets (list[Dataset]): The k-th element of datasets is a Dataset
                     representing activation vectors associated with the k-th
                     concept
-            labels (list[Int]): The k-th element of labels is the integer label
+            labels (list[int]): The k-th element of labels is the integer label
                     associated with the k-th concept
         """
         assert len(datasets) == len(
@@ -69,16 +71,17 @@ def _i_to_k(self, i):
             else:
                 right = mid
 
-    def __getitem__(self, i):
+    def __getitem__(self, i: int):
         """
         Returns a batch of activation vectors, as well as a batch of labels
         indicating which concept the batch of activation vectors is associated
         with.
 
-        args:
+        Args:
+
             i (int): which (activation vector, label) batch in the dataset to
                     return
-        returns:
+        Returns:
             inputs (Tensor): i-th batch in Dataset (representing activation
                     vectors)
             labels (Tensor): labels of i-th batch in Dataset
@@ -91,7 +94,7 @@ def __getitem__(self, i):
         labels = torch.tensor([self.labels[k]] * inputs.size(0), device=inputs.device)
         return inputs, labels
 
-    def __len__(self):
+    def __len__(self) -> int:
         """
         returns the total number of batches in the labelled_dataset
         """
@@ -113,6 +116,7 @@ def train_cav(
     Please see the TCAV class documentation for further information.
 
     Args:
+
         model_id (str): A unique identifier for the PyTorch model for which
                 we would like to load the layer activations and train a
                 model in order to compute CAVs.
@@ -120,7 +124,7 @@ def train_cav(
                 to train a classifier and learn decision boundaries between
                 those concepts for each layer defined in the `layers`
                 argument.
-        layers (str, list[str]): A list of layer names or a single layer
+        layers (str or list[str]): A list of layer names or a single layer
                 name that is used to compute the activations of all concept
                 examples per concept and train a classifier using those
                 activations.
@@ -203,7 +207,7 @@ class TCAV(ConceptInterpreter):
     This class implements ConceptInterpreter abstract class using an
     approach called Testing with Concept Activation Vectors (TCAVs),
     as described in the paper:
-    https://arxiv.org/pdf/1711.11279.pdf
+    https://arxiv.org/abs/1711.11279
 
     TCAV scores for a given layer, a list of concepts and input example
     are computed using the dot product between prediction's layer
@@ -251,9 +255,10 @@ def __init__(
     ) -> None:
         r"""
         Args:
+
             model (Module): An instance of pytorch model that is used to compute
                     layer activations and attributions.
-            layers (str, list[str]): A list of layer name(s) that are
+            layers (str or list[str]): A list of layer name(s) that are
                     used for computing concept activations (cavs) and layer
                     attributions.
             model_id (str, optional): A unique identifier for the PyTorch `model`
@@ -275,7 +280,7 @@ def __init__(
                     attribution algorithm.
             save_path (str, optional): The path for storing CAVs and
                     Activation Vectors (AVs).
-            classifier_kwargs (any, optional): Additional arguments such as
+            classifier_kwargs (Any, optional): Additional arguments such as
                     `test_split_ratio` that are passed to concept `classifier`.
 
         Examples::
@@ -342,7 +347,7 @@ def generate_activation(self, layers: Union[str, List], concept: Concept) -> Non
         the list of layer(s) `layers`.
 
         Args:
-            layers (str, list[str]): A list of layer names or a layer name
+            layers (str or list[str]): A list of layer names or a layer name
                     that is used to compute layer activations for the
                     specific `concept`.
             concept (Concept): A single Concept object that provides access
@@ -403,6 +408,7 @@ def load_cavs(
         of concepts and layer.
 
         Args:
+
             concepts (list[Concept]): A list of Concept objects for which we want
                     to load the CAV.
 
@@ -458,6 +464,7 @@ def compute_cavs(
         the argument.
 
         Args:
+
             experimental_sets (list[list[Concept]]): A list of lists of concept
                     instances for which the cavs will be computed.
             force_train (bool, optional): A flag that indicates whether to
@@ -469,6 +476,7 @@ def compute_cavs(
                     multi-processing, otherwise it will be performed sequentially
                     in a single process.
                     Default: None
+
         Returns:
             cavs (dict) : A mapping of concept ids and layers to CAV objects.
                     If CAVs for the concept_ids-layer pairs are present in the
@@ -569,7 +577,8 @@ def interpret(
         scores for specific predictions and CAV vectors.
 
         Args:
-            inputs (tensor or tuple of tensors): Inputs for which predictions
+
+            inputs (Tensor or tuple of Tensor): Inputs for which predictions
                     are performed and attributions are computed.
                     If model takes a single tensor as
                     input, a single input tensor should be provided.
@@ -581,7 +590,7 @@ def interpret(
                     provided, the examples must be aligned appropriately.
             experimental_sets (list[list[Concept]]): A list of list of Concept
                     instances.
-            target (int, tuple, tensor or list, optional):  Output indices for
+            target (int, tuple, Tensor, or list, optional): Output indices for
                     which attributions are computed (for classification cases,
                     this is usually the target class).
                     If the network returns a scalar value per example,
@@ -617,6 +626,7 @@ def interpret(
                     attribution algorithm's attribute method. This could be for
                     example `n_steps` in case of integrated gradients.
                     Default: None
+
         Returns:
             results (dict): A dictionary of sign and magnitude -based tcav scores
                     for each concept set per layer.
diff --git a/captum/concept/_utils/classifier.py b/captum/concept/_utils/classifier.py
index 5bdf605470..b8ba7d0a59 100644
--- a/captum/concept/_utils/classifier.py
+++ b/captum/concept/_utils/classifier.py
@@ -95,7 +95,7 @@ def weights(self) -> Tensor:
         C is the number of classes and F is the number of features.
 
         Returns:
-            weights (tensor): A torch Tensor with the weights resulting from
+            weights (Tensor): A torch Tensor with the weights resulting from
                 the model training.
         """
         pass
@@ -189,10 +189,10 @@ def weights(self) -> Tensor:
         r"""
         This function returns a C x F tensor weights, where
         C is the number of classes and F is the number of features.
-        In case of binary classification, C = 2 othewise it is > 2.
+        In case of binary classification, C = 2 otherwise it is > 2.
 
         Returns:
-            weights (tensor): A torch Tensor with the weights resulting from
+            weights (Tensor): A torch Tensor with the weights resulting from
                 the model training.
         """
         assert self.lm.linear is not None, (
diff --git a/captum/concept/_utils/data_iterator.py b/captum/concept/_utils/data_iterator.py
index 6a8a48f197..574bc71ae7 100644
--- a/captum/concept/_utils/data_iterator.py
+++ b/captum/concept/_utils/data_iterator.py
@@ -16,7 +16,7 @@ class CustomIterableDataset(IterableDataset):
     def __init__(self, transform_filename_to_tensor: Callable, path: str) -> None:
         r"""
         Args:
-            transform_filename_to_tensor (callable): Function to read a data
+            transform_filename_to_tensor (Callable): Function to read a data
                         file from path and return a tensor from that file.
             path (str): Path to dataset files. This can be either a path to a
                         directory or a file where input examples are stored.
diff --git a/captum/influence/_core/similarity_influence.py b/captum/influence/_core/similarity_influence.py
index 0fd21eedb7..db1484466e 100644
--- a/captum/influence/_core/similarity_influence.py
+++ b/captum/influence/_core/similarity_influence.py
@@ -82,7 +82,7 @@ def __init__(
         Args:
             module (torch.nn.Module): An instance of pytorch model. This model should
                     define all of its layers as attributes of the model.
-            layers (str or List of str): The fully qualified layer(s) for which the
+            layers (str or list[str]): The fully qualified layer(s) for which the
                     activation vectors are computed.
             influence_src_dataset (torch.utils.data.Dataset): PyTorch Dataset that is
                     used to create a PyTorch Dataloader to iterate over the dataset and
@@ -166,13 +166,13 @@ def influence(  # type: ignore[override]
     ) -> Dict:
         r"""
         Args:
-            inputs (tensor or tuple of tensors): Batch of examples for which influential
+            inputs (Tensor or tuple of Tensor): Batch of examples for which influential
                     instances are computed. They are passed to the forward_func. The
                     first dimension in `inputs` tensor or tuple of tensors corresponds
                     to the batch size. A tuple of tensors is only passed in if this
                     is the input form that `module` accepts.
             top_k (int): The number of top-matching activations to return
-            additional_forward_args (optional):  Additional arguments that will be
+            additional_forward_args (Any, optional): Additional arguments that will be
                     passed to forward_func after inputs.
             load_src_from_disk (bool): Loads activations for `influence_src_dataset`
                     where possible. Setting to False would force regeneration of
diff --git a/captum/influence/_core/tracincp.py b/captum/influence/_core/tracincp.py
index 15811e684b..8cb2ac7bfc 100644
--- a/captum/influence/_core/tracincp.py
+++ b/captum/influence/_core/tracincp.py
@@ -45,7 +45,7 @@
 
 Implements abstract DataInfluence class and provides implementation details for
 influence computation based on the logic provided in TracIn paper
-(https://arxiv.org/pdf/2002.08484.pdf).
+(https://arxiv.org/abs/2002.08484).
 
 The TracIn paper proposes an idealized notion of influence which can be represented by
 the total amount a training example reduces loss for a test example via a training
@@ -105,9 +105,10 @@ def __init__(
     ) -> None:
         r"""
         Args:
+
             model (torch.nn.Module): An instance of pytorch model. This model should
                     define all of its layers as attributes of the model.
-            train_dataset (torch.utils.data.Dataset or torch.utils.DataLoader):
+            train_dataset (torch.utils.data.Dataset or torch.utils.data.DataLoader):
                     In the `influence` method, we either compute the influence score of
                     training examples on examples in a test batch, or self influence
                     scores for those training examples, depending on which mode is used.
@@ -131,7 +132,7 @@ def __init__(
                     `model` accepts `L-1` arguments, and the last element of `batch` is
                     the label. In other words, `model(*batch[:-1])` gives the output of
                     `model`, and `batch[-1]` are the labels for the batch.
-            checkpoints (str or List of str or Iterator): Either the directory of the
+            checkpoints (str, list[str], or Iterator): Either the directory of the
                     path to store and retrieve model checkpoints, a list of
                     filepaths with checkpoints from which to load, or an iterator which
                     returns objects from which to load checkpoints.
@@ -140,7 +141,7 @@ def __init__(
                     learning rate if it is saved. By default uses a utility to load a
                     model saved as a state dict.
                     Default: _load_flexible_state_dict
-            layers (List of str or None, optional): A list of layer names for which
+            layers (list[str] or None, optional): A list of layer names for which
                     gradients should be computed. If `layers` is None, gradients will
                     be computed for all layers. Otherwise, they will only be computed
                     for the layers specified in `layers`.
@@ -215,7 +216,7 @@ def self_influence(
         with are not too large, so that there will not be an out-of-memory error.
 
         Args:
-            batches (Tuple, or DataLoader): Either a single tuple of any, or a
+            batches (tuple or DataLoader): Either a single tuple of any, or a
                     `DataLoader`, where each batch yielded is a tuple of any. In
                     either case, the tuple represents a single batch, where the last
                     element is assumed to be the labels for the batch. That is,
@@ -227,7 +228,7 @@ def self_influence(
                     more details on the assumed structure of a batch.
             show_progress (bool, optional): Computation of self influence scores can
                     take a long time if `inputs_dataset` represents many examples. If
-                    `show_progress`is true, the progress of this computation will be
+                    `show_progress` is true, the progress of this computation will be
                     displayed. In more detail, this computation will iterate over all
                     checkpoints (provided as the `checkpoints` initialization argument)
                     in an outer loop, and iterate over all batches that
@@ -261,9 +262,10 @@ def _get_k_most_influential(
     ) -> KMostInfluentialResults:
         r"""
         Args:
-            inputs (Tuple of Any): A tuple that represents a batch of examples. It does
+
+            inputs (tuple of Any): A tuple that represents a batch of examples. It does
                     not represent labels, which are passed as `targets`.
-            targets (tensor, optional): If computing influence scores on a loss
+            targets (Tensor, optional): If computing influence scores on a loss
                     function, these are the labels corresponding to the batch `inputs`.
                     Default: None
             k (int, optional): The number of proponents or opponents to return per test
@@ -274,7 +276,7 @@ def _get_k_most_influential(
                     Default: True
             show_progress (bool, optional): To compute the proponents (or opponents)
                     for the batch of examples, we perform computation for each batch in
-                    training dataset `train_dataset`, If `show_progress`is
+                    training dataset `train_dataset`, If `show_progress` is
                     true, the progress of this computation will be displayed. In
                     particular, the number of batches for which the computation has
                     been performed will be displayed. It will try to use tqdm if
@@ -309,23 +311,24 @@ def _influence(
     ) -> Tensor:
         r"""
         Args:
-            inputs (Tuple of Any): A batch of examples. Does not represent labels,
+
+            inputs (tuple of Any): A batch of examples. Does not represent labels,
                     which are passed as `targets`. The assumption is that
                     `model(*inputs)` produces the predictions for the batch.
-            targets (tensor, optional): If computing influence scores on a loss
+            targets (Tensor, optional): If computing influence scores on a loss
                     function, these are the labels corresponding to the batch
                     `inputs`.
                     Default: None
 
         Returns:
-            influence_scores (tensor): Influence scores over the entire
+            influence_scores (Tensor): Influence scores over the entire
                     training dataset `train_dataset`. Dimensionality is
                     (inputs_batch_size, src_dataset_size). For example:
                     influence_scores[i][j] = the influence score for the j-th training
                     example to the i-th input example.
             show_progress (bool, optional): To compute the influence of examples in
                     training dataset `train_dataset`, we compute the influence
-                    of each batch. If `show_progress`is true, the progress of this
+                    of each batch. If `show_progress` is true, the progress of this
                     computation will be displayed. In particular, the number of batches
                     for which influence has been computed will be displayed. It will
                     try to use tqdm if available for advanced features (e.g. time
@@ -369,6 +372,7 @@ def influence(  # type: ignore[override]
           opponent) on the test example.
 
         Args:
+
             inputs (Any, optional): If not provided or `None`, the self influence mode
                     will be run. Otherwise, `inputs` is the test batch that will be
                     used when running in either influence score or k-most influential
@@ -380,7 +384,7 @@ def influence(  # type: ignore[override]
                     `inputs` will need to be a tuple. In other words, `inputs` will be
                     unpacked as an argument when passing to `model`.
                     Default: None
-            targets (tensor, optional): If computing influence scores on a loss
+            targets (Tensor, optional): If computing influence scores on a loss
                     function, these are the labels corresponding to the batch `inputs`.
                     Default: None
             k (int, optional): If not provided or `None`, the influence score mode will
@@ -399,7 +403,7 @@ def influence(  # type: ignore[override]
             show_progress (bool, optional): For all modes, computation of results
                     requires "training dataset computations": computations for each
                     batch in the training dataset `train_dataset`, which may
-                    take a long time. If `show_progress`is true, the progress of
+                    take a long time. If `show_progress` is true, the progress of
                     "training dataset computations" will be displayed. In particular,
                     the number of batches for which computations have been performed
                     will be displayed. It will try to use tqdm if available for
@@ -501,9 +505,10 @@ def __init__(
     ) -> None:
         r"""
         Args:
+
             model (torch.nn.Module): An instance of pytorch model. This model should
                     define all of its layers as attributes of the model.
-            train_dataset (torch.utils.data.Dataset or torch.utils.DataLoader):
+            train_dataset (torch.utils.data.Dataset or torch.utils.data.DataLoader):
                     In the `influence` method, we either compute the influence score of
                     training examples on examples in a test batch, or self influence
                     scores for those training examples, depending on which mode is used.
@@ -527,7 +532,7 @@ def __init__(
                     `model` accepts `L-1` arguments, and the last element of `batch` is
                     the label. In other words, `model(*batch[:-1])` gives the output of
                     `model`, and `batch[-1]` are the labels for the batch.
-            checkpoints (str or List of str or Iterator): Either the directory of the
+            checkpoints (str, list[str], or Iterator): Either the directory of the
                     path to store and retrieve model checkpoints, a list of
                     filepaths with checkpoints from which to load, or an iterator which
                     returns objects from which to load checkpoints.
@@ -536,7 +541,7 @@ def __init__(
                     learning rate if it is saved. By default uses a utility to load a
                     model saved as a state dict.
                     Default: _load_flexible_state_dict
-            layers (List of str or None, optional): A list of layer names for which
+            layers (list[str] or None, optional): A list of layer names for which
                     gradients should be computed. If `layers` is None, gradients will
                     be computed for all layers. Otherwise, they will only be computed
                     for the layers specified in `layers`.
@@ -698,6 +703,7 @@ def influence(  # type: ignore[override]
           opponent) on the test example.
 
         Args:
+
             inputs (Any, optional): If not provided or `None`, the self influence mode
                     will be run. Otherwise, `inputs` is the test batch that will be
                     used when running in either influence score or k-most influential
@@ -709,7 +715,7 @@ def influence(  # type: ignore[override]
                     `inputs` will need to be a tuple. In other words, `inputs` will be
                     unpacked as an argument when passing to `model`.
                     Default: None
-            targets (tensor, optional): If computing influence scores on a loss
+            targets (Tensor, optional): If computing influence scores on a loss
                     function, these are the labels corresponding to the batch `inputs`.
                     Default: None
             k (int, optional): If not provided or `None`, the influence score mode will
@@ -728,7 +734,7 @@ def influence(  # type: ignore[override]
             show_progress (bool, optional): For all modes, computation of results
                     requires "training dataset computations": computations for each
                     batch in the training dataset `train_dataset`, which may
-                    take a long time. If `show_progress`is true, the progress of
+                    take a long time. If `show_progress` is true, the progress of
                     "training dataset computations" will be displayed. It will try to
                     use tqdm if available for advanced features (e.g. time estimation).
                     Otherwise, it will fallback to a simple output of progress.
@@ -827,15 +833,16 @@ def _influence(
         output of `self._basic_computation_tracincp`.
 
         Args:
-            inputs (Tuple of Any): A test batch of examples. Does not represent labels,
+
+            inputs (tuple of Any): A test batch of examples. Does not represent labels,
                     which are passed as `targets`. The assumption is that
                     `model(*inputs)` produces the predictions for the batch.
-            targets (tensor, optional): If computing influence scores on a loss
+            targets (Tensor, optional): If computing influence scores on a loss
                     function, these are the labels corresponding to the batch `inputs`.
                     Default: None
             show_progress (bool, optional): To compute the influence of examples in
                     training dataset `train_dataset`, we compute the influence
-                    of each batch. If `show_progress`is true, the progress of this
+                    of each batch. If `show_progress` is true, the progress of this
                     computation will be displayed. In particular, the number of batches
                     for which influence has been computed will be displayed. It will
                     try to use tqdm if available for advanced features (e.g. time
@@ -844,7 +851,7 @@ def _influence(
                     Default: False
 
         Returns:
-            influence_scores (tensor): Influence scores from the TracInCP method.
+            influence_scores (Tensor): Influence scores from the TracInCP method.
             Its shape is `(input_size, train_dataset_size)`, where `input_size`
             is the number of examples in the test batch, and
             `train_dataset_size` is the number of examples in
@@ -882,7 +889,8 @@ def _get_k_most_influential(
     ) -> KMostInfluentialResults:
         r"""
         Args:
-            inputs (Tuple of Any): A tuple that represents a batch of examples. It does
+
+            inputs (tuple of Any): A tuple that represents a batch of examples. It does
                     not represent labels, which are passed as `targets`.
             targets (Tensor, optional): If computing influence scores on a loss
                     function, these are the labels corresponding to the batch `inputs`.
@@ -895,7 +903,7 @@ def _get_k_most_influential(
                     Default: True
             show_progress (bool, optional): To compute the proponents (or opponents)
                     for the batch of examples, we perform computation for each batch in
-                    training dataset `train_dataset`, If `show_progress`is
+                    training dataset `train_dataset`, If `show_progress` is
                     true, the progress of this computation will be displayed. In
                     particular, the number of batches for which the computation has
                     been performed will be displayed. It will try to use tqdm if
@@ -964,7 +972,7 @@ def _self_influence_by_checkpoints(
         times.
 
         Args:
-            batches (Tuple, or DataLoader): Either a single tuple of any, or a
+            batches (tuple or DataLoader): Either a single tuple of any, or a
                     `DataLoader`, where each batch yielded is a tuple of any. In
                     either case, the tuple represents a single batch, where the last
                     element is assumed to be the labels for the batch. That is,
@@ -976,7 +984,7 @@ def _self_influence_by_checkpoints(
                     more details on the assumed structure of a batch.
             show_progress (bool, optional): Computation of self influence scores can
                     take a long time if `inputs_dataset` represents many examples. If
-                    `show_progress`is true, the progress of this computation will be
+                    `show_progress` is true, the progress of this computation will be
                     displayed. In more detail, this computation will iterate over all
                     checkpoints (provided as the `checkpoints` initialization argument)
                     in an outer loop, and iterate over all batches that
@@ -1126,7 +1134,7 @@ def self_influence(
         for each batch. For large models, loading checkpoints can be time-intensive.
 
         Args:
-            batches (Tuple, or DataLoader): Either a single tuple of any, or a
+            batches (tuple or DataLoader): Either a single tuple of any, or a
                     `DataLoader`, where each batch yielded is a tuple of any. In
                     either case, the tuple represents a single batch, where the last
                     element is assumed to be the labels for the batch. That is,
@@ -1178,7 +1186,8 @@ def _basic_computation_tracincp(
         and batches.
 
         Args:
-            inputs (Tuple of Any): A batch of examples, which could be a training batch
+
+            inputs (tuple of Any): A batch of examples, which could be a training batch
                     or test batch, depending which method is the caller. Does not
                     represent labels, which are passed as `targets`. The assumption is
                     that `model(*inputs)` produces the predictions for the batch.
diff --git a/captum/influence/_core/tracincp_fast_rand_proj.py b/captum/influence/_core/tracincp_fast_rand_proj.py
index 114d4c45bd..0a95a52dc1 100644
--- a/captum/influence/_core/tracincp_fast_rand_proj.py
+++ b/captum/influence/_core/tracincp_fast_rand_proj.py
@@ -36,7 +36,7 @@
 r"""
 Implements abstract DataInfluence class and also provides implementation details for
 influence computation based on the logic provided in TracIn paper
-(https://arxiv.org/pdf/2002.08484.pdf).
+(https://arxiv.org/abs/2002.08484).
 
 The TracIn paper proposes an idealized notion of influence which can be represented by
 the total amount a training example reduces loss for a test example via a training
@@ -92,6 +92,7 @@ def __init__(
     ) -> None:
         r"""
         Args:
+
             model (torch.nn.Module): An instance of pytorch model. This model should
                     define all of its layers as attributes of the model.
             final_fc_layer (torch.nn.Module or str): The last fully connected layer in
@@ -99,7 +100,7 @@ def __init__(
                     projection method. Can be either the layer module itself, or the
                     fully qualified name of the layer if it is a defined attribute of
                     the passed `model`.
-            train_dataset (torch.utils.data.Dataset or torch.utils.DataLoader):
+            train_dataset (torch.utils.data.Dataset or torch.utils.data.DataLoader):
                     In the `influence` method, we either compute the influence score of
                     training examples on examples in a test batch, or self influence
                     scores for those training examples, depending on which mode is used.
@@ -123,7 +124,7 @@ def __init__(
                     `model` accepts `L-1` arguments, and the last element of `batch` is
                     the label. In other words, `model(*batch[:-1])` gives the output of
                     `model`, and `batch[-1]` are the labels for the batch.
-            checkpoints (str or List of str or Iterator): Either the directory of the
+            checkpoints (str, list[str], or Iterator): Either the directory of the
                     path to store and retrieve model checkpoints, a list of
                     filepaths with checkpoints from which to load, or an iterator which
                     returns objects from which to load checkpoints.
@@ -236,6 +237,7 @@ def influence(  # type: ignore[override]
           opponent) on the test example.
 
         Args:
+
             inputs (Any, optional): If not provided or `None`, the self influence mode
                     will be run. Otherwise, `inputs` is the test batch that will be
                     used when running in either influence score or k-most influential
@@ -247,7 +249,7 @@ def influence(  # type: ignore[override]
                     `inputs` will need to be a tuple. In other words, `inputs` will be
                     unpacked as an argument when passing to `model`.
                     Default: None
-            targets (tensor, optional): The labels corresponding to the batch `inputs`.
+            targets (Tensor, optional): The labels corresponding to the batch `inputs`.
                     This method is designed to be applied for a loss function, so
                     `targets` is required, unless running in "self influence" mode.
                     Default: None
@@ -267,7 +269,7 @@ def influence(  # type: ignore[override]
             show_progress (bool, optional): For all modes, computation of results
                     requires "training dataset computations": computations for each
                     batch in the training dataset `train_dataset`, which may
-                    take a long time. If `show_progress`is true, the progress of
+                    take a long time. If `show_progress` is true, the progress of
                     "training dataset computations" will be displayed. It will try to
                     use tqdm if available for advanced features (e.g. time estimation).
                     Otherwise, it will fallback to a simple output of progress.
@@ -281,7 +283,7 @@ def influence(  # type: ignore[override]
               `train_dataset`. The length of this tensor is the number of
               examples in `train_dataset`, regardless of whether it is a
               Dataset or DataLoader.
-            - influence score mode: if this mode is run (`inputs is not None, `k` is
+            - influence score mode: if this mode is run (`inputs` is not None, `k` is
               None), returns a 2D tensor `influence_scores` of shape
               `(input_size, train_dataset_size)`, where `input_size` is
               the number of examples in the test batch, and
@@ -377,15 +379,16 @@ def _influence(  # type: ignore[override]
         output of `_basic_computation_tracincp_fast`.
 
         Args:
-            inputs (Tuple of Any): A batch of examples. Does not represent labels,
+
+            inputs (tuple of Any): A batch of examples. Does not represent labels,
                     which are passed as `targets`. The assumption is that
                     `model(*inputs)` produces the predictions for the batch.
-            targets (tensor): The labels corresponding to the batch `inputs`. This
+            targets (Tensor): The labels corresponding to the batch `inputs`. This
                     method is designed to be applied for a loss function, so labels
                     are required.
             show_progress (bool, optional): To compute the influence of examples in
                     training dataset `train_dataset`, we compute the influence
-                    of each batch. If `show_progress`is true, the progress of this
+                    of each batch. If `show_progress` is true, the progress of this
                     computation will be displayed. In particular, the number of batches
                     for which influence has been computed will be displayed. It will
                     try to use tqdm if available for advanced features (e.g. time
@@ -394,7 +397,7 @@ def _influence(  # type: ignore[override]
                     Default: False
 
         Returns:
-            influence_scores (tensor): Influence scores from the TracInCPFast method.
+            influence_scores (Tensor): Influence scores from the TracInCPFast method.
             Its shape is `(input_size, train_dataset_size)`, where `input_size`
             is the number of examples in the test batch, and
             `train_dataset_size` is the number of examples in
@@ -434,9 +437,10 @@ def _get_k_most_influential(  # type: ignore[override]
     ) -> KMostInfluentialResults:
         r"""
         Args:
-            inputs (Tuple of Any): A tuple that represents a batch of examples. It does
+
+            inputs (tuple of Any): A tuple that represents a batch of examples. It does
                     not represent labels, which are passed as `targets`.
-            targets (tensor): The labels corresponding to the batch `inputs`. This
+            targets (Tensor): The labels corresponding to the batch `inputs`. This
                     method is designed to be applied for a loss function, so labels
                     are required.
             k (int, optional): The number of proponents or opponents to return per test
@@ -447,7 +451,7 @@ def _get_k_most_influential(  # type: ignore[override]
                     Default: True
             show_progress (bool, optional): To compute the proponents (or opponents)
                     for the batch of examples, we perform computation for each batch in
-                    training dataset `train_dataset`, If `show_progress`is
+                    training dataset `train_dataset`, If `show_progress` is
                     true, the progress of this computation will be displayed. In
                     particular, the number of batches for which the computation has
                     been performed will be displayed. It will try to use tqdm if
@@ -516,7 +520,7 @@ def _self_influence_by_checkpoints(
         times.
 
         Args:
-            batches (Tuple, or DataLoader): Either a single tuple of any, or a
+            batches (tuple or DataLoader): Either a single tuple of any, or a
                     `DataLoader`, where each batch yielded is a tuple of any. In
                     either case, the tuple represents a single batch, where the last
                     element is assumed to be the labels for the batch. That is,
@@ -528,7 +532,7 @@ def _self_influence_by_checkpoints(
                     more details on the assumed structure of a batch.
             show_progress (bool, optional): Computation of self influence scores can
                     take a long time if `inputs_dataset` represents many examples. If
-                    `show_progress`is true, the progress of this computation will be
+                    `show_progress` is true, the progress of this computation will be
                     displayed. In more detail, this computation will iterate over all
                     checkpoints (provided as the `checkpoints` initialization argument)
                     in an outer loop, and iterate over all batches that
@@ -660,7 +664,7 @@ def self_influence(
         for each batch. For large models, loading checkpoints can be time-intensive.
 
         Args:
-            batches (Tuple, or DataLoader): Either a single tuple of any, or a
+            batches (tuple or DataLoader): Either a single tuple of any, or a
                     `DataLoader`, where each batch yielded is a tuple of any. In
                     either case, the tuple represents a single batch, where the last
                     element is assumed to be the labels for the batch. That is,
@@ -713,17 +717,18 @@ def _basic_computation_tracincp_fast(
     and batches.
 
     Args:
+
         influence_instance (TracInCPFast): A instance of TracInCPFast or its children.
                 We assume `influence_instance` has a `loss_fn` attribute, i.e. the loss
                 function applied to the output of the last fully-connected layer, as
                 well as a `reduction_type` attribute, which indicates whether `loss_fn`
                 reduces the per-example losses by using their mean or sum. The
                 `reduction_type` attribute must either be "mean" or "sum".
-        inputs (Tuple of Any): A batch of examples, which could be a training batch
+        inputs (tuple of Any): A batch of examples, which could be a training batch
                 or test batch, depending which method is the caller. Does not
                 represent labels, which are passed as `targets`. The assumption is
                 that `model(*inputs)` produces the predictions for the batch.
-        targets (tensor): If computing influence scores on a loss function,
+        targets (Tensor): If computing influence scores on a loss function,
                 these are the labels corresponding to the batch `inputs`.
 
     Returns:
@@ -809,7 +814,7 @@ class TracInCPFastRandProj(TracInCPFast):
     to obtain proponents / opponents or influence scores will be made in an
     "interactive" manner, and there is sufficient memory to store vectors for the
     entire `train_dataset`. This is because in order to enable interactive
-    analysis, this implementation incures overhead in ``__init__` to setup the
+    analysis, this implementation incures overhead in `__init__` to setup the
     nearest-neighbors data structure, which is both time and memory intensive, as
     vectors corresponding to all training examples needed to be stored. To reduce
     memory usage, this implementation enables random projections of those vectors.
@@ -853,6 +858,7 @@ def __init__(
     ) -> None:
         r"""
         Args:
+
             model (torch.nn.Module): An instance of pytorch model. This model should
                     define all of its layers as attributes of the model.
             final_fc_layer (torch.nn.Module or str): The last fully connected layer in
@@ -860,7 +866,7 @@ def __init__(
                     projection method. Can be either the layer module itself, or the
                     fully qualified name of the layer if it is a defined attribute of
                     the passed `model`.
-            train_dataset (torch.utils.data.Dataset or torch.utils.DataLoader):
+            train_dataset (torch.utils.data.Dataset or torch.utils.data.DataLoader):
                     In the `influence` method, we either compute the influence score of
                     training examples on examples in a test batch, or self influence
                     scores for those training examples, depending on which mode is used.
@@ -884,7 +890,7 @@ def __init__(
                     `model` accepts `L-1` arguments, and the last element of `batch` is
                     the label. In other words, `model(*batch[:-1])` gives the output of
                     `model`, and `batch[-1]` are the labels for the batch.
-            checkpoints (str or List of str or Iterator): Either the directory of the
+            checkpoints (str, list[str], or Iterator): Either the directory of the
                     path to store and retrieve model checkpoints, a list of
                     filepaths with checkpoints from which to load, or an iterator which
                     returns objects from which to load checkpoints.
@@ -935,7 +941,7 @@ def __init__(
                     int, and random projection will be performed to ensure that the
                     vector is of dimension no more than `projection_dim` * C.
                     `projection_dim` corresponds to the variable d in the top of page
-                    15 of the TracIn paper: https://arxiv.org/pdf/2002.08484.pdf.
+                    15 of the TracIn paper: https://arxiv.org/abs/2002.08484.
                     Default: None
             seed (int, optional): Because this implementation chooses a random
                     projection, its output is random. Setting this seed specifies the
@@ -995,15 +1001,16 @@ def _influence(  # type: ignore[override]
     ) -> Tensor:
         r"""
         Args:
+
             inputs (tuple of Any): A batch of examples. Does not represent labels,
                     which are passed as `targets`. The assumption is that
                     `model(*inputs)` produces the predictions for the batch.
-            targets (tensor): The labels corresponding to the batch `inputs`. This
+            targets (Tensor): The labels corresponding to the batch `inputs`. This
                     method is designed to be applied for a loss function, so labels
                     are required.
 
         Returns:
-            influence_scores (tensor): Influence scores from the
+            influence_scores (Tensor): Influence scores from the
             TracInCPFastRandProj method. Its shape is
             `(input_size, train_dataset_size)`, where `input_size` is the
             number of examples in the test batch, and `train_dataset_size` is
@@ -1034,9 +1041,10 @@ def _get_k_most_influential(  # type: ignore[override]
     ) -> KMostInfluentialResults:
         r"""
         Args:
-            inputs (Tuple of Any): A tuple that represents a batch of examples. It does
+
+            inputs (tuple of Any): A tuple that represents a batch of examples. It does
                     not represent labels, which are passed as `targets`.
-            targets (tensor): The labels corresponding to the batch `inputs`. This
+            targets (Tensor): The labels corresponding to the batch `inputs`. This
                     method is designed to be applied for a loss function, so labels
                     are required.
             k (int, optional): The number of proponents or opponents to return per test
@@ -1101,7 +1109,7 @@ def self_influence(
         with are not too large, so that there will not be an out-of-memory error.
 
         Args:
-            batches (Tuple, or DataLoader): Either a single tuple of any, or a
+            batches (tuple or DataLoader): Either a single tuple of any, or a
                     `DataLoader`, where each batch yielded is a tuple of any. In
                     either case, the tuple represents a single batch, where the last
                     element is assumed to be the labels for the batch. That is,
@@ -1113,7 +1121,7 @@ def self_influence(
                     more details on the assumed structure of a batch.
             show_progress (bool, optional): Computation of self influence scores can
                     take a long time if `inputs_dataset` represents many examples. If
-                    `show_progress`is true, the progress of this computation will be
+                    `show_progress` is true, the progress of this computation will be
                     displayed. In more detail, this computation will iterate over all
                     checkpoints (provided as the `checkpoints` initialization argument)
                     and all batches that `inputs_dataset` represents. Therefore, the
@@ -1187,6 +1195,7 @@ def influence(  # type: ignore[override]
         gradients in the last fully-connected layer, please use `TracInCPFast` instead.
 
         Args:
+
             inputs (Any, optional): If not provided or `None`, the self influence mode
                     will be run. Otherwise, `inputs` is the test batch that will be
                     used when running in either influence score or k-most influential
@@ -1198,7 +1207,7 @@ def influence(  # type: ignore[override]
                     `inputs` will need to be a tuple. In other words, `inputs` will be
                     unpacked as an argument when passing to `model`.
                     Default: None
-            targets (tensor): The labels corresponding to the batch `inputs`. This
+            targets (Tensor): The labels corresponding to the batch `inputs`. This
                     method is designed to be applied for a loss function, so `targets`
                     is required.
             k (int, optional): If not provided or `None`, the influence score mode will
@@ -1219,7 +1228,7 @@ def influence(  # type: ignore[override]
 
             The return value of this method depends on which mode is run.
 
-            - influence score mode: if this mode is run (`inputs is not None, `k` is
+            - influence score mode: if this mode is run (`inputs` is not None, `k` is
               None), returns a 2D tensor `influence_scores` of shape
               `(input_size, train_dataset_size)`, where `input_size` is
               the number of examples in the test batch, and
@@ -1275,6 +1284,7 @@ def _set_projections_tracincp_fast_rand_proj(
         `TracInCPFastRandProj.__init__`.
 
         Args:
+
             dataloader (DataLoader): determining the projection requires knowing the
                     dimensionality of the last layer's parameters (`jacobian_dim`
                     below) and its input (`layer_input_dim` below). These are
@@ -1282,10 +1292,10 @@ def _set_projections_tracincp_fast_rand_proj(
                     provides that batch.
 
         Returns:
-            jacobian_projection (tensor or None): Projection matrix to apply to
+            jacobian_projection (Tensor or None): Projection matrix to apply to
                     Jacobian of last layer to reduce its dimension, if needed.
                     None otherwise.
-            input_projection (tensor or None): Projection matrix to apply to input of
+            input_projection (Tensor or None): Projection matrix to apply to input of
                     last layer to reduce its dimension, if needed. None otherwise.
         """
         # figure out projection dimensions, if needed
@@ -1326,7 +1336,7 @@ def _set_projections_tracincp_fast_rand_proj(
             # allowable dimension of the "partial" intermediate quantity. Therefore,
             # we only project if `jacobian_dim` * `layer_input_dim` > `projection_dim`.
             # `projection_dim` corresponds to the variable d in the top of page 15 of
-            # the TracIn paper: https://arxiv.org/pdf/2002.08484.pdf.
+            # the TracIn paper: https://arxiv.org/abs/2002.08484.
             if jacobian_dim * layer_input_dim > projection_dim:
                 jacobian_projection_dim = min(int(projection_dim**0.5), jacobian_dim)
                 layer_input_projection_dim = min(
@@ -1361,7 +1371,8 @@ def _process_src_intermediate_quantities_tracincp_fast_rand_proj(
         method creates that data structure. This method has side effects.
 
         Args:
-            src_intermediate_quantities (tensor): the output of the
+
+            src_intermediate_quantities (Tensor): the output of the
                     `_get_intermediate_quantities_tracin_fast_rand_proj` function when
                     applied to training dataset `train_dataset`. This
                     output is the vector representation of all training examples.
@@ -1401,7 +1412,7 @@ def _get_intermediate_quantities_tracincp_fast_rand_proj(
                     projection is to be applied.
 
         Returns:
-            intermediate_quantities (tensor): A tensor of dimension
+            intermediate_quantities (Tensor): A tensor of dimension
                     (N, D * C), where N is total number of examples in `dataloader`, C
                     is the number of checkpoints passed as the `checkpoints` argument
                     of `TracInCPFastRandProj.__init__`, and each row represents the
@@ -1417,7 +1428,7 @@ def _get_intermediate_quantities_tracincp_fast_rand_proj(
                     performed to ensure that the vector is of dimension no more than
                     `self.projection_dim` * C. `self.projection_dim` corresponds to
                     the variable d in the top of page 15 of the TracIn paper:
-                    https://arxiv.org/pdf/2002.08484.pdf.
+                    https://arxiv.org/abs/2002.08484.
         """
         # if `inputs_dataset` is not a `DataLoader`, turn it into one.
         inputs_dataset = _format_inputs_dataset(inputs_dataset)
@@ -1547,7 +1558,7 @@ def compute_intermediate_quantities(
                     structure of a batch.
 
         Returns:
-            intermediate_quantities (tensor): A tensor of dimension
+            intermediate_quantities (Tensor): A tensor of dimension
                     (N, D * C), where N is total number of examples in
                     `inputs_dataset`, C is the number of checkpoints passed as the
                     `checkpoints` argument of `TracInCPFastRandProj.__init__`, and each
diff --git a/captum/influence/_utils/common.py b/captum/influence/_utils/common.py
index cd989098c8..356f09b8e6 100644
--- a/captum/influence/_utils/common.py
+++ b/captum/influence/_utils/common.py
@@ -91,12 +91,12 @@ def _jacobian_loss_wrt_inputs(
                 torch.nn.Module. If a custom loss is provided, it can be either type,
                 but must behave as a library loss function would if `reduction='sum'`
                 or `reduction='mean'`.
-        out (tensor): This is a tensor that represents the batch of inputs to
+        out (Tensor): This is a tensor that represents the batch of inputs to
                 `loss_fn`. In practice, this will be the output of a model; this is
                 why this argument is named `out`. `out` is a 2D tensor of shape
                 (batch size, model output dimensionality). We will call `loss_fn` via
                 `loss_fn(out, targets)`.
-        targets (tensor): The labels for the batch of inputs.
+        targets (Tensor): The labels for the batch of inputs.
         vectorize (bool): Flag to use experimental vectorize functionality for
                 `torch.autograd.functional.jacobian`.
         reduction_type (str): The type of reduction used by `loss_fn`. If `loss_fn`
@@ -104,7 +104,7 @@ def _jacobian_loss_wrt_inputs(
                 only be "mean" or "sum".
 
     Returns:
-        jacobians (tensor): Returns the jacobian of the per-sample loss (implicitly
+        jacobians (Tensor): Returns the jacobian of the per-sample loss (implicitly
                 defined by `loss_fn` and `reduction_type`) w.r.t each sample
                 in the batch represented by `out`. This is a 2D tensor, where the
                 first dimension is the batch dimension.
@@ -153,8 +153,9 @@ def _load_flexible_state_dict(model: Module, path: str) -> float:
     state_dict and other information.
 
     Args:
-        model: The model for which to load a checkpoint
-        path: The filepath to the checkpoint
+
+        model (torch.nn.Module): The model for which to load a checkpoint
+        path (str): The filepath to the checkpoint
 
     The module state_dict is modified in-place, and the learning rate is returned.
     """
@@ -203,7 +204,7 @@ def _get_k_most_influential_helper(
         influence_batch_fn (Callable): A callable that will be called via
                 `influence_batch_fn(inputs, targets, batch)`, where `batch` is a batch
                 in the `influence_src_dataloader` argument.
-        inputs (Tuple of Any): A batch of examples. Does not represent labels,
+        inputs (tuple of Any): A batch of examples. Does not represent labels,
                 which are passed as `targets`.
         targets (Tensor, optional): If computing TracIn scores on a loss function,
                 these are the labels corresponding to the batch `inputs`.
@@ -216,7 +217,7 @@ def _get_k_most_influential_helper(
                 Default: True
         show_progress (bool, optional): To compute the proponents (or opponents)
                 for the batch of examples, we perform computation for each batch in
-                training dataset `influence_src_dataloader`, If `show_progress`is
+                training dataset `influence_src_dataloader`, If `show_progress` is
                 true, the progress of this computation will be displayed. In
                 particular, the number of batches for which the computation has
                 been performed will be displayed. It will try to use tqdm if
@@ -354,7 +355,7 @@ def _self_influence_by_batches_helper(
         instance_name (str): This is the name of the implementation class that
                 `self_influence_batch_fn` is a method of. This is used for displaying
                 warning messages.
-        batches (Tuple, or DataLoader): Either a single tuple of any, or a
+        batches (tuple or DataLoader): Either a single tuple of any, or a
                 `DataLoader`, where each batch yielded is a tuple of any. In
                 either case, the tuple represents a single batch, where the last
                 element is assumed to be the labels for the batch. That is,
diff --git a/captum/influence/_utils/nearest_neighbors.py b/captum/influence/_utils/nearest_neighbors.py
index 3ecd452de3..fa8d6d7136 100644
--- a/captum/influence/_utils/nearest_neighbors.py
+++ b/captum/influence/_utils/nearest_neighbors.py
@@ -34,7 +34,7 @@ def get_nearest_neighbors(
         so that `query` is 2D.
 
         Args:
-            query (tensor): tensor representing the batch of tensors for which k-nearest
+            query (Tensor): tensor representing the batch of tensors for which k-nearest
                     neighbors are desired. `query` is of shape (N, *), where N is the
                     size of the batch, i.e. the 0-th dimension of `query` indexes the
                     batch. * denotes an arbitrary shape, so that each tensor in the
@@ -68,7 +68,7 @@ def setup(self, data: torch.Tensor) -> None:
         dimension indexes the tensors in the stored tensors.
 
         Args:
-            data (tensor): A tensor of shape (N, *) representing the stored tensors.
+            data (Tensor): A tensor of shape (N, *) representing the stored tensors.
                     The 0-th dimension indexes the tensors in the stored tensors,
                     so that `data[i]` is the tensor with index `i`. The nearest
                     neighbors of a query will be referred to by their index.
@@ -129,7 +129,7 @@ def setup(self, data: torch.Tensor) -> None:
         tensors.
 
         Args:
-            data (tensor): A tensor of shape (N, *) representing the stored tensors.
+            data (Tensor): A tensor of shape (N, *) representing the stored tensors.
                     The 0-th dimension indexes the tensors in the stored tensors,
                     so that `data[i]` is the tensor with index `i`. The nearest
                     neighbors of a query will be referred to by their index.
@@ -160,7 +160,7 @@ def get_nearest_neighbors(
         dot-product of the flattened version of tensors.
 
         Args:
-            query (tensor): tensor representing the batch of tensors for which k-nearest
+            query (Tensor): tensor representing the batch of tensors for which k-nearest
                     neighbors are desired. `query` is of shape (N, *), where N is the
                     size of the batch, i.e. the 0-th dimension of `query` indexes the
                     batch. * denotes an arbitrary shape, so that each tensor in the
diff --git a/captum/insights/__init__.py b/captum/insights/__init__.py
index 48ba6fdfa0..2ba766cdd2 100644
--- a/captum/insights/__init__.py
+++ b/captum/insights/__init__.py
@@ -1 +1 @@
-from captum.insights.attr_vis import AttributionVisualizer, Batch  # noqa
+from captum.insights.attr_vis import AttributionVisualizer, Batch, features  # noqa
diff --git a/captum/insights/attr_vis/app.py b/captum/insights/attr_vis/app.py
index 9a0433090b..fe7e0bbcda 100644
--- a/captum/insights/attr_vis/app.py
+++ b/captum/insights/attr_vis/app.py
@@ -108,7 +108,7 @@ def __init__(
 
         Args:
 
-            inputs (tensor or tuple of tensors): Batch of inputs for a model.
+            inputs (Tensor or tuple of Tensor): Batch of inputs for a model.
                         These may be either a Tensor or tuple of tensors. Each tensor
                         must correspond to a feature for AttributionVisualizer, and
                         the corresponding input transform function of the feature
@@ -116,7 +116,7 @@ def __init__(
                         model. It is assumed that the first dimension of each
                         input tensor corresponds to the number of examples
                         (batch size) and is aligned for all input tensors.
-            labels (tensor): Tensor containing correct labels for input examples.
+            labels (Tensor): Tensor containing correct labels for input examples.
                         This must be a 1D tensor with length matching the first
                         dimension of each input tensor.
             additional_args (tuple, optional): If the forward function
@@ -149,11 +149,11 @@ def __init__(
         r"""
         Args:
 
-            models (torch.nn.module): One or more PyTorch modules (models) for
+            models (torch.nn.Module): One or more PyTorch modules (models) for
                           attribution visualization.
-            classes (list of string): List of strings corresponding to the names of
+            classes (list[str]): List of strings corresponding to the names of
                           classes for classification.
-            features (list of BaseFeature): List of BaseFeatures, which correspond
+            features (list[BaseFeature]): List of BaseFeatures, which correspond
                           to input arguments to the model. Each feature object defines
                           relevant transformations for converting to model input,
                           constructing baselines, and visualizing. The length of the
@@ -163,10 +163,10 @@ def __init__(
                           a single BaseFeature, while a multimodal classifier may
                           provide a list of features, each corresponding to a different
                           tensor input and potentially different modalities.
-            dataset (iterable of Batch): Defines the dataset to visualize attributions
+            dataset (Iterable of Batch): Defines the dataset to visualize attributions
                           for. This must be an iterable of batch objects, each of which
                           may contain multiple input examples.
-            score_func (callable, optional): This function is applied to the model
+            score_func (Callable, optional): This function is applied to the model
                           output to obtain the score for each class. For instance,
                           this function could be the softmax or final non-linearity
                           of the network, applied to the model output. The indices
@@ -175,7 +175,7 @@ def __init__(
                           are taken directly and assumed to correspond to the
                           class scores.
                           Default: None
-            use_label_for_attr (boolean, optional): If true, the class index is passed
+            use_label_for_attr (bool, optional): If true, the class index is passed
                           to the relevant attribution method. This is necessary in most
                           cases where there is an output neuron corresponding to each
                           class. When the model output is a scalar and class index
diff --git a/captum/insights/attr_vis/features.py b/captum/insights/attr_vis/features.py
index 0986170758..9a048e57a6 100644
--- a/captum/insights/attr_vis/features.py
+++ b/captum/insights/attr_vis/features.py
@@ -43,16 +43,16 @@ def __init__(
 
             name (str): The label of the specific feature. For example, an
                         ImageFeature's name can be "Photo".
-            baseline_transforms (list, callable, optional): Optional list of
+            baseline_transforms (list, Callable, optional): Optional list of
                         callables (e.g. functions) to be called on the input tensor
                         to construct multiple baselines. Currently only one baseline
                         is supported. See
                         :py:class:`.IntegratedGradients` for more
                         information about baselines.
-            input_transforms (list, callable, optional): Optional list of callables
+            input_transforms (list, Callable, optional): Optional list of callables
                         (e.g. functions) called on the input tensor sequentially to
                         convert it into the format expected by the model.
-            visualization_transform (callable, optional): Optional callable (e.g.
+            visualization_transform (Callable, optional): Optional callable (e.g.
                         function) applied as a postprocessing step of the original
                         input data (before ``input_transforms``) to convert it to a
                         format to be understood by the frontend visualizer as
@@ -89,16 +89,16 @@ def __init__(
         Args:
             name (str): The label of the specific feature. For example, an
                         ImageFeature's name can be "Photo".
-            baseline_transforms (list, callable, optional): Optional list of
+            baseline_transforms (list, Callable, optional): Optional list of
                         callables (e.g. functions) to be called on the input tensor
                         to construct multiple baselines. Currently only one baseline
                         is supported. See
                         :py:class:`.IntegratedGradients` for more
                         information about baselines.
-            input_transforms (list, callable, optional): A list of transforms
+            input_transforms (list, Callable, optional): A list of transforms
                         or transform to be applied to the input. For images,
                         normalization is often applied here.
-            visualization_transform (callable, optional): Optional callable (e.g.
+            visualization_transform (Callable, optional): Optional callable (e.g.
                         function) applied as a postprocessing step of the original
                         input data (before input_transforms) to convert it to a
                         format to be visualized.
@@ -164,7 +164,7 @@ def __init__(
         Args:
             name (str): The label of the specific feature. For example, an
                         ImageFeature's name can be "Photo".
-            baseline_transforms (list, callable, optional): Optional list of
+            baseline_transforms (list, Callable, optional): Optional list of
                         callables (e.g. functions) to be called on the input tensor
                         to construct multiple baselines. Currently only one baseline
                         is supported. See
@@ -174,7 +174,7 @@ def __init__(
                         corresponding to PAD with the same size as the input
                         tensor. See :py:class:`.TokenReferenceBase` for more
                         information.
-            input_transforms (list, callable, optional): A list of transforms
+            input_transforms (list, Callable, optional): A list of transforms
                         or transform to be applied to the input. For text, a common
                         transform is to convert the tokenized input tensor into an
                         interpretable embedding. See
@@ -182,7 +182,7 @@ def __init__(
                         and
                         :py:func:`~.configure_interpretable_embedding_layer`
                         for more information.
-            visualization_transform (callable, optional): Optional callable (e.g.
+            visualization_transform (Callable, optional): Optional callable (e.g.
                         function) applied as a postprocessing step of the original
                         input data (before ``input_transforms``) to convert it to a
                         suitable format for visualization. For text features,
diff --git a/captum/metrics/_core/infidelity.py b/captum/metrics/_core/infidelity.py
index 33f485a78e..a10b2e2812 100644
--- a/captum/metrics/_core/infidelity.py
+++ b/captum/metrics/_core/infidelity.py
@@ -44,12 +44,12 @@ def sub_infidelity_perturb_func_decorator(pertub_func: Callable) -> Callable:
         r"""
         Args:
 
-            pertub_func(callable): Input perturbation function that takes inputs
+            pertub_func(Callable): Input perturbation function that takes inputs
                 and optionally baselines and returns perturbed inputs
 
         Returns:
 
-            default_perturb_func(callable): Internal default perturbation
+            default_perturb_func(Callable): Internal default perturbation
             function that computes the perturbations internally and returns
             perturbations and perturbed inputs.
 
@@ -126,7 +126,7 @@ def infidelity(
     and the differences between the predictor function at its input
     and perturbed input.
     More details about the measure can be found in the following paper:
-    https://arxiv.org/pdf/1901.09392.pdf
+    https://arxiv.org/abs/1901.09392
 
     It is derived from the completeness property of well-known attribution
     algorithms and is a computationally more efficient and generalized
@@ -134,7 +134,7 @@ def infidelity(
     of the attributions and the differences of the predictor function at
     its input and fixed baseline. More details about the Sensitivity-n can
     be found here:
-    https://arxiv.org/pdf/1711.06104.pdfs
+    https://arxiv.org/abs/1711.06104
 
     The users can perturb the inputs any desired way by providing any
     perturbation function that takes the inputs (and optionally baselines)
@@ -147,10 +147,10 @@ def infidelity(
 
     Args:
 
-        forward_func (callable):
+        forward_func (Callable):
                 The forward function of the model or any modification of it.
 
-        perturb_func (callable):
+        perturb_func (Callable):
                 The perturbation function of model inputs. This function takes
                 model inputs and optionally baselines as input arguments and returns
                 either a tuple of perturbations and perturbed inputs or just
@@ -205,12 +205,13 @@ def infidelity(
                    Similar to previous case here as well we need to return only
                    perturbed inputs in case `infidelity_perturb_func_decorator`
                    decorates out `perturb_func`.
+
                 It is important to note that for performance reasons `perturb_func`
                 isn't called for each example individually but on a batch of
                 input examples that are repeated `max_examples_per_batch / batch_size`
                 times within the batch.
 
-        inputs (tensor or tuple of tensors):  Input for which
+        inputs (Tensor or tuple of Tensor): Input for which
                 attributions are computed. If forward_func takes a single
                 tensor as input, a single input tensor should be provided.
                 If forward_func takes multiple tensors as input, a tuple
@@ -220,7 +221,7 @@ def infidelity(
                 multiple input tensors are provided, the examples must
                 be aligned appropriately.
 
-        baselines (scalar, tensor, tuple of scalars or tensors, optional):
+        baselines (scalar, Tensor, tuple of scalar, or Tensor, optional):
                 Baselines define reference values which sometimes represent ablated
                 values and are used to compare with the actual inputs to compute
                 importance scores in attribution algorithms. They can be represented
@@ -249,13 +250,13 @@ def infidelity(
 
                 Default: None
 
-        attributions (tensor or tuple of tensors):
+        attributions (Tensor or tuple of Tensor):
                 Attribution scores computed based on an attribution algorithm.
                 This attribution scores can be computed using the implementations
                 provided in the `captum.attr` package. Some of those attribution
                 approaches are so called global methods, which means that
                 they factor in model inputs' multiplier, as described in:
-                https://arxiv.org/pdf/1711.06104.pdf
+                https://arxiv.org/abs/1711.06104
                 Many global attribution algorithms can be used in local modes,
                 meaning that the inputs multiplier isn't factored in the
                 attribution scores.
@@ -271,7 +272,7 @@ def infidelity(
 
                 For local attributions we can use real-valued perturbations
                 whereas for global attributions that perturbation is binary.
-                https://arxiv.org/pdf/1901.09392.pdf
+                https://arxiv.org/abs/1901.09392
 
                 If we want to compute the infidelity of global attributions we
                 can use a binary perturbation matrix that will allow us to select
@@ -291,7 +292,7 @@ def infidelity(
                 tensor as well. If inputs is provided as a tuple of tensors
                 then attributions will be tuples of tensors as well.
 
-        additional_forward_args (any, optional): If the forward function
+        additional_forward_args (Any, optional): If the forward function
                 requires additional arguments other than the inputs for
                 which attributions should not be computed, this argument
                 can be provided. It must be either a single additional
@@ -304,7 +305,7 @@ def infidelity(
                 being passed to `perturb_func` as an input argument.
 
                 Default: None
-        target (int, tuple, tensor or list, optional): Indices for selecting
+        target (int, tuple, Tensor, or list, optional): Indices for selecting
                 predictions from output(for classification cases,
                 this is usually the target class).
                 If the network returns a scalar value per example, no target
@@ -365,7 +366,7 @@ def infidelity(
                 Default: False
     Returns:
 
-        infidelities (tensor): A tensor of scalar infidelity scores per
+        infidelities (Tensor): A tensor of scalar infidelity scores per
                 input example. The first dimension is equal to the
                 number of examples in the input batch and the second
                 dimension is one.
diff --git a/captum/metrics/_core/sensitivity.py b/captum/metrics/_core/sensitivity.py
index 77d87e6291..f0c841a5a1 100644
--- a/captum/metrics/_core/sensitivity.py
+++ b/captum/metrics/_core/sensitivity.py
@@ -30,8 +30,8 @@ def default_perturb_func(
 
     Args:
 
-        inputs (tensor or a tuple of tensors): The input tensors that we'd
-                like to perturb by adding a random noise sampled unifromly
+        inputs (Tensor or tuple of Tensor): The input tensors that we'd
+                like to perturb by adding a random noise sampled uniformly
                 random from an L_infinity ball with a radius `perturb_radius`.
 
         radius (float): A radius used for sampling from
@@ -39,8 +39,8 @@ def default_perturb_func(
 
     Returns:
 
-        perturbed_input (tuple(tensor)): A list of perturbed inputs that
-                are createed by adding noise sampled uniformly random
+        perturbed_input (tuple of Tensor): A list of perturbed inputs that
+                are created by adding noise sampled uniformly random
                 from L_infiniy ball with a radius `perturb_radius` to the
                 original inputs.
 
@@ -90,7 +90,7 @@ def sensitivity_max(
 
     More about the Lipschitz Continuity Metric can also be found here
     `On the Robustness of Interpretability Methods`
-    https://arxiv.org/pdf/1806.08049.pdf
+    https://arxiv.org/abs/1806.08049
     and
     `Towards Robust Interpretability with Self-Explaining Neural Networks`
     https://papers.nips.cc/paper\
@@ -99,16 +99,16 @@ def sensitivity_max(
 
     More details about sensitivity max can be found here:
     `On the (In)fidelity and Sensitivity of Explanations`
-    https://arxiv.org/pdf/1901.09392.pdf
+    https://arxiv.org/abs/1901.09392
 
     Args:
 
-        explanation_func (callable):
+        explanation_func (Callable):
                 This function can be the `attribute` method of an
                 attribution algorithm or any other explanation method
                 that returns the explanations.
 
-        inputs (tensor or tuple of tensors):  Input for which
+        inputs (Tensor or tuple of Tensor): Input for which
                 explanations are computed. If `explanation_func` takes a
                 single tensor as input, a single input tensor should
                 be provided.
@@ -119,7 +119,7 @@ def sensitivity_max(
                 multiple input tensors are provided, the examples must
                 be aligned appropriately.
 
-        perturb_func (callable):
+        perturb_func (Callable):
                 The perturbation function of model inputs. This function takes
                 model inputs and optionally `perturb_radius` if
                 the function takes more than one argument and returns
@@ -138,7 +138,7 @@ def sensitivity_max(
         perturb_radius (float, optional): The epsilon radius used for sampling.
             In the `default_perturb_func` it is used as the radius of
             the L-Infinity ball. In a general case it can serve as a radius of
-            any L_p nom.
+            any L_p norm.
             This argument is passed to `perturb_func` if it takes more than
             one argument.
 
@@ -149,10 +149,12 @@ def sensitivity_max(
                 `perturb_func` function.
 
                 Default: 10
-        norm_ord (int, float, inf, -inf, 'fro', 'nuc', optional): The type of norm
-                that is used to compute the
-                norm of the sensitivity matrix which is defined as the difference
-                between the explanation function at its input and perturbed input.
+        norm_ord (int, float, or str, optional): The type of norm that is used to
+                compute the norm of the sensitivity matrix which is defined as the
+                difference between the explanation function at its input and perturbed
+                input. Acceptable values are either a string of 'fro' or 'nuc', or a
+                number in the range of [-inf, inf] (including float("-inf") &
+                float("inf")).
 
                 Default: 'fro'
         max_examples_per_batch (int, optional): The number of maximum input
@@ -176,7 +178,7 @@ def sensitivity_max(
 
     Returns:
 
-        sensitivities (tensor): A tensor of scalar sensitivity scores per
+        sensitivities (Tensor): A tensor of scalar sensitivity scores per
                input example. The first dimension is equal to the
                number of examples in the input batch and the second
                dimension is one. Returned sensitivities are normalized by
diff --git a/captum/metrics/_utils/batching.py b/captum/metrics/_utils/batching.py
index ee3b38f58e..83a773bda3 100644
--- a/captum/metrics/_utils/batching.py
+++ b/captum/metrics/_utils/batching.py
@@ -28,9 +28,9 @@ def _divide_and_aggregate_metrics(
                         attributions for.
         n_perturb_samples (int): The number of samples per example that are used for
                         perturbation purposes for example.
-        metric_func (callable): This function takes the number of samples per
+        metric_func (Callable): This function takes the number of samples per
                         input batch and returns an overall metric for each example.
-        agg_func (callable, optional): This function is used to aggregate the
+        agg_func (Callable, optional): This function is used to aggregate the
                         metrics across multiple sub-batches and that are
                         generated by `metric_func`.
         max_examples_per_batch (int, optional): The maximum number of allowed examples
@@ -38,7 +38,7 @@ def _divide_and_aggregate_metrics(
 
         Returns:
 
-            metric (tensor): A metric score estimated by `metric_func` per
+            metric (Tensor): A metric score estimated by `metric_func` per
                         input example.
     """
     bsz = inputs[0].size(0)
diff --git a/captum/robust/_core/fgsm.py b/captum/robust/_core/fgsm.py
index f717481ccd..0e42d08c37 100644
--- a/captum/robust/_core/fgsm.py
+++ b/captum/robust/_core/fgsm.py
@@ -1,5 +1,5 @@
 #!/usr/bin/env python3
-from typing import Any, Callable, Tuple
+from typing import Any, Callable, Optional, Tuple
 
 import torch
 from captum._utils.common import (
@@ -21,37 +21,44 @@
 
 class FGSM(Perturbation):
     r"""
-    Fast Gradient Sign Method is an one-step method that can generate
-    adversarial examples. For non-targeted attack, the formulation is
-    x' = x + epsilon * sign(gradient of L(theta, x, y)).
-    For targeted attack on t, the formulation is
-    x' = x - epsilon * sign(gradient of L(theta, x, t)).
-    L(theta, x, y) is the model's loss function with respect to model
+    Fast Gradient Sign Method is a one-step method that can generate
+    adversarial examples.
+
+    For non-targeted attack, the formulation is::
+
+        x' = x + epsilon * sign(gradient of L(theta, x, y))
+
+    For targeted attack on t, the formulation is::
+
+        x' = x - epsilon * sign(gradient of L(theta, x, t))
+
+    ``L(theta, x, y)`` is the model's loss function with respect to model
     parameters, inputs and labels.
 
     More details on Fast Gradient Sign Method can be found in the original
-    paper:
-    https://arxiv.org/pdf/1412.6572.pdf
+    paper: https://arxiv.org/abs/1412.6572
     """
 
     def __init__(
         self,
         forward_func: Callable,
-        loss_func: Callable = None,
+        loss_func: Optional[Callable] = None,
         lower_bound: float = float("-inf"),
         upper_bound: float = float("inf"),
     ) -> None:
         r"""
         Args:
-            forward_func (callable): The pytorch model for which the attack is
+            forward_func (Callable): The pytorch model for which the attack is
                         computed.
-            loss_func (callable, optional): Loss function of which the gradient
+            loss_func (Callable, optional): Loss function of which the gradient
                         computed. The loss function should take in outputs of the
                         model and labels, and return a loss tensor.
                         The default loss function is negative log.
             lower_bound (float, optional): Lower bound of input values.
+                        Default: ``float("-inf")``
             upper_bound (float, optional): Upper bound of input values.
                         e.g. image pixels must be in the range 0-255
+                        Default: ``float("inf")``
 
         Attributes:
             bound (Callable): A function that bounds the input values based on
@@ -80,13 +87,13 @@ def perturb(
 
         Args:
 
-            inputs (tensor or tuple of tensors): Input for which adversarial
+            inputs (Tensor or tuple of Tensor): Input for which adversarial
                         attack is computed. It can be provided as a single
                         tensor or a tuple of multiple tensors. If multiple
                         input tensors are provided, the batch sizes must be
-                        aligned accross all tensors.
+                        aligned across all tensors.
             epsilon (float): Step size of perturbation.
-            target (any): True labels of inputs if non-targeted attack is
+            target (Any): True labels of inputs if non-targeted attack is
                         desired. Target class of inputs if targeted attack
                         is desired. Target will be passed to the loss function
                         to compute loss, so the type needs to match the
@@ -112,7 +119,8 @@ def perturb(
                           examples in inputs (dim 0), and each tuple containing
                           #output_dims - 1 elements. Each tuple is applied as the
                           label for the corresponding example.
-            additional_forward_args (any, optional): If the forward function
+
+            additional_forward_args (Any, optional): If the forward function
                         requires additional arguments other than the inputs for
                         which attributions should not be computed, this argument
                         can be provided. These arguments are provided to
@@ -124,7 +132,7 @@ def perturb(
 
         Returns:
 
-            - **perturbed inputs** (*tensor* or tuple of *tensors*):
+            - **perturbed inputs** (*Tensor* or tuple of *Tensor*):
                         Perturbed input for each
                         input tensor. The perturbed inputs have the same shape and
                         dimensionality as the inputs.
@@ -167,7 +175,7 @@ def _perturb(
         r"""
         A helper function to calculate the perturbed inputs given original
         inputs, gradient of loss function and epsilon. The calculation is
-        different for targetd v.s. non-targeted as described above.
+        different for targeted v.s. non-targeted as described above.
         """
         multiplier = -1 if targeted else 1
         inputs = tuple(
diff --git a/captum/robust/_core/metrics/attack_comparator.py b/captum/robust/_core/metrics/attack_comparator.py
index b9ebb59ad6..7964711883 100644
--- a/captum/robust/_core/metrics/attack_comparator.py
+++ b/captum/robust/_core/metrics/attack_comparator.py
@@ -60,15 +60,15 @@ def __init__(
         self,
         forward_func: Callable,
         metric: Callable[..., MetricResultType],
-        preproc_fn: Callable = None,
+        preproc_fn: Optional[Callable] = None,
     ) -> None:
         r"""
         Args:
-            forward_func (callable or torch.nn.Module): This can either be an instance
+            forward_func (Callable or torch.nn.Module): This can either be an instance
                 of pytorch model or any modification of a model's forward
                 function.
 
-            metric (callable): This function is applied to the model output in
+            metric (Callable): This function is applied to the model output in
                 order to compute the desired performance metric or metrics.
                 This function should have the following signature::
 
@@ -85,9 +85,10 @@ def __init__(
                 If tensor metrics represent results for the full batch, the size of the
                 first dimension should be 1.
 
-            preproc_fn (callable, optional): Optional method applied to inputs. Output
+            preproc_fn (Callable, optional): Optional method applied to inputs. Output
                 of preproc_fn is then provided as input to model, in addition to
                 additional_forward_args provided to evaluate.
+                Default: ``None``
         """
         self.forward_func = forward_func
         self.metric: Callable = metric
@@ -113,7 +114,8 @@ def add_attack(
         Adds attack to be evaluated when calling evaluate.
 
         Args:
-            attack (perturbation or callable): This can either be an instance
+
+            attack (Perturbation or Callable): This can either be an instance
                 of a Captum Perturbation / Attack
                 or any other perturbation or attack function such
                 as a torchvision transform.
@@ -121,23 +123,29 @@ def add_attack(
             name (str, optional): Name or identifier for attack, used as key for
                 attack results. This defaults to attack.__class__.__name__
                 if not provided and must be unique for all added attacks.
+                Default: ``None``
 
-            num_attempts (int): Number of attempts that attack should be
+            num_attempts (int, optional): Number of attempts that attack should be
                 repeated. This should only be set to > 1 for non-deterministic
                 attacks. The minimum, maximum, and average (best, worst, and
                 average case) are tracked for attack attempts.
-
-            apply_before_preproc (bool): Defines whether attack should be applied
-                before or after preproc function.
-
-            attack_kwargs (dict): Additional arguments to be provided to given attack.
-                This should be provided as a dictionary of keyword arguments.
-
-            additional_attack_arg_names (list[str]): Any additional arguments for the
-                attack which are specific to the particular input example or batch.
-                An example of this is target, which is necessary for some attacks such
-                as FGSM or PGD. These arguments are included if provided as a kwarg
-                to evaluate.
+                Default: ``1``
+
+            apply_before_preproc (bool, optional): Defines whether attack should be
+                applied before or after preproc function.
+                Default: ``True``
+
+            attack_kwargs (dict, optional): Additional arguments to be provided to
+                given attack. This should be provided as a dictionary of keyword
+                arguments.
+                Default: ``None``
+
+            additional_attack_arg_names (list[str], optional): Any additional
+                arguments for the attack which are specific to the particular input
+                example or batch. An example of this is target, which is necessary
+                for some attacks such as FGSM or PGD. These arguments are included
+                if provided as a kwarg to evaluate.
+                Default: ``None``
         """
         if name is None:
             name = attack.__class__.__name__
@@ -239,7 +247,7 @@ def evaluate(
 
         Args:
 
-        inputs (any): Input for which attack metrics
+            inputs (Any): Input for which attack metrics
                 are computed. It can be provided as a tensor, tuple of tensors,
                 or any raw input type (e.g. PIL image or text string).
                 This input is provided directly as input to preproc function as well
@@ -247,7 +255,7 @@ def evaluate(
                 function is provided, this input is provided directly to the main
                 model and all attacks.
 
-        additional_forward_args (any, optional): If the forward function
+            additional_forward_args (Any, optional): If the forward function
                 requires additional arguments other than the preprocessing
                 outputs (or inputs if preproc_fn is None), this argument
                 can be provided. It must be either a single additional
@@ -259,8 +267,8 @@ def evaluate(
                 For a tensor, the first dimension of the tensor must
                 correspond to the number of examples. For all other types,
                 the given argument is used for all forward evaluations.
-                Default: None
-        perturbations_per_eval (int, optional): Allows perturbations of multiple
+                Default: ``None``
+            perturbations_per_eval (int, optional): Allows perturbations of multiple
                 attacks to be grouped and evaluated in one call of forward_fn
                 Each forward pass will contain a maximum of
                 perturbations_per_eval * #examples samples.
@@ -272,9 +280,10 @@ def evaluate(
                 In order to apply this functionality, the output of preproc_fn
                 (or inputs itself if no preproc_fn is provided) must be a tensor
                 or tuple of tensors.
-                Default: 1
-        kwargs (any, optional): Additional keyword arguments provided to metric function
-                as well as selected attacks based on chosen additional_args
+                Default: ``1``
+            kwargs (Any, optional): Additional keyword arguments provided to metric
+                function as well as selected attacks based on chosen additional_args.
+                Default: ``None``
 
         Returns:
 
diff --git a/captum/robust/_core/metrics/min_param_perturbation.py b/captum/robust/_core/metrics/min_param_perturbation.py
index 99308727e4..95b2897a08 100644
--- a/captum/robust/_core/metrics/min_param_perturbation.py
+++ b/captum/robust/_core/metrics/min_param_perturbation.py
@@ -63,7 +63,7 @@ def __init__(
         corresponding perturbed input.
 
         Args:
-            forward_func (callable or torch.nn.Module): This can either be an instance
+            forward_func (Callable or torch.nn.Module): This can either be an instance
                 of pytorch model or any modification of a model's forward
                 function.
 
@@ -85,23 +85,23 @@ def __init__(
             arg_step (int, float): Minimum interval for increase of target variable.
 
             mode (str, optional): Mode for search of minimum attack value;
-                either 'linear' for linear search on variable, or 'binary' for
+                either ``linear`` for linear search on variable, or ``binary`` for
                 binary search of variable
-                Default: 'linear'
+                Default: ``linear``
 
             num_attempts (int, optional): Number of attempts or trials with
                 given variable. This should only be set to > 1 for non-deterministic
                 perturbation / attack functions
-                Default: 1
+                Default: ``1``
 
-            preproc_fn (callable, optional): Optional method applied to inputs. Output
+            preproc_fn (Callable, optional): Optional method applied to inputs. Output
                 of preproc_fn is then provided as input to model, in addition to
                 additional_forward_args provided to evaluate.
-                Default: None
+                Default: ``None``
 
             apply_before_preproc (bool, optional): Defines whether attack should be
                 applied before or after preproc function.
-                Default: False
+                Default: ``False``
 
             correct_fn (Callable, optional): This determines whether the perturbed input
                 leads to a correct or incorrect prediction. By default, this function
@@ -114,13 +114,15 @@ def __init__(
                 function must be provided which determines correctness.
 
                 The first argument to this function must be the model out;
-                any additional arguments should be provided through correct_fn_kwargs.
+                any additional arguments should be provided through
+                ``correct_fn_kwargs``.
 
                 This function should have the following signature:
+
                     def correct_fn(model_out: Tensor, **kwargs: Any) -> bool
 
                 Method should return a boolean if correct (True) and incorrect (False).
-                Default: None (applies standard correct_fn for classification)
+                Default: ``None`` (applies standard correct_fn for classification)
         """
         self.forward_func = forward_func
         self.attack = attack
@@ -363,7 +365,7 @@ def evaluate(
                     pre-processing function is provided,
                     this input is provided directly to the main model and all attacks.
 
-            additional_forward_args (any, optional): If the forward function
+            additional_forward_args (Any, optional): If the forward function
                     requires additional arguments other than the preprocessing
                     outputs (or inputs if preproc_fn is None), this argument
                     can be provided. It must be either a single additional
@@ -375,9 +377,9 @@ def evaluate(
                     For a tensor, the first dimension of the tensor must
                     correspond to the number of examples. For all other types,
                     the given argument is used for all forward evaluations.
-                    Default: None
+                    Default: ``None``
             target (TargetType): Target class for classification. This is required if
-                using the default correct_fn
+                using the default ``correct_fn``.
 
             perturbations_per_eval (int, optional): Allows perturbations of multiple
                     attacks to be grouped and evaluated in one call of forward_fn
@@ -391,10 +393,10 @@ def evaluate(
                     In order to apply this functionality, the output of preproc_fn
                     (or inputs itself if no preproc_fn is provided) must be a tensor
                     or tuple of tensors.
-                    Default: 1
-            attack_kwargs (dictionary, optional): Optional dictionary of keyword
+                    Default: ``1``
+            attack_kwargs (dict, optional): Optional dictionary of keyword
                     arguments provided to attack function
-            correct_fn_kwargs (dictionary, optional): Optional dictionary of keyword
+            correct_fn_kwargs (dict, optional): Optional dictionary of keyword
                     arguments provided to correct function
 
         Returns:
diff --git a/captum/robust/_core/perturbation.py b/captum/robust/_core/perturbation.py
index 9eb6d53481..c47b02dd78 100644
--- a/captum/robust/_core/perturbation.py
+++ b/captum/robust/_core/perturbation.py
@@ -18,15 +18,15 @@ class Perturbation:
 
     Args:
 
-        inputs (tensor or tuple of tensors): Input for which adversarial attack
+        inputs (Tensor or tuple of Tensor): Input for which adversarial attack
                     is computed. It can be provided as a single tensor or
                     a tuple of multiple tensors. If multiple input tensors
-                    are provided, the batch sizes must be aligned accross all
+                    are provided, the batch sizes must be aligned across all
                     tensors.
 
     Returns:
 
-        - **perturbed inputs** (*tensor* or tuple of *tensors*):
+        - **perturbed inputs** (*Tensor* or tuple of *Tensor*):
                     Perturbed input for each
                     input tensor. The perturbed inputs have the same shape and
                     dimensionality as the inputs.
diff --git a/captum/robust/_core/pgd.py b/captum/robust/_core/pgd.py
index b14239c681..733cbcc488 100644
--- a/captum/robust/_core/pgd.py
+++ b/captum/robust/_core/pgd.py
@@ -31,8 +31,7 @@ class PGD(Perturbation):
         x_(t+1) = Clip_r(x_t - alpha * sign(gradient of L(theta, x, t)))
 
     More details on Projected Gradient Descent can be found in the original
-    paper:
-    https://arxiv.org/pdf/1706.06083.pdf
+    paper: https://arxiv.org/abs/1706.06083
     """
 
     def __init__(
@@ -44,15 +43,17 @@ def __init__(
     ) -> None:
         r"""
         Args:
-            forward_func (callable): The pytorch model for which the attack is
+            forward_func (Callable): The pytorch model for which the attack is
                         computed.
-            loss_func (callable, optional): Loss function of which the gradient
+            loss_func (Callable, optional): Loss function of which the gradient
                         computed. The loss function should take in outputs of the
                         model and labels, and return the loss for each input tensor.
                         The default loss function is negative log.
             lower_bound (float, optional): Lower bound of input values.
+                        Default: ``float("-inf")``
             upper_bound (float, optional): Upper bound of input values.
                         e.g. image pixels must be in the range 0-255
+                        Default: ``float("inf")``
 
         Attributes:
             bound (Callable): A function that bounds the input values based on
@@ -82,17 +83,17 @@ def perturb(
 
         Args:
 
-            inputs (tensor or tuple of tensors): Input for which adversarial
+            inputs (Tensor or tuple of Tensor): Input for which adversarial
                         attack is computed. It can be provided as a single
                         tensor or a tuple of multiple tensors. If multiple
                         input tensors are provided, the batch sizes must be
-                        aligned accross all tensors.
+                        aligned across all tensors.
             radius (float): Radius of the neighbor ball centered around inputs.
                         The perturbation should be within this range.
             step_size (float): Step size of each gradient step.
             step_num (int): Step numbers. It usually guarantees that the perturbation
                         can reach the border.
-            target (any): True labels of inputs if non-targeted attack is
+            target (Any): True labels of inputs if non-targeted attack is
                         desired. Target class of inputs if targeted attack
                         is desired. Target will be passed to the loss function
                         to compute loss, so the type needs to match the
@@ -118,23 +119,23 @@ def perturb(
                           examples in inputs (dim 0), and each tuple containing
                           #output_dims - 1 elements. Each tuple is applied as the
                           label for the corresponding example.
-            additional_forward_args (any, optional): If the forward function
+            additional_forward_args (Any, optional): If the forward function
                         requires additional arguments other than the inputs for
                         which attributions should not be computed, this argument
                         can be provided. These arguments are provided to
                         forward_func in order following the arguments in inputs.
-                        Default: None.
+                        Default: ``None``
             targeted (bool, optional): If attack should be targeted.
-                        Default: False.
+                        Default: ``False``
             random_start (bool, optional): If a random initialization is added to
-                        inputs. Default: False.
+                        inputs. Default: ``False``
             norm (str, optional): Specifies the norm to calculate distance from
-                        original inputs: 'Linf'|'L2'.
-                        Default: 'Linf'.
+                        original inputs: ``Linf`` | ``L2``.
+                        Default: ``Linf``
 
         Returns:
 
-            - **perturbed inputs** (*tensor* or tuple of *tensors*):
+            - **perturbed inputs** (*Tensor* or tuple of *Tensor*):
                         Perturbed input for each
                         input tensor. The perturbed inputs have the same shape and
                         dimensionality as the inputs.
diff --git a/docs/algorithms.md b/docs/attribution_algorithms.md
similarity index 99%
rename from docs/algorithms.md
rename to docs/attribution_algorithms.md
index b06a8aa5f1..f1d00a8f53 100644
--- a/docs/algorithms.md
+++ b/docs/attribution_algorithms.md
@@ -1,5 +1,5 @@
 ---
-id: algorithms
+id: attribution_algorithms
 title: Algorithm Descriptions
 ---
 
diff --git a/docs/contribution_guide.md b/docs/contribution_guide.md
index f8aacf1c85..82e4f158a2 100644
--- a/docs/contribution_guide.md
+++ b/docs/contribution_guide.md
@@ -4,7 +4,7 @@ title: The Captum Contribution Process
 ---
 
 The Captum development process involves a healthy amount of open discussions between the core development team and the community.
-Captum operates similar to most open source projects on GitHub. However, if you've never contributed to an open source project before, here is the basic process.
+Captum operates similarly to most open source projects on GitHub. However, if you've never contributed to an open source project before, here is the basic process.
 
 
 1. **Figure out what you're going to work on.**
@@ -59,7 +59,7 @@ https://captum.ai/tutorials/Bert_SQUAD_Interpret
 https://captum.ai/tutorials/IMDB_TorchText_Interpret
 
 **Vision**
-- We provide a sample toy model for CIFAR dataset and examples with ResNet model.
+- We provide a sample toy model for the CIFAR dataset and examples with a ResNet model.
 https://captum.ai/tutorials/CIFAR_TorchVision_Interpret
 https://captum.ai/tutorials/Resnet_TorchVision_Interpret
 These would be great starting points for benchmarking.
diff --git a/docs/extension/integrated_gradients.md b/docs/extension/integrated_gradients.md
index 0a00fb0ad1..ebcca190ec 100644
--- a/docs/extension/integrated_gradients.md
+++ b/docs/extension/integrated_gradients.md
@@ -42,7 +42,7 @@ class ToyModel(nn.Module):
 
 Second, let's apply integrated gradients on the toy model's output layer using sample data.
 The code snippet below computes the attribution of output with respect to the inputs.
-`attribute` method of `IntegratedGradients` class returns input attributions which
+The `attribute` method of `IntegratedGradients` class returns input attributions which
 have the same size and dimensionality as the inputs and an approximation error which
 is computed based on the completeness property of the integrated gradients.
 Completeness property is one of the axioms that integrated gradients satisfies.
@@ -114,7 +114,7 @@ class ToySoftmaxModel(nn.Module):
 Now, let's apply integrated gradients on the toy classification model defined
 above using inputs that contain a range of numbers. We also choose an arbitrary
 target class (target_class_index: 5) which we use to attribute our predictions to.
-Similar to previous example the output of attribution is a tensor with the same
+Similar to the previous example, the output of attribution is a tensor with the same
 dimensionality as the inputs and an approximation error computed based on the
 completeness property of integrated gradients.
 
@@ -157,9 +157,9 @@ Now, let's look at a model that besides input tensors takes input arguments of
 other types. In practice this can be used to pass the sequence length or the
 word/token indices in a sequence of a text, for instance. The example below
 demonstrates how to use `additional_forward_args`. In this particular example
-`additional_forward_args` represents single integer value.
-Those arguments are passed as `additional_forward_args` to `attribute` method and
-they will be passed to model's forward function followed by inputs in the oder
+`additional_forward_args` represents a single integer value.
+Those arguments are passed as `additional_forward_args` to the `attribute` method and
+they will be passed to the model's forward function followed by inputs in the order
 provided in `additional_forward_args`. In the example below, we also demonstrate
 how to apply integrated gradients to a batch of samples. The first dimension of
 the input corresponds to the batch size.
diff --git a/docs/faq.md b/docs/faq.md
index de4e22ea4c..16bf59b54a 100644
--- a/docs/faq.md
+++ b/docs/faq.md
@@ -9,7 +9,7 @@ title: FAQ
 * [Are SmoothGrad or VarGrad supported in Captum?](#are-smoothgrad-or-vargrad-supported-in-captum)
 * [How do I use Captum with BERT models?](#how-do-i-use-captum-with-bert-models)
 * [My model inputs or outputs token indices, and when using Captum I see errors relating to gradients, how do I resolve this?](#my-model-inputs-or-outputs-token-indices-and-when-using-captum-i-see-errors-relating-to-gradients-how-do-i-resolve-this)
-* [Can my model using functional non-linearities (E.g. nn.functional.ReLU) or reused modules be used with Captum?](#can-my-model-using-functional-non-linearities-eg-nnfunctionalrelu-or-reused-modules-be-used-with-captum)
+* [Can my model use functional non-linearities (E.g. nn.functional.ReLU) or can reused modules be used with Captum?](#can-my-model-use-functional-non-linearities-eg-nnfunctionalrelu-or-can-reused-modules-be-used-with-captum)
 * [Do JIT models, DataParallel models, or DistributedDataParallel models work with Captum?](#do-jit-models-dataparallel-models-or-distributeddataparallel-models-work-with-captum)
 * [I am working on a new interpretability or attribution method and would like to add it to Captum. How do I proceed?](#i-am-working-on-a-new-interpretability-or-attribution-method-and-would-like-to-add-it-to-captum-how-do-i-proceed)
 * [I am using a gradient-based attribution algorithm such as integrated gradients for a RNN or LSTM network and I see 'cudnn RNN backward can only be called in training mode'. How can I resolve this issue ?](#how-can-I-resolve-cudnn-RNN-backward-error-for-RNN-or-LSTM-network)
@@ -53,7 +53,7 @@ For NLP models that take token indices as inputs, we cannot take gradients with
 
 If the output of the model is a token index, such as an image captioning cases, it is necessary to attribute with respect to the token score or probability rather than the index. Make sure that the model returns this and use target to choose the appropriate scalar score to attribute with respect to.
 
-### **Can my model using functional non-linearities (E.g. nn.functional.ReLU) or reused modules be used with Captum?**
+### **Can my model use functional non-linearities (E.g. nn.functional.ReLU) or can reused modules be used with Captum?**
 
 Most methods will work fine with functional non-linearities and arbitrary operations. Some methods, which require placing hooks during back-propagation, including DeepLift, DeepLiftShap, Guided Backpropagation, and Deconvolution will not work appropriately with functional non-linearities and must use the corresponding module activation (e.g. torch.nn.ReLU) which should be initialized in the module constructor. For DeepLift, it is important to also not reuse modules in the forward function, since this can cause issues in the propagation of multipliers. Computing layer or neuron attribution with layer modules that are used multiple times generally computes attributions for the last execution of the module. For more information regarding these restrictions, refer to the API documentation for the specific method, including DeepLift, DeepLiftShap, Guided Backpropagation, and Deconvolution.
 
diff --git a/scripts/install_via_pip.sh b/scripts/install_via_pip.sh
index 7a13dedb9e..de643e0687 100755
--- a/scripts/install_via_pip.sh
+++ b/scripts/install_via_pip.sh
@@ -37,7 +37,7 @@ export TERM=xterm
 # NOTE: All of the below installs use sudo, b/c otherwise pip will get
 # permission errors installing in the docker container. An alternative would be
 # to use a virtualenv, but that would lead to bifurcation of the CircleCI config
-# since we'd need to source the environemnt in each step.
+# since we'd need to source the environment in each step.
 
 # upgrade pip
 sudo pip install --upgrade pip
@@ -55,7 +55,7 @@ fi
 if [[ $PYTORCH_NIGHTLY == true ]]; then
   sudo pip install --upgrade --pre torch -f https://download.pytorch.org/whl/nightly/cpu/torch_nightly.html
 else
-  # If no version specified, upgrade to latest release.
+  # If no version is specified, upgrade to the latest release.
   if [[ $CHOSEN_TORCH_VERSION == -1 ]]; then
     sudo pip install --upgrade torch
   else
diff --git a/sphinx/source/approximation_methods.rst b/sphinx/source/approximation_methods.rst
index b6b197d92e..4deec709bf 100644
--- a/sphinx/source/approximation_methods.rst
+++ b/sphinx/source/approximation_methods.rst
@@ -1,4 +1,4 @@
-Captum Approximation
+Approximation
 ====================
 
 .. automodule:: captum.attr._utils.approximation_methods
diff --git a/sphinx/source/base_classes.rst b/sphinx/source/base_classes.rst
index c337d666fc..a1f3d8117b 100644
--- a/sphinx/source/base_classes.rst
+++ b/sphinx/source/base_classes.rst
@@ -1,32 +1,32 @@
 Base Classes
-==========
+========================
 
 Attribution
-^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^
 
 .. autoclass:: captum.attr.Attribution
     :members:
 
 Layer Attribution
-^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^
 
 .. autoclass:: captum.attr.LayerAttribution
     :members:
 
 Neuron Attribution
-^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^
 
 .. autoclass:: captum.attr.NeuronAttribution
     :members:
 
 Gradient Attribution
-^^^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^
 
 .. autoclass:: captum.attr.GradientAttribution
     :members:
 
 Perturbation Attribution
-^^^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^
 
 .. autoclass:: captum.attr.PerturbationAttribution
     :members:
diff --git a/sphinx/source/common.rst b/sphinx/source/common.rst
deleted file mode 100644
index 711a7e6fe5..0000000000
--- a/sphinx/source/common.rst
+++ /dev/null
@@ -1,12 +0,0 @@
-Captum.Utils
-============
-
-.. automodule:: captum.attr._utils.common
-
-.. autofunction:: validate_input
-.. autofunction:: validate_noise_tunnel_type
-.. autofunction:: format_input
-.. autofunction:: _format_attributions
-.. autofunction:: zeros
-.. autofunction:: _reshape_and_sum
-.. autofunction:: _run_forward
diff --git a/sphinx/source/concept.rst b/sphinx/source/concept.rst
index 7aa60aabb9..19157398b7 100644
--- a/sphinx/source/concept.rst
+++ b/sphinx/source/concept.rst
@@ -1,29 +1,29 @@
 Concept-based Interpretability
-======
+==============================
 
 TCAV
-^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 .. autoclass:: captum.concept.TCAV
     :members:
 
 
 ConceptInterpreter
-^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 .. autoclass:: captum.concept.ConceptInterpreter
     :members:
 
 
 Concept
-^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 .. autoclass:: captum.concept.Concept
     :members:
 
 
 Classifier
-^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 .. autoclass:: captum.concept.Classifier
     :members:
diff --git a/sphinx/source/conf.py b/sphinx/source/conf.py
index 27bdc763fd..b01d1c8b81 100644
--- a/sphinx/source/conf.py
+++ b/sphinx/source/conf.py
@@ -10,7 +10,9 @@
 # -- Path setup --------------------------------------------------------------
 
 import os
+import re
 import sys
+from typing import List
 
 base_path = os.path.abspath(os.path.join(__file__, "..", "..", ".."))
 # read module from src instead of installation
@@ -75,6 +77,11 @@
 # Inlcude init docstrings into body of autoclass directives
 autoclass_content = "both"
 
+# Preserve signature defaults
+# Prevents entire tensors from being printed, & gives callable functions
+# proper names
+autodoc_preserve_defaults = True
+
 # Configuration for intersphinx: refer to the Python standard library and PyTorch
 intersphinx_mapping = {
     "python": ("https://docs.python.org/3", None),
@@ -201,3 +208,46 @@
 
 # If true, `todo` and `todoList` produce output, else they produce nothing.
 todo_include_todos = True
+
+
+# -- Docstring Improvements --------------------------------------------------
+
+
+# Regex code for typing replacements.
+# The "(?<![\.])" part checks to see if the string
+# starts with a period, and "\b" denotes word boundaries.
+# Only words that don't start with a period are replaced.
+_rt = [r"(?<![\.])(\b", r"\b)"]
+
+
+def autodoc_process_docstring(
+    app, what: str, name: str, obj, options, lines: List[str]
+) -> None:
+    """
+    Modify docstrings before creating html files.
+    Sphinx converts the 'Args:' and 'Returns:' sections of docstrings into
+    reStructuredText (rST) syntax, which can then be found via ':type' & ':rtype'.
+
+    See here for more information:
+    https://www.sphinx-doc.org/en/master/usage/extensions/autodoc.html
+    """
+    for i in range(len(lines)):
+        # Skip unless line is an parameter doc or a return doc
+        if not lines[i].startswith(":type"):
+            continue
+        if ":py:data:" in lines[i]:
+            continue
+
+        # Ensure Any, Callable, & Iterator types are hyperlinked with intersphinx.
+        # The tilde '~' character hides the 'typing.' portion of the string.
+        lines[i] = re.sub(_rt[0] + r"Any" + _rt[1], "~typing.Any", lines[i])
+        lines[i] = re.sub(_rt[0] + r"Callable" + _rt[1], "~typing.Callable", lines[i])
+        lines[i] = re.sub(_rt[0] + r"Iterator" + _rt[1], "~typing.Iterator", lines[i])
+        lines[i] = re.sub(_rt[0] + r"Iterable" + _rt[1], "~typing.Iterable", lines[i])
+
+        # Ensure Tensor type is hyperlinked by interpshinx
+        lines[i] = re.sub(_rt[0] + r"Tensor" + _rt[1], "~torch.Tensor", lines[i])
+
+
+def setup(app) -> None:
+    app.connect("autodoc-process-docstring", autodoc_process_docstring)
diff --git a/sphinx/source/deconvolution.rst b/sphinx/source/deconvolution.rst
index 61e092e768..d5813d3842 100644
--- a/sphinx/source/deconvolution.rst
+++ b/sphinx/source/deconvolution.rst
@@ -1,5 +1,5 @@
 Deconvolution
-=========
+=============
 
 .. autoclass:: captum.attr.Deconvolution
     :members:
diff --git a/sphinx/source/feature_ablation.rst b/sphinx/source/feature_ablation.rst
index 35484a0fe6..e337aecf73 100644
--- a/sphinx/source/feature_ablation.rst
+++ b/sphinx/source/feature_ablation.rst
@@ -1,5 +1,6 @@
 Feature Ablation
-=========
+================
 
 .. autoclass:: captum.attr.FeatureAblation
     :members:
+    :exclude-members: compute_convergence_delta
diff --git a/sphinx/source/feature_permutation.rst b/sphinx/source/feature_permutation.rst
index d58f625aee..609ff1ff39 100644
--- a/sphinx/source/feature_permutation.rst
+++ b/sphinx/source/feature_permutation.rst
@@ -1,5 +1,6 @@
 Feature Permutation
-=========
+===================
 
 .. autoclass:: captum.attr.FeaturePermutation
     :members:
+    :exclude-members: compute_convergence_delta
diff --git a/sphinx/source/gradient_shap.rst b/sphinx/source/gradient_shap.rst
index 2a676dcb06..8d94c31463 100644
--- a/sphinx/source/gradient_shap.rst
+++ b/sphinx/source/gradient_shap.rst
@@ -3,6 +3,3 @@ GradientShap
 
 .. autoclass:: captum.attr.GradientShap
     :members:
-
-.. autoclass:: captum.attr.InputBaselineXGradient
-    :members:
diff --git a/sphinx/source/guided_backprop.rst b/sphinx/source/guided_backprop.rst
index 6ef3a947ae..4c0685e8c5 100644
--- a/sphinx/source/guided_backprop.rst
+++ b/sphinx/source/guided_backprop.rst
@@ -1,5 +1,5 @@
 Guided Backprop
-=========
+===============
 
 .. autoclass:: captum.attr.GuidedBackprop
     :members:
diff --git a/sphinx/source/guided_grad_cam.rst b/sphinx/source/guided_grad_cam.rst
index 99f18d2af1..207d8e55fa 100644
--- a/sphinx/source/guided_grad_cam.rst
+++ b/sphinx/source/guided_grad_cam.rst
@@ -1,5 +1,5 @@
 Guided GradCAM
-=========
+==============
 
 .. autoclass:: captum.attr.GuidedGradCam
     :members:
diff --git a/sphinx/source/influence.rst b/sphinx/source/influence.rst
index 6366924a70..6b906d8c47 100644
--- a/sphinx/source/influence.rst
+++ b/sphinx/source/influence.rst
@@ -1,41 +1,41 @@
 Influential Examples
-======
+====================
 
 DataInfluence
-^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^
 
 .. autoclass:: captum.influence.DataInfluence
     :members:
 
 
 SimilarityInfluence
-^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^
 
 .. autoclass:: captum.influence.SimilarityInfluence
     :members:
 
 
 TracInCPBase
-^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^
 
 .. autoclass:: captum.influence.TracInCPBase
     :members:
 
 
 TracInCP
-^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^
 
 .. autoclass:: captum.influence.TracInCP
     :members:
 
 TracInCPFast
-^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^
 
 .. autoclass:: captum.influence.TracInCPFast
     :members:
 
 TracInCPFastRandProj
-^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^
 
 .. autoclass:: captum.influence.TracInCPFastRandProj
     :members:
diff --git a/sphinx/source/input_x_gradient.rst b/sphinx/source/input_x_gradient.rst
index cd5f222e27..5213eab69b 100644
--- a/sphinx/source/input_x_gradient.rst
+++ b/sphinx/source/input_x_gradient.rst
@@ -1,5 +1,5 @@
 Input X Gradient
-===============
+================
 
 .. autoclass:: captum.attr.InputXGradient
     :members:
diff --git a/sphinx/source/insights.rst b/sphinx/source/insights.rst
index ece9180971..1e0963d483 100644
--- a/sphinx/source/insights.rst
+++ b/sphinx/source/insights.rst
@@ -4,12 +4,12 @@ Insights
 Batch
 ^^^^^
 
-.. autoclass:: captum.insights.api.Batch
+.. autoclass:: captum.insights.Batch
     :members:
 
 AttributionVisualizer
 ^^^^^^^^^^^^^^^^^^^^^
-.. autoclass:: captum.insights.api.AttributionVisualizer
+.. autoclass:: captum.insights.AttributionVisualizer
     :members:
 
 
diff --git a/sphinx/source/kernel_shap.rst b/sphinx/source/kernel_shap.rst
index 48cfde3535..421ed0ea62 100644
--- a/sphinx/source/kernel_shap.rst
+++ b/sphinx/source/kernel_shap.rst
@@ -3,3 +3,4 @@ KernelShap
 
 .. autoclass:: captum.attr.KernelShap
     :members:
+    :exclude-members: compute_convergence_delta
diff --git a/sphinx/source/layer.rst b/sphinx/source/layer.rst
index 7fbbd5bd85..466fbd97d2 100644
--- a/sphinx/source/layer.rst
+++ b/sphinx/source/layer.rst
@@ -1,70 +1,70 @@
 Layer Attribution
-======
+===========================
 
 Layer Conductance
-^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 .. autoclass:: captum.attr.LayerConductance
     :members:
 
 
 Layer Activation
-^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 .. autoclass:: captum.attr.LayerActivation
     :members:
 
 Internal Influence
-^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 .. autoclass:: captum.attr.InternalInfluence
     :members:
 
 Layer Gradient X Activation
-^^^^^^^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 .. autoclass:: captum.attr.LayerGradientXActivation
     :members:
 
 GradCAM
-^^^^^^^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 .. autoclass:: captum.attr.LayerGradCam
     :members:
 
 Layer DeepLift
-^^^^^^^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 .. autoclass:: captum.attr.LayerDeepLift
     :members:
 
 Layer DeepLiftShap
-^^^^^^^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 .. autoclass:: captum.attr.LayerDeepLiftShap
     :members:
 
 Layer GradientShap
-^^^^^^^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 .. autoclass:: captum.attr.LayerGradientShap
     :members:
 
 Layer Integrated Gradients
-^^^^^^^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 .. autoclass:: captum.attr.LayerIntegratedGradients
     :members:
 
 Layer Feature Ablation
-^^^^^^^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 .. autoclass:: captum.attr.LayerFeatureAblation
     :members:
 
 
 Layer LRP
-^^^^^^^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 .. autoclass:: captum.attr.LayerLRP
     :members:
diff --git a/sphinx/source/lime.rst b/sphinx/source/lime.rst
index 4c722304f1..483458572c 100644
--- a/sphinx/source/lime.rst
+++ b/sphinx/source/lime.rst
@@ -3,6 +3,7 @@ Lime
 
 .. autoclass:: captum.attr.LimeBase
     :members:
+    :exclude-members: compute_convergence_delta
 .. autoclass:: captum.attr.Lime
     :members:
 
diff --git a/sphinx/source/metrics.rst b/sphinx/source/metrics.rst
index 47c11e4856..8e71a40b02 100644
--- a/sphinx/source/metrics.rst
+++ b/sphinx/source/metrics.rst
@@ -1,15 +1,15 @@
 Metrics
-======
+===========
 
 Infidelity
-^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^
 
 .. autoclass:: captum.metrics.infidelity
     :members:
 
 
 Sensitivity
-^^^^^^^^^^^^^^^^
+^^^^^^^^^^^
 
 .. autoclass:: captum.metrics.sensitivity_max
     :members:
diff --git a/sphinx/source/neuron.rst b/sphinx/source/neuron.rst
index 8ad1514378..897f237baf 100644
--- a/sphinx/source/neuron.rst
+++ b/sphinx/source/neuron.rst
@@ -1,56 +1,57 @@
 Neuron Attribution
-=======
+===========================
 
 Neuron Gradient
-^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 .. autoclass:: captum.attr.NeuronGradient
     :members:
 
 Neuron Integrated Gradients
-^^^^^^^^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 .. autoclass:: captum.attr.NeuronIntegratedGradients
     :members:
 
 Neuron Conductance
-^^^^^^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 .. autoclass:: captum.attr.NeuronConductance
     :members:
 
 Neuron DeepLift
-^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 .. autoclass:: captum.attr.NeuronDeepLift
     :members:
 
 Neuron DeepLiftShap
-^^^^^^^^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 .. autoclass:: captum.attr.NeuronDeepLiftShap
     :members:
 
 Neuron GradientShap
-^^^^^^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 .. autoclass:: captum.attr.NeuronGradientShap
     :members:
 
 Neuron Guided Backprop
-^^^^^^^^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 .. autoclass:: captum.attr.NeuronGuidedBackprop
     :members:
 
 Neuron Deconvolution
-^^^^^^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 .. autoclass:: captum.attr.NeuronDeconvolution
     :members:
 
 Neuron Feature Ablation
-^^^^^^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 .. autoclass:: captum.attr.NeuronFeatureAblation
     :members:
+    :exclude-members: compute_convergence_delta
diff --git a/sphinx/source/noise_tunnel.rst b/sphinx/source/noise_tunnel.rst
index e1aff40b18..15b6ec7dbf 100644
--- a/sphinx/source/noise_tunnel.rst
+++ b/sphinx/source/noise_tunnel.rst
@@ -3,3 +3,4 @@ NoiseTunnel
 
 .. autoclass:: captum.attr.NoiseTunnel
     :members:
+    :exclude-members: compute_convergence_delta
diff --git a/sphinx/source/occlusion.rst b/sphinx/source/occlusion.rst
index a05b236e24..5867d739b9 100644
--- a/sphinx/source/occlusion.rst
+++ b/sphinx/source/occlusion.rst
@@ -3,3 +3,4 @@ Occlusion
 
 .. autoclass:: captum.attr.Occlusion
     :members:
+    :exclude-members: compute_convergence_delta
diff --git a/sphinx/source/pytext.rst b/sphinx/source/pytext.rst
index 66c847dcd9..f11a6a2099 100644
--- a/sphinx/source/pytext.rst
+++ b/sphinx/source/pytext.rst
@@ -1,11 +1,8 @@
 Captum.Models
 ==========================
 
-.. automodule:: captum.attr._models.pytext
-
-.. autoclass:: PyTextInterpretableEmbedding
+.. autoclass:: captum.attr._models.pytext.PyTextInterpretableEmbedding
     :members:
 
-
-.. autoclass:: BaselineGenerator
+.. autoclass:: captum.attr._models.pytext.BaselineGenerator
     :members:
diff --git a/sphinx/source/robust.rst b/sphinx/source/robust.rst
index 3b90a32ae5..48b360ad80 100644
--- a/sphinx/source/robust.rst
+++ b/sphinx/source/robust.rst
@@ -1,29 +1,29 @@
 Robustness
-======
+======================
 
 FGSM
-^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^
 
 .. autoclass:: captum.robust.FGSM
     :members:
 
 
 PGD
-^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^
 
 .. autoclass:: captum.robust.PGD
     :members:
 
 
 Attack Comparator
-^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^
 
 .. autoclass:: captum.robust.AttackComparator
     :members:
 
 
 Min Param Perturbation
-^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^
 
 .. autoclass:: captum.robust.MinParamPerturbation
     :members:
diff --git a/sphinx/source/shapley_value_sampling.rst b/sphinx/source/shapley_value_sampling.rst
index c998125af9..4d40338540 100644
--- a/sphinx/source/shapley_value_sampling.rst
+++ b/sphinx/source/shapley_value_sampling.rst
@@ -1,7 +1,9 @@
 Shapley Value Sampling
-=========
+======================
 
 .. autoclass:: captum.attr.ShapleyValueSampling
     :members:
+    :exclude-members: compute_convergence_delta
 .. autoclass:: captum.attr.ShapleyValues
     :members:
+    :exclude-members: compute_convergence_delta
diff --git a/sphinx/source/utilities.rst b/sphinx/source/utilities.rst
index f4e3d7ace6..a19e75df9e 100644
--- a/sphinx/source/utilities.rst
+++ b/sphinx/source/utilities.rst
@@ -8,6 +8,8 @@ Visualization
 
 .. autofunction:: captum.attr.visualization.visualize_image_attr_multiple
 
+.. autofunction:: captum.attr.visualization.visualize_timeseries_attr
+
 
 Interpretable Embeddings
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -16,6 +18,7 @@ Interpretable Embeddings
     :members:
 
 .. autofunction:: captum.attr.configure_interpretable_embedding_layer
+
 .. autofunction:: captum.attr.remove_interpretable_embedding_layer
 
 
diff --git a/tests/influence/_core/test_tracin_intermediate_quantities.py b/tests/influence/_core/test_tracin_intermediate_quantities.py
index 7f3e806c28..9f0daebad3 100644
--- a/tests/influence/_core/test_tracin_intermediate_quantities.py
+++ b/tests/influence/_core/test_tracin_intermediate_quantities.py
@@ -179,7 +179,7 @@ def test_tracin_intermediate_quantities_consistent(
             else:
                 # `test_features` is a tuple, so we unpack it to place in tuple,
                 # along with `test_labels`
-                test_batch = (*test_features, test_labels)
+                test_batch = (*test_features, test_labels)  # type: ignore[assignment]
 
             # the influence score is the dot product of intermediate quantities
             intermediate_quantities_scores = torch.matmul(
diff --git a/website/sidebars.json b/website/sidebars.json
index 0337e1bbe9..9efb1fddb2 100644
--- a/website/sidebars.json
+++ b/website/sidebars.json
@@ -1,7 +1,7 @@
 {
   "docs": {
     "About": ["introduction"],
-    "General": ["getting_started", "captum_insights", "algorithms", "algorithms_comparison_matrix", "faq", "contribution_guidelines"],
+    "General": ["getting_started", "captum_insights", "attribution_algorithms", "algorithms_comparison_matrix", "faq", "contribution_guidelines"],
     "Usage": ["extension/integrated_gradients"]
   }
 }