docs: update readme, quickstart (#116)

* update & quick start * update autoencoder and other docs * add a lighter progress bar callback to make training faster and almost on par with native torch Signed-off-by: Avik Basu <avikbasu93@gmail.com> Co-authored-by: Vigith Maurice <vigith@gmail.com>
numaproj · Jan 5, 2023 · 2735d72 · 2735d72
1 parent e8b5304
commit 2735d72
Show file tree

Hide file tree

Showing 10 changed files with 805 additions and 479 deletions.
diff --git a/.flake8 b/.flake8
@@ -1,5 +1,5 @@
 [flake8]
 ignore = E203, F821
-exclude = .git,__pycache__,docs/source/conf.py,old,build,dist
+exclude = .git,__pycache__,docs/source/conf.py,old,build,dist,venv
 max-complexity = 10
 max-line-length = 100
diff --git a/.gitignore b/.gitignore
@@ -165,3 +165,5 @@ cython_debug/
 
 # Mac related
 *.DS_Store
+
+.python-version
diff --git a/docs/autoencoders.md b/docs/autoencoders.md
@@ -2,47 +2,60 @@
 
 An Autoencoder is a type of Artificial Neural Network, used to learn efficient data representations (encoding) of unlabeled data. 
 
-It mainly consist of 2 components: an encoder and a decoder. The encoder compresses the input into a lower dimensional code, the decoder then reconstructs the input only using this code.
+It mainly consists of 2 components: an encoder and a decoder. The encoder compresses the input into a lower dimensional code, the decoder then reconstructs the input only using this code.
 
-### Autoencoder Pipelines
+## Datamodules
+Pytorch-lightning datamodules abstracts and separates the data functionality from the model and training itself.
+Numalogic provides `TimeseriesDataModule` to help set up and load dataloaders.
 
-Numalogic provides two types of pipelines for Autoencoders. These pipelines serve as a wrapper around the base network models, making it easier to train, predict and generate scores. Also, this module follows the sklearn API.
+```python
+import numpy as np
+from numalogic.tools.data import TimeseriesDataModule
+
+train_data = np.random.randn(100, 3)
+datamodule = TimeseriesDataModule(12, train_data, batch_size=128)
+```
 
-#### AutoencoderPipeline
+## Autoencoder Trainer
 
-Here we are using `VanillAE`, a Vanilla Autoencoder model.
+Numalogic provides a subclass of Pytorch-Lightning Trainer module specifically for Autoencoders. 
+This trainer provides a mechanism to train, validate and infer on data, with all the parameters supported by Lightning Trainer.
+
+Here we are using `VanillaAE`, a Vanilla Autoencoder model.
 
 ```python 
-from numalogic.models.autoencoder.variants import Conv1dAE
-from numalogic.models.autoencoder import SparseAEPipeline
+from numalogic.models.autoencoder.variants import VanillaAE
+from numalogic.models.autoencoder import AutoencoderTrainer
 
-model = AutoencoderPipeline(
-    model=VanillaAE(signal_len=12, n_features=3), seq_len=seq_len
-)
-model.fit(X_train)
+model = VanillaAE(seq_len=12, n_features=3)
+trainer = AutoencoderTrainer(max_epochs=50, enable_progress_bar=True)
+trainer.fit(model, datamodule=datamodule)
 ```
 
-#### SparseAEPipeline
+## Autoencoder Variants
 
-A Sparse Autoencoder is a type of autoencoder that employs sparsity to achieve an information bottleneck. Specifically the loss function is constructed so that activations are penalized within a layer.
+Numalogic supports 2 variants of Autoencoders currently. 
+More details can be found [here](https://www.deeplearningbook.org/contents/autoencoders.html).
 
-So, by adding a sparsity regularization, we will be able to stop the neural network from copying the input and reduce overfitting.
+### 1. Undercomplete autoencoders
 
-```python 
-from numalogic.models.autoencoder.variants import Conv1dAE
-from numalogic.models.autoencoder import SparseAEPipeline
+This is the simplest version of autoencoders where it is made sure that the 
+latent dimension is smaller than the encoding and decoding dimesions.
 
-model = SparseAEPipeline(
-    model=VanillaAE(signal_len=12, n_features=3), seq_len=36, num_epochs=30
-)
-model.fit(X_train)
-```
+Examples would be `VanillaAE`, `Conv1dAE`, `LSTMAE` and `TransformerAE`
+
+### 2. Sparse autoencoders
+A Sparse Autoencoder is a type of autoencoder that employs sparsity to achieve an information bottleneck. 
+Specifically the loss function is constructed so that activations are penalized within a layer.
+So, by adding a sparsity regularization, we will be able to stop the neural network from copying the input and reduce overfitting.
+
+Examples would be `SparseVanillaAE`, `SparseConv1dAE`, `SparseLSTMAE` and `SparseTransformerAE`
 
-### Autoencoder Variants
+## Network architectures
 
-Numalogic supports the following variants of Autoencoders
+Numalogic currently supports the following architectures.
 
-#### VanillaAE
+#### Fully Connected
 
 Vanilla Autoencoder model comprising only fully connected layers.
 
@@ -52,17 +65,17 @@ from numalogic.models.autoencoder.variants import VanillaAE
 model = VanillaAE(seq_len=12, n_features=2)
 ```   
 
-#### Conv1dAE
+#### 1d Convolutional
 
 Conv1dAE is a one dimensional Convolutional Autoencoder with multichannel support.
 
 ```python
-from numalogic.models.autoencoder.variants import Conv1dAE
+from numalogic.models.autoencoder.variants import SparseConv1dAE
 
-model=Conv1dAE(in_channels=3, enc_channels=8)
+model = SparseConv1dAE(beta=1e-2, seq_len=12, in_channels=3, enc_channels=8)
 ```
 
-#### LSTMAE
+#### LSTM
 
 An LSTM (Long Short-Term Memory) Autoencoder is an implementation of an autoencoder for sequence data using an Encoder-Decoder LSTM architecture.
 
@@ -73,7 +86,7 @@ model = LSTMAE(seq_len=12, no_features=2, embedding_dim=15)
 
 ```
 
-#### TransformerAE
+#### Transformer
 
 The transformer-based Autoencoder model was inspired from [Attention is all you need](https://arxiv.org/abs/1706.03762) paper. 
 

diff --git a/docs/post-processing.md b/docs/post-processing.md
@@ -3,7 +3,20 @@
 Post-processing step is again an optional step, where we normalize the anomalies between 0-10. This is mostly to make the scores more understandable.
 
 ```python
+import numpy as np
 from numalogic.postprocess import tanh_norm
 
-test_anomaly_score_norm = tanh_norm(test_anomaly_score)
+raw_anomaly_score = np.random.randn(10, 2)
+test_anomaly_score_norm = tanh_norm(raw_anomaly_score)
+```
+
+A scikit-learn compatible API is also available.
+```python
+import numpy as np
+from numalogic.postprocess import TanhNorm
+
+raw_score = np.random.randn(10, 2)
+
+norm = TanhNorm(scale_factor=10, smooth_factor=10)
+norm_score = norm.fit_transform(raw_score)
 ```
diff --git a/examples/numalogic-simple-pipeline/src/udf/inference.py b/examples/numalogic-simple-pipeline/src/udf/inference.py
@@ -16,8 +16,8 @@
 def inference(_: str, datum: Datum) -> Messages:
     r"""
     Here inference is done on the data, given, the ML model is present
-    in the registry. If a model does not exist, it moves on Otherwise, conditional forward the inferred data
-    to postprocess vertex for generating anomaly score for the payload.
+    in the registry. If a model does not exist, the payload is flagged for training.
+    It then passes to the threshold vertex.
 
     For more information about the arguments, refer:
     https://github.com/numaproj/numaflow-python/blob/main/pynumaflow/function/_dtypes.py

diff --git a/examples/quick-start.ipynb b/examples/quick-start.ipynb
diff --git a/numalogic/models/autoencoder/trainer.py b/numalogic/models/autoencoder/trainer.py
@@ -2,6 +2,8 @@
 
 import pytorch_lightning as pl
 import torch
+
+from numalogic.tools.callbacks import ProgressDetails
 from numalogic.tools.data import TimeseriesDataModule
 from pytorch_lightning import Trainer
 from torch import Tensor
@@ -20,8 +22,12 @@ def __init__(
         enable_progress_bar=False,
         enable_model_summary=False,
         limit_val_batches=0,
+        callbacks=None,
         **trainer_kw
     ):
+        if (not callbacks) and enable_progress_bar:
+            callbacks = ProgressDetails()
+
         super().__init__(
             logger=logger,
             max_epochs=max_epochs,
@@ -31,6 +37,7 @@ def __init__(
             enable_progress_bar=enable_progress_bar,
             enable_model_summary=enable_model_summary,
             limit_val_batches=limit_val_batches,
+            callbacks=callbacks,
             **trainer_kw
         )
 

diff --git a/numalogic/tools/callbacks.py b/numalogic/tools/callbacks.py
@@ -0,0 +1,34 @@
+import logging
+
+import pytorch_lightning as pl
+from pytorch_lightning.callbacks import ProgressBarBase
+
+
+_LOGGER = logging.getLogger(__name__)
+
+
+class ProgressDetails(ProgressBarBase):
+    r"""
+    A lightweight training progress detail producer.
+
+    Args:
+         log_freq: Interval of epochs to log
+    """
+
+    def __init__(self, log_freq: int = 5):
+        super().__init__()
+        self._log_freq = log_freq
+        self._enable = True
+
+    def enable(self) -> None:
+        self._enable = True
+
+    def disable(self):
+        self._enable = False
+
+    def on_train_epoch_end(self, trainer: pl.Trainer, pl_module: pl.LightningModule) -> None:
+        super().on_train_epoch_end(trainer, pl_module)
+        metrics = self.get_metrics(trainer, pl_module)
+        curr_epoch = trainer.current_epoch
+        if curr_epoch % self._log_freq == 0:
+            _LOGGER.info("epoch %s, loss: %s", curr_epoch, metrics["loss"])