Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

After update, MWE does not produce same/similar output anymore (fit() or predict() problem) #1411

Open
mhofert opened this issue Mar 1, 2024 · 2 comments

Comments

@mhofert
Copy link

mhofert commented Mar 1, 2024

Hi,

After a recent update of Python/TensorFlow/Keras, a minimal working example (MWE)
I used to run to produce samples from a target distribution does not produce such samples
anymore (close but clearly from a different distribution; see the attached screenshots below). After more than 24h searching the needle in the haystack, I'm still clueless. A colleague ran the MWE
under his setup on Windows with older versions of Python/TensorFlow/Keras and
obtained the correct samples as we always did. And so did another colleague on macOS. Our loss functions also produce very similar values, so we are still unsure whether it's keras' fit() or predict().

Here is the full story which, by now, I consider a 'bug' in the hope others may see this post when
realizing their networks don't train/predict properly anymore. The biggest issue is that this can remain entirely undetected as the loss functions don't indicate any problem... hence this post. Also, it means
that certain R packages (e.g. 'gnn') currently can work for some (my colleague) but
not others (myself) without any warning.

The MWE trains a single-hidden-layer neural network (NN) to act as a random
number generator (RNG). I pass iid N(0,1) samples through the NN and then
compare them to given dependent multivariate samples from some target
distribution (here: scaled ranks of absolute values of correlated normals) with the
loss function MMD (maximum mean discrepancy) that we implemented (jointly with the NN, this is
called a GMMN, a generating moment matching network).

The MWE below worked well with R running inside a virtual Python environment
(installed with Minimorge3 on my M1
14" MacBook Pro, first gen) and then TensorFlow installed via "conda install -c
apple tensorflow-deps" and "python -m pip install
tensorflow-metal". This was until about a year ago. When I wanted to run
the MWE again this week, I received:

Error: Valid installation of TensorFlow not found.

Python environments searched for 'tensorflow' package:
/usr/local/miniforge3/bin/python3.10
...
ModuleNotFoundError: No module named 'tensorflow'

You can install TensorFlow using the install_tensorflow() function.

After reinstalling Python/TensorFlow/Keras in the exact way as I used to do,
I still received this error. I then read on t-kalinowski/deep-learning-with-R-2nd-edition-code#3
that the following is the (now) recommended way to install Python/TensorFlow/Keras on all platforms,
so I did:

install.packages("remotes")
remotes::install_github("rstudio/keras")
reticulate::install_python()
keras::install_keras()

After that, the MWE ran again. However, it did not properly
generate samples from the target distribution anymore. I cannot go
back to older versions of the R package 'keras' as then the above error
appears again.

Here is the MWE with sessionInfo() etc., also for the outputs
of my colleague (on Windows). Again, he obtains very similar loss values,
but my generated samples look like normals, not asymmetric anymore as they should
(and his are fine).

library(tensorflow) # only needed for our custom MMD loss function
library(keras)

## Generate training data U (scaled ranks of absolute values of correlated normals)
d <- 2 # bivariate case
P <- matrix(0.9, nrow = d, ncol = d); diag(P) <- 1 # correlation matrix
A <- t(chol(P)) # Cholesky factor
ntrn <- 50000 # training data sample size
set.seed(271)
Z <- matrix(rnorm(ntrn * d), ncol = d) # generate N(0,1)
X <- abs(Z %*% t(A)) # absolute values of N(0,P) samples
U <- apply(X, 2, rank) / (ntrn + 1) # training data
if(FALSE)
    plot(U, pch = ".") # ... to see the rough sample shape we are aiming for

## Helper function for custom MMD loss function (from 'gnn')
radial_basis_function_kernel <- function(x, y, bandwidth = 10^c(-3/2, -1, -1/2, -1/4, -1/8, -1/16))
{
    x. <- tf$expand_dims(x, axis = 1L)
    y. <- tf$expand_dims(y, axis = 0L)
    dff2 <- tf$square(x. - y.)
    dst2 <- tf$reduce_sum(dff2, axis = 2L)
    dst2.vec <- tf$reshape(dst2, shape = c(1L, -1L))
    fctr <- tf$convert_to_tensor(as.matrix(1 / (2 * bandwidth^2)), dtype = dst2.vec$dtype)
    kernels <- tf$exp(-tf$matmul(fctr, b = dst2.vec))
    tf$reshape(tf$reduce_mean(kernels, axis = 0L),
               shape = tf$shape(dst2))
}

## Maximum mean discrepancy (MMD) loss function (from 'gnn')
MMD <- function(x, y, ...)
{
    is.R.x <- !tf$is_tensor(x)
    is.R.y <- !tf$is_tensor(y)
    if(is.R.x) x <- tf$convert_to_tensor(x, dtype = "float64")
    if(is.R.y) y <- tf$convert_to_tensor(y, dtype = "float64")
    res <- tf$sqrt(tf$reduce_mean(radial_basis_function_kernel(x, y = x, ...)) +
                   tf$reduce_mean(radial_basis_function_kernel(y, y = y, ...)) -
                   2 * tf$reduce_mean(radial_basis_function_kernel(x, y = y, ...)))
    if(is.R.x || is.R.y) as.numeric(res) else res
}

## Setup model
in.lay <- layer_input(shape = 2)
hid.lay <- layer_dense(in.lay,  units = 300, activation = "relu")
out.lay <- layer_dense(hid.lay, units = 2,   activation = "sigmoid")
model <- keras_model(in.lay, out.lay)
compile(model, optimizer = "adam", loss = function(x, y) MMD(x, y = y))
## Note:
## 1) Even with loss = "mse" I get different sample shapes than before
##    (before they were scattered around (1/2, 1/2), now they seem to be normal around (1/2, 1/2))
## 2) With optimizer = optimizer_adam() instead of optimizer = "adam", I get the following
##    (but training seems to remain unaffected):
##    WARNING:absl:At this time, the v2.11+ optimizer `tf.keras.optimizers.Adam` runs slowly on M1/M2 Macs, please use the legacy Keras optimizer instead, located at `tf.keras.optimizers.legacy.Adam`.
##    WARNING:absl:There is a known slowdown when using v2.11+ Keras optimizers on M1/M2 Macs. Falling back to the legacy Keras optimizer, i.e., `tf.keras.optimizers.legacy.Adam`.
## 3) I also tried optimizer = keras$optimizers$legacy$Adam() but it makes no difference

## Train
fit(model,
    x = matrix(rnorm(ntrn * d), ncol = 2), # prior sample (here: training)
    y = U, # training data to match (here: target data)
    batch_size = 500, epochs = 10) # small values here, but enough so that we should see barely any difference between the generated samples and those in the training data

## Generate from trained model by passing through new prior samples
N <- matrix(rnorm(2000 * d), ncol = 2)
V <- predict(model, x = N)

## Compare with training data
layout(t(1:2))
opar <- par(pty = "s", pch = 20, cex = 0.7)
plot(U[1:2000,], xlab = expression(U[1]), ylab = expression(U[2]))
plot(V,          xlab = expression(V[1]), ylab = expression(V[2])) # => not close anymore!
par(opar)
layout(1)

My colleague saved the weights and whole model he trained based on the above code
and if I pass 'N' through those then the samples are also off (more mass towards
the corners). Same the other way around (if I send him my trained model/weights).
What has possibly changed that could affect such a serious difference?

I saw on t-kalinowski/deep-learning-with-R-2nd-edition-code#6 (comment) that one might need to tell the optimizer before fit()
which variables it will be modifying... Is this related? But why are the losses
close yet the samples so different (they are always symmetric, more normally distributed
but should be asymmetric)

Below is more information about the two sessions (mine, my colleague). The only
difference we found is that if we both run class(model), then his output starts
with "keras.engine.training.Model" and mine with "keras.engine.functional.Functional"
(and then with "keras.engine.training.Model"). But even calling keras:::predict.keras.engine.training.Model()
directly did not make a difference. Nothing in the above code was modified from
the previous point this was working for me, so it must be due to a change in
TensorFlow/Keras (perhaps on macOS only?). Any hunch? I'm happy to provide (even) more
details.

Thanks & cheers,
Marius

Info about my session

Python, TensorFlow, Keras were installed via:

install.packages("remotes")
remotes::install_github("rstudio/keras")
reticulate::install_python()
keras::install_keras()

reticulate::py_config() shows:

python:         /Users/mhofert/.virtualenvs/r-tensorflow/bin/python
libpython:      /Users/mhofert/.pyenv/versions/3.9.18/lib/libpython3.9.dylib
pythonhome:     /Users/mhofert/.virtualenvs/r-tensorflow:/Users/mhofert/.virtualenvs/r-tensorflow
version:        3.9.18 (main, Feb 29 2024, 14:28:41)  [Clang 15.0.0 (clang-1500.1.0.2.5)]
numpy:          /Users/mhofert/.virtualenvs/r-tensorflow/lib/python3.9/site-packages/numpy
numpy_version:  1.24.3
tensorflow:     /Users/mhofert/.virtualenvs/r-tensorflow/lib/python3.9/site-packages/tensorflow
NOTE: Python version was forced by import("tensorflow")

sessionInfo() shows (note: I also installed the R package tensorflow in version 2.13.0
but it didn't solve the problem):

## Output:
## R version 4.3.2 (2023-10-31)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Sonoma 14.3.1

## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

## time zone: Asia/Hong_Kong
## tzcode source: internal

## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base

## other attached packages:
## [1] keras_2.13.0      tensorflow_2.15.0

## loaded via a namespace (and not attached):
##  [1] R6_2.5.1          base64enc_0.1-3   Matrix_1.6-1.1    lattice_0.21-9
##  [5] reticulate_1.35.0 magrittr_2.0.3    generics_0.1.3    png_0.1-8
##  [9] lifecycle_1.0.4   cli_3.6.2         grid_4.3.2        zeallot_0.1.0
## [13] tfruns_1.5.2      compiler_4.3.2    rprojroot_2.0.4   here_1.0.1
## [17] whisker_0.4.1     Rcpp_1.0.12       rlang_1.1.3       jsonlite_1.8.8

Info about my colleague's session

His reticulate::py_config() shows:

python:         C:/Users/avina/AppData/Local/r-miniconda/envs/r-reticulate/python.exe
libpython:      C:/Users/avina/AppData/Local/r-miniconda/envs/r-reticulate/python36.dll
pythonhome:     C:/Users/avina/AppData/Local/r-miniconda/envs/r-reticulate
version:        3.6.12 |Anaconda, Inc.| (default, Sep  9 2020, 00:29:25) [MSC v.1916 64 bit (AMD64)]
Architecture:   64bit
numpy:          C:/Users/avina/AppData/Local/r-miniconda/envs/r-reticulate/Lib/site-packages/numpy
numpy_version:  1.19.5

His sessionInfo() shows:

R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

Matrix products: default

locale:
[1] LC_COLLATE=English_Canada.1252  LC_CTYPE=English_Canada.1252    LC_MONETARY=English_Canada.1252
[4] LC_NUMERIC=C                    LC_TIME=English_Canada.1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] keras_2.9.0      tensorflow_2.9.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.8.3    here_1.0.1      lattice_0.20-41 png_0.1-7       rprojroot_2.0.3 zeallot_0.1.0
 [7] rappdirs_0.3.3  grid_4.0.2      R6_2.5.1        jsonlite_1.8.0  magrittr_2.0.3  cli_3.2.0
[13] rlang_1.0.2     tfruns_1.5.0    whisker_0.4     Matrix_1.2-18   reticulate_1.24 generics_0.1.2
[19] tools_4.0.2     compiler_4.0.2  base64enc_0.1-3
my_output
@mhofert
Copy link
Author

mhofert commented Mar 2, 2024

I cleaned everything (Python, TensorFlow, Keras) and installed Keras the way I used to do again (essentially manually). Now it did run without errors but also produced wrong samples. I then realized that

install.packages("keras") 
reticulate::install_python() 
keras::install_keras() 

is essentially doing the same thing -- and actually ignores whatever I install manually (conda, location of virtual environments...). I then looked into keras::install_keras() and realized that it uses version = "default" as default, which is 2.13 (but I know that my colleague used tensorflow 2.15 and got the code to produce the correct samples). I then did:

install.packages("keras") 
reticulate::install_python() 
keras::install_keras(version = "release") 

and it solved the problem! This is reproducible, if I call keras::install_keras() again, it fails again. As I mentioned before, note that there is nothing that indicates the failure (very similar loss values, no indication of wrong training).

Here is a plot of the correct samples:

Screenshot 2024-03-02 at 14 38 20

@t-kalinowski
Copy link
Member

Hi, thanks for reporting.

Running your code, I can't reproduce the issue. I suspect that this ultimately boils down to an issue with older builds of tensorflow-metal or tensorflow-macos, the M1 specific builds provided by Apple. The early versions of them had some bugs related to random tensor generation, and it's possible the current versions have them too.

Fortunately, beginning with TF 2.16 (available as an RC now, should be in release soon), we'll no longer need to install tensorflow-macos, as the necessary parts to make tensorflow work on M1 macs are now part of the official build.

If for some reason you require running an older version of tensorflow on an M1 mac, you can skip tensorflow-macos and force the tensorflow-cpu package.

tensorflow::install_tensorflow(metal = FALSE, version = "2.13-cpu")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants