Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ignore one class subfolder while using image_dataset_from_directory() function #1386

Open
maitra opened this issue Oct 29, 2023 · 10 comments
Open

Comments

@maitra
Copy link

maitra commented Oct 29, 2023

I am looking into the R package keras and the function image_dataset_from_directory()

According to the help page,

 If your directory structure is:

 main_directory/
 ...class_a/
 ......a_image_1.jpg
 ......a_image_2.jpg
 ...class_b/
 ......b_image_1.jpg
 ......b_image_2.jpg

Then calling ‘image_dataset_from_directory(main_directory, labels='inferred')’ will return a ‘tf.data.Dataset’ that yields batches of images from the subdirectories class_a and class_b, together with labels 0 and 1 (0 corresponding to class_a and 1 corresponding to class_b).

However, I have three folders:

   main_directory/
 ...class_a/
 ......a_image_1.jpg
 ......a_image_2.jpg
 ...
 ...class_b/
 ......b_image_1.jpg
 ......b_image_2.jpg
 ...
...class_c/
 ......c_image_1.jpg  
 ......c_image_1.jpg
 ...

I want to read only two of these classes (and ignore the third). Is there a way to do this using the image_dataset_from_directory() or some other function?

@t-kalinowski
Copy link
Member

t-kalinowski commented Oct 30, 2023

This is not directly supported by the convenience function image_dataset_from_directory(), but it should be straightforward to hack around the limitations of the function to achieve what you want.

The simplest fix is probably to use {tfdatasets}, either dataset_map() or dataset_filter() to drop out the labels you're uninterested in. This is an expedient path if data loading is not the bottleneck in your pipeline.

# first get the label sorted in the same order as keras
# sorted directories in main_directory
library(reticulate)
os <- reticulate::import("os")
sorted_labels <- os$walk(main_directory) |> iter_next() |> _[[2]] 

labels <- seq(0, along = sorted_labels)
names(labels) <- sorted_labels 
my_unwanted_label <- labels %>% .[!names(.) %in% c("class_c")] %>% unname()

library(keras)
library(tfdatasets)
ds <- image_dataset_from_directory(....) %>%
  dataset_map(\(images, labels) {
    keep <- my_unwanted_labels  |> 
      lapply(\(bad_label) labels != bad_label) |>
      purrr::reduce(`&`)
    tuple(images[keep], labels[keep])
})

or using dataset_filter()

my_unwanted_labels %<>% as_tensor()
ds <- image_dataset_from_directory(....) %>%
  dataset_unbatch() %>%
  dataset_filter(\(image, label) !k_any(label == my_unwanted_labels)
  dataset_batch(batch_size = 32)

Alternatively, instead of fixing up the output of image_dataset_from_directory(), you can instead fix-up the input, by creating a directory with a curated set of symlinks. Something like:

library(fs)
library(keras)

curated_dataset <- fs::path("curated_dataset") |> path_abs()
dir_create(curated_dataset)
class_dirs <- dir_ls(main_directory, recurse = FALSE) %>%
  .[!basename(.) %in% c("class_c")] %>% 
  path_abs()
link_create(class_dirs,                                 # link target
            path(curated_dataset, basename(class_dirs)) # link location

ds <- image_dataset_from_directory(curated_dataset, follow_links = FALSE)

(all the code snippets are above untested, but I trust you can figure out the rest).

@maitra
Copy link
Author

maitra commented Oct 31, 2023

Thanks for this! Very helpful.

The fixing up the input solution works, (you do have a parenthesis missing, but that can be easily fixed).


library(fs)
library(keras)

curated_dataset <- fs::path("curated_dataset") |> path_abs()
dir_create(curated_dataset)
class_dirs <- dir_ls(main_directory, recurse = FALSE) %>%
  .[!basename(.) %in% c("class_c")] %>% 
  path_abs()
link_create(class_dirs,                                 # link target
            path(curated_dataset, basename(class_dirs)))# link location

ds <- image_dataset_from_directory(curated_dataset, follow_links = FALSE)

However, I think that the fixing the output solution is more desirable. For one it does not create all these curated symlinks. However, I can not see if the other arguments of image_dataset_from_directory can be made to work.


# first get the label sorted in the same order as keras
# sorted directories in main_directory
library(reticulate)
os <- reticulate::import("os")
sorted_labels <- os$walk(main_directory) |> iter_next() |> _[[2]] 

labels <- seq(0, along = sorted_labels)
names(labels) <- sorted_labels 
my_unwanted_label <- labels %>% .[!names(.) %in% c("class_c")] %>% unname()

library(keras)
library(tfdatasets)
ds <- image_dataset_from_directory(....) %>%
  dataset_map(\(images, labels) {
    keep <- my_unwanted_labels  |> 
      lapply(\(bad_label) labels != bad_label) |>
      purrr::reduce(`&`)
    tuple(images[keep], labels[keep])
})

First, I wanted to note that I get:


Warning messages:
1: In `[.tensorflow.tensor`(images, keep) :
  Incorrect number of dimensions supplied. The number of supplied arguments, (not counting any NULL, tf$newaxis or np$newaxis) must match thenumber of dimensions in the tensor, unless an all_dims() was supplied (this will produce an error in the future)
2: In force(if_any_TRUE) :
  Indexing tensors are passed as-is to python, no index offsetting or R to python translation is performed. Selected options for one_based and inclusive_stop are ignored and treated as FALSE. To silence this warning, set options(tensorflow.extract.warn_tensors_passed_asis = FALSE)

But I also would lke to pass some arguments to the image_dataset_from_directory() function, like:

                                               image_size = c(180, 180),
                                               validation_split = 0.6,
                                               subset = "training",
                                               seed = random.seed,
                                               batch_size = 32

How do I do this? Thanks again for this wonderful resource that allows me to use R with keras!

@t-kalinowski
Copy link
Member

t-kalinowski commented Oct 31, 2023

To get rid of this warning:

In `[.tensorflow.tensor`(images, keep) :
  Incorrect number of dimensions supplied....

You can change the call images[keep] to images[keep, all_dims()] (same for labels[keep] and any other calls to [ where you are implicitly slicing along the first dim of a multidimensional tensor.)

(tensorflow::all_dims(), reticulate::py_ellipsis() or reticulate::py_eval("...") all return the same thing).

The second note is issued when you are subsetting a tensor with another tensor. It's a one-time warning per R session, to help remind you that x[1] is not the same as x[as_tensor(1L)]. You an silence it globally by calling options(tensorflow.extract.warn_tensors_passed_asis = FALSE)

I think that all the other arguments should still work. The one thing that might change is the exact output shape of the tfdataset that is returned, and you'd have to adjust the formals of the function passed to dataset_map() or dataset_filter() to match.

When in doubt about what the exact signature is needed, and to avoid a guessing game, you can quickly test by passing a function with ..., something like this:

image_dataset_from_directory(<many args>) %>% 
  dataset_map(function(...) {
    str(list(...))
    # you can also do "browser-driven development", and write the body of the
    # function with live references to the symbolic "graph-mode" tensors available
    # for interactive, line-by-line testing, by dropping into a browser() context here:
    browser()
    # just be sure to exit the browser() by "(c)ontinuing" and not by "(q)uiting". 
    # If you quit, tensorflow keeps the tracing context open, leaving the sesion in a 
    # broken state that requires an R session restart to fix. 
  }) 

Then when you are done experimenting/writing, you can update the function signature for future readability:

image_dataset_from_directory(...., validation_split = .... ) %>% 
  dataset_map(function(train, val) {
    names(train) <- names(val) <- c("images", "labels")
    for(nm in c("images", "labels")) {
      train[[nm]] %<>% .[keep, all_dims()] 
        val[[nm]] %<>% .[keep, all_dims()]      
    }
    tuple(lapply(list(train, val), unname))
  })

@maitra
Copy link
Author

maitra commented Nov 3, 2023

My apologies: I have been trying several things for a while, but I am still confused about this. Let us use the dataset (AFHQ in your co-authored book). I want to only focus on the cats and dogs to sort of match what you are doing there in Chapter 8, but as a learning experience,I do not want to create a new folder of images as you have done there.

base_dir <- fs::path("afhq")

library(fs)
library(keras)
random.seed <- 415588819

library(reticulate)
os <- reticulate::import("os")
sorted_labels <- os$walk(base_dir / "train") |> iter_next() |> _[[2]] 

labels <- seq(0, along = sorted_labels)
names(labels) <- sorted_labels 
my_wanted_labels <- labels %>% .[!names(.) %in% c("wild")] %>% unname()

library(keras)
library(tfdatasets)
ds <- image_dataset_from_directory(base_dir / "train", validation_split = 0.8, image_size = c(180, 180), batch_size = 32, subset = "both", seed = random.seed) |>
  dataset_map(\(images, labels) {
    keep <- my_wanted_labels  |> 
      lapply(\(bad_label) labels != bad_label) |>
      purrr::reduce(`&`)
    tuple(images[keep, all_dims()], labels[keep, all_dims()])
})

However, I get:

Found 14630 files belonging to 3 classes.
Using 2926 files for training.
Using 11704 files for validation.
Error in dataset$map(map_func = as_py_function(map_func), num_parallel_calls = as_integer_tensor(num_parallel_calls,  : 
  attempt to apply non-function

I feel like I am almost there, however, I am still stuck.

Thanks again for all your help! And thanks also for the book, and the resource!

@t-kalinowski
Copy link
Member

Here is a working example using a mnist dataset (most convenient for me right now)

library(purrr)
library(fs)
library(keras)
library(tfdatasets)

class_names <- xfun::n2w(0:9)
unwanted_class_names <- xfun::n2w(c(6, 9))

class_labels <- seq.int(from = 0, along.with = class_names)
names(class_labels) <- class_names

unwanted_labels <- local({
  class_labels %>% .[names(.) %in% unwanted_class_names]
})

dir <- tempfile("mnist-")
dir_create(dir, class_names)

mnist <- dataset_mnist()

walk(seq_len(nrow(mnist$train$x)), \(i) {
  img <- mnist$train$x[i,,]/255
  lbl <- mnist$train$y[i]
  jpeg::writeJPEG(image = img,
                  target = path(dir, xfun::n2w(lbl), i, ext = "jpeg"))
})

ds <- image_dataset_from_directory(dir, class_names = class_names)

ds <- ds %>%
  dataset_unbatch() %>%
  dataset_filter(\(img, lbl) k_all(lbl != unwanted_lbls)) %>%
  dataset_batch(32)

# confirm the unwanted labels aren't there
seen_labels <- ds %>%
  dataset_take(10) %>%
  as_array_iterator() %>%
  reticulate::iterate(\(x) {
    c(images, labels) %<-% x
    unique(labels)
  }) %>%
  unlist() %>% unique() %>% sort()
# 0 1 2 3 4 5 7 8
stopifnot(!unwanted_labels %in% seen_labels)

# Note, in the upcoming keras 3 / keras_core, passing a subset of names to `class_names` will work:
ds <- image_dataset_from_directory(dir, class_names = class_names[1:3])

And thanks also for the book, and the resource!

Thank you! I'm glad to hear you find it helpful.

@maitra
Copy link
Author

maitra commented Nov 5, 2023

Here is a working example using a mnist dataset (most convenient for me right now)

library(purrr)
library(fs)
library(keras)
library(tfdatasets)

class_names <- xfun::n2w(0:9)
unwanted_class_names <- xfun::n2w(c(6, 9))

class_labels <- seq.int(from = 0, along.with = class_names)
names(class_labels) <- class_names

unwanted_labels <- local({
  class_labels %>% .[names(.) %in% unwanted_class_names]
})

dir <- tempfile("mnist-")
dir_create(dir, class_names)

mnist <- dataset_mnist()

walk(seq_len(nrow(mnist$train$x)), \(i) {
  img <- mnist$train$x[i,,]/255
  lbl <- mnist$train$y[i]
  jpeg::writeJPEG(image = img,
                  target = path(dir, xfun::n2w(lbl), i, ext = "jpeg"))
})

ds <- image_dataset_from_directory(dir, class_names = class_names)

ds <- ds %>%
  dataset_unbatch() %>%
  dataset_filter(\(img, lbl) k_all(lbl != unwanted_labels)) %>%
  dataset_batch(32)

# confirm the unwanted labels aren't there
seen_labels <- ds %>%
  dataset_take(10) %>%
  as_array_iterator() %>%
  reticulate::iterate(\(x) {
    c(images, labels) %<-% x
    unique(labels)
  }) %>%
  unlist() %>% unique() %>% sort()
# 0 1 2 3 4 5 7 8
stopifnot(!unwanted_labels %in% seen_labels)

Thank you! There is a typo there, for anyone looking at this for future reference. It is obvious, but unwanted_lbls should be unwanted_labels in the reduction part of the code.

Btw, after the reduction, lenght(ds) no longer works after the reduction. We need to use this later in the coding for pretraining. How do we refer to this? Many thanks again!

@maitra
Copy link
Author

maitra commented Nov 5, 2023

Here is a working example using a mnist dataset (most convenient for me right now)

library(purrr)
library(fs)
library(keras)
library(tfdatasets)

class_names <- xfun::n2w(0:9)
unwanted_class_names <- xfun::n2w(c(6, 9))

class_labels <- seq.int(from = 0, along.with = class_names)
names(class_labels) <- class_names

unwanted_labels <- local({
  class_labels %>% .[names(.) %in% unwanted_class_names]
})

dir <- tempfile("mnist-")
dir_create(dir, class_names)

mnist <- dataset_mnist()

walk(seq_len(nrow(mnist$train$x)), \(i) {
  img <- mnist$train$x[i,,]/255
  lbl <- mnist$train$y[i]
  jpeg::writeJPEG(image = img,
                  target = path(dir, xfun::n2w(lbl), i, ext = "jpeg"))
})

ds <- image_dataset_from_directory(dir, class_names = class_names)

ds <- ds %>%
  dataset_unbatch() %>%
  dataset_filter(\(img, lbl) k_all(lbl != unwanted_labels)) %>%
  dataset_batch(32)

# confirm the unwanted labels aren't there
seen_labels <- ds %>%
  dataset_take(10) %>%
  as_array_iterator() %>%
  reticulate::iterate(\(x) {
    c(images, labels) %<-% x
    unique(labels)
  }) %>%
  unlist() %>% unique() %>% sort()
# 0 1 2 3 4 5 7 8
stopifnot(!unwanted_labels %in% seen_labels)

Thank you! There is a typo there, for anyone looking at this for future reference. It is obvious, but unwanted_lbls should be unwanted_labels in the reduction part of the code.

Btw, lenght(ds) no longer works after the reduction. I get a NA. We need to use this later in the coding for pretraining. How do we refer to this? Many thanks again!

@t-kalinowski
Copy link
Member

Making length() of a TF Dataset non-NA after applying a dataset_filter() requires manually injecting the length information into the pipeline. There isn't a non-experimental way to do this yet, but this works in TF 2.14.

n_images <- list.files(dir, full.names = TRUE) %>% 
  .[!basename(.) %in% unwanted_class_names] %>% 
  list.files(pattern = "\\.jpe?g$") %>% 
  length()

ds <- image_dataset_from_directory(dir, class_names = class_names)

ds <- ds %>%
  dataset_unbatch() %>%
  dataset_filter(\(img, lbl) k_all(lbl != unwanted_labels)) %>%
  { .$apply(tf$data$experimental$assert_cardinality(n_images)) } %>% 
  dataset_batch(32)

length(ds) # 1505

@maitra
Copy link
Author

maitra commented Nov 6, 2023

Odd. I have a problem with the AFHQ dataset (sorry):

base_dir <- fs::path("afhq")
library(tfdatasets)
library(keras)

class_names <- c("cat", "dog", "wild")
unwanted_class_names <- c("wild")

class_labels <- seq.int(from = 0, along.with = class_names)
names(class_labels) <- class_names

unwanted_labels <- local({
  class_labels %>% .[names(.) %in% unwanted_class_names]
})

ds <- image_dataset_from_directory(base_dir / "train",
                                              class_names = class_names)

n_images <- list.files(base_dir, full.names = TRUE) %>%
       .[!basename(.) %in% unwanted_class_names] |>
           list.files(pattern = "\\.jpe?g$") %>%
             length()

ds <- ds |>
        dataset_unbatch() |>
        dataset_filter(\(img, lbl) k_all(lbl != unwanted_labels)) %>%
            { .$apply(tf$data$experimental$assert_cardinality(n_images)) } |>
              dataset_batch(32)

I get:

Found 14630 files belonging to 3 classes.

Then,

length(ds)
0

I don't quite understand what is going wrong here. Thanks!

@t-kalinowski
Copy link
Member

t-kalinowski commented Nov 7, 2023

I think rather than working around the current TF Dataset cardinality limitations, it's simpler to create temporary links:

library(fs)
library(keras)

image_dataset_from_directory_subset <- function(directory, ..., class_names) {
  directory2 <- dir_create(path_temp(file_temp(), path_file(directory)))
  stopifnot(class_names %in% list.files(directory))
  link_create(path(directory, class_names),  # link target
              path(directory2, class_names)) # link location
  keras::image_dataset_from_directory(directory2, ..., class_names = class_names)
}

ds <- image_dataset_from_directory_subset(dir, class_names = class_names[1:5]) 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants