Improve performance of transpose_list() #396

halhen · 2022-07-14T23:50:41Z

This improves the runtime significantly for loading data with many
columns. The order of loop nesting as well as a much more efficient
binary search does the trick.

In a real world example, fetching ~300k rows with ~50 columns from
MongoDB, this brings the query + load time from 70 seconds to ~40.
Used to be: ~10 seconds query, ~30 seconds transpose_list, and ~30
seconds simplifying colums. The transpose_list now takes <2 seconds.

Microbenchmark with synthetic data on an AMD 5950X, 128GB RAM, Fedora
Linux 36, R 4.1.3, jsonlite 1.8.0.9000 commit 8085435

> set.seed(1)
> rows <- 10000
> columns <- 100
> p_missing <- 0.2
>
> recordlist <- lapply(1:rows, function(rownum) {
+   row <- as.list(1:columns)
+   names(row) <- paste0("col_", row)
+   row[runif(columns) > p_missing]
+ })
> columns <- unique(unlist(lapply(recordlist, names), recursive = FALSE,
+                          use.names = FALSE))

Before this change

> microbenchmark::microbenchmark(
+     jsonlite:::transpose_list(recordlist, columns),
+     times = 10
+ )
Unit: milliseconds
                                           expr      min       lq     mean   median       uq      max neval
 jsonlite:::transpose_list(recordlist, columns) 577.8338 589.4064 593.0518 591.6895 599.4221 607.3057    10

With this change

> microbenchmark::microbenchmark(
+     jsonlite:::transpose_list(recordlist, columns),
+     times = 10
+ )
Unit: milliseconds
                                           expr      min       lq     mean   median       uq      max neval
 jsonlite:::transpose_list(recordlist, columns) 41.37537 43.22655 43.88987 43.76705 45.43552 46.81052    10

This improves the runtime significantly for loading data with many columns. The order of loop nesting as well as a much more efficient binary search does the trick. In a real world example, fetching ~300k rows with ~50 columns from MongoDB, this brings the query + load time from 70 seconds to ~40. Microbenchmark with synthetic data on an AMD 5950X, 128GB RAM, Fedora Linux 36, R 4.1.3, jsonlite 1.8.0.9000 commit 8085435 ``` > set.seed(1) > rows <- 10000 > columns <- 100 > p_missing <- 0.2 > > recordlist <- lapply(1:rows, function(rownum) { + row <- as.list(1:columns) + names(row) <- paste0("col_", row) + row[runif(columns) > p_missing] + }) > columns <- unique(unlist(lapply(recordlist, names), recursive = FALSE, + use.names = FALSE)) ``` Before this change ``` > microbenchmark::microbenchmark( + jsonlite:::transpose_list(recordlist, columns), + times = 10 + ) Unit: milliseconds expr min lq mean median uq max neval jsonlite:::transpose_list(recordlist, columns) 577.8338 589.4064 593.0518 591.6895 599.4221 607.3057 10 ``` With this change ``` > microbenchmark::microbenchmark( + jsonlite:::transpose_list(recordlist, columns), + times = 10 + ) Unit: milliseconds expr min lq mean median uq max neval jsonlite:::transpose_list(recordlist, columns) 41.37537 43.22655 43.88987 43.76705 45.43552 46.81052 10 ```

halhen · 2022-07-14T23:52:41Z

R/transpose_list.R

@@ -1,4 +1,12 @@
 #' @useDynLib jsonlite C_transpose_list
 transpose_list <- function(x, names) {
-  .Call(C_transpose_list, x, names)
+  # Sort names before entering C, allowing for a binary search


Sorting names lets us use a binary search in C. Sorting is a lot easier to do in R. If you are willing to add {stringi} or {withr} as dependencies, we can sort C-style without this ugly Sys.setlocale() dance? There might well be other, cleaner ways to do this.

halhen · 2022-07-14T23:53:19Z

src/transpose_list.c

-      for(size_t k = 0; k < Rf_length(listnames); k++){
-        if(!strcmp(CHAR(STRING_ELT(listnames, k)), targetname)){
+    SET_VECTOR_ELT(out, i, col);
+    UNPROTECT(1);


I'm very inexperienced integrating R and C. Please take extra care to review my PROTECT():s.

If a name exists in the data, sorted less than the smallest being requested, the previous code would end up in an infinite loop.

halhen commented Jul 14, 2022

View reviewed changes

halhen added 2 commits July 17, 2022 11:57

Protect from edge case infinite loop

8b467f6

If a name exists in the data, sorted less than the smallest being requested, the previous code would end up in an infinite loop.

Fix typo

5a9331e

jeroen force-pushed the master branch 8 times, most recently from e43a8f6 to 89a3e99 Compare December 5, 2022 22:56

jeroen force-pushed the master branch from e6f89b3 to 86e5067 Compare December 4, 2023 12:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of transpose_list() #396

Improve performance of transpose_list() #396

halhen commented Jul 14, 2022 •

edited

halhen Jul 14, 2022

halhen Jul 14, 2022

Improve performance of transpose_list() #396

Are you sure you want to change the base?

Improve performance of transpose_list() #396

Conversation

halhen commented Jul 14, 2022 • edited

halhen Jul 14, 2022

Choose a reason for hiding this comment

halhen Jul 14, 2022

Choose a reason for hiding this comment

halhen commented Jul 14, 2022 •

edited