Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of transpose_list() #396

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

halhen
Copy link

@halhen halhen commented Jul 14, 2022

This improves the runtime significantly for loading data with many
columns. The order of loop nesting as well as a much more efficient
binary search does the trick.

In a real world example, fetching ~300k rows with ~50 columns from
MongoDB, this brings the query + load time from 70 seconds to ~40.
Used to be: ~10 seconds query, ~30 seconds transpose_list, and ~30
seconds simplifying colums. The transpose_list now takes <2 seconds.

Microbenchmark with synthetic data on an AMD 5950X, 128GB RAM, Fedora
Linux 36, R 4.1.3, jsonlite 1.8.0.9000 commit 8085435

> set.seed(1)
> rows <- 10000
> columns <- 100
> p_missing <- 0.2
>
> recordlist <- lapply(1:rows, function(rownum) {
+   row <- as.list(1:columns)
+   names(row) <- paste0("col_", row)
+   row[runif(columns) > p_missing]
+ })
> columns <- unique(unlist(lapply(recordlist, names), recursive = FALSE,
+                          use.names = FALSE))

Before this change

> microbenchmark::microbenchmark(
+     jsonlite:::transpose_list(recordlist, columns),
+     times = 10
+ )
Unit: milliseconds
                                           expr      min       lq     mean   median       uq      max neval
 jsonlite:::transpose_list(recordlist, columns) 577.8338 589.4064 593.0518 591.6895 599.4221 607.3057    10

With this change

> microbenchmark::microbenchmark(
+     jsonlite:::transpose_list(recordlist, columns),
+     times = 10
+ )
Unit: milliseconds
                                           expr      min       lq     mean   median       uq      max neval
 jsonlite:::transpose_list(recordlist, columns) 41.37537 43.22655 43.88987 43.76705 45.43552 46.81052    10

This improves the runtime significantly for loading data with many
columns. The order of loop nesting as well as a much more efficient
binary search does the trick.

In a real world example, fetching ~300k rows with ~50 columns from
MongoDB, this brings the query + load time from 70 seconds to ~40.

Microbenchmark with synthetic data on an AMD 5950X, 128GB RAM, Fedora
Linux 36, R 4.1.3, jsonlite 1.8.0.9000 commit 8085435

```
> set.seed(1)
> rows <- 10000
> columns <- 100
> p_missing <- 0.2
>
> recordlist <- lapply(1:rows, function(rownum) {
+   row <- as.list(1:columns)
+   names(row) <- paste0("col_", row)
+   row[runif(columns) > p_missing]
+ })
> columns <- unique(unlist(lapply(recordlist, names), recursive = FALSE,
+                          use.names = FALSE))
```

Before this change

```
> microbenchmark::microbenchmark(
+     jsonlite:::transpose_list(recordlist, columns),
+     times = 10
+ )
Unit: milliseconds
                                           expr      min       lq     mean   median       uq      max neval
 jsonlite:::transpose_list(recordlist, columns) 577.8338 589.4064 593.0518 591.6895 599.4221 607.3057    10
```

With this change

```
> microbenchmark::microbenchmark(
+     jsonlite:::transpose_list(recordlist, columns),
+     times = 10
+ )
Unit: milliseconds
                                           expr      min       lq     mean   median       uq      max neval
 jsonlite:::transpose_list(recordlist, columns) 41.37537 43.22655 43.88987 43.76705 45.43552 46.81052    10
```
@@ -1,4 +1,12 @@
#' @useDynLib jsonlite C_transpose_list
transpose_list <- function(x, names) {
.Call(C_transpose_list, x, names)
# Sort names before entering C, allowing for a binary search
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorting names lets us use a binary search in C. Sorting is a lot easier to do in R. If you are willing to add {stringi} or {withr} as dependencies, we can sort C-style without this ugly Sys.setlocale() dance? There might well be other, cleaner ways to do this.

for(size_t k = 0; k < Rf_length(listnames); k++){
if(!strcmp(CHAR(STRING_ELT(listnames, k)), targetname)){
SET_VECTOR_ELT(out, i, col);
UNPROTECT(1);
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm very inexperienced integrating R and C. Please take extra care to review my PROTECT():s.

If a name exists in the data, sorted less than the smallest being
requested, the previous code would end up in an infinite loop.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant