Performance regression with `Time.parse` from 1.7 to 9k #3922

cprice404 · 2016-05-24T23:37:26Z

Environment

JRuby 1.7.20.1, JRuby 9.1.0.0, JRuby 9.1.1.0; run from Java via ScriptingContainer, with various values tested for the setCompileMode method. Standalone script execution with no frameworks.

Expected Behavior

Expected performance for Time.parse to be on par with performance for 1.7.x.

Actual Behavior

Time.parse seems to be roughly 60% slower in JRuby 9k than in 1.7.x.

I have set up a github repo with a simple reproducer here:
https://github.com/cprice404/jruby9k-benchmarks/tree/19a2719e4edddc17e3f1fabbf311b9d80c6eba80/time-parse-comparison

Here are some relevant numbers from the benchmark output (the rest of it is available in the linked reproducer repro if you are interested):

JRUBY VERSION: 1.7.20.1
COMPILE MODE: off
                           user     system      total        real
warmup (100 runs):     0.213000   0.000000   0.213000 (  0.213000)
middle (10000 runs):   1.179000   0.000000   1.179000 (  1.179000)
tail (100 runs):       0.009000   0.000000   0.009000 (  0.009000)

JRUBY VERSION: 9.1.1.0
COMPILE MODE: off
                           user     system      total        real
warmup (100 runs):     0.730000   0.010000   0.740000 (  0.138776)
middle (10000 runs):   4.890000   0.300000   5.190000 (  1.905043)
tail (100 runs):       0.030000   0.010000   0.040000 (  0.019285)

Also tested with compileMode set to JIT; that doesn't seem to make much difference in 1.7, and makes the numbers look worse in 9k (though the benchmark may not have had a long enough execution time to give a fair characterization of JIT).

The text was updated successfully, but these errors were encountered:

headius · 2016-05-25T03:57:10Z

The interpreter in 9k is known to be slower (for now) so this and your other bug are not terribly surprising. However, these give us some good examples to work with. We would like to get get the interpreter running as fast as possible.

I am curious about your setup. Is there a reason you need to throw the ScriptingContainer away every time? Using a single container and letting things JIT should bring better performance.

headius · 2016-05-25T04:08:48Z

Here's a JVM sampled profile of execution of your benchmark script, loaded repeatedly in a loop with -X-C to disable JIT. This should be roughly analogous to running through ScriptingContainer.

     Compiled + native   Method                        
 26.3%  1403  +     2    org.joni.ByteCodeMachine.matchAt
 11.3%   605  +     0    org.jruby.ir.interpreter.InterpreterEngine.interpret
  4.5%   241  +     0    org.jruby.ir.interpreter.InterpreterEngine.processCall
  4.2%   222  +     0    org.jruby.ir.interpreter.InterpreterEngine.processOtherOp
  3.7%   196  +     0    org.jruby.ir.interpreter.InterpreterEngine.processBookKeepingOp
  2.8%   147  +     0    org.joni.Matcher.forwardSearchRange
  2.6%   140  +     1    org.jruby.ir.instructions.CallBase.interpret
  2.3%   125  +     0    org.jruby.ir.instructions.BFalseInstr.interpretAndGetNewIPC
  2.0%   109  +     0    org.jruby.ir.IRScope.getFileName
  1.9%   100  +     0    org.jruby.RubySymbol$SymbolTable.getSymbol
  1.6%    84  +     0    org.jruby.ir.instructions.specialized.OneArgOperandAttrAssignInstr.interpret
  1.5%    82  +     0    org.jruby.RubyArray.join19
  1.4%    74  +     0    java.nio.charset.Charset.decode
  1.4%    73  +     0    org.jruby.RubyString.sub_bang19
  1.2%    64  +     0    org.jruby.RubyString.chompBangCommon19
  1.1%    60  +     0    org.jruby.RubyHash.op_aref
  1.1%    58  +     0    org.jruby.ir.instructions.AttrAssignInstr.interpret
  1.0%    55  +     0    org.jruby.util.TypeConverter.convertToType
  1.0%    53  +     0    org.joni.SearchAlgorithm$8.search

There's a very large amount of samples triggering inside our regexp engine Joni, and then a number of interpreter methods after that.

This also shows me that we are getting to the fully-processed IR and not staying in the simple, unoptimized IR (InterpreterEngine versus SimpleInterpreterEngine).

headius · 2016-05-25T04:26:01Z

Interesting to note: JRuby 1.7 with JIT is about 2x faster on this benchmark than JRuby 9k with JIT...so there may be something other than Ruby execution slowing this down.

headius · 2016-05-25T04:27:51Z

The profile for JIT is also interesting...it's almost looking like the important difference is in Regexp.

     Compiled + native   Method                        
 19.1%   630  +     1    org.joni.ByteCodeMachine.matchAt
 10.0%   332  +     0    org.joni.ByteCodeMachine.opExactNIC
  6.5%   216  +     0    org.joni.ByteCodeMachine.opExact1IC
  3.3%   106  +     3    Users.headius.projects.jruby.lib.ruby.stdlib.date.format.RUBY$method$method_missing$0
  2.8%    92  +     0    org.jruby.RubyRegexp.search19
  2.2%    72  +     0    org.jruby.RubyString$INVOKER$i$0$0$intern19.call
  2.0%    66  +     0    org.jruby.RubySymbol$SymbolTable.getSymbol
  1.7%    56  +     0    org.jruby.internal.runtime.methods.JavaMethod$JavaMethodOneOrTwoOrNBlock.call
  1.6%    52  +     0    org.jruby.RubyHash.internalGetEntry
  1.6%    52  +     0    org.jruby.RubyArray.join19
  1.5%    49  +     0    org.joni.Analyser.optimizeNodeLeft
  1.4%    47  +     0    org.joni.Matcher.forwardSearchRange
  1.3%    43  +     0    java.lang.invoke.LambdaForm$DMH.invokeStatic_L8_L
  1.1%    37  +     0    org.jruby.RubyString.gsubCommon19
  1.1%    36  +     0    org.joda.time.base.BaseDateTime.<init>
  1.1%    36  +     0    Users.headius.projects.jruby.lib.ruby.stdlib.date.format.RUBY$method$s3e$0
  1.0%    34  +     0    org.joni.Parser.parseExp
  1.0%    34  +     0    org.jruby.util.TypeConverter.convertToType
  1.0%    33  +     0    org.jruby.RubyHash.internalPut
  1.0%    32  +     0    org.jruby.RubyString$INVOKER$i$chomp_bang19.call

headius · 2016-05-25T04:40:04Z

I gathered another profile using -J-Xrunhprof:cpu=samples,depth=20,interval=1 to run the "hprof" profiler shipped with OpenJDK. This takes samples on a finer grain than --sample (which I used for the other sampled profiles) and I can get stack traces from each point.

We are definitely hitting Regexp hard but the code in question has not changed significantly between 1.7 and 9k. Here's the Ruby code getting hit for the top item:

    if str.sub!(
                /(
                   (?:
                     \d+\s*:\s*\d+
                     (?:
                       \s*:\s*\d+(?:[,.]\d*)?
                     )?
                   |
                     \d+\s*h(?:\s*\d+m?(?:\s*\d+s?)?)?
                   )
                   (?:
                     \s*
                     [ap](?:m\b|\.m\.)
                   )?
                 |
                   \d+\s*[ap](?:m\b|\.m\.)
                 )
                 (?:
                   \s*
                   (
                     (?:gmt|utc?)?[-+]\d+(?:[,.:]\d+(?::\d+)?)?
                   |
                     [[:alpha:].\s]+(?:standard|daylight)\stime\b
                   |
                     [[:alpha:]]+(?:\sdst)?\b
                   )
                 )?
                /ix,
                ' ')

Multi-line regex always mess with my head, but this does appear to be a pretty complex regular expression.

I also did an allocation profile. The top item was ByteCodeMachine from Joni. I'll try to compare with 1.7 and see if anything jumps out.

headius · 2016-05-25T05:08:44Z

A quick patch to cache the ByteCodeMachine for a given Regexp did not appear to improve performance any, so at least it doesn't seem like the construction of the machine is part of this cost.

headius · 2016-05-25T05:12:13Z

@lopex For what it's worth, here's the patch I came up with in Joni for caching most-recently-used Matcher:

diff --git a/src/org/joni/ByteCodeMachine.java b/src/org/joni/ByteCodeMachine.java
index 5cf643e..dabf439 100644
--- a/src/org/joni/ByteCodeMachine.java
+++ b/src/org/joni/ByteCodeMachine.java
@@ -36,6 +36,8 @@ import org.joni.constants.OPSize;
 import org.joni.exception.ErrorMessages;
 import org.joni.exception.InternalException;

+import java.util.Arrays;
+
 class ByteCodeMachine extends StackMachine {
     private static final int INTERRUPT_CHECK_EVERY = 30000;
     int interruptCheckCounter = 0; // we modulos this to occasionally check for interrupts
diff --git a/src/org/joni/Matcher.java b/src/org/joni/Matcher.java
index 5b623fe..7aad74a 100644
--- a/src/org/joni/Matcher.java
+++ b/src/org/joni/Matcher.java
@@ -33,9 +33,9 @@ public abstract class Matcher extends IntHolder {
     protected final Regex regex;
     protected final Encoding enc;

-    protected final byte[]bytes;
-    protected final int str;
-    protected final int end;
+    protected byte[]bytes;
+    protected int str;
+    protected int end;

     protected int msaStart;
     protected int msaOptions;
@@ -83,6 +83,32 @@ public abstract class Matcher extends IntHolder {
         return msaEnd;
     }

+    public final Regex getRegex() {
+        return regex;
+    }
+
+    public final byte[] getBytes() {
+        return bytes;
+    }
+
+    public void init(byte[] bytes, int begin, int end) {
+        this.bytes = bytes;
+        this.str = begin;
+        this.end = end;
+    }
+
+    public void clear() {
+        bytes = null;
+        if (msaRegion != null) msaRegion.clear();
+        msaBegin = 0;
+        msaBestLen = 0;
+        msaBestS = 0;
+        msaEnd = 0;
+        msaOptions = 0;
+        this.str = 0;
+        this.end = 0;
+    }
+
     protected final void msaInit(int option, int start) {
         msaOptions = option;
         msaStart = start;
diff --git a/src/org/joni/Regex.java b/src/org/joni/Regex.java
index 3b77a3a..be6886e 100644
--- a/src/org/joni/Regex.java
+++ b/src/org/joni/Regex.java
@@ -25,6 +25,7 @@ import static org.joni.Option.isDontCaptureGroup;

 import java.util.IllegalFormatConversionException;
 import java.util.Iterator;
+import java.util.concurrent.atomic.AtomicReferenceFieldUpdater;

 import org.jcodings.Encoding;
 import org.jcodings.specific.ASCIIEncoding;
@@ -60,6 +61,9 @@ public final class Regex {
     public WarnCallback warnings;
     public MatcherFactory factory;

+    volatile Matcher lastMatcher;
+    static final AtomicReferenceFieldUpdater lastMatcherUpdater = AtomicReferenceFieldUpdater.newUpdater(Regex.class, Matcher.class, "lastMatcher");
+
     final Encoding enc;
     int options;
     int userOptions;
@@ -166,9 +170,19 @@ public final class Regex {
     }

     public Matcher matcher(byte[]bytes, int p, int end) {
+        Matcher last = lastMatcher;
+        if (last != null && lastMatcherUpdater.compareAndSet(this, last, null)) {
+            last.init(bytes, p, end);
+            return last;
+        }
         return factory.create(this, bytes, p, end);
     }

+    public void returnMatcher(Matcher matcher) {
+        matcher.clear(matcher);
+        lastMatcher = matcher;
+    }
+
     public int numberOfCaptures() {
         return numMem;
     }

cprice404 · 2016-05-25T19:58:48Z

@headius thanks for looking at these - greatly appreciated.

I am curious about your setup. Is there a reason you need to throw the ScriptingContainer away every time? Using a single container and letting things JIT should bring better performance.

Unfortunately, there is a reason we need to terminate them. We don't do it on every single execution, but we have a setting that basically causes them to be destroyed and replaced after a certain period of time.

The reason is that we allow our users to extend our applications via Ruby, and many of our users don't have a deep background in software development. Therefore, it's not entirely unlikely that the Ruby code we end up loading will contain memory leaks, and throwing away the ScriptingContainer instances allows us to free up the leaked memory rather than eventually OOM'ing the application server.

Many of our users come from a world where the Ruby code was run as a Rack application under an Apache/Passenger setup. In that world, Passenger spawns a specific number of MRI Ruby processes to handle requests, and it will kill off and replace those MRI processes periodically, for the same reasons I described above. Because of that, memory leaks in user extension code don't tend to cause major problems, and that's a trait of our application that it's important for us to retain when running under JRuby.

lopex · 2016-08-25T19:33:48Z

@headius Is the one-off matcher still a suspect here ? I remember our irc discussion wrt earlier String implementations that were reusing the matcher.

headius · 2016-09-26T23:18:24Z

Link with #4112.

headius · 2016-09-26T23:18:50Z

@lopex I never managed to get my matcher-caching to improve perf more than noise effects. I don't suspect it as a primary bottleneck at this point.

enebo · 2018-09-19T17:33:42Z

think we finally caught up in upcoming 9.2.1.0:

Detected JRuby version 1.7.x, setting compatVersion 1.9.
Setting compileMode to: 'JIT'
JRUBY VERSION: 1.7.20.1
COMPILE MODE: jit
                           user     system      total        real
warmup (100 runs):     0.105000   0.000000   0.105000 (  0.105000)
middle (10000 runs):   0.772000   0.000000   0.772000 (  0.772000)
tail (100 runs):       0.005000   0.000000   0.005000 (  0.005000)
Detected JRuby version 1.7.x, setting compatVersion 1.9.
Setting compileMode to: 'OFF'
JRUBY VERSION: 1.7.20.1
COMPILE MODE: off
                           user     system      total        real
warmup (100 runs):     0.086000   0.000000   0.086000 (  0.086000)
middle (10000 runs):   0.913000   0.000000   0.913000 (  0.912000)
tail (100 runs):       0.010000   0.000000   0.010000 (  0.010000)
JRUBY VERSION: 9.2.1.0-SNAPSHOT
COMPILE MODE: jit
                           user     system      total        real
warmup (100 runs):     0.460000   0.010000   0.470000 (  0.089430)
middle (10000 runs):   2.990000   0.010000   3.000000 (  0.648920)
tail (100 runs):       0.020000   0.000000   0.020000 (  0.004793)
Detected JRuby version 9k, skipping compatVersion.
Setting compileMode to: 'OFF'
JRUBY VERSION: 9.2.1.0-SNAPSHOT
COMPILE MODE: off
                           user     system      total        real
warmup (100 runs):     0.380000   0.000000   0.380000 (  0.067444)
middle (10000 runs):   2.770000   0.020000   2.790000 (  0.628573)
tail (100 runs):       0.030000   0.000000   0.030000 (  0.006038)

cprice404 mentioned this issue Aug 25, 2016

End-of-life for JRuby 1.7.x #4112

Closed

headius added JRuby 9000 performance regression labels Dec 15, 2016

headius added this to the JRuby 9.2.0.0 milestone Dec 15, 2016

headius modified the milestones: JRuby 9.2.0.0, JRuby 9.2.1.0 May 15, 2018

enebo closed this as completed Sep 19, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance regression with `Time.parse` from 1.7 to 9k #3922

Performance regression with `Time.parse` from 1.7 to 9k #3922

cprice404 commented May 24, 2016

headius commented May 25, 2016

headius commented May 25, 2016

headius commented May 25, 2016

headius commented May 25, 2016

headius commented May 25, 2016

headius commented May 25, 2016

headius commented May 25, 2016

cprice404 commented May 25, 2016 •

edited

lopex commented Aug 25, 2016

headius commented Sep 26, 2016

headius commented Sep 26, 2016

enebo commented Sep 19, 2018

Performance regression with Time.parse from 1.7 to 9k #3922

Performance regression with Time.parse from 1.7 to 9k #3922

Comments

cprice404 commented May 24, 2016

Environment

Expected Behavior

Actual Behavior

headius commented May 25, 2016

headius commented May 25, 2016

headius commented May 25, 2016

headius commented May 25, 2016

headius commented May 25, 2016

headius commented May 25, 2016

headius commented May 25, 2016

cprice404 commented May 25, 2016 • edited

lopex commented Aug 25, 2016

headius commented Sep 26, 2016

headius commented Sep 26, 2016

enebo commented Sep 19, 2018

Performance regression with `Time.parse` from 1.7 to 9k #3922

Performance regression with `Time.parse` from 1.7 to 9k #3922

cprice404 commented May 25, 2016 •

edited