Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression with Time.parse from 1.7 to 9k #3922

Closed
cprice404 opened this issue May 24, 2016 · 12 comments
Closed

Performance regression with Time.parse from 1.7 to 9k #3922

cprice404 opened this issue May 24, 2016 · 12 comments

Comments

@cprice404
Copy link
Contributor

Environment

JRuby 1.7.20.1, JRuby 9.1.0.0, JRuby 9.1.1.0; run from Java via ScriptingContainer, with various values tested for the setCompileMode method. Standalone script execution with no frameworks.

Expected Behavior

Expected performance for Time.parse to be on par with performance for 1.7.x.

Actual Behavior

Time.parse seems to be roughly 60% slower in JRuby 9k than in 1.7.x.

I have set up a github repo with a simple reproducer here:
https://github.com/cprice404/jruby9k-benchmarks/tree/19a2719e4edddc17e3f1fabbf311b9d80c6eba80/time-parse-comparison

Here are some relevant numbers from the benchmark output (the rest of it is available in the linked reproducer repro if you are interested):

JRUBY VERSION: 1.7.20.1
COMPILE MODE: off
                           user     system      total        real
warmup (100 runs):     0.213000   0.000000   0.213000 (  0.213000)
middle (10000 runs):   1.179000   0.000000   1.179000 (  1.179000)
tail (100 runs):       0.009000   0.000000   0.009000 (  0.009000)

JRUBY VERSION: 9.1.1.0
COMPILE MODE: off
                           user     system      total        real
warmup (100 runs):     0.730000   0.010000   0.740000 (  0.138776)
middle (10000 runs):   4.890000   0.300000   5.190000 (  1.905043)
tail (100 runs):       0.030000   0.010000   0.040000 (  0.019285)

Also tested with compileMode set to JIT; that doesn't seem to make much difference in 1.7, and makes the numbers look worse in 9k (though the benchmark may not have had a long enough execution time to give a fair characterization of JIT).

@headius
Copy link
Member

headius commented May 25, 2016

The interpreter in 9k is known to be slower (for now) so this and your other bug are not terribly surprising. However, these give us some good examples to work with. We would like to get get the interpreter running as fast as possible.

I am curious about your setup. Is there a reason you need to throw the ScriptingContainer away every time? Using a single container and letting things JIT should bring better performance.

@headius
Copy link
Member

headius commented May 25, 2016

Here's a JVM sampled profile of execution of your benchmark script, loaded repeatedly in a loop with -X-C to disable JIT. This should be roughly analogous to running through ScriptingContainer.

     Compiled + native   Method                        
 26.3%  1403  +     2    org.joni.ByteCodeMachine.matchAt
 11.3%   605  +     0    org.jruby.ir.interpreter.InterpreterEngine.interpret
  4.5%   241  +     0    org.jruby.ir.interpreter.InterpreterEngine.processCall
  4.2%   222  +     0    org.jruby.ir.interpreter.InterpreterEngine.processOtherOp
  3.7%   196  +     0    org.jruby.ir.interpreter.InterpreterEngine.processBookKeepingOp
  2.8%   147  +     0    org.joni.Matcher.forwardSearchRange
  2.6%   140  +     1    org.jruby.ir.instructions.CallBase.interpret
  2.3%   125  +     0    org.jruby.ir.instructions.BFalseInstr.interpretAndGetNewIPC
  2.0%   109  +     0    org.jruby.ir.IRScope.getFileName
  1.9%   100  +     0    org.jruby.RubySymbol$SymbolTable.getSymbol
  1.6%    84  +     0    org.jruby.ir.instructions.specialized.OneArgOperandAttrAssignInstr.interpret
  1.5%    82  +     0    org.jruby.RubyArray.join19
  1.4%    74  +     0    java.nio.charset.Charset.decode
  1.4%    73  +     0    org.jruby.RubyString.sub_bang19
  1.2%    64  +     0    org.jruby.RubyString.chompBangCommon19
  1.1%    60  +     0    org.jruby.RubyHash.op_aref
  1.1%    58  +     0    org.jruby.ir.instructions.AttrAssignInstr.interpret
  1.0%    55  +     0    org.jruby.util.TypeConverter.convertToType
  1.0%    53  +     0    org.joni.SearchAlgorithm$8.search

There's a very large amount of samples triggering inside our regexp engine Joni, and then a number of interpreter methods after that.

This also shows me that we are getting to the fully-processed IR and not staying in the simple, unoptimized IR (InterpreterEngine versus SimpleInterpreterEngine).

@headius
Copy link
Member

headius commented May 25, 2016

Interesting to note: JRuby 1.7 with JIT is about 2x faster on this benchmark than JRuby 9k with JIT...so there may be something other than Ruby execution slowing this down.

@headius
Copy link
Member

headius commented May 25, 2016

The profile for JIT is also interesting...it's almost looking like the important difference is in Regexp.

     Compiled + native   Method                        
 19.1%   630  +     1    org.joni.ByteCodeMachine.matchAt
 10.0%   332  +     0    org.joni.ByteCodeMachine.opExactNIC
  6.5%   216  +     0    org.joni.ByteCodeMachine.opExact1IC
  3.3%   106  +     3    Users.headius.projects.jruby.lib.ruby.stdlib.date.format.RUBY$method$method_missing$0
  2.8%    92  +     0    org.jruby.RubyRegexp.search19
  2.2%    72  +     0    org.jruby.RubyString$INVOKER$i$0$0$intern19.call
  2.0%    66  +     0    org.jruby.RubySymbol$SymbolTable.getSymbol
  1.7%    56  +     0    org.jruby.internal.runtime.methods.JavaMethod$JavaMethodOneOrTwoOrNBlock.call
  1.6%    52  +     0    org.jruby.RubyHash.internalGetEntry
  1.6%    52  +     0    org.jruby.RubyArray.join19
  1.5%    49  +     0    org.joni.Analyser.optimizeNodeLeft
  1.4%    47  +     0    org.joni.Matcher.forwardSearchRange
  1.3%    43  +     0    java.lang.invoke.LambdaForm$DMH.invokeStatic_L8_L
  1.1%    37  +     0    org.jruby.RubyString.gsubCommon19
  1.1%    36  +     0    org.joda.time.base.BaseDateTime.<init>
  1.1%    36  +     0    Users.headius.projects.jruby.lib.ruby.stdlib.date.format.RUBY$method$s3e$0
  1.0%    34  +     0    org.joni.Parser.parseExp
  1.0%    34  +     0    org.jruby.util.TypeConverter.convertToType
  1.0%    33  +     0    org.jruby.RubyHash.internalPut
  1.0%    32  +     0    org.jruby.RubyString$INVOKER$i$chomp_bang19.call

@headius
Copy link
Member

headius commented May 25, 2016

I gathered another profile using -J-Xrunhprof:cpu=samples,depth=20,interval=1 to run the "hprof" profiler shipped with OpenJDK. This takes samples on a finer grain than --sample (which I used for the other sampled profiles) and I can get stack traces from each point.

We are definitely hitting Regexp hard but the code in question has not changed significantly between 1.7 and 9k. Here's the Ruby code getting hit for the top item:

    if str.sub!(
                /(
                   (?:
                     \d+\s*:\s*\d+
                     (?:
                       \s*:\s*\d+(?:[,.]\d*)?
                     )?
                   |
                     \d+\s*h(?:\s*\d+m?(?:\s*\d+s?)?)?
                   )
                   (?:
                     \s*
                     [ap](?:m\b|\.m\.)
                   )?
                 |
                   \d+\s*[ap](?:m\b|\.m\.)
                 )
                 (?:
                   \s*
                   (
                     (?:gmt|utc?)?[-+]\d+(?:[,.:]\d+(?::\d+)?)?
                   |
                     [[:alpha:].\s]+(?:standard|daylight)\stime\b
                   |
                     [[:alpha:]]+(?:\sdst)?\b
                   )
                 )?
                /ix,
                ' ')

Multi-line regex always mess with my head, but this does appear to be a pretty complex regular expression.

I also did an allocation profile. The top item was ByteCodeMachine from Joni. I'll try to compare with 1.7 and see if anything jumps out.

@headius
Copy link
Member

headius commented May 25, 2016

A quick patch to cache the ByteCodeMachine for a given Regexp did not appear to improve performance any, so at least it doesn't seem like the construction of the machine is part of this cost.

@headius
Copy link
Member

headius commented May 25, 2016

@lopex For what it's worth, here's the patch I came up with in Joni for caching most-recently-used Matcher:

diff --git a/src/org/joni/ByteCodeMachine.java b/src/org/joni/ByteCodeMachine.java
index 5cf643e..dabf439 100644
--- a/src/org/joni/ByteCodeMachine.java
+++ b/src/org/joni/ByteCodeMachine.java
@@ -36,6 +36,8 @@ import org.joni.constants.OPSize;
 import org.joni.exception.ErrorMessages;
 import org.joni.exception.InternalException;

+import java.util.Arrays;
+
 class ByteCodeMachine extends StackMachine {
     private static final int INTERRUPT_CHECK_EVERY = 30000;
     int interruptCheckCounter = 0; // we modulos this to occasionally check for interrupts
diff --git a/src/org/joni/Matcher.java b/src/org/joni/Matcher.java
index 5b623fe..7aad74a 100644
--- a/src/org/joni/Matcher.java
+++ b/src/org/joni/Matcher.java
@@ -33,9 +33,9 @@ public abstract class Matcher extends IntHolder {
     protected final Regex regex;
     protected final Encoding enc;

-    protected final byte[]bytes;
-    protected final int str;
-    protected final int end;
+    protected byte[]bytes;
+    protected int str;
+    protected int end;

     protected int msaStart;
     protected int msaOptions;
@@ -83,6 +83,32 @@ public abstract class Matcher extends IntHolder {
         return msaEnd;
     }

+    public final Regex getRegex() {
+        return regex;
+    }
+
+    public final byte[] getBytes() {
+        return bytes;
+    }
+
+    public void init(byte[] bytes, int begin, int end) {
+        this.bytes = bytes;
+        this.str = begin;
+        this.end = end;
+    }
+
+    public void clear() {
+        bytes = null;
+        if (msaRegion != null) msaRegion.clear();
+        msaBegin = 0;
+        msaBestLen = 0;
+        msaBestS = 0;
+        msaEnd = 0;
+        msaOptions = 0;
+        this.str = 0;
+        this.end = 0;
+    }
+
     protected final void msaInit(int option, int start) {
         msaOptions = option;
         msaStart = start;
diff --git a/src/org/joni/Regex.java b/src/org/joni/Regex.java
index 3b77a3a..be6886e 100644
--- a/src/org/joni/Regex.java
+++ b/src/org/joni/Regex.java
@@ -25,6 +25,7 @@ import static org.joni.Option.isDontCaptureGroup;

 import java.util.IllegalFormatConversionException;
 import java.util.Iterator;
+import java.util.concurrent.atomic.AtomicReferenceFieldUpdater;

 import org.jcodings.Encoding;
 import org.jcodings.specific.ASCIIEncoding;
@@ -60,6 +61,9 @@ public final class Regex {
     public WarnCallback warnings;
     public MatcherFactory factory;

+    volatile Matcher lastMatcher;
+    static final AtomicReferenceFieldUpdater lastMatcherUpdater = AtomicReferenceFieldUpdater.newUpdater(Regex.class, Matcher.class, "lastMatcher");
+
     final Encoding enc;
     int options;
     int userOptions;
@@ -166,9 +170,19 @@ public final class Regex {
     }

     public Matcher matcher(byte[]bytes, int p, int end) {
+        Matcher last = lastMatcher;
+        if (last != null && lastMatcherUpdater.compareAndSet(this, last, null)) {
+            last.init(bytes, p, end);
+            return last;
+        }
         return factory.create(this, bytes, p, end);
     }

+    public void returnMatcher(Matcher matcher) {
+        matcher.clear(matcher);
+        lastMatcher = matcher;
+    }
+
     public int numberOfCaptures() {
         return numMem;
     }

@cprice404
Copy link
Contributor Author

cprice404 commented May 25, 2016

@headius thanks for looking at these - greatly appreciated.

I am curious about your setup. Is there a reason you need to throw the ScriptingContainer away every time? Using a single container and letting things JIT should bring better performance.

Unfortunately, there is a reason we need to terminate them. We don't do it on every single execution, but we have a setting that basically causes them to be destroyed and replaced after a certain period of time.

The reason is that we allow our users to extend our applications via Ruby, and many of our users don't have a deep background in software development. Therefore, it's not entirely unlikely that the Ruby code we end up loading will contain memory leaks, and throwing away the ScriptingContainer instances allows us to free up the leaked memory rather than eventually OOM'ing the application server.

Many of our users come from a world where the Ruby code was run as a Rack application under an Apache/Passenger setup. In that world, Passenger spawns a specific number of MRI Ruby processes to handle requests, and it will kill off and replace those MRI processes periodically, for the same reasons I described above. Because of that, memory leaks in user extension code don't tend to cause major problems, and that's a trait of our application that it's important for us to retain when running under JRuby.

@lopex
Copy link
Member

lopex commented Aug 25, 2016

@headius Is the one-off matcher still a suspect here ? I remember our irc discussion wrt earlier String implementations that were reusing the matcher.

@headius
Copy link
Member

headius commented Sep 26, 2016

Link with #4112.

@headius
Copy link
Member

headius commented Sep 26, 2016

@lopex I never managed to get my matcher-caching to improve perf more than noise effects. I don't suspect it as a primary bottleneck at this point.

@enebo
Copy link
Member

enebo commented Sep 19, 2018

think we finally caught up in upcoming 9.2.1.0:

Detected JRuby version 1.7.x, setting compatVersion 1.9.
Setting compileMode to: 'JIT'
JRUBY VERSION: 1.7.20.1
COMPILE MODE: jit
                           user     system      total        real
warmup (100 runs):     0.105000   0.000000   0.105000 (  0.105000)
middle (10000 runs):   0.772000   0.000000   0.772000 (  0.772000)
tail (100 runs):       0.005000   0.000000   0.005000 (  0.005000)
Detected JRuby version 1.7.x, setting compatVersion 1.9.
Setting compileMode to: 'OFF'
JRUBY VERSION: 1.7.20.1
COMPILE MODE: off
                           user     system      total        real
warmup (100 runs):     0.086000   0.000000   0.086000 (  0.086000)
middle (10000 runs):   0.913000   0.000000   0.913000 (  0.912000)
tail (100 runs):       0.010000   0.000000   0.010000 (  0.010000)
JRUBY VERSION: 9.2.1.0-SNAPSHOT
COMPILE MODE: jit
                           user     system      total        real
warmup (100 runs):     0.460000   0.010000   0.470000 (  0.089430)
middle (10000 runs):   2.990000   0.010000   3.000000 (  0.648920)
tail (100 runs):       0.020000   0.000000   0.020000 (  0.004793)
Detected JRuby version 9k, skipping compatVersion.
Setting compileMode to: 'OFF'
JRUBY VERSION: 9.2.1.0-SNAPSHOT
COMPILE MODE: off
                           user     system      total        real
warmup (100 runs):     0.380000   0.000000   0.380000 (  0.067444)
middle (10000 runs):   2.770000   0.020000   2.790000 (  0.628573)
tail (100 runs):       0.030000   0.000000   0.030000 (  0.006038)

@enebo enebo closed this as completed Sep 19, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants