Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed Optimizations #6

Closed
wants to merge 1 commit into from
Closed

Speed Optimizations #6

wants to merge 1 commit into from

Conversation

metabench
Copy link

Please note - the code is not tidied up, there are lots of commented out console.log statements, I have left code which is yet to be deleted. I'm still working on the code, and plan to get more optimizations done. I didn't want to rearrange the code yet, but wanted to send what I had done once I had seen very high speed increases (about 334%, depending on how it's calculated).

----------------------------------------

Have replaced the use of many Int32Arrays with a single 'matrix'
Int32Array for each component. Created new versions of functions that
interact with that array.

Removed the 'buffer copy' code at the end by writing
getData_rgb_to_rgba_buffer, which writes directly to an RGBA buffer.

In the test I did, observed decoding time decreased from 3300ms to 760ms.

I also got it to calculate a few variables that get used multiple times, like XscaleX and YscaleY. I don't know if those other optimizations really sped it up, it was changing to the blocks_matrix that really made the difference, and then making it write the result directly to the RGBA node Buffer made a big difference too.

There are further optimizations that can yet be made, to do with better
use of typed arrays, and not allocating various variables multiple
times, but reusing space in typed arrays.

Have replaced the use of many Int32Arrays with a single 'matrix'
Int32Array for each component. Created new versions of functions that
interact with that array.

Removed the 'buffer copy' code at the end by writing
getData_rgb_to_rgba_buffer, which writes directly to an RGBA buffer.

In the test I did, observed decoding time decreased from 3300 to 760ms.

There are further optimizations that can yet be made, to do with better
use of typed arrays, and not allocating various variables multiple
times, but reusing space in typed arrays.
@metabench
Copy link
Author

I think I left out this function:

        function matrix_decodeACFirst(component, matrix_offset) {
            if (eobrun > 0) {
                eobrun--;
                return;
            }
            var k = spectralStart, e = spectralEnd;
            while (k <= e) {
                var rs = decodeHuffman(component.huffmanTableAC);
                var s = rs & 15, r = rs >> 4;
                if (s === 0) {
                    if (r < 15) {
                        eobrun = receive(r) + (1 << r) - 1;
                        break;
                    }
                    k += 16;
                    continue;
                }
                k += r;
                var z = dctZigZag[k];
                component.blocks_matrix[z + matrix_offset] = receiveAndExtend(s) * (1 << successive);
                k++;
            }
        }

@eugeneware
Copy link
Collaborator

This is looking great @metabench! Let me know when it's ready for a CR / merge.

@metabench
Copy link
Author

I gather a CR is a code review, but I don't know how that feature works on Github / git.

I have done more optimization, to do with making a lines_matrix for the components rather than having all those separate arrays for each line. However, so far I have only got my test case down to about 710ms.

Also, the code I am now working on has a different API. I now have

module.exports = {
    'decode': decode,
    'size': size
};

I wanted to be able to get the size of the image without doing as much decoding.

Now I am thinking about some more optimizations:

Keeping all of the components' lines in one typed array.

Possibly consolidating the scale variables for each component into one - but do you know id the components will always have the same scale.

I'm going to be doing a bit more work on this soon, but I'm not that confident I will get significant speed improvements.

@eugeneware
Copy link
Collaborator

Hi @metabench - happy to merge it once you've done the tidy up and it passes the test suite.

If we change the public API we'll need to bump the major version of course.

@metabench
Copy link
Author

Hi @eugeneware,

I've made more changes to the code, trying to get more optimizations, and it seems as though I have sped it up really small amount more, but the code is a lot larger now and harder to follow.

I'm not planning on doing any tidying up for this jpeg code, but am happy to share the code I am working on. My opinion at the moment is that the structure of the Huffman tree in the optimized code I have been working on is still very suboptimal wrt JavaScript run-time. Storing the data in a flat structure would likely speed things up a lot.

Perhaps this codebase could be a starting point for re-engineering the Huffman tree, but I'm busy with other things at the moment, and having spent a little while focusing on JPEG programming and got results I can use I have moved on to other things. At some point I'm likely to do more work on JPEGs, but how much work I do using this codebase remains to be seen. I'm quite interested in writing in Haxe, so it can be compiled to optimized JavaScript and other languages such as C++ which could then be compiled (and run quickly).

@eugeneware
Copy link
Collaborator

Hi @metabench, thanks for this. I think you're right, probably not worth complicating the code base for a small performance increase.

Certainly put the code in as a PR for future reference for others to pick up and work on - you never know!

I've been using emscripten with good results for doing audio encoding work, so potentially porting some of libjpeg or something like that might also be a good way forward for a performant implementation given that asm.js is becoming a thing, and even v8 looks to be tuning for it in the next iteration of their JIT compiler (TurboFan).

BTW, thanks so much for your work!

@metabench
Copy link
Author

Hi @eugeneware,

I'm going to have another look at the JPEG decoder, it occurred to me that I could make an object oriented Huffman Tree object that internally uses a flat data structure. I don't want to send a push request without having first tested it a little bit, and have not had time to do that recently. If anyone wants a look at the latest published version of the code I have written it's at https://github.com/metabench/jsgui/blob/master/js/image/node/jpeg-js/jsgui_decoder.js.

Regarding emscripten, I think it's a very good idea. Right at the moment I don't think I have the skills to compile libjpeg using it, I've not done any work with emscripten, but want to get into that way of doing things.

Also, my objective has been to get fast JPEG access in node, so compiling libjpeg so it's referenced from the node would be useful to me, but seems outside the scope of this project as implied by it's name.

How tricky / time consuming does it look in order to make use of libjpeg either with node C bindings or compiling through emscripten.

Can we talk some more about compiling jibjpeg (or even libjpeg turbo) to work with node despite it most likely being outside of this project's scope?

I also think that a really good blog article or gist could be made about compiling libjpeg to JavaScript, for me it would be a learning experience, but it would be a great help to collaborate on compiling libjpeg to JavaScript and to write about how we did it.

@eugeneware
Copy link
Collaborator

Hi @metabench - good luck on the huffman tree approach. It looks promising!

On the emscripten side of things - we have done a bit of that at our company to port mp3 and various audio codecs to Javascript. Another dev did it on my team, so it's still a bit of black magic to me TBH.

It definitely is possible, though, as I've seen ffmpeg and mpeg4 codecs get ported over with emscripten.

A quick amount of googling didn't unearth much in terms of others having success in building it with emscripten.

I did find this though: https://github.com/jvrousseau/openjpeg

But no doco about how to use it, and he didn't include the original source code and makefile.

From our experience you end up with a javascript API that is the same as the C API, and then you wrap a nice JS-friendly API around that.

Often it's good to write a simple C program that has a nice higher level API that calls the underlying C libraries.

Which really means you end up writing a bit of C code to get it to work. So it does help to have some C experience.

@metabench
Copy link
Author

My C experience is limited to writing PNG encoding and decoding in C, calling it from Node. It's here: https://github.com/metabench/jsgui/blob/master/js/image/node/c/binding.cpp. I was surprised that I managed to write something actually useful and fast as a beginner at C.

Using C for this kind of thing for JPEGs is the kind of thing I'd like to expand my C experience into.

Is this other developer on your team in a position to put any time at all into helping to solve this problem?

@eugeneware
Copy link
Collaborator

Hey @metabench - great job on the native binding!

I spoke with the dev on our team, and he's pretty busy at the moment, unfortunately :-(

He did say that it was pretty straightforward. He suggested:

  1. Write a basic C program that has an interface close to what you want to call in JS.
  2. Then use emscripten using emconfigure and emmake in place of configure and make
  3. You'll then end up with a big JS file with some entry points which you can then wrap in your node.js module.

@metabench
Copy link
Author

Hi @eugeneware (and anyone else interested),

Recently I've been busy working on a different project involving optimized JavaScript. I wound up making an OO structure to access a TypedArray, with functions to get and set values. I found it to be logical to program with, but ran a lot slower than I wanted. I then inlined various get and set function calls, and it ran really quickly. I'm still using the Class that contains a TypedArray inside, but refereing directly to its Float64Array.

I've given up on replacing local variables with typed array references. It's hard to keep track of, and I think the V8 optimizer is able to deal with local variables that are only used for one type efficiently. I had previously been thinking more in terms of the ASM JavaScript style (or my misunderstanding of it) but I'm writing the code for the V8 engine which does not focus on the ASM optimizations anyway.

I think this is the technique that should be used to optimize the Huffman table, if it is to get done. I'm not planning on doing it all that soon, maybe sometime though.

@metabench
Copy link
Author

So I think this pull request is ready to merge, or at least I don't think I'll be able to send an improved version of the pull request algorithm-wise within the next few months.

@eugeneware
Copy link
Collaborator

Hi @metabench - thanks for the update! Good luck with your TypedAdrray optimization for the future.

The current PR It doesn't rebase cleanly onto HEAD. If you could rebase and probably do a pass to remove any debug statements and commented out codes then we can merge and release.

We should probably bump the major version just in case too I'm thinking given the size of the change.

@metabench
Copy link
Author

I'm much better at optimizing JavaScript than I am at using git and rebasing anything, Would it be OK for you to rebase the code that I have, or provide more detailed instructions about how I would do the rebase?

@neophob
Copy link

neophob commented Sep 13, 2015

+1

@Lurk Lurk mentioned this pull request Mar 24, 2016
@benwiley4000
Copy link
Collaborator

@metabench I'd like to help you get this merged into master. Would you be able to make sure the "allow edits from maintainers" box is checked on the right side of this PR?

allow edits from maintainers

@benwiley4000
Copy link
Collaborator

@metabench I also could be pushing my luck here, but do you still have access to the benchmarks you used to test these optimizations? 😃

@patrickhulce
Copy link
Collaborator

@benwiley4000 is your plan to try and minimize the diff here? It's quite hard to check what's going on with all the whitespace changes and feels quite risky in it's current state :)

Definitely worth re-benchmarking too to make sure there's a significant improvement in perf in modern JS engines.

@benwiley4000
Copy link
Collaborator

benwiley4000 commented Jul 16, 2017

Right, I'd like to definitely test some benchmarks.

My hope was to first merge master into this branch, and then see if we can distill the changes down to what's most useful. Not totally sure yet we should merge, but seems like it's worth some effort if it really provides a >300% speed boost on decode.

ETA: Also maybe this should wait at least until resolution of #30.

@metabench
Copy link
Author

@benwiley4000 I've allowed edits from maintainers. I don't have these benchmarks immediately available. If I find them, I could add them to the project. Alternatively, I could redo them or describe them, they were not substantial pieces of code, just running an operation a few times (such as 10) and using timer measurements.

It's possible there have been optimizations on the master branch since my work, so I can't be sure that it would be >300%, as the measurements were taken at an earlier stage in the codebase. V8 was less optimized back in August 2014, so I'd expect different benchmark results.

@benwiley4000
Copy link
Collaborator

@metabench thanks! if you could at least describe them, that would be great 🙂

@metabench
Copy link
Author

metabench commented Jul 26, 2017

@benwiley4000 From my recollection, I found a large, high-resolution, and by my perception high-quality image of Mount Fuji. I used that as input to the decoder testers. I can't recall if with one pass of decoding it was so slow normally (such as 3.5s) that I only needed one pass in the benchmarks.

I made two separate programs to perform the same operations using both the old and new code. Time differences were taken, and results used for comparison. I was careful only to take time measurements at the points in the code where the image decoding was actually running, rather than during library initialization. I had the programs save images in a different format, such as bitmap, to test that the images were not corrupted.

p.children[p.index] = q.children;
p = q;
}
"use strict";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happened to these lines?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know. It does not look like I made any significant or useful changes to that part. Maybe they were some formatting changes that my IDE automatically made.

@patrickhulce
Copy link
Collaborator

thanks very much for your contributions @metabench ! since this has become quite stale, I'm going to go ahead and close for now.

if you're able to generate a cleaner diff again we'd be happy to reopen discussion around it :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants