Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation suggestion for complex multiplication to improve performance #91

Open
Fulguritude opened this issue Jul 4, 2021 · 5 comments

Comments

@Fulguritude
Copy link

The current implementation for complex multiplication uses the "naive" algorithm, see lib.rs lines 683-692.

// (a + i b) * (c + i d) == (a*c - b*d) + i (a*d + b*c)
impl<T: Clone + Num> Mul<Complex<T>> for Complex<T> {
    type Output = Self;

    #[inline]
    fn mul(self, other: Self) -> Self::Output {
        let re = self.re.clone() * other.re.clone() - self.im.clone() * other.im.clone();
        let im = self.re * other.im + self.im * other.re;
        Self::Output::new(re, im)
    }
}

There exists a slightly faster alternative, a variation on the Karatsuba algorithm, for complex numbers. This algorithm works using only 3 multiplications instead of 4, but 5 additions/subtractions instead of 2. On platforms where addition and multiplication cost the same, it might not be an improvement, but I've consistently seen this algorithm show improved performance on various platforms in C (though I'm not familiar with the Rust compiler, so you might want to test things with some benchmarks).

This is the algorithm's pseudocode:

    For z =  a + ib and z' = c + id 
    k1 = c · (a + b)
    k2 = a · (d − c)
    k3 = b · (c + d)
    Re(zz') = k1 − k3
    Im(zz') = k1 + k2

source: https://en.wikipedia.org/wiki/Multiplication_algorithm#Complex_multiplication_algorithm

This would correspond, if I'm not mistaken (I'm currently learning Rust; drawing fractals, which is why I noticed this possible performance improvement; but am not sure about where to .clone() or not, so I put it everywhere), to the following code:

    #[inline]
    fn mul(self, other: Self) -> Self::Output {
        let k1 = other.re.clone() * (self.re.clone() + self.im.clone());
        let k2 = self.re.clone() * (other.im.clone() - other.re.clone());
        let k3 = self.im.clone() * (other.re.clone() + other.im.clone());
        let re = k1 - k3;
        let im = k1 + k2;
        Self::Output::new(re, im)
    }
@cuviper
Copy link
Member

cuviper commented Jul 8, 2021

A few thoughts:

  • k2 = a · (d − c) may be negative even when the result will only have positive parts.

    • That's potentially an issue for Complex<T> with unsigned T, though I don't know if anyone realistically uses that.
    • This could be avoided by branching: if c <= d { /* normal k2 */ } else { /* flip to c-d and then subtract k2 */ }
  • More generally, this changes the overflow conditions, but it's not necessarily better or worse.

    • i.e. some inputs will overflow intermediate values in one algorithm but not the other, in both directions.
    • That would mostly affect Gaussian integers, but floating-point Complex could also be near infinity or zero-underflow.
    • Maybe this warrants waiting for a breaking-change release? Or maybe nobody would notice...
  • Do you have any code which can be benchmarked with this change?

not sure about where to .clone() or not, so I put it everywhere

It's reasonable to start with that everywhere, and then you can remove the clone on the last use of a given value to instead consume it. So in your code k3 doesn't need any cloning, and k2 doesn't need to clone self.re.

@Fulguritude
Copy link
Author

Fulguritude commented Jul 9, 2021

That's potentially an issue for Complex<T> with unsigned T, though I don't know if anyone realistically uses that.

Wouldn't this also be an issue with the current multiplication ? ie, i^2 = -1 would cause a problem for any complex multiplication algorithm for any unsigned (and non-cyclic) T. But in any case, I think your solution works.

More generally, this changes the overflow conditions, but it's not necessarily better or worse.

I'm really not an expert on generic Rust (though I'd very much like to become one :D) so I don't feel confident giving you my input here.

Do you have any code which can be benchmarked with this change?

I don't have any such benchmarks in Rust, but if you tell me it's relevant, I can dig through my files and adapt some old C of mine this weekend, and provide you the source so you can run the benchmark. Or I can also try to fork this repo and make the changes, but I'm really not sure how to set that up such a fork or dependency with cargo. Otherwise, I could try to figure out how to use the system clock in Rust. The simplest thing is just to:

  • generate a random array of complex numbers,
  • start a timer,
  • run a defined amount of multiplications on this list with protocol 1,
  • end the timer: mark the operation time,
  • start another timer,
  • run the same multiplications on this list with protocol 2,
  • end the timer: mark the operation time,
  • compare the times

I can do that in C for you right now if you'd like.

It's reasonable to start with that everywhere, and then you can remove the clone on the last use of a given value to instead consume it. So in your code k3 doesn't need any cloning, and k2 doesn't need to clone self.re.

Thanks for the helpful advice !! Given that logic, I suppose I forgot to clone k1 in my line let re = k1 - k3; ?

@twirrim
Copy link

twirrim commented Jul 14, 2021

Not sure if this is useful, but a hopefully fairly straightforward benchmark for complex number multiplication could come from a straightforward fractal loop:

#[bench]
fn bench_mul(b: &mut Bencher) {
    // This leverages part of the julia set to benchmark multiplication
    let c = Complex {
        re: -0.7,
        im: 0.27015,
    };
    b.iter(|| {
        let mut i = 0;
        let mut z = Complex {
            re: -0.8099464512195294,
            im: 0.07329678182974636,
        };
        while z.norm() <= 2.0 && i < 255 {
            z = z * z + c;
            i += 1;
        }
    });
}

I tweaked some rust-based julia fractal code I've been fiddling with recently to spit me out some example values where it had to do lots of iteration, to get me some useful values for c and z. That seems to result in sufficient multiplications as to provide a measurement of performance. I can easily influence the performance quite drastically by deliberately breaking the mul implementation doing things like making let im = self.re * other.im + self.im * other.re; just let im = self.im;

Here's the benchmark result from this benchmark against current master:

running 1 test
test bench_mul ... bench:         270 ns/iter (+/- 17)

@cuviper
Copy link
Member

cuviper commented Jul 14, 2021

With my Ryzen 7 3800X, I get 248 ns/iter on that benchmark with current master, and 264 ns/iter with the suggested changes. Most of that time is spent in norm's call to hypot though. If I change that to z.norm_sqr() <= 4.0, I get 79 ns/iter and 105 ns/iter for the old and new code, respectively.

@cuviper
Copy link
Member

cuviper commented Jul 14, 2021

Here are the two implementations on godbolt, using f64: https://rust.godbolt.org/z/558nzKaso

It doesn't look surprising at this level -- indeed showing 4 mulsd versus only 3 in the new, but one more instruction total.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants