saltelli.sample returns several times the exact same samples #447

chahank · 2021-07-08T15:53:23Z

I recently upgrade to SAlib 1.4.0.2 and witnessed a behaviour that looks incorrect to me. When using saltelli.sample, most of the returned samples are identical, which would mean that the model is evaluated several times with the exact same input variables. Is this really how it should be?

Code from the SAlib example:

from SALib.sample import saltelli
from SALib.analyze import sobol

problem = {
    'num_vars': 3,
    'names': ['x1', 'x2', 'x3'],
    'bounds': [[-3.14159265359, 3.14159265359],
               [-3.14159265359, 3.14159265359],
               [-3.14159265359, 3.14159265359]]
}

x = saltelli.sample(problem, 2)

Output:

x = array([[-3.14159265, -3.14159265, -3.14159265],
       [-3.14159265, -3.14159265, -3.14159265],
       [-3.14159265, -3.14159265, -3.14159265],
       [-3.14159265, -3.14159265, -3.14159265],
       [-3.14159265, -3.14159265, -3.14159265],
       [-3.14159265, -3.14159265, -3.14159265],
       [-3.14159265, -3.14159265, -3.14159265],
       [-3.14159265, -3.14159265, -3.14159265],
       [ 0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ]])

The text was updated successfully, but these errors were encountered:

ConnectedSystems · 2021-07-08T16:18:19Z

Hi @chahank

I believe this is expected behaviour - from memory the Sobol' sequence has some repetition initially.
See http://arxiv.org/abs/2008.08051 for why skip_values now defaults to 0 (raised in #442)

I do recognise that it is wasteful to do repeated runs with same values.
I'll have to think on how best to deal with that in this case.

In the mean time, you can revert to previous behavior by providing a skip_value (previous default was 1024).

There are some caveats though, so do take note of the documentation:

SALib/src/SALib/sample/saltelli.py

Line 30 in 4012c13

    
               If skipping values, one recommendation is that the largest possible `n` such that

chahank · 2021-07-08T16:23:38Z

Thanks for the quick response! I have read the documentation, but to be honest could not understand what the whole description about skipping values meant, in particular not how this compared to the previous implementation. The linked PR and Issue discussion also did not help me out there.

Is it possible to remove a posteriori the duplicate samples or would this break the sensitivity analysis?

ConnectedSystems · 2021-07-08T16:49:14Z

You mean you're unsure of what "skipping values" actually means?

The brief explanation I can offer is that:

The Sobol' sequence is used to generate points in parameter space to sample from. It is a deterministic sequence.
A common practice was to skip the initial first points to obtain a more uniform sampling
This was shown have an adverse effect (in the document I linked to above), actually increasing the number of samples needed to achieve convergence

To avoid the duplicate samples, you can skip a given number of points in the Sobol' sequence (using the skip_values argument). Avoiding the first point should avoid the duplicate samples.

The caveat is simply that ideally

The number of samples you want and the number of points to skip (given with skip_values) will be a power of 2
Failing the above, the value of skip_values can be set to (2^n)-1 (e.g., you pick n) and this value (2^n)-1 would be <= the desired number of samples

As I understand it, not following the above will still produce usable results (for some value of "usable") but may take more samples than necessary (mentioned above).

As for removing samples, this is not recommended. As noted above, the Sobol' sequence is deterministic so changing the samples destroys its structure, and I think the subsequent analysis likely won't be usable. I recommend skipping values to avoid the duplicates rather than filtering.

chahank · 2021-07-08T17:54:13Z

Amazing, thanks for the detailed explanation! I would maybe suggest adding this description to the documentation at some level, it would help users to understand the changes from the previous implementation.

ConnectedSystems · 2021-07-08T22:43:49Z

Glad it was helpful, and thanks for raising that the documentation may be too cryptic for some - I guess I assumed too much when making adjustments.

Please keep this issue open for now - I will close it after I adjust the documentation again for clarity.

chahank · 2021-07-12T08:37:41Z

May I give some feedback on the latest changes to the documentation? I am still somehow confused.
The docstring now reads:

A separate recommendation is that `skip_values` be set to a value equal to the 
    largest possible ``(2^m)-1 <= N`` (see [6]). In other words, the user selects
    an `m` such that `skip_values` is equal to ``(2^m)-1``.

What does this mean? skip_values = m ? or skip_values = 2^m - 1 = N ? What is then the value of m, an integer? a base-2 number? a float? a user parameter?

Also, the code gives a warning:

Convergence properties of the Sobol' sequence is only valid if  `skip_values` (1) is equal to `2^m

Again, here the question is what is this 2^m. This message does only help me understand that something might be suboptimal, but does not help understanding what the issue might be, as the value m is undefined. Furthermore, it appears to be in contradiction with the docstring where 2^m - 1 is given.

ConnectedSystems · 2021-07-12T09:13:43Z

Hi again,

Yes, I am still thinking through how to better express the suggestions.

To answer your questions directly, there are broadly speaking, two somewhat separate recommendations in the literature as far as I am aware.

One recommendation is that both skip_values and N should be a power of 2 (the first mentioned in the updated documentation). Violating this has been shown to increase error. Owen (in the referenced article) says "[u]sing 1000 points of a Sobol' sequence may well be less accurate than using 512".

An earlier perspective, the one you ask about here, is that skipping some points of the Sobol' sequence produces improved uniformity in the samples.

If following the second of the two:

m here is an arbitrary integer selected such that skip_values == (2^m)-1. In other words, the parameter skip_values represents this (2^m)-1
This value (2^m)-1 should ideally be <= N
If following this suggestion, then the warning emitted can be ignored (we assume you know what you're doing)

I take your point that the warning message is likely confusing as it requires awareness of these specific details to discern, and likely needs adjustment.

Nevertheless I hope this cleared things up for you, and I'll try to come up with a better approach to explaining this in the documentation.

Hope you don't mind if I ask for your input on the documentation again in the future?

chahank · 2021-07-12T15:42:57Z

Sure, I am happy to give feedback also in the future. Thanks for the quick responses!

Maybe for a bit of context: I integrated SAlib into our CLIMADA package to perform uncertainty and sensitivity analysis. CLIMADA can be used to model the impact and risk of natural catastrophes today and in the future.

chahank · 2021-07-12T15:45:08Z

Coming back to the original question: why would the Sobol sequence produce identical samples? Is this a rounding error? I could not find any mention of repeated samples in any publications so far.
As I understand it, the issue with the uniformity (and the reason to skip values) is a different one.

ConnectedSystems · 2021-07-13T14:14:49Z

The first point in the sequence is always the origin, so it will always produce identical values if it is not skipped (see Table 2 in [1], and a brief mention in [2]).

The way Saltelli's sampler works is to cross-sample between two matrices identical matrices. The skip_values controls how many "rows" to skip in the base matrix. See the code for the Saltelli sampler for implementation details, or if you want the full glorious details, see [3].

[1] https://doi.org/10.1016/j.cpc.2010.12.039
[2] http://janroman.dhis.org/finance/Numerical%20Methods/Sobol.pdf
[3] https://doi.org/10.1016/S0010-4655(02)00280-1

PS: I'm recently became aware of CLIMADA - looks very interesting!

ConnectedSystems · 2021-07-13T14:47:17Z

Hmm, I think to avoid overloading the docs and confusing users I will simplify to outlining just one of the recommendations: that both skip_values and N be a power of 2, and that skip_values be >= N.

scipy/scipy#10844 (comment)

chahank · 2021-07-14T13:13:05Z

The first point in the sequence is always the origin, so it will always produce identical values if it is not skipped (see Table 2 in [1], and a brief mention in [2]).

Thanks for the references (I did not yet have the time to read them in details). I think we might not be talking of the same thing when saying "identical samples". If I look at table 2 in [1], all of the 8 (rows in the table) 10-dimensional points are different samples. This is different from the output of the saltelli.sample() example I gave above. There, out of the 16 (rows of the 2d numpy.array) 3-dimensional points, there are only 3 unique points.

ConnectedSystems · 2021-07-14T14:39:08Z

I think I understand you, but I also understand the confusion.

Table 2 in [1] are not samples, they are points in the Sobol' sequence.

The first row shows all points in this sequence for a 10-dimensional problem, and actually all dimensions, are identical (e.g., all set to 0.5).

As Campolongo et al., describes (in [1]): "As in the first points of the Sobol' sequence the values of the coordinates tend to repeat (i.e. for the first point they are all equal to 0.5, for the second they are alternates couples of 0.25 and 0.75 and so on"

This repetition is what causes the initial samples to be identical:

"... in order to achieve different coordinates' values for the points a and b, we need to generate a quasi-random matrix of Sobol' numbers of size (R, 2k), with R > r, and discard the first few points for the auxiliary points ..."

chahank · 2021-07-15T12:56:39Z

I am still confused as to how a (good uniform) sampling method from a continuous high-dimensional space can produce identical samples, but I must also admit that I have not yet understood fully the Sobol sequence algorithm.

ConnectedSystems · 2021-09-04T05:42:27Z

Closing - resolved in v1.4.5

hudsonb22 · 2023-12-15T20:01:25Z

This seems to still be happening for me. All of the outputs from my sobol sample are exactly the same, even using the recommended methods for setting a skip value. Is anyone else encountering this problem?

ConnectedSystems · 2023-12-15T21:22:05Z

@hudsonb22 could you open a new issue with an example of how you're using SALib and the results you're seeing? I can help more then. Note however that saltelli.sample is now deprecated in favour of sobol

hudsonb22 · 2023-12-15T23:25:37Z

@ConnectedSystems Yes! It's issue #600

ConnectedSystems mentioned this issue Jul 8, 2021

Adjust Saltelli documentation for clarity #448

Merged

ConnectedSystems self-assigned this Jul 12, 2021

ConnectedSystems added documentation in progress labels Jul 12, 2021

ConnectedSystems added the enhancement label Jul 13, 2021

ConnectedSystems closed this as completed Sep 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

saltelli.sample returns several times the exact same samples #447

saltelli.sample returns several times the exact same samples #447

chahank commented Jul 8, 2021

ConnectedSystems commented Jul 8, 2021 •

edited

chahank commented Jul 8, 2021 •

edited

ConnectedSystems commented Jul 8, 2021 •

edited

chahank commented Jul 8, 2021

ConnectedSystems commented Jul 8, 2021

chahank commented Jul 12, 2021 •

edited

ConnectedSystems commented Jul 12, 2021 •

edited

chahank commented Jul 12, 2021

chahank commented Jul 12, 2021

ConnectedSystems commented Jul 13, 2021 •

edited

ConnectedSystems commented Jul 13, 2021 •

edited

chahank commented Jul 14, 2021

ConnectedSystems commented Jul 14, 2021 •

edited

chahank commented Jul 15, 2021

ConnectedSystems commented Sep 4, 2021

hudsonb22 commented Dec 15, 2023

ConnectedSystems commented Dec 15, 2023

hudsonb22 commented Dec 15, 2023

saltelli.sample returns several times the exact same samples #447

saltelli.sample returns several times the exact same samples #447

Comments

chahank commented Jul 8, 2021

ConnectedSystems commented Jul 8, 2021 • edited

chahank commented Jul 8, 2021 • edited

ConnectedSystems commented Jul 8, 2021 • edited

chahank commented Jul 8, 2021

ConnectedSystems commented Jul 8, 2021

chahank commented Jul 12, 2021 • edited

ConnectedSystems commented Jul 12, 2021 • edited

chahank commented Jul 12, 2021

chahank commented Jul 12, 2021

ConnectedSystems commented Jul 13, 2021 • edited

ConnectedSystems commented Jul 13, 2021 • edited

chahank commented Jul 14, 2021

ConnectedSystems commented Jul 14, 2021 • edited

chahank commented Jul 15, 2021

ConnectedSystems commented Sep 4, 2021

hudsonb22 commented Dec 15, 2023

ConnectedSystems commented Dec 15, 2023

hudsonb22 commented Dec 15, 2023

ConnectedSystems commented Jul 8, 2021 •

edited

chahank commented Jul 8, 2021 •

edited

ConnectedSystems commented Jul 8, 2021 •

edited

chahank commented Jul 12, 2021 •

edited

ConnectedSystems commented Jul 12, 2021 •

edited

ConnectedSystems commented Jul 13, 2021 •

edited

ConnectedSystems commented Jul 13, 2021 •

edited

ConnectedSystems commented Jul 14, 2021 •

edited