Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

saltelli.sample returns several times the exact same samples #447

Closed
chahank opened this issue Jul 8, 2021 · 18 comments
Closed

saltelli.sample returns several times the exact same samples #447

chahank opened this issue Jul 8, 2021 · 18 comments

Comments

@chahank
Copy link
Contributor

chahank commented Jul 8, 2021

I recently upgrade to SAlib 1.4.0.2 and witnessed a behaviour that looks incorrect to me. When using saltelli.sample, most of the returned samples are identical, which would mean that the model is evaluated several times with the exact same input variables. Is this really how it should be?

Code from the SAlib example:

from SALib.sample import saltelli
from SALib.analyze import sobol

problem = {
    'num_vars': 3,
    'names': ['x1', 'x2', 'x3'],
    'bounds': [[-3.14159265359, 3.14159265359],
               [-3.14159265359, 3.14159265359],
               [-3.14159265359, 3.14159265359]]
}

x = saltelli.sample(problem, 2)

Output:

x = array([[-3.14159265, -3.14159265, -3.14159265],
       [-3.14159265, -3.14159265, -3.14159265],
       [-3.14159265, -3.14159265, -3.14159265],
       [-3.14159265, -3.14159265, -3.14159265],
       [-3.14159265, -3.14159265, -3.14159265],
       [-3.14159265, -3.14159265, -3.14159265],
       [-3.14159265, -3.14159265, -3.14159265],
       [-3.14159265, -3.14159265, -3.14159265],
       [ 0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ]])
@ConnectedSystems
Copy link
Member

ConnectedSystems commented Jul 8, 2021

Hi @chahank

I believe this is expected behaviour - from memory the Sobol' sequence has some repetition initially.
See http://arxiv.org/abs/2008.08051 for why skip_values now defaults to 0 (raised in #442)

I do recognise that it is wasteful to do repeated runs with same values.
I'll have to think on how best to deal with that in this case.

In the mean time, you can revert to previous behavior by providing a skip_value (previous default was 1024).

There are some caveats though, so do take note of the documentation:

If skipping values, one recommendation is that the largest possible `n` such that

@chahank
Copy link
Contributor Author

chahank commented Jul 8, 2021

Thanks for the quick response! I have read the documentation, but to be honest could not understand what the whole description about skipping values meant, in particular not how this compared to the previous implementation. The linked PR and Issue discussion also did not help me out there.

Is it possible to remove a posteriori the duplicate samples or would this break the sensitivity analysis?

@ConnectedSystems
Copy link
Member

ConnectedSystems commented Jul 8, 2021

You mean you're unsure of what "skipping values" actually means?

The brief explanation I can offer is that:

  • The Sobol' sequence is used to generate points in parameter space to sample from. It is a deterministic sequence.
  • A common practice was to skip the initial first points to obtain a more uniform sampling
  • This was shown have an adverse effect (in the document I linked to above), actually increasing the number of samples needed to achieve convergence

To avoid the duplicate samples, you can skip a given number of points in the Sobol' sequence (using the skip_values argument). Avoiding the first point should avoid the duplicate samples.

The caveat is simply that ideally

  1. The number of samples you want and the number of points to skip (given with skip_values) will be a power of 2
  2. Failing the above, the value of skip_values can be set to (2^n)-1 (e.g., you pick n) and this value (2^n)-1 would be <= the desired number of samples

As I understand it, not following the above will still produce usable results (for some value of "usable") but may take more samples than necessary (mentioned above).

As for removing samples, this is not recommended. As noted above, the Sobol' sequence is deterministic so changing the samples destroys its structure, and I think the subsequent analysis likely won't be usable. I recommend skipping values to avoid the duplicates rather than filtering.

@chahank
Copy link
Contributor Author

chahank commented Jul 8, 2021

Amazing, thanks for the detailed explanation! I would maybe suggest adding this description to the documentation at some level, it would help users to understand the changes from the previous implementation.

@ConnectedSystems
Copy link
Member

Glad it was helpful, and thanks for raising that the documentation may be too cryptic for some - I guess I assumed too much when making adjustments.

Please keep this issue open for now - I will close it after I adjust the documentation again for clarity.

@chahank
Copy link
Contributor Author

chahank commented Jul 12, 2021

May I give some feedback on the latest changes to the documentation? I am still somehow confused.
The docstring now reads:

A separate recommendation is that `skip_values` be set to a value equal to the 
    largest possible ``(2^m)-1 <= N`` (see [6]). In other words, the user selects
    an `m` such that `skip_values` is equal to ``(2^m)-1``.

What does this mean? skip_values = m ? or skip_values = 2^m - 1 = N ? What is then the value of m, an integer? a base-2 number? a float? a user parameter?

Also, the code gives a warning:

Convergence properties of the Sobol' sequence is only valid if  `skip_values` (1) is equal to `2^m

Again, here the question is what is this 2^m. This message does only help me understand that something might be suboptimal, but does not help understanding what the issue might be, as the value m is undefined. Furthermore, it appears to be in contradiction with the docstring where 2^m - 1 is given.

@ConnectedSystems
Copy link
Member

ConnectedSystems commented Jul 12, 2021

Hi again,

Yes, I am still thinking through how to better express the suggestions.

To answer your questions directly, there are broadly speaking, two somewhat separate recommendations in the literature as far as I am aware.

One recommendation is that both skip_values and N should be a power of 2 (the first mentioned in the updated documentation). Violating this has been shown to increase error. Owen (in the referenced article) says "[u]sing 1000 points of a Sobol' sequence may well be less accurate than using 512".

An earlier perspective, the one you ask about here, is that skipping some points of the Sobol' sequence produces improved uniformity in the samples.

If following the second of the two:

  • m here is an arbitrary integer selected such that skip_values == (2^m)-1. In other words, the parameter skip_values represents this (2^m)-1

  • This value (2^m)-1 should ideally be <= N

  • If following this suggestion, then the warning emitted can be ignored (we assume you know what you're doing)

I take your point that the warning message is likely confusing as it requires awareness of these specific details to discern, and likely needs adjustment.

Nevertheless I hope this cleared things up for you, and I'll try to come up with a better approach to explaining this in the documentation.

Hope you don't mind if I ask for your input on the documentation again in the future?

@chahank
Copy link
Contributor Author

chahank commented Jul 12, 2021

Sure, I am happy to give feedback also in the future. Thanks for the quick responses!

Maybe for a bit of context: I integrated SAlib into our CLIMADA package to perform uncertainty and sensitivity analysis. CLIMADA can be used to model the impact and risk of natural catastrophes today and in the future.

@chahank
Copy link
Contributor Author

chahank commented Jul 12, 2021

Coming back to the original question: why would the Sobol sequence produce identical samples? Is this a rounding error? I could not find any mention of repeated samples in any publications so far.
As I understand it, the issue with the uniformity (and the reason to skip values) is a different one.

@ConnectedSystems
Copy link
Member

ConnectedSystems commented Jul 13, 2021

The first point in the sequence is always the origin, so it will always produce identical values if it is not skipped (see Table 2 in [1], and a brief mention in [2]).

The way Saltelli's sampler works is to cross-sample between two matrices identical matrices. The skip_values controls how many "rows" to skip in the base matrix. See the code for the Saltelli sampler for implementation details, or if you want the full glorious details, see [3].

[1] https://doi.org/10.1016/j.cpc.2010.12.039
[2] http://janroman.dhis.org/finance/Numerical%20Methods/Sobol.pdf
[3] https://doi.org/10.1016/S0010-4655(02)00280-1

PS: I'm recently became aware of CLIMADA - looks very interesting!

@ConnectedSystems
Copy link
Member

ConnectedSystems commented Jul 13, 2021

Hmm, I think to avoid overloading the docs and confusing users I will simplify to outlining just one of the recommendations: that both skip_values and N be a power of 2, and that skip_values be >= N.

scipy/scipy#10844 (comment)

@chahank
Copy link
Contributor Author

chahank commented Jul 14, 2021

The first point in the sequence is always the origin, so it will always produce identical values if it is not skipped (see Table 2 in [1], and a brief mention in [2]).

Thanks for the references (I did not yet have the time to read them in details). I think we might not be talking of the same thing when saying "identical samples". If I look at table 2 in [1], all of the 8 (rows in the table) 10-dimensional points are different samples. This is different from the output of the saltelli.sample() example I gave above. There, out of the 16 (rows of the 2d numpy.array) 3-dimensional points, there are only 3 unique points.

@ConnectedSystems
Copy link
Member

ConnectedSystems commented Jul 14, 2021

I think I understand you, but I also understand the confusion.

Table 2 in [1] are not samples, they are points in the Sobol' sequence.

The first row shows all points in this sequence for a 10-dimensional problem, and actually all dimensions, are identical (e.g., all set to 0.5).

As Campolongo et al., describes (in [1]): "As in the first points of the Sobol' sequence the values of the coordinates tend to repeat (i.e. for the first point they are all equal to 0.5, for the second they are alternates couples of 0.25 and 0.75 and so on"

This repetition is what causes the initial samples to be identical:

"... in order to achieve different coordinates' values for the points a and b, we need to generate a quasi-random matrix of Sobol' numbers of size (R, 2k), with R > r, and discard the first few points for the auxiliary points ..."

@chahank
Copy link
Contributor Author

chahank commented Jul 15, 2021

I am still confused as to how a (good uniform) sampling method from a continuous high-dimensional space can produce identical samples, but I must also admit that I have not yet understood fully the Sobol sequence algorithm.

@ConnectedSystems
Copy link
Member

Closing - resolved in v1.4.5

@hudsonb22
Copy link

This seems to still be happening for me. All of the outputs from my sobol sample are exactly the same, even using the recommended methods for setting a skip value. Is anyone else encountering this problem?

@ConnectedSystems
Copy link
Member

@hudsonb22 could you open a new issue with an example of how you're using SALib and the results you're seeing? I can help more then. Note however that saltelli.sample is now deprecated in favour of sobol

@hudsonb22
Copy link

@ConnectedSystems Yes! It's issue #600

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants