Some Thoughts on Properties of Similarity Metrics #24
philswatton
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Two sets of thoughts on how some of the metrics are likely to behave and how we'd probably ideally want them to behave. The first point is on the behaviour of the metrics where we are only dropping observations as a source of difference. The second is on whether we want our measures to satisfy symmetry (i.e. measures with the property that dist[A,B] = dist[B,A]).
Dropping Observations Alone
As a starting point, consider the CIFAR dataset. There are 10 classes, each image 32x32, with 6000 images each. Most of our measures (or families of measures) are based on treating a dataset as a probability distribution over the features (MMD, Kernel Density Estimation, OT). So the question emerges - what's happening to the distribution when we drop observations from the dataset?
As a rough ballpark figure, a normal distribution is increasing approximated after about 100 observations (I've also heard the number 30). Of course, CIFAR's data isn't normally distributed - the features take on integer values, and have natural limits (0 and 255 for each feature). It's worth noting though that the dimensionality of a gaussian distribution doesn't change then number of observations we need to succesfully approximate it. There's a nice stackoverflow post explaining this here: https://stats.stackexchange.com/questions/59478/when-data-has-a-gaussian-distribution-how-many-samples-will-characterise-it.
Possibly relevant, possibly not, is that in survey sampling (or statistical sampling in general) the size of the overall population is not particularly relevant to the margin of error of estimated statistics https://medium.com/swlh/sampling-fractions-and-populations-dc48bc482187.
So: there are good reasons to think that at a very low number of samples the distributions of each class will be well-approximated (assuming essentially random sampling of a given population), and there are good reasons to suspect this won't be affected by the dimensionality of the data.
In this case, randomly dropping observations stratified by class (i.e. 10% drop means dropping 10% from each class) should mean both that the within-class feature distribution is more or less unchanged, and that the overall dataset distribution is unchanged (this would not be the case if dropping observations was not stratified by class).
If we then consider the case where we drop observations but do not otherwise transform the data to produce datasets A and B, then the distributions of A and B should be approximately the same. It follows that the metrics based on treating the dataset as a distribution should suggest that the datasets are close to being identical no matter the number of observations dropped (and indeed, dropping 80% from A instead of 20% is not guaranteed to produce a larger overall difference). This should be true even if A and B do not contain any of the same observations!
Broadly, I think this is a desirable feature. Since in this case we'd expect transferring between A and B to be successful (training on smaller/larger CIFAR then transfering either the model or an attack to the other should work well), this is in line with how a similarity metric should behave. In other words, a good axiom for a metric is that were the distributions are the same, regardless of the actual observations the similarity should be close to identical (or distance/dissimilarity should be close to 0, since most of our metrics really belong to this category).
However, PAD is based on the error of a hold-out test set after training a model to predict which dataset observations from both belong to. If there are more observations in A than B, but the distributions are essentially the same, this will produce less test error than if the number of observations were equal because the model will be biased towards the larger dataset. It's likely then that PAD will turn out not to satisfy the above axiom.
Symmetry as a Property
A second thought is on the fact that almost all of our proposed metrics satisfy the symmetry property, that is for two datasets A and B: dist[A,B] = dist[B,A]
We want 'dataset similarity' (which is not an entirely meaningful concept on its own) to correspond to real-world criteria such as transfer learning and attack transfer success. There is no clear a-priori reason to assume that transfers from A to B will be as successful as transfers from B to A (we may want to pay attention to this in general).
The only proposed metric that does not satisfy symmetry is the KL-divergence within the kernel density estimation family of metrics. We should therefore make sure to pay special attention to how well it does compared to the other metrics, especially if we check against both transfer attacks from A to B and from B to A.
Beta Was this translation helpful? Give feedback.
All reactions