Defining Datasets and Transfering Between Them #34

philswatton · 2023-03-01T11:32:13Z

philswatton
Mar 1, 2023
Maintainer

Other Definitions of Datasets, Domains, and Tasks

To set up theroetical expectations, it would be good to have a rough definition of datasets to use. Initially, I considered using the definition from the optimal transport dataset distance paper, which is that a dataset $\mathcal{D}$ is a set of feature-label pairs $(x,y) \in \mathcal{X} \times \mathcal{Y}$, where $\mathcal{X}$ is a feature space and $\mathcal{Y}$ is a label space.

However, when thinking this definition through, it made me realise a few things, some more relevant, some less:

With e.g. CIFAR-10, the labels are known in advance. In many social science datasets though, you'll choose the labels from among the variables on offer (e.g. in the British Election Study, you might choose vote choice or ideology as your dependent variable). This doesn't necessarily seem relevant to mathematical definition per se, but it implies that if we use the definition above what is a dataset is task specific
- Thinking it through, it would also be strange from the perspective we normally understand the term 'dataset' to consider CIFAR-10 as a different dataset when using it for classification vs running a clustering algorithm on the images (i.e. not using the labels) or using the labels as inputs to generate images as outputs. But that's what this definition would entail! That's maybe not a bad thing, but something worth thinking about
In ML we only observe labels in advance when we're doing supervised learning. With unsupervised learning and reinforcement learning (?), we generate labels/actions.

I then considered the paper A Survey on Transfer Learning (originally linked An introduction to domain adaptation and transfer learning), which instead of defining a dataset defines a domain $\mathcal{D} = \{\mathcal{X}, P(X)\} $, where $\mathcal{X}$ is a feature space and $P(X)$ is a marginal distribution where $X = \{x_1, x_2, \ldots, x_n\} \in \mathcal{X}$. Given a specific domain, a task is denoted by $\mathcal{T} = \{\mathcal{Y}, f(\cdot)\}$, where $\mathcal{Y}$ is a label space and $f(\cdot)$ is a function that is learned from pairs of training data.

What I find helpful about this framework is both the fact with a small adjustment, it can be made more agnostic to which of the three broad types of learning is being performed.

Defining Transfer Learning

In the Survey on Transfer Learning paper, transfer learning is defined as starting from a source domain $\mathcal{D}_S$ and learning task $\mathcal{T}_S$, and a target domain $\mathcal{D}_T$ and learning task $\mathcal{T}_T$, using the knowledge from $\mathcal{D}_S$ and $\mathcal{T}_S$ to aid in the learning of $f_T(\cdot)$.

An imporant condition is that either $\mathcal{D}_S \neq \mathcal{D}_T$ or $\mathcal{T}_S \neq \mathcal{T}_T$. If neither of these conditions are true, we are no longer performing transfer learning.

The first condition implies that either $\mathcal{X}_S \neq \mathcal{X}_T$ or $P_S(X) \neq P_T(X)$. So either the feature space must be different, or the marginal distribution of variables drawn from that feature space must be different. If we consider learning $f(\cdot)$ as learning $P(Y|X)$, then the second condition implies that either $\mathcal{Y}_S \neq \mathcal{Y}_T$ or $P(Y_S|X_S) \neq P(Y_T|X_T)$.

From the perspective of datasets, this has some implications:

If we draw two sample from the same population, but we use different sampling techniques, then generalizing from one to the other would still be considered transfer learning, even if everything else is the same, including the feature space, and the learning task in question
Different label spaces or different feature spaces imply transfer learning
- The former implies a different task, while the latter implies a different domain
Even if the feature spaces and label spaces are the same, and the sampling distribution of the features are the same, if the distribution of the labels with respect to the features differs, we are doing transfer learning
Broadly speaking, the process of using a test dataset is not transfer learning

Defining Datasets for Our Purposes

So, some considerations:

We ideally want to define things in a way that is task-agnostic, i.e. while we focus on classification we'd want our theory to be reasonably general. We might therefore want to account for the fact that:
- Sometimes we have observed labels in advance, sometimes we don't
We might want to separate 'dataset' in an informal sense (i.e. a collection of data) and 'dataset' in a formal sense (i.e. features drawn from a feature space with a given probability distribution)
- This separation will sometimes lead to intuitively strange results, such as the case when we use the same dataset in slightly different ways (e.g. changing around which features are used as features and which feature(s) is(are) used as label(s))
- Something to think about: do we include the labels in our definition of a dataset? Is it optional?
  - Some dataset similarity metrics are label-agnostic. Others (mainly OTDD really) are not, or we can compute some by label (which relies on the same labels being present).
  - We shouldn't let the limitations of some metrics determine our definition though!
Also worth thinking about: the train-val-test split. When is splitting away one of these creating a new dataset? In the definition above from the OTDD paper, they're all new datasets (and the only time they're not is when all observation-label pairs are the same).
- I also think the 'test data' case is relevant to the discussion on data dropping. What's the different between a 'test split' and a dataset drawn from the same distribution with no overlapping features? Mathematically there isn't one - they're just different meanings applied based on how we arrived at the dataset.

So, given the above definitions of dataset, domain, and task (albeit with some confusingly overlapping notation between the two papers), I think where I'm at is to go down the following path:

Datasets are domain and task-specific. Here we're not concerned with datasets in the colloquial sense, but in the sense of data used for a particular task (since this is the sense in which data is relevant to transfer learning or transfer attacks)
Datasets may or not may contain labels in advance, depending on the task in question. But since the task in question defines the label space, and datasets are task-specific, this isn't problematic
- Task-specificity works because if you have observed $X$ and $Y$, the regression/classification task maps from $X$ to $Y$. You can only change the task by either changing $Y$, or by switching to an unsupervised learning task (or maybe RL?) which implies an entirely different feature space. It's confusing in the sense that you've kept the same $X$, but I think the change in feature (space) is enough to call it a different dataset
- On that note though, many measures of similarity are label-agnostic (as they would have to be for e.g. unsupervised learning). That may not be a bad thing though - intuitively I'd expect a supervised learning algo on CIFAR-10 to transfer well to an unsupervised one. And the two 'different' datasets are in fact very similar despite the difference in label spaces!
Since datasets are defined by both domain and task, their similarity should also incorporate either domain or task or both. Most metrics are based on the domain alone - e.g. the combination of feature space and distribution of $\mathcal{X}$, $P(X)$, explains why a lot of the metrics make sense. Treating the observed datasets as probability distributions for instance captures this notion, and allows us to compare the domains.

If we're agreed that this all makes sense, then I'll write it out more formally (and with non-overlapping notation for e.g. domain and dataset).

Transfer Learning vs Transfer Attacks

It's also worth considering another point: that the task of a transfer attack differs from transfer learning. In a transfer attack, we learn $f_S: \mathcal{X}_S \rightarrow \mathcal{Y}_S$ for the surrogate dataset and $f_T: \mathcal{X}_T \rightarrow \mathcal{Y}_T$. We then want to train an attack $A$ on $f_S$ such that the cost when applied to $f_T$ is maximised (assuming we just want it to be classified incorrectly. If we want to assign a specific label to it, it's not easy to imagine how that kind of attack looks when the label spaces are not the same). Here though $X_T$ and $X_S$ must allow for the same inputs (although both elements of the domain could be different - consider e.g. CIFAR vs rotated CIFAR), we can't assume that $Y_S$ and $Y_T$ belong to the same label space. So the mapping of $X$ to $Y$ becomes particularly relevant, even when the feature space is the same.

philswatton · 2023-03-02T13:48:45Z

philswatton
Mar 2, 2023
Maintainer Author

Unlabelled vs labelled datasets -> unlabelled and labelled measures (e.g. MMD vs OTDD)
Supervised: labels known in advance, task is determined by the labels -> labelled dataset defines a task
- Simplifying assumption: prediction space is the same as the label space
Unsupervised: no labels, only a prediction space, so task is not determined by the dataset (other than that it's not supervised learning)
$P(X)$ is a probability dist, not a marginal dist

0 replies

philswatton · 2023-03-03T16:33:46Z

philswatton
Mar 3, 2023
Maintainer Author

I've done some work following the initial dump of ideas and our discussion in writing up on overleaf, with the goal of also tying into some of the theoretical predictions made in #24.

Apologies in advance as there's a lot of repetition on the above, but hopefully this is a more coherent account with more consistent notation, and the implications made clearly follow

define an unlabelled dataset $D_U$ as a set of features from a feature space $x \in \mathcal{X}$, and a labelled dataset $D_L$ as a set of feature-label pairs from a feature-label space $(x,y) \in \mathcal{X} \times \mathcal{Y}$
define a domain as the set containing the feature space and probability distribution that assigns the members of that space a value between 0 and 1 $\mathcal{D} = \{\mathcal{X}, P(\mathcal{X})\}$
- n.b. my notation here is probably not correct to the wider literature but I want to emphasise that the domain probability is a seprate entity to the observed probability in the dataset
a task is the set containing a label/prediction space and a function space $\mathcal{T} = \{\mathcal{Y}, \mathcal{F}\}$, where $\mathcal{F} = \{f: f\colon \mathcal{X} \rightarrow \mathcal{Y}\}$
- there is a change on the above here, as I wanted to make clear that the empirical estimation of an $f \in \mathcal{F}$ does not change the task, just how well the task is being done

The key implications here are:

Removing labels from the dataset changes the dataset. I.e. performing unsupervised learning on the features of a dataset is de-facto changing that dataset into another one
We cannot change the features in any way without changing the domain OR the task (as it changes the space of functions in terms of $\mathcal{X}$)
For labelled datasets, the dataset fully determines the task in question, as we cannot change either the features or the labels without changing the domain or the label space
- A thought I got while writing this: I guess randomly scrambling the labels but keeping the label space would stay the same though? so maybe we want to add something like $P(\mathcal{Y} | \mathcal{X}$ to the task definition?
The task depends on the domain, but the domain does not depend on the task
The observed $P(x)$ is an estimator of $P(\mathcal{X}$, and likewise for $P(y|x)$ and $P(\mathcal{Y} | \mathcal{X})$

If we set $\mathcal{D}_S$, $\mathcal{T}_S$, and $f_S$ as the source/surrogate domain, task, and empirically learned function, and $\mathcal{D}_T$, $\mathcal{T}_T$, and $f_T$ as the target domain, task, and empirically learned function, then:

domain adaptation occurs when we wish to use $f_S$ to learn $f_T$ and the domain is different in some meaningful way
transfer learning occurs when we wish to use $f_S$ to learn $f_T$ and the task is different in some meaningful way
- since the domain partly determines the task, we might wish to say it's specifically transfer learning when the label space is different
transfer attacks occur when we train an adversial attack on $f_S$ with the goal of maintaining a high cost on $f_T$

Broadly, we would expect that transfer success would depend on the similarity of the task (and therefore also the domain, which is relevant to the task). E.g. learning a supervised CIFAR classifier should aid in an unsupervised CIFAR classifier because the domain is identical, even if the task is not. The task in other words depends on $P(\mathcal{Y} | \mathcal{X})$ and to some extent $P(\mathcal{X}$ (as this partly determines the former, unless features are unrelated? - unsure here)

Unless we have all possible members of the feature(/label) space(s) in our dataset, we cannot measure the space generally. We do have the aforementioned empirical estimators $P(x)$ and $P(y|x)$ (if we have a labelled dataset), or variations thereof. We therefore use these as estimators of properties of the domain/task (either declared properties or implied properties).

It follows that e.g.

we would expect a small number of samples to approximate these domain properties
hence dataset similarity is really in part about empirically estimating domain similarity as an attribute of dataset similarity and the relevant attributes of task similarity. This follows from the ways in which dataset, task, and domain are interrelated
It follows from the above that our expectation re say 50% and 50% drops that the similarity should be 1 (or that the dissimilarity should be 0)

A point I'm stuck on:

OTDD gets the feature distance over feature-label pairs. This is kind of an an attribute of the task, but kind of not - I'm unsure how to expand on that or if it's well-covered by the above

This also helps generate a typology for the metrics:

labelled vs unlabelled metrics take into account or don't take into account the labels
- we can only compare unlabelled vs labelled datasets via unlabelled metrics
some metrics require the feature space is the same between both datasets (e.g. Wasserstein distance, OTDD, PAD (without use of embeddings anyway)), while others do not (MMD, Gromov-Wasserstein distance, OTDD based on GW-distance)

all of which is relevant to the potential generalisation of these metrics (assuming they perform well) in the conclusion.

It's also relevant to the prediction that PAD might struggle with data dropping alone, since the reletive size of the two datasets may influence its result even though this isn't relevant to e.g. domain similarity.

Hopefully this helps make the above more coherent, and the extra ideas are clear

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Defining Datasets and Transfering Between Them #34

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Defining Datasets and Transfering Between Them #34

philswatton Mar 1, 2023 Maintainer

Other Definitions of Datasets, Domains, and Tasks

Defining Transfer Learning

Defining Datasets for Our Purposes

Transfer Learning vs Transfer Attacks

Replies: 2 comments

philswatton Mar 2, 2023 Maintainer Author

philswatton Mar 3, 2023 Maintainer Author

philswatton
Mar 1, 2023
Maintainer

philswatton
Mar 2, 2023
Maintainer Author

philswatton
Mar 3, 2023
Maintainer Author