Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MetaDataset with sequence list filter file #1504

Open
vieting opened this issue Jan 25, 2024 · 0 comments
Open

MetaDataset with sequence list filter file #1504

vieting opened this issue Jan 25, 2024 · 0 comments
Assignees

Comments

@vieting
Copy link
Contributor

vieting commented Jan 25, 2024

I have a MetaDataset which contains two HDFDatasets and I want to apply a sequence list filter file. The MetaDataset has an option seq_list_file, but the docstring says

You only need it if the tag name is not the same for all datasets.
It will currently not act as filter,
as the subdataset controls the sequence order (and thus what seqs to use).

Since the tag names are identical in my case, this does not seem to help. Therefore, I use seq_list_filter_file for each HDFDataset, something like

dev = {
    "class": "MetaDataset",
    "dataset": {
        "features": {"class": "HDFDataset", ..., "seq_list_filter_file": "/path/to/dev_segments"},
        "alignment": {"class": "HDFDataset", ..., "seq_list_filter_file": "/path/to/dev_segments"},
    },
    "seq_order_control_dataset": "features",
}

When running this config, RETURNN complains

Reading sequence list for MetaDataset 'dev' from sub-dataset 'dev_features'
Dataset 'alignment' has less sequences (252366) than in sequence list (252377) read from 'features', this cannot work out!
Seq tag 'switchboard-1/sw02663A/sw2663A-ms98-a-0022' in dataset 'features' but not in dataset 'alignment'.

although the sequence list file only contains 300 lines and the stated seq tag is not contained in them.

If no seq_list_file is provided for the MetaDataset, it calls get_all_tags() of the default dataset. HDFDataset.get_all_tags() then returns all tags that are included in the hdf files and does not apply the seq list. This seems unexpected to me and results in the error above. Modifying HDFDataset.get_all_tags() to apply the filter, however, leads to issues in Dataset.get_seq_order_for_epoch().

What is a good way to fix the described issues?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants