Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pagination support to Files Source plugins #18059

Merged
merged 34 commits into from
May 21, 2024

Conversation

davelopez
Copy link
Contributor

@davelopez davelopez commented Apr 26, 2024

Currently, when listing contents in a remote file source, we retrieve all possible files and directories inside, then perform client-side pagination and search. While this approach is enough for small sources with a limited number of files, larger repositories could potentially contain a vast number of files, making navigation within these sources very slow and difficult.

This addresses the issue by passing limit and offset parameters to paginate the contents. However, this approach comes with its own drawbacks. Specifically, we must carefully handle sorting, filtering, and searching server-side, which may prove challenging depending on the plugin implementation. Also, pagination is only applicable in a "non-recursive" view of the file source.

The pull request includes the following changes:

  • A new serializable property for each plugin called supports_pagination. This property determines if the plugin can paginate the contents server-side so the client can delegate that.
  • Implementation of pagination for every PyFilesystem2-based and Invenio-based plugin.
  • A new plugin named TempFileSource, which simplifies testing of PyFilesystem2-based plugins. This plugin utilizes the built-in OS Filesystem. I wonder if we could even replace the posix plugin with this... unless there are specific functionalities unique to the posix plugin 🤔
  • Fix an issue in PyFilesystem2 plugins when listing them recursively, which was detected during testing with the new TempFileSource plugin.
  • A new serializable property for each plugin called supports_search. This property determines if the plugin can filter results by a search query. The syntax of the query is up to the plugin implementation, but a basic "search by name" should be supported as a minimum.
  • A new serializable property for each plugin called supports_sorting. Most plugins (if any) won't support this, but it can signal the client to disable sorting in the UI when pagination is supported but server-side sorting is not, so the listings are always in sync.
  • Listing file source contents will now return the total number of matches for the request. This is to be able to handle pagination correctly when using server-side pagination.

FilesSourcesPagination

TODO

  • Support server-side pagination.
    • Add total_matches as an additional return value when listing contents.
  • Support server-side filtering/search.
  • Support server-side sorting.
  • Update UI to use server-side pagination

How to test the changes?

  • I've included appropriate automated tests.
  • This is a refactoring of components with existing test coverage.
  • Instructions for manual testing are as follows:
    1. [add testing steps and prerequisites here if you didn't write automated tests covering all your changes]

License

  • I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.

@jmchilton
Copy link
Member

Looking great so far!

@davelopez davelopez force-pushed the file_sources_pagination branch 2 times, most recently from 591200e to d3cef4e Compare May 3, 2024 11:31
@davelopez davelopez force-pushed the file_sources_pagination branch 2 times, most recently from f128313 to b614ee6 Compare May 17, 2024 08:46
@davelopez
Copy link
Contributor Author

davelopez commented May 17, 2024

I guess this test failure:

tests/files/test_gcsfs.py:23: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
tests/files/_util.py:193: in assert_simple_file_realize
    res, count = file_source.list("/", recursive=recursive, user_context=user_context)
galaxy/files/sources/__init__.py:441: in list
    return self._list(path, recursive, user_context, opts, limit, offset, query)
galaxy/files/sources/_pyfilesystem2.py:76: in _list
    count = self._get_total_matches_count(h, path, filter)
galaxy/files/sources/_pyfilesystem2.py:93: in _get_total_matches_count
    directories_count = fs.glob(directory_glob_pattern).count()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = Globber(GCSFS('genomics-public-data', root_path=''), '//*/')

    def count(self):
        # type: () -> Counts
        """Count files / directories / data in matched paths.
    
        Example:
            >>> my_fs.glob('**/*.py').count()
            Counts(files=2, directories=0, data=55)
    
        Returns:
            `~Counts`: A named tuple containing results.
    
        """
        directories = 0
        files = 0
        data = 0
        for _path, info in self._make_iter(namespaces=["details"]):
            if info.is_dir:
                directories += 1
            else:
                files += 1
>           data += info.size
E           TypeError: unsupported operand type(s) for +=: 'int' and 'NoneType'

Is caused either by an upstream implementation error or by this known limitation of GCSFS https://fs-gcsfs.readthedocs.io/en/stable/#limitations

As a workaround, I will catch the error and default to 0 as directories are not a "real" thing in Google Cloud Storage or try to call fix_storage() as stated in the documentation but this might make navigating this file source slower 🤔

Update

Using fix_storage() didn't solve the issue and created more issues, so I'll try the other workaround even if pagination might not work correctly with the Google Cloud Storage plugging. In the worst case, we can set supports_pagination to false for this plugin and fall back to the existing behavior.

@davelopez davelopez marked this pull request as ready for review May 17, 2024 09:37
@github-actions github-actions bot added this to the 24.1 milestone May 17, 2024
@davelopez
Copy link
Contributor Author

This should be finally ready for review. The remaining converter and integration test failures seem unrelated.

test/unit/files/_util.py Outdated Show resolved Hide resolved
test/unit/files/_util.py Outdated Show resolved Hide resolved
test/unit/files/_util.py Outdated Show resolved Hide resolved
test/unit/files/test_webdav.py Outdated Show resolved Hide resolved
test/unit/files/test_webdav.py Outdated Show resolved Hide resolved
lib/galaxy/files/sources/__init__.py Show resolved Hide resolved
lib/galaxy/files/sources/invenio.py Show resolved Hide resolved
lib/galaxy/files/sources/__init__.py Show resolved Hide resolved
@davelopez davelopez marked this pull request as draft May 21, 2024 08:41
@davelopez davelopez changed the base branch from dev to release_24.1 May 21, 2024 09:29
@davelopez davelopez marked this pull request as ready for review May 21, 2024 09:30
Copy link
Member

@jdavcs jdavcs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Thank you for addressing all the comments!

@jdavcs jdavcs merged commit 529ec35 into galaxyproject:release_24.1 May 21, 2024
56 of 57 checks passed
@davelopez davelopez deleted the file_sources_pagination branch May 21, 2024 11:45
Copy link
Member

@jmchilton jmchilton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks amazing! I'm glad it got into the release - very nice work.

@jdavcs jdavcs added the highlight Included in user-facing release notes at the top label May 24, 2024
@galaxyproject-sentryintegration

Suspect Issues

This pull request was deployed and Sentry observed the following issues:

  • ‼️ InternalServerError: Problem listing file source path gxfiles://noaa-ufs-prototype5-pds/ /api/remote_files View Issue

Did you find this useful? React with a 👍 or 👎

@davelopez
Copy link
Contributor Author

Sentry, you are so cool! 😆

This seems like a user configuration error:

FileNotFoundError: The specified bucket does not exist

It probably should be a 400 instead of a 500 though. I wonder how we could discriminate better between those here 🤔

@mvdbeek
Copy link
Member

mvdbeek commented May 31, 2024

I'm not sure ... in any case eu isn't running 24.1 yet, so this PR is unrelated. If this is an admin config that is wrong it seems correct to raise that error, if this is a user than I agree we don't need to make this a 500 ...

@davelopez
Copy link
Contributor Author

Ahh! You're right! I didn't notice it was on EU. I thought it was on test with someone testing the new user-defined file sources.

Then this does look indeed like a typo or maybe the bucket was renamed. https://registry.opendata.aws/noaa-ufs-s2s/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/backend highlight Included in user-facing release notes at the top kind/enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants