[WIP][Dependent] Fix task creation with cloud storage data #7903

Marishka17 · 2024-05-17T07:45:41Z

Motivation and context

This PR contains several fixes:

Optimized memory usage by not keeping all downloaded images/part of images in memory while creating a manifest file
Optimized the number of requests to CS providers by downloading only images from a specified range (use_cache in True/False)

How has this been tested?

This table contains metrics (required time and memory usage) obtained before this fix and after.

with/without fix	Threads count	Time for partially files downloading & preparing manifest	Memory usage diff after `_create_task_manifest_from_cloud_data` (MB)
without	1	0:01:02.797817	222
without	2	0:00:11.307688	225
with	1	0:00:12.622483	16.5
with	2	0:00:13.004692	17.2

Testing dataset contains 4000 images, its size is about 400MB, 64 KB was downloaded on average for each image.

Details

  after fix:
  1 thread
  Required time:  0:00:12.622483
  280.7 MiB      0.0 MiB           1                   if data['stop_frame'] is not None:
     727                                                             stop_frame = min(stop_frame, data['stop_frame'])
     728    297.2 MiB     16.5 MiB           2                   _create_task_manifest_from_cloud_data(
     729    280.7 MiB      0.0 MiB           1                       db_data.cloud_storage, sorted_media, manifest,
     730    280.7 MiB      0.0 MiB           1                       start_frame=db_data.start_frame, stop_frame=stop_frame, step=db_data.get_frame_step()
     731                                                         )
     732                                         
     733    297.2 MiB      0.0 MiB           1       av_scan_paths(upload_dir)
  
  2 threads
  Required time:  0:00:13.004692
  726    281.1 MiB      0.0 MiB           1                   if data['stop_frame'] is not None:
     727                                                             stop_frame = min(stop_frame, data['stop_frame'])
     728    298.3 MiB     17.2 MiB           2                   _create_task_manifest_from_cloud_data(
     729    281.1 MiB      0.0 MiB           1                       db_data.cloud_storage, sorted_media, manifest,
     730    281.1 MiB      0.0 MiB           1                       start_frame=db_data.start_frame, stop_frame=stop_frame, step=db_data.get_frame_step()
     731                                                         )
     732                                         
     733    298.3 MiB      0.0 MiB           1       av_scan_paths(upload_dir)
  
  without fix:
  1 thread
  Required time:  0:01:02.797817
  688    280.5 MiB      0.0 MiB           1               if manifest_file:
     689                                                         # Define task manifest content based on cloud storage manifest content and uploaded files
     690                                                         _create_task_manifest_based_on_cloud_storage_manifest(
     691                                                             sorted_media, cloud_storage_manifest_prefix,
     692                                                             cloud_storage_manifest, manifest)
     693                                                     else: # without manifest file but with use_cache option
     694                                                         # Define task manifest content based on list with uploaded files
     695    502.6 MiB    222.1 MiB           1                   _create_task_manifest_from_cloud_data(db_data.cloud_storage, sorted_media, manifest)
     696                                         
     697    502.6 MiB      0.0 MiB           1       av_scan_paths(upload_dir)
  
  2 threads
  Required time:  0:00:11.307688
  280.8 MiB      0.0 MiB           1               if manifest_file:
     689                                                         # Define task manifest content based on cloud storage manifest content and uploaded files
     690                                                         _create_task_manifest_based_on_cloud_storage_manifest(
     691                                                             sorted_media, cloud_storage_manifest_prefix,
     692                                                             cloud_storage_manifest, manifest)
     693                                                     else: # without manifest file but with use_cache option
     694                                                         # Define task manifest content based on list with uploaded files
     695    506.0 MiB    225.2 MiB           1                   _create_task_manifest_from_cloud_data(db_data.cloud_storage, sorted_media, manifest)
     696                                         
     697    506.0 MiB      0.0 MiB           1       av_scan_paths(upload_dir)

Checklist

I submit my changes into the develop branch
I have created a changelog fragment
I have updated the documentation accordingly
I have added tests to cover my changes
I have linked related issues (see GitHub docs)
I have increased versions of npm packages if it is necessary
(cvat-canvas,
cvat-core,
cvat-data and
cvat-ui)

License

I submit my code changes under the same MIT License that covers the project.
Feel free to contact the maintainers if that's a concern.

Summary by CodeRabbit

New Features
- Introduced threading logic for bulk downloads to enhance performance based on CPU capabilities and file count.
- Added new classes Named and NamedBytesIO for better handling of file attributes in image processing.
Improvements
- Optimized memory usage and request handling for creating manifest files.
- Enhanced handling of image sources and iteration logic in DatasetImagesReader.
Configuration
- Added global settings for cloud data downloading thread management.
Bug Fixes
- Improved retrieval of preview information from manifests to ensure robustness.
Documentation
- Updated internal documentation to reflect new threading logic and image processing enhancements.

coderabbitai · 2024-05-17T07:45:46Z

Walkthrough

The recent updates enhance the CVAT application by optimizing cloud data handling, improving memory usage, and refining threading logic. Key changes include updating methods for preview retrieval, bulk downloading with threading, and updating image processing in manifests. New classes and global variables were introduced to support these enhancements.

Changes

File(s)	Change Summary
`cvat/apps/engine/cache.py`	Modified preview retrieval method to use `get_first_not_empty_item()`.
`cvat/apps/engine/cloud_provider.py`	Added threading logic for bulk downloads and updated return type of `bulk_download_to_memory`.
`cvat/apps/engine/task.py`	Introduced new parameters and methods for task creation and progress updates.
`utils/dataset_manifest/core.py`	Enhanced `DatasetImagesReader` to handle different source types and improved iteration logic.
`cvat/settings/base.py`	Introduced global variables for cloud data downloading thread management.
`changelog.d/20240523_120716_maria_fix_task_creation_with_cs_data.md`	Documented optimizations in memory usage and request handling for image processing.
`utils/dataset_manifest/types.py`	Added `Named` and `NamedBytesIO` classes for better data handling.
`utils/dataset_manifest/__init__.py`	Added import statement for `InvalidManifestError`.

In the world of code, where bytes do flow,
Changes bring a brighter glow.
Threads now dance, in harmony they play,
Making tasks swift, in a cloud ballet.
With names and streams, the data sings,
Progress leaps on rapid wings.
🌟✨🐇

Tip

New Features and Improvements

Review Settings

Introduced new personality profiles for code reviews. Users can now select between "Chill" and "Assertive" review tones to tailor feedback styles according to their preferences. The "Assertive" profile posts more comments and nitpicks the code more aggressively, while the "Chill" profile is more relaxed and posts fewer comments.

AST-based Instructions

CodeRabbit offers customizing reviews based on the Abstract Syntax Tree (AST) pattern matching. Read more about AST-based instructions in the documentation.

Community-driven AST-based Rules

We are kicking off a community-driven initiative to create and share AST-based rules. Users can now contribute their AST-based rules to detect security vulnerabilities, code smells, and anti-patterns. Please see the ast-grep-essentials repository for more information.

New Static Analysis Tools

We are continually expanding our support for static analysis tools. We have added support for biome, hadolint, and ast-grep. Update the settings in your .coderabbit.yaml file or head over to the settings page to enable or disable the tools you want to use.

Tone Settings

Users can now customize CodeRabbit to review code in the style of their favorite characters or personalities. Here are some of our favorite examples:

Mr. T: "You must talk like Mr. T in all your code reviews. I pity the fool who doesn't!"
Pirate: "Arr, matey! Ye must talk like a pirate in all yer code reviews. Yarrr!"
Snarky: "You must be snarky in all your code reviews. Snark, snark, snark!"

Revamped Settings Page

We have redesigned the settings page for a more intuitive layout, enabling users to find and adjust settings quickly. This change was long overdue; it not only improves the user experience but also allows our development team to add more settings in the future with ease. Going forward, the changes to .coderabbit.yaml will be reflected in the settings page, and vice versa.

Miscellaneous

Turn off free summarization: You can switch off free summarization of PRs opened by users not on a paid plan using the enable_free_tier setting.
Knowledge-base scope: You can now set the scope of the knowledge base to either the repository (local) or the organization (global) level using the knowledge_base setting. In addition, you can specify Jira project keys and Linear team keys to limit the knowledge base scope for those integrations.
High-level summary placement: You can now customize the location of the high-level summary in the PR description using the high_level_summary_placeholder setting (default @coderabbitai summary).
Revamped request changes workflow: You can now configure CodeRabbit to auto-approve or request changes on PRs based on the review feedback using the request_changes_workflow setting.

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share

Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai generate interesting stats about this repository and render them as a table.
- @coderabbitai show all the console.log statements in this repository.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to full the review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

Marishka17 · 2024-05-17T08:02:07Z

/check

github-actions · 2024-05-17T08:02:20Z

❌ Some checks failed
📄 See logs here

Marishka17 · 2024-05-17T11:54:59Z

/check

github-actions · 2024-05-17T11:55:09Z

❌ Some checks failed
📄 See logs here

github-actions · 2024-05-17T12:20:38Z

❌ Some checks failed
📄 See logs here

github-actions · 2024-05-17T12:47:37Z

❌ Some checks failed
📄 See logs here

utils/dataset_manifest/core.py

Marishka17 · 2024-05-20T07:57:43Z

cvat/apps/engine/task.py

+                stop_frame = len(sorted_media)
+                if data['stop_frame'] is not None:


refactor code to remove duplicates

Marishka17 · 2024-05-20T08:00:08Z

@zhiltsov-max, Could you please take a look at the PR? I've added small comments that should be fixed but I can fix them only in the evening probably.

coderabbitai

Actionable comments posted: 2

Out of diff range and nitpick comments (1)

cvat/apps/engine/task.py (1)
Line range hint 893-893: Change the membership test to use not in for correct logical evaluation.
- if not chunk_path.endswith(f"{properties['name']}{properties['extension']}"):
+ if chunk_path not in f"{properties['name']}{properties['extension']}":

Review Details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 1a86ccd and 40fb577.

Files selected for processing (4)

cvat/apps/engine/cache.py (1 hunks)
cvat/apps/engine/cloud_provider.py (4 hunks)
cvat/apps/engine/task.py (4 hunks)
utils/dataset_manifest/core.py (8 hunks)

Additional Context Used

Ruff (2)

cvat/apps/engine/task.py (1)

893-893: Test for membership should be not in

utils/dataset_manifest/core.py (1)

788-788: Test for membership should be not in

Additional comments not posted (6)

cvat/apps/engine/cache.py (1)
315-315: Change to use manifest.get_first_not_empty_item() improves robustness by ensuring a non-empty item is retrieved.

Verification successful

The _prepare_cloud_preview method is properly integrated within the _get_or_set_cache_item method, ensuring that the returned values are correctly handled and cached.
Scripts Executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify integration of `_prepare_cloud_preview` with other components.

# Test: Search for usages of `_prepare_cloud_preview`. Expect: Proper handling of the returned values.
rg --type python '_prepare_cloud_preview'
Length of output: 72

Script:
#!/bin/bash
# Description: Verify integration of `_prepare_cloud_preview` with other components.

# Test: Search for usages of `_prepare_cloud_preview`. Expect: Proper handling of the returned values.
rg --type py '_prepare_cloud_preview'
Length of output: 220

Script:
#!/bin/bash
# Extract and review the context around the usages of `_prepare_cloud_preview`.

# Extract 5 lines before and after the first usage
rg --type py '_prepare_cloud_preview' -A 5 -B 5
Length of output: 1472

Script:
#!/bin/bash
# Extract the `_get_or_set_cache_item` method for review.

# Search for the `_get_or_set_cache_item` method definition and its content
ast-grep --lang python --pattern $'def _get_or_set_cache_item($_, $_):\n    $$$\n'
Length of output: 81

Script:
#!/bin/bash
# Search for the `_get_or_set_cache_item` method definition and its content.

# Search for the method definition
rg 'def _get_or_set_cache_item' -A 20
Length of output: 1417
utils/dataset_manifest/core.py (1)
196-222: Updates to _get_img_properties enhance handling of different image source types. Verify integration with other components.

Verification successful

The _get_img_properties method is used multiple times within the utils/dataset_manifest/core.py file, primarily in generator expressions. The context shows that the method is called to yield image properties, and the returned values are handled appropriately.

Lines where _get_img_properties is called:

utils/dataset_manifest/core.py: yield self._get_img_properties(image)

The integration appears to be correct, with the method's return values being used as expected.
Scripts Executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify integration of `_get_img_properties` with other components.

# Test: Search for usages of `_get_img_properties`. Expect: Proper handling of the returned values.
rg --type python '_get_img_properties'
Length of output: 69

Script:
#!/bin/bash
# Description: Verify integration of `_get_img_properties` with other components.

# Test: Search for usages of `_get_img_properties`. Expect: Proper handling of the returned values.
rg --type py '_get_img_properties'
Length of output: 510

Script:
#!/bin/bash
# Description: Extract and review the context of `_get_img_properties` calls.

# Extract lines where `_get_img_properties` is called along with some surrounding lines for context.
rg --type py '_get_img_properties' -A 3 -B 3
Length of output: 2351
cvat/apps/engine/cloud_provider.py (2)

9-9: Introduce threading and constants for performance optimization.

The introduction of threading using ThreadPoolExecutor and the constants CPU_NUMBER, MAX_THREADS_NUMBER, and NUMBER_OF_FILES_PER_THREAD are well-aligned with the PR's objective to optimize performance. This should effectively manage the number of threads based on the CPU capabilities and the number of files, which is crucial for handling large volumes of data efficiently.

Also applies to: 14-15, 40-42

185-185: Enhance file downloading functions with optimal threading.

The modifications in bulk_download_to_memory and bulk_download_to_dir to incorporate threading logic are commendable. These changes should significantly reduce the time required for downloading files by parallelizing the process. The use of ThreadPoolExecutor and the conditional logic to determine the number of threads based on the workload are appropriate and should ensure that the system resources are used efficiently.

Also applies to: 205-219, 225-237

cvat/apps/engine/task.py (2)

484-502: Optimized task manifest creation from cloud data to handle large datasets more efficiently.

716-722: Refactor to remove duplicate code as suggested in previous comments.

cvat/apps/engine/task.py

utils/dataset_manifest/core.py

codecov-commenter · 2024-05-20T08:30:46Z

Codecov Report

Attention: Patch coverage is 90.51095% with 13 lines in your changes are missing coverage. Please review.

Project coverage is 83.37%. Comparing base (f892bd4) to head (c9867b0).

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #7903      +/-   ##
===========================================
- Coverage    83.37%   83.37%   -0.01%     
===========================================
  Files          381      382       +1     
  Lines        40053    40123      +70     
  Branches      3769     3769              
===========================================
+ Hits         33395    33451      +56     
- Misses        6658     6672      +14

Components	Coverage Δ
cvat-ui	`79.25% <ø> (-0.06%)`	⬇️
cvat-server	`87.15% <90.51%> (+0.02%)`	⬆️

zhiltsov-max

There are quite a few places where stop is changed from (the last valid index) to (the first index after the last valid index). In most places in CVAT it's used in the first meaning, I suggest avoiding changing this to avoid confusion.

cvat/apps/engine/cloud_provider.py

cvat/apps/engine/task.py

cvat/apps/engine/cloud_provider.py

utils/dataset_manifest/core.py

cvat/apps/engine/task.py

utils/dataset_manifest/core.py

coderabbitai

Actionable comments posted: 6

Outside diff range and nitpick comments (3)

utils/dataset_manifest/core.py (2)
144-153: Consider documenting the DatasetImagesReader constructor parameters.

Adding docstrings to the constructor would improve code readability and maintainability by clearly explaining the purpose and expected types of each parameter, especially for complex functions like this one.

Line range hint 761-761: Correct the membership test in the condition.
- if not _dict['version'] in _Manifest.SupportedVersion.choices():
+ if _dict['version'] not in _Manifest.SupportedVersion.choices():
This change corrects the syntax for the membership test to follow Python best practices and improve readability.
cvat/apps/engine/task.py (1)
Line range hint 895-895: Consider using not in for membership tests to enhance readability and performance.
- if not manifest_file or not os.path.isfile(os.path.join(manifest_root, manifest_file)):
+ if manifest_file not in os.path.isfile(os.path.join(manifest_root, manifest_file)):

Review Details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 40fb577 and 64e3218.

Files selected for processing (4)

changelog.d/20240523_120716_maria_fix_task_creation_with_cs_data.md (1 hunks)
cvat/apps/engine/cloud_provider.py (4 hunks)
cvat/apps/engine/task.py (4 hunks)
utils/dataset_manifest/core.py (8 hunks)

Files skipped from review due to trivial changes (1)

changelog.d/20240523_120716_maria_fix_task_creation_with_cs_data.md

Additional Context Used

Ruff (2)

cvat/apps/engine/task.py (1)

895-895: Test for membership should be not in

utils/dataset_manifest/core.py (1)

761-761: Test for membership should be not in

Additional comments not posted (5)

utils/dataset_manifest/core.py (1)
168-170: Clarify the necessity of the stop parameter with generators.

The current implementation raises a ValueError if the stop parameter is not passed when a generator is used. This is crucial for preventing runtime errors due to undefined behavior when the end of the generator is not known.

Verification successful

The DatasetImagesReader class constructor includes a check for whether the sources parameter is a generator using the isgenerator function. This check sets the _is_generator_used attribute. The code snippet provided in the review comment ensures that if a generator is used, the stop parameter must be provided to avoid runtime errors due to undefined behavior when the end of the generator is not known.

The stop parameter is crucial when using generators to define the endpoint of the iteration, preventing potential runtime errors.
Scripts Executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify the handling of the `stop` parameter when generators are used.

# Test: Search for usages of DatasetImagesReader with a generator. Expect: Proper handling of the `stop` parameter.
rg --type python 'DatasetImagesReader\([^)]*isgenerator'
Length of output: 89

Script:
#!/bin/bash
# Description: Verify the handling of the `stop` parameter when generators are used.

# Test: Search for usages of DatasetImagesReader with a generator. Expect: Proper handling of the `stop` parameter.
rg 'DatasetImagesReader\([^)]*isgenerator'
Length of output: 44

Script:
#!/bin/bash
# Description: Verify the handling of the `stop` parameter when generators are used.

# Search for the `isgenerator` parameter usage in the codebase.
rg 'isgenerator'

# Inspect the `DatasetImagesReader` class definition to understand how it handles generators.
ast-grep --lang python --pattern $'class DatasetImagesReader {
  $$$
}'
Length of output: 696

Script:
#!/bin/bash
# Description: Verify the handling of the `stop` parameter when generators are used.

# Inspect the `DatasetImagesReader` class definition to understand how it handles generators.
ast-grep --lang python --pattern $'class DatasetImagesReader {
  $$$
}' utils/dataset_manifest/core.py
Length of output: 561

Script:
#!/bin/bash
# Description: Verify the handling of the `stop` parameter when generators are used.

# Inspect the `DatasetImagesReader` class definition to understand how it handles generators.
ast-grep --lang python --pattern 'class DatasetImagesReader { $$$ }' utils/dataset_manifest/core.py

# Fallback: Search for the `DatasetImagesReader` class definition using `rg`.
rg 'class DatasetImagesReader' utils/dataset_manifest/core.py
Length of output: 648

Script:
#!/bin/bash
# Description: Extract the `DatasetImagesReader` class definition for manual inspection.

# Extract the `DatasetImagesReader` class definition from `utils/dataset_manifest/core.py`.
sed -n '/class DatasetImagesReader:/,/^$/p' utils/dataset_manifest/core.py
Length of output: 511
cvat/apps/engine/cloud_provider.py (1)

191-191: Ensure proper error handling in image download logic.

The method optimally_image_download attempts to parse an image and falls back to downloading the entire file if the initial chunk is insufficient. Verify that this fallback mechanism works correctly across different image formats and sizes to ensure robustness.

cvat/apps/engine/task.py (3)

484-504: The updates to _create_task_manifest_from_cloud_data function align well with the PR objectives to optimize cloud storage interactions.

677-688: The implementation of _download_data_from_cloud_storage is efficient and uses cloud storage capabilities effectively.

484-504: The data validation logic in _validate_data is robust, ensuring that the uploaded data conforms to expected formats and quantities.

utils/dataset_manifest/core.py

cvat/apps/engine/cloud_provider.py

utils/dataset_manifest/core.py

coderabbitai

Actionable comments posted: 6

Outside diff range and nitpick comments (1)

utils/dataset_manifest/core.py (1)

144-153: Ensure consistent use of type hints.

The constructor for DatasetImagesReader uses type hints for some parameters but not for others. It would improve readability and maintainability if all parameters had explicit type hints.

Review Details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 64e3218 and 914b7bc.

Files selected for processing (3)

cvat/apps/engine/cloud_provider.py (4 hunks)
cvat/settings/base.py (1 hunks)
utils/dataset_manifest/core.py (8 hunks)

Additional Context Used

Ruff (5)

cvat/settings/base.py (4)

34-34: Module level import not at top of file

36-36: Module level import not at top of file

80-80: keys.secret_key.SECRET_KEY imported but unused

718-718: Module level import not at top of file

utils/dataset_manifest/core.py (1)

761-761: Test for membership should be not in

Additional comments not posted (3)

utils/dataset_manifest/core.py (1)

162-163: Good enforcement of method constraints.

Raising a ValueError when a non-predefined sorting method is used with a generator is a good practice as it enforces the method's constraints explicitly.

cvat/apps/engine/cloud_provider.py (2)

223-231: Implementation of bulk_download_to_dir handles threading effectively.

223-231: Cloud storage instance creation is well-handled with comprehensive support for different providers and attributes.

cvat/apps/engine/cloud_provider.py

cvat/settings/base.py

utils/dataset_manifest/core.py

cvat/apps/engine/cloud_provider.py

utils/dataset_manifest/core.py

cvat/apps/engine/cloud_provider.py

…led tests

coderabbitai

Actionable comments posted: 4

Outside diff range and nitpick comments (4)

utils/dataset_manifest/core.py (2)
Line range hint 7-7: Remove unused import io.BytesIO.
- from io import StringIO, BytesIO
+ from io import StringIO
This change cleans up the imports by removing the unused BytesIO, which is not referenced anywhere in this file.

Line range hint 763-763: Correct the membership test in is_empty.
- if self._index.is_empty():
+ if not self._index.is_empty():
This change corrects the logic to properly check if the index is not empty before loading it, which aligns with the intended functionality.
cvat/apps/engine/cloud_provider.py (2)

9-9: Consider making the math import configurable or ensure it's necessary here as it's only used for thread normalization.

Line range hint 184-212: The method optimally_image_download is well-implemented with detailed logging and error handling. However, consider adding more comments to explain the logic, especially around the conditions that determine whether the entire file needs to be downloaded.

Review Details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 914b7bc and c1b6c8b.

Files selected for processing (3)

cvat/apps/engine/cloud_provider.py (8 hunks)
utils/dataset_manifest/core.py (10 hunks)
utils/dataset_manifest/types.py (1 hunks)

Additional Context Used

Ruff (2)

utils/dataset_manifest/core.py (2)

7-7: io.BytesIO imported but unused

763-763: Test for membership should be not in

Additional comments not posted (5)

utils/dataset_manifest/types.py (1)

8-9: The Named protocol is well-defined and serves its purpose effectively.

utils/dataset_manifest/core.py (2)

199-225: Refactor the _get_img_properties method for clarity.

This refactor simplifies the method by reducing the number of conditionals and improving the structure of the code for better readability and maintainability.

227-233: Optimize the iteration logic in __iter__.

This change uses islice directly in the loop to simplify the iteration and avoid unnecessary checks and range creations inside the loop.

cvat/apps/engine/cloud_provider.py (2)

35-43: The implementation of NamedBytesIO with filename as a property is well-done. This encapsulation enhances the maintainability of the code.

149-149: The method download_fileobj in different storage classes (_CloudStorage, AWS_S3, AzureBlobContainer, GoogleCloudStorage) is consistently implemented. This ensures uniform behavior across different cloud providers, which is good for maintainability and debugging.

Also applies to: 533-541, 735-744, 849-854

cvat/apps/engine/cloud_provider.py

utils/dataset_manifest/types.py

coderabbitai

Actionable comments posted: 2

Outside diff range and nitpick comments (1)

utils/dataset_manifest/core.py (1)
Line range hint 763-763: Use not in for membership tests to improve readability.
- if not _dict['version'] in _Manifest.SupportedVersion.choices():
+ if _dict['version'] not in _Manifest.SupportedVersion.choices():
This change makes the membership test more readable and aligns with common Python practices.

Review Details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between c1b6c8b and b0efba5.

Files selected for processing (1)

utils/dataset_manifest/core.py (10 hunks)

Additional Context Used

Ruff (1)

utils/dataset_manifest/core.py (1)

763-763: Test for membership should be not in

utils/dataset_manifest/core.py

coderabbitai

Actionable comments posted: 4

Review Details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between b0efba5 and 0e71a05.

Files selected for processing (1)

cvat/apps/engine/cloud_provider.py (8 hunks)

Additional comments not posted (1)

cvat/apps/engine/cloud_provider.py (1)
Line range hint 200-228: The method optimally_image_download attempts to download just enough of the image to parse its properties, which is a good optimization. However, ensure that the error handling is robust, especially in cases where the image properties cannot be determined with the initial chunk. Consider logging or handling these cases explicitly.

Verification successful

The optimally_image_download method does include a warning log when the initial chunk is insufficient to determine the image properties. This log provides detailed information about the image size and resolution, which is a good practice for error handling and debugging.

The method logs a warning if the initial chunk is not enough to parse the image properties.

It then proceeds to download the entire file if the initial chunk is insufficient.

This approach ensures that the method handles cases where the image properties cannot be determined with the initial chunk, providing robust error handling.
Scripts Executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify robust error handling in optimally_image_download.

# Test: Search for error handling related to image property determination failures.
rg 'optimally_image_download' --type py
Length of output: 300

Script:
#!/bin/bash
# Extract the full implementation of the optimally_image_download method for detailed analysis
ast-grep --lang python --pattern $'def optimally_image_download($_, $_) {
  $$$
}' cvat/apps/engine/cloud_provider.py
Length of output: 587

Script:
#!/bin/bash
# Extract the full implementation of the optimally_image_download method for detailed analysis
ast-grep --lang python --pattern 'def optimally_image_download($_, $$$): $$$' cvat/apps/engine/cloud_provider.py
Length of output: 114

Script:
#!/bin/bash
# Extract the full implementation of the optimally_image_download method for detailed analysis using ripgrep
rg -A 20 -B 5 'def optimally_image_download' cvat/apps/engine/cloud_provider.py
Length of output: 1055

Script:
#!/bin/bash
# Extract the full implementation of the optimally_image_download method for detailed analysis using ripgrep
rg -A 40 -B 5 'def optimally_image_download' cvat/apps/engine/cloud_provider.py
Length of output: 1920

cvat/apps/engine/cloud_provider.py

coderabbitai

Actionable comments posted: 3

Outside diff range and nitpick comments (3)

cvat/apps/engine/cloud_provider.py (3)
9-9: Consider making the math import conditional or localized within functions where it's used, if it's only needed for specific operations. This can help reduce the import overhead when the module is loaded but the specific functionality requiring math is not used.

48-70: The normalize_threads_number function is well-implemented, ensuring that the number of threads does not exceed the maximum allowed by the system or settings. However, consider adding a brief comment explaining the logic, especially the use of math.ceil and the conditions checked, to improve code readability.

Line range hint 200-228: The method optimally_image_download is well-documented and implements a robust approach to downloading images optimally. However, consider handling exceptions that might arise during the download process, such as network issues or corrupted data, to ensure the application's stability.
+        try:
+            image_parser.feed(chunk)
+        except Exception as e:
+            slogger.glob.error(f"Failed to parse image chunk for '{key}': {str(e)}")
+            raise

Review Details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 0e71a05 and b432acd.

Files selected for processing (2)

cvat/apps/engine/cloud_provider.py (8 hunks)
cvat/settings/base.py (1 hunks)

Additional Context Used

Ruff (4)

cvat/settings/base.py (4)

34-34: Module level import not at top of file

36-36: Module level import not at top of file

80-80: keys.secret_key.SECRET_KEY imported but unused

718-718: Module level import not at top of file

Additional comments not posted (1)

cvat/apps/engine/cloud_provider.py (1)

35-43: The implementation of the NamedBytesIO class with filename as a property is a good approach for encapsulating file-related data. This should help in managing file operations more effectively.

cvat/settings/base.py

coderabbitai

Actionable comments posted: 3

Outside diff range and nitpick comments (4)

cvat/settings/base.py (1)
Line range hint 34-34: Move module-level imports to the top of the file to adhere to Python's PEP 8 style guide.
+ import mimetypes
+ import os
+ import sys
+ import tempfile
+ from datetime import timedelta
+ from enum import Enum
+ import urllib
+ from attr.converters import to_bool
+ from corsheaders.defaults import default_headers
+ from logstash_async.constants import constants as logstash_async_constants
+ from cvat import __version__
+ from pathlib import Path
+ from django.core.exceptions import ImproperlyConfigured
+ from cvat.rq_patching import update_started_job_registry_cleanup
Also applies to: 36-36, 718-718
cvat/apps/engine/cloud_provider.py (3)
9-9: Consider making the math import conditional or localized within functions where it's used, if it's only needed for specific operations. This can help reduce the import overhead when the module is loaded but the specific functionality requiring math is not used.

48-70: The normalize_threads_number function is well-implemented, ensuring that the number of threads does not exceed the limits set by the system or configuration. However, consider adding a brief comment explaining the logic, especially how it handles the threads_number parameter when it's None.

Line range hint 200-228: The method optimally_image_download is a significant improvement in handling image downloads by first trying with a chunk and then downloading the full image if necessary. However, consider adding error handling for cases where the image cannot be parsed even after downloading the full file.
+        try:
+            image_parser.feed(chunk)
+        except Exception as ex:
+            slogger.glob.error(f"Failed to parse image from chunk: {ex}")
+            raise

Review Details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 0e71a05 and b432acd.

Files selected for processing (2)

cvat/apps/engine/cloud_provider.py (8 hunks)
cvat/settings/base.py (1 hunks)

Additional Context Used

Ruff (4)

cvat/settings/base.py (4)

34-34: Module level import not at top of file

36-36: Module level import not at top of file

80-80: keys.secret_key.SECRET_KEY imported but unused

718-718: Module level import not at top of file

Additional comments not posted (1)

cvat/apps/engine/cloud_provider.py (1)

35-43: The implementation of NamedBytesIO with a custom filename property is a good approach to enhance the functionality of BytesIO. This allows for better handling of file-related metadata which is often necessary in cloud operations.

cvat/settings/base.py

Marishka17 · 2024-05-24T10:17:18Z

@coderabbitai, can you please remove all your comments from this PR?

coderabbitai · 2024-05-24T10:19:06Z

@Marishka17, I have processed the request to remove all my comments from PR #7903. If there's anything else you need, feel free to let me know!

Analysis chain

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Fetch all comments made by @coderabbitai[bot] in PR #7903 and delete them
gh pr view 7903 --json comments --jq '.comments[] | select(.author.login == "coderabbitai[bot]") | .id' | xargs -I {} gh api -X DELETE /repos/cvat-ai/cvat/issues/comments/{}

Length of output: 179

Actions Performed

Comments resolved.

Marishka17 · 2024-05-24T10:28:56Z

@zhiltsov-max, I've applied all your comments, could you please take a look at the PR one more time?

coderabbitai

Actionable comments posted: 1

Review Details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between b432acd and 7f91dd8.

Files selected for processing (3)

cvat/apps/engine/cache.py (2 hunks)
cvat/apps/engine/task.py (7 hunks)
utils/dataset_manifest/init.py (1 hunks)

Files skipped from review as they are similar to previous changes (1)

cvat/apps/engine/cache.py

Additional Context Used

Ruff (5)

cvat/apps/engine/task.py (1)

899-899: Test for membership should be not in

utils/dataset_manifest/__init__.py (4)

4-4: .core.VideoManifestManager imported but unused; consider removing, adding to __all__, or using a redundant alias

4-4: .core.ImageManifestManager imported but unused; consider removing, adding to __all__, or using a redundant alias

4-4: .core.is_manifest imported but unused; consider removing, adding to __all__, or using a redundant alias

5-5: .errors.InvalidManifestError imported but unused; consider removing, adding to __all__, or using a redundant alias

coderabbitai · 2024-05-24T16:42:25Z

utils/dataset_manifest/__init__.py

 from .core import VideoManifestManager, ImageManifestManager, is_manifest
+from .errors import InvalidManifestError


Consider removing unused imports or explicitly including them in the __all__ list if they are intended for external use.

- from .core import VideoManifestManager, ImageManifestManager, is_manifest - from .errors import InvalidManifestError + __all__ = ['VideoManifestManager', 'ImageManifestManager', 'is_manifest', 'InvalidManifestError']

Committable suggestion was skipped due low confidence.

… created with random sorting & cloud data & use_cache==True

coderabbitai

Actionable comments posted: 2

Outside diff range and nitpick comments (4)

cvat/apps/engine/cloud_provider.py (3)

9-9: Consider making the math import conditional or localized within functions where it's used, if it's only needed for specific operations. This can help reduce the import overhead when the module is loaded but the specific functionality requiring math is not used.

48-70: The normalize_threads_number function is well-implemented, providing a robust way to determine the optimal number of threads based on system and user constraints. However, consider adding a brief comment explaining the logic, especially the use of min and max functions, to improve code readability.

Line range hint 200-228: The method optimally_image_download introduces an efficient way to download images by first trying with a chunk and then downloading the full image if necessary. This is a good optimization. However, ensure that the logging level is appropriate for the warnings about insufficient chunk sizes, as this might clutter the log files in normal operations.

cvat/apps/engine/task.py (1)

Line range hint 900-900: The static analysis tool flagged a potential issue with a membership test on line 900, but this line is not included in the provided code snippet. This might be a false positive or an error in the tool's output.

Review Details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 7f91dd8 and 5c50222.

Files selected for processing (2)

cvat/apps/engine/cloud_provider.py (8 hunks)
cvat/apps/engine/task.py (7 hunks)

Additional Context Used

Ruff (1)

cvat/apps/engine/task.py (1)

900-900: Test for membership should be not in

Additional comments not posted (2)

cvat/apps/engine/cloud_provider.py (1)

35-43: The implementation of the NamedBytesIO class with a filename property is a good addition for handling file names in memory streams. This should enhance the handling of file objects within the cloud storage operations.

cvat/apps/engine/task.py (1)

484-504: The updates to _create_task_manifest_from_cloud_data effectively utilize the new parameters for frame slicing and dimension handling, aligning well with the PR's objectives to optimize cloud interactions.

coderabbitai · 2024-05-26T22:00:57Z

cvat/apps/engine/cloud_provider.py

@@ -128,7 +162,7 @@ def get_file_last_modified(self, key):
        pass

    @abstractmethod
-    def download_fileobj(self, key):
+    def download_fileobj(self, key: str) -> NamedBytesIO:


The method download_fileobj is implemented in multiple classes but seems to have a similar structure. Consider refactoring to reduce duplication and improve maintainability. Perhaps a base class implementation could be used, with subclasses overriding only the necessary parts.

Also applies to: 551-559, 753-762, 867-872

coderabbitai · 2024-05-26T22:00:57Z

cvat/apps/engine/cloud_provider.py

+        *,
+        threads_number: Optional[int] = None,
        _use_optimal_downloading: bool = True,
-    ) -> List[BytesIO]:
+    ) -> Iterator[BytesIO]:
        func = self.optimally_image_download if _use_optimal_downloading else self.download_fileobj
-        if threads_number > 1:
-            with ThreadPool(threads_number) as pool:
-                return pool.map(func, files)
-        else:
-            slogger.glob.warning('Download files to memory in series in one thread.')
-            return [func(f) for f in files]
+        threads_number = normalize_threads_number(threads_number, len(files))
+
+        with ThreadPoolExecutor(max_workers=threads_number) as executor:
+            yield from executor.map(func, files)


The implementation of bulk_download_to_memory and bulk_download_to_dir using threading is a robust enhancement. However, ensure that exceptions during thread execution are handled to prevent the application from crashing. Consider wrapping the thread execution in a try-except block.

+ try: + with ThreadPoolExecutor(max_workers=threads_number) as executor: + yield from executor.map(func, files) + except Exception as ex: + slogger.glob.error(f"Error during bulk download: {ex}") + raise

Also applies to: 248-258

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change

*,

threads_number: Optional[int] = None,

_use_optimal_downloading: bool = True,

) -> List[BytesIO]:

) -> Iterator[BytesIO]:

func = self.optimally_image_download if _use_optimal_downloading else self.download_fileobj

if threads_number > 1:

with ThreadPool(threads_number) as pool:

return pool.map(func, files)

else:

slogger.glob.warning('Download files to memory in series in one thread.')

return [func(f) for f in files]

threads_number = normalize_threads_number(threads_number, len(files))

with ThreadPoolExecutor(max_workers=threads_number) as executor:

yield from executor.map(func, files)

*,

threads_number: Optional[int] = None,

_use_optimal_downloading: bool = True,

) -> Iterator[BytesIO]:

func = self.optimally_image_download if _use_optimal_downloading else self.download_fileobj

threads_number = normalize_threads_number(threads_number, len(files))

try:

with ThreadPoolExecutor(max_workers=threads_number) as executor:

yield from executor.map(func, files)

except Exception as ex:

slogger.glob.error(f"Error during bulk download: {ex}")

raise

coderabbitai

Actionable comments posted: 0

Review Details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 5c50222 and c9867b0.

Files selected for processing (1)

utils/dataset_manifest/init.py (1 hunks)

Files skipped from review as they are similar to previous changes (1)

utils/dataset_manifest/init.py

cvat/apps/engine/cloud_provider.py

Do not keep all downloaded images in memory

f15cc60

Marishka17 added bug Something isn't working server labels May 17, 2024

Fix type

f189599

fix pylint issue

b702500

Marishka17 commented May 20, 2024

View reviewed changes

Merge branch 'develop' into mk/fix_task_creation_with_cs_data

40fb577

Marishka17 marked this pull request as ready for review May 20, 2024 07:58

Marishka17 requested a review from azhavoro as a code owner May 20, 2024 07:58

Marishka17 requested review from zhiltsov-max and removed request for azhavoro May 20, 2024 08:00

coderabbitai bot reviewed May 20, 2024

View reviewed changes

cvat/apps/engine/task.py Outdated Show resolved Hide resolved

utils/dataset_manifest/core.py Outdated Show resolved Hide resolved

zhiltsov-max reviewed May 20, 2024

View reviewed changes

Marishka17 added 2 commits May 23, 2024 13:01

Apply comments

1e41676

Add changelog fragment

64e3218

Marishka17 requested a review from nmanovic as a code owner May 23, 2024 11:02

coderabbitai bot reviewed May 23, 2024

View reviewed changes

utils/dataset_manifest/core.py Show resolved Hide resolved

utils/dataset_manifest/core.py Show resolved Hide resolved

cvat/apps/engine/cloud_provider.py Outdated Show resolved Hide resolved

Marishka17 added 2 commits May 23, 2024 14:00

Move settings

a8b55db

Fix stop frame

914b7bc

Marishka17 commented May 23, 2024

View reviewed changes

utils/dataset_manifest/core.py Show resolved Hide resolved

coderabbitai bot reviewed May 23, 2024

View reviewed changes

zhiltsov-max reviewed May 23, 2024

View reviewed changes

utils/dataset_manifest/core.py Show resolved Hide resolved

zhiltsov-max reviewed May 23, 2024

View reviewed changes

cvat/apps/engine/cloud_provider.py Outdated Show resolved Hide resolved

Marishka17 added 3 commits May 24, 2024 00:43

Fix typing && use ThreadPoolExecutor instead of ThreadPool && fix fai…

7b21a8c

…led tests

Merge branch 'develop' into mk/fix_task_creation_with_cs_data

c1b6c8b

Remove unused import

b0efba5

coderabbitai bot reviewed May 23, 2024

View reviewed changes

cvat/apps/engine/cloud_provider.py Outdated Show resolved Hide resolved

utils/dataset_manifest/types.py Show resolved Hide resolved

coderabbitai bot reviewed May 23, 2024

View reviewed changes

utils/dataset_manifest/core.py Show resolved Hide resolved

utils/dataset_manifest/core.py Show resolved Hide resolved

Ensure that the threads number is always non-negative

0e71a05

coderabbitai bot reviewed May 24, 2024

View reviewed changes

cvat/apps/engine/cloud_provider.py Show resolved Hide resolved

cvat/apps/engine/cloud_provider.py Outdated Show resolved Hide resolved

zhiltsov-max reviewed May 24, 2024

View reviewed changes

cvat/apps/engine/cloud_provider.py Outdated Show resolved Hide resolved

Marishka17 added 2 commits May 24, 2024 12:05

Use wait && rename settings

025c01a

Merge branch 'develop' into mk/fix_task_creation_with_cs_data

b432acd

coderabbitai bot reviewed May 24, 2024

View reviewed changes

cvat/settings/base.py Show resolved Hide resolved

coderabbitai bot reviewed May 24, 2024

View reviewed changes

cvat/settings/base.py Show resolved Hide resolved

Handle InvalidManifestError exception

b813a85

coderabbitai bot reviewed May 24, 2024

View reviewed changes

Fix one more bug: soring was applied 2 times therefore task cannot be…

5c50222

… created with random sorting & cloud data & use_cache==True

Marishka17 force-pushed the mk/fix_task_creation_with_cs_data branch from 7f91dd8 to 5c50222 Compare May 26, 2024 21:57

coderabbitai bot reviewed May 26, 2024

View reviewed changes

Apply comment

c9867b0

coderabbitai bot reviewed May 26, 2024

View reviewed changes

zhiltsov-max reviewed May 27, 2024

View reviewed changes

cvat/apps/engine/cloud_provider.py Show resolved Hide resolved

zhiltsov-max reviewed May 27, 2024

View reviewed changes

cvat/apps/engine/cloud_provider.py Show resolved Hide resolved

Marishka17 changed the title ~~Fix task creation with cloud storage data~~ [WIP][Dependent] Fix task creation with cloud storage data May 30, 2024

zhiltsov-max mentioned this pull request May 30, 2024

Fixed several issues related to creating tasks with cloud data #7969

Open

7 tasks

		stop_frame = len(sorted_media)
		if data['stop_frame'] is not None:

		from .core import VideoManifestManager, ImageManifestManager, is_manifest
		from .errors import InvalidManifestError

[WIP][Dependent] Fix task creation with cloud storage data #7903

Are you sure you want to change the base?

[WIP][Dependent] Fix task creation with cloud storage data #7903

Conversation

Marishka17 commented May 17, 2024 • edited by coderabbitai bot

Motivation and context

How has this been tested?

Checklist

License

Summary by CodeRabbit

coderabbitai bot commented May 17, 2024 • edited

Walkthrough

Changes

Review Settings

AST-based Instructions

Community-driven AST-based Rules

New Static Analysis Tools

Tone Settings

Revamped Settings Page

Miscellaneous

Chat

CodeRabbit Commands (invoked as PR comments)

CodeRabbit Configration File (.coderabbit.yaml)

Documentation and Community

Marishka17 commented May 17, 2024

github-actions bot commented May 17, 2024 • edited

Marishka17 commented May 17, 2024

github-actions bot commented May 17, 2024 • edited

github-actions bot commented May 17, 2024 • edited

github-actions bot commented May 17, 2024 • edited

Marishka17 May 20, 2024

Choose a reason for hiding this comment

Marishka17 commented May 20, 2024

coderabbitai bot left a comment

Choose a reason for hiding this comment

codecov-commenter commented May 20, 2024 • edited

Codecov Report

zhiltsov-max left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

Marishka17 commented May 24, 2024

coderabbitai bot commented May 24, 2024

Marishka17 commented May 24, 2024

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot May 24, 2024

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot May 26, 2024

Choose a reason for hiding this comment

coderabbitai bot May 26, 2024

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

Marishka17 commented May 17, 2024 •

edited by coderabbitai bot

coderabbitai bot commented May 17, 2024 •

edited

CodeRabbit Configration File (`.coderabbit.yaml`)

github-actions bot commented May 17, 2024 •

edited

github-actions bot commented May 17, 2024 •

edited

github-actions bot commented May 17, 2024 •

edited

github-actions bot commented May 17, 2024 •

edited

codecov-commenter commented May 20, 2024 •

edited