Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor ModelBuilder and RandomBuilder #971

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

jrycw
Copy link
Contributor

@jrycw jrycw commented Mar 22, 2024

I encountered ModelBuilder._randomize_attribute and initially found the numerous if-elif-else checks puzzling. However, further investigation revealed that certain factories necessitate additional information from the column. I attempted to consolidate these logics into RandomBuilder._build, aiming for improved clarity. Nevertheless, the final code may not be as pristine as desired. I'm looking forward to hearing the community's thoughts on whether this refactoring improves the codebase.

@codecov-commenter
Copy link

codecov-commenter commented Mar 22, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 92.82%. Comparing base (91d6238) to head (ad763c3).
Report is 4 commits behind head on master.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #971      +/-   ##
==========================================
+ Coverage   92.78%   92.82%   +0.03%     
==========================================
  Files         108      108              
  Lines        8182     8222      +40     
==========================================
+ Hits         7592     7632      +40     
  Misses        590      590              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@dantownsend dantownsend left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great - thanks 👍

ModelBuilder is something which started fairly simple, but as more and more edge cases were discovered it has become quite complex.

I agree that your solution is cleaner and easier to follow.

I left a few comments.

piccolo/testing/random_builder.py Outdated Show resolved Hide resolved
piccolo/testing/random_builder.py Outdated Show resolved Hide resolved
Comment on lines 42 to 45
random_value_callable = partial(
cls.next_list,
mapper[t.cast(Array, column).base_column.value_type],
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if anything besides an Array could end up in this block.

It wonder if we can add list to mapper in a clean way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The handling of list is quite complex, involving two aspects: the simple callable, such as RandomBuilder.next_bool, and additional logic that requires knowledge of the column to manufacture another callable. Since list relies on obtaining the type from the mapper, we cannot determine the exact type until the last moment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right - arrays are always tricky edge cases.

Maybe we could add a check if isinstance(column, Array)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my latest comment regarding this issue.

piccolo/testing/random_builder.py Outdated Show resolved Hide resolved


class RandomBuilder:
@classmethod
def _build(cls, column: Column) -> t.Any:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about calling this something like get_value_for_column? Do you think it should be a private method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially considered treating it as a private method, but also wanted to find a way to provide a mechanism for registering a new type.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be nice.

piccolo/testing/random_builder.py Outdated Show resolved Hide resolved
@jrycw
Copy link
Contributor Author

jrycw commented Mar 22, 2024

This is great - thanks 👍

ModelBuilder is something which started fairly simple, but as more and more edge cases were discovered it has become quite complex.

I agree that your solution is cleaner and easier to follow.

I left a few comments.

@dantownsend I completely agree with you; the task turned out to be more complex than I initially anticipated. Building the mapper required careful handling of many cases, such as using partial, dealing with the return from get, and modifying the current next-ish method with default values to becoming a callable. While this PR is functional at the moment, I acknowledge that it may become challenging to maintain in the future. I am currently exploring alternative ideas and hope to provide another version soon.

@dantownsend
Copy link
Member

@dantownsend I completely agree with you; the task turned out to be more complex than I initially anticipated. Building the mapper required careful handling of many cases, such as using partial, dealing with the return from get, and modifying the current next-ish method with default values to becoming a callable. While this PR is functional at the moment, I acknowledge that it may become challenging to maintain in the future. I am currently exploring alternative ideas and hope to provide another version soon.

I think there's a lot of potential in this approach. Feel free to play around with other ideas, but this is a nice start.

@jrycw
Copy link
Contributor Author

jrycw commented Mar 22, 2024

@dantownsend I completely agree with you; the task turned out to be more complex than I initially anticipated. Building the mapper required careful handling of many cases, such as using partial, dealing with the return from get, and modifying the current next-ish method with default values to becoming a callable. While this PR is functional at the moment, I acknowledge that it may become challenging to maintain in the future. I am currently exploring alternative ideas and hope to provide another version soon.

I think there's a lot of potential in this approach. Feel free to play around with other ideas, but this is a nice start.

I've just pushed another version. This version introduces a hook for users to register their own random type. I'll review your comments tomorrow as it's already midnight here in Asia.

@jrycw
Copy link
Contributor Author

jrycw commented Mar 23, 2024

The third version may be a bit easier to understand:

  • RandomBuilder now offers a public API called get_mapper, enabling users to access the default random mapper.
  • Forming the type-callable pairs into a single dictionary involves several steps, which are encapsulated in the public API ModelBuilder.get_registry:
    • Initially, we import information from RandomBuilder.get_mapper, a one-time operation.
    • Next, we address situations where callables may require information from the column in ModelBuilder._get_local_mapper.
    • We then incorporate user-registered types into ModelBuilder._get_other_mapper.
    • Finally, we integrate the logic for list into reg = { **default_mapper, **cls._get_local_mapper(column), **cls._get_other_mapper(column)}, returning reg as a MappingProxyType.
  • I made an effort to comprehend the relationship between list and Array, but I'm afraid I may have gotten a bit lost in the codebase. I'm uncertain about the extent to which isinstance(column, Array) can handle the edge case effectively.
  • Handling enum.Enum remains a challenge.
  • I made a minor adjustment in ModelBuilder._build, where defaults = defaults or { } seems clearer to me.
  • The naming of variables and functions could be improved; I welcome any advice you may have on this matter.
  • Providing a method for users to unregister their custom types might be handy:
    @classmethod
    def unregister_types(cls) -> None:
        cls.__OTHER_MAPPER.clear()

Comment on lines +206 to +209
precision, scale = column._meta.params.get("digits") or (4, 2)
local_mapper[Decimal] = partial(
RandomBuilder.next_decimal, precision, scale
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If RandomBuilder.next_decimal incorporates default values for precision and scale (like this PR), the logic can be simplified to:

        if precision_scale := column._meta.params.get("digits"):
            local_mapper[Decimal] = partial(
                RandomBuilder.next_decimal, *precision_scale
            )

cls.__DEFAULT_MAPPER[base_type]() for _ in range(length)
]
elif column._meta.choices:
reg = cls.get_registry(column)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can probably go into the else block, as we don't use it if the column has choices.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a valid observation. It seems that the presence of reg is a result of the requirements in the previous version, where it was needed for multiple elif-else blocks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this is a good idea.

    @classmethod
    def _randomize_attribute(cls, column: Column) -> t.Any:
        reg = cls.get_registry(column)
        random_value = reg.get(enum.Enum, reg[column.value_type])()
        if isinstance(column, (JSON, JSONB)):
            return json.dumps({"value": random_value})
        return random_value


    @classmethod
    def _get_local_mapper(cls, column: Column) -> t.Dict[t.Type, t.Callable]:
	...
        if _choices := column._meta.choices:
            local_mapper[enum.Enum] = partial(
                RandomBuilder.next_enum, _choices)

        return local_mapper

if column.value_type == list:
reg[list] = partial(
RandomBuilder.next_list,
reg[t.cast(Array, column).base_column.value_type],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to make things too insane, but multidimensional arrays are possible:

Array(Array(Integer())

I wonder what would happen here in that situation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for bringing this situation to my attention. I was curious about the behavior in our current codebase, so I conducted a quick test. It seems that the current implementation will throw a KeyError for this situation (please correct me if I'm mistaken).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be a bit off-topic, but I wanted to try out this behavior in the Piccolo playground, and it doesn't seem to be working. The code works fine if I just launch a terminal and enter the shell.

piccolo playground run --engine=sqlite3
In [1]: from piccolo.table import Table

In [2]: from piccolo.columns import Array, BigInt

In [3]: class MyTable(Table):
   ...:     my_column = Array(Array(BigInt()))
   ...:
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[3], line 1
----> 1 class MyTable(Table):
      2     my_column = Array(Array(BigInt()))

Cell In[3], line 2, in MyTable()
      1 class MyTable(Table):
----> 2     my_column = Array(Array(BigInt()))

NameError: name 'Array' is not defined

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jrycw This happens because the Array and BigInt columns are not imported into the Playground application. If you patch the local Piccolo installation and add these two columns, everything works fine. Maybe should add all the possible columns to import and that would solve the problem..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sinisaos Thank you very much for identifying the source of the issue and providing the solution. It's working now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right that the current implementation of ModelBuilder doesn't handle multidimensional arrays, so I wouldn't worry about it if it's a tricky fix. Array columns now have two methods to help with this kind of thing: _get_dimensions and _get_inner_value_type.

piccolo/testing/model_builder.py Show resolved Hide resolved
Comment on lines +253 to +258
@classmethod
def register_type(cls, typ: t.Type, callable_: t.Callable) -> None:
cls.__OTHER_MAPPER[typ] = callable_

@classmethod
def unregister_type(cls, typ: t.Type) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Being able to register custom types is cool, but I wonder about the main use cases.

You can specify defaults at the moment:

await ModeBuilder.build(MyTable, defaults={MyTable.some_column: "foo"})

I'm not against being able to override how types are handled, but we just need to articulate to users when it's appropriate vs using defaults.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for bringing up this question, it prompted me to reflect on the code and its implications.

The main difference between using default and register_type lies in their respective purposes:

  • Utilizing default is suitable when users generally find our provided random logic satisfactory, but they require a hardcoded value for a specific column on a one-time basis.
  • On the other hand, employing register_type is appropriate when users desire a custom implementation for a type, effectively overwriting our default random logic for that specific type.
    Here are three distinct use cases:
    • (1) Types not provided by us: For instance, the type like next_decimal is not available in the current release version, but users can implement their own logic for the type and inject it into the ModelBuilder.
    • (2) User preference for specific logic: In scenarios where we introduce new features, such as the shiny next_decimal logic (returning decimal.Decimal if column.value_type is decimal.Decimal), users may prefer the previous implementation or have specific requirements. With register_type, they have the flexibility to override the default behavior.
    • (3) Unanticipated user cases: This aspect is particularly valuable for registration. For example, consider a user who initially builds a successful e-commerce platform in the UK using Piccolo. Later, they expand into Asia and encounter legal requirements necessitating the storage of customer names in local languages. If ModelBuilder does not support non-English characters, users can register their own implementations to address this issue.

A draft test for situations (1) and (2) might look like this:

class TableWithDecimal(Table):
    numeric = Numeric()
    numeric_with_digits = Numeric(digits=(4, 2))
    decimal = Decimal()
    decimal_with_digits = Decimal(digits=(4, 2))

class TestModelBuilder(unittest.TestCase):
    ...
    def test_registry_overwritten1(self):
        table = ModelBuilder.build_sync(TableWithDecimal)
        for key, value in table.to_dict().items():
            if key != "id":
                self.assertIsInstance(value, decimal.Decimal)

        def fake_next_decimal(column: Column) -> float:
            """will return `float` instead of `decimal.Decimal`"""
            precision, scale = column._meta.params["digits"] or (4, 2)
            return RandomBuilder.next_float(
                maximum=10 ** (precision - scale), scale=scale
            )

        ModelBuilder.register_type(decimal.Decimal, fake_next_decimal)
        overwritten_table = ModelBuilder.build_sync(TableWithDecimal)
        for key, value in overwritten_table.to_dict().items():
            if key != "id":
                self.assertIsInstance(value, float)

A draft test for situations (3) might look like this:

class TestModelBuilder(unittest.TestCase):
    ...
    def test_registry_overwritten2(self):
        choices = "一二三"  # Chinese characters

        def next_str(length: int = 3) -> str:
            # Chinese names often consist of three Chinese characters
            return "".join(random.choice(choices) for _ in range(length))

        ModelBuilder.register_type(str, next_str)
        manager1 = ModelBuilder.build_sync(Manager)
        self.assertTrue(all(char_ in choices for char_ in manager1.name))

        poster1 = ModelBuilder.build_sync(Poster)
        self.assertTrue(all(char_ in choices for char_ in poster1.content))

        ModelBuilder.unregister_type(str)
        manager2 = ModelBuilder.build_sync(Manager)
        self.assertTrue(all(char_ not in choices for char_ in manager2.name))

        poster2 = ModelBuilder.build_sync(Poster)
        self.assertTrue(all(char_ not in choices for char_ in poster2.content))

The scenario is as follows: Manager1 is a locally hired individual, while Manager2 is dispatched from the UK. Both are working on the poster using their native languages.

Finally, I realized I had overlooked the magic behavior of Python's name mangling rules. For instance:

>>> class ModelBuilder:
...     __OTHER_MAPPER = {}
...
>>> ModelBuilder.__OTHER_MAPPER
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: type object 'ModelBuilder' has no attribute '__OTHER_MAPPER'
>>> ModelBuilder._ModelBuilder__OTHER_MAPPER
{}

As a result, the previous test code might be a bit off. I need to use the following code for the setup and teardown phases for each test:

    def setUp(self) -> None:
        ModelBuilder._ModelBuilder__OTHER_MAPPER.clear()  # type: ignore

    def tearDown(self) -> None:
        ModelBuilder._ModelBuilder__OTHER_MAPPER.clear()  # type: ignore

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining the rationale behind it - it makes sense.

If I was to completely redesign ModelBuilder, I probably wouldn't have class methods. Instead of:

await ModelBuilder.build(MyTable)

I would have:

await ModelBuilder(some_option=True).build(MyTable)

So we can configure ModelBuilder's behaviour easier. For registering types we could have:

custom_builder = ModelBuilder(types={...})

await custom_builder.build(MyTable)

We could allow the types to be passed in via the build method instead:

await ModelBuilder.build(MyTable, types={...})

If register and unregister work globally, there are pros and cons. The main pro is you only need to set it up once (e.g. in a session fixture of Pytest). But if you were to somehow run your tests in parallel, it might be problematic.

What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for sharing your thoughts with me. Personally, I am inclined towards the instance method approach. However, implementing this change might break the current interface. I propose a three-stage transition plan:

  1. First stage: Utilize the concept of descriptors to distinguish between calls from the class or instance. Initially, we move the current implementation to the class branch to maintain user experience. Simultaneously, we start implementing the new concept in the instance branch, issuing an experimental warning.

  2. Second stage: If the new concept gains appreciation from users or developers, we add a deprecated warning to the class branch.

  3. Third stage: Remove the class branch and clean up the code to ensure all methods are instance methods by the end.

During the first two stages, we'll keep the class branch unchanged and encourage users to try out the new syntax and the new features. If we reach the third stage, users who prefer the class branch might need to adjust their habits from using await ModelBuilder.build(...) to await ModelBuilder().build() or ModelBuilder.build_sync(...) to ModelBuilder().build_sync(...).

The concept of descriptors is relatively straightforward, but it can sometimes feel too magical to grasp. I often need a refresher before coding if I haven't touched it for a long time. Fortunately, we don't need the complex __get__ and __set__ logic for the data descriptor. A simple non-data descriptor should suffice for our use case. With the help of this post, I've drafted a concept code as follows:

import asyncio
import inspect
import typing as t
from concurrent.futures import ThreadPoolExecutor


def run_sync(coroutine: t.Coroutine):
    try:
        # We try this first, as in most situations this will work.
        return asyncio.run(coroutine)
    except RuntimeError:
        # An event loop already exists.
        with ThreadPoolExecutor(max_workers=1) as executor:
            future = executor.submit(asyncio.run, coroutine)
            return future.result()


class dichotomy:
    def __init__(self, f):
        self.f = f

    def __get__(self, instance, owner):
        cls_or_inst = instance if instance is not None else owner
        if inspect.iscoroutine(self.f):
            async def newfunc(*args, **kwargs):
                return await self.f(cls_or_inst, *args, **kwargs)
        else:
            def newfunc(*args, **kwargs):
                return self.f(cls_or_inst, *args, **kwargs)
        return newfunc


class ModelBuilder:
    def __init__(self, *args, **kwargs):
        self._types = "..."  # Some information for instance method

    @dichotomy
    async def build(self_or_cls, *args, **kwargs):
        if inspect.isclass(self_or_cls):
            print("called as a class method from build")
            cls = self_or_cls
            await cls._build()
        else:
            print("called as an instance method from build")
            self = self_or_cls
            await self._build()

    @dichotomy
    def build_sync(self_or_cls, *args, **kwargs):
        return run_sync(self_or_cls.build())

    @dichotomy
    async def _build(self_or_cls, *args, **kwargs):
        if inspect.isclass(self_or_cls):
            print("called as a class method from _build", end="\n"*2)
            cls = self_or_cls  # noqa: F841
            # Current implementation remains here.
        else:
            print("called as an instance method from _build")
            self = self_or_cls
            # Some information can be retrieved.
            print(f'{self._types=}', end="\n"*2)
            # Our new logics


async def main():
    print('Async ModelBuilder.build: ')
    await ModelBuilder.build()

    print('Async ModelBuilder().build: ')
    await ModelBuilder().build()

    print('Sync ModelBuilder.build: ')
    ModelBuilder.build_sync()

    print('Sync ModelBuilder().build: ')
    ModelBuilder().build_sync()

if __name__ == '__main__':
    asyncio.run(main())
Async ModelBuilder.build: 
called as a class method from build
called as a class method from _build

Async ModelBuilder().build: 
called as an instance method from build
called as an instance method from _build
self._types='...'

Sync ModelBuilder.build: 
called as a class method from build
called as a class method from _build

Sync ModelBuilder().build: 
called as an instance method from build
called as an instance method from _build
self._types='...'

Finally, I agree that making register and unregister work globally could make it challenging to verify test results in parallel scenarios. I might lean towards using instance methods for the registering issue again.

These are just rough ideas that came to mind. I'm open to further discussions and refinements.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using descriptors is an interesting idea. I've used them sparingly before - as you say, they're very powerful, but can be confusing.

There's a lot of really good ideas in this PR, and I don't want to bog things down. I wonder if we could add this in a subsequent PR.

Copy link
Contributor Author

@jrycw jrycw Mar 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Certainly! Here are some options to consider for closing this PR:

  • Closing the PR without merging any changes.
  • Keeping the current code as is, while implementing the next_decimal functionality and updating related code.
  • Utilizing the latest commit of this PR while removing the option for users to register custom types.
  • Merging the PR with its latest commit.
  • Considering any other suggestions or alternatives.

I'm open to any of these choices. @dantownsend , what are your thoughts on this matter?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jrycw Sorry for the delay on this - I haven't forgotten about it, I'm just trying to finish off a couple of PRs.

I'm interested to know your thoughts on this:

#978

If you think it's a good idea or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dantownsend no worries at all. It's great to see the project progressing on various fronts. I'll make an effort to review it and share any opinions or feedback I may have.

piccolo/testing/random_builder.py Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants