Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update code for better backward and forward compatibility #476

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

leogama
Copy link
Contributor

@leogama leogama commented May 16, 2022

Changes possible thanks to recent improvements in the code base.

@anivegesana
Copy link
Contributor

anivegesana commented May 16, 2022

That is a good idea. Using FunctionType directly would make pickling faster for newer versions of dill. Unfortunately, pickles that have already been pickled in older versions of dill will stop working with this PR because they will expect the _create_function function to be present, so forward compatibility of existing pickles will be lost.

This actually a good potential opportunity to see if the _shims.py module I made has good documentation. See if you can write a shim for the _create_function function that will fix this problem. Feel free to ask questions. It will improve documentation.

Actually, I was trying to remember why I didn't do this before because I would have done it if I had thought of it. I don't believe that keeping perfect compatibility is possible. Either pickles you write now will be unable to be opened in older versions of dill or pickles from older versions of dill cannot be opened now. Props to you if you get it to work in a way that supports both backward and forward compatibility, but this PR as it stands loses forward compatibility of old pickle files.

I am looking into a way to both fix the bug you pointed out (#466) and make the function pickling code more readable and simple. Unfortunately, I am blocking on my work and don't entirely have time to look into it. You can pick it up if you are interested in fixing the pickling of functions (which you kind of seem, considering this PR.) I will be able to finish it up in 3 weeks anyway, so it doesn't really matter if you pick up or don't.

@leogama
Copy link
Contributor Author

leogama commented May 16, 2022

You are 100% right, my change added backward compatibility, but removed forward compatibility. I think I did it too late last night to figure...

But why couldn't we keep _create_function() and just don't use it for pickling? And move it to _shims?


Talking about compatibility, I'm playing with a "portable mode" that could bootstrap _create_code() from its reduced code object and FunctionType. It would then be able to unpickle any _create_*() constructor functions necessary for unpickling the user objects in the file. Such portable pickle file would be loadable even by stock pickle. (It's not working yet, and it's only for Python 3.8+.)

Also thinking of a decorator for the constructor functions to unpack positional args, ignoring any extras, as a future-proof feature. It would avoid compatibility breakage as what happened to _create_function().

@mmckerns mmckerns changed the title Update code for better backard and forward compatibility Update code for better backward and forward compatibility May 16, 2022
@anivegesana
Copy link
Contributor

anivegesana commented May 16, 2022

Yep. I was looking into the same plan that you described in order to create a clear separate between the private functions in _dill and the _create_* functions which are, I guess, public due to their use in save functions. I think the plan you said (move all of the create functions to _shims and "@move_to" them back) would be the best option out of everything I thought of. Adding on to that, I think we should add them to __all__ so sometime, many years in the future, we can move them to the main dill package and shave off the ._dill in their names.

Actually, strangely enough, _create_code is probably the only function that you cannot bootstrap in that way. This is because Python routinely adds arguments to the code class, which would break the _create_code function. The only way for old pickles containing code from Python 3.9 to be opened in Python 3.10 is if the _create_code function inside the pickle is compatible with Python 3.10 to begin with, which is fine. That can work. The problem is we cannot predict the same thing for, say, Python 3.13 because it hasn't been written yet and existing pickles would break if they change the code object between now and then. Because we can't bootstrap _create_code, I abandoned trying to bootstrap dill, but on the condition that the Python version is less than or equal to the pickling version (the opposite of what most people would want from this functionality), it would be possible. I personally don't think this is feasible, but if you find another solution, go for it.

I would note that you would want a feature like this to be a separate setting since it would bloat otherwise small pickles containing just one function to contain all of the create functions, which I would guesstimate would add ~10kB to the size of the pickle. If someone had tens of thousands of pickles, this could balloon to include tens of thousands of copies which would add up to a majority of the space taken up by the pickles. This is not always a bad thing. Pickling torch models, most of the space is taken up by the tensors, not the pickle, although it would not be ideal in cases where you would have a collection of many small pickles, like mystic.

@anivegesana
Copy link
Contributor

anivegesana commented May 16, 2022

I have an idea that would work, but this is on the bottom of my priorities. It is to create a new option that uses the compile function and the source code of the function it is trying to save (or its AST) into the pickle. This would increase the size of the pickle file, which is fine it this particular case since the pickles will be much larger anyway.

Line numbers are a little difficult. Using the string directly would save space, but I don't know how to change the line numbers during compilation, but saving the AST balloons the size way too much, requiring hundreds of bytes for a single AST node, being way too inefficient for any practical purpose. I hope that that gives you a good starting point if you care to pursue this farther. The best solution I could think of was prepending newlines before the string until the line number is met. Inelegant, but works if you want to try it.

The problem that I didn't quite figure out is how to compile inner functions, which are compiled differently than outer functions.

@leogama
Copy link
Contributor Author

leogama commented May 16, 2022

Yep. I was looking into the same plan that you described in order to create a clear separate between the private functions in _dill and the _create_* functions which are, I guess, public due to their use in save functions.

Yes, I also think that even if constructor functions are internal in a sense, they should be considered public, not private. They should be named without a leading underscore and exported to the root module dill. A quote from Python docs about variable and method naming in classes:

“Private” instance variables that cannot be accessed except from inside an object don’t exist in Python. However, there is a convention that is followed by most Python code: a name prefixed with an underscore (e.g. _spam) should be treated as a non-public part of the API (whether it is a function, a method or a data member). It should be considered an implementation detail and subject to change without notice.

These functions are not exactly an implementation detail, and changes to this internal API should be rare and incremental. The functions could be renamed with the underscore stripped out, but an alias with the old name should be kept in dill._dill for compatibility.

I think the plan you said (move all of the create functions to _shims and "@move_to" them back) would be the best option out of everything I thought of. Adding on to that, I think we should add them to __all__ so sometime, many years in the future, we can move them to the main dill package and shave off the ._dill in their names.

In reality, my suggestion is to move just _create_function() to _shims as it becomes exclusively a compatibility feature. I believe that the _create_*() functions should stay in the same file as the correspondent save_*() functions, because their logic are intrinsically related. Also, there's the problem of function namespace if these functions were in a separated module but referred to variables and functions in _dill.

Note: If you are thinking of alleviating the extension and complexity of the main submodule _dill, it'll likely get a lot better with cleaning the logic for unsupported Python versions after the next release and moving code related to session saving into a new submodule (these are almost completely independent, just need a couple private functions from _dill).

Actually, strangely enough, _create_code is probably the only function that you cannot bootstrap in that way. This is because Python routinely adds arguments to the code class, which would break the _create_code function. The only way for old pickles containing code from Python 3.9 to be opened in Python 3.10 is if the _create_code function inside the pickle is compatible with Python 3.10 to begin with, which is fine. That can work.

There are a few strategies that I'm thinking of:

  • Probably the easiest to implement, the first is to leverage the code.replace() method added in Python 3.8. It uses an existing code object to create a new one by substituting one or more of its members. This method uses keyword-only arguments instead of positional-only. What I'll try is to apply it to some trivial function in the Standard Library (it can't be a built-in function) to create a first bootstrapped _create_code() with just the essential components and then use this to create a second, complete version using CodeType.

  • The second idea is to pickle all the argument tuple variants in the bootstrap header, in a dictionary —which shouldn't use much space with memoization, and retrieve the correct one with sys.version_info to pass to CodeType directly.

The problem is we cannot predict the same thing for, say, Python 3.13 because it hasn't been written yet and existing pickles would break if they change the code object between now and then. Because we can't bootstrap _create_code, I abandoned trying to bootstrap dill, but on the condition that the Python version is less than or equal to the pickling version (the opposite of what most people would want from this functionality), it would be possible. I personally don't think this is feasible, but if you find another solution, go for it.

  • This is really hard to tackle, but if the version of the Python doing the unpickle is greater than the version that the bootstrap _create_code() supports, it could try to use inspection to figure out the correct signature for CodeType at runtime. It should work if future Python versions just add new arguments, but if they were completely changed there's not much to do. The problem with this approach is that signature inspection for built-in functions depends on function annotations, and these can be missing depending on optimization options. Therefore, this strategy can't be used in general, but would serve as a decent fallback.

I would note that you would want a feature like this to be a separate setting since it would bloat otherwise small pickles containing just one function to contain all of the create functions, which I would guesstimate would add ~10kB to the size of the pickle. If someone had tens of thousands of pickles, this could balloon to include tens of thousands of copies which would add up to a majority of the space taken up by the pickles. This is not always a bad thing. Pickling torch models, most of the space is taken up by the tensors, not the pickle, although it would not be ideal in cases where you would have a collection of many small pickles, like mystic.

Of course, this should be optional. I'm thinking of a "portable" setting/option. Plus, at least for simple objects, it should be possible to include just the necessary constructors other than _create_code(). I don't know how hard it would be to infer these though. Maybe the only way is to do a "dry run" pickling with no-op or in-memory writes and either, if it writes more than a threshold size, abort and then just include all the constructors or, if it finishes, check which of the save_*() functions were called.

@anivegesana
Copy link
Contributor

anivegesana commented May 17, 2022

These functions are not exactly an implementation detail, and changes to this internal API should be rare and incremental. The functions could be renamed with the underscore stripped out, but an alias with the old name should be kept in dill._dill for compatibility.

RTFD hides names beginning with _ and I don't know if the name change will pollute the documentation. It for sure will pollute tab completion in the REPL with functions that are not meant to be used directly by the user, so I am still in favor of the names starting with _, but just clear documentation, separation, or just anything at all to distinguish them from the other functions in _dill.

In reality, my suggestion is to move just _create_function() to _shims as it becomes exclusively a compatibility feature. I believe that the _create_*() functions should stay in the same file as the correspondent save_*() functions, because their logic are intrinsically related. Also, there's the problem of function namespace if these functions were in a separated module but referred to variables and functions in _dill.

Note: If you are thinking of alleviating the extension and complexity of the main submodule _dill, it'll likely get a lot better with cleaning the logic for unsupported Python versions after the next release and moving code related to session saving into a new submodule (these are almost completely independent, just need a couple private functions from _dill).

👍

There are a few strategies that I'm thinking of:

  • Probably the easiest to implement, the first is to leverage the code.replace() method added in Python 3.8. It uses an existing code object to create a new one by substituting one or more of its members. This method uses keyword-only arguments instead of positional-only. What I'll try is to apply it to some trivial function in the Standard Library (it can't be a built-in function) to create a first bootstrapped _create_code() with just the essential components and then use this to create a second, complete version using CodeType.

Yep. That would be preferable and should probably be added. The problem is that it could not be called _create_code() anymore since that name already exists. Fortunately, all you need is a prototype object to generate any other code object from and don't need to create a new _create_* function to do this. You could just get a global variable (dill._dill._prototype_code if it exists and dill.copy.__code__ if it doesn't for example) and run .replace on it to create the new code object. That is not currently possible because Python 3.7 is still in the security fix phase. I was actually going to open a PR to do this in a year when Python 3.7 is fully deprecated. Good observation!

  • The second idea is to pickle all the argument tuple variants in the bootstrap header, in a dictionary —which shouldn't use much space with memoization, and retrieve the correct one with sys.version_info to pass to CodeType directly.

I did not think about this before, but I agree that saving the version info would be a useful thing to do whenever we have a code object present.

  • This is really hard to tackle, but if the version of the Python doing the unpickle is greater than the version that the bootstrap _create_code() supports, it could try to use inspection to figure out the correct signature for CodeType at runtime. It should work if future Python versions just add new arguments, but if they were completely changed there's not much to do. The problem with this approach is that signature inspection for built-in functions depends on function annotations, and these can be missing depending on optimization options. Therefore, this strategy can't be used in general, but would serve as a decent fallback.

Unfortunately, the names in the signature don't 100% line up with the attribute names, as frustrating as it is.

Of course, this should be optional. I'm thinking of a "portable" setting/option. Plus, at least for simple objects, it should be possible to include just the necessary constructors other than _create_code(). I don't know how hard it would be to infer these though. Maybe the only way is to do a "dry run" pickling with no-op or in-memory writes and either, if it writes more than a threshold size, abort and then just include all the constructors or, if it finishes, check which of the save_*() functions were called.

👍

I like your observations and insights.

The problem with getting rid of _create_function was that you need to get the builtins dictionary at runtime, not during pickling. This is because new versions of Python can add functions to the standard library. Thankfully, I think with the new additions in the _shims.py file, it is actually possible to do that now. Thank you for looking into this. It is good to have a fresh set of eyes on things. I wouldn't have noticed that it was now possible to do what you described.

dill/_dill.py Outdated
@@ -1905,6 +1889,7 @@ def save_function(pickler, obj):
# recurse to get all globals referred to by obj
from .detect import globalvars
globs_copy = globalvars(obj, recurse=True, builtin=True)
globs_copy.setdefault('__builtins__', globals()['__builtins__'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, ...

@mmckerns Why does this copy the builtins from the globals of _dill? Was there a reason that you didn't just use the __builtin__ module directly? It should be the same and would allow for us to move the code into save_function like @leogama describes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you are asking me... I didn't write the line you are referring to, @leogama did.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mmckerns, This is just a variation of

dill/dill/_dill.py

Lines 770 to 771 in a16219e

if "__builtins__" not in func.__globals__:
func.__globals__["__builtins__"] = globals()["__builtins__"]

But I didn't understand the question either. By default, globals()['__builtins__'] in a module is "an alias for the dictionary of the builtins module itself". It could be substituted by the __builtin__ alias in _dill though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, ok. So, unless I'm forgetting something... no reason. I believe __builtin__ could have been used here without issue, as that should point to the same module object as the current code.

@leogama
Copy link
Contributor Author

leogama commented May 17, 2022

That is not currently possible because Python 3.7 is still in the security fix phase. I was actually going to open a PR to do this in a year when Python 3.7 is fully deprecated.

Even though Python 3.7 will be supported for some time, this (maybe) more reliable method could be used for 3.8+ only, falling back to the second one I described for Python <3.8. The overhead should be minimal.

@leogama
Copy link
Contributor Author

leogama commented May 17, 2022

I've tried the first commit with the old dill 0.3.0 and noted that it (actually, the save_function() updates) won't fix #341 because _create_code() was introduced in dill 0.3.2 and it completely broke compatibility for pickled functions. But it was inevitable due to the ever increasing list of arguments to CodeType. Just for the records...

@leogama
Copy link
Contributor Author

leogama commented May 17, 2022

What do you think, for every new save_<class>() function that doesn't have a correspondent _create_<class>() constructor function, of having an alias* _create_<class> = <Class>Type and pickle this instead of the real constructor. This way, if one day a custom constructor for <class> is added, it can be made to mimic the <Class>Type signature, possibly with extra arguments, and keep backward compatibility.

(*) Actually, it wouldn't work because pickle would save it as obj.__module__ and obj.__name__. It needs to be a wrapper function. A fictitious example with CodeType:

from functools import wraps

@wraps(CodeType, assigned=('__doc__', '__annotations__'), updated=())
def _create_code(*args):
    return CodeType(*args)

@anivegesana
Copy link
Contributor

I've tried the first commit with the old dill 0.3.0 and noted that it (actually, the save_function() updates) won't fix #341 because _create_code() was introduced in dill 0.3.2 and it completely broke compatibility for pickled functions. But it was inevitable due to the ever increasing list of arguments to CodeType. Just for the records...

Yeah. I think Mike was forced into a brick wall on that one. There was no way to maintain compatibility, so he had to break it. Now that the functionality is working, even though it may not be elegant or sustainable, compatibility should not be broken because people will notice. You can look through the issues and see that there were a handful that came up because of incompatible changes like that one, and that is to be avoided.

@anivegesana
Copy link
Contributor

anivegesana commented May 17, 2022

What do you think, for every new save_<class>() function that doesn't have a correspondent _create_<class>() constructor function, of having an alias* _create_<class> = <Class>Type and pickle this instead of the real constructor. This way, if one day a custom constructor for <class> is added, it can be made to mimic the <Class>Type signature, possibly with extra arguments, and keep backward compatibility.

(*) Actually, it wouldn't work because pickle would save it as obj.__module__ and obj.__name__. It needs to be a wrapper function. A fictitious example with CodeType:

from functools import wraps



@wraps(CodeType, assigned=('__doc__', '__annotations__'), updated=())

def _create_code(*args):

    return CodeType(*args)

Actually, a mechanism similar to what you are intending is outlined in _shims.py. The reason I didn't jump on it yet is that very large extensive changes to the dill library would have to be made and currently there is no framework to test the compatibility of pickles with other versions of dill or Python. I would say that that needs to be created first before any progress can be made on your suggestions. I would say that the perfect time to work on it is when dill 0.3.5 is out and Python 2 cruft is removed from the repo and extensive changes are being made anyway. Just my two cents.

Also, the sort of breaking change presented by _create_code is why I would recommend creating a shim function called _function_constructor or something like that that will point to FunctionType for now, but can be changed if there is an incompatible change to Python that changes the constructor arguments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants