Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Future ABI compatibility #34

Closed
ryanolson opened this issue Mar 20, 2019 · 55 comments
Closed

Future ABI compatibility #34

ryanolson opened this issue Mar 20, 2019 · 55 comments

Comments

@ryanolson
Copy link

DLPack’s vision of a common in-memory tensor format that spans device and memory types is fantastic.

However, in its form, there is no upgrade path for adding new items to the either the DLTensor or DLManagedTensor structs in way that would maintain ABI compatibility.

I would like to propose the addition of two components to the DLTensor struct. This will break current ABI compatibility, but will in the long run future proof the design.

Additions to DLTensor

  1. unt8/16_t version
  2. uint64_t future_bytes

Adding the version allows the receiving library to determine if the DLTensor can be consumed. The receiver may not have a matching version, but as long as it knows the version it can make a decision on if the data can be correctly used.

Adding future_bytes allows for the addition of new options to DLTensor. One of which might be data layout, ie row-major or column major (c-format vs FORTRAN). I will open a separate issue for this feature.

@junrushao
Copy link
Member

I feel that we could write additional information in managed_ctx. For example, when borrowing numpy ndarrays, we could save a pointer to a Python object inside the managed_ctx. As long as two framework have protocol about how to interpret the managed_ctx, it will be fine

@rgommers
Copy link
Contributor

Has this been discussed in more detail elsewhere? The lack of a version attribute inside one of the structs is indeed odd, which came up in the conversation around NumPy adopting DLPack. There's #define DLPACK_VERSION 050 in dlpack.h, but that seems to only serve as documentation since it's not accessible by a library receiving a DLPack capsule.

@tqchen
Copy link
Member

tqchen commented May 19, 2021

I agree, given that we are ABI compatible so far this isn't an issue so far.

How about we allow attaching version information in the capsule for now, this way we do not need to update the struct while still allows libraries to check it.

@rgommers
Copy link
Contributor

That seems good, cause breaking ABI for it would be painful. Do you mean in the name string of the PyCapsule, or passing it as a second value separately like:

# return version as second argument, `(major, minor, micro)`
def __dlpack__(self, stream=None) -> Tuple[PyCapsule, Tuple[int, int, int]]:

@tqchen
Copy link
Member

tqchen commented May 19, 2021

we can directly attach it as a version attribute of the PyCapsule

@hameerabbasi
Copy link

hameerabbasi commented May 24, 2021

It seems PyCapsules cannot accept additional attributes:

>>> handle.version = 5
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'PyCapsule' object has no attribute 'version'

This can happen as PyCapsules don't have a __dict__.

@tqchen
Copy link
Member

tqchen commented May 24, 2021

I see, for backward compat reason with the previously defined protocol, then perhaps dlpack_version is a better way to handle this

@rgommers
Copy link
Contributor

I'd prefer to avoid a third attribute at the Python level if possible. The two options I suggested still seem feasible:

  • a second return value from __dlpack__
  • Using PyCapsule_SetName and _GetName with some convention like "name must be DLPACK_VERSION=x.y.z".

The first one may be easier to implement (parsing version strings in C is a pain, it typically breaks for a naive implementation once you hit 0.10.0), the second is a bit cleaner from the API perspective.

@hameerabbasi
Copy link

  • a second return value from __dlpack__

I'd prefer this one if at all possible. Also, we should have an ABI version along with a library version.

@rgommers
Copy link
Contributor

rgommers commented May 24, 2021

Also, we should have an ABI version along with a library version.

That's actually the only thing we really care about, right?

@hameerabbasi
Copy link

That's actually the only thing we really care about, right?

I'd think so, but currently the version is 050, which I assume means 0.5.0. We can also use the major version to indicate an ABI break.

@rgommers
Copy link
Contributor

I'd prefer to avoid a third attribute at the Python level if possible.

One advantage of doing this though is that it's more flexible - it can be checked before actually calling DLPack. Especially when streams are involved, it's possible that a consumer calls __dlpack__, then producer inserts a "wait for event" on the stream, then returns the capsule + version, and then the consumer needs to raise.

Another small issue is that TVM (and perhaps some other library?) already implemented support.

So I'm coming around to maybe using __dlpack_version__, which could then use both an API and ABI version.

Would be nice to get some more opinions. @leofang, @emcastillo, @kkraus14?

@kkraus14
Copy link

Why are we planning on handling this at the Python level? DLPack is a C header and is usable from C libraries as well. If we solve the problem for C libraries it kinda naturally solves the problem for Python as well.

Can we please avoid Python only solutions?

@rgommers
Copy link
Contributor

The concern is breaking the ABI, you can't just add a field. Can you think of a way to do it in C in a reasonable manner @kkraus14?

@seberg
Copy link
Contributor

seberg commented May 24, 2021

One thing I do not quite like about __dlpack_version__ is that it is imaginable that it is not constant. I.e. a data format should probably export it with the lowest compatible version, even if it supports a higher one.
The use case is: Some additional information is added to the struct, but only necessary for GPU and not CPU. So NumPy might not support the newer version, so the CPU export uses a lower version than the GPU one.

But, I admit it should be constant for a single tensor/array (although not unimaginable that it is not).

Putting it in the DLPackManagedTensor seems nicest... But yeah, the struct can only be extended if we are OK to crash when the consumer needs it, but the producer is too old. On the up-side, it might be years until most consumers even have a reason to read it, and within the Python ecosystem we could probably standardize quickly on a recent enough DLPack?)

Otherwise, maybe a new __dlpack_info__ struct could work? It could contain the version, device, and any other information we may need in the future. (Replacing the __dlpack_device__) And have a C-counterpart in the DLPack header.

In a sense that would add __dlpack_info__ as an additional "exchange information" next to DLPackManagedTensor. We could then consider having __dlpack__ return a tuple (DLPackManagedTensor, __dlpack_info__), where __dlpack_info__ can be queried on its own for stream support.

@kkraus14
Copy link

The concern is breaking the ABI, you can't just add a field. Can you think of a way to do it in C in a reasonable manner @kkraus14?

It's a 0.x project so we could just break the ABI 😄. Otherwise we'd need to do something similar to what we did with the stream on the Python side and pass it out of band.

@leofang
Copy link
Contributor

leofang commented May 25, 2021

  • Using PyCapsule_SetName and _GetName with some convention like "name must be DLPACK_VERSION=x.y.z".

@rgommers This is not OK. I think we missed to specify in the Array API standard that a live capsule must have the name "dltensor", which is in fact part of the DLPack protocol since its Day 1. (We did say a dead one is named "used_dltensor".) Capsules must be looked up by name, so we cannot have the name varying across libraries. I can push a PR to array-api for clarification.

Why are we planning on handling this at the Python level? DLPack is a C header and is usable from C libraries as well. If we solve the problem for C libraries it kinda naturally solves the problem for Python as well.

This makes sense to me, although let's not forget the current exchange protocol is fully complete only in Python (via __dlpack__, __dlpack_device__, and from_dlpack()).

I feel it's about time for us to revisit @kkraus14's request #65 --- maybe it is sufficient to "backport" all the Python components to C? Then, we can create a new C function, which could be, say, get_dlpack_info() that backs @seberg's __dlpack_info__ idea to return the version. Adding new functions does not break ABI AFAIK.

A separate question is if returning DLPACK_VERSION alone is sufficient or not. @rgommers You mentioned __dlpack_version__ should return both API and ABI versions. Is a single version not sufficient?

@hameerabbasi
Copy link

Is a single version not sufficient?

I believe we can just put ABI breaks in a major version if needed, but IMHO it's best to have a separate version for feature support and ABI.

@rgommers
Copy link
Contributor

A separate question is if returning DLPACK_VERSION alone is sufficient or not. @rgommers You mentioned __dlpack_version__ should return both API and ABI versions. Is a single version not sufficient?

I think both matter. For example, when DLPack was upgraded in PyTorch from 0.2 to 0.4, the question was if that was ABI-compatible. Without a separate ABI version, the only way to know would be to look into the DLPack commit history, or ask someone. It gets worse when you receive a version from another library that is higher than what you implement in your own library - should you raise an exception or not? There's no way to know if that newer version is ABI-compatible or not. Unless you require that any ABI break increases major version indeed. But there's other reasons to increase major version number, so a separate ABI version seems nicer.

@rgommers
Copy link
Contributor

I feel it's about time for us to revisit @kkraus14's request #65 --- maybe it is sufficient to "backport" all the Python components to C? Then, we can create a new C function, which could be, say, get_dlpack_info() that backs @seberg's __dlpack_info__ idea to return the version. Adding new functions does not break ABI AFAIK.

This makes sense I think - although coupling version info and stream support isn't necessary, those can be done independently.

Right now we don't have a pressing need for other new fields, adding version info is "future proofing". Making breaking changes just to future-proof things seems a bit gratuitous. So a new function sounds good.

It did bring up another question for me. Are there examples of libraries adding dlpack.h to their own public C/C++ API? I'm curious how they do it (e.g. with an unchanged name, or with a prefix). Multiple libraries all exporting the same symbol is not great. So far I've only looked at Python libraries, and they seem to not offer a C/C++ API.

@tirthasheshpatel
Copy link
Contributor

tirthasheshpatel commented Feb 28, 2022

I think it would be great to add version info to DLPack. There have been some requests for ABI breaking changes (data-apis/array-api#191, #97). Unless versioning is done, each ABI breaking change will require everyone to update.

we can create a new C function, which could be, say, get_dlpack_info() that backs @seberg's __dlpack_info__ idea to return the version.

Wouldn't this cause symbol conflicts (since get_dlpack_info() will need to be exported somehow) as @rgommers said? I don't see a way around an ABI break. We can add __dlpack_info__ at Python level but, again, ABI breaking changes will still break C/C++ libraries that have integrated support for DLPack. Any suggestions on how versioning can be achieved without an ABI break?

@seberg
Copy link
Contributor

seberg commented Feb 28, 2022

I will bring up an alternative once more: Add "request flags" API. That is, you add a new kwarg to __dlpack__, with requests=uint64 (or similar).
The consumer can use it to "request" things from the exporter. You could separate them out into two parts: required flags and requested flags:

  • Some flags are "required" (either high 32bit or you pass two values). If the exporter doesn't know what the flag means, they must raise a BufferError.
  • Most flags could probably be "requests", in the sense that the exporter can just ignore them.

As soon as you really want the exporter to pass out additional information, you would add that first "required" flag. We would assume that 32 flags is plenty for the forsee-able future, most likely there will be very few flags, but nobody will have to worry about the how to add something.

The transition will require a try/except by the consumer (if they want to request anything).

The above is of course only for the Python API, I think it makes sense for a C-API/ABI, but I don't think the C part has a defined exchange protocol anyway.

@tirthasheshpatel
Copy link
Contributor

tirthasheshpatel commented Mar 10, 2022

I think there are three options that should future proof DLPack against ABI breaks/changes in the future at Python level:

Option 1

Return a 3-tuple from __dlpack__. The third return value is the ABI version.

Option 2

Introduce a __dlpack_info__ dunder method that returns the API and ABI versions.

One problem with this: In the future, when we break the ABI, let's say by adding the readonly flag, previous versions of the Python libraries that supported from_dlpack would start crashing since the ABI has changed and there is no way for them to know __dlpack_info__ exists. To avoid that, along with __dlpack_info__, we will need to add some version info to the exported capsule so that the imports immediately fail due to invalid capsule name.

One solution is to add the ABI version in the capsule name e.g. "dltensor_v0.1". Since a ABI change breaks both forward and backward compatibility, the importer will immediately error out if it doesn't get a capsule that has the same ABI version i.e. any change in ABI version means the array can't be imported. This will cause errors in Python at a much higher level which is much better than crashing.

Option 3

Implement the requests API as @seberg suggested. I think this is more work than the above two options.


I vote for the first option since it is the easiest to do. For the C API, adding a function that returns a API/ABI version and specifying it in the C spec seems like the best option since DLPack had no way before to guard C applications against ABI breaks. Maybe we can proceed with #72 to address this problem at C level.

@tqchen @seberg @leofang @rgommers @mattip Please let me know which options sounds good to you. We can update the spec with whatever options has consensus.

@seberg
Copy link
Contributor

seberg commented Mar 21, 2022

I would really like to nudge towards doing the annoying thing now. I.e. introduce a new C struct (with version and flag-space at the beginning as the initial poster also suggested). And then rename the capsule for safety.
(You can do more breaks now, but even without anything else it seems perfectly OK to me.)

Why? Many small reasons:

  • It will make live easy for anyone who wants to add stuff in the future (i.e. us in a year). They will not have to worry about how to change the API/ABI, but only about the exact change they are interested in.
  • Having two calls means moving ABI information through Python (that includes the larger struct). While race conditions are probably not be a big deal, it would be nice to keep things together: The C code can then verify everything, rather than having to trust that __dlpack_info__ returned the correct version. That is only clearly possible if all information is passed a single capsule. (Also, I expect it will be simpler C code)
  • You avoid creating __dlpack_info__, which may become unnecessary whenever a "real" ABI break comes.

Now, I could see this not replacing the current DLPackManagedTensor for now (or ever), for use cases where a C++ stack is compiled in one go. I.e. the new struct may just be used for dyanmically linked libraries who want to exchange (like Python libs).

@tirthasheshpatel
Copy link
Contributor

tirthasheshpatel commented Mar 22, 2022

It looks to me like that would naturally fit at the end of DLTensor. Note that adding new fields at the end of a struct is not an ABI break (unless it changes the semantics of other fields, which readonly doesn't); older code that is unaware of the new field will continue to work unchanged.

@rgommers But adding it to the end of DLTensor will change the size of the first field of DLManagedTensor and accessing the fields after DLTensor might segfault. I checked with importing torch and tensorflow tensors in NumPy. Both fail with a readonly field appended to DLTensor. We can append the field to DLManagedTensor instead but it's much better if we could add it to DLTensor.

I would really like to nudge towards doing the annoying thing now. I.e. introduce a new C struct (with version and flag-space at the beginning as the initial poster also suggested). And then rename the capsule for safety.

@seberg I agree with the first suggestion about introducing a new C struct with an API and ABI version field and adding it to the DLTensor struct at the beginning. This can also be used by pure C/C++ implementations to guard against ABI breaks. But I think renaming the capsule can be avoided (since it is such an integral part of the spec) by adding this to the spec:

The __dlpack_info__ method will return the supported DLPack version. The consumers will check if this version is supported and raise a BufferError in case of incompatibility. If __dlpack_info__ isn't present, the consumer can fallback to the old ABI or raise a BufferError.

This can help transition Python libs smoothly from the old ABI to the new one (the libs that want to support both can do so by checking if __dlpack_info__ exists [1]). It is also possible to support the old ABI (without version info in the struct). If, afterwards, we add the version keyword to __dlpack__, it'd be possible to support multiple API and ABI versions at the same time. In summary, we need to:

  • Add an ABI version to DLPack (a macro like DLPACK_ABI_VERSION similar to the DLPACK_VERSION). This could just be a number like 1 (i.e. ABI v1) which we increment everytime there is an ABI break (since we don't expect to break the ABI often, this could work)
  • Add the version info to DLTensor and increment the ABI (since we are breaking it). We can also batch things like readonly flag and endianness info when we do this so we don't need another ABI break.
  • Add the __dlpack_info__ dunder to the Python spec and semantics as explained above.
  • Add the version keyword to __dlpack__ if we expect the API/ABI to evolve often and want Python libs to support multiple API/ABI versions.

If this sounds good to you @seberg @rgommers @mattip @tqchen, I will propose a PR for the first three and then the last one if required.

[1] These checks could be a problem if we want to remove __dlpack_info__ later for whatever reason but I don't think it's a credible possibility since we expect the DLPack spec for Python to be quite stable after this change.

@mattip
Copy link

mattip commented Mar 22, 2022

But I think renaming the capsule can be avoided ...

This remains to be seen. If the contents of the dlpack struct are not compatible with what is expected with the version in NumPy 1.22 (and comparable versions of other libraries), it must error out when loading this incompatible struct.

@tqchen
Copy link
Member

tqchen commented Mar 22, 2022

Thanks for all the discussions, I agree that __dlpack_info__ is a great starting pt.

Let us deliberate where/when/where do add version info DLTensor, and path for upgrading. The nature of standard does mean that we favor stability and compatibility in current supporting frameworks.

@leofang
Copy link
Contributor

leofang commented Mar 22, 2022

Great discussion. I like __dlpack_info__ too. My only question is when will DLPACK_VERSION be bumped during this ABI-breaking process.

@seberg
Copy link
Contributor

seberg commented Mar 22, 2022

Let us deliberate where/when/where do add version info DLTensor, and path for upgrading. The nature of standard does mean that we favor stability and compatibility in current supporting frameworks.

This seems the important part to me. Once it is settled how the C ABI/API will evolve, the Python side __dlpack__ should become more clear. For myself, I currently doubt there will be an immediate need for __dlpack_info__ once the C side is clarified.

As Matti also allured to: __dlpack_info__ doesn't solve the ABI problem. Because it relies on the consumer making careful checks, but the consumer could be outdated and than it won't do those.

@tirthasheshpatel
Copy link
Contributor

Let us deliberate where/when/where do add version info DLTensor, and path for upgrading. The nature of standard does mean that we favor stability and compatibility in current supporting frameworks.

Where can we add version info?

The version info can be added at the beginning of the DLTensor struct:

typedef struct {
  /*!
  * \brief The DLPack and ABI versions of the tensor. The exporting library
  *  should set the DLPack version to DLPACK_VERSION and ABI version to
  *  DLPACK_ABI_VERSION.
  */
  struct {
    uint8_t dlpack;
    uint8_t abi;
  } version;
  
  ...

  // The new fields go here (at the end of the struct)

  /*! \brief Mark the data readonly. */
  uint8_t readonly;
  /*!
  * \brief Endianness of the data. 1 for non-native endianness and
  *  0 for native endianness.
  */
  uint8_t endianness;
  /*! \brief Alignment of the data. */
  uint32_t alignment;
} DLTensor;

typedef struct DLManagedTensor {
  DLTensor dl_tensor;
  void * manager_ctx;
  void (*deleter)(struct DLManagedTensor * self);
} DLManagedTensor;

If we want to add more fields in the future, we can do so by appending them at the end of the struct (this doesn't affect the offset of the version struct so versions can still be accessed to check compatibility.).

Upgrade Policy

Note that __dlpack_info__ dunder will not be required (as pointed out by @seberg and @mattip) since we are adding version info in the DLTensor C struct itself.

After adding version info to the DLTensor, we can define the following new semantics in the Python spec:

  • __dlpack__ should accept a version keyword (a Python tuple (dlpack_version, dlpack_abi_version)) which is set to None by default. Consumers can use this kwarg to request certain versions of DLPack. If version=None or the ABI version 1 is requested:
    • a capsule named "dltensor" which uses the old ABI should be returned (if the producer wants to keep supporting it) or
    • a BufferError should be raised (if the producer doesn't want to keep support for the old ABI)
      Otherwise, a capsule named "versioned_dltensor" should be returned which uses the new ABI. If the requested version is not supported, __dlpack__ should raise a BufferError.
  • Consumers should pass a version keyword to the __dlpack__ and __dlpack_device__ methods requesting for a particular DLPack version and DLPack ABI version.
  • If the __dlpack__ method doesn't accept the version kwarg, the consumer should fallback to the old API i.e. passing no arguments to __dlpack__. The consumers can check the capsule name: if a "dltensor" capsule is received, the old ABI can be used to import data or if a "versioned_dltensor" is received, the version in the DLTensor struct can be used to check for compatibility.

(The version keyword mechanism is required to preserve backward compatibility with Python libraries exporting capsules that use the old ABI.)

I think this concretely defines a roadmap to upgrade. If this sounds good, we can start with merging #101 and then upgrading the Python implementations according to the new spec.

@leofang
Copy link
Contributor

leofang commented Mar 25, 2022

This seems the important part to me. Once it is settled how the C ABI/API will evolve, the Python side __dlpack__ should become more clear. For myself, I currently doubt there will be an immediate need for __dlpack_info__ once the C side is clarified.

As Matti also allured to: __dlpack_info__ doesn't solve the ABI problem. Because it relies on the consumer making careful checks, but the consumer could be outdated and than it won't do those.

Copying my #101 (comment) back here because I don't understand why out of sudden we are now talking about removing __dlpack_info__ when we just agreed upon adding it:

Removing __dlpack_info__ is a design flaw: Without the producer providing this info upon the consumer's request, how does the consumer know

  • the max version supported by the producer (without a try-except loop until no error)?
  • the sensible combination of dlpack_version and dlpack_abi_version? For example I can pass meaningless combos like (0.5, 2) (such combo does not exist).

Shouldn't the burden be fallen on the producer since they know the best? Maybe we should move the discussion back to #34?

@leofang
Copy link
Contributor

leofang commented Mar 25, 2022

  • the sensible combination of dlpack_version and dlpack_abi_version? For example I can pass meaningless combos like (0.5, 2) (such combo does not exist).

Expanding on this: In this case the producer most likely would raise BufferError, but how does the consumer know if it's because the producer does not support it, or the consumer being silly asking for nonsense?

@tirthasheshpatel
Copy link
Contributor

tirthasheshpatel commented Mar 25, 2022

how does the consumer know

  • the sensible combination of dlpack_version and dlpack_abi_version? For example I can pass meaningless combos like (0.5, 2) (such combo does not exist).

Expanding on this: In this case the producer most likely would raise BufferError, but how does the consumer know if it's because the producer does not support it, or the consumer being silly asking for nonsense?

The consumer can theoretically ask for non-existent version combinations but I didn't think someone would do that since we don't let the user call __dlpack__ (and to_dlpack has not been included in the spec). So, the versions are requested internally, and it'd make sense for the consumer to ask for a valid version=(DLPACK_VERSION, DLPACK_ABI_VERSION) and producer to raise a BufferError if it isn't supported.

  • the max version supported by the producer (without a try-except loop until no error)?

I think this point is valid. IIUC, the idea is that the consumer can ask for a version lower than the max-supported one returned by __dlpack_info__. But then we are forcing the producer to support all the versions below a given DLPack and ABI version (which IMO should be left up to producers if they want to support all the previous versions).

Shouldn't the burden be fallen on the producer since they know the best?

IIUC it's equivalent to put the burden on either side. Sorry but I didn't understand what "since they [producers] know the best" exactly means. What do the producers know better than consumers?

@seberg
Copy link
Contributor

seberg commented Mar 25, 2022

__dlpack_info__ seemed like a decent try to get more information with as few changes as possible to me as well, initially. However, I am unsure there is any use especially if the ABI is bumped now (or an equivalent change happens in Python, even if not in C).
In that case it seems better to pass the relevant information directly rather than splitting it up into two calls?

Personally, I think that should happen, because it seems to me we should expect backwards incompatible breaks eventually if not now (Both gh-97, and gh-41 at least play with the idea of incompatible changes?).

"Breaking" things now gives a clear path for doing future incompatible changes. I do not quite see that __dlpack_info__ provides a clear path for backwards incompatible ABI/API changes.
(Because __dlpack_info__ as a new attribute is consumer only: If consumers are outdated you still need to intentionally break outdated consumers e.g. by renaming the capsule).

From a Python API point of view, it seems to me that adding __dlpack_info__ is not more convenient than passing a version into __dlpack__ and renaming the capsule. Both requires the consumer to add a try/except eventually, if they want the new information? Or am I missing something?

However, I have no idea what the plan for ABI breakage is in C. My suggestion right now would be to create a new DLManagedTensorEx with the version/info as first part. Keep the current DLManagedTensor unchanged. C users will have to start using DLManagedTensorEx when they are ready.

DLManagedTensorEx will be written in a way that the version is guaranteed to be unchanged (i.e. guarantee that it casts to a DLPackInfo struct which includes version and maybe flags.
What follows would be versioned. Probably by adding a new struct name when ABI breaks. Although for appending fields and certain type of incompatible changes that may not be strictly necessary.

[can pass meaningless combos like (0.5, 2) ...] Expanding on this: In this case the producer most likely would raise BufferError, but how does the consumer know if it's because the producer does not support it, or the consumer being silly asking for nonsense?

I don't really think it matters. If the consumer asks for clearly invalid things an error must be raised, but I am not even sure a BufferError would be best. This is arguably a consumer bug, so IMO all bets are off and it doesn't matter what the producer does.

The BufferError should be raised if the producer refuses to export due to an incompatible version. In practice, I think it would make sense to say that the producer should raise a BufferError, but the consumer must check version compatibility and deal with it (i.e. usually raise a BufferError). That is the consumer must not trust that the producer ensured version compatibility – but it would probably be better if the producer does.

@leofang
Copy link
Contributor

leofang commented Mar 25, 2022

Sorry I still am not convinced. First to correct, by "user" here I meant consumer, which should be obvious from context.

I was saying even for consumers it's not ok for them to guess it with a N^2 for loop (for two entries): Which cap version tuple should they start trying downward? If the producer only supports a version tuple above the consumer's top try, the export would fail (but it could be that the particular version tuple is supported by the consumer as well, just we're not forgiving enough for the consumer to know; UPDATE or, it should really just fail but why are we paying the loop cost to fail slow?).

What do the producers know better than consumers?

This is an odd question... As a producer I know everything about what I'm exporting and can export... Why would the consumer know better than the producer?

With the producer providing the info it's one shot: either the ABI version on both sides matches, or it doesn't. Am I missing anything? (We need to check the ABI version first in order to determine the struct layout for subsequent interpretation.)

In that case it seems better to pass the relevant information directly rather than splitting it up into two calls?

Splitting up into two calls here is similar to why we had __dlpack__ and __dlpack_device__. It prepares the consumer to react accordingly. With __dlpack_info__ I can error out early at the Python level without unpacking the pycapsule and reading into the struct members.

Please note: I am not objecting to adding the version info as new C structs (or struct members, I don't really care). In fact that's good when we move on to standardize C APIs for DLPack in addition to Python APIs (see the meta issue #74). Let's focus on the Python handshake convention, OK?

From a Python API point of view, it seems to me that adding __dlpack_info__ is not more convenient than passing a version into __dlpack__ and renaming the capsule. Both requires the consumer to add a try/except eventually, if they want the new information? Or am I missing something?

Adding __dlpack_info__ does not require using Python C APIs, looking up the capsule name does. Also, I thought we already agreed that encoding the version info into the capsule name is a bad idea? Why are we keeping backtracking? 🙁

@seberg
Copy link
Contributor

seberg commented Mar 25, 2022

I am not sure I follow, so let us just take a step back. Can we agree that there are exactly two relevant versions:

  • The maximum version supported by the consumer: version_consumer
  • The maximum version supported by the producer: version_producer

and that a successful exchange must use min(version_consumer, version_producer)? If that minimum version is too old to be supported by one side, a BufferError will be raised (by someone).

Using min(version_consumer, version_producer) requires information to be passed both ways. So I think there are probably exactly two possible designs:

  1. We ask the producer to do the work, and the consumer to check it:
    • The consumer calls __dlpack__(version=version_consumer)
    • The producer calculates min(version_consumer, version_producer)
    • The producer returns min(version_consumer, version_producer) as part of the Capsule/struct.
    • The consumer double checks the returned version and raises if necessary (too old or faulty producer).
  2. We ask the consumer to do the work and the producer to ensure compatibility:
    • The consumer asks the producer for its maximum version using version_producer = producer.__dlpack_info__["version"]
    • The consumer calculated min(version_consumer, version_producer)
    • The consumer calls __dlpack__(min(version_consumer, version_producer))
    • The producer checks the requested version and either raises an error or returns that exact version.

I admit, I missed the second as a serious alternative. I still prefer the first a bit, because it allows the consumer to confirm that the right version was passed back without any need to trust the producer or intermediate Python code.
That is, all relevant information is in one place in C (OTOH, I am not sure that is quite true for stream=... already, so it doesn't matter much in practice).

I do not think it matters that __dlpack_info__ is available in Python. All we need for Option 1 is a single call to PyCapsule_IsValid(capsule, "dltensor") in the same code that already has a call to PyCapsule_GetPointer(capsule, "dltensor").

The one advantage of __dlpack_info__ is if we anticipate usages like __dlpack_device__ where it is necessary information to pass stream=. Since, I had missed option 2 as a serious alternative, my opinion was that there is no need to anticipate this yet (it can be added on an as-need basis, unlike versioning itself).

In the end, maybe what matters is again how you want to move the C-API forward. Assuming the ABI is broken in C: would you want to add the version into the new DLManagedTensor struct or put it somewhere else? If yes, I would go with option 1: it keeps Python and C closer. If no, maybe option 2 is better.

@mattip
Copy link

mattip commented Mar 26, 2022

I thought we already agreed that encoding the version info into the capsule name is a bad idea?

Could you point to where that was settled? I encouraged Tirth to change the capsule name when adding fields to the struct, as it seems the only way to continue to use consumer.from_dlpack(capsule) where capsule is produced with the newer ABI.

@mattip
Copy link

mattip commented Mar 28, 2022

I admit I am confused where we want to go with the interaction between consumers and producers:

  • Do we require specifying a version in np.from_dlpack(obj, version=3)? Or is the interface np.from_dlpack(obj, min_version=3) or np.from_dlpack(obj, max_version=3)?
  • Or are we proposing a way to query the object obj.__dlpack_info__, and what do we expect to get back?
  • And then the question of backward compatibility: how do we get from where we are now, with DLPack implementations already shipped, to where we want to be? Is there a way to do that without changing the capsule's name?
  • Why do we need both ABI version and dlpack version? Is there a world where, for a specific dlpack version there is more than one ABI version? That seems counter intuitive.

Perhaps we should try to arrange a meeting with all interested parties, say next week, to try to resolve the open questions.

@rgommers
Copy link
Contributor

@rgommers But adding it to the end of DLTensor will change the size of the first field of DLManagedTensor and accessing the fields after DLTensor might segfault. I checked with importing torch and tensorflow tensors in NumPy. Both fail with a readonly field appended to DLTensor. We can append the field to DLManagedTensor instead but it's much better if we could add it to DLTensor.

Ugh, I missed this - and in my mind ManagedTensor was using a pointer to DLTensor. The current structure is unfortunate and if there's a new C struct it probably shouldn't be retained, because it prevents adding more fields to DLTensor in a backwards-compatible fashion.

Do we require specifying a version in np.from_dlpack(obj, version=3)?

No, no one has suggested anything like this I think. Exposing the version to end users doesn't make sense. This should be private info that's passed between libraries.

And then the question of backward compatibility: how do we get from where we are now, with DLPack implementations already shipped, to where we want to be? Is there a way to do that without changing the capsule's name?

This is a critical point, and should be worked out in more detail. It's something like:

  1. Add __dlpack_info__
  2. Update all known libraries to check for __dlpack_info__ before making any breaking changes in DLPack itself
  3. Update libraries to be able to make the transition smoothly, by either making the producers or consumers deal with two ABI versions at once (probably producers I think, and continue producing ABI v1 by default to avoid breakage).
  4. Update DLPack with "new things people want" (will contain backwards compat break, TBD how exactly)
  5. Incorporate new DLPack version in libraries.

I think it's worth emphasing again that DLPack is used fairly heavily in production code, so old-style code shouldn't break.

Why do we need both ABI version and dlpack version? Is there a world where, for a specific dlpack version there is more than one ABI version? That seems counter intuitive.

No, but there are many DLPack versions with the same ABI version. Like right now, DLPack v0.2 to v0.6 are all ABI-compatible, and should work together just fine if libraries are on different version. How would you know without checking both?

Perhaps we should try to arrange a meeting with all interested parties, say next week, to try to resolve the open questions.

Some synthesis work and writing that down seems needed first, I don't think a call is going to resolve anything without that.

@seberg
Copy link
Contributor

seberg commented Mar 28, 2022

I agree that a synthesis would be great. Maybe a meeting will even be unnecessary. Or maybe plan a meeting and someone prepares a rough summary as a basis? (EDIT: ideally summary + laying out alternatives discussed)

Update all known libraries to check for __dlpack_info__ before making any breaking changes in DLPack itself

Ideally, the breaking change should be designed in a way that it is OK if a library fails to update (either producer or consumer). Of course it is OK for such a library to just cause an error eventually or even very soon. (But it is not nice if a library that fails to update can run into ABI issues.)

Why do we need both ABI version and dlpack version?

Hmmm, I agree we do not need both a major and minor version. But, it may make "more common" changes like adding additional information at the end or adding new datatypes/devices easier.
OTOH, especially if we expect version bumps to be pretty rare; maybe a single version is just simpler. Both seems fine to me.

@mattip
Copy link

mattip commented Mar 30, 2022

This is how I understand it:

  1. array = np.from_dlpack(obj) will call obj.__dlpack_info__() and will get back the highest dlpack version number that obj supports. This should be an int, there is no reason to have two different ABIs on a single dlpack version, so the dlpack version uniquely specifies the ABI used.
  2. If the attribute does not exist, then the current protocol will be used.
  3. Then np will call obj.__dlpack__(version=x) where x is either the return value from (1) or a lower version, at the discretion of np. If (2), then the call will be obj.__dlpack__() like before.
  4. Perhaps there will be additional kwargs to __dlpack__ to specify stream or other attributes that np requires.
  5. The call can fail if obj does not support the version requested.
  6. Before returning array, np will check all the struct attributes: alignment, read/write, dtype, device, etc to make sure it can actually use what obj.__dlpack__() returned. If it cannot, the call will fail.

Is that right? What I missed before is that from_dlpack does not "know" about all the underlying version negotiation. Then since the negotiation is done in C, the PyCapsule name does not need to change.

@seberg
Copy link
Contributor

seberg commented Mar 30, 2022

Yeah, I think that is one good option that ensures full backcompat. And it has the nice part that __dlpack_info__ can be used to query whether features are supported (not necessary as of now, but could become useful).

The alternative I see right now is the following:

  1. The consumer calls
    try:
        obj.__dlpack__(verison=max_verison)  # the max version the consumer supports
    except:
        obj.__dlpack__()
    
    (an outdated consumer will always call the second branch, which is the same as version=0)
  2. The producer, then:
    • Returns the current capsule for the second branch (version=0/unspecified)
    • In the first branch, we return a new "dltensorex" named capsule. The returned version of that capsule may be lower than max_version. (It should not be higher, but the consumer must double check that.)
  3. The new capsule will include the version number as a stable ABI field. That way the consumer knows which version it actually got (no matter what the producer did). If it gets the "wrong" capsule for any reasons, either the capsule name mismatches or a version check will catch it.

I like the fact that this is not "out of band" (as Keith said early on). The version is included in the capsule so it can be verified in C and is exchanged in a single Python "call" (fetching __dlpack_info__ is a second Python call in some sense).
It would be my preference because it feels cleaner to not separate version info from the struct. But, I don't want to ride this too much...

However, I do feel it seems much more natural, if we create a new struct in C, which includes the version info as a first field. (But I still somewhat expect that will happen?)

@tirthasheshpatel
Copy link
Contributor

  • In the first branch, we return a new "dltensorex" named capsule.

@seberg We don't need to change the capsule name. The except block in your example will always return a capsule with the current ABI. A name change might provide more protection against crashes but isn't necessary. I have (roughly) implemented something for numpy here for reference.

However, I do feel it seems much more natural, if we create a new struct in C, which includes the version info as a first field. (But I still somewhat expect that will happen?)

Yes, I too think adding version info in the new struct will be better (libraries that want to support multiple ABI versions will need to do that anyway). But once we have a new struct with version info, we can keep appending new fields to it in case of an ABI break in the future. (perhaps, we can also remove the old version after a few years)

@seberg
Copy link
Contributor

seberg commented Mar 30, 2022

A name change might provide more protection against crashes but isn't necessary

Right, I like the extra layer of protection against a wrong capsule being passed out accidentally (the capsule has full version information). And I feel there is not much of a downside, if that name reflects a struct name change/addition in C.

@tirthasheshpatel
Copy link
Contributor

Right, I like the extra layer of protection against a wrong capsule being passed out accidentally (the capsule has full version information). And I feel there is not much of a downside, if that name reflects a struct name change/addition in C.

Makes sense. I think @leofang was against a name change (if everyone follows the spec, it ideally shouldn't be required). @leofang Did you have anything in mind that would break if we change the capsule name (conditionally i.e. we still export a capsule named dltensor for the old ABI)? If not, it doesn't seem too inconvenient to implement a name change (I think it's worth the extra protection)

@leofang
Copy link
Contributor

leofang commented Apr 5, 2022

Sorry for getting back late guys. Still trying to digest... 😅

I think @leofang was against a name change (if everyone follows the spec, it ideally shouldn't be required)

Again, that's not true. The usage of the names dltensor and used_dltensor is a contract that exists in the wild for many years already, long before any of us here in this thread started working on DLPack. (Even before __dlpack__ and __dlpack_device__ were devised.) By doing this we are guaranteeing a breaking change that is incompatible with old ways of using a DLPack capsule, because in the old days, due to the agreed-upon contract, no one bothered to call PyCapsule_GetName to fetch the capsule name before calling PyCapsule_GetPointer with the name.

If we are talking about immediate breaks to move forward faster, then yes we can do it intentionally to signal that we're breaking things (for the better). Then, I don't think it's a bad idea to name a capsule like "dltensor_0.6_1".

libraries that want to support multiple ABI versions will need to do that anyway

I honestly haven't thought about it seriously enough, and I now see that it'd be hard to support multiple ABI versions in practice, at least unlikely with NumPy/CuPy/PyTorch? (@seberg @kmaehashi @rgommers to chime in.) It means we need to build multiple Python modules, and in each module's source code we include a different DLPack header with different ABI.

I feel if we're breaking it we should consider doing it collectively, forget about supporting multiple ABIs, and don't look back.

@seberg
Copy link
Contributor

seberg commented Apr 6, 2022

The usage of the names dltensor and used_dltensor is a contract that exists in the wild

Nobody suggests to break the contract. Renaming the capsule rather amends the contract. Any current user will still get the old name and can still use the old name (for the time being). You just won't be able to write:

capsule = obj.__dlpack__(version=1)
from_dlpack(capsule)  # or variations

But that is not intended use of __dlpack__. And without renaming, it would just crash due to ABI differences; which is worse.

no one bothered to call PyCapsule_GetName to fetch the capsule name

PyCapsule_GetPointer requires the name as second argument and you have to check for used_dltensor. So I think it is safe to say that all code checks the capsule name.
I would argue that it is nice if capsule names uniquely correspond to the C-structs name that is contained. (Although, the C struct itself could be underspecified if the first field is the version the full struct.)

and I now see that it'd be hard to support multiple ABI versions in practice

I don't think it is particularly hard. Yes, it will require additional code paths in all libraries that try to be friendly to users. But, that is unavoidable if we want to support multiple ABI versions. And supporting multiple ABI versions seems unavoidable if we want users to not notice transitions when ABI (or incompatible API) changes happen.
The only alternative would seem to never break ABI (or introduce incompatible API changes).

In practice, code duplication can hopefully be mostly avoided using templating or a helper. But even if not, the good thing is that we can deprecate old versions after 1 year (or even just remove support). So the duplicated code has a clear "expiry date" and that makes the maintenance burden less bad.

I feel if we're breaking it we should consider doing it collectively, forget about supporting multiple ABIs, and don't look back.

For NumPy, I don't care. np.from_dlpack is not public in any released version. But in general, my opinion is that the priority should be that users don't notice anything. And all that seems necessary to achieve that is to keep the old code around for a year or so (plus a bit of cruft to pick between new and old versions).
We should break "collectively", but we can't synchronize perfectly, so there will need to be a ~1 year transition period.

@tqchen
Copy link
Member

tqchen commented Apr 7, 2022

related discussion #104

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
10 participants