Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(json-schema): support propertyNames for sample generation #9739

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

twankamp
Copy link
Contributor

Use propertyNames schema to generate property names for samples with additionalProperties

Refs #8913
Resolves #5713

Description

Json Schema 2020-12 has support for supplying a propertyNames schema, which the names of the (additional) properties in the object have to adhere to. However when generating samples this was currently ignored and the the properties would always be named additionalProp1, additionalProp2 etc. With this PR, the property names in the samples will follow the propertyNames schema. Property names are always strings, so this is mostly reusing the existing code of sample generation for string values. Except objects cannot have multiple properties with the same name, so the string generation now supports adding in an index in the places where it was previously producing fixed values.

Motivation and Context

Many schemas describe dictionaries that map variable property names to property values, which is modeled with additionalProperties. As described in #5713 , the generated samples with additionalPropx have issues: 1) they are not good for readability as they don't provide any context; 2) they may actually be violating the propertyNames schema; and 3) the number of additional property samples cannot be controlled. Writing custom examples is not a good solution, because typically these additional properties have full-fledged object schemas as values, so then all the power of the very useful auto-generated samples is lost. This feature generates additional property names according to the propertyNames schema (partial implementation of #8913), and the propertyNames examples can optionally be used to control the names and number of sample properties (resolution for #5713).

How Has This Been Tested?

Added unit tests
Performed manual test

Screenshots (if appropriate):

propertynames

Checklist

My PR contains...

  • No code changes (src/ is unmodified: changes to documentation, CI, metadata, etc.)
  • Dependency changes (any modification to dependencies in package.json)
  • Bug fixes (non-breaking change which fixes an issue)
  • Improvements (misc. changes to existing features)
  • Features (non-breaking change which adds functionality)

My changes...

  • are breaking changes to a public API (config options, System API, major UI change, etc).
  • are breaking changes to a private API (Redux, component props, utility functions, etc.).
  • are breaking changes to a developer API (npm script behavior changes, new dev system dependencies, etc).
  • are not breaking changes.

Documentation

  • My changes do not require a change to the project documentation.
  • My changes require a change to the project documentation.
  • If yes to above: I have updated the documentation accordingly.

Automated tests

  • My changes can not or do not need to be tested.
  • My changes can and should be tested by unit and/or integration tests.
  • If yes to above: I have added tests to cover my changes.
  • If yes to above: I have taken care to cover edge cases in my tests.
  • All new and existing tests passed.

@char0n
Copy link
Member

char0n commented Mar 27, 2024

Hi,

Thanks for the PR!

I have number of questions and suggestions around the PR.

1. FormatRegistry

Why creating FormatRegistry at all? formatAPI was created to reflect the API of json-schema-faker and was intended to only register new formats that we don't support by default. The change doesn't seem to be related to the feature you're proposing. It was designed in this specific way, as it was suspected that formats are the most frequest thing to be added (compared to encoder or media type). When new format is added, with your change, you'll always have to remember to add it to FormatRegistry.

2. Modifying all generators to support arbitrary index

Are these changes necessary at all? Solving the case for idx inside generators is breaking separation of concerns.
I think I do understand what were your tring to solve, but IMHO we could solve this by more systematic way:

When we have ability to turn on randomnes on on demand, it will help up adding propertyNames.

What do you think?

@twankamp
Copy link
Contributor Author

Thanks for the review and feedback! Let me first respond to the items and wait for your confirmation which direction you prefer with information of my reasoning in mind. Then I will change accordingly.

1. FormatRegistry

Why creating FormatRegistry at all? formatAPI was created to reflect the API of json-schema-faker and was intended to only register new formats that we don't support by default. The change doesn't seem to be related to the feature you're proposing. It was designed in this specific way, as it was suspected that formats are the most frequent thing to be added (compared to encoder or media type). When new format is added, with your change, you'll always have to remember to add it to FormatRegistry.

I wanted to add an options argument to the generator and I noticed I had to repeat it multiple times in this large case statement: https://github.com/swagger-api/swagger-ui/blob/master/src/core/plugins/json-schema-2020-12-samples/fn/types/string.js#L40 . Then I saw that format is the only place where such case statement is used, whereas for media type and encoder it's reusing the API with default values to make it look a lot cleaner: https://github.com/swagger-api/swagger-ui/blob/master/src/core/plugins/json-schema-2020-12-samples/fn/types/string.js#L145 . So I thought by doing this, we make it consistent between format/mediatype/encoder while removing a lot of case statements and imports in string.js.

For the scenario of adding new formats, I didn't really see a difference in effort between having to add it to FormatRegistry or having to add a case-statement?

So it looked to me more like an accidental inconsistency, but if you say it was designed in this specific way, then it may not make sense to change. Let me know your final verdict if it's better to keep the change or add back the case statement?

You are certainly right: this change doesn't strictly relate to the feature at hand.

2. Modifying all generators to support arbitrary index

Are these changes necessary at all? Solving the case for idx inside generators is breaking separation of concerns. I think I do understand what were your trying to solve, but IMHO we could solve this by more systematic way:

When we have ability to turn on randomness on on demand, it will help up adding propertyNames.

Thanks for pointing to json-schema-faker. I looked into it, and if I understand correctly, your suggestion would entail 3 parts.

1. Supply options via a separate OptionsAPI instead of via a function argument

I like the idea of an OptionsAPI in general, but the reason I thought it would be more logical to supply options via an argument in this specific case, is that for this feature it's only needed when generating propertyNames, so only for specific function calls. Configuring options through an OptionsAPI sounds like a good idea in general, but felt more appropriate for longer living options.

It's still possible to use an OptionsAPI, but then I would store the current option setting as a backup, enforce it to be what I need, and put it back when done.

Hope to hear your preference.

(As for the separation of concerns, I guess that either way how the options are passed, the generator will need to be aware about them?)

2. Use randomness instead of indexing

I did explicitly choose for indexing over full randomness, for these reasons:

  • This is an sample generator for a documentation UI that aims for readability, so I think john.doe1@example.com, john.doe2@example.com looks better than fdjskalvc@sjsc8fw.cew, wkjcx3wi@kapeic.xe. (this would be different if our purpose was for example to generate test vectors).
  • Theoretically with random you can hit the same value multiple times, but propertyNames must be unique
  • It's also generates deterministic schemas and simplifies unit tests

This wouldn't rule out that in the future we can also support an option for full randomness, so options could be static, indexed, or random, but that could be a separate PR. Let me know your thoughts.

3. Make the option a function instead of a parameter

In json-schema-faker they don't enable/disable randomness through e.g. a boolean parameter, but rather let you supply a random function. It gives a bit more flexibility. Supposedly if you want no randomness you can supply a function returning a fixed value. I guess that makes sense so I'll look into it.

Note that these 3 are independent, so we can mix and match. I could create an option to supply an indexing function and pass it through either an argument or OptionsAPI.

Unrelated side question

While looking at this, I did notice the way string constraints are apply is not really great. For example if format is e-mail and the generator gives john.doe@example.com, and the schema also has maxLength: 10 which is applied after the generator, you would get john.doe@e which is not a valid e-mail. If instead the generator would be aware about the maxLength then it could generate jon@ex.com which is at least valid. I saw that the schema was supplied when using the API https://github.com/swagger-api/swagger-ui/blob/master/src/core/plugins/json-schema-2020-12-samples/fn/types/string.js#L37 but not in the case statement, and it is never used. Do you happen to know if this choice was made intentionally?

I don't think it's a big deal, so not something I want to improve right now. But was wondering about the direction here, which maybe could help put the choices above in perspective.

@char0n
Copy link
Member

char0n commented Apr 9, 2024

So it looked to me more like an accidental inconsistency, but if you say it was designed in this specific way, then it may not make sense to change. Let me know your final verdict if it's better to keep the change or add back the case statement?

OK let's introduce FormatRegistry. The argument it not being there looking as accidental inconsistency is a strong one from me. Could you introduce FormatRegistry as a separate PR please?\

Configuring options through an OptionsAPI sounds like a good idea in general, but felt more appropriate for longer living options.

Yes, it's build for long living options. I've introduced optionAPI in #9795. Unless it's really necessary and cannot be avoided, we want to use optionAPI exclusively.

In json-schema-faker they don't enable/disable randomness through e.g. a boolean parameter, but rather let you supply a random function.

Yes, but json-schema-faker doesn't assume that the generators could generate constant values. This is how Draft 5 sample generator was used and people expect Draft 2020-12 to work in similar way. That's why I mentioned that when random option is set to null instead of a function it means that we want sample generator to return stable constant values as it does now. When random option is supplied as function it will start working more like json-schema-faker.

It gives a bit more flexibility. Supposedly if you want no randomness you can supply a function returning a fixed value. I guess that makes sense so I'll look into it.

It's not as simple as that. For example look at https://github.com/json-schema-faker/json-schema-faker/blob/master/src/lib/core/random.mjs#L31

If we supply fixed/constant value we're always get the undefined value from the pick function. This also applies for other psedo-random functions there.

While looking at this, I did notice the way string constraints are apply is not really great. For example if format is e-mail and the generator gives john.doe@example.com, and the schema also has maxLength: 10 which is applied after the generator, you would get john.doe@e which is not a valid e-mail. If instead the generator would be aware about the maxLength then it could generate jon@ex.com which is at least valid. I saw that the schema was supplied when using the API https://github.com/swagger-api/swagger-ui/blob/master/src/core/plugins/json-schema-2020-12-samples/fn/types/string.js#L37 but not in the case statement, and it is never used. Do you happen to know if this choice was made intentionally?

I've implemented the current code with that bug in mind. As the sample generator now generates constant/fixed values and there is now way for us to randomly reiterate and generate next random value. In order to fix this we should do the same thing that json-schema-faker does and that is not constraining the generated values of type string that came from format generators. In our case we have to apply this for media encoders as well.

I've addressed that in separate PR: #9796


Overall IMHO the idea of solving property names with idx on generator level doesn't look like a way to go to me. IMHO we should either

1.

introduce proper randomness and turn it on when generating property names

2.

create specialized propertyNames generator that will use underlying generators producing constant values and somehow modifying those generated constant values with idx.

I think option 2. is unnecessarily complex to implement. We would have to know where to add idx in generated values and also make sure those values with idx still comply with original schema constrains.

@twankamp
Copy link
Contributor Author

twankamp commented Apr 9, 2024

Ok will have a look at it again with the above in mind.

@char0n One clarification on:

Yes, it's build for long living options.

So you say it's build for long living options, but you also say you want to use it exclusively, so I assume that includes for short living options like during propertyName generation? In other words, you agree I would before starting to generate property names, store the current option setting as a backup, enforce it to be what I need for property name generation, and put it back when done.

Or are you rather saying that for propertyName generation to work, this option should be enabled for the full schema, and if not enabled then propertyName generation should keep working in the broken way? And if so would you want it default enabled, or is there a UI to enable it?

@char0n
Copy link
Member

char0n commented Apr 10, 2024

So you say it's build for long living options, but you also say you want to use it exclusively, so I assume that includes for short living options like during propertyName generation? In other words, you agree I would before starting to generate property names, store the current option setting as a backup, enforce it to be what I need for property name generation, and put it back when done.

Yes. Sorry for contradiction. It's designed to handle long living options, but can also be used to solve ad-hoc problems like these.

We want to introduce all of this in small PRs (layers). I can image following steps:

  • adding the `FormatRegistry
  • adding support for randomness (optionsAPI and core/random.js)
  • adding support for randomness to bunch of generators (multipe smallers PR, by default they still return constant values)
  • adding support for randomness to media type generators
  • compensating for randomness changes (probably how we apply constraints would need to be redone?)
  • introducing support for propertyNames

@twankamp
Copy link
Contributor Author

Ok started with adding FormatRegistry in #9799

char0n pushed a commit that referenced this pull request Apr 11, 2024
@twankamp
Copy link
Contributor Author

Now added support for randomness in #9810

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ability to rename additionalProperties keys
2 participants