Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend collection.yml to accept Metanorma source files as entries #328

Closed
kwkwan opened this issue Apr 18, 2024 · 15 comments
Closed

Extend collection.yml to accept Metanorma source files as entries #328

kwkwan opened this issue Apr 18, 2024 · 15 comments
Assignees
Labels
enhancement New feature or request

Comments

@kwkwan
Copy link

kwkwan commented Apr 18, 2024

No description provided.

@kwkwan kwkwan self-assigned this Apr 18, 2024
@kwkwan
Copy link
Author

kwkwan commented Apr 18, 2024

Refer to https://github.com/metanorma/iso-10303-srl/commit/5472f9468bf0eff1d2ae2defcd18bc8a3627e891, there are several proposed changes in yaml files.

The collection.yml in collection level (root folder) will be changed from

manifest:
  level: collection
  title: XXX
  manifest:
  - level: document
    title: Document
    docref:
    - fileref: documents/<document-1>/document.xml
      identifier: <document-1>
      sectionsplit: true
    - fileref: documents/<document-2>/document.xml
      identifier: <document-2>
      sectionsplit: true
...

to

manifest:
  level: collection
  title: XXX
  sectionsplit: true
  docref:
  - file: documents/<document-1>/collection.yml
  - file: documents/<document-2>/collection.yml
...

The collection.yml in document level (e.g. documents/document-1 folder) will be changed from

 manifest:
    - level: document
      title: Document
      docref:
        - fileref: document.xml
          identifier: <document-1>
          sectionsplit: true

to

  manifest:
    - level: document
      title: Document
      sectionsplit: true
      docref:
        - fileref: document.adoc

@kwkwan
Copy link
Author

kwkwan commented Apr 18, 2024

The goal of this change is to change the way to collect resources to build the collection.

Previously, metanorma.yml is pointed to a list of documents to build to collection. Metanorma compile command are run to build each document, then a file named collection.yml is generated based on the list of documents. Finally, metanorma collection command is run to build the collection based on the generated file collection.yml.

The proposed change is to find the resources to build the collection in collection.yml. The file attributes in collection.yml (collection level) are pointed to collection.yml (document level) under different document folders. When the attributes fileref are pointed to .adoc files, metanorma should compile them. This will ideally make metanorma collection command run once to build and/or compile the resources.

As the format of collection.yml has been changed, metanorma collection may not able to provide backward compatibility of the previous format of collection.yml. Moreover, source codes, tests and user guides in metanorma and/or metanorma-cli may be needed to update accordingly.

@ronaldtse @opoudjis If there are any concerns or suggestions, please let me know.

@kwkwan
Copy link
Author

kwkwan commented Apr 19, 2024

In order to minimize the impact of the original code, we can first create a separate function say render_nested_collections in Metanorma::Collection to handle compilation of documents and collections. What do you think?

@ronaldtse
Copy link
Contributor

The only collections used are in BSI and BIPM.

@opoudjis
Copy link
Contributor

opoudjis commented Apr 20, 2024

As the format of collection.yml has been changed, metanorma collection may not able to provide backward compatibility of the previous format of collection.yml. Moreover, source codes, tests and user guides in metanorma and/or metanorma-cli may be needed to update accordingly.

You know very well that my response to you wanting to break backward compatibility is violent disagreement. There is in fact no particular need to break anything at all from what I am seeing, all of this can be handled as manifest preprocessing, and I've laid out to Koonwa in chat what that would look like.

The collection-level manifest change can be handled separately, because there is no existing support of YAML files as file references in a manifest. So if you iterate the manifest and find a file reference, sure, process it differently. This can go straight in:

manifest:
  level: collection
  title: XXX
  sectionsplit: true
  docref:
  - fileref: documents/<document-1>/collection.yml
  - fileref: documents/<document-2>/collection.yml
...

There is no reason to change fileref to file, and you won't do that.

In my own work, I am introducing separate directives on how files will be preprocessed for ISO 10303 specific processing:

  docref:
  - file: documents/iso-10303-41/collection.yml
    schemas-only: true
  - file: documents/iso-10303-42/collection.yml
    schema-anchors: true

Those directives WILL BE IGNORED by metanorma code. They will be used in the preprocessing of Lutaml that happens out of scope of metanorma code, to determine which templates to use and how; see collection-nn.rb.

Coming to the document-level manifest, there are three changes. Two are straight forward, one is not well thought out, and I object to it.

 manifest:
    - level: document
      title: Document
      sectionsplit: true
      docref:
        - fileref: document.adoc
  • Sectionsplit has been moved from an attribute of docref to an attribute of manifest. That's not a problem, you just preprocess the manifest so that, if the manifest element has a sectionsplit attribute, you copy it to each of the docref listings under it. The same applies for other drectives such as attachment and bare.

So in https://github.com/metanorma/metanorma/blob/main/lib/metanorma/collection_manifest.rb, in both from_yaml and from_xml, you need to introduce a step at the start, before the recursive processing of manifests, to recurse it down. To take the YAML instance:

%w(bare sectionsplit attachment).each do |w|
  mnf[w] or next
  mnf["docref"]&.each do |dr|
     dr[w] ||= mnf[w]
  end
  mnf["manifest"]&.each do |m|
     m[w] ||= mnf[w]
  end
end
  • fileref can point to *.adoc. As I said to you in Skype, because so much processing presupposes the XML is already there, any filerefs pointing to adoc need to trigger parsing to semantic XML early: Metanorma::Document::parse_file() already assumes that it can parse XML to find bibliographic information. I think the right place to do that parse is at the very beginning of parse_file :
def parse_file(file, attachment, identifier = nil, index = true)
         # new preprocessing code
         if file.ends_in(".adoc")
            # find a way to propagate @compile from the collection class ( https://github.com/metanorma/metanorma/blob/06dd1299e817a92a8e5e11211697feed62ba1e02/lib/metanorma/collection.rb#L51 ) into the initialised Document class
            @compile.compile(file, { format: :asciidoc, extension_keys: :xml } )
            # there may be more options here, but let's not complicate things. Refer to https://github.com/metanorma/metanorma/blob/06dd1299e817a92a8e5e11211697feed62ba1e02/lib/metanorma/collection_fileprocess.rb#L12 
            file.sub!(/\.adoc$/, ".xml")
        end
        # existing code
        new(bibitem(file, attachment, identifier), file,
            { attachment: attachment, index: index })
      end

The document identifier in the manifest has to match the document identifier provided inside documents for links between documents, for bibliographic processing. That identifier is how links between documents are resolved, which is the entire point of collections.

So the identifier is essential to all processing.

  • The only way you can skip identifiers in the manifest is if you presuppose the document is in Semantic XML (which it is only after Metanorma::Document::parse_file() is invoked), so that you can extract the bibliographic metadata (:bibdata) and then the authoritative identifier (Metanorma::CollectionRender::CollectionRenderer() ). And that means you have to change around the entire processing ordering, so that the XML can be parsed the minute you start generating the manifest.

It is a complete waste of a week for a marginal improvement in user laziness. The suggestion is unjustifiable, and if you insist on it, it's your problem.

@ronaldtse
Copy link
Contributor

It is a complete waste of a week for a marginal improvement in user laziness. The suggestion is unjustifiable, and if you insist on it, it's your problem.

Ease of use is extremely important and in fact likely the most important thing here. This also helps streamline our documentation needs when the way things work match user expectations.

We can discuss with the exact parameters are but naming keys as fileref instead of file seems like pedagogy and not really accurate...

The proposal is doing away with identifier in the manifest

There are 3 cases:

  • If a document already has a Metanorma encoded identifier
  • If the document does not have a Metanorma encoded identifier (i.e. it is an attachment)
  • If a document already has a Metanorma encoded identifier but needs to be overridden (i.e. including two documents that share the same identifier/namespace for its anchors)

For the first case, there is no need to have an additional "identifier". This is what we're facing now. I know I've not made this clear but here it is.

@ronaldtse
Copy link
Contributor

ronaldtse commented Apr 20, 2024

There is in fact no particular need to break anything at all from what I am seeing, all of this can be handled as manifest preprocessing

Indeed nothing "breaks" here. What I'm saying that if things "have to break" for improvements in collection.yml, so be it because they are under our control.

@ronaldtse ronaldtse changed the title Make metanorma-cli consumes the new format of collection.yml. Extend collection.yml to accept Metanorma source files as entries Apr 20, 2024
@ronaldtse
Copy link
Contributor

With regards to sectionsplit: true being moved one level up, it is to inherit the options to prevent users from needing to manually specify for every item in that particular branch.

@opoudjis
Copy link
Contributor

OK, I have been clear in what I have said. There is some ease of use introduced, but attachments can only be referenced with an identifier supplied in the manifest. That means that skipping identifiers in the manifest is not encouraging users to be careful.

Requiring the identifier up front in processing the manifest is how the code has been built, and if the identifier is not available upfront early, the code will have to be refactored. I have said how that needs to be done, and what code is used to extract the identifier from the document, if it is in Semantic XML.

I'll also add that, in the recent work to process interleaved manifests and document references, I deal with manifests under docref by making their key be a GUID. That solution can in fact be used as a temporary solution, to defer extracting the document ID until you actually need it. But the key for the entry will then need to be updated from the GUID to the document ID, once the document is parsed.

@kwkwan this is all for your benefit since you're the one doing the refactoring, hope you're monitoring this. :-)

@kwkwan
Copy link
Author

kwkwan commented Apr 29, 2024

@opoudjis Thanks for the detail explanation, the importance of the identifier is clear. It is quite difficult to remove identifier. Therefore, the identifier is just removed from the collection.yml. A function set_indentifier_resolver is implemented to allow user to customize the identifier.

You may refer to the changes in
metanorma/metanorma@77c8d32

cc @ronaldtse

@opoudjis
Copy link
Contributor

opoudjis commented May 9, 2024

I see you have enacted @ronaldtse perverse change of fileref to file.

I don't approve of that, but there is no point in me arguing this.

@kwkwan You will need to update the documentation for collection manifests as well. And the documentation needs to document both the old and the new formats.

https://www.metanorma.org/author/topics/collections/

Source on https://github.com/metanorma/metanorma.org

@opoudjis
Copy link
Contributor

opoudjis commented May 15, 2024

The refactoring you've introduced @kwkwan has been useful, but there is a lot that is not clearly thought through about this. So:

  • The new format of YAML introduces three things: change of keywords, flattening of structure, and introduction of YAML recursion.
  • I am now taking care of the former two in Simplify manifests metanorma#383 and that is maintaining backward compatibility. Your work needs to align with that now.
  • There must not be distinction between new and old formats based on structure, and whether there are attachment and document children. That is bogus and rigid, and needs to drop
  • The new code for processing YAML with hooks is good and can stay
  • I am making references to files be absolute rather than relative now; that is dealing with https://github.com/metanorma/iso-10303-srl/issues/224
  • The one thing that does need to be kept as new vs old processing of YAML, as "new structure"-specific code, new_yaml_format?() is the YAML recursion
  • And YAML files can show up as docrefs anywhere in the process, at any depth of recursion of files: a yaml can contain references to a yaml referencing a yaml. So I don't think we really even will need new_yaml_format?() any more once I fold the Shale code in: I think there will be just the one processing chain.
  • Those recursed YAML files are in fact submanifests, and should be processed as such: docref: x.yaml is equivalent to docref: manifest: (content of x.yaml.manifest). And we do already have methods to process manifest. In the new refactoring, file: x.yaml = entry: (contents of x.yaml.manifest)
  • I am now making the file references absolute. Once we start chaining YAML to YAML calls, I want you to update that code so that the relative directories are resolved. You will see in my code that I have introduced File.expand_path(ref_folder), to make the references to ref_folder be absolute. So if we get
    • YAML1 is at absolute location //x/y, references YAML2 at relative location z/a
    • YAML 2 references YAML3 at relative location b/c
    • Then I want you to process file references within YAML3 to be at absolute location //x/y/z/a/b/c, chaining together the ref_folder container locations recursively
    • While I am working on Simplify manifests metanorma#383 , I think making YAML calls track chaining of locations, and be allowed at any level of docref, is the most useful thing you can be doing
  • fileref_resolver must return file locations relative to the working directory. (The path from the working directory to the current YAML is given in the first, ref_folder parameter)
  • some of the calls now being made to fileref_resolver in collection.rb are I think redundant now, because of the switch to absolute file locations I have made. I have already changed the fileref_resolver in collection.rb to work.
  • @ronaldtse and you need to get familiar with what I am doing with the new manifest folder
  • For Simplify manifests metanorma#383, I will update the code to work with the existing YAML processing, but I will not intervene too much in the new code in collection_construct_model.rb. That will be your job once the rest of the code is stable.
  • Ronald and Koonwa, please coordinate about how to proceed.

@opoudjis
Copy link
Contributor

There are now three active branches of Metanorma work:

I am now taking over this ticket, and I am going to start collapsing these branches:

  • fix/absolute-filepaths-manifest is built on top of 377-allow-metanorma-to-adopt-new-format-of-collectionyml, and I will now close 377-allow-metanorma-to-adopt-new-format-of-collectionyml: it is superseded by fix/absolute-filepaths-manifest
  • fix/manifest-refactor had started on an earlier version of 377-allow-metanorma-to-adopt-new-format-of-collectionyml; it will now be updated to incorporate fix/absolute-filepaths-manifest (if possible, a lot has changed)
  • I will work on fix/manifest-refactor
  • Once that is reasonably completed, I am going to close fix/absolute-filepaths-manifest, and start undoing some of the assumptions in this code that I am querying above (particularly recursively calling collection manifest, and chain-propagating manifest locations for YAML nesting), to do away with the old/new distinction
  • I may ask you @kwkwan to do some updates, but I currently think it is better for me to do the lot now. I will advise as I go.

@opoudjis opoudjis added the enhancement New feature or request label May 17, 2024
@opoudjis opoudjis self-assigned this May 17, 2024
@opoudjis
Copy link
Contributor

Progress report from the schedule I introduced 5 days ago:

  • The new format of YAML introduces three things: change of keywords, flattening of structure, and introduction of YAML recursion.
  • I am now taking care of the former two in Simplify manifests metanorma#383 and that is maintaining backward compatibility. Your work needs to align with that now.

Done, and I've confirmed the updates have stuck in rspec with the old structure. I am working on the tests for the new structure now.

  • There must not be distinction between new and old formats based on structure, and whether there are attachment and document children. That is bogus and rigid, and needs to drop

I've dropped it, and moved code specific to the new structure into manifest postprocessing.

  • The new code for processing YAML with hooks is good and can stay
  • I am making references to files be absolute rather than relative now; that is dealing with https://github.com/metanorma/iso-10303-srl/issues/224
  • The one thing that does need to be kept as new vs old processing of YAML, as "new structure"-specific code, new_yaml_format?() is the YAML recursion
  • And YAML files can show up as docrefs anywhere in the process, at any depth of recursion of files: a yaml can contain references to a yaml referencing a yaml. So I don't think we really even will need new_yaml_format?() any more once I fold the Shale code in: I think there will be just the one processing chain.

I am already skipping that new_yaml_format check.

  • Those recursed YAML files are in fact submanifests, and should be processed as such: docref: x.yaml is equivalent to docref: manifest: (content of x.yaml.manifest). And we do already have methods to process manifest. In the new refactoring, file: x.yaml = entry: (contents of x.yaml.manifest)

Done, processing of YAML references are handled as recursive calls to YAML Shale manifests, and made in postprocessing.

  • I am now making the file references absolute. Once we start chaining YAML to YAML calls, I want you to update that code so that the relative directories are resolved. You will see in my code that I have introduced File.expand_path(ref_folder), to make the references to ref_folder be absolute. So if we get
    • YAML1 is at absolute location //x/y, references YAML2 at relative location z/a
    • YAML 2 references YAML3 at relative location b/c
    • Then I want you to process file references within YAML3 to be at absolute location //x/y/z/a/b/c, chaining together the ref_folder container locations recursively

Done and tested.

  • While I am working on Simplify manifests metanorma#383 , I think making YAML calls track chaining of locations, and be allowed at any level of docref, is the most useful thing you can be doing

Done.

  • fileref_resolver must return file locations relative to the working directory. (The path from the working directory to the current YAML is given in the first, ref_folder parameter)

Done.

  • some of the calls now being made to fileref_resolver in collection.rb are I think redundant now, because of the switch to absolute file locations I have made. I have already changed the fileref_resolver in collection.rb to work.

Will look at this after I'm done.

  • @ronaldtse and you need to get familiar with what I am doing with the new manifest folder
  • For Simplify manifests metanorma#383, I will update the code to work with the existing YAML processing, but I will not intervene too much in the new code in collection_construct_model.rb. That will be your job once the rest of the code is stable.

I have intervened, and the code needed out of collection_construct_model has already been moved out of it, but for the hooks.

  • Ronald and Koonwa, please coordinate about how to proceed.

Wait on me.

@opoudjis
Copy link
Contributor

All done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Done
Development

No branches or pull requests

3 participants