Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide a shortcut for typing character markup #4462

Open
r12a opened this issue Jun 22, 2023 · 20 comments
Open

Provide a shortcut for typing character markup #4462

r12a opened this issue Jun 22, 2023 · 20 comments

Comments

@r12a
Copy link
Contributor

r12a commented Jun 22, 2023

Is your feature request related to a problem? Please describe.
The i18n WG is developing recommendations for referring to one or more characters in markup (see https://w3c.github.io/bp-i18n-specdev/#char_ref_template).

The most basic template for the expanded markup is:

<span class="codepoint" translate="no"><bdi lang="xx">&#xXXXX;</bdi><span class="uname">U+XXXX UNICODE_CHARACTER_NAME</span></span>

This is not complicated, but it's a bit lengthy and fiddly for authors to type in full, especially if a sequence of characters is involved. We'd therefore like to propose a macro that can be used with respec docs to automatically create the full markup from a more concise base.

Describe the solution you'd like
We propose the following expansions, where

  • the textContent can be a code point value, eg. 00E9, or a sequence of space-separated values, eg. 0928 093F;
  • the textContent can be a character, eg. é, or a sequence of characters, eg. नि
  • the lang attribute is strongly recommended, and has a BCP47 language code as its value
  • there is no limit on the number of values provided
  • hex and character values can't be mixed – the former can be requested using class="hx", and the latter using class="ch"
  • the character name(s) are automatically inserted by respec

Examples:
[1]

<span class="hx" lang="fr">00E9</span>

OR

<span class="ch" lang="fr">é</span>

--->

<span class="codepoint" translate="no"><bdi lang="fr">&#x00E9;</bdi><span class="uname">U+00E9 LATIN SMALL LETTER E WITH ACUTE</span></span>

[2]

<span class="hx" lang="hi">0928 093F</span>

OR

<span class="ch" lang="hi">नि</span>

--->

<span class="codepoint" translate="no"><bdi lang="hi">&#x0928;&#x093F;</bdi><span class="uname">U+0928 DEVANAGARI LETTER NA</span> + <span class="uname">U+093F DEVANAGARI VOWEL SIGN I</span></span>

It may also be useful to have a way of indicating that no bdi element is wanted (although much of the time an image would be useful as a replacement). Maybe something like:

<span class="hx nobdi" lang="en">00A0</span>

For invisible characters or tricky to display characters (such as certain combining marks), more complete solution would allow for an image in the expanded markup. For example:

<span class="codepoint" translate="no"><img src="mypath/2003.png" alt="&#x2003;"><span class="uname">U+2003: EM SPACE</span></span>

If it's possible to standardise or accept user input wrt the image location, this could be achieved with a shorthand such as the following, where an additional class name of img or svg is used.

<span class="hx img" lang="ja">2003</span>

(Btw, I can provide a set of images for invisible characters, eg. U+2003.)

Additional context
Note that there is intentionally no span between </bdi><span>. The gap will be provided by styling (which avoids problems with variable space widths and makes it possible to reduce the gap or change it at scale if needed).

@r12a
Copy link
Contributor Author

r12a commented Jun 29, 2023

The nobdi class name may be better nochar, or some such.

@r12a
Copy link
Contributor Author

r12a commented Jun 29, 2023

Btw, i have code that could be adapted to make this work.

@sidvishnoi
Copy link
Member

sidvishnoi commented Jun 29, 2023

Hi @r12a. Having some example code would be good if you've done such thing before. I'm not sure if there's some API we can use or do we need to maintain a list of all the characters (to convert 00A0 to U+00E9 LATIN SMALL LETTER E WITH ACUTE; we can perhaps use Intl.Segmenter to split chars as needed though (no Firefox support there though)). If we need a list, I'm not sure if it'd make sense to bundle it for all users or even how that list would be maintained. It maybe better suited as a plugin.

@r12a
Copy link
Contributor Author

r12a commented Jul 13, 2023

@sidvishnoi The code i use for my own pages may help.

Note, however, that my code has some differences, built in to the way i use it. These include:

  • i can get the language from the context, rather than a parameter
  • i know where to look for the images - i have my own set - that will need a different solution
  • i use dedicated character databases to retrieve the name (spreadsheetRows) - you'll need to use a list derived from the Unicode database (and updated for each Unicode release) - i have one of these at https://github.com/r12a/shared/blob/gh-pages/code/all-names.js - probably best to do this conversion on the server though, given the size of the file (even if it's compacted)

But there's probably a good deal of the algorithm that's useful.

Search for the expandCharMarkup function at https://github.com/r12a/scripts/blob/gh-pages/common29/functions.js

to convert 00A0 to U+00E9 LATIN SMALL LETTER E WITH ACUTE; we can perhaps use Intl.Segmenter to split chars as needed

Not quite sure what you mean here. If you simply want to get a list of the characters in the textContent, that's easy, use the ... operator (eg. charlist = [... charMarkup[i].textContent])

Don't know whether that helps. Let me know.

@r12a
Copy link
Contributor Author

r12a commented Jul 13, 2023

Note that i just changed one of the links in the previous comment.

@sidvishnoi
Copy link
Member

We likely want to go with the approach we use in core/xref here, as the database is quite large (1.4MB) to be included in ReSpec main bundle. i.e., we'll have an endpoint at respec.org and fetch details from there.

I'll try to find some time this month. PRs welcome to respec-web-services as well as ReSpec - having either will help us move forward.

@r12a
Copy link
Contributor Author

r12a commented Oct 19, 2023

Hello @sidvishnoi . The i18n WG is asking me whether we are able to make progress on this. The full markup is now described in https://www.w3.org/StyleSheets/TR/2021/README.html#unicode-codepoints

@sidvishnoi
Copy link
Member

@r12a Not at the moment from my side, sorry. I'll have to get free from my daily job (or time-off) to focus on this. Happy to review any pull requests related to this though.

@r12a
Copy link
Contributor Author

r12a commented Jan 24, 2024

@sidvishnoi i'm just checking whether you are likely to have time to look at this again? Cheers.

@sidvishnoi
Copy link
Member

@r12a not at the moment unfortunately. Maybe in March as my current work contract ends then.

@r12a
Copy link
Contributor Author

r12a commented Mar 27, 2024

@sidvishnoi ping ?

@sidvishnoi
Copy link
Member

sidvishnoi commented Mar 27, 2024

I'll try after next Thursday... Sorry to keep you hanging.
Would definitely appreciate a PR from community, even if partial though.

@r12a
Copy link
Contributor Author

r12a commented Mar 27, 2024

Thanks @sidvishnoi . I can't really create any PR, but there's a link to my code, fwiw, above. I also have a (new) list of class names that i use to manage the output at https://r12a.github.io/scripts/template/xx.html#template_codepoints – this may well be far more than we need for W3C docs, but i point to it for what value it may have. I think the key thing is to be able to go from <span class="ch">x</span> or <span class="hx">XXX</span> to the full syntax. hth

@sidvishnoi
Copy link
Member

sidvishnoi commented Apr 30, 2024

Here's the plan:

  1. Create a backend API at respec.org/unicode/names.
  2. In ReSpec:
    1. Get code points with [...textContent.trim()].map(e => e.codePointAt(0).toString(16))
    2. De-dupe queries, bulk POST request to above endpoint (similar to xref, use IndexedDB as a cache too), and
    3. map elements to names, expanding the shorthand.
    4. The class names ch, hx, img and nobdi seem good to me (can probably use char and hex). IDK if we want to support graphemes here, i think not.
    • Might be better idea to use a custom element to avoid adding so many (global) classes? @marcoscaceres e.g. <respec-unicode hex="HEX" img></respec-unicode> or <respec-unicode hex>HEX</respec-unicode> (it'll replace itself with right markup). We can then maybe even publish it (in future) as separate script, without needing to include it in ReSpec core (added benefits like popups with details).

Images

  1. I think we'd want W3C to host them.
    • We can return the URL in above response too. Can cache images indefinitely by adding a hash in filename.
  2. I think I won't add support for images in first pass.
  3. I wonder if we'd need to support images for graphemes @r12a? Like returning नि as image instead of individual code point images.

respecConfig

I don't think there is any needed, but can probably allow overriding image URLs, something like:

respecConfig.unicode = {
  images: (codePointAsNumberOrGraphemeAsString) => URL
}

My plan is to implement it this Sunday. I've done enough reading to get started now :)

@r12a
Copy link
Contributor Author

r12a commented May 2, 2024

Later, we'd need to parse https://unicode.org/Public/UNIDATA/UnicodeData.txt to get the latest mapping (@r12a I assume you've written a parser for this file?).

I have, indeed, and i update the file as soon as each new Unicode release occurs (it's needed for my own tools, such as UniView). We can decide how to manage updates later. There are a couple of choices.

The class names ch, hx, img and nobdi seem good to me (can probably use char and hex). IDK if we want to support graphemes here, i think not.

I originally used char and hex class names, but it slowed down the content authoring, so i switched to ch and hx. That makes it much faster to type the code (esp. in DreamWeaver, where i just type span.ch[tab] to get <span class="ch">|</span>). So i recommend keeping the shorter forms.

I'm not sure why you suggest nobdi. I can't think of a situation where you'd not want to have bdi. (It's harmless when not needed.)

Not sure what you mean by supporting graphemes, but it's absolutely important to allow a sequence of characters rather than just a single character – eg. <span class="ch">abc</span> should work. Remember that many languages use combining marks and multi-character text units as a basic typographic unit. And often the positioning of elements within a sequence can often be problematic if you don't have a webfont or image to control what it looks like.

Might be better idea to use a custom element

It's a lot of typing and it's not portable (For example, I'm likely to want to copy (a lot of) stuff between my own stuff and the i18n lreq docs), so i prefer the span & classname approach.

Images. I think we'd want W3C to host them.
I think I won't add support for images in first pass.

They are very useful, though – especially when talking about invisible or ambiguous Unicode characters, so i'd encourage you to support them out of the gate if you can. Of course

I understand that my own setup is a lot simpler and more efficient than what we'd implement for respec. Not least because the images (aiui) would need to packaged in the same directory as a document that is being published, to pass publication rules. So my assumption was that WGs would make or find images for use here, and store them locally. (I don't mind if people copy images from my set on GH, but i wouldn't expect documents to pull images from that location.)

That said, i think it is important for people to be able to include images in the document, rather than only characters – especially for ambiguous or invisible characters. (The i18n WG already does this in some of their documents.)

hope that helps

@r12a
Copy link
Contributor Author

r12a commented May 2, 2024

Btw, the other class name values, such as split, circle, coda, init, etc. are very useful, and shouldn't be too hard to implement given that i've already done so in my code (albeit my function could do with rewriting to simplify it, but the logic is there). (See https://r12a.github.io/scripts/template/xx.html#template_codepoints)

@r12a
Copy link
Contributor Author

r12a commented May 2, 2024

Oh, and if nobdi means 'show only the Unicode name', perhaps it would be better named as 'nameonly' or some such. Most people don't know what bdi is, let alone know that it will appear in the resulting code.

@sidvishnoi
Copy link
Member

where i just type span.ch[tab]...

Agreed. This is a strong argument for using classes over custom element.

Remember that many languages use combining marks and multi-character text units as a basic typographic unit. And often the positioning of elements within a sequence can often be problematic if you don't have a webfont or image to control what it looks like.

This is my concern. Consider <span class="ch img" lang="hi">नि</span>. Will we need to return an image as नि, or as separate characters? Do we have images for all such combined characters somewhere? Also, if नि, then then would we need to return images for full words such as नियुक्ति too? I guess would make sense to use webfont in that case - but then would ReSpec need to add these webfonts too?

Support for images for control/invisible characters is reasonable. That's something we can definitely support out of box in first pass.

Btw, the other class name values, such as split, circle, coda, init, etc. are very useful, and shouldn't be too hard to implement given that i've already done so in my code

With all these classes and special features, I wonder if it would make sense for ReSpec to support it. How about we provide a backend API (via respec.org), and then i18n specs can use a custom preProcess or postProcess plugin to handle this specific logic?

My concern being all these classes is they're are tied to unicode expansion plugin, but being classes they're "too global". This is why I was looking to encapsulate it with custom element. But I guess we can take these classes into account only with .ch or .hex prefixes to avoid clashes in future, and remove the classes as soon as element gets processed.

Not least because the images (aiui) would need to packaged in the same directory as a document that is being published, to pass publication rules. So my assumption was that WGs would make or find images for use here, and store them locally.

I think if we host images on W3C/Unicode servers, pubrules can be modified to allow those URL prefixes. Storing images locally is fine too - I'm hoping we can make a page at W3C or Unicode servers (or even respec.org in worst case) to support that.
Do note that tools like w3c/spec-prod could download these remotely referenced images before publishing to /TR, so hosting them anywhere shouldn't be a problem.

@r12a
Copy link
Contributor Author

r12a commented May 2, 2024

But I guess we can take these classes into account only with .ch or .hex prefixes to avoid clashes in future, and remove the classes as soon as element gets processed.

Yes, that's what i would expect to happen (both points).

I think if we host images on W3C/Unicode servers, pubrules can be modified to allow those URL prefixes. Storing images locally is fine too

There are a few problems here with hosting images:

  1. you need to source the images. I have a set of just over 71,000 images for my own documents, but those don't cover Chinese, Korean, or Tangut blocks (which contains many tens of thousands of more characters).
  2. each year when the Unicode Standard is updated it's necessary to create new images for the new characters, but also usually there are additional changes to existing reference character shapes which also need to be updated. That's a lot of work, and we probably don't want images already used for a published spec/document to change by default during these updates anyway.
  3. most people won't use the vast majority of the available images anyway. They'll only need a few per document (though they might use them multiple times).
  4. for my own stuff, i have to have webfonts anyway, so i only use the images for particular cases (mostly invisible / ambiguous, but sometimes for combining marks if the font isn't great – i'm working with many long tail languages). This gives me sufficient control over the rendering that i don't usually need graphics for character sequences. But for W3C specs, i think there will be more interest in using images rather than webfonts for showing the characters. And people will likely want to be able to use images for some sequences, such as नि etc. So being able to create your own images and reference them using a simple syntax seems the best option to me (for the W3C use case).

would we need to return images for full words such as नियुक्ति too?

In principle, yes, but bear in mind that this is really aimed at single characters or small numbers of characters. Otherwise the following Unicode names grow very long. When i want to show full words i will typically create a figure or another mechanism which hides the character names but allows you to discover them, if needed.

Do we have images for all such combined characters somewhere?

No. That would be a vast collection. I'm proposing that the WG creates or sources just the images it needs, but that the respec authoring would allow them to easily show those images with attached Unicode names.

@sidvishnoi
Copy link
Member

Seems like respecConfig.unicode.images function would be best way to support images then. We can pass codepoint, full text (of shorthand element) as well as reference to that element as parameters. The WG/spec can store images at a convenient location with file name such that it makes the images function simple.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants