Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZWNJ in Persian #61

Closed
theodore-s-beers opened this issue Aug 22, 2019 · 7 comments
Closed

ZWNJ in Persian #61

theodore-s-beers opened this issue Aug 22, 2019 · 7 comments

Comments

@theodore-s-beers
Copy link

theodore-s-beers commented Aug 22, 2019

Feel free to close this if there's nothing to be done about it. I saw that Annex 29 repeatedly excludes the zero-width non-joiner.

In Persian, this character (U+200C) is used to prevent connection of letters between certain prefixes and suffixes, and the words to which they are attached. I know it has other purposes in other languages, but Persian is what I'm working with. (I also work in Arabic, where the ZWNJ is not used in any context that I know of.)

I was tinkering with a Rust program that involves (among other things) taking Arabic or Persian text input and segmenting the graphemes. Once I found this package, it worked immediately, with few exceptions. And I understood the exceptions that occurred. For example, if an Arabic letter is followed by a vowel mark or other diacritic, those code points stay together as a unit. That seems right, since the letter plus diacritic(s) can be said to represent the "user-perceived character."

But I have a problem with the ZWNJ in Persian. It does not create a new "user-perceived character" along with the preceding letter—which is how it's being treated in this segmentation scheme. Rather, the intention is, "act as though there's a space after this letter, but leave out the space."

At issue is the fact that letters in the Arabic or Persian alphabet have up to four contextual forms: isolated, initial, medial, and final. As you probably know, setting the correct form in a given context tends to be taken care of by the shaping engine. (Otherwise, typing would be incredibly tedious.) When a ZWNJ is added, it's an instruction not to use the medial form of the preceding letter, where it might otherwise be used. The result is that one of the other standard forms will be set instead, depending on the context.

When segmenting graphemes in Persian, then, I don't think it makes sense to exclude the ZWNJ as a boundary. It would better be segmented out, the way that spaces are. In fact, unless I've missed something, U+200C could be treated as a grapheme boundary when it occurs after any code point in the Arabic block. (It should not, however, be treated as a word or sentence boundary by default.)

But I could be wrong. There are people who would know better. And if the mandate here is to follow Annex 29 faithfully, then I suppose it doesn't matter. I found a workaround for my immediate purposes.

Thank you for your work on this project!

@Manishearth
Copy link
Member

Yes, we understand how Arabic contextual forms work.

The mandate is to follow that spec faithfully. That spec does call out that context-specific tailorings can exist, but those are usually use-case-dependent and this crate doesn't go that far.

You can build such a tailoring on top of this to filter for ZWNJ.

@Manishearth
Copy link
Member

Also, with my Unicode hat on, I think the spec is doing the right thing here: the zwnj has semantic meaning there -- it's not just a space -- and it's not it's own "perceived character" -- so it has to be rolled in to something, and the spec rolls it into the previous one. Force-final forms aren't considered different in Arabic or Persian, it's a property of the word not the letter, but grapheme segmentation isn't about equality. A jeem and a jeem with a zwnj are both a user-perceived character, even though they are perceived by the user as the same character, because sameness isn't about encoding.

If you're relying on equality while segmenting, you need to tweak how you look at equality for this to work. With Unicode algorithms it is important to understand if the algorithm is precisely for the conceptual purpose you want to use it for. Grapheme boundaries provide a simple bare-minimum logical places to do a bunch of segmentation operations (backspace, arrow keys, hyphenated linebreaking). They don't necessarily produce "graphemes" that are equal when you need them to be.

@theodore-s-beers
Copy link
Author

theodore-s-beers commented Aug 22, 2019

Thanks for your responses. Filtering for ZWNJ is easy enough. And I appreciate your last point; maybe I had the wrong idea of what grapheme segmentation is primarily meant for.

Having said that, I'm not sure I agree about the spec itself. Where the ZWNJ is mentioned, it isn't related to the way it's used in Persian. The reference is to Indic languages, and those cases, from what I've seen, are less ambiguous (i.e., a difference in the user-perceived character). I'm not totally convinced that Persian usage was considered in the drafting of this spec… though again, I could be wrong.

It's worth keeping in mind that the effort to get people to use the ZWNJ in Persian has been gradual, and it's still quite common to see spaces used instead. (That's the way I first learned to type in Persian, around fifteen years ago. Many of my academic colleagues continue to do so.) There are also contexts in which, failing the use of the ZWNJ, the letters can be allowed to connect. This is true of the verbal prefix . So you might see, for example, می‌کند or می کند or میکند and it's not the end of the world. The version with the ZWNJ is just the best option, since it's easy to read but also space-efficient and keeps the word together.

Anyway, I could ramble about this for days, but my point is that the "semantic content" of the ZWNJ in Persian is debatable. Is it just like a space? Not quite, but a space can stand in for it in a pinch. Does it produce something new in combination with the preceding letter? No, or not perceptibly. Does it need to be treated as part of a grapheme cluster with the preceding letter? I don't think anyone could argue that it needs to, but maybe someone thinks it's better this way for encoding purposes. I'll get to the bottom of it eventually. Maybe I'll ask Thomas Milo; if he tells me I'm full of it, that'll shut me up real quick. Thanks again.

@Manishearth
Copy link
Member

I suspect Roozbeh would have noticed if the spec was wrong, but I can ask him.

@theodore-s-beers
Copy link
Author

It could easily be the "bare default spec" argument ("if you want something different for working with Persian, go ahead and customize").

I don't know. It occurred to me that much of what I wrote about Persian would also apply to the use of the ZWNJ in German to prevent a ligature across the stems of a compound word. Someone would have spoken up if it seemed wrong to include the ZWNJ in the preceding grapheme cluster.

I'll ask around.

@Manishearth
Copy link
Member

So I asked Roozbeh (unicode expert, script expert, native Persian speaker) about this and he agrees with me, but felt that you should submit feedback through https://unicode.org/reporting.html anyway so we can discuss this at the next UTC.

@theodore-s-beers
Copy link
Author

Ok. I'll just ask that a sentence or two be added to explain how the treatment of ZWNJ in this spec fits with languages where it's used to prevent a connection or ligature. That much would have cowed me from the start. If the way it works now seems right to highly placed Iranians in the free software world (not a small group), then what can I do but eat my words.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants