feature request: unicode in source #1406

tpapp · 2016-05-09T17:29:11Z

Introduction

Some languages now support Unicode (mostly UTF8) for writing source code. It would be great if one could also use Unicode in Stan source. (Note that comments in UTF8, or any superset that embeds ASCII, are already supported in the sense the parser just ignores them.)

Broadly, there are two possible levels of support:

in variable and function names (eg ϕ), and
in operators (eg ≤), which provide synonyms for existing ones (eg <=)

Example

This is how the 8 schools example would look like in unicode:

data {
  int<lower=0> J;             // number of schools
  real y[J];                  // estimated treatment effect (school j)
  real<lower=0> σ[J];         // std err of effect estimate (school j)
}
parameters {
  real μ;
  real θ[J];
  real<lower=0> τ;
}
model {
  θ ~ normal(μ, τ); 
  y ~ normal(θ, σ);
}

Possible benefits

more compact source code
better mapping to equations in papers

Possible downsides

editor/entry support
font support
possibly corrupted files

The first two are mitigated by the fact that ASCII is a subset of UTF8, so using the feature is optional.

UTF8 support in various languages which have interfaces for Stan

language	literals	identifiers	operators	would UTF8 variables work for interfacing with Stan?
R	yes	yes	no	yes
Python	yes	only from version 3	no	yes, even in Python 2, as they are used as literal keys
Julia	yes	yes	yes	yes
Matlab	yes	yes, but needs to be enabled	no	yes
Stata	yes	yes, from version 14	no	probably?

Editor support

Emacs

See this list for various UTF8 implementations using autocomplete, company-mode, and quail.

On May 11, 2016, at 1:24 PM, Allen Riddell notifications@github.com wrote:

Here's a link to the exact section: https://docs.python.org/2/howto/unicode.html#unicode-literals-in-python-source-code

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub

ariddell · 2016-05-11T20:07:07Z

You're right. In Python 2.7 you can't have unicode variables. In Python 3 you can. But why does that matter? We only need unicode in the Stan program code. (Parameter lookups aren't affected since keys are and always were strings.)

bob-carpenter · 2016-05-11T20:48:11Z

It looks like you use a dictionary structure for variable
names.

schools_dat = {
'J': 8,
'y': [28, 8, -3, 7, -1, 1, 18, 12],
'sigma': [15, 10, 16, 11, 9, 11, 10, 18]
}

Can the keys be unicode?

RStan can read data values out of the environment if they're
named after variables in the Stan program. And it can attach
the resulting draws as variables in the environment.

Bob

On May 11, 2016, at 4:07 PM, Allen Riddell notifications@github.com wrote:

You're right. In Python 2.7 you can't have unicode variables. In Python 3 you can. But why does that matter? We only need unicode in the Stan program code. (Parameter lookups aren't affected since keys are and always were strings.)

—
You are receiving this because you commented.
Reply to this email directly or view it on GitHub

ariddell · 2016-05-12T12:20:20Z

Python 2 has no problems with unicode dictionary keys. In fact, it can
have unicode variables in the environment but you have to reference them
via strings indirectly. For example, this works in Python 2:

>>> locals()[u'é'] = 9
>>> locals()[u'é']

(locals is something like baseenv or .GlobalEnv in R)

Bref, there is nothing Python 2 can't do that's relevant to supporting
unicode in Stan code. The table above is inaccurate.

On 05/11, Bob Carpenter wrote:

It looks like you use a dictionary structure for variable
names.

schools_dat = {
'J': 8,
'y': [28, 8, -3, 7, -1, 1, 18, 12],
'sigma': [15, 10, 16, 11, 9, 11, 10, 18]
}

Can the keys be unicode?

RStan can read data values out of the environment if they're
named after variables in the Stan program. And it can attach
the resulting draws as variables in the environment.

Bob

On May 11, 2016, at 4:07 PM, Allen Riddell notifications@github.com wrote:

You're right. In Python 2.7 you can't have unicode variables. In Python 3 you can. But why does that matter? We only need unicode in the Stan program code. (Parameter lookups aren't affected since keys are and always were strings.)

—
You are receiving this because you commented.
Reply to this email directly or view it on GitHub

You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#1406

bob-carpenter · 2016-05-12T15:33:15Z

On May 12, 2016, at 8:20 AM, Allen Riddell notifications@github.com wrote:

Python 2 has no problems with unicode dictionary keys. In fact, it can
have unicode variables in the environment but you have to reference them
via strings indirectly. For example, this works in Python 2:
>>> locals()[u'é'] = 9
>>> locals()[u'é']
(locals is something like baseenv or .GlobalEnv in R)

Bref, there is nothing Python 2 can't do that's relevant to supporting
unicode in Stan code. The table above is inaccurate.

You should have edit permission on the issues.

Bob

ariddell · 2016-05-12T16:18:21Z

I was just recording my thought on the matter. I appreciate @tpapp putting work into drafting the issue text and would prefer to leave any edits to him.

tpapp · 2016-05-13T06:54:35Z

@ariddell: The table was accurate, but since not all Stan interfaces work the way that R/Julia does, I extended it with the information that is probably most relevant: whether the interfaces, in the way they currently operate, would support UTF8 variables for (1) passing data to Stan and (2) extracting MCMC results. Thanks for pointing this out, this is much more important than the details of UTF8 support in those languages per se.

Not being a STATA user, I am reluctant to make a definitive statement about it. If someone could help with that it would be great.

ariddell · 2016-05-14T23:28:09Z

One use of unicode in Stan Program code which should definitely be supported is in comments. Leaving code comments in one's native language is fairly routine in Python/Java/etc. We should at least support that in Stan.

bob-carpenter · 2016-05-14T23:51:31Z

Unicode in comments is OK now.

Bob

On May 14, 2016, at 7:28 PM, Allen Riddell notifications@github.com wrote:

One use of unicode in Stan Program code which should definitely be supported is in comments. Leaving code comments in one's native language is fairly routine in Python/Java/etc. We should at least support that in Stan.

—
You are receiving this because you commented.
Reply to this email directly or view it on GitHub

tpapp · 2016-05-15T07:02:41Z

Indeed UTF8 comments work fine, and I have been using them for a while. Made a clarification in the issue.

ariddell · 2016-05-15T14:16:53Z

UTF8 comments aren't supported in PyStan right now (non-ASCII characters will generate an error). I'll fix this. stan-dev/pystan#223

bob-carpenter added the feature New feature or request label Nov 24, 2016

bob-carpenter self-assigned this Nov 24, 2016

bob-carpenter mentioned this issue Mar 9, 2018

illegal encoding warning stan-dev/stan#2485

Closed

This was referenced Mar 12, 2019

feature request: unicode in source alashworth/test-issue-import#95

Open

illegal encoding warning alashworth/test-issue-import#188

Open

WardBrian transferred this issue from stan-dev/stan Feb 15, 2024

WardBrian unassigned bob-carpenter Feb 15, 2024

WardBrian added the parsing issues related to the parser and syntax errors label Feb 15, 2024

WardBrian linked a pull request Feb 15, 2024 that will close this issue

Experimental support for unicode identifiers. #1407

Draft

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature request: unicode in source #1406

feature request: unicode in source #1406

tpapp commented May 9, 2016 •

edited

ariddell commented May 10, 2016

tpapp commented May 11, 2016

bgoodri commented May 11, 2016

ariddell commented May 11, 2016

tpapp commented May 11, 2016

ariddell commented May 11, 2016 •

edited

bob-carpenter commented May 11, 2016

ariddell commented May 11, 2016

bob-carpenter commented May 11, 2016

ariddell commented May 12, 2016

bob-carpenter commented May 12, 2016

ariddell commented May 12, 2016

tpapp commented May 13, 2016

ariddell commented May 14, 2016

bob-carpenter commented May 14, 2016

tpapp commented May 15, 2016

ariddell commented May 15, 2016 •

edited

feature request: unicode in source #1406

feature request: unicode in source #1406

Comments

tpapp commented May 9, 2016 • edited

Introduction

Example

Possible benefits

Possible downsides

UTF8 support in various languages which have interfaces for Stan

Editor support

Emacs

See also

ariddell commented May 10, 2016

tpapp commented May 11, 2016

bgoodri commented May 11, 2016

ariddell commented May 11, 2016

tpapp commented May 11, 2016

ariddell commented May 11, 2016 • edited

bob-carpenter commented May 11, 2016

ariddell commented May 11, 2016

bob-carpenter commented May 11, 2016

ariddell commented May 12, 2016

bob-carpenter commented May 12, 2016

ariddell commented May 12, 2016

tpapp commented May 13, 2016

ariddell commented May 14, 2016

bob-carpenter commented May 14, 2016

tpapp commented May 15, 2016

ariddell commented May 15, 2016 • edited

tpapp commented May 9, 2016 •

edited

ariddell commented May 11, 2016 •

edited

ariddell commented May 15, 2016 •

edited