Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

It would be nice to have a mapping from arpabet to IPA for the cmudict #3238

Open
fcbond opened this issue Mar 13, 2024 · 1 comment
Open

Comments

@fcbond
Copy link
Contributor

fcbond commented Mar 13, 2024

I am not sure if this does what it should with the stress markers (I just converted to superscripts), and I am not sure where it should go, but thought that this might inspire someone to add it properly.

_MAP_ARPA_IPA = {"AA": "ɑ", "AE": "æ", "AH": "ʌ", "AO": "ɔ","AW": "aʊ",
                "AX": "ə", "AXR": "ɚ", "AY": "aɪ", "EH": "ɛ", "ER": "ɝ",
                "EY": "eɪ", "IH": "ɪ", "IX": "ɨ", "IY": "i", "OW": "oʊ",
                "OY": "ɔɪ", "UH": "ʊ", "UW": "u", "UX": "ʉ", "B": "b",
                "CH": "tʃ", "D": "d", "DH": "ð", "DX": "ɾ", "EL": "l̩",
                "EM": "m̩", "EN": "n̩", "F": "f", "G": "ɡ", "HH": "h",
                "H": "h", "JH": "dʒ", "K": "k", "L": "l", "M": "m",
                "N": "n", "NG": "ŋ", "NX": "ɾ̃", "P": "p", "Q": "ʔ",
                "R": "ɹ", "S": "s", "SH": "ʃ", "T": "t", "TH": "θ",
                "V": "v", "W": "w", "WH": "ʍ", "Y": "j", "Z": "z",
                "ZH": "ʒ"}

_MAP_ARPA_AUX = {
    "0": "⁰", "1": "¹", "2": "²", "3": "³", "4": "⁴",
    "5": "⁵", "6": "⁶", "7": "⁷", "8": "⁸", "9": "⁹",
    "-": "-",   "!": "!", "+": "+",
    "/": "/",   "#": "#", ":": ":"}

def arpa2ipa(tag, mapping, aux_map):
  """
  function definition:
  --> takes an Arpabet tag and maps into IPA
  stress is shown as a superscript

  >>> arpa2ipa("AA", _MAP_ARPA_IPA, _MAP_ARPA_AUX)
  "ɑ"
  >>> arpa2ipa("IY0", _MAP_ARPA_IPA, _MAP_ARPA_AUX)
  "i⁰"
  """
  if tag[-1] in aux_map:
    assert tag[:-1] in mapping, f"Unexpected arpabet: {tag[:-1]}"
    return mapping[tag[:-1]] + aux_map[tag[-1]]
  else:
    assert tag in mapping, f"Unexpected arpabet: {tag}"
    return mapping[tag]

Use example:

>>> green = ["G", "R", "IY0", "N"]
>>> print(green, [arpa2ipa(l, _MAP_ARPA_IPA, _MAP_ARPA_AUX) for l in green ])
['ɡ', 'ɹ', 'i⁰', 'n']

I ran it over the whole lexicon and it converts every pronunciation without any errors.

@ekaf
Copy link
Contributor

ekaf commented Mar 17, 2024

This would be a useful addition to the cmudict corpus reader. It could eventually go into a separate module, especially if NLTK had more IPA phonetics or TTS material. Meanwhile, it is most tightly connected to cmudict.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants