It would be nice to have a mapping from arpabet to IPA for the cmudict #3238

fcbond · 2024-03-13T20:31:59Z

I am not sure if this does what it should with the stress markers (I just converted to superscripts), and I am not sure where it should go, but thought that this might inspire someone to add it properly.

_MAP_ARPA_IPA = {"AA": "ɑ", "AE": "æ", "AH": "ʌ", "AO": "ɔ","AW": "aʊ",
                "AX": "ə", "AXR": "ɚ", "AY": "aɪ", "EH": "ɛ", "ER": "ɝ",
                "EY": "eɪ", "IH": "ɪ", "IX": "ɨ", "IY": "i", "OW": "oʊ",
                "OY": "ɔɪ", "UH": "ʊ", "UW": "u", "UX": "ʉ", "B": "b",
                "CH": "tʃ", "D": "d", "DH": "ð", "DX": "ɾ", "EL": "l̩",
                "EM": "m̩", "EN": "n̩", "F": "f", "G": "ɡ", "HH": "h",
                "H": "h", "JH": "dʒ", "K": "k", "L": "l", "M": "m",
                "N": "n", "NG": "ŋ", "NX": "ɾ̃", "P": "p", "Q": "ʔ",
                "R": "ɹ", "S": "s", "SH": "ʃ", "T": "t", "TH": "θ",
                "V": "v", "W": "w", "WH": "ʍ", "Y": "j", "Z": "z",
                "ZH": "ʒ"}

_MAP_ARPA_AUX = {
    "0": "⁰", "1": "¹", "2": "²", "3": "³", "4": "⁴",
    "5": "⁵", "6": "⁶", "7": "⁷", "8": "⁸", "9": "⁹",
    "-": "-",   "!": "!", "+": "+",
    "/": "/",   "#": "#", ":": ":"}

def arpa2ipa(tag, mapping, aux_map):
  """
  function definition:
  --> takes an Arpabet tag and maps into IPA
  stress is shown as a superscript

  >>> arpa2ipa("AA", _MAP_ARPA_IPA, _MAP_ARPA_AUX)
  "ɑ"
  >>> arpa2ipa("IY0", _MAP_ARPA_IPA, _MAP_ARPA_AUX)
  "i⁰"
  """
  if tag[-1] in aux_map:
    assert tag[:-1] in mapping, f"Unexpected arpabet: {tag[:-1]}"
    return mapping[tag[:-1]] + aux_map[tag[-1]]
  else:
    assert tag in mapping, f"Unexpected arpabet: {tag}"
    return mapping[tag]

Use example:

>>> green = ["G", "R", "IY0", "N"]
>>> print(green, [arpa2ipa(l, _MAP_ARPA_IPA, _MAP_ARPA_AUX) for l in green ])
['ɡ', 'ɹ', 'i⁰', 'n']

I ran it over the whole lexicon and it converts every pronunciation without any errors.

The text was updated successfully, but these errors were encountered:

ekaf · 2024-03-17T11:45:29Z

This would be a useful addition to the cmudict corpus reader. It could eventually go into a separate module, especially if NLTK had more IPA phonetics or TTS material. Meanwhile, it is most tightly connected to cmudict.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

It would be nice to have a mapping from arpabet to IPA for the cmudict #3238

It would be nice to have a mapping from arpabet to IPA for the cmudict #3238

fcbond commented Mar 13, 2024

ekaf commented Mar 17, 2024

It would be nice to have a mapping from arpabet to IPA for the cmudict #3238

It would be nice to have a mapping from arpabet to IPA for the cmudict #3238

Comments

fcbond commented Mar 13, 2024

ekaf commented Mar 17, 2024