Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request]: Add support for uppercase / title cases in lexicon. #1033

Open
MarketingPip opened this issue Sep 6, 2023 · 4 comments
Open
Labels

Comments

@MarketingPip
Copy link
Contributor

Hoping this get's done, but will be a big enough task.

Would be nice to support added for this -

let lexicon = {"House":["ProperNoun"],
               "house":["Noun"]
              }
nlp("I am watching the House of Commons at my house", lexicon)

Some words like house can represent a different meaning when title cased.

Example: House of Commons - house is used as a proper noun.

Where This is my house and my family's ancestral home is used as a noun.

This will be able to help improve the part of speech tagger big time. As well would be useful for things like country codes.

Where US would currently be detected in "then there was two of us" with the current tagger.

I think this would be way easier rather than have regex plugins to do matches for things like this.

Then ideally we will then re-tag all the word tile cased in the current dataset.

@MarketingPip
Copy link
Contributor Author

@spencermountain - proposed solution to this:

Have separate list of words for things that are

  • upper case
  • title case
  • exact match (title and lower case mixes)

When the lexicon is merged add a rule such as #UPPERCASE_ONLY.

This will help for taking things like He lives in the US as a place.

Or words such as "House" when used as proper noun.

@spencermountain
Copy link
Owner

if you want to do matches based on case i recommend using the @methods like:

nlp('house of commons').match('(house && @isTitleCase) of commons')

nlp('in the us').match('(us && @isUpperCase) of commons')

I know that's a little awkward. I'm not keen to make a change to the lexicon, as it would be a breaking change.

@MarketingPip
Copy link
Contributor Author

@spencermountain - not ideal. This should be done for the whole project. Even if it makes a breaking change, it will be well worth it.

This code isn't complete but rough idea of what I am saying we should do.

function mergeLexiconLists(list1, list2) {
  const mergedLexicon = { ...list1 };

  for (const word in list2) {
    const lowercaseWord = word.toLowerCase();
    
    if (mergedLexicon.hasOwnProperty(lowercaseWord)) {
      const mergedCategories = new Set([
        ...(Array.isArray(mergedLexicon[lowercaseWord])
          ? mergedLexicon[lowercaseWord]
          : [mergedLexicon[lowercaseWord]]),
        ...(Array.isArray(list2[word]) ? list2[word] : [list2[word]]),
      ]);

      // Check if the word in list2 is in title case or all uppercase
      if (isFirstLetterUpperCase(word)) {
        for(let item in list2[word]){
           mergedCategories.add(`${list2[word][item]}_tileCase`);
        }//
       
      }
      //
        if (isAllCaps(word)) {
        for(let item in list2[word]){
           mergedCategories.add(`${list2[word][item]}_UPPERCASE`);
        }//
       
      }

      mergedLexicon[lowercaseWord] = Array.from(mergedCategories);
    } else {
      mergedLexicon[lowercaseWord] = list2[word];
    }
  }

  return mergedLexicon;
}


function isAllCaps(str) {
  // Check if the string has any lowercase letters or non-alphabetic characters
  if (str === str.toUpperCase() && str !== str.toLowerCase()) {
    return true;
  } else {
    return false;
  }
}


function isFirstLetterUpperCase(str) {
  // Check if the first character of the string is an uppercase letter
  if (str.charAt(0) === str.charAt(0).toUpperCase()) {
    return true;
  } else {
    return false;
  }
}

let lexicon1 = {
  apple: 'Fruit',
  a: 'Fruit',
  house: ['Verb'],
  us: ['#Verb'], 
  world: ['noun'],
};

const lexicon2 = {
  amazing: ['#Test'],
  Apple: ['Noun'],
  House: ['#Noun'],
  US: ['#Place'], 
  hello: ['#Tests'],
};

const mergedLexicon = mergeLexiconLists(lexicon1, lexicon2);
console.log(mergedLexicon);

If you want to hack on that / end up hacking on it, send me a copy back haha!

But this will substantively help Compromise.js tag words better. While keeping the data the EXACT same size (beside 3 tags - which again). Think how much this will help the rule set and lots more.

We will have to re tag words - (see there meaning when used as title case / upper case). Plus this will help SO much better for acronyms and MUCH more.

Before dismissing this HUGELY needed feature. Think of the enhancement's it will bring.

plus think of useful - #Place_titleCase would be for other rules etc..

@MarketingPip
Copy link
Contributor Author

MarketingPip commented Mar 9, 2024

@spencermountain - see this! Page 133 (PDF) - here

Explains those POS rules I referenced earlier.

As well all the data / rules can be found here.

That PDF might change your mind about doing something like this for old issue / feature request I made here.

Taken from PDF source.

 Two additional lexicons exist - one for texts in all uppercase (lexicon cap), and
one for texts in all lowercase (lexicon lower).

I would this this would ease some pain instead of writing some rules based on context to match...

And solve some old issues / more than like currently persisting like this one

ps; enjoy your weekend. 🥂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants