Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: .slashes() tokenize transform #1100

Open
retorquere opened this issue Apr 1, 2024 · 6 comments
Open

Feature: .slashes() tokenize transform #1100

retorquere opened this issue Apr 1, 2024 · 6 comments

Comments

@retorquere
Copy link

I'm tokenizing using compromise/one. Can I have 'IEEE/WIC/ACM' be recognized as 3 slash-separated words rather than one?

@spencermountain
Copy link
Owner

hey Emiliano, good idea.
I've changed the tokenizer to allow for 3 slashes per word.

nlp(`IEEE/WIC/ACM`).match('wic').found //true

note that the words are not actually split. we have an awkward, but safe interpretation of slashes, so they don't get bunged-up by other transformations.

cheers

@spencermountain
Copy link
Owner

released in 14.13.0, thanks

@retorquere
Copy link
Author

on 14.13 I still see

const nlp = require('compromise/one')
const doc = nlp('IEEE/WIC/ACM')

for (const sentence of doc.json({offset:true})) {
  for (const term of sentence.terms) console.log(term)
}

showing one token with the combined words. How do I extract the separate words in 14.13?

@spencermountain
Copy link
Owner

Hey, sorry for delay - yes, this is not possible now, but is a good idea.
The thinking was that slashed words should be one term, for most purposes, but should be able to be accessed individually with matches and things. You can see the slashed words are tokenized in the .json() response in an 'alias' property.

It would be cool (and possible) to add a .slashes().split() method. I can try to add it in an upcoming release
Cheers

@spencermountain spencermountain changed the title IEEE/WIC/ACM recognized as one word Feature: .slashes() tokenize transform Apr 6, 2024
@retorquere
Copy link
Author

I find them there, but they've been lowercased. I use the tokenizer for a sentence-casing algorithm so I need case intact.

@retorquere
Copy link
Author

Would the split method recreate location info? And this slashes.split would be something I would run on individual terms?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants