Feature: .slashes() tokenize transform #1100

retorquere · 2024-04-01T12:33:11Z

I'm tokenizing using compromise/one. Can I have 'IEEE/WIC/ACM' be recognized as 3 slash-separated words rather than one?

The text was updated successfully, but these errors were encountered:

spencermountain · 2024-04-01T16:43:17Z

hey Emiliano, good idea.
I've changed the tokenizer to allow for 3 slashes per word.

nlp(`IEEE/WIC/ACM`).match('wic').found //true

note that the words are not actually split. we have an awkward, but safe interpretation of slashes, so they don't get bunged-up by other transformations.

cheers

spencermountain · 2024-04-01T17:09:21Z

released in 14.13.0, thanks

retorquere · 2024-04-01T18:13:18Z

on 14.13 I still see

const nlp = require('compromise/one')
const doc = nlp('IEEE/WIC/ACM')

for (const sentence of doc.json({offset:true})) {
  for (const term of sentence.terms) console.log(term)
}

showing one token with the combined words. How do I extract the separate words in 14.13?

spencermountain · 2024-04-06T17:13:30Z

Hey, sorry for delay - yes, this is not possible now, but is a good idea.
The thinking was that slashed words should be one term, for most purposes, but should be able to be accessed individually with matches and things. You can see the slashed words are tokenized in the .json() response in an 'alias' property.

It would be cool (and possible) to add a .slashes().split() method. I can try to add it in an upcoming release
Cheers

retorquere · 2024-04-06T17:31:14Z

I find them there, but they've been lowercased. I use the tokenizer for a sentence-casing algorithm so I need case intact.

retorquere · 2024-04-06T17:32:51Z

Would the split method recreate location info? And this slashes.split would be something I would run on individual terms?

spencermountain added the fixed-on-dev label Apr 1, 2024

spencermountain closed this as completed Apr 1, 2024

spencermountain added feature-request and removed fixed-on-dev labels Apr 6, 2024

spencermountain reopened this Apr 6, 2024

spencermountain changed the title ~~IEEE/WIC/ACM recognized as one word~~ Feature: .slashes() tokenize transform Apr 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: .slashes() tokenize transform #1100

Feature: .slashes() tokenize transform #1100

retorquere commented Apr 1, 2024

spencermountain commented Apr 1, 2024

spencermountain commented Apr 1, 2024

retorquere commented Apr 1, 2024

spencermountain commented Apr 6, 2024

retorquere commented Apr 6, 2024

retorquere commented Apr 6, 2024

Feature: .slashes() tokenize transform #1100

Feature: .slashes() tokenize transform #1100

Comments

retorquere commented Apr 1, 2024

spencermountain commented Apr 1, 2024

spencermountain commented Apr 1, 2024

retorquere commented Apr 1, 2024

spencermountain commented Apr 6, 2024

retorquere commented Apr 6, 2024

retorquere commented Apr 6, 2024