Skip to content

turian/topia.termextract

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This package determines important terms within a given piece of content. It
uses linguistic tools such as Parts-Of-Speech (POS) and some simple
statistical analysis to determine the terms and their strength.

NOTE: This is a fork by Joseph Turian of topia.termextract 1.1.0
CONTRIBUTIONS:
    * Unicode alphabetic characters are tokenized correctly.
    I changed TERM_SPEC in topic.termextract.tag:
        Old = [u'S', u'\xe3o', u'Paulo', u'was', u'home', u'to']
        New = [u'S\xe3o', u'Paulo', u'was', u'home', u'to']
    * extractor.extract() now has a parameter KEEP_ORIGINAL_SPACING=True,
    which allows you to keep the original spacing of the term:
        Old = [u'Mr . Smith']
        New = [u'Mr. Smith']
    * Fixed a bug where a term wouldn't be found if it was literally
    the last token of the sentence.
    * Fixed a bug (?) where unigram terms were included even if their
    tokens were part of a multiterm.

About

Updates to Zope's keyphrase extractor (forked from 1.1.0)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages