Added comma condition to PunktWordTokeniser #746

smithsimonj · 2014-09-22T08:56:00Z

This addition to the word tokeniser splits words that are separated by commas but not numbers. For example the sentence:
'I have 13,000,000 bottles,in,my,bag'
tokenises as:
['I', 'have', '13,000,000', 'bottles', ',', 'in', ',', 'my', ',', 'bag']
rather than the awkward:
['I', 'have', '13,000,000', 'bottles,in,my,bag']

This is useful as this occurs quite a lot in blogs and tweets and has caused errors in my processing chain when we had some 1600 words joined in this way - leading to a massive word being created...

This addition to the word tokeniser splits words that are separated by commas but not numbers. For example the sentence: 'I have 13,000,000 bottles,in,my,bag' tokenises as: ['I', 'have', '13,000,000', 'bottles', ',', 'in', ',', 'my', ',', 'bag'] rather than the awkward: ['I', 'have', '13,000,000', 'bottles,in,my,bag'] This is useful as this occurs quite a lot in blogs and tweets and has caused errors in my processing chain when we had some 1600 words joined in this way - leading to a massive word being created...

jnothman · 2014-09-22T10:23:14Z

The PunktWordTokenizer shouldn't be used for tokenizing words except within Punkt sentence boundary detection. Is that the context in which you are using it? I don't understand why a long token in dirty text is a substantial problem in that context.

smithsimonj · 2014-09-22T10:50:08Z

Yes I am using the Punkt sentence tokeniser prior to this, I sent you an example document via email which it does not do well on. It is relatively common for people to miss out a space between comma separated words in e.g. social media data. Personally I think it should be in the tokeniser as otherwise I (and presumably others) would have to add an additional tokeniser after the NLTK one. If the performance hit is neglible, that is!

jnothman · 2014-09-22T11:52:08Z

Yes I am using the Punkt sentence tokeniser prior to this

I don't know what you mean by this. After performing sentence boundary
detection with the Punkt sentence tokenizer, you should use another
tokenizer, not the PunktWordTokenizer, which in my opinion should only be
used internal to the Punkt sentence tokenizer -- it was never designed for
public use.

So if you are using it external to the PunktSentenceTokenizer then IMO this
is a documentation error: PunktWordTokenizer should not be available for
public use, and its coverage in that module's docstring, for instance,
should probably be deleted. Googling it there is a raft of inappropriate
usage of this class online and I think it should be explicitly taken out of
the public API.

On 22 September 2014 20:50, Simon Smith notifications@github.com wrote:

Yes I am using the Punkt sentence tokeniser prior to this, I sent you an
example document via email which it does not do well on. It is relatively
common for people to miss out a space between comma separated words in e.g.
social media data. Personally I think it should be in the tokeniser as
otherwise I (and presumably others) would have to add an additional
tokeniser after the NLTK one. If the performance hit is neglible, that is!

—
Reply to this email directly or view it on GitHub
#746 (comment).

jnothman · 2014-09-22T11:56:49Z

I now remind myself that I made PunktWordTokenizer obsolete in my changes
to PunktSentenceTokenizer: it was overkill when the tokens immediately
surrounding a period / full-stop were the only ones necessary to extract.

To be clear, the PunktWordTokenizer used to be necessitated by the
collection of statistics over token sequences when training Punkt to
disambiguate the use of a period / full-stop as either sentence-internal or
sentence-final. It should serve no other purpose, although I recall that it
was left in the public API because someone had deemed it useful in and of
its own. I hold the opinion that it should not be used.

And in the context of sentence boundary detection, I don't see how its
non-splitting on commas in a long list of comma-separated items is relevant.

On 22 September 2014 21:52, Joel Nothman joel.nothman@sydney.edu.au wrote:

Yes I am using the Punkt sentence tokeniser prior to this

I don't know what you mean by this. After performing sentence boundary
detection with the Punkt sentence tokenizer, you should use another
tokenizer, not the PunktWordTokenizer, which in my opinion should only be
used internal to the Punkt sentence tokenizer -- it was never designed for
public use.

So if you are using it external to the PunktSentenceTokenizer then IMO
this is a documentation error: PunktWordTokenizer should not be available
for public use, and its coverage in that module's docstring, for instance,
should probably be deleted. Googling it there is a raft of inappropriate
usage of this class online and I think it should be explicitly taken out of
the public API.

On 22 September 2014 20:50, Simon Smith notifications@github.com wrote:

Yes I am using the Punkt sentence tokeniser prior to this, I sent you
an example document via email which it does not do well on. It is
relatively common for people to miss out a space between comma separated
words in e.g. social media data. Personally I think it should be in the
tokeniser as otherwise I (and presumably others) would have to add an
additional tokeniser after the NLTK one. If the performance hit is
neglible, that is!

—
Reply to this email directly or view it on GitHub
#746 (comment).

smithsimonj · 2014-09-22T12:04:46Z

Right ok, that is unfortunate... I'm doing this sort of thing:

for sentence, s_start, s_end in punkt_sent_tokeniser.span_tokenizer_generator(text):
tokens = punkt_word_tokeniser.span_tokenize_list(sentence, s_start)

jnothman · 2014-09-22T12:28:42Z

Sure, but what benefit does PunktWordTokenizer provide over other
tokenizers in NLTK?

On 22 September 2014 22:04, Simon Smith notifications@github.com wrote:

Right ok, that is unfortunate... I'm doing this sort of thing:

for sentence, s_start, s_end in
punkt_sent_tokeniser.span_tokenizer_generator(text):
tokens = punkt_word_tokeniser.span_tokenize_list(sentence, s_start)

—
Reply to this email directly or view it on GitHub
#746 (comment).

smithsimonj · 2014-09-22T12:37:50Z

Well I agree with you, but you could see how I would draw my conclusions from the docs, e.g.:

https://nltk.googlecode.com/svn/trunk/doc/howto/tokenize.html

Personally, I thought picking a 'matching pair' of tokenisers would be safer as they would be designed to work together.

Thanks for the advice anyway, I will look at switching it out of our code or creating a local copy - though it does work really quite well.

jnothman · 2014-09-22T12:46:39Z

np. Most "word tokenizers" that I know of operate over sentences, so one
sentence boundary detection has been performed (unless it's performed
together with word tokenization), any tokenizer appropriate to the language
that expects sentences as input should do.

On 22 September 2014 22:37, Simon Smith notifications@github.com wrote:

Well I agree with you, but you could see how I would draw my conclusions
from the docs, e.g.:

https://nltk.googlecode.com/svn/trunk/doc/howto/tokenize.html

Personally, I thought picking a 'matching pair' of tokenisers would be
safer as they would be designed to work together.

Thanks for the advice anyway, I will look at switching it out of our code
or creating a local copy - though it does work really quite well.

—
Reply to this email directly or view it on GitHub
#746 (comment).

stevenbird · 2014-09-23T06:24:10Z

The Google Code version of the HOWTO documentation is stale, and the project there is flagged as closed. I would have thought that the repository would become hidden. Instead I have scheduled nltk@googlecode for deletion. Now we will lose the redirection information.

Here is the current version of that file – we've been migrating contents into the code, to be accessed using help() http://www.nltk.org/howto/tokenize.html

I'll remove the stale PunktWordTokenizer and its documentation to avoid this problem arising in future.

markuskiller · 2014-10-01T19:34:11Z

What is the recommended way to tokenize German text, using nltk's builtin tokenizers? So far, I've used a combination of PunktSentenceTokenizer and PunktWordTokenizer (i.e. the changes to the nltk.tokenize API will break my implementation):

import nltk

tokens = []

sent_tok = nltk.tokenize.load('tokenizers/punkt/german.pickle')
word_tok = nltk.tokenize.punkt.PunktWordTokenizer()

text = "Heute ist der 3. Mai 2014 und Dr. Meier feiert seinen 43. Geburtstag. Ich muss unbedingt daran denken, Mehl, usw. für einen Kuchen einzukaufen. Aber leider habe ich nur noch EUR 3.50 in meiner Brieftasche."

sents = sent_tok.tokenize(text)

for s in sents:
    # PunktWordTokenizer separates text into tokens
    # leaving all periods attached to words, but separating off other punctuation.
    s_tokens = word_tok.tokenize(s)
    # separate periods from last word of a sentence
    last_word = s_tokens[-1]
    if last_word.endswith('.') and len(last_word) > 1:
        s_tokens = s_tokens[:-1] + [last_word[:-1], '.']     
    tokens.extend(s_tokens)

print(tokens)
#Output
['Heute', 'ist', 'der', '3.', 'Mai', '2014', 'und', 'Dr.', 'Meier', 'feiert', 'seinen', '43.', 'Geburtstag', '.', 'Ich', 'muss',  'unbedingt', 'daran', 'denken', ',', 'Mehl', ',', 'usw.', 'für', 'einen', 'Kuchen', 'einzukaufen', '.', 'Aber', 'leider', 'habe', 'ich', 'nur', 'noch', 'EUR', '3.50', 'in', 'meiner', 'Brieftasche', '.']

Are there any builtin word tokenizers that use German models (i.e. do not separate the period from ordinal numbers AND use the same list of German abbreviations that is used in the German PunktSentenceTokenizer model)? If I use word_tokenize, the period will be separated from ordinal numbers and from unknown German abbreviations (e.g. "usw."):

#Output, using nltk.tokenize.word_tokenize
['Heute', 'ist', 'der', '3', '.', 'Mai', '2014', 'und', 'Dr.', 'Meier', 'feiert', 'seinen', '43', '.', 'Geburtstag', '.', 'Ich', 'muss', 'unbedingt', 'daran', 'denken', ',', 'Mehl', ',', 'usw', '.', 'für', 'einen', 'Kuchen', 'einzukaufen', '.', 'Aber', 'leider', 'habe', 'ich', 'nur', 'noch', 'EUR', '3.50', 'in', 'meiner', 'Brieftasche', '.']

jnothman · 2014-10-02T06:44:48Z

As far as I'm concerned that's a bug in word_tokenize. Abbreviation lists aren't used by the tokenizers; they assume they are passed sentences, and by convention should leave a period attached to any token that is not sentence-final.

jnothman · 2014-10-02T06:46:02Z

Abbreviation lists aren't used by the tokenizers

I should clarify: the PTB tokenizer that word_tokenize knows some English contractions and expands them to be standalone words (e.g. 'll, 've, 's, n't).

markuskiller · 2014-10-02T07:11:45Z

Thanks for your quick reply. In this case, if abbreviations don't come into play, I don't understand, why the default tokenizer used in nltk.tokenize.word_tokenize would leave the period attached to Dr. (Element 9 in list above) and separate it from usw (Element 24, German for 'etc.'). I suspect that there must be regular expressions at work that are clearly intended for the use with English language sentence input.

markuskiller · 2014-10-02T07:19:02Z

@jnothman in your comment 10 days ago you mentioned 'once sentence boundary detection has been performed [...], any tokenizer appropriate to the language that expects sentences as input should do.' My question is, now: is there a builtin tokenizer appropriate for German? Because if there isn't, the PunktWordTokenizer (made obsolete) was pretty close to what we need (except for sentence final period, which could certainly be fixed more elegantly and efficiently than in my example above).

jnothman · 2014-10-02T12:20:45Z

why the default tokenizer used in nltk.tokenize.word_tokenize would leave the period attached to Dr. (Element 9 in list above) and separate it from usw

After spending some time investigating this as a bug, I realise now that the problem is that word_tokenize first applies an English Punkt sentence tokenizer model (which is a bit cruel and not clearly documented). So where you see the full-stop split off, it is because that Punkt model split the sentence there (in ignorance).

Applying the TreebankWordTokenizer directly shouldn't give you that problem. It intends to be a port of the sed script at http://www.cis.upenn.edu/~treebank/tokenizer.sed (which is an approximation of the Penn TreeBank tokenisation with known limitations), which states as I did "Assume sentence tokenization has been done first, so split FINAL periods only."

The TreebankWordTokenizer is still designed for English, but shouldn't do terribly on other Latin-alphabet languages.

markuskiller · 2014-10-02T13:00:43Z

Thanks a lot for investigating the problem. The proposed solution works perfectly:

import nltk

text = "Heute ist der 3. Mai 2014 und Dr. Meier feiert seinen " \
       "43. Geburtstag. Ich muss unbedingt daran denken, Mehl, " \
       "usw. für einen Kuchen einzukaufen. Aber leider habe ich " \
       "nur noch EUR 3.50 in meiner Brieftasche."

sent_tok = nltk.tokenize.load('tokenizers/punkt/german.pickle')
word_tok = nltk.tokenize.TreebankWordTokenizer()

sents = sent_tok.tokenize(text)

tokens = []
for s in sents:
    # alternatively you could use tokens.append for nested 
    # lists of sentences
    tokens.extend(word_tok.tokenize(s))
print(tokens)

#Output
['Heute', 'ist', 'der', '3.', 'Mai', '2014', 'und', 'Dr.', 'Meier', 'feiert',
 'seinen', '43.', 'Geburtstag', '.', 'Ich', 'muss', 'unbedingt', 'daran',
 'denken', ',', 'Mehl', ',', 'usw.', 'für', 'einen', 'Kuchen',
 'einzukaufen', '.', 'Aber', 'leider', 'habe', 'ich', 'nur', 'noch', 
'EUR', '3.50', 'in', 'meiner', 'Brieftasche', '.']

…aced it with 'TreebankWordTokenizer' see nltk/nltk#746 (comment) for details

see nltk/nltk#746

ghost · 2017-08-10T23:40:55Z

Hi Team,

I am trying to learn data analysis using python.

I tried to label data in python and got this error

import nltk
from nltk.corpus import state_union
from nltk.tokenize import punktSentenceTokenizer

train_text = state_union.raw ("2005-GWBush.txt")
sample_text = state_union.raw ("2006-GWBush.txt")

custom_sent_tokenizer = PunktSentenceTokenizer(train_text)

tokenized = custom_sent_tokenizer.tokenize(sample_text)

def process_content():
try:
for i in tokenized:
words = nltk.word_tokenize(i)
tagged = nltk.pos_tag(words)
print(tagged)

except Exception as e:
    print(str(e))

============== RESTART: C:/Users/Chakir/Desktop/mypyprogram.py ==============
Traceback (most recent call last):
File "C:/Users/Chakir/Desktop/mypyprogram.py", line 3, in
from nltk.tokenize import punktSentenceTokenizer
ImportError: cannot import name 'punktSentenceTokenizer'

I don't understand. Could you please help

Thanks

stevenbird added a commit that referenced this pull request Sep 23, 2014

Removed obsolete PunktWordTokenizer, cf #746

0b91a71

stevenbird closed this Sep 23, 2014

markuskiller added a commit to markuskiller/textblob-de that referenced this pull request Oct 2, 2014

removed dependency on nltk's depricated 'PunktWordTokenizer' and repl…

5904f48

…aced it with 'TreebankWordTokenizer' see nltk/nltk#746 (comment) for details

jnothman mentioned this pull request Oct 2, 2014

word_tokenize for languages other than English #751

Merged

pquentin pushed a commit to pquentin/nltk that referenced this pull request Oct 3, 2014

Removed obsolete PunktWordTokenizer, cf nltk#746

Unverified

This commit is not signed, but one or more authors requires that any commit attributed to them is signed.

Learn about vigilant mode

4d0ff1b

stevenbird added a commit that referenced this pull request Mar 11, 2015

Removed obsolete PunktWordTokenizer, cf #746

905671e

stevenbird mentioned this pull request Mar 16, 2015

ImportError: cannot import name PunktWordTokenizer #920

Closed

bacor mentioned this pull request Jun 20, 2015

Using PunktWordTokenizer ariddell/tatom#14

Open

inteldict pushed a commit to inteldict/nltk that referenced this pull request Jul 15, 2015

Removed obsolete PunktWordTokenizer, cf nltk#746

bc2c6e9

peschue added a commit to knowlp/nltk_contrib that referenced this pull request Nov 24, 2015

PunktWordTokenizer deprecated, replacing with PunktSentenceTokenizer

1f4a5b8

see nltk/nltk#746

alvations mentioned this pull request Apr 24, 2018

Better PunktTrainer #2008

Open

Added comma condition to PunktWordTokeniser #746

Added comma condition to PunktWordTokeniser #746

Conversation

smithsimonj commented Sep 22, 2014

Uh oh!

jnothman commented Sep 22, 2014

Uh oh!

smithsimonj commented Sep 22, 2014

Uh oh!

jnothman commented Sep 22, 2014

Uh oh!

jnothman commented Sep 22, 2014

Uh oh!

smithsimonj commented Sep 22, 2014

Uh oh!

jnothman commented Sep 22, 2014

Uh oh!

smithsimonj commented Sep 22, 2014

Uh oh!

jnothman commented Sep 22, 2014

Uh oh!

stevenbird commented Sep 23, 2014

Uh oh!

markuskiller commented Oct 1, 2014

Uh oh!

jnothman commented Oct 2, 2014

Uh oh!

jnothman commented Oct 2, 2014

Uh oh!

markuskiller commented Oct 2, 2014

Uh oh!

markuskiller commented Oct 2, 2014

Uh oh!

jnothman commented Oct 2, 2014

Uh oh!

markuskiller commented Oct 2, 2014

Uh oh!

ghost commented Aug 10, 2017

Uh oh!