Skip to content

Added comma condition to PunktWordTokeniser #746

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

smithsimonj
Copy link

This addition to the word tokeniser splits words that are separated by commas but not numbers. For example the sentence:
'I have 13,000,000 bottles,in,my,bag'
tokenises as:
['I', 'have', '13,000,000', 'bottles', ',', 'in', ',', 'my', ',', 'bag']
rather than the awkward:
['I', 'have', '13,000,000', 'bottles,in,my,bag']

This is useful as this occurs quite a lot in blogs and tweets and has caused errors in my processing chain when we had some 1600 words joined in this way - leading to a massive word being created...

This addition to the word tokeniser splits words that are separated by commas but not numbers. For example the sentence:
    'I have 13,000,000 bottles,in,my,bag'
tokenises as:
    ['I', 'have', '13,000,000', 'bottles', ',', 'in', ',', 'my', ',', 'bag']
rather than the awkward:
    ['I', 'have', '13,000,000', 'bottles,in,my,bag']

This is useful as this occurs quite a lot in blogs and tweets and has caused errors in my processing chain when we had some 1600 words joined in this way - leading to a massive word being created...
@jnothman
Copy link
Contributor

The PunktWordTokenizer shouldn't be used for tokenizing words except within Punkt sentence boundary detection. Is that the context in which you are using it? I don't understand why a long token in dirty text is a substantial problem in that context.

@smithsimonj
Copy link
Author

Yes I am using the Punkt sentence tokeniser prior to this, I sent you an example document via email which it does not do well on. It is relatively common for people to miss out a space between comma separated words in e.g. social media data. Personally I think it should be in the tokeniser as otherwise I (and presumably others) would have to add an additional tokeniser after the NLTK one. If the performance hit is neglible, that is!

@jnothman
Copy link
Contributor

Yes I am using the Punkt sentence tokeniser prior to this

I don't know what you mean by this. After performing sentence boundary
detection with the Punkt sentence tokenizer, you should use another
tokenizer, not the PunktWordTokenizer, which in my opinion should only be
used internal to the Punkt sentence tokenizer -- it was never designed for
public use.

So if you are using it external to the PunktSentenceTokenizer then IMO this
is a documentation error: PunktWordTokenizer should not be available for
public use, and its coverage in that module's docstring, for instance,
should probably be deleted. Googling it there is a raft of inappropriate
usage of this class online and I think it should be explicitly taken out of
the public API.

On 22 September 2014 20:50, Simon Smith notifications@github.com wrote:

Yes I am using the Punkt sentence tokeniser prior to this, I sent you an
example document via email which it does not do well on. It is relatively
common for people to miss out a space between comma separated words in e.g.
social media data. Personally I think it should be in the tokeniser as
otherwise I (and presumably others) would have to add an additional
tokeniser after the NLTK one. If the performance hit is neglible, that is!


Reply to this email directly or view it on GitHub
#746 (comment).

@jnothman
Copy link
Contributor

I now remind myself that I made PunktWordTokenizer obsolete in my changes
to PunktSentenceTokenizer: it was overkill when the tokens immediately
surrounding a period / full-stop were the only ones necessary to extract.

To be clear, the PunktWordTokenizer used to be necessitated by the
collection of statistics over token sequences when training Punkt to
disambiguate the use of a period / full-stop as either sentence-internal or
sentence-final. It should serve no other purpose, although I recall that it
was left in the public API because someone had deemed it useful in and of
its own. I hold the opinion that it should not be used.

And in the context of sentence boundary detection, I don't see how its
non-splitting on commas in a long list of comma-separated items is relevant.

On 22 September 2014 21:52, Joel Nothman joel.nothman@sydney.edu.au wrote:

Yes I am using the Punkt sentence tokeniser prior to this

I don't know what you mean by this. After performing sentence boundary
detection with the Punkt sentence tokenizer, you should use another
tokenizer, not the PunktWordTokenizer, which in my opinion should only be
used internal to the Punkt sentence tokenizer -- it was never designed for
public use.

So if you are using it external to the PunktSentenceTokenizer then IMO
this is a documentation error: PunktWordTokenizer should not be available
for public use, and its coverage in that module's docstring, for instance,
should probably be deleted. Googling it there is a raft of inappropriate
usage of this class online and I think it should be explicitly taken out of
the public API.

On 22 September 2014 20:50, Simon Smith notifications@github.com wrote:

Yes I am using the Punkt sentence tokeniser prior to this, I sent you
an example document via email which it does not do well on. It is
relatively common for people to miss out a space between comma separated
words in e.g. social media data. Personally I think it should be in the
tokeniser as otherwise I (and presumably others) would have to add an
additional tokeniser after the NLTK one. If the performance hit is
neglible, that is!


Reply to this email directly or view it on GitHub
#746 (comment).

@smithsimonj
Copy link
Author

Right ok, that is unfortunate... I'm doing this sort of thing:

for sentence, s_start, s_end in punkt_sent_tokeniser.span_tokenizer_generator(text):
tokens = punkt_word_tokeniser.span_tokenize_list(sentence, s_start)

@jnothman
Copy link
Contributor

Sure, but what benefit does PunktWordTokenizer provide over other
tokenizers in NLTK?

On 22 September 2014 22:04, Simon Smith notifications@github.com wrote:

Right ok, that is unfortunate... I'm doing this sort of thing:

for sentence, s_start, s_end in
punkt_sent_tokeniser.span_tokenizer_generator(text):
tokens = punkt_word_tokeniser.span_tokenize_list(sentence, s_start)


Reply to this email directly or view it on GitHub
#746 (comment).

@smithsimonj
Copy link
Author

Well I agree with you, but you could see how I would draw my conclusions from the docs, e.g.:

https://nltk.googlecode.com/svn/trunk/doc/howto/tokenize.html

Personally, I thought picking a 'matching pair' of tokenisers would be safer as they would be designed to work together.

Thanks for the advice anyway, I will look at switching it out of our code or creating a local copy - though it does work really quite well.

@jnothman
Copy link
Contributor

np. Most "word tokenizers" that I know of operate over sentences, so one
sentence boundary detection has been performed (unless it's performed
together with word tokenization), any tokenizer appropriate to the language
that expects sentences as input should do.

On 22 September 2014 22:37, Simon Smith notifications@github.com wrote:

Well I agree with you, but you could see how I would draw my conclusions
from the docs, e.g.:

https://nltk.googlecode.com/svn/trunk/doc/howto/tokenize.html

Personally, I thought picking a 'matching pair' of tokenisers would be
safer as they would be designed to work together.

Thanks for the advice anyway, I will look at switching it out of our code
or creating a local copy - though it does work really quite well.


Reply to this email directly or view it on GitHub
#746 (comment).

@stevenbird
Copy link
Member

The Google Code version of the HOWTO documentation is stale, and the project there is flagged as closed. I would have thought that the repository would become hidden. Instead I have scheduled nltk@googlecode for deletion. Now we will lose the redirection information.

Here is the current version of that file – we've been migrating contents into the code, to be accessed using help() http://www.nltk.org/howto/tokenize.html

I'll remove the stale PunktWordTokenizer and its documentation to avoid this problem arising in future.

stevenbird added a commit that referenced this pull request Sep 23, 2014
@stevenbird stevenbird closed this Sep 23, 2014
@markuskiller
Copy link

What is the recommended way to tokenize German text, using nltk's builtin tokenizers? So far, I've used a combination of PunktSentenceTokenizer and PunktWordTokenizer (i.e. the changes to the nltk.tokenize API will break my implementation):

import nltk

tokens = []

sent_tok = nltk.tokenize.load('tokenizers/punkt/german.pickle')
word_tok = nltk.tokenize.punkt.PunktWordTokenizer()

text = "Heute ist der 3. Mai 2014 und Dr. Meier feiert seinen 43. Geburtstag. Ich muss unbedingt daran denken, Mehl, usw. für einen Kuchen einzukaufen. Aber leider habe ich nur noch EUR 3.50 in meiner Brieftasche."

sents = sent_tok.tokenize(text)

for s in sents:
    # PunktWordTokenizer separates text into tokens
    # leaving all periods attached to words, but separating off other punctuation.
    s_tokens = word_tok.tokenize(s)
    # separate periods from last word of a sentence
    last_word = s_tokens[-1]
    if last_word.endswith('.') and len(last_word) > 1:
        s_tokens = s_tokens[:-1] + [last_word[:-1], '.']     
    tokens.extend(s_tokens)

print(tokens)
#Output
['Heute', 'ist', 'der', '3.', 'Mai', '2014', 'und', 'Dr.', 'Meier', 'feiert', 'seinen', '43.', 'Geburtstag', '.', 'Ich', 'muss',  'unbedingt', 'daran', 'denken', ',', 'Mehl', ',', 'usw.', 'für', 'einen', 'Kuchen', 'einzukaufen', '.', 'Aber', 'leider', 'habe', 'ich', 'nur', 'noch', 'EUR', '3.50', 'in', 'meiner', 'Brieftasche', '.']

Are there any builtin word tokenizers that use German models (i.e. do not separate the period from ordinal numbers AND use the same list of German abbreviations that is used in the German PunktSentenceTokenizer model)? If I use word_tokenize, the period will be separated from ordinal numbers and from unknown German abbreviations (e.g. "usw."):

#Output, using nltk.tokenize.word_tokenize
['Heute', 'ist', 'der', '3', '.', 'Mai', '2014', 'und', 'Dr.', 'Meier', 'feiert', 'seinen', '43', '.', 'Geburtstag', '.', 'Ich', 'muss', 'unbedingt', 'daran', 'denken', ',', 'Mehl', ',', 'usw', '.', 'für', 'einen', 'Kuchen', 'einzukaufen', '.', 'Aber', 'leider', 'habe', 'ich', 'nur', 'noch', 'EUR', '3.50', 'in', 'meiner', 'Brieftasche', '.']

@jnothman
Copy link
Contributor

jnothman commented Oct 2, 2014

As far as I'm concerned that's a bug in word_tokenize. Abbreviation lists aren't used by the tokenizers; they assume they are passed sentences, and by convention should leave a period attached to any token that is not sentence-final.

@jnothman
Copy link
Contributor

jnothman commented Oct 2, 2014

Abbreviation lists aren't used by the tokenizers

I should clarify: the PTB tokenizer that word_tokenize knows some English contractions and expands them to be standalone words (e.g. 'll, 've, 's, n't).

@markuskiller
Copy link

Thanks for your quick reply. In this case, if abbreviations don't come into play, I don't understand, why the default tokenizer used in nltk.tokenize.word_tokenize would leave the period attached to Dr. (Element 9 in list above) and separate it from usw (Element 24, German for 'etc.'). I suspect that there must be regular expressions at work that are clearly intended for the use with English language sentence input.

@markuskiller
Copy link

@jnothman in your comment 10 days ago you mentioned 'once sentence boundary detection has been performed [...], any tokenizer appropriate to the language that expects sentences as input should do.' My question is, now: is there a builtin tokenizer appropriate for German? Because if there isn't, the PunktWordTokenizer (made obsolete) was pretty close to what we need (except for sentence final period, which could certainly be fixed more elegantly and efficiently than in my example above).

@jnothman
Copy link
Contributor

jnothman commented Oct 2, 2014

why the default tokenizer used in nltk.tokenize.word_tokenize would leave the period attached to Dr. (Element 9 in list above) and separate it from usw

After spending some time investigating this as a bug, I realise now that the problem is that word_tokenize first applies an English Punkt sentence tokenizer model (which is a bit cruel and not clearly documented). So where you see the full-stop split off, it is because that Punkt model split the sentence there (in ignorance).

Applying the TreebankWordTokenizer directly shouldn't give you that problem. It intends to be a port of the sed script at http://www.cis.upenn.edu/~treebank/tokenizer.sed (which is an approximation of the Penn TreeBank tokenisation with known limitations), which states as I did "Assume sentence tokenization has been done first, so split FINAL periods only."

The TreebankWordTokenizer is still designed for English, but shouldn't do terribly on other Latin-alphabet languages.

@markuskiller
Copy link

Thanks a lot for investigating the problem. The proposed solution works perfectly:

import nltk

text = "Heute ist der 3. Mai 2014 und Dr. Meier feiert seinen " \
       "43. Geburtstag. Ich muss unbedingt daran denken, Mehl, " \
       "usw. für einen Kuchen einzukaufen. Aber leider habe ich " \
       "nur noch EUR 3.50 in meiner Brieftasche."

sent_tok = nltk.tokenize.load('tokenizers/punkt/german.pickle')
word_tok = nltk.tokenize.TreebankWordTokenizer()

sents = sent_tok.tokenize(text)

tokens = []
for s in sents:
    # alternatively you could use tokens.append for nested 
    # lists of sentences
    tokens.extend(word_tok.tokenize(s))
print(tokens)

#Output
['Heute', 'ist', 'der', '3.', 'Mai', '2014', 'und', 'Dr.', 'Meier', 'feiert',
 'seinen', '43.', 'Geburtstag', '.', 'Ich', 'muss', 'unbedingt', 'daran',
 'denken', ',', 'Mehl', ',', 'usw.', 'für', 'einen', 'Kuchen',
 'einzukaufen', '.', 'Aber', 'leider', 'habe', 'ich', 'nur', 'noch', 
'EUR', '3.50', 'in', 'meiner', 'Brieftasche', '.']

markuskiller added a commit to markuskiller/textblob-de that referenced this pull request Oct 2, 2014
pquentin pushed a commit to pquentin/nltk that referenced this pull request Oct 3, 2014

Unverified

This commit is not signed, but one or more authors requires that any commit attributed to them is signed.
stevenbird added a commit that referenced this pull request Mar 11, 2015
inteldict pushed a commit to inteldict/nltk that referenced this pull request Jul 15, 2015
peschue added a commit to knowlp/nltk_contrib that referenced this pull request Nov 24, 2015
@ghost
Copy link

ghost commented Aug 10, 2017

Hi Team,

I am trying to learn data analysis using python.

I tried to label data in python and got this error

import nltk
from nltk.corpus import state_union
from nltk.tokenize import punktSentenceTokenizer

train_text = state_union.raw ("2005-GWBush.txt")
sample_text = state_union.raw ("2006-GWBush.txt")

custom_sent_tokenizer = PunktSentenceTokenizer(train_text)

tokenized = custom_sent_tokenizer.tokenize(sample_text)

def process_content():
try:
for i in tokenized:
words = nltk.word_tokenize(i)
tagged = nltk.pos_tag(words)
print(tagged)

except Exception as e:
    print(str(e))

============== RESTART: C:/Users/Chakir/Desktop/mypyprogram.py ==============
Traceback (most recent call last):
File "C:/Users/Chakir/Desktop/mypyprogram.py", line 3, in
from nltk.tokenize import punktSentenceTokenizer
ImportError: cannot import name 'punktSentenceTokenizer'

I don't understand. Could you please help

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants