-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Added comma condition to PunktWordTokeniser #746
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This addition to the word tokeniser splits words that are separated by commas but not numbers. For example the sentence: 'I have 13,000,000 bottles,in,my,bag' tokenises as: ['I', 'have', '13,000,000', 'bottles', ',', 'in', ',', 'my', ',', 'bag'] rather than the awkward: ['I', 'have', '13,000,000', 'bottles,in,my,bag'] This is useful as this occurs quite a lot in blogs and tweets and has caused errors in my processing chain when we had some 1600 words joined in this way - leading to a massive word being created...
The |
Yes I am using the Punkt sentence tokeniser prior to this, I sent you an example document via email which it does not do well on. It is relatively common for people to miss out a space between comma separated words in e.g. social media data. Personally I think it should be in the tokeniser as otherwise I (and presumably others) would have to add an additional tokeniser after the NLTK one. If the performance hit is neglible, that is! |
I don't know what you mean by this. After performing sentence boundary So if you are using it external to the PunktSentenceTokenizer then IMO this On 22 September 2014 20:50, Simon Smith notifications@github.com wrote:
|
I now remind myself that I made PunktWordTokenizer obsolete in my changes To be clear, the PunktWordTokenizer used to be necessitated by the And in the context of sentence boundary detection, I don't see how its On 22 September 2014 21:52, Joel Nothman joel.nothman@sydney.edu.au wrote:
|
Right ok, that is unfortunate... I'm doing this sort of thing: for sentence, s_start, s_end in punkt_sent_tokeniser.span_tokenizer_generator(text): |
Sure, but what benefit does PunktWordTokenizer provide over other On 22 September 2014 22:04, Simon Smith notifications@github.com wrote:
|
Well I agree with you, but you could see how I would draw my conclusions from the docs, e.g.: https://nltk.googlecode.com/svn/trunk/doc/howto/tokenize.html Personally, I thought picking a 'matching pair' of tokenisers would be safer as they would be designed to work together. Thanks for the advice anyway, I will look at switching it out of our code or creating a local copy - though it does work really quite well. |
np. Most "word tokenizers" that I know of operate over sentences, so one On 22 September 2014 22:37, Simon Smith notifications@github.com wrote:
|
The Google Code version of the HOWTO documentation is stale, and the project there is flagged as closed. I would have thought that the repository would become hidden. Instead I have scheduled nltk@googlecode for deletion. Now we will lose the redirection information. Here is the current version of that file – we've been migrating contents into the code, to be accessed using I'll remove the stale |
What is the recommended way to tokenize German text, using nltk's builtin tokenizers? So far, I've used a combination of import nltk
tokens = []
sent_tok = nltk.tokenize.load('tokenizers/punkt/german.pickle')
word_tok = nltk.tokenize.punkt.PunktWordTokenizer()
text = "Heute ist der 3. Mai 2014 und Dr. Meier feiert seinen 43. Geburtstag. Ich muss unbedingt daran denken, Mehl, usw. für einen Kuchen einzukaufen. Aber leider habe ich nur noch EUR 3.50 in meiner Brieftasche."
sents = sent_tok.tokenize(text)
for s in sents:
# PunktWordTokenizer separates text into tokens
# leaving all periods attached to words, but separating off other punctuation.
s_tokens = word_tok.tokenize(s)
# separate periods from last word of a sentence
last_word = s_tokens[-1]
if last_word.endswith('.') and len(last_word) > 1:
s_tokens = s_tokens[:-1] + [last_word[:-1], '.']
tokens.extend(s_tokens)
print(tokens)
#Output
['Heute', 'ist', 'der', '3.', 'Mai', '2014', 'und', 'Dr.', 'Meier', 'feiert', 'seinen', '43.', 'Geburtstag', '.', 'Ich', 'muss', 'unbedingt', 'daran', 'denken', ',', 'Mehl', ',', 'usw.', 'für', 'einen', 'Kuchen', 'einzukaufen', '.', 'Aber', 'leider', 'habe', 'ich', 'nur', 'noch', 'EUR', '3.50', 'in', 'meiner', 'Brieftasche', '.'] Are there any builtin word tokenizers that use German models (i.e. do not separate the period from ordinal numbers AND use the same list of German abbreviations that is used in the German #Output, using nltk.tokenize.word_tokenize
['Heute', 'ist', 'der', '3', '.', 'Mai', '2014', 'und', 'Dr.', 'Meier', 'feiert', 'seinen', '43', '.', 'Geburtstag', '.', 'Ich', 'muss', 'unbedingt', 'daran', 'denken', ',', 'Mehl', ',', 'usw', '.', 'für', 'einen', 'Kuchen', 'einzukaufen', '.', 'Aber', 'leider', 'habe', 'ich', 'nur', 'noch', 'EUR', '3.50', 'in', 'meiner', 'Brieftasche', '.'] |
As far as I'm concerned that's a bug in |
I should clarify: the PTB tokenizer that |
Thanks for your quick reply. In this case, if abbreviations don't come into play, I don't understand, why the default tokenizer used in |
@jnothman in your comment 10 days ago you mentioned 'once sentence boundary detection has been performed [...], any tokenizer appropriate to the language that expects sentences as input should do.' My question is, now: is there a builtin tokenizer appropriate for German? Because if there isn't, the |
After spending some time investigating this as a bug, I realise now that the problem is that Applying the The |
Thanks a lot for investigating the problem. The proposed solution works perfectly: import nltk
text = "Heute ist der 3. Mai 2014 und Dr. Meier feiert seinen " \
"43. Geburtstag. Ich muss unbedingt daran denken, Mehl, " \
"usw. für einen Kuchen einzukaufen. Aber leider habe ich " \
"nur noch EUR 3.50 in meiner Brieftasche."
sent_tok = nltk.tokenize.load('tokenizers/punkt/german.pickle')
word_tok = nltk.tokenize.TreebankWordTokenizer()
sents = sent_tok.tokenize(text)
tokens = []
for s in sents:
# alternatively you could use tokens.append for nested
# lists of sentences
tokens.extend(word_tok.tokenize(s))
print(tokens)
#Output
['Heute', 'ist', 'der', '3.', 'Mai', '2014', 'und', 'Dr.', 'Meier', 'feiert',
'seinen', '43.', 'Geburtstag', '.', 'Ich', 'muss', 'unbedingt', 'daran',
'denken', ',', 'Mehl', ',', 'usw.', 'für', 'einen', 'Kuchen',
'einzukaufen', '.', 'Aber', 'leider', 'habe', 'ich', 'nur', 'noch',
'EUR', '3.50', 'in', 'meiner', 'Brieftasche', '.']
|
…aced it with 'TreebankWordTokenizer' see nltk/nltk#746 (comment) for details
Hi Team, I am trying to learn data analysis using python. I tried to label data in python and got this error import nltk train_text = state_union.raw ("2005-GWBush.txt") custom_sent_tokenizer = PunktSentenceTokenizer(train_text) tokenized = custom_sent_tokenizer.tokenize(sample_text) def process_content():
============== RESTART: C:/Users/Chakir/Desktop/mypyprogram.py ============== I don't understand. Could you please help Thanks |
This addition to the word tokeniser splits words that are separated by commas but not numbers. For example the sentence:
'I have 13,000,000 bottles,in,my,bag'
tokenises as:
['I', 'have', '13,000,000', 'bottles', ',', 'in', ',', 'my', ',', 'bag']
rather than the awkward:
['I', 'have', '13,000,000', 'bottles,in,my,bag']
This is useful as this occurs quite a lot in blogs and tweets and has caused errors in my processing chain when we had some 1600 words joined in this way - leading to a massive word being created...