What are the ways of treatng missing values in XGboost? #21

New issue

Closed

What are the ways of treatng missing values in XGboost?#21

naggar1

Generally does the model performance get better with that ?

added

changed the title ~~[-]What are the ways of treatng missing values in Xgboost?[/-]~~ What are the ways of treatng missing values in XGboost?

on Aug 12, 2014

tqchen

Member

xgboost naturally accepts sparse feature format, you can directly feed data in as sparse matrix, and only contains non-missing value.

i.e. features that are not presented in the sparse feature matrix are treated as 'missing'. XGBoost will handle it internally and you do not need to do anything on it.

tqchen

Member

Internally, XGBoost will automatically learn what is the best direction to go when a value is missing. Equivalently, this can be viewed as automatically "learn" what is the best imputation value for missing values based on reduction on training loss.

tqchen

closed this as completed

on Aug 12, 2014

tqchen

Member

I haven't done formal comparison with other methods, but I think it should be comparable, and it also gives computation benefit when your feature matrix is sparse

rkirana

well - if values are not provided, it takes it as missing. So are all 0 values also treated as missing?

Example: A column has 25 values, 15 are 1, 5 are missing/NA and 5 are 0.
Are the 5 + 5 = 10 treated as missing?

tqchen

Member

It will depends on how you present the data. If you put data in as LIBSVM format, and list zero features there, it will not be treated as missing

rkirana

it maybe extremely difficult to list 0 features in case of sparse data. So should we avoid xgboost in cases where there is missing data and many 0 features?

maxliu

Just gave a quick glance of the code ( it is beautiful ,by the way), it is very interesting the way you treat the missing values - it is depending how to make the tree better. Does this method/algorithm have name?

tqchen

Member

Normally, it is fine that you treat missing and zero all as zero:)

On Sat, Aug 30, 2014 at 5:11 AM, rkirana notifications@github.com wrote:

it maybe extremely difficult to list 0 features in case of sparse data. So
should we avoid xgboost in cases where there is missing data and many 0
features?

—
Reply to this email directly or view it on GitHub
https://github.com/tqchen/xgboost/issues/21#issuecomment-53956745.

Sincerely,

Tianqi Chen
Computer Science & Engineering, University of Washington

tqchen

Member

I invent the protocol and tricks by my self, maybe you can just call it
xgboost. The general algorithm however, fits into framework of gradient
boosting.

On Sat, Aug 30, 2014 at 8:56 AM, maxliu notifications@github.com wrote:

Just gave a quick glance of the code ( it is beautiful ,by the way), it is
very interesting the way you treat the missing values - it is depending how
to make the tree better. Does this method/algorithm has name?

—
Reply to this email directly or view it on GitHub
https://github.com/tqchen/xgboost/issues/21#issuecomment-53962310.

Sincerely,

Tianqi Chen
Computer Science & Engineering, University of Washington

maxliu

I am not surprised by the seed of xgboost but the score is better than sklearn-GBR. The trick of missing value might be one of the reasons.

Have you published any paper for the boosting algorithm you used for xgboost? Unlike random forest, I could not find many code for boosting with parallel algorithm - may need to improve my google skill though.

tqchen

Member

I didn't yet publish any paper describing xgboost.

For parallel boosting tree code, the only one I am aware of so far is
http://machinelearning.wustl.edu/pmwiki.php/Main/Pgbrt . You can try it out
and compare with xgb if you are interested

On Sat, Aug 30, 2014 at 9:40 AM, maxliu notifications@github.com wrote:

I am not surprised by the seed of xgboost but the score is better than
sklearn-GBR. The trick of missing value might be one of the reasons.

Have you published any paper for the boosting algorithm you used for
xgboost? Unlike random forest, I could not find many code for boosting with
parallel algorithm - may need to improve my google skill though.

—
Reply to this email directly or view it on GitHub
https://github.com/tqchen/xgboost/issues/21#issuecomment-53963590.

Sincerely,

Tianqi Chen
Computer Science & Engineering, University of Washington

Acriche

A follow up question-

While I understand how XGboost handles missing values within discrete variables, I'm not sure how does it handle continues (numeric) variables.
Can you please explain?

17 remaining items

to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

Labels

No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

What are the ways of treatng missing values in XGboost? #21

Sincerely,

Sincerely,

Sincerely,

17 remaining items

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

What are the ways of treatng missing values in XGboost? #21

Description

Activity

tqchen commented on Aug 12, 2014

tqchen commented on Aug 12, 2014

tqchen commented on Aug 12, 2014

rkirana commented on Aug 29, 2014

tqchen commented on Aug 29, 2014

rkirana commented on Aug 30, 2014

maxliu commented on Aug 30, 2014

tqchen commented on Aug 30, 2014

Sincerely,

tqchen commented on Aug 30, 2014

Sincerely,

maxliu commented on Aug 30, 2014

tqchen commented on Aug 30, 2014

Sincerely,

Acriche commented on Jul 6, 2015

17 remaining items

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Issue actions