Skip to content

What are the ways of treatng missing values in XGboost? #21

Closed
@naggar1

Description

@naggar1

Generally does the model performance get better with that ?

Activity

changed the title [-]What are the ways of treatng missing values in Xgboost?[/-] [+]What are the ways of treatng missing values in XGboost?[/+] on Aug 12, 2014
tqchen

tqchen commented on Aug 12, 2014

@tqchen
Member

xgboost naturally accepts sparse feature format, you can directly feed data in as sparse matrix, and only contains non-missing value.

i.e. features that are not presented in the sparse feature matrix are treated as 'missing'. XGBoost will handle it internally and you do not need to do anything on it.

tqchen

tqchen commented on Aug 12, 2014

@tqchen
Member

Internally, XGBoost will automatically learn what is the best direction to go when a value is missing. Equivalently, this can be viewed as automatically "learn" what is the best imputation value for missing values based on reduction on training loss.

tqchen

tqchen commented on Aug 12, 2014

@tqchen
Member

I haven't done formal comparison with other methods, but I think it should be comparable, and it also gives computation benefit when your feature matrix is sparse

rkirana

rkirana commented on Aug 29, 2014

@rkirana

well - if values are not provided, it takes it as missing. So are all 0 values also treated as missing?

Example: A column has 25 values, 15 are 1, 5 are missing/NA and 5 are 0.
Are the 5 + 5 = 10 treated as missing?

tqchen

tqchen commented on Aug 29, 2014

@tqchen
Member

It will depends on how you present the data. If you put data in as LIBSVM format, and list zero features there, it will not be treated as missing

rkirana

rkirana commented on Aug 30, 2014

@rkirana

it maybe extremely difficult to list 0 features in case of sparse data. So should we avoid xgboost in cases where there is missing data and many 0 features?

maxliu

maxliu commented on Aug 30, 2014

@maxliu

Just gave a quick glance of the code ( it is beautiful ,by the way), it is very interesting the way you treat the missing values - it is depending how to make the tree better. Does this method/algorithm have name?

tqchen

tqchen commented on Aug 30, 2014

@tqchen
Member

Normally, it is fine that you treat missing and zero all as zero:)

On Sat, Aug 30, 2014 at 5:11 AM, rkirana notifications@github.com wrote:

it maybe extremely difficult to list 0 features in case of sparse data. So
should we avoid xgboost in cases where there is missing data and many 0
features?


Reply to this email directly or view it on GitHub
https://github.com/tqchen/xgboost/issues/21#issuecomment-53956745.

Sincerely,

Tianqi Chen
Computer Science & Engineering, University of Washington

tqchen

tqchen commented on Aug 30, 2014

@tqchen
Member

I invent the protocol and tricks by my self, maybe you can just call it
xgboost. The general algorithm however, fits into framework of gradient
boosting.

On Sat, Aug 30, 2014 at 8:56 AM, maxliu notifications@github.com wrote:

Just gave a quick glance of the code ( it is beautiful ,by the way), it is
very interesting the way you treat the missing values - it is depending how
to make the tree better. Does this method/algorithm has name?


Reply to this email directly or view it on GitHub
https://github.com/tqchen/xgboost/issues/21#issuecomment-53962310.

Sincerely,

Tianqi Chen
Computer Science & Engineering, University of Washington

maxliu

maxliu commented on Aug 30, 2014

@maxliu

I am not surprised by the seed of xgboost but the score is better than sklearn-GBR. The trick of missing value might be one of the reasons.

Have you published any paper for the boosting algorithm you used for xgboost? Unlike random forest, I could not find many code for boosting with parallel algorithm - may need to improve my google skill though.

tqchen

tqchen commented on Aug 30, 2014

@tqchen
Member

I didn't yet publish any paper describing xgboost.

For parallel boosting tree code, the only one I am aware of so far is
http://machinelearning.wustl.edu/pmwiki.php/Main/Pgbrt . You can try it out
and compare with xgb if you are interested

On Sat, Aug 30, 2014 at 9:40 AM, maxliu notifications@github.com wrote:

I am not surprised by the seed of xgboost but the score is better than
sklearn-GBR. The trick of missing value might be one of the reasons.

Have you published any paper for the boosting algorithm you used for
xgboost? Unlike random forest, I could not find many code for boosting with
parallel algorithm - may need to improve my google skill though.


Reply to this email directly or view it on GitHub
https://github.com/tqchen/xgboost/issues/21#issuecomment-53963590.

Sincerely,

Tianqi Chen
Computer Science & Engineering, University of Washington

Acriche

Acriche commented on Jul 6, 2015

@Acriche

A follow up question-

While I understand how XGboost handles missing values within discrete variables, I'm not sure how does it handle continues (numeric) variables.
Can you please explain?

17 remaining items

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @tqchen@maxliu@acc-to-learn@Acriche@rkirana

        Issue actions

          What are the ways of treatng missing values in XGboost? · Issue #21 · dmlc/xgboost