Skip to content

Slim Retraining #1877

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
eshirima opened this issue Jul 6, 2017 · 6 comments
Closed

Slim Retraining #1877

eshirima opened this issue Jul 6, 2017 · 6 comments

Comments

@eshirima
Copy link

eshirima commented Jul 6, 2017

I finally started the training process of the object detection API on my own dataset. Since none of the currently available models consisted of my object, I got rid of the checkpoint options in my configuration file.

A snap of the logged info so far

INFO:tensorflow:global step 12470: loss = 2.5983 (5.387 sec/step)
INFO:tensorflow:global step 12471: loss = 3.4173 (5.339 sec/step)
INFO:tensorflow:global step 12472: loss = 1.7505 (5.242 sec/step)
INFO:tensorflow:global step 12473: loss = 2.2654 (5.029 sec/step)
INFO:tensorflow:global step 12474: loss = 1.7415 (5.705 sec/step)
INFO:tensorflow:global step 12475: loss = 2.6800 (5.247 sec/step)
INFO:tensorflow:global step 12476: loss = 2.0485 (5.179 sec/step)
INFO:tensorflow:global step 12477: loss = 2.3963 (5.116 sec/step)
INFO:tensorflow:global step 12478: loss = 1.9042 (5.262 sec/step)
INFO:tensorflow:global step 12479: loss = 1.8263 (5.185 sec/step)

From step 0 until now, my loss has dramatically decreased but for the past couple hours, my loss has been fluctuating between 1-2. My questions are:

  1. Am I in the risk of overfitting my model?
  2. Is there a way of knowing the remaining number of global steps?

My config file:

# SSD with Inception v2 configured for Oxford-IIIT Pets Dataset.
# Users should configure the fine_tune_checkpoint field in the train config as
# well as the label_map_path and input_path fields in the train_input_reader and
# eval_input_reader. Search for "PATH_TO_BE_CONFIGURED" to find the fields that
# should be configured.

model {
  ssd {
    num_classes: 2
    box_coder {
      faster_rcnn_box_coder {
        y_scale: 10.0
        x_scale: 10.0
        height_scale: 5.0
        width_scale: 5.0
      }
    }
    matcher {
      argmax_matcher {
        matched_threshold: 0.5
        unmatched_threshold: 0.5
        ignore_thresholds: false
        negatives_lower_than_unmatched: true
        force_match_for_each_row: true
      }
    }
    similarity_calculator {
      iou_similarity {
      }
    }
    anchor_generator {
      ssd_anchor_generator {
        num_layers: 6
        min_scale: 0.2
        max_scale: 0.95
        aspect_ratios: 1.0
        aspect_ratios: 2.0
        aspect_ratios: 0.5
        aspect_ratios: 3.0
        aspect_ratios: 0.3333
        reduce_boxes_in_lowest_layer: true
      }
    }
    image_resizer {
      fixed_shape_resizer {
        height: 300
        width: 300
      }
    }
    box_predictor {
      convolutional_box_predictor {
        min_depth: 0
        max_depth: 0
        num_layers_before_predictor: 0
        use_dropout: false
        dropout_keep_probability: 0.8
        kernel_size: 3
        box_code_size: 4
        apply_sigmoid_to_scores: false
        conv_hyperparams {
          activation: RELU_6,
          regularizer {
            l2_regularizer {
              weight: 0.00004
            }
          }
          initializer {
            truncated_normal_initializer {
              stddev: 0.03
              mean: 0.0
            }
          }
        }
      }
    }
    feature_extractor {
      type: 'ssd_inception_v2'
      min_depth: 16
      depth_multiplier: 1.0
      conv_hyperparams {
        activation: RELU_6,
        regularizer {
          l2_regularizer {
            weight: 0.00004
          }
        }
        initializer {
          truncated_normal_initializer {
            stddev: 0.03
            mean: 0.0
          }
        }
        batch_norm {
          train: true,
          scale: true,
          center: true,
          decay: 0.9997,
          epsilon: 0.001,
        }
      }
    }
    loss {
      classification_loss {
        weighted_sigmoid {
          anchorwise_output: true
        }
      }
      localization_loss {
        weighted_smooth_l1 {
          anchorwise_output: true
        }
      }
      hard_example_miner {
        num_hard_examples: 3000
        iou_threshold: 0.99
        loss_type: CLASSIFICATION
        max_negatives_per_positive: 3
        min_negatives_per_image: 0
      }
      classification_weight: 1.0
      localization_weight: 1.0
    }
    normalize_loss_by_num_matches: true
    post_processing {
      batch_non_max_suppression {
        score_threshold: 1e-8
        iou_threshold: 0.6
        max_detections_per_class: 100
        max_total_detections: 100
      }
      score_converter: SIGMOID
    }
  }
}

train_config: {
  batch_size: 10
  optimizer {
    rms_prop_optimizer: {
      learning_rate: {
        exponential_decay_learning_rate {
          initial_learning_rate: 0.004
          decay_steps: 800720
          decay_factor: 0.95
        }
      }
      momentum_optimizer_value: 0.9
      decay: 0.9
      epsilon: 1.0
    }
  }
  data_augmentation_options {
    random_horizontal_flip {
    }
  }
  data_augmentation_options {
    ssd_random_crop {
    }
  }
}

train_input_reader: {
  tf_record_input_reader {
    input_path: "#####/models/object_detection/pascal_train.record"
  }
  label_map_path: "#####/models/object_detection/data/mine_label_map.pbtxt"
}

eval_config: {
  num_examples: 58
}

eval_input_reader: {
  tf_record_input_reader {
    input_path: "#####/models/object_detection/pascal_val.record"
  }
  label_map_path: "#####/models/object_detection/data/mine_label_map.pbtxt"
  shuffle: false
  num_readers: 1
}
@jch1
Copy link

jch1 commented Jul 6, 2017

Hi @eshirima - the best way is to just run the eval.py binary. We typically run this binary in parallel to training, pointing it at the directory holding the checkpoint that is being trained. The eval.py binary will write logs to an eval_dir that you specify which you can then point to with Tensorboard.

You want to see that the mAP has "lifted off" in the first few hours, and then you want to see when it converges. It's hard to tell without looking at these plots how many steps you need.

@eshirima
Copy link
Author

eshirima commented Jul 6, 2017

@jch1 Thank you so much for your help both here and on SO!!! I just finished my training and it works really well.

@eshirima eshirima closed this as completed Jul 6, 2017
@oscarorti
Copy link

hi @eshirima, i'm trying to retrain ssd model with my own dataset and I reach loss around 2-5, but when I do the predict nothing is detected because the scores are around 0.01. I'm training a binary classifier like you, could you tell me how you label map is and how you create tfrecord file (I followed the tensorflow tutorial) but now I really don't know where the error is, my config file is like yours.

Thanks

@eshirima
Copy link
Author

@oscarorti I shared my experience on SO

@Abduoit
Copy link

Abduoit commented Aug 22, 2017

Hi @jch1 and @eshirima
I checked this helpful link but I have question please ?

I have 8 GB GPU memory, I run train.py perfectly
However,
Based on your suggestion, I can not run train.py and eval.py at the same time, for memory reason it fails. So how to do that.
Thx

@Abduoit
Copy link

Abduoit commented Aug 23, 2017

The solution here to run train.py and eval.py at the same time on single gpu is by adding the following lines in the train.py file

def main(_):
  gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.5)  
  sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))

This will use 50% of gpu for training process, then run eval.py for the rest of memory

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants