Closed
Description
I finally started the training process of the object detection API on my own dataset. Since none of the currently available models consisted of my object, I got rid of the checkpoint options in my configuration file.
A snap of the logged info so far
INFO:tensorflow:global step 12470: loss = 2.5983 (5.387 sec/step)
INFO:tensorflow:global step 12471: loss = 3.4173 (5.339 sec/step)
INFO:tensorflow:global step 12472: loss = 1.7505 (5.242 sec/step)
INFO:tensorflow:global step 12473: loss = 2.2654 (5.029 sec/step)
INFO:tensorflow:global step 12474: loss = 1.7415 (5.705 sec/step)
INFO:tensorflow:global step 12475: loss = 2.6800 (5.247 sec/step)
INFO:tensorflow:global step 12476: loss = 2.0485 (5.179 sec/step)
INFO:tensorflow:global step 12477: loss = 2.3963 (5.116 sec/step)
INFO:tensorflow:global step 12478: loss = 1.9042 (5.262 sec/step)
INFO:tensorflow:global step 12479: loss = 1.8263 (5.185 sec/step)
From step 0 until now, my loss has dramatically decreased but for the past couple hours, my loss has been fluctuating between 1-2. My questions are:
- Am I in the risk of overfitting my model?
- Is there a way of knowing the remaining number of global steps?
My config file:
# SSD with Inception v2 configured for Oxford-IIIT Pets Dataset.
# Users should configure the fine_tune_checkpoint field in the train config as
# well as the label_map_path and input_path fields in the train_input_reader and
# eval_input_reader. Search for "PATH_TO_BE_CONFIGURED" to find the fields that
# should be configured.
model {
ssd {
num_classes: 2
box_coder {
faster_rcnn_box_coder {
y_scale: 10.0
x_scale: 10.0
height_scale: 5.0
width_scale: 5.0
}
}
matcher {
argmax_matcher {
matched_threshold: 0.5
unmatched_threshold: 0.5
ignore_thresholds: false
negatives_lower_than_unmatched: true
force_match_for_each_row: true
}
}
similarity_calculator {
iou_similarity {
}
}
anchor_generator {
ssd_anchor_generator {
num_layers: 6
min_scale: 0.2
max_scale: 0.95
aspect_ratios: 1.0
aspect_ratios: 2.0
aspect_ratios: 0.5
aspect_ratios: 3.0
aspect_ratios: 0.3333
reduce_boxes_in_lowest_layer: true
}
}
image_resizer {
fixed_shape_resizer {
height: 300
width: 300
}
}
box_predictor {
convolutional_box_predictor {
min_depth: 0
max_depth: 0
num_layers_before_predictor: 0
use_dropout: false
dropout_keep_probability: 0.8
kernel_size: 3
box_code_size: 4
apply_sigmoid_to_scores: false
conv_hyperparams {
activation: RELU_6,
regularizer {
l2_regularizer {
weight: 0.00004
}
}
initializer {
truncated_normal_initializer {
stddev: 0.03
mean: 0.0
}
}
}
}
}
feature_extractor {
type: 'ssd_inception_v2'
min_depth: 16
depth_multiplier: 1.0
conv_hyperparams {
activation: RELU_6,
regularizer {
l2_regularizer {
weight: 0.00004
}
}
initializer {
truncated_normal_initializer {
stddev: 0.03
mean: 0.0
}
}
batch_norm {
train: true,
scale: true,
center: true,
decay: 0.9997,
epsilon: 0.001,
}
}
}
loss {
classification_loss {
weighted_sigmoid {
anchorwise_output: true
}
}
localization_loss {
weighted_smooth_l1 {
anchorwise_output: true
}
}
hard_example_miner {
num_hard_examples: 3000
iou_threshold: 0.99
loss_type: CLASSIFICATION
max_negatives_per_positive: 3
min_negatives_per_image: 0
}
classification_weight: 1.0
localization_weight: 1.0
}
normalize_loss_by_num_matches: true
post_processing {
batch_non_max_suppression {
score_threshold: 1e-8
iou_threshold: 0.6
max_detections_per_class: 100
max_total_detections: 100
}
score_converter: SIGMOID
}
}
}
train_config: {
batch_size: 10
optimizer {
rms_prop_optimizer: {
learning_rate: {
exponential_decay_learning_rate {
initial_learning_rate: 0.004
decay_steps: 800720
decay_factor: 0.95
}
}
momentum_optimizer_value: 0.9
decay: 0.9
epsilon: 1.0
}
}
data_augmentation_options {
random_horizontal_flip {
}
}
data_augmentation_options {
ssd_random_crop {
}
}
}
train_input_reader: {
tf_record_input_reader {
input_path: "#####/models/object_detection/pascal_train.record"
}
label_map_path: "#####/models/object_detection/data/mine_label_map.pbtxt"
}
eval_config: {
num_examples: 58
}
eval_input_reader: {
tf_record_input_reader {
input_path: "#####/models/object_detection/pascal_val.record"
}
label_map_path: "#####/models/object_detection/data/mine_label_map.pbtxt"
shuffle: false
num_readers: 1
}
Activity
jch1 commentedon Jul 6, 2017
Hi @eshirima - the best way is to just run the eval.py binary. We typically run this binary in parallel to training, pointing it at the directory holding the checkpoint that is being trained. The eval.py binary will write logs to an
eval_dir
that you specify which you can then point to with Tensorboard.You want to see that the mAP has "lifted off" in the first few hours, and then you want to see when it converges. It's hard to tell without looking at these plots how many steps you need.
eshirima commentedon Jul 6, 2017
@jch1 Thank you so much for your help both here and on SO!!! I just finished my training and it works really well.
oscarorti commentedon Jul 18, 2017
hi @eshirima, i'm trying to retrain ssd model with my own dataset and I reach loss around 2-5, but when I do the predict nothing is detected because the scores are around 0.01. I'm training a binary classifier like you, could you tell me how you label map is and how you create tfrecord file (I followed the tensorflow tutorial) but now I really don't know where the error is, my config file is like yours.
Thanks
eshirima commentedon Jul 18, 2017
@oscarorti I shared my experience on SO
Abduoit commentedon Aug 22, 2017
Hi @jch1 and @eshirima
I checked this helpful link but I have question please ?
I have 8 GB GPU memory, I run train.py perfectly
However,
Based on your suggestion, I can not run train.py and eval.py at the same time, for memory reason it fails. So how to do that.
Thx
Abduoit commentedon Aug 23, 2017
The solution here to run train.py and eval.py at the same time on single gpu is by adding the following lines in the
train.py
fileThis will use 50% of gpu for training process, then run eval.py for the rest of memory