Confusion about total loss is negative #18

lixiangpengcs · 2019-09-12T03:12:33Z

I am new to OCR. I run your code and the Loss and Recognition Loss are all negative and detection loss is positive. I am a little confused about this result. Is it correct?

Train Epoch: 3 [576/856 (67%)] Loss: -4.654196 Detection Loss: 0.016792 Recognition Loss:-4.670988
Train Epoch: 3 [592/856 (69%)] Loss: -3.796435 Detection Loss: 0.017061 Recognition Loss:-3.813496
Train Epoch: 3 [608/856 (71%)] Loss: -4.000570 Detection Loss: 0.014988 Recognition Loss:-4.015558

Besides, the experiment results of first few epoch are 0. It is really strage:

['img_100.jpg', 'img_213.jpg', 'img_624.jpg', 'img_362.jpg', 'img_491.jpg', 'img_469.jpg'] Expected tensor to have CPU Backend, but got tensor with CUDA Backend (while checking arguments for cudnn_ctc_loss)
epoch : 3
loss : -3.987356729596575
det_loss : 0.021637324957507795
rec_loss : -4.008994055685596
precious : 0.0
recall : 0.0
hmean : 0.0
val_loss : 0.0
val_det_loss : 0.0
val_rec_loss : 0.0
val_precious : 0.0
val_recall : 0.0
val_hmean : 0.0
Saving checkpoint: ./saved_model/united_2019-09-12/checkpoint-epoch003-loss--3.9874.pth.tar ...

novioleo · 2019-09-12T03:25:26Z

#17 the ctc loss comes to negative is common,you can search it on google.
^.^
As above your error info mentioned,you need covert your convert your tensor to cpu,the tensor between gt label and predict are inconsist.
@lixiangpengcs

lixiangpengcs · 2019-09-12T09:08:01Z

@novioleo, Thanks for your quick reply. I have solved ctc-loss problem by using torch version 1.0.1.post2. However, the experiment results also seems strange. Val_precision and val_recall are still 0 after training 10 epoches. Is it a normal phenohenon? Also the loss is stable around -4.0, Is it correct?

Start validate
epoch : 8
loss : -3.9846485162449774
det_loss : 0.01933059994583932
rec_loss : -4.003979111386236
precious : 0.0
recall : 0.0
hmean : 0.0
val_loss : 184.1101213319129
val_det_loss : 89.47653757918037
val_rec_loss : 94.63358375273253
val_precious : 0.0
val_recall : 0.0
val_hmean : 0.0
Saving checkpoint: ./saved_model/united_2019-09-12/checkpoint-epoch008-loss--3.9846.pth.tar ...
Train Epoch: 9 [0/856 (0%)] Loss: -4.143412 Detection Loss: 0.016749 Recognition Loss:-4.160161
Train Epoch: 9 [16/856 (2%)] Loss: -3.958525 Detection Loss: 0.016582 Recognition Loss:-3.975107
Train Epoch: 9 [32/856 (4%)] Loss: -4.779177 Detection Loss: 0.013592 Recognition Loss:-4.792768
Train Epoch: 9 [48/856 (6%)] Loss: -4.109888 Detection Loss: 0.015186 Recognition Loss:-4.125073
Train Epoch: 9 [64/856 (7%)] Loss: -4.520103 Detection Loss: 0.013715 Recognition Loss:-4.533818
Train Epoch: 9 [80/856 (9%)] Loss: -3.883961 Detection Loss: 0.015919 Recognition Loss:-3.899881

novioleo · 2019-09-12T11:42:31Z

not correct...

…

---Original--- From: "Xiangpeng Li"<notifications@github.com> Date: 2019/9/12 17:08:06 To: "novioleo/FOTS"<FOTS@noreply.github.com>; Cc: "Comment"<comment@noreply.github.com>;"Tao Luo"<744351893@qq.com>; Subject: Re: [novioleo/FOTS] Confiusion about total loss is negative (#18) Thanks for your quick reply. I have solved ctc-loss problem by using torch version 1.0.1.post2. However, the experiment results also seems strange. Val_precision and val_recall are still 0 after training 10 epoches. Is it a normal phenohenon? Also the loss is stable around -4.0, Is it correct? Start validate epoch : 8 loss : -3.9846485162449774 det_loss : 0.01933059994583932 rec_loss : -4.003979111386236 precious : 0.0 recall : 0.0 hmean : 0.0 val_loss : 184.1101213319129 val_det_loss : 89.47653757918037 val_rec_loss : 94.63358375273253 val_precious : 0.0 val_recall : 0.0 val_hmean : 0.0 Saving checkpoint: ./saved_model/united_2019-09-12/checkpoint-epoch008-loss--3.9846.pth.tar ... Train Epoch: 9 [0/856 (0%)] Loss: -4.143412 Detection Loss: 0.016749 Recognition Loss:-4.160161 Train Epoch: 9 [16/856 (2%)] Loss: -3.958525 Detection Loss: 0.016582 Recognition Loss:-3.975107 Train Epoch: 9 [32/856 (4%)] Loss: -4.779177 Detection Loss: 0.013592 Recognition Loss:-4.792768 Train Epoch: 9 [48/856 (6%)] Loss: -4.109888 Detection Loss: 0.015186 Recognition Loss:-4.125073 Train Epoch: 9 [64/856 (7%)] Loss: -4.520103 Detection Loss: 0.013715 Recognition Loss:-4.533818 Train Epoch: 9 [80/856 (9%)] Loss: -3.883961 Detection Loss: 0.015919 Recognition Loss:-3.899881 — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

lixiangpengcs · 2019-09-15T07:53:12Z

Can you tell me why my code runing like this? Or can you tell me what correct training log should be?

novioleo · 2019-09-15T08:38:42Z

@lixiangpengcs replace the original ctc_loss with torch-baidu-ctc.
because of the bug of ctc loss....

novioleo · 2019-09-15T08:40:18Z

if the error still comes out ,i think you need checkout your gt

lixiangpengcs · 2019-09-18T02:04:08Z

I modify the common_str character set. The code is still wired even the training recision and recall is not 0:
Start validate
epoch : 38
loss : 3.4252535650663285
det_loss : 0.013250550838344009
rec_loss : 3.4120030135751884
precious : 0.014009486778199243
recall : 0.014009486778199243
hmean : 0.014009486778199243
val_loss : 14.513418368259934
val_det_loss : 5.277081434320855
val_rec_loss : 9.23633693393908
val_precious : 0.0
val_recall : 0.0
val_hmean : 0.0

novioleo · 2019-09-18T04:42:32Z

@lixiangpengcs how many pcs of your dataset?

lixiangpengcs · 2019-09-18T06:28:08Z

@lixiangpengcs how many pcs of your dataset?

I use icdar2015 as my training set.

lixiangpengcs · 2019-09-18T12:30:53Z

Detection performance looks not bad. May be the error appears in the recognition branch.

novioleo · 2019-09-18T12:32:15Z

i modified the recognition part code.you can modify for your application.

…

------------------ 原始邮件 ------------------ 发件人: "Xiangpeng Li"<notifications@github.com>; 发送时间: 2019年9月18日(星期三) 晚上8:30 收件人: "novioleo/FOTS"<FOTS@noreply.github.com>; 抄送: "北国枫叶。"<744351893@qq.com>; "Mention"<mention@noreply.github.com>; 主题: Re: [novioleo/FOTS] Confiusion about total loss is negative (#18) Detection performance looks not bad. May be the error appears in the recognition branch. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

novioleo · 2019-09-20T01:00:55Z

icdar2015 usually cause some problems,you can try your own dataset

…

---Original--- From: "Xiangpeng Li"<notifications@github.com> Date: 2019/9/18 14:28:08 To: "novioleo/FOTS"<FOTS@noreply.github.com>; Cc: "Mention"<mention@noreply.github.com>;"Tao Luo"<744351893@qq.com>; Subject: Re: [novioleo/FOTS] Confiusion about total loss is negative (#18) @lixiangpengcs how many pcs of your dataset? I use icdar2015 as my training set. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

novioleo · 2019-09-26T02:42:25Z

@lixiangpengcs have you every tried your own dataset?

lixiangpengcs · 2019-09-26T02:48:52Z

Yes, I have a try on COCOText dataset. But it is hard to converge and the test performance has a little improvement when I train it for 50 epochs. But it is stiil not satisfying. Can you give me some training suggestions?

novioleo · 2019-09-26T02:50:22Z

@lixiangpengcs have you replace my crnn to original crnn?

lixiangpengcs · 2019-09-26T02:53:45Z

No. I use your CRNN as you default configuration.

novioleo · 2019-09-26T03:01:53Z

@lixiangpengcs how about your learning rate set?

lixiangpengcs · 2019-09-26T03:04:09Z

"lr_scheduler_type": "StepLR",
"lr_scheduler_freq": 100,
"lr_scheduler": {
"gamma": 0.9,
"step_size": 100
},
As you suggestion in another issue.

novioleo · 2019-09-26T03:10:22Z

@lixiangpengcs if you try the public dataset such as icdar2015, you need set multi step lrs,and the initial learning rate to a little bit big,it can avoid local optimum,the value you set depend on your pratical trials...

i called this "The metaphysics tuning parameter"....god bless you...

lixiangpengcs · 2019-10-14T12:07:57Z

@novioleo Can this code run on multi-GPU? I met error on multi-GPU running.

novioleo · 2019-10-14T12:09:30Z

@lixiangpengcs you can parallel the network with pytorch multigpu tutorial. i don't adapt this.

lixiangpengcs · 2019-10-14T12:11:27Z

@novioleo This code may not adapt to multigpu. I have to rewrite dataset part.

novioleo · 2019-10-14T12:14:44Z

@lixiangpengcs can u post your error log ?

lixiangpengcs · 2019-10-14T12:17:15Z

The DataParallel class will split the batch data into two subsets to train model on two gpus. The mapping and text data can not be split and their first dimension is not batchsize as image data.

novioleo · 2019-10-14T12:19:35Z

@lixiangpengcs i think you can try this method:
concat all the label of one image into single string,and split it during the train process.

lixiangpengcs · 2019-10-14T12:22:29Z

That is a great idea!

novioleo · 2019-10-14T12:32:19Z

@lixiangpengcs except your good news.

feitiandemiaomi · 2019-10-18T02:24:51Z

@lixiangpengcs Is COCOText dataset 2014? Could you tell me the specific name ?

novioleo changed the title ~~Confiusion about total loss is negative~~ Confusion about total loss is negative Oct 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Confusion about total loss is negative #18

Confusion about total loss is negative #18

lixiangpengcs commented Sep 12, 2019

novioleo commented Sep 12, 2019 •

edited

Loading

lixiangpengcs commented Sep 12, 2019 •

edited

Loading

novioleo commented Sep 12, 2019 via email

lixiangpengcs commented Sep 15, 2019

novioleo commented Sep 15, 2019

novioleo commented Sep 15, 2019

lixiangpengcs commented Sep 18, 2019

novioleo commented Sep 18, 2019

lixiangpengcs commented Sep 18, 2019

lixiangpengcs commented Sep 18, 2019

novioleo commented Sep 18, 2019 via email

novioleo commented Sep 20, 2019 via email

novioleo commented Sep 26, 2019

lixiangpengcs commented Sep 26, 2019

novioleo commented Sep 26, 2019

lixiangpengcs commented Sep 26, 2019

novioleo commented Sep 26, 2019

lixiangpengcs commented Sep 26, 2019

novioleo commented Sep 26, 2019

lixiangpengcs commented Oct 14, 2019

novioleo commented Oct 14, 2019

lixiangpengcs commented Oct 14, 2019

novioleo commented Oct 14, 2019

lixiangpengcs commented Oct 14, 2019

novioleo commented Oct 14, 2019

lixiangpengcs commented Oct 14, 2019

novioleo commented Oct 14, 2019

feitiandemiaomi commented Oct 18, 2019

Confusion about total loss is negative #18

Confusion about total loss is negative #18

Comments

lixiangpengcs commented Sep 12, 2019

novioleo commented Sep 12, 2019 • edited Loading

lixiangpengcs commented Sep 12, 2019 • edited Loading

novioleo commented Sep 12, 2019 via email

lixiangpengcs commented Sep 15, 2019

novioleo commented Sep 15, 2019

novioleo commented Sep 15, 2019

lixiangpengcs commented Sep 18, 2019

novioleo commented Sep 18, 2019

lixiangpengcs commented Sep 18, 2019

lixiangpengcs commented Sep 18, 2019

novioleo commented Sep 18, 2019 via email

novioleo commented Sep 20, 2019 via email

novioleo commented Sep 26, 2019

lixiangpengcs commented Sep 26, 2019

novioleo commented Sep 26, 2019

lixiangpengcs commented Sep 26, 2019

novioleo commented Sep 26, 2019

lixiangpengcs commented Sep 26, 2019

novioleo commented Sep 26, 2019

lixiangpengcs commented Oct 14, 2019

novioleo commented Oct 14, 2019

lixiangpengcs commented Oct 14, 2019

novioleo commented Oct 14, 2019

lixiangpengcs commented Oct 14, 2019

novioleo commented Oct 14, 2019

lixiangpengcs commented Oct 14, 2019

novioleo commented Oct 14, 2019

feitiandemiaomi commented Oct 18, 2019

novioleo commented Sep 12, 2019 •

edited

Loading

lixiangpengcs commented Sep 12, 2019 •

edited

Loading