DET LOC VID Scene Team information

Legend:
Yellow background = winner in this task according to this metric; authors are willing to reveal the method
White background = authors are willing to reveal the method
Grey background = authors chose not to reveal the method
Italics = authors requested entry not participate in competition

Object detection (DET)^[top]

Task 1a: Object detection with provided training data

Ordered by number of categories won

Team name Entry description Number of object categories won mean AP

CUImage Ensemble of 6 models using provided data 109 0.662751

Hikvision Ensemble A of 3 RPN and 6 FRCN models, mAP is 67 on val2 30 0.652704

Hikvision Ensemble B of 3 RPN and 5 FRCN models, mean AP is 66.9, median AP is 69.3 on val2 18 0.652003

NUIST submission_1 15 0.608752

NUIST submission_2 9 0.607124

Trimps-Soushen Ensemble 2 8 0.61816

360+MCG-ICT-CAS_DET 9 models ensemble with validation and 2 iterations 4 0.615561

360+MCG-ICT-CAS_DET Baseline: Faster R-CNN with Res200 4 0.590596

Hikvision Best single model, mAP is 65.1 on val2 2 0.634003

CIL Ensemble of 2 Models 1 0.553542

360+MCG-ICT-CAS_DET 9 models ensemble 0 0.613045

360+MCG-ICT-CAS_DET 3 models 0 0.605708

Trimps-Soushen Ensemble 1 0 0.57956

360+MCG-ICT-CAS_DET res200+dasc+obj+sink+impneg+seg 0 0.576742

CIL Single Model ( Preactivation Resnet + Faster RCNN on Tensflow, On training (1/3 of total epochs was finished) 0 0.551189

KAIST-SLSP 2 models ensemble with box rescoring 0 0.535393

MIL_UT ensemble of ResNet101, ResNet152 based Faster RCNN 0 0.532216

KAIST-SLSP 2 models ensemble 0 0.515472

Faceall-BUPT ensemble plan B; validation map 52.28 0 0.488839

Faceall-BUPT ensemble plan A; validation map 52.24 0 0.486977

Faceall-BUPT multi-scale roi; best single model; validation map 51.73 0 0.484141

VB Ensemble Detection Model E3 0 0.481285

Hitsz_BCC Combined 500x500 with 300x300 model 0 0.479929

VB Ensemble Detection Model E1 0 0.479043

ToConcoctPellucid Ensemble of ResNet-101 + ResNet-50 followed by prediction pooling using box-voting 0 0.477484

Hitsz_BCC Self-implement SSD 500x500 model with ResNet-101 0 0.472984

ToConcoctPellucid Ensemble of different topology of ResNet-101 + ResNet-50 followed by prediction pooling using box-voting 0 0.470133

ToConcoctPellucid ResNet-101 + Faster-RCNN single model 0 0.469716

Faceall-BUPT faster rcnn baseline; validation map 49.30 0 0.461085

Hitsz_BCC Self-implement SSD 300x300 model with ResNet-152 0 0.451462

VB Ensemble Detection Model E2 0 0.45063

SunNMoon ensemble FRCN and SSD based on Resnet101 networks. 0 0.434906

Choong Ensemble of Deep learning model based on VGG16 & ResNet 0 0.434323

VB Single Detection Model S1 0 0.421331

hustvision convbox-googlenet 0 0.413457

LZDTX A deconv-ssd network with input size 300x300. 0 0.403113

OutOfMemory ResNet-152+FasterRCNN 0 0.393259

Lean-T A single model, Faster R-CNN baseline，continuous iterations(~230K) 0 0.314391

BUAA ERCACAT combined model for detection 0 0.269069

BUAA ERCACAT A single model for detection 0 0.265055

Lean-T A single model, Faster R-CNN baseline，discontinuous iterations(~600K) 0 0.259508

CUImage Single GBD-Net model using provided data --- 0.633634

CUImage Single Cluster-Net using provided data --- 0.618024

Trimps-Soushen Single model --- 0.581434

DPFly detetion algorithm 1 --- 0.491905

VIST Single model A using ResNet for detection --- 0.459305

VIST Single model B using ResNet for detection --- 0.455689

[top]

Ordered by mean average precision

Team name Entry description mean AP Number of object categories won

CUImage Ensemble of 6 models using provided data 0.662751 109

Hikvision Ensemble A of 3 RPN and 6 FRCN models, mAP is 67 on val2 0.652704 30

Hikvision Ensemble B of 3 RPN and 5 FRCN models, mean AP is 66.9, median AP is 69.3 on val2 0.652003 18

Hikvision Best single model, mAP is 65.1 on val2 0.634003 2

CUImage Single GBD-Net model using provided data 0.633634 ---

Trimps-Soushen Ensemble 2 0.61816 8

CUImage Single Cluster-Net using provided data 0.618024 ---

360+MCG-ICT-CAS_DET 9 models ensemble with validation and 2 iterations 0.615561 4

360+MCG-ICT-CAS_DET 9 models ensemble 0.613045 0

NUIST submission_1 0.608752 15

NUIST submission_2 0.607124 9

360+MCG-ICT-CAS_DET 3 models 0.605708 0

360+MCG-ICT-CAS_DET Baseline: Faster R-CNN with Res200 0.590596 4

Trimps-Soushen Single model 0.581434 ---

Trimps-Soushen Ensemble 1 0.57956 0

360+MCG-ICT-CAS_DET res200+dasc+obj+sink+impneg+seg 0.576742 0

CIL Ensemble of 2 Models 0.553542 1

CIL Single Model ( Preactivation Resnet + Faster RCNN on Tensflow, On training (1/3 of total epochs was finished) 0.551189 0

KAIST-SLSP 2 models ensemble with box rescoring 0.535393 0

MIL_UT ensemble of ResNet101, ResNet152 based Faster RCNN 0.532216 0

KAIST-SLSP 2 models ensemble 0.515472 0

DPFly detetion algorithm 1 0.491905 ---

Faceall-BUPT ensemble plan B; validation map 52.28 0.488839 0

Faceall-BUPT ensemble plan A; validation map 52.24 0.486977 0

Faceall-BUPT multi-scale roi; best single model; validation map 51.73 0.484141 0

VB Ensemble Detection Model E3 0.481285 0

Hitsz_BCC Combined 500x500 with 300x300 model 0.479929 0

VB Ensemble Detection Model E1 0.479043 0

ToConcoctPellucid Ensemble of ResNet-101 + ResNet-50 followed by prediction pooling using box-voting 0.477484 0

Hitsz_BCC Self-implement SSD 500x500 model with ResNet-101 0.472984 0

ToConcoctPellucid Ensemble of different topology of ResNet-101 + ResNet-50 followed by prediction pooling using box-voting 0.470133 0

ToConcoctPellucid ResNet-101 + Faster-RCNN single model 0.469716 0

Faceall-BUPT faster rcnn baseline; validation map 49.30 0.461085 0

VIST Single model A using ResNet for detection 0.459305 ---

VIST Single model B using ResNet for detection 0.455689 ---

Hitsz_BCC Self-implement SSD 300x300 model with ResNet-152 0.451462 0

VB Ensemble Detection Model E2 0.45063 0

SunNMoon ensemble FRCN and SSD based on Resnet101 networks. 0.434906 0

Choong Ensemble of Deep learning model based on VGG16 & ResNet 0.434323 0

VB Single Detection Model S1 0.421331 0

hustvision convbox-googlenet 0.413457 0

LZDTX A deconv-ssd network with input size 300x300. 0.403113 0

OutOfMemory ResNet-152+FasterRCNN 0.393259 0

Lean-T A single model, Faster R-CNN baseline，continuous iterations(~230K) 0.314391 0

BUAA ERCACAT combined model for detection 0.269069 0

BUAA ERCACAT A single model for detection 0.265055 0

Lean-T A single model, Faster R-CNN baseline，discontinuous iterations(~600K) 0.259508 0

[top]

Task 1b: Object detection with additional training data

Ordered by number of categories won

Team name Entry description Description of outside data used Number of object categories won mean AP

CUImage Our model using our labeled landmarks on ImageNet Det data We used the labeled landmarks on ImageNet Det data 176 0.660081

Trimps-Soushen Ensemble 3 With extra annotations. 22 0.616836

NUIST submission_4 refine the training data, add labels neglected, remove noisy labels for multi-instance images 1 0.542942

NUIST submission_3 refine the training data, add labels neglected, remove noisy labels for multi-instance images 1 0.540981

NUIST submission_5 refine the training data, add labels neglected, remove noisy labels for multi-instance images 0 0.540619

DPAI Vison multi-model ensemble, multiple classifier ensemble add extra data for class num<1000 0 0.534943

DPAI Vison multi-model ensemble, multiple context classifier ensemble add extra data for class num<1000 0 0.534543

DPAI Vison multi-model ensemble, extra classifier add extra data for class num<1000 0 0.534203

DPAI Vison multi-model ensemble, one-scale context classifier add extra data for class num<1000 0 0.533838

DPAI Vison multi-model ensemble add extra data for class num<1000 0 0.526699

[top]

Ordered by mean average precision

Team name Entry description Description of outside data used mean AP Number of object categories won

CUImage Our model using our labeled landmarks on ImageNet Det data We used the labeled landmarks on ImageNet Det data 0.660081 176

Trimps-Soushen Ensemble 3 With extra annotations. 0.616836 22

NUIST submission_4 refine the training data, add labels neglected, remove noisy labels for multi-instance images 0.542942 1

NUIST submission_3 refine the training data, add labels neglected, remove noisy labels for multi-instance images 0.540981 1

NUIST submission_5 refine the training data, add labels neglected, remove noisy labels for multi-instance images 0.540619 0

DPAI Vison multi-model ensemble, multiple classifier ensemble add extra data for class num<1000 0.534943 0

DPAI Vison multi-model ensemble, multiple context classifier ensemble add extra data for class num<1000 0.534543 0

DPAI Vison multi-model ensemble, extra classifier add extra data for class num<1000 0.534203 0

DPAI Vison multi-model ensemble, one-scale context classifier add extra data for class num<1000 0.533838 0

DPAI Vison multi-model ensemble add extra data for class num<1000 0.526699 0

[top]

Object localization (LOC)^[top]

Task 2a: Classification+localization with provided training data

Ordered by localization error

Team name Entry description Localization error Classification error

Trimps-Soushen Ensemble 3 0.077087 0.02991

Trimps-Soushen Ensemble 4 0.077429 0.02991

Trimps-Soushen Ensemble 2 0.077668 0.02991

Trimps-Soushen Ensemble 1 0.079068 0.03144

Hikvision Ensemble of 3 Faster R-CNN models for localization 0.087377 0.03711

Hikvision Ensemble of 4 Faster R-CNN models for localization 0.087533 0.03711

NUIST prefer multi box prediction with refine 0.090593 0.03461

NUIST prefer multi class prediction 0.094058 0.03351

CU-DeepLink GrandUnion + Fused-scale EnsembleNet 0.098892 0.03042

CU-DeepLink GrandUnion + Basic Ensemble 0.098954 0.03049

CU-DeepLink GrandUnion + Multi-scale EnsembleNet 0.099006 0.03046

KAISTNIA_ETRI Ensembles B (further tuned in class-dependent models I) 0.099286 0.03352

CU-DeepLink GrandUnion + Class-reweighted Ensemble with Per-instance Normalization 0.099349 0.03103

CU-DeepLink GrandUnion + Class-reweighted Ensemble 0.099369 0.03096

KAISTNIA_ETRI Ensembles A (further tuned in class-dependent model I ) 0.100552 0.03352

KAISTNIA_ETRI Ensembles B 0.100676 0.03256

KAISTNIA_ETRI Ensembles A 0.102015 0.03256

KAISTNIA_ETRI Ensembles C 0.102056 0.03256

NUIST prefer multi box prediction without refine 0.11743 0.03473

SamExynos 3 model only for classification 0.236561 0.03171

SamExynos single model only for classification 0.238791 0.03614

Faceall-BUPT Single localization network (II) fine-tuned with object-level annotations of training data. 0.31649 0.05184

Faceall-BUPT Ensemble of 5 models for classification, single model for localization. 0.320754 0.04574

Faceall-BUPT Ensemble of 3 models for classification, single model for localization. 0.325235 0.0466

WQF_BTPZ Two models for classification, localization model is fixed. The top-5 cls-only error on validation is 0.0645. The top-5 cls-loc error on validation is 0.4029. 0.374499 0.06414

Faceall-BUPT Single localization network (I) fine-tuned with object-level annotations of training data. 0.415558 0.05184

DGIST-KAIST Weighted sum #1 (five models) 0.489969 0.03297

DGIST-KAIST Averaging four models 0.490373 0.03378

WQF_BTPZ For classification, we merge two ResNet models, the top-5 cls-error on validation is 0.0639. For localization, we use a single faster RCNN model with ResNet, the top-5 cls-loc error on validation is 0.4025. 0.524586 0.06407

ResNeXt Ensemble C, weighted average, tuned on val. [No bounding box results] 0.737308 0.03031

ResNeXt Ensemble B, weighted average, tuned on val. [No bounding box results] 0.737484 0.03092

ResNeXt Ensemble A, simple average. [No bounding box results] 0.737505 0.0315

ResNeXt Ensemble C, weighted average. [No bounding box results] 0.737526 0.03124

ResNeXt Ensemble B, weighted average. [No bounding box results] 0.737681 0.03203

SIIT_KAIST-TECHWIN Ensemble B 0.931565 0.03416

SIIT_KAIST-TECHWIN Ensemble C 0.931565 0.03458

SIIT_KAIST-TECHWIN Ensemble A 0.931596 0.03436

SIIT_KAIST-TECHWIN Single model 0.931596 0.03651

DEEPimagine ImagineNet ensemble for classification only [ALL] 0.995757 0.03536

DEEPimagine ImagineNet ensemble for classification only [PART#2] 0.995757 0.03592

DEEPimagine ImagineNet ensemble for classification only [PART#1] 0.995768 0.03643

NEU_SMILELAB An ensemble of five models. Top-5 error 3.92% on validation set. 0.999077 0.03981

NEU_SMILELAB An ensemble of six models. Top-5 error 4.24% on validation set. 0.999077 0.04268

NEU_SMILELAB A single resnet-200 layer trained with small batch size. Top-5 error 4.57% on validation set. 0.999077 0.04511

NEU_SMILELAB Our single model with a partition of the 1000 classes. Top-5 error 7.62% on validation set. 0.999097 0.07288

DeepIST EnsembleC 1.0 0.03291

DeepIST EnsembleD 1.0 0.03294

DGIST-KAIST Weighted sum #2 (five models) 1.0 0.03324

DGIST-KAIST Averaging five models 1.0 0.03357

DGIST-KAIST Averaging six models 1.0 0.03357

DeepIST EnsembleB 1.0 0.03446

DeepIST EnsembleA 1.0 0.03449

[top]

Ordered by classification error

Team name Entry description Classification error Localization error

Trimps-Soushen Ensemble 2 0.02991 0.077668

Trimps-Soushen Ensemble 3 0.02991 0.077087

Trimps-Soushen Ensemble 4 0.02991 0.077429

ResNeXt Ensemble C, weighted average, tuned on val. [No bounding box results] 0.03031 0.737308

CU-DeepLink GrandUnion + Fused-scale EnsembleNet 0.03042 0.098892

CU-DeepLink GrandUnion + Multi-scale EnsembleNet 0.03046 0.099006

CU-DeepLink GrandUnion + Basic Ensemble 0.03049 0.098954

ResNeXt Ensemble B, weighted average, tuned on val. [No bounding box results] 0.03092 0.737484

CU-DeepLink GrandUnion + Class-reweighted Ensemble 0.03096 0.099369

CU-DeepLink GrandUnion + Class-reweighted Ensemble with Per-instance Normalization 0.03103 0.099349

ResNeXt Ensemble C, weighted average. [No bounding box results] 0.03124 0.737526

Trimps-Soushen Ensemble 1 0.03144 0.079068

ResNeXt Ensemble A, simple average. [No bounding box results] 0.0315 0.737505

SamExynos 3 model only for classification 0.03171 0.236561

ResNeXt Ensemble B, weighted average. [No bounding box results] 0.03203 0.737681

KAISTNIA_ETRI Ensembles A 0.03256 0.102015

KAISTNIA_ETRI Ensembles C 0.03256 0.102056

KAISTNIA_ETRI Ensembles B 0.03256 0.100676

DeepIST EnsembleC 0.03291 1.0

DeepIST EnsembleD 0.03294 1.0

DGIST-KAIST Weighted sum #1 (five models) 0.03297 0.489969

DGIST-KAIST Weighted sum #2 (five models) 0.03324 1.0

NUIST prefer multi class prediction 0.03351 0.094058

KAISTNIA_ETRI Ensembles A (further tuned in class-dependent model I ) 0.03352 0.100552

KAISTNIA_ETRI Ensembles B (further tuned in class-dependent models I) 0.03352 0.099286

DGIST-KAIST Averaging five models 0.03357 1.0

DGIST-KAIST Averaging six models 0.03357 1.0

DGIST-KAIST Averaging four models 0.03378 0.490373

SIIT_KAIST-TECHWIN Ensemble B 0.03416 0.931565

SIIT_KAIST-TECHWIN Ensemble A 0.03436 0.931596

DeepIST EnsembleB 0.03446 1.0

DeepIST EnsembleA 0.03449 1.0

SIIT_KAIST-TECHWIN Ensemble C 0.03458 0.931565

NUIST prefer multi box prediction with refine 0.03461 0.090593

NUIST prefer multi box prediction without refine 0.03473 0.11743

DEEPimagine ImagineNet ensemble for classification only [ALL] 0.03536 0.995757

DEEPimagine ImagineNet ensemble for classification only [PART#2] 0.03592 0.995757

SamExynos single model only for classification 0.03614 0.238791

DEEPimagine ImagineNet ensemble for classification only [PART#1] 0.03643 0.995768

SIIT_KAIST-TECHWIN Single model 0.03651 0.931596

Hikvision Ensemble of 3 Faster R-CNN models for localization 0.03711 0.087377

Hikvision Ensemble of 4 Faster R-CNN models for localization 0.03711 0.087533

NEU_SMILELAB An ensemble of five models. Top-5 error 3.92% on validation set. 0.03981 0.999077

NEU_SMILELAB An ensemble of six models. Top-5 error 4.24% on validation set. 0.04268 0.999077

NEU_SMILELAB A single resnet-200 layer trained with small batch size. Top-5 error 4.57% on validation set. 0.04511 0.999077

Faceall-BUPT Ensemble of 5 models for classification, single model for localization. 0.04574 0.320754

Faceall-BUPT Ensemble of 3 models for classification, single model for localization. 0.0466 0.325235

Faceall-BUPT Single localization network (I) fine-tuned with object-level annotations of training data. 0.05184 0.415558

Faceall-BUPT Single localization network (II) fine-tuned with object-level annotations of training data. 0.05184 0.31649

WQF_BTPZ For classification, we merge two ResNet models, the top-5 cls-error on validation is 0.0639. For localization, we use a single faster RCNN model with ResNet, the top-5 cls-loc error on validation is 0.4025. 0.06407 0.524586

WQF_BTPZ Two models for classification, localization model is fixed. The top-5 cls-only error on validation is 0.0645. The top-5 cls-loc error on validation is 0.4029. 0.06414 0.374499

NEU_SMILELAB Our single model with a partition of the 1000 classes. Top-5 error 7.62% on validation set. 0.07288 0.999097

[top]

Task 2b: Classification+localization with additional training data

Ordered by localization error

Team name Entry description Description of outside data used Localization error Classification error

Trimps-Soushen Ensemble 5 With extra annotations. 0.077377 0.02991

NUIST prefer multi box prediction ensemble one model trained on CLS+Place2 (1365) 0.094992 0.04093

NUIST prefer multi class prediction ensemble one model trained on CLS+Place2(1365) 0.097782 0.03877

[top]

Ordered by classification error

Team name Entry description Description of outside data used Classification error Localization error

Trimps-Soushen Ensemble 5 With extra annotations. 0.02991 0.077377

NUIST prefer multi class prediction ensemble one model trained on CLS+Place2(1365) 0.03877 0.097782

NUIST prefer multi box prediction ensemble one model trained on CLS+Place2 (1365) 0.04093 0.094992

[top]

Object detection from video (VID)^[top]

Task 3a: Object detection from video with provided training data

Ordered by number of categories won

Team name Entry description Number of object categories won mean AP

NUIST cascaded region regression + tracking 10 0.808292

NUIST cascaded region regression + tracking 10 0.803154

CUVideo 4-model ensemble with Multi-Context Suppression and Motion-Guided Propagation 9 0.767981

Trimps-Soushen Ensemble 2 1 0.709651

MCG-ICT-CAS ResNet101+ResNet200 models for detection,Non-coocurrence filtration, Coherent tublet reclassification, trackInfo 0 0.733116

MCG-ICT-CAS ResNet101+ResNet200 models for detection, Non-coocurrence filtration, Coherent tublet reclassification 0 0.730793

MCG-ICT-CAS ResNet200 models for detection, Non-coocurrence filtration, Coherent tublet reclassification 0 0.720318

MCG-ICT-CAS ResNet101 models for detection,Non-coocurrence filtration, Coherent tublet reclassification, trackInfo 0 0.706204

MCG-ICT-CAS ResNet101 models for detection, Non-coocurrence filtration, Coherent tublet reclassification 0 0.700729

Trimps-Soushen Ensemble 3 0 0.684258

KAIST-SLSP set 1 (ensemble with 2 models w/ various post-processing, including multiple object tracking w/ beta = 0.2) 0 0.642787

NUS_VISENZE fused ssd vgg+resnet nms 0 0.64062

RUC_BDAI We use the well-trained Faster R-CNN to generate bounding boxes for every frame of the video. Then we utilize the contextual information of the video to reduce the noise and add the missing. 0 0.562668

SRA Object detection using temporal and contextual information 0 0.511638

SRA object detection without contextual information 0 0.492785

Faceall-BUPT faster rcnn, brute force detection, only used DET data to train; map on val is 53.51 0 0.490969

CIGIT_Media adopt a new method for merging the scores of R-FCN and SSD detectors 0 0.483239

CIGIT_Media object detection from video without tracking 0 0.47782

F205_CV ssd resnet 101 0.01 confidence rate 0 0.472255

F205_CV ssd resnet 101 0.1 confidence rate 0 0.439478

F205_CV ssd resnet 101 0.2 confidence rate 0 0.41922

F205_CV ssd with resnet101 filted by nms with a 0.6 overlap rate and 0.1 confidence rate 0 0.357711

F205_CV ssd with resnet101 filted by nms with a 0.6 overlap rate, and 0.02 confidence rate 0 0.340852

SIS ITMO University SSD 0 0.272236

ASTAR_VA Our model takes into account spatial and temporal information from several previous frames. 0 0.270755

MCC --- 0 0.25457

RUC_BDAI We only use the well-trained Faster R-CNN to generate bounding boxes for every frame of the video. 0 0.108039

CUVideo 4-model ensemble without MCS & MGP --- 0.740812

CUVideo Single GBD-Net with Multi-Context Suppression & Motion-Guided Propagation --- 0.732857

[top]

Ordered by mean average precision

Team name Entry description mean AP Number of object categories won

NUIST cascaded region regression + tracking 0.808292 10

NUIST cascaded region regression + tracking 0.803154 10

CUVideo 4-model ensemble with Multi-Context Suppression and Motion-Guided Propagation 0.767981 9

CUVideo 4-model ensemble without MCS & MGP 0.740812 ---

MCG-ICT-CAS ResNet101+ResNet200 models for detection,Non-coocurrence filtration, Coherent tublet reclassification, trackInfo 0.733116 0

CUVideo Single GBD-Net with Multi-Context Suppression & Motion-Guided Propagation 0.732857 ---

MCG-ICT-CAS ResNet101+ResNet200 models for detection, Non-coocurrence filtration, Coherent tublet reclassification 0.730793 0

MCG-ICT-CAS ResNet200 models for detection, Non-coocurrence filtration, Coherent tublet reclassification 0.720318 0

Trimps-Soushen Ensemble 2 0.709651 1

MCG-ICT-CAS ResNet101 models for detection,Non-coocurrence filtration, Coherent tublet reclassification, trackInfo 0.706204 0

MCG-ICT-CAS ResNet101 models for detection, Non-coocurrence filtration, Coherent tublet reclassification 0.700729 0

Trimps-Soushen Ensemble 3 0.684258 0

KAIST-SLSP set 1 (ensemble with 2 models w/ various post-processing, including multiple object tracking w/ beta = 0.2) 0.642787 0

NUS_VISENZE fused ssd vgg+resnet nms 0.64062 0

RUC_BDAI We use the well-trained Faster R-CNN to generate bounding boxes for every frame of the video. Then we utilize the contextual information of the video to reduce the noise and add the missing. 0.562668 0

SRA Object detection using temporal and contextual information 0.511638 0

SRA object detection without contextual information 0.492785 0

Faceall-BUPT faster rcnn, brute force detection, only used DET data to train; map on val is 53.51 0.490969 0

CIGIT_Media adopt a new method for merging the scores of R-FCN and SSD detectors 0.483239 0

CIGIT_Media object detection from video without tracking 0.47782 0

F205_CV ssd resnet 101 0.01 confidence rate 0.472255 0

F205_CV ssd resnet 101 0.1 confidence rate 0.439478 0

F205_CV ssd resnet 101 0.2 confidence rate 0.41922 0

F205_CV ssd with resnet101 filted by nms with a 0.6 overlap rate and 0.1 confidence rate 0.357711 0

F205_CV ssd with resnet101 filted by nms with a 0.6 overlap rate, and 0.02 confidence rate 0.340852 0

SIS ITMO University SSD 0.272236 0

ASTAR_VA Our model takes into account spatial and temporal information from several previous frames. 0.270755 0

MCC --- 0.25457 0

RUC_BDAI We only use the well-trained Faster R-CNN to generate bounding boxes for every frame of the video. 0.108039 0

[top]

Task 3b: Object detection from video with additional training data

Ordered by number of categories won

Team name Entry description Description of outside data used Number of object categories won mean AP

NUIST cascaded region regression + tracking proposal network is finetuned from COCO 17 0.79593

NUIST cascaded region regression + tracking proposal network is finetuned from COCO 5 0.781144

Trimps-Soushen Ensemble 6 Extra data from ImageNet dataset(out of the ILSVRC2016) 5 0.720704

ITLab-Inha An ensemble for detection, MCMOT for tracking pre-trained model from COCO detection, extra data collected by ourselves (100 images per class) 3 0.731471

DPAI Vison single model extra data 0 0.615196

DPAI Vison single model and iteration regression extra data 0 0.532302

TEAM1 VGG-16 + Faster R-CNN Imagenet DET dataset 0 0.217933

TEAM1 Ensemble of 6 models Imagenet DET dataset 0 0.207165

TEAM1 Ensemble of 7 models Imagenet DET dataset 0 0.189227

[top]

Ordered by mean average precision

Team name Entry description Description of outside data used mean AP Number of object categories won

NUIST cascaded region regression + tracking proposal network is finetuned from COCO 0.79593 17

NUIST cascaded region regression + tracking proposal network is finetuned from COCO 0.781144 5

ITLab-Inha An ensemble for detection, MCMOT for tracking pre-trained model from COCO detection, extra data collected by ourselves (100 images per class) 0.731471 3

Trimps-Soushen Ensemble 6 Extra data from ImageNet dataset(out of the ILSVRC2016) 0.720704 5

DPAI Vison single model extra data 0.615196 0

DPAI Vison single model and iteration regression extra data 0.532302 0

TEAM1 VGG-16 + Faster R-CNN Imagenet DET dataset 0.217933 0

TEAM1 Ensemble of 6 models Imagenet DET dataset 0.207165 0

TEAM1 Ensemble of 7 models Imagenet DET dataset 0.189227 0

[top]

Task 3c: Object detection/tracking from video with provided training data

Team name Entry description mean AP

CUVideo 4-model ensemble 0.558557

NUIST cascaded region regression + tracking 0.548781

CUVideo Single GBD-Net 0.526137

MCG-ICT-CAS ResNet101+ResNet200 models for detetion, optical flow for tracking, Coherent tublet reclassification++, MDNet tracking 0.488632

MCG-ICT-CAS ResNet101+ResNet200 models for detetion, optical flow for tracking, Coherent tublet reclassification, MDNet tracking 0.484771

MCG-ICT-CAS ResNet101 models for detetion, optical flow for tracking, Coherent tublet reclassification, MDNet tracking 0.462013

MCG-ICT-CAS ResNet101 models for detetion, optical flow for tracking, Coherent tublet reclassification 0.395057

MCG-ICT-CAS ResNet101 models+ResNet200 for detetion, optical flow for tracking, Coherent tublet reclassification 0.393705

KAIST-SLSP set 1 (ensemble with 2 models w/ various post-processing, including multiple object tracking w/ beta = 0.2) 0.327421

CIGIT_Media object detection from video with tracking 0.229714

CIGIT_Media adopt a new method for merging the scores of R-FCN and SSD detectors 0.221176

F205_CV a simple track with ssd_resnet101 0.164678

NUS_VISENZE 17Sept_result_final_ss_ssd_resnet_nms_fused 0.148463

NUS_VISENZE test 0.148463

F205_CV a simple track with ssd_resnet101 with 0.1 confidence 0.139524

F205_CV a simple track with ssd_resnet101 with 0.2 confidence 0.132039

NUS_VISENZE fused 3 models with tracking 0.112528

NUS_VISENZE fused 3 models with tracking max 8 classes 0.112524

BSC- UPC This is the longest run without error. 0.002263

BSC- UPC This had some error I don't know if it's complete. 0.002263

[top]

Task 3d: Object detection/tracking from video with additional training data

Team name Entry description Description of outside data used mean AP

NUIST cascaded region regression + tracking proposal network is finetuned from COCO 0.583898

ITLab-Inha An ensemble for detection, MCMOT for tracking pre-trained model from COCO detection, extra data collected by ourselves (100 images per class) 0.490863

[top]

Scene Classification (Scene)^[top]

Team name Entry description Top-5 classification error

Hikvision Model D 0.0901

Hikvision Model E 0.0908

Hikvision Model C 0.0939

Hikvision Model B 0.0948

MW Model ensemble 2 0.1019

MW Model ensemble 3 0.1019

MW Model ensemble 1 0.1023

Hikvision Model A 0.1026

Trimps-Soushen With extra data. 0.103

Trimps-Soushen Ensemble 2 0.1042

SIAT_MMLAB 10 models fusion 0.1043

SIAT_MMLAB 7 models fusion 0.1044

SIAT_MMLAB fusion with softmax 0.1044

SIAT_MMLAB learning weights with cnn 0.1044

SIAT_MMLAB 6 models fusion 0.1049

Trimps-Soushen Ensemble 4 0.1049

Trimps-Soushen Ensemble 3 0.105

MW Single model B 0.1073

MW Single model A 0.1076

NTU-SC Product of 5 ensembles (top-5) 0.1085

NTU-SC Product of 3 ensembles (top-5) 0.1086

NTU-SC Sum of 3 ensembles (top-5) 0.1086

NTU-SC Sum of 5 ensembles (top-3) 0.1086

NTU-SC Single ensemble of 5 models (top-5) 0.1088

NQSCENE Four models 0.1093

NQSCENE Three models 0.1101

Samsung Research America: General Purpose Acceleration Group Simple Ensemble, 3 Inception v3 models w/various hyper param changes, 32 multi-crop (60.11 top-1, 88.98 top-5 on val) 0.1113

fusionf Fusion with average strategy (12 models) 0.1115

fusionf Fusion with scoring strategy (14 models) 0.1117

fusionf Fusion with average strategy (13 models) 0.1118

YoutuLab weighted average1 at scale level using greedy search 0.1125

YoutuLab weighted average at model level using greedy search 0.1127

YoutuLab weighted average2 at scale level using greedy search 0.1129

fusionf Fusion with scoring strategy (13 models) 0.113

fusionf Fusion with scoring strategy (12 models) 0.1132

YoutuLab simple average using models in entry 3 0.1139

Samsung Research America: General Purpose Acceleration Group Model A0, weakly scaled, multi-crop. (59.61 top-1, 88.64 top-5 on val) 0.1142

SamExynos 3 model 0.1143

Samsung Research America: General Purpose Acceleration Group Ensemble B, 3 Inception v3 models w/various hyper param changes + Inception v4 res2, 128 multi-crop 0.1152

YoutuLab average on base models 0.1162

NQSCENE Model B 0.117

Samsung Research America: General Purpose Acceleration Group Model A2, weakly scaled, single-crop & mirror. (58.84 top-1, 88.09 top-5 on val) 0.1188

NQSCENE Model A 0.1192

Samsung Research America: General Purpose Acceleration Group Model A1, weakly scaled, single-crop. (58.65 top-1, 88.07 top-5 on val) 0.1193

Trimps-Soushen Ensemble 1 0.1196

Rangers ensemble model 1 0.1208

SamExynos single model 0.121

Rangers ensemble model 2 0.1212

Everphoto ensemble by learned weights - 1 0.1213

Everphoto ensemble by product strategy 0.1218

Everphoto ensemble by learned weights - 2 0.1218

Everphoto ensemble by average strategy 0.1223

MIPAL_SNU Ensemble of two ResNet-50 with balanced sampling 0.1232

KPST_VB Model II 0.1233

KPST_VB Ensemble of Model I and II 0.1235

Rangers single model result of 69 0.124

Everphoto ensemble by product strategy (without specialist models) 0.1242

KPST_VB Model II with adjustment 0.125

KPST_VB Model I 0.1251

Rangers single model result of 66 0.1253

KPST_VB Ensemble of Model I and II with adjustment 0.1253

SJTU-ReadSense Ensemble 5 models with learnt weights 0.1272

SJTU-ReadSense Ensemble 5 models with weighted validation accuracies 0.1273

iMCB A combination of CNN models based on researched influential factors 0.1277

SJTU-ReadSense Ensemble 6 models with learnt weights 0.1278

SJTU-ReadSense Ensemble 4 models with learnt weights 0.1287

iMCB A combination of CNN models with a strategy w.r.t.validation accuracy 0.1299

Choong Based on VGG16, features are extracted from multiple layers. ROI proposal network is not applied. Every neuron from each feature layer is center of ROI candidate. 0.131

SIIT_KAIST 101-depth single model (val.error 12.90%) 0.131

DPAI Vison An ensemble model 0.1355

isia_ICT spectral clustering on confusion matrix 0.1355

isia_ICT fusion of 4 models with average strategy 0.1357

NUIST inception+shortcut CNN 0.137

isia_ICT MP_multiCNN_multiscale 0.1372

NUIST inception+shortcut CNN 0.1381

Viz Insight Multiple Deep Metaclassifiers 0.1386

iMCB FeatureFusion_2L 0.1396

iMCB FeatureFusion_3L 0.1404

DPAI Vison Single Model 0.1425

isia_ICT 2 models with size of 288 0.1433

Faceall-BUPT A single model with 150crops 0.1471

iMCB A Single Model 0.1506

SJTU-ReadSense A single model (based on Inception-BN) trained on the Places365-Challenge dataset 0.1511

OceanVision A result obtained by VGG-16 0.1635

OceanVision A result obtained by alexnet 0.1867

OceanVision A result obtained by googlenet 0.1867

ABTEST GoogLeNet Model trained on LSUN dataset and fined tuned on Places2 0.3245

Vladimir Iglovikov VGG16 trained on 128x128 0.3552

Vladimir Iglovikov VGG19 trained on 128x128 0.3593

Vladimir Iglovikov average of VGG16 and VGG19 trained on 128x128 0.3712

Vladimir Iglovikov Resnet 50 trained on 128x128 0.4577

scnu407 VGG16+4D lstm 0.8831

[top]

Scene Parsing^[top]

Team name Entry description Average of mIoU and pixel accuracy

SenseCUSceneParsing ensemble more models on trainval data 0.57205

SenseCUSceneParsing dense ensemble model on trainval data 0.5711

SenseCUSceneParsing ensemble model on trainval data 0.5705

SenseCUSceneParsing ensemble model on train data 0.5674

Adelaide Multiple models, multiple scales, refined with CRFs 0.56735

Adelaide Multiple models, multiple scales 0.56615

Adelaide Single model, multiple scales 0.5641

Adelaide Multiple models, single scale 0.5617

360+MCG-ICT-CAS_SP fusing 152, 101, 200 layers front models with global context aggregation, iterative boosting and high resolution training 0.55565

Adelaide Single model, single scale 0.5539

SenseCUSceneParsing best single model on train data 0.5538

360+MCG-ICT-CAS_SP fusing 152, 101, 200 layers front models with global context aggregation, iterative boosting and high resolution training, some models adding local refinement network before fusion 0.55335

360+MCG-ICT-CAS_SP fusing 152, 101, 200 layers front models with global context aggregation, iterative boosting and high resolution training, some models adding local refinement network before and after fusion 0.55215

360+MCG-ICT-CAS_SP 152 layers front model with global context aggregation, iterative boosting and high resolution training 0.54675

SegModel ensemble of 5 models, bilateral filter, 42.7 mIoU on val set 0.5465

SegModel ensemble of 5 models,guided filter, 42.5 mIoU on val set 0.5449

CASIA_IVA casia_iva_model4:DeepLab, Multi-Label 0.5433

CASIA_IVA casia_iva_model3:DeepLab, OA-Seg, Multi-Label 0.5432

CASIA_IVA casia_iva_model5:Aug_data,DeepLab, OA-Seg, Multi-Label 0.5425

NTU-SP Fusion models from two source models (Train + TrainVal) 0.53565

NTU-SP 6 ResNet initialized models (models are trained from TrainVal) 0.5354

NTU-SP 8 ResNet initialized models + 2 VGG initialized models (with different bn statistics) 0.5346

SegModel ensemble by joint categories and guided filter, 42.7 on val set 0.53445

NTU-SP 8 ResNet initialized models + 2 VGG initialized models (models are trained from Train only) 0.53435

NTU-SP 8 ResNet initialized models + 2 VGG initialized models (models are trained from TrainVal) 0.53435

Hikvision Ensemble models 0.53355

SegModel ensemble by joint categories and bilateral filter, 42.8 on val set 0.5332

ACRV-Adelaide use DenseCRF 0.5326

SegModel single model, 41.3 mIoU on valset 0.53225

DPAI Vison different denseCRF parameters of 3 models(B) 0.53065

Hikvision Single model 0.53055

ACRV-Adelaide an ensemble 0.53035

360+MCG-ICT-CAS_SP baseline，152 layers front model with iterative boosting 0.52925

CASIA_IVA casia_iva_model2:DeepLab, OA-Seg 0.52785

DPAI Vison different denseCRF parameters of 3 models(C) 0.52645

DPAI Vison average ensemble of 3 segmentation models 0.52575

DPAI Vison different denseCRF parameters of 3 models(A) 0.52575

CASIA_IVA casia_iva_model1:DeepLab 0.5243

SUXL scene parsing network 5 0.52355

SUXL scene parsing network 0.52325

SUXL scene parsing network 3 0.5224

ACRV-Adelaide a single model 0.5221

SUXL scene parsing network 2 0.5212

SYSU_HCP-I2_Lab cascade nets 0.52085

SYSU_HCP-I2_Lab DCNN with skipping layers 0.5136

SYSU_HCP-I2_Lab DeepLab_CRF 0.51205

SYSU_HCP-I2_Lab Pixel normalization networks 0.5077

SYSU_HCP-I2_Lab ResNet101 0.50715

S-LAB-IIE-CAS Multi-Scale CNN + Bbox_Refine + FixHole 0.5066

S-LAB-IIE-CAS Combined with the results of other models 0.50625

S-LAB-IIE-CAS Multi-Scale CNN + Attention 0.50515

NUS_FCRN trained with training set and val set 0.5006

NUS-AIPARSE model3 0.4997

NUS_FCRN trained with training set only 0.49885

NUS-AIPARSE model2 0.49855

F205_CV Model fusion of ResNet101 and DilatedNet, with data augmentation and CRF, fine-tuned from places2 scene classification/parsing 2016 pretrained models. 0.49805

F205_CV Model fusion of ResNet101 and FCN, with data augmentation and CRF, fine-tuned from places2 scene classification/parsing 2016 pretrained models. 0.4933

F205_CV Model fusion of ResNet101, FCN and DilatedNet, with data augmentation and CRF, fine-tuned from places2 scene classification/parsing 2016 pretrained models. 0.4899

Faceall-BUPT 6 models finetuned by pre-trained fcn8s and dilatedNet with 3 different images sizes. 0.4893

NUS-AIPARSE model1 0.48915

Faceall-BUPT We use six models finetuned by pre-trained fcn8s and dilatedNet with 3 different images sizes. The pixel-wise accuracy is 76.94％ and mean of the class-wise IoU is 0.3552. 0.48905

F205_CV Model fusion of ResNet101, FCN and DilatedNet, with data augmentation, fine-tuned from places2 scene classification/parsing 2016 pretrained models. 0.48425

S-LAB-IIE-CAS Multi-Scale CNN + Bbox_Refine 0.4814

Faceall-BUPT 3 models finetuned by pre-trained fcn8s with 3 different images sizes. 0.4795

Faceall-BUPT 3 models finetuned by pre-trained dilatedNet with 3 different images sizes. 0.4793

F205_CV Model fusion of FCN and DilatedNet, with data augmentation and CRF, fine-tuned from places2 scene classification/parsing 2016 pretrained models. 0.47855

S-LAB-IIE-CAS Multi-Scale CNN 0.4757

Multiscale-FCN-CRFRNN Multi-scale CRF-RNN 0.47025

Faceall-BUPT one models finetuned by pre-trained dilatedNet with images size 384*384. The pixel-wise accuracy is 75.14％ and mean of the class-wise IoU is 0.3291. 0.46565

Deep Cognition Labs Modified Deeplab Vgg16 with CRF 0.41605

mmap-o FCN-8s with classification 0.39335

NuistParsing SegNet+Smoothing 0.3608

XKA SegNet trained on ADE20k +CRF 0.3603

VikyNet Fine tuned version of ParseNet 0.0549

VikyNet Fine tuned version of ParseNet 0.0549

[top]

Team information^[top]

Team name Team members Abstract

360+MCG-ICT-CAS_SP Rui Zhang (1,2)
Min Lin (1)
Sheng Tang (2)
Yu Li (1,2)
YunPeng Chen (3)
YongDong Zhang (2)
JinTao Li (2)
YuGang Han (1)
ShuiCheng Yan (1,3)

(1) Qihoo 360
(2) Multimedia Computing Group,Institute of Computing Technology,Chinese Academy of Sciences (MCG-ICT-CAS), Beijing, China
(3) National University of Singapore (NUS) Technique Details for the Scene Parsing Task:

There are two core and general contributions for our scene parsing system: 1) Local-refinement-network for object boundary refinement, and 2) Iterative-boosting-network for overall parsing refinement.
These two networks collaboratively refine the parsing results from two perspectives, and the details are as below:
1) Local-refinement-network for object boundary refinement. This network takes the original image and the K object probability maps (each for one of the K classes) as inputs, and the output is m*m feature maps indicating how each of the m*m neighbors propagates the probability vector to the center point for local refinement. It works similar to bounding-box-refinement in object detection task in spirit, but here locally refine the object boundary instead of object bounding box.
2) Iterative-boosting-network for overall parsing refinement. This network takes the original image and the K object probability maps (each for one of the K classes) as inputs, and the output is the refined probability maps for all classes. It iterative boosting the parsing results in a global way.

Also two other tricks are used as below:
1) Global context aggregation: The scene classification information may potentially provide the global context information for decision as well as capture the co-occurrence relationship between scene and object/stuff in scene. Thus, we add the features from an independent scene classification model trained on ILSVRC 2016 Scene Classification dataset into our scene parsing system as contexts.
2) Multi-scale scheme: Considering the limited amount of training data and the various scales of objects in different training samples, we use multi-scale data argumentation in both training and inference stages. High resolution models are also trained on magnified images to capture details and small objects.

360+MCG-ICT-CAS_DET Yu Li (1,2),
Sheng Tang (2),
Min Lin (1),
Rui Zhang (1,2),
YunPeng Chen (3),
YongDong Zhang (2),
JinTao Li (2),
YuGang Han (1),
ShuiCheng Yan (1,3),

(1) Qihoo 360,
(2) Multimedia Computing Group,Institute of Computing Technology,Chinese Academy of Sciences (MCG-ICT-CAS), Beijing, China,
(3) National University of Singapore (NUS) Technique Details for Object Detection Task

The new contributions of this system are three-fold: 1) Implicit sub-categories of background class, 2) Sink class when necessary, and 3) new semantic segmentation features.

For training:

1) Implicit sub-categories of background class: for Faster-RCNN [1], the "background" class is considered as ONE class equally as other individual object classes, but it is quite diverse and impossible to describe as one pattern. Thus we use K output nodes, namely K patterns, to implicitly represent the sub-categories of the background class, which well improved the identification capability of the background class.
(2) Sink class when necessary: It is often the case that the ground-truth class may have low probability, and thus the result is incorrect since the sum of all probabilities for all classes equals 1. To address this issue and improve the chance for the ground-truth class with low probability to win, we add a so-called "sink" class, which shall take some probability value if the ground-truth class has low probability, make other classes to have even lower probabilities than the ground-truth class, and make the ground-truth to win. We also propose to use sink class for loss function only when necessary, namely when the ground-truth class is not in the top-k list.
(3) New semantic segmentation features: On one hand, motivated by [2], we generate weakly supervised segmentation feature which is used to train region proposal scoring functions and make the gradient flow among all branches. On the other hand, an independent segmentation model trained on ILSVRC Scene Parsing dataset is used to provide feature for our detection network, which is supposed to bring in both stuff and object information for decision.
(4) Dilation as context: Motivated by widely used dilated convolution [3] in segmentation, we introduce dilated convolutional layers (initialized as identity mapping) to obtain effective context for training.

For testing:
We utilize box refinement, box voting, multi-scale testing, co-occurrence refinement, and models ensemble approaches to benefit inference stage.

References:
[1] Ren, Shaoqing, et al. "Faster R-CNN: Towards real-time object detection with region proposal networks." Advances in neural information processing systems. 2015.
[2] Gidaris, Spyros, and Nikos Komodakis. "Object detection via a multi-region and semantic segmentation-aware cnn model." Proceedings of the IEEE International Conference on Computer Vision. 2015.
[3] Yu, Fisher, and Vladlen Koltun. "Multi-scale context aggregation by dilated convolutions." International Conference on Learning Representations. 2016.

ABTEST Ankan Bansal We have used a 22 layer GoogleNet [1] model to classify scenes. The model was trained on the LSUN [2] dataset and then finetuned on the Places dataset fot 365 categories. We did not use any intelligent data selection techniques. The network is simply trained using all the available data without considering the data distribution for different classes.

Before training on LSUN, this network was trained using the Places205 dataset. The model was trained till it saturated at around 85% (Top-1) accuracy on the validation dataset of the LSUN challenge. Then the model was fine-tuned on the 365 categories in the Places2 challenge.

We did not use the trained models provided by the organisers to initialise our network.

References:
[1] Szegedy, Christian, et al. "Going deeper with convolutions." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015.
[2] Yu, Fisher, et al. "Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop." arXiv preprint arXiv:1506.03365 (2015).

ACRV-Adelaide Guosheng Lin;
Chunhua Shen;
Anton van den Hengel;
Ian Reid;

Affiliations: ACRV; University of Adelaide; Our method is based on multi-level information fusion. We generate multi-level representation of the input image and develop a number of fusion networks with different architectures.
Our models are initialized from the pre-trained residual nets [1] with 50 and 101 layers. A part of the network design in our system is inspired by the multi-scale network with pyramid pooling which is described in [2] and the FCN network in [3].

Our system achieves good performance on the validation set. The IoU score on the validation set is 40.3 for using a single model, which is clearly better than the reported results of the baseline methods in [4]. Applying DenseCRF [5] slightly improves the result.

We are preparing a technical report on our method and it will be available in arXiv soon.

References:
[1] "Deep Residual Learning for Image Recognition", Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. CVPR 2016.
[2] "Efficient Piecewise Training of Deep Structured Models for Semantic Segmentation", Guosheng Lin, Chunhua Shen, Anton van den Hengel, Ian Reid; CVPR 2016
[3] "Fully convolutional networks for semantic segmentation", J Long, E Shelhamer, T Darrell; CVPR 2015
[4] "Semantic Understanding of Scenes through ADE20K Dataset" B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso and A. Torralba. arXiv:1608.05442
[5] "Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials", Philipp Krahenbuhl, Vladlen Koltun; NIPS 2012.

Adelaide Zifeng Wu, University of Adelaide
Chunhua Shen, University of Adelaide
Anton van den Hengel, University of Adelaide We have trained networks with different newly designed structures. One of them performs as well as the Inception-Residual-v2 network in the classification task. It was further tuned for several epochs using the Places365 dataset, which finally obtained even better results on the validation set in the segmentation task. As for FCNs, we mostly followed the settings in our previous technical reports [1, 2]. The best result was obtained by combining the FCNs initialized using two pre-trained networks.

[1] High-performance Semantic Segmentation Using Very Deep Fully Convolutional Networks. https://arxiv.org/abs/1604.04339
[2] Bridging Category-level and Instance-level Semantic Image Segmentation. https://arxiv.org/abs/1605.06885

ASTAR_VA Romain Vial (VA Master Intern Student)
Zhu Hongyuan (VA Scientist)
Su Bolan (ex ASTAR Scientist)
Shijian Lu (VA Head) The problem of object detection from videos is an important part of computer vision that has yet to be solved. The diversity of scenes with the presence of movement make this task very challenging.

Our system localizes and recognizes objects from various scales, positions and classes. It takes into account spatial (local and global) and temporal information from several previous frames.

The model has been trained on both the training and validation set. We achieve a final score on the validation set of 76.5% mAP.

BSC- UPC Andrea Ferri This is the result of my thesis: Implementing a deep learning envirorment into a computational server and develop a Object Tracking in Video with Tensorflow suitable for the ImageNET VID challenge.

BUAA ERCACAT Biao Leng (Beihang University), Guanglu Song (Beihang University), Cheng Xu (Beihang University), Jiongchao Jin (Beihang University), Zhang Xiong (Beihang University)
Our group utilize two image object detection architectures, namely Fast R-CNN[1] and Faster R-CNN[2] for the task of object detection. The detection system Faster R-CNN can be divided into two modules including RPN (region proposal network), a fully convolutional network that proposes regions to tell the Faster R-CNN modules where to focus on in an image, and a Fast R-CNN detector that uses region proposals and classifies the objects in the proposal.
Our training model is based on the VGG_16 model, and we utilize a combined model for higher RPN recall.

[1] Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,” arXiv:1506.01497.
[2]Ross Girshick. "Fast R-CNN: Fast Region-based Convolutional Networks for object detection", CVPR 2015.

CASIA_IVA Jun Fu,Jing Liu,Xinxin Zhu,Longteng Guo,Zhenwei Shen,Zhiwei Fang,Hanqing Lu We implement image semantic segmentation based on the fused result of the three deep models: DeepLab[1], OA-Seg[2] and the officially public model in this challenge. DeepLab is trained with the framework of Resnet101, and is further improved with object proposals and multiscale prediction combination. OA-Seg is trained with VGG, in which object proposals and multiscale supervision are considered. We argument training data by multiscale and mirrored variants for the above both models. We additionally employ multi-label annotation for images to refine the segmentation results.
[1]Liang-Chieh Chen et.al, DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs,arXiv:1606.00915,2016
[2]Yuhang Wang et.al, Objectness-aware Semantic Segmentation, Accepted by ACM Multimedia, 2016.

Choong Choong Hwan Choi (KAIST) Abstract
Ensemble of Deep learning model based on VGG16 & ResNet
Based on VGG16, features are extracted from multiple layers. ROI proposal network is not applied. Every neuron from each feature layer is center of ROI candidate.
Reference :
[1] Liu, Wei, et. al. "SSD: Single Shot Multibox Detector"
[2] K. Simonyan, A. Zisserman, "Very Deep Convolutional Networks for Large-Scale Image Recognition"
[3] Kaiming He, et. al., "Deep Residual Learning for Image Recognition"

CIGIT_Media Youji Feng, Jiangjing Lv, Xiaohu Shao, Pengcheng Liu, Cheng Cheng

Chongqing Institute of Green and Intelligent Technology, Chinese Academy of Sciences We present a simple method combining still image object detection and object tracking for the ImageNet VID task. Object detection is first performed on each frame of the video, and the detected targets are then tracked through the nearby frames. Each tracked target is also assigned a detection score by the object detector. According to the scores, non-maximum suppression (NMS) is applied to all the detected and tracked targets on each frame to obtain the VID results. To improve the performance, we actually employ two state-of-the-art detectors for still image object detection, i.e. the R-FCN detector and the SSD detector. We run the above steps for both detectors independently and combine the respective results into the final ones through NMS.

[1] J. Dai, Y. Li, K. He, and J. Sun. R-FCN: Object Detection via Region-based Fully Convolutional Networks. arXiv 2016.
[2] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. Berg. SSD: Single Shot MultiBox Detector. arXiv 2016.
[3] K. Kang, W. Ouyang, H. Li, and X. Wang. Object Detection from Video Tubelets with Convolutional Neural Networks. CVPR 2016.

CIL Seongmin Kang
Seonghoon Kim
Yusun Lim
Kibum Bae
Heungwoo Han
Our model is based on Faster RCNN [1].
Pre-activation residual network[2] trained with ILSVRC 2016 dataset is modified for detection tasks.
Heavy data augmentation is applied. OHEM[3] and atrous convolution are also applied.
All of them are implemented on Tensorflow with multi-gpu training.[4]

To meet the deadline, the detection model was trained just for 1/3 training epoches we had planned.

[1]Shaoqing Ren et al., Faster R-CNN Towards real-time object detection with region proposal networks, NIPS, 2015
[2]Kaiming He et al., Identity Mappings in Deep Residual Networks, ICML, 2016
[3]Abhinav Shrivastava et al., Training Region-based Object Detectors with Online Hard Example Mining, CVPR, 2016
[4]Martín Abadi et al., TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org

CU-DeepLink Major team members
-------------------

Xingcheng Zhang ^1
Zhizhong Li ^1
Yang Shuo ^1
Yuanjun Xiong ^1
Yubin Deng ^1
Xiaoxiao Li ^1
Kai Chen ^1
Yingrui Wang ^2
Chen Huang ^1
Tong Xiao ^1
Wanshen Feng ^2
Xinyu Pan ^1
Yunxiang Ge ^1
Hang Song ^1
Yujun Shen ^1
Boyang Deng ^1
Ruohui Wang ^1

Supervisors
------------

Dahua Lin ^1
Chen Change Loy ^1
Wenzhi Liu ^2
Shengen Yan ^2

1 - Multimedia Lab, The Chinese University of Hong Kong.
2 - SenseTime Inc.
Our efforts are divided into two relatively independent directions, namely classification and localization. Specifically, the classification framework would predict five distinct class labels for each image, while the localization framework would produce bounding boxes, one for each predicted class label.

Classification
------------------

Our classification framework is built on top of Google's Inception-ResNet-v2 (IR-v2) [1]. We combined several important techniques, which together leads to substantial performance gain.

1. We developed a novel building block, called “PolyInception”. Each PolyInception can be considered as a meta-module that integrates multiple inception modules via K-way polynomial composition. In this way, we substantially improve a module's expressive power. Also, to facilitate the propagation of gradients across a very deep network, we retain an identity path [2] for each PolyInception.
2. At the core of our framework is the Grand Models. Each grand model comprises three sections operating on different spatial resolutions. Each section is a stack of multiple PolyInception modules. To achieve optimal overall performance (within a certain computational budget), we rebalance the number of modules across the sections.
3. Most of our grand models contain over 500 layers. Whereas they demonstrate remarkable model capacity, we observed notable overfitting at later stage of the training process. To overcome this difficulty, we adopted Stochastic Depth [3] for regularization.
4. We trained 20+ Grand Models, some deeper and others wider. These models constitute a performant yet diverse ensemble. The single most powerful Grand Model reached a top-5 classification error at 4.27%(single corp) on the validation set.
5. Given each image, the class label predictions are produced in two steps. First, multiple crops at 8 scales are generated. Predictions are respectively made on these crops, which are subsequently combined via a novel scheme called selective pooling. The multi-crop predictions generated by individual models are finally integrated to reach the final prediction. In particular, we explored two different integration strategies, namely ensemble-net (a two-layer neural-network designed to integrate predictions) and class-dependent model reweighting. With these ensemble techniques, we reached a top-5 classification error below 2.8% on the validation set.

Localization
-----------------

Our localization framework is a pipeline comprised of Region Proposal Networks (RPN) and R-CNN models.

1. We trained two RPNs with different design parameters based on ResNet.
2. Given an image, 300 bounding box proposals are derived based on the RPNs, using multi-scale NMS pooling.
3. We also trained four R-CNN models, respectively based on ResNet-101, ResNet-269, Extended IR-v2, and one of our Grand Models. These R-CNNs are used to predict how likely a bounding box belongs to each class as well as to refine the bounding box (via bounding box regression).
4. The four RCNN models form an ensemble. Their predictions (on both class scores and refined bounding boxes) are integrated via average pooling. Given a class label, the refined bounding box with highest score corresponding to that class is used as the result.

Deep Learning Framework
-----------------

Both our classification and localization frameworks are implemented using Parrots, a new Deep Learning framework developed internally by ourselves (from scratch). Parrots is featured with a highly scalable distributed training scheme, a memory manager that supports dynamic memory reuse, and a parallel preprocessing pipeline. With this framework, the training time is substantially reduced. Also, with the same GPU memory capacity, much larger networks can be accommodated.

References
-----------------

[1] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, Alex Alemi. "Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning". arXiv:1602.07261. 2016.
[2] Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. "Identity Mappings in Deep Residual Networks". arXiv:1603.05027. 2016.
[3] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, Kilian Weinberger. "Deep Networks with Stochastic Depth". arXiv:1603.09382. 2016.

CUImage Wanli Ouyang, Junjie Yan, Xingyu Zeng, Hongsheng Li, Tong Xiao, Kun Wang, Xin Zhu, Yucong Zhou, Yu Liu, Buyu Li, Zhiwei Fang, Changbao Wang, Zhe Wang, Hui Zhou, Liping Zhang, Xingcheng Zhang, Zhizhong Li, Hongyang Li, Ruohui Wang, Shengen Yan, Dahua Lin, Xiaogang Wang Compared with CUImage submission in ILSVRC 2015, the new components are as follows.
(1) The models are pretrained for 1000-class object detection task using the approach in [a] but adapted to the fast-RCNN for faster detection speed.
(2) The region proposal is obtained using the improved version of CRAFT in [b].
(3) A GBD network [c] with 269 layers is fine-tuned on 200 detection classes with the gated bidirectional network (GBD-Net), which passes messages between features from different support regions during both feature learning and feature extraction. The GBD-Net is found to bring ~3% mAP improvement on the baseline 269 model and ~5% mAP improvement on the Batch normalized GoogleNet.
(4) For handling their long-tail distribution problem, the 200 classes are clustered. Different from the original implementation in [d] that learns several models, a single model is learned, where different clusters have both shared and distinguished feature representations.
(5) Ensemble of the models using the approaches mentioned above lead to the final result in the provided data track.
(6) For the external data track, we propose object detection with landmarks. Comparing to the standard bounding box centric approach, our landmark centric approach provides more structural information and can be used to improve both the localization and classification step in object detection. Based on the landmark annotations provided in [e], we annotate 862 landmarks from 200 categories on the training set. Then we use them to train a CNN regressor to predict landmark position and visibility of each proposal in testing images. In the classification step, we use the landmark pooling on top of the fully convolutional network, where features around each landmark are mapped to be a confidence score of the corresponding category. The landmark level classification can be naturally combined with standard bounding box level classification to get the final detection result.
(7) Ensemble of the models using the approaches mentioned above lead to the final result in the external data track.

The fastest publicly available multi-GPU caffe code is our strong support [f].

[a] W. Ouyang, X. Wang, X. Zeng, S. Qiu, P. Luo, Y. Tian, H. Li, S. Yang, Z. Wang, C. Loy, X. Tang, “DeepID-Net: Deformable Deep Convolutional Neural Networks for Object Detection,” CVPR 2015.
[b] Yang, B., Yan, J., Lei, Z., Li, S. Z. "Craft objects from images." CVPR 2016.
[c] X. Zeng, W. Ouyang, B. Yang, J. Yan, X. Wang, “Gated Bi-directional CNN for Object Detection,” ECCV 2016.
[d] Ouyang, W., Wang, X., Zhang, C., Yang, X. Factors in Finetuning Deep Model for Object Detection with Long-tail Distribution. CVPR 2016.
[e] Wanli Ouyang, Hongyang Li, Xingyu Zeng, and Xiaogang Wang, "Learning Deep Representation with Large-scale Attributes", In Proc. ICCV 2015.
[f] https://github.com/yjxiong/caffe

CUVideo Hongsheng Li*, Kai Kang* (* indicates equal contribution), Wanli Ouyang, Junejie Yan, Tong Xiao, Xingyu Zeng, Kun Wang, Xihui Liu, Qi Chu, Junming Fan, Yucong Zhou, Yu Liu, Ruohui Wang, Shengen Yan, Dahua Lin, Xiaogang Wang

The Chinese University of Hong Kong, SenseTime Group Limited We utilize several deep neural networks with different structures for the VID task.

(1) The models are pretrained for 200-class detection task using the approach in [a] but adapted to the fast-RCNN for faster detection speed.
(2) The region proposal is obtained by a separately-trained ResNet-269 model.
(3) A GBD network [b] with 269 layers is fine-tuned on 200 detection classes of the DET task and then on the 30 classes of the VID task. It passes messages between features from different support regions during both feature learning and feature extraction. The GBD-Net is found to bring ~3% mAP improvement on the baseline 269 model.
(4) Based on detection boxes of individual frames, tracklet proposals are efficiently generated by trained bounding box regressors. An LSTM network is integrate into the network to learn temporal-based appearance variation.
(5) Multi-context suppression and motion-guide propagation in [c] are utilized to post-process the per-frame detection results. They result in a ~3.5% mAP improvement on the validation set.
(6) Ensemble of the models using the approaches mentioned above lead to the final result in the provided data track.
(7) For the VID with tracking task, we modified an online multiple object tracking algorithm [d]. The tracking-by-detection algorithm utilizes our per-frame detection results and generates tracklets for different objects.

The fastest publicly available multi-GPU caffe code is our strong support [e].

[a] W. Ouyang, X. Wang, X. Zeng, S. Qiu, P. Luo, Y. Tian, H. Li, S. Yang, Z. Wang, C. Loy, X. Tang, “DeepID-Net: Deformable Deep Convolutional Neural Networks for Object Detection,” CVPR 2015.
[b] X. Zeng, W. Ouyang, B. Yang, J. Yan, X. Wang, “Gated Bi-directional CNN for Object Detection,” ECCV 2016.
[c] K. Kang, H. Li, J. Yan, X. Zeng, B. Yang, T. Xiao, C. Zhang, Z. Wang, R. Wang, X. Wang, W. Ouyang, “T-CNN: Tubelets with Convolutional Neural Networks for Object Detection from Videos”, arXiv:1604.02532
[d] J. H. Yoon, C.-R. Lee, M.-H. Yang, K.-J. Yoon, “Online Multi-Object Tracking via Structural Constraint Event Aggregation”, CVPR 2016
[e] https://github.com/yjxiong/caffe

Deep Cognition Labs Mandeep Kumar, Deep Cognition Labs
Krishna Kishore, Deep Cognition Labs
Rajendra Singh, Deep Cognition Labs We present these results for scene parsing task that are aquired using a modified Deeplab vgg16 network along with CRF.

DEEPimagine Sung-soo Park(DEEPimagine corp.)
Hyoung-jin Moon(DEEPimagine corp.)

Contact email : sspark@deepimagine.com 1.Model design
- Wide Residual SWAPOUT network
- Inception Residual SWAPOUT network
- We focused on the model multiplicity with many shallow networks
- We adopted a SWAPOUT architecture

2.Ensemble
- Fully convolutional dense crop
- Variant parameter model ensemble

[1] " Swapout: Learning an ensemble of deep architectures"
Saurabh Singh, Derek Hoiem, David Forsyth

[2] " Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning"
Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, Alex Alemi

[3] " Deep Residual Learning for Image Recognition "
Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun

DeepIST Heechul Jung*(DGIST/KAIST), Youngsoo Kim*(KAIST), Byungju Kim(KAIST), Jihun Jung(DGIST), Junkwang Kim(DGIST), Junho Yim(KAIST), Min-Kook Choi(DGIST), Yeakang Lee(KAIST), Soon Kwon(DGIST), Woo Young Jung(DGIST), Junmo Kim(KAIST)
* indicates equal contribution. We basically use nine networks. Networks consist of one 200-layer ResNet, one Inception-ResNet v2, one Inception v3 Net, two 212-layer ResNets and four Branched-ResNets.
Networks are trained for 95 epochs except Inception-ResNet v2 and Inception v3.
Ensemble A takes an average of one 212-layer ResNet, two Branched-ResNets and one Inception-ResNet v2.
Ensemble B takes a weighted sum over one 212-layer ResNet, two Branched-ResNets and one Inception-ResNet v2.
Ensemble C takes an average of one 200-layer ResNet, two 212-layer ResNets, two Branched-ResNets, one Inception v3 and one Inception-ResNet v2. It achieves a top-5 error rate of 3.16% for 20000 validation images.
Ensemble D takes an averaged result on all nine networks.

We submit only classification results.

References:
[1] He, Kaiming, et al. "Deep residual learning for image recognition." arXiv preprint arXiv:1512.03385 (2015).
[2] He, Kaiming, et al. "Identity mappings in deep residual networks." arXiv preprint arXiv:1603.05027 (2016).
[3] Szegedy, Christian, Sergey Ioffe, and Vincent Vanhoucke. "Inception-v4, inception-resnet and the impact of residual connections on learning." arXiv preprint arXiv:1602.07261 (2016).
[4] Szegedy, Christian, et al. "Rethinking the inception architecture for computer vision." arXiv preprint arXiv:1512.00567 (2015).
[5] Sermanet, Pierre, et al. "Overfeat: Integrated recognition, localization and detection using convolutional networks." arXiv preprint arXiv:1312.6229 (2013).
[6] Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. "Distilling the knowledge in a neural network." arXiv preprint arXiv:1503.02531 (2015).

Acknowledgement
- DGIST was funded by the Ministry of Science, ICT and Future Planning.
- KAIST was funded by Hanwha Techwin CO., LTD.

DGIST-KAIST Heechul Jung(DGIST/KAIST), Jihun Jung(DGIST), Junkwang Kim(DGIST), Min-Kook Choi(DGIST), Soon Kwon(DGIST), Junmo Kim(KAIST), Woo Young Jung(DGIST) We basically use ensemble model of state-the-art architectures [1,2,3,4] as following:
[1] He, Kaiming, et al. "Deep residual learning for image recognition." arXiv preprint arXiv:1512.03385 (2015).
[2] He, Kaiming, et al. "Identity mappings in deep residual networks." arXiv preprint arXiv:1603.05027 (2016).
[3] Szegedy, Christian, Sergey Ioffe, and Vincent Vanhoucke. "Inception-v4, inception-resnet and the impact of residual connections on learning." arXiv preprint arXiv:1602.07261 (2016).
[4] Szegedy, Christian, et al. "Rethinking the inception architecture for computer vision." arXiv preprint arXiv:1512.00567 (2015).

We train five deep neural networks, which models are two 212-layers ResNets, a 224-layers ResNet, an inception-v3, and an Inception-ResNet v2. Given models are linearly combined by weighted some of class probabilities using validation set to obtain appropriate contribution for each model.

- This work was funded by the Ministry of Science, ICT and Future Planning.

DPAI Vison Object detection: Chris Li, Savion Zhao, Bin Liu, Yuhang He, Lu Yang, Cena Liu
Scene classification: Lu Yang, Yuhang He, Cena Liu, Bin Liu, Bo Yu
Scene parsing: Bin Liu, Lu Yang, Yuhang He, Cena Liu, Bo Yu, Chris Li, Xiongwei Xia
Object detection from video: Bin Liu, Cena Liu, Savion Zhao, Yuhang He, Chris Li
Object detection:Our methods is based on faster-rcnn and extra classifier. (1) data processing: data equalization by deleting lots of examples in threee dominating classes (person, dog, and bird); adding extra data for classes with training data less than 1000; (2) COCO pre-train; (3) Iterative bounding box regression + multi-scale (trian/test) + random flip images (train / test) (4) Multimodel ensemble: resnet-101 and inception-v3 (5) Extra classifier with 200 classes which helps to promote recall and refine the detection scores of ultimate boxes.
[1] He K, Zhang X, Ren S, et al. Deep residual learning for image recognition[J]. arXiv preprint arXiv:1512.03385, 2015.
[2] Ren S, He K, Girshick R, et al. Faster R-CNN: Towards real-time object detection with region proposal networks[C]//Advances in neural information processing systems. 2015: 91-99.

Scene classification: We trained the model on Caffe[1]. An ensemble of Inception-V3[2] and Inception-V4[3]. We totally integrated four models. Top1 error on validation is 0.431 and top5 error is 0.129. The single model is modified on Inception-V3[2], the top1 error on validation is 0.434, top5 error is 0.133.
[1] Jia, Yangqing and Shelhamer, Evan and Donahue, Jeff and Karayev, Sergey and Long, Jonathan and Girshick, Ross and Guadarrama, Sergio and Darrell, Trevor. Caffe: Convolutional Architecture for Fast Feature Embedding. arXiv preprint arXiv:1408.5093. 2014.
[2]C.Szegedy,V.Vanhoucke,S.Ioffe,J.Shlens,andZ.Wojna. Rethinking the inception architecture for computer vision. arXiv preprint arXiv:1512.00567, 2015.
[3] C.Szegedy,S.Ioffe,V.Vanhoucke. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. arXiv preprint arXiv:1602.07261, 2016.

Scene parsing: We trained 3 models on modified deeplab[1] (inception-v3, resnet-101, resnet-152) and only used the ADEChallengeData2016[2] data. Multi-scale \ image crop \ image fliping \ contrast transformation are used for data augmentation and decseCRF is used as post-processing to refine object boundaries. On validation with combining 3 models, witch achieved 0.3966 mIoU and 0.7924 pixel-accuracy.
[1] L. Chen, G. Papandreou, I. K.; Murphy, K.; and Yuille, A. L. 2016. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. In arXiv preprint arXiv:1606.00915.
[2] B. Zhou, H. Zhao, X. P. S. F. A. B., and Torralba, A. 2016. Semantic understanding of scenes through the ade20k dataset. In arXiv preprint arXiv:1608.05442.

Object detection from video: Our methods is based on faster-rcnn and extra classifier. We train Faster-RCNN based on RES-101 with the provided training data. We also train extra classifier with 30 classes which helps to promote recall and refine the detection scores of ultimate boxes.

DPFly Savion DP.co We fine-tune the detection models using the DET training set and the val1 set. The val2 set is used for validation.
data process to nearly equal ammount. since some categories have much more images than others. So,we need to process the initial data to let amount of each category near equal.
usr res101model+fater_rcnn.The networks are pre- trained on the 1000-class ImageNet classification set, and are fine-tuned on the DET data.
use box refinement:In Faster R-CNN, the final output is a regressed box that is different from its proposal box. So for inference, we pool a new feature from the regressed box and obtain a new classification score and a new regressed box. We combine these 300 new predictions with the orig- inal 300 predictions. Non-maximum suppression (NMS) is applied on the union set of predicted boxes using an IoU threshold of 0.3.
use multiscale test.In our current implementation, we have performed multi-scale testing. we compute conv feature maps on an image pyramid, where the image’s shorter sides are 300,450,600
use multiscale anchor. we add two anchor scales to original anchor scales of faster rcnn.
use test flip. we flip image and combine results with original image

Everphoto Yitong Wang, Zhonggan Ding, Zhengping Wei, Linfu Wen

Everphoto Our method is based on DCNN approaches.

We use 5 models with different input scales and different network structures as basic models. They are derived from GoogleNet, VGGNet and ResNet.

We also utilize the idea of dark knowledge [1] to train several specialist models, and use these specialist models to reassign probability scores and refine the basic outputs.

Our final results are based on the ensemble of refined outputs.

[1] Hinton G, Vinyals O, Dean J. Distilling the knowledge in a neural network[J]. arXiv preprint arXiv:1503.02531, 2015.

F205_CV Cheng Zhou
Li Jiancheng
Lin Zhihui
Lin Zhiguan
Yang Dali
All came from Tsinghua university, Graduate School at ShenZhen Lab F205,China Our team has five student members from Tsinghua university, Graduate School at ShenZhen Lab F205,China. We have joined two sub-tasks of the ILSVRC2016 & COCO challenge which is the Scene Parsing and Object detection from video. We are the first time to attend this competition.
The two of the members have focus on the Scene Parsing, they mainly utilized several model fusion algorithms on some famous and effective CNN models like ResNet[1], FCN[2] and DilatedNet[3, 4] and used CRF to get more context features to improve the classification accuracy and mean IoU rate. Since the image size is large, the image is downsampled before feeding to the network. What's more, we used vertical mirror technique for data augmentation. The places2 scene classification 2016 pretrained model was used to fine-tune ResNet101 and FCN, while DilatedNet fine-tuned from the places2 scene parsing 2016 pretrained model[5]. Later fusion and CRF were added at last.
For object detection from video, the biggest challenge is there are more than 2 millions images with very high resolution in total. We didn't think about using the fast-RCNN[6] like models to solve it. It need much more training and testing time. So we chose the ssd[7] which is an effective and efficient framework for object detection. We utilized the ResNet101 as the base model, but it is slower than VGGNet[8]. For testing it can achieve about 10FPS on single GTX TITAN X GPU. However, there are more than 700 thousands images in the test set. It costed lots of time. On tracking task, we have a dynamic adjustment algorithm, but it need a ResNet101 model for scoring the patch. It can just achieve about less than 1FPS. So we cannot do this work on test set. For the submission, we used a simple method to filter the noise proposals and track the object.

References:
[1] He K, Zhang X, Ren S, et al. Deep residual learning for image recognition[J]. arXiv preprint arXiv:1512.03385, 2015.
[2] Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015: 3431-3440.
[3] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with
deep convolutional nets, atrous convolution, and fully connected CRFs. arXiv:1606.00915, 2016.
[4] F. Yu and V. Koltun. Multi-scale context aggregation by dilated convolutions. In ICLR, 2016.
[5] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso and A. Torralba. arXiv:1608.05442
[6] Girshick R. Fast r-cnn[C]//Proceedings of the IEEE International Conference on Computer Vision. 2015: 1440-1448.
[8] Liu W, Anguelov D, Erhan D, et al. SSD: Single Shot MultiBox Detector[J]. arXiv preprint arXiv:1512.02325, 2015.
[9] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[J]. arXiv preprint arXiv:1409.1556, 2014.

Faceall-BUPT Xuankun HUANG, BUPT, CHINA
Jiangqi ZHANG, BUPT, CHINA
Zhiqun HE, BUPT, CHINA
Junfei ZHUANG, BUPT, CHINA
Zesang HUANG, BUPT, CHINA
Yongqiang Yao, BUPT, CHINA
Kun HU, BUPT, CHINA
Fengye XIONG, BUPT, CHINA
Hongliang BAI, Beijing Faceall co., LTD
Wenjian FENG, Beijing Faceall co., LTD
Yuan DONG, BUPT, CHINA # Classification/Localization
We trained the ResNet-101, ResNet-152 and Inception-v3 for object classification. Multi-view testing and models ensemble is utilized to generate the final classification results.
For localization task, we trained a Region Proposal Network to generate proposals of each image, and we fine-tuned two models with object-level annotations of 1,000 classes. Moreover, a background class is added into the network. Then test images are segmented into 300 regions by RPN and these regions are classified by the fine-tuned model into one of 1,001 classes. And the final bounding box is generated by merging the bounding rectangle of three regions.

# Object detection
We utilize faster-rcnn with the publicly available resnet-101. Other than the baseline, we adopt multi-scale roi to obtain features containing richer context information. For testing, we use 3 scales and merge these results using the simple strategy introduced last year.

No validation data is used for training, and flipped images are used in only a third of the training epochs.

# Object detection from video
We use Faster R-CNN with Resnet-101 to do this as in the object detection task. One fifth of the images are tested with 2 scales. No tracking techniques are used because of some mishaps.

# Scene classification
We trained a single Inception-v3 network with multi-scale and tested with multi-view of 150 crops.
On validation the top-5 error is about 14.56%.

# Scene parsing
We trained 6 models with net structure inspired by fcn8s and dilatedNet with 3 scales(256,384,512). Then we test with flipped images using pre-trained fcn8s and dilatedNet. The pixel-wise accuracy is 76.94％ and mean of the class-wise IoU is 0.3552.

fusionf Nina Narodytska (Samsung Research America)
Shiva Kasiviswanathan (Samsung Research America)
Hamid Maei (Samsung Research America) We used several modifications of modern CNNs, including VGG[1], GoogleNet[2,4], and ResNet[3]. We used several fusion strategies,
including a standard averaging and scoring scheme. We also used different subsets of models in different submissions. Training was
performed on low-resolution dataset. We used balanced loading to take into account different numbers of images in each class.

[1] K. Simonyan and A. Zisserman. Very deep convolutional networks for
large-scale image recognition.

[2] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D.
Erhan, V. Vanhoucke, A. Rabinovich. Going Deeper with Convolutions.

[3] Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun
Deep Residual Learning for Image Recognition

[4]Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, Alex Alemi
Inception-v4, Inception-ResNet and the Impact of Residual Connections
on Learning

Future Vision Gautam Kumar Singh(independent)
Kunal Kumar Singh(independent)
Priyanka Singh(independent) Future Vision Project (based on Matcovnet )
========================================

This is an extremely simple CNN model which may not stand in ILSVRC competition. Our major goal was to get a working CNN model which can be enhanced to work efficiently on ILSVRC standards.

This project ran on following configurations :

processor : Intel core i3-4005U CPU @ 1.70GHz (4cpus)
RAM: 4GB

As we had no advance hardware resources like GPUs or high speed CPUs,cuDNN could not be used either. So we could not train on this vast data and we were forced to use this simplest shallow model . Besides we also threw away some 90% of the data. 10 % of the data was only used in this project, equally distributed as training data and test data.

Places2 validation data : NOT USED
Places2 training data : 90% discarded

10% of this training data were further equally divided into two categories('train'& 'test') which were used for training and testing of this project.

The output text file on the Places2 test images could not produced as we were facing some techinical dificulties and we were running out of time .

Reference :we referenced this project: http://www.cc.gatech.edu/~hays/compvision/proj6

Future Vision Team :

Gautam Kumar Singh
Kunal Kumar Singh
Priyanka Singh

Hikvision Qiaoyong Zhong*, Chao Li, Yingying Zhang(#), Haiming Sun*, Shicai Yang*, Di Xie, Shiliang Pu (* indicates equal contribution)

Hikvision Research Institute
(#)ShanghaiTech University, work is done at HRI [DET]
Our work on object detection is based on Faster R-CNN. We design and validate the following improvements:
* Better network. We find that the identity-mapping variant of ResNet-101 is superior for object detection over the original version.
* Better RPN proposals. A novel cascade RPN is proposed to refine proposals' scores and location. A constrained neg/pos anchor ratio further increases proposal recall dramatically.
* Pretraining matters. We find that a pretrained global context branch increases mAP by over 3 points. Pretraining on the 1000-class LOC dataset further increases mAP by ~0.5 point.
* Training strategies. To attack the imbalance problem, we design a balanced sampling strategy over different classes. With balanced sampling, the provided negative training data can be safely added for training. Other training strategies, like multi-scale training and online hard example mining are also applied.
* Testing strategies. During inference, multi-scale testing, horizontal flipping and weighted box voting are applied.
The final mAP is 65.1 (single model) and 67 (ensemble of 6 models) on val2.

[CLS-LOC]
A combination of 3 Inception networks and 3 residual networks is used to make the class prediction. For localization, the same Faster R-CNN configuration described above for DET is applied. The top5 classification error rate is 3.46%, and localization error is 8.8% on the validation set.

[Scene]
For the scene classification task, by drawing support from our newly-built M40-equipped GPU clusters, we have trained more than 20 models with various architectures, such as VGG, Inception, ResNet and different variants of them in the past two months. Fine-tuning very deep residual networks from pre-trained ImageNet models, like ResNet 101/152/200, seemed not to be as good enough as what we expected. Inception-style networks could get better performance in considerably less training time according to our experiments. Based on this observation, deep Inception-style networks, and not-so-deep residuals networks have been used. Besides, we have made several improvements for training and testing. First, a new data augmentation technique is proposed to better utilize the information of original images. Second, a new learning rate setting is adopted. Third, label shuffling and label smoothing is used to tackle the class imbalance problem. Fourth, some small tricks are used to improve the performance in test phase. Finally we achieved a very good top 5 error rate, which is below 9% on the validation set.

[Scene Parsing]
We utilize a fully convolutional network transferred from VGG-16 net, with a module, called mixed context network, and a refinement module appended to the end of the net. The mixed context network is constructed by a stack of dilated convolutions and skip connections. The refinement module generates predictions by making use of output of the mixed context network and feature maps from early layers of FCN. The predictions are then fed into a sub-network, which is designed to simulate message-passing process. Compared with baseline, our first major improvement is that, we construct the mixed context network, and find that it provides better features for dealing with stuff, big objects and small objects all at once. The second improvement is that, we propose a memory-efficient sub-network to simulate message-passing process. The proposed system can be trained end-to-end. On validation set, the mean iou of our system is 0.4099 (single model) and 0.4156 (ensemble of 3 models), and the pixel accuracy is 79.80% (single model) and 80.01% (ensemble of 3 models).

References
[1] Ren, Shaoqing, et al. "Faster R-CNN: Towards real-time object detection with region proposal networks." Advances in neural information processing systems. 2015.
[2] Shrivastava, Abhinav, Abhinav Gupta, and Ross Girshick. "Training region-based object detectors with online hard example mining." arXiv preprint arXiv:1604.03540 (2016).
[3] He, Kaiming, et al. "Deep residual learning for image recognition." arXiv preprint arXiv:1512.03385 (2015).
[4] He, Kaiming, et al. "Identity mappings in deep residual networks." arXiv preprint arXiv:1603.05027 (2016).
[5] Ioffe, Sergey, and Christian Szegedy. "Batch normalization: Accelerating deep network training by reducing internal covariate shift." arXiv preprint arXiv:1502.03167 (2015).
[6] Szegedy, Christian, et al. "Rethinking the inception architecture for computer vision." arXiv preprint arXiv:1512.00567 (2015).
[7] Szegedy, Christian, Sergey Ioffe, and Vincent Vanhoucke. "Inception-v4, inception-resnet and the impact of residual connections on learning." arXiv preprint arXiv:1602.07261 (2016).
[8] F. Yu and V. Koltun, "Multi-scale context aggregation by dilated convolutions," in ICLR, 2016.
[9] J. Long, E. Shelhamer, and T. Darrell, "Fully convolutional networks for semantic segmentation," in CVPR, 2015.
[10] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. Torr, "Conditional random fields as recurrent neural networks," in ICCV, 2015.
[11] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, A. Yuille, "DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs", arXiv:1606.00915, 2016.
[12] P. O. Pinheiro, T. Lin, R. Collobert, P. Dollar, "Learning to Refine Object Segments", arXiv:1603.08695, 2016.

Hitsz_BCC Qili Deng,Yifan Gu,Mengdie Chu,Shuai Wu,Yong Xu
Harbin Institute of Technology,Shenzhen We combined a residual learning framework with Single Shot MultiBox Detector for object detection. For using the ResNet-152,we fixed the all batch-normlization layers and conv1, conv2_x in ResNet.Inspired by HyperNet,we exploit multi-layer features to detect objects.
Reference:
[1] He K, Zhang X, Ren S, et al. Deep residual learning for image recognition[J]. arXiv preprint arXiv:1512.03385, 2015.
[2] Kong T, Yao A, Chen Y, et al. HyperNet: Towards Accurate Region Proposal Generation and Joint Object Detection[J]. arXiv preprint arXiv:1604.00600, 2016.
[3] Liu W, Anguelov D, Erhan D, et al. SSD: Single Shot MultiBox Detector[J]. arXiv preprint arXiv:1512.02325, 2015.

hustvision Xinggang Wang, Huazhong University of Science and Technology
Kaibin Chen, Huazhong University of Science and Technology We propose a very fast and accurate object detection method based on deep neural networks. The core of this method is an object detection loss layer named ConvBox, which directly regresses object boundingbox. The ConvBox loss layer can be plugged into any deep neural networks. In this competition, we choose google-net as the base network. Running on Nvidia GTX 1080, it takes about one day for training on ILSVRC 2016. Testing speed is about 60fps. In the training images of this competition, there are many positive instances are not labelled. To deal with this problem, in the proposed ConvBox loss, we tolerate hard negatives, which improves detection performance to some extent.

iMCB *Yucheng Xing,
*Yufeng Zhang,
Zhiqin Chen,
Weichen Xue,
Haohua Zhao,
Liqing Zhang
@Shanghai Jiao Tong University (SJTU)

(* indicates equal contribution) In this competition, we submit five entries.

The first model is a single model, which achieved 15.24% top-5 error on validation dataset. It is a Inception-V3[1] model that is modified and trained based on both the challenge and standard datasets[2]. When being tested, images are resized to 337*337 and then a 12-crops skill is used to get the 299*299 inputs to the model, which contributes to the improvement of performance.

The second model is a fusion-feature model(FeatureFusion_2L), which achieved 13.74% top-5 error on validation dataset. It is a two layers fusion-feature network, whose input is the combination of fully-connected layer's features extracted from several well performed CNNs(i.e. pretrained models[3], such as Resnet, VGG, Googlenet).As a result, it turns out to be efficient in reducing the error rate.

The third model is also a fusion-feature network(FeatureFusion_3L),which achieved 13.95% top-5 error on validation dataset. Comparing with the second model, it is a three layers fusion-feature network which contains two fully-connected layers.

The fourth is the combination of CNN models with a strategy w.r.t.validation accuracy, which achieved 13% top-5 error on validation dataset. It combines the probabilities provided by the softmax layer from three CNNs, in which the influential factor of each CNN is determined by the validation accuracy.

The fifth is the combination of CNN models based on researched influential factors, which achieved 12.65% top-5 error on validation dataset. There are six CNNs taken into consideration, while four models(Inception-V2, Inception-V3, FeatureFusion_2L and FeatureFusion_3L) of them are trained by us and the other two are pretrained. The influential factors of these models are optimized according to plenty of researches.

[1] Szegedy, Christian, et al. "Rethinking the Inception Architecture for Computer Vision." arXiv preprint arXiv:1512.00567 (2015).

[2]B. Zhou, A. Khosla, A. Lapedriza, A. Torralba and A. Oliva. "Places: An Image Database for Deep Scene Understanding." Arxiv, 2016.

[3] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba and A. Oliva. "Learning Deep Features for Scene Recognition using Places Database."Advances in Neural Information Processing Systems 27 (NIPS), 2014.

isia_ICT Xinhan Song, Institute of Computing Technology
Chengpeng Chen, Institute of Computing Technology
Shuqiang jiang, Institute of Computing Technology For convenience, we use the 4 provided models as our basic models, which are used for the following fine-tuning or networks adaptation. Besides, considering the non-uniform and the tremendous image number of the Challenge Dataset, we only use the Standard Dataset for all the following steps.
First, we fuse these models with average strategy as the baseline. And then, we add a SPP layer to VGG16 and ResNet152 perspectively to enable the models to be feed with images with larger scale. After fine-tuning the models, we also fuse them with average strategy, and we only submit the result of the size 288.
we also perform spectral clustering on the confusion matrix extracted from validation data to get 20 clusters, which means that 365 classes are separated into 20 clusters mainly dependent on their co-relationship. To classify the classes in the same cluster more precisely, we train an extra classifier within each cluster, which is implemented by fine-tuning the networks with all the layers fixed except for fc8 layer and combining them into a network at last.

D. Yoo, S. Park, J. Lee and I. Kweon. “Multi-scale pyramid pooling for deep convolutional representation”. In CVPR Workshop 2015

ITLab-Inha Byungjae Lee, Inha University,
Songguo Jin, Inha University,
Enkhbayar Erdenee, Inha University,
Mi Young Nam, NaeulTech,
Young Giu Jung, NaeulTech,
Phill Kyu Rhee, Inha University. We propose a robust multi-class multi-object tracking (MCMOT) formulated by a Bayesian framework [1]. Multi-object tracking for unlimited object classes is conducted by combining detection responses and changing point detection (CPD) algorithm. The CPD model is used to observe abrupt or abnormal changes due to a drift and an occlusion based spatiotemporal characteristics of track states.

The ensemble of object detector is based on the Faster R-CNN [2] using VGG16 [3], and ResNet [4] adaptively. For parameter optimization, POMDP based parameter learning approach is adopted which described in our previous work [5].

[1] “Multi-Class Multi-Object Tracking using Changing Point Detection”, Byungjae Lee, Enkhbayar Erdenee, Songguo Jin, Phill Kyu Rhee. arXiv 2016.
[2] “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun. TPAMI 2016.
[3] “Very Deep Convolutional Networks for Large-Scale Image Recognition”, Karen Simonyan, Andrew Zisserman. arXiv 2015.
[4] “Deep Residual Learning for Image Recognition”, Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. CVPR 2016.
[5] “Adaptive Visual Tracking using the Prioritized Q-learning Algorithm: MDP-based Parameter Learning Approach”, Sarang Khim, Sungjin Hong, Yoonyoung Kim, Phill Kyu Rhee. Image and Vision Computing 2014.

KAIST-SLSP Sunghun Kang*(KAIST)
Jae Hyun Lim*(ETRI)
Houjeung Han(KAIST)
Donghoon Lee(KAIST)
Junyeong Kim(KAIST)
Chang D. Yoo(KAIST)
(* indicates equal contribution) For both image and video object detection, the faster-rcnn detection algorithm proposed by Shaoqing Ren et al. is integrated with various other state-of-the art key techniques described below in Torch. For image and video, post-processing techniques such as box-refinement and classification rescoring via global context feature. Classification rescoring by combining global context feature with feature outputs is conducted per prediction. In order to further enhance detection performance for video, classification probabilities within tracklets obtained by multiple object tracking were re-scored via combining feature responses weighted on various combination of tracklet lengths. Our architecture is based on ensemble of the several architectures independently trained. The faster-rcnn based on deep residual net is implemented to be learned in an end-to-end manner, and for inference, model ensemble and box-refinement are integrated into the two faster-rcnn architectures.

For both image and video object detection, the following three key components (1-3) that include three post-processing techniques (pp1-3) are integrated in torch for end-to-end learning and inferencing:
(1) Deep residual net[1]
(2) Faster-R-CNN[2] with end2end training
(3) post-processing
(pp1) box refinement[3]
(pp2) model ensemble
(pp3) classification re-scoring via SVM using global context features

For only video object detection, the following post-processing techniques (pp4-5) are additionally included in conjunction with the above three post-processing techniques:
(3) post-processing
(pp4) multiple object tracking[4]
(pp5) tracklets re-scoring

[1] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, "Deep Residual Learning for Image Recognition", IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016

[2] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun, "Faster {R-CNN}: Towards Real-Time Object Detection with Region Proposal Networks", Advances in Neural Information Processing Systems (NIPS), 2015

[3] Spyros Gidaris and Nikos Komodakis, "Object detection via a multi-region & semantic segmentation-aware CNN model", International Conference on Computer Vision (ICCV), 2015

[4] Hamed Pirsiavash, Deva Ramanan, and Charless C.Fowlkes, “Globally-optimal greedy algorithms for tracking a variable number of objects,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2011

KAISTNIA_ETRI Keun Dong Lee(ETRI)
Seungjae Lee(ETRI)
Yunhun Jang(KAIST)
Hankook Lee(KAIST)
Hyung Kwan Son(ETRI)
Jinwoo Shin(KAIST)

For the localization task, we use a variant of Faster-RCNN with ResNet, where the overall training procedure is similar with that in [1]. For the classification task, we used an ensemble of ResNet and GoogLeNet [2] with various data augmentations. Then we recursively obtained attentions in input images to adjust localization outputs. It is further tuned by class-dependent regression models.

[1] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015.

[2] Szegedy, C., Ioffe, S., Vanhoucke, V.: Inception-v4, inception-resnet and the impact of residual connections on learning. arXiv:1602.07261, 2016.

KPST_VB Nguyen Hong Hanh
Seungjae Lee
Junhyeok Lee In this work, we used pre-trained ResNet200(ImageNet)[1] and retrained the network on Place 365 Challenge data (256 by 256). We also estimated scene probability using the output of pretrained ResNet200 and scene vs. object (ImageNet 1000 class) distribution on training data. For classification, we used ensemble of two networks with multiple crops and adjusted on scene probability.

[1] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015.

*Our work is performed by deep learning analysis tool(Deep SDK by KPST).

Lean-T Yuechao Gao,Nianhong Liu,Sen Li @ Tsinghua University For the object dection task,our dectors is based on the Faster RCNN[1]. We used pre-traind VGG16[2] to initialize the net. we used caffe[3] to train our model and only 230K iterations were conducted.The images for the DET dataset served as negative training data were not used.
[1]Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems (pp. 91-99).
[2]K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556, 2014.
[3]Jia, Yangqing, et al. "Caffe: Convolutional architecture for fast feature embedding." Proceedings of the 22nd ACM international conference on Multimedia. ACM, 2014.

LZDTX Liu Yinan (Independent Individual);
Zhou Yusong (Beijing University of Posts and Telecommunications);
Deng Guanghui (BeijingUniversity of Technology);
Tuqiang (Beijing Insititute of Technology);
Xing Zongheng (University of Science & Technology Beijing);
In this year, we focus on the Object Detection Task because it is widely used in our project and in other areas such as self-driving, robot and image analysis. Researchers are all the time trying to find a real time object detection algorithm with relatively high accuracy. However, most proposed algorithms are based on proposals and transfer detection task to classification task by classifying the proposals selected from image. Sliding windows is an widely used way but it produces too much proposals. In recent years some methods try to combine traditional proposal select method and deep learning algorithm such as R-CNN. Some methods try to accelerate feature extraction process such as Fast R-CNN and Faster R-CNN. But it is still too slow to most real time applications. Recently some object detection methods without proposals are proposed such as YOLO and SSD. The speed of such methods are much faster than methods with proposals. The drawback of YOLO and SSD is that they are useless to small objects. This is because both YOLO and SSD directly map a box from image to target object. To overcome this drawback, we try to add a deconvolutional structure on the ssd network. The naive idea of our network structure is to enlarge the output feature map and we think larger feature map is able to provide more detailed predicting information and cover small size target object. Deconvolutional layer is the way we use to enlarge the feature map of ssd. We add 3 deconvolutional layers in the basic ssd network, and the deconvolutional layers output predict results as other ssd extra layers. Experimental results show that our deconv-ssd network improves performance of baseline of 300x300 ssd model on validation dataset of ILSVRC 2016 and PASCAL VOC. We submit one model with input size 300x300. We train our model on a Nvidia Titan GPU card, batch size is set as 32.

[1] Uijlings J R R, Sande K E A V D, Gevers T, et al. Selective Search for Object Recognition[J]. International Journal of Computer Vision, 2013, 104(2):154-171.
[2] Russakovsky O, Deng J, Su H, et al. ImageNet Large Scale Visual Recognition Challenge[J]. International Journal of Computer Vision, 2015, 115(3):211-252.
[3] Simonyan K, Zisserman A. Very Deep Convolutional Networks for Large-Scale Image Recognition[J]. Computer Science, 2015.
[4] Sermanet P, Eigen D, Zhang X, et al. OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks[J]. Eprint Arxiv, 2013.
[5] Girshick R. Fast R-CNN[J]. Computer Science, 2015.
[6] Ren S, He K, Girshick R, et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks.[J]. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2015:1-1.
[7] Redmon J, Divvala S, Girshick R, et al. You Only Look Once: Unified, Real-Time Object Detection[J]. Computer Science, 2015.
[8] Liu W, Anguelov D, Erhan D, et al. SSD: Single Shot MultiBox Detector[J]. Computer Science, 2015.

MCC Lei You, Harbin Institute of Technology Shenzhen Graduate School
Yang Zhang, Harbin Institute of Technology Shenzhen Graduate School
Lingzhi Fu, Harbin Institute of Technology Shenzhen Graduate School
Tianyu Wang, Harbin Institute of Technology Shenzhen Graduate School
Huamen He, Harbin Institute of Technology Shenzhen Graduate School
Yuan Wang, Harbin Institute of Technology Shenzhen Graduate School We combined and modified the RESNET and faster-RCNN for image classification, and then we constructed several detection models for the target location according to the classification model results, finally we integrated the two steps and got the final results.

Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks.
Neural Information Processing Systems (NIPS), 2015
K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. CVPR, 2016.
Russell Stewart, Mykhaylo Andriluka. End-to-end people detection in crowded scenes. CVPR, 2016.

MCG-ICT-CAS Sheng Tang (Corresponding email: ts@ict.ac.cn),
Bin Wang,
JunBin Xiao,
Yu Li,
YongDong Zhang,
JinTao Li

Multimedia Computing Group,Institute of Computing Technology,Chinese Academy of Sciences (MCG-ICT-CAS), Beijing, China

Technique Details for the Object Detection from Video (VID) Task:

For this year’s VID task, our primary contribution is that we propose a novel tracking framework based on two complementary kinds of tubelet generation methods which focus on precision and recall respectively, followed by a novel tubelet merging method. Under this framework, our main contributions are two-fold:
(1) Tubelet generation based on detection and tracking: We propose to sequentialize the detection bounding boxes of same object with different tracking methods to form two complementary kinds of tubelets. One is to use the detection bounding boxes to refine the optical-flow based tracking for precise tubelet generation. The other is to integrate the detection bounding boxes with multi-target tracking based on MDNet to recall missing tubelets.
(2) Overlapping and successive tubelet fusion: Based on the above two complementary tubelet generation methods, we propose a novel effective union method to merge two overlapping tubelets, and a concatenation method to merge two successive tubelets, which improves the final AP by a substantial margin.

Also three other tricks are used as below:
(1) Non-coocurrence filtration: Based on the co-occurrence relationship mined from the training dataset, we filter out the true negative objects which have lower detection scores and whose categories are not concurrently appeared with those objects of highest detection scores.
(2) Coherent reclassification: After generating the object tubelets based on detection results and optical flow, we propose a coherent reclassification method to get coherent categories throughout a tubelet.
(3) Efficient multi-target tracking with MDNet: we first choose anchor frame, and exploit the adjacent information to determine the reliable anchor targets for efficient tracking. Then, we track each anchor target with a MDNet tracker in parallel. Finally, we use still-image detection results to recall missing tubelets.

In our implementation, we use Faster R-CNN [1] with ResNet [2] for still-image detection, optical flow [3] and MDNet [4] for tracking.

References:
[1] Ren S, He K, Girshick R, Sun J. “Faster R-CNN: Towards real-time object detection with region proposal networks”, NIPS 2015: 91-99.
[2] He K, Zhang X, Ren S, Sun J. “Deep residual learning for image recognition”, CVPR 2016.
[3] Kang K, Ouyang W, Li H, Wang X. “Object Detection from Video Tubelets with Convolutional Neural Networks”, CVPR 2016.
[4] Nam H, Han B. “Learning multi-domain convolutional neural networks for visual tracking”, CVPR 2016.

MIL_UT Kuniaki Saito
Shohei Yamamoto
Masataka Yamaguchi
Yoshitaka Ushiku
Tatsuya Harada

all of members are from University of Tokyo We used Faster RCNN[1] as a basic detection system.

We implemented Faster RCNN based ResNet-152 and ResNet-101[2]. We used pretrained model on 1000classes of ResNet.

We placed Region Proposal Network after conv4 on both models. We freezed weight before conv3 during training. We trained these models with end-to-end training procedure. We used Online Hard Example Mining[3] to train these models. We chose top 64 proposals with large loss from 128 proposals for calculating loss.

Our submission is from ensemble of the Faster RCNN on ResNet-152 and ResNet-101. For ensemble these models, we shared region proposals from two models, we merged proposals and scores from separately calculated ones.

Our result scored map 54.3 on validation dataset.

[1] Ren, Shaoqing, et al. "Faster R-CNN: Towards real-time object detection with region proposal networks." NIPS 2015.
[2] He, Kaiming, et al. "Deep residual learning for image recognition." CVPR 2016.
[3] Shrivastava, Abhinav, Abhinav Gupta, and Ross Girshick. "Training region-based object detectors with online hard example mining." CVPR 2016.

MIPAL_SNU Sungheon Park and Nojun Kwak (Graduate School of Convergence Science and Technology, Seoul National University) We trained two ResNet-50 [1] networks. One network used 7x7 mean pooling, and the other used multiple mean poolings with various sizes and positions. We also used balanced sampling strategy which is similar to [2] to deal with the imbalanced training set.

[1] He, Kaiming, et al. "Deep residual learning for image recognition." CVPR, 2016.

[2] Shen, Li, et al. "Relay Backpropagation for Effective Learning of Deep Convolutional Neural Networks." arXiv, 2015.

mmap-o Qi Zheng, Wuhan University
Cheng Tong, Wuhan University
Xiang Li, Wuhan University We use the FULLY CONVOLUTIONAL NETWORKS [1] with VGG 16-layer net to parsing the scene images. The model is adopted with 8 pixel stride nets.

Initial results contain some labels irrelevant to the scene. Some high confidence labels are exploited to group the images into different scenes to remove irrelevant labels. Here we use data-driven classification strategy to refine the results.

[1] Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation[C]// IEEE Conference on Computer Vision and Pattern Recognition. 2015:1337-1342.

Multiscale-FCN-CRFRNN Shuai Zheng, Oxford
Anurag Arnab, Oxford
Philip Torr, Oxford This submission is trained based on Conditional Random Fields as Recurrent Neural Networks, (described in Zheng et al., ICCV 2015), with multi-scale training pipeline. Our base model is built on ResNet101, which is only pre-trained on ImageNet. After that, the model is built within a Fully Convolutional Network (FCN) structure and only fine-tuned on MIT Scene Parsing dataset. This is done using a multi-scale training pipeline, similar to Farabet et al. 2013. In the end, this FCN-ResNet101 model is plugged in with CRF-RNN and trained in an end-to-end pipeline.

MW Gang Sun (Institute of Software, Chinese Academy of Sciences)
Jie Hu (Peking University) We leverage the theory named CNA [1] (capacity and necessity analysis) to guide the design of CNNs. We add more layers on the larger feature map (e.g., 56x56) to increase the capacity, and remove some layers on the smaller feature map (e.g., 14x14) to avoid ineffective architectures. We have verified the effectiveness on the models in [2], ResNet-like models [3], and Inception-ResNet-like models [4]. In addition, we also apply cropped patches from original images as training samples by selecting random area and aspect ratio. To increase the ability of generalization, we prune the model weights periodically. Moreover, we utilize balanced sampling strategy [2] and label smooth regularization [5] during training, to alleviate the bias from the non-uniform sample distribution among categories and partial incorrect training labels. We use the provided data (Places365) for training models, do not use any additional data, and train all models from scratch. The algorithm and architecture details will be described in our arXiv paper (available online shortly).

[1] Xudong Cao. A practical theory for designing very deep convolutional neural networks, 2014. (unpublished)
[2] Li Shen, Zhouchen Lin, Qingming Huang. Relay Backpropagation for Effective Learning of Deep Convolutional Neural Networks. In ECCV 2016.
[3] Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. Deep Residual Learning for Image Recognition. In CVPR 2016.
[4] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, Alex Alemi. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. ArXiv:1602.07261,2016.
[5] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, Zbigniew Wojna. Rethinking the Inception Architecture for Computer Vision. ArXiv:1512.00567,2016.

NEU_SMILELAB YUE WU, YU KONG, JUN LI, LANCE BARCELONA, RAMI SALEH, SHANGQIAN GAO, RYAN BIRKE, HONGFU LIU, JOSEPH ROBINSON, TALEB ALASHKAR, YUN FU

Northeastern University, MA, USA
We focus on the object classification problem. The 1000 classes are split into 2 parts based on the analysis of the WORDNET structure and a visualization of features from a resnet-200 model [1]. The first part has 417 classes and is annotated as “LIVING THINGS”. The second part has the rest 583 classes and is annotated as “ARTIFACTS and OTHERS”. Two resnet-200 layer models are trained for each part separately. The model for “LIVING THINGS” has a top-5 error of 3.174% on validation set and the model for “ARTIFACTS and OTHERS” has a top-5 error of 7.874% with only center crop testing. However, we cannot find a proper way to combine these two models to get a good result for the total 1000 classes. Our combination of the two models [2] gets a top-5 error 7.62% for 1000 classes with 144-crop testing. We also train several resnet models with different layers. Our submission is based on an ensemble of these models. Our best result achieves a top-5 error 3.92% on validation set. For localization, we simply take the center of the image as the box for object.

[1] Identity Mappings in Deep Residual Networks, ECCV, 2016
[2] Deep Convolutional Neural Network with Independent Softmax for Large Scale Face Recognition, ACM Multimedia (MM), 2016

NQSCENE Chen Yunpeng ( NUS )
Jin Xiaojie ( NUS )
Zhang Rui ( CAS )
Li Yu ( CAS )
Yan Shuicheng ( Qihoo/NUS ) Technique Details for the Scene Classification:

For the scene classification task, we propose the following methods to address the data imbalance issues (aka the long tail distribution issue) which benefit and boost the final performance:

1) Category-wise Data Augmentation:
We implied a category wise data augmentation strategy, which associates each category with adaptive augmentation level. The augmentation level is updated iteratively during the training.

2) Multi-task Learning:
We proposed a multipath learning architecture to jointly learn feature representations from the Imagnet-1000 dataset and Places-365 dataset.

Vanilla ResNet-200 [1] is adopted with following elementary tricks: scale and aspect ratio augmentation, over-sampling, multi-scale (x224,x256,x288,x320) dense testing. In total, we have trained four models and fused them by averaging their scores. It costs about three days for training each model using MXNet [2] on a cluster with forty NVIDIA M40 (12GB).

------------------------------
[1] He, Kaiming, et al. "Identity mappings in deep residual networks." arXiv preprint arXiv:1603.05027 (2016).
[2] Chen, Tianqi, et al. "Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems." arXiv preprint arXiv:1512.01274(2015).

NTU-SC Jason Kuen, Xingxing Wang, Bing Shuai, Xiangfei Kong, Jianxiong Yin, Gang Wang*, Alex C Kot

Rapid-Rich Object Search Lab, Nanyang Technological University, Singapore. All of our scene classification models are built upon pre-activation ResNets [1]. For scene classification using the provided RGB images, we train from scratch a ResNet-200, as well as a relatively shallow Wide-ResNet [2]. In addition to RGB images, we make use of class activation maps [3] and (scene) semantic segmentation masks [4] as complementary cues, obtained from models pre-trained for ILSVRC image classification [5] and scene parsing [6] tasks respectively. Our final submissions consist of ensembles of multiple models.

References
[1] He, K., Zhang, X., Ren, S., & Sun, J. “Identity Mappings in Deep Residual Networks”. ECCV 2016.
[2] Zagoruyko, S., & Komodakis, N. “Wide Residual Networks”. BMVC 2016.
[3] Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., & Torralba, A. “Learning Deep Features for Discriminative Localization”. CVPR 2016.
[4] Shuai, B., Zuo, Z., Wang, G., & Wang, B. "Dag-Recurrent Neural Networks for Scene Labeling". CVPR 2016.
[5] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., ... & Berg, A. C. “Imagenet large scale visual recognition challenge”. International Journal of Computer Vision, 115(3), 211-252.
[6] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., & Torralba, A. “Semantic Understanding of Scenes through the ADE20K Dataset”. arXiv preprint arXiv:1608.05442.

NTU-SP Bing Shuai (Nanyang Technological University)
Xiangfei Kong (Nanyang Technological University)
Jason Kuen (Nanyang Technological University)
Xingxing Wang (Nanyang Technological University)
Jianxiong Yin (Nanyang Technological University)
Gang Wang* (Nanyang Technological University)
Alex Kot (Nanyang Technological University) We train our improved fully convolution networks (IFCN) for the scene parsing task. More specifically, we use the pre-trained Convolution Neural Network (pre-trained from ILSVRC CLS-LOC task) as encoder, and then adds a multi-branch deep convolution network to perform multi-scale context aggregation. Finally, simple deconvolution network (without unpooling layers) is used as the decoder to generate the high-resolution label prediction maps. IFCN subsumes the above three network components. The network (IFCN) is trained with the class weighted loss proposed in [Shuai et al, 2016].

[Shuai et al, 2016] Bing Shuai, Zhen Zuo, Bing Wang, Gang Wang. DAG-Recurrent Neural Network for Scene Labeling

NUIST Jing Yang, Hui Shuai, Zhengbo Yu, Rongrong Fan, Qiang Ma, Qingshan Liu, Jiankang Deng 1.inception v2 [1] is used in the VID task, which is almost real time with GPU.
2.cascaded region regression is used to detect and track different instances.
3.context inference between instances within each video
4.online detector and tracker update to improve recall

[1]Szegedy, Christian, et al. "Going deeper with convolutions." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015.
[2]Ren, Shaoqing, et al. "Faster R-CNN: Towards real-time object detection with region proposal networks." Advances in neural information processing systems. 2015.
[3]Dai, Jifeng, et al. "R-FCN: Object Detection via Region-based Fully Convolutional Networks." arXiv preprint arXiv:1605.06409 (2016).

NuistParsing Feng Wang:B-DAT Lab, Nanjing University of Information Science and Technology, China
Zhi Li:B-DAT Lab, Nanjing University of Information Science and Technology, China
Qingshan Liu:B-DAT Lab, Nanjing University of Information Science and Technology, China
Scene parsing problem is extremely challenging due to the diversity of appearance and the complexity of configuration,laying, and occasion. We mainly adopt SegNet architecture for scene parsing work. We first extract the edge information of images from ground truth and take the edge as a new class. Then we re-compute the weights of all classes to overcome the imbalance between classes. We use the new ground truth and new weights to train the model. In addition, we employ super-pixel smoothing to optimize the results.
[1] V. Badrinarayanan, A. Handa, and R. Cipolla. SegNet:a deep convolutional encoder-decoder architecture for robust semantic pixel-wise labelling. arXiv preprint arXiv:1505.07293,2015.
[2]Wang F, Li Z, Liu Q. Coarse-to-fine human parsing with Fast R-CNN and over-segment retrieval[C]//2016 IEEE International Conference on Image Processing (ICIP). IEEE, 2016: 1938-1942.

NUS-AIPARSE XIAOJIE JIN (NUS)
YUNPENG CHEN (NUS)
XIN LI (NUS)
JIASHI FENG (NUS)
SHUICHENG YAN (360 AI INSTITUTE, NUS) The submissions are based on our proposed Multi-Path Feedback recurrent neural network (MPF-RNN) [1]. MPF-RNN aims to enhancing the capability of RNNs on modeling long-range context information at multiple levels and better distinguish pixels that are easy to confuse in pixel-wise classification. In contrast to CNNs without feedback and RNNs with only a single feedback path, MPF-RNN propagates the contextual features learned at top layers through weighted recurrent connections to multiple bottom layers to help them learn better features with such "hindsight". Besides, we propose a new training strategy which considers the loss accumulated at multiple recurrent steps to improve performance of the MPF-RNN on parsing small objects as well as stabilizing the training procedure.

In this contest, Res101 is used as baseline model. Multi-scale input data augmentation as well as multi-scale testing are used.

[1] Jin, Xiaojie, Yunpeng Chen, Jiashi Feng, Zequn Jie, and Shuicheng Yan. "Multi-Path Feedback Recurrent Neural Network for Scene Parsing." arXiv preprint arXiv:1608.07706 (2016).

NUS_FCRN Li Xin, Tsinghua University;
Jin xiaojie, National University of Singapore;
Jiashi Feng, National University of Singapore.
We trained a single fully convolutional neural network with ResNet-101 as frontend model.

We did not use any multiscale data augmentation in both training and testing.

NUS_VISENZE Kyaw Zaw Lin(dcskzl@nus.edu.sg)
Shangxuan Tian(shangxuan@visenze.com)
JingYuan Chen(a0117039@u.nus.edu) Fusion of three models SSD (VGG+Resnet)[2] with Faster Rcnn[4] with Resnet[3]. Context suppression is applied and then tracking is performed using according to [1]. Tracklets are greedily merged after tracking.

[1]Danelljan, Martin, et al. "Accurate scale estimation for robust visual tracking." Proceedings of the British Machine Vision Conference BMVC. 2014.
[2]Liu, Wei, et al. "SSD: Single Shot MultiBox Detector." arXiv preprint arXiv:1512.02325 (2015).
[3]He, Kaiming, et al. "Deep residual learning for image recognition." arXiv preprint arXiv:1512.03385 (2015).
[4]Ren, Shaoqing, et al. "Faster R-CNN: Towards real-time object detection with region proposal networks." Advances in neural information processing systems. 2015.

OceanVision Zhibin Yu Ocean University of China
Chao Wang Ocean University of China
ZiQiang Zheng Ocean University of China
Haiyong Zheng Ocean University of China Our homepage: http://vision.ouc.edu.cn/~zhenghaiyong/

We are interesting in scene classification and we aim to build a net for this problem.

OutOfMemory Shaohua Wan, UT Austin
Jiapeng Zhu, BIT, Beijing Faster RCNN [1] object detection framework plus ResNet-152 [2] network configuration is used in our object detection algorithm. Much effort is made towards optimizing the network such that it consumes much less GPU memory.

[1] Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In NIPS, 2015.
[2] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS, 2015

Rangers Y. Q. Gao,
W. H. Luo,
X. J. Deng,
H. Wang,
W. D. Chen,
---

ResNeXt Saining Xie, UCSD
Ross Girshick, FAIR
Piotr Dollar, FAIR
Kaiming He, FAIR

We present a simple, modularized multi-way extension of ResNet for ImageNet classification. In our network, each residual block consists of multiple ways that are of the same architectural shape, and the network is a simple stack of such residual blocks that share the same template, following the design of the original ResNet. Our model is highly modularized and thus reduces the burdens of exploring the design space. We carefully conducted ablation experiments showing the improvements of this architecture. More details will be available in a technical report. In the submissions we exploited multi-way ResNets-101. We submit no localization result.

RUC_BDAI Peng Han, Renmin University of China
An Zhao, Renmin University of China
Wenwu Yuan, Renmin University of China
Zhiwu Lu, Renmin University of China
Jirong Wen, Renmin University of China
Lidan Yang, Renmin University of China
Aoxue Li, Peking University We use the well-trained Faster R-CNN[1] to generate bounding boxes for every frame of the video. And we only use a few frames of every video to train that model. To reduce the effect of the unbalanced problem, the number of every category is basically the same. Then we utilize the contextual information of the video to reduce the noise and add the missing.

[1] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS, 2015.

S-LAB-IIE-CAS Ou Xinyu [1,2]
Ling Hefei [2]
Liu Si [1]

1. Chinese Academy of Sciences, Institute of Information Engineering;
2. Huazhong University of Science and Technology
（This work was done when the first author worked as an intern at S-Lab of CASIIE.）
We exploit object-based contextual enhancement strategies to improve the performance of deep convolutional neural network over scene parsing task. Increasing the weights of objects on local proposal regions can enhance the structure characteristics of the object and correct the ambiguous areas which are wrongly judged as stuff. We have verified its effectiveness on ResNet101-like architecture [1], which is designed with multi-scale, CRF, atrous convolutional [2] technologies. We also apply various technologies (such as RPN [3], black hole padding, visual attention, iterative training) to this ResNet101-like architecture. The algorithm and architecture details will be described in our paper (available online shortly).
In this competition, we submit five entries. The first (model A) is a Multi-Scale Resnet101-like model with Fully Connected CRF and Atrous Convolutions, which achieved 0.3486 mIOU and 75.39% pixel-wise accuracy on validation dataset. The second model is a Multi-Scale deep CNN modified by object proposal, which achieved 0.3809 mIOU and 75.69% pixel-wise accuracy. A black hold restoration strategy is attached to model B to generate the model C. The model D attention strategies in deep CNN model. And the model E combined with the results of other four models.

[1] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition.IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016
[2] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, Alan L. Yuille:
DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. CoRR abs/1606.00915 (2016)
[3] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. Conference on Neural Information Processing Systems (NIPS), 2015

SamExynos Qian Zhang(Beijing Samsung Telecom R&D Center)
Peng Liu(Beijing Samsung Telecom R&D Center)
Jinbin Lin(Beijing Samsung Telecom R&D Center)
Junjun Xiong(Beijing Samsung Telecom R&D Center) Object localization:

The submission is based on [1] and [2], but we modified the model, and the newtwork is 205 layers. Due to the limit of time and GPUs, we have just trained three CNN model for classification. The top-5 accuracy on the validation set with dense crops(scale:224,256,288,320,352,384,448,480) is 96.44% for the best single model. And the top-5 accuracy on the validation set with dense crops is 96.88% for three model ensemble.

places365 classification:

The submission is based on [3] and [4], we add 5 layers to resnet 50, and modified the network. Due to the limit of time and GPUs, we have just trained three CNN model for the scene classification task. The top-5 accuracy on the validation set with 72 crops is 87.79% for the best single model. And the top-5 accuracy on the validation set with multiple crops is 88.70% for three model ensemble.

[1]Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun,Identity Mappings in Deep Residual Networks. ECCV 2016.
[2]Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, Alex Alemi. "Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning". arXiv preprint arXiv:1602.07261 (2016)
[3]Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, Zbigniew Wojna. "Rethinking the Inception Architecture for Computer Vision". arXiv preprint arXiv:1512.00567 (2015)

Samsung Research America: General Purpose Acceleration Group Dr. S. Eliuk (Samsung), C. Upright (Samsung), Dr. H. Vardhan (Samsung), T. Gale (Intern, Northeastern University), S. Walsh (Intern, University of Alberta). The General Purpose Acceleration Group is focused on accelerating training via HPC & distributed computing. We present Distributed Training Done Right (DTDR) where standard open-source models are trained in an effective manner via a multitude of techniques involving strong / weak scaling and strict distributed training modes. Several different models are used from standard Inception v3, to Inception v4 res2, and ensembles of such techniques. The training environment is unique as we can explore extremely deep models given the model-parallel nature of our partitioning of data.

scnu407 Li Shiqi South China Normal University
Zheng Weiping South China Normal University
Wu Jinhui South China Normal University We believe that the spatial relationships between objects in the image is a kind of time-series data. Therefore, we first use VGG16 to extract the features of the image, then add 4 LSTM layer in the back, four LSTM layer representing the four directions of the scanning feature map.

SegModel Falong Shen, Peking Univerisity
Rui Gan, Peking University
Gang Zeng, Peking Univerisity Abstract
Our models are finetuned from resnet152[1] and follow the methods introduced in [2].

References
[1] K He，X Zhang，S Ren，J Sun. Deep Residual Learning for Image Recognition.
[2] F Shen，G Zeng. Fast Semantic Image Segmentation with High Order Context and Guided Filtering.

SenseCUSceneParsing Hengshuang Zhao* (SenseTime, CUHK), Jianping Shi* (SenseTime), Xiaojuan Qi (CUHK), Xiaogang Wang (CUHK), Tong Xiao (CUHK), Jiaya Jia (CUHK) [* equal contribution] We have employed FCN based semantic segmentation for the scene parsing. We propose a context aware semantic segmentation framework. The additional image level information significantly improves the performance under complex scene in natural distribution. Moreover, we find that deeper pretrained model is better. Our pretrained models include ResNet269, ResNet101 from ImageNet dataset, and ResNet152 from Places2 dataset. Finally, we utilize the deeply supervised structure to assist training the deeper model. Our best single model reach 44.65 mIOU and 81.58 pixel accurcy in validation set.

[1]. Long, Jonathan, Evan Shelhamer, and Trevor Darrell. "Fully convolutional networks for semantic segmentation." CVPR. 2015.
[2]. He, Kaiming, et al. "Deep residual learning for image recognition." arXiv:1512.03385, 2015.
[3]. Lee, Chen-Yu, et al. "Deeply-Supervised Nets." AISTATS, 2015.

SIAT_MMLAB Sheng Guo, Linjie Xing,
Shenzhen Institutes of Advanced Technology, CAS.
Limin Wang,
Computer Vision Lab, ETH Zurich.
Yuanjun Xiong,
Chinese University of Hong Kong.
Jiaming Liu and Yu Qiao,
Shenzhen Institutes of Advanced Technology, CAS. We propose a modular framework for large-scale scene recognition, called as multi-resolution CNN (MR-CNN) [1]. This framework addresses the characterization difficulty of scene concepts, which may be based on multi-level visual information, including local objects, spatial layout, and global context. Specifically, in this challenge submission, we utilizes four resolutions (224, 299, 336, 448) as the input sizes of MR-CNN architectures. For coarse resolution (224, 299), we exploit the existing powerful Inception architectures (Inception v2 [2], Inception v4 [3], and Inception-ResNet [3]), while for fine resolution (336, 448), we propose our new inception architectures by making original inception network deeper and wider. Our final submission is the prediction result of MR-CNNs by fusing the outputs of CNNs of different resolutions.

In addition, we propose several principled techniques to reduce the over-fitting risk of MR-CNNs, including class balancing and hard sample mining. These simple yet effective training techniques enable us to further improve the generalization performance of MR-CNNs on the validation dataset. Meanwhile, we use an efficient parallel version of Caffe toolbox [4] to allow for the fast training of our proposed deeper and wider Inception networks.

[1] L. Wang, S. Guo, W. Huang, Y. Xiong, and Y. Qiao, Knowledge guided disambiguation for large-scale scene classification with Multi-Resolution CNNs, in arXiv, 2016.

[2] S. Ioffe and C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, in ICML, 2015.

[3] C. Szegedy, S. Ioffe, and V. Vanhouche, Inception-v4, Inception-ResNet and the impact of residual connections on learning, in arXiv, 2016.

[4] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool, Temporal segment networks: towards good practices for deep action recognition, in ECCV, 2016.

SIIT_KAIST Sihyeon Seong (KAIST)
Byungju Kim (KAIST)
Junmo Kim (KAIST) We used ResNet[1] (101 layers / 4GPUs) as our baseline model. From the model pre-trained with ImageNet classification dataset(provided by [2]), We re-tuned the model with Places365 dataset (256-resized small dataset). Then, we further fine-tuned the model based on the following ideas:

i) Analyzing correlations between labels : We calculated correlations between each pair of predictions p(i), p(j) where i, j are classes. Then, highly correlated label pairs are extracted by thresholding the correlation coefficients.

ii) Additional semantic label generation : Using the correlation table from i), we further generated super/subclass labels by clustering them. Additionally, we generated 170 binary labels for separations of confusing classes, which maximize margins between highly correlated label pairs.

iii) Boosting-like multi-loss terms :
A large number of loss terms are combined for classifying the labels generated in ii).

[1] He, Kaiming, et al. "Deep residual learning for image recognition." arXiv preprint arXiv:1512.03385 (2015).
[2] https://github.com/facebook/fb.resnet.torch

SIIT_KAIST-TECHWIN Byungju Kim (KAIST),
Youngsoo Kim (KAIST),
Yeakang Lee (KAIST),
Junho Yim (KAIST),
Sangji Park (Techwin),
Jaeho Jang (Techwin),
Shimin Yin (Techwin),
Soonmin Bae (Techwin),
Junmo Kim (KAIST) Our methods for classification and localization are based on ResNet[1].

We used Branched-200-layer ResNets, based on the original 200-layer ResNet and Label Smoothing method [2].

The networks are trained on ILSVRC2016 localization dataset. (from scrach)

For testing, dense sliding window method[3] was used on six scales and with horizontal flip.

'Single model' is one Branched-ResNet with Label Smoothing method. Validation top-5 classification error rate is 3.7240%

'Ensemble A' consists of one 200-layer ResNet, one Branched-ResNets without label smoothing and 'Single model', which is with label smoothing.

'Ensemble B' consists of three Branched-ResNets without label smoothing and 'Single model', which is with label smoothing.

'Ensemble C' consists of 'Ensemble B' and an original 200-layer ResNet.

Ensemble A and B are averaged on soft set of targets distilled by high temperature, which is similar to the method in [4].

Ensemble C is averaged on softmax outputs.

This work was supported by Hanwha Techwin CO., LTD.

[1] He, Kaiming, et al. "Deep residual learning for image recognition." arXiv preprint arXiv:1512.03385 (2015).
[2] Szegedy, Christian, et al. "Rethinking the inception architecture for computer vision." arXiv preprint arXiv:1512.00567 (2015).
[3] Sermanet, Pierre, et al. "Overfeat: Integrated recognition, localization and detection using convolutional networks." arXiv preprint arXiv:1312.6229 (2013).
[4] Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. "Distilling the knowledge in a neural network." arXiv preprint arXiv:1503.02531 (2015).

SIS ITMO University --- Single-Shot Detector

SJTU-ReadSense Qinchuan Zhang, Shanghai Jiao Tong University
Junxuan Chen, Shanghai Jiao Tong University
Thomas Tong, ReadSense
Leon Ding, ReadSense
Hongtao Lu, Shanghai Jiao Tong University We train two CNN models from the scratch. Model A based on Inception-BN [1] with one auxiliary classifier is trained on the Places365-Challenge dataset [2], which achieved 15.03% top-5 error on validation dataset. Model B based on ResNet [3] with depth of 50 layers is trained on the Places365-Standard dataset and finetuned for 2 epochs on the Places365-Challenge dataset due to the limit of time, which achieved 16.3% top-5 error on validation dataset. We also fuse features extracted from 3 baseline models [2] on the Places365-Challenge dataset and trained two fully connected layers with a softmax classifier. Moreover, we adopt the "class-aware" sampling strategy proposed by [4] for models trained on Places365-Challenge dataset to tackle the non-uniform distribution of images over 365 categories. We implement model A using Caffe [5] and conduct all other experiments using MXNet [6] to deploy larger batch size on a GPU.

We train all models with a 224x224 crop randomly sampled from an 256x256 image or its horizontal flip, with the per-pixel mean subtracted. We apply 12-crops [7] for evaluation on validation and test datasets.

We ensemble multiple models with weights (learnt on validation dataset or top-5 validation accuracies), and achieve 12.79% (4 models), 12.69% (5 models), 12.57% (6 models) top-5 error on validation dataset.

[1] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
[2] Places: An Image Database for Deep Scene Understanding. B. Zhou, A. Khosla, A. Lapedriza, A. Torralba and A. Oliva. Arxiv, 2016.
[3] K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. In CVPR, 2016.
[4] L. Shen, Z. Lin , Q. Huang. Relay backpropagation for effective learning of deep convolutional neural networks. arXiv:1512.05830, 2015.
[5] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv:1408.5093, 2014.
[6] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C.n Zhang, and Z. Zhang. MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems. In NIPS, 2015.
[7] C. Szegedy,W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015.

SRA Hojjat Seyed Mousavi, The Pennsylvania State University, Samsung Research America

Da Zhang, University of California at Santa Barbara, Samsung Research America

Nina Narodytska, Samsung Research America

Hamid Maei, Samsung Research America

Shiva Kasiviswanathan, Samsung Research America Object detection from video is a challenging task in computer vision. These challenges sometime come from the temporal aspects of videos or the nature of objects present in the video. For example detection of objects when they disappear and reappear in the camera’s field of view, or detection of non-rigid objects that change appearances are of common challenges in object detection in video. In this work, we specifically focus on incorporating the temporal and contextual information in addressing some of these challenges. In our proposed method, initial candidates for objects are first detected in each frame of the video sequence. Then based on the information from adjacent cells and also contextual information from the whole video sequence, object detections and categories are recalculated for each video sequence. We have submitted two different submissions to this year’s competition. One corresponds to our algorithm using information from still video frames, temporal information from adjacent frames and contextual information of the whole video sequence. The other submission does not use the contextual information present in the video.

SunNMoon Moon Hyoung-jin.
Park Sung-soo. We ensembled two object detection, Faster-rcnn and SingleShotDetector.
we used pre-trained Resnet101 classificationmodel.

Faster-rcnn is combined with RPN and SPOPnet(Scale-aware Pixel-wise Object Proposal networks) algorithm to find better rois in Faster-rcnn. we trained Faster-rcnn and SingleShotDetector(SSD). the finally result is the combiantion of multi scale Faster-rcnn and SSD300x300.

The result of single Faster-rcnn is 42.8% mAP.
The result of SSD300x300 is 43.7% mAP.
The ensembled result is 47.6% mAP for multi scaled faster-rcnn and SSD.

References:
[1] Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun. "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks", CVPR 2015
[2] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, Alexander C. Berg "SSD: Single Shot MultiBox Detector"
[3] Zequn Jie, Xiaodan Liang, Jiashi Feng, Wen Feng Lu, Eng Hock Francis Tay, Shuicheng Yan "Scale-aware Pixel-wise Object Proposal Networks"

SUXL Xu Lin SZ UVI Technology Co., Ltd The proposed model is a combination of some convolutional neural network framework for scene parsing implemented on Caffe. We initialise ResNet-50 and ResNet-101 [1] trained on ImageNet classification dataset; then train this two networks on Place2 scene classification 2016. With some modification for scene parsing task, we train multiscale dilated network [2] initialised by trained parameter of ResNet-101, and FCN-8x and FCN-16x [3] trained parameter of ResNet-50. Considering additional models provided by scene parsing challenge 2016, we do a combination of these models via post network. The proposed model is also refined by fully connected CRF for semantic segmentation [4].

[1]. K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
[2].L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and
A. L. Yuille. Deeplab: Semantic image segmentation with
deep convolutional nets, atrous convolution, and fully connected CRFs. arXiv:1606.00915, 2016
[3].J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proc. CVPR, 2015.
[4].Krahenbuhl, P. and Koltun, V. Efficient inference in fully connected crfs with gaussian edge potentials. In NIPS, 2011.

SYSU_HCP-I2_Lab Liang Lin (Sun Yat-sen University),
Lingbo Liu (Sun Yat-sen University),
Guangrun Wang (Sun Yat-sen University),
Junfan Lin (Sun Yat-sen University),
Ziyang Tang (Sun Yat-sen University),
Qixian Zhou (Sun Yat-sen University),
Tianshui Chen (Sun Yat-sen University) We design our scene parsing model based on DeepLab2-CRF, and improve it from the following two aspects. First, we incorporate deep, semantic information and shallow, appearance information with skipping layers to produce refined, detailed segmentations. Specifically, We combine the features of the 'fusion' layer (after up-sampled via bilinear interpolation) , the 'res2b' layer and the 'res2c' layer. Second, we develop cascade nets, in which the second network utilize the output of the first network to generate more accurate parsing map. Our ResNet-101 was pre-trained on the standard 1.2M imagenet data and finetuned on ADE20K Dataset.

TEAM1 Sung-Bae Cho (Yonsei University)
Sangmuk Jo (Yonsei University)
Seung Ha Kim (Yonsei University)
Hyun-Tae Hwang (Yonsei University)
Youngsu Park (LGE)
Hyungseok Ohk (LGE) We use the Faster-RCNN framework[1] and finetune the network with the provided VID training data and the additional DET training data.
(We modified the DET training data into 30 classes from 200 classes.)
For our baseline model, we use the model VGG-16. [2]

[1] REN, Shaoqing, et al. Faster R-CNN: Towards real-time object detection with region proposal networks. In: Advances in neural information processing systems. 2015. p. 91-99.
[2] WANG, Limin, et al. Towards good practices for very deep two-stream convnets. arXiv preprint arXiv:1507.02159, 2015.

ToConcoctPellucid Digvijay Singh
MS by Research in Computer Science, IIIT Hyderabad This submission for the task of object detection follows the algorithmic guidelines provided by Ren et al [2], also known as Faster-RCNN algorithm. This deep neural network based method removes the dependency of detection algorithms on older object proposal techniques. Faster-RCNN network consists of two networks: Region proposal network (RPN) for object proposals and Fast-RCNN for object detection. The authors show that the framework is independent of the classification network architecture chosen. Particularly, Residual Networks (ResNet) [1] trained for ImageNet classification task are picked because of their recent success in many domains. It has been observed that the improvement in quality of object proposals is drastic as compared to older available methods like Selective Search, MCG etc. The Faster-RCNN layers which are not borrowed from ResNet, like RPN layers and final fully connected layers for roi-classification and bounding box offset prediction, are randomly initialized with a gaussian distribution. Further, to train/fine-tune our baseline single model we utilize train + val1 set of ImageNet DET dataset and the other half val2 set is kept for validation purposes. This split is maintained for all the three entries here. Three entries have been submitted for object detection challenge, these are:

Entry 1: ResNet-101 + Faster-RCNN. Pretrained ResNet-101 network provided by the authors of ResNet are used to initialize most of the layers in Faster-RCNN. The extra layers belonging to RPN network and final fully connected layers are randomly initialized. The ROI-Pooling is put between conv4 and conv5 blocks of ResNet-101. Hence, ResNet layers conv1-conv4 act as feature extractor and these features are used for object proposal generation as well as object detection task. Object Region of Interests are generated by the RPN network and are passed on to ROI pooling which scales all roi bounded activations to a fixed size of 7x7 [WxH]. Layers in conv5 act as object classifier.
Entry 2: Taking hints from last year winner's recommendations, this entry is an ensemble of two Residual Networks. To add to this, more semantically meaningful box-voting technique [3] is also used. For our ensemble, ResNet-101 and ResNet-50 networks are picked because of the availability of pretrained models. While testing, both networks generate proposals. Proposals from both networks are combined and Non-maximal-suppresion is done to remove redundancy. The final set of ROIs are given to both the networks and detections from each of them are collected. To obtain a final set of detections, box-voting technique [3] is used to refine the detections from both. Box-voting technique can be seen as weighted-averaging of instances belonging to a certain niche (which is decided by an IoU criteria).
Entry 3: In this entry, the ResNet layers are altered, removed and modified to get a different topology from networks used in Entry 2. As an example, the batch-nomalization and scaling layers which are followed by Eltwise addition in standard ResNet are removed and are affixed after the Eltwise operation. Rest of the training/testing settings remain the same as Entry 2.

[1] Kaiming He, Xiangyu Zhang, Shaoqing Ren and Jian Sun. Deep Residual Learning for Image Recognition. CVPR 2016
[2] Shaoqing Ren, Kaiming He, Ross Girshick and Jian Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. NIPS 2015
[3] S. Gidaris and N. Komodakis. Object Detection via a multi-region and semantic segmentation-aware cnn model. ICCV 2015

Trimps-Soushen Jie Shao, Xiaoteng Zhang, Zhengyan Ding, Yixin Zhao, Yanjun Chen, Jianying Zhou, Wenfei Wang, Lin Mei, Chuanping Hu

The Third Research Institute of the Ministry of Public Security, P.R. China. Object detection (DET)
We use several pre-trained models, including ResNet, Inception, Inception-Resnet etc. By taking the predict boxes from our best model as region proposals, we average the softmax scores and the box regression outputs across all models. Other improvements include annotations refine, boxes voting and features maxout.

Object classification/localization (CLS-LOC)
Based on image classification models like Inception, Inception-Resnet, ResNet and Wide Residual Network (WRN), we predict the class labels of the image. Then we refer to the framework of "Faster R-CNN" to predict bounding boxes based on the labels. Results from multiple models are fused in different ways, using the model accuracy as weights.

Scene classification (Scene)
We adopt different kinds of CNN models such as ResNet, Inception and WRN. To improve the performance of features from multiple scales and models, we implement a cascade softmax classifier after the extraction stage.

Object detection from video (VID)
Same methods as DET task were applied to each frame. Optical flow guided motion prediction helped to reduce the false negative detections.

[1] Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun. NIPS 2015

[2] Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning, Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, Alex Alem.

[3] Zagoruyko S, Komodakis N. Wide Residual Networks[J]. arXiv preprint arXiv:1605.07146, 2016.

VB JongGook Ko
Seungjae Lee
KeunDong Lee
DaUn Jeong
WeonGeun Oh
In this work, we use a variant of SSD[1] with ResNet[2] for detection task. The overall training of the detection network follows a similar procedure with [1]. For detection, we design detection network from the ResNet and select multiple object candidates from different layers with various aspect ratio, scales and so on. For Ensemble models, we also train Faster RCNN[3] using ResNet and a variant of SSD with VGG network[1].

[1]"SSD:Single Shot MultiBox Detector", Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, Alexander C. Berg. arXiv preprint arXiv:1512.02325(2015).
[2]"Deep Residual Learning for Image Recognition", Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. Tech Report 2015.
[3]"Faster r-cnn: Towards real-time object detection with region proposal networks[J]". Ren S, He K, Girshick R, et al. arXiv preprint arXiv:1506.01497, 2015.

VikyNet K.Vikraman , Independent researcher. Graduate from IIT Roorkee Semantic Segmentation requires careful adjustment of parameters. Due to max pooling, a lot of useful information about edges are lost. A lot of algorithms like Deconvolutional networks try to capture them but at the cost of increased computational time.
FCNs have a lot of advantages over them in terms of processing time. ParseNet has an increased perfomance. I have fine-tuned the model to perform better.

References:
1)ParseNet: Looking Wider to See Better
2)Fully Convolutional Networks for Semantic Segmentation

VIST Jongyoul Park(ETRI)
Joongsoo Lee(ETRI)
Joongwon Hwang(ETRI)
Seung-Hwan Bae(ETRI)
Young-Suk Yoon(ETRI)
Yuseok Bae(ETRI)
We use ResNet models for classification network and adapt Faster R-CNN for region proposal network basically in this work.

Viz Insight Biplab Ch Das, Samsung R&D Institute Bangalore
Shreyash Pandey, Samsung R&D Institute Bangalore Ensembling approaches have been known to outperform individual classifiers on standard classification tasks [No free Lunch Theorem :)]

In our approach we trained state of the art classifiers including variations of:
1.ResNet
2.VGGNet
3.AlexNet
4.SqueezeNet
5.GoogleNet

Each of these classifiers were trained on different views of the provided places 2 challenge data.

Multiple Deep Metaclassifiers were trained on the confidence of the labels predicted by above classifiers successfully accomplishing a non linear ensemble,
where the weights of the neural network are in a way to maximize the accuracy of scene recognition.

To impose further consistency between objects and scenes, a state of art classifier trained on imagenet was adapted to places via a zero shot learning approach.

We did not use any external data for training the classifiers. However we balanced the data to make the classifiers get unbiased results. So some of the data remained unused.

Vladimir Iglovikov Vladimir Iglovikov

Hardware: Nvidia Titan X
Software: Keras with Theano backend
Time spent: 5 days

All models trained on 128x128 resized from "small 256x256" dataset.

[1] Modiffied VGG16 => validation top5 error => 0.36
[2] Modified VGG19 => validation top5 error => 0.36
[3] Modified Resnet 50 => validation top5 error => 0.46
[4] Average of [1] and [2] => validation top5 error 0.35

Main changes:
Relu => Elu
Optimizer => Adam
Batch Normalization added to VGG16 and VGG19

WQF_BTPZ Weichen Sun
Yuanyuan Li
Jiangfan Deng We participate in the classification and localization task. Our framework is mainly based on deep residual network (ResNet) [1] and faster RCNN [2].
We make the following improvements.
(1) For the classification part, we train our models with the 1000-category classification dataset providing multi-scale inputs. We pre-train our models with 224x224 crops from resized images of 256x256 Then we fine-tune the pre-trained ResNet (50-layer and 101-layer) classification models with 299x299 crops from resized images of 328x328.
(2) In order to promote the optimization abilities of our models, we also replace ReLU layers with PReLU or RReLU layers. By introducing different activation methods, our models achieve a better classification performance.
(3) For the localization part, we use the faster RCNN framework with a 50-layer ResNet as a classifier. Based on the pre-trained model in step (1), we train the model with the localization dataset.

[1] "Deep Residual Learning for Image Recognition", Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun.
[2] "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks", Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun.
[3] “Empirical Evaluation of Rectified Activations in Convolution Network”, Bing Xu, Naiyan Wang, Tianqi Chen, Mu Li.

XKA Zengming Shen,Yifei Liu, Lengyue Chen,Honghui Shi, Thomas Huang

University of Illinois at Urbana-Champaign SgeNet is trained only on ADE20k dataset and post processed with CRF.

1.SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation Vijay Badrinarayanan, Alex Kendall and Roberto Cipolla
2.Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials, Philipp Krähenbühl and Vladlen Koltun, NIPS 2011

YoutuLab Xiaowei Guo, YoutuLab
Ruixin Zhang, YoutuLab
Yushi Yao, YoutuLab
Pai Peng, YoutuLab
Ke Li, YoutuLab We build a scene recognition system using deep CNN models. These CNN models are inspired by original resnet[1] and inception[2] network architectures. We train these models on challenge dataset and apply balanced sampling strategy[3] to adapt unbalanced challenge dataset. Moreover, DSD[4] process is applied to further improve model performance.
In this competition, we submit five entries. The first and second are combinations of single scale results using weighted arithmetic average which weights is searched by greedy strategy. The third is a combination of single model results using same strategy with the first entry. The fourth and fifth are combinations using simple average strategy of single model results.
[1] K. He, X. Zhang, S. Ren, J. Sun. Identity Mappings in Deep Residual Networks. In ECCV 2016. abs/1603.05027
[2] C. Szegedy, S. Ioffe, V. Vanhoucke, A. Alemi. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. In ICLR 2016. abs/1602.07261
[3] L. Shen, Z. Lin, Q. Huang. Relay Backpropagation for Effective Learning of Deep Convolutional Neural Networks. abs/1512.05830
[4] S. Han, J. Pool, S. Narang, H. Mao, S. Tang, E. Elsen, B. Catanzaro, J. Tran, W. J. Dally. DSD: Regularizing Deep Neural Networks with Dense-Sparse-Dense Training Flow. abs/1607.04381

[top]

Team name	Entry description	Number of object categories won	mean AP
CUImage	Ensemble of 6 models using provided data	109	0.662751
Hikvision	Ensemble A of 3 RPN and 6 FRCN models, mAP is 67 on val2	30	0.652704
Hikvision	Ensemble B of 3 RPN and 5 FRCN models, mean AP is 66.9, median AP is 69.3 on val2	18	0.652003
NUIST	submission_1	15	0.608752
NUIST	submission_2	9	0.607124
Trimps-Soushen	Ensemble 2	8	0.61816
360+MCG-ICT-CAS_DET	9 models ensemble with validation and 2 iterations	4	0.615561
360+MCG-ICT-CAS_DET	Baseline: Faster R-CNN with Res200	4	0.590596
Hikvision	Best single model, mAP is 65.1 on val2	2	0.634003
CIL	Ensemble of 2 Models	1	0.553542
360+MCG-ICT-CAS_DET	9 models ensemble	0	0.613045
360+MCG-ICT-CAS_DET	3 models	0	0.605708
Trimps-Soushen	Ensemble 1	0	0.57956
360+MCG-ICT-CAS_DET	res200+dasc+obj+sink+impneg+seg	0	0.576742
CIL	Single Model ( Preactivation Resnet + Faster RCNN on Tensflow, On training (1/3 of total epochs was finished)	0	0.551189
KAIST-SLSP	2 models ensemble with box rescoring	0	0.535393
MIL_UT	ensemble of ResNet101, ResNet152 based Faster RCNN	0	0.532216
KAIST-SLSP	2 models ensemble	0	0.515472
Faceall-BUPT	ensemble plan B; validation map 52.28	0	0.488839
Faceall-BUPT	ensemble plan A; validation map 52.24	0	0.486977
Faceall-BUPT	multi-scale roi; best single model; validation map 51.73	0	0.484141
VB	Ensemble Detection Model E3	0	0.481285
Hitsz_BCC	Combined 500x500 with 300x300 model	0	0.479929
VB	Ensemble Detection Model E1	0	0.479043
ToConcoctPellucid	Ensemble of ResNet-101 + ResNet-50 followed by prediction pooling using box-voting	0	0.477484
Hitsz_BCC	Self-implement SSD 500x500 model with ResNet-101	0	0.472984
ToConcoctPellucid	Ensemble of different topology of ResNet-101 + ResNet-50 followed by prediction pooling using box-voting	0	0.470133
ToConcoctPellucid	ResNet-101 + Faster-RCNN single model	0	0.469716
Faceall-BUPT	faster rcnn baseline; validation map 49.30	0	0.461085
Hitsz_BCC	Self-implement SSD 300x300 model with ResNet-152	0	0.451462
VB	Ensemble Detection Model E2	0	0.45063
SunNMoon	ensemble FRCN and SSD based on Resnet101 networks.	0	0.434906
Choong	Ensemble of Deep learning model based on VGG16 & ResNet	0	0.434323
VB	Single Detection Model S1	0	0.421331
hustvision	convbox-googlenet	0	0.413457
LZDTX	A deconv-ssd network with input size 300x300.	0	0.403113
OutOfMemory	ResNet-152+FasterRCNN	0	0.393259
Lean-T	A single model, Faster R-CNN baseline，continuous iterations(~230K)	0	0.314391
BUAA ERCACAT	combined model for detection	0	0.269069
BUAA ERCACAT	A single model for detection	0	0.265055
Lean-T	A single model, Faster R-CNN baseline，discontinuous iterations(~600K)	0	0.259508
CUImage	Single GBD-Net model using provided data	---	0.633634
CUImage	Single Cluster-Net using provided data	---	0.618024
Trimps-Soushen	Single model	---	0.581434
DPFly	detetion algorithm 1	---	0.491905
VIST	Single model A using ResNet for detection	---	0.459305
VIST	Single model B using ResNet for detection	---	0.455689

Team name	Entry description	mean AP	Number of object categories won
CUImage	Ensemble of 6 models using provided data	0.662751	109
Hikvision	Ensemble A of 3 RPN and 6 FRCN models, mAP is 67 on val2	0.652704	30
Hikvision	Ensemble B of 3 RPN and 5 FRCN models, mean AP is 66.9, median AP is 69.3 on val2	0.652003	18
Hikvision	Best single model, mAP is 65.1 on val2	0.634003	2
CUImage	Single GBD-Net model using provided data	0.633634	---
Trimps-Soushen	Ensemble 2	0.61816	8
CUImage	Single Cluster-Net using provided data	0.618024	---
360+MCG-ICT-CAS_DET	9 models ensemble with validation and 2 iterations	0.615561	4
360+MCG-ICT-CAS_DET	9 models ensemble	0.613045	0
NUIST	submission_1	0.608752	15
NUIST	submission_2	0.607124	9
360+MCG-ICT-CAS_DET	3 models	0.605708	0
360+MCG-ICT-CAS_DET	Baseline: Faster R-CNN with Res200	0.590596	4
Trimps-Soushen	Single model	0.581434	---
Trimps-Soushen	Ensemble 1	0.57956	0
360+MCG-ICT-CAS_DET	res200+dasc+obj+sink+impneg+seg	0.576742	0
CIL	Ensemble of 2 Models	0.553542	1
CIL	Single Model ( Preactivation Resnet + Faster RCNN on Tensflow, On training (1/3 of total epochs was finished)	0.551189	0
KAIST-SLSP	2 models ensemble with box rescoring	0.535393	0
MIL_UT	ensemble of ResNet101, ResNet152 based Faster RCNN	0.532216	0
KAIST-SLSP	2 models ensemble	0.515472	0
DPFly	detetion algorithm 1	0.491905	---
Faceall-BUPT	ensemble plan B; validation map 52.28	0.488839	0
Faceall-BUPT	ensemble plan A; validation map 52.24	0.486977	0
Faceall-BUPT	multi-scale roi; best single model; validation map 51.73	0.484141	0
VB	Ensemble Detection Model E3	0.481285	0
Hitsz_BCC	Combined 500x500 with 300x300 model	0.479929	0
VB	Ensemble Detection Model E1	0.479043	0
ToConcoctPellucid	Ensemble of ResNet-101 + ResNet-50 followed by prediction pooling using box-voting	0.477484	0
Hitsz_BCC	Self-implement SSD 500x500 model with ResNet-101	0.472984	0
ToConcoctPellucid	Ensemble of different topology of ResNet-101 + ResNet-50 followed by prediction pooling using box-voting	0.470133	0
ToConcoctPellucid	ResNet-101 + Faster-RCNN single model	0.469716	0
Faceall-BUPT	faster rcnn baseline; validation map 49.30	0.461085	0
VIST	Single model A using ResNet for detection	0.459305	---
VIST	Single model B using ResNet for detection	0.455689	---
Hitsz_BCC	Self-implement SSD 300x300 model with ResNet-152	0.451462	0
VB	Ensemble Detection Model E2	0.45063	0
SunNMoon	ensemble FRCN and SSD based on Resnet101 networks.	0.434906	0
Choong	Ensemble of Deep learning model based on VGG16 & ResNet	0.434323	0
VB	Single Detection Model S1	0.421331	0
hustvision	convbox-googlenet	0.413457	0
LZDTX	A deconv-ssd network with input size 300x300.	0.403113	0
OutOfMemory	ResNet-152+FasterRCNN	0.393259	0
Lean-T	A single model, Faster R-CNN baseline，continuous iterations(~230K)	0.314391	0
BUAA ERCACAT	combined model for detection	0.269069	0
BUAA ERCACAT	A single model for detection	0.265055	0
Lean-T	A single model, Faster R-CNN baseline，discontinuous iterations(~600K)	0.259508	0

Team name	Entry description	Description of outside data used	Number of object categories won	mean AP
CUImage	Our model using our labeled landmarks on ImageNet Det data	We used the labeled landmarks on ImageNet Det data	176	0.660081
Trimps-Soushen	Ensemble 3	With extra annotations.	22	0.616836
NUIST	submission_4	refine the training data, add labels neglected, remove noisy labels for multi-instance images	1	0.542942
NUIST	submission_3	refine the training data, add labels neglected, remove noisy labels for multi-instance images	1	0.540981
NUIST	submission_5	refine the training data, add labels neglected, remove noisy labels for multi-instance images	0	0.540619
DPAI Vison	multi-model ensemble, multiple classifier ensemble	add extra data for class num<1000	0	0.534943
DPAI Vison	multi-model ensemble, multiple context classifier ensemble	add extra data for class num<1000	0	0.534543
DPAI Vison	multi-model ensemble, extra classifier	add extra data for class num<1000	0	0.534203
DPAI Vison	multi-model ensemble, one-scale context classifier	add extra data for class num<1000	0	0.533838
DPAI Vison	multi-model ensemble	add extra data for class num<1000	0	0.526699

Team name	Entry description	Localization error	Classification error
Trimps-Soushen	Ensemble 3	0.077087	0.02991
Trimps-Soushen	Ensemble 4	0.077429	0.02991
Trimps-Soushen	Ensemble 2	0.077668	0.02991
Trimps-Soushen	Ensemble 1	0.079068	0.03144
Hikvision	Ensemble of 3 Faster R-CNN models for localization	0.087377	0.03711
Hikvision	Ensemble of 4 Faster R-CNN models for localization	0.087533	0.03711
NUIST	prefer multi box prediction with refine	0.090593	0.03461
NUIST	prefer multi class prediction	0.094058	0.03351
CU-DeepLink	GrandUnion + Fused-scale EnsembleNet	0.098892	0.03042
CU-DeepLink	GrandUnion + Basic Ensemble	0.098954	0.03049
CU-DeepLink	GrandUnion + Multi-scale EnsembleNet	0.099006	0.03046
KAISTNIA_ETRI	Ensembles B (further tuned in class-dependent models I)	0.099286	0.03352
CU-DeepLink	GrandUnion + Class-reweighted Ensemble with Per-instance Normalization	0.099349	0.03103
CU-DeepLink	GrandUnion + Class-reweighted Ensemble	0.099369	0.03096
KAISTNIA_ETRI	Ensembles A (further tuned in class-dependent model I )	0.100552	0.03352
KAISTNIA_ETRI	Ensembles B	0.100676	0.03256
KAISTNIA_ETRI	Ensembles A	0.102015	0.03256
KAISTNIA_ETRI	Ensembles C	0.102056	0.03256
NUIST	prefer multi box prediction without refine	0.11743	0.03473
SamExynos	3 model only for classification	0.236561	0.03171
SamExynos	single model only for classification	0.238791	0.03614
Faceall-BUPT	Single localization network (II) fine-tuned with object-level annotations of training data.	0.31649	0.05184
Faceall-BUPT	Ensemble of 5 models for classification, single model for localization.	0.320754	0.04574
Faceall-BUPT	Ensemble of 3 models for classification, single model for localization.	0.325235	0.0466
WQF_BTPZ	Two models for classification, localization model is fixed. The top-5 cls-only error on validation is 0.0645. The top-5 cls-loc error on validation is 0.4029.	0.374499	0.06414
Faceall-BUPT	Single localization network (I) fine-tuned with object-level annotations of training data.	0.415558	0.05184
DGIST-KAIST	Weighted sum #1 (five models)	0.489969	0.03297
DGIST-KAIST	Averaging four models	0.490373	0.03378
WQF_BTPZ	For classification, we merge two ResNet models, the top-5 cls-error on validation is 0.0639. For localization, we use a single faster RCNN model with ResNet, the top-5 cls-loc error on validation is 0.4025.	0.524586	0.06407
ResNeXt	Ensemble C, weighted average, tuned on val. [No bounding box results]	0.737308	0.03031
ResNeXt	Ensemble B, weighted average, tuned on val. [No bounding box results]	0.737484	0.03092
ResNeXt	Ensemble A, simple average. [No bounding box results]	0.737505	0.0315
ResNeXt	Ensemble C, weighted average. [No bounding box results]	0.737526	0.03124
ResNeXt	Ensemble B, weighted average. [No bounding box results]	0.737681	0.03203
SIIT_KAIST-TECHWIN	Ensemble B	0.931565	0.03416
SIIT_KAIST-TECHWIN	Ensemble C	0.931565	0.03458
SIIT_KAIST-TECHWIN	Ensemble A	0.931596	0.03436
SIIT_KAIST-TECHWIN	Single model	0.931596	0.03651
DEEPimagine	ImagineNet ensemble for classification only [ALL]	0.995757	0.03536
DEEPimagine	ImagineNet ensemble for classification only [PART#2]	0.995757	0.03592
DEEPimagine	ImagineNet ensemble for classification only [PART#1]	0.995768	0.03643
NEU_SMILELAB	An ensemble of five models. Top-5 error 3.92% on validation set.	0.999077	0.03981
NEU_SMILELAB	An ensemble of six models. Top-5 error 4.24% on validation set.	0.999077	0.04268
NEU_SMILELAB	A single resnet-200 layer trained with small batch size. Top-5 error 4.57% on validation set.	0.999077	0.04511
NEU_SMILELAB	Our single model with a partition of the 1000 classes. Top-5 error 7.62% on validation set.	0.999097	0.07288
DeepIST	EnsembleC	1.0	0.03291
DeepIST	EnsembleD	1.0	0.03294
DGIST-KAIST	Weighted sum #2 (five models)	1.0	0.03324
DGIST-KAIST	Averaging five models	1.0	0.03357
DGIST-KAIST	Averaging six models	1.0	0.03357
DeepIST	EnsembleB	1.0	0.03446
DeepIST	EnsembleA	1.0	0.03449

Team name	Entry description	Classification error	Localization error
Trimps-Soushen	Ensemble 2	0.02991	0.077668
Trimps-Soushen	Ensemble 3	0.02991	0.077087
Trimps-Soushen	Ensemble 4	0.02991	0.077429
ResNeXt	Ensemble C, weighted average, tuned on val. [No bounding box results]	0.03031	0.737308
CU-DeepLink	GrandUnion + Fused-scale EnsembleNet	0.03042	0.098892
CU-DeepLink	GrandUnion + Multi-scale EnsembleNet	0.03046	0.099006
CU-DeepLink	GrandUnion + Basic Ensemble	0.03049	0.098954
ResNeXt	Ensemble B, weighted average, tuned on val. [No bounding box results]	0.03092	0.737484
CU-DeepLink	GrandUnion + Class-reweighted Ensemble	0.03096	0.099369
CU-DeepLink	GrandUnion + Class-reweighted Ensemble with Per-instance Normalization	0.03103	0.099349
ResNeXt	Ensemble C, weighted average. [No bounding box results]	0.03124	0.737526
Trimps-Soushen	Ensemble 1	0.03144	0.079068
ResNeXt	Ensemble A, simple average. [No bounding box results]	0.0315	0.737505
SamExynos	3 model only for classification	0.03171	0.236561
ResNeXt	Ensemble B, weighted average. [No bounding box results]	0.03203	0.737681
KAISTNIA_ETRI	Ensembles A	0.03256	0.102015
KAISTNIA_ETRI	Ensembles C	0.03256	0.102056
KAISTNIA_ETRI	Ensembles B	0.03256	0.100676
DeepIST	EnsembleC	0.03291	1.0
DeepIST	EnsembleD	0.03294	1.0
DGIST-KAIST	Weighted sum #1 (five models)	0.03297	0.489969
DGIST-KAIST	Weighted sum #2 (five models)	0.03324	1.0
NUIST	prefer multi class prediction	0.03351	0.094058
KAISTNIA_ETRI	Ensembles A (further tuned in class-dependent model I )	0.03352	0.100552
KAISTNIA_ETRI	Ensembles B (further tuned in class-dependent models I)	0.03352	0.099286
DGIST-KAIST	Averaging five models	0.03357	1.0
DGIST-KAIST	Averaging six models	0.03357	1.0
DGIST-KAIST	Averaging four models	0.03378	0.490373
SIIT_KAIST-TECHWIN	Ensemble B	0.03416	0.931565
SIIT_KAIST-TECHWIN	Ensemble A	0.03436	0.931596
DeepIST	EnsembleB	0.03446	1.0
DeepIST	EnsembleA	0.03449	1.0
SIIT_KAIST-TECHWIN	Ensemble C	0.03458	0.931565
NUIST	prefer multi box prediction with refine	0.03461	0.090593
NUIST	prefer multi box prediction without refine	0.03473	0.11743
DEEPimagine	ImagineNet ensemble for classification only [ALL]	0.03536	0.995757
DEEPimagine	ImagineNet ensemble for classification only [PART#2]	0.03592	0.995757
SamExynos	single model only for classification	0.03614	0.238791
DEEPimagine	ImagineNet ensemble for classification only [PART#1]	0.03643	0.995768
SIIT_KAIST-TECHWIN	Single model	0.03651	0.931596
Hikvision	Ensemble of 3 Faster R-CNN models for localization	0.03711	0.087377
Hikvision	Ensemble of 4 Faster R-CNN models for localization	0.03711	0.087533
NEU_SMILELAB	An ensemble of five models. Top-5 error 3.92% on validation set.	0.03981	0.999077
NEU_SMILELAB	An ensemble of six models. Top-5 error 4.24% on validation set.	0.04268	0.999077
NEU_SMILELAB	A single resnet-200 layer trained with small batch size. Top-5 error 4.57% on validation set.	0.04511	0.999077
Faceall-BUPT	Ensemble of 5 models for classification, single model for localization.	0.04574	0.320754
Faceall-BUPT	Ensemble of 3 models for classification, single model for localization.	0.0466	0.325235
Faceall-BUPT	Single localization network (I) fine-tuned with object-level annotations of training data.	0.05184	0.415558
Faceall-BUPT	Single localization network (II) fine-tuned with object-level annotations of training data.	0.05184	0.31649
WQF_BTPZ	For classification, we merge two ResNet models, the top-5 cls-error on validation is 0.0639. For localization, we use a single faster RCNN model with ResNet, the top-5 cls-loc error on validation is 0.4025.	0.06407	0.524586
WQF_BTPZ	Two models for classification, localization model is fixed. The top-5 cls-only error on validation is 0.0645. The top-5 cls-loc error on validation is 0.4029.	0.06414	0.374499
NEU_SMILELAB	Our single model with a partition of the 1000 classes. Top-5 error 7.62% on validation set.	0.07288	0.999097

Team name	Entry description	Top-5 classification error
Hikvision	Model D	0.0901
Hikvision	Model E	0.0908
Hikvision	Model C	0.0939
Hikvision	Model B	0.0948
MW	Model ensemble 2	0.1019
MW	Model ensemble 3	0.1019
MW	Model ensemble 1	0.1023
Hikvision	Model A	0.1026
Trimps-Soushen	With extra data.	0.103
Trimps-Soushen	Ensemble 2	0.1042
SIAT_MMLAB	10 models fusion	0.1043
SIAT_MMLAB	7 models fusion	0.1044
SIAT_MMLAB	fusion with softmax	0.1044
SIAT_MMLAB	learning weights with cnn	0.1044
SIAT_MMLAB	6 models fusion	0.1049
Trimps-Soushen	Ensemble 4	0.1049
Trimps-Soushen	Ensemble 3	0.105
MW	Single model B	0.1073
MW	Single model A	0.1076
NTU-SC	Product of 5 ensembles (top-5)	0.1085
NTU-SC	Product of 3 ensembles (top-5)	0.1086
NTU-SC	Sum of 3 ensembles (top-5)	0.1086
NTU-SC	Sum of 5 ensembles (top-3)	0.1086
NTU-SC	Single ensemble of 5 models (top-5)	0.1088
NQSCENE	Four models	0.1093
NQSCENE	Three models	0.1101
Samsung Research America: General Purpose Acceleration Group	Simple Ensemble, 3 Inception v3 models w/various hyper param changes, 32 multi-crop (60.11 top-1, 88.98 top-5 on val)	0.1113
fusionf	Fusion with average strategy (12 models)	0.1115
fusionf	Fusion with scoring strategy (14 models)	0.1117
fusionf	Fusion with average strategy (13 models)	0.1118
YoutuLab	weighted average1 at scale level using greedy search	0.1125
YoutuLab	weighted average at model level using greedy search	0.1127
YoutuLab	weighted average2 at scale level using greedy search	0.1129
fusionf	Fusion with scoring strategy (13 models)	0.113
fusionf	Fusion with scoring strategy (12 models)	0.1132
YoutuLab	simple average using models in entry 3	0.1139
Samsung Research America: General Purpose Acceleration Group	Model A0, weakly scaled, multi-crop. (59.61 top-1, 88.64 top-5 on val)	0.1142
SamExynos	3 model	0.1143
Samsung Research America: General Purpose Acceleration Group	Ensemble B, 3 Inception v3 models w/various hyper param changes + Inception v4 res2, 128 multi-crop	0.1152
YoutuLab	average on base models	0.1162
NQSCENE	Model B	0.117
Samsung Research America: General Purpose Acceleration Group	Model A2, weakly scaled, single-crop & mirror. (58.84 top-1, 88.09 top-5 on val)	0.1188
NQSCENE	Model A	0.1192
Samsung Research America: General Purpose Acceleration Group	Model A1, weakly scaled, single-crop. (58.65 top-1, 88.07 top-5 on val)	0.1193
Trimps-Soushen	Ensemble 1	0.1196
Rangers	ensemble model 1	0.1208
SamExynos	single model	0.121
Rangers	ensemble model 2	0.1212
Everphoto	ensemble by learned weights - 1	0.1213
Everphoto	ensemble by product strategy	0.1218
Everphoto	ensemble by learned weights - 2	0.1218
Everphoto	ensemble by average strategy	0.1223
MIPAL_SNU	Ensemble of two ResNet-50 with balanced sampling	0.1232
KPST_VB	Model II	0.1233
KPST_VB	Ensemble of Model I and II	0.1235
Rangers	single model result of 69	0.124
Everphoto	ensemble by product strategy (without specialist models)	0.1242
KPST_VB	Model II with adjustment	0.125
KPST_VB	Model I	0.1251
Rangers	single model result of 66	0.1253
KPST_VB	Ensemble of Model I and II with adjustment	0.1253
SJTU-ReadSense	Ensemble 5 models with learnt weights	0.1272
SJTU-ReadSense	Ensemble 5 models with weighted validation accuracies	0.1273
iMCB	A combination of CNN models based on researched influential factors	0.1277
SJTU-ReadSense	Ensemble 6 models with learnt weights	0.1278
SJTU-ReadSense	Ensemble 4 models with learnt weights	0.1287
iMCB	A combination of CNN models with a strategy w.r.t.validation accuracy	0.1299
Choong	Based on VGG16, features are extracted from multiple layers. ROI proposal network is not applied. Every neuron from each feature layer is center of ROI candidate.	0.131
SIIT_KAIST	101-depth single model (val.error 12.90%)	0.131
DPAI Vison	An ensemble model	0.1355
isia_ICT	spectral clustering on confusion matrix	0.1355
isia_ICT	fusion of 4 models with average strategy	0.1357
NUIST	inception+shortcut CNN	0.137
isia_ICT	MP_multiCNN_multiscale	0.1372
NUIST	inception+shortcut CNN	0.1381
Viz Insight	Multiple Deep Metaclassifiers	0.1386
iMCB	FeatureFusion_2L	0.1396
iMCB	FeatureFusion_3L	0.1404
DPAI Vison	Single Model	0.1425
isia_ICT	2 models with size of 288	0.1433
Faceall-BUPT	A single model with 150crops	0.1471
iMCB	A Single Model	0.1506
SJTU-ReadSense	A single model (based on Inception-BN) trained on the Places365-Challenge dataset	0.1511
OceanVision	A result obtained by VGG-16	0.1635
OceanVision	A result obtained by alexnet	0.1867
OceanVision	A result obtained by googlenet	0.1867
ABTEST	GoogLeNet Model trained on LSUN dataset and fined tuned on Places2	0.3245
Vladimir Iglovikov	VGG16 trained on 128x128	0.3552
Vladimir Iglovikov	VGG19 trained on 128x128	0.3593
Vladimir Iglovikov	average of VGG16 and VGG19 trained on 128x128	0.3712
Vladimir Iglovikov	Resnet 50 trained on 128x128	0.4577
scnu407	VGG16+4D lstm	0.8831

Team name	Entry description	Average of mIoU and pixel accuracy
SenseCUSceneParsing	ensemble more models on trainval data	0.57205
SenseCUSceneParsing	dense ensemble model on trainval data	0.5711
SenseCUSceneParsing	ensemble model on trainval data	0.5705
SenseCUSceneParsing	ensemble model on train data	0.5674
Adelaide	Multiple models, multiple scales, refined with CRFs	0.56735
Adelaide	Multiple models, multiple scales	0.56615
Adelaide	Single model, multiple scales	0.5641
Adelaide	Multiple models, single scale	0.5617
360+MCG-ICT-CAS_SP	fusing 152, 101, 200 layers front models with global context aggregation, iterative boosting and high resolution training	0.55565
Adelaide	Single model, single scale	0.5539
SenseCUSceneParsing	best single model on train data	0.5538
360+MCG-ICT-CAS_SP	fusing 152, 101, 200 layers front models with global context aggregation, iterative boosting and high resolution training, some models adding local refinement network before fusion	0.55335
360+MCG-ICT-CAS_SP	fusing 152, 101, 200 layers front models with global context aggregation, iterative boosting and high resolution training, some models adding local refinement network before and after fusion	0.55215
360+MCG-ICT-CAS_SP	152 layers front model with global context aggregation, iterative boosting and high resolution training	0.54675
SegModel	ensemble of 5 models, bilateral filter, 42.7 mIoU on val set	0.5465
SegModel	ensemble of 5 models,guided filter, 42.5 mIoU on val set	0.5449
CASIA_IVA	casia_iva_model4:DeepLab, Multi-Label	0.5433
CASIA_IVA	casia_iva_model3:DeepLab, OA-Seg, Multi-Label	0.5432
CASIA_IVA	casia_iva_model5:Aug_data,DeepLab, OA-Seg, Multi-Label	0.5425
NTU-SP	Fusion models from two source models (Train + TrainVal)	0.53565
NTU-SP	6 ResNet initialized models (models are trained from TrainVal)	0.5354
NTU-SP	8 ResNet initialized models + 2 VGG initialized models (with different bn statistics)	0.5346
SegModel	ensemble by joint categories and guided filter, 42.7 on val set	0.53445
NTU-SP	8 ResNet initialized models + 2 VGG initialized models (models are trained from Train only)	0.53435
NTU-SP	8 ResNet initialized models + 2 VGG initialized models (models are trained from TrainVal)	0.53435
Hikvision	Ensemble models	0.53355
SegModel	ensemble by joint categories and bilateral filter, 42.8 on val set	0.5332
ACRV-Adelaide	use DenseCRF	0.5326
SegModel	single model, 41.3 mIoU on valset	0.53225
DPAI Vison	different denseCRF parameters of 3 models(B)	0.53065
Hikvision	Single model	0.53055
ACRV-Adelaide	an ensemble	0.53035
360+MCG-ICT-CAS_SP	baseline，152 layers front model with iterative boosting	0.52925
CASIA_IVA	casia_iva_model2:DeepLab, OA-Seg	0.52785
DPAI Vison	different denseCRF parameters of 3 models(C)	0.52645
DPAI Vison	average ensemble of 3 segmentation models	0.52575
DPAI Vison	different denseCRF parameters of 3 models(A)	0.52575
CASIA_IVA	casia_iva_model1:DeepLab	0.5243
SUXL	scene parsing network 5	0.52355
SUXL	scene parsing network	0.52325
SUXL	scene parsing network 3	0.5224
ACRV-Adelaide	a single model	0.5221
SUXL	scene parsing network 2	0.5212
SYSU_HCP-I2_Lab	cascade nets	0.52085
SYSU_HCP-I2_Lab	DCNN with skipping layers	0.5136
SYSU_HCP-I2_Lab	DeepLab_CRF	0.51205
SYSU_HCP-I2_Lab	Pixel normalization networks	0.5077
SYSU_HCP-I2_Lab	ResNet101	0.50715
S-LAB-IIE-CAS	Multi-Scale CNN + Bbox_Refine + FixHole	0.5066
S-LAB-IIE-CAS	Combined with the results of other models	0.50625
S-LAB-IIE-CAS	Multi-Scale CNN + Attention	0.50515
NUS_FCRN	trained with training set and val set	0.5006
NUS-AIPARSE	model3	0.4997
NUS_FCRN	trained with training set only	0.49885
NUS-AIPARSE	model2	0.49855
F205_CV	Model fusion of ResNet101 and DilatedNet, with data augmentation and CRF, fine-tuned from places2 scene classification/parsing 2016 pretrained models.	0.49805
F205_CV	Model fusion of ResNet101 and FCN, with data augmentation and CRF, fine-tuned from places2 scene classification/parsing 2016 pretrained models.	0.4933
F205_CV	Model fusion of ResNet101, FCN and DilatedNet, with data augmentation and CRF, fine-tuned from places2 scene classification/parsing 2016 pretrained models.	0.4899
Faceall-BUPT	6 models finetuned by pre-trained fcn8s and dilatedNet with 3 different images sizes.	0.4893
NUS-AIPARSE	model1	0.48915
Faceall-BUPT	We use six models finetuned by pre-trained fcn8s and dilatedNet with 3 different images sizes. The pixel-wise accuracy is 76.94％ and mean of the class-wise IoU is 0.3552.	0.48905
F205_CV	Model fusion of ResNet101, FCN and DilatedNet, with data augmentation, fine-tuned from places2 scene classification/parsing 2016 pretrained models.	0.48425
S-LAB-IIE-CAS	Multi-Scale CNN + Bbox_Refine	0.4814
Faceall-BUPT	3 models finetuned by pre-trained fcn8s with 3 different images sizes.	0.4795
Faceall-BUPT	3 models finetuned by pre-trained dilatedNet with 3 different images sizes.	0.4793
F205_CV	Model fusion of FCN and DilatedNet, with data augmentation and CRF, fine-tuned from places2 scene classification/parsing 2016 pretrained models.	0.47855
S-LAB-IIE-CAS	Multi-Scale CNN	0.4757
Multiscale-FCN-CRFRNN	Multi-scale CRF-RNN	0.47025
Faceall-BUPT	one models finetuned by pre-trained dilatedNet with images size 384*384. The pixel-wise accuracy is 75.14％ and mean of the class-wise IoU is 0.3291.	0.46565
Deep Cognition Labs	Modified Deeplab Vgg16 with CRF	0.41605
mmap-o	FCN-8s with classification	0.39335
NuistParsing	SegNet+Smoothing	0.3608
XKA	SegNet trained on ADE20k +CRF	0.3603
VikyNet	Fine tuned version of ParseNet	0.0549
VikyNet	Fine tuned version of ParseNet	0.0549

Large Scale Visual Recognition Challenge 2016 (ILSVRC2016)

Ordered by number of categories won

Ordered by mean average precision

Ordered by number of categories won

Ordered by mean average precision

Ordered by localization error

Ordered by classification error

Ordered by localization error

Ordered by classification error

Ordered by number of categories won

Ordered by mean average precision

Ordered by number of categories won

Ordered by mean average precision