News   History   Timetable   Introduction   Challenges   FAQ   Citation   Contact  

News

History

2015, 2014, 2013, 2012, 2011, 2010

Tentative Timetable

Introduction

This challenge evaluates algorithms for object localization/detection from images/videos and scene classification/parsing at scale.
  1. Object localization for 1000 categories.
  2. Object detection for 200 fully labeled categories.
  3. Object detection from video for 30 fully labeled categories.
  4. Scene classification for 365 scene categories (Joint with MIT Places team) on Places2 Database http://places2.csail.mit.edu.
  5. Scene parsingNew for 150 stuff and discrete object categories (Joint with MIT Places team).

Challenges

I: Object localization

The data for the classification and localization tasks will remain unchanged from ILSVRC 2012 . The validation and test data will consist of 150,000 photographs, collected from flickr and other search engines, hand labeled with the presence or absence of 1000 object categories. The 1000 object categories contain both internal nodes and leaf nodes of ImageNet, but do not overlap with each other. A random subset of 50,000 of the images with labels will be released as validation data included in the development kit along with a list of the 1000 categories. The remaining images will be used for evaluation and will be released without labels at test time. The training data, the subset of ImageNet containing the 1000 categories and 1.2 million images, will be packaged for easy downloading. The validation and test data for this competition are not contained in the ImageNet training data.

In this task, given an image an algorithm will produce 5 class labels $c_i, i=1,\dots 5$ in decreasing order of confidence and 5 bounding boxes $b_i, i=1,\dots 5$, one for each class label. The quality of a localization labeling will be evaluated based on the label that best matches the ground truth label for the image and also the bounding box that overlaps with the ground truth. The idea is to allow an algorithm to identify multiple objects in an image and not be penalized if one of the objects identified was in fact present, but not included in the ground truth.

The ground truth labels for the image are $C_k, k=1,\dots n$ with $n$ class labels. For each ground truth class label $C_k$, the ground truth bounding boxes are $B_{km},m=1\dots M_k$, where $M_k$ is the number of instances of the $k^\text{th}$ object in the current image.

Let $d(c_i,C_k) = 0$ if $c_i = C_k$ and 1 otherwise. Let $f(b_i,B_k) = 0$ if $b_i$ and $B_k$ have more than $50\%$ overlap, and 1 otherwise. The error of the algorithm on an individual image will be computed using:

\[ e=\frac{1}{n} \cdot \sum_k min_{i} min_{m} max \{d(c_i,C_k), f(b_i,B_{km}) \} \] The winner of the object localization challenge will be the team which achieves the minimum average error across all test images.

II: Object detection

The training and validation data for the object detection task will remain unchanged from ILSVRC 2014. The test data will be partially refreshed with new images for this year's competition. There are 200 basic-level categories for this task which are fully annotated on the test data, i.e. bounding boxes for all categories in the image have been labeled. The categories were carefully chosen considering different factors such as object scale, level of image clutterness, average number of object instance, and several others. Some of the test images will contain none of the 200 categories. Browse all annotated detection images here.

For each image, algorithms will produce a set of annotations $(c_i, s_i, b_i)$ of class labels $c_i$, confidence scores $s_i$ and bounding boxes $b_i$. This set is expected to contain each instance of each of the 200 object categories. Objects which were not annotated will be penalized, as will be duplicate detections (two annotations for the same object instance). The winner of the detection challenge will be the team which achieves first place accuracy on the most object categories.

III: Object detection from video

This is similar in style to the object detection task. We will partially refresh the validation and test data for this year's competition. There are 30 basic-level categories for this task, which is a subset of the 200 basic-level categories of the object detection task. The categories were carefully chosen considering different factors such as movement type, level of video clutterness, average number of object instance, and several others. All classes are fully labeled for each clip. Browse all annotated train/val snippets here.

For each video clip, algorithms will produce a set of annotations $(f_i, c_i, s_i, b_i)$ of frame number $f_i$, class labels $c_i$, confidence scores $s_i$ and bounding boxes $b_i$. This set is expected to contain each instance of each of the 30 object categories at each frame. The evaluation metric is the same as for the objct detection task, meaning objects which are not annotated will be penalized, as will duplicate detections (two annotations for the same object instance). The winner of the detection from video challenge will be the team which achieves best accuracy on the most object categories.

IV: Scene classification

This challenge is being organized by the MIT Places team, namely Bolei Zhou, Aditya Khosla, Antonio Torralba and Aude Oliva. Please feel free to send any questions or comments to Bolei Zhou (bzhou@csail.mit.edu).

If you are reporting results of the taster challenge or using the Places2 dataset, please cite:

The goal of this challenge is to identify the scene category depicted in a photograph. The data for this task comes from the Places2 Database which contains 10+ million images belonging to 400+ unique scene categories. Specifically, the challenge data will be divided into 8M images for training, 36K images for validation and 328K images for testing coming from 365 scene categories. Note that there is a non-uniform distribution of images per category for training, ranging from 3,000 to 40,000, mimicking a more natural frequency of occurrence of the scene.

For each image, algorithms will produce a list of at most 5 scene categories in descending order of confidence. The quality of a labeling will be evaluated based on the label that best matches the ground truth label for the image. The idea is to allow an algorithm to identify multiple scene categories in an image given that many environments have multi-labels (e.g. a bar can also be a restaurant) and that humans often describe a place using different words (e.g. forest path, forest, woods).

For each image, an algorithm will produce 5 labels \( l_j, j=1,...,5 \). The ground truth labels for the image are \( g_k, k=1,...,n \) with n classes of scenes labeled. The error of the algorithm for that image would be

\[ e= \frac{1}{n} \cdot \sum_k \min_j d(l_j,g_k). \]

\( d(x,y)=0 \) if \( x=y \) and 1 otherwise. The overall error score for an algorithm is the average error over all test images. Note that for this version of the competition, n=1, that is, one ground truth label per image.

V: Scene parsing

This challenge is being organized by the MIT CSAIL Vision Group. The goal of this challenge is to segment and parse an image into different image regions associated with semantic categories, such as sky, road, person, and bed. The data for this challenge comes from ADE20K Dataset (The full dataset will be released after the challenge) which contains more than 20K scene-centric images exhaustively annotated with objects and object parts. Specifically, the challenge data is divided into 20K images for training, 2K images for validation, and another batch of held-out images for testing. There are totally 150 semantic categories included in the challenge for evaluation, which include stuffs like sky, road, grass, and discrete objects like person, car, bed. Note that there are non-uniform distribution of objects occuring in the images, mimicking a more natural object occurrence in daily scene.

To evaluate the segmentation algorithms, we will take the mean of the pixel-wise accuracy and class-wise IoU as the final score. Pixel-wise accuracy indicates the ratio of pixels which are correctly predicted, while class-wise IoU indicates the Intersection of Union of pixels averaged over all the 150 semantic categories. Refer to the development kit for the detail.

The data and the development kit are located at http://sceneparsing.csail.mit.edu. Please feel free to send any questions or comments about this scene parsing task to Bolei Zhou (bzhou@csail.mit.edu)

FAQ

1. Are challenge participants required to reveal all details of their methods?

Entires to ILSVRC2016 can be either "open" or "closed." Teams submitting "open" entries will be expected to reveal most details of their method (special exceptions may be made for pending publications). Teams may choose to submit a "closed" entry, and are then not required to provide any details beyond an abstract. The motivation for introducing this division is to allow greater participation from industrial teams that may be unable to reveal algorithmic details while also allocating more time at the 2nd ImageNet and COCO Visual Recognition Challenges Joint Workshop to teams that are able to give more detailed presentations. Participants are strongly encouraged to submit "open" entires if possible.

2. Can additional images or annotations be used in the competition?

Entires submitted to ILSVRC2016 will be divided into two tracks: "provided data" track (entries only using ILSVRC2016 images and annotations from any aforementioned tasks, and "external data" track (entries using any outside images or annotations). Any team that is unsure which track their entry belongs to should contact the organizers ASAP. Additional clarifications will be posted here as needed.

3. How many entries can each team submit per competition?

Participants who have investigated several algorithms may submit one result per algorithm (up to 5 algorithms). Changes in algorithm parameters do not constitute a different algorithm (following the procedure used in PASCAL VOC).

Citation

If you are reporting results of the challenge or using the dataset, please cite:

Organizers

Contact

Please feel free to send any questions or comments to