Back to Main page  


When using the dataset, please cite:

Ground truth for test data

Patch for images and precomputed features ( updated on 7/17/2010 )

Due to some error in our image collection process, a very small portion of packaged images are blank images returned from websites where the original images have become unavailable. We found 970 (out of 1.2M) such images in training, 9 (out of 50K) in validation and 19 (out of 150K) in test. Although this should not noticeably impact training and testing, we release a patch that contains the correct images (6MB) and the correct pre-computed features (80MB). Please go to the "images" and "features" download sections to download the patches.

To apply the patch to the images, simply replace the old images with the new ones in the patch. For precomputed features,we provide a matlab program to modify your old feature files. Please consult the readme files for details.

Development Kit

The development kit includes

Please be sure to consult the readme file included in the development kit.


The training images are the same images in the ImageNet 2010 Spring Release. There are a total of 1,261,406 images for training. The number of images for each synset (category) ranges from 668 to 3047.

There are 50,000 validation images, with 50 images per synset.

All images are in JPEG format.

To download the images, please register first, even if you are not entering the competition. Download links will be sent to you via email.


We have computed dense SIFT[1] features for all iamges -- training, validation and test. They are available for download (features for test data will be made available later).

Each image is resized to have a max side length of 300 pixel (smaller images are not enlarged). SIFT descriptors are computed on 20x20 overlapping patches with a spacing of 10 pixels. Images are further downsized (to 1/2 the side length and then 1/4 of the side length) and more descriptors are computed. We use the VLFeat[2] implemenation of dense SIFT (version

We then perform k-means clustering of a random subset of 10 million SIFT descriptors to form a visual vocabulary of 1000 visual words. Each SIFT descriptor is quantized into a visual word using the nearest cluster center.

We provide both raw SIFT features (vldsift) and the visual codewords (sbow). Spatial coordiates of each descriptor/codeword are also included.

To run the demo system included in the development kit, you need to download the visual words features( for train and validation). Note that the raw SIFT features are not needed to run the demo code.

Please consult the readme file in the development kit for more details.


  1. David G. Lowe, Distinctive Image Features from Scale-Invariant Keypoints. International Journal of Computer Vision, 2004. pdf
  2. A. Vedaldi and B. Fulkerson. VLFeat: An Open and Portable Library of Computer Vision Algorithms. 2008.
  3. R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research 9(2008), 1871-1874.