Image Representation with Bag-of-Words
Abstract Image classification, which is to assign one or more category labels to an image, is a very hot topic in computer vision and pattern recognition. It can be applied in video surveillance, remote sensing, web content analysis, biometrics, etc. Many successful models transform low-level descriptors into richer mid-level representations. Extracting mid-level features involves a sequence of interchangeable modules. However, they always consist of two major parts: Bag-of-Words (BoW) and Spatial Pyramid Matching (SPM). The target is to embed low-level descriptors in a representative codebook space.First of all, low-level descriptors are firstly extracted at interest points or in dense grids. Then, a pre-defined codebook is applied to encode each descriptor using a specific coding scheme. The code is normally a vector with binary or continuous elements depends on coding scheme, which can be referred as mid-level descriptor. Next, the image is divided into increasingly finer spatial subregions. Multiple codes from each subregion are pooled together by averaging or normalizing into a histogram. Finally, the final image representation is generated by concatenating the histograms from all subregions together. In this chapter, we introduce the key techniques employed in the BoW framework including SPM, which are coding process and pooling process.