Pooling procedure converts the mid-level descriptor to final image representation by aggregating occurrences of visual words in the input image. To capture the shapes or locating an object, SPM  is proposed by dividing the image into increasingly finer spatial subregions and computing histogram of mid-level descriptors (codes for local descriptors) from each subregion. Let ( = 1, 2,...,L denote level of subpartition, such that there are 2t-x x 2t-x subregions at level (. Then, pooling strategy aggregates the occurrences of visual words for each subregion. The final image representation is generated by concatenating all the pooled features.
We introduce two simple yet efficient pooling schemes, i.e., average pooling and max pooling, which are respectively defined as follows:
where I/ is the sth subregion at level ( and y; e I/ denotes the encoded features within I. N is the number of features within I/. The “sum” and “max” function is a row-wise manner. The max pooling and average pooling strategy is theoretically analyzed in [19, 20]. In particular, max pooling always performs better than average pooling when using a linear SVM.