Feature Pyramid Network (Top-Down Traversal): Conv5 (Convolution stage 5) layer output is directly used as feature map M5. However, successive feature maps are generated by downsampling the preceding feature maps from top-down layers by a factor of 2 and combined with corresponding bottom-up convolution stage output via a lateral connection. The layers in the bottom-up pathway are passed through a 1 × 1 convolution layer so that depth can be downsampled to that of the top-down layer for in-place addition to take place. Feature maps M4 and M3 are generated in this manner.
4 4. Each feature map (M3-M5) is passed through a 3 × 3 convolution to generate pyramid feature maps (P5-P3). P5 is passed through a max-pooling layer to generate additional feature pyramid P6.
5 5. The RPN Objectness sub-net consists of three 1 × 1 convolutional layers with a depth of 18 followed by a sigmoid activation function. This detects whether an object is present or not and as to which pyramid feature maps the objects are fed to.
6 6. The RPN box detection sub-net performs regression on bounding boxes and consists of three 1 × 1 convolutional layers with a depth of 36.
7 7. Region Proposal sub-net: The region proposal takes the anchors and output of both RPN Objectness and box detection sub-nets to generate region proposals from which it selects the best 2,000 proposals.
8 8. Box Head: FPN-RoI mapping is performed followed by RoI Align to generate 7 × 7 matrices for each RoI. The input is reshaped and fed through two fully connected layers with 1,024 nodes to result in vectors having a length of 1,024 for all RoIs.
9 9. Classifier Sub-network: The classifier is a fully connected layer that predicts the object class. It has nodes equal to the number of classes and uses softmax activation.
10 10. Bounding Box Regressor: The classifier is a fully connected layer that gives delta value for bounding box coordinates. It has nodes equal to four times the number of classes and uses linear activation.
11 11. Mask Head: The mask head runs parallel to the box head, and the RoIs from RoI Align operation are fed through a layer of four convolution filters which have dimensions of 3 × 3 × 256. The resultant outputs are passed through a 2 × 2 × 256 transposed convolution layer. This is subjected to a 1 × 1 convolutional layer, with the number of “out” channels being equal to the number of classes, one mask for each class, and detection. Masks are rescaled based on bilinear interpolation to input image size and applied to the input image.
Table 3.1 Layers involved in ResNet-101 architecture.
Layer type | Number of iterations | Kernel size (h x w x d) | Number of filters | Stride |
Conv1 (C1) | 1 | 7 × 7 × 3 | 64 | 2 |
Max pool | 1 | 3 × 3 | 1 | 2 |
Conv2 (C2) | 3 | 1 × 1 × 64 | 64 | 1 |
3 × 3 × 64 | 64 | 1 | ||
1 × 1 × 64 | 256 | 1 | ||
Conv3 (C3) | 4 | 1 × 1 × 256 | 128 | 1 |
3 × 3 × 128 | 128 | 1 | ||
1 × 1 × 128 | 512 | 1 | ||
Conv4 (C4) | 23 | 1 × 1× 512 | 256 | 1 |
3 × 3 × 256 | 256 | 1 | ||
1 × 1 × 256 | 1,024 | 1 | ||
Conv5 (C5) | 3 | 1 × 1 × 1024 | 512 | 1 |
3 × 3 × 512 | 512 | 1 | ||
1 × 1 × 512 | 2,048 | 1 |
Mask R-CNN is used to segment and construct pixel-wise masks for each customer in a given video frame. The output of this step is a dictionary of masks and bounding box coordinates that engulfs the detected customers. This data corresponding to the person detection is also used later in Stage 3 of the framework.
To obtain the foreground information, we remove the regions common to the foreground masks obtained by the background subtraction model and Mask R-CNN from the aggregate of the former. This step ensures that the clothing worn by the customer are excluded from the foreground.
3.3.2 Detection of Active Garments
We define an active garment as a garment present in the foreground frame obtained from the preceding stage. Individual color masks constituting the dominant colors such as Red, Blue, Green, and Yellow are applied to this foreground frame. As our data pre-processing involves the conversion of the video frames to the HSV color model, the color masks utilized in this step embodies the entire range of HSV values for a given color (i.e., all possible shades of a given color) and not just limited to a specific predetermined value. The corresponding images for each color are obtained after applying the given color masks to the foreground frame. The resulting images are converted to grayscale to reduce space and computational complexities. Morphological image processing techniques, most notably the closing operation, are performed to alleviate the small holes in images that arise due to noise.
Edge detection is used to detect the contours and edges that are present on each of the preceding frames. The given contours could either represent an entire active garment or a region of an active garment present in the foreground. Hence, an imperative step of identifying missing garment regions is performed once the garment regions are obtained. These missing regions of an active garment result from scenarios in which a customer serves as an occlusion to the active garment into consideration, thus obstructing a portion of the garment. These missing regions are identified as garment regions before the linking process. Linking identifies adjacent contours that belong to the same active garment and associates them based on spatial distance. Thus, this stage yields all active garments present used by the subsequent stages in a given video frame.
3.3.3 Identification of Garments of Interest
Once the active garments are determined, we introduce a novel approach to determine the garments of interest. We define a garment of interest