considering its 3D location and posture. This 3D data of a garment is obtained from the deformable model by comparing the observed state of garments with predicted candidate shapes. Sutoyo et al. [16] proposed a methodology for hand detection, by obtaining an image dataset comprising of positive (with hands) and negative (without hands) images. The Haar cascade classifier model was trained on these images to build a hand detection model. The key disadvantage of using this model for the detection of hands is that the model requires an up-close image of a hand to classify it accurately, a scenario that is unattainable from surveillance footage data.
Modanwal et al. [17] developed a model for the detection of the human wrist point using geometric features. After obtaining a binary image of the hand mask, circular and elliptical shapes are used to approximate the palm region. The authors observed that the palm can be approximated using the largest circle inscribing the hand mask and the wrist point is approximately located on the boundary of this circle. Also, the wrist landmark point is a fixed one which lies on the center of the forearm-palm joint irrespective of the hand rotation. The authors locate this wrist point by performing a distance transform image processing operation on the binary hand mask image, thereby obtaining the largest circle inscribed within the hand mask. By locating the point with the largest value in the distance transform of the hand mask and by determining the maximum angles of tilt that a human hand can endure, mathematical operations are used to obtain the wrist landmark. Although this is an intricate process that obtains high precision and recall values, the method is not suitable for video datasets which comprise of noisy images and frequently occluded hand regions. The suitability to run the method on video datasets is vital for obtaining the hand mask to detect the wrist point.
As discussed previously, it can be comprehended that a combination of previous works had the following limitations or drawbacks: on some occasions, the works could not detect complex garments accurately, faced issues for detecting occluded garments in video surveillance, performed less adequately in cases of uncommon garments such as Indian sarees, required close-up images of hands for their proper identification, or simply used an archaic objected detection framework such as R-CNN. Our proposed approach attempts to address a majority of these problems. Color masks are applied to detect regions of garments and these are linked to obtain the entire garment. Missing regions of partially occluded garments are also identified before linking. The OpenPose framework is used for pose estimation as it does not require close-up images of wrists and Mask R-CNN is used as it outperforms R-CNN.
3.3 Methodology
In this section, we elucidate the key stages in our proposed framework, which aims at identifying the garments of interest to customers as they browse through the collection of garments available at a garment store. The framework comprises of three integral stages, namely:
1 1. Stage 1: Obtaining the foreground information
2 2. Stage 2: Detection of active garments
3 3. Stage 3: Identification of garments of interest
The overall framework is illustrated in Figure 3.1. The stages of the framework are delineated in the forthcoming subsections.
3.3.1 Obtaining the Foreground Information
The proposed approach processes an input video from the dataset on a per-frame basis. Before processing the input video frame for garment identification, the input video frame is converted from RGB color space to HSV color space, so that the pixel intensity can be distinguished from the color information. To obtain the foreground information, we use a background subtraction model inline with an object detection algorithm, which is known as Mask R-CNN. The background subtraction model identifies the pixels associated with non-static objects present in a particular frame such as an instance a customer picks up a garment he finds interesting. As the garments worn by the customers are also included in this foreground, the Mask R-CNN model is utilized to identify customers and obtain the pixels associated with the customers alone. These pixels are then excluded from the foreground obtained by the background subtraction algorithm, thereby ensuring that only pixels associated with the garments at the store are considered by the subsequent stages of the proposed framework.
Figure 3.1 Architecture of proposed framework.
3.3.1.1 GMG Background Subtraction
The GMG background subtraction model, proposed by Godbehere et al. [18], was used to obtain the dynamic foreground from a given input frame. We briefly describe this model as illustrated in Figure 3.2.
An input frame I(k) is quantized in color space and compared against the static background image model, Ĥ(k), to generate a posterior probability image. The resulting image is filtered using morphological operations. The filtered image is then segmented into a set of bounding boxes,
3.3.1.2 Person Detection
We utilize the Mask R-CNN [9] object detection model to obtain masks for each person in a given video frame since we must identify the customers in a given frame before proceeding with further steps. Mask R-CNN is a state-of-the-art deep learning framework for instance segmentation. It improves upon Faster R-CNN [19] by using a new methodology named RoI Align instead of using the existing RoI Pooling which provides 10% to 50% more accurate results for masks [9]. This is achieved with RoI Align which overcomes the location misalignment issue suffered by RoI Pooling, which attempts to fit the blocks of the input feature map. Its key steps are explained below.
Figure 3.2 GMG background subtraction model [18].
1 1. Image Preprocessing: The input image is pre-processed by centering, rescaling, and padding. Pixels are channel-wise centered by taking the mean of pixels across all training and test examples for the three color channels and subtracting the mean from the pixel value of the input image to center the values around 0. Then, the image is scaled to a side length which ranges between 800px and 1333px and then padded such that the sides become a multiple of 32. All the images are resized to 1,024 × 1,024 × 3 to allow for batch training.
2 2. ResNet-101 backbone (Bottom-Up Traversal): Table 3.1 shows the series of network layers through which the input image is passed through. Multiple layers are grouped together into stages Conv1 to Conv5. Each convolution layer is followed by a batch normalization layer and a ReLU activation