Abstract

Customization is an increasing trend in fashion product industry to reflect individual lifestyles. Previous studies have examined the idea of virtual footwear try-on in augmented reality (AR) using a depth camera. However, the depth camera restricts the deployment of this technology in practice. This research proposes to estimate the six degrees-of-freedom pose of a human foot from a color image using deep learning models to solve the problem. We construct a training dataset consisting of synthetic and real foot images that are automatically annotated. Three convolutional neural network models (deep object pose estimation (DOPE), DOPE2, and You Only Look Once (YOLO)-6D) are trained with the dataset to predict the foot pose in real-time. The model performances are evaluated using metrics for accuracy, computational efficiency, and training time. A prototyping system implementing the best model demonstrates the feasibility of virtual footwear try-on using a red–green–blue camera. Test results also indicate the necessity of real training data to bridge the reality gap in estimating the human foot pose.

1 Introduction

Modern consumers tend to choose product designs that distinguish them from peers by reflecting individual styles and tastes. It is important to quickly evaluate whether a design fits the user. Real try-on of fashion commodities would be laborious and time-consuming to fulfill this need. Recent progress in virtual prototyping has led to the implementation of virtual try-on technologies available for the retail industry [1]. Such technologies allow users to virtually evaluate garments, footwear, and other accessories on their body in an interactive and immersive manner. Such novel user experience might enhance customer satisfaction in both online and offline shopping [2].

Commercial try-on solutions can be classified into two modes based on the display environment: augmented reality (AR) and virtual reality (VR). An AR-based application creates the try-on process by overlaying a virtual product model on the user’s body in a real image, whereas a VR-based application simulates the process by rendering the model on an avatar that represents the user in a virtual world. Experimental findings [2] have shown that overlaying the user’s body is more appealing than using a virtual human avatar, owing to the positive affective responses of self-identification. Perhaps for this reason, the fashion retail industry has increasingly adopted AR-based virtual try-on to enhance purchase intention in real stores or in e-commerce environment.

To understand what users perceive and expect from virtual try-on of apparel, footwear, or wearing items is critical to development of its technical solutions. This requires systematic user-centric studies. Shin and Baytar [3] investigated whether images of female bodies shown in a website influence female consumers’ level of body satisfaction and analyzed how these variables influence their intentions to use virtual apparel try-on technology. Despite the advantages offered by virtual try-on in VR/AR, its rate of adoption in retail channels or online webstores is slower than expected [4]. Inaccuracy of the virtual try-on process may cause consumers dissatisfaction with the fitting results. The quality of user experience is thus not satisfactory, as the poor representation of human body, face, or motion continues to make unrealistic perceptions. Our previous study [5] conducted an experimental study to investigate differences in presence, usability, and user experience between actual and virtual try-on using both psychological and physiological measures. Analysis results of the experimental data showed that user experience produced by the VR and AR try-on was not comparable to the real. They also reveal the factors that negatively affect the quality of the user’s interaction with the try-on processes.

Fewer studies have focused on technical solutions of virtual footwear try-on compared with those of garments and facial accessories [4]. Earlier try-on systems tracked the user’s feet in a controlled environment using specialized devices and visually simulated the try-on process based on the tracking results. These devices included green screens, wearable trackers, 3D glasses, and depth cameras, whose accessibility limits their deployment in practice. The device to which virtual try-on is deployed should be as simple as a smartphone or desktop computer. However, precisely overlaying a shoe design onto a moving human foot in real scene using such devices remains a challenging task. A critical problem is that estimation of the human foot pose in 3D space still lacks effective solutions. In recent years, machine learning has been successfully applied to object detection, classification, and tracking from red–green–blue (RGB) images in the fields of robotics, manufacturing, and autonomous driving supported by enhanced computing capabilities. Those applications have validated the effectiveness of various learning models, such as deep neural networks, convolutional neural networks (CNNs), and recurrent neural networks. Deep learning-based algorithms have been able to precisely predict the six degrees of freedom (6-DoF) pose estimation of rigid objects [5] from a single RGB image. They may provide a potential solution for the critical problem in virtual footwear try-on.

Therefore, this paper presents an approach based on deep learning that estimates the 6-DoF of a human foot in a single RGB image. The objective is to support virtual footwear try-on using simple deployment devices. First, the virtual try-on function in AR is formulated as a 3D localization problem of a shoe model with respect to a user’s bare foot. A dataset comprising both synthetic and real foot images is constructed for training deep learning models to solve the problem. Operational procedures are developed to automatically annotate the data with minimal human involvement. Next, we compare the performance of three trained models in terms of accuracy, computational efficiency, and training time, namely deep object pose estimation (DOPE), modified DOPE, and You Only Look Once (YOLO)-6D. Different approaches for constructing these models from the dataset are evaluated to determine an optimal training scheme. The results reveal insights into whether or not the reality gap exists in estimating the foot pose. Lastly, a prototyping system implementing the optimal scheme demonstrates the feasibility of virtual footwear try-on from color images captured by simple devices. This study contributes to the technical advancement of AR applications in human-centric product design and evaluation.

2 Related Works

Virtual footwear try-on in AR aims to recognize a user’s foot from a series of real-world images; then overlays a product model onto the foot in the images accordingly. Technologically, this can be realized using different methods. For example, the images can be captured using an RGB or RGB-D camera, which involves different foot-tracking algorithms. Simulation of the try-on process can be created by either combining a shoe image with the recognized foot via image processing or by 3D projecting a virtual shoe model according to the foot pose. With regard to timeliness, the virtual try-on can be performed in real-time or involve a longer processing period. Moreover, environmental conditions in use are either cluttered or strictly controlled during the try-on process.

Mottura et al. [6] implemented the idea of a “magic mirror” that supports virtual footwear evaluation and customization in a real store. A user must wear a pair of socks with predefined patterns and stand on a special carpet, which simplify tracking the position and orientation of the user’s foot in a controlled environment. Eisert et al. [7] segmented the foot contour from an image obtained by an RGB camera using image processing and camera calibration techniques. Their approach was developed under several use restrictions such as fixing the foot-to-camera distance and a green floor background without the presence of other objects. Greci et al. [8] simulated the internal volume of a shoe using specialized mechatronic sensors based on haptic technology. These sensors deliver instant tactile sense feedback that helps find the best size fit for individual users.

Yang et al. [9] developed a prototyping system for virtual footwear try-on in AR using a Kinect sensor. A two-stage algorithm was proposed to track the human foot from depth and color images captured by the sensor. The algorithm relies on external markers to initialize tracking at first; then followed by markerless tracking in subsequent frames using the trimmed iterative closest point method. A free-form deformation mechanism was introduced to allow users to adjust shoe appearance based on personal preferences. Chu et al. [10] proposed a new concept of AR as a service in a cloud-based system. Users upload a video of foot motion captured by a depth camera to the system and subsequently receive the try-on video of a chosen shoe design on a handheld device. Alternatively, other studies created visualization of the try-on result using image processing techniques. Chou et al. [11] proposed a pose-invariant virtual try-on method that utilizes an end-to-end generative adversarial network comprising 2D feature extractors. The network trained with a commercial shoe database could adjust a shoe image rendered from an original view to visually fit another view. The try-on result was thus synthesized from given shoe and foot images in different views. The proposed method failed to simulate continuous try-on motion or augmented images containing other objects. An et al. [12] proposed a multi-branch CNN that performs key points detection, 2D pose estimation, and segmentation of the human foot and leg from a color image simultaneously. A new stabilization method was developed to smoothen the movement of overlaid shoe models and eliminate jitters. They also constructed a large-scale dataset containing annotation information for all the labels related to virtual shoe try-on. However, the try-on environment does not allow the presence of other objects. The training dataset only contains real images captured by a monocular camera.

Existing commercial tools such as Wanna Kicks2 and Vyking3 simulate the try-on process from the egocentric perspective. In particular, the camera normally shoots a human foot from a top view. This setting is different from real try-on in physical world, thus inducing unnatural user experience. To the authors’ knowledge, Ref. [12] is the only study that describes the implementation of virtual shoe try-on using deep learning methods. Unfortunately, they imposed the same restriction on the viewing perspective to simplify the human foot tracking and preparation of training data. Deep-learning-based methods are still lacking in virtual footwear try-on in AR from a third-person view. To fill the research gap, the main motivation of this study is to demonstrate the feasibility of such methods. The remainder of this paper is organized as follows. Section 3 describes the virtual footwear try-on as a mathematical problem and assumptions for simplification. Section 4 presents the generation process for synthetic and real training datasets. Section 5 discusses the test results of the three deep learning models constructed using different training schemes and data sources. The failure analysis of the models is also discussed. The last section presents concluding remarks and suggestions for future research.

3 Problem Description

This study realizes the idea of virtual try-on using 3D model rendering rather than 2D image processing. Users can see themselves wearing virtual shoe models in a video stream of real environment. The models should remain precisely on the user’s moving feet during the try-on process, thus visually mimicking real try-on. The corresponding problem is described as follows. In each image containing a human foot, a shoe model must be accurately superimposed on the foot through a 3D transformation determined by the foot pose and subsequent projection onto the image. Because the foot shape is different from a shoe model, a reference foot model is introduced to avoid aligning those two shapes directly, as proposed by Ref. [10]. As shown in Fig. 1, an affine transformation matrix M, which is predefined prior to the try-on, describes the relative position between the reference model R and the shoe model S. It controls the allowance between R and S, which accommodates free movement of the foot within the shoe. This allowance also facilitates error toleration in the pose estimation to a certain degree. As shown in Fig. 2, P* and R* represent the foot pose in the image and the initial pose of the reference model, respectively. A transformation matrix Ψ exists to properly align R* and P*, which are originally described in different coordinate systems. R* is determined by the modeling process. In this study, P* is estimated using deep learning models from a color image.

Fig. 1
Predefined relationship between the reference foot and shoe models [10]
Fig. 1
Predefined relationship between the reference foot and shoe models [10]
Close modal
Fig. 2
Problem description of virtual shoe try-on
Fig. 2
Problem description of virtual shoe try-on
Close modal
Deep learning models generally do not directly compute a foot pose in space from a given image. Instead, they estimate the 3D bounding box of the foot by specifying its eight vertices projected onto the image. Thus, it is necessary to restore the actual pose from the projected coordinates using the perspective-n-point algorithm [13]. Given the camera’s intrinsic parameters, the pose and projected vertices are correlated as
spc=K[R|T]pw
(1)
s[uv1]=[fxγu00fyv0001][r11r12r13t1r21r22r23t2r31r32r33t3][xyz1]
(2)
where pw=[xyz1]T represents the world coordinates in 3D space, and pc=[uv1]T corresponds to the projected coordinates in a 2D image. K refers to the intrinsic parameters of the camera, fx and fy are the focal lengths, and γ is the skew parameter. (u0, v0) refers to the theoretical image center, and s is the scaling factor. R and T are the camera rotation and translation matrix to be calculated, respectively. Equation (1) can be rewritten as a direct linear transform problem [14]:
wi=PWi,i=1,2,,n
(3)
where Wi is a point in 3D space, wi is its projection onto a 2D image, and P is a projection matrix determined by R, T, and the camera parameters. Expanding the above equation leads to
xi=P11Xi+P12Yi+p13Zi+P14P31Xi+P32Yi+p33Zi+P34yi=P21Xi+P22Yi+P23Zi+P24P31Xi+P32Yi+P33Zi+P34
(4)
and
xi(P31Xi+P32Yi+p33Zi+P34)=P11Xi+P12Yi+p13Zi+P14yi(P31Xi+P32Yi+p33Zi+P34)=P21Xi+P22Yi+P23Zi+P24
(5)
The above equation can be written as a matrix form:
AP=0
(6)
which can be solved by minimizing AP. The projection matrix is equal to the eigenvector corresponding to the minimum eigenvalues. R and T can be estimated using the QR decomposition method [14].

Deep learning models for object recognition and detection often involve large-scale training data and high-performance computations. This study aims to evaluate the feasibility of 3D pose estimation for human feet using deep learning models in terms of the accuracy and computational speed required by virtual try-on capabilities. Our focus is not to develop a fully trained model or recognize the foot pose under all conditions. Therefore, to simplify the problem complexity, foot recognition from a color image is conducted in a context subject to the following limitations:

  • Detect only human bare feet.

  • The distance between the foot to be detected and the camera falls within a specific range.

  • The foot is not occluded or truncated.

  • The foot must not be crossed or covered.

  • The recognition should be conducted in an environment with a stable light source.

Occlusion inevitably occurs between the shoe model and the real foot while simulating the try-on process. Correct processing of the occlusion yields high-quality visualization that resembles a human’s natural perception and enhances the user experience. The reference model in Fig. 2 can be used to approximate the depth information of the actual human foot during occlusion processing by transforming the model into a location corresponding to the foot posture. The portion of the shoe model blocked by the transformed model in space becomes hidden using the Z-buffer technique in computer graphics. However, the reference model does not appear in the rendering result during the try-on process.

4 Training Data Generation

To obtain high-fidelity training data that offer correct labeling and sufficient diversity is a critical issue in most applications based on deep learning models. Manual data annotation is not only time and resource consuming, but the subjective nature of human judgment may also lead to inconsistent data quality and poor results after model training. There exist open training datasets constructed for pose estimation of common objects, such as LINEMOD [15] and OCCLUSION [16]. However, for shoe try-on applications, open data sources that provide foot annotation data for 6D pose estimation are not available. To overcome this deficiency, we propose a systematic approach to generating pre-labeled synthetic data and semi-automatically annotated real data that together form a proprietary foot training dataset. The following subsections describe the data generation methods and their corresponding annotation procedures in detail.

4.1 Synthetic Training Data.

Synthetic data have become increasingly popular by meeting the needs of training deep neural networks [17], which normally require rapid mass production of correct annotations. Tobin et al. [18] proposed a domain randomization method to generate a large number of training data in an unrealistic and highly randomized manner, while treating real data as one of the variables when testing a neural network model to avoid underfitting or overfitting. The present study adopts a similar strategy for the mass generation of synthetic foot image data. In a virtual try-on application, a deep learning model must precisely estimate the 6D pose of a user's moving foot from a given color image. Each training dataset must define a 3D bounding box of the foot in an image manually or automatically to empower the model. Compared with 2D bounding box annotations, manually specifying a 3D bounding box is challenging and prone to error. The annotation process often requires the assistance of software tools to ensure data quality and efficiency. A common approach is to generate pre-labeled synthetic images from a constructed virtual environment using a 3D rendering engine. To ensure sufficient information diversity, the generated data require a proper preparation plan. Figure 3 shows the data generation procedure proposed in this study.

Fig. 3
Synthetic foot data generation pipeline
Fig. 3
Synthetic foot data generation pipeline
Close modal

The first step is to create 3D geometric models of the human foot corresponding to the demographic group of the target user. The creation process starts with the construction of a full-scale human model based on anthropometric parameters of height, weight, and gender using a 3D human modeling tool (i.e., MakeHuman). The parameter values match the body features of the user group in this study: Taiwanese females aged 20–30. The foot geometry was adjusted according to the average and standard deviation of the foot length and width suggested by Ref. [19] to correctly cover the group. The foot portion was then segmented from each constructed model and exported to the next step.

Next, a virtual scene is constructed using a 3D game engine (Unreal) integrated with the NVIDIA deep learning dataset synthesizer [20]. The foot model is placed in the scene generated by changing various factors systematically, including foot geometry, skin texture, ground texture, background, camera position, and ambient light source. In total, 50,000 synthetic images of the scene were generated as training data by changing those factors according to the following plan. Figure 4 shows examples of the synthetic training data thus generated.

  • Foot geometry: five geometries were constructed by the average foot dimensions (i.e., length and width), the average dimensions adding and subtracting half standard deviation as well as one standard deviation.

  • Skin texture: eight types of skin texture with different shades were used to create a variety of foot appearances.

  • Ground texture: ten types of ground texture with different shaders were used to create a variety of floor appearances.

  • Background: a total of 5000 images containing various objects were randomly selected from the open image dataset [21]. They sequentially become the background of the virtual scene to establish the prediction capability of trained models in cluttered environments.

  • Camera position: the virtual camera used for rendering references the foot model as the center point while capturing images from different view angles evenly distributed in the scene. The azimuth angle of the camera movement was chosen as 0–180 deg, the elevation angle was 5–85 deg, and the distance from the foot was 50–150 cm.

  • Ambient light source: the 3D game engine adjusts the lighting condition by setting the light source type, position, and illumination to create different rendering effects. For the training data, we employed three types of light sources: directional, point, and spot, with illumination ranging from 10 to 3000 Lux. For each rendering, the light source was randomly located 150–350 cm from the foot along the x-, y-, and z-axes to produce varying lighting conditions.

Fig. 4
Examples of synthetic training data
Fig. 4
Examples of synthetic training data
Close modal

Variations in the foot posture are generated by changing the viewing angle and position of the camera in the game engine. The eight corners and the center point of the 3D bounding box of the foot can be determined from the rendering process of a virtual scene. They are automatically labeled in the corresponding image, as shown in Fig. 5. Projecting the bounding box onto the image plane establishes a one-to-one mapping relationship from 3D to 2D space. Each training image contains the following information: the camera’s rotation and translation matrices, the coordinates of the bounding box corners, the center point in 3D space, and their projected 2D coordinates in the image.

Fig. 5
Data labeling schematics: mapping the relationship from 2D to 3D space
Fig. 5
Data labeling schematics: mapping the relationship from 2D to 3D space
Close modal

4.2 Real Training Data Generation.

Although synthetic pre-labeled training data can be generated quickly in large quantities at minimal cost, discrepancies may still exist between synthetic and real data. Validation using real ground truth data should provide the trained model with better prediction accuracy in real applications. However, labeling real data, particularly in 3D space, can be problematic and time-consuming. To overcome these difficulties, we propose a semi-automatic method for creating real foot data that avoids manual labeling and minimizes the error rate during the creation process.

The proposed method involves only off-the-self hardware and software components. The hardware consists of a three-axis controller, a linear slider, an RGB camera, and a calibration checkerboard. The camera was mounted on a sliding rail integrated with a three-axis controller (see Fig. 6). The camera’s moving path can be programed using a smartphone for accurate rotation and translation control while capturing images. This function allows image shooting to be triggered directly from a smartphone at a distance. Camera calibration is a common procedure in computer vision that precisely defines the transformation relationship between different coordinate systems. This study establishes a world coordinate system based on the checkerboard through camera calibration, which defines the foot posture by providing the eight corner points of the corresponding bounding box in a real scene. The process of setting up the control environment and capturing the real data is described as follows:

  1. The camera, three-axis controller, linear slider, and checkerboard are setup in the environment. The camera is calibrated to retrieve the intrinsic and extrinsic camera parameters. A human standing in the scene shows only one foot in the image. Foreign objects, such as bottles, books, containers, jars, toys, and plastic bags, are randomly chosen and placed in the scene.

  2. The image resolution is set to 1280 × 720 pixels so that the captured frames always include the checkerboard and foot in the scene.

  3. The human foot dimensions are measured to obtain the length (L), width (W), and height (H). The foot must be sufficiently far from the checkerboard to avoid removing it from the scene by subsequent image cropping. Therefore, the foot is placed 40-cm away along the x-axis relative to the origin of the checkerboard. Taking this as the reference point of the foot, we determine the remaining seven vertices based on the measured foot dimensions and ankle height, as shown in Fig. 7.

  4. The camera is moved to the checkerboard from predefined angles and positions using a three-axis controller and a linear slider. The image captured at each setting is recorded.

  5. The corner positions of the 3D bounding box are estimated in each image relative to the checkerboard, and the 2D coordinates of the corresponding projected points and the camera’s position are recorded and orientated to form the raw data.

  6. The image is cropped to 640 × 480 pixels to remove the checkerboard.

  7. Based on the cropped image size, the 2D point coordinates are recalculated, the data are overwritten with the updated coordinates, and data annotation is completed.

  8. The checkerboard position, foot size, camera settings, and objects in the scene are recorded according to the data generation plan. Steps (3)–(7) are repeated to collect sufficient data with diversity. Figure 8 shows examples of the real data obtained.

Fig. 6
(a) Camera mounted on a linear slider integrated with a three-axis controller and (b) environmental settings for capturing real foot images
Fig. 6
(a) Camera mounted on a linear slider integrated with a three-axis controller and (b) environmental settings for capturing real foot images
Close modal
Fig. 7
Labeling the 3D bounding box in relation to the position of the checkerboard
Fig. 7
Labeling the 3D bounding box in relation to the position of the checkerboard
Close modal
Fig. 8
Examples of real data after image cropping
Fig. 8
Examples of real data after image cropping
Close modal

4.3 Deep Learning Models.

CNN models have been widely used in various applications of deep learning related to image recognition. Peng et al. [22] proposed a pixel-wise voting network that predicts a unit vector from each pixel to a key point in the image. This approach applies a random sampling consistent algorithm to remove the predicted outliers and obtain the probability distribution of each key point in 3D space. Considering the uncertainty of the key point location improves the robustness of the 6D pose estimation when a target object is heavily occluded or truncated. Tekin et al. [23] designed a single-shot network architecture based on YOLOv2 [23] to improve the computation speed of the original model. The test results on the NVIDIA Titan X Pascal GPU showed that the performance satisfies the requirement of real-time 6D pose estimation.

The DOPE architecture proposed by Tremblay et al. [24] has been successfully applied to intelligent operations, such as picking up and placing objects using robots. DOPE is based on the convolutional pose machine (CPM) network architecture [25], which uses a one-shot fully convolutional deep neural network in combination with a multistage architecture to detect key points (see Fig. 9). Its forward propagation takes a 640 × 480 × 3 color image as input. The first ten layers of the pre-trained VGG-19 convolutional neural network work as a feature extractor from the input image. DOPE implements the features of reduced dimensions in each stage of the CPM, which considers not only the image features but also the output from the previous stage.

Fig. 9
DOPE network architecture [22]
Fig. 9
DOPE network architecture [22]
Close modal

To reduce the number of convolutional layers in the original network, we modified the DOPE network by combining the six stages of the CPM on the belief map into one stage. The modified model is referred to as “DOPE2” in this study, and the corresponding network structure is shown in Fig. 10. The output of each stage was 80 × 60 × 25 after the modification. The original kernel size of 7 × 7 in convolutional layers was replaced with a multiple kernel size of 3 × 3 to reduce the number of network parameters. A large convolutional kernel size can produce a larger receptive field, but it also requires more parameters to be determined in the model. Szegedy et al. [26] proposed that a single large convolutional layer could be replaced by several successive small convolutional layers, which maintain the receptive field range and reduce the number of parameters simultaneously. Additional network layers also imply more nonlinear functions, which may improve the probability of correct prediction in complex tasks.

Fig. 10
Modified DOPE network architecture
Fig. 10
Modified DOPE network architecture
Close modal

5 Test Results

5.1 Implementation Details.

Our implementation is based on the Ubuntu 16.04 operating system and PyTorch deep learning framework. We employed the adaptive moment estimation [27] optimizer with an initial learning rate of 0.0001 for DOPE/DOPE2 and a stochastic gradient descent optimizer with an initial learning rate of 0.001 for YOLO-6D. All experiments were conducted on an Intel® Core i7-9700 CPU integrated with two NVIDIA GeForce RTX 2080 SUPER graphics cards. To reduce overfitting in model training, data augmentation was implemented by randomly changing the brightness (μ = 1, σ = 0.1), contrast (μ = 1, σ = 0.3), and noise (μ = 0, σ = 0.1) when preparing the synthetic images in the real platform. Other techniques, such as random rotation and translation, were also applied to modify these images. The batch size was set to 16 during the training processes with an epoch of 100. A Logitech C922 webcam captured the user’s barefoot motions as input video. The visualization tool RViz running in the robot operating system created the try-on images frame-by-frame and played back the final result as a video stream.

5.2 Evaluation Experiments.

The current standard datasets for object detection and pose estimation do not contain human foot data. Therefore, evaluation experiments were conducted on the proprietary foot dataset consisting of both synthetic and real data (see Sec. 4), all in cluttered environments. The training versus validation data ratio was 4:1. Three network models (i.e., DOPE, DOPE2, and YOLO-6D) trained with the same dataset were compared using various measures.

The prediction accuracy of the trained models was evaluated using two quantitative metrics. First, the 2D reprojection error indicated discrepancies in the location between the ground truth and the estimated pose in the 2D image plane. A pose estimation is considered acceptable when the average deviation between the projected vertices of its bounding box and the ground truth is less than five pixels in this case [15]. This measure has been adopted by other AR applications for validating object tracking [23]. In contrast, the average 3D Euclidean distance of vertices (ADD) is commonly used in tasks related to pose estimation. It calculates the average deviation between vertices of the bounding box between the ground truth and its predicted pose in 3D space.
ADD=1mxM(R+T)(R~+T~)
(7)
where R~ and T~ indicate the rotational and translational matrices for the estimated pose, respectively. R and T are the matrices for the ground truth. M denotes a set of 3D vertices, and m is the number of points in M. The pose estimation by the ADD measure is considered acceptable when the value is smaller than 10% of the diameter of the circumscribed sphere of the object [26]. Because virtual try-on is normally conducted in a video stream, a measure accumulated over a period of time would be more appropriate to characterize the accuracy of trained models in practice; this relies on the percentage of correctly estimated poses in a series of continuous images. Moreover, it is advantageous to compare the training time of the three models and their computational performance in a real deployment environment. Finally, the performance of a deep learning model highly depends on the training scheme employed to construct its prediction capability. Thus, we examined three different schemes to understand how they influenced post estimation:
  • Scheme 1: Pretraining with the ImageNet dataset; successive training using the proprietary dataset (20,000 synthetic and 5000 real data items).

  • Scheme 2: Training using the same proprietary dataset.

  • Scheme 3: Pretraining with 30,000 synthetic data items; successive training using the same proprietary dataset.

Table 1 presents the test results of the three models trained using our proprietary foot dataset. In the table, the accuracy of the pose estimation is reported as a percentage that satisfies the common thresholds for the ADD and 2D reprojection errors. The following observations can be made from the results.

  • For all three training schemes, YOLO-6D outperformed the two DOPE models for both measures.

  • For all three models, scheme 3 yielded the best accuracy in both measures.

  • DOPE has a higher accuracy than that of DOPE2.

  • No significant difference exists in the accuracy of YOLO-6D trained using schemes 1 and 3.

Table 1

Test results of three models over different training schemes

Scheme 1Scheme 2Scheme 3
Probability of 10% average three-dimensional Euclidean distance of vertices threshold
DOPE59.3862.5068.75
DOPE250.0053.1356.30
YOLO-6D72.1569.3871.84
Probability of five-pixel two-dimensional reprojection threshold
DOPE81.2382.7584.25
DOPE268.4069.7069.66
YOLO-6D90.8184.9292.76
Scheme 1Scheme 2Scheme 3
Probability of 10% average three-dimensional Euclidean distance of vertices threshold
DOPE59.3862.5068.75
DOPE250.0053.1356.30
YOLO-6D72.1569.3871.84
Probability of five-pixel two-dimensional reprojection threshold
DOPE81.2382.7584.25
DOPE268.4069.7069.66
YOLO-6D90.8184.9292.76

As shown in Table 1, the best scheme pretrains models using a proprietary dataset that successively receives the training using a new set of hybrid data containing synthetic and real foot images. For DOPE and DOPE2, pretraining with the ImageNet dataset performed worse than that with our proprietary dataset. Successive training produces a better accuracy of foot pose estimation than does simultaneous training for all three models.

Table 2 shows the training time and computational efficiency of the three trained models in real use. YOLO-6D required a shorter training time than did the two DOPE models. It also had the fastest running speed of 53 fps, which was approximately double that of DOPE2 and ten times faster than DOPE. This outcome is similar to the test reported by Ref. [22], which recommended YOLO-6D for real-time applications related to pose estimation in AR. Thus, this study implements a virtual shoe try-on based on the YOLO-6D model trained by scheme 3 because of its superior performance compared with other models and training schemes.

Table 2

Test results of foot pose estimation

Training time (h)Running speed (fps)
DOPE181.315.3
DOPE2109.4228.7
YOLO-6D48.3153
Training time (h)Running speed (fps)
DOPE181.315.3
DOPE2109.4228.7
YOLO-6D48.3153

5.3 Implementation of Virtual Shoe Try-On.

The trained YOLO-6D model was implemented as a kernel in a prototyping system for real-time virtual shoe try-on. The following try-on tests were conducted in an indoor environment with stable ambient light. A female user freely moved one foot with moderate acceleration within 0.4–1.2 m from the camera in both cluttered and uncluttered environments. The foot was visible from the ankle joint and below during the try-on process. Figure 11 shows the bounding box of the bare foot at different postures predicted by the YOLO-6D model. The results visually matched well with the foot in both environments. Practical virtual try-on required pose estimation for both feet, which involved training data specific to the right and left feet, respectively. The bounding boxes of both feet and the corresponding try-on images in AR are shown in Fig. 12. The test results were natural and realistic, thus validating the effectiveness of the research idea proposed in this study.

Fig. 11
Pose estimation results (top: cluttered environment; bottom: uncluttered environment)
Fig. 11
Pose estimation results (top: cluttered environment; bottom: uncluttered environment)
Close modal
Fig. 12
Real-time pose estimation and try-on results
Fig. 12
Real-time pose estimation and try-on results
Close modal

The pose estimation result tended to produce excessive errors under certain circumstances, which caused the virtual try-on function to fail. Understanding when and how failures occur provides guidelines for limiting user movements in practice. It was observed that the trained models became erroneous when the user’s ankle joint was drastically rotated during the try-on process. Such movements can be produced with four degrees of freedom (i.e., plantar flexion, dorsiflexion, pronation, and supination) of the human foot [27]. Another condition causing prediction failure occurs when the toes are bent or dilated (see Fig. 13). A possible explanation for this failure is that the foot is viewed as a rigid body in both data generation and model training. All 3D foot models imported into the virtual scene had the foot resting on the floor and were posed as a rigid body. The trained models should acquire some generalization ability to predict the pose of the foot in the image for minor swinging and rotation. However, excessive foot twisting or bending can cause failure and should be avoided in real use.

Fig. 13
Failures caused by various foot movements
Fig. 13
Failures caused by various foot movements
Close modal

6 Conclusion

Most studies on virtual AR footwear try-on require the use of a depth sensor or other special hardware to capture the user’s foot movements in a real environment. However, smart phones or personal computers may not support these devices, thus limiting the deployment of virtual try-on technology in practice. Virtual shoe try-on using a monocular RGB camera is highly desirable in this regard by enabling quick online and offline product evaluation. Moreover, existing commercial tools simulate the try-on process from the egocentric perspective with restricted viewing angles. The corresponding user experience is unnatural and unsatisfying. To overcome these problems, this study applied deep learning models to estimate the 3D pose of a human foot from a color image captured in third-person. Synthetic annotated training data were generated by systematically adjusting the light source, foot geometry, background, camera position, and skin and background textures to ensure sufficient information variation. A semi-automatic approach was proposed to capture real foot images and facilitate data labeling using off-the-shelf components. Three CNN models (i.e., DOPE, DOPE2, and YOLO-6D) were trained with a proprietary dataset containing both synthetic and real foot images according to different training plans. The pose estimation accuracy of the three trained models was evaluated by the ADD and 2D reprojection errors with one-fifth of the dataset used for validation. Test results showed that training a CNN model for foot pose estimation performed better via the combination of pretraining and successive training with both proprietary and ImageNet datasets. Moreover, YOLO-6D outperformed both the DOPE and simplified DOPE2 models in terms of prediction accuracy and computational speed in real use. A prototyping system verified the effectiveness of real-time virtual shoe try-on based on deep learning by showing natural and realistic visualization results.

This study can be improved in various aspects. Large twisting movements cause 3D pose estimation failures during try-on as the current training data were created by treating the foot as a rigid body with the foot resting on the floor. Training data generation should further be developed by combining biomechanical modeling to provide training information in line with real foot movements. Deformable models of shoes can be created to more accurately represent bending dynamics during standing and walking, which supports highly realistic simulation of footwear try-on. From a practical perspective, virtual AR footwear try-on should be integrated with a lightweight neural network architecture to support its deployment in handheld devices.

Footnotes

Conflict of Interest

There are no conflicts of interest.

Data Availability Statement

The datasets generated and supporting the findings of this article are obtainable from the corresponding author upon reasonable request.

References

1.
Fincato
,
M.
,
Cornia
,
M.
,
Landi
,
F.
,
Cesari
,
F.
, and
Cucchiara
,
R.
,
2022
, “
Transform, Warp, and Dress: A New Transformation-Guided Model for Virtual Try-On
,”
ACM Trans. Multimedia Comput. Commun. Appl.
,
18
(
2
), pp.
1
24
.
2.
Hu
,
P.
,
Nourbakhsh
,
N.
,
Tian
,
J.
,
Sturges
,
S.
,
Dadarlat
,
V.
, and
Munteanu
,
A.
,
2020
, “
A Generic Method of Wearable Items Virtual Try-On
,”
Text. Res. J.
,
90
(
19–20
), pp.
2161
2174
.
3.
Shin
,
E.
, and
Baytar
,
F.
,
2014
, “
Apparel Fit and Size Concerns and Intentions to Use Virtual Try-On: Impacts of Body Satisfaction and Images of Models’ Bodies
,”
Cloth. Text. Res. J.
,
32
(
1
), pp.
20
33
.
4.
Plotkina
,
D.
, and
Saurel
,
H.
,
2019
, “
Me or Just Like Me? The Role of Virtual Try-On and Physical Appearance in Apparel M-Retailing
,”
Retail. Consum. Serv.
,
51
, pp.
362
377
.
5.
Chu
,
C. H.
,
Chen
,
Y. A.
,
Huang
,
Y. Y.
, and
Lee
,
Y. J.
,
2022
, “
A Comparative Study of Virtual Footwear Try-On Applications in Virtual and Augmented Reality
,”
ASME J. Comput. Inf. Sci. Eng.
,
22
(
4
), p.
041004
.
6.
Mottura
,
S.
,
Greci
,
L.
,
Sacco
,
M.
, and
Boër
,
C. R.
,
2003
, “
An Augmented Reality System for the Customized Shoe Shop
,”
Second Interdisciplinary World Congress on Mass Customization and Personalization
,
Munich, Germany
.
7.
Eisert
,
P.
,
Fechteler
,
P.
, and
Rurainsky
,
J.
,
2008
, “
3-D Tracking of Shoes for Virtual Mirror Applications
,”
IEEE Conference on Computer Vision and Pattern Recognition
, Anchorage, AK, June 23–28, pp.
1
6
.
8.
Greci
,
L.
,
Sacco
,
M.
,
Cau
,
N.
, and
Buonanno
,
F.
,
2012
, “
FootGlove: A Haptic Device Supporting the Customer in the Choice of the Best Fitting Shoes
,”
Haptics: Perception, Devices, Mobility, and Communication
, pp.
148
159
.
9.
Yang
,
Y. I.
,
Yang
,
C. K.
, and
Chu
,
C. H.
,
2014
, “
A Virtual Try-On System in Augmented Reality Using RGB-D Cameras for Footwear Personalization
,”
J. Manuf. Syst.
,
33
(
4
), pp.
690
698
.
10.
Chu
,
C. H.
,
Cheng
,
C. H.
,
Wu
,
H. S.
, and
Kuo
,
C. C.
,
2019
, “
A Cloud Service Framework for Virtual Try-On of Footwear in Augmented Reality
,”
ASME J. Comput. Inf. Sci. Eng.
,
19
(
2
), p.
021002
.
11.
Chou
,
C. T.
,
Lee
,
C. H.
,
Zhang
,
K.
,
Lee
,
H. C.
, and
Hsu
,
W. H.
,
2018
, “
PIVTONS: Pose Invariant Virtual Try-On Shoe With Conditional Image Completion
,”
Asian Conference on Computer Vision
, pp.
654
668
.
12.
An
,
S.
,
Che
,
G.
,
Guo
,
J.
,
Zhu
,
H.
,
Ye
,
J.
,
Zhou
,
F.
,
Zhu
,
Z.
,
Wei
,
D.
,
Liu
,
A.
, and
Zhang
,
W.
,
2021
, “
ARShoe: Real-Time Augmented Reality Shoe Try-On System on Smartphones
,”
The 29th ACM International Conference on Multimedia
, Virtual, Oct. 20–24, pp.
1111
1119
.
13.
Szegedy
,
C.
,
Vanhoucke
,
V.
,
Ioffe
,
S.
,
Shlens
,
J.
, and
Wojna
,
Z.
,
2016
, “
Rethinking the Inception Architecture for Computer Vision
,”
IEEE Conference on Computer Vision and Pattern Recognition
, Las Vegas, NV, June 27–20, pp.
2818
2826
.
14.
Trefethen
,
L. N.
, and
Bau
,
D.
III
,
1997
,
Numerical Linear Algebra
,
SIAM
,
Philadelphia, PA
.
15.
Hinterstoisser
,
S.
,
Lepetit
,
V.
,
Ilic
,
S.
,
Holzer
,
S.
,
Bradski
,
G.
,
Konolige
,
K.
, and
Navab
,
N.
,
2012
, “
Model Based Training, Detection and Pose Estimation of Texture-Less 3d Objects in Heavily Cluttered Scenes
,”
Computer Vision—ACCV 2012: 11th Asian Conference on Computer Vision
,
Daejeon, South Korea
,
Nov. 5–9
,
Springer
,
Berlin/Heidelberg
, pp.
548
562
.
16.
Brachmann
,
E.
,
Krull
,
A.
,
Michel
,
F.
,
Gumhold
,
S.
,
Shotton
,
J.
, and
Rother
,
C.
,
2014
, “
Learning 6d Object Pose Estimation Using 3d Object Coordinates
,”
European Conference on Computer Vision
, pp.
536
551
.
17.
Nikolenko
,
S. I.
,
2019
, “
Synthetic Data for Deep Learning
,” preprint arXiv:1909.11512.
18.
Tobin
,
J.
,
Fong
,
R.
,
Ray
,
A.
,
Schneider
,
J.
,
Zaremba
,
W.
, and
Abbeel
,
P.
,
2017
, “
Domain Randomization for Transferring Deep Neural Networks From Simulation to the Real World
,”
IEEE/RSJ International Conference on Intelligent Robots and Systems
, Vancouver, Canada, Sept. 24–28, pp.
23
30
.
19.
Lee
,
Y. C.
, and
Wang
,
M. J.
,
2015
, “
Taiwanese Adult Foot Shape Classification Using 3D Scanning Data
,”
Ergonomics
,
58
(
3
), pp.
513
523
.
20.
To
,
T.
,
Tremblay
,
J.
,
McKay
,
D.
,
Yamaguchi
,
Y.
,
Leung
,
K.
,
Balanon
,
A.
,
Cheng
,
J.
,
Hodge
,
W.
, and
Birchfield
,
S.
,
2018
, “
NDDS: NVIDIA Deep Learning Dataset Synthesizer
,”
CVPR 2018 Workshop on Real World Challenges and New Benchmarks for Deep Learning in Robotic Vision
,
Salt Lake City, UT
,
June
, Vol. 22.
21.
Quattoni, A., and Torralba, A.,
2009
, “
Recognizing Indoor Scenes
,”
2009 IEEE Conference on Computer Vision and Pattern Recognition
, IEEE, pp.
413
420
.
22.
Peng
,
S.
,
Liu
,
Y.
,
Huang
,
Q.
,
Zhou
,
X.
, and
Bao
,
H.
,
2019
, “
Pvnet: Pixel-Wise Voting Network for 6dof Pose Estimation
,”
IEEE Conference on Computer Vision and Pattern Recognition
, Long Beach, CA, June 15–20, pp.
4561
4570
.
23.
Tekin
,
B.
,
Sinha
,
S. N.
, and
Fua
,
P.
,
2018
, “
Real-Time Seamless Single Shot 6d Object Pose Prediction
,”
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, Salt Lake City, UT, June 18–23, pp.
292
301
.
24.
Tremblay
,
J.
,
To
,
T.
,
Sundaralingam
,
B.
,
Xiang
,
Y.
,
Fox
,
D.
, and
Birchfield
,
S.
,
2018
, “
Deep Object Pose Estimation for Semantic Robotic Grasping of Household Objects
,” preprint arXiv:1809.10790.
25.
Wei
,
S. E.
,
Ramakrishna
,
V.
,
Kanade
,
T.
, and
Sheikh
,
Y.
,
2016
, “
Convolutional Pose Machines
,”
IEEE Conference on Computer Vision and Pattern Recognition
, pp.
4724
4732
.
26.
Hartley
,
R.
, and
Zisserman
,
A.
,
2003
,
Multiple View Geometry in Computer Vision
,
Cambridge University Press
,
Cambridge, UK
.
27.
Xiang
,
Y.
,
Schmidt
,
T.
,
Narayanan
,
V.
, and
Fox
,
D.
,
2017
, “
Posecnn: A Convolutional Neural Network for 6d Object Pose Estimation in Cluttered Scenes
,” preprint arXiv:1711.00199.