Myntra’s Fashion Recommendation Pipeline

Gaurav Kavhar
7 min readMay 19, 2021

In this blog I’ll talk about the Recommendation pipeline(ref: https://arxiv.org/pdf/2008.11638.pdf) that Myntra uses for their fashion products. Those who don’t know Myntra is a e-commerce company famous for their fashion apparels.

Index:-

  1. Business Problem
  2. Use of Deep Learning to solve the problem
  3. Business Constraints
  4. Performance Metrics
  5. Source of data
  6. Research Section
  7. First cut approach
  8. Experiments performed and their results
  9. Future work
  10. References
  11. Github and LinkedIn Profile

1. Business problem:

  • The business problem we are trying to solve is of similar fashion item
    recommendation for multiple fashion items.
  • While the majority of existing works in this domain focus on retrieving similar products corresponding to a single item present in a query, Myntra’s recommendation pipeline focuses on the retrieval of multiple fashion items at once.
  • This is an important problem because while a user might have searched for
    a particular primary article type (e.g., men’s shorts), the human
    models in the full-shot look image would usually be wearing secondary fashion items as well (e.g., t-shirts, shoes etc).
  • If we could recommend similar items of both secondary and primary fashion articles; it could boost revenue, improve customer experience and engagement.

2.Use of Deep Learning to solve the problem

The whole Recommendation pipeline is divided into 3 stages.

High level overview of the Recommendation pipeline

i) Front Facing Full-shot image detection: In this module we will detect all the front facing full-shot images, using Human Pose estimation. It is a binary classification problem.

ii) Fashion article detection and localisation: In this module we will extract all the fashion items from the earlier detected full-shot images using Object detection and segmentation.

iii) Similar image retrieval: In this module we will find similar fashion items which are extracted in stage 2 using some similarity metric.

3.Business Constraints

  • Scalable: The recommendation pipeline should be scalable because Myntra has millions of images in their catalog and the image count keeps increasing.
  • Low Latency: Latency is one of the most important constraints for an e-commerce company. For a good customer experience the latency should be as low as possible. So the pipeline should be able to retrieve recommendations within fraction of a second.

4.Performance Metrics

For our Module 1 which is binary classification we can use F1 score, Precision, Recall as metrics. In Module 2 which is an Object detection task we will use Mean Average Precision(mAP). In Module 3 we will again use Precision and Recall as the metric to evaluate our method quantitatively.

5.Source of Data

Myntra has a huge catalog of annotated data which they used for their system, since we can’t get access to it I used publically available DeepFashion dataset.

6.Research Section

i)Original Myntra paper(https://arxiv.org/pdf/2008.11638.pdf)

An illustration of the pipeline of the proposed framework.
  • This paper is the main idea behind the case study, they propose a framework for recommending multiple fashion items simultaneously. The framework combines different modules to achieve their task, such as Human Pose detection module, Pose classifier, Object detection module, Triple-net based embedding learning for finding similar products.

ii)MaskRCNN for Object detection (https://arxiv.org/pdf/1703.06870.pdf)

  • This paper introduces Mask R-CNN, which is a general framework for object instance segmentation. Mask R-CNN efficiently detects objects in an image while simultaneously generating a high-quality segmentation mask for each instance. It is an extension of Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition.

iii) Human Pose estimation paper (https://arxiv.org/pdf/1804.06208.pdf)

  • In this paper contains the state-of-the-art human pose estimation approach introduced by Xiao et al. The reason for choosing this method for the task of key-points detection is it’s surprisingly simple and effective nature. This pose estimation is based on a few deconvolutional layers added on a backbone network of ResNet. It is probably the simplest way to estimate heat maps from deep and low resolution feature maps. The code is available here.

7.First cut approach

  • Perform some basic EDA on our Deepfashion dataset to see how many images we have of different categories.
  • Then we train a Pose estimation model(Simple Baseline for Human Pose Estimation and Tracking) on our dataset. It will filter out the full-shot images.
  • From the full-shot images we need to find the front facing images, for that we will use a pose classifier(ResNet18).
  • Train the Mask R-CNN model on our dataset. The Mask R-CNN model will extract relevant fashion articles from the full shot images.
  • To find the products similar to the extracted fashion articles we need to represent them in a common embedding space, where similar products are closer to each other and dissimilar ones are farther from one another.
  • We will use pretrained DenseNet model to get the image embeddings. Then to find the similar products in the query image, we will use cosine similarity among the embeddings.

8.Experiments performed and their results

i) EDA:-

  • Women’s clothes(around 45k) are way more compared to Men(around 8k).
  • Upper body images are highest with the count of 33k.
  • Full and Lower body images are 9k and 10k respectively.
  • Tees tanks is the category having the largest count (around 14k).
  • There are only 39 images of the Suits category.

ii) Module 1:-

In this module we perform Human Pose estimation to filter out front facing full-shot images. I used a pretrained Hrnet model which performed very well on our dataset.

Some examples of pretrained Hrnet model on our dataset

ii) Module 2:-

In this module we will train Mask R-CNN model on our data using Transfer learning. We apply Transfer learning where we use a model pretrained on COCO dataset and freeze all layers except the final layer.

  • Model performance:-
Model performance after Transfer learning
  • Results on our dataset:-
  • Extracting the fashion items:-
Fashion items extracted after detection using the bounding boxes.

iii) Module 3:-

  • In this module we will use pretrained DenseNet model to extract the image embeddings from the final layer. I tried using ResNet models but they had high sparsity in the final embeddings. The DenseNet model worked better on our dataset.
  • To find the similar images I used Facebook’s FAISS library. It is used for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM. It has very low inference time.

iv) End to End Pipeline:-

From our first cut approach, we were able to extract fashion items and also find similar items for recommending to the user.

Fashion item detection on original image.
  • Recommending similar products:-

--

--

Gaurav Kavhar

Machine Learning enthusiast. Computer Science engineer.