Fashion Style in 128 Floats

Fashion Style in 128 Floats

In this work we present an approach for learning features from large amounts of weakly-labelled data. Our approach consists training a convolutional neural network with both a ranking and classification loss jointly. We do this by exploiting user-provided metadata of images on the web. We define a rough concept of similarity between images using this metadata, which allows us to define a ranking loss using this similarity. Combining this ranking loss with a standard classification loss, we are able to learn a compact 128 float representation of fashion style using only noisy user provided tags that outperforms standard features. Furthermore, qualitative analysis shows that our model is able to automatically learn nuances in style.

For related work, see our work on parsing clothing in images and predicting fashionability.

Method

Overview of our Approach

We base our approach on combining both a classification network with a feature network that are learnt jointly with a ranking and classification loss. We do this by first defining a similarity metric on the user provided noisy tags. Using this metric we can then roughly determine semantically similar and dissimilar images. Given an anchor or reference image, we then form triplets of images by choosing a very similar and very dissimilar image to the anchor image. This allows us to define a ranking loss in which we encourage the L2 norm of the features from similar images to be small, and the L2 norm of the features from dissimilar images to be large. Although this already gives good performance, by further combining this with a small classification network and a classification loss on the dissimilar image, results can be further improved. In contrast with using features directly from classification networks, our features are optimized as an embedding using L2 norm and thus the Euclidean distance can be used directly to provide t-SNE visualizations and also similarity queries using KD-trees.

Results

t-SNE Visualization

We train our model on the Fashion144k dataset and evaluate our features both qualitatively and quantitatively on the Hipsters Wars dataset1. Above we show a visualization using t-SNE2 on the Pinup class of the Hipster Wars dataset. We can see our approach is able to group different outfits ignoring the background and the wearer. For the full visualization, click here.

For full results and details please consult the full paper.

This research was partially funded by JST CREST.

Publications

2016

  • Fashion Style in 128 Floats: Joint Ranking and Classification using Weak Data for Feature Extraction
    • Fashion Style in 128 Floats: Joint Ranking and Classification using Weak Data for Feature Extraction
    • Edgar Simo-Serra and Hiroshi Ishikawa
    • Conference in Computer Vision and Pattern Recognition (CVPR), 2016

Source Code

  • StyleNet
  1. M. Hadi Kiapour, Kota Yamaguchi, Alexander C. Berg, Tamara L. Berg. Hipster Wars: Discovering Elements of Fashion Styles. In ECCV, 2014. 

  2. L.J.P. van der Maaten and G.E. Hinton. Visualizing High-Dimensional Data Using t-SNE. Journal of Machine Learning Research 9(Nov):2579-2605, 2008.