Dittory: Discovery with Deep Learning

Sai Gaddam
8 min readApr 8, 2017

Shopping online for clothes just got way easier. With Dittory you can now effortlessly discover identical or visually similar products across all the major Indian e-commerce stores.

Dittory results as seen on various e-commerce sites

Dittory is currently available as a stand-alone Chrome extension. We’ll be making it available on other browsers and as Android app in the coming weeks.

What’s so great about discovering similar products?

Boring. I only need one glance to figure out if two things look similar

— you might say, channeling your dismissive inner Tweeter-In-Chief. It turns out that what’s effortlessly easy for us humans, is an extraordinarily difficult problem for machines. Currently, there is no other product or app in the market that can do this. Pretty good, right? Here’s what went into building this tech.

Over the last couple of months, we have built custom neural networks and taught them using deep learning algorithms to automatically label and index more than 30 million apparel product images across all major e-commerce sites in India. (We’ll explain the jargon in a bit.)

These labels are machine-readable ones that are far more fine-grained than the kind you might typically encounter on an e-commerce site. These labels are also more nuanced than what a human — that is you or I — might be able to consciously articulate. How, for instance, would you describe the color, pattern, and style of the clothing in these two catalog images?

credit: myntra

Shopping is hard. Let’s do some math

The text labels offered along with both catalog images — kurta, anarkali, red, long-sleeve , flared hemline— only capture the broad category and attributes, which happen to be the same for both. However, these Anarkali Kurtas are obviously — to us — very different in style. The neck style, gathers, contrast piping, placket, are all different. And these are stylistic differences that we can articulate, if we happen to know the vocabulary. There are many more differences that even clothing designers would find difficult to put into words. How would you describe the fabric pattern on the right? Or how closely the sleeves hug the arms?

Our neural networks, untrained in English, learn to automatically parse the product images to generate hundreds of rich labels that capture the fine details of each product. Just as important is the learned ability of these networks to ignore what is irrelevant. They learn, after seeing tens of millions of product images, to ignore features like the background, the position of the model’s hands and feet, their smiles, the color and length of their curls. We then store these hundreds of labels for each of the tens of millions of products. When a user visits a product on any of the e-commerce sites we currently index, we scour this database of indexed products and retrieve products that have identical or near-identical sets of these rich labels. Some nifty optimization layered on top of hefty infrastructure allows us to do this search across 30 million indexed images in about 250 milliseconds.

The very satisfying end result is that you’ll automatically know if there’s a better cousin of that Skater dress or Kaftan you liked or, even better, if it is available elsewhere at a lower price.

A Brief Pre-history Of Deep Learning

This sounds interesting, but what’s deep learning, what are neural networks and why do you need to bring out the big guns for this problem? To understand what deep learning is and why detecting similar products is such a tough problem, we’ll need a short detour into neuroscience. (Wait what?? Trust us, this is interesting.)

What is the meaning of similar? When you and I are looking at an object in real life, our eyes and brains are not seeing and perceiving pixels. We, or the many millions of neurons in our brains, are translating the light reflecting off the object and landing on our retinas into labels. We aren’t consciously aware of this process. All we know is that a few hundred milliseconds after seeing a red, vaguely oval edible object on a table, we see an apple.

credit: pixabay

It just seems so easy and natural.

It isn’t.

This instantaneous perception is the result of hundreds of millions of neurons in the brain, which have evolved over tens of million of years, working in beautiful stochastic synchrony . Do keep that in mind the next time you stop to smell and see the flowers. What does that have to do with similarity though? Part of the sophisticated computational trickery our brains have picked up over the last many millennia is the ability to ignore irrelevant stuff in the world around us. An apple is an apple, whether it is placed on a table or a book or person’s head. An apple is an apple whether it is two feet away in dim light, or twenty feet away at high noon. And astonishingly, our brains are able to figure this out. To put in technical terms, our brains allow us to discount changes in scale, position, perspective, orientation, luminance, and distortion and give us a meaningful representation of the world we inhabit.

credit: mathworks

Now, consider how a computer “sees” an apple. It is actually receiving a bunch of RGB pixel values. The individual values of these pixels will vary completely even if the apple location is shifted by a single pixel. Any change in the ambient light in room will also change the pixel values drastically. Our computer algorithms have to intelligently learn to discount this. Not just that, for the computer to even begin recognizing the apple, it has to understand where the apple ends and the table begins, that an apple could be partially hidden behind something else, that it could over-ripe or green. Simply enumerating all the visual scenarios an apple could appear in would take a long time.

In 1955, a couple of computer science professors at MIT famously thought that a group of “carefully selected” scientists could, over a few summer months, make a “significant advance” in understanding the nature of intelligence and learning itself. It turns out that even the computational capabilities of an opossum, a mammal cousin that’s remained unchanged over 70 million years, are beyond the grasp of our most powerful computers.

credit: wikipedia, flickr

It’s a daunting task, but if we had to try how would one go from pixels to labels? The common-sense approach is to attempt to identify different, commonly occurring patterns or features in the images. For instance, if you were trying to identify faces in an image, it would make sense to try and look for groups of pixels that looked like white ovals with dark spots (eyes), a triangular object vertically aligned under those ovals (nose), and so on. This is indeed where most of the research was focused. The problem with this approach is that for each class of labeling problems (face detection, bird detection, car detection, shoe detection, and so forth…) you would need to manually hand-craft a set of features. This approach, already very brittle, quickly becomes unwieldy when you want something capable of recognizing multiple different things at the same time.

If only we could just show a computer a whole bunch of images and get it to learn these features auto-magically.

In 2012, a mere 57 years after Marvin Minsky’s we’ll do it this summer proclamation, Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton showed the world that this was indeed possible. Neural networks have been around for a long time, but it was Alex et al who demonstrated that with the right training technique and a couple of powerful GPUs, you could make them learn fundamental features common to the visual world in a way that’s very similar to what our visual cortex neurons encode. The deep learning techniques they pioneered have led to astonishing advances in a slew of domains. Deep learning is in Google Search, in self-driving cars, in Alexa. It’s everywhere!

End of detour. Whew.

This brings us back to the futuristic present. We are currently powering similar product search using deep learning across 8 clothing categories and are rapidly expanding our coverage. We’ve also found similarity-based retrieval handy in other non-clothing categories.

These watches are identical, but sold under different brand names, making text search difficult

What Next?

There’s a lot more interesting stuff in our pipeline. We are just getting started! It’s day one! We haven’t even scratc- Yeah, those slogans are real annoying. But it’s true, it’s incredibly exciting to see what the latest advances in deep learning, or AI as the buzzword buzzards refer to it, are making possible.

One of the interesting consequences of generating rich-labels for product images is that we can see patterns in the kind of products users like and click. This in turn allows us to understand the stylistic sensibilities of users and show them interesting niche products they would really like. What a user likes on Myntra or Jaypore will allow us to discover what they will like on stalkbuylove or itokri. To help indie boutiques get their products discovered, we are currently working on a machine learning algorithms that will automatically enhance the quality of catalog images one might have shot in poor lighting on a mobile camera.

If this sounds exciting and you’d like to learn more, please write to us. And if this sound really, really exciting and you want to build and play with your own neural networks, try and find the hidden message here.

Back to disrup… Bye!

--

--

Sai Gaddam

Co-Founder @ Comini Learning ; Co-Author: Journey of the Mind (2022); A Billion Wicked Thoughts (2011); PhD, Computational Neuroscience