At Bold Metrics we have the nice challenge of making software—and making machine learning models—that accurately match shoppers with the garments that will fit them best. It sounds easy, doesn’t it? Bigger shoppers will fit best in bigger sizes, smaller shoppers will fit best in smaller sizes. All done?
Unfortunately, the problem is not that easy. Or, I guess it is fortunate for us that it is not, since it keeps me employed as a data scientist, and it keeps our clients interested in our Virtual Tailor, Virtual Sizer, and Apparel Insights products (and our custom work), which is far from trivial for them to build on their own. The basic approach of Bold Metrics products is to take just a few measurements from shoppers—usually, as part of a loyalty program or brand app that clothing retailers use—things that shoppers generally know about their own body size, then make some staged inferences.
At the first level, our body regressors make intelligent guesses about the body measurements that shoppers usually do not know. For example, most shoppers know their own height, weight, waist size, and bra size; but it is far less common for them to be able to report, or even estimate or understand, their thigh_circum_proximal or neck_circum_larynx. We make good guesses because these models are based on detailed historical observations of actual human bodies.
At a second level, we want to find out how the survey features the shoppers actually report match up with the garments they are happy with buying. For that we have the great power of history, many past purchases by customers, and sometimes returns when they are not satisfied with a size. It is this last event that our clients want to minimize, to save the concrete costs of returns and the amount of clothes that end up in a landfill, and to promote enthusiasm and customer satisfaction with their clothing. Of course, in the world of data science, the first axiom of our practice is “data is dirty.” Sometimes cleaning data is at least possible in a rigorous—even if laborious—way; but often we simply do not know which data are flawed in one way or another. As a simple but pertinent example, clothing shoppers sometimes buy gifts rather than garments intended for themselves (and those might be sizes that don’t match their particular bodies).
At Bold Metrics, we have explored a number of algorithms for models. We also utilize a variety of data cleanup and feature engineering steps. These kinds of processes that fall outside the ultimate stage of “feed the data to scikit-learn, or XGBoost, or PyTorch” can hugely affect the quality of models ultimately produced. That said, I recently taught a webinar about PyTorch and have made all of the materials I taught publicly available. For part of that teaching material, I decided to utilize some real world data, albeit slightly simplified for pedagogical purposes.
In the end, the model I developed for the webinar, using PyTorch, is not particularly accurate. I think it is a good example for teaching purposes, but it is far worse than anything we would put into production. The model consists of one hidden layer large enough to extract 2nd-order polynomial combinations of input features, followed by three fully connected “inference” layers, each with twice the number of neurons as there are distinct sizes. The first “inference” layer includes drop-out to reduce co-adaptation of neurons. We can visualize the network from the summary torchsummary gives:
This network could be given many more parameters easily enough. Popular classifiers like Inception v.3 or Resnet-152 have millions of parameters, for example. However, it is not obvious where to add more, or with what behaviors, to make this particular approach improve in quality.
The accompanying graph shows some predictions made about best sizes for shoppers whose purchase history indicates happiness with particular sizes. Shown are marks (blue circles) for 10,000 randomly selected purchases from the test set (i.e. they were not used in the training, per train/test split). The X-axis shows the size they were actually happy with, the Y-axis shows the prediction of the PyTorch model. In a perfect model, all the blue circles would fall on top of the red line.
Clearly there is some tendendency to predict the right sizes, by the model. However, the typical variance from that ideal prediction is quite wide. Moreover, this PyTorch model shows a problem that we needed to work on in those models we actually use in production: There is a tendency for many types of models to skew towards “typical” values rather than predict the extremes. This tendency is exacerbated by the highly noisy data, which actual consumer purchases inevitably are.
Even the PyTorch model pictured is somewhat better than it might initially look. I wrote that the perfect model would put all the blue circles directly on top of the red line. But that is not actually true either. A given shopper—or for this purpose, equally a different shopper who has exactly the same surveyed body size—might buy a size 8 on a Monday, then a size 10 on Tuesday, then a size 00 on Thursday, and a size 18 on Friday. That seemingly odd behavior could simply be because she falls between 8 and 10 in her own body, and perhaps prefers a slightly different fit for different garments. Then at the end of the week, she buys gifts for her smaller and larger friends. All of these are legitimate purchases the shopper is happy with. However, the model has no way in principle to discern some of that intent. Moreover, the data used for training is just as noisy as that used for testing, so best guesses by machine-learning models will always remain guesses.
The point here is, there is no silver bullet here. We data scientists will remain employed for a good while longer. While there are some impressive and useful tools in the world, both commercial and open source, that attempt to “automate data science,” in the end they do not remove the need for a human data scientist who can get a good feel for the data… and follow that up with lots and lots of trial and error. What feature engineering? What data cleanup? What choice of model algorithms? What hyperparameters? There are and will remain, a great many choices for us to make.
For anyone trying to solve similar problems, I encourage you to be brave, get out there and try things, don’t rely on other people’s ideas, generate some of your own, you never know what you might come up with!
David Mertz, Ph.D.
Chief Technology Officer, BoldMetrics Inc.