A popular dataset for training machine-learning systems – one that’s used by thousands of students to build an open-source – contains critical errors and omissions, including missing labels for hundreds of images of bicyclists and pedestrians.
Copyright by nakedsecurity.sophos.com
Machine learning models are only as good as the data on which they’re trained. But when researchers at Roboflow, a firm that writes boilerplate computer vision code, hand-checked the 15,000 images in Udacity Dataset 2 , they found problems with 4,986 – that’s 33% – of those images.
Amongst these [problematic data] were thousands of unlabeled vehicles, hundreds of unlabeled pedestrians, and dozens of unlabeled cyclists. We also found many instances of phantom annotations, duplicated bounding boxes, and drastically oversized bounding boxes.
Perhaps most egregiously, 217 (1.4%) of the images were completely unlabeled but actually contained cars, trucks, street lights, and/or pedestrians.
Junk in, junk out. In the case of the behind self-driving cars, junk data could literally lead to deaths. This is how Dwyer describes how bad/unlabelled data propagates through a system:
Generally speaking, models learn by example. You give it a photo, it makes a prediction, and then you nudge it a little bit in the direction that would have made its prediction more ‘right’. Where ‘right’ is defined as the ‘ground truth’, which is what your training data is.
If your training data’s ground truth is wrong, your model still happily learns from it, it’s just learning the wrong things (eg ‘that blob of pixels is *not* a cyclist’ vs ‘that blob of pixels *is* a cyclist’). Neural networks do an Ok job of performing well despite *some* errors in their training data, but when 1/3 of the ground truth images have issues it’s definitely going to degrade performance.
Self-driving car engineers, please use the fixed dataset
Thanks to the permissive licensing terms of the open-source data, Roboflow has fixed and re-released the Udacity dataset in a number of formats. Dwyer is asking those who were training a model on the original dataset to please consider switching to the updated dataset.
Dwyer hasn’t looked into any other datasets, so he’s not sure how much bad data is sitting at the base of training in this nascent industry. But he has looked at datasets in other domains, finding that Udacity’s Dataset 2 was particularly bad in comparison, he told me:
Of the datasets I’ve looked at in other domains (eg medicine, animals, games), this one stood out as being of particularly poor quality.
Could crappy data quality like this have led to the death of 49-year-old Elaine Herzberg? She was killed by a as she walked her bicycle across a street in Tempe, Arizona in March 2018. Uber said that her death was likely caused by a software bug in its technology. […]