Allegro AI: Insights on The Data Challenge in Deep Learning Projects

Allegro AI: Insights on The Data Challenge in Deep Learning Projects


Data is the most precious resource of deep learning research. As such, it should be handled carefully, from data gathering, data annotation, data QA and data versioning. However, even if you managed to perform all the above tasks in the best possible way, data holds challenges that can dramatically affect your performance.

In this talk, we discuss the fact that your data is most likely biased and that it affects the performance of your model. We will show how to identify data bias and what can be done to address it. Particularly, we focus on class imbalance. We provide illustrative experiments to accompany these ideas. Our experiments focus on an object detection task, which have additional complexities beyond vanilla classification tasks. We explore how different data balancing methods (data resampling and loss reweighting) affect the performance of minority and majority classes in such settings.

In addition we will peek into the diminishing effect of annotated data. Deep learning models are notorious for their endless appetite for training data. The process of acquiring high quality annotated data consumes a relatively large amount of resources. Monitoring the diminishing effect provides a way to assess how much data is needed for the different stages of the project lifecycle and even predicting whether the current model architecture will be able to achieve the target metric. This knowledge effectively provides a tool for optimal management of time, manpower, and computing resources.

Finally, we will discuss the features needed for a dataset management tool that can help identify and tackle the data challenge in your deep learning projects. We will demonstrate the effectiveness of using such a tool on popular computer vision tasks.