Reproducibility in ML: why it matters and how to achieve it

Reproducing results across machine learning experiments is painstaking work, and in some cases, even impossible. In this post, we detail why reproducibility matters, what exactly makes it so hard, and what we at Determined AI are doing about it.

Reproducibility is critical to industrial grade model development. Without it, data scientists risk claiming gains from changing one parameter without realizing that hidden sources of randomness are the real source of improvement. Reproducibility reduces or eliminates variations when rerunning failed jobs or prior experiments, making it essential in the context of fault tolerance and iterative refinement of models. This capability becomes increasingly important as sophisticated models and real-time data streams push us towards distributed training across clusters of GPUs. This shift not only multiplies the sources of non-determinism but also increases the need for both fault tolerance and iterative model development.

Numerous experts in the deep learning community have already begun to draw attention to the importance of reproducibility, like this excellent post by Pete Warden at Google. However, reproducibility in ML remains elusive, as we illustrate via the example below.

A Day in the Life of a New Data Scientist

You’ve been handed your first project at your new job. The inference time on an existing ML model is too slow, so the team wants you to analyze the performance tradeoffs of a few different architectures. Can you shrink the network and still maintain acceptable accuracy?

The engineer who developed the original model is on leave for a few months, but not to worry, you’ve got the model source code and a pointer to the dataset. You’ve been told the model currently reports 30.3% error on the validation set and that the company isn’t willing to let that number creep above 33.0%.

You start by training a model from the existing architecture so you’ll have a baseline to compare against. After reading through the source, you launch your coworker’s training script and head home for the day, leaving it to run overnight.

The next day you return to a bizarre surprise: the model is reporting 52.8% validation error after 10,000 batches of training. Looking at the plot of your model’s validation error alongside that of your teammate leaves you scratching your head. How did the error rate increase before you even made any changes?

Initial validation error curve.

After some debugging, you find two glaring issues explaining the divergent model performance:

  1. Additional training data: The team recently added several thousand images to the database. Since the root path to the data remained unchanged, the training script threw no warnings and included the new images in the dataset.
  2. Inconsistent hyperparameters: The training script included default hyperparameter values (e.g. learning rate, dropout probabilities), but it also allowed users to specify them at runtime. Digging through your coworker’s results, you find the hyperparameters from the best of breed model don’t match the defaults.

Having fixed these problems, you are confident your model, hyperparameters, and dataset now match exactly. You restart the training script, expecting to match the statistical performance of the baseline model. Instead, you see this:

Validation error has improved to 37.3%, but still doesn't match the baseline.

You resist the urge to shout at your computer. Though the difference has narrowed, you’re still seeing a 7% gap in classification error!

Root Causes of Non-Determinism

Unfortunately, several sources contribute to run-to-run variation even when working with identical model code and training data. Here are some of the common causes:

  1. Random initialization of layer weights: Many ML models set initial weight values by sampling from a particular distribution. This has been shown to increase the speed of convergence over initializing all weights to zeros [ 1, 2].
  2. Shuffling of datasets: The dataset is often randomly shuffled at initialization. If the model is written to use a fixed range of the dataset for validation (e.g. the last 10%), the contents of this set will not be consistent across runs. Even if the validation set is fixed, shuffling within the training dataset affects the order in which the samples are iterated over, and consequently, how the model learns.
  3. Noisy hidden layers: Certain NN architectures include layers with inherent randomness during training. Dropout layers, for example, exclude the contribution of a particular input node with probability p. While this may help prevent overfitting, it means the same input sample will produce different layer activations on any given iteration.
  4. Changes in ML frameworks: Updates to ML libraries can lead to subtly different behavior across versions, while migrating a model from one framework to another can cause even bigger discrepancies. For example, Tensorflow warns its users to “rely on approximate accuracy, not on the specific bits computed” across versions. Keras will exhibit different behaviors if swapping between Theano and Tensorflow backends without taking the appropriate steps.
  5. Non-deterministic GPU floating point calculations: Certain functions in cuDNN, the Nvidia Deep Neural Network library for GPUs, do not guarantee reproducibility across runs by default, including several convolutional operations. Furthermore, reproducibility is not guaranteed across different GPU architectures unless these operations are forcibly disabled by your ML library.
  6. CPU multi-threading: For CPU training, TensorFlow by default configures thread pools with one thread per CPU core to parallelize computation. This parallelization happens both within execution of certain individual ops ( intra_op_parallelism) as well as between graph operations deemed independent ( inter_op_parallelism). While this speeds up training, the existing implementation introduces non-determinism.

The Path to Reproducibility

To alleviate the frustration of our fictional data scientist, we must invest in making machine learning experiments reproducible. How might we achieve this? Well, we can start by capturing all the metadata associated with an experiment, and systematically addressing the common causes listed above. Determined automatically handles many of these challenges. Furthermore, using our explicit “reproducibility” flag, users can control the randomness affecting batch creation, weights, and noise layers across experiments. Table 1 provides more details as to how Determined tackles the problem of reproducibility.

Features that Determined supports to enable reproducible model training.

Table 1: Support for Reproducibility in Determined

Rerunning our experiment in Determined, here’s what we see:

Validation error is reduced using Determined.

As we hoped, the resulting validation error is now almost identical to our baseline. The remaining variation is due to the inherent non-determinism in the underlying cuDNN library used during GPU training (see item 5 above). Switching to CPU-only training and disabling multithreading, we see that Determined allows us to duplicate runs exactly, with training loss and validation error matching at each step.

By performing CPU-only training, we can achieve perfect reproducibility.

At Determined AI we are passionate about supporting reproducible machine learning workflows. By building first class support for it into our tools, we hope to further increase the visibility of this issue. Given its complexity, fully enabling reproducibility will require effort across the stack - from infrastructure-level developers all the way up to ML framework authors. Ultimately, this will be well worth the investment if we want not just to reproduce, but rather extend, advances in machine learning to date.

Support for reproducibility is just one of the ways Determined makes it easier to build high performance ML models. If you’re interested in making your company’s data science team more productive, contact us to learn more.

Recent Posts

SEP 11, 2024

Finding the best LoRA parameters

AUG 12, 2024

Summer '24 Conference Recap

JUL 17, 2024

How does Video Generation work?


玻璃钢生产厂家海淀商场美陈灯饰画价格云南抽象玻璃钢雕塑彭汉钦雕塑玻璃钢绥化玻璃钢雕塑制作玻璃钢雕塑雕塑 南宁福建工业玻璃钢花盆徐州商场美陈装饰玉溪玻璃钢雕塑生产制造南昌玻璃钢雕塑公司气球商场美陈装饰热线电话南通玻璃钢海豚雕塑深圳室内商场美陈采购屯溪玻璃钢花盆花器玻璃钢小八路雕塑阜新玻璃钢雕塑定制价格江苏玻璃钢雕塑人像兰州玻璃钢植物雕塑定制沉浸式商场美陈淮北玻璃钢卡通雕塑品牌佛像玻璃钢雕塑设计公司玉林玻璃钢泡沫雕塑公司苏州商场美陈销售涵江玻璃钢花盆花器汕头玻璃钢卡通雕塑批发大型玻璃钢雕塑生成厂家季节性商场美陈有哪些户外玻璃钢雕塑设计制作梅州玻璃钢卡通雕塑公司玻璃钢游戏雕塑卡通雕塑玻璃钢哪家比较好香港通过《维护国家安全条例》两大学生合买彩票中奖一人不认账让美丽中国“从细节出发”19岁小伙救下5人后溺亡 多方发声单亲妈妈陷入热恋 14岁儿子报警汪小菲曝离婚始末遭遇山火的松茸之乡雅江山火三名扑火人员牺牲系谣言何赛飞追着代拍打萧美琴窜访捷克 外交部回应卫健委通报少年有偿捐血浆16次猝死手机成瘾是影响睡眠质量重要因素高校汽车撞人致3死16伤 司机系学生315晚会后胖东来又人满为患了小米汽车超级工厂正式揭幕中国拥有亿元资产的家庭达13.3万户周杰伦一审败诉网易男孩8年未见母亲被告知被遗忘许家印被限制高消费饲养员用铁锨驱打大熊猫被辞退男子被猫抓伤后确诊“猫抓病”特朗普无法缴纳4.54亿美元罚金倪萍分享减重40斤方法联合利华开始重组张家界的山上“长”满了韩国人?张立群任西安交通大学校长杨倩无缘巴黎奥运“重生之我在北大当嫡校长”黑马情侣提车了专访95后高颜值猪保姆考生莫言也上北大硕士复试名单了网友洛杉矶偶遇贾玲专家建议不必谈骨泥色变沉迷短剧的人就像掉进了杀猪盘奥巴马现身唐宁街 黑色着装引猜测七年后宇文玥被薅头发捞上岸事业单位女子向同事水杯投不明物质凯特王妃现身!外出购物视频曝光河南驻马店通报西平中学跳楼事件王树国卸任西安交大校长 师生送别恒大被罚41.75亿到底怎么缴男子被流浪猫绊倒 投喂者赔24万房客欠租失踪 房东直发愁西双版纳热带植物园回应蜉蝣大爆发钱人豪晒法院裁定实锤抄袭外国人感慨凌晨的中国很安全胖东来员工每周单休无小长假白宫:哈马斯三号人物被杀测试车高速逃费 小米:已补缴老人退休金被冒领16年 金额超20万

玻璃钢生产厂家 XML地图 TXT地图 虚拟主机 SEO 网站制作 网站优化