设计工具
存储

存储 for AI training: MLPerf 存储 on the 微米® 9400 NVMe™ 固态硬盘

约翰Mazzie 韦斯Vaske | 2023年8月

分析 & 表征:AI工作负载与MLPerf存储

Testing 存储 for AI workloads is a challenging task as running actual training can require specialty hardware that may be expensive and can change quickly. This is where MLPerf comes in to help test 存储 for AI workloads.

为什么MLPerf?

MLCommons produces many AI workload benchmarks focused on scaling the performance of AI accelerators. They have recently used this expertise to focus on 存储 for AI and have built a benchmark for stressing 存储 for AI training. The goal of this benchmark is to perform I/O in the same way as a real AI training process, providing larger datasets to limit the effects of filesystem caching and/or decoupling training hardware (GPUs and other accelerators) from 存储 testing.1

MLPerf 存储 utilizes the Deep Learning I/O (DLIO) benchmark, which uses the same data loaders as real AI training workloads (pytorch, tensorflow, 等.)将数据从存储器移动到CPU内存. 在DLIO, 加速器是用休眠时间和批大小来定义的, where the sleep time is computed from running real workloads in the accelerator being emulated. The workload can be scaled up/out by adding clients running DLIO and using message passing interface (MPI) for multiple emulated accelerators per client.

MLPerf works by defining a set of configurations to represent results submitted to MLPerf Training. 目前, the models implemented are BERT (Natural Language Processing) and Unet3D (3D Medical Imaging), and results are reported in samples per second and number of supported accelerators. To pass the test, a minimum 90% accelerator utilization must be maintained.

Unet3D分析

虽然MLPerf实现了BERT和Unet3D, 我们的分析重点是Unet3D, as the BERT benchmark does not stress 存储 I/O extensively. Unet3D is a 3D medical imaging model that reads large image files into accelerator memory with manual annotation and generates dense volumetric segmentations. 从存储的角度来看, this looks like randomly reading in large files from your training dataset. Our testing looks at the results of one accelerator vs 15 accelerators using a 7.68TB 微米 9400 PPO NVMe固态硬盘.

First, we will examine the throughput over time on the device. 在图1中, results for one accelerator are measured mostly between 0 and 600MB/s, 一些峰值为1,600 MB /秒. These peaks correspond to the pref等h buffer being filled at the start of an epoch before starting compute. 在图2中, 我们看到15个加速器, workload still bursts but reaches the max supported throughput of the device. However, due to the burst of the workload, the total average throughput is 15-20% less than the max.

graph of time in seconds on x axis versus mibps showing the Mibps plot graph
graph named mibps plot device nvme1n1 operation read with time in seconds on x axis

Next, we will look at the queue depth (QD) for the same workload. 只有一个加速器, the QD never goes above 10 (Figure 3) while with fifteen accelerators, QD在早期达到145左右的峰值, but stabilizes around 120 and below for the remainder of the test (Figure 4). However, these time series charts don’t show us the entire picture.
 

按操作显示队列深度与时间的关系图
Graph showing Queue depth vs time by operation for device nvme1n1

当将I/ o百分比视为给定的QD时, 我们看到单个加速器是这样的, almost 50% of I/Os were the first transaction on the queue (QD 0) and almost 50% were the second transaction (QD 1), 如图5所示.

graph showing queue depth versus percentage of operation for nvme1n1 device

有15个加速器, 大多数交易发生在量子点80到110之间, 但很大一部分发生在量子点10以下(图6). This behavior shows that there are idle times in a workload that was expected to show consistently high throughput.
 

graph showing queue depth versus percentage of operation for nvme1n1 device

From these results, we see that the workloads are non-trivial from a 存储 viewpoint. 另外, random large block transfers and idle-time mixed with large bursts of transfers and MLPerf 存储 are a tool that will be extremely helpful in benchmarking 存储 for various models by reproducing these realistic workloads.

首席存储解决方案工程师

约翰Mazzie

John is a Member of the Technical Staff in the Data Center Workload Engineering group in Austin, TX. He graduated in 2008 from West Virginia University with his MSEE with an emphasis in wireless communications. John has worked for Dell on their 存储 MD3 Series of 存储 arrays on both the development and sustaining side. John joined 微米 in 2016 where he has worked on Cassandra, MongoDB, 和Ceph, 以及其他高级存储工作负载.

SMTS系统性能工程师

韦斯Vaske

韦斯Vaske is a principal 存储 solution engineer with 微米.