访问主内存的延迟几乎与发送数据包的顺序相同 [英] Latency of accessing main memory is almost the same order of sending a packet
问题描述
看看Jeff Dean著名的延迟指南
Looking at Jeff Dean's famous latency guides
Latency Comparison Numbers (~2012)
----------------------------------
L1 cache reference 0.5 ns
Branch mispredict 5 ns
L2 cache reference 7 ns 14x L1 cache
Mutex lock/unlock 25 ns
Main memory reference 100 ns 20x L2 cache, 200x L1 cache
Compress 1K bytes with Zippy 3,000 ns 3 us
Send 1K bytes over 1 Gbps network 10,000 ns 10 us
Read 4K randomly from SSD* 150,000 ns 150 us ~1GB/sec SSD
Read 1 MB sequentially from memory 250,000 ns 250 us
Round trip within same datacenter 500,000 ns 500 us
Read 1 MB sequentially from SSD* 1,000,000 ns 1,000 us 1 ms ~1GB/sec SSD, 4X memory
Disk seek 10,000,000 ns 10,000 us 10 ms 20x datacenter roundtrip
Read 1 MB sequentially from disk 20,000,000 ns 20,000 us 20 ms 80x memory, 20X SSD
Send packet CA->Netherlands->CA 150,000,000 ns 150,000 us 150 ms
对我来说似乎有些不可思议的一件事是,从磁盘顺序读取1MB的时间仅比在大西洋上发送往返数据包快10倍.谁能给我更多的直觉,为什么这样感觉正确.
One thing which looks somewhat uncanny to me is the time taken to read 1MB sequentially from disk is only 10 times faster than sending a round trip packet across the Atlantic. Can anyone give me more intuition why this feels right.
推荐答案
Q : 1MB SEQ-HDD-READ〜比CA/NL跨大西洋RTT快10倍-为什么感觉正确吗?
Q : 1MB SEQ-HDD-READ ~ 10x faster than a CA/NL trans-atlantic RTT - why this feels right?
一些旧"值(从2017年起从开始有一些跨QPI/NUMA更新),从以下开始:
Some "old" values ( with a few cross-QPI/NUMA updates from 2017 ) to start from:
0.5 ns - CPU L1 dCACHE reference
1 ns - speed-of-light (a photon) travel a 1 ft (30.5cm) distance
5 ns - CPU L1 iCACHE Branch mispredict
7 ns - CPU L2 CACHE reference
71 ns - CPU cross-QPI/NUMA best case on XEON E5-46*
100 ns - MUTEX lock/unlock
100 ns - CPU own DDR MEMORY reference
135 ns - CPU cross-QPI/NUMA best case on XEON E7-*
202 ns - CPU cross-QPI/NUMA worst case on XEON E7-*
325 ns - CPU cross-QPI/NUMA worst case on XEON E5-46*
10,000 ns - Compress 1 KB with Zippy PROCESS (+GHz,+SIMD,+multicore tricks)
20,000 ns - Send 2 KB over 1 Gbps NETWORK
250,000 ns - Read 1 MB sequentially from MEMORY
500,000 ns - Round trip within a same DataCenter
10,000,000 ns - DISK seek
10,000,000 ns - Read 1 MB sequentially from NETWORK
30,000,000 ns - Read 1 MB sequentially from DISK
150,000,000 ns - Send a NETWORK packet CA -> Netherlands
| | | |
| | | ns|
| | us|
| ms|
跨大西洋网络RTT:
- 全球光网络大致以光速(
300.000.000 m/s
)工作 - LA(CA)-AMS(NL)数据包必须不经过大地测量的距离" ,而是要穿越一组大陆和跨大西洋的"em潜水艇" 电缆,其长度要长得多(请参见地图)
- Global optical networks work roughly at a speed of light (
300.000.000 m/s
) - LA(CA)-AMS(NL) packet has to travel not the geodetical "distance", but over a set of continental and trans-atlantic "submarine" cables, the length of which is way longer ( see the map )
- HDD-s可以具有非常快和非常短的传输路径来移动数据,但是
READ
-ops必须等待其物理/机械操作媒体读取头(这里花费大部分时间,而不是实际数据传输到主机RAM的时间) - HDD-s是旋转设备,磁盘必须在开始读取的位置对齐",这大约要花费
10 [ms]
- HDD-s设备将数据存储到头(2+,从磁性板表面读取物理信号)的静态结构中:圆柱(磁盘头微控制器将一个圆柱对齐的读取头放置到的板中:扇区(圆柱体的角截面,每个截面都承载相同大小的数据块〜4KB,8KB,...)
- HDD-s can have very fast and very short transport-path for moving the data, but the
READ
-ops have to wait for the physical / mechanical operations of the media-reading heads ( that takes most of the time here, not the actual data-transfer to the host RAM ) - HDD-s are rotational devices, the disk has to "align" where to start the read, which costs the first about
10 [ms]
- HDD-s devices store data into a static structure of heads( 2+, reading physical signals from the magnetic plates' surfaces ):cylinders( concentric circular zones on the plate, into which a cyl-aligned reading-head gets settled by disk-head micro-controller):sector( angular-sections of the cylinder, each carrying a block of the same sized data ~ 4KB, 8KB, ... )
Trans-Atlantic Network RTT :
这些因素不会不会 有所改善" -仅传输能力在增长,而光放大器,重定时单元和其他L1-PHY中引入了附加延迟/L2-/L3-网络技术处于受控状态,并且尽可能小.
These factors do not "improve" - only the transport capacity is growing, with add-on latencies introduced in light-amplifiers, retiming units and other L1-PHY / L2-/L3-networking technologies are kept under control, as small as possible.
因此,使用此技术,LA(CA)-AMS(NL)RTT将保持不变,相同〜150毫秒
So the LA(CA)-AMS(NL) RTT will remain, using this technology, the same ~ 150 ms
使用其他技术,例如LEO-Sat多维数据集,距离"仅从9000 km P2P到一对额外的GND/LEO网段,再加上另外几个LEO/LEO跃点,引入更长"的距离,附加的跃点/跃点后处理延迟和容量将无法与当前可用的光学传输装置相提并论,因此不会出现回到未来"的神奇跳跃(我们仍然会错过DeLorean).
Using other technology, LEO-Sat Cubes - as an example - the "distance" will only grow from ~ 9000 km P2P, by a pair of additional GND/LEO segments, plus by a few addition LEO/LEO hops, which introduce "longer" distance, add-on hop/hop re-processing latencies and capacity will not get any close to the current optical transports available, so no magic jump "back to the future" is to be expected ( we still miss the DeLorean ).
这些因素不 改善" -所有商品生产的驱动器均保持行业选择的{5k4 |角速度.7k2 |10k |15k |18k}旋转/分钟(RPM).这意味着,如果在这样的磁盘上维护紧凑的数据布局,则在整个圆柱体上连续读取一个head:cylinder会花费:
These factors do not "improve" - all commodity produced drives remain at industry selected angular speeds of about { 5k4 | 7k2 | 10k | 15k | 18k }-spins/min (RPM). This means, that if a well-compacted data-layouts are maintained on such a disk, one continuous head:cylinder aligned reading round the whole cylinder will take:
>>> [ 1E3 / ( RPM / 60. ) for RPM in ( 5400, 7200, 10000, 15000, 18000 ) ]
11.1 ms per CYL @ 5k4 RPM disk,
8.3 ms per CYL @ 7k2 RPM disk,
6.0 ms per CYL @ 10k RPM disk,
4.0 ms per CYL @ 15k RPM disk,
3.3 ms per CYL @ 18k RPM disk.
数据密度还受到磁性介质属性的限制.Spintronics R& D将带来更多密集存储的数据,但是最近30年一直在可靠的磁存储范围之内.
Data-density is also limited by the magnetic media properties. Spintronics R&D will bring some more densely stored data, yet the last 30 years have been well inside the limits of the reliable magnetic storage.
从技巧到一次并行读取多个磁头期望会有更多的结果,但这与嵌入式微控制器的设计背道而驰,因此大多数读取都是按顺序进行的,从一个磁头到另一个磁头,放入HDD控制器板载缓冲区中,最好是不进行缸头到缸头的机械重新对准(技术上这取决于先前的数据到磁盘的布局,由操作系统维护,并可能需要磁盘维护)-optimisers(最初称为磁盘磁盘压缩"),它只是试图重新对齐FAT描述的数据块的已知序列,以便遵循head:cyl:sector转换的最佳轨迹,这主要取决于实际设备的head:head和cyl:cyl延迟).因此,即使是最乐观的数据布局,也需要〜13..21 [ms]
进行查找和读取,而一个head:cyl-path
More is to be expected from a trick to co-parallel-read from several heads at-once, yet this goes against the design of the embedded microcontrollers, so most of the reading goes but sequentially, from one head after another, into the HDD-controller onboard buffers, best if no cyl-to-cyl heads mechanical re-alignment were to take place ( technically this depends on the prior data-to-disc layout, maintained by the O/S and possible care of disk-optimisers ( originally called disk disk-"compression", which just tried to re-align the known sequences of FAT-described data-blocks, so as to follow the most optimal trajectory of head:cyl:sector transitions, depending most on the actual device's head:head and cyl:cyl latencies ). So even the most optimistic data-layout takes ~ 13..21 [ms]
to seek-and-read but one head:cyl-path
物理学定律
这篇关于访问主内存的延迟几乎与发送数据包的顺序相同的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!