访问主内存的延迟几乎与发送数据包的顺序相同 [英] Latency of accessing main memory is almost the same order of sending a packet

查看:62
本文介绍了访问主内存的延迟几乎与发送数据包的顺序相同的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

看看Jeff Dean著名的延迟指南

Looking at Jeff Dean's famous latency guides

Latency Comparison Numbers (~2012)
----------------------------------
L1 cache reference                           0.5 ns
Branch mispredict                            5   ns
L2 cache reference                           7   ns                      14x L1 cache
Mutex lock/unlock                           25   ns
Main memory reference                      100   ns                      20x L2 cache, 200x L1 cache
Compress 1K bytes with Zippy             3,000   ns        3 us
Send 1K bytes over 1 Gbps network       10,000   ns       10 us
Read 4K randomly from SSD*             150,000   ns      150 us          ~1GB/sec SSD
Read 1 MB sequentially from memory     250,000   ns      250 us
Round trip within same datacenter      500,000   ns      500 us
Read 1 MB sequentially from SSD*     1,000,000   ns    1,000 us    1 ms  ~1GB/sec SSD, 4X memory
Disk seek                           10,000,000   ns   10,000 us   10 ms  20x datacenter roundtrip
Read 1 MB sequentially from disk    20,000,000   ns   20,000 us   20 ms  80x memory, 20X SSD
Send packet CA->Netherlands->CA    150,000,000   ns  150,000 us  150 ms

对我来说似乎有些不可思议的一件事是,从磁盘顺序读取1MB的时间仅比在大西洋上发送往返数据包快10倍.谁能给我更多的直觉,为什么这样感觉正确.

One thing which looks somewhat uncanny to me is the time taken to read 1MB sequentially from disk is only 10 times faster than sending a round trip packet across the Atlantic. Can anyone give me more intuition why this feels right.

推荐答案

Q : 1MB SEQ-HDD-READ〜比CA/NL跨大西洋RTT快10倍-为什么感觉正确吗?

Q : 1MB SEQ-HDD-READ ~ 10x faster than a CA/NL trans-atlantic RTT - why this feels right?


一些旧"值(从2017年起开始有一些跨QPI/NUMA更新),从以下开始:


Some "old" values ( with a few cross-QPI/NUMA updates from 2017 ) to start from:

           0.5 ns - CPU L1 dCACHE reference
           1   ns - speed-of-light (a photon) travel a 1 ft (30.5cm) distance
           5   ns - CPU L1 iCACHE Branch mispredict
           7   ns - CPU L2  CACHE reference
          71   ns - CPU cross-QPI/NUMA best  case on XEON E5-46*
         100   ns - MUTEX lock/unlock
         100   ns - CPU own DDR MEMORY reference
         135   ns - CPU cross-QPI/NUMA best  case on XEON E7-*
         202   ns - CPU cross-QPI/NUMA worst case on XEON E7-*
         325   ns - CPU cross-QPI/NUMA worst case on XEON E5-46*
      10,000   ns - Compress 1 KB with Zippy PROCESS (+GHz,+SIMD,+multicore tricks)
      20,000   ns - Send 2 KB over 1 Gbps NETWORK
     250,000   ns - Read 1 MB sequentially from MEMORY
     500,000   ns - Round trip within a same DataCenter
  10,000,000   ns - DISK seek
  10,000,000   ns - Read 1 MB sequentially from NETWORK
  30,000,000   ns - Read 1 MB sequentially from DISK
 150,000,000   ns - Send a NETWORK packet CA -> Netherlands
|   |   |   |
|   |   | ns|
|   | us|
| ms|

跨大西洋网络RTT:

  • 全球光网络大致以光速( 300.000.000 m/s )工作
  • LA(CA)-AMS(NL)数据包必须不经过大地测量的距离" ,而是要穿越一组大陆和跨大西洋的"em潜水艇" 电缆,其长度要长得多(请参见地图)
  • Trans-Atlantic Network RTT :

    • Global optical networks work roughly at a speed of light ( 300.000.000 m/s )
    • LA(CA)-AMS(NL) packet has to travel not the geodetical "distance", but over a set of continental and trans-atlantic "submarine" cables, the length of which is way longer ( see the map )
    • 这些因素不会不会 有所改善" -仅传输能力在增长,而光放大器,重定时单元和其他L1-PHY中引入了附加延迟/L2-/L3-网络技术处于受控状态,并且尽可能小.

      These factors do not "improve" - only the transport capacity is growing, with add-on latencies introduced in light-amplifiers, retiming units and other L1-PHY / L2-/L3-networking technologies are kept under control, as small as possible.

      因此,使用此技术,LA(CA)-AMS(NL)RTT将保持不变,相同〜150毫秒

      So the LA(CA)-AMS(NL) RTT will remain, using this technology, the same ~ 150 ms

      使用其他技术,例如LEO-Sat多维数据集,距离"仅从9000 km P2P到一对额外的GND/LEO网段,再加上另外几个LEO/LEO跃点,引入更长"的距离,附加的跃点/跃点后处理延迟和容量将无法与当前可用的光学传输装置相提并论,因此不会出现回到未来"的神奇跳跃(我们仍然会错过DeLorean).

      Using other technology, LEO-Sat Cubes - as an example - the "distance" will only grow from ~ 9000 km P2P, by a pair of additional GND/LEO segments, plus by a few addition LEO/LEO hops, which introduce "longer" distance, add-on hop/hop re-processing latencies and capacity will not get any close to the current optical transports available, so no magic jump "back to the future" is to be expected ( we still miss the DeLorean ).

      • HDD-s可以具有非常快和非常短的传输路径来移动数据,但是 READ -ops必须等待其物理/机械操作媒体读取头(这里花费大部分时间,而不是实际数据传输到主机RAM的时间)
      • HDD-s是旋转设备,磁盘必须在开始读取的位置对齐",这大约要花费 10 [ms]
      • HDD-s设备将数据存储到(2+,从磁性板表面读取物理信号)的静态结构中:圆柱(磁盘头微控制器将一个圆柱对齐的读取头放置到的板中:扇区(圆柱体的角截面,每个截面都承载相同大小的数据块〜4KB,8KB,...)
      • HDD-s can have very fast and very short transport-path for moving the data, but the READ-ops have to wait for the physical / mechanical operations of the media-reading heads ( that takes most of the time here, not the actual data-transfer to the host RAM )
      • HDD-s are rotational devices, the disk has to "align" where to start the read, which costs the first about 10 [ms]
      • HDD-s devices store data into a static structure of heads( 2+, reading physical signals from the magnetic plates' surfaces ):cylinders( concentric circular zones on the plate, into which a cyl-aligned reading-head gets settled by disk-head micro-controller):sector( angular-sections of the cylinder, each carrying a block of the same sized data ~ 4KB, 8KB, ... )

      这些因素 改善" -所有商品生产的驱动器均保持行业选择的{5k4 |角速度.7k2 |10k |15k |18k}旋转/分钟(RPM).这意味着,如果在这样的磁盘上维护紧凑的数据布局,则在整个圆柱体上连续读取一个head:cylinder会花费:

      These factors do not "improve" - all commodity produced drives remain at industry selected angular speeds of about { 5k4 | 7k2 | 10k | 15k | 18k }-spins/min (RPM). This means, that if a well-compacted data-layouts are maintained on such a disk, one continuous head:cylinder aligned reading round the whole cylinder will take:

      >>> [ 1E3 / ( RPM / 60. ) for RPM in ( 5400, 7200, 10000, 15000, 18000 ) ]
      
      11.1 ms per CYL @  5k4 RPM disk,
       8.3 ms per CYL @  7k2 RPM disk,
       6.0 ms per CYL @ 10k  RPM disk,
       4.0 ms per CYL @ 15k  RPM disk,
       3.3 ms per CYL @ 18k  RPM disk.
      

      数据密度还受到磁性介质属性的限制.Spintronics R& D将带来更多密集存储的数据,但是最近30年一直在可靠的磁存储范围之内.

      Data-density is also limited by the magnetic media properties. Spintronics R&D will bring some more densely stored data, yet the last 30 years have been well inside the limits of the reliable magnetic storage.

      从技巧到一次并行读取多个磁头期望会有更多的结果,但这与嵌入式微控制器的设计背道而驰,因此大多数读取都是按顺序进行的,从一个磁头到另一个磁头,放入HDD控制器板载缓冲区中,最好是不进行缸头到缸头的机械重新对准(技术上这取决于先前的数据到磁盘的布局,由操作系统维护,并可能需要磁盘维护)-optimisers(最初称为磁盘磁盘压缩"),它只是试图重新对齐FAT描述的数据块的已知序列,以便遵循head:cyl:sector转换的最佳轨迹,这主要取决于实际设备的head:head和cyl:cyl延迟).因此,即使是最乐观的数据布局,也需要〜13..21 [ms] 进行查找和读取,而一个head:cyl-path

      More is to be expected from a trick to co-parallel-read from several heads at-once, yet this goes against the design of the embedded microcontrollers, so most of the reading goes but sequentially, from one head after another, into the HDD-controller onboard buffers, best if no cyl-to-cyl heads mechanical re-alignment were to take place ( technically this depends on the prior data-to-disc layout, maintained by the O/S and possible care of disk-optimisers ( originally called disk disk-"compression", which just tried to re-align the known sequences of FAT-described data-blocks, so as to follow the most optimal trajectory of head:cyl:sector transitions, depending most on the actual device's head:head and cyl:cyl latencies ). So even the most optimistic data-layout takes ~ 13..21 [ms] to seek-and-read but one head:cyl-path

      物理学定律

      这篇关于访问主内存的延迟几乎与发送数据包的顺序相同的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆