I’ve been doing some testing to clarify what a vCPU in Amazon Web Services actually is. Over the course of the testing, I experienced inconsistent results on a 2-thread test on a 4-vCPU m3.xlarge system, due to the mislabeling of the vCPUs as independent single-core processors by the Linux kernel. This issue manifests itself in a CPU-bound, multithreaded workload where there is idle CPU time.
My test environment used a paravirtualized (PV) kernel, which moves some of the virtualization logic into the Linux kernel, reducing the need for high-overhead hardware emulation. One drawback is that the kernel cannot be modified to, for example, resolve the CPU mislabeling. But there is an alternative: an HVM system relying on virtualization extensions in the CPU hardware and allowing custom kernels or even non-Linux operating systems to run. Historically the drawback has been a performance hit, though I read a very interesting post from Brendan Gregg’s blog, indicating that what’s called HVM in Amazon EC2 is actually a hybrid of PV and HVM, combining aspects of both. A test run by Phoronix on EC2 showed HVM performance on par with PV, and in some cases even better. So it definitely seems worth repeating my earlier tests on.
As before, I fire up an instance, but this time using the latest HVM Amazon Linux image:
$ ec2-describe-images ami-76817c1e -H Type ImageID Name Owner State Accessibility ProductCodes Architecture ImageType KernelId RamdiskId Platform RootDeviceType VirtualizationType Hypervisor IMAGE ami-76817c1e amazon/amzn-ami-hvm-2014.03.2.x86_64-ebs amazon available public x86_64 machine ebs hvm xen BLOCKDEVICEMAPPING /dev/xvda snap-810ffc56 8 $ ec2-run-instances ami-76817c1e -k marc-aws --instance-type m3.xlarge --availability-zone us-east-1d RESERVATION r-a4f480da 462281317311 default INSTANCE i-c5d5b6ef ami-76817c1e pending marc-aws 0 m3.xlarge 2014-06-23T19:02:18+0000 us-east-1d monitoring-disabled ebs hvm xen sg-5fc61437 default
Checking in on CPUs:
[[email protected] ~]$ egrep '(processor|model name|cpu MHz|physical id|siblings|core id|cpu cores)' /proc/cpuinfo processor : 0 model name : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz cpu MHz : 2593.949 physical id : 0 siblings : 4 core id : 0 cpu cores : 4 processor : 1 model name : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz cpu MHz : 2593.949 physical id : 0 siblings : 4 core id : 1 cpu cores : 4 processor : 2 model name : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz cpu MHz : 2593.949 physical id : 0 siblings : 4 core id : 2 cpu cores : 4 processor : 3 model name : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz cpu MHz : 2593.949 physical id : 0 siblings : 4 core id : 3 cpu cores : 4
It’s the same 2.6GHz E5-2670 processor, but is reported as a single-socket non-hyperthreaded quad-core processor. Not yet the dual-core hyperthreaded processing we’re getting though.
Time to run a few tests.
[[email protected] ~]$ taskset -pc 0 $$ pid 1768's current affinity list: 0-3 pid 1768's new affinity list: 0 [[email protected] ~]$ dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null 2170552320 bytes (2.2 GB) copied, 18.1955 s, 119 MB/s [[email protected] ~]$ for i in {1..2}; do (dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null &) done 2170552320 bytes (2.2 GB) copied, 36.4968 s, 59.5 MB/s 2170552320 bytes (2.2 GB) copied, 36.506 s, 59.5 MB/s
In the same range as with PV, but also 1-2% slower, meaning we’re seeing a small amount of HVM overhead. Let’s try across processors
[[email protected] ~]$ taskset -pc 0,1 $$ pid 1768's current affinity list: 0 pid 1768's new affinity list: 0,1 [[email protected] ~]$ for i in {1..2}; do (dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null &) done 2170552320 bytes (2.2 GB) copied, 27.8401 s, 78.0 MB/s 2170552320 bytes (2.2 GB) copied, 27.8398 s, 78.0 MB/s [[email protected] ~]$ taskset -pc 0,2 $$ pid 1768's current affinity list: 0,1 pid 1768's new affinity list: 0,2 [[email protected] ~]$ for i in {1..2}; do (dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null &) done 2170552320 bytes (2.2 GB) copied, 18.1849 s, 119 MB/s 2170552320 bytes (2.2 GB) copied, 18.2014 s, 119 MB/s
Again, a tiny bit slower than with PV. To test variability, I’ll kick off 20 consecutive runs, and print a histogram of output:
[[email protected] ~]$ taskset -pc 0-3 $$ pid 1768's current affinity list: 0,2 pid 1768's new affinity list: 0-3 [[email protected] ~]$ for run in {1..20}; do > for i in {1..2}; do dd if=/dev/zero bs=1M count=2070 2>> output | gzip -c > /dev/null & done > wait > done ... [[email protected] ~]$ cat output | awk '/bytes/ {print $8,$9}' | sort -n | uniq -c 1 113 MB/s 3 114 MB/s 4 115 MB/s 6 116 MB/s 10 117 MB/s 10 118 MB/s 6 119 MB/s
Running between 113 and 119 MB/s per thread: much less variability than before. In chart form:
Looking at “top”:
[[email protected] ~]$ cat > ~/.toprc <<-EOF > RCfile for "top with windows" # shameless braggin' > Id:a, Mode_altscr=0, Mode_irixps=1, Delay_time=3.000, Curwin=0 > Def fieldscur=AEHIOQTWKNMbcdfgjplrsuvyzX > winflags=25913, sortindx=10, maxtasks=2 > summclr=1, msgsclr=1, headclr=3, taskclr=1 > Job fieldscur=ABcefgjlrstuvyzMKNHIWOPQDX > winflags=62777, sortindx=0, maxtasks=0 > summclr=6, msgsclr=6, headclr=7, taskclr=6 > Mem fieldscur=ANOPQRSTUVbcdefgjlmyzWHIKX > winflags=62777, sortindx=13, maxtasks=0 > summclr=5, msgsclr=5, headclr=4, taskclr=5 > Usr fieldscur=ABDECGfhijlopqrstuvyzMKNWX > winflags=62777, sortindx=4, maxtasks=0 > summclr=3, msgsclr=3, headclr=2, taskclr=3 > EOF [[email protected] ~]$ top -b -n20 -U ec2-user top - 20:31:51 up 28 min, 2 users, load average: 1.37, 1.17, 0.63 Tasks: 82 total, 4 running, 78 sleeping, 0 stopped, 0 zombie Cpu0 : 22.9%us, 0.3%sy, 0.0%ni, 76.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu1 : 74.0%us, 3.0%sy, 0.0%ni, 23.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu2 : 59.7%us, 4.0%sy, 0.0%ni, 36.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu3 : 33.7%us, 2.7%sy, 0.0%ni, 63.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1951 ec2-user 20 0 4444 608 400 R 97.1 0.0 0:08.92 gzip 1953 ec2-user 20 0 4444 608 400 R 97.1 0.0 0:08.92 gzip top - 20:31:54 up 28 min, 2 users, load average: 1.37, 1.17, 0.63 Tasks: 82 total, 4 running, 78 sleeping, 0 stopped, 0 zombie Cpu0 : 72.3%us, 4.3%sy, 0.0%ni, 23.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu1 : 0.0%us, 0.3%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu2 : 94.4%us, 5.6%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu3 : 21.3%us, 2.0%sy, 0.0%ni, 76.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1953 ec2-user 20 0 4444 608 400 R 97.1 0.0 0:11.84 gzip 1951 ec2-user 20 0 4444 608 400 R 96.8 0.0 0:11.83 gzip top - 20:31:57 up 28 min, 2 users, load average: 1.34, 1.17, 0.64 Tasks: 82 total, 3 running, 79 sleeping, 0 stopped, 0 zombie Cpu0 : 95.3%us, 4.7%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu1 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu2 : 92.4%us, 7.6%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu3 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1951 ec2-user 20 0 4444 608 400 R 96.8 0.0 0:14.74 gzip 1953 ec2-user 20 0 4444 608 400 R 96.8 0.0 0:14.75 gzip top - 20:32:00 up 28 min, 2 users, load average: 1.32, 1.17, 0.64 Tasks: 82 total, 4 running, 78 sleeping, 0 stopped, 0 zombie Cpu0 : 29.9%us, 1.7%sy, 0.0%ni, 68.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu1 : 63.0%us, 3.7%sy, 0.0%ni, 33.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu2 : 40.5%us, 2.3%sy, 0.0%ni, 57.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu3 : 55.3%us, 3.7%sy, 0.0%ni, 41.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1951 ec2-user 20 0 4444 608 400 R 97.1 0.0 0:17.66 gzip 1953 ec2-user 20 0 4444 608 400 R 97.1 0.0 0:17.67 gzip
We see that work is split between adjacent CPUs, but that the scheduler is doing a good job of keeping the adjacent CPUs near 100% usage between them.
So based on these tests, it looks like, even though the CPU is still mislabeled, HVM has almost entirely avoided the issue of variability due to shared-core scheduling, at the cost of a small reduction in overall throughput.
No comments