Some months ago, Amazon Web Services changed the way they measure CPU capacity on their EC2 compute platform. In addition to the old ECUs, there is a new unit to measure compute capacity: vCPUs. The instance type page defines a vCPU as “a hyperthreaded core for M3, C3, R3, HS1, G2, and I2.” The description seems a bit confusing: is it a dedicated CPU core (which has two hyperthreads in the E5-2670 v2 CPU platform being used), or is it a half-core, single hyperthread?
I decided to test this out for myself by setting up one of the new-generation m3.xlarge instances (with thanks to Christo for technical assistance). It is stated to have 4 vCPUs running E5-2670 v2 processor at 2.5GHz on the Ivy Bridge-EP microarchitecture (or sometimes 2.6GHz in the case of xlarge instances).
Investigating for ourselves
I’m going to use paravirtualized Amazon Linux 64-bit for simplicity:
$ ec2-describe-images ami-fb8e9292 -H Type ImageID Name Owner State Accessibility ProductCodes Architecture ImageType KernelId RamdiskId Platform RootDeviceType VirtualizationType Hypervisor IMAGE ami-fb8e9292 amazon/amzn-ami-pv-2014.03.1.x86_64-ebs amazon available public x86_64 machine aki-919dcaf8 ebs paravirtual xen BLOCKDEVICEMAPPING /dev/sda1 snap-b047276d 8
Launching the instance:
$ ec2-run-instances ami-fb8e9292 -k marc-aws --instance-type m3.xlarge --availability-zone us-east-1d RESERVATION r-cde66bb3 462281317311 default INSTANCE i-b5f5a2e6 ami-fb8e9292 pending marc-aws 0 m3.xlarge 2014-06-16T20:23:48+0000 us-east-1d aki-919dcaf8 monitoring-disabled ebs paravirtual xen sg-5fc61437 default
The instance is up and running within a few minutes:
$ ec2-describe-instances i-b5f5a2e6 -H Type ReservationID Owner Groups Platform RESERVATION r-cde66bb3 462281317311 default INSTANCE i-b5f5a2e6 ami-fb8e9292 ec2-54-242-182-88.compute-1.amazonaws.com ip-10-145-209-67.ec2.internal running marc-aws 0 m3.xlarge 2014-06-16T20:23:48+0000 us-east-1d aki-919dcaf8 monitoring-disabled 54.242.182.88 10.145.209.67 ebs paravirtual xen sg-5fc61437 default BLOCKDEVICE /dev/sda1 vol-1633ed53 2014-06-16T20:23:52.000Z true
Logging in as ec2-user. First of all, let’s see what /proc/cpuinfo says:
[[email protected] ~]$ egrep '(processor|model name|cpu MHz|physical id|siblings|core id|cpu cores)' /proc/cpuinfo processor : 0 model name : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz cpu MHz : 2599.998 physical id : 0 siblings : 4 core id : 0 cpu cores : 1 processor : 1 model name : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz cpu MHz : 2599.998 physical id : 0 siblings : 4 core id : 0 cpu cores : 1 processor : 2 model name : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz cpu MHz : 2599.998 physical id : 0 siblings : 4 core id : 0 cpu cores : 1 processor : 3 model name : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz cpu MHz : 2599.998 physical id : 0 siblings : 4 core id : 0 cpu cores : 1
Looks like I got some of the slightly faster 2.6GHz CPUs. /proc/cpuinfo shows four processors, each with physical id 0 and core id 0. Or in other words, one single-core processor with 4 threads. We know that the E5-2670 v2 processor is actually a 10-core processor, so the information we see at the OS level is not quite corresponding.
Nevertheless, we’ll proceed with a few simple tests. I’m going to run “gzip”, an integer-compute-intensive compression test, on 2.2GB of zeroes from /dev/zero. By using synthetic input and discarding output, we can avoid effects of disk I/O. I’m going to combine this test with taskset comments to impose processor affinity on the process.
A simple test
The simplest case: a single thread, on processor 0:
[[email protected] ~]$ taskset -pc 0 $$ pid 1531's current affinity list: 0-3 pid 1531's new affinity list: 0 [[email protected] ~]$ dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null 2170552320 bytes (2.2 GB) copied, 17.8837 s, 121 MB/s
With the single processor, we can process 121 MB/sec. Let’s try running two gzips at once. Sharing a single processor, we should see half the throughput.
[[email protected] ~]$ for i in {1..2}; do dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null & done 2170552320 bytes (2.2 GB) copied, 35.8279 s, 60.6 MB/s 2170552320 bytes (2.2 GB) copied, 35.8666 s, 60.5 MB/s
Sharing those cores
Now, let’s make things more interesting: two threads, on adjacent processors. If they are truly dedicated CPU cores, we should get a full 121 MB/s each. If our processors are in fact hyperthreads, we’ll see throughput drop.
[[email protected] ~]$ taskset -pc 0,1 $$ pid 1531's current affinity list: 0 pid 1531's new affinity list: 0,1 [[email protected] ~]$ for i in {1..2}; do dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null & done 2170552320 bytes (2.2 GB) copied, 27.1704 s, 79.9 MB/s 2170552320 bytes (2.2 GB) copied, 27.1687 s, 79.9 MB/s
We have our answer: throughput has dropped by a third, to 79.9 MB/sec, showing that processors 0 and 1 are threads sharing a single core. (But note that Hyperthreading is giving performance benefits here: 79.9 MB/s on a shared core is higher than then 60.5 MB/s we see when sharing a single hyperthread.)
Trying the exact same test, but this time, non-adjacent processors 0 and 2:
[[email protected] ~]$ taskset -pc 0,2 $$ pid 1531's current affinity list: 0,1 pid 1531's new affinity list: 0,2 [[email protected] ~]$ for i in {1..2}; do dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null & done 2170552320 bytes (2.2 GB) copied, 17.8967 s, 121 MB/s 2170552320 bytes (2.2 GB) copied, 17.8982 s, 121 MB/s
All the way up to full-speed, showing dedicated cores.
What does this all mean? Let’s go back to the Amazon’s vCPU definition
Each vCPU is a hyperthreaded core
As our tests have shown, a vCPU is most definitely not a core. It’s a half of a shared core, or one hyperthread.
A side effect: inconsistent performance
There’s another issue at play here too: the shared-core behavior is hidden from the operating system. Going back to /proc/cpuinfo:
[[email protected] ~]$ grep 'core id' /proc/cpuinfo core id : 0 core id : 0 core id : 0 core id : 0
This means that the OS scheduler has no way of knowing which processors have shared cores, and can not schedule tasks around it. Let’s go back to our two-thread test, but instead of restricting it to two specific processors, we’ll let it run on any of them.
[[email protected] ~]$ taskset -pc 0-3 $$ pid 1531's current affinity list: 0,2 pid 1531's new affinity list: 0-3 [[email protected] ~]$ for i in {1..2}; do dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null & done 2170552320 bytes (2.2 GB) copied, 18.041 s, 120 MB/s 2170552320 bytes (2.2 GB) copied, 18.0451 s, 120 MB/s [[email protected] ~]$ for i in {1..2}; do dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null & done 2170552320 bytes (2.2 GB) copied, 21.2189 s, 102 MB/s 2170552320 bytes (2.2 GB) copied, 21.2215 s, 102 MB/s [[email protected] ~]$ for i in {1..2}; do dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null & done 2170552320 bytes (2.2 GB) copied, 26.2199 s, 82.8 MB/s 2170552320 bytes (2.2 GB) copied, 26.22 s, 82.8 MB/s
We see throughput varying between 82 MB/sec and 120 MB/sec, for the exact same workload. To get some more performance information, we’ll configure top to run 10-second samples with per-processor usage information:
[[email protected] ~]$ cat > ~/.toprc <<-EOF RCfile for "top with windows" # shameless braggin' Id:a, Mode_altscr=0, Mode_irixps=1, Delay_time=3.000, Curwin=0 Def fieldscur=AEHIOQTWKNMbcdfgjplrsuvyzX winflags=25913, sortindx=10, maxtasks=2 summclr=1, msgsclr=1, headclr=3, taskclr=1 Job fieldscur=ABcefgjlrstuvyzMKNHIWOPQDX winflags=62777, sortindx=0, maxtasks=0 summclr=6, msgsclr=6, headclr=7, taskclr=6 Mem fieldscur=ANOPQRSTUVbcdefgjlmyzWHIKX winflags=62777, sortindx=13, maxtasks=0 summclr=5, msgsclr=5, headclr=4, taskclr=5 Usr fieldscur=ABDECGfhijlopqrstuvyzMKNWX winflags=62777, sortindx=4, maxtasks=0 summclr=3, msgsclr=3, headclr=2, taskclr=3 EOF [[email protected] ~]$ top -b -n10 -U ec2-user top - 21:07:50 up 43 min, 2 users, load average: 0.55, 0.45, 0.36 Tasks: 86 total, 4 running, 82 sleeping, 0 stopped, 0 zombie Cpu0 : 96.7%us, 3.3%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu1 : 0.0%us, 1.4%sy, 0.0%ni, 97.9%id, 0.0%wa, 0.3%hi, 0.0%si, 0.3%st Cpu2 : 96.0%us, 4.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu3 : 0.0%us, 1.0%sy, 0.0%ni, 97.9%id, 0.0%wa, 0.7%hi, 0.0%si, 0.3%st PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1766 ec2-user 20 0 4444 608 400 R 99.7 0.0 0:06.08 gzip 1768 ec2-user 20 0 4444 608 400 R 99.7 0.0 0:06.08 gzip
Here two non-adjacent CPUs are in use. But 3 seconds later, the processes are running on adjacent CPUs:
top - 21:07:53 up 43 min, 2 users, load average: 0.55, 0.45, 0.36 Tasks: 86 total, 4 running, 82 sleeping, 0 stopped, 0 zombie Cpu0 : 96.3%us, 3.7%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu1 : 96.0%us, 3.6%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.3%hi, 0.0%si, 0.0%st Cpu2 : 0.0%us, 0.0%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.3%hi, 0.0%si, 0.3%st Cpu3 : 0.3%us, 0.0%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.3%st PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1766 ec2-user 20 0 4444 608 400 R 99.7 0.0 0:09.08 gzip 1768 ec2-user 20 0 4444 608 400 R 99.7 0.0 0:09.08 gzip
Although usage percentages are similar, we’ve seen earlier that throughput drops by a third when cores are shared, and we see varied throughput as the processes are context-switched between processors.
This type of situation arises where compute-intensive workloads are running, and when there are fewer processes than total CPU threads. And if only AWS would report correct core IDs to the system, this problem wouldn’t happen: the OS scheduler would make sure processes did not share cores unless necessary.
Here’s a chart summarizing the results:
Summing up
Over the course of the testing I’ve learned two things:
- A vCPU in an AWS environment actually represents only half a physical core. So if you’re looking for equivalent compute capacity to, say, an 8-core server, you would need a so-called 4xlarge EC2 instance with 16 vCPUs. So take it into account in your costing models!
- The mislabeling of the CPU threads as separate single-core processors can result in performance variability as processes are switched between threads. This is something the AWS and/or Xen teams should be able to fix in the kernel.
Readers: what has been your experience with CPU performance in AWS? If any of you has access to a physical machine running E5-2670 processors, it would be interesting to see how the simple gzip test runs.
17 Comments. Leave new
Good one Marc !!! I tested this on Oracle Cloud with same commands, all their Virtual CPUs seem to be full Cores, not hyper-threads :)
Hi Vasu,
Interesting; the CPU specs do matter though. I’d take a shared core if it had twice the compute throughput, for example. Are you able to share the CPU specs and test throughput?
Marc
Great post, Marc!
A quick question: you mentioned that “the OS scheduler would make sure processes did not share cores unless necessary.” Is that guaranteed? For example, in the example where you tested two concurrent gzips with a 2-cpu affinity; is it guaranteed that the OS will never allocate CPU cycles in the same CPU for the two processes?
Hi Andre,
The Linux scheduler considers multiple execution threads on a single core to be a single “scheduler domain”, and attempts to schedule workload for each such domain evenly. Or, in other words, it’s SMT-aware.
Tons more detail: https://www.kernel.org/doc/ols/2005/ols2005v2-pages-201-212.pdf
Marc
Good post Marc, exactly what I was looking for!!
Hi
1. Thanks for that – was looking for such a test before I make a decision. It’s quite strange that the OS don’t know which is what – do you think it would be different with their Windows clusters (I guess not)?
2. Does it mean that your CPU credits can sometimes worth more or Amazon does take these things into account?
Thanks
Hi glj,
With virtualization, the OS isn’t talking with the hardware: it’s talking to the hypervisor. So it’s just seeing whatever the hypervisor tells it. And based on my tests, neither AWS nor Azure present anything close to the real, hardware configuration. I haven’t tested with Windows, but from a technical perspective the hypervisor should be able to present whatever CPU type it wants.
As far as performance per CPU “core” goes, I’m indeed seeing large difference. Although performance varies greatly based on workload, for the testcase I ran, AWS did give more “bang for the buck” in terms of performance.
Cheers!
Marc
Marc, good article. I tried this on T2.medium which they say is 20% x 2 vCPUs. They called those vCPUs cores in FAQ about T2s. And indeed, it did show vCPU to be a core, not a hyper-thread. See any issue with this methodology on burstable instances?
[email protected]…:~$ egrep ‘(processor|model name|cpu MHz|physical id|siblings|core id|cpu cores)’ /proc/cpuinfo
processor : 0
model name : Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
cpu MHz : 2494.098
physical id : 0
siblings : 2
core id : 0
cpu cores : 2
processor : 1
model name : Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
cpu MHz : 2494.098
physical id : 0
siblings : 2
core id : 1
cpu cores : 2
Now, force OS to use just “processor” 0.
[email protected]…:~$ taskset -pc 0 $$
pid 1353’s current affinity list: 0,1
pid 1353’s new affinity list: 0
Do our CPU work
[email protected]…:~$ dd if=/dev/zero bs=1M count=500 2> >(grep bytes >&2 ) | gzip -c > /dev/null
524288000 bytes (524 MB) copied, 4.07044 s, 129 MB/s
Ok, got 129 MB/s
Now, allow it to use both “processors”:
[email protected]…:~$ taskset -pc 0,1 $$
pid 1353’s current affinity list: 0
pid 1353’s new affinity list: 0,1
Run two jobs at the same time:
[email protected]…:~$ for i in {1..2}; do dd if=/dev/zero bs=1M count=500 2> >(grep bytes >&2 ) | gzip -c > /dev/null & done
[1] 1385
[2] 1388
[email protected]:~$ 524288000 bytes (524 MB) copied, 4.04859 s, 129 MB/s
524288000 bytes (524 MB) copied, 4.05697 s, 129 MB/s
Shows each job getting same performance. Two cores indeed here?
Hi Tony,
Looks like you’re definitely getting two full cores of performance. I haven’t explored T2 instances in detail, so the question here might be: how does throughput change over time if you run the same workload for hours or days.
Marc
It’s because t2 instance family contains a credit balance given at birth, which allows you using a full vCPU.
As documented at https://aws.amazon.com/ec2/instance-types/#intel – “Each vCPU is a hyperthread of an Intel Xeon core except for T2 and m3.medium.”
Marc Fielding: did you measurement on a real intel cpu based computer? I mean, hardware in your own hand. 121 mb / sec seems low performance.
On my laptop with Intel(R) Core(TM) i5-5200U CPU @ 2.20GHz. I am seeing similar numbers: 138 MB/s instead of 121 MB/s.
It seems that your server actually has a Sandy Bridge E5-2670 processor “v1”.
The Ivy Bridge would be tagged “v2” in /proc/cpu.
Given the recent change of Oracle policy regarding authorised cloud providers where Oracle has removed the core factor from being included for AWS and Azure (https://www.theregister.co.uk/2017/01/30/oracle_effectively_doubles_licence_fees_to_run_in_aws/) , i was wondering whether you have any data on AWS dedicated hosts and whether the vCPUs of a dedicated host are fixed to the physical CPUs of the underlying physical machine – is there a way to test this ?
Brilliant analysis! Very helpful – thanks!
I wonder if the recent AWS switch to use kvm on the newer machines will change anything here?