Virtual CPUs with Amazon Web Services

Posted in: Technical Track

Some months ago, Amazon Web Services changed the way they measure CPU capacity on their EC2 compute platform. In addition to the old ECUs, there is a new unit to measure compute capacity: vCPUs. The instance type page defines a vCPU as “a hyperthreaded core for M3, C3, R3, HS1, G2, and I2.” The description seems a bit confusing: is it a dedicated CPU core (which has two hyperthreads in the E5-2670 v2 CPU platform being used), or is it a half-core, single hyperthread?

I decided to test this out for myself by setting up one of the new-generation m3.xlarge instances (with thanks to Christo for technical assistance). It is stated to have 4 vCPUs running E5-2670 v2 processor at 2.5GHz on the Ivy Bridge-EP microarchitecture (or sometimes 2.6GHz in the case of xlarge instances).

Investigating for ourselves

I’m going to use paravirtualized Amazon Linux 64-bit for simplicity:

$ ec2-describe-images ami-fb8e9292 -H
Type ImageID Name Owner State Accessibility ProductCodes Architecture ImageType KernelId RamdiskId Platform RootDeviceType VirtualizationType Hypervisor
IMAGE ami-fb8e9292 amazon/amzn-ami-pv-2014.03.1.x86_64-ebs amazon available public x86_64 machine aki-919dcaf8 ebs paravirtual xen
BLOCKDEVICEMAPPING /dev/sda1 snap-b047276d 8

Launching the instance:

$ ec2-run-instances ami-fb8e9292 -k marc-aws --instance-type m3.xlarge --availability-zone us-east-1d
RESERVATION r-cde66bb3 462281317311 default
INSTANCE i-b5f5a2e6 ami-fb8e9292 pending marc-aws 0 m3.xlarge 2014-06-16T20:23:48+0000 us-east-1d aki-919dcaf8 monitoring-disabled ebs paravirtual xen sg-5fc61437 default

The instance is up and running within a few minutes:

$ ec2-describe-instances i-b5f5a2e6 -H
Type ReservationID Owner Groups Platform
RESERVATION r-cde66bb3 462281317311 default
INSTANCE i-b5f5a2e6 ami-fb8e9292 ec2-54-242-182-88.compute-1.amazonaws.com ip-10-145-209-67.ec2.internal running marc-aws 0 m3.xlarge 2014-06-16T20:23:48+0000 us-east-1d aki-919dcaf8 monitoring-disabled 54.242.182.88 10.145.209.67 ebs paravirtual xen sg-5fc61437 default
BLOCKDEVICE /dev/sda1 vol-1633ed53 2014-06-16T20:23:52.000Z true

Logging in as ec2-user. First of all, let’s see what /proc/cpuinfo says:

[[email protected] ~]$ egrep '(processor|model name|cpu MHz|physical id|siblings|core id|cpu cores)' /proc/cpuinfo
processor : 0
model name : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
cpu MHz : 2599.998
physical id : 0
siblings : 4
core id : 0
cpu cores : 1
processor : 1
model name : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
cpu MHz : 2599.998
physical id : 0
siblings : 4
core id : 0
cpu cores : 1
processor : 2
model name : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
cpu MHz : 2599.998
physical id : 0
siblings : 4
core id : 0
cpu cores : 1
processor : 3
model name : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
cpu MHz : 2599.998
physical id : 0
siblings : 4
core id : 0
cpu cores : 1

Looks like I got some of the slightly faster 2.6GHz CPUs. /proc/cpuinfo shows four processors, each with physical id 0 and core id 0. Or in other words, one single-core processor with 4 threads. We know that the E5-2670 v2 processor is actually a 10-core processor, so the information we see at the OS level is not quite corresponding.

Nevertheless, we’ll proceed with a few simple tests. I’m going to run “gzip”, an integer-compute-intensive compression test, on 2.2GB of zeroes from /dev/zero. By using synthetic input and discarding output, we can avoid effects of disk I/O. I’m going to combine this test with taskset comments to impose processor affinity on the process.

A simple test

The simplest case: a single thread, on processor 0:

[[email protected] ~]$ taskset -pc 0 $$
pid 1531's current affinity list: 0-3
pid 1531's new affinity list: 0
[[email protected] ~]$ dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null
2170552320 bytes (2.2 GB) copied, 17.8837 s, 121 MB/s

With the single processor, we can process 121 MB/sec. Let’s try running two gzips at once. Sharing a single processor, we should see half the throughput.

[[email protected] ~]$ for i in {1..2}; do dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null & done
2170552320 bytes (2.2 GB) copied, 35.8279 s, 60.6 MB/s
2170552320 bytes (2.2 GB) copied, 35.8666 s, 60.5 MB/s

Sharing those cores

Now, let’s make things more interesting: two threads, on adjacent processors. If they are truly dedicated CPU cores, we should get a full 121 MB/s each. If our processors are in fact hyperthreads, we’ll see throughput drop.

[[email protected] ~]$ taskset -pc 0,1 $$
pid 1531's current affinity list: 0
pid 1531's new affinity list: 0,1
[[email protected] ~]$ for i in {1..2}; do dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null & done
2170552320 bytes (2.2 GB) copied, 27.1704 s, 79.9 MB/s
2170552320 bytes (2.2 GB) copied, 27.1687 s, 79.9 MB/s

We have our answer: throughput has dropped by a third, to 79.9 MB/sec, showing that processors 0 and 1 are threads sharing a single core. (But note that Hyperthreading is giving performance benefits here: 79.9 MB/s on a shared core is higher than then 60.5 MB/s we see when sharing a single hyperthread.)

Trying the exact same test, but this time, non-adjacent processors 0 and 2:

[[email protected] ~]$ taskset -pc 0,2 $$
pid 1531's current affinity list: 0,1
pid 1531's new affinity list: 0,2
[[email protected] ~]$ for i in {1..2}; do dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null & done
2170552320 bytes (2.2 GB) copied, 17.8967 s, 121 MB/s
2170552320 bytes (2.2 GB) copied, 17.8982 s, 121 MB/s

All the way up to full-speed, showing dedicated cores.

What does this all mean? Let’s go back to the Amazon’s vCPU definition

Each vCPU is a hyperthreaded core

As our tests have shown, a vCPU is most definitely not a core. It’s a half of a shared core, or one hyperthread.

A side effect: inconsistent performance

There’s another issue at play here too: the shared-core behavior is hidden from the operating system. Going back to /proc/cpuinfo:

[[email protected] ~]$ grep 'core id' /proc/cpuinfo
core id : 0
core id : 0
core id : 0
core id : 0

This means that the OS scheduler has no way of knowing which processors have shared cores, and can not schedule tasks around it. Let’s go back to our two-thread test, but instead of restricting it to two specific processors, we’ll let it run on any of them.

[[email protected] ~]$ taskset -pc 0-3 $$
pid 1531's current affinity list: 0,2
pid 1531's new affinity list: 0-3
[[email protected] ~]$ for i in {1..2}; do dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null & done
2170552320 bytes (2.2 GB) copied, 18.041 s, 120 MB/s
2170552320 bytes (2.2 GB) copied, 18.0451 s, 120 MB/s
[[email protected] ~]$ for i in {1..2}; do dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null & done
2170552320 bytes (2.2 GB) copied, 21.2189 s, 102 MB/s
2170552320 bytes (2.2 GB) copied, 21.2215 s, 102 MB/s
[[email protected] ~]$ for i in {1..2}; do dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null & done
2170552320 bytes (2.2 GB) copied, 26.2199 s, 82.8 MB/s
2170552320 bytes (2.2 GB) copied, 26.22 s, 82.8 MB/s

We see throughput varying between 82 MB/sec and 120 MB/sec, for the exact same workload. To get some more performance information, we’ll configure top to run 10-second samples with per-processor usage information:

[[email protected] ~]$ cat > ~/.toprc <<-EOF
RCfile for "top with windows" # shameless braggin'
Id:a, Mode_altscr=0, Mode_irixps=1, Delay_time=3.000, Curwin=0
Def fieldscur=AEHIOQTWKNMbcdfgjplrsuvyzX
winflags=25913, sortindx=10, maxtasks=2
summclr=1, msgsclr=1, headclr=3, taskclr=1
Job fieldscur=ABcefgjlrstuvyzMKNHIWOPQDX
winflags=62777, sortindx=0, maxtasks=0
summclr=6, msgsclr=6, headclr=7, taskclr=6
Mem fieldscur=ANOPQRSTUVbcdefgjlmyzWHIKX
winflags=62777, sortindx=13, maxtasks=0
summclr=5, msgsclr=5, headclr=4, taskclr=5
Usr fieldscur=ABDECGfhijlopqrstuvyzMKNWX
winflags=62777, sortindx=4, maxtasks=0
summclr=3, msgsclr=3, headclr=2, taskclr=3
EOF
[[email protected] ~]$ top -b -n10 -U ec2-user
top - 21:07:50 up 43 min, 2 users, load average: 0.55, 0.45, 0.36
Tasks: 86 total, 4 running, 82 sleeping, 0 stopped, 0 zombie
Cpu0 : 96.7%us, 3.3%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu1 : 0.0%us, 1.4%sy, 0.0%ni, 97.9%id, 0.0%wa, 0.3%hi, 0.0%si, 0.3%st
Cpu2 : 96.0%us, 4.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu3 : 0.0%us, 1.0%sy, 0.0%ni, 97.9%id, 0.0%wa, 0.7%hi, 0.0%si, 0.3%st
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1766 ec2-user 20 0 4444 608 400 R 99.7 0.0 0:06.08 gzip
1768 ec2-user 20 0 4444 608 400 R 99.7 0.0 0:06.08 gzip

Here two non-adjacent CPUs are in use. But 3 seconds later, the processes are running on adjacent CPUs:

top - 21:07:53 up 43 min, 2 users, load average: 0.55, 0.45, 0.36
Tasks: 86 total, 4 running, 82 sleeping, 0 stopped, 0 zombie
Cpu0 : 96.3%us, 3.7%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu1 : 96.0%us, 3.6%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.3%hi, 0.0%si, 0.0%st
Cpu2 : 0.0%us, 0.0%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.3%hi, 0.0%si, 0.3%st
Cpu3 : 0.3%us, 0.0%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.3%st
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1766 ec2-user 20 0 4444 608 400 R 99.7 0.0 0:09.08 gzip
1768 ec2-user 20 0 4444 608 400 R 99.7 0.0 0:09.08 gzip

Although usage percentages are similar, we’ve seen earlier that throughput drops by a third when cores are shared, and we see varied throughput as the processes are context-switched between processors.

This type of situation arises where compute-intensive workloads are running, and when there are fewer processes than total CPU threads. And if only AWS would report correct core IDs to the system, this problem wouldn’t happen: the OS scheduler would make sure processes did not share cores unless necessary.

Here’s a chart summarizing the results:

 

Summing up

Over the course of the testing I’ve learned two things:

  • A vCPU in an AWS environment actually represents only half a physical core. So if you’re looking for equivalent compute capacity to, say, an 8-core server, you would need a so-called 4xlarge EC2 instance with 16 vCPUs. So take it into account in your costing models!
  • The mislabeling of the CPU threads as separate single-core processors can result in performance variability as processes are switched between threads. This is something the AWS and/or Xen teams should be able to fix in the kernel.

Readers: what has been your experience with CPU performance in AWS? If any of you has access to a physical machine running E5-2670 processors, it would be interesting to see how the simple gzip test runs.

email
Want to talk with an expert? Schedule a call with our team to get the conversation started.

About the Author

Marc is a passionate and creative problem solver, drawing on deep understanding of the full enterprise application stack to identify the root cause of problems and to deploy sustainable solutions. Marc has a strong background in performance tuning and high availability, developing many of the tools and processes used to monitor and manage critical production databases at Pythian. He is proud to be the very first DataStax Platinum Certified Administrator for Apache Cassandra.

17 Comments. Leave new

Good one Marc !!! I tested this on Oracle Cloud with same commands, all their Virtual CPUs seem to be full Cores, not hyper-threads :)

Reply
Marc Fielding
June 25, 2014 12:39 pm

Hi Vasu,

Interesting; the CPU specs do matter though. I’d take a shared core if it had twice the compute throughput, for example. Are you able to share the CPU specs and test throughput?

Marc

Reply
André Araújo
July 20, 2014 3:17 am

Great post, Marc!
A quick question: you mentioned that “the OS scheduler would make sure processes did not share cores unless necessary.” Is that guaranteed? For example, in the example where you tested two concurrent gzips with a 2-cpu affinity; is it guaranteed that the OS will never allocate CPU cycles in the same CPU for the two processes?

Reply
Marc Fielding
July 20, 2014 7:00 pm

Hi Andre,

The Linux scheduler considers multiple execution threads on a single core to be a single “scheduler domain”, and attempts to schedule workload for each such domain evenly. Or, in other words, it’s SMT-aware.

Tons more detail: https://www.kernel.org/doc/ols/2005/ols2005v2-pages-201-212.pdf

Marc

Reply

Good post Marc, exactly what I was looking for!!

Reply

Hi

1. Thanks for that – was looking for such a test before I make a decision. It’s quite strange that the OS don’t know which is what – do you think it would be different with their Windows clusters (I guess not)?
2. Does it mean that your CPU credits can sometimes worth more or Amazon does take these things into account?

Thanks

Reply

Hi glj,

With virtualization, the OS isn’t talking with the hardware: it’s talking to the hypervisor. So it’s just seeing whatever the hypervisor tells it. And based on my tests, neither AWS nor Azure present anything close to the real, hardware configuration. I haven’t tested with Windows, but from a technical perspective the hypervisor should be able to present whatever CPU type it wants.

As far as performance per CPU “core” goes, I’m indeed seeing large difference. Although performance varies greatly based on workload, for the testcase I ran, AWS did give more “bang for the buck” in terms of performance.

Cheers!

Marc

Reply

Marc, good article. I tried this on T2.medium which they say is 20% x 2 vCPUs. They called those vCPUs cores in FAQ about T2s. And indeed, it did show vCPU to be a core, not a hyper-thread. See any issue with this methodology on burstable instances?

[email protected]…:~$ egrep ‘(processor|model name|cpu MHz|physical id|siblings|core id|cpu cores)’ /proc/cpuinfo
processor : 0
model name : Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
cpu MHz : 2494.098
physical id : 0
siblings : 2
core id : 0
cpu cores : 2
processor : 1
model name : Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
cpu MHz : 2494.098
physical id : 0
siblings : 2
core id : 1
cpu cores : 2

Now, force OS to use just “processor” 0.

[email protected]…:~$ taskset -pc 0 $$
pid 1353’s current affinity list: 0,1
pid 1353’s new affinity list: 0

Do our CPU work

[email protected]…:~$ dd if=/dev/zero bs=1M count=500 2> >(grep bytes >&2 ) | gzip -c > /dev/null
524288000 bytes (524 MB) copied, 4.07044 s, 129 MB/s

Ok, got 129 MB/s

Now, allow it to use both “processors”:

[email protected]…:~$ taskset -pc 0,1 $$
pid 1353’s current affinity list: 0
pid 1353’s new affinity list: 0,1

Run two jobs at the same time:

[email protected]…:~$ for i in {1..2}; do dd if=/dev/zero bs=1M count=500 2> >(grep bytes >&2 ) | gzip -c > /dev/null & done
[1] 1385
[2] 1388
[email protected]:~$ 524288000 bytes (524 MB) copied, 4.04859 s, 129 MB/s
524288000 bytes (524 MB) copied, 4.05697 s, 129 MB/s

Shows each job getting same performance. Two cores indeed here?

Reply

Hi Tony,

Looks like you’re definitely getting two full cores of performance. I haven’t explored T2 instances in detail, so the question here might be: how does throughput change over time if you run the same workload for hours or days.

Marc

Reply
Rafael Lopes
July 4, 2015 12:50 pm

It’s because t2 instance family contains a credit balance given at birth, which allows you using a full vCPU.

Reply
Andy Kilhoffer
January 4, 2017 7:35 pm

As documented at https://aws.amazon.com/ec2/instance-types/#intel – “Each vCPU is a hyperthread of an Intel Xeon core except for T2 and m3.medium.”

Reply

Marc Fielding: did you measurement on a real intel cpu based computer? I mean, hardware in your own hand. 121 mb / sec seems low performance.

Reply
Rune Philosof
October 10, 2016 7:43 am

On my laptop with Intel(R) Core(TM) i5-5200U CPU @ 2.20GHz. I am seeing similar numbers: 138 MB/s instead of 121 MB/s.

Reply

It seems that your server actually has a Sandy Bridge E5-2670 processor “v1”.
The Ivy Bridge would be tagged “v2” in /proc/cpu.

Reply

Given the recent change of Oracle policy regarding authorised cloud providers where Oracle has removed the core factor from being included for AWS and Azure (https://www.theregister.co.uk/2017/01/30/oracle_effectively_doubles_licence_fees_to_run_in_aws/) , i was wondering whether you have any data on AWS dedicated hosts and whether the vCPUs of a dedicated host are fixed to the physical CPUs of the underlying physical machine – is there a way to test this ?

Reply
Abhimanyu Grover
May 29, 2017 12:19 am

Brilliant analysis! Very helpful – thanks!

Reply
Andrew Mackenzie
November 17, 2017 11:08 am

I wonder if the recent AWS switch to use kvm on the newer machines will change anything here?

Reply

Leave a Reply

Your email address will not be published. Required fields are marked *