vCPU sharing in EC2: HVM to the rescue?

Posted in: Technical Track

I’ve been doing some testing to clarify what a vCPU in Amazon Web Services actually is. Over the course of the testing, I experienced inconsistent results on a 2-thread test on a 4-vCPU m3.xlarge system, due to the mislabeling of the vCPUs as independent single-core processors by the Linux kernel. This issue manifests itself in a CPU-bound, multithreaded workload where there is idle CPU time.

My test environment used a paravirtualized (PV) kernel, which moves some of the virtualization logic into the Linux kernel, reducing the need for high-overhead hardware emulation. One drawback is that the kernel cannot be modified to, for example, resolve the CPU mislabeling. But there is an alternative: an HVM system relying on virtualization extensions in the CPU hardware and allowing custom kernels or even non-Linux operating systems to run. Historically the drawback has been a performance hit, though I read a very interesting post from Brendan Gregg’s blog, indicating that what’s called HVM in Amazon EC2 is actually a hybrid of PV and HVM, combining aspects of both. A test run by Phoronix on EC2 showed HVM performance on par with PV, and in some cases even better. So it definitely seems worth repeating my earlier tests on.

As before, I fire up an instance, but this time using the latest HVM Amazon Linux image:

$ ec2-describe-images ami-76817c1e -H
Type    ImageID Name    Owner   State   Accessibility   ProductCodes    Architecture    ImageType       KernelId        RamdiskId Platform        RootDeviceType  VirtualizationType      Hypervisor
IMAGE   ami-76817c1e    amazon/amzn-ami-hvm-2014.03.2.x86_64-ebs        amazon  available       public          x86_64  machine                           ebs     hvm     xen
BLOCKDEVICEMAPPING      /dev/xvda               snap-810ffc56   8
$ ec2-run-instances ami-76817c1e -k marc-aws --instance-type m3.xlarge --availability-zone us-east-1d
RESERVATION     r-a4f480da      462281317311    default
INSTANCE        i-c5d5b6ef      ami-76817c1e                    pending marc-aws        0               m3.xlarge       2014-06-23T19:02:18+0000  us-east-1d                              monitoring-disabled                                     ebs                                       hvm     xen             sg-5fc61437     default

Checking in on CPUs:

[[email protected] ~]$ egrep '(processor|model name|cpu MHz|physical id|siblings|core id|cpu cores)' /proc/cpuinfo
processor       : 0
model name      : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
cpu MHz         : 2593.949
physical id     : 0
siblings        : 4
core id         : 0
cpu cores       : 4
processor       : 1
model name      : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
cpu MHz         : 2593.949
physical id     : 0
siblings        : 4
core id         : 1
cpu cores       : 4
processor       : 2
model name      : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
cpu MHz         : 2593.949
physical id     : 0
siblings        : 4
core id         : 2
cpu cores       : 4
processor       : 3
model name      : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
cpu MHz         : 2593.949
physical id     : 0
siblings        : 4
core id         : 3
cpu cores       : 4

It’s the same 2.6GHz E5-2670 processor, but is reported as a single-socket non-hyperthreaded quad-core processor. Not yet the dual-core hyperthreaded processing we’re getting though.

Time to run a few tests.

[[email protected] ~]$ taskset -pc 0 $$
pid 1768's current affinity list: 0-3
pid 1768's new affinity list: 0
[[email protected] ~]$ dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null
2170552320 bytes (2.2 GB) copied, 18.1955 s, 119 MB/s
[[email protected] ~]$ for i in {1..2}; do (dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null &) done
2170552320 bytes (2.2 GB) copied, 36.4968 s, 59.5 MB/s
2170552320 bytes (2.2 GB) copied, 36.506 s, 59.5 MB/s

In the same range as with PV, but also 1-2% slower, meaning we’re seeing a small amount of HVM overhead. Let’s try across processors

[[email protected] ~]$ taskset -pc 0,1 $$
pid 1768's current affinity list: 0
pid 1768's new affinity list: 0,1
[[email protected] ~]$ for i in {1..2}; do (dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null &) done
2170552320 bytes (2.2 GB) copied, 27.8401 s, 78.0 MB/s
2170552320 bytes (2.2 GB) copied, 27.8398 s, 78.0 MB/s
[[email protected] ~]$ taskset -pc 0,2 $$
pid 1768's current affinity list: 0,1
pid 1768's new affinity list: 0,2
[[email protected] ~]$ for i in {1..2}; do (dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null &) done
2170552320 bytes (2.2 GB) copied, 18.1849 s, 119 MB/s
2170552320 bytes (2.2 GB) copied, 18.2014 s, 119 MB/s

Again, a tiny bit slower than with PV. To test variability, I’ll kick off 20 consecutive runs, and print a histogram of output:

[[email protected] ~]$ taskset -pc 0-3 $$
pid 1768's current affinity list: 0,2
pid 1768's new affinity list: 0-3
[[email protected] ~]$ for run in {1..20}; do
>  for i in {1..2}; do dd if=/dev/zero bs=1M count=2070 2>> output | gzip -c > /dev/null & done
>  wait
> done
...
[[email protected] ~]$ cat output | awk '/bytes/ {print $8,$9}' | sort -n | uniq -c
      1 113 MB/s
      3 114 MB/s
      4 115 MB/s
      6 116 MB/s
     10 117 MB/s
     10 118 MB/s
      6 119 MB/s

Running between 113 and 119 MB/s per thread: much less variability than before. In chart form:
aws-cpu-hvm

Looking at “top”:

[[email protected] ~]$ cat > ~/.toprc <<-EOF
> RCfile for "top with windows"           # shameless braggin'
> Id:a, Mode_altscr=0, Mode_irixps=1, Delay_time=3.000, Curwin=0
> Def     fieldscur=AEHIOQTWKNMbcdfgjplrsuvyzX
>         winflags=25913, sortindx=10, maxtasks=2
>         summclr=1, msgsclr=1, headclr=3, taskclr=1
> Job     fieldscur=ABcefgjlrstuvyzMKNHIWOPQDX
>         winflags=62777, sortindx=0, maxtasks=0
>         summclr=6, msgsclr=6, headclr=7, taskclr=6
> Mem     fieldscur=ANOPQRSTUVbcdefgjlmyzWHIKX
>         winflags=62777, sortindx=13, maxtasks=0
>         summclr=5, msgsclr=5, headclr=4, taskclr=5
> Usr     fieldscur=ABDECGfhijlopqrstuvyzMKNWX
>         winflags=62777, sortindx=4, maxtasks=0
>         summclr=3, msgsclr=3, headclr=2, taskclr=3
> EOF
[[email protected] ~]$ top -b -n20 -U ec2-user
top - 20:31:51 up 28 min,  2 users,  load average: 1.37, 1.17, 0.63
Tasks:  82 total,   4 running,  78 sleeping,   0 stopped,   0 zombie
Cpu0  : 22.9%us,  0.3%sy,  0.0%ni, 76.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  : 74.0%us,  3.0%sy,  0.0%ni, 23.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  : 59.7%us,  4.0%sy,  0.0%ni, 36.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  : 33.7%us,  2.7%sy,  0.0%ni, 63.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 1951 ec2-user  20   0  4444  608  400 R 97.1  0.0   0:08.92 gzip
 1953 ec2-user  20   0  4444  608  400 R 97.1  0.0   0:08.92 gzip                                                         
top - 20:31:54 up 28 min,  2 users,  load average: 1.37, 1.17, 0.63
Tasks:  82 total,   4 running,  78 sleeping,   0 stopped,   0 zombie
Cpu0  : 72.3%us,  4.3%sy,  0.0%ni, 23.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  :  0.0%us,  0.3%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  : 94.4%us,  5.6%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  : 21.3%us,  2.0%sy,  0.0%ni, 76.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 1953 ec2-user  20   0  4444  608  400 R 97.1  0.0   0:11.84 gzip
 1951 ec2-user  20   0  4444  608  400 R 96.8  0.0   0:11.83 gzip                                                         
top - 20:31:57 up 28 min,  2 users,  load average: 1.34, 1.17, 0.64
Tasks:  82 total,   3 running,  79 sleeping,   0 stopped,   0 zombie
Cpu0  : 95.3%us,  4.7%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  : 92.4%us,  7.6%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 1951 ec2-user  20   0  4444  608  400 R 96.8  0.0   0:14.74 gzip
 1953 ec2-user  20   0  4444  608  400 R 96.8  0.0   0:14.75 gzip                                                         
top - 20:32:00 up 28 min,  2 users,  load average: 1.32, 1.17, 0.64
Tasks:  82 total,   4 running,  78 sleeping,   0 stopped,   0 zombie
Cpu0  : 29.9%us,  1.7%sy,  0.0%ni, 68.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  : 63.0%us,  3.7%sy,  0.0%ni, 33.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  : 40.5%us,  2.3%sy,  0.0%ni, 57.1%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  : 55.3%us,  3.7%sy,  0.0%ni, 41.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 1951 ec2-user  20   0  4444  608  400 R 97.1  0.0   0:17.66 gzip
 1953 ec2-user  20   0  4444  608  400 R 97.1  0.0   0:17.67 gzip

We see that work is split between adjacent CPUs, but that the scheduler is doing a good job of keeping the adjacent CPUs near 100% usage between them.

So based on these tests, it looks like, even though the CPU is still mislabeled, HVM has almost entirely avoided the issue of variability due to shared-core scheduling, at the cost of a small reduction in overall throughput.

email
Want to talk with an expert? Schedule a call with our team to get the conversation started.

About the Author

Marc is a passionate and creative problem solver, drawing on deep understanding of the full enterprise application stack to identify the root cause of problems and to deploy sustainable solutions. Marc has a strong background in performance tuning and high availability, developing many of the tools and processes used to monitor and manage critical production databases at Pythian. He is proud to be the very first DataStax Platinum Certified Administrator for Apache Cassandra.

No comments

Leave a Reply

Your email address will not be published. Required fields are marked *