Virtual CPUs with Google compute engine

Pythian Marketing

February 25, 2015

Tags: Cloud

Continuing on my series of virtual CPU behavior in Amazon Web Services, Amazon Web Service HVM instances, and Microsoft Azure, I’m taking a look at Google Compute Engine (GCE). GCE is a relative newcomer to the public cloud world, become generally available in December 2013. It does have some interesting features, including transparent maintenance through live migration, and automatic sustained-use discounts without upfront commitments.

Unlike Amazon or Microsoft, Google is very upfront about their vCPU definition.

For the n1 series of machine types, a virtual CPU is implemented as a single hyperthread on a 2.6GHz Intel Sandy Bridge Xeon or Intel Ivy Bridge Xeon (or newer) processor. This means that the n1-standard-2 machine type will see a whole physical core.

I still believe calling such a hyperthread a “virtual CPU” is misleading. When creating a virtual machine in a non-cloud VM platform, 1 virtual CPU = 1 physical core. Plain and simple. But when using a cloud platform, I need 2 virtual CPUs to get that same physical core.

cpu-comparison

Anyways, off to run some CPU tests. n1-standard-4 is a close match to the m3.xlarge instances previously tested, so I’ll try that.

Getting set up on Google Compute Engine

I already signed up with Google Compute Engine’s free trial and created a project I’m calling marc-cpu-test. Installing the gcloud compute command-line tools.

[marc@quartz ~]$ gcloud auth login --no-launch-browser
Go to the following link in your browser:
    https://accounts.google.com/o/oauth2/auth?redirect_uri=urn%3Aietf%(redacted)&access_type=offline
Enter verification code: (redacted)
Saved Application Default Credentials.
You are now logged in as [fielding@pythian.com].
Your current project is [None].  You can change this setting by running:
  $ gcloud config set project PROJECT
[marc@quartz ~]$ gcloud config set project marc-cputest
[marc@quartz ~]$ gcloud config set compute/zone us-central1-a
[marc@quartz ~]$ gcloud compute instances create cpu-test-n4 --image centos-6 --machine-type "n1-standard-4" --zone us-central1-a
Created [https://www.googleapis.com/compute/v1/projects/marc-cputest/zones/us-central1-a/instances/cpu-test-n4].
NAME        ZONE          MACHINE_TYPE  INTERNAL_IP    EXTERNAL_IP   STATUS
cpu-test-n4 us-central1-a n1-standard-4 10.240.222.194 104.154.75.96 RUNNING
[marc@quartz ~]$ gcloud compute ssh cpu-test-n4
WARNING: You do not have an SSH key for Google Compute Engine.
WARNING: [/usr/bin/ssh-keygen] will be executed to generate a key.
Generating public/private rsa key pair.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/marc/.ssh/google_compute_engine.
Your public key has been saved in /home/marc/.ssh/google_compute_engine.pub.
The key fingerprint is:
(redacted)
Updated [https://www.googleapis.com/compute/v1/projects/marc-cputest].
Warning: Permanently added '104.154.75.96' (RSA) to the list of known hosts.
Warning: Permanently added '104.154.75.96' (RSA) to the list of known hosts.

OK, instance all set and connected. As a CentOS 6 image it doesn’t allow SSH root logins by default, so attempting to set up a gcloud environment as a root user will get you “permission denied” errors on SSH. Serves me right for trying to run these tools as root in the first place :-).

Looking around

Checking what they got us:

[marc@cpu-test-n4 ~]$ egrep '(processor|model name|cpu MHz|physical id|siblings|core id|cpu cores)' /proc/cpuinfo
processor       : 0
model name      : Intel(R) Xeon(R) CPU @ 2.60GHz
cpu MHz         : 2600.000
physical id     : 0
siblings        : 4
core id         : 0
cpu cores       : 2
processor       : 1
model name      : Intel(R) Xeon(R) CPU @ 2.60GHz
cpu MHz         : 2600.000
physical id     : 0
siblings        : 4
core id         : 0
cpu cores       : 2
processor       : 2
model name      : Intel(R) Xeon(R) CPU @ 2.60GHz
cpu MHz         : 2600.000
physical id     : 0
siblings        : 4
core id         : 1
cpu cores       : 2
processor       : 3
model name      : Intel(R) Xeon(R) CPU @ 2.60GHz
cpu MHz         : 2600.000
physical id     : 0
siblings        : 4
core id         : 1
cpu cores       : 2

Google has redacted the exact CPU model numbers, but has clearly marked this as a 2-core system with core IDs 0 and 1.

The single-CPU case

[marc@cpu-test-n4 ~]$ taskset -pc 0 $$
pid 1558's current affinity list: 0-3
pid 1558's new affinity list: 0
[marc@cpu-test-n4 ~]$  dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null
2170552320 bytes (2.2 GB) copied, 14.3427 s, 151 MB/s
[marc@cpu-test-n4 ~]$ for i in {1..2}; do dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null & done
2170552320 bytes (2.2 GB) copied, 29.3081 s, 74.1 MB/s
2170552320 bytes (2.2 GB) copied, 29.3065 s, 74.1 MB/s

We get a nice boost in raw CPU numbers as compared to the 120 MB/s I saw in AWS. With two processes sharing this CPU, see a tiny bit less than half the throughput.

Sharing the cores (or trying to)

[marc@cpu-test-n4 ~]$ taskset -pc 0,1 $$
pid 1558's current affinity list: 0
pid 1558's new affinity list: 0,1
[marc@cpu-test-n4 ~]$ for i in {1..2}; do dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c &gt; /dev/null &amp; done
[1] 1803
[2] 1805
[marc@cpu-test-n4 ~]$ 2170552320 bytes (2.2 GB) copied, 14.6959 s, 148 MB/s
2170552320 bytes (2.2 GB) copied, 14.7139 s, 148 MB/s

This is interesting; we see almost full-speed throughput in spite of processors 0 and 1 said to have a shared core. With processors 0 and 2 the situation is the same.

[marc@cpu-test-n4 ~]$ taskset -pc 0,2 $$
pid 1558's current affinity list: 0,1
pid 1558's new affinity list: 0,2
[marc@cpu-test-n4 ~]$ for i in {1..2}; do dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c &gt; /dev/null &amp; done
[1] 1830
[2] 1833
[marc@cpu-test-n4 ~]$ 2170552320 bytes (2.2 GB) copied, 14.6683 s, 148 MB/s
2170552320 bytes (2.2 GB) copied, 14.6692 s, 148 MB/s

Is the CPU scheduler ignoring my taskset commands? Running mpstat 2-second samples during the test to see actual CPU usage:

[marc@cpu-test-n4 ~]$ mpstat -P ALL 2
...
06:08:44 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
06:08:46 PM  all   46.31    0.00    3.75    0.00    0.00    0.00    0.00    0.00   49.94
06:08:46 PM    0   93.50    0.00    6.50    0.00    0.00    0.00    0.00    0.00    0.00
06:08:46 PM    1   92.00    0.00    8.00    0.00    0.00    0.00    0.00    0.00    0.00
06:08:46 PM    2    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
06:08:46 PM    3    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
...
06:08:52 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
06:08:54 PM  all   46.75    0.00    3.25    0.00    0.00    0.00    0.00    0.00   50.00
06:08:54 PM    0   93.47    0.00    6.53    0.00    0.00    0.00    0.00    0.00    0.00
06:08:54 PM    1    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
06:08:54 PM    2   93.50    0.00    6.50    0.00    0.00    0.00    0.00    0.00    0.00
06:08:54 PM    3    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00

So the taskset commands are working: when we ask for CPUs 0 and 1, we are getting them, but throughput shows that cores aren’t being shared. It means that the CPUs in the virtual machine are not statically bound to hardware threads as seen under AWS. I’d call it a win as it gets more consistent performance even if the guest operating system is forced to makes poor CPU scheduling decisions as in this case.

[marc@cpu-test-n4 ~]$ taskset -pc 0-3 $$
pid 1558's current affinity list: 0,2
pid 1558's new affinity list: 0-3
[marc@cpu-test-n4 ~]$ for i in {1..4}; do dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null & done
2170552320 bytes (2.2 GB) copied, 22.9823 s, 94.4 MB/s
2170552320 bytes (2.2 GB) copied, 22.9914 s, 94.4 MB/s
2170552320 bytes (2.2 GB) copied, 22.9915 s, 94.4 MB/s
2170552320 bytes (2.2 GB) copied, 23.1333 s, 93.8 MB/s

This is more the throughput we would expect with two shared cores.

Lessons learned

Over the course of these tests, I’ve discovered a few things:

Although they share virtual CPUs like competitors, Google is very upfront about this behavior.
Actual throughput for a simple gzip workload is excellent.
Google Compute Engine has an abstraction layer in front of CPUs that dynamically schedules tasks between CPU threads, in addition to the regular scheduler in the virtual machine. In my testing, it allocates tasks efficiently across CPU cores, even when the OS scheduler is configured suboptimally.

Insight and analysis of technology and business strategy