Computer networks have always had secondary roles in the design of applications. The network is important for your services to be able to talk to each other. It needs to be as fast as possible, secure, with required SLAs, and the availability and redundancy should be higher than your application. The list of requirements grows faster with the scale of an application. All of a sudden you start thinking about CDN planning and getting to know what is better for your case, L7 or L3 load balancer. Add cloud concerns and the network doesn’t seem secondary anymore.
Networking was transformed with the introduction of the variety of cloud services, and we need to start thinking about networks differently. Network planning and the way network engineers operate within the cloud infrastructure are no longer the same. Monitoring and troubleshooting in the environment with limited network visibility require creativity.
Network planning used to be a long process that required a lot of resources before even starting to write code, but not anymore.
|Traditional networking||Cloud networking|
|Logical network planning||Logical planning|
|Capacity planning||N/A *|
|Power requirements||N/A **|
|Life-cycle planning||N/A ***|
*Capacity planning is required in extreme cases of traffic-intensive applications. Even then, knowing the limits of the virtual instances is enough.
**The only time when you need to think about power is when you need to better understand availability zones.
***Life-cycle planning is moving towards application and services level and is not applicable at the networking layer.
The table above shows how network planning changed.
- Network engineers need to think mostly from L3 and above when dealing with cloud infrastructure.
- No wires, no power, no cooling, no physical infrastructure.
- L2 can be thought of as merely interconnectivity, specific services, and edge troubleshooting cases.
- More smart services that act above L3 layer like caching, L7 load balancing, CDN, GEO-based DNS, Serverless functions.
- Availability, scalability, and reliability became the architectural task and cloud vendors’ headache.
Some limitations are coming along that become tricky for applications that rely on network services:
- Dynamic routing is tricky and limited depending on the cloud provider you’re using.
- Be aware of the IP address conflicts that will turn your network interconnectivity (either office to cloud or one cloud account with another one) into a nightmare.
- Limits in the number of records for the ACLs and Security Groups.
- VPN connectivity with other cloud providers or with the hardware solutions is not that straightforward.
- Services like DNS split-horizon may work only within the native cloud environment and are unsuitable for the parts of the infrastructure that are outside of cloud ecosystem.
A lot of things can be fixed on the go for the cloud networking, but those changes sometimes lead to the same complexity as in traditional networking.
There are some common mistakes that can be avoided with proper planning:
- Thinking VLANs instead of IP networks and routing.
- Trying to replicate all traditional network elements like NAT and DPI boxes.
- Planning IP addressing without understanding that external IP addresses are no longer routed into cloud infrastructure, they are attached to the interfaces.
- Creating rudimental backbone-like bottlenecks to have visibility or control over the traffic
Maintaining Network Infastructure
Maintaining the cloud network infrastructure is comparatively a paradise for network engineers. They don’t need to worry about routers firmware, fan noise and temperature of the server rooms.
|Working with Traditional Networking||Working with Cloud Networking|
|Physical access||No physical access|
|Mainly manual, with simple scripted automation||Manual as well as automated using cloud SDK/API|
|Multiple devices to talk to deploy infrastructure||Single point of sending commands during the network deployment|
|Hard to code||Easy to code|
|A lot of visibility for troubleshooting||Limited, close to no visibility for troubleshooting|
Network engineers have an option of working on their cloud network using either their own scripts or known tools, and they can use cloud-native network services or build their own. Network engineers are no longer attached to their data centers and in the cloud infrastructure case might never know the location of them.Network engineers are more comfortable setting up new infrastructure because it can be coded and doesn’t require any new device deployments. The downside is that, in case of a mistake, production infrastructure can be destroyed or misconfigured.
Keep in mind that the cost of the traditional network maintenance is much higher than in the cloud, and in most cases, internal traffic costs are insignificant.
Network Monitoring and Troubleshooting
Network monitoring requires to be reconsidered in the infrastructure with limited or no access to network devices.
|Traditional Networking Monitoring||Cloud Network Monitoring|
|Network devices counters and logs||Application generated alerts|
|Automatic failover in the case of the network failure||Failover is not visible for the application|
|Visibility during light network degradation (ex. 5% packet loss)||No visibility during light network degradation|
|Isolated issue, harder to detect and fix||Mostly spread across multiple customers, fixed by the cloud vendor|
Design the application with network monitoring components in it (timeouts, failovers to a backup host, checksums, response codes and confirmations, stateless, autoscaling.)What does this all mean?
- Monitoring of the network is performed on the Virtual OS level but it’s hard to detect noisy neighbor.
- Cloud native services network monitoring can be done by tracking the endpoint response times on the application level.
- In case there is a network related issue in the cloud your account wouldn’t be the one affected and most likely cloud vendor support will be aware of the ETA of fixing the issue.
All cloud vendors have their services status pages. Remember that those status pages are mostly filled by humans and if you’re the first customer who detected the issue you can expect to see all green on the status page.
There are services on the internet where customers report issues with cloud vendors and SaaS applications even before it gets acknowledged by the customer support teams.
Troubleshooting is a critical skill and being able to define the root cause of the issue in networking is not trivial. It’s even more complicated when you are trying to troubleshoot networks you don’t know. And even more complicated when you are trying to troubleshoot a network that isn’t your own and you don’t have visibility into the devices supporting it, and you’re remote.
How should network engineers troubleshoot cloud networking?
- Know the weak spots (signs of the noisy neighbor, signs of potential node degradation or failure, API rate limits, ACL number of records limitations, nodes network limitations.)
- Know your application to be able to separate application issues from network issues.
- Call cloud vendor support because they can run nodes and network health checks.
- Know what the normal performance of the network is (run load tests and compare them during the times of degradation.)
- Run multipoint network connectivity tests within the cloud environment (some availability zones behave differently during the peak hours.)
- Expect different response time from cloud-native services during different times of the day and provide recommendations to design your application accordingly.
Incidents caused by the network issues are rare in cloud networking and usually troubleshooting in a properly planned reliable and redundant application is coming down to acknowledging that something happened and application recovered by itself. If a big outage happens there is not much network engineer can do except helping to execute a disaster recovery strategy.
Network planning, maintenance, monitoring, and troubleshooting changed with the growth of cloud and cloud-native services. There are a lot of other things to consider that are network related. Network and application security are tied together. Cloud-native network services can provide better use of engineering time and ease of working remotely. Network-based application performance is improving all the time by getting closer to the customer’s devices. If you’re not sure where your network is on its transformation journey, don’t hesitate to reach out to experts.