Network Basics for Development
This chapter is focused on explaining the basics and jargon used in computer networking. The idea is to build a good foundation to be used throughout the book.
If you are a network engineer or have experience in this field, you might want to skip it, or perhaps skim through it.
If you are a software developer with little network experience, this chapter is for you. It will help you build a solid base on network jargon that will be useful when writing code for network automation.
The following are the topics that we will cover in this chapter:
- Reviewing protocol layers, network device types, and network topologies
- Describing network architecture and its components
- Illustrating network management components, network bastions, and more
Reviewing protocol layers, network device types, and network topologies
We have lots to talk about here. But due to the size restraints of this book, I have organized a summary with the most important aspects of today’s network jargon and explained them briefly. I hope you can find some new information to help your automation work.
It’s important to note that there are several different standards for protocol layers, and the most academic one is the ISO organization called OSI model, which defines seven layers. But we are going to consider only five defined in the TCP/IP protocol stack, which is used on the internet. Here is a short summary of each of the layers:
- Physical layer: In this layer are the technologies involved in the physical connection itself where the bits and bytes are transformed into the physical medium, such as the light in fiber optics, electricity in a cable, and radio waves in antennas. At this layer, physical checks can be implemented on the node input, such as power levels, collision, noise, and signal distortion, among other types of checks.
- Data link layer: Here, the information is called a frame, and it contains a delimited size, known as the maximum transmission unit (MTU). The reason is that a frame is a data representation in bytes that has to move from one node to another one and in a reliable manner without interruption. At this level, frame queues are present; the queues are used to place the frames on the physical layer in sequential order or in priority order. Some data link devices can prioritize certain types of frames, jumping to the front of the queue. At the data link layer, some checks are done, but within the frame itself, such as CRC or checksum. In addition, source and destination addresses can be added to the frame to differentiate destinations on a shared media. The information on the frame is normally used locally within the same organization. This layer is also known as the Ethernet layer.
- Network layer: This is also known as the IP layer, or the router layer. Here, the information is called a packet, and it contains the information that goes between nodes that are beyond the layer 2 domain (or the previous Ethernet layer). This level is where the routing protocols are used, the network address translation (NAT) does its job, some access control lists (ACLs) are present, and the control packets are, among other functions. The packet on this level has enough information to know where it came from and where it has to go. This layer is also responsible for fragmenting the packet into multiple packets if the frame MTU is smaller than the IP packet. The main information carried in the packet is the IP address and has source and destination addresses.
- Transport layer: The transport layer deals with data information that is called a segment. On today’s internet, only two types of protocols are used here, the User Data Protocol (UDP) and the Transmission Control Protocol (TCP). The idea is one provides more confirmation and control than the other. TCP has traffic flow control, packet loss detection, and packet retransmission, among other functions. UDP, on the other hand, is just the IP packet plus a little more information. The idea behind having TCP is to enhance communication on the unreliable internet, so the application has a guaranteed transport method. TCP has more overhead, with an additional header field, and might be slower in some cases than UDP. The transport layer adds a port number to the segment, which is carried inside every packet in the IP layer. The port number is used for two reasons: to designate which application is using the transport layer, such as port
80for HTTP communication, and to associate it with a communication socket in the host. The port number is required for the source and destination, which will be used to designate the correct socket to communicate with the host.
- Application layer: This is the top of the layers, normally referred to by my professor as the cherry on the cake. An application layer is used to associate a socket on the host where data will be sent and received. The application normally handles the content of the data, such as page requests on HTTP. The software that we are producing in this book uses this layer to automate the network.
LAN, WAN, internet, and intranet
LAN, or local area network, is used to refer to networks that are local. Nowadays, it means networks that use the data link layer as the main communication, such as Ethernet. The reason why the name is more related to the communication layer than the geography is that technology has evolved, allowing Ethernet switches to communicate over thousands of kilometers. So, a LAN normally designates a topology inside the same organization using Ethernet, but not necessarily geographically in the same location.
WAN, or wide area network, is used to refer to networks that are remotely connected, or technologies that allow nodes to be far apart, such as extinct technologies such as X.25, Frame Relay, and Asynchronous Transfer Mode (ATM). Now, the term WAN is normally used to designate interfaces or networks that are connected to different networks, or in other words, networks that are not in the same organization, data link layer, or Ethernet domain.
For more information about ATM, please refer to the article Technology and Applications in SSRN Electronic Journal, June 1998, by Jeffrey Scott Ray.
The term intranet was used when corporations were using the internet protocols to communicate internally on their network. The reason is that other technologies were competing with the internet TCP/IP protocol at that time, such as SNA and IPX. So, when the term intranet was used, it was simply to state that the corporate network uses TCP/IP. Nowadays, intranet refers to a network that is within the same organization and not connected to external nodes. Therefore, the network is safe from external interference.
A point-to-point (P2P) connection is used to interconnect two nodes. A link between two nodes is normally a P2P connection (as shown in Figure 1.1), unless using media such as satellite or broadcast antennas. This connection can either be back to back or not. The term back to back is normally used to indicate that the nodes are connected directly without any other physical layer between them, such as repeaters. Therefore, back-to-back connections have limited distances due to the noise and distortion introduced in the connection as the wiring gets longer. Depending on the speed and the technology used, the distances are limited to within the same room or building.
Figure 1.1 – A P2P connection
Star or hub-spoke topologies
Star or hub-spoke topologies are used in small and medium companies, where one office is the main distributor and the other locations are consumers. The topology looks like a star, and network elements are smaller and simpler at the remote locations, while being larger and complex at the main distributor (see the example in Figure 1.2).
Normally, these types of topologies can scale up to hundreds of nodes, but depending on the traffic, the requirements can scale to thousands. Let’s look at two examples that illustrate the scale of these topologies.
For instance, in a bank, the automated teller machines are distributed in remote locations, where the main computer is located in the main branch. This can scale to thousands of remote machines as the traffic requirements are small in terms of byte transfer on a teller machine.
On the other hand, if you have a supermarket chain using a star topology, it won’t scale to thousands of remote machines, as each supermarket requires a large amount of data transfer to handle all transactions and employees.
So, the use of star topologies is limited to the amount of traffic it can handle in the central node. In the star topology, we have two device functions, a device that will be either at the remote location or in the main office.
Network capacity planning is trivial when dealing with star topologies, as the main office node is updated as it grows.
Figure 1.2 – A star topology
Hierarchical or tree topologies
Hierarchical topologies are used to optimize traffic, where larger nodes are used to aggregate traffic to smaller nodes in a hierarchical matter (see the example in Figure 1.3). These topologies can scale to thousands of nodes; however, because of the number of nodes in the path, the topologies can cause undesirable latency and extra node costs.
There is no limit on the number of nodes on this type of topology, and it’s one of the foundations of the internet global infrastructure.
Depending on the size of this topology, it can introduce a longer path, which will add significant latency. For instance, in Figure 1.3, A1 has to cross five hosts to reach A7.
Figure 1.3 – A hierarchical or tree topology
This type of topology is also known as a Clos network or fabric. This topology is used to increase the number of ports without compromising latency and throughput and is often used in data centers. This topology is composed of at least three stages. Note that there is no oversubscription or aggregation like in the hierarchical topologies. The Clos topology provides the same amount of available bandwidth on the input and output. The stage names are normally spines and leafs. The spines are always in the center and only have connections to the Clos nodes. Leaves are used to connect to external devices or networks.
Figure 1.4 – A Clos topology
Why are these topologies used? To increase the number of ports available without compromising throughput. This kind of topology is also used inside a router to provide connectivity between interface cards. Some companies use small devices to increase the number of ports that are offered without raising the cost as smaller devices are normally cheaper.
One additional characteristic of the Clos network is that it has the same distance between any two external ports (in terms of nodes in the path), therefore the latency in normal conditions is the same. For instance, in Figure 1.4, the latency between an external port on node L1 to an external port on L4 or E1 is the same.
More information on Clos networks can be found in an interesting paper from Google called Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google’s Datacenter Network – ACM SIGCOMM Computer Communication Review, Volume 45, Issue 4, October 2015.
A mixed topology is used in large corporations where latency and traffic are both important to care of. Normally, star topologies and P2P are used to shorten paths and reduce latency, whereas hierarchical topologies are used to optimize and aggregate traffic, and finally, Clos networks to increase the number of ports.
Modern cloud service providers are migrating to a more complex topology, where there are connections between elements where latency matters and aggregate device functions where traffic matters.
Network capacity planning is normally harder because connections are not totally hierarchical and aggregation points are not necessarily part of all traffic paths. An example of this kind of mixed topology is shown in Figure 1.5:
Figure 1.5 – A mixed topology
A very important point that some engineers get confused about is the interface speed representation. 1 KB in memory representation is 2^10 or 1,024 bytes and 1 GB is 2^30, which is 1,073,741,824 bytes. For interface speeds, the same does not apply, and 1 Kbps is actually 1,000 bits/second, while 1 Gbps is 1,000,000,000 bits/second (more details can be found at https://en.wikipedia.org/wiki/Data-rate_units).
Device types and functions
Network devices used to have specific functions as CPU and memory were scarce and expensive. Nowadays, network devices can have multiple functions when required. In large networks, devices have fewer functions as they tend to get overloaded easier when traffic demands increase. Here are some of the functions that a device can have:
- Hub: This is a very old term to designate a device that only repeats the physical signal.
- Switch: A device that works only on the data link layer. It is normally used in LANs, and it works by switching frames. The most common protocol used on these devices to control paths is the Spanning Tree Protocol (STP).
- Router: A device that works only on the network layer or IP Layer. It is used to interconnect multiple LANs or create long-haul remote connections. Internally, a router routes packets using a routing protocol to exchange route information with other routers. Some routers can also switch frames or work as a switch.
- NAT: NAT is devices that replace source and destination IP addresses to allow the use of private IP addresses or to isolate internal traffic from external traffic.
- Firewall: Normally, devices that control the traffic that passes through it by looking into the content of the frame or the packet. There are several different types of firewalls, and some might be super complex, which includes encrypting and decrypting traffic.
- Load balancers: When servers can’t handle too many clients because of hardware limitations, load balancers can be used to deal with the client demands by sharing the client request between several servers. Those devices also look into the packet content to determine which server would get the traffic.
- Network server: A computer used to provide some sort of service to the network, for instance, an authentication server, an NTP server, or a Syslog collector.
In network jargon, this term is used to describe nodes or links in the network that aggregate traffic from other parts of the network and statistically use it to their advantage. For instance, they have a 1 Gbps interface to connect to the internet and 1,000 customers with 10 Mbps interfaces to use the service, which is an oversubscription of 1 to 10. This practice is quite normal and is only possible to use because of the characteristics of the client’s traffic that allow such aggregation without degradation. There are lots of mathematical models and papers on the internet describing this behavior and how to use it in your favor.
But some traffic can’t be aggregated without being degraded. In a data center, the traffic that can’t be oversubscribed is the traffic between servers, such as remote disk, data transfers, and database replicas. In this scenario, the best solution is to interconnect them without oversubscription using a solution such as non-blocking Clos topologies.
Browsing web pages, watching videos, and receiving messages from most of the traffic on the internet, which easily allows the aggregation technique without degradation.
More information on oversubscription can be found in the paper Evaluating Impacts of Oversubscription on Future Internet Business Models by A. Raju, V. Gonçalves, and P. Ballon – Published in Networking Workshops, 25 May 2012 – Computer Science.
In this section, we went over the basic components of computer networks, including protocols, topology types, interface speeds, and device types. By now, you should be able to identify these terms more easily and will be familiar with their meanings, because we are going to use these terms throughout this book. Moving on, we are going to review more terms related to network architecture.
Describing network architecture and its components
The term network architecture was introduced in the early 2000s, mimicking roles in the construction industry, where architects design and civil engineers build. Different companies use the term differently, but in this book, network architecture will be used to refer to the design of the network and its functions.
For a good network architecture, it is desirable to have a document describing in detail the first three layers of the network, from the physical layer to the routing layer. With this documentation, it is easy for the engineers to understand the physical connections, the Ethernet domains, and the routing protocols used.
A network diagram is mostly like a map, where the cities are the nodes and the roads are the links that connect them. For a network engineer, diagrams are crucial to describe how nodes are connected, and they also can group and demarcate important areas. A good diagram is easy to interpret and follow how data flows.
There are up to three types of diagrams; they can be integrated on the same page and graph, or they can be separated onto different pages. The main diagrams are one to show the physical connections, which can include the technology involved in the data link layer, and the switching and routing diagrams.
In Figure 1.6, we can see an example of a network diagram:
Figure 1.6 – Example of a network diagram
Figure 1.7 – Network diagram symbols
Network node names
A network node is a device that is essentially used to interconnect and serve as a transport of the data in the network. It can be either a hub, a switch, or a router. To help network engineers identify the node function, names are used to describe their main function. Here are some of them:
- Transit router: These are routers that have interfaces with other service providers. These links are normally used as a service to access other networks, therefore they have a cost because they are normally connected to other big carriers.
- Peer router: These routers have interfaces with other networks in a peer configuration, meaning none of the parts pay to use it. In these links, only the traffic between the peer companies is exchanged, and the traffic destinated to outside networks are not allowed. Accessing external networks would be the case when using transit routers.
- Core router: These are nodes that are in the center of the network. They normally handle a large amount of traffic and have high-speed interfaces. Their throughput capacity is the highest in the network, but they have fewer interfaces as they concentrate the traffic of the network.
- Distribution router: These are nodes that normally connect to the core and aggregation routers. They normally interconnect different locations of the network. They don’t have many interfaces and their throughput capacity is high, but not as high as the core router.
- Aggregation router: These routers normally aggregate the traffic from the access routers. They are normally located in the same area or location as the access routers, and they have fewer interfaces compared to the access routers.
- Access router: Some architects add a node that connects all last-mile networks or CPE nodes. These routers are located closer to the customer and have more interfaces than any other router.
- Top of the rack (TOR): TOR refers to nodes that can be either a switch or a router, depending on the architecture. They are responsible for connecting the servers in the rack to the rest of the network.
- Clos rack: A Clos network, as described before, is a technique to add connectivity to multiple servers using small devices. A Clos rack is seen by the rest of the network as a single unique block, and in terms of architecture, it acts as a single node, normally used as a single router with a large number of interfaces.
- CPE: CPE is the node that is installed at the customer’s location. It normally has one interface connecting to the last-mile network and one local interface that can be an Ethernet or a wireless Ethernet. These devices can also implement NAT, firewall and, in some cases, they have multiple local interfaces, which can act as a switch and a router. These nodes are cheap and small with very low throughput capacity compared to the other nodes.
The last-mile network
This term is used to describe the architecture used to connect the customer to the network. Normally, this term is only used for ISPs, but some corporations also use it to interconnect their branches.
The last-mile network has a range of coverage and normally doesn’t cross the 1 km mark but depends on the type of technology used. Here are some of the most common last-mile networks:
- Cable TV: There are several technologies used here to provide data communication using the cable TV that the customer has installed. The most used one is DOCSIS, which in 2017 was upgraded to version 4. This solution uses a single cable that is shared to several premises.
- Digital subscriber line (DSL): DSL uses the old telephone line to pass data communication. For that, there are lots of standards, and the most common ones are VDSL and ADSL. The DSL solutions don’t share the same media as cable TV does, and there is one cable for each customer.
- Fiber to the premises (FTTP): FTTP is when an optical cable arrives at a customer’s premises. Like cable TV, the most common implementation is a single fiber that crosses several different customers in a sharable manner. The most common technology is a passive optical network (PON) or, more specifically, the Gigabit Ethernet PON (GPON) (or G.984).
Further details on GPON networks can be found in the paper GPON in Telecommunication Network – November 2010 – Paper from the International Congress on Ultra Modern Telecommunications and Control Systems (ICUMT) conference, 2010.
- Wi-Fi: Normally, this technology is used privately inside a company or a home, but some ISPs use the Wireless Ethernet standards (IEEE 802.11 family) to provide the last mile to customers using omnidirectional antennas. This particular use is different depending on each country and it depends on the government’s legislation. They are normally advertised as Ethernet hotspots (https://en.wikipedia.org/wiki/Hotspot_(Wi-Fi)).
- Satellite: For data communication using satellites, there are two methods: one using geostationary satellites and the other using constellation satellites. The difference between them is the latency, as geostationary orbits very far from earth. The constellation method has low latency but has handover challenges as the satellites keep moving, normally having very low data throughput. The most famous technology using geostationary is VSAT. Internet using VSAT adds around 250 ms every time it has to travel from earth to the satellite, therefore it is a 500 ms round trip. But the dark ages of high latency might be over as SpaceX has announced they have finally solved the handover problem using the constellation method. This new service is called Starlink and has promised to have high capacity, low latency, and high availability using low orbit satellites.
A good discussion on the Starlink network can be found in the paper Starlink Analysis – July 15, 2021 – Research group ROADMAP-5G at the Carinthia University of Applied Sciences.
- Power line communication (PLC) or HomePlug: PLC, or broadband over power lines (BoPL), uses the power cables to communicate data. This is achieved by modulating high frequencies on the wire. Most transformers won’t be able to pass through the information as they act as a low-frequency cut filter, so it has to be contained within a house or between posts without a transformer. The most common technologies here are the HomePlug AV2 and IEEE 1901-2010 (https://ieeexplore.ieee.org/document/5678772).
- Mobile: Definitely the most popular network is the mobile last mile. Today, they use 5G technology, but other old networks are still in use, such as 4G (LTE), 3G, and GPRS.
More information on mobile technologies can be found at Evolution of Mobile Communication Technology towards 5G Networks and Challenges by A. Agarwal, K. Agarwal, S. Agarwal, and G. Misra – American Journal of Electrical and Electronic Engineering, 2019, Vol. 7, No. 2, pp. 34-37.
The physical architecture
The physical architecture is sometimes not necessarily the description of the cables or the fibers that will connect the devices but the infrastructure used by the network as a physical layer defined in the TCP/IP stack. This means we can reuse other foreign networks as a physical layer even though they have their own protocol stacks. Here are some of the possible physical technologies used in the architecture:
- Dark fiber: When connecting nodes, the term dark fiber means the nodes that are connected will be using a fiber that does not contain a repeater or underlying infrastructure. In the case of a connection between two nodes using dark fiber, if one node loses power, the other will not receive any light from the fiber. In this scenario, a fiber cut is perceived in both ends immediately, and interfaces go down instantaneously with a fiber cut. Only the packets in the output interface queue are discarded when a failure occurs.
- Synchronous Transport Module (STM): STM was initially created to multiplex digital phone lines, but later started to be used for data communication. The most common one was STM-1, which was 155 Mbps. Routers used to have an interface that could encapsulates STM frames toward an STM network. The STM network would just switch the frames from one end to the other. A cut in the fiber using this technology might not be perceived quickly enough, causing a huge amount of packet loss. As we will describe later, bidirectional forwarding detection (BFD) needs to be used here to avoid drastic problems.
- Dense wavelength-division multiplexing (DWDM): DWDM is an evolution of STM. The DWDM network is a switch network that also has a frame and time and wave division for each of the packets of data carried, similar to STM but enhanced. Similarly, BFD is necessary because a cut in the fiber here would not be perceived quickly enough, causing a huge amount of packet loss.
- Back to back: As explained before, the term back to back is normally used to designate the nodes that are connected directly without any other physical layer in between, such as repeaters.
- Network tunnels: Network tunnels are points of the network that are used to encapsulate the traffic and travel in a different network. Tunnels can be either Layer 2 or Layer 3 and are implemented to abstract the network that is being carried. In some network architectures, they are meant to reach a distant part of the network using a foreign infrastructure.
- VPN tunnels: These are like network tunnels. VPN tunnels normally add encryption.
The routing architecture
It’s important to define how the traffic will flow in the network. For that, we need to have a proper design in terms of routing distribution. This is necessary so failure remediation, redundant paths, load balancing, routing policies, and traffic agreements can be implemented. The architecture would have to include an internal routing protocol and an external routing protocol if connected outside. Here is a summary:
- Interior gateway protocol (IGP): IGP is a routing protocol that runs in a delimited area or location, normally internally within the same organization, as the name says. In the IGP domain, routers exchange path information by announcing and receiving topology updates. The most common IGPs use link state information to build the routing path topology. If an interface goes down, the update has to be propagated to the entire IGP domain. Isolated areas are used to avoid having to update a too-large topology and cause instability. Historically, the popular IGPs were RIP and EIGRP, but today, only Open Shortest Path First (OSPF) and Intermediate System-to-Intermediate System (IS-IS) are used.
- Exterior gateway protocol (EGP): EGP is a routing protocol used to exchange routing information between organizations. It normally does not contain link state information, only the path distance. The most common EGP protocol is Border Gateway Protocol (BGP).
- IS-IS: IS-IS is an IGP protocol designed by ISO, registered as ISO 10589. It is a link state protocol based on the shortest path algorithm called Dijkstra’s algorithm. It’s the second most used IGP.
- OSPF: OSPF is an IGP protocol designed by IETF, registered originally in 1989 by RFC1131 and updated a few times later. Version 3 is the last version described in RFC5340. OSPF also uses Dijkstra’s algorithm to calculate paths and is the most popular and used IGP. OSPF uses areas to scale and improve stability during routing database updates.
- BGP: BGP is a unique protocol used to exchange routing information between organizations. It was first introduced in 1989 in RFC1105. It is also one of the protocols with more updates and extensions on the IETF and can be used for different purposes, such as internal BGP (iBGP), Multiprotocol BGP(MP-BGP) defined in RFC4760, MPLS (MP-BGP), and recently, BGPsec, defined in 2017 in RFC8205. BGP is a path vector-based protocol, also known as a distance vector protocol, and it does not use link information like OSPF.
- Autonomous system number (ASN): Like the IP range, ASN is a unique number that is associated with an organization when starting using BGP to exchange routing tables. It is controlled by the five regional internet registries: ARIN in North America, LACNIC in Latin America, APNIC in Asia-Pacific, RIPE in Europe, and AFRINIC in Africa. When routing tables are exchanged using BGP, the ASN is carried on the path. For instance, Amazon.com uses ASN 16509 (https://whois.arin.net/rest/asn/AS16509).
Let’s explore how a network works in terms of its state.
Types of failure
In computer networks, a major problem is the instability caused by failures in routing tables, links, or nodes. If a node goes numb, for example, the CPU freezes, the other nodes have to detect it quickly so they can divert the traffic through a different path. But how can a failure be detected to reroute quickly enough? Let’s explore the types of failures first:
- Link failure: A link failure is when a connection between two nodes stops receiving or sending data because there is an interruption on the path. The failure can be caused by a physical problem, such as a fiber cut, environmental conditions, such as heavy rain, or because of middleware equipment failure. Nodes normally detect whether a link is down by the lack of signal on the input, but in some cases, such as when using repeaters or underlying networks (such as DWDM), the signal is present on the input but data can’t be delivered. So, it requires a higher-level protocol to monitor and detect the communication breakdown instead of the interface input signal alone; otherwise, data will be discarded continuously until a node decides to reroute the traffic, which can take several seconds in some cases.
- Node failure: A node can fail in several different ways; the most common ones are power loss and OS freeze. A software glitch can cause a router to freeze for minutes or even hours, causing packet loss or not, depending on where the freeze occurs, in either the forwarding plane or the control plane. Detecting this failure quickly is a bit harder because all interface signals are still present, and the forwarding plane might be still working.
- Flapping: Interface flapping is when the interface keeps going down for short periods without being detected. Flapping causes data loss without detection and normally is hard to be discovered without specific equipment to measure the medium connected normally on both ends. The term flapping also is used when a route keeps appearing and disappearing on the routing table, called route flapping.
Failure detection techniques
- Signal off: Interfaces have a very simple way of detecting failure, by the absence of the main signal or light. In the case of fiber, if the intensity of the light received is too low, it would consider the interface down. Note that this detection is made on the input interface.
- Protocol keep alive and hello packets: Some routing protocols have keep alive (or hello) messages to check whether their neighbors are still alive. In OSPF, the default period for hello packets is 10 seconds for LAN interfaces, and 30 seconds for P2P connections. BGP has a default of 30 seconds. For today’s network speed, 30 seconds is a lot of data lost. A 10 Gbps interface would discard a total of 37 GB if fully loaded. In today’s protocol implementation, the period of sending these messages can’t be shorter than a few seconds, which is still a long period of data lost.
- Link BFD: In 2010, IETF published RFC5880, which describes the BFD protocol, which was intended to allow routers to detect failure on their interfaces in the order of microseconds. The BFD message supports a minimum of 1 ms interval. BFD is normally implemented on the interface hardware, which allows it to respond without interrupting the main CPU.
- The BFD routing protocol: Link BFD is normally enabled in all interfaces of the network to detect failures quickly, but it would not help in the case of OS router freeze or control plane failure. To avoid packet loss in these cases, all major protocols have the BFD capability, including OSPF, IS-IS, and BGP. Although the BFD protocol message supports microsecond intervals, the implementation using routing protocols is normally in the order of milliseconds and limited to the number of points. The reason is that these messages need to be handled by the main CPU, and too many might cause performance degradation.
- Route flapping detection: The routing protocol can detect persistent route flapping and suppress it for a period. This is useful to avoid recalculating paths when a route is not actually stable. When suppression is in place, normally, the default route is taken.
Control plane and forwarding plane
The forwarding plane, or data plane, is an abstract concept where some processes, equipment, and hardware are used to forward traffic through the network. In other words, the forwarding plane defines all entities in the network responsible for receiving data, transporting it, and delivering it.
A forwarding plane works when data is carried from one input point, A, to another output point, B, but does not need to have a control plane working. The control plane would only work if a path does not exist from A to B. The control plane also works in case of a failure because the original path might be interrupted and needs to be constructed again.
So, why is this important in network automation? Because the control plane has to update forwarding paths if there is a problem with the forwarding plane, which can cause packet drop, jitter, and delays. A stable network does not require any path updates and consequently minimum work for the control plane. Network automation needs to avoid any particular automation that might cause the control plane to update the network.
Usually, when a router restarts, all the routing peers detect that the session went down and then came up. This down/up transition results in the control plane working to recompute all the route paths, generating thousands of updates in the entire network and, consequently, causing a churn to the forwarding plane. This recomputation can also cause routing flaps, which may create transient forwarding black holes and transient forwarding loops. These transient problems also consume a lot of resources on the control plane of the routers affected.
Therefore, a graceful restart was created to avoid such drastic changes if a restart is required.
The idea is we could restart all control plane processes in one router without affecting the forwarding plane and the control plane of the other neighbor routers. In practice, a graceful restart is a method to restart the routing processes without affecting the forwarding plane.
In 2003, IETF published RFC3623 to define the implementation of the graceful restart for OSPF. Today, the main control plane protocols have some sort of graceful restart, including BGP, IS-IS, MPLS, RSVP, and LDP.
When building network automation, this kind of method is preferred to update the software.
In this section, we’ve reviewed network architecture and its components. We got more details on routing and physical architecture components. We also learned how important control and data plane separation is, along with the failure types. It is important to know these network terminologies to help with network automation. Next, we’re going to review network management and its components.
Illustrating network management components, network bastions, and more
Before we finish this chapter, let’s touch on some of the terms used in network management and planning.
The ACL can be implemented on either the forwarding plane or the control plane. When implemented on the forwarding plane, they are used to limit the IP reachability to certain parts of the network and also avoid IP spoofing. When implemented on the control plane, they are protecting the routing protocol and management ports from malicious connections.
ACLs are also used in the management interfaces to avoid undesirable traffic when using in-band management.
Management system and managed elements
The management system is the platform including the software and hardware responsible for managing the network. It can be centralized or distributed. The managed elements are the targets of the management system, including routers, switches, modems, repeaters, and intelligent racks, among others.
Note that the managed element does not need to be part of the network. It can be a support system, such as a rack with fans or an air conditioning unit. When writing code for network management, it is important to classify the elements appropriately so they can be managed accordingly.
In-band and out-of-band management
For OOB management, there is an isolated network infrastructure that carries only management traffic, which is not connected to the main network in any way and only to the management ports of every managed element. In other words, the forwarding plane does not carry any management traffic. In addition, the OOB network should be able to exist and deliver management traffic independently of the main network routing status, because, in the case of a catastrophic scenario, the OOB network should be enough to reach the managed element even if its network interfaces are down. It’s important to note that this network normally does not carry much traffic and the nodes and interfaces on it have low throughput. Some OOB networks are implemented using mobile networks.
In in-band management, there is no physical network isolation between the forwarding plane and management traffic, so the interface that carries customer data also carries management traffic. In this scenario, ACLs are used extensively to avoid unwanted traffic toward the ports of the managed elements. In addition, some network architecture adds priority queues to interfaces to allow the management traffic to be delivered first and avoid discards on heavily loaded links.
Some management systems use both to talk to the devices, some via in-band and some via OOB networks. Normally, heavy traffic, such as OS upgrades and event logging, goes via in-band and the element console access goes via the OOB network.
Telemetry is not a new term and refers to any type of equipment that can monitor field variables remotely, such as temperature or humidity. This term was then imported to computer networks to refer to a group of procedures used to collect network information remotely.
Network telemetry refers to an area in a computer network responsible for encompassing various procedures and systems to define, collect, and analyze network data. In some cases, it can mean a new method of obtaining network data by using streaming methods from the network devices.
Management information base
An MIB can be public or private. When public, its definitions are published by an RFC, such as the interface group MIB defined on RFC2863. When private, it has to be provided by the vendor who owns the MIB.
The MIB is normally organized in a tree by numbers (Figure 1.8). When describing an object in an MIB tree, it is normally referred to as an object identifier (OID). For instance, the number of packets seen in an interface output is represented by an OID called IfOutUcastPkts, which has the sequence .188.8.131.52.184.108.40.206 (http://www.net-snmp.org/docs/mibs/interfaces.html).
The OID normally contains a value that is a variable that can have different types, such as
OCTETSTR, among others.
Figure 1.8 – The SNMP MIB tree
The term bastion comes from fortifications that were used to protect cannons during medieval wars. At that time, a bastion was an angularly shaped part of an outer wall, usually placed around the corners of a fort to allow defensive fire in many directions.
Like cannons in medieval times, network elements need layers for protection. Network bastions or bastion hosts are physical devices, normally computers, with defensive mechanisms built in that are connected to two or more networks.
Bastion hosts are popularly designed using Linux installed on a computer with multiple Ethernet ports. Each Ethernet port connects to a different part of the network where isolation or protection is desired.
To protect the networks, bastion hosts do not forward traffic, and normally, the IP forwarding capability is disabled like the following example applied in Linux:
# sysctl net.ipv4.ip_forward net.ipv4.ip_forward = 0
A Linux box without IP forwarding means traffic can’t be routed between interfaces, so traffic has to be originated in the bastion host to reach an external interface, and no other traffic would go out of the Ethernet port that is not originated locally in the box. Therefore, bastion hosts need to have authentication, such as a username and password, to allow a user or system to log in and run a shell locally. From the host, the user might be able to generate IP packets toward the other Ethernet ports.
Network automation will require an additional mechanism to allow accessing the network nodes for configuring them. We are going to cover these mechanisms later when writing some code to access the nodes.
An example of a bastion host is shown in Figure 1.9:
Figure 1.9 – Bastion host connecting production and corporate networks
FCAPS is a management network model and framework defined by ISO for network management. Its acronym stands for fault, configuration, accounting, performance, and security. Each of these is the management category into which the ISO model defines network management tasks. Let’s look at a very summarized description of each of these tasks:
- Fault: The goal of fault management is to recognize, isolate, correct, and log faults that occur in the network.
- Configuration: This refers to storing configurations from devices, tracking changes, and provisioning new configurations.
- Accounting: This is concerned with tracking network usage for data transported, per business, client, or user. The goal is to be able to appropriately be billed or charged for accounting purposes.
- Performance: This is focused on ensuring that the network works at an acceptable level by monitoring network latency, packet loss, link utilization, packet discards, retransmissions, and error rates.
- Security: This refers to controlling access to assets and protocols in the network, which can include AAA systems, ACLs, and firewalls.
Why is FCAPS important for network automation? Because it’s better to write network automation code with tasks separated like in FCAPS, so we can place automation in different parts of the management system accordingly.
Growing the network is hard because if you buy too many resources, you could lose money. However, if you buy too little, you could lose customers. Network planning is used to ensure performance and costs are going in the right direction. In other words, it consists of several activities whose final target is to define an optimal cost-effective network design for the future. Network planning engineers work with prediction and statistical models to draw the most probable future network growth.
Because of the nature of the predictions, the network planning team needs a lot of data from the network, and most of them are probably not easy to collect. You might be required to collect instantaneous data, which is not available on the current management systems, so you can use your network automation skills for that.
One important point to mention is the work that companies are doing to create a safe place for traffic to flow. Most companies today have invested in separate and specialized teams to deal with security, and those teams have engineers that only understand nuances of the network to delineate security rules to be taken when designing and operating computer networks.
Your automation work will probably involve dealing with such security rules applied in the network, and will probably vary by company depending on the technology and level of security required. A good overview of network security can be found at https://en.wikipedia.org/wiki/Network_security.
In this chapter, we reviewed the key points of networking. The intention was to highlight the main concepts and define some of the network jargon. At this point, you have enough network background to discuss automation work with any network engineer. I hope that any work done on network automation coding from now on will make more sense and you’re more familiar with the network terms.
In the next chapter, we will learn about how networks are evolving to be programmable.