Reader small image

You're reading from  Practical Site Reliability Engineering

Product typeBook
Published inNov 2018
PublisherPackt
ISBN-139781788839563
Edition1st Edition
Right arrow
Authors (3):
Pethuru Raj Chelliah
Pethuru Raj Chelliah
author image
Pethuru Raj Chelliah

 Pethuru Raj Chelliah (PhD) works as the chief architect at the Site Reliability Engineering Center of Excellence, Reliance Jio Infocomm Ltd. (RJIL), Bangalore. Previously, he worked as a cloud infrastructure architect at the IBM Global Cloud Center of Excellence, IBM India, Bangalore, for four years. He also had an extended stint as a TOGAF-certified enterprise architecture consultant in Wipro Consulting services division and as a lead architect in the corporate research division of Robert Bosch, Bangalore. He has more than 17 years of IT industry experience.
Read more about Pethuru Raj Chelliah

Shreyash Naithani
Shreyash Naithani
author image
Shreyash Naithani

Shreyash Naithani is currently a site reliability engineer at Microsoft R&D. Prior to Microsoft, he worked with both start-ups and mid-level companies. He completed his PG Diploma from the Centre for Development of Advanced Computing, Bengaluru, India, and is a computer science graduate from Punjab Technical University, India. In a short span of time, he has had the opportunity to work as a DevOps engineer with Python/C#, and as a tools developer, site/service reliability engineer, and Unix system administrator. During his leisure time, he loves to travel and binge watch series.
Read more about Shreyash Naithani

Shailender Singh
Shailender Singh
author image
Shailender Singh

Shailender Singh is a principal site reliability engineer and a solution architect with around 11 year's IT experience who holds two master's degrees in IT and computer application. He has worked as a C developer on the Linux platform. He had exposure to almost all infrastructure technologies from hybrid to cloud-hosted environments. In the past, he has worked with companies including Mckinsey, HP, HCL, Revionics and Avalara and these days he tends to use AWS, K8s, Terraform, Packer, Jenkins, Ansible, and OpenShift.
Read more about Shailender Singh

View More author details
Right arrow

Chapter 11. Post-Production Activities for Ensuring and Enhancing IT Reliability

Business automation, augmentation, and acceleration get neatly accomplished through a variety of microservices-based software applications in conjunction with integrated platforms and optimized IT infrastructures. In short, IT is the best and biggest enabler of businesses across the globe. That is, business offerings and outputs are being deftly and decisively enabled by scores of distinct IT advancements. The evolving business expectations are being duly automated through a host of delectable developments in the IT space. These improvements elegantly empower business houses to deliver newer and premium business offerings fast. With intuitive, informative, and inspiring interfaces, software applications are being presented to their customers and consumers to be used in an easy and error-free fashion. Furthermore, this continuous empowerment in the IT space, in turn, facilitates accomplishing more with less,...

Modern IT infrastructure


Today, software-defined cloud centers are very popular and profoundly leveraged for business agility, affordability, and productivity. The cloud idea fulfils the infrastructure's automation, optimization, and utilization requirements. The faster maturity and stability of the virtualization movement makes the hardware programming a grand reality. Therefore, infrastructure as code is the buzzword in the IT industry these days. IT infrastructure monitoring, measurement, and management are seeing a lot of delectable advancements with the rise of the cloud paradigm. A variety of IT infrastructure operations are being automated and accelerated through a host of advanced and standardized tools. The simultaneous rise of the DevOps concept, along with a flurry of powerful cloud technologies and tools, has brought in scores of strategic automation and optimization in the IT space. IT self-service, pay-per-usage, and elasticity have become the core IT capabilities.

Cloud service...

Monitoring clouds, clusters, and containers


The cloud centers are being increasingly containerized and managed. That is, there are going to be well-entrenched containerized clouds soon. The formation and managing of containerized clouds gets simplified through a host of container orchestration and management tools. There are both open source and commercial-grade container-monitoring tools. Kubernetes is emerging as the leading container orchestration and management platform. Thus, by leveraging the aforementioned toolsets, the process of setting up and sustaining containerized clouds is accelerated, risk-free, and rewarding.

The tool-assisted monitoring of cloud resources (both coarse-grained as well as fine-grained) and applications in production environments is crucial to scaling the applications and providing resilient services. In a Kubernetes cluster, application performance can be examined at many different levels: containers, pods, services, and clusters. Through a single pane of glass...

Cloud infrastructure and application monitoring


The cloud idea has disrupted, innovated, and transformed the IT world. Yet, the various cloud infrastructures, resources, and applications ought to be minutely monitored and measured through automated tools. The aspect of automation is gathering momentum in the cloud era. Every activity is getting automated through pioneering algorithms and technologically powerful tools. A slew of flexibilities in the form of customization, configuration, and composition are being enacted through cloud automation tools. A bevy of manual and semi-automated tasks are being fully automated through a series of advancements in the IT space. In this section, we are going to discuss the infrastructure monitoring toward infrastructure optimization and automation. There are processes, platforms, procedures, and products to enable cloud monitoring. 

Enterprise-scale and mission-critical applications are being cloud-enabled to be deployed in various cloud environments...

The monitoring tool capabilities


 The cloud paradigm brings the much-needed flexibility of assigning resources needed to support demand from cloud users. Establishing and enforcing appropriate policies and rules are important for assigning cloud resources to business applications and IT services. However, the effectiveness of policy management depends on the visibility that organizations have about their cloud resources. Organizations need to have the capability to create, modify, monitor, and update the policies. In short, cloud monitoring tools need to have the previously mentioned cloud-specific features, functionalities, and facilities to realize all the cloud-sponsored benefits.

As organizations deploying cloud computing services trust third-party providers to fulfil the quality of service (QoS) attributes and performance, as quoted previously, is the key QoS parameter. The monitoring tool has to monitor not only the actual levels of performance, as experienced by business users, but...

Prognostic, predictive, and prescriptive analytics


Any operational environment is in need of data analytics and machine learning capabilities to be intelligent in their everyday actions and reactions. The profoundly impacting environments include IT environments (traditional data centers or recent cloud-enabled data centers (CeDCs)), manufacturing and assembly floors, plant operations, maintenance, repair, and overhaul (MRO) facilities. Increasingly, a variety of important environments are being stuffed with scores of networked, embedded, and resource constrained, as well as intensive devices, toolsets, and microcontrollers. Hospitals have a growing array of medical instruments, and homes are blessed with a number of wares and utensils, such as connected coffee makers, dishwashers, microwave ovens, and consumer electronics. Manufacturing floors have powerful equipment, machinery, and robots. Workshops, mechanical shops, and flight maintenance garages are becoming more sophisticated and smarter...

Log analytics


Every software and hardware system generates a lot of log data (big data), and it is essential to do real-time log analytics to quickly understand whether there is any deviation or deficiency. This extracted knowledge helps administrators to consider countermeasures in time. Log analytics, if done systematically, facilitates preventive, predictive, and prescriptive maintenance. Workloads, IT platforms, middleware, databases, and hardware solutions all create a lot of log data when they are working together to complete business functionalities. There are several log analytics tools on the market.

Everyone knows that logs play an important role in the IT industry. Logs are used for various purposes such as IT operations, system, and application monitoring, security and compliance, and much more. Having a centralized and standardized logging system makes life easy for software developers. They are often being requested to troubleshoot the application, detect issues, enhance the...

IT operational analytics 


We discussed log data and its analytics in the previous section. There are log-management tools and log analytics platforms to gain real-time information about all kinds of software and hardware systems. The insights emitted go a long way in stabilizing and strengthening various systems by proactively attending the systems issues. There is also operational data for all kinds of systems under operation. The data from IT systems contains valuable insights into system usage, the user's experience, and behavior patterns. There are operational analytics platforms and engines, such as Splunk software for monitoring, searching, analyzing, visualizing, and acting on massive streams of real-time and historical machine data, from any source, format, or location. The main advantages of operational analytics are listed here. Operational analytics helps with the following:

  • Extricating operational insights
  • Reducing IT costs and complexity
  • Improving employee productivity
  • Identifying...

IT performance and scalability analytics


There are typically big gaps between the theoretical and practical performance limits. The challenge is how to enable systems to attain their theoretical performance level under any circumstance. The performance level required can suffer due to various reasons. This includes the poor system design, bugs in software, network bandwidth, third-party dependencies, and I/O access. The middleware solutions such as adapter, connector, and driver also contribute to the unexpected performance degradation of the system. The system's performance has to be maintained under any loads (user, message, and data). There are several metrics such as request per second (RPS) and transaction per second (TPS). Performance testing is one way of recognizing the performance bottlenecks and adequately addressing them. The testing is performed in the pre-production phase.

Now, the software is functioning in production servers, and the thing to do here is to continuously and...

IT security analytics


IT infrastructure security, application security, and data (at rest, transit, and usage) security are the top three security challenges, and there are security solutions approaching the issues at different levels and layers. Access-control mechanisms, cryptography, hashing, digest, digital signature, watermarking, and steganography are the well-known and widely used aspects of ensuing impenetrable and unbreakable security. There's also security testing, and ethical hacking for identifying any security risk factors and eliminating them at the budding stage itself. All kinds of security holes, vulnerabilities, and threats are meticulously unearthed in to deploy defect-free, safety-critical, and secure software applications. During the post-production phase, the security-related data is being extracted out of both software and hardware products, to precisely and painstakingly spit out security insights that in turn goes a long way in empowering security experts and architects...

The importance of root-cause analysis 


The cost of service downtime is growing up. There are reliable reports stating that the cost of downtime ranges from $100,000-$72,000 per minute. Identifying the root-cause (mean-time-to-identification (MTTI) generally takes hours. For a complex situation, the process may run into days. The MTTI is lengthy due to various reasons. There are not many tools to speed up the MTTI process. We need competent tools that enrich the value by correlating the data from different IT tools, such as APM, ITSM, SIEM, and ITOM with open API connectors. As microservices and their instances run on containers, IT teams need to manage millions of data points. This transition mandates for highly advanced and automated tools. The pioneering AI algorithms will be commonly used to automate for precisely finding the root-causes.

Root-cause analysis is being touted as an important post-deployment activity for exactly pinpointing bugs and their roots in any software applications...

Summary


There are several activities being strategically planned and executed to enhance the resiliency, robustness, and versatility of enterprise, edge, and embedded IT. It is overwhelmingly accepted that the domains of data analytics and machine learning are going to be the key differentiators for corporations in fulfilling the varying expectations of their customers, clients, and consumers. This chapter has described the various post-production data analytics to allow you to gain a deeper understanding of applications, middleware solutions, databases, and IT infrastructures to manage them effectively and efficiently. Machine-learning algorithms enable the formation of self-learning models to predict problems and prescribe the viable solutions to surmount them. Thus, data analytics methods and ML algorithms come in handy in realizing resilient IT. The other important facets include static and dynamic code analyzes to proactively identify bugs in software code to enhance application reliability...

Further Readings


The following are a few references: 

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Practical Site Reliability Engineering
Published in: Nov 2018Publisher: PacktISBN-13: 9781788839563
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €14.99/month. Cancel anytime

Authors (3)

author image
Pethuru Raj Chelliah

 Pethuru Raj Chelliah (PhD) works as the chief architect at the Site Reliability Engineering Center of Excellence, Reliance Jio Infocomm Ltd. (RJIL), Bangalore. Previously, he worked as a cloud infrastructure architect at the IBM Global Cloud Center of Excellence, IBM India, Bangalore, for four years. He also had an extended stint as a TOGAF-certified enterprise architecture consultant in Wipro Consulting services division and as a lead architect in the corporate research division of Robert Bosch, Bangalore. He has more than 17 years of IT industry experience.
Read more about Pethuru Raj Chelliah

author image
Shreyash Naithani

Shreyash Naithani is currently a site reliability engineer at Microsoft R&D. Prior to Microsoft, he worked with both start-ups and mid-level companies. He completed his PG Diploma from the Centre for Development of Advanced Computing, Bengaluru, India, and is a computer science graduate from Punjab Technical University, India. In a short span of time, he has had the opportunity to work as a DevOps engineer with Python/C#, and as a tools developer, site/service reliability engineer, and Unix system administrator. During his leisure time, he loves to travel and binge watch series.
Read more about Shreyash Naithani

author image
Shailender Singh

Shailender Singh is a principal site reliability engineer and a solution architect with around 11 year's IT experience who holds two master's degrees in IT and computer application. He has worked as a C developer on the Linux platform. He had exposure to almost all infrastructure technologies from hybrid to cloud-hosted environments. In the past, he has worked with companies including Mckinsey, HP, HCL, Revionics and Avalara and these days he tends to use AWS, K8s, Terraform, Packer, Jenkins, Ansible, and OpenShift.
Read more about Shailender Singh