SysOps vs SRE vs DevOps
- SysOps practices and tooling overlap with DevOps.
- SRE (Site Reliability Engineering) is a subset of SysOps.
- SysOps vs DevOps
SysOps Overview
Systems Operation Engineers manage physical and cloud infrastructure. SysOps follow the ITIL (Information Technology Infrastructure Library) approach. They deal with Patch mgt. IAC. Hypervisors and VMs, no big whoop.
- Architecture Oversight: Coordinate with product managers and project managers.
- Defining and maintaining “static” infrastructure that cuts across applications and business units.
- Directory services (like LDAP, AD)
- Message busses (Kafka, RabbitMQ)
- Long living data storage (DBs)
- Long lived VMs that may host containers that are managed by Kubernetes.
- Providing self-service tools for DevOps so they can cycle entire dev, staging, and prod resources safely. See also Platform Engineering.
- Performance Efficiency
- Cost optimization
SRE Overview
Site Reliability Engineers write automation code to increase stability and performance of systems. Focus on SLIs, SLAs, and SLOs. SRE role created about 2016 to fill gap between SysOps and DevOps.
- Disaster Prevention
- Fault Tolerance: ability for a system to remain in operation even if some of the components used to build the system fail.
- Resilience: ability of a workload to recover from infrastructure or service disruptions, dynamically acquire computing resources to meet demand, and mitigate disruptions, such as misconfigurations or transient network issues.
- Auto Backups, onsite, offsite
- Continuous Monitoring/Testing for:
- Business-wide security
- threat detection
- Reliability: distributed system design, recovery planning, and adapting to changing requirements
- Performance emergencies, playbooks
- Cost overruns emergencies, playbooks
- Business-wide security
- Security
- Zero Trust
- Data encryption in transit, at rest
- Disaster Recovery
- Backup and Restore: slow, cheap
- Pilot Light
- Warm Standby
- Multi-site active/active: fast, expensive
- Disaster Testing: Causing real disasters on purpose
- Principles of Chaos Engineering
- Chaos engineering was basically invented by Netflix.
SysOps
-
Monitoring
-
Logging
- splunk: log mining.
- Elastic Logstash: log storage, transform
- Elasticsearch: log mining. Distributed, multitenant-capable full-text search engine with an HTTP web interface
- OpenSearch: Elasticsearch fork.
- Loggly log mining
- Nagios: monitoring, log storage and mining.
-
Visualization
-
Messaging
SRE (Site Reliability Engineering)
-
Continuous Automated Testing
-
- Continuous monitoring and Threat Detection
- AWS Security Hub
- Amazon GuardDuty
- Amazon Macie S3 PII exposure.
-
Chaos Engineering: actually break things, without warning
-
- Apache JMeter Written in Java.
- Locust Load testing written in Python.
- Gatling
- The Grinder Java
-
-
Network Security
IaC, EaC
-
Git Workflow for Ops Infrastructure
- GitOps = IaC + MRs + CI/CD
- Declaritive
- IaC docs
- Config docs
- Argos Workflow
- GitOps = IaC + MRs + CI/CD
-
Self Serve Automated Infrastructure For Devs
- Local Assets for Developers
- Remote Assets for Developers
- Integration Testing Assets for CI Build/Test Tools
- Staging Assets for CD
- supports QA, Acceptance Testing workflow
- AWS Control Tower
- AWS Organizations
-
Infrastructure Provisioning
- TerraForm HashiCorp
- CrossPlane: Terraform vs CrossPlane
- Pulumi
- AWS CloudFormation
- AWS Elastic Beanstalk
- AWS CDK Cloud Development Kit. AWS version of Pulumi?
-
Configuration Mgt.
- Ansible Redhat. Playbooks.
- Chef Legacy
- Puppet Legacy
- AWS Systems Manager
Infrastructure
AWS Outposts On prem AWS cloud.
DNS
Load Balancers
Software
-
Cloudflare
-
HashiCorp Envoy
-
Azure
- App Gateway
- Traffic Mngr
- Load Balancer
-
Google
- Cloud Traffic Director
- Cloud Load Balancer
-
AWS
- Gateway Load Balancer
- Elastic Load Balancing
-
VMWare NSX
-
Fastly Edge
-
NGINX: Owned by F5
-
Barracuda
-
A10 Thunder
-
Kubernetes Ingress, etc, see Containers page.
Hardware
CDN (Content Delivery Network)
APIs
- API Gateways
- Service Mesh
- Istio Google, IBM.
- Linkerd Rust. Integrates with Traefik, Kong and Gloo Edge.
- Traefik Mesh
- Hashi Consul Connect
- AWS App Mesh
- Apache ServicComb
- Kuma nee Kong
Storage: File, Block, Object
File storage, block storage, or object storage?
- Kubernetes Storage (see Containers)
- Amazon
- Amazon Elastic File System (EFS)
- Amazon S3: simple object storage service
- Amazon Elastic Block Storage (EBS)
- AWS Storage Gateway Hybrid on prem
- AWS Athena Query S3 data.
Event Buses
- Apache Kafka Distributed event streaming.
- AWS EventBridge Serverless. Between apps.
Messaging Queues
- RabbitMQ message broker
- AWS SQS (simple queue service)
- Apache ActiveMQ Java based message broker. JS, Python clients.
- Artemis: next gen
Data Analytics
- Apache Spark: stream and batch processing. 3rd gen.
- Apache Flink: event-driven apps, stream and batch analytics, pipelines, ETL. Newer than Spark. 4th gen. Auto optimize. Many options for state maintenance. Supports replay. Known for “Big Data” and “Stream Processing”
- Amazon Kinesis Data streams into storage.
Data Warehouse, Lake
- Data Warehouse: structured, filtered data that has already been processed for a specific purpose
- Data Lake: raw data
- AWS Redshift
- Google BigQuery Serverless cloud data warehouse.
- Snowflake
Data Routing and ETL
- Apache nifi: “Niagra Falls”, named by NSA. Known for “ETL” and “Data Integration”. DAGs for data routing and ETL. Low code. Web GUI. “Platform”
- Apache Camel Java enterprise integration “Framework”.
- Example Camel plus Kafka plus Nifi Java app uses camel to send messages to Kafka. Nifi consumes from Kafka.
- AWS Glue serverless data integration.
Identity Mgt.
- LDAP
- MS Active Directory
- Azure AD Cloud Active Directory
- AWS Cognito
- Google Cloud Directory Sync
Workflow Mgt., Event Scheduling
- Apache Airflow
- Orchestration Framework
- DAG: directed acyclic graphs. Vertices and edges.
- ETL Extract, transform, load.
- Spring Batch processing large volumes of records, including logging/tracing, transaction management, job processing statistics, job restart, skip, and resource management.
- Luigi tasks, data pipelines, batch jobs. Written by Spotify. Python.
- CloudWatch Events NOT generic CloudWatch.