Yasoda Krishna Annapureddy

Software Engineer & System Architect. I design, code, and scale distributed systems that power global retail operations. With deep expertise in backend engineering, cloud architecture, and reliability, I bridge the gap between complex code and massive scale.

Technical Identity

What I bring to the table as a developer-first leader:

Scalable Backends

Building high-throughput, low-latency systems using Java, Python, and Node.js.

Distributed Systems

Architecting resilient microservices, event-driven flows, and cloud-native infrastructure.

Production Automation

Engineering self-healing systems, CI/CD pipelines, and developer tooling.

AI-Enhanced Engineering

Integrating LLMs and AI agents to revolutionize observability and operations.

Engineering Journey

root@walmart-global-tech:~
./exec_engineering --scope=nationwide --clusters=4600+

Walmart Global Tech

Software Engineer III (SRE & Distributed Systems)
Present

Architecture & Scale

  • Hybrid Cloud/Edge: Architected reliability for a hybrid control plane managing 4,600+ Kubernetes clusters across heterogeneous edge hardware.
  • Progressive Delivery: Designed a multi-phase rollout strategy (Pilot → 25 → 100 → 500 → Fleet) using Argo & Flagger, ensuring zero-downtime updates for millions of daily transactions.
  • Holiday Readiness: Led full-system profiling and stress testing for Black Friday/Holiday peak, certifying microservices for resilience under burst loads.

Engineering Verticals

  • Performance Tuning: Optimized JVM G1GC garbage collection and thread pools, eliminating memory leaks in async pipelines and reducing CPU throttling by 60%.
  • Zero-Downtime Migrations: Executed nationwide Helm2 → Helm3 and Vault → Akeyless migrations with fallback readers, maintaining 100% uptime.
  • Observability Cost: Reduced metric ingestion costs by 50% by pruning 200+ high-cardinality metrics and optimizing histogram buckets.
lead@iqit-solutions:~
./modernize_stack --target=microservices

IQIT Solutions

Software Engineer
Jan 2024 – Jul 2024

Microservices Modernization

  • Architecture: Decomposed monolithic applications into Spring Boot microservices, implementing Kafka consumer groups for parallel batch processing.
  • Performance: Reduced latency by 30% by eliminating blocking database accesses and optimizing async flows.

Security & Platform

  • Security: Implemented Spring Security with token rotation and rate limiting for healthcare data protection.
  • Containerization: Migrated legacy infrastructure to Docker/Kubernetes, improving deployment velocity and scalability.
researcher@geophysical-tech:~
./ingest_data --rate=high_frequency

Geophysical Technology / ISU

Software Engineer & Researcher
Aug 2022 – Dec 2023

Low-Level Systems

  • Real-Time Ingestion: Built high-frequency C/Java multithreaded sensor data ingestion systems with microsecond precision using lock-free queues.
  • Kernel Tuning: Optimized Linux IRQ pinning and network stack (sysctl rmem/wmem) for predictable interrupt handling and throughput.

Security Engineering

  • Identity: Engineered robust SSO/OAuth2 authentication systems for university academic platforms.
eng@ncr-corp:~
./process_transactions --consistency=strict

NCR Corporation

Software Engineer
Feb 2021 – Aug 2022

Financial Engineering

  • Transaction Integrity: Designed idempotent Rewards Transaction APIs using timestamp vectors to ensure 99.99% accuracy and replay safety.
  • Firmware Porting: Engineered the migration of critical fuel controller firmware from Windows to Linux, enhancing stability for edge devices.

Systems Thinking & Architecture

Design principles for massive scale and reliability.

Progressive Delivery

I don't just deploy; I orchestrate. Using Argo and Flagger, I implemented a "Stage1 → Stage4" rollout strategy that validates SLOs (latency, error rates) at every batch (25 → 100 → 500 clusters) before proceeding, preventing fleet-wide outages.

Resilience Engineering

Systems must survive network partitions. I design for the edge, ensuring checkout workflows function even with degraded WAN connectivity, using local caching and eventual consistency models for telemetry.

Performance Profiling

Deep-dive optimization is key. I use JFR (Java Flight Recorder) to analyze thread pools and GC behavior, tuning heap compaction and eliminating serialization loops to meet strict p99 latency targets.

Leadership & Impact

Driving engineering excellence across the organization.

Holiday Readiness Lead

Pivotal role in certifying 4,600+ clusters for Black Friday. Led full-system stress testing, failure mode validation (packet loss, node churn), and microservice certification to ensure uninterrupted checkout operations.

Cost Optimization

Engineered a 50% reduction in observability costs by identifying and pruning 200+ unused high-cardinality metrics and optimizing Prometheus scrape intervals.

MTTR Reduction

Reduced Mean Time To Resolution by 40% through the creation of deterministic SRE playbooks and automated root cause analysis templates adopted org-wide.

Advanced Skills Matrix

Technical depth for distributed systems engineering.

Distributed Systems
Consensus Models Event-Driven Arch Multi-Cluster Orchestration CAP Theorem Edge Computing
Kubernetes & Cloud
Custom CRDs Operator Pattern Helm3 (Advanced) Argo Rollouts Flagger Istio/Envoy
SRE & Reliability
SLO/SLI Design Error Budgets Chaos Engineering Auto-Remediation Incident Command
Performance
JVM Tuning (G1GC) Thread Profiling (JFR) Kernel Tuning (eBPF/Sysctl) Latency Optimization Async Pipelines
Languages
Java (Expert) Python (Expert) C/C++ Go Bash SQL

Flagship Engineering Case Studies

Deep dives into complex systems I've architected and built.

4,600+ Cluster Fleet
Distributed Systems | Edge Computing

The Challenge

Managing 4,600+ independent Kubernetes clusters across retail stores with variable network conditions and hardware. Manual updates were impossible.

The Architecture

Built a GitOps-driven control plane using ArgoCD and Flagger. Designed a custom "Wave Rollout" controller that promotes artifacts from Pilot → Region → National fleet based on real-time health metrics.

Key Engineering

Implemented automated canary analysis checking p99 latency and error rates. Built fallback mechanisms for disconnected edge clusters to ensure eventual consistency.

AI SRE Agent
Python | LLM | Prometheus

The Challenge

Alert fatigue and slow MTTR due to high volumes of repetitive infrastructure alerts across the massive fleet.

The Solution

Engineered an autonomous Python agent that listens to alerts, queries Prometheus & OpenObserve for context, and uses an LLM to perform root cause analysis and suggest remediation.

Impact

Automated triage for 80% of production alerts and reduced MTTR by 40%, freeing up significant engineering time.

Zero-Downtime Migrations
Infrastructure | Security

The Challenge

Migrating critical infrastructure components (Helm2→3, Vault→Akeyless) across 4,600+ live clusters without disrupting checkout operations.

The Strategy

Designed dual-read/write paths and fallback logic. Implemented automated linting and dry-run simulations on edge replicas to catch regressions before rollout.

Outcome

Completed nationwide migrations with zero downtime and 100% data integrity.

AI + Developer Automation

Building the tools that build the software.

Log Analysis Bot

Automated log parsing and anomaly detection using Python and OpenObserve VRL functions.

Cluster Health Checks

Async Python scripts to perform parallel health checks across thousands of K8s clusters.

Code Quality Auto

Integrated JaCoCo and SonarQube into CI pipelines to enforce quality gates automatically.

Education

Illinois State University
M.S. Computer Science (Data Science)
GPA: 4.0 / 4.0
K L University
B.Tech Computer Science
GPA: 9.02 / 10.0

What I'm Building Next

Exploring the frontier of software engineering.

Distributed AI Agents

Designing multi-agent systems that can autonomously manage complex infrastructure and business workflows.

High-Scale Backends

Pushing the limits of concurrency in Java and Python to handle next-generation data volumes.

Autonomous Reliability

Building self-driving reliability systems that predict and prevent outages before they impact users.

Contact Me

Let's connect to discuss Distributed Systems, AI Engineering, or High-Scale Architecture.