JPMorganChase logo

Lead Software Engineer - DevOps / Production Support

JPMorganChase
1 day ago
Full-time
On-site
Houston, Texas, United States
Software / Technology / IT
Description

We have an opportunity to impact your career and provide an adventure where you can push the limits of what's possible.

As a Lead Software Engineer at JPMorgan Chase within the Commercial & Investment Banking - Markets Tech - Trading / Derivatives Execution Tech team, you are an integral part of an agile team that works to enhance, build, and deliver trusted market-leading technology products in a secure, stable, and scalable way. As a core technical contributor, you are responsible for conducting critical technology solutions across multiple technical areas within various business functions in support of the firm’s business objectives.

This position will support the reliability, performance, and operational integrity of electronic and equities trading systems, with a specific focus on FIX protocol connectivity. This role is hands-on and operations-oriented, partnering closely with trading, technology, and development teams to ensure stable order flow, rapid incident response, and disciplined change execution. The position emphasizes Python automation, Linux troubleshooting, and Grafana-based observability, with C++ exposure used primarily to investigate issues and collaborate effectively with Application 

 

Job responsibilities

  • Executes creative software solutions, design, development, and technical troubleshooting with ability to think beyond routine or conventional approaches to build solutions or break down technical problems
  • Provide daily production support for electronic trading platforms, including FIX sessions, connectivity health, and order/trade workflow stability
  • Monitor system health and trading-impacting signals using Grafana dashboards and alerting to improve visibility with latency, errors, throughput, and availability 
  • Lead incident triage and restoration activities during service degradation, including structured troubleshooting, stakeholder communications, and post-incident follow-up
  • Perform root cause analysis on recurring issues and implement durable remediation, including runbook improvements, alert tuning, and operational automation
  • Develops secure high-quality production code with reviewing and debugging code by using Python scripts and tools for health checks, operational workflows, reporting, and environment validation (per user-provided role intent)
  • Drives team adoption of enterprise-authorized AI-assisted engineering practices within the work environment to improve code quality, delivery speed, and operational outcomes (e.g., AI-assisted code review / refactoring, test strategy acceleration, incident/root-cause analysis support), while establishing consistent validation standards (secure coding, peer review, automated testing) and promoting reuse of effective patterns across the team
  • Applies knowledge of tools within the Software Development Life Cycle toolchain, including enterprise-authorized AI-assisted development and automaton capabilities, to improve the value realized by automation
  • Troubleshoot Linux based systems using logs, process and resource diagnostics, and network-level checks relevant to connectivity and application behavior (per user-provided role intent)
  • Partner with development teams to investigate complex issues in trading components with read logs, traces, diagnostic output and the ability to interpret and discuss findings in contexts where components are implemented in C++
  • Adds to team culture of diversity, opportunity, inclusion, and respect

 

Required qualifications, capabilities, and skills

  • Formal training or certification on Software engineering concepts and 5+ years applied experience 
  • Advanced in one or more programming language(s), framework(s) and tools (e.g., Python, C++, Linux, Grafana, etc.)
  • Demonstrated experience in DevOps, production support, SRE, or application support in a mission-critical environment, with accountability for uptime and incident execution
  • Practical understanding of the FIX protocol
  • Strong Linux troubleshooting capability, including log analysis, process/resource diagnostics, and command-line proficiency
  • Hands-on experience with AWS and Terraform (infrastructure as code), and familiarity/experience with Atlas and Copilot as part of the deployment and platform toolchain

  • Ability to collaborate effectively across trading, operations, and engineering teams, including clear incident communications under time pressure

  • Proficiency in automation and continuous delivery methods, with advanced understanding of agile methodologies such as CI/CD, Application Resiliency, and Security
  • Demonstrated experience leading effective use of approved AI-assisted software development tools (e.g., for coding, code review, test acceleration, troubleshooting) with the ability to set team expectations for validating AI outputs for correctness, performance, and security
  • Strong understanding of responsible AI use in engineering workflows, including data sensitivity considerations, secure handling of inputs/outputs, and adherence to resiliency and security expectations; experience coaching engineers on safe, compliant adoption within delivery practices
  • Demonstrated proficiency in software applications and technical processes within a technical discipline (e.g., cloud, artificial intelligence, machine learning, mobile, etc.)

 

Preferred qualifications, capabilities, and skills
 
  • Knowledge in electronic trading or equities trading environments, including familiarity with order lifecycle concepts and trading-impacting incident patterns 
  • Exposure to C++ sufficient to assist with investigation (e.g., reading stack traces, understanding logs and component behavior), without being a primary feature developer
  • Demonstrated proficiency in software applications and technical processes within a technical discipline (e.g., cloud, artificial intelligence, machine learning, mobile, etc.)
  • Familiarity with incident management disciplines, including runbooks, post-incident reviews, alert quality management, and operational readiness practices
  • Basic networking knowledge relevant to troubleshooting connectivity and performance (e.g., TCP/IP behavior, port connectivity, latency sensitivity)