About Quantela
We are a technology company that offers outcomes business models. We empower our customers with the right digital infrastructure to deliver greater economic, social, and environmental outcomes for their constituents.
When the company was founded in 2015, we specialized in smart cities technology alone. Today, working with cities and towns; utilities, and public venues, our team of 280+ experts offer a vast array of outcomes business models through technologies like digital advertising, smart lighting, smart traffic, and digitized citizen services.
We pride ourselves on our agility, innovation, and passion to use technology for a higher purpose. Unlike other technology companies, we tailor our offerings (what we can digitize) and the business model (how we partner with our customers to deliver that digitization) to drive measurable impact where our customers need it most. Over the last several months alone, we have served customers to deliver outcomes like increased medical response times to save lives; reduced traffic congestion to keep cities moving and created new revenue streams to tackle societal issues like homelessness.
We are headquartered in Billerica, Massachusetts in the United States with offices across Europe, and Asia.
The company has been recognized with the World Economic Forum’s ‘Technology Pioneers’ Award in 2019 and CRN’s IoT Innovation Award in 2020.
For the latest news and updates please visit us at www.quantela.com
Overview of the Role
The Site Reliability Engineer (SRE) ensures the availability, reliability, performance, and security of applications and infrastructure at the State Data Center (SDC). This role involves proactive monitoring, incident response, system optimization, and process improvements to maintain high service levels and compliance with security standards. The SRE will work closely with IT teams to enhance system resilience and efficiency.
Roles and Responsibilities
- Implement infrastructure monitoring (CPU, Memory, Disk, Network) using Zabbix, Prometheus, Grafana, or ELK Stack.
- Monitor database performance (PostgreSQL, MySQL, Oracle DB) and recommend optimizations.
- Establish log aggregation and alerting mechanisms to detect anomalies.
- Generate uptime and SLA compliance reports for management review.
- Diagnose system and network issues, escalate as required, and track resolution.
- Maintain a ticketing system for issue documentation and trend analysis.
- Conduct root cause analysis (RCA) and implement preventive measures.
- Perform post-incident reviews (PIRs) to improve system resilience.
- Ensure high availability and failover readiness for critical services.
- Optimize database indexing, query performance, and backup strategies.
- Perform capacity planning to ensure systems can handle peak loads.
- Implement automated scaling and load balancing for performance optimization.
- Enforce access control policies, including firewalls, SSH restrictions, and IAM.
- Ensure timely patching and hardening of OS, middleware, and databases.
- Monitor for security vulnerabilities and implement necessary mitigations.
- Ensure compliance with government security policies (CERT-In, ISO 27001).
- Ensure real-time replication of databases to the disaster recovery (DR) site.
- Conduct regular failover testing to validate DR readiness.
- Maintain documentation and runbooks for disaster recovery scenarios.
- Maintain incident reports, troubleshooting guides, and standard operating procedures (SOPs).
- Track service-level agreements (SLAs) and prepare compliance reports.
- Develop training sessions for internal teams on monitoring tools and processes.
Desired Skills/Background
- 5+ years of experience in SRE, IT Operations, or System Administration.
- Strong Linux (Ubuntu, RHEL, CentOS) and Windows Server knowledge.
- Experience with monitoring tools (Zabbix, Prometheus, Grafana, ELK, Splunk).
- Knowledge of networking, VPNs, firewalls, and load balancers.
- Familiarity with cloud services and on-premises infrastructure.
- Experience in database administration (PostgreSQL, MySQL, Oracle).
- Strong troubleshooting and incident management skills.
- AWS Certified SysOps Administrator, RHCE, ITIL, or Zabbix Certified Specialist.
- Experience working with State Data Centers (SDCs) and government IT projects.