Get the report
MoreComplete visibility for DevSecOps
Reduce downtime and move from reactive to proactive monitoring.
November 21, 2022
For retail and e-commerce companies, exponential traffic spikes are a holiday season tradition that often peaks on Black Friday. And Ulta Beauty knows this all too well. As the largest beauty retailer in the US today, operating 1,300 stores nationwide, Ulta experienced the growing pains of its on-premises IT environment: slow rollouts, tedious infrastructure management, and a lack of visibility into critical systems. Sound familiar?
As companies brace themselves for the influx of holiday shopping activity, we sat down with Omar Koncobo, IT Director of e-commerce/Digital and Marketing Systems at Ulta Beauty, to share best practices and strategies for how to prepare for the holiday rush. Ulta recently decided to migrate to the cloud and take a microservices, API-first approach.
Our core technical team consists of architects, developers and our DevOps team to keep things running 24/7. We also get grid support from our performance engineering team, especially for search. But it really takes a village that also involves the whole IT team – from networking to security to infrastructure.
We start getting ready for the holidays right after they’ve ended. There’s a lot of learning that comes from each holiday season, so while things are still fresh we regroup to review where we had hiccups and identify opportunities for improvement, and then build our roadmap from there.
Starting in July, we put our “readiness hat” on and start to test and fix in earnest.
A part of the team pauses everything else they’re doing to just focus on this holiday rush. We get alignment on the minimum size of our infrastructure; we decide how we will get the customer through the system, testing a variety of scenarios to optimize the customer experience with systems that will remain stable.
Today, we consider our holiday season to begin the Saturday before Thanksgiving through the Tuesday after Cyber Monday. The daily surge of traffic we get from the holidays can be 50% more than it is the rest of the year. That means our infrastructure needs to be double the size. And this is where we saw the move to the cloud to be critical because building extra infrastructure for coverage means being overprovisioned for normal traffic the rest of the year. Being in the cloud, we can scale up and down as needed and save a lot of money.
Monitoring is key for us. We have Sumo Logic dashboards that we run throughout the year, and as we approach the holidays we dial into some specific performance indicators on the systems we want to watch.
For example, we identify bad actors by tracking website activities that could indicate attacks and fraudulent activities. Using the Brute Force Attack dashboard, we can track indicators like invalid password attempts and login attempts per IP address and per country.
We also track order flow and volume analytics, with an Order Insight dashboard for monitoring system reliability and end-to-end operational issues. If there’s an increase in order cancellations, it may indicate a front-end issue or a problem with inventory in our warehouses. We can see the fastest-selling SKUs and where our traffic is coming from.
We set up a control room, or command center, where we monitor over 20 different dashboards for 20 different data points.
Most of our real-time alerts are coming from Sumo Logic. We have people watching the dashboards but also watching alerts. We see alerts related to system health, or perhaps, one of the instances is struggling or one of the nodes is maxed out on CPU. Sumo alerts us on those and then the dashboard helps to confirm and pinpoint these issues.
The majority of our application performance monitoring (APM) tools are complementary to Sumo. From Sumo logs, we will see exactly which system the issue is coming from and jump on our application monitoring tool to tell us specifically at what level of the stack it’s happening.
Having the right tools in place in advance helps to quickly identify and remediate problems.
The first week of January is a blackout period for us, this is when we don’t make any system changes to support the returns. The returns are just as much part of the holiday season digital experience for our customers as the shopping is, so we want to make sure our system is ready and stable for both.
The number one thing is planning. Don’t wait until the last minute. You need to know what’s working and what’s not working right after the holidays. Make a list of the key things you need to track for the next year.
The second most important thing is making sure you have the right performance engineering team in place and make sure you’re preparing for all your different scenarios.
Third, the tools are really important. You need to have the right monitoring systems, right alerting, and know what you need to alert on and pay attention to.
Logs are like gold when you’re trying to troubleshoot an issue. So, make sure you have visibility into your logs to quickly see issues and address them to reduce your mean time to resolution.
Watch our entire conversation with Koncobo.
Reduce downtime and move from reactive to proactive monitoring.
Build, run, and secure modern applications and cloud infrastructures.
Start free trial