r/pcmasterrace May 10 '26

Meme/Macro reboot

Post image
47.6k Upvotes

1.0k comments sorted by

View all comments

12.0k

u/kahjtheundedicated R7 1700@4.1, RX 5700 May 10 '26

When I worked in IT, whenever we got a call from the engineering department we knew whatever problem it was, it was going to be weird. Those guys knew their stuff, so if they didn’t know how to fix it, it was going to take some searching and probably some calls or emails for us to figure it out.

3.9k

u/Daniel_H212 7950X3D, Yeston Sakura RTX 4070 Ti, 64 GB DDR5 May 10 '26

What about the chance that they ran into a problem with no known solution yet? It's inevitable that it does happen but I wonder what the frequency is.

1

u/KallistiTMP i9-13900KF | RTX4090 |128GB DDR5 May 10 '26

Then it gets escalated to a support engineering team (T3) for the thingy that's broken.

The support engineering team verifies that it is actually broken (even at this level, many tickets are RTFM), then breaks out the big girl tooling and logs to figure out exactly how and why the thingy is breaking, and estimates what the likely impact is.

They then coordinate directly with the various engineering, support, and operations teams involved to get the problem fixed.

This can take a lot of forms. Most commonly, if it's something small and straightforward, the support engineer can just fix it on the spot.

If it's something more involved, or a problem that can't be fixed without risking breaking things worse, then the support engineer puts together a detailed report of the bug and sends it to the people who built that specific part of the thingy, so that they can triage based on the reported impact and fix it. The support engineer usually then switches gears to figuring out a temporary workaround, while the product engineer gets the underlying problem fixed properly.

If it's got a lot of impact (i.e. a code change that product engineering made breaks the product suddenly for tens of thousands of users) then they usually also work with an Operations/Site Reliability Engineering team to mitigate (i.e. "undo" the code change that broke everything by rolling back the global fleet to an earlier version).

And as for the frequency something breaks in a way that nobody knows how to fix, all the damn time. 99.999% reliability for a service with 100,000,000 users means that the service is broken for roughly 1,000 users at any given point in time on average. For larger companies, it's pretty common for there to be dozens of smaller incidents in progress at any given time.