Windows feature that resets system clocks based on random data is wreaking havoc

KelsonV@lemmy.world · 2 years ago

Windows feature that resets system clocks based on random data is wreaking havoc

Z4rK@lemmy.world · 2 years ago

This bug has created havocs for me. We had a “last synchronized” time stamp persisted to a DB so that the system was able to robustly deal with server restarts / bootstrapping on new environments.

The synchronization was used to continuously fetch critical incident and visualize them on a map. The data came through a third party api that broke down if we asked for too much data at a time, so we had to reason about when we fetched data last time, and only ask for new updates since then.

Each time the synchronization ran, it would persist an updated time stamp to the DB.

Of course this routine ran just as the server jumped several months into the feature for a few minutes. After this, the last run time stamp was now some time next year. Subsequent runs of the synchronization routine never found any updates as the date range it asked for didn’t really make sense.

It just ran successfully without finding any new issues. We were quite happy about it. It took months before we figured out we actually had a mayor discrepancy in our visualization map.

We had plenty of unit tests, integration tests, and system tests. We just didn’t think of having one that checked whether the server had time traveled to the future or not.

lolcatnip@reddthat.com · 2 years ago

If I’ve learned one thing from the last decade of movie and TV sci-fi, it’s that you always need to account for the possibility of time travel.

xavier666@lemm.ee · 2 years ago

Reminds me of a “bug” in a genealogy software which crashed for a client. Turns out the client had incest and entering the relation in the software caused a loop in the family tree.

lolcatnip@reddthat.com · 2 years ago

Why put “bug” in quotes? If a program crashes because of unexpected user input, that’s always a bug.

deafboy@lemmy.world · 2 years ago

Unexpected input 😏

dublet@lemmy.world · 2 years ago

https://infiniteundo.com/post/25326999628/falsehoods-programmers-believe-about-time and https://infiniteundo.com/post/25509354022/more-falsehoods-programmers-believe-about-time

SzethFriendOfNimi@lemmy.world · 2 years ago

That’ll be one weird regression test. Imagine the comment you’ll have to write to explain “why” this test exists.

Z4rK@lemmy.world · edit-2 2 years ago

While the root issue was still unknown, we actually wrote one. It sort of made sense. Check that the date from isn’t later than date to in the generated range used for the synchronization request. Obviously. You never know what some idiot future coder (usually yourself some weeks from now) would do, am I right?

However, it was far worse to write the code that fulfilled the test. In the very same few lines of code, we fetched the current date from time.now() plus some time span as date.to, fetched the last synchronization timestamp from db as date.from, and then validated that date.from wasn’t greater than date.to, and if so, log an error about it.

The validation code made no logic sense when looking at it.

SzethFriendOfNimi@lemmy.world · 2 years ago

Feels like writing

Assert.is(false,“This should never happen”);

and seeing it pop up one time?

towerful · 2 years ago

I feel like the 3rd party API should have had some error checking, although that might have strayed too far into a client’s business logic.
If it is an API of incidents, that suggests past incidents. And the whole “never trust user data” kinda implies they should throw an error if you request information about a tinerange in the future.
I guess, not throwing an error does allow the 3rd party to “schedule” an incident in the future, eg planned maintenance/downtime.

But then, that isn’t separation of concerns. Ideally those endpoint would be separate. One for planned hypothetical incidents and one for historical concrete incidents.

It’s definitely an odd scenario where you are taking your trusted data (from your systems and your database), then having to validate it.

xavier666@lemm.ee · 2 years ago

// for possible time travel scenarios
// DO NOT DELETE!

Z4rK@lemmy.world · 2 years ago

lol I have to add this to the code now 😝

JoBo@feddit.uk · 2 years ago

So we have mini-Y2Ks happening, at random, because MS is oblivious to anything outside its own ecosystem? Cool, cool.

Treczoks@lemmy.world · 2 years ago

I’ve read the stuff on STS and my first thought was: How can anyone be so stupid to try such a loony concept and still be able to create a working piece of code?

AutoTL;DR@lemmings.world · 2 years ago

This is the best summary I could come up with:

A few months ago, an engineer in a data center in Norway encountered some perplexing errors that caused a Windows server to suddenly reset its system clock to 55 days in the future.

“With these updated routing tables, a lot of people were unable to make calls, as we didn’t have a correct state!” the engineer, who asked to be identified only by his first name, Simen, wrote in an email.

Simen had experienced a similar error last August when a machine running Windows Server 2019 reset its clock to January 2023 and then changed it back a short time later.

Windows systems with clocks set to the wrong time can cause disastrous errors when they can’t properly parse timestamps in digital certificates or they execute jobs too early, too late, or out of the prescribed order.

The mechanism, Microsoft engineers wrote, “helped us to break the cyclical dependency between client system time and security keys, including SSL certificates.”

Simen and Ken, who both asked to be identified only by their first names because they weren’t authorized by their employers to speak on the record, soon found that engineers and administrators had been reporting the same time resets since 2016.

I’m a bot and I’m open source!