Skip to content

Add draft for VM time synchronisation decisions#577

Open
kgube wants to merge 8 commits intomainfrom
231-add-NTP-DR
Open

Add draft for VM time synchronisation decisions#577
kgube wants to merge 8 commits intomainfrom
231-add-NTP-DR

Conversation

@kgube
Copy link
Contributor

@kgube kgube commented Apr 24, 2024

@kgube kgube marked this pull request as draft April 24, 2024 11:11
@kgube kgube changed the title Add draft for VM clock synchronisation recommendations Add draft for VM time synchronisation decisions Aug 21, 2024
@kgube kgube marked this pull request as ready for review August 21, 2024 08:18
@scoopex
Copy link
Contributor

scoopex commented Aug 21, 2024

Had a discussion with @kgube about the motivation , goals and the contents of this DR.

Input/framework conditions that may be useful:
(needs to be evaluated)

  • Its good to add reference to software-systems which might be used by cusomers that use shared quorum algorithms and reference to the relevance of good system time
    (Zookeper, RabbitMQ, ETCD, Consul, Hazelcast, Ceph)
  • Its good to add a reference that using public (internet) NTP servers with the same S-NAT IP might lead to ratelimit situations if dozens of systems in a project are using the same ntp servers because the the NTP servers might see the same IP with dozens of NTP sessions
  • SCS environments itself should be operated with at least 3 central and CSP-local NTP sources (for Ceph, RabbitMQ, ...)
  • Whether overcommit or that a VM is not “scheduled” plays a role for the quality of the time synchronization with the virtualization used must not matter to the user
  • The CSP offers at least three local and not rate limited NTP servers that have at least 5 statically defined upstream stratum servers or local time sources with high quality
  • We can define a minimum quality that is based on the requirements of common systems and provides some reserve to keep popular systems running without problems (offset, jitter, frequency drift, ...)
  • The CSP ensures that a time with a minimum quality can be maintained in VMs with a reference setup
    • defined chrony setup/configuration that uses the min. 3 CSP NTP servers
    • this should be possible with all flavors (in some virtualization technologies the size of the virtual machine has impact to the scheduling of it and related to that to its time sychronization
    • the health check service activates several VMs with a single defined flavor distributed across the CSP landscape (e.g. 3) that run permanently and checks their quality to evaluate the compliance
  • Subordinate, but exciting would be a idea how to provide the flavor images with a standardized setup by default which can be used independent from the CSP (e.g. by using a standardized setup mechanism, or standardized references to the servers)

@kgube
Copy link
Contributor Author

kgube commented Oct 23, 2024

I discussed the potential upstream topic with Neutron Team, and created an RFE issue for it.

The topic will also be discussed during the PTG, it is currently scheduled for the 2014-10-24 15:00 - 16:00 UTC timeslot.

@kgube
Copy link
Contributor Author

kgube commented Feb 19, 2025

I could not attend the PTG unfortunately, but the Topic was discussed and there were some questions on the scope of the feature that were forwarded to the RFE ticket, which I answered.
In particular, both ovn and dnsmasq allow global dhcp-options, so provided that the link-local NTP server address is the same in all subnets (which would be a design goal), we can configure it as a global option and there is no dynamic port-specific DHCP-config necessary.

If we want to proceed with pursuing this feature, it would probably best to track it in a separate issue.
The next step would be to take the RFE Ticket to a Neutron Drivers meeting, get affirmation of the scope of the feature from the team, and ask for guidance on how to proceed with the implementation.

Copy link
Contributor

@artificial-intelligence artificial-intelligence left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if jitter times and such should be mandated but I currently don't have much time reviewing and researching this topic in depth, so I don't want to hold this up.

@jklare
Copy link
Contributor

jklare commented Feb 26, 2026

@kgube @artificial-intelligence we are trying to clean up the stale PRs atm. This PR has not been moved for quite some time, and I am wondering if there is still work needed here. @garloff since the issue referenced here is from you, I am wondering if you also have input here?

@mbuechse
Copy link
Contributor

mbuechse commented Feb 26, 2026

From my POV, if no one wants to take over here, this can simply be merged as draft. (Maybe we can get @kgube to add missing sign-offs.

@jklare
Copy link
Contributor

jklare commented Mar 9, 2026

Works for me. @kgube could you add the signed-off to the commits 5a58ee9 and 6f86fed, so this can be merged?

kgube added 7 commits March 9, 2026 10:17
Signed-off-by: Konrad Gube <konrad.gube@cloudandheat.com>
Signed-off-by: Konrad Gube <konrad.gube@cloudandheat.com>
Signed-off-by: Konrad Gube <konrad.gube@cloudandheat.com>
Signed-off-by: Konrad Gube <konrad.gube@cloudandheat.com>
Signed-off-by: Konrad Gube <konrad.gube@cloudandheat.com>
Signed-off-by: Konrad Gube <konrad.gube@cloudandheat.com>
Signed-off-by: Konrad Gube <konrad.gube@cloudandheat.com>
@jklare
Copy link
Contributor

jklare commented Mar 9, 2026

@kgube thank you so much for the swift update!
@mbuechse could you give this a final review and merge as draft as suggested? Do we need to replace the XXXX in the filename, or is this intentional?

@kgube
Copy link
Contributor Author

kgube commented Mar 9, 2026

I rebased the branch and apparently it wasn't up to date on the linting. I'll try to fix this too!

Signed-off-by: Konrad Gube <konrad.gube@cloudandheat.com>
@kgube
Copy link
Contributor Author

kgube commented Mar 9, 2026

Okay, I fixed the MD lint error, but I'm not sure what to do about the scs/check, which fails in the post job. scs-check-adr-syntax itself appears to pass.

@mbuechse
Copy link
Contributor

mbuechse commented Mar 9, 2026

@jklare will do. The xxxx needs to be replaced. I think 0129 would be next. I can handle that (though not today)

@jklare
Copy link
Contributor

jklare commented Mar 9, 2026

@kgube thanks for fixing the linter issues, sadly I also have no idea what the scs/check post error means. Given that this also seems to fail similarly for entirely different PRs, I would assume it is not related to yours, but rather a general pipeline issue.

@fkr
Copy link
Member

fkr commented Mar 9, 2026

@kgube thanks for fixing the linter issues, sadly I also have no idea what the scs/check post error means. Given that this also seems to fail similarly for entirely different PRs, I would assume it is not related to yours, but rather a general pipeline issue.

Is zuul properly working? Am just on my phone right now, but it seems that zuul is not properly responding - eg the webfrontend does not even open for me.

@garloff?

@mbuechse
Copy link
Contributor

mbuechse commented Mar 9, 2026

The Zuul UI does work for me, but it shows that no job has been completed successfully since yesterday afternoon:
https://zuul.sovereignit.cloud/t/scs/builds

@kgube
Copy link
Contributor Author

kgube commented Mar 9, 2026

I restarted the check and it gave the following console output in context of the error:

2026-03-09 13:05:41.551038 | POST-RUN START: [trusted : github.com/SovereignCloudStack/zuul-scs-jobs/playbooks/base/post-logs.yaml@main]
2026-03-09 13:05:42.214389 | 
2026-03-09 13:05:42.214519 | PLAY [Base post-logs]
2026-03-09 13:05:42.225989 | 
2026-03-09 13:05:42.226082 | TASK [generate-zuul-manifest : Generate Zuul manifest]
2026-03-09 13:05:42.537722 | localhost | changed
2026-03-09 13:05:42.551784 | 
2026-03-09 13:05:42.551936 | TASK [generate-zuul-manifest : Return Zuul manifest URL to Zuul]
2026-03-09 13:05:42.581185 | localhost | ok
2026-03-09 13:05:42.590532 | 
2026-03-09 13:05:42.590623 | TASK [Get cloud config from vault]
2026-03-09 13:05:43.388015 | localhost | Output suppressed because no_log was given
failure
2026-03-09 13:05:43.389916 | 
2026-03-09 13:05:43.389972 | PLAY RECAP
2026-03-09 13:05:43.390032 | localhost | ok: 2 changed: 1 unreachable: 0 failed: 1 skipped: 0 rescued: 0 ignored: 0
2026-03-09 13:05:43.390064 | 
2026-03-09 13:05:43.493642 | POST-RUN END RESULT_NORMAL: [trusted : github.com/SovereignCloudStack/zuul-scs-jobs/playbooks/base/post-logs.yaml@main]

--- END OF STREAM ---

(Unfortunately the output is not saved, so I had to catch it live)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants