feat: metricsV2 + oTel + prometheus sample and Grafana dashboard#3154
feat: metricsV2 + oTel + prometheus sample and Grafana dashboard#3154csviri wants to merge 66 commits intooperator-framework:nextfrom
Conversation
c7e6ca2 to
ece63e8
Compare
88c118c to
cf9eb57
Compare
24f992f to
0438611
Compare
9014b2b to
fe4ab6a
Compare
Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
…torsdk/operator/sample/metrics/MetricsHandlingSampleOperator.java Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
…tor/api/monitoring/Metrics.java Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
…operatorsdk/operator/sample/deployment.yaml Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
d60193a to
cf114e3
Compare
| | Meter name (Micrometer) | Type | Tags | Description | | ||
| |------------------------------------------|---------|-----------------------------------|----------------------------------------------------------------------| | ||
| | `reconciliations.executions` | gauge | `controller.name` | Number of reconciler executions currently in progress | | ||
| | `reconciliations.active` | gauge | `controller.name` | Number of resources currently queued for reconciliation | |
There was a problem hiding this comment.
is "active" really a good name for this?
| private static final String RECONCILIATIONS_STARTED = RECONCILIATIONS + "started" + TOTAL_SUFFIX; | ||
|
|
||
| private static final String RECONCILIATIONS_EXECUTIONS_GAUGE = RECONCILIATIONS + "executions"; | ||
| private static final String RECONCILIATIONS_QUEUE_SIZE_GAUGE = RECONCILIATIONS + "active"; |
There was a problem hiding this comment.
queue would be better name
There was a problem hiding this comment.
The thing is that queue is a bit misleading, because this is actually the queue + ongoing reconiliations, that is the reason I added active. (ExecutorService has a queue). Can expand the docs on this
There was a problem hiding this comment.
Actually made an adjustment so we have now active and queue where queue is that are submitted to executor service but not started yet, the active is the number of actually running threads with reconiliation logic. This seems to be more intuitive and useful.
- deleted redundant delete events - adjusted metrics now we have queue and active reconciliations (with the intuitive seantics) Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
Signed-off-by: Chris Laprun <metacosm@gmail.com>
Signed-off-by: Chris Laprun <metacosm@gmail.com>
Signed-off-by: Chris Laprun <metacosm@gmail.com>
Signed-off-by: Chris Laprun <metacosm@gmail.com>
| ``` No newline at end of file | ||
| ``` | ||
|
|
||
| ## Metrics interface changes |
There was a problem hiding this comment.
This would technically be an API break and would require a new major version.
There was a problem hiding this comment.
Strinctly sepaking yes, but such minor API changes we do some times, see the migration document. As other frameworks sometimes. It is basically I think a better choice in terms of tradeoff, since because we don't really want to increase the major verion that often and we on the other hand we have quite an amount of APIs, that sometimes better to evolve this way IMO.
I also was trying to do backwards compatible, we still could. But at the end it looked like that it would be more confusing, that just having a table to be able to easily migrate from current impl. If that makes sense.
|
|
||
| # Check if helm is installed, download locally if not | ||
| echo -e "\n${YELLOW}Checking helm installation...${NC}" | ||
| if ! command -v helm &> /dev/null; then |
There was a problem hiding this comment.
Installing helm should be made optional. People might not want to have helm automatically installed or want to install it themselves some other way.
There was a problem hiding this comment.
What this does, it checks if helm command works (so already installed), if not, downloads it as a binary, but just locally in the project dir (so does not install it), and uses that binary directly. So this is not very invasive at the end IMO, usually it is done this way in such scripts.
| import java.util.function.Consumer; | ||
| import java.util.function.Function; | ||
|
|
||
| import org.jspecify.annotations.NonNull; |
There was a problem hiding this comment.
This should not have been introduced here. Nullability should be reviewed at the SDK level, not simply here for 2 methods.
There was a problem hiding this comment.
removed this, not sure how it got here, good catch!
|
|
||
| \* `namespace` tag is only included when `withNamespaceAsTag()` is enabled. | ||
|
|
||
| The execution timer uses explicit SLO boundaries (10ms, 50ms, 100ms, 250ms, 500ms, 1s, 2s, 5s, 10s, 30s) to ensure |
There was a problem hiding this comment.
SLO acronym should be used expanded first before.
There was a problem hiding this comment.
SLO I consider a well known term, but actually we can just remote that from here. done.
| \* `namespace` tag is only included when `withNamespaceAsTag()` is enabled. | ||
|
|
||
| The execution timer uses explicit SLO boundaries (10ms, 50ms, 100ms, 250ms, 500ms, 1s, 2s, 5s, 10s, 30s) to ensure | ||
| compatibility with `histogram_quantile()` queries in Prometheus. This is important when using the OTLP registry, where |
There was a problem hiding this comment.
OTLP acronym should be used expanded first.
| if (resourceEvent.getAction() == ResourceAction.ADDED) { | ||
| gauges.get(numberOfResourcesRefName(getControllerName(metadata))).incrementAndGet(); | ||
| } | ||
| var namespace = resourceEvent.getRelatedCustomResourceID().getNamespace().orElse(null); |
There was a problem hiding this comment.
Shouldn't that be in sync with what's done for MDC, i.e. use a default value instead of null for clustered resources?
There was a problem hiding this comment.
This way at the end we don't add the namespace tag (if null), that is according the guidelines.
|
|
||
| private void incrementCounter( | ||
| String counterName, String namespace, Map<String, Object> metadata, Tag... additionalTags) { | ||
| final var tags = new ArrayList<Tag>(2 + additionalTags.length); |
There was a problem hiding this comment.
This shouldn't be a List but a Set.
There was a problem hiding this comment.
That is really an implementation detail IMO
| null, | ||
| metadata, | ||
| Tag.of(EVENT, event.getClass().getSimpleName()), | ||
| Tag.of(ACTION, UNKNOWN_ACTION)); |
There was a problem hiding this comment.
Removed this, supposedly if we cannot add meaningful value it should added, thank you pointing out
|
|
||
| int retryNumber = retryInfo.map(RetryInfo::getAttemptCount).orElse(0); | ||
| if (retryNumber > 0) { | ||
| incrementCounter(RECONCILIATIONS_RETRIES_NUMBER, namespace, metadata); |
There was a problem hiding this comment.
How does the retry number get propagated? Seems like the counter that is incremented is completely decorrelated from the actual retry number?
There was a problem hiding this comment.
The idea is that we measure the number of retries, so we increase this number if the actual reconiliation is a retry. What is interesting at the end in the metrics is the rate of the retries.
Signed-off-by: Chris Laprun <metacosm@gmail.com>
Signed-off-by: Attila Mészáros <a_meszaros@apple.com>



Goal of this PR is to provide a OTel + Prometheus + Grafana setup. So we:
Notes on new metrics implementation: