Question Optimization within the Prometheus World
[ad_1]
A standard motivation we hear from engineering organizations working into challenges with Prometheus is round question efficiency. As the quantity of metric knowledge we acquire will increase, it’s solely pure that consuming that knowledge by way of dashboards and alerts will change into dearer. However what instruments does Prometheus present to assist right here? On this submit, we’ll have a look at a number of the visibility that Prometheus gives into its question workload, and the choices we have now to enhance the efficiency of gradual queries.
What Determines Question Efficiency?
John Potocny
John is a senior gross sales engineer at Chronosphere with practically a decade of expertise within the monitoring and observability area. John began as an engineer engaged on time-series knowledge assortment and evaluation earlier than transferring to a pre-sales/customer-support position, and has labored with quite a few corporations throughout industries to resolve their distinctive observability challenges.
Earlier than we dive into the main points right here, let’s talk about the most important components affecting question efficiency. Normally, when querying knowledge, the most important issue within the total efficiency, moreover issues like {hardware} constraints, is how a lot knowledge we have now to contemplate/course of to generate our end result. We’ve most likely all skilled this in apply earlier than — each easy and sophisticated queries will return fairly shortly when working towards a small dataset, however as you scale up the quantity of information, the efficiency begins to degrade accordingly. For time-series databases like Prometheus, we are able to concentrate on two components particularly that decide how a lot knowledge a given question will think about:
- What number of distinct collection does a question need to course of?
- What number of knowledge factors does a question need to course of?
Between these two, we usually care extra about what number of collection are processed by a given question, though in case your question is wanting over a very long time vary, the variety of knowledge factors/samples begins to change into necessary as effectively. That’s why it’s normally really helpful to downsample metric knowledge to a decrease decision when storing it for longer durations of time. Moreover saving cash on storage, it additionally offers us a significant enchancment in efficiency once we need to question over weeks or months of information at a time.
What Type of Observability Does Prometheus Give Us?
Now that we’ve checked out what determines the efficiency of Prometheus queries, let’s dive into the instruments to assist us establish whether or not there are queries that should be optimized. First, in case you have Prometheus configured to scrape itself, it does embrace some high-level metrics on the question workload, resembling info on time spent executing user-issued queries vs. mechanically executed guidelines, like recording or alerting guidelines. This helps us hold monitor of the overall efficiency of our Prometheus queries, in addition to do issues like alert us if a rule group is getting near taking longer than its execution interval.
Word that queue_time
is mainly 0 within the image above; that is what we usually count on, until the server is seeing extra simultaneous requests than allowed by the server’s question.max-concurrency
flag.
What’s noticeably lacking right here, is any form of details about the overall effectivity of the queries being run towards Prometheus. There aren’t any metrics that inform us what number of collection/knowledge factors are being fetched from the database vs. returned to the consumer. This presents an issue, since as we famous above, how a lot knowledge is being learn vs. returned is the easiest way to know how costly our queries are, comparatively talking. We are able to nonetheless use the out there metrics to detect whether or not queries are gradual, however there’s further work concerned to know whether or not that’s as a result of they’re fetching loads of completely different collection, fetching loads of knowledge factors or if there’s one other concern, resembling an absence of assets out there to Prometheus.
Moreover understanding whether or not queries are environment friendly/performant usually, the opposite factor we need to perceive is which queries are the most costly in our workload. From the out there metrics, we are able to see if Prometheus is spending extra time on evaluating alerting and recording rule queries vs. ad-hoc queries, and which rule teams are costliest within the case of rule queries inflicting points. What we don’t see right here, although, are the queries themselves — that might be an excessive amount of cardinality for Prometheus to emit again into itself.
To get a way of the efficiency of particular queries, we are able to allow the Prometheusquery
log, which can write a JSON object containing particulars about each question executed towards the server to a specified log file. Right here’s an instance of the output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
{ “params”: { “finish”: “2020-02-08T14:59:50.368Z”, “question”: “up == 0”, “begin”: “2020-02-08T13:59:50.368Z”, “step”: 5 }, “stats”: { “timings”: { “evalTotalTime”: 0.000447452, “execQueueTime”: 7.599e–06, “execTotalTime”: 0.000461232, “innerEvalTime”: 0.000427033, “queryPreparationTime”: 1.4177e–05, “resultSortTime”: 6.48e–07 }, “samples”: { “totalQueryableSamples”: 80, “peakSamples”: 24 } }, “ts”: “2020-02-08T14:59:50.387Z” } |
The “timings” info right here tells us how lengthy the given question spent within the numerous execution phases, much like the metrics we noticed above. Included as of Prometheus launch v2.35, the “samples” statistics inform us the full samples fetched, together with the height variety of samples processed concurrently in the course of the question’s execution — that is in relation to the server’s question.max-samples
flag, so values which are too excessive right here inform us if the server’s restrict must be elevated. (Word that rising this will likely additionally improve reminiscence utilization). Along with the per-query statistics, the file will comprise particulars of the HTTP shopper/endpoint that initiated the question, or the identify of the rule group it’s related to within the case of alerting/recording guidelines.
That is helpful info, however once more there are some drawbacks to the answer:
- In a posh setting, it is going to be troublesome to establish the place a selected user-initiated question originates from. Ideally, we’d know particulars just like the identify of a dashboard that it’s related to, however sadly, Prometheus can’t decide that by itself.
- We don’t have the power to restrict what’s logged to the question log, which implies in a manufacturing setting it’s going to develop in a short time. Ideally, we might set a threshold to solely log queries that take longer than a sure period, or fetch a sure variety of time collection/knowledge factors after they execute, so we solely log the costly queries that we’re desirous about analyzing.
How Can We Optimize Our Queries?
We’ve seen the choices we have now to establish gradual/inefficient queries inside Prometheus. So what choices do we have now to optimize them? We famous in the beginning that question efficiency with PromQL is generally decided by what number of collection/knowledge factors a question has to function towards, which implies we should always concentrate on methods to cut back the variety of collection/knowledge factors a question has to fetch so as to optimize it. Broadly talking, there are a number of methods out there to us:
Shorten the Timeframe That the Question Is Run In opposition to
That is clearly not superb, because it limits our capability to contemplate developments within the knowledge we’re , however it’s also most likely the quickest possibility when you have got a gradual question and have to get the outcomes extra shortly.
Cut back the Decision of the Metrics Being Queried
If we decrease the decision of our knowledge, then we’ll cut back the variety of knowledge factors that need to be processed to guage our question over a given time period. This could be a cheap method, significantly once we’re attempting to enhance efficiency of queries that look again over a really lengthy window of time. It may be tough to do in Prometheus although. As a result of Prometheus doesn’t help downsampling knowledge, the one approach we have now to regulate the decision of our metrics is thru the scrape interval for the roles we configure. That doesn’t give us loads of flexibility. If you wish to have fine-grained knowledge for analyzing current habits and decrease decision knowledge for long-term developments, then you must configure two duplicate scrape jobs that specify completely different intervals, which implies extra load on Prometheus, extra complexity and a extra complicated consumer expertise, since customers have to explicitly select which job to question towards of their queries. Some long-term storage options for Prometheus do help downsampling knowledge although, so if you’re working into this drawback, they’re price contemplating if you’re not already utilizing one.
Cut back the Variety of Collection Being Queried
That is usually how Prometheus queries are optimized, by way of using recording guidelines to pre-aggregate the time collection being queried underneath a brand new metric. Recording guidelines do have a number of drawbacks although:
- They should be outlined for every metric/question you are attempting to optimize, which implies you may shortly find yourself managing tons of and even 1000’s of them as your metric use instances and knowledge volumes develop.
- Moreover, recording guidelines run within the background on the system, so that you’re including fixed load to the database to guage them. Because of this including extra guidelines to optimize completely different queries can change into a big supply of labor for Prometheus, and results in its useful resource wants rising quicker than you would possibly count on.
- It’s additionally price noting that as a result of guidelines are evaluated periodically, there could be delays in knowledge from a given rule being out there when in comparison with when the uncooked collection can be found. This will get even worse in case you have instances the place a recording rule is querying metric names which are generated from one other recording rule. The worst-case time to new knowledge availability turns into the scrape interval of the information plus the execution interval of all the guidelines concerned within the chain.
- Lastly, we have now to recollect to make use of the results of our recording guidelines rather than the unique question in every single place, in any other case, we don’t see any profit. This implies our guidelines should be discoverable and understood by finish customers, they’ll’t be added transparently within the background and mechanically velocity up our queries.
Along with utilizing recording guidelines, we are able to additionally have a look at eradicating dimensions from the metrics that our purposes emit so there are fewer time collection being collected; the apparent tradeoff right here is that fewer dimensions means much less granular insights into our methods. It’s additionally not at all times an possibility. In case you are coping with metrics from an off-the-shelf utility somewhat than one you’ve instrumented by yourself, then it’s not attainable to cut back the cardinality of the information being collected with out aggregation.
How Chronosphere Can Assist
For those who’ve learn by way of this submit since you’re having bother with question efficiency in your Prometheus setup, you’re not alone! As we stated originally, this can be a frequent drawback that we see, and one which we assist with on a regular basis.
Chronosphere prospects continuously see a big enchancment in question efficiency just by upleveling from Prometheus, and so they have all of the acquainted instruments out there to them to optimize question efficiency, in addition to new ones like Chronosphere’s Aggregation Guidelines or our Question Builder to assist perceive what makes a gradual question inefficient. We additionally present our prospects an in depth view of how they’re querying the system, so it’s simple to know how the workload is behaving total:
For those who’re desirous about listening to extra about how we can assist improve your Prometheus expertise, tell us. We’d love to listen to from you!
[ad_2]
Source_link