Wednesday, October 07, 2009

Jopr, RHQ and the availability check interval (updated)

Jopr and RHQ check from time to time for each resources availability. This is done in the agent, where a periodic thread calls the getAvailability() method of each ResourceComponent. After this scan the result is sent to the server and the server shows the red and green state on the resource.

This server side processing creates a certain stress in large environements (say: hundreds of agents with ten thousands of resources), so that in Jopr 2.3 we have increased the interval this check is done to 5 minutes.

For many use cases (smaller installs, testing) this is too long. Luckily, this interval is not cast in stone, but configurable in the agent settings file, conf/agent-configuration.xml:

Defines how often an availability scan is run. This type
of scan is used to determine what resources are up and running
and what resources have gone down. The value is specified in
<entry key="rhq.agent.plugins.availability-scan.period-secs" value="300"/>

You see the default is shown as commented value. To change this to a different value, remove the xml comment signs around it and change this back to e.g. 60 seconds (it will probably not make any sense to go lower than that, as the load on the agent, your managed resource and also the Jopr server will increase again):

<entry key="rhq.agent.plugins.availability-scan.period-secs" value="60"/>

After this is done, you need to restart the agent to have it read the new value (see also next paragraph).


As the agent writes its configuration into the java preferences as backing store, the above change will not directly be honored. This may sound strange at first, but it has the advantage that you can run an agent, remove it, install a newer version of it and have the new version automatically use the saved values of the first agent install.

So you need to tell the agent that it indeed should read the new configuration file. This can be done by starting the agent with the option --clenanconfig or better by supplying -c agent-configuration.xml. This is explained at the top of the agent-configuration.xml file and also in the yellow box in the agent install document.

Of course this applies to JBoss ON 2.3 as well.


Vinicius Carvalho said...

This is great, should help a lot. But, we have a need to get immediate response from a down server. No need to all resources, but if one of our Tomcat servers is down, we need ASAP responses, and 60 seconds is an eternity for an angry customer. Is there another way to get availability per resource? So we could have something like 10s for our tomcat servers?

Heiko W. Rupp said...

this is on our "should-do" list and we could need help here (which open source project doesn't :-)

An idea would be to allow for ResorceTypes and individual resources to
overwrite the default in addition to the global agent option. We need to be careful though that this does
not impose a performance issue (e.g. setting all 100.000 resources to 1sec). Limiting the shorter intervals
to platform and server types could already help a lot though.

This is no trivial change as we have a backfilling system (i.e. if the agent is down for some time, we mark resources as down), that is tightly coupled with the availability scan period, so this would need some love too.