Thursday 17 April 2008

zenoss: creasting zencommand custom plugins

Loads of work for me on Zenoss last few weeks.
FINALLY the class system/zProperties/rrd Templates/etc are making some sense to the inside of my head. We want to monitor buckets of data.


I posted this to here to see what people think:
http://community.zenoss.com/forums/viewtopic.php?p=18670#18670

I've added tons of stuff to Zenoss_configuration page in wiki yesterday, some today.
Up till 3:30am last night - I stopped when windows/mouse died.
I managed to put a zenoss template which contained several graphs and datasources into a ZenPack and then delete it and was not able to load it back. ARGH! So something else to figure out.


I had a couple of awkward problems to solve when getting a zencommand plugin to work so I thought I would share. If my analysis is correct this information might also be useful to others and could possibly be used to enhance the documentation on creating zencommand plugins.

So please take a look and see if this is useful.

J.

== Tip when working with zencommands. ==

If ONE data point in a whole template has an error (can have multiple graphs, data sources and points) then the data does not get stored in rrdtool. Even though testing with "zencommand run" reports a "storing" message. It might be safer to define datasources and graphs in a few seperate templates instead of one template.
In that way an error appearing for one piece of data would not cause all data retrieve to stop.

I'm still finding it a little awkward how best to manage classes and attributes in zenoss.

== Large data values received had a problem being stored if set to AVERAGE, use COUNTER instead ==

Value returned was FramesReceived=74609722 but rrdtool tried to insert 74609722.0.


2008-04-17 12:36:12 ERROR zen.RRDUtil: rrd error not a simple integer: '74609722.0' Devices/frigg.ie.commprove.test/iubTGenStatusTxt_FramesReceived


Fix is easy, use COUNTER instead of AVERAGE for those large data types.

(It could happen that a value monitored for a long time could grow and suddenly cause this kind of problem is a surprising way.)

== zencommand has some problems scaling. ==

... if you quickly hack together a plugin! Smile

"ERROR zen.zencommand: [Errno 24] Too many open files"

I see per-process open file limits of 254 and 258 on solaris 10 boxes.

zencommand fires off multiple processes ( number of servers * data sources ) to retrieve data.

One may edit zencommand.py and reduce MAX_CONNECTIONS
gsed -i s/MAX_CONNECTIONS=256/MAX_CONNECTIONS=16/ $ZENHOME/Products/ZenRRD/zencommand.py

Problems were reduced but not eliminated.
There were 15 servers in list and 3 zencommand datasources 15 * 3 = 45 which is not too many.

If you implement a zencommand plugin that causes zencommand to use up file handles you can run into this issue. I have a generic plugin which is called with different parameters to feed several data sources. The performance template was applied to a list of solaris servers.

The first incarnation of the plugin is a script which calls nagios plugin check_http and also uses a temporary file. I figure plugin is too slow/heavy; bash is spawned and another process to call check_http. Essentially best practice for plugins is to keep them very light and do any messing/work on the server side.

We wish to add more servers and more data sources using plugins like this. So it must scale. There is a limitation on what resources plugins are allowed to use which may not be very obvious.

No comments: