Supporting Windows monitoring was in our plans since we began Knock. Here it comes.
We are used with Windows, i even started my career with Windows platforms (ASP.NET, C#, Exchange, IIS, SqlServer, MSMQ, WebServices, NT services and more).
Then i migrated to Linux. It was painful. The two systems are very different, but at the end, Linux rocks.
We started Knock with Linux support, Debian first, then CentOs, which was pretty fast forward.
Knock daemon is full python and co-routine based (gevent).
So we need Knock daemon on Windows, with a solid code base, unified metrics and alerts across Linux and Windows systems. So how to achieve this, well and fast - very fast?
C# or not C#
We knew we would have two options.
Option 1 was to perform a full rewrite of Knock Daemon in C#.
- Native code base, native packaging and installer tools
- We have extensive C# knowledge
- Two code bases to maintain
- Full rewrite required, with unittest coverage (transport layer, probes manager and scheduler, all other apis)
- Different development tools and build systems
- We like python and co-routines.
- And we like python.
Option 2 was to keep the python code base and handle Windows with it.
- Unified code base, already heavily tested
- Same tools and build system
- Not native
- We need our development environment (git, virtualenv, pycharm and so son)
- Requires NT Service, Win32 api , WMI access, Event log support and more
- A way to package python for Windows and an installer
- Risky (bugs, ressources usage)
A trivial answer was to go with C#.
Though, having to handle two complete code bases was indeed a blocker. So we took a couple of weeks to investigate option 2. And to be honest, we were not optimistic...
This point is critical, i cannot code with vim neither with notepad.
We are using git. We are using Pycharm, it's good, it's Java based.
Let's install python and git for Windows (which come with git bash, what a good idea), let's install Pycharm (which works like a charm on Windows), git clone, open project, we are in a familiar world.
We also need redis (which is ported) and squid (ported two) for some unittests.
This point is critical too. We are using it for development and for build system.
After some testing, we ended with "pip install virtualenv", which was working fine.
Ok, we rely on the old "cmd.exe" (which is in the same state as ten years ago, meaning horrible), but it is working.
Indeed a blocker, we need a NT Service. As we have our dev env ready, we can move ahead.
I was mentally prepared for a massive headache at this stage.
Pypiwin32 was the medecine.
In two hours (!!), we have a skeleton of python NT service, working in debug, installing in the system, starting, stopping, restarting, uninstalling.
With our NT Service working, next step was to package it.
Requirements were simple : a standalone executable (no dependencies to install).
It took a bit longer. We tested several tools without success.
We ended with pyinstaller. It works fine and create a full standalone EXE file, which is just perfect.
The only trick is that it scans ".py" files, and was not handling our probes implementations, which are not explicit included but dynamically loaded. Butcher mode on, we added a registry class which just reference all probes (and do nothing else). Doing so, pyinstaller is happy.
Ok, i don't like Windows installers.
We tested several tools, it took two days (long...) and we selected Wix Toolset.
It handles NT services, INI file manipulation, customizable interface, and generate a working MSI.
We discover at the end that the MSI file need to be signed with a code signing certificate (my god, why so much pain?) to avoid some warnings during installation. You have to buy the certificate, to buy it to have to provided officials documents. Well, this is ongoing and will come later in low priority.
WMI and Win32
Before going ahead with our code base, we still have two blockers.
We need WMI access for the metrics and (may be) Win32 apis access.
The Win32 apis should be handled by Pypiwin32.
A python package WMI seems to offer WMI access. We need to ensure it works fine and we have the required metrics for our operating relative probes.
The WMI package was working fine, but fetching full WMI objects was sometimes slow (we ended with almost only WQL queries). We will go async (like usual in these cases) with some tricks for some fast refreshes. Good.
Then we checked what WMI counters have in stock for our metrics. Pretty straight forward with some minor issues:
- fields documented but not populated
- some network drivers not populating some counters
- massive headaches with Cpu and memory counters
- threads instead of processes running
- formatted vs raw datas
- load support
We finally got a go!
Knock Daemon code base
Here comes the big deal. Let's fire our unittests over Windows.
I was anticipating issues, i was wrong.
All packages installed fine, gevent was rock solid, after minor code updates, all low level unittests were green.
We adapted a bit the low level probes to support Linux and/or Windows execution pipeline, moved UDP domain socket to UDP standard socket, integrated all that in unittests, and everything was suipergreen in a day.
The next day we set the Knock daemon NT Service with unittest.
The event log :(
Then we lost two days. The winner was the Event log.
Instead of storing plain text buffers, they had (who??) the big idea to handle logs with formatted string buffers (kind of templates), with the magic goal to internationalize the logs (just go English men) and to reduce the event log sizes (i will suicide and will be back after).
I already encountered that, but i forgot (may be too messy for my poor brain).
To ease all this mess, these formatted string buffers have to be in a DLL (lol).
We ended with win32evtlog.pyd and event id 1 (which is an undocumented "%s", gg guys).
We did some minor updates in our low level logger to handle log files rotation over Windows (which do not have logrotate) using TimedRotatingFileHandler.
We finished with some tricks about file paths (which are pretty dirty but avoid storing stuff inside the Windows registry).
The Daemon service was up, packaging and MSI tested.
We integrated this is our build system, tested MSI again.
With solid foundations, Windows probes implementation was easy.
As expected, we remapped Windows metrics to existing ones without (almost) issues.
95% of the stuff is WMI based, except socket states which are based on Win32 api calls.
Then we encountered some shifting in probes scheduling on some Windows boxes. Not sure what the real root cause is at this stage (gevent, Windows, both...). We rewritten a bit this part and boosted a bit the scheduling intervals for windows to avoid gaps in graphs.
Last issue was NT Service slow start under OS heavy pressure at boot (slow disks, huge packs of service starting up, standalone EXE).
Step one was to delay service start. Not enough.
We moved the start signal a bit sooner and engage restart failure action with 2 minutes delay using the Wix Toolset upon service install.
This should be acceptable; moreover we have no easy way to configure the delayed start delay and the (stupid) 30 seconds timeout handled by the Windows service manager (it's in registry, requires reboot, system wide blablabla)
At the end, Knock Daemon Service used a bit CPU more ressources than Linux version - due to WMI fetches - but this remains low, memory footprint remains at less than 40 MB and network and IO usages are similar.
We got fast Windows support with unified code base, unified metrics, unified alerts in a couple of weeks.
We reused our development environments and build systems.
We did not modify a single code line in our monitoring backend to handle Windows.
And we keep python and gevent. Which a just voodoo magic!
Lolo**2 = Laurent Labatut & Laurent Champagnac