Dovecot, samba, sysvrc and the case of the vanishing PID files

steve_v · 2022-06-08 06:30:34

I have been having an infrequent, annoying, and so far difficult to pin down issue with a machine running Devuan 10 (oldstable), and it goes something like this...

Every so often I'll notice something is up, such as changes to samba configuration not taking or clients complaining that dovecot's SSL cert has expired.
I'll then try to stop or restart the relevant service, only to have the init script (or the init script via 'service') return "OK", apparently without actually stopping or restarting anything. I can do a "service [dovecot|smbd|nmbd] stop" as many times as I like, it always returns "OK", and the processes remain running.

Investigation reveals that these processes have no corresponding PID files (even before trying to stop them), and that appears to be why the init scripts are lying to me.

Manually killing these processes and restarting the service writes out a new pidfile as expected, and everything is fine until the next time... which might be months away.

It's pretty annoying that the init script (or more to the point start-stop-daemon) is returning without error in this instance, but the bigger issue is obviously that the pid file is being removed while the processes are still running.

Here's the stop function from dovecot's init script:

do_stop()
{
    # Return
    #   0 if daemon has been stopped
    #   1 if daemon was already stopped
    #   2 if daemon could not be stopped
    #   other if a failure occurred
    start-stop-daemon --stop --quiet --retry=TERM/30/KILL/5 --pidfile $PIDFILE --name ${DAEMON##*/}
    RETVAL="$?"
    [ "$RETVAL" = 2 ] && return 2
    # Wait for children to finish too if this is a daemon that forks
    # and if the daemon is only ever run from this initscript.
    # If the above conditions are not satisfied then add some other code
    # that waits for the process to drop all resources that could be
    # needed by services started subsequently.  A last resort is to
    # sleep for some time.
    start-stop-daemon --stop --quiet --oknodo --retry=0/30/KILL/5 --pidfile $PIDFILE --name ${DAEMON##*/}
    [ "$?" = 2 ] && return 2
    # Many daemons don't delete their pidfiles when they exit.
    rm -f $PIDFILE
    return "$RETVAL"
}

And smbd:

        stop)

                log_daemon_msg "Stopping SMB/CIFS daemon" smbd

                start-stop-daemon --stop --quiet --pidfile $SMBDPID
                # Wait a little and remove stale PID file
                sleep 1
                if [ -f $SMBDPID ] && ! ps h `cat $SMBDPID` > /dev/null
                then
                        # Stale PID file, remove it (should be removed by
                        # smbd itself IMHO).
                        rm -f $SMBDPID
                fi

                log_end_msg 0

                ;;

At a cursory glance, it would appear that both of these scripts are indeed going to fail if there's no PID to work with...

So, given that this particular ugliness has (AFAIK) been in Debian/Devuan for quite some time, the question becomes:
Are the services refusing to exit during an automated (logrotate, unattended-upgrades etc.) restart (both liable to have persistent connections), and the init script falling through to that blunt "stale pid file" instrument?
Is something else interfering? I can't really think of any likely suspects here.
Is it aliens?
How best do I find out, considering this only seems to happen once every few months? All I see in the logs are the expected "OK" restarts, but then I know the init scripts are lying, sooo?

Before I start instrumenting init scripts and/or writing a watch daemon to try and figure out what the hell is going on, has anyone seen this kind of thing before? General internet searches are becoming fairly useless for sysv-rc related things it seems, and I can't find much that doesn't start with "systemctl, blah, blah".

Any ideas? likely suspects or further places to look? This is kind of a pain to diagnose in the right now, since once working everything remains working until whatever it is makes it not...

Last edited by steve_v (2022-06-08 06:38:29)

Head_on_a_Stick · 2022-06-08 12:04:23

I have no experience of dovecot so this is a swing in the dark but have you tried setting $PIDBASE to a different directory via /etc/defaults/dovecot? Perhaps /var/run/dovecot/ is causing problems. How is /var/run/ mounted?

You could also try the suggested upstream init script to help eliminate start-stop-daemon as the culprit (sed 's|/usr/local|/usr|'):

https://doc.dovecot.org/installation_gu … .d_script/

And I notice that systemd uses socket activation for dovecot.service, perhaps that has something to do with it.

Last edited by Head_on_a_Stick (2022-06-08 12:04:45)

steve_v · 2022-06-10 05:06:28

Head_on_a_Stick wrote:

have you tried setting $PIDBASE to a different directory via /etc/defaults/dovecot?
Perhaps /var/run/dovecot/ is causing problems.

I really can't think of any reason /var/run/dovecot would be a problem, and it's working that way just perfectly at the moment - stopping and starting (both via the init script and doveadm) does what it should WRT master.pid.
I guess it wouldn't hurt to try though.

Head_on_a_Stick wrote:

How is /var/run/ mounted?

/var/run -> /run
tmpfs on /run type tmpfs (rw,nosuid,noexec,relatime,size=6593300k,mode=755)

Pretty sure that's a default config from the Debian init scripts... somewhere, sometime. This install has been a rolling upgrade since at least as far back as Squeeze.

Head_on_a_Stick wrote:

You could also try the suggested upstream init script.

I'll give that a poke and see how it goes (will involve much patient waiting OFC). At least there's no 'rm -f $PIDFILE' in it anyway...

Frankly I'm wondering why anyone would ship an init script that does rm -f on a pidfile without checking to see if the relevant process is really gone to begin with, but presumably this one has been working for people for some time, right?
I mean, at least the smbd init script does a basic 'ps $PID' check, but here there's nothing.
Unless I'm missing something obvious, it looks like it will remove the pidfile even if start-stop-daemon returns non-zero (i.e. returns early only on "2"). That isn't even half-arse fail-safe.
This looks like some generic template script TBH, one intended to be tweaked for things like dovecot that fork a bunch of processes.
And yes, I checked, it's the one shipped in the Debian package.

Then again, I guess I should probably shut up about the fragile init script untill I know it's really the problem.

Head_on_a_Stick wrote:

I notice that systemd uses socket activation for dovecot.service, perhaps that has something to do with it.

Not sure what you're getting at here, I have no systemd anything anywhere socket activated or otherwise.

Last edited by steve_v (2022-06-10 05:48:31)

Head_on_a_Stick · 2022-06-10 08:58:49

steve_v wrote:

This install has been a rolling upgrade since at least as far back as Squeeze.

Well I've just installed a fresh chimaera system (from the netinstall with just the standard utilities) and dovecot stops and restarts normally there so perhaps you have some outdated configuration files that are causing problems.

steve_v wrote:

Head_on_a_Stick wrote:
I notice that systemd uses socket activation for dovecot.service, perhaps that has something to do with it.
Not sure what you're getting at here, I have no systemd anything anywhere socket activated or otherwise.

I meant that perhaps dovecot relies on socket activation to be restarted correctly. Probably not though given that it seems to work (or run, at least) in a fresh Devuan system.

steve_v · 2022-06-10 12:10:28

Head_on_a_Stick wrote:

Well I've just installed a fresh chimaera system (from the netinstall with just the standard utilities) and dovecot stops and restarts normally there so perhaps you have some outdated configuration files that are causing problems.

Sure, maybe I do... The catch is that everything works fine here too, so long as I'm looking at it.
If it was not stopping properly reproducibly, I'd have it fixed by now.

As this is a working (and moderately important, at least to me) mail server, I'm not particularly keen to reinstall everything or revert to default configuration for a month to see if solution-by-scorched-earth sticks.
That's very much last-resort material, and besides being a total pain in the ass is unlikely to be educational as to the real cause.

Head_on_a_Stick wrote:

I meant that perhaps dovecot relies on socket activation to be restarted correctly.

Fair enough. Personally I doubt it though, dovecot has been around a lot longer than systemd, and again I'd expect a consistent result if that was the problem.

Aside, the focus on dovecot is all well and good (that is the more annoying problem), but TBH I still suspect something outside dovecot itself (or its configuration) since whatever it is appears to affect smbd and nmbd in much the same way.
They might be unrelated of course, but IMO it'd be a hell of a coincidence for 3 services to suffer the same problem without a common cause of some kind.

Really, I don't think I'm going to get anywhere with this right now, it'll have to wait until the cosmic rays or whatever it is strike again to even start.
I was really just hoping somebody had seen something like this before and I might get a hint that way, but oh well. Guess I'll come back when it happens again and I have more info.

I know this is all kinda vague for a tech support question, but then If it was easy I wouldn't be asking.

steve_v · 2022-06-10 12:55:59

Oh FFS. Just when I decide to let it cook for a bit...

Ignore dovecot for the minute, that's behaving itself right now. smbd and nmbd on the other hand:
Are running right now, and working fine.
Have a PID file...
But that PID file is empty. I know it contained a valid PID ~2 days ago, because I checked after I had to restart it.
The smbd and nmbd processes are ~2 days old (Jun 8 17:26), but the modification date on the PID file is 'Jun 10 17:49'.
Grepping the logs for that time and date (and an hour before) reveals... Nothing of any interest.
Sure enough, the init scripts can't stop those services.
Killing them manually and restarting with the init scripts writes out the correct PID, and all is well again.

And it gets better, it's not just dovecot and samba, they're only the canaries:

# find /var/run/ -iname '*.pid' -empty | xargs ls -la
-rw-r--r-- 1 root root 0 Jun 10 17:49 /var/run/apcupsd.pid
-rw------- 1 root root 0 Jun 10 17:49 /var/run/fail2ban/fail2ban.pid
-rw-r--r-- 1 root root 0 Jun 10 17:49 /var/run/php/php7.4-fpm.pid
-rw-r--r-- 1 redis redis 0 Jun 10 17:49 /var/run/redis/redis-server.pid
-rw-r--r-- 1 root root 0 Jun 10 17:49 /var/run/samba/nmbd.pid
-rw-r--r-- 1 root root 0 Jun 10 17:49 /var/run/spamd.pid

What the? Seriously? How did I miss that?

So it's not the init scripts at fault, It's been aliens all along.
I think it's also safe to say my initial statement that /var/run/dovecot/master.pid was missing is incorrect. I expect I looked after I had already run the dovecot init script, which of course fell through to that rm -rf on account of having no PID to work with.
I can't exactly prove it right now, but I bet you a cookie the pidfile was present but empty like all the others.

I'm still none the wiser as to what's fingering my files behind my back, and it's high time I went to sleep right now. But hey, data point is data point.

If you have any idea what that data point points to... I'm all ears.

I just sicced fnotifystat onto /var/run/, so let's see if it catches anyone...

Last edited by steve_v (2022-06-10 16:02:57)

alexkemp · 2022-06-10 16:58:58

Searching, I came across the multi-OS fswatch:

A monitor which periodically stats the file system, saves file modification times in memory and manually calculates file system changes, which can work on any operating system where stat (2) can be used.

It is within Synaptic:

A monitor based on inotify, a Linux kernel subsystem that reports file system changes to applications

That looks a possibility, and particularly since it works from any of a varied selection of available APIs.

steve_v · 2022-06-10 17:58:38

Well that was quick.

We have a weiner, and really, I should have seen it coming...
Via auditd (yeah, it's better than fnotifystat):

proctitle=/usr/bin/python3 /usr/bin/systemctl list-units --full --all 
type=PATH msg=audit(11/06/22 05:16:39.046:1736) : item=1 name=/var/run/samba/nmbd.pid inode=41571 dev=00:15 mode=file,644 ouid=root ogid=root rdev=00:00 nametype=NORMAL cap_fp=none cap_fi=none cap_fe=0 cap_fver=0 
type=PATH msg=audit(11/06/22 05:16:39.046:1736) : item=0 name=/var/run/samba/ inode=16650 dev=00:15 mode=dir,755 ouid=root ogid=root rdev=00:00 nametype=PARENT cap_fp=none cap_fi=none cap_fe=0 cap_fver=0 
type=CWD msg=audit(11/06/22 05:16:39.046:1736) : cwd=/root 
type=SYSCALL msg=audit(11/06/22 05:16:39.046:1736) : arch=x86_64 syscall=openat success=yes exit=3 a0=0xffffff9c a1=0x7f257ac19050 a2=O_WRONLY|O_CREAT|O_TRUNC|O_CLOEXEC a3=0x1b6 items=2 ppid=16701 pid=16702 auid=steve uid=root gid=root euid=root suid=root fsuid=root egid=root sgid=root fsgid=root tty=pts2 ses=16 comm=systemctl exe=/usr/bin/python3.7 subj==unconfined key=rundir

Sure enough:

-rw-r--r-- 1 root root 0 Jun 11 05:16 /var/run/samba/nmbd.pid

And it's not even some dodgy crap I backported myself this time:

Package systemctl:                              
i   1.4.4147-1~bpo10+1                                             oldstable-backports                         200

This particular footgun was installed to satisfy some crappy postinst stuff in the packages from zmrepo (which actually works fine without systemd but was clearly packaged by a muppet).
I thought it would be relatively safe in that context as all the postinst does is a 'systemctl status zoneminder' and nothing else should use it (this is Devuan after all, and we don't use systemd, right? RIGHT?)... But if it's on the system it looks like it gets called by a bunch of other things too - in this case, somewhere downstream of a 'service apache2 reload'.
Said zmrepo package was only installed because there's nothing for beowulf... Which is odd and kinda annoying, as there is a backport for ascii.

Anyone care to speculate why a command that is supposed to simply list units is stomping all over everything? Or why there is a loaded footgun in the devuan repos without a depends !sysvinit safety catch?

Conclusions:
1) That standalone systemctl package is an extremely nasty trap. I don't know why it's nuking my pidfiles and I don't really speak python, but I might have a look later. For now it goes in the naughty corner where it belongs.
2) Installing zoneminder on this box was probably a bad idea to begin with. It's nominally a mailserver (with roundcube webmail via apache), but it had disk space to burn, a webserver already installed, and it was convenient at the time. 'twas intended to be a temporary solution, but you know how that usually goes.

Last edited by steve_v (2022-06-10 18:21:49)

The officially official Devuan Forum!

#1 2022-06-08 06:30:34

Dovecot, samba, sysvrc and the case of the vanishing PID files

#2 2022-06-08 12:04:23

Re: Dovecot, samba, sysvrc and the case of the vanishing PID files

#3 2022-06-10 05:06:28

Re: Dovecot, samba, sysvrc and the case of the vanishing PID files

#4 2022-06-10 08:58:49

Re: Dovecot, samba, sysvrc and the case of the vanishing PID files

#5 2022-06-10 12:10:28

Re: Dovecot, samba, sysvrc and the case of the vanishing PID files

#6 2022-06-10 12:55:59

Re: Dovecot, samba, sysvrc and the case of the vanishing PID files

#7 2022-06-10 16:58:58

Re: Dovecot, samba, sysvrc and the case of the vanishing PID files

#8 2022-06-10 17:58:38

Re: Dovecot, samba, sysvrc and the case of the vanishing PID files

Board footer