You are not logged in.
Hello:
While (still) trying to track down the origin of the shutdown problem that occasionally affects my U24 box, I have come back to the matter of the infamous Intel e1000e GbE controller.
Randomly, on shutdown the rig will freeze with all fans going at full blast with this output on the TTY1 screen:
Devuan GNU/Linux ascii devuan tty1
devuan login: [483.367459] EXT-fs (sdc1): re-mounted. Opts: (null)
[485.772216] e1000e: eth0 NIC Link is Down
[485.776885] kvm: exiting hardware virtualization
[485.777756] sd 9:0:3:0: [sdf] Synchronizing SCSI cache
[485.778154] sd 9:0:2:0: [sdf] Synchronizing SCSI cache
[485.781519] e1000e: EEE TX LPI TIMER: 00000000
[485.785219] ACPI: Preparing to enter sleep state S5
[485.868007] reboot: Power down <---- screen freezes at this point
From [485.868007] onwards, the only way out is a hard shutdown.
One idea I picked up during my searches on the web was to disable the EEE TX LPI timer.
Made sense, that was the last thing active, maybe it was not working properly.
But the onboard controller will have none of it:
[root@devuan ~]# ethtool --set-eee eth0 tx-lpi off
Cannot get EEE settings: Operation not supported
[root@devuan ~]#
Nor would it inform me of the actual state of the timer:
[root@devuan]# ethtool --show-eee eth0
Cannot get EEE settings: Operation not supported
[root@devuan]#
This is rather strange as the MSWindows driver allows me to change these parameters:
ie:
---
settings > control panel -> system -> hardware -> device manager -> network adapters
Intel PRO/1000 MT Desktop Adapter
Advanced -> Wait for Link=Off | Wake on Link Settings=Disabled | Wake on Settings=Disabled
Power Management -> Allow the computer to turn off this device to save power -> unchecked
---
These settings survive both the reboot of the VM and a reboot on the host.
The first question would be why this would be so.
Why can't ethtool do the same thing? (v.4.19)
Then I reasoned that the next best thing would be to unload the e1000e module before the shutdown command, so I put together a script which took the place of the absurd xfsm-shutdown-helper bundled along with Xfce:
#!/bin/sh
# shutdown system directly (no shutdownhelper)
# disable onboard eth wol
# remove e1000e module
PATH=/sbin:/bin:/usr/sbin:/usr/bin:
# sync && sudo ethtool -s eth0 wol d && sudo shutdown -h now
sync && sudo rmmod -s -v e1000e && sudo shutdown -h now
Thinking that with the e1000e module removed, it would be the end of the EEE timer on shutdown, I made a video grab of the shutdown process.
But to my surprise, the damned thing was still there ie: the shutdown screen still included a line for the EEE timer.
Devuan GNU/Linux 3 devuan tty1
devuan login: [ 286.719428] e1000e: eth0 NIC Link is Down
--- snip ---
[287.219230] e1000e: EEE TX LPI TIMER: 00000000 <-------------- | x |
[287.223022] ACPI: Preparing to enter sleep state S5
[287.223551] reboot: Power down
Now, if the module was unloaded, why is the EEE timer still around after the fact?
I can confirm the module gets unloaded as the LAN link goes down down immediately, both with rmmod and with modprobe -r.
I once tried to get something useful from the Intel chaps, they really don't have the slightest clue.
ie: a waste of time
I wrote the maintainer of ethtool a couple of days ago but have not had a reply yet.
Any insight on this would be appreciated.
Thanks in advance,
Best,
A.
Last edited by Altoid (2021-04-23 01:58:57)
Offline
Well, I guess you know, but for reference. There is an open linux kernel bug[0] for the e1000e watchdog timer. And a patch included in 5.5.0. Did you test that Kernel already? Otherwise 4.19 still seems to need a reworked patch.
My 2 cents: Delayed work is quite dangerous indeed.
[0] https://bugzilla.kernel.org/show_bug.cgi?id=205047
Last edited by geki (2021-04-18 17:03:06)
Offline
Hello:
... guess you know ...
... open linux kernel bug[0] for the e1000e watchdog timer.
No.
Had no idea but I am not at all surprised that the bug exists.
Nor am I surprised that it is unsolved almost two years later.
What I do know (own experience, web-wide rants) is that the e1000e, besides being is a real piece of work, is intrinsically linked* to the Intel Management Engine.
In my Ultra 24's BIOS, the GbE entry is greyed out, you cannot disable it.
ie: the box's owner, OS administrator cannot disable the on-board GbE controller or access the settings the entry presumably allows you to change.
I have not been able to find any instructions anywhere on how to do it. Owner's manual, Field manual, etc. have no mention of it.
WTF?
I run Beowulf which uses the 3.2.6-k version, the problems reported seem to be with the version with the upstream kernels.
And the shutdown problem I have is present from when I first installed Linux on this box, around late 2015.
Upgrading to the last BIOS available (1.56) did not solve the problem.
* : see this post from your link to the bug: https://bugzilla.kernel.org/show_bug.cgi?id=205047#c19
Vitaly Lifshits 2019-10-17 10:47:38 UTC
Please try:
1. rmmod mei && rmmod mei_me <----------------------------- | x |
2. removing the if in the patch and moving the call e1000_phy_hw_reset(&adapter->hw) outside of the while loop:if (!(pcim_state & E1000_STATUS_PCIM_STATE))
e1000_phy_hw_reset(&adapter->hw);
Understand what I am referring to?
My installation has both modules (mei and mei_me) blacklisted in /etc/modprobe.d.
And yes, the problem I have was there before I blacklisted the modules.
Did you test that Kernel already?
No, not planning to do it.
... 4.19 still seems to need a reworked patch.
Lockdown has still some time to go, so I may look into trying the DKMS version of the driver.
But I've never done such a thing it and would have to study a bit to see how it is done.
But what really I want to do is to disable the EEE timer.
It was the first thing I thought of doing and have come back to it after trying all sorts of ACPI magic, kernel command entries and DSDT mods, to no avail.
Maybe the email I wrote to the e1000e driver ethtool utility maintainer will get me something.
I fail to understand why e1000e: EEE TX LPI TIMER: 00000000 is still there at shutdown if the module had been previously removed.
Thanks for your input.
Best,
A.
Edited: mail was actually sent to maintainer of ethtool utility - on if it is at all possible to disable EEE Timer / how to do it.
-
Last edited by Altoid (2021-04-19 00:15:42)
Offline
That c19 refers to patch from c11. The important patch is referenced in c33 and c55, which went in upstream kernel 5.5. That one seems to undo a major regression wrt the watchog timer handling. So, if you can, you should test kernel 5.5 or newer from beowulf-backports. I may have overseen, that you tested that kernel versions already.... See: https://pkginfo.devuan.org/cgi-bin/poli … mage-5.10*
linux-image-5.10.0-0.bpo.4-amd64 5.10.19-1~bpo10+1
http://deb.devuan.org/merged beowulf-backports/main amd64
Offline
I fail to understand why e1000e: EEE TX LPI TIMER: 00000000 is still there at shutdown if the module had been previously removed.
That is the nature of "My 2 cents: Delayed work is quite dangerous indeed. "
They delay the work item of the watchdog timer for that device. The device is unloaded. Then the watchdog tries to process the work item and instead of crashing or invalidating, it hangs waiting for the device, which is no longer there and therefore cannot answer, to answer. I am not involved there. Just the symptoms from the far.
And this is what they reverted, not to push delayed work items but process directly or otherwise "simpler".
Last edited by geki (2021-04-18 21:34:39)
Offline
I have aided someone else with e1000e issues a decade ago, or so, on the gentoo forums. That one actually worked then with the e1000e linux developer back then. The linux developer of the e1000e actually did not own the hardware, IIRC. The e1000e linux driver thrashed its MAC Address in its early days, IIRC. It is best to avoid that chipset....
Offline
Hello:
Please bear with me, I think our posts may have crossed.
That c19 refers to patch from c11.
Yes.
I was making a note wrt the fact that ime and ime_me were being put into play, for whatever reason.
Like I said, Intel ME and the e1000e controller go hand in hand.
Nasty crap ...
... important patch is referenced in c33 and c55, which went in upstream kernel 5.5.
... seems to undo a major regression wrt the watchog timer handling.
I see.
... if you can, you should test kernel 5.5 or newer from beowulf-backports.
I'm usually weary of new kernels, have a tendency to screw up things which have been working perfectly well from a long time back.
I'd feel much more comfortable if there was a patch or a point release eg: Devuan Beowulf 3.2 or whatever.
... you tested that kernel versions already...
No.
I have tried booting live distros using newer kernels to see what dmesg had to say as compared to what it says on Beowulf.
But that's about it, no long term testing.
In my experience with this problem, I can go as much as 15/18 days without a bad shutdown and maybe as many as three in two days.
My average boot/shutdown cycle count is roughly 5 to 7 a day, ocassionally a couple more.
See: https://pkginfo.devuan.org/cgi-bin/poli … mage-5.10*
example wrote:linux-image-5.10.0-0.bpo.4-amd64 5.10.19-1~bpo10+1
http://deb.devuan.org/merged beowulf-backports/main amd64
Thanks for the info.
I'll check it out.
Best,
A.
Offline
Hello:
That is the nature of "My 2 cents: Delayed work is quite dangerous indeed. "
Ahh ...
Now I (sort of) understand.
They delay the work item of the watchdog timer ...
... device is unloaded.
... watchdog tries to process the work item and instead of crashing or invalidating, it hangs waiting for the device ...
... no longer there and therefore cannot answer, to answer.
I get where you are going.
But ...
Maybe I did not explain myself correctly.
Without any intervention on my behalf ie: explicitly unloading the module before shutdown (see script posted previously), when I get a bad shutdown instance, the output on the screen is the same.
ie:
Devuan GNU/Linux ascii devuan tty1
devuan login: [483.367459] EXT-fs (sdc1): re-mounted. Opts: (null)
[485.772216] e1000e: eth0 NIC Link is Down
[485.776885] kvm: exiting hardware virtualization
[485.777756] sd 9:0:3:0: [sdf] Synchronizing SCSI cache
[485.778154] sd 9:0:2:0: [sdf] Synchronizing SCSI cache
[485.781519] e1000e: EEE TX LPI TIMER: 00000000
[485.785219] ACPI: Preparing to enter sleep state S5
[485.868007] reboot: Power down <---- screen freezes at this point
Now, if I understand what you are saying, then I was right from the very start when I went looking for how to disable the #$&@ timer.
When a bad shutdown comes along, the e1000e: EEE TX LPI TIMER: 00000000 bit shows up and the box freezes as previously described.
But when the shudown is normal, the e1000e: EEE TX LPI TIMER: 00000000 bit is also there but the box shuts down properly.
So ...
It is a question of timing, maybe a +/- 0.5s somewhere may be the 'trigger' for the bad shutdown which has proven impossible for me to replicate.
Absolutely unpredictable, I have had more than a fortnight of uneventful shutdowns and as many as three in just a couple of days.
Then, out of nowhere ... 8^7
Just the symptoms from the far.
Thanks for the heads up.
... this is what they reverted, not to push delayed work items but process directly or otherwise "simpler".
So it is a kernel problem?
Of course, Intel is doing nobody a favour by making it impossible to disable the EEE Timer.
That would be a very quick fix.
---
... aided someone else with e1000e issues a decade ago ...
... actually worked then with the e1000e linux developer back then.
... developer of the e1000e actually did not own the hardware, IIRC.
Not too easy to test then ...
Maybe that's why the driver is such a POS?
... best to avoid that chipset ...
Indeed ...
Unfortunately, it is what came onboard with the U24 which is, even by today's standards, a great piece of kit.
Keep in mind that it was brought to market almost 15 years ago, in mid 2007.
It was a great buy for me, practically brand new.
And all my slots are filled, so I'll have to make do till this glitch finally gets fixed or I can do something kernel-wise.
Because I hope to be using this HW for a few years more, maybe with a faster SAS controller and SSDs.
Thanks a lot for you input.
Finally I have had some light shed on this rather annoying problem.
Best,
A.
Offline
Hello:
Just the symptoms from the far.
I have found a way to disable this Energy-Efficient Ethernet thing in my Intel e1000e on-board controller.
Where the ethool utility has no access, the kernel does: adding igb.EEE=0 to the kernel cmdline disables EEE during boot.
https://www.toradex.com/community/quest … ernet.html
This EEE issue seems to be something that has been around for a long time:
https://thatbytes.co.uk/posts/fun-with- … ontroller/ <--- 05/2012!
EDIT:
Doing all this again helped me remember that at one time (back in 2019) I tried solving this problem by adding the stanza e1000e.EEE=0 to the kernel command line.
It did not work: the e1000e: EEE TX LPI TIMER: 00000000 bit was also present in the tty1 output in case of a normal shutdown and the bad shutdowns kept happening.
See https://www.linuxquestions.org/question … ost5954899
Basically in order for EEE to kick in both devices need negotiate ...
... but the switch didn’t support this ...
Maybe my ISP provided cheapo router does not (most probably) support EEE.
But I was already having this problem when I was on a shared WiFi connection.
But not everything is good news.
After setting this parameter and rebooting to see the results, I got a bad shudown, albeit without the presence of the e1000e: EEE TX LPI TIMER: 00000000 bit:
Devuan GNU/Linux 3 devuan tty1
devuan login: [ 864.785061] EXT4-fs (sda1): re-mounted. Opts: (null)
[ 864.824466] kvm: exiting hardware virtualization
[ 864.910856] sd 7:0:3:0: [sde] Syncronizing SCSI cache
[ 864.911235] sd 7:0:2:0: [sdd] Syncronizing SCSI cache
[ 864.911634] sd 5:0:0:0: [sda] Syncronizing SCSI cache
[ 864.913092] sd 5:0:0:0: [sda] Stopping disk
[ 865.013903] ACPI: Preparing to enter sleep state S5
[ 865.014444] reboot: Power down
Note that there are no lines referring to the status of the NIC link or EEE:
ie:
e1000e: eth0 NIC Link is Down or e1000e: EEE TX LPI TIMER: 00000000.
Maybe (?) the bad shutdown was a consequence of having disabled EEE on the controller.
Won't know till at least 15 days go by without another bad shutdown.
Meanwhile, I'll enjoy my having found this tidbit of information.
[rant]
Information I asked for but the DHs at Intel e1000e support were unable to give me.
I'll assume that they did not know what I was talking about.
[/rant]
Best,
A.
Last edited by Altoid (2021-04-20 15:10:39)
Offline
The next step is to see kernel log of the shutdown. Dunno, if it needs to be configured to generate them. You could check /var/log/kern.log*, if there is a kernel oops or such.
Last edited by geki (2021-04-19 17:26:20)
Offline
Hello:
... see kernel log of the shutdown.
... needs to be configured to generate them.
... check /var/log/kern.log* ...
Yes, that would be /var/log/kern.log.
But at that stage, (preparing to enter S5), all system files are read only.
ie: all drives have been synced and stopped, the last one being the one with the /boot partition.
I have never found any useful data with respect to this bad shutdown problem in the log files.
It happens when no one is looking and save for the useless and volatile tty1 printout, leaves no trace behind. 8^7
According to /var/log/auth.log, I added the igb.EEE=0 instruction to the kernel command line @08:27.
I then shutdown and got another bad shutdown instance.
But the kernel line addition was not active yet, so that bad shutdown only means that everything done to that point had had no effect.
Nothing new ...
These are the kernel log entries from that time forward, the last one at shutdown and the first one at boot:
groucho@devuan:/var/log$ tail -6000 kern.log
--- snip ---
Apr 19 08:29:27 devuan kernel: [ 3556.858961] e1000e: eth0 NIC Link is Down <--- | fix not active |
Apr 19 08:31:30 devuan kernel: [ 0.000000] microcode: microcode updated early to revision 0xa0b, date = 2010-09-28 <--- | fix active |
--- snip ---
Apr 19 08:45:16 devuan kernel: [ 858.318960] e1000e: eth0 NIC Link is Down
Apr 19 08:47:43 devuan kernel: [ 0.000000] microcode: microcode updated early to revision 0xa0b, date = 2010-09-28
--- snip ---
Apr 19 08:48:06 devuan kernel: [ 58.328174] e1000e: eth0 NIC Link is Down
Apr 19 08:50:08 devuan kernel: [ 0.000000] microcode: microcode updated early to revision 0xa0b, date = 2010-09-28
--- snip ---
Apr 19 11:42:27 devuan kernel: [10373.487079] e1000e: eth0 NIC Link is Down
Apr 19 12:51:32 devuan kernel: [ 0.000000] microcode: microcode updated early to revision 0xa0b, date = 2010-09-28
--- snip ---
Apr 19 16:08:27 devuan kernel: [ 2166.724262] e1000e: eth0 NIC Link is Down
Apr 19 16:10:32 devuan kernel: [ 0.000000] microcode: microcode updated early to revision 0xa0b, date = 2010-09-28
--- snip ---
groucho@devuan:/var/log$
Nothing in the log after the NIC is down.
The fact that the tty1 print out has no entries indicating the status of the NIC link or EEE (always present before) would seem to imply that we may be on the right track.
Now I just have to wait 15 days and see if stanza added to the kernel command line actually fixed the problem.
A 'bad shutdown' quarantine if you'll willing to pardon the pun.
Thanks a lot for your input.
Best,
A.
Last edited by Altoid (2021-04-19 19:59:47)
Offline
Hmm, on a second thought. For embedded hardware, there is this kernel (early) log piped to serial line feature. To debug early errors. I wonder if that feature works for shutdown, too?! To debug late errors? Never actually used that feature. At some point mounts are ro, right, and gfx stops updates to monitor. Is there some kernel commandline to activate that for debian kernels?
See: https://kernelnewbies.org/Linux_Kernel_ … e_Chapter3
Last edited by geki (2021-04-19 21:52:11)
Offline
Hello:
... this kernel (early) log piped to serial line feature.
... works for shutdown, too?
I suppose it would.
But the last information you would get would be the tty1 output, up to where the screen reads reboot: Power down but doesn't power down and stays there.
Like you pointed out, waiting for a signal that it will never receive.
There's nothing more after that because at that stage, the OS is in a frozen and totally unresponsive state.
The only way out of that is a hard shutdown.
Thanks for your input.
Best,
A.
Last edited by Altoid (2021-04-20 00:26:58)
Offline
But the last information you would get would be the tty1 output, ...
I wonder, the above-mentioned link says something different AFAIUI. But it seems to be the best to use the latest kernel with its newest e1000e driver and hope for the best.
Last edited by geki (2021-04-20 06:25:33)
Offline
... above-mentioned link says something different AFAIUI.
I'll have another look.
Maybe I missed something.
Thanks for the heads up.
... use the latest kernel with its newest e1000e driver ...
From what I have seen, the e1000e driver has always been very problematic, its link to the Intel ME and how it works probably having much to do with that being so.
ie: up to now, an Intel network controller has been essential for IME to work.
The fact that I cannot disable my GbE controller (WTF?) or access all of its settings in Linux via the tool used to that effect (ethtool) speaks volumes.
I have not yet heard from ethtool's maintainer with respect to that.
Problems with this driver are known to exist as far back as 2012 (!) and it seems that, almost 10 years later, things have not changed much.
From where I sit, I don't see any need for the EEE feature in a desktop, workstation or server.
To me it is just another layer of complication, so I want it turned off.
I posted the solution I found to the Intel e1000e support forum.
But haven't heard from them yet and it's quite possible I won't.
In my view, this EEE thing it is only useful (and only to a limited extent) in a portable, battery operated device or one in which the network component tends to run hot.
eg: some SoCs
And the same goes for any other energy saving features they come up with.
If disabling the controller's EEE does effectively do away with the bad shutdown problem I have, that will be it for the time being.
I'll upgrade my kernels conservatively, as I have always done.
Which is why I ultimately chose Debian as my distribution.
And when the developers/maintainers turned into DebHoles, moved to Devuan.
... and hope for the best.
Hope is in very short supply in this day and age.
I'd rather use the little there is for other, more important things. 8^D!
Thanks a lot for you input, your post steered me in the right direction.
Best,
A.
Last edited by Altoid (2021-04-20 15:15:24)
Offline
Hello:
I have not yet heard from ethtool's maintainer with respect to that.
I got a reply from him this afternoon.
I had asked:
What does being able to disable the EEE TX LPI timer in my 2566DM-2 Gbe controller actually depend on?
Is it hardwired?
If so, could it be solved with a different firmware?
Here's a transcription of the relevant part of his reply:
In this case, ethtool is almost certainly only a messenger.
A request like this is passed to kernel and it's the NIC driver to either implement it or report that it is not supported.
And in your case it's querying the current setting that fails so it looks like either the device does not support getting and setting EEE parameters or the support in its driver (e1000e) is missing.
As clear as Perrier ...
So it's quite definitely the e1000e driver that is blocking access to both the status and configuration of the EEE settings in the 82566DM-2 GbE controller.
No doubt about that because I have been able to disable it completely via the igb.EEE=0 bit added to the kernel command line.
What is really irritating is that Intel Ethernet Products support insisted from the start that the only way to get it done was either through ethtool or modprobe, something which I repetedly reported as being non-working solutions:
This is the last I heard from them back in 01/2019:
We typically turn off EEE using ethtool. Another method to do this is through modprobe as described in the readme for e1000e (https://downloadmirror.intel.com/15817/eng/readme.txt). It is not normal that EEE cannot be turned off with the previous methods, so it may be a change on the OEM end, and we are not aware of the modifications made by Sun. We strongly recommend to check with them on the root cause of your issue. Best regards ...
Draw your own conclusions.
Best,
A.
Offline
And another root cause of misbehaving intel NICs wrt EEE are (old!) CAT5 network cables. If you want, try with CAT6/7 (S)FTP cable, if you got such old cable. That solved the EEE issues with some, though, I wonder...
e1000e EEE driver part seems to need to "see" some link layer state and with old cables somehow gets it wrong and enters some bogus state, so that it cannot shutdown properly in your case?! That would be crazy, but I have seen all kind of things already. Always worth to try.
But yeah, its kind of energy savings feature, which is not always necessary.
Last edited by geki (2021-04-21 06:23:10)
Offline
Hello:
... cause of misbehaving intel NICs wrt EEE are (old!) CAT5 network cables.
Would not be at all surprised.
But I don't think the POS router my telco provides has any EEE capability, so the problem is probably (in part) there.
... try with CAT6/7 (S)FTP cable ...
That could be a solution, if I had any need for EEE, which I don't.
Like I said, I think it is more a hindrance/problem for anything but a portable/non-mains device.
And then, after careful consideration of the pros/cons.
Just how much does a NIC in a portable device use?
How much energy is actually saved by adding this layer of complexity to an already very complex device?
e1000e EEE driver part seems to need to "see" some link layer state ...
From what I understand, EEE needs both devices (controller and router/switch/other controller) to autonegotiate what/how/when/whatever.
Otherwise it does not work.
More of a problem is (I think) that the Intel's e1000e driver blocks all access to EEE, both for querying status and changing settings.
And that Intel Support makes no mention of it whatsoever and throws Sun under the bus.
Thanks for your input.
Best,
A.
Offline
Hello:
... and hope for the best.
The plot thickens ...
Since I set up the igb.EEE=0 stanza in the kernel command line, things had been coming along well enough.
But this morning I had another, albeit different, bad shutdown.
It had not reared its head for the longest while, probably because it was obscured by the other one.
This one reboots the box on shutdown with the fans on.
Not as bad but still quite annoying.
I then realised that I had not edited my shutdown script to its previous version.
ie: the one disabling WoL before shutting down and had left it at the version that removed the e1000e module before shutting down.
ie:
This one ...
#!/bin/sh
# added to troubleshoot nic related bad shutdown
PATH=/sbin:/bin:/usr/sbin:/usr/bin:
# sync
# remove e1000e module
# shutdown system directly (no shutdownhelper)
sync && sudo rmmod -s -v e1000e && sudo shutdown -h now
instead of this other one ...
#!/bin/sh
# added to troubleshoot nic related bad shutdown
PATH=/sbin:/bin:/usr/sbin:/usr/bin:
# sync
# disable onboard eth wol
# shutdown system directly (no shutdownhelper)
sync && sudo ethtool -s eth0 wol d && sudo shutdown -h now
Made me think that it was the reason for the tty1 output being different.
ie: no e1000e: eth0 NIC Link is Down or e1000e: EEE TX LPI TIMER: 00000000 in the output.
And that maybe the igb.EEE=0 bit was not really working. 8^7
Once things were as I thought they should be, I rebooted and shutdown while getting a video and got the bad news:
The e1000e: eth0 NIC Link is Down and e1000e: EEE TX LPI TIMER: 00000000 lines now show in the output again.
So, the added stanza does not really work.
So I decided to make my shutdown script work a bit more and edited it to this version:
#!/bin/sh
# added to troubleshoot nic related bad shutdown
PATH=/sbin:/bin:/usr/sbin:/usr/bin:
# sync
# disable onboard eth wol
# remove e1000e module
# shutdown system directly (no shutdownhelper)
sync && sudo ethtool -s eth0 wol d && sudo rmmod -s -v e1000e && sudo shutdown -h now
A shutdown, reboot and video grab later got me this*:
* times edited for simplicity's sake
Devuan GNU/Linux 3 devuan tty1
devuan login: [ ] EXT4-fs (sda1): re-mounted. Opts: (null)
[ ] kvm: exiting hardware virtualization
[ ] sd 8:0:3:0: [sdg] Syncronizing SCSI cache
[ ] sd 8:0:2:0: [sdf] Syncronizing SCSI cache
[ ] sd 5:0:0:0: [sdb] Syncronizing SCSI cache
[ ] sd 5:0:0:0: [sdb] Stopping disk
[ ] sd 4:0:0:0: [sda] Syncronizing SCSI cache
[ ] sd 4:0:0:0: [sda] Stopping disk
[ ] ACPI: Preparing to enter sleep state S5
[ ] reboot: Power down
Looking back, it makes sense as the NIC driver in use is not the igb driver but the e1000e one.
I'll have to try and see what using the e1000e.EEE=0 stanza gets me.
Edit:
groucho@devuan:~$ sudo dmesg | grep e1000e
[ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.19.0-16-amd64 root=UUID=d6841f29-e39b-4c87-9c52-3a9c3bafe2d3 ro acpi_osi=Linux e1000e.eee=0 agp=off apparmor=0 ipv6.disable=1 enable_mtrr_cleanup nmi_watchdog=0
--- snip ---
[ 2.158949] e1000e: unknown parameter 'eee' ignored
--- snip ---
groucho@devuan:~$
Very sorry for the screw up. 8^7
Best,
A.
Last edited by Altoid (2021-04-22 21:24:18)
Offline
I continue posting my findings here. :-)
Looking at [0], there is just one e1000e device/phy/whatever type of many supporting ethtool EEE queries. Bad luck for you. Looking at [1], I guess you need to compile your favorite kernel version and comment out one line, to see if that helps. The call to e1000e_igp3_phy_powerdown_workaround_ich8lan in drivers/net/ethernet/intel/e1000e/ich8lan.c. Maybe your hardware hangs there, who knows. Maybe is no good on your hardware? Trial and error....
/**
* e1000_ipg3_phy_powerdown_workaround_ich8lan - Power down workaround on D3
* @hw: pointer to the HW structure
*
* Workaround for 82566 power-down on D3 entry:
* 1) disable gigabit link
* 2) write VR power-down enable
* 3) read it back
* Continue if successful, else issue LCD reset and repeat
**/
[0] https://github.com/torvalds/linux/searc … G2_HAS_EEE
[1] https://github.com/torvalds/linux/searc … nd_ich8lan
Last edited by geki (2021-04-26 22:21:03)
Offline
Hello:
... posting my findings here.
No problem ...
But I'll have to continue with the rest on Part II.
You'll see why.
... you need to compile your favorite kernel version ...
Hmm ...
Thanks for the suggestion but I'll pass on that one.
I'm not confortable with having to do all that just to be able to disable this EEE crap. 8^7
Yes, I saw those links but didn't understand what was going on.
I had not done it before so I was rather apprehensive but I managed to follow the instructions and compiled the latest e1000e driver module.
Please bear with me and see the results of my efforts in PII. 8^)
Thanks a lot for your input.
Best,
A.
Last edited by Altoid (2021-04-27 14:02:36)
Offline
Well the same function should be in the driver you compiled yourself and comment that line out there and recompile. I wonder if that EEE timer has anything to do with the bad shutdown. Though, I may be wrong.
Offline
Hello:
... same function should be in the driver you compiled yourself and comment that line out there and recompile.
No idea how that works.
Quite happy to actually have been able to compile the driver.
For some reason I expected to see a .config file similar to what I have seen is used to compile a kernel.
So I'd check the boxes of the functions I wanted to compile in the driver.
But none of that came up.
And have no idea where to look or how to do it.
... if that EEE timer has anything to do with the bad shutdown.
Well ...
That is precisely what I am attempting to find out by disabling EEE.
Something that (for unknown reasons) seems to be impossible.
If you had read Part II (post #2) ... 8^D
I ran my rig for more than a week without a bad shutdown using the shutdown script described.
With EEE (supposedly disabled) and a standard shutdown -h now I immediately got a bad shutdown.
Could be a coincidence?
Maybe.
But I also got a line which would seem to be evidence of the EEE TX LPI Timer being active at shutdown, before the system went to S5 and powered down.
This in spite of EEE being disabled albeit with nothing in dmesg indicating that it was or it was not, like with Smart Power Down Disabled.
eg: e1000e: EEE Disabled or e1000e: unknown parameter 'eee' ignored.
... may be wrong.
I may be wrong.
But up to 'now' everything points to the EEE function in the NIC.
Would it be OK if we continue with this in PII?
That's where the driver module compilation (if it comes to that) would be discussed.
Thanks a lot for your input.
Best,
A.
Offline