You are not logged in.
Pages: 1
@steve_v, are you trying to troll?
What you wrote until now is pretty illogical.
Moreover, you fail to understand what I am telling.
there should be some investigations done... there is no investigation... because there is no resource available.
The most available resource for initial investigation of a bug only you have encountered is... You.
LOL. I hope you are not a developer, because what you wrote is totally dumb. This is a big part of any developer work: try to fix bugs, that are not so obvious to find. And you know what? This is not the customers who do the work, but developers…
You appear to want want somebody else to take responsibility for this, yet you have so far:
* Reported it to a team that is not directly responsible for the component in question, with minimal information, no reproduction steps, and no logs or debug output.
You wrote "you appear". Indeed, this is your interpretation. Not the truth. Obviously, you did not take the time to read what I have stated before.
* Tried to divert discussion and tracking away from the official channels (bugtracker & mailing lists) to a user forum (which few developers frequent).
Ok, now I can affirm that you are a troll, because you are going to a kind of trial of intent.
* Failed to follow up when asked to reproduce the problem with an untainted kernel (taint disables kernel debugging), or engage on the bugtracker at all beyond the initial report. (hint: replies to the bugtracker mail people who might care, forum posts do not)
Now I can affirm that you are a troll that did not read everything, yet acting very badly.
* Continually stated a "suspicion" that this is a software problem that "somebody" should investigate, yet focused your own cursory "investigation" entirely on hardware.
What are you talking about?
Why exactly do you expect the Devuan developers, of all people, to work on reproducing a "bug" you "suspect" you have found, when you yourself are apparently unwilling to do the same? Do you want them to feed and burp you as well?
It is not at all uncommon, regardless of the size of a project, for a bug report to remain that - just a report - until either the reporter or somebody else with the same problem provides enough information to reproduce the issue. Developers are not psychic, and what cannot be seen cannot be fixed.
Only then can it move to [confirmed] and work on finding the cause begin.
Where I have written that I was expecting anything from Devuan developers? Are you dumb? Standardpoodle was amazed, not me.
It is a fact that bigger projects like Debian, Ubuntu, Fedora, etc. handle better this kind of issue. Here an issue with Fedora and XFS:
- https://bugzilla.redhat.com/show_bug.cgi?id=2208553
And it is a fact that here we have nothing after 4 years. Is this a complaint? NO JUST FACTS.
Now just ignore me, because I will ignore you, as you need to get treatment.
zero wrote:I still suspect EXT4, but cannot prove this.
Isolated report from a single user (vs. the uncounted thousands running EXT4 daily), and no proposed mechanism or reliable reproduction: highly likely to remain an unproven suspicion.
This is why I can only suspect EXT4…
zero wrote:Unfortunately, Devuan team seems to be too small to handle this kind of issue.
If this is a bug in EXT4 (and frankly I very much doubt that) it should be punted upstream to the kernel mailing list. Investigating esoteric kernel bugs is not Devuan's responsibility.
And as long as EXT4 is only suspected, there should be some investigations done. So, for now, there is no investigation, not because this is, for sure, an EXT4 bug, rather because there is no resource available. Hence, the bug report stales…
Take a look at Debian, Ubuntu, etc. In these other bigger projects, we have at least some feedbacks, etc. Here we have nothing after 4 years.
Disks are fine, and nothing else happened after this event.
I repeat:
different controllers
different brands
two different disks
two different technologies: one HDD and one nvme
more than 100GB have been loss, while the system has max theoretically up to 32GB of memory
but they all have in common EXT4.
So, I still suspect EXT4, but cannot prove this. Maybe I am totally wrong, but why EXT4? Well, there are some infamous known bugs:
- https://bugs.launchpad.net/ubuntu/+sour … bug/317781
- https://lore.kernel.org/lkml/50882787.3 … home.de/T/
- https://bugs.debian.org/cgi-bin/bugrepo … bug=785672
Unfortunately, Devuan team seems to be too small to handle this kind of issue.
I was looking for linux-image-5.19.0-trunk-amd64 in Experimental repository, when I found that Experimental packages list were limited to these packages:
apulse
apulse-dbgsym
debug-symbols
db4.8-util
libdb4.8
libdb4.8++
libdb4.8++-dev
libdb4.8-dbg
libdb4.8-dev
libdb4.8-java
libdb4.8-java-dev
libdb4.8-java-gcj
libdb4.8-tcl
devuan-sanity
dummy-systemd-dev
eudev
eudev-dbgsym
debug-symbols
libeudev-dev
libeudev1
libeudev1-dbgsym
debug-symbols
overlay-boot
setnet
udptap
udptap-dbgsym
debug-symbols
I did not expect to see so few packages.
Why there is not almost the same number of packages (minus systemd related packages) as Debian?
Would it be safe to mix Experimental Debian kernel packages, and maybe Experimental nvidia-driver with Devuan? Or should I compile drivers from kernel.org and get NVIDIA drivers from their website?
Thanks to you two. At least I am not crazy!
Seems that we have to deal with it, but it is not a big deal.
Thanks for your answer, you help me to better understand the situation.
I have conclude that there are at least two issues.
Another, not yet clearly identified, but probably related to the GPU and how ACPI is handled
For the first issue, this is definitely not what cause the system to hang at resume. Now I have to wait for the fix.
For the second issue, it remembers the exact same problem I get when I tried to enable Wayland with 470 driver series, but here with Xorg only…
So I have tried the lasted driver (510.54) from Nvidia website. Now I can suspend the system without getting any crash. Moreover, I have now Wayland working.
I do not understand how, but before the new install, and with the same drivers, the system worked fine…
Anyway, thanks for your help.
EDIT
What a noob… I did not noticed that Nouveau was running, until I used some softwares that are GPU greedy… With the Nvidia 510.54 driver, I have the same bug than before, but not with Nouveau.
But at least, now I can blame Nvidia driver.
Yesterday I have installed Devuan on a pc that used to work perfectly before, but this time I get a suspend issue.
Maybe I am wrong, but I think it is related to runit, as the same drivers are installed as well as the kernel, only runit is new.
Now if I suspend the system while being on gdm, or gnome, Devuan crashes at resume, leaving only a black screen, or some message. I have to hard reboot or reset with Alt Gr+Print Scrn+b.
But if, before suspend, I go to any tty, and then push the suspend button on the keyboard, the system resumes without any problem, but now I can get some output from dmesg.
For each "virual" cpu I get lines like these:
BUG: scheduling while atomic: cpuhp/…
Preemption disabled at:
the call stack
Could be also related to xserver-xorg-video-nvidia, but I tried almost every combination that used to work just before the new installation with Chimaera kernel and drivers, backports kernel and drivers, Ceres…, but since this new installation I get each time the same issue.
Any idea on how to solve this issue?
Finally, to who should I report this bug? Devuan only, if it is related to runit? To Linux kernel maintainers?
Yesterday, I have installed twice Devuan on my pc, and I have noticed something strange the second time, and I and get a suspend issue (but this one is for another topic).
Before that, Devuan was installed on this same machine. It worked very well since 2018, but was based on an older Debian install. Moreover, I wanted to setup runit without too much works.
I have first download the netinstall iso from the website. Then wipe out the partition table.
The first time the installer:
setup itself the network with the good private IP and put Google DNS, without DHCP;
asked me to register the user into sudo group;
asked me some stuff related to country and locales to finally setup correctly the time.
At boot time, I noticed that I did a mistake on the partitioning, so I have wipe out again the partition table and reinstall Devuan with the exact same options selected. But this time, the installer did not setup the network, nor asked me to add the user into sudo group, and finally did not ask me anything about the timezone and did not setup correctly the clock.
Does anybody can explain why the installer have such irregular behavior?
Well I have finally tested the RAM; all memtest86+ tests are OK (4/4 pass).
@Head_on_a_Stick & @fsmithred
Thanks for the links, rodsbooks.com helped me a lot to understand how UEFI works. Previously, in my understanding, UEFI was only a kind of BIOS wrapper.
As for legacy mode, I used to think that I can get it under Grub EFI.
@fsmithred,
To free up the memory that mate is using, drop to console (ctrl-alt-F2) log in as root and stop your display manager.
/etc/init.d/slim stop Or replace slim in that command with lightdm or whatever dm you're using. Replace stop with start in that command when you're ready to go back to the desktop.
You right, how I did not think about this before :s Anyway, even if I stop lightdm, then stop as many services as possible, I will never get the maximum RAM to test.
Finally, I did my tests with memtest86+ and all 4 pass are OK.
Thanks for all.
Ok thanks, I did not know that this version doesn't work in UEFI mode :s
I tried the UEFI version with usb sticks, but, strangely, I still get a black screen… Even if, now the keyboard seems to react (led, etc…). But still nothing on screen.
As for memtester, it should be "less accurate" than memtest86+; indeed, I will have to allocate an amount of memory, leaving at least 2GB unchecked—mate takes ~1.2GB.
Can you help me to get my system in Legacy mode?
As an aside, do you have some links about how works EFI with Grub? How can I manage grubx64.efi entries under /boot/efi?
I tried efibootmgr but seems useless—some entries are rewritten at boot?
I was trying to use memtest86+ for this topic.
Unfortunately, I get a black screen when I try to use it.
The keyboard's leds shutdown, and do not react when I push keys like num lock and caps lock.
Here the related grub.cfg
### BEGIN /etc/grub.d/20_memtest86+ ###
menuentry "Memory test (memtest86+)" {
insmod part_gpt
insmod ext2
set root='hd2,gpt1'
if [ x$feature_platform_search_hint = xy ]; then
search --no-floppy --fs-uuid --set=root --hint-bios=hd2,gpt1 --hint-efi=hd2,gpt1 --hint-baremetal=ahci2,gpt1 xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
else
search --no-floppy --fs-uuid --set=root xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
fi
linux16 /memtest86+.bin
}
menuentry "Memory test (memtest86+, serial console 115200)" {
insmod part_gpt
insmod ext2
set root='hd2,gpt1'
if [ x$feature_platform_search_hint = xy ]; then
search --no-floppy --fs-uuid --set=root --hint-bios=hd2,gpt1 --hint-efi=hd2,gpt1 --hint-baremetal=ahci2,gpt1 xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
else
search --no-floppy --fs-uuid --set=root xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
fi
linux16 /memtest86+.bin console=ttyS0,115200n8
}
menuentry "Memory test (memtest86+, experimental multiboot)" {
insmod part_gpt
insmod ext2
set root='hd2,gpt1'
if [ x$feature_platform_search_hint = xy ]; then
search --no-floppy --fs-uuid --set=root --hint-bios=hd2,gpt1 --hint-efi=hd2,gpt1 --hint-baremetal=ahci2,gpt1 xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
else
search --no-floppy --fs-uuid --set=root xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
fi
multiboot /memtest86+_multiboot.bin
}
menuentry "Memory test (memtest86+, serial console 115200, experimental multiboot)" {
insmod part_gpt
insmod ext2
set root='hd2,gpt1'
if [ x$feature_platform_search_hint = xy ]; then
search --no-floppy --fs-uuid --set=root --hint-bios=hd2,gpt1 --hint-efi=hd2,gpt1 --hint-baremetal=ahci2,gpt1 xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
else
search --no-floppy --fs-uuid --set=root xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
fi
multiboot /memtest86+_multiboot.bin console=ttyS0,115200n8
}
### END /etc/grub.d/20_memtest86+ ###
Well, here the fstab (I have removed the UUID):
UUID= /boot ext4 defaults 0 2
UUID= /boot/efi vfat umask=0077 0 1
#/dev/mapper/devuan_root / ext4 errors=remount-ro 0 1
UUID= / ext4 errors=remount-ro 0 1
#/dev/mapper/devuan_usr /usr ext4 defaults 0 2
UUID= /usr ext4 defaults 0 2
#/dev/mapper/devuan_var /var ext4 defaults 0 2
UUID= /var ext4 defaults 0 2
#/dev/mapper/devuan_tmp /tmp ext4 defaults 0 2
UUID= /tmp ext4 defaults 0 2
#/dev/mapper/nvme_crypt /home ext4 defaults 0 2
UUID= /home ext4 defaults 0 2
#/dev/mapper/sde1_crypt /home/zero/public ext4 defaults 0 2
UUID= /home/zero/public ext4 defaults 0 2
Indeed, the two "faulty drives" are mount under /home—while the other drives are much "read-only".
As for wearing out my SSD. A new and better SSD like the Kingston A400 120GB will cost ~20€… while the SAMSUNG SSD 830 have cost me, more than 7 years ago, ~90€.
As I recall, the TBW of the SAMSUNG SSD 830 120GB is around 30—unfortunately, seems the specs file of this drive was among the vanished files that I definitively lost :s.
After more than 7 years, my SAMSUNG SSD 830 displays ~11GB of total writes… So, theoretically, it should last 14 more years…
So, wearing out my SSD is not a real problem.
I think it should be better to display lsblk:
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 119.2G 0 disk
├─sda1 8:1 0 1.9G 0 part /boot/efi
├─sda2 8:2 0 41.9G 0 part
├─sda3 8:3 0 10G 0 part
│ └─devuan_root 254:0 0 10G 0 crypt /
├─sda4 8:4 0 14G 0 part
├─sda5 8:5 0 14G 0 part
│ └─devuan_var 254:2 0 14G 0 crypt /var
├─sda6 8:6 0 5.6G 0 part
│ └─devuan_tmp 254:3 0 5.6G 0 crypt /tmp
└─sda7 8:7 0 31.9G 0 part
└─devuan_usr 254:1 0 31.9G 0 crypt /usr
sdb 8:16 0 931.5G 0 disk
├─sdb1 8:17 0 922G 0 part
│ └─sde1_crypt 254:5 0 922G 0 crypt /home/zero/public
├─sdb2 8:18 0 1K 0 part
├─sdb5 8:21 0 7.7G 0 part
└─sdb6 8:22 0 1.9G 0 part
sdc 8:32 1 3.8G 0 disk
└─sdc1 8:33 1 3.8G 0 part /boot
nvme0n1 259:0 0 465.8G 0 disk
└─nvme_crypt 254:4 0 465.8G 0 crypt /home
/dev/sda is the SAMSUNG_SSD_830 (no data loss)
/dev/sdb is the WDC_WD10EZEX (data loss)
/dev/nvme0n1 is the Samsung SSD 970 EVO 500GB (data loss)
Everything is "perfectly seated, and so on".
No strange noises, no dust (very clean inside), more than enough ventilated. I repeat for all tools, smartmontools included, everything is OK (even multiple/long tests).
Yes, I use some Ceres packages (mainly for lxc).
I have installed sigil from Devuan Ascii repository (0.9.7+dfsg-1). I did not used it before. Indeed, I switched from Debian to Devuan, after ~15 years, using mainly SID. (I used to use FreeBSD too).
I only try to start Sigil for the first time on Devuan. But it complains that it cannot run. As if it was unable to write anything on the nvme—when the bug occurred?
I suppose, but I do not affirm that EXT4 is the problem. Maybe the RAM is faulty, but seems strange, as:
two diffrent disks
more than 100GB have been loss, while the system has max theoretically up to 32GB of memory
To get corrupt, these data (files; folders; etc…), from these two disks, should have been "randomly loaded in the RAM" (say max 30GB) in more than three times (3*30 = 90GB), to get corrupted each time. How and why being loaded in the RAM?
Anyway, I will test the RAM as soon as I can.
System: Host: devuan Kernel: 4.19.0-0.bpo.4-amd64 x86_64 (64 bit) Desktop: MATE 1.20.4
Distro: Devuan GNU/Linux ascii
Machine: Device: desktop Mobo: Micro-Star model: B450M PRO-VDH (MS-7A38) v: 4.0
UEFI: American Megatrends v: M.40 date: 01/25/2019
CPU: Hexa core AMD Ryzen 5 2600 Six-Core (-HT-MCP-) cache: 3072 KB
clock speeds: max: 3400 MHz 1: 1390 MHz 2: 1518 MHz 3: 1383 MHz 4: 1398 MHz 5: 1379 MHz 6: 1399 MHz
7: 1549 MHz 8: 1592 MHz 9: 1356 MHz 10: 1323 MHz 11: 1457 MHz 12: 1386 MHz
Graphics: Card: Advanced Micro Devices [AMD/ATI] Oland PRO [Radeon R7 240/340]
Display Server: X.Org 1.19.2 drivers: ati,radeon (unloaded: modesetting,fbdev,vesa)
Resolution: 1920x1200@59.95hz
GLX Renderer: AMD OLAND (DRM 2.50.0, 4.19.0-0.bpo.4-amd64, LLVM 7.0.1)
GLX Version: 4.5 (Compatibility Profile) Mesa 18.3.6
Audio: Card-1 Advanced Micro Devices [AMD] Family 17h (Models 00h-0fh) HD Audio Controller
driver: snd_hda_intel
Card-2 Advanced Micro Devices [AMD/ATI] Cape Verde/Pitcairn HDMI Audio [Radeon HD 7700/7800 Series]
driver: snd_hda_intel
Sound: Advanced Linux Sound Architecture v: k4.19.0-0.bpo.4-amd64
Network: Card: Realtek RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller driver: r8169
IF: eth0 state: up speed: 100 Mbps duplex: full mac: 00:xx:xx:xx:xx:xx
Drives: HDD Total Size: 1132.3GB (79.0% used)
ID-1: /dev/nvme0n1 model: N/A size: 500.1GB
ID-2: /dev/sda model: SAMSUNG_SSD_830 size: 128.0GB
ID-3: /dev/sdb model: WDC_WD10EZEX size: 1000.2GB
ID-4: USB /dev/sdc model: USB_2.0_FD size: 4.0GB
Partition: ID-1: / size: 9.8G used: 1.2G (13%) fs: ext4 dev: /dev/dm-0
ID-2: /usr size: 32G used: 6.9G (23%) fs: ext4 dev: /dev/dm-1
ID-3: /boot size: 3.7G used: 175M (5%) fs: ext4 dev: /dev/sdc1
ID-4: /var size: 14G used: 1.4G (11%) fs: ext4 dev: /dev/dm-2
ID-5: /tmp size: 5.5G used: 25M (1%) fs: ext4 dev: /dev/dm-3
ID-6: /home size: 458G used: 153G (36%) fs: ext4 dev: /dev/dm-4
Sensors: System Temperatures: cpu: No active sensors found. Have you configured your sensors yet? mobo: N/A gpu: 33.0
Repos: Active apt sources in file: /etc/apt/sources.list
deb http://pkgmaster.devuan.org/merged ascii main contrib non-free
deb http://pkgmaster.devuan.org/merged ascii-updates main
deb http://pkgmaster.devuan.org/merged ascii-security main
deb http://pkgmaster.devuan.org/merged ascii-backports main contrib non-free
Active apt sources in file: /etc/apt/sources.list.d/ceres.list
deb http://fr.deb.devuan.org/merged ceres main non-free contrib
Info: Processes: 393 Uptime: 12:10 Memory: 19808.3/32183.9MB Client: Shell (bash) inxi: 2.3.5
The nvme is the Samsung SSD 970 EVO 500GB.
Summary from bug#313
There are 4 drives, two different brands, 3 different technologies (SSD, Hard Drive, and nvme), and data loss occurred on the HDD and the nvme… So different technologies, "different controllers", different brands but they all have in common EXT4.
The data vanished in few seconds…
I tried photorec without success. There are not recoverable.
I could add that it's a "relatively brand new pc—2 months old" and that I have tried
nvme-cli
smartmontools
Samsung Magician
All these tools say that EVERYTHING is OK, and strangely, Photorec do not find those files.
Finally, and this is I think very important, as I was not logged in any way as root (terminal, etc… or even used sudo…). Files that belonged to root user or other user/group have also vanished. Yet my user did not have any "super rights". So more than 100GB in two disks vanished instantly, not possible and too fast for a "rm" command—that I never did).
I launched Sigil (displayed in full screen) to write something, then I get an error message in a dialog box, saying something like "Sigil cannot not run". So I clicked on "close" and noticed that the background image was gone…
I understood that my pictures have gone… :s I checked my home folder and indeed, my pictures have gone… seems few other files too, gone… Luckily, I did backup my system some days ago.
Now how can I track what happened? Nothing is logged in /var, and dmesg yeld nothing. Moreover, Smartctl reports nothing nor Samsung Magician.
And how can I track this in the future?
---EDIT---
Files in an other drive are missing. Possibly an EXT4 BUG??? How can I check this?
Pages: 1