You are not logged in.
/proc/sys/debug/exception-trace has something to do with controlling verbosity (following that sysdig article). It can hold a 0 (off) or 1 (on). On my machine, it seems to be turned on already.
sysdig.com has an article on troubleshooting containers. I believe this story has something to do with vsyscall later on in the document.
https://sysdig.com/blog/troubleshooting-containers/
But, an early troubleshooting step shown here, is to generate a core dump.
ulimit -c unlimited
bad_command_to_execute
For me, this is some kind of BOINC process. It might be that executing the shell script in init.d might show something interesting, but I suppose the core produced would be of bash, dash or something like that. So, I think I need to manually start a BOINC job, in the hopes that it makes this vsyscall?
I had a desktop machine lock up on me, and trying to get it regoing has been problematic.
Most of my computers are set with vsyscall=emulate on the command line, and as near as I can tell the only reason to have that setting, is that some sources of BOINC jobs require it. At one time (2 years ago), there was a special setting for BOINC having to do with LIBC215. I just went looking at Einstein @ Home, and no option like that appears to be present now.
In order to boot consistently now, I have had to remove execute permissions on /etc/init.d/boinc-client. If I boot to multiuser mode, and check the kernel command line, I can see the vsyscall=emulate. If I then restart BOINC, the system immediately locks up.
I had updated the system to the new libc (2.29-2) a few days ago. It may be that it took BOINC a few days to "tickle" something, which is causing this crashing.
Is there any way to get more information as to what is happening?
Too many posts under my name at the top of the list.
As it appears that the original ASRock motherboard was at least partially at fault for why the Ryzen 1600 would not work, it seems possible that that CPU is not a dud.
---
I picked up a A320 motherboard on sale. If this Ryzen 1600 works, that's wonderful. But I doubt the application requires a 12 core CPU.
Long ago, I heard of the stealth logserver.
I suppose the original stealth logserver was a printer attached to the console, and it just printed everything that happened. If someone broke in, they could do nothing about the log entries already printed. Regardless of kernel messages about printer on fire, which was always a "joke".
But the stealth logserver I remember from way back when; was a computer on a LAN, where all non-stealth computers were configured to send their logs to a public logserver. And it logged everything. But on the LAN was a computer with a NIC that had no assigned IP address, and the NIC was in promiscuous mode. And so, it could also log everything. As it had no IP address, there is no (easy) way to get to it, to erase logs.
More recently, writeups about stealth logservers, don't talk about a stealth (promiscuous) machine paralleling logs. All the machines on the LAN are configured to send log messages to a machine which doesn't exist.
Is it better that the stealth logserver is paralleling a real logserver, or that the config of all the other machines makes it obvious that the logs are on a "stealth" machine?
Recent articles on this concept talk about running a network intrusion detector on a stealth machine. Like snort. The idea of a stealth machine, is that there is no way to get to it; it has no IP. So, if the intrusion detection machine detects that an intrusion has happened, how does this message get to the LAN server and/or the router, to stop the intrusion?
The computer this is on, is one of the smallest I have (and is eventually supposed to spend time in my truck).
The partition for /usr is getting a little short on space.
If a Debian/Devuan developer is working with one or more packages, the instructions about:
apt-get build-dep whatever-package
are not a problem. They are going to be using those packages over and over again, so there is a need for them to be on the system.
For someone like me, trying to track down a bug; they are a problem. I have to install a bunch of packages I don't really want on the final system. Which means trying to remember what to delete/purge later on.
Is there a jail/container/vm solution to this?
I see that the DEB_BUILD_OPTIONS from the earliest comment is slightly different from what I have tonight. I think the important part is nostrip (don't strip the DEBUG symbols out).
Tonight, I went playing with the libopencl1 (aka ocl-icd) source code again.
In the past and tonight, I am running
DEB_BUILD_OPTIONS=nostrip,noopt dpkg-buildpackage -rfakeroot -uc -us
(Sometimes it is "nostrip noopt", they both seem to work.) But, maybe my problem was I didn't have the build-depends correct? So, I did
apt-get build-depends ocl-icd
and it installed I believe 3 packages which weren't here before. Arrgh! :-) Okay, I've got a contaminated "source tree", so I made a new directory and re-downloaded the ocl-icd source.
Running the dpkg-buildpackage command (with the DEB...) takes longer than earlier tonight (with the contaminated source tree), and I again get the error: dump_vendor_icd not found.
The clinfo program needs that function to do anything (first libopencl function it calls). From reading various OpenCL related things, it seems most programs call it at the beginning. How can it not be present?
There is no source code (in C or C++) in the source package. What it looks like is happening to me, is that the C source is calculated from something in Ruby, for whatever purpose the Makefile presents. Or dpkg-buildpackage. How the heck do I track down why this function is missing in dynamically created source code? I would much rather work in FORTRAN-IV or FORTRAN-77. :-)
A day or so ago, I ran across some new threads of interest about APUs and GPUs in a computer.
If want to compare
%command%
with
DRI_PRIME=1 %command%
If command is glxinfo and I go looking for the provider, the first version reports the APU, and the second version reports the RX-550. So, I am guessing that the motherboard graphics and APU are the default. If command is glxgears, I get about 20% higher frame rate with the RX-550. The monitor is connected to the motherboard VGA connector.
I am guessing that this means that amdgpu is able to work with both the APU and the GPU. I have never read anything about having to put DRI_PRIME=1 in anything having to do with BOINC jobs, so I don't know if that is a solution or not. There is still the problem that clinfo hangs, and never tells you anything.
Hello Toxic, sorry I didn't know a reply had come in.
I am just using Mesa3D as a provider of OpenCL. I had run into some problem quite a while ago about OpenCL (from Mesa?) co-existing with POCL on the same machine. I could try purging POCL if that is present, I hadn't remembered to look.
Some time ago, I had time to pull the old motherboard (old, it is almost new). One of the support posts came undone, so that required a little fiddling. The odd header in different places between the old ASRock and the new MSI motherboard. Manufacturers really don't include everything they should in the instructions.
Anyway, I cleaned the old heat transfer fluid off stock wraith cooler and CPU with the Arctic kit (the cooler never did look like a mirror). I put some Arctic MX-4 on the new (Ryzen 2600) CPU, and re-installed the wraith cooler. I had problems with the cooler no being coplanar with the CPU, which is aggravated by the screws on the cooler being (IMHO) too short.
As this new motherboard had never been booted before, I had some adjustments to make. Like different net hardware. And possibly that SATA1 is not available if you have a nvme SSD (I did); that is poorly documented. Anyway, Devuan booted. BOINC ran fine with 3 cores (of 12). Run more than 3 cores, and eventually it dies. I thought at first; I've got another dog of a CPU. Nothing in the logs. But, run xsensors and the CPU is running too warm. I think the mounting problem caused much of the thermal compound to get squeezed out.
Why spend time using a stock cooler which suffers from screws being 2mm too short? So, I bought the AM4 specific Noctua U12 tower cooler. Wait a while for it to arrive in the mail, and it pops on just fine. Used the Noctua compound with the Noctua cooler. I've run it with 3 cores, 4 cores, 5 cores, 6 cores and now 8 cores under BOINC, and the temperatures are sitting in the high 50 to low 60C. There is a lot of overlap, you couldn't predict number of cores active from the temperatures.
I picked up a _big_ disk (Seagate 8TB NAS) and put it in this machine as well, the idea to try and set it up as documentation server (involving squid) should the Internet go down. It's on btrfs. I'm still trying to figure out how to get it to cache things for HTTP and HTTPS. Essentially I am the only person in the house that uses the computers, so I don't think there are any breaking encryption problems with this.
---
This machine was originally a Ryzen 1600, I also have a Ryzen 1600X. That machine has a not too big RAID-10, and is running BackupPC, mostly to backup my LAN server, which is due for hardware upgrades yet this (almost finished) winter.
Most of my machines run amd64 CPU cores with AMD GPUs. Most of the GPUs are now Polaris. An exception , is a mini-ITX running a A10 APU with a RX-550.
I had been running it with both the radeon and amdgpu modules, and I think the radeon was doing the APU/motherboard side, and amdgpu was doing the RX-550.
I tried getting it to just use the amdgpu module. I tried blacklisting and I tried kernel command line options. It runs, it doesn't do GPU work any more for BOINC, and clinfo doesn't work. If you run clinfo at a command prompt, it hangs. Running clinfo under strace didn't seem to point to anything.
I downloaded the source to clinfo, and changed all print operations (to stdout) to being fprintf to stderr. This should mean all prints are unbuffered. Well, it hangs on the very first call to the OpenCL library. What I think this call is supposed to do, is just to return how many platforms are present, so that it can loop over the platforms present.
I was doing this in emacs with the gdb interface. And if you try to singlestep into that library call, it just hangs. I believe I can get control of the process from within emacs. So, probably an endless loop?
Debian doesn't supply that library with DBG symbols (and so it isn't surprising that Devuan doesn't either). So, I downloaded the source, and did
DEB_BUILD_OPTIONS="nostrip noopt debug" dpkg-buildpackage -b-uc-us
And the first time, dpkg-buildpackage said I didn't have all the pre-requisties, so I downloaded more stuff.
Now the compile bombs out on an error. It cannot find dump_vendor_icd().
I am in the top directory of this package when running dpkg-buildpackage. I would think that function would be a few subdirectories down, but running fgrep that function is defined in a file in that top directory. I didn't look to see if it is also defined elsewhere.
Or, maybe I am missing something on how to compile the package with debug symbols. So, I am describing what I think I did. And changing consoles with a KVM switch to type stuff.
If I write back in this post, will it get noticed?
I am still working with problems related to graphics.
I never did get a local version of that amd graphics firmware built. Debian cascading to Devuan did produce updates. On the machine I was having problems with, some newer kernels would boot and some wouldn't. That problem eventually resolved through upgrades to the kernel.
But the problem of the CPU locking up, whether it was lightly loaded, heavily loaded or something else never went away. The firmware package that is in Devuan, should load the right stuff. I've upgraded the BIOS to the newest available. I have tried playing with the various settings on the kernel command line, and while it may change lockups in an hour or 2 to lock ups in several hours, the machine just wouldn't run.
To me, the symptoms looked like the idle lockup, where a CPU core goes to sleep, and never wakes up, and this brings the entire system down at some point. I don't have sleep or hibernation or anything like that set up (it's a desktop, and if running does BOINC or other stuff).
Oh, about the time I tried the Ryzen 5 2600, I also had purchased a pure sine wave UPS, so it was on better power than before. I believe the power supply is a Seasonic Focus 80+ Platinum. Should be good enough components.
That was with a Ryzen 5 1600. I bought a Ryzen 5 2600 and tried that, and it does the same thing. This being winter, I had time to work on things, and so I am almost finished replacing the motherboard.
What is left is to remove the CPU/cooler from the old motherboard, clean the two surfaces off (I have the Arctic kit for this) and apply new compound to the CPU/cooler interface before installing the new cooler (factory Wraith) on the CPU.
If the new motherboard works, then I will be happy. I will report on that.
----
If the new system works, I am going to think that my old CPU is fine. I have another application where I have a AM4 socket motherboard with the low end A320 chipset.
AMD not too long ago announced some new CPUs. One of them was a Ryzen 5 2600E - a 45W CPU. It appears that at least in the near term, this is an OEM chip only. But some people are thinking that if you devolt the CPU, you probably should be able to get it down to 45W. The A320 chipset is supposed to not let overclocking (and yet, you can find manufacturers saying how wonderfully their A320 motherboard overclock), but perhaps it will let a person devolt the CPU? I will report on that at some point. But, I have another post to make.
Somewhere I had a message about some hardware problems I was having, and still am.
The 4.19 kernel finally came out, and so I was rebooted this Ryzen-1600 machine which has the idle problem.
In the bootup, there is a message to the effect that polaris11_k_mc.bin is missing.
Debian has 2 bugreports related to this missing firmware in firmware-amd-graphics (most recent build was August I believe). I've no idea when this package might be upgraded, so that the 4.19 boot finds this file.
So, I did an apt-get source firmware-amd-graphics, which corrects me and just downloads firmware-nonfree. There is a message to the effect that I should clone the repository at some debian.org address. Okay, I do that.
There is a README file near the top of this git repository, which seems to say that I need to build a tarball first. The command is debian/bin/genorig.py
I run that, and it complains almost immediately that there is no debian_linux module to load.
Where is this module?
Looking around a bit, perhaps it is in a package called linux-tools? But Debian only has linux-tools up to stable. The package description in stable says this is a translation thing to bring in linux-perf. Linux-perf is versioned to a kernel. While the list of files for linux-perl-4.9 has python files (I presume some are modules, in the sense of perl modules), none seem to be named debian_linux.py
How is a person supposed to build firmware-nonfree from source? Do I have to do this on the hardware that has this idle problem (I can load up BOINC jobs to keep it from needing idle states on the CPU).?
aTdHvAaNnKcSe
Gord
I restarted 7 BOINC jobs with the kernel command line trying to help, and after about 3 days it found a long enough idle to freeze.
I restarted (not getting the kernel command line mods in), so I used that zen program to disable cstate6. The power went out before it crashed. I guess I forgot to tell the power company that I was going to do this test. :-)
I think I will wait for 4.19 before I start this again.
It still freezes in idle.
The 4.19 kernel is rumoured to have some chance of fixing this, but I've also seen articles say similar things about earlier kernels. Wishful thinking? I've seen articles talking about overclocking being another way to avoid this freeze in idle, but I have never gotten into overclocking. Are there any good documents which describe the various cstates and pstates and overclocking?
I guess there was no "real" problem with filesystems. The only thing that seems to have happened, is the "dirty bit" was set on the VFAT mounted at /boot/efi.
Fixing that "dirty bit", it boots now.
However, that new BIOS doesn't fix the problem. At least not without kernel command line arguments. I again added
rcu_nocbs=0-11 processor.max_cstate=5 idle=nomwait
to the kernel command line. We'll see if that helps with the freezing in idle.
I had read about people having filesystem issues when these idle CPUs just go to sleep, I hadn't experienced one. Yesterday, this machine with now only 3 BOINC jobs running froze.
So, this morning I installed the latest (4.80) BIOS, and went looking for this power supply idle control setting, to change it to "typical" (or common). When I went to reboot (from the "disks" in the system), the kernel panicked. As rEFInd is my bootloader on this machine, I put a rEFInd DVD in the machine, and that was able to boot to my root partition. I may have some work to do with fs repair. But, the machine is busy doing almost nothing again, to see if this newer BIOS fixes the idle freeze problem.
How I am monitoring this, is to ssh -X into that machine, and run xsensors in the background. When the machine locks up, the xsensors display on this machine either dies or goes strange.
Greetings.
I have two computers now, running Ryzen processors - 1600X and a 1600. I don't believe the 1600X has this CPU hardware bug, but the 1600 does. Just running BOINC jobs, I can have the system lockup even having 6 active threads (so 6 of 12 CPU threads).
I have a new enough BIOS that I can set the Power Idle State (or however it is described) to something like typical. There are 2 newer BIOS available, that I haven't installed (I have the 4.6 BIOS? installed, as it has NVME support). ASRock motherboard in this thing.
I booted a while ago, with idle=nomwait and after a couple of days it locked up (6 BOINC jobs). I had before and after that, booted with rcu_nocbs=0-11. For Devuan kernels, I believe that instruction is ignored because some CONFIG_RCU thing isn't set. Debian has a bug thread, where the last entry I believe is February of 2018, asking for kernels to be compiled with that config variable set, to which there has been no response.
I just rebooted with
rcu_nocbs=0-11 processor.max_cstate=5 idle=nomwait
and reduced the BOINC jobs to 3.
All this computer is doing, is BOINC jobs. So nothing important. If people at Devuan are interested in this bug, and want me to try things, I can do that. I believe it is currently running the 4.18.0-2 kernel package (which is most current). There are rumours that the 4.19 kernel may have some fixes for this bug. There were rumours in the March to May timeframe of 2018, that AMD had release updates to the BIOS which cover this.
I did install the zenstates.py program from github, but so far I have only used it to list things.
And then the install script bring up the next page, where it lets me choose where to install grub (to the flagged bootable single partition on the NVMe), which also happens to be the first one they list.
I take the install media out as instructed, reboot, the UEFI boot screen I choose to legacy boot the NVMe, and up comes Devuan Ascii. Check what is mounted, and it is just what I installed.
Reboot, go into the UEFI/BIOS, and it still doesn't give me the option of UEFI booting the NVMe SSD.
So, I guess I boot some recovery thing, repartition the NVMe with GPT and set things up about the way they were before, and see what happens then.
Okay, booted into the graphical install Devuan-Ascii DVD, blew away the gpt partitioning of the NVMe SSD, and partitioned a singl;e 20GB MBR partition, and installed Devuan into it. I am almost finished, just grub to install.
It asks me, I've noticed that you have 2 other OS installed on this computer (on the SATA SSD which is /dev/sda). Do you want me to write to the MBR of the _FIRST_ disk?
What pray tell, is the first disk? /dev/nvme0n1p1 sorts before /dev/sda*. The SATA SSD is partitioned with a gpt partition scheme, which includes a protective MBR (I believe).
It would be nicer, if the query we define what it thinks the first disk is.
Click through the question, and get a message that I have to mount those things manually anyway. :-)
Go to adjust partitioning of the hard disk with fdisk. There were 2 partitions of about 30G each, on a 1TB disk (well, 931+GB). Delete both of those. Set a 32G(p), a 10G(p) and two 8G(1p, 1e) partitions up. That should total 58G. Go to set aside most of what is left for a /home, and fdisk tells me that all the space on the device has been allocated.
Probably me not remembering how to set up MBR with fdisk. So, used parted and managed to get an unaligned partition the first time through.
The UEFI/BIOS still gives me no choice of a UEFI boot after having installed grub-efi-amd64. But, with the hard disk partitioned, I can now save my tarball, and then try to make a bootable legacy Linux on the NVMe SSD.
I booted the ASCII DVD in a UEFI manner to graphical/rescue. Console F1 has a message about fixing fbdev (xf86gamma).
The system asks me for the root partition, so I point at nvme0n1p3. It asks about a boot partition, is I point at nvme0n1p2. It asks about EFI partition, so I point at nvme0n1p1 (for /boot/efi).
It then asks about starting a shell in the root partition. But so far no questions about /usr and /var partitions, so I am going to mount those manually from console F2 before answering the question.
The cli installer notices is a UEFI environment, and directs me to refractainstaller-uefi. The bigger screen resolution is noticable. Running refractainstaller-uefi, appears not to be cli.
The disk is partitioned as
nvme0n1p1 EFI
nvme0n1p2 /boot
nvme0n1p3 /
nvme0n1p4 /usr
nvme0n1p5 /var
nvme0n1p6 /opt
There is swap space on a hard disk. Eventually there will be /tmp, /var/log, /usr/local and /home on the hard disk.
So, I set or unset options to work with what I have already done.
The next step wants me to choose one of the two EFI partitions in the system. I try to pick the nvme0n1p1, but both partitions seem to be chosen.
Next it wants me to choose a root partition (nvme0n1p3), but it never does display any partitions for me to choose from.
And that's where I have stopped for now. I am going to guess that there is a bug in the code for choosing which EFI partition to use (eventually I will remove the SATA SSD which has the other EFI partition), and a bug for choosing which partition to use for the root partition.
The kernel for this installer DVD is 3.16.0.
I seen email from the Devuan project a day or so ago, informing me that ASCII (2.0) had been released. I have a copy from bittorrent. I tried looking to see if it had nvme and/or uefi support, and was not successful at tracking this down. If that installer is better to try, I can do that.
At the moment, I think I will manually mount all the partitions at /mnt (or below /mnt), and mount another partition on the hard disk somewhere else and then make a tarball of what I have there on the hard disk.
I had done an install to a Ryzen 1600x with nvme SSD quite a while ago with this 3.16 based disk. Or, I thought I had. I don't think I have had 2 refracta disks, and somehow I picked the old one for this latest install?
I have the amd64 installer DVD being burned on another machine.
No, I haven't booted any Linux from the NVMe drive. I can change the order drives are searched through, I had tried moving the NVMe up to first. There is lots of storage on the LAN, so I could back up that device to a disk somewhere, and install a MBR based Linux. I would hope that an image based backup of /dev/nvme0n1 would be sufficient.
Oh well, it's not raining and my coffeebreak is over. Time to get back to work.
I thought I had finished installing Devuan (Jessie/Ascii) to the NVMe (GPT partitioned). But my BIOS/UEFI only seems to recognize it as a legacy device, not as a UEFI device. I had installed rEFInd for a boot manager (largely to learn more about this other thing). I suppose I could install grub-efi and see if that makes a difference. How much analyzing does the UEFI do, to determine what boot devices are present?
I have a friend in SE Manitoba. He keeps telling my to invite bambi to dinner. I think you two went to the same school. :-)
The last number I have (which is a few years old), is that there are 50 deer within a radius of about 1 mile. I don't see them in the summer (to warm I think), but in the winter I have close to 10 moose in the vicinity.