I’m running Legato on a Yocto 2.2.1 linux build (a generic x86 machine using kernel 4.8.12, with support for Smack and cgroup compiled into the kernel). It runs for about fifteen seconds and then crashes the kernel, forcing the machine to reboot.
I get three of these “timed out” messages when the framework starts up.
Legato: -WRN- | UNKNOWN[337]/framework T=main | LE_FILENAME DAEMON_DAEMONIZE() 71 | Timed out out after waiting 5000 ms for indication from child.
Then several seconds later I get this, followed by an immediate reboot:
Legato: CRT | UNKNOWN[338]/framework T=main | LE_FILENAME TryToRun() 1050 | Supervisor was killed by a signal 14.
There are no core files or log files left behind that I can find. The only way I got these log lines was by capturing it on video. True story.
Also, during those several seconds between the timeouts and the killed supervisor, I have tried running Legato’s log utility and other tools, but none of them work at all. The command will just hang until the system reboot.
I haven’t tried to integrate the Legato build into the Yocto build at all. I’m using Yocto to create the OS and the cross-compiling toolchain. Then I use the toolchain to build Legato (that took a while to figure out, btw.)
Is there any reason Legato would be incompatible with the Linux 4.8 kernel?
I haven’t read much of the source code, but much of what I’ve seen looks like very reasonable wrappers around standard OS calls. It’s mostly the cgroup, Smack, and IPC stuff that is unusual. I haven’t seen any warnings in the documentation about incompatibilities with more recent kernels, or with the x86 32-bit architecture.
I’ve seen the same issue while trying to use 4.8, and I’m not exactly sure where it’s coming from.
By setting: PREFERRED_VERSION_linux-yocto = "4.4%"
it seems to work fine, so somehow there is some incompatibility with 4.8.
I’ve created an internal issue but this is unlikely to be handled as high-priority as so far we don’t have any products running with a kernel as recent as 4.8.
If you happen to find out the issue don’t hesitate to let us know or upstream your solution!
I’ve learned that the kernel isn’t crashing. Rather, the startup program is rebooting the system on purpose because the startup process is taking too long, and the supervisor died from a SIGALRT.
Everything seems to launch fine until the sdir program is run, which is supposed to have something to do with the IPC bindings. I’m digging into that source today.
I would guess it might be linked to some change in behavior regarding smack in more recent kernel revisions, and there is something that we are not doing on the Legato side.
Did you make some progress on your side by any chance?
Actually that reminds me of some IPC issue we had on the 3.14 after some backporting of smack patches, it was solved by the following patch:
From ab5094eade4b424517b1f93adcaaa4d60bc6bfa4 Mon Sep 17 00:00:00 2001
From: Dragan Marinkovic <dmarinkovi@sierrawireless.com>
Date: Mon, 20 Mar 2017 20:08:23 -0700
Subject: [PATCH] Smack: Unable to Communicate through IPC
Before "Rework file hooks" patch was introduced, LSM hooks
were working off of file pointers. Once this patch was
introduced, most of the file hooks are working off of
security data from inodes.
It looks like that this change broke Legato IPC, and we need
some of that good, old stuff back.
Resolves: Jira:LE-6112
Signed-off-by: Dragan Marinkovic <dmarinkovi@sierrawireless.com>
Change-Id: I03d604cbb37210f425d0bcaa1125121c3bd28770
---
security/smack/smack_lsm.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
index 88bc381..7953baf 100644
--- a/security/smack/smack_lsm.c
+++ b/security/smack/smack_lsm.c
@@ -1892,7 +1892,7 @@ static int smack_file_receive(struct file *file)
if (file->f_mode & FMODE_WRITE)
may |= MAY_WRITE;
- rc = smk_curacc(smk_of_inode(inode), may, &ad);
+ rc = smk_curacc(file->f_security, may, &ad);
rc = smk_bu_file(file, may, rc);
return rc;
}
--
2.1.4
This definitely does look a bit like the SMACK issue that Dragan was dealing with before, but there are differences. The problem Dragan saw only occurred with unprivileged processes, but in this case ‘sdir load’ runs as root with the same SMACK label as the logCtrlDaemon, so there should not be any SMACK problems here.
DirectorySocketReadable() logs the “Connection has closed.” message when unixSocket_ReceiveMsg() returns LE_CLOSED. Unfortunately, the unixSocket_ReceiveMsg() returns LE_CLOSED in two different cases, and there’s no log message indicating which case it is hitting.
@cholmes, can you instrument the framework code a bit more and let us know what you see?
…It could just be that the ancillary data buffer (cmsgBuffer) is not big enough in unixSocket_ReceiveMsg(). Maybe something changed in the unix domain sockets implementation that requires more space for ancillary data in some cases?
I added a warning log entry to differentiate between bytesReceived==0 vs ECONNRESET. Both the sdir and logCtrlDaemon processes are reporting that bytesReceived == 0:
May 4 23:56:22 genericx86 Legato: -WRN- | sdir[354]/framework T=main | LE_FILENAME unixSocket_ReceiveMsg() 653 | recvmsg() failed with bytesReceived == 0
May 4 23:56:22 genericx86 Legato: -WRN- | logCtrlDaemon[347]/framework T=main | LE_FILENAME unixSocket_ReceiveMsg() 653 | recvmsg() failed with bytesReceived == 0
Could you please check the content of the msg_flags field in the msgHeader structure when bytesReceived == 0?
The man page for recvmsg() tells you what flags can be set in the msg_flags field.
If you see MSG_CTRUNC set, then it means the cmsgBuffer is not big enough. Try adding a bunch of space to it and see if that makes the problem go away. I have a feeling that the SMACK implementation may be adding a bunch of extra stuff to the credentials that are being delivered in the ancillary data message and this is causing the truncation of the message.
// If we didn't receive any ancillary data, and recvmsg() still returned zero,
// then the socket must have closed.
else if (bytesReceived == 0)
{
LE_WARN("recvmsg() failed with bytesReceived == 0");
LE_WARN_IF(msgHeader.msg_flags & MSG_CTRUNC, "recvmsg() failed with MSG_CTRUNC with CMSG_BUFF_SIZE=%d", CMSG_BUFF_SIZE);
LE_WARN_IF(msgHeader.msg_flags & MSG_DONTROUTE, "recvmsg() failed with MSG_DONTROUTE");
LE_WARN_IF(msgHeader.msg_flags & MSG_EOR, "recvmsg() failed with MSG_EOR");
LE_WARN_IF(msgHeader.msg_flags & MSG_OOB, "recvmsg() failed with MSG_OOB");
LE_WARN_IF(msgHeader.msg_flags & MSG_PEEK, "recvmsg() failed with MSG_PEEK");
LE_WARN_IF(msgHeader.msg_flags & MSG_TRUNC, "recvmsg() failed with MSG_TRUNC");
LE_WARN_IF(msgHeader.msg_flags & MSG_WAITALL, "recvmsg() failed with MSG_WAITALL");
return LE_CLOSED;
}
Here’s the result:
May 5 16:22:15 genericx86 Legato: -WRN- | sdir[356]/framework T=main | LE_FILENAME unixSocket_ReceiveMsg() 653 | recvmsg() failed with bytesReceived == 0
May 5 16:22:15 genericx86 Legato: -WRN- | logCtrlDaemon[349]/framework T=main | LE_FILENAME unixSocket_ReceiveMsg() 653 | recvmsg() failed with bytesReceived == 0
May 5 16:22:15 genericx86 Legato: -WRN- | logCtrlDaemon[349]/framework T=main | LE_FILENAME unixSocket_ReceiveMsg() 654 | recvmsg() failed with MSG_CTRUNC with CMSG_BUFF_SIZE=10280
I notice that sdir is just getting zero bytes received. Only logCtrlDaemon sees the flag.
Hmm… Well, it certainly doesn’t look like it’s a buffer size problem.
I wonder if the Log Control Daemon is hitting its ulimit for the maximum number of file descriptors that it’s allowed to have open? Can you check the limit and try increasing it?
If that’s not it, what is the value of msgHeader.msg_controllen when recvmsg() returns with bytesReceived == 0?
root@genericx86:/# cat /proc/345/limits
Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size 8388608 unlimited bytes
Max core file size 0 unlimited bytes
Max resident set unlimited unlimited bytes
Max processes 16106 16106 processes
Max open files 1024 4096 files
Max locked memory 65536 65536 bytes
Max address space unlimited unlimited bytes
Max file locks unlimited unlimited locks
Max pending signals 16106 16106 signals
Max msgqueue size 819200 819200 bytes
Max nice priority 0 0
Max realtime priority 0 0
Max realtime timeout unlimited unlimited us
We are allowed 1024 files. So how many files do we have open? 12.
// If we didn't receive any ancillary data, and recvmsg() still returned zero,
// then the socket must have closed.
else if (bytesReceived == 0)
{
LE_WARN("recvmsg() failed with bytesReceived == 0");
LE_WARN_IF(msgHeader.msg_flags & MSG_CTRUNC, "recvmsg() failed with MSG_CTRUNC with CMSG_BUFF_SIZE=%d and msg_controllen=%d", CMSG_BUFF_SIZE, (int)msgHeader.msg_controllen);
.... etc...
}
And the relevant log entry:
May 8 18:56:35 genericx86 Legato: -WRN- | logCtrlDaemon[355]/framework T=main | LE_FILENAME unixSocket_ReceiveMsg() 654 | recvmsg() failed with MSG_CTRUNC with CMSG_BUFF_SIZE=51252 and msg_controllen=0
I’d be surprised if this is the cause, though, because both supervisor and sdir are running as root (with CAP_MAC_ADMIN and CAP_MAC_OVERRIDE privileges). There’s nothing in /legato/smack/onlycap right?