Legato on Generic Linux


#1

I’m running Legato on a Yocto 2.2.1 linux build (a generic x86 machine using kernel 4.8.12, with support for Smack and cgroup compiled into the kernel). It runs for about fifteen seconds and then crashes the kernel, forcing the machine to reboot.

I get three of these “timed out” messages when the framework starts up.

Legato: -WRN- | UNKNOWN[337]/framework T=main | LE_FILENAME DAEMON_DAEMONIZE() 71 | Timed out out after waiting 5000 ms for indication from child.

Then several seconds later I get this, followed by an immediate reboot:

Legato: CRT | UNKNOWN[338]/framework T=main | LE_FILENAME TryToRun() 1050 | Supervisor was killed by a signal 14.

There are no core files or log files left behind that I can find. The only way I got these log lines was by capturing it on video. True story.

Also, during those several seconds between the timeouts and the killed supervisor, I have tried running Legato’s log utility and other tools, but none of them work at all. The command will just hang until the system reboot.

Any clues as to where I should look?


Error while loading shared libraries
#2

Hey,

I think the problem is, that Legato is only compatible to Yocto 1.7.2.
http://legato.io/legato-docs/latest/yoctoLegatoOverview.html

But I would also be interested to have Legato together with a newer Yocto base. So keep this thread updated, if you have success :wink:


#3

I haven’t tried to integrate the Legato build into the Yocto build at all. I’m using Yocto to create the OS and the cross-compiling toolchain. Then I use the toolchain to build Legato (that took a while to figure out, btw.)

Is there any reason Legato would be incompatible with the Linux 4.8 kernel?

I haven’t read much of the source code, but much of what I’ve seen looks like very reasonable wrappers around standard OS calls. It’s mostly the cgroup, Smack, and IPC stuff that is unusual. I haven’t seen any warnings in the documentation about incompatibilities with more recent kernels, or with the x86 32-bit architecture.


#4

Hi cholmes,

I’ve seen the same issue while trying to use 4.8, and I’m not exactly sure where it’s coming from.
By setting:
PREFERRED_VERSION_linux-yocto = "4.4%"
it seems to work fine, so somehow there is some incompatibility with 4.8.

I’ve created an internal issue but this is unlikely to be handled as high-priority as so far we don’t have any products running with a kernel as recent as 4.8.

If you happen to find out the issue don’t hesitate to let us know or upstream your solution!


#5

I’ve learned that the kernel isn’t crashing. Rather, the startup program is rebooting the system on purpose because the startup process is taking too long, and the supervisor died from a SIGALRT.

Everything seems to launch fine until the sdir program is run, which is supposed to have something to do with the IPC bindings. I’m digging into that source today.


#6

I would guess it might be linked to some change in behavior regarding smack in more recent kernel revisions, and there is something that we are not doing on the Legato side.
Did you make some progress on your side by any chance?


#7

Actually that reminds me of some IPC issue we had on the 3.14 after some backporting of smack patches, it was solved by the following patch:

From ab5094eade4b424517b1f93adcaaa4d60bc6bfa4 Mon Sep 17 00:00:00 2001
From: Dragan Marinkovic <dmarinkovi@sierrawireless.com>
Date: Mon, 20 Mar 2017 20:08:23 -0700
Subject: [PATCH] Smack: Unable to Communicate through IPC

Before "Rework file hooks" patch was introduced, LSM hooks
were working off of file pointers. Once this patch was
introduced, most of the file hooks are working off of
security data from inodes.

It looks like that this change broke Legato IPC, and we need
some of that good, old stuff back.

Resolves: Jira:LE-6112
Signed-off-by: Dragan Marinkovic <dmarinkovi@sierrawireless.com>

Change-Id: I03d604cbb37210f425d0bcaa1125121c3bd28770
---
 security/smack/smack_lsm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
index 88bc381..7953baf 100644
--- a/security/smack/smack_lsm.c
+++ b/security/smack/smack_lsm.c
@@ -1892,7 +1892,7 @@ static int smack_file_receive(struct file *file)
 	if (file->f_mode & FMODE_WRITE)
 		may |= MAY_WRITE;
 
-	rc = smk_curacc(smk_of_inode(inode), may, &ad);
+	rc = smk_curacc(file->f_security, may, &ad);
 	rc = smk_bu_file(file, may, rc);
 	return rc;
 }
-- 
2.1.4

It seems that on 4.9.25 the code is https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/tree/security/smack/smack_lsm.c?id=refs/tags/v4.9.25#n1919 :

rc = smk_curacc(smk_of_inode(inode), may, &ad);
rc = smk_bu_file(file, may, rc);

So Dragan’s patch seems to apply and might solve the issue here as well.


#8

The startup process nearly completes, until the sdir program is launched. Then this happens:

Apr 26 17:01:42 genericx86 Legato:  DBUG | serviceDirectory[349]/serviceDirectory_exe T=main | serviceDirectory.c ClientConnectHandler() 1430 | Client connected:  pid = 358;  uid = 0;  gid = 0. 
Apr 26 17:01:42 genericx86 Legato:  DBUG | serviceDirectory[349]/serviceDirectory_exe T=main | serviceDirectory.c ProcessOpenRequestFromClient() 1160 | Processing OPEN request from client pid 358 <root> for service 'LogClient' (LogControlProtocol). 
Apr 26 17:01:42 genericx86 Legato:  DBUG | serviceDirectory[349]/serviceDirectory_exe T=main | serviceDirectory.c FollowBinding() 797 | FOLLOWING BINDING <root>.LogClient -> <root>.LogClient 
Apr 26 17:01:42 genericx86 Legato:  DBUG | serviceDirectory[349]/framework T=main | LE_FILENAME unixSocket_SendMsg() 430 | Sending fd 21. 
Apr 26 17:01:42 genericx86 Legato:  DBUG | serviceDirectory[349]/serviceDirectory_exe T=main | serviceDirectory.c DispatchToServer() 759 | Client (uid 0 'root', pid 358) connected to server (uid 0 'root', pid 351) for service 'LogClient' (protocol ID = 'LogControlProtocol'). 
Apr 26 17:01:42 genericx86 Legato:  DBUG | logCtrlDaemon[351]/framework T=main | LE_FILENAME DirectorySocketReadable() 550 | Connection has closed. 

Two points to note:

  • Other processes were able to connect tot he logCtrlDameon without problem. configTree, updateDaemon, and watchdog all were able to connect.
  • This error repeats about 50x per second until the start program reboots the machine.

#9

This definitely does look a bit like the SMACK issue that Dragan was dealing with before, but there are differences. The problem Dragan saw only occurred with unprivileged processes, but in this case ‘sdir load’ runs as root with the same SMACK label as the logCtrlDaemon, so there should not be any SMACK problems here.

DirectorySocketReadable() logs the “Connection has closed.” message when unixSocket_ReceiveMsg() returns LE_CLOSED. Unfortunately, the unixSocket_ReceiveMsg() returns LE_CLOSED in two different cases, and there’s no log message indicating which case it is hitting.

@cholmes, can you instrument the framework code a bit more and let us know what you see?

…It could just be that the ancillary data buffer (cmsgBuffer) is not big enough in unixSocket_ReceiveMsg(). Maybe something changed in the unix domain sockets implementation that requires more space for ancillary data in some cases?


#10

I added a warning log entry to differentiate between bytesReceived==0 vs ECONNRESET. Both the sdir and logCtrlDaemon processes are reporting that bytesReceived == 0:

May  4 23:56:22 genericx86 Legato: -WRN- | sdir[354]/framework T=main | LE_FILENAME unixSocket_ReceiveMsg() 653 | recvmsg() failed with bytesReceived == 0 
May  4 23:56:22 genericx86 Legato: -WRN- | logCtrlDaemon[347]/framework T=main | LE_FILENAME unixSocket_ReceiveMsg() 653 | recvmsg() failed with bytesReceived == 0

#11

Thanks, @cholmes.

Could you please check the content of the msg_flags field in the msgHeader structure when bytesReceived == 0?

The man page for recvmsg() tells you what flags can be set in the msg_flags field.

If you see MSG_CTRUNC set, then it means the cmsgBuffer is not big enough. Try adding a bunch of space to it and see if that makes the problem go away. I have a feeling that the SMACK implementation may be adding a bunch of extra stuff to the credentials that are being delivered in the ancillary data message and this is causing the truncation of the message.


#12

In unixSocket.c I extended the control buffer.

#define CMSG_BUFF_SIZE (CMSG_SPACE(sizeof(int)) + CMSG_SPACE(sizeof(struct ucred))+10240)

And I instrumented unixSocket_ReceiveMsg:

// If we didn't receive any ancillary data, and recvmsg() still returned zero,
// then the socket must have closed.
else if (bytesReceived == 0)
{
	LE_WARN("recvmsg() failed with bytesReceived == 0");
	LE_WARN_IF(msgHeader.msg_flags & MSG_CTRUNC, "recvmsg() failed with MSG_CTRUNC with CMSG_BUFF_SIZE=%d", CMSG_BUFF_SIZE);
	LE_WARN_IF(msgHeader.msg_flags & MSG_DONTROUTE, "recvmsg() failed with MSG_DONTROUTE");
	LE_WARN_IF(msgHeader.msg_flags & MSG_EOR, "recvmsg() failed with MSG_EOR");
	LE_WARN_IF(msgHeader.msg_flags & MSG_OOB, "recvmsg() failed with MSG_OOB");
	LE_WARN_IF(msgHeader.msg_flags & MSG_PEEK, "recvmsg() failed with MSG_PEEK");
	LE_WARN_IF(msgHeader.msg_flags & MSG_TRUNC, "recvmsg() failed with MSG_TRUNC");
	LE_WARN_IF(msgHeader.msg_flags & MSG_WAITALL, "recvmsg() failed with MSG_WAITALL");
    return LE_CLOSED;
}

Here’s the result:

May  5 16:22:15 genericx86 Legato: -WRN- | sdir[356]/framework T=main | LE_FILENAME unixSocket_ReceiveMsg() 653 | recvmsg() failed with bytesReceived == 0 
May  5 16:22:15 genericx86 Legato: -WRN- | logCtrlDaemon[349]/framework T=main | LE_FILENAME unixSocket_ReceiveMsg() 653 | recvmsg() failed with bytesReceived == 0 
May  5 16:22:15 genericx86 Legato: -WRN- | logCtrlDaemon[349]/framework T=main | LE_FILENAME unixSocket_ReceiveMsg() 654 | recvmsg() failed with MSG_CTRUNC with CMSG_BUFF_SIZE=10280

I notice that sdir is just getting zero bytes received. Only logCtrlDaemon sees the flag.


#13

I tried increasing the control buffer to 50k. No change, other than the larger buffer size being reported in the log file.

#define CMSG_BUFF_SIZE (CMSG_SPACE(sizeof(int)) + CMSG_SPACE(sizeof(struct ucred))+CMSG_SPACE(50*1024))


#14

Hmm… Well, it certainly doesn’t look like it’s a buffer size problem. :slight_smile:

I wonder if the Log Control Daemon is hitting its ulimit for the maximum number of file descriptors that it’s allowed to have open? Can you check the limit and try increasing it?

If that’s not it, what is the value of msgHeader.msg_controllen when recvmsg() returns with bytesReceived == 0?


#15

So let’s start with an excerpt from ps -eH just to get our process numbers right:

      root@genericx86:/# ps -eH
      338 ?        00:00:00   startSystem
      341 ?        00:00:00     supervisor
      343 ?        00:00:09       serviceDirector
      345 ?        00:00:02       logCtrlDaemon
      347 ?        00:00:00       configTree
      349 ?        00:00:00       updateDaemon
      351 ?        00:00:00       watchdog
      352 ?        00:00:07       sdir

Log Control daemon is pid 345. Check the limits:

root@genericx86:/# cat /proc/345/limits
Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            8388608              unlimited            bytes     
Max core file size        0                    unlimited            bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             16106                16106                processes 
Max open files            1024                 4096                 files     
Max locked memory         65536                65536                bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       16106                16106                signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us        

We are allowed 1024 files. So how many files do we have open? 12.

root@genericx86:/# ls /proc/345/fd
0  1  10  11  2  3  4  5  6  7  8  9

I’ll instrument recvmsg() more thoroughly and get back to you.


#16

Here is the same process list, with security label:

root@genericx86:~# ps -eZH
LABEL                             PID TTY          TIME CMD
_                                 341 ?        00:00:00   startSystem
_                                 344 ?        00:00:00     supervisor
framework                         346 ?        00:00:01       serviceDirector
framework                         348 ?        00:00:00       logCtrlDaemon
framework                         350 ?        00:00:00       configTree
framework                         352 ?        00:00:00       updateDaemon
framework                         354 ?        00:00:00       watchdog
_                                 355 ?        00:00:00       sdir

#17

Interesting. msg_controllen is 0.

Source:

// If we didn't receive any ancillary data, and recvmsg() still returned zero,
// then the socket must have closed.
else if (bytesReceived == 0)
{
	LE_WARN("recvmsg() failed with bytesReceived == 0");
	LE_WARN_IF(msgHeader.msg_flags & MSG_CTRUNC, "recvmsg() failed with MSG_CTRUNC with CMSG_BUFF_SIZE=%d and msg_controllen=%d", CMSG_BUFF_SIZE, (int)msgHeader.msg_controllen);
    .... etc...
}

And the relevant log entry:

May 8 18:56:35 genericx86 Legato: -WRN- | logCtrlDaemon[355]/framework T=main | LE_FILENAME unixSocket_ReceiveMsg() 654 | recvmsg() failed with MSG_CTRUNC with CMSG_BUFF_SIZE=51252 and msg_controllen=0


#18

One thing that doesn’t look right in this listing is that the supervisor (and sdir) should be running with the “framework” label too (not “_”).


#19

I’d be surprised if this is the cause, though, because both supervisor and sdir are running as root (with CAP_MAC_ADMIN and CAP_MAC_OVERRIDE privileges). There’s nothing in /legato/smack/onlycap right?


#20

Jen, if you want to reproduce the issue I provided a docker image.
You just need docker (cf https://store.docker.com/editions/community/docker-ce-server-ubuntu?tab=description) and then:

docker run --rm -ti --name testvirt quay.io/legato/virt-x86:2.2.1-alpha

This runs a Yocto 2.2.1 based x86 image in QEmu with a 4.8.12 kernel.

Because of the issue it reboots few seconds after the boot.
As to prevent that you can void the start-up script:

echo > /etc/init.d/startlegato.sh

As for the toolchain, you can use http://downloads.sierrawireless.com/legato/virt/poky-swi-glibc-x86_64-meta-toolchain-swi-i586-toolchain-swi-2.2.1.sh like this:

export VIRT_TOOLCHAIN_DIR=/opt/swi/y22
export VIRT_TARGET_ARCH=x86
export LEGATO_SYSROOT=/opt/swi/y22/sysroots/i586-poky-linux
make virt

The target also has an SSH server so you can use regular scripts like update or app install.
To get the IP:

docker inspect --format '{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' testvirt 

To stop the target just do:

docker kill testvirt

QEMU support for device emulation