Finding the cause of random segfaults

Questions about the LÖVE API, installing LÖVE and other support related questions go here.
Forum rules
Before you make a thread asking for help, read this.
grump
Party member
Posts: 583
Joined: Sat Jul 22, 2017 7:43 pm

Finding the cause of random segfaults

Post by grump » Thu Jan 03, 2019 6:36 pm

I'm getting random segfaults in a project and I can't pinpoint it on a single thing I'm doing in the 10K+ LOC project - letting it sit idle for a while makes it crash.

The only observation so far is that the crash occurs faster the more (temporary) ImageData / Canvas objects I'm creating. Could it be a memory leak? Memory usage is < 100 MB when it crashes, love.graphics.getStats() says ~6 MB texture memory is being used. It's not doing any object creation when idle.

I tried running the process in gdb to get a trace, but that's not really helpful:

Code: Select all

Thread 1 "love" received signal SIGSEGV, Segmentation fault.
0x00007ffff74d43fe in ?? () from /usr/lib/x86_64-linux-gnu/libluajit-5.1.so.2
(gdb) trace
Tracepoint 1 at 0x7ffff74d43fe
(gdb) 
I know nothing about gdb. Is there some magic way to retrieve a useful stack trace? Or even better, how can I find the exact line in my code where it crashes (other than putting debug prints all over 10K LOC)?
Last edited by grump on Thu Jan 03, 2019 8:30 pm, edited 1 time in total.

User avatar
pgimeno
Party member
Posts: 1746
Joined: Sun Oct 18, 2015 2:58 pm

Re: Finding the cause of random segfaults

Post by pgimeno » Thu Jan 03, 2019 8:05 pm

That sucks. Yeah, gdb backtraces of LuaJIT are rarely helpful, if ever.

I doubt it's a leak. Are you using FFI? Maybe try with jit.off(), it will probably not segfault but I'm not sure what that will tell you, if anything.

grump
Party member
Posts: 583
Joined: Sat Jul 22, 2017 7:43 pm

Re: Finding the cause of random segfaults

Post by grump » Thu Jan 03, 2019 8:32 pm

Yeah, I'm using ffi, but only sparingly, in code that has proven to be stable in a number of projects. I'm also using external libraries, nuklear (self-built) and FreeType (what's installed on the system).

I suspect there's a bug in love that causes this. In other projects, at least 1 in 50 launches of love fail with a segfault. It happens with even the most simple apps.

The crashes in this project started to occur when I changed something in the code that made the number of temporary Canvas object go up from 8 to ~200 - no huge memory impact though, the canvasses are tiny.
pgimeno wrote:
Thu Jan 03, 2019 8:05 pm
Maybe try with jit.off(), it will probably not segfault but I'm not sure what that will tell you, if anything.
That was helpful. It made a bug in my code reproducible that did not occur with jit on, relating to a call of math.max with one argument being nil. I'll report back with my findings.

User avatar
pgimeno
Party member
Posts: 1746
Joined: Sun Oct 18, 2015 2:58 pm

Re: Finding the cause of random segfaults

Post by pgimeno » Thu Jan 03, 2019 8:45 pm

Bugs with FFI can be subtle. In https://github.com/gvx/bitser/issues/9# ... -436060611 it was caused by using a tail call, because a local variable was dereferenced by the tail call but the pointer to it was live. I remember another bug caused by storing a 64-bit pointer in a 52-bit float.

grump
Party member
Posts: 583
Joined: Sat Jul 22, 2017 7:43 pm

Re: Finding the cause of random segfaults

Post by grump » Fri Jan 04, 2019 6:19 am

It's memory corruption. In one part of the code I load a font with FreeType. In another part I process the font glyphs, where the font data has suddenly, unpredictably changed, and FreeType tells me there's now fewer glyphs than when the font was loaded. It's not even a corruption of LuaJIT data, it's memory allocated by FreeType - but what I'm doing in LÖVE does affect the outcome. This is gonna be so much fun.

grump
Party member
Posts: 583
Joined: Sat Jul 22, 2017 7:43 pm

Garbage collection going wrong

Post by grump » Tue Jan 08, 2019 2:41 pm

I shouldn't jump to conclusions like this. It's not memory corruption, it's just garbage collection going wrong.

So I'm looking at these FreeType ffi bindings and they look super fishy to me:

Code: Select all

function lib.open_face(library, args, face_index)
	local face = ffi.new'FT_Face[1]'
	checknz(C.FT_Open_Face(library, args, face_index or 0, face))
	return face[0]
end
I'm not 100% sure, but I think this is the cause of my problem. I think face plus the one element it contains will get garbagecollected because nothing retains a reference to face. Is that correct?

User avatar
pgimeno
Party member
Posts: 1746
Joined: Sun Oct 18, 2015 2:58 pm

Re: Finding the cause of random segfaults

Post by pgimeno » Wed Jan 09, 2019 2:03 am

Looks like you're correct. Indeed if the reference isn't kept, it will vanish on GC; I've been there in a slightly different way: I subtracted 1 from the pointer I got from ffi.new() (to simulate an 1-based array like in Lua) and thought that keeping the reference to the result was enough; turns out it isn't and I have to keep the original reference too, or use malloc and set up GC myself, which is what I ended up doing. And maybe malloc/free solves your problem as well.

And don't worry, it's easy to draw wrong conclusions from the symptoms when there are so many variables into play.

User avatar
pgimeno
Party member
Posts: 1746
Joined: Sun Oct 18, 2015 2:58 pm

Re: Finding the cause of random segfaults

Post by pgimeno » Fri Jan 11, 2019 11:42 am


I've made a quick proof of concept:

Code: Select all

local ffi = require'ffi'

ffi.cdef[[

  struct tagRec {
    int a;
    int b;
  };

  typedef struct tagRec R;

  typedef struct tagRec *pR;

]]

x = ffi.new('R')
x.a = 2
x.b = 5
print(x.a) -- works fine
print(x.b) -- works fine

y = ffi.new('pR[1]')
y = y[0]
collectgarbage() -- now y is in unallocated memory

z = ffi.new('pR[1]')
z = z[0]
z.a = 1
This code segfaults for me at the z.a assignment.

They may not be aware of this. Are you going to report it? I'd be happy to if you want.


EDIT: Oops, I was doing something stupid there. Also, the FFI documentation suggests exactly this method for doing the & operation: http://luajit.org/ext_ffi_tutorial.html (under 'Translating C idioms'). But it still looks fishy to me.

grump
Party member
Posts: 583
Joined: Sat Jul 22, 2017 7:43 pm

Re: Finding the cause of random segfaults

Post by grump » Sat Jan 12, 2019 11:36 am

Thanks, pgimeno. I have not reported it yet, still investigating. I've not been able to build a minimal test case from my code that reliably reproduces the issue. Go ahead and report it if you want to.

Edit: Here's a minimal example that reproduces the issue reliably on my system. Only works in Linux with FreeType installed. If it breaks, an assertion will fail. If it quits silently, no error has occured.

Still looking for a way to make it fail without newImageData, but haven't been able to yet.
Attachments
ft_crash.love
(45.91 KiB) Downloaded 14 times

User avatar
pgimeno
Party member
Posts: 1746
Joined: Sun Oct 18, 2015 2:58 pm

Re: Finding the cause of random segfaults

Post by pgimeno » Sat Jan 12, 2019 3:21 pm

After more thought, I think that code is probably correct. Here's my take.

FT_Face is a pointer, and using 'face[0]' is creating a new CDATA object (also a pointer) and copying the value of the pointer that is already a valid object (until the assignment finishes) and initialized with what FT_Open_Face set it to.

The catch is that a CDATA pointer does not need extra storage beyond itself. Since the new object is a CDATA pointer object, has a valid value, and is referenced, it won't vanish. If the container object used for calling FT_Open_Face vanishes, that won't have consequences because the value is already safe within the new CDATA object.

That's a different to what I was doing that caused a crash:

Code: Select all

  ptr = ffi.new('double[?]', 10) -- This creates *both* a buffer and a CDATA pointer object.
                                 -- The buffer is NOT contained in the CDATA object, but set up to be
                                 -- freed on GC of the pointer object.
  ptr = ptr - 1 -- This creates a new CDATA object: a pointer with the value of the previous one minus 1.
                -- The original reference is lost, therefore the buffer will be GC'd at some point,
                -- but the value of this pointer will be preserved, so it will point to invalid memory.
In the FT case, the buffer is allocated by the FT library, therefore not GC'd by FFI.

Edit: I've just seen your edit. Got the assertion error with LÖVE 0.10.2 and 0.9.1, but not with 11.2 or 0.9.2. Investigating.

Post Reply

Who is online

Users browsing this forum: No registered users and 6 guests