Questions about the LÖVE API, installing LÖVE and other support related questions go here.
Forum rules
Roland_Yonaba
Inner party member
Posts: 1562
Joined: Tue Jun 21, 2011 6:08 pm
Contact:

Okay, first of all, the context:
I am actually working on a 2weeks project about raytracing. Initially, I wanted to explore the topic and and try to get the basics used to render those beautiful images. I made a first attempt with local illumination (simple raytracing), then I came accross this small global illumination path tracer from Kevin Beason, which was, by far, more realistic, and I wanted to give it try. Anyway I didn't want my port to be so compact, so I splitted the code in several files, made some design and added a bit of object orientation, to have something more understandable.
Here's the link to the Github repository: smallpt-lua

For one thing, it works fine. I didn't pushed yet any example of use, as I am still polishing some stuff on the background, but it works flawlessly.
Problem is, it runs reaally slow.

Possible explanations:
So far, I have to admit that global illumination rendering is a really expensive task. Actually, the program take some inputs (scene, primitives, intial viewpoint), performs some quick camera setup, then processes each pixel to compose the the output:

Here is the main loop skeleton:

Code: Select all

for y = 1, height do -- height
for x = 1, width do -- width

for sx = 0,aa-1 do -- for anti-aliasing
for sy = 0,aa-1 do -- for anti-aliasing

for samp = 1, spp do -- for sampling : we need at least 16~25 samples/per pixels to have a clean output
-- Processes pixel x,y, with possible recursive calls (around 5, in general)
-- evaluate a partial radiance for each sampling, save it
end
end
end
-- cumulate all radiances to evaluate the final color for pixel x,y
end
end
So, most of operations involded to evaluate radiance are vector-based operations. The are A LOT of vector operations, actually. I am using my own vector class, and I am conviced that this is one of the bottlenecks...
But, I can't thing of using a table-less structure here (in other words, multiple return values instead), as the code would no longer be readable nor clean enough to me: functions passing vectors to each others will take lots of args, syntax would be a lot more verbose...

Then I started thinking of using love.thread as a possible way to gain speed.
Actually, I have a very basic understanding of love.thread module and threads in general (and don't laugh at me, i can see you ).

First of all, why can't we pass tables from one thread to another ? (Well that's more of a general question, was just wondering why).
Second, I know that multiple threads can be created with love.thread module. And they can run in parallel. This is really nice because, in my case, I can make it so that each <x,y> pixel sampling would be processed in its own thread. Then i'll just have to pass the scene description, plus a <x,y> couple matching a pixel to be evaluated to a thread, and it will return to the main thread the color for that pixel.

Let's say that I have to render a 200*100 sized-image, which makes 20000 pixels to render.
Well, it doesn't mean I have to create 20000 threads, right ?
And even if I could, they woudn't run all at the same time ?
So how much maximum threads can I run at the same time ? This depends on the processor, right ? (and stop laughing at me... )
What would be the pros/cons of doing this, though ? Actually, what I've done so far whith love thread was loading resources in a separate thread, to preserve the framerate in the main thread. In this case actually, I will obviously have two ore more threads running, the main one for basic display, and the others for raytracing operations...But does this means necessarily a gain in terms of speed ?

I have some others questions left, but they can wait, actually.
Sorry for the long post.

Mathias
Prole
Posts: 5
Joined: Thu Dec 13, 2012 12:48 pm

Creating a thread always takes some time, so its not useful to create a thread for each pixel. Also since your CPU has probably not more then 4-8 cores it doesn't make sense to create much more then that.
The usual way is, you create a few threads and they ask the main thread for a work unit, in your case it makes sense to render one or a few lines per thread. When they are finished, pass back the result and ask for a new work unit.

Keep in mind, path tracing is pretty slow. So using lua probably doesn't help either.

bartbes
Sex machine
Posts: 4944
Joined: Fri Aug 29, 2008 10:35 am
Location: The Netherlands
Contact:

Roland_Yonaba wrote:why can't we pass tables from one thread to another?

Because they're an extremely non-trivial data structure. Correctly copying tables is (practically) impossible. What do you do with loops, for instance?
Roland_Yonaba wrote:Let's say that I have to render a 200*100 sized-image, which makes 20000 pixels to render.
Well, it doesn't mean I have to create 20000 threads, right?
Please no, even a gpu doesn't have that many threads. What you could do, is have a bunch of worker threads, say 4, that all do a bunch of pixels, this means that every thread is doing roughly 5000 pixels.
Roland_Yonaba wrote:And even if I could, they woudn't run all at the same time?
No. But then again, you can almost never guarantee that, it will run mostly at the same time though. As in, it will pretty much act like they were.
Also, OS threads aren't light in any way, you shouldn't be spawning tons of them, especially since you have to load a lua state with the standard libraries on every one as well, you want to re-use threads as much as possible.
Roland_Yonaba wrote:But does this means necessarily a gain in terms of speed?
No, threads never do. That is, they can make it faster, but they can't guarantee it.

Roland_Yonaba
Inner party member
Posts: 1562
Joined: Tue Jun 21, 2011 6:08 pm
Contact:

Mathias wrote:Keep in mind, path tracing is pretty slow. So using lua probably doesn't help either.
Totally. Well I'm not looking for some lightning, fast-paced stuff, but I am pretty sure something can be done to improve it in its currenht state, actually.
Mathias wrote:Also since your CPU has probably not more then 4-8 cores it doesn't make sense to create much more then that.
I agree on this. So this implies one thread per core (obviously). Well this raises another question then : assuming I create, let's say 4 threads, then I start running them all of them (thread:start):

Code: Select all

-- main.lua
for i = 1,4 do
end
end

-- etc etc
What's likely to happen if i run this on a dual-core machine, for instance ?
Does this means that, along with the main thread, only 1 additional thread will be processed at at time ?
Like this:
...
EDIT: Well partially answered by Bartbes's previous post. But I still leave it open.

Inny
Party member
Posts: 652
Joined: Fri Jan 30, 2009 3:41 am
Location: New York

Now, as for the questions at hand

1. Love implements threads with two separate Lua states, and communicates between them via the Actor Model. As Tables are complex structures, Love doesn't make any attempt to move them between lua states, because it'd certainly do them wrong for somebody. You'll have to work around it via serialization and deserialization, or have the value "stream" between the two states. Here's an idea for a protocol which mimicks semaphors

Worker performs work
Worker begins sending back results in packets (named X1...Xn)
Main thread waits on value named X1
Main continues reading values until Xm where 1 < m < n is a nil
Main waits on Xm until Worker fills it
Main continues pulling values between Xm through Xn until complete

This doesn't work for complex types, but for array types it'd probably work. It'd probably need some working at to get right.

2. The maximum number of threads you can have depends on the system (which may be capped at like 4096 or something, it's hard to know), but the maximum number of threads that can run simultaneously is dependent on the number of physical threads the machine has, which is tied to the number of cores. For instance, I have a corei5 laptop with 4 cores, and each core has 2 threads, so 8 threads at once. And one is already taken for the Main thread, and other programs on the machine will want those threads as well (like the OS, the windowing system, firefox/chrome, etc).

But, in my opening paragraph, I said that threads existed before multicore machine. As long as you're happy with waiting, you can have more threads than what's physically available. Since you're doing hard calculations, you probably don't want to wait. So, you'll want less threads than the max, so that less threads are waiting for the machine to give up some physical cpu time. It sounds counterintuitive, but it has to do with cache-hits-and-misses and context-switching. The more things that are vying for computation time, the more the os has to waste cpu time on switching between them. Though, less threads means less work is being done, so you have to be johnny-on-the-spot with work to be performed, so design your threads around performing tasks, and have the main thread act as a dispatcher, sending in work to be done. That's called a "Thread Pool".

Hope this all helps.

bartbes
Sex machine
Posts: 4944
Joined: Fri Aug 29, 2008 10:35 am
Location: The Netherlands
Contact:

Roland_Yonaba wrote: Does this means that, along with the main thread, only 1 additional thread will be processed at at time ?
No, it means that your OS will be switching threads like crazy. One of the fundamentals with threading, it is non-deterministic (at least, practically), you can't tell what's going to run when, or where, and in what order.

EDIT: Do note that it always switches like crazy, because of multiple processes.

Roland_Yonaba
Inner party member
Posts: 1562
Joined: Tue Jun 21, 2011 6:08 pm
Contact:

Okay, thanks all, that was very useful piece of informations, and nice explanations.
Thanks Inny, for the HotTubTaskMachine, i'll sit around and take a closer look at the code soon.

Actually I just set a temporary solution. What I was aiming to do was the possibility to display (in love.draw) a basic interface, and monitor current progress, while (in the background) the task of rendering would be completed. So that the user can see the computation progress, instead of having an app that would look like blocked in the rendering for-do-end loops.

So for now on, I'll be dealing with coroutines, which seem to be perfect for that.
Taking the original loop, it'll now look like:

Code: Select all

local function render(...)
for y = 1, height do -- height
for x = 1, width do -- width

for sx = 0,aa-1 do -- for anti-aliasing
for sy = 0,aa-1 do -- for anti-aliasing

for samp = 1, spp do -- for sampling : we need at least 16~25 samples/per pixels to have a clean output
-- Processes pixel x,y, with possible recursive calls (around 5, in general)
-- evaluate a partial radiance for each sampling, save it
end
end
end
-- cumulate all radiances to evaluate the final color for pixel x,y
local progress = ... -- estimate the current progress
coroutine.yield(false, progress)
end
end
-- Assuming map is a 2d aray of listing all computed colors
coroutine.yield(true, map)
end
Then I can just wrap this like:

Code: Select all

local renderer = coroutine.create(render)
And call it in the main file:

Code: Select all

function love.update(dt)
local _, status, reseult = coroutine.resume(renderer,...)
end
Also, I can still try to commit what i've earlier suggested, using a table - less data structures for vector, just multiple values.
i'll probably won't be that happy with it, but the gain in terms of speed could be noticeable.

There's also a method i've recently discovered that can be applied to global illumination to faster it, called space filling curves (Hilbert curves, for instance).
Actually I still don't get how they really work (opened a thread on stackOverflow for ages, but seems nobody is willing to answer). I'll look into it too.

### Who is online

Users browsing this forum: Bing [Bot], slime and 11 guests