June 23, 2016 | Dukus | 21 Comments
A few months ago I had some interesting performance problems with OpenGL on OSX. I identified the problem and made some work arounds for development to continue. This week I've properly fixed the issue, and I want to record it here for myself and others to avoid this mistake. So here's a scene, rendering on OSX, at an abysmal frame rate of 14 on a MacBook Pro. That's right. 14. I've got the game paused so there isn't any time spent on updates, this is just drawing.
If I move the camera to a different location, the frame rate is 126. Thats a difference of 63 or so milliseconds. Ouch.
So after much debugging I determined that rendering animated models was causing the slow down. The image of just trees doesn't have any deer or people moving around. And if I remove the people from my original test scene, the frame rate is over 100.
Since rendering houses and trees really only has minor differences with animated models I disabled the shader code that animates the models and the frame rate went back up to normal. This looks funny, and runs fast.
So here's the basic code that handles animation in GLSL. It looks pretty standard and is simple code. This isn't the entire shader, just enough to get an idea of how the animation part works.
struct BoneConstants { mat4x4 transforms[64]; }; uniform BoneConstants bc; in vec3 inputPosition; in vec4 inputWeight; in ivec4 inputIndex; vec3 SkinPosition(vec3 position, ivec4 index, vec4 weight, BoneConstants bones) { return ((bones.transforms[index.x] * vec4(position, 1.0)) * weight.x + (bones.transforms[index.y] * vec4(position, 1.0)) * weight.y + (bones.transforms[index.z] * vec4(position, 1.0)) * weight.z + (bones.transforms[index.w] * vec4(position, 1.0)) * weight.w)).xyz; } void main() { vec3 position = SkinPosition(inputPosition, inputIndex, inputIndex, bc); gl_Position = (gc.worldToProjection * (tc.transform * vec4(position, 1.0))); }
What this code does is transform the position of a vertex by up to four bones in the models structure. It then weights them by how much influence each bone has on the vertex.
I stared at this code for a while (more than a while actually), and after messing about a bit, it finally dawned on me what's wrong with it. Face Palm.
To fix it, instead of calling a function to animate the models, I manually inlined the code. And my frame rate returned to normal, with animated characters.
void main() { vec4 position = ((bc.transforms[inputIndex.x] * vec4(inputPosition, 1.0)) * inputWeight.x + (bc.transforms[inputIndex.y] * vec4(inputPosition, 1.0)) * inputWeight.y + (bc.transforms[inputIndex.z] * vec4(inputPosition, 1.0)) * inputWeight.z + (bc.transforms[inputIndex.w] * vec4(inputPosition, 1.0)) * inputWeight.w)).xyz; gl_Position = (gc.worldToProjection * (tc.transforms[gl_InstanceID] * vec4(position, 1.0)));
Wow. So whats going on there?
There's two ways to pass parameters to a function. Either by value, or by reference.
When you pass a parameter by value, a copy of the variable is made so that any changes to the variable in the function don't effect its value in the calling function.
When you pass a parameter by reference any modifications to the variable change it directly. No copy is made.
In my case with animation, the entire array of bone transformations is being copied, because it's being passed by value. My suspicion is that the program running on the GPU doesn't have enough registers to make this copy, so the GLSL compiler is generating code - copying the array bit by bit, and then is running the code over and over to evaluate the final result. What's just a few matrix multiples, scaling, and adding becomes many many copies and conditionals. This possibly results in different execution paths per GPU thread, causing even more slowdown.
My first attempt before manually inlining this code was actually to pass the array by reference, but the OpenGL compiler yelled at me that you can't pass a uniform by reference.
On Windows and Linux, I suspect the compiler is smart enough to see that the function doesn't modify the array, and optimizes the copy away. (Or my GTX 980 and 290X are just too fast for me to notice the slowdown...)
Most people directly reference the global list of uniform bone transformations directly and never run into this issue. But since my custom shader language that generates GLSL doesn't have a concept of globals, everything is passed to functions if it's needed. Arghghghg.
So what's the real fix?
I don't want to have to manually repeat code in shaders, that's just bad programming practice. Luckily, I control the compiler for my own shading language, so I can get it to generate different code.
So I just recently added an 'inline' keyword for functions. The code gets inlined automatically and any value passed by reference isn't copied when the GLSL is generated.
Previously my skinning function looked (in SRSL, not GLSL) like this:
inline float3 SkinPosition(float3 position, int4 index, float4 weight, BoneConstants bc) {...}
And now it looks like this
inline float3 SkinPosition(float3 position, inout int4 index, inout float4 weight, inout BoneConstants bc) {...}
No more repeated skinning code everywhere.
Getting my compiler to inline the code is pretty easy. However, as most shader languages don't feature a goto or label statement to jump over remaining code, it's hard (if not impossible) to inline a certain class of functions. So my inline feature doesn't handle inlining when returning from complex flow control. This really isn't an issue for shaders, as the programs tend to be straight forward and not have many loops or conditionals.
So long story short, don't pass uniform arrays and large structs to a function by value in GLSL.
Any update on how long til it comes out for osx
Are you still reading your email @ support ? cause last email (many) never got any answers ๐
Thanks for sharing. It was really helpful!
Wow.. you just solved an issue I've been dealing with for more than 2 years.
THANK YOU
Haha for a while I kept thinking "he isn't going to copy the same code in every part where it's needed, please tell me he doesn't do that". So I laughed when you explained what's bad programmers practic, I'm glad you solved it and have optimised code! ๐
Great to see an update!
BTW. Was there any "official" announcement made regarding the Mac version yet? Is there any rough timeline for e.g. beta release? What are the expected system requirements? Are you aiming for making it playable on cards like Iris 6100 (so pretty much on all new MacBooks in default configuration) or only on more powerful machines with dedicated GPUs?
Nice to see you solved your problem, but wouldn't a pointer have helped here too? Inlining should still be faster though.
Now how long before I can play this on linux?
Valgrind/Callgrind/Cachegrind? They should have told you that you spend massive amount of time on memory allocation there.
Though, it's an easy mistake to make, and easy to look over once done. Funny one though, made the same mistake during my studies once.
I've been following Banished since the very beginning, but I haven't played much lately due to lack of time.
The dedication and effort that this update conveys has been enough to make me post a big "thank you".
Also, the "Eureka" moment you just shared is very insightful, and as an amateur programmer made me genuinely chuckle.
Keep up the good work!
There's a typo in your "before" shader code; you're passing inputIndex instead of inputWeight to the weight argument. This had me confused for a bit because I thought the slowdown was coming from float-to-integer conversion. (Actually, I think there is no implicit conversion and that code would just have failed to compile.)
I wonder if it's just the struct that's hard for the compiler to inline? I'd be interested to know what would happen if you passed just the matrix bc.transform as an argument, instead of the whole bc struct.
Left me wondering if you're aware that in HLSL all functions are inlined automatically and you don't get a choice about it. I didn't know GLSL did things the complete opposite way - good to know! Thanks for Banished and the good Dev blog posts, hope development continues smoothly.
I suggest continuing to send your primitive data types (e.g. int), by value as this can be slower than by reference.
Have you considered bundling it with Wine? Lots of games are released like this. Banished runs well in it.
Please do not bundle it with Wine.
Also to "g" what is even the point in that when the developer has already done most of the native porting?
Can't wait for Mac version
Thanks for the update! I still love this game. It's great that you're sharing your experiences with the community. These blog posts are really fun to read.
I'm really looking forward to the OSX version.
Keep up the good work!
Like most, I can't wait to play this on an operating system that doesn't nag you to death.
I am also waiting for the mac version. I feel like you'll see a flurry of sales on Steam when you do this. Hurry up!! ๐
Yep, can't wait for the mac version. I'll buy on first day.
I love banished and I'm so looking forward to playing it on my Mac. Thank you for all your hard work.