Yesterday I posted Handling Transform Data in Vulkan, and it generally seemed well received by everyone, but as it turns out, I was doing something dumb. I was pretty sure that SSBOs should have been slower than UBOs on my (NVidia) graphics card, but none of the data I had showed that. In fact, all the data I had showed that all three buffer-based approaches to handling transform data performed about the same.
As weird as that was, I didn’t have any data to suggest otherwise, so I posted what I had. Luckily, someone (who I won’t name in case they’d prefer not to be forever immortalized on my blog) spotted my mistake:
Suffice to say, the person giving me this feedback has much more experience than I do, so I figured I needed to take a look at my tests and see if I could improve things. Luckily, today was my day off, so I had some time to kill.
If that feedback was correct, the problem was that my shaders were so simple that my gpu was burning through frames so fast that it didn’t matter if there was a performance difference in my transform data approaches, I was hitting a different performance bottleneck first. Sadly, I’ve spent a lot of time fixing graphics pipeline bottlenecks on different projects… I just didn’t think about them when collecting my data last time.
I kept almost everything the same as last time, since I’m pretty sure my test setup was mostly sound. However, I changed my fragment shader. You might recall that I was just outputting normals in my first attempt at benchmarking this. This seemed plausible because I only cared about vertex shader performance, but was also probably where I went wrong. So to re-test everything, I changed my fragment shader to the following:
This seemed like a reasonable way to try to avoid hitting the same bottlenecks as before, since it was easy to adjust how many useless operations I wanted to do in order to control how much time I wanted to be spent in fragment processing. Also, since every test case already used the same fragment shader, this was an easy spot to add instructions too that I knew would be applied uniformly to every test, so that the only variable between them was still transform data handling.
Now, my scene looked like this:
Also - one things to note with these tests is that they seem to be consistent to within about 0.5 ms (when testing the Bistro Scene). That is, if I re-run the same test multiple times, all my results for that test are within about a half millisecond of each other. The Sponza tests were much more consistent (but the values were also much smaller). I’ve included detailed testing methodology notes at the end of this blog post, but the short version is that each test result is the average of 4096 frames, and I re ran each test 3 times. The number you see in the graphs is the median of these three test runs.
Ok, let’s start with Push Constants again. I’ve included my results from last time in this graph to showcase the differences made by the new test.
Similar to last time, there’s not really much interesting to say about these results other than that you can see the impact that the new fragment shader had on frame time pretty clearly in that graph.
Ok, here’s where things get interesting. Last time I ran this test, the data showed that Dynamic UBOs and the UBOs containing an array of structs performed pretty much evenly, however, I wasn’t sure whether or not the difference was enough to suggest a real performance difference or just issues with my tests’ accuracy.
These new results don’t mean that there aren’t issues with the accuracy of my tests, but what it helps to confirm is that this performance difference exists to some extent (at least at scale). An obvious criticism of this is that the Sponza test was still reporting the same values for everything, and that’s fair, there’s likely more I could be doing to benchmark the smaller scene, but I’m happy with the results from the Bistro scene, and I’ve now spent way, way too much time on this benchmark.
The other thing this helps confirm is that there is indeed a noticeable performance difference between using HOST_CACHED UBOs and DEVICE_LOCAL ones. I stand by the old conclusion that for data that updates on a per frame basis, you should just stick to HOST_CACHED.
And finally, the moment we’ve all been waiting for! Here’s what the UBO test results look like alongside the new test results for SSBOs
As expected, with the new test, we can now actually see a real performance difference between using SSBOs and using UBOs on my NVidia GTX 1060. All is well with the universe, things look how they should.
Here’s all the results put into a single graph.
With this new data, I’m going to walk back my previous assertion that I’m going to go with a single large SSBO for my transform data, and I think will place my bets firmly in the large UBO pages camp. At least until I get an AMD card to play with. They seem fun.
It’s interesting to note that even with the change in test methodology, Push Constants still come out the loser in the larger scenes. I don’t know for sure what that means, but it’s definitely still a surprise. Hopefully that doesn’t point to a new problem with this benchmark, but you know, if it does and you can see what I’m doing wrong, send me a message on Twitter, Mastodon, or via e-mail and I’ll write up yet another one of these posts. Next time with an even bigger facepalm meme photo.
Until next time!