Lags, Freezes, and RenderGeom

So for a long time I was having occasional whole-machine crashes in Second Life, where I’d get a Blue Screen of Death from Windows, blaming the video drivers for something, or the screen would go black and the machine unresponsive, or the machine would stop responding with the screen just frozen on whatever I was last seeing, or the machine would power down entirely.

Camming around too fast or too far from my AV seemed to increase the crashing, so I slowly got trained not to do that (although ordinarily camming all over the heck is one of my chief occupations).

There were a few days when the entire machine would lock up about half the time that I tried to start up the SL viewer, and the other half the time it would usually lock up or crash within an hour or so. That was horrible, and I was afraid I’d just have to give up SL entirely. Then it stopped doing that for no apparent reason, and as of 1.19 I’ve pretty much stopped having SL-related machine crashes entirely. The viewer itself crashes once in awhile, but not often, and it’s pretty quick to start up again.

Now I’m running 1.20, the former RC now actually Released, and things seem even better. FSP seems higher, and camming and zooming and so forth seem smoother. I’m back to camming all over the place, and I haven’t had any crashes at all yet on 1.20 (knock wood), although I expect that I will eventually. And I like some of the new 1.20 features (stretching images in the image previewer, clicking on names in chat and IM to bring up profiles, being able to be smug about how low my Avatar Rendering Cost is, etc, etc), and I haven’t found anything I really dislike. Yay!

There’s still one very annoying viewer behavior left, though. Sometimes at random (and I think more often when there are alot of AVs around, but I could be wrong about that) my fps (frames per second, the framerate) will drop from some normalish value to significantly less than one; that is, my view of SL will get updated less often than once per second! And it’s not only the view; because of the way the SL viewer is written, it means that it also won’t notice my keystrokes or mouse actions except every second or two (or three or…). Which makes it pretty much impossible to do anything, even chat!

If I turn off prim rendering (control shift alt 9, natch), things get fast again. On the other hand everyone is bald and all buildings and objects vanish. :) So I can chat again and stuff, but I can’t sit on things, and I tend to walk into walls.

My investigations so far reveal that pretty much invariably what’s happening is that the “RenderGeom” part of the viewer has gone insane, and started to eat up huge amounts of time. The figure there shows the timing graphs (control shift 2) during one of these “freezelag” times, both with and without prim rendering on. The top part of the graph is with prim rendering on and a horribly low FPS rate; note that almost the entirety of each horizontal bar is the grey color of RenderGeom. The bottom half is with prim rendering off and a high FPS; RenderGeom is still significant, but not like 95% of rendering time. The bottom part is also what the bars look like, roughly, when I have prim rendering on but I’m not freezelagging. (Note that the chart is in normalized mode, so each bar is the same size, even though the absolute scene rendering time is much much much higher in the top ones.)

Has anyone else experienced similar stuff? Have any clever suggestions? There’s a JIRA on the problem, and it has an internal Linden issue number, but it doesn’t seem to be getting alot of visible attention or work. If anyone has seen any other related JIRA (I did search for “RenderGeom”), a pointer to that would also be greatly appreciated.

Isn’t that all fascinating? :) In other news, I’ve been flying around and exploring and dancing and talking to people and writing scripts and stuff!

7 Responses

  1. Great detective work, Dale. I am always fascinated by revelations regarding performance. I haven’t tried the timing graphs you show above, but I will now. Meanwhile, I always enjoy tweaking settings to see if I can get a few more fps out of the system. Usually it comes down to a choice of: Do you want to See? or Do you want to do activities?

  2. I’ve seen similar with the Linux viewer and nvidia drivers, but I’ve noticed that it occurs much less often, and for shorter periods, in the recent 1.20 viewer (and in the 1.20 RC viewer since about RC11 or RC12). My vague recollection – I haven’t checked this – is that there was a tweak to the RenderGeom code during the 1.20 RC phase, and that this was described in the release notes and / or the SL official blog.

  3. Gimme a blogroll linkypoo ? I blogroll linkypoo’d you :)

  4. Anyfing for Ava. :)

  5. I get performance anxiety.. and just relog

  6. One Linden developer sort of hinted that what is happening comes from the way Second Life is constantly pushing statistics into LL’s servers. Unlike many other “games” and “platforms”, LL really doesn’t trust the tiny vocal minority that always complains about lag and lack of performance. They wish to see the “big picture”, and this means that every SL resident logged in will happily send loads and loads of data about your computer’s performance up to LL (it was even mentioned that some things like FPS, or memory usage by the texture cache, are sent “several times per second”).

    This naturally allows LL to do an average of what is going on in SL which is rather close to the reality.

    It has, though, a problem.

    Allegedly to get this level of metrics and statistics requires calling some more obscure OpenGL functions. These are hardly used by anyone (outside of research labs to evaluate a graphic card’s performance, or a developer team figuring out how effective their new rendering algorithms are). It looks like none of the games or MMORPGs out there that use OpenGL ever thought of using those functions. At least Apple has conceded that their own OpenGL drivers have, indeed, some bugs on those obscure functions, and became quite amused when LL reported that they were, in fact, using them extensively.

    So LL has a dilemma to solve. Either they wait until the OpenGL developers fix the bugs, or they stop gathering statistics. In the latter case, that also means having no clue on how buggy/crashy the SL clients actually are. It means having to rely on word-of-mouth by a vocal minority of a few hundred users that are always complaining, often without having a clue on how to tweak their systems for good performance and high stability. In the past it was quite clear that this approach doesn’t work: the biggest problems are found with the users that never protest and are silent, and LL wants to track them too, and have the “big picture” in order to be able to figure out if they’re in the right direction. On the other hand, while doing so, it means that these obscure OpenGL functions will crash (or “freeze”) the computer quite often (it has been shown that on a Mac, for instance, SL is currently the only application that can crash Leopard and force a reboot — because Apple’s cool GUI runs on top of OpenGL too, and when SL aggressively uses those nice obscure functions, the Mac can totally lose connection with the graphics card — and a reboot is the only way out).

    For some reason, the “freezelag” happens way more often when the grid is in an unstable mode and your SL client is repeatedly trying to load a texture that refuses to arrive. This somehow triggers the sloppily-coded OpenGL functions more often and they freeze the computer for a few seconds. Then the SL client tries to get the texture again, which doesn’t arrive — and the cycle repeats. In some cases the only workaround is to log out and try to log in again on a “mostly empty” sim, and hope that the dreaded textures that never arrive will not be called. Sometimes, however, these are on your avatar (skins, clothes, or attached prims…) and it will be hard to exactly pinpoint which one is the culprit. And sometimes everything works smoothly and you get no “freezelag” for days after days.

  7. Hm, interesting. The “nonarriving texture” idea appeals to me, in that at least sometimes it feels like there’s some particular point in the world that I have to avoid looking at to avoid the freezelag; that might be where the missing texture is wanting to be rendered. I should just look at the source, and see exactly what’s happening inside renderGeom…

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: