Thursday, September 22, 2005

anatomy of a nasty bug

i came across an interesting a month or two ago: if you run with desktop icons turned off in kde (which i do because i have zero use for them) you may discover that sooner or later the left, middle and right click menus on the desktop cease to work at some point. i couldn't figure out what the cause was (though later i was to find a bug report on this)

and then a few days ago coolo mentioned on irc that a fellow from iceland had complained about this to SUSE. so i decided that 3.5 could not go out with this bug in it. i mean, if it was affecting both a canadian and an icelander ... well! (no, i have no idea what that actually means)

so tried everything i could think of to trigger the bug and eventually managed to do it: i ran something using the "run command" dialog and ... poof, no menus! but just what in the heck did the "run command" dialog have to do with desktop menus? good question. so i threw a bunch of debug input into the appropriate areas (desktop.cc, krootwm.cc, etc) of the code to see when and where the mouse button events were arriving. and mysteriously enough, once the "run command" dialog was shown they events just disappeared. holy bermuda triangle, batman!

so i read over the code in minicli.cpp (the "run command" dialog code) and nothing looked amiss. so i starting brute forcing it: remove code, compile kdesktop, dcopquit kdesktop && kdesktop, click (read debug), run command dialog, click. not the fastest process in the world. and when the last bit of code had been removed from minicli.cpp the problem persisted, and my blood temperature dropped.

well, i actually hadn't removed all the code from minicli.cc, it was still creating the gui which was created by qt designer. how... odd. so i tried just not showing the dialog; no improvement. then i tried not creating the gui. success!

so something in the autogenerated code was causing the problem? i read through it. nothing seemed amiss (familiar feeling). on a lark i decided to remove the kde widgets and replace them with their qt counterparts one by one in case it was a bug in kdelibs. and lo! when i removed the last kpushbutton the bug went away.

and so off into libkdeui in kdelibs went i to examine kpushbutton. i could sense the desktop icon code in krootwm.cc receding far into the distance by this point. and you know what? nothing looked amiss in the kpushbutton code. so i started removing code again, and eventually tracked it down to this one line:

QToolTip::add( this, d->item.toolTip() );


you have no idea how many times i had restarted kdesktop by this point, so i wasn't about to give up now and so pointed my text editor at the qt tooltip code. i'd fixed a memory leak bug in here a couple years back so was at least familiar with this particular area of qt. and then i saw it: creating a tooltip creates a tooltip manager and the tooltip manager installs an event filter that processes every event passed through the application. but nothing (wait for it!) looked amiss in the event filter!

so just what was going on? well, kdesktop filters events on the root x11 window and filtering on the qapplication itself, even when done apparently right, messes this up.

it was at this point i backed off, having at least tracked the bug down, and simply replaced the kpushbuttons with qpushbuttons (resulting in 4x the original LOC in the process). so, 3.5 will not have this bug, but then again ... i won't get those 2 days back either.

(though in the name of honesty, i did get lots of other stuff done on those two days as well. including welcome t. back home from her vacation =)

4 comments:

LMCBoy said...

Wow. Great work, Aaron. Nasty bug indeed. Good to know I'm not the only one who resorts to brute-force debugging methods :)

Ian Monroe said...

I just fixed a bug kind of like this in amaroK though not a tenth as bad. KDirOperator was eating up the DEL key from the playlist and for some reason no event filter would filter it until I put the filter on the entire playlistwindow, which I committed. Of course that caused other problems and a hilariously confusing bug report.

I suppose it has something to do with how Action Collections work? Since as it turned it was part of KDirOperator's actions collection, so I used QObject::child to get the actioncollection and then disabled the DEL action.

Max Howell said...

I once tried to fix the same bug as Ian, but gave up after about 3 or 4 hours. Two days is pretty impressive! I usually lose faith in life, code and even beer after a few minutes debugging.

But heh, I enjoyed that story, and sympathised during its unfurling so many thanks :-)

Bram said...

Thank you for fixing this annoying bug :)