Debugging Deadlocks on Android

Deadlocks are well known to be difficult to debug, especially when you’re dealing with a large amount of concurrent code. Fortunately on Android we have some tools that are more than useful to detect where a deadlock is happening.

Debugging Deadlocks

Probably in your developer life you’ll end up dealing with deadlocks at least a couple (hundred thousand) times, while a deadlock-situation might not be that hard to detect, sometimes it can be hard to detect where the deadlock happened. This usually lead to questions like “where? and how do I trace it?”, so you start writing a new class, if you didn’t do that already, to track resource allocation and hunt the bug, a very time consuming process, especially if you’re in a rush. Let’s see what to do if you’re developing on Android platform.

A few days ago I’ve found that the release version of a program I was writing had a deadlock that wasn’t present in the debug version. Quite strange if you consider that the only difference were some log.d(). What to do?

Deadlock Identification

A deadlock wasn’t my first guess, so I started checking again the code to see if I was missing something obvious. Eventually, after ruling out any other possibility I had to tell myself that yep, it was a damn deadlock. How to identify it? It’s pretty easy, the only thing you need is a rooted phone: first of all deploy your application on the device, run it, wait for the deadlock to happen and then connect to the device using ADB:

c:\> adb shell
$ su
#

Get a list of all processes running on the system and take note of the PID that belongs to your process:

# ps
app_59 645 94 137584 18308 ffffffff afd0c52c S com.sec.android.app.samsungapps.una2
app_71 745 94 141712 21352 ffffffff afd0c52c S com.que.gpslogger
app_41 772 94 136820 16276 ffffffff afd0c52c S com.sec.android.widgetapp.clockweather.clockwidget

In my case the PID was 745, kill it using the -3 as argument:

# kill -3 745

This command creates a trace file into /data/anr/, on some Android versions you’ll have to create the directory by yourself otherwise the command won’t produce any output file.

# cd /data/anr/
# ls -l
ls -l
-rw-rw-rw- app_71 app_71 2237 2012-01-27 12:06 traces.txt
-rw-rw-rw- root root 4364 2012-01-27 12:33 traces.txt.bugreport

At this point all you have to do is pull the files and examine them:

c:\> adb pull /data/anr/traces.txt

Opening the file you’ll see something like this:

"Thread-23" daemon prio=5 tid=22 NATIVE
  | group="main" sCount=1 dsCount=0 obj=0x4056fbc0 self=0x1b5da0
  | sysTid=764 nice=0 sched=0/0 cgrp=bg_non_interactive handle=1793248
  at android.os.MessageQueue.nativePollOnce(Native Method)
  at android.os.MessageQueue.next(MessageQueue.java:119)
  at android.os.Looper.loop(Looper.java:117)
  at com.que.gpslogger.module.b.a.run((null):-1)

"Thread-22" prio=5 tid=21 TIMED_WAIT
  | group="main" sCount=1 dsCount=0 obj=0x40573d58 self=0x1b6768
  | sysTid=765 nice=0 sched=0/0 cgrp=bg_non_interactive handle=1796256
  at java.lang.Object.wait(Native Method)
  - waiting on <0x40573cf0> (a com.que.gpslogger.d.q)
  at java.lang.Object.wait(Object.java:395)
  at com.que.gpslogger.a.run((null):-1)
  at java.lang.Thread.run(Thread.java:1019)

"Thread-21" prio=5 tid=20 TIMED_WAIT
  | group="main" sCount=1 dsCount=0 obj=0x4056e678 self=0x1b4b18
  | sysTid=763 nice=0 sched=0/0 cgrp=bg_non_interactive handle=1789008
  at java.lang.Object.wait(Native Method)
  - waiting on <0x4056e5c0> (a com.que.gpslogger.d.n)
  at java.lang.Object.wait(Object.java:395)
  at com.que.gpslogger.a.run((null):-1)
  at java.lang.Thread.run(Thread.java:1019)

"Thread-19" prio=5 tid=18 WAIT
  | group="main" sCount=1 dsCount=0 obj=0x4056d860 self=0x1b3f70
  | sysTid=761 nice=0 sched=0/0 cgrp=bg_non_interactive handle=1786024
  at java.lang.Object.wait(Native Method)
  - waiting on <0x40573458> (a java.lang.VMThread)
  at java.lang.Object.wait(Object.java:358)
  at java.lang.Thread.join(Thread.java:914)
  at com.que.gpslogger.d.k.c((null):-1)
  at com.que.gpslogger.a.run((null):-1)
  at java.lang.Thread.run(Thread.java:1019)

This is just an extract of the original file containing the stack trace for every thread associated to your process. Did you notice anything? Methods related to your core will probably be obfuscated by ProGuard: com.que.gpslogger.d.k.c, so it is really hard to understand what a give thread is doing. Fortunately we have a nice tool into the Android SDK called retrace.bat on Windows and retrace.sh on Linux, this tool is able to convert the obfuscated trace into a normal one, the only think you’ll have to provide is the mapping.txt file, you can find it into your project directory: <project_root>/bin/proguard/mapping.txt, this file contains the mapping between the normal and obfuscated code:

com.que.gpslogger.Data -> com.que.gpslogger.h:
int type -&gt; a
int size -&gt; b
int accuracy -&gt; c
long time -&gt; d
long getType() -&gt; a
void setTime(int) -&gt; d

We can easily proceed to code-deobfuscation simply running this script from its own directory:

[sourcecode]
c:\Android-sdk-windows\tools\proguard\bin&gt;retrace.bat -verbose C:\Projects\QueGPS\bin\proguard\mapping.txt C:\Projects\QueGPS\bin\proguard\traces.txt &gt; C:\Projects\QueGPS\bin\proguard\deobfuscated.txt
[/sourcecode]

After this step we’ll be able to examine our trace without any obfuscation, so open deobfuscated.txt and look for any thread that’s in WAIT or MONITOR state:

"Thread-23" daemon prio=5 tid=22 NATIVE
  | group="main" sCount=1 dsCount=0 obj=0x4056fbc0 self=0x1b5da0
  | sysTid=764 nice=0 sched=0/0 cgrp=bg_non_interactive handle=1793248
  at android.os.MessageQueue.nativePollOnce(Native Method)
  at android.os.MessageQueue.next(MessageQueue.java:119)
  at android.os.Looper.loop(Looper.java:117)
  at com.que.gpslogger.module.position.GPS.void run()((null):-1)

"Thread-22" prio=5 tid=21 TIMED_WAIT
  | group="main" sCount=1 dsCount=0 obj=0x40573d58 self=0x1b6768
  | sysTid=765 nice=0 sched=0/0 cgrp=bg_non_interactive handle=1796256
  at java.lang.Object.wait(Native Method)
  - waiting on <0x40573cf0> (a com.que.gpslogger.d.q)
  at java.lang.Object.wait(Object.java:395)
  at com.que.gpslogger.Main.void run()((null):-1)
  at java.lang.Thread.run(Thread.java:1019)

"Thread-21" prio=5 tid=20 TIMED_WAIT
  | group="main" sCount=1 dsCount=0 obj=0x4056e678 self=0x1b4b18
  | sysTid=763 nice=0 sched=0/0 cgrp=bg_non_interactive handle=1789008
  at java.lang.Object.wait(Native Method)
  - waiting on <0x4056e5c0> (a com.que.gpslogger.d.n)
  at java.lang.Object.wait(Object.java:395)
  at com.que.gpslogger.Main.void run()((null):-1)
  at java.lang.Thread.run(Thread.java:1019)

"Thread-19" prio=5 tid=18 WAIT
  | group="main" sCount=1 dsCount=0 obj=0x4056d860 self=0x1b3f70
  | sysTid=761 nice=0 sched=0/0 cgrp=bg_non_interactive handle=1786024
  at java.lang.Object.wait(Native Method)
  - waiting on <0x40573458> (a java.lang.VMThread)
  at java.lang.Object.wait(Object.java:358)
  at java.lang.Thread.join(Thread.java:914)
  at com.que.gpslogger.alert.Notification.void stopNotification()((null):-1)
  at com.que.gpslogger.Main.void run()((null):-1)
  at java.lang.Thread.run(Thread.java:1019)

Look for any thread in WAIT state and check if it has to be that state or not, in my case the Thread-19 is blocked in a join() call, something that doesn’t have to happen so there it is my problem.

"Thread-19" prio=5 tid=18 WAIT // It is waiting
  - waiting on <0x40573458> (a java.lang.VMThread) // Why threaddy are you here?
  at java.lang.Object.wait(Object.java:358)
  at java.lang.Thread.join(Thread.java:914)
  at com.que.gpslogger.alert.Notification.void stopNotification()((null):-1)
  at com.que.gpslogger.Main.void run()((null):-1)
  at java.lang.Thread.run(Thread.java:1019)

 

In the end I forgot to close a Looper, or better, the call to close the looper was into the debugging code that was stripped out when compiling in release, that’s the reason the debug version was behaving correctly. I hope this short memo will help you save some time :).

Quequero