Joe Leon

The Dig

December 4, 2024

Cracking Open APK Files at Scale

Cracking Open APK Files at Scale

Joe Leon

December 4, 2024

Tl;dr TruffleHog now automatically decodes Android Package Kit (APK) files and searches them for secrets. It runs ~9x faster than using an external decompiler before calling TruffleHog.



Android Package Kit (apk) files drive the apps on Android phones. Despite their cool extension (.apk), they’re essentially just zip files containing, among other things, compiled Java and Kotlin code. 

During the build process, Java and Kotlin code undergo complex compilation and encoding steps that often obscure secrets. While TruffleHog has long supported scanning zip files for secrets, it lacked the specialized logic to process the compiled code inside apk files—until now.

TruffleHog now automatically decodes and parses APK files. And it’s really fast. By eliminating the need to decompile, TruffleHog performs apk secret scanning up to nine times faster, drastically reducing the time and effort required to secure Android applications.

9x Faster APK Secret Scanning

We recently downloaded the top 5 most popular apk files from APKMirror: Google Play, Google Authenticator, WhatsApp, Facebook, and Facebook Messenger. 

We scanned each apk 6 times:

  • 3 times using our old method (a script combining the jadx decompiler and the pre-updated TruffleHog) 

  • 3 times using the updated TruffleHog 

The results surprised us.



By decoding apk files natively within TruffleHog, we saw a 9x improvement over our previous open-source apk scanning method.

Why does this matter?

Apk files are known for leaking keys. Until now, researching apk secret leakage at scale was prohibitively expensive. Imagine having to wait over 3 minutes to scan just one apk file (like in the case of Facebook Messenger). Then multiply that by all the different versions and architecture releases of that app. It just took forever.


This version of Facebook Messenger has 40 variants! Imagine waiting 3 minutes do decompile each one!

Now, researchers and other TruffleHog users can efficiently search for leaked secrets in apk files at scale. We’re hopeful this will help bug bounty hunters, internal security teams, and Android developers. 

So, what exactly are the old and new methods?

The Old Method

In the past, the most straightforward way to scan an apk file for secrets was a two-part process: (1) Decompile the apk using a tool like jadx, (2) Scan the decompiled data using TruffleHog’s filesystem command. It works. And it’s thorough. But it just takes a long time.

Here’s a snippet of Python if you’d like to try it out:


# Install jadx and trufflehog and export to your PATH
import os
os.system(f"jadx file.apk -d __apkfiles")
os.system(f"trufflehog filesystem __apkfiles --no-verification")

Note: There are other open-source apk secret scanning tools - like APKLeaks and apkscan - but those tools also leverage external decompilers (like jadx) and rely only on regex matching (without secret verification). Since validating secrets is core to our research at Truffle Security, we did not leverage those tools and instead created this simple Python script. But those tools also search for more than just secrets, so check them out!

The New Method

In adding support for apk files to TruffleHog, we initially looked at external decompilers like jadx and apktool, but those would require users to install a third-party tool, which was a non-starter. Also, while those tools are powerful, we wanted something lighter-weight and faster.

So we went a different route. We researched the most common places secrets leak in Android applications (e.g., AndroidManifest.xml, strings.xml, asset files such as Javascript, and dex files), identified Golang packages to parse just those files (dextk and apkparser), and then added logic to scan for secrets in those specific locations. 



Admittedly, this approach will not conduct as thorough a scan as using an external decompiler and then invoking TruffleHog. Still, it will identify the vast majority of secrets in a fraction of the time. The tradeoff is worth it.

Below is detailed summary of all the file types TruffleHog apk scanner will check for secrets.

XML

Android XML files are stored in a unique Android Binary XML format. Simply unzipping an apk file and inspecting the .xml files look like this:



Removing those special characters in red makes the text slightly more intelligible. However, there’s still an issue: Android stores XML attribute values (aka strings that could contain API keys and passwords) as reference IDs instead of the actual values. 

For example, instead of storing the plaintext string Example AWS Key, an Android XML file would store a reference to the strong like this:


<activity  android:label="@7F0300b3">


The resource ID 7F0300b3 tells us where to find the value, but we need context about the application’s resources to retrieve the actual string value. Android provides that data in a special file named resources.arsc

By parsing the resources.arsc file, we can build a ResourceTable to look up 7F0300b3 and discover the plaintext string (Example AWS Key). 

Avast’s open-source apkparser package exposes a function called ParseXml() that accepts both an XML file and a ResourceTable object. It uses the ResourceTable to resolve those resource IDs (in addition to reformatting the binary XML) to deliver a more complete picture of the original .xml file


Example of secrets found in a reconstructed AndroidManifest.xml file.

Every xml file that TruffleHog identifies in the unzipped apk file is processed through this function with the appropriate ResourceTable context.

strings.xml

One of the xml files most likely to contain a secret is named strings.xml. It’s basically just a bunch of key/value pairs.


Sample strings.xlm file.


Unfortunately, reconstructing this file was challenging. When we unzipped the apk files, we couldn’t find a file named strings.xml. But we did find a workaround.

It turns out that the resources.arsc file (discussed above) houses all key/value pairs from the strings.xml file. In particular, those values are located in the resource ID range: 0x7f000000-0x7fffffff. By iterating over every string key/value pair in that resource ID range, we can reconstruct the strings.xml file. It’s a hacky workaround, but we end up scanning the same data stored in the original strings.xml file.

Dex

A dex file contains compiled code (it’s where the Java or Kotlin source code is transformed into bytecode that runs on Android). Apk files generally include at least one dex file, usually named classes.dex, but if the app is extensive or modular, there might be multiple dex files—like classes2.dex, classes3.dex, etc.

This is what a dex file looks like when you open it.



It’s chaos. It makes even less sense than the Android XML file. Fortunately, an awesome Golang package called dextk efficiently parses dex files into “source code.” Emphasis on the quotes around “source code” since, technically, the output is a bunch of bytecode instructions and their values.

For example, this is the sample output provided on the tool’s README.


invoke-interface method=android/os/Parcelable$Creator:createFromParcel:([Landroid/os/Parcel;]):Ljava/lang/Object; args=[2 1]
   move-result-object dst=0
   return-object value=0
   const/4 dst=0 value='0'

Source: https://github.com/csnewman/dextk#:~:text=%2C%20o)%0A%09%7D%0A%7D-,Output%3A,-android/support/v4

The first line above reads invoke-interface method=android/os/Parcelable$Creator:createFromParcel. This translates to calling a function named createFromParcel(). The source code is mostly there; it just looks funny.

In secret scanning, we primarily care about strings since that’s where developers store API keys and passwords. In dex bytecode, strings are all referred to as const-string instructions. While we can easily filter and grab the const-string values, we have another problem. 

TruffleHog scans for secrets using a technique called keyword pre-flighting. Each secret type (e.g., AWS, Stripe, etc.) has a keyword (e.g., AKIA for AWS, sk_live for Stripe, etc.), and we check that keyword is near a suspected secret.



To get the required keywords near suspected secrets, we parse several other dex bytecode instruction types (ex: those that contain a class name or method name) and then place them near each suspected secret. It’s a little convoluted, but it ensures we can scan the decoded source code thoroughly. If you’re curious, the dex scanning logic is here.

Everything Else

All other files in the decompressed and unarchived apk file are scanned normally (no special decoding). This allows TruffleHog to review many other target file types, like *.properties, *.json, and more. We found all kinds of files included as raw assets, including .git directories, sqlite databases, and more.

Limitations

Our implementation works super well. But it’s not perfect. There are a few limitations.

  1. Incomplete file coverage. Our scanning is not as thorough as fully decompiling an apk and then scanning each file. There’s no way to do this fast enough (or without external dependencies). But we do search all of the usual locations secrets leak in apk files. Unless a secret leaks in an uncommon location, we’ll find it.

  2. We don’t support scanning Xapk and apkm files out-of-the-box. If you unzip them and then point TruffleHog to that folder, it will scan the embedded apk files. The reasons are complicated, and we hope to address them soon—similarly, any apk files embedded in a zip must be unzipped first. 

  3. Encrypted and Packed DEX files. We can’t decrypt and unpack dex files. But this is a limitation of other tools too. If you have an idea, we’re open to suggestions!

Special Thanks

Significant updates to TruffleHog’s open-source code base like this are a team effort. We’d like to thank Richard Gomez and Noman Shaikh for their feedback during development. We also thank Brandon Weeks for his expert review of our apk scanning logic.