Measuring the Usefulness of Multiple Models

The past several years have seen a massive increase in products, services, and features which are powered or enhanced by artificial intelligence — voice recognition, facial recognition, targeted advertisements, and so on. In the anti-virus industry, we’ve seen a similar trend with a push away from traditional, signature-based detection towards fancy machine learning models. Machine learning allows anti-virus companies to leverage large amounts of data and clever feature engineering to build models which can accurately detect malware and scales much better than manual reverse engineering. Using machine learning and training a model is simple enough that there are YouTube videos like “Build an Antivirus in 5 Minutes“. While these home brew approaches are interesting and educational, building and maintaining an effective, competitive model takes a lot longer than 5 minutes.

Read More

Filtering Popular Code and Effects on Model Accuracy

I’ve been conducting experiments to try and improve accuracy of a malware detection tool called Judge. This post is about an experiment involving finding which classes and methods occur frequently in the training data and excluding them from training. The intuition is that by filtering out popular code, the features will be more representative of what’s unique to each sample.

This idea, as well as several other such as model stacking and blending which I write about in Measuring the Usefulness of Multiple Models, as well as some implementation ideas were given to me by Nikita Buchka (@advegoc and blogs).

Read More

Using Markov Chains for Android Malware Detection

If you’re chatting with someone, and they tell you “aslkjeklvm,e,zk3l1” then they’re speaking gibberish. But how can you teach a computer to recognize gibberish, and more importantly, why bother? I’ve looked at a lot of Android malware, and I’ve noticed that many of them have gibberish strings either as literals in the code, as class names, in the signature, and so on. My hypothesis was that if you could quantify gibberishness, it would be a good feature in a machine learning model. I’ve tested this intuition, and in this post I’ll be sharing my results.

Read More

Monitoring HTTPS Traffic of a Single App on OSX

If you reverse engineer network protocols or do any other network security stuff, you’ve probably needed to collect network traffic at least once – either to understand a protocol or look for sensitive information. Back in the good old days, this simply meant firing up tcpdump and watching those sweet, plaintext packets flow on by. Now, everyone has a stick up their butts about encryption – bunch of cry babies couldn’t handle getting their accounts hacked and their private info sold on the deep dark web for a few hundred dogecoin.

In today-time, doing any network analysis absolutely requires knowledge of HTTPS / SSL / TLS interception and it turns out to be non trivial almost all of the time! Of course, this makes sense because the entire point of TLS is to secure your communication. Like any other seldom trodden path, intercepting TLS has some caveats. First, you have to grok how Man-in-the-Middle works, how certificates work and how to install them on your system, how to massage your OS and certain apps into using those certs. Finally, you’ve got to navigate a bunch of proxy documentation and configuration to actually intercept and display the traffic.

In this post, I’ll be describing how to monitor the encrypted HTTPS traffic of a single app on macOS as well as solutions to some of the frustrating problems I encountered.

Read More

How Bitcoin Improves Free Speech and Government

Many people are introduced to Bitcoin and other cryptocurrencies merely as a way to make money investing. They see the price rising, buy in, and hope it goes “to the moon“, without really understanding what it is or why the price is moving. I’m glad Bitcoin is getting popular. I’m a huge fan, but I don’t give a single shit about how good of an investment it is. Even though you might make a lot of money investing, it pales in comparison to how Bitcoin can fundamentally change the world.

I can talk all day about how a big chunk of the world is unbanked and doesn’t have access to financial services, how this is a huge problem, and who knows what will happen when 4 billion people suddenly have access to savings accounts and loans with a simple feature phone with internet. I can write pages and pages about how Bitcoin enables truly micro transactions and how I think it’ll fit in nicely as payment models for AI powered robot services, self-driving cars, and media streaming services. The list goes on and on, but this post covers how Bitcoin bolsters the power of speech and improves our relationship with government.

Read More

Calling JNI Functions with Java Object Arguments from the Command Line

When analyzing malware or penetration testing an app which uses a native library, it’s helpful to isolate and execute the library’s functions. This opens the door for debugging and using the malware’s own code against it. For example, if the malware has encrypted strings and the decryption is done by a native function, you could either spend a bunch of time reversing the algorithm to write your own decryption routine or you could just harness the function such that you can execute it with arbitrary inputs. If the malware author completely changes their decryption, you might not have to change anything. In this post, I’ll explain how to harness a native library and execute its functions even if they require arguments from a live JVM instance.

In a previous post, I explained how to create a Java VM from Android native code but I didn’t give any real examples of how to use it. In this post, I’ll give a concrete example.

Read More

Creating a Java VM from Android Native Code

If you’re writing native / JNI code for Android, it’s probably as native method of an Android app. These methods are always passed the Dalvik VM instance of the app as the first parameter. You need this to create jstrings and other Java objects, lookup classes and fields, etc. It’s not normal for you to have to instantiate a VM from native code because most of the time, if you’re using the Java Native Interface (JNI), you started in Java land and are only dipping into native code land for them sweet, sweet performance benefits. However, if you’re reverse engineering or writing an exploit, you’re likely always delving int all kinds of unusual trouble which the developers reasonably believed would never happen or at least would only be a theoretical edge case.

I recently needed to create a VM from native code to pass Java object arguments to a JNI function. In this post, I want to share what I came up with and why I finally settled on this particular method.

Read More

Building with and Detecting Android's Jack Compiler

Recently, I needed to write a bunch of Smali code to use in tests for Simplify. While, Smali syntax is simple and fairly easy to write, it’s also tedious and I needed to do some tricky, uncommon stuff. I wasn’t even sure how to do it in Smali. Luckily, it’s pretty easy to write Java and convert it to Smali. I’ve talked about how to make a small alias to do this and go over some other use cases in a previous post. Writing Java and converting to Smali makes it easy to quickly prototype lots of Smali code without worrying about Smali syntax or conventions. In this post, I want to show how to use a new Android compiler called jack which takes the place of dx and you’ll need to know how to use if you want to continue converting Java to Smali.

Read More

Understanding Dalvik Static Fields part 2 of 2

In the first part of this series on Dalvik class fields, I wrote about how Dalvik handles static field literals. This article is focused on how field inheritance works and exploring all the different but equally valid ways of referencing fields at the bytecode level.

If you are familiar with Java, you probably already understand how Java field inheritance looks and behaves at the source code level, but btyecode is less strict and potentially more ambiguous (at least to humans) than source. JVM languages like Scala and Groovy compile to the same bytecode as Java, but both have very different source code restrictions.

Read More