I’ve been conducting experiments to try and improve accuracy of a malware detection tool called Judge. This post is about an experiment involving finding which classes and methods occur frequently in the training data and excluding them from training. The intuition is that by filtering out popular code, the features will be more representative of what’s unique to each sample.
This idea, as well as several other such as model stacking and blending which I write about in Measuring the Usefulness of Multiple Models, as well as some implementation ideas were given to me by Nikita Buchka (@advegoc and blogs).
Fingerprinting Code
The first step in filtering out popular code is creating a fingerprinting mechanism which can produce a hash for code at the class and method level. Identical or very nearly identical Java code should map to the same hash value, and any small change in the code should lead to a totally different hash.
How can you do this? You might think you can just hash the actual Dalvik executable bytes of a method to get a fingerprint, but the exact same Java code may look totally different at the byte level because of differing register layouts and virtual table lookup indices.
My approach was to transform class and method properties as well as opcodes into a string. Properties include annotation size, interface count, access flags, register count, and so on. This string representation should be highly brittle and sensitive to almost any changes. Finally, the string is hashed into a long
using an implementation of MurmurHash3.
The source code is available here: DexFileFingerprinter.java. Also note, there’s a boolean strict
flag which is true by default. This was an untested idea of building less strict fingerprint strings of methods which would map code which was mostly similar but not exactly identical.
Measuring Popularity
Popularity was measured by fingerprinting every class and method in the benign sample set. This was 107916 apps taken from a variety of sources including non-english speaking markets. Labels were determined by a combination of VirusTotal and manual scanning. VirusTotal criteria were that it had 0 positives and had been scanned after 2 months of it first appearing on VirusTotal. This is to give vendors time to react to new threats. Since vendors tend to put highly sensitive models on VirusTotal, this should slightly bias the data set away from anything remotely resembling adware. However, this was remediated somewhat by adding ~50,000 samples from randomly crawled markets, many of which don’t appear on VirusTotal. These were determined to be benign by scanning them with multiple Android anti-malware products.
Popularity Distributions
Below is the class hash popularity distribution:
As you can see from the above figure, there is a very long tail to the distributions. This means that many classes are only seen rarely and there are only a few classes which are popular. Further, popular classes tend to be much, much more popular than non-popular classes. These popular classes probably include empty classes and classes from popular SDKs and libraries.
Below is the method hash popularity distribution which follows a similar distribution:
The code for generating these hashes is available here: graph.py
The data for these graphs is here: android-hash-counts.7z
Selecting Hashes
The threshold for what constituted a “popular” class was 75, which was determined visually. In other words, I kinda eyeballed it. For methods, the threshold was 100. This means, if a class was seen 75 times or a method was seen 100 times, it should be skipped when extracting features. However, this could lead to a loss of valuable information, maybe, so I added a few features for number of classes and methods skipped to the finished model.
The code for selecting hashes is available here: select_popular_hashes.py
Top 100 Class Paths
Below is a list of the top 100 class paths, sorted by the number of times they were seen. Note, the class path is the first class path seen, not a list of all class paths associated with a hash. From manual inspection, the first several classes were mostly empty and it’s likely that many paths map to the same hash.
# | Hash | Count | Path |
---|---|---|---|
1 | -6539677568569540121 | 87112 | Lpl/realitypump/ironsky/R; |
2 | 7879150481849290686 | 74164 | Lxr; |
3 | -2675497598636306534 | 70696 | Lcom/realitypump/G2/G2GLSurfaceView$1; |
4 | -5007384885342135023 | 69153 | Lcom/devilishgames/nativeextensions/admob/R$layout; |
5 | 7877155950984699004 | 68682 | Landroid/support/v4/app/SuperNotCalledException; |
6 | -7642079844424071458 | 65704 | Lcom/socialize/ui/comment/CommentDetailActivity; |
7 | 4173837491973374472 | 62937 | Landroid/support/v4/app/FragmentManagerImpl$1; |
8 | -5936151236995089740 | 60471 | Layj; |
9 | 2308415025046174030 | 57786 | Landroid/support/v4/content/ModernAsyncTask$4; |
10 | 6014432581692261376 | 56086 | Lcom/facebook/internal/FileLruCache$2; |
11 | -4099485795817903106 | 55354 | Lfq; |
12 | -902756984570334141 | 53880 | Lgj; |
13 | 2544757508923420717 | 53475 | Lcom/google/android/gms/internal/i; |
14 | -3399958614432743607 | 51806 | Landroid/support/v4/view/ViewConfigurationCompatFroyo; |
15 | 6743212674057076693 | 51617 | Landroid/support/v4/app/FragmentManagerState; |
16 | -7929451300585271506 | 51499 | Landroid/support/v4/view/accessibility/AccessibilityManagerCompat$AccessibilityManagerIcsImpl$1; |
17 | -5808535828223012233 | 51454 | Landroid/support/v4/view/VelocityTrackerCompatHoneycomb; |
18 | -740089631434253045 | 51388 | Lfo; |
19 | -5945944459787456127 | 51047 | Landroid/support/v4/app/ActivityCompatHoneycomb; |
20 | 4074238650528361008 | 50715 | Landroid/support/v4/view/ViewCompatGingerbread; |
21 | 8457541340339836364 | 50561 | Landroid/support/v4/app/ShareCompatJB; |
22 | -7602298695408306862 | 50458 | Landroid/support/v4/util/LogWriter; |
23 | 5534139342145712559 | 50432 | Lcom/google/ads/internal/j$a$1; |
24 | -4664196824785783660 | 50412 | Lretrofit/http/Streaming; |
25 | -989624789237851679 | 50381 | Lpl/realitypump/ironsky/R$style; |
26 | -5573814636106502307 | 50345 | Landroid/support/v4/view/ViewPager$Decor; |
27 | -4316020411289880285 | 50316 | Lpl/realitypump/ironsky/BuildConfig; |
28 | 2305477454044242758 | 50214 | Landroid/support/v4/content/pm/ActivityInfoCompat; |
29 | -2217179160054368884 | 49972 | Landroid/support/v4/app/TaskStackBuilderJellybean; |
30 | -3424197808730443223 | 49926 | Landroid/support/v4/view/ViewGroupCompatIcs; |
31 | 906934702213877563 | 49773 | Landroid/support/v4/accessibilityservice/AccessibilityServiceInfoCompatIcs; |
32 | -8430991917603055454 | 49760 | Landroid/support/v4/os/ParcelableCompatCreatorCallbacks; |
33 | 8802143954482027195 | 49534 | Landroid/support/v4/app/Fragment$InstantiationException; |
34 | 1819905089452899121 | 49474 | Landroid/support/v4/app/NavUtilsJB; |
35 | -2969348625307750576 | 49298 | Landroid/support/v4/net/ConnectivityManagerCompatJellyBean; |
36 | -9115786371403805780 | 49239 | Lcom/devilishgames/nativeextensions/admob/R$attr; |
37 | 7268059251633904232 | 49229 | Lee; |
38 | -7105014750669022041 | 49090 | Landroid/support/v4/widget/ResourceCursorAdapter; |
39 | 7889800717182803764 | 49026 | Lcom/ormma/view/Browser$4; |
40 | -5917040822247039264 | 49005 | Landroid/support/v4/view/VelocityTrackerCompat$BaseVelocityTrackerVersionImpl; |
41 | -5720493436074601324 | 48997 | Landroid/support/v4/view/VelocityTrackerCompat$HoneycombVelocityTrackerVersionImpl; |
42 | -1588742293679105513 | 48978 | Landroid/support/v4/os/ParcelableCompatCreatorHoneycombMR2Stub; |
43 | -1957521141270853035 | 48707 | Lcom/google/ads/util/g$1; |
44 | 4547472865708723651 | 48653 | Landroid/support/v4/os/ParcelableCompat$CompatCreator; |
45 | 679737453138114582 | 48048 | Landroid/support/v4/view/KeyEventCompatHoneycomb; |
46 | 7698708090613627434 | 47722 | Landroid/support/v4/app/FragmentManager$OnBackStackChangedListener; |
47 | 2814151969022890373 | 47699 | Landroid/support/v4/content/LocalBroadcastManager$BroadcastRecord; |
48 | 4341191700874667005 | 47672 | Landroid/support/v4/content/ContextCompatJellybean; |
49 | 7588347216197350949 | 47672 | Landroid/support/v4/content/ModernAsyncTask$WorkerRunnable; |
50 | 5989011503337385482 | 47576 | Landroid/support/v4/view/ViewGroupCompat$ViewGroupCompatIcsImpl; |
51 | -8484019840654742338 | 47487 | Landroid/support/v4/view/accessibility/AccessibilityRecordCompatJellyBean; |
52 | -7092896683557467419 | 47449 | Lcom/ormma/controller/Defines; |
53 | -4701498684616902325 | 47362 | Lcom/socialize/R; |
54 | 530400150896447174 | 47339 | Landroid/support/v4/view/ViewCompat$GBViewCompatImpl; |
55 | -5103961978418681524 | 47338 | Lcom/facebook/widget/GraphObjectAdapter$3; |
56 | 645745845319975076 | 47300 | Landroid/support/v4/view/ViewPager$PagerObserver; |
57 | 5597419562948276843 | 47252 | Landroid/support/v4/app/ListFragment$1; |
58 | 662320570537411526 | 47214 | Landroid/support/v4/widget/CursorAdapter$MyDataSetObserver; |
59 | 6941237681180831691 | 47206 | Landroid/support/v4/view/ViewPager$OnPageChangeListener; |
60 | 547540514056755988 | 47176 | Landroid/support/v4/app/LoaderManager$LoaderCallbacks; |
61 | 8037420148745296490 | 47120 | Landroid/support/v4/app/Fragment$SavedState; |
62 | 1431765083981923887 | 47057 | Landroid/support/v4/view/accessibility/AccessibilityNodeProviderCompat$AccessibilityNodeProviderStubImpl; |
63 | 4215379159449676179 | 47043 | Landroid/support/v4/view/accessibility/AccessibilityRecordCompatIcs; |
64 | -1402826908821352158 | 47038 | Landroid/support/v4/view/accessibility/AccessibilityNodeInfoCompatIcs; |
65 | 6569434689745479135 | 47027 | Landroid/support/v4/view/VelocityTrackerCompat$VelocityTrackerVersionImpl; |
66 | 8380934945701363758 | 46987 | Landroid/support/v4/widget/CursorAdapter$ChangeObserver; |
67 | -4672682710599146656 | 46984 | Lcom/google/android/gms/drive/metadata/SearchableOrderedMetadataField; |
68 | 2693836027826835029 | 46886 | Landroid/support/v4/view/ViewPager$SimpleOnPageChangeListener; |
69 | -786851131260004710 | 46787 | Landroid/support/v4/widget/CursorFilter$CursorFilterClient; |
70 | 8667997251744534852 | 46787 | Landroid/support/v4/app/BackStackRecord$Op; |
71 | -5475116159158576131 | 46748 | Landroid/support/v4/view/accessibility/AccessibilityRecordCompatIcsMr1; |
72 | 5002996181779391509 | 46723 | Landroid/support/v4/view/accessibility/AccessibilityNodeInfoCompatJellyBean; |
73 | 4936223830400664217 | 46719 | Landroid/support/v4/content/LocalBroadcastManager$1; |
74 | 239062217891206692 | 46658 | Landroid/support/v4/content/Loader$OnLoadCompleteListener; |
75 | -4913400411363934046 | 46548 | Landroid/support/v4/view/AccessibilityDelegateCompatIcs$AccessibilityDelegateBridge; |
76 | 5650689994489871625 | 46535 | Landroid/support/v4/view/AccessibilityDelegateCompatIcs$1; |
77 | 4892105813053381800 | 46481 | Landroid/support/v4/content/ModernAsyncTask$AsyncTaskResult; |
78 | 6021955058463343957 | 46358 | Landroid/support/v4/app/FragmentActivity$FragmentTag; |
79 | 997731539676885091 | 46090 | Landroid/support/v4/view/AccessibilityDelegateCompat$AccessibilityDelegateIcsImpl; |
80 | 1092001379795394415 | 46030 | Lcom/google/android/gms/common/GooglePlayServicesNotAvailableException; |
81 | 6233302609198898297 | 46028 | Landroid/support/v4/widget/SimpleCursorAdapter$ViewBinder; |
82 | 855966447952261336 | 46024 | Landroid/support/v4/widget/SimpleCursorAdapter$CursorToStringConverter; |
83 | 720132315695652952 | 45931 | Landroid/support/v4/app/FragmentState$1; |
84 | -4176127283910384678 | 45931 | Landroid/support/v4/app/FragmentManagerState$1; |
85 | 4286589477084046391 | 45931 | Landroid/support/v4/app/BackStackState$1; |
86 | 3462422210892878260 | 45776 | Landroid/support/v4/app/ListFragment$2; |
87 | -7093935369431053865 | 45718 | Landroid/support/v4/view/AccessibilityDelegateCompat$AccessibilityDelegateIcsImpl$1; |
88 | 5066173829313712639 | 45678 | Landroid/support/v4/app/Fragment$SavedState$1; |
89 | 3904811169429859734 | 45667 | Landroid/support/v4/content/Loader$ForceLoadContentObserver; |
90 | -2256735085640949925 | 45608 | Landroid/support/v4/view/AccessibilityDelegateCompat$AccessibilityDelegateJellyBeanImpl; |
91 | 343007254769290917 | 45595 | Landroid/support/v4/view/ViewPager$SavedState$1; |
92 | 3277445617080844160 | 45450 | Landroid/support/v4/view/accessibility/AccessibilityNodeProviderCompat$AccessibilityNodeProviderJellyBeanImpl; |
93 | -8522484960838180971 | 45430 | Lcom/google/ads/util/a$a; |
94 | 7484484164573575266 | 45385 | Landroid/support/v4/content/ModernAsyncTask$2; |
95 | 504944274815651800 | 45314 | Landroid/support/v4/view/ViewPager$1; |
96 | 5364514390186567115 | 45165 | Landroid/support/v4/view/accessibility/AccessibilityManagerCompatIcs$AccessibilityStateChangeListenerBridge; |
97 | -1930552062862662117 | 45153 | Landroid/support/v4/view/AccessibilityDelegateCompatIcs; |
98 | -1303476657272658237 | 45088 | Landroid/support/v4/view/accessibility/AccessibilityEventCompat$AccessibilityEventIcsImpl; |
99 | 8194078639849743862 | 45063 | Landroid/support/v4/app/LoaderManager; |
100 | -5328076847440050904 | 45023 | Landroid/support/v4/view/accessibility/AccessibilityRecordCompat$AccessibilityRecordJellyBeanImpl; |
Top 100 Method Signatures
Note, method signature is first one seen and the caveats about class paths also apply here – specifically, many popular methods are empty or nearly empty.
# | Hash | Count | Signature |
---|---|---|---|
1 | -7642079844424071458 | 108157 | Lcom/aaplab/android/robird/util/ThemeUtils;-> |
2 | -6539677568569540121 | 92107 | Lcom/handmark/pulltorefresh/library/R;-> |
3 | 8202678715559741078 | 85637 | Lcom/nostra13/universalimageloader/core/LoadAndDisplayImageTask$1;-> |
4 | 4890160260607866519 | 84235 | Ltwitter4j/internal/logging/LoggerFactory;-> |
5 | 2609197996308319503 | 84222 | Ltwitter4j/UserStreamAdapter;-> |
6 | 1118179837447239723 | 83255 | Lcom/google/analytics/tracking/android/HitBuilder;-> |
7 | -5936151236995089740 | 80384 | Lcom/nostra13/universalimageloader/core/assist/FlushedInputStream;-> |
8 | -7680277393425055583 | 78740 | Ltwitter4j/internal/http/HttpClientImpl$1;-> |
9 | -7092896683557467419 | 78187 | Lcom/nostra13/universalimageloader/core/DefaultConfigurationFactory;-> |
10 | -4701498684616902325 | 78131 | Ltwitter4j/auth/AuthorizationFactory;-> |
11 | -3124051984392334424 | 74756 | Lcom/aaplab/android/robird/ui/widget/photoview/ScrollerProxy;-> |
12 | -3752891835737638512 | 70619 | Lcom/nostra13/universalimageloader/core/ImageLoaderEngine$1;-> |
13 | 7877155950984699004 | 69899 | Landroid/support/v4/app/SuperNotCalledException;-> |
14 | -5007384885342135023 | 69160 | Lcom/aaplab/android/robird/R$xml;-> |
15 | 4628587325345143326 | 67922 | Lcom/google/analytics/tracking/android/AnalyticsGmsCoreClient$AnalyticsServiceConnection;-> |
16 | 3307061431700887237 | 67203 | Ltwitter4j/internal/logging/SLF4JLoggerFactory;-> |
17 | -8202474463569554070 | 67026 | Lcom/nostra13/universalimageloader/core/assist/DiscCacheUtil;-> |
18 | -9115786371403805780 | 66898 | Lcom/urbandroid/sleep/R$attr;-> |
19 | 8771223191024956139 | 66782 | Landroid/support/v4/app/NoSaveStateFrameLayout;-> |
20 | -8728496932605184659 | 66756 | Lcom/squareup/otto/ThreadEnforcer$2;-> |
21 | 336673459420538461 | 66275 | Lcom/nostra13/universalimageloader/core/assist/deque/LinkedBlockingDeque$Itr;-> |
22 | 5729975003762079487 | 66226 | Lcom/nostra13/universalimageloader/core/assist/deque/LinkedBlockingDeque$Itr;-> |
23 | -2401040049338412171 | 65842 | Lcom/nostra13/universalimageloader/core/display/SimpleBitmapDisplayer;-> |
24 | -2929060470316057339 | 65614 | Landroid/support/v4/database/DatabaseUtilsCompat;-> |
25 | 6315692033328304976 | 65281 | Landroid/support/v4/widget/SlidingPaneLayout$SlidingPanelLayoutImplBase;-> |
26 | -893111063096445170 | 65223 | Lcom/nostra13/universalimageloader/core/assist/MemoryCacheUtil$1;-> |
27 | -675924626915601184 | 65006 | Lcom/aaplab/android/robird/ui/SearchFragment$1;-> |
28 | -3847130604763184100 | 64027 | Lcom/urbandroid/common/http/StringResponseHandlerTemplate;-> |
29 | 1416323870660580531 | 63987 | Lcom/google/analytics/tracking/android/NoopDispatcher;-> |
30 | -3784336663845911879 | 63855 | Lcom/aaplab/android/robird/ui/widget/popupmenu/ListPopupWindowCompat$ResizePopupRunnable;-> |
31 | 8297103538565484969 | 63801 | Lcom/aaplab/android/robird/ui/widget/popupmenu/ListPopupWindowCompat$ResizePopupRunnable;-> |
32 | 4173837491973374472 | 63394 | Lcom/handmark/pulltorefresh/library/PullToRefreshBase$1;->run()V |
33 | -945002099871771481 | 62319 | Lcom/google/android/gms/internal/ar;-> |
34 | 2544757508923420717 | 61319 | Lcom/google/android/gms/internal/m;-> |
35 | -8312584374661018230 | 60614 | Lcom/google/android/gms/wallet/o;-> |
36 | -3011160232611646563 | 59302 | Lcom/twitter/Extractor$1;-> |
37 | 3315472411031331071 | 59008 | Lcom/handmark/pulltorefresh/library/extras/PullToRefreshWebView2$JsValueCallback;-> |
38 | -5559722447751808665 | 58966 | Lde/keyboardsurfer/android/widget/crouton/Manager$1;-> |
39 | 3699353597961706907 | 58289 | Landroid/support/v4/widget/SlidingPaneLayout$SlidingPanelLayoutImplJBMR1;-> |
40 | -8046356283759202622 | 58235 | Landroid/support/v4/widget/SlidingPaneLayout$SimplePanelSlideListener;-> |
41 | 6652028519967212390 | 57846 | Lcom/squareup/otto/Bus$2;-> |
42 | 7268059251633904232 | 57804 | Lcom/google/android/gms/maps/model/RuntimeRemoteException;-> |
43 | 2308415025046174030 | 57787 | Lcom/nostra13/universalimageloader/utils/ImageSizeUtils$1;-> |
44 | 6014432581692261376 | 56721 | Lcom/google/analytics/tracking/android/GAServiceProxy$2;->run()V |
45 | -7564279190829922402 | 56698 | Landroid/support/v4/content/ModernAsyncTask$InternalHandler;-> |
46 | -8522484960838180971 | 56075 | Lcom/handmark/pulltorefresh/library/PullToRefreshScrollView;-> |
47 | -8522152698274272858 | 56041 | Landroid/support/v4/widget/SearchViewCompatIcs;-> |
48 | 3363593219160481475 | 55914 | Landroid/support/v4/app/ShareCompatICS;-> |
49 | -6926701068898159683 | 55805 | Landroid/support/v4/content/ModernAsyncTask$InternalHandler;-> |
50 | 8721373828724890608 | 55392 | Landroid/support/v4/app/FragmentManagerState;-> |
51 | 4213088151749349370 | 55379 | Landroid/support/v4/app/FragmentManagerState;-> |
52 | 5389859633888440453 | 55377 | Landroid/support/v4/app/FragmentManagerState;-> |
53 | 7889800717182803764 | 55343 | Lcom/google/android/gms/tagmanager/o$1;->bc(Ljava/lang/String;)V |
54 | -1937623317952501810 | 55045 | Lcom/unity3d/player/d$2;-> |
55 | -1853703880280089887 | 54821 | Landroid/support/v4/widget/ListViewAutoScrollHelper;-> |
56 | 153996737814003651 | 54818 | Lcom/aaplab/android/robird/util/Events$PullDownToRefreshEvent;-> |
57 | -757778117290738428 | 54356 | Landroid/support/v4/widget/ViewDragHelper$Callback;-> |
58 | -7013990137290522882 | 54191 | Landroid/support/v4/os/ParcelableCompatCreatorHoneycombMR2;-> |
59 | -526199542608917401 | 54151 | Lcom/urbandroid/sleep/alarmclock/DontPressWithParentLayout;-> |
60 | 8802143954482027195 | 54126 | Landroid/support/v4/app/Fragment$InstantiationException;-> |
61 | -8778086982190029168 | 53895 | Landroid/support/v4/os/ParcelableCompatCreatorHoneycombMR2;->createFromParcel(Landroid/os/Parcel;)Ljava/lang/Object; |
62 | 2590316452339699669 | 53895 | Landroid/support/v4/os/ParcelableCompatCreatorHoneycombMR2;->createFromParcel(Landroid/os/Parcel;Ljava/lang/ClassLoader;)Ljava/lang/Object; |
63 | -902756984570334141 | 53880 | Landroid/support/v4/os/ParcelableCompatCreatorHoneycombMR2;->newArray(I)[Ljava/lang/Object; |
64 | -8155489102404949519 | 53812 | Lcom/google/analytics/tracking/android/FutureApis;-> |
65 | 7706868487092767732 | 53473 | Landroid/support/v4/app/ListFragment;-> |
66 | 5534139342145712559 | 53314 | Lcom/aaplab/android/robird/ui/MediaFragment$1;->onLoadingCancelled(Ljava/lang/String;Landroid/view/View;)V |
67 | -2916310843770363879 | 53211 | Lcom/aaplab/android/robird/ui/widget/HackyViewPager;-> |
68 | -5945944459787456127 | 53070 | Landroid/support/v4/app/ActivityCompatHoneycomb;->invalidateOptionsMenu(Landroid/app/Activity;)V |
69 | -3589436608246049539 | 53070 | Landroid/support/v4/app/ActivityCompatHoneycomb;->dump(Landroid/app/Activity;Ljava/lang/String;Ljava/io/FileDescriptor;Ljava/io/PrintWriter;[Ljava/lang/String;)V |
70 | -7627280748607221422 | 52984 | Landroid/support/v4/app/FragmentPagerAdapter;-> |
71 | -8514214176211164536 | 52797 | Lcom/google/android/gms/analytics/internal/IAnalyticsService$Stub$Proxy;-> |
72 | 8321315701923846730 | 52579 | Lcom/google/android/gms/dynamic/a$2;-> |
73 | -683162621362766186 | 52442 | Landroid/support/v4/content/ModernAsyncTask$3;-> |
74 | 5818020461474712792 | 52404 | Lcom/google/android/gms/analytics/internal/Command$1;->createFromParcel(Landroid/os/Parcel;)Lcom/google/android/gms/analytics/internal/Command; |
75 | -4927133974873939683 | 52197 | Lcom/google/android/apps/dashclock/api/ExtensionData$1;->createFromParcel(Landroid/os/Parcel;)Lcom/google/android/apps/dashclock/api/ExtensionData; |
76 | -3399958614432743607 | 52112 | Landroid/support/v4/view/ViewConfigurationCompatFroyo;->getScaledPagingTouchSlop(Landroid/view/ViewConfiguration;)I |
77 | -2969348625307750576 | 51979 | Landroid/support/v4/view/KeyEventCompatEclair;->isTracking(Landroid/view/KeyEvent;)Z |
78 | -2353508751924428670 | 51901 | Lcom/google/android/gms/games/multiplayer/realtime/RoomEntity$a;-> |
79 | -7929451300585271506 | 51785 | Landroid/support/v4/view/accessibility/AccessibilityManagerCompat$AccessibilityManagerIcsImpl$1;->onAccessibilityStateChanged(Z)V |
80 | 1657002645692441933 | 51685 | Landroid/support/v4/view/VelocityTrackerCompatHoneycomb;->getXVelocity(Landroid/view/VelocityTracker;I)F |
81 | -5901020707757742812 | 51678 | Landroid/support/v4/util/LogWriter;-> |
82 | -780048243655997484 | 51617 | Landroid/support/v4/app/FragmentManagerState;->describeContents()I |
83 | 6743212674057076693 | 51617 | Landroid/support/v4/app/FragmentManagerState;->writeToParcel(Landroid/os/Parcel;I)V |
84 | -736970292080916002 | 51616 | Lcom/google/android/gms/analytics/internal/Command$1;->createFromParcel(Landroid/os/Parcel;)Ljava/lang/Object; |
85 | -5241685838796254039 | 51611 | Lcom/google/android/gms/analytics/internal/IAnalyticsService$Stub$Proxy;->asBinder()Landroid/os/IBinder; |
86 | -5808535828223012233 | 51461 | Landroid/support/v4/view/VelocityTrackerCompatHoneycomb;->getYVelocity(Landroid/view/VelocityTracker;I)F |
87 | -2423550744584704325 | 51331 | Lcom/google/android/gms/internal/kc;-> |
88 | -6705063738206203334 | 51229 | Lcom/google/android/gms/internal/x;-> |
89 | -5437951835842775581 | 51091 | Landroid/support/v4/content/ModernAsyncTask$1;-> |
90 | 3400939097257307009 | 51055 | Lcom/google/android/gms/tagmanager/cx;-> |
91 | 9207932117182213076 | 50985 | Landroid/support/v4/view/MotionEventCompatEclair;->findPointerIndex(Landroid/view/MotionEvent;I)I |
92 | 8447758234808132189 | 50880 | Landroid/support/v4/app/NavUtilsJB;->getParentActivityIntent(Landroid/app/Activity;)Landroid/content/Intent; |
93 | 4074238650528361008 | 50793 | Landroid/support/v4/view/ViewCompatGingerbread;->setOverScrollMode(Landroid/view/View;I)V |
94 | -6034617258200236949 | 50764 | Landroid/support/v4/widget/ResourceCursorAdapter;-> |
95 | -5799817234132788985 | 50745 | Landroid/support/v4/widget/ResourceCursorAdapter;-> |
96 | 3788487412323627274 | 50741 | Landroid/support/v4/util/LogWriter;->flush()V |
97 | -6252821634745382551 | 50741 | Landroid/support/v4/util/LogWriter;->flushBuilder()V |
98 | 2401124206520909086 | 50741 | Landroid/support/v4/util/LogWriter;->close()V |
99 | -4316020411289880285 | 50701 | Lcom/handmark/pulltorefresh/library/BuildConfig;-> |
100 | 8457541340339836364 | 50645 | Landroid/support/v4/content/IntentCompatHoneycomb;->makeMainActivity(Landroid/content/ComponentName;)Landroid/content/Intent; |
Impacts on Model Accuracy
Unfortunately, it seemed like filtering classes as described above had a negative impact on model accuracy. It’s not fun to write about negative results, but it might spark some useful discussion or prevent someone else from wasting time. If you’ve found opposite effects, I’d love to know more details about how you implemented it.
Models were scored by selecting the top 100, 500, 1000, 5000, and 10,000 features using mutual information scoring and then performing a 3 fold cross validation. These metrics aren’t perfect since they don’t represent total possible performance, but they should be good for comparison.
Below is a table of model performance scores without popularity filtering:
features | recall | precision | accuracy |
---|---|---|---|
100 | 0.946433740702 | 0.977091995842 | 0.967735883444 |
500 | 0.938567454973 | 0.974466194496 | 0.96336181481 |
1000 | 0.946597359445 | 0.978583324659 | 0.968432732614 |
5000 | 0.955420185518 | 0.980420266832 | 0.972887206921 |
10000 | 0.958403081067 | 0.981061094076 | 0.97440419396 |
Below are the model scores with popularity filtering:
features | recall | precision | accuracy |
---|---|---|---|
100 | 0.860611027171 | 0.975812305816 | 0.932179212864 |
500 | 0.927680606418 | 0.97665201801 | 0.960124489399 |
1000 | 0.928871417962 | 0.977905691408 | 0.961129481943 |
5000 | 0.924236216036 | 0.979243260843 | 0.959762475956 |
10000 | 0.922213116853 | 0.979931426705 | 0.959205947827 |
Conclusion
I really expected better model performance with filtration, but a loss of ~1.5% is a fairly significant reduction in accuracy. Most of the loss in accuracy seems to be from recall since precision seems to be fairly unaffected. This suggests the model wasn’t able to detect certain types of malware when popular classes were filtered. I’m really not sure why this is. I’d need to investigate specifically which samples were missed to understand more, but there are other experiments which I think would help that I’d rather do first.
Perhaps I could also modify thresholds to filter out only the tippy top most popular methods and classes, or reduce the threshold to filter out anything which occurs more than a dozen or so times.
Future Work
The two things I really want to try are sample clustering and model blending. Sample clustering would allow me to cluster similar samples and reduce the total number of training samples. This would allow me to vastly scale up sample size without losing much information and without a proportional increase in training time.
Blending and stacking is training multiple models on the same data and then using the output of those models as features for a meta estimator. In this way, the strengths of different models can be combined and the individual weaknesses reduced. It’s pretty common in Kaggle competitions. I’ve written a bit about this in Measuring the Usefulness of Multiple Models.