Caught in our Net

Using neural networks to identify algorithmically generated domains (AGDs)

The problem with today’s generation

A while back, we released a new CAL Feed that leveraged our ability to detect domains that were generated via an algorithm.  This is an interesting cohort of domains — they’re typically generated by machines and for machines.  That alone makes them somewhat unique, given that most domains are generated by humans for humans.

The worst part is that even though these algorithmically generated domains (AGDs) are easy to spot for a human, they can be changed frequently with no human interaction from the adversary.  Pair this with an increase in dynamic resolution techniques, and you can see how the savvier adversaries can use this to evade detection and mitigation techniques from security practitioners. This sophisticated opportunity for evasion makes AGDs, and confusingly enough the domain generation algorithms (DGAs) that create them, an interesting point of study.

Traditional detection and mitigation techniques don’t quite work the same way with AGDs as they do with normal domains.  Sure, you can spot a domain and block it — but what happens tomorrow when the DGA spits out a different domain?  At best, you’re playing whack-a-mole trying to keep up.  At worst, the adversary is able to establish a command and control channel within your network without you even noticing.

Current DGA analysis requires some heavy lifting — you might have to capture a malware sample, reverse engineer it, figure out how the algorithm works with different inputs (sometimes called seeds) — and then brute-force your options.  Depending on the malware family, you may be looking at tens thousands of possible domains per day.  Even if you were able to do all of that for the dozens of malware families that exist, you’d be moving massive indicator watchlists around that are pretty big haystacks with relatively few needles in them.  Our partners at Bambenek Consulting have done exactly that — and we wanted to help advance their research even further.

Modern problems require futuristic solutions

To counter this, we can leverage some form of analytics to try to identify an AGD without any of that advanced knowledge.  One of the interesting things about most DGAs is that they have to generate something unique enough to not be already registered.  Between that, and the computational difficulty of generating a “random” (really pseudorandom) string of characters, a lot of AGDs don’t pass the eye test for humans.  If I asked you to look at a domain like drjamesbbqbonanza[.]com and a domain like dyh7fhps5h0prsyd839oxov2[.]com you would probably know which one was created by a human and which one was generated by an algorithm.  So if humans can distinguish between the two, how can we make sure that an automated solution like CAL can do it?

This is a perfect opportunity for machine learning to bridge the gap between human intelligence and computer horsepower.  We wanted to take the opportunity to not just celebrate, but explain to the community, how and why we chose the machine learning solution that we did.  Terms like “machine learning” and “artificial intelligence” get bandied about as buzzwords, with little explanation or meaning.  As cool as they sound, they’re really just another tool in the toolbox for us to better understand data.  To clear up some of that muddled conversation, we wanted to provide a transparent and scientific explanation of how we’re using these advanced analytics techniques to solve the problem.

As a result, I’m pleased to announce that we have released a white paper, entitled Detection of Automatically Generated Domain Names in Real-time with Machine Learning, that details our experimental methods and findings.  If the paper seems daunting to you, don’t worry — the rest of this blog will break down some of the main concepts and findings in layman’s terms.  Still, I’d encourage you to take a look at some of the key facts and figures that helped us derive our solution.  That solution was able to be validated with a 99.9% success rate and minimal false positives.

All about that paper

While we haven’t gone through the rigmarole of having our paper peer reviewed for publication in academic journals, we have some experience in this stuff.  We’ve constructed our paper in the model of a proper research paper to help anyone so inclined understand our underlying data and methodology.  To summarize the key sections for you:

  • The introduction lays out the basics of the problem surrounding AGDs.  Basically what I did above, but in a less colloquial and more formal tone for anyone who is maybe more interested in the machine learning part of the problem and has less knowledge of the problem space.
  • The methodology section lays out how we conducted our experiments: what data we used, the different machine learning techniques we tried, and the tradeoffs we knew we’d be making.
  • The results section explains how the different machine learning models did, in real numbers.  We give examples of things that the models did well or poorly, and some of the practical constraints about actually using these machine learning techniques in real engineering applications.
  • The conclusion lays out next steps — room to improve our experimental design or the underlying data, so that we can better apply our findings in future iterations.

We’re really excited to continue to improve the dialog surrounding machine learning and analytics in our space moving forward, so if you’re at all curious about these papers I would encourage you to check it out — you’ll be seeing more!

Spoiler alert: it worked

If you don’t care to delve into the white paper and discover what things did or didn’t work, or how well, I’ll sum it up for you: it works, and pretty darn well.  We tried a couple of machine learning applications, we made a few tradeoffs, but in the end what worked out was a long short-term memory (LSTM) network, a form of recurrent neural networks.

We did our best to identify features of AGDs — they have weird character frequencies, they tend to be longer, etc.  We tried to feed those features into other machine learning approaches like decision trees.  But in the end, the LSTM was great at finding novel features of the dataset we hadn’t predicted.

Some of our hypotheses were right — longer domains are more likely to be AGDs, and long sequences of consecutive consonants are similarly predictive.  But the beauty of the LSTM is that if you let it crank away on a clearly labeled dataset like we had, it will find even more predictive features.  While it took awhile to get a custom-tuned high-power GPU to do the crunching, when the dust settled we were left with a very powerful predictive model that took domain names in and spat out a confidence level that said domains were AGDs.

The up-front cost of this (dozens of hours of training, sanitizing, optimizing, and tinkering) can now be leveraged with near-instant predictions based purely on a string of characters.  Here are some sample outputs from our LSTM:

As you can see, there’s a sliding scale of things that look particularly weird to us humans versus what looks like a real domain.  The confidence levels of our LSTM match that, which is a good sign.  Even better, when an old friend (Zloader aka Zeus Sphinx) resurfaced and researcher Johannes Bader published some brute-forced domains, we were able to validate our LSTM against those AGDs to ensure we could catch domains from families that weren’t in our training dataset.  Nearly 99% of the domains were appropriately labeled with a confidence of 99% or higher — so we’re definitely onto something here!

If I only had a heart…

A robot with a brain is interesting, but the reality is that if we can’t tie these insights into other analytical models and applications it’s just an academic exercise.  What an LSTM really does is detect weird looking domains versus English looking domains.  The white paper explains some of the nuances for this, and ways that we’re working to improve those limitations.  At the end of the day there’s a top layer of cream that we wanted to isolate, combine with some of our other analytical techniques, and serve up for you to use  in your day-to-day investigations.  We do this a few ways:

  1. The aforementioned CALF contains some of the cream of the crop.  We look for very high confidence AGDs but also take into account things like registration date, DNS resolution patterns, and remove sinkholing.  This gives you access to high-confidence AGDs as they’re registered.To give you a sense of the data, we’re averaging around 71 high-confidence AGDs per day in this feed.  To date, 97% of them are unreported by other open source feeds and 0% of them have any false positives reported across the thousands of analysts that have access to them.  We’re even seeing reported observations of these indicators in customer environments as well — meaning that our predictive modeling is helping to prevent incidents in real-world networks!
  2. If you’re participating in CAL, you’ll start to see that it automatically classifies suspected AGDs for you.  This means any domains that you have in your instances — whether provided from premium feeds, partner data sources, or even your own internal case management — will get run through our LSTM automatically.  Anything that red-lines our model into “high confidence” territory will get the Classifier [Host.DGA.Suspected] applied and have its ThreatAssess score adjusted accordingly.  Not only does this help our human eyes further appreciate the relative priority of these AGDs, but it now enables other automated applications — such as orchestration via Playbooks or Case Management — to take appropriate next steps.

Humans learning about machine learning

Stay tuned as we identify and deploy additional applications of this fascinating family of technologies moving forward.  Our goal is to do more than just solve your data problems — it’s to explain in a clear, digestible manner the way in which we’re solving them.  Analytics doesn’t have to be a muddy buzzword that has lost all meaning, as is so common in our space.  Instead, let’s move the dialog together towards understanding the nature of our problem and what makes something the right solution.

As we continue to reap the benefits of those discoveries, so will you.  We’re looking forward to not just deploying these armies of tireless machines, but helping you understand what they’re doing for you.

About the Author
Drew Gidwani

Drew Gidwani is the Director of Analytics at ThreatConnect. He drives the data modeling, collection, and analytics both within the core ThreatConnect platform and in CAL. Previously, Drew worked for the Department of Defense where he leveraged his varied analysis experiences to scale growing intelligence teams in the face of the ever-changing threats we face today. Drew holds a B.S. from Carnegie Mellon University and an M.S. from Johns Hopkins University. He currently resides in Maryland with his fierce warrior dog named Gimli.