My Methodology to AWS Detection Engineering (Part 2: Risk Assignment)

Introduction

Welcome back to the second installment of the blog series discussing my methodology for threat detection engineering in AWS. I am humbled by the response to Part One, so thank you to everyone who reached out. I'm very happy that some of you found it helpful. In case you missed it, we covered RBA object selection for AWS and how an example of how that could be of use. I recommend reading Part One before continuing here. But since you're back, I assume you are here for Part Two, so let's get into it. We will look at an example application of risk score assignment with the concepts discussed in Part One and the lessons I've learned using the available RBA features in Splunk ES.

This will be a shorter post as I want to focus on the key components that make up the risk assignment rule. Let's start by jumping right into the logic needed in this methodology.

Core Components:

Initial Filter 1

index=notable tag_field="aws"

In this approach, we will use the index that contains your triggered detections. This is up to you, but you could just as easily add the logic mentioned in the rest of this blog in every detection rule you have. Just know that the collect command is going to run and output results whenever you execute the correlation search ad-hoc which can cause headaches. I prefer to have one rule to manage my AWS risk assignments, as it gives me the freedom to manage it in one spot…again, up to you.

Initial Filter 2

tag_field="aws"

I will explain my approach here as there are a few options:

This is a filtering field and could be whatever field you use to tag and identify detections as AWS-specific alerts…assuming you have done this in some capacity
If you don't have this or a similar tag, but you happen to output the "sourcetype" in your detections, you could use this as your filter (sourcetype="aws:*")            
If there is no sourcetype or tag field, then you could get really lazy with it and use the alert name field, search_name="*AWS*"            
If you've made it this far, chances are you probably want to just add the logic below into your individual AWS correlation searches... though I'd argue it's simpler to just add a tag to those searches and use that field as noted in Step 1 😉   

Severity & Fidelity

| eval severity=coalesce(severity,"informational") 
| eval fidelity=coalesce(fidelity,1)

This is probably obvious, but this bit of code simply says, if the severity or fidelity exists (which it should), use that, otherwise set a default value.

Risk Objects

| eval risk_object=mvappend(aws_identity_arn, aws_principal_id, target_resource_name, src_ip,instance_id,...)
| mvexpand risk_object
| eval risk_object_type=IF(match(risk_object,"^i-[0-9a-f]{7,17}$|\b(?:\d{1,3}\.){3}\d{1,3}\b"),"system","user")

Remember, not all risk_objects described in Part One of this series may exist, and that's okay. But you do need to include all of the possible risk object fields (note: I did not include them all above for brevity). The risk_object_type lets Splunk know what type of object this is of course, but honestly, unless it's an instance_id, src_ip, or if you are blessed enough to obtain friendly instance/compute hostnames, the type will be "user". This match condition contains lazy regex (because there's no other way to do it, don't @ me) for instance IDs and IPv4 addresses and sets the field to "system", otherwise keep the default as "user".

Now would be the time to include threat_object(s) and their respective types, which would be the target_* fields mentioned in Part One, but I am skipping over this part as I find these serve a better purpose as risk objects. If you want to know more about threat objects, see here.

Lastly, the mvexpand is necessary to apply risk to each individual risk object per notable event.

Base Risk Score

| lookup severity_score.csv severity output score as base_score 
| eval base_risk_score=base_score 
| eval risk_score=base_risk_score * fidelity

This creates your base score from the severity_score lookup, which contains simple key/value pairs that align with the scores outlined here or your own custom scoring scheme. This base_score is used in the risk_score calculation by multiplying the fidelity by the base score.

The Magic: Risk Score Assignment

| eval event_time=orig_time 
| collect index=risk

I was totally kidding when I said "the magic," as this is pretty straightforward. The event_time piece ensures you are pulling in the original _time from the alerted event. If you were doing this within the original detection rules, then it would just be a matter of setting event_time to _time. However, in this approach, we are querying the notable index for the triggered detections, so orig_time is what we need.

The collect statement is what we use to send this information to the risk index. Trust me when I say this method (the collect command) is the most liberating in comparison to the alternatives, which I will briefly touch on later.

Advice

When assigning risk with a standalone correlation rule to the risk index, you will need to deduplicate events in some way to prevent score inflation on any given object.

I think one of the simplest ways to do this is to use the sha256() hash function against any of the following fields:

_raw
A concatenated field that includes the alert name (search_name), your risk_objects, event_time
Use the notable description (message)

Note: This means you would have already configured this in every one of your detection rules and that it includes enough information to make events unique enough (like a timestamp) to prevent "over-throttling". Be descriptive and clear in this message and it may be useful to use a timestamp variable here ($_time$)

Concatenation of whatever fields you think would capture the core contents of an event that have not already been mentioned or a mix-and-match of them

I personally like the _raw of the contributing detection (notable) event. This implies that you trust the suppression rules from the contributing correlation rules and to me, it is the easiest of the options above. Once you have your dedup field, you will need to place a subsearch in this rule to filter out risk events that have already been scored, which will look something like this:

| eval raw_event_hash=sha256(_raw) 
| search NOT [search index=risk tag_field="aws" | table raw_event_hash]

I would be remiss if I did not mention what the alternative options are for assigning risk in the Splunk platform. I have taken the liberty of giving them their 15 minutes of fame or shame in the case of sendalert.

Alternative risk assignment methods

There are two methods in particular you could use (sendalert & Risk Adaptive Action) but let me explain why I prefer not to use them:

The sendalert risk command has some frustrating out-of-the-box limitations, such as not accepting multi-valued fields for the risk_objects, along with some other gotchas which have driven me to insanity. Keen readers will notice there is an mvexpand above, and the collect command method has a similar limitation. Both can be overcome with mvexpand on the risk_object, but sendalert has additional shortcomings that would require another paragraph to explain. I'll spare you the details... don't use this unless you hate yourself (kidding, relax).
Risk Adaptive Action is nice until you really want to manipulate the risk_score based on custom logic which requires "Risk Factor" configuration

I will spend a little time on the last bullet point. Yes, Splunk introduced Risk Factors a few years ago (2021, I believe), and this is great for those just getting started, but at least for me, I didn't want to do everything within the console. I believe if you want to do this in a medium or large enterprise, it doesn't scale all that well. I'd suggest using Risk Factors for general use cases, like the ones shown here, but in general this a nice feature to use in your RBA strategy.

That said, given the use case I was going for (AWS risk assignment) which brings along its own complexities and issues that I mentioned in Part One, I opted for the freedom of SPL and a correlation search, which can be managed in detection-as-code, as opposed to the former.

Conclusion

Whether you choose to assign your risk scores in this way or use Risk Adaptive Actions and Risk Factors, you'll need to ensure you're doing the proper tuning within the contributing detection rules and/or in your risk assignment rule. This is not a "set it and forget it" type approach—I am actually not sure that there are any in the detection engineering space—but if you find any, please reach through your screen and slap me with that information.

I did not add this above as I assume everyone is doing this, but map your detections to the MITRE ATT&CK framework (technique/sub-technique IDs) and where possible the Cloud Matrix, as this will be useful in scoring and aggregation rules you create later.

You may have noticed that I conveniently skipped over the actual risk scoring part in this blog, aside from the inherited score from severity and its product when multiplied by fidelity. In my next post, I will be covering "Variable Scoring Using SPL," which I believe is the best part about all of this.

See you soon!

Search This Blog

Le Bron Does Security?