My Methodology to AWS Detection Engineering (Part 1: Object Selection)
Introduction
Welcome to the first installment of my new blog series discussing my methodology for threat detection engineering in AWS. This blog assumes you are familiar with Splunk Enterprise Security, its terminology, and/or similar SIEM functionality related to "Risk-Based Alerting" concepts. If not, you can read some reference docs here and here or if you prefer videos you can go here and here. Also, if you need a refresher on AWS CloudTrail userIdentity fields, see the official documentation here.
To be clear, this is just what I have been doing, and it doesn't mean that it is prescriptive. While this approach may use Splunk at its core, the concepts apply to any SIEM that allows you to perform risk scoring or has the components to do so, such as creating indices, performing lookups, and using eval commands. That said, this concept is better stated to be "tailored event aggregation" (shoutout to Haylee Mills).
Traditional Approach
About five years ago, I began to question whether traditional methodologies for engineering detections and correlating events against AWS CloudTrail data were outdated. Is it really valuable to just focus solely on fields like "sourceIPAddress" and/or "user"? What exactly is a "user" according to SIEM standards within the multiverse of AWS services and API actions? Can singular detections stand the test of time and be high fidelity in ever-changing corporate environments?
To assist with my likely inadequate explanation of my approach (I know, buckle up), I will first describe what most are likely doing to detect threats. Then, I’ll expand on the two fields I believe are incorrectly used when aggregating or assigning them as the sole risk objects. I will use the example of an AWS IAM User being created in your environment. For the purposes of this blog, we will ignore the fact that it is a crime against your AWS environment to be creating an IAM User in the first place.
Traditionally the core logic of your detection would look something like this:
Sigma: logsource: product: aws service: cloudtrail detection: selection: eventSource: iam.amazonaws.com eventName: CreateUser condition: selection Splunk: index=aws_cloudtrail eventSource="iam.amazonaws.com" eventName="CreateUser"
This may or may not trigger regularly in your environment, and if you are an enterprise that was not built within the last ten years or a cloud-native startup, chances are this is a semi-regular occurrence. At best, this is a moderate-fidelity alert, but it's more likely to generate some noise on its own. Even so, among the fields in your output that would be important in traditional alerting and risk assignment are "src_ip" & "user". Here are my issues with these fields that arise in traditional methods:
-
src_ip
- Issue: This is highly ephemeral and cannot be a reliable IOC by itself, especially for savvy threat actors. Not to mention, this could be an AWS service FQDN and not an IP address at all.
-
user
-
Issue: This can be so many different value types within CloudTrail that it's not even funny... Seriously, I stopped laughing at this a long time ago. This may be a SIEM issue, but it was my issue, so now you get to hear about it. Is this field a friendly user name, an instance ID, an email address, etc.?
-
Issue: This can be so many different value types within CloudTrail that it's not even funny... Seriously, I stopped laughing at this a long time ago. This may be a SIEM issue, but it was my issue, so now you get to hear about it. Is this field a friendly user name, an instance ID, an email address, etc.?
Cool, so what happens when the threat actor switches IP addresses (or AWS does) and starts using the newly created IAM User? As I continued to learn more about AWS and the other CSPs (AWS is my safe space), it became apparent to me that this was not good enough. It wouldn't be enough if I were to ever catch a cloud adversary that was even savvy by accident. I believe that if you really want to cast the widest net without creating noise factories with your alerts, while still managing to detect both automated and hands-on keyboard-style attacks, the game must change.
For me, the answer was to blend the concepts of risk-based alerting with the ephemeral nature of the cloud and to select some of the most critical fields logged by CloudTrail and adapt them as necessary. Specifically, I wanted to build something that would serve and revolve around an AWS IAM-centric approach but have room for creativity. After all, identity is the new endpoint... or perimeter... or whatever cliché is being marketed now.
Risk Objects & Methodology
The risk object fields that I have chosen to roll with are listed below, a.k.a. the fields that will have numerical values assigned to them per triggered detection event and be used for aggregation. I have renamed a few of them for readability and clarity throughout the blog:
-
aws_identity_arn
- Renamed from userIdentity.arn
-
aws_principal_id
- Renamed from userIdentity.principalId
- *Cutting off the colon (:) and value trailing this ID allows for more consistent correlation
-
target_resource_name*
- This is use-case dependent per detection and is a special field that can be the difference in correlating risk to a threat actor's movements from identity A to identity B or assist with risk accumulation on an AWS resource: instance_id/database_name/etc.
- From our above detection example: requestParameter.userName (IAM User)
- target_resource_arn(s)*
- This field is created by concatenating the metadata from the alerted event using values like the account ID, target_resource_name & knowledge of the ARN structure
- In some cases, this ARN is simply provided in CloudTrail requestParameters.{} or responseElements.{} data
- Can be multiple ARNs
- target_resource_principal_id
- This is the renamed unique ID provided in CloudTrail responseElements.{} data, will be show later in the blog
- user
- src_ip
-
Optional Fields:
-
instance_id*
- Lazy regex example: "^i-[0-9a-f]{7,17}$"
- For certain events this is extracted from the responseElements.{} or requestParameters.{}
- aws_parent_principal_arn
-
Renamed from
userIdentity.sessionContext.sessionIssuer.arn
- Direct match for target_resource_arn that are identity based
- aws_parent_principal_id
- Renamed from userIdentity.sessionContext.sessionIssuer.principalId
- Direct match for target_resource_principal_id that are identity based
-
instance_id*
There are more fields that you could add to this list, but these have been the most relevant for me when stitching together disparate events. Let's go back to our detection example. Let's say a threat actor obtains exposed IAM User credentials for your environment.
One of the first things they will want to do with these credentials after enumeration (which is likely automated) is gain persistence within your environment by creating an attacker-controlled IAM User/Role and pivoting to this persistent identity. If you are using more traditional methodologies, most automated correlation logic will start to fall apart without manual IR investigation to tie relevant events together. This is where this approach really shines, by focusing on the actual AWS IAM principals performing the actions upfront as opposed to source IP addresses or inconsistent users values.
Detection Breakdown
High Volume of API Errors by AWS Principal (+5)
- Risk Objects:
- aws_identity_arn = arn:aws:iam::012345678910:user/Bob (Risk Score: 5)
- aws_principal_id = AIDAEXAMPLE1234567890 (Risk Score: 5)
- user = Bob (Risk Score: 5)
- src_ip = 1.1.1.1 (Risk Score: 5)
AWS GuardDuty Alert - Discovery:IAMUser/AnomalousBehavior (+20)
- Risk Objects:
- aws_identity_arn = N/a
- *Does not exist in GuardDuty output (though it should == feature please!)
- aws_principal_id = AIDAEXAMPLE1234567890 (Risk Score: 25)
- user = Bob (Risk Score: 25)
- src_ip = 1.1.1.1 (Risk Score: 25)
- aws_identity_arn = N/a
AWS Privileged IAM Role (+40)
- Risk Objects:
- aws_identity_arn = arn:aws:iam::012345678910:user/Bob (Risk Score: 45)
- aws_principal_id = AIDAEXAMPLE1234567890 (Risk Score: 65)
- target_resource_name = Alice (Risk Score: 40)
- target_resource_arn = arn:aws:iam::012345678910:role/Alice (Risk Score: 40)
- Remember when I said this could be multiple ARNs? Well another ARN that you could use is arn:aws:sts::012345678910:assumed-role/Alice
- target_resource_principal_id = AROAEXAMPLE1234567890 (Risk Score: 40)
- user = Bob (Risk Score: 65)
- src_ip = 1.1.1.5 (Risk Score: 40)
Note: Adding risk to the responseElements.role.roleId which I would refer to as target_resource_principal_id would allow for correlation to the aws_parent_principal_id in following events.
EC2 Created with Open Security Group (+40)
- Risk Objects:
- aws_identity_arn = arn:aws:sts::012345678910:assumed-role/Alice (Risk Score: 40)
- *Risk score could be 80 if stripping role session name as noted below and if adding the second target_resource_arn in the AWS Privileged IAM Role detection
- aws_principal_id = AROAEXAMPLE1234567890 (Risk Score: 80)
- *See screenshot for bonus correlation opportunity (target_resource_principal_id)
- user = Alice (Risk Score: 80)
- *Correlated from Privileged IAM Role detection (target_resource_name)
- src_ip = 1.1.1.10 (Risk Score: 40)
- target_resource_name = i-0a1b2c3d4e5f6a7b8 (Risk Score: 40)
- *responseElements.instancesSet.items{}.instanceId
- aws_parent_principal_arn = arn:aws:iam::012345678910:role/Alice (Risk Score: 80)
- *Correlated from Privileged IAM Role detection (target_resource_arn)
- aws_parent_principal_id = AROAEXAMPLE1234567890 (Risk Score: 80)
- *See screenshots for bonus correlation opportunity (target_resource_principal_id). Also, this principal ID just so happens to be the same as the aws_principal_id due to the IAM User (Bob) assuming the Alice role. This is just how it logs in CloudTrail; I am not sure why this value would not be the principal ID for Bob, but go fight AWS about it, not me.
- aws_identity_arn = arn:aws:sts::012345678910:assumed-role/Alice (Risk Score: 40)
Expensive EC2 Instance Created (+20)
- Risk Objects:
- aws_identity_arn = arn:aws:sts::012345678910:assumed-role/Alice (Risk Score: 60)
- *Again, this could be 100 at this point
- aws_principal_id = AROAEXAMPLE1234567890 (Risk Score: 100)
- user = Alice (Risk Score: 100)
- *Correlated from Privileged IAM Role detection (target_resource_name)
- src_ip = 1.1.1.10 (Risk Score: 60)
- target_resource_name = i-0a1b2c3d4e5f6a7b8 (Risk Score: 60)
- *Correlated from Open Security Group detection (target_resource_name)
- aws_parent_principal_arn = arn:aws:iam::012345678910:role/Alice (Risk Score: 100)
- aws_parent_principal_id = AROAEXAMPLE1234567890 (Risk Score: 100)
- aws_identity_arn = arn:aws:sts::012345678910:assumed-role/Alice (Risk Score: 60)
AWS GuardDuty Alert - CryptoCurrency:EC2/BitcoinTool.B (+80)
- Risk Objects:
- src_ip = 1.1.1.15 (Risk Score: 80)
- *Bitcoin & EC2 specific related IPs, so this is effectively another IP change
- instance_id = i-0a1b2c3d4e5f6a7b8 (Risk Score: 140)
- src_ip = 1.1.1.15 (Risk Score: 80)
Conclusion
When bringing these all together, you'd simply querying your risk index, which contains the respective objects and take the aggregate score for any of them and then set your threshold:
index=risk | stats values(*) as * sum(risk_score) as total_score by risk_object | where total_score > 100
Although this method for tailored event aggregation and risk object selection is not foolproof, it offers significant flexibility and creativity, playing a pivotal role in correlation compared to traditional alternatives. Prior to the final GuardDuty detection, it is very likely that a threat actor would trigger a high-fidelity risk rule based on accumulated scores for risk_object(s) due to correlation described above. The above SPL is obviously simplified but depending on your scoring or threshold, it is quite possible that your configured event aggregators triggers an alarm around the time of the EC2 detections. Without adding these extra risk objects, it would likely take longer to surface critical events to your incident response team in any meaningful way in an enterprise environment.
In the next post, I will do a deep dive into what this looks like in Splunk SPL code, explore different methods of performing risk assignment, and how this can be adapted for the ever-changing AWS landscape.
Stay tuned!