top of page

How Gem's Architecture Enables Real-Time Enrichment & Analysis of Cloud Telemetry at Scale

Security operations in cloud environments can be a blessing and a curse. The blessing is that cloud environments generate huge amounts of data that makes it possible, in theory, to easily gain a level of visibility that’s difficult to achieve on premises. The curse, though, is that managing all this information can be quite a challenge. Beyond the pure scale, cloud data often requires extensive enrichment and correlation to be useful to SecOps teams.


But though the cloud offers extremely rich and abundant data, it’s often not always modeled in the way needed for investigation (as anyone who has investigated cloud alerts is, surely, well aware). Plus, streamed log files from CSPs create numerous other headaches: often, the unordered nature of streamed data means that information needed for enrichment is not available until after the event in question arrives for processing.


Gem’s platform is built to overcome these challenges, enabling this fusion of data from across the cloud environment, automatically and immediately. The platform brings a very high level of context to the entire cloud security operations workflow, ranging from context-aware detections that adapt based on behavioral analytics baselines to seamless investigation and deep drilldowns about specific cloud assets in the UI.


Tackling this problem requires enriching our real-time data with multiple asynchronous processes. This enrichment enables many of our leading edge detection & response features. Today we’ll focus on one of those cases, which resolves AssumeRole chains in CloudTrail. We’ll discuss our strategy for ingestion of raw CloudTrail logs and how we structured our Snowflake data lake to enable maximum performance for enrichment and correlation while keeping costs manageable.




How it all begins: Gem’s CloudTrail Ingestion Pipeline

CloudTrail is the basis for tracking the most core activities within AWS environments. At Gem, we ingest raw CloudTrail logs into a Snowflake data lake for processing.


As activities occur in AWS environments, CloudTrail pushes log files into S3 buckets in the customer’s account. Gem maintains read-only access to these buckets, and is notified in real time when new events are available via an AWS SNS (Simple Notification Service) notification. We can work with notifications generated either directly by the CloudTrail trail, or with S3 event notifications. 


Once we receive the notification, we ingest the new file immediately after adding some additional custom columns, such as the customer identifier and originating trail information, to each row. The data is ingested into our Snowflake table using Snowpipe.


Assumed Role Unchaining

So once we have the data in Snowflake, how do we enrich it?


Let’s focus on one specific use case: Assume Role Unchaining (for more details on this technique, see our blog here). When an analyst is investigating a suspicious action, the first question they generally ask is “who did it.” But in AWS, answering this simple question is often challenging due to complicated role assumption chains: when looking at CloudTrail logs to see which user initiated a given action, often, only the role associated with that action is recorded. To dig deeper, analysts need to find out which users had assumed the relevant role at the relevant time and taken the relevant action. Often, this process must be repeated to “unroll” the identity all the way back to an originating source user.


To make investigation easier, we needed to abstract these challenges: our backend is built on Snowflake, and ultimately, we needed the ability to query CloudTrail logs in Snowflake based on the relevant source identity (the original identity at the beginning of the AssumeRole chain). As opposed to, for example, keeping these source identities in a separate key-value store, this approach allows the most flexibility and allows us to easily answer questions like “what else was a particular user doing around the time of an alert” without having to tackle role assumption each time.


Requirements, alternatives and trade-offs

Implementing this approach presented several technical challenges. Among them is the fact that CloudTrail logs aren’t necessarily ordered: if we detected a suspicious event that needed to be “unrolled,” the relevant corresponding Assume Role events might not have even been ingested yet. To tackle this, we considered 2 possible approaches:


  1. Mutating the Snowflake table: Wait until all relevant Assume Role events have been received, then run an UPDATE statement updating all relevant records with the source identity value

  2. Using a separate enrichment table: we create a separate enrichment table which we can JOIN with our main log events table efficiently. The enrichment table would be appended with new source identity information as soon as it becomes available


Each approach had advantages, but we opted to go with #2 for the following reasons:


  1. Log immutability: all of our log tables are append-only. This simplifies the architecture significantly and reduces the likelihood of data integrity issues. Maintaining the immutability of the log tables was a requirement in this case.

  2. Better write performance: using an enrichment table only requires writing a single row per source identity value. Mutating the log tables with enriched source identities would UPDATE many more rows (and hence partitions) because the same enrichment value would need to be written for every row of that AssumedRole session, consuming more compute resources

  3. Cost: by making the enrichment table relatively small and well-clustered, we could achieve very good read performance (close to in-table data). Together with the favorable write performance, we believed this approach would result in a smaller bill

Creating the tables

Optimizing the Schema

We designed the enrichment table schema to enable the highest performance. The enrichment table is indexed by a source identity lookup key. We deliberately designed the source identity lookup key, which is a field that we calculate ourselves, to take a numeric value rather than a human-readable string. The reason for this is simply that the lookups by number are significantly more performant than lookups by a string value.


—- Date field for narrowing down lookup
SESSION_CREATION_DATETIME TIMESTAMP_LTZ(9),
-- lookup field
SOURCE_IDENTITY_LOOKUP_KEY NUMBER,
-- enrichment fields
SOURCE_IDENTITY_TYPE STRING,
SOURCE_IDENTITY_CONCAT STRING,
IS_CROSS_ACCOUNT BOOLEAN   

Balancing Clustering Objectives

To optimize clustering, we needed to ensure that our clustering strategy matched the patterns of our querying, while balancing the size and number of clusters. Having clusters that are too large reduces performance and requires too much computation to search for the right information within the given cluster. Clustering by customer was an obvious first step, as all querying and analysis within the platform relates to a specific customer and divisions between customers are the largest groupings within our schema. 


To reduce cluster size further, our second step was to cluster by date. This decision reflected the real-time nature of the platform: in security operations, virtually every question we might ask is time-bounded. When analyzing a suspicious event, we want to know what other actions were taking place around the time of the event; when assessing whether behavior of a given user or entity is normal, we want to compare that behavior against the baseline of recent activity. Virtually every view shown in the Gem UI has a time-bounded component: and as such, clustering by day made sense to us.


We still needed to cluster further to increase performance, and given that the enrichment table is largely joined using the source identity lookup key, clustering by source identity made the most sense. But a typical cloud environment has at least thousands of entities that could appear here, and creating a cluster for each wasn’t feasible as it would require too many fetch operations and reduce performance.


To balance these issues, we round the source identity using a bitwise-and operation. This reduces the number of clusters needed, while preserving the overall goal of keeping source identity information consolidated in the same cluster.

CLUSTER BY (INGEST__CUSTOMER_NAME,
   TO_DATE(SESSION_CREATION_DATETIME),
   BITAND(SOURCE_IDENTITY_LOOKUP_KEY, 255));

The end result is that Gem can easily join the enrichment table to our main event table, quickly and efficiently. The query below, which joins the enrichment table with our main events table and enables us to easily surface relevant identities, runs all the time on our infrastructure, and would not be possible without these optimizations.

SELECT *
FROM INGESTION_PROD.PUBLIC.CLOUDTRAIL
LEFT JOIN
   ENRICHMENT_PROD.PUBLIC.CLOUDTRAIL_SOURCE_IDENTITY ON
   CLOUDTRAIL.SOURCE_IDENTITY_LOOKUP_KEY = CLOUDTRAIL_SOURCE_IDENTITY.SOURCE_IDENTITY_LOOKUP_KEY AND
   CLOUDTRAIL_SOURCE_IDENTITY.SESSION_CREATION_DATETIME BETWEEN DATEADD(hour, -12, CLOUDTRAIL.EVENTTIME) AND CLOUDTRAIL.EVENTTIME AND
   CLOUDTRAIL_SOURCE_IDENTITY.INGEST__ORGANIZATION_NAME = CLOUDTRAIL.INGEST__ORGANIZATION_NAME

Conclusion

Managing the enormous amounts of data available in the cloud presents unique challenges that require specialized approaches for cloud security operations and incident response. Our approach separates enrichment context from raw logs. This enables us to keep log tables immutable, while supporting delayed or out-of-order data that’s streaming in from the cloud service provider. Carefully designed clustering to match the structure of the query means we can keep these two tables separate, while ensuring high performance and cost efficiency.


Customers can choose a turnkey approach in which they store their cloud telemetry in Gem’s     Snowflake instance – delivered as a Snowflake Managed Application – or they can store their data in their own private Snowflake security data lake, and connect to Gem’s platform through the Snowflake connected application model. This makes it an ideal solution for regulated markets where data traceability and governance are required.  


Learn how Gem's cloud-native and agentless Cloud Detection & Response (CDR) platform helps SecOps teams dramatically reduce the time to detect, triage, forensically investigate, and contain multi-stage cloud attacks across all major cloud providers (AWS, Azure, GCP) and identity providers (Okta, Azure AD, Google Workspace).


To book a demo, don’t hesitate to reach out.

bottom of page