Lessons of Enterprise Scale Analytics, Part 2
In this series we’re exploring specific technical lessons that apply as Analytics implementations grow to ever larger scales. In our first entry, we introduced the idea that complexity in these circumstances only moves one one direction. Today we’re going to start with that principle as a basis and dive deep into staying focused on grain considerations.
Grain is everything.
When a traditional relational database is normalized, the grain (that is: the level of detail that each row in the table represents) of each table is obvious and consistent. We have to consider it when we query the database, but not when we’re just storing the data. In Analytics, we have to consider the data grain all the way from Recipe input node to Dashboard query output, and the impact of that understanding grows proportionally with the complexity of the implementation. At enterprise scale, the impact is so central to success that I want to share 3 distinct strategies for managing grain.
Know where your measures live.
Aggregating an off-grain measure in a query can lead to duplication, inflated values, and an inability to recompute values dynamically. Imagine grouping an Opportunities Dataset by the related Account Name, and then summing the Account Number of Employees within that grouping. As we have many Opportunities per Account, each Account Name potentially appears on multiple records, so the Account Number of Employees gets added multiple times, presenting an incorrect value. As a general rule we should not include off-grain measures in our Datasets to avoid accidentally aggregating them in queries. Instead, move the measure in question to the Dataset already at that grain, or create a new Dataset with appropriate keys to store the measure.
The notable exception is a use case like targets vs actuals, where we have a goal or quota value that applies in a certain grouping context, such as per-user-per-year or per-region-per-quarter. In that case we can join the goal value into the Dataset using the grouping dimensions as a key ( ex: “Smith-2025” or “Eastern-Q2”), and then in the query display the min, max, or average of that value as an aggregate grouped where the value applies. If the grouping is correct, every value in the grouping will be the same, so the min, max, and average all match. We can then compare that value against the group sum of the actuals measure to compute goal attainment.This use case is the exception that proves the rule: if you know how your numeric values - measures - are stored, you’re well set up to consistently aggregate them correctly.
2. Rely on lookup joins as much as possible.
A lot of Analytics builders who come from a database management background will want to use left joins to build their Datasets. However, we recommend against this in most cases, as left joins will add rows to your Dataset if there are multiple matches, which can affect your grain or row count, making validation hard or impossible. In contrast, lookup joins keep both the grain and the row count consistent, which makes managing complexity over time much easier.
We must be careful not to be fundamentalist about lookup joins, however. There are times and places for every join type. In fact, my team recently needed to use a left join when augmenting one junction object onto another. For this use case, we were creating a new grain based on the two junction inputs, so preserving either input grain or row count was less important than making sure we captured all the various combinations of record relationships. Please keep in mind this was an exceptional case, and much more complicated than the vast majority of everyday joins. A good Analytics builder will know why this kind of general rule is important, but can also identify when it doesn’t apply.
3. Compute derived fields at their matching grain.
Recently I was building a number of rollups where we were flagging Attendees of Activities and aggregating those flags to classify the parent record based on who had attended related Activities. Our client has requested that we bucket a number of Attendee Roles to simplify the classification, and in our initial update, we computed that bucketing late in the Recipe processing, after our Attendees had already been joined with their Activities and their parent records. When we later ran into issues with this bucketing, we had created extra challenges for ourselves because although the actual bucketing logic referred to Attendee Roles, applying that logic after the Attendee, Activity, and parent grains had been joined together obfuscated the source of the logic, making it harder to troubleshoot, maintain, or extend. To resolve, we moved the Attendee Role bucketing to much earlier in our Recipe processing at the Attendee grain. This change made it much easier to see the source of the logic and change it as needed.
What’s interesting here is that the matching grain for a given derived field isn’t always obvious, another of the many reasons why it’s so important for us to have an instinctive understanding of our working data model. In the example above, the business needs are in the realm of identifying Opportunities or Accounts, but the actual Attendee identification and flagging happens at the User or Contact level.
We can get a sense of if we might be computing derived fields at the wrong grain by noticing if our formulas reference a lot of fields that have API name prefixes that come from joins. If they do, we should look to earlier branches in our Recipe and consider moving the formula computation to a finer grain. This is one of those small changes in the way we build that strongly influence the robustness and performance of our implementation as it reaches full scale.
I can’t overstate enough how key this idea is. If we can understand grain in an Analytics context, we can from that understanding derive robust data structure strategies that include the kind of consistency we need to have confidence in our row-level and aggregated numeric values.
We’ll continue this sub-series on Lessons of Enterprise Scale Analytics next time with a deep dive into both how visualizations relate to larger Analytics strategies and how we can supercharge our ETL processes to maximize dashboard query performance and maintainability.