Aggregation reduces the amount of data by aggregating events into thread groups and into function groups.
A striking example for the benefit of thread groups is a parallel code that runs on a cluster of SMP systems. In fact this scenario was the inspiration to introduce this concept. To analyze the behavior of such an application, the data transfer rate is verified to check if the reached rate is plausible with respect to the data rates that are expected (maybe a fraction of the data rates advertised). Of course the effective and expected data transfer rates differ for messages that travel inside an SMP node (intra-node) and between two SMP nodes (inter-node).
In the Intel Trace Analyzer selecting Aggregation into the predefined process group All_Nodes is enough to make the distinction between intra-node and inter-node messages very easy: in the Message Profile the values for the intra-node messages appear on the diagonal of the matrix.
Note that selecting a process group generally results in displaying the information for the group's children (with the notable exception of the function profile). That is the reason why single, unthreaded processes or single threads cannot be selected for aggregation.
Until now, there was only a hierarchy with two levels, but more complicated hierarchies are useful too: threads living on the same core (due to Hyper Threading), threads living on different cores in the same CPU, threads living on the same FSB in different CPUs, threads living in the same SMP box on different FSBs, threads living in different boxes connected by a faster interconnect, threads living in SMP boxes connected by a not so fast interconnect and so on. These considerations suggest allowing for deeply nested thread groups.
If the thread group representing a single node is selected to concentrate on intra-node effects then the analysis might appear to be not faster but slower than using the thread group All_Processes alone. Why is that? The reason is twofold. First the Intel Trace Analyzer does not have to do any aggregation for the thread group All_Processes since it is flat (assuming no threads are used). Second, despite the fact that only a single SMP node is chosen, all other threads go through the analysis and are thrown into the artificially created thread group Other. Click on -> -> to make this group visible. To speed things up, choose a filter that only lets the threads of the selected SMP node pass. Note: Filtering and Aggregation are orthogonal mechanisms in the Intel Trace Analyzer.
Aggregation into function groups allows to decide on what level of detail to look at the threads or thread groups activity. In many cases it might be enough to see that a code spends some percent of its time in MPI without knowing in which particular function. (In some cases optimizing the serial parts of the program might seem more rewarding than optimizing the communication structure).
However, if the fraction spent in MPI exceeds the expectation, then it is interesting to know in which particular MPI call the time was spent. Function grouping allows exactly this shift in perspective by ungrouping the function group MPI.
While the argumentation given in Section 9.2.1 for having nested thread groups may not be that compelling, the reason for having nested function groups comes quite clear as soon as there occur nested modules, classes and/or name spaces. This gets clearer if the binary instrumentation feature of Intel Trace Collector is used as the result is thousands of functions instrumented.
Provided that there are adequate function groups, it is also much easier to categorize code by library or by author. In this way, it is possible to concentrate on precisely the code that is considered tunable while code that is controlled by third parties is aggregated into coarse categories.
Note that selecting a function group generally results in displaying the information for the groups' children. That is the reason why single functions cannot be selected for aggregation.
In the case of timelines, certain events may not be visible at all times. This does not necessarily mean that they are not there. The reason behind this is that the finer your grouping is, the less time is spent in each individual function/group. On aggregating over processes in a time interval where some processes are idle, nothing is displayed because of the idle state of the processes. Zooming in helps to see these events better. For a better overview, check one of the corresponding profiles. If the timeline and the profile seem to contradict, then the information from the profile is more precise.