This section provides a quick start into the Intel Trace Analyzer. It demonstrates essential features of the program using the example trace files poisson_sendrecv.single.stf and poisson_icomm.single.stf that are available in the Intel Trace Analyzer's examples directory.
The traces were generated with two implementations of the same algorithm computing the same result over the same data set: a poisson solver for a linear equation system. As the names imply, the first version uses a sendrecv to communicate, while the second version uses non-blocking communication.
It is illustrated that the first version leads to an overall serialization of the parallel algorithm and how the improved version solves the problem. Figure 1-2 shows the Intel Trace Analyzer after loading poisson_sendrecv.single.stf. The figure shows a main window and a child window, a so called View. The View contains a function profile which shows that the program spent nearly all of its time in MPI.
Together with the Event Timeline (see Section 4.1), a time scale opens above the Timeline in the View. Figure 1-4 shows a View containing a time scale, an Event Timeline and a Function Profile. These diagrams are called Charts (see Chapter 4). In the status bar (see Section 3.2) found at the bottom of the View the current time interval and some other information are shown.
The setup of the parallel processes seems to take most of the run time. Zoom into the interesting area on the right edge of the Event Timeline by dragging the mouse over the desired time interval with the left mouse button pressed as shown in Figure 1-5. The result should look like in Figure 1-6. Note the apparent iterative nature of the application.
Now zoom further into the trace to look at a single iteration and close the Function Profile (Figure 1-7.-> ). The result should look like
To see which particular MPI functions are used in the program, right-click on MPI in the Event Timeline and choose Ungroup Group MPI shows this. The result should look like Figure 1-8. The Function Aggregation of the View changes so that the MPI functions are no longer aggregated (see Section 9.2) into the Function Group MPI but are shown individually. This is shown in the status bar. The button titled Major Function Groups changes to MPI expanded in (Major Function Groups). Click this button to open the Function Group Editor (see Section 5.4), which enables creating new function groups and to switch between them.
It is apparent that at the start of the iteration the processes communicate with their direct neighbors using MPI_Sendrecv. The way this data exchange is implemented shows a clear disadvantage: process i does not exchange data with its neighbor i+1 until the exchange between i-1 and i is complete. This makes the first MPI_Sendrecv block look like a staircase. The second block is already deferred and hence does not show the same effect. The MPI_Allreduce at the end of the iteration nearly resynchronizes all processes. The net effect is that the processes spend most of their time waiting for each other.
Looking at the status bar (see Figure 1-8) shows that one iteration is roughly 1.4 milliseconds long.
Figure 1-9 shows a View with an Event Timeline ( -> -> ), two Function Profiles ( -> -> ) with their Load Balance tab and a Message profile ( -> -> ) that reveals the asymmetric pattern in the point to point messages.
Note that the time spent in MPI_Sendrecv grows with the process number while the time for MPI_Allreduce decreases. The Message Profile (Section 4.6) in the bottom right corner of Figure 1-9 shows that messages traveling from a higher rank to a lower rank need more and more time with increasing rank while the messages traveling from lower rank to higher rank do reveal a weak even-odd kind of pattern.
As poisson_sendrecv.single.stf is such a striking example of serialization, next to all Charts provided by the Intel Trace Analyzer reveal this interesting pattern. But in real world cases it might be necessary to formulate a hypothesis regarding how the program should behave and to check this hypothesis using the most adequate Chart.
A possible way to improve the performance of the program is to use non-blocking communication to replace the usage of MPI_Sendrecv and to avoid the serialization in this way. One iteration of the resulting program looks like the one shown in Figure 1-10. Note that a single iteration now takes about 0.9 milliseconds, while it took about 1.4 milliseconds before the change.
To compare two trace files the Intel Trace Analyzer offers the so called Comparison View (see Chapter 6). Just use the menu entry -> -> in the View showing poisson_sendrecv.single.stf. In the now appearing dialog choose another View that shows poisson_icomm.single.stf. A new Comparison View is opened that shows an Event Timeline for each file and a Comparison Function Profile Chart (see Section 6.2.1) that shows a profile computed from both tracefiles. The time intervals, aggregation settings and filters are taken from the original Views.
After adjusting some configurable behavior (see Chapter 6) and zooming to the first iteration in each trace file a Comparison View for these two files can look like in Figure 1-11. Now it is immediately obvious that one iteration in the improved program needs considerable less time than in the original program.
Please note that you can create a Comparison View that compares one time interval and one process group against another time interval and another process group of the same trace file. Other useful applications of the Comparison View include scalability analysis where you compare two runs of the same unmodified program with different processor counts and try to find out which functions scale well and which suffer from Amdahls Law. Other scenarios could be the comparison of two different MPI libraries, interconnects or machines.
Of course this introduction only scratches the surface. If there is not enough time to browse through the whole documentation, it is recommended to at least read through Chapter 9 to learn about features like filtering, tagging, process aggregation and function aggregation. These features have the potential to make analyzing parallel applications more efficient.