Software Debugging and Monitoring for Heterogeneous Many-Core Telecom Systems
Software Debugging and Monitoring for Heterogeneous Many-Core Telecom Systems is decomposed into 6 sub-projects which are detailed in this page. Tracks 1 to 5 are part of the base NSERC Cooperative Research and Development project, while track 6 is to be financed by Prompt:
- Tracing, Debugging and Profiling Mechanisms and Architecture on Many-Core Systems
- Cloud Debugging and Monitoring
- Advanced analysis
- Parallel and Incremental Analysis
- Model-Driven Engineering support
- Case Studies
Tracing, Debugging and Profiling Mechanisms and Architecture on Many-Core Systems
This project will examine the common instrumentation and data collection needs of different tracing, debugging and profiling tools. In addition, new many-core architectures will be studied in order to propose efficient mechanisms and interactive tools to monitor these systems. This project has five research axes:
- Research on the architecture and performance, related to tracing, of different many-core heterogeneous processors such as Adapteva’s Parallela board with the 64 processors Epiphany chip, and Kalray’s MPPA MANYCORE processors containing 256 cores.
- Examine the new Intel Processor Trace (PT) extensions with a focus on bare-metal platforms with thousands of processors used for Telecom equipment, in tasks such as packet and baseband processing. These custom heterogeneous many-core processors are crucial for obtaining the desired level of performance, reliability and power efficiency. They resemble GPGPU processors in their highly parallel architecture but are optimized for network instead of graphics tasks.
- Examine the architectural peculiarities of the many-core systems studied, and propose a tool architecture suitable for these large-scale parallel systems. Efficient algorithms are required both for controlling and interacting with the instrumentation and tracing hardware of these heterogeneous many-core systems. Based on some prototyping with CoreSet commands in GDB, the whole interaction with systems containing 256 cores, and soon much more, needs to be reinvented. Similarly, improved algorithms are required for efficient low-disturbance data collection. On many-core systems, the data communication infrastructure rapidly becomes a bottleneck. When instrumentation traffic is super-imposed on the same communication channels, it becomes very difficult to maintain low-disturbance.
- Research on the tool architecture from a different point of view, sharing data collection mechanisms between the different levels (kernel, user-space, bare metal, Java Virtual Machine and Python runtime) and monitoring applications such as tracers (LTTng, Ftrace, Perf), debuggers (GDB), profilers (Perf, Oprofile) and specialized tools (Address / Thread / Heap Sanitizer). Existing mechanisms for static and dynamic instrumentation will be revisited with a special emphasis on scalability. At the instrumentation point, efficient handlers must be available to be called for verifying conditions, aggregating values, collecting data and for more specialized tasks such as extracting a stack dump, or scanning roots (global and local variables) to verify memory objects reachability. A significant challenge is to propose new low-overhead thread safe algorithms to perform these various tasks. Different strategies will be explored to install efficient handlers for those tasks. The handlers can be provided as bytecode with just-in-time compilation or as pre-compiled native code.
- Setup and support of these specialized hardware and software platforms to test the proposed algorithms on some of their bare-metal platforms, and integration of the proposed mechanisms in GDB and LTTng. [With the help of Ericsson and EfficiOS]
Cloud Debugging and Monitoring
This project will study multi-level execution models in general, and in particular in the context of OpenStack
with virtual machine migration and has four research axes:
- Research on the architecture of the OpenStack Cloud framework and propose different data collection, analysis and monitoring activities. Some of the parameters already monitored at high level with OpenStack’s Tomograph will be reused. The parameters to monitor include resource usage (power consumption, CPU, disk and network usage) and various latencies (query response time, contention among virtual machines, migration latency). The computation of these latencies requires correlating information from several nodes. For instance, the information about virtual machines migration, including the bandwidth required and the periods of performance degradation and non-availability, requires tracing information from user-space and kernel level, in the origin and destination hosts.
- Research on the architecture and organization of Software Defined Networks like OpenDaylight overlaid on top of the physical networks. The collected data will provide information about physical and logical networks and their current usage, helping to diagnose
networking problems or possible violations of Terms of Service concerning the available bandwidth to specific client virtual machines. Discussions will be held with Ericsson support engineers to learn more about the most difficult problems they encounter in the field, and the metrics and views that could be extracted from the virtual and physical network layers in order to facilitate the monitoring and diagnosis of such problems. - Propose efficient architecture and algorithms to collect and analyse in formation coming from several levels in the execution model, from physical machines to virtual machines
(e.g. KVM) and to user-space, bare metal, Java virtual machines (JVM) and Python runtime information, extending our earlier work. The proposed architecture must efficiently support the analysis of mobile virtual machines, migrating from one physical machine to another, and must scale to large clusters / cloud. Different techniques for scaling will be examined, including hierarchical aggregation, dynamic selection of the level of details, and sampling based on time (one request every n) or node (one node every n similar nodes). - Setup and support of the OpenStack cloud environment and OpenDaylight Software Defined Network. [With the help of Ericsson to provide access to an internal test Cloud, already setup for another research project between Ericsson and Ecole de Technologie Superieure’s professor Mohamed Cheriet].
Advanced analysis
This project addresses the problem of advanced trace analysis, identifying patterns and metrics of higher-level
behavior such as performance problems, contention or simply normal activities, in the context of specific
programming models (e.g., OpenCL, OpenMP, MPI...) and has five research axes:
- Research on different programming models and examine how to relate trace information to programming constructs, and extract timing and other performance properties from execution traces, to be used in profiles and high-level models. For example, OpenCL delegates computation to heterogeneous processors, typically GPGPUs, sending requests and receiving results asynchronously. The tracing and profiling information generally arrives asynchronously as well, and must be synchronized and reconnected with the OpenCL constructs. This can enable to display a timeline of the parallel execution of the computation on the massively parallel SIMD cores, and a representation of the request queues. Interestingly, many special-purpose heterogeneous processors used in Telecom equipment have the same level of parallelism and asynchrony as GPGPUs. The work will thus target both types of devices, with the GPGPUs being easier to access, and experiment with, initially.
- Concentrate on advanced analysis modules and views related to resource consumption such as memory usage, input / output, and power consumption. In each of these cases, the system architecture and available hardware has evolved considerably and suitable monitoring data
collection and special-purpose views must be devised. Memory usage in the presence of virtualization and page merging is much more difficult to assess and views built with system-wide information can help detecting non optimal usage patterns. Input / output devices, shared among numerous cores, can easily become a bottleneck. Furthermore, the type of analysis must be adapted to the underlying medium, for example rotating disk, solid-state storage or remote virtual storage. Power consumption can now be studied more efficiently with new registers for power consumption measurement. In addition, advanced scheduling algorithms can significantly impact the potential power savings by grouping interrupts and minimizing the number of wakeups from sleep mode. - Pursuing the previous work on using the modeled state tree for matching patterns of high-level behavior for specific programming models. He will propose an organization to provide
pattern dictionaries than can automatically be used to match simultaneously and efficiently against large execution traces. By using the state history tree to store the pattern state, the proposed mechanism will allow for easy navigation through the trace, and debugging of the patterns, understanding how they were matched to the trace under study. - Research on trace correlation, building upon earlier work in this area and benefiting from the modeled state tree and state history tree. Two different use cases for trace correlation will be examined in particular. The first is when the same software package is used in two different contexts, one being considered a correct execution and the second labeled as problematic. In
that case, the correlation should highlight the differences which are likely related to the reported problem. The second use case is when two different versions of a software package are executed in the same context. Any difference, good or bad, is likely to be related to the differences highlighted by the correlation between the two traces. - Test the proposed algorithms on industrial problems and provide feedback for refinement and optimisation purposes and finally integration of the works into Trace Compass [With the help of Ericsson].
Parallel and Incremental Analysis
This project examines the performance of trace analysis tools and the scalability to multiple cores. Doctoral
student T4D1 will study the problem of constructing a model of the traced system incrementally despite
missing information, because of lost events, or missing initial values not present in the trace or not yet
computed by other parallel analysis tasks. He will then propose new algorithms to tolerate the missing
information, determine which state information can still be computed, and label attributes with missing or
uncertain values. The missing or uncertain information may be corrected later using subsequent information, or information from parallel analysis tasks that become available at a later time. This will enable a
more robust modeled state computation algorithm. It will also facilitate the decoupling of the state computation for different time interval sections of the trace, allowing the otherwise sequential processing to
be efficiently performed in parallel. This project has three research axes:
- Provide a framework to measure the interactive analysis performance of the Tracing and Monitoring Framework. He will then characterize the scalability of the algorithms in terms of different parameters such as the trace duration, the degree of overlap for the state intervals, and the model attribute tree size. He will study the effect of different possible trade-offs such as generating a partial state history tree, trading storage and bandwidth for computation costs, and propose a more efficient configuration and architecture, pipelining different tasks in paralle threads.
- Propose new algorithms to split the state history database along different dimensions that can be computed in parallel and stored separately, such as separate attribute subtrees, subtrees for different resources, or separate time intervals. The challenge for time intervals is to insure that a sufficient proportion of states can be computed without the initial state values at the beginning of the interval, such that the parallel computation can be performed efficiently, minimizing the
need for fixing up later the missing information once state for ealier intervals is computed. In addition, the disk based format for the state history tree will need to be extended in order to allow for the parallel computation of the state for different sections. - Test, integration and performance analysis of the proposed algorithms on large industrial problems [With the help of Ericsson].
Model-Driven Engineering support
This project will focus is the development of techniques to improve model quality using run-time information, thus dealing with the necessary information exchange between the modeling and the run-time levels. This project will also focus on the use of techniques from Model-based Testing (MBT) to generate test
cases that allow the collection of sufficiently representative run-time information. All implementation work in this project will use Papyrus, an Eclipse-based open-source MDE tool environment with state-of-the-art support for the Unified Modeling Language (UML) and related notations such as MARTE, a UML profile for the Modeling and Analysis of Real-Time and Embedded Systems. It has four research axes:
- Extend Papyrus to allow for the appropriate model-level display of user-specified run-time information such as attribute values, current states, transitions taken and messages sent.
- Design, implement, and evaluate techniques for the automatic correction and refinement of environment assumptions through the detection of discrepancies between model-level and run-time information. Both simply structured assumptions (e.g., simply-typed attribute values, performance constraints, and message names) and richly structured assumptions (e.g.,
fault models, models of user behaviour, and behavioural interface and component specifications) in the form of, e.g., state machines, sequence diagrams, or constraints are to be supported. - Extend Papyrus with an MBT plugin to allow for model-level specification of test cases
through user interaction and random test case generation. Specified tests will be executed on the code generated from the model and results will be displayed on the model. - Develop automatic test case generation techniques for correcting, refining, or completing model-level information. The resulting MBT plugin for Papyrus will be integrated with that developed in the previous part to allow the generation of test cases suitable for the automatic correction and refinement of models.
Case Studies
This project will focus on useful case studies and has four research axes:
-
Compare the performance of different tracing and monitoring tools, including LTTng, Perf, Ftrace, SystemTap and GDB, and examine their integration into the software development toolchain. He will then study how typical applications in the Cloud can be monitored with
these tools, including Apache, MariaDB, PosrgreSQL and Drupal. By instrumenting and analyzing these diverse applications, the student will uncover any problem with the proposed approach and tools and be in a position to suggest improvements to the proposed tools. In addition, he will propose innovative specialized views for these applications, again validating and demonstrating the efficiency of the proposed tools for developing custom analysis views and monitoring these important infrastructure Open Source Projects. - Study and instrument complex multi-level applications such as Eclipse and Chromium. He will then study the applicability and performance of the new tools and techniques proposed
in this project. Both Eclipse and Chromium come with extensive run-time libraries and an elaborate threading model. Eclipse runs on top of the Java Virtual Machine and uses threading extensively. Ericsson has several full time Eclipse developers in Montreal. They will provide useful feedback on the accuracy and relevance of the monitoring data and specialized multi-level views that the proposed tools will offer. Similarly, Chromium embeds a high performance Javascript engine and uses an elaborate threading and sandboxing model. Complex applications such as GMail, running in Javascript within Chromium, are extremely difficult and complex to analyze with the current tools. We are in close contact with the Montreal Google office where significant research and development work is conducted on Chromium optimization. They will also provide us with valuable feedback on the value of the proposed tracing and monitoring tools. The complexity and large developer and user base of these projects make them ideal testbeds to
evaluate the applicability, efficiency and scalability of the proposed approaches. They are therefore in a unique position to provide excellent feedback to improve further the proposed mechanisms and algorithms. - Setup a realistic test environment and to communicate back the interesting findings including the setup of the different tools (LTTng, Perf, Ftrace, SystemTap, GDB), environment (OpenStack and OpenDaylight) and applications (Apache, MariaDB, PostgreSQL, Drupal, Eclipse and Chromium), and collaboration with the different groups developing these Open Source projects, and in particular with Ericsson (developing TMF), EfficiOS (developing LTTng, UST and Babel trace) and Google Montreal (developing Chromium).
- Setiup and evaluate the different tools, environments and applications, and collaborate with researchers to interface to the test environment and quickly obtain feedback on the efficiency and relevance of their proposed techniques and algorithms.