Voss, Joseph and Garcia, Joe A. and Proctor, W. Cyrus and Evans, R. Todd
Datacenters have a growing need to monitor and maintain complicated computing machines and verify their systems are functioning at a high level. In order to achieve this goal, it is critical that administrators are able to readily verify basic user operations and survey system performance, quickly discerning when a configuration is sub-optimal. We have created a system health and performance monitoring tool that enables tracking both health and historical performance. The tool presents this data visually and enables the identification of realized and potential problems intuitively. This tool leverages Slurm, a workload manager common in, yet critical to High Performance Computing workflows. We construct the tool around Jenkins, a popular and well-supported testing automation framework which has been used in recent system health and regression testing, as well as PyTest, an assertion-driven Python unit test framework, after evaluating several potential automation tools and testing frameworks. This project develops a test harness for the Texas Advanced Computing Center to run multiple extendable suites of benchmarking and system health applications demonstrated on the Stampede2 and Lonestar5 HPC systems. The applications chosen to run within the test harness include existing in-house benchmarks, such as the Performance Assessment Workbench, and community benchmarks, e.g. STREAM, in addition to newly created system health monitoring scripts.
Ababao, Rainier and Garcia, Joe A. and Voss, Joseph and Proctor, W. Cyrus and Evans, R. Todd
As part of a reproducibility initiative from the Student Cluster Competition 2016, specific results and trends presented in “A parallel connectivity algorithm for de Bruijn graphs in metagenomic applications” were similarly produced and verified. The general lack of reproducibility within the scientific community is a known issue, but few have the time, resources, or incentives to fully address it. Motivation for reproducibility resides in the need to independently validate previous research claims and test the difficulty or ease with which these claims may be reasserted. This fundamental tenant becomes ever more important, particularly due the prohibitive simulation cost and data complexity currently associated with metagenomics. The algorithms in the aforementioned article provide a scalable, distributed memory solution to the problem of assembling and labeling connected components in graphs associated with metagenomic samples. We aim to verify four of the components demonstrated by this article; namely, the deterministic countability of connected components in the data sets used, the computation to communication ratio of different work-balanced parallel algorithm implementations, the results obtained from said algorithm implementations, and the scaling behavior of the algorithms as the number of MPI processes are increased.