Heavy Data, Light Measurements – network troubleshooting in a distributed computing environment
Complex, heavily distributed computing environments like clouds and grids are increasingly more reliant on complex network setups, consisting of multiple layers, spanning several domains and using different technologies.
Network issues may have diverse impacts with severe consequences on the actual user experience, depending where they happen in the chain between the user application and the physical connections.
Network troubleshooting could be then a crucial task in managing a distributed computing centre or a grid/cloud infrastructure. The reliability of the computing facilities is directly related to the underlying network, which has to be properly designed and set-up but also equipped with adequate monitoring and troubleshooting tools. This is particularly true when the data centre of the computing infrastructure is involved with capturing, processing, storing, sharing and transferring large and complex data sets (Big Data).
Network monitoring and troubleshooting activities in highly performing computing infrastructure should be as least destructive as possible, not taking up precious resources and not interfering with the traffic generated by the computing facility. Lightweight network measurement is then the preferred strategy.
A network measurement can be defined as “lightweight” if it is not competing with the real network traffic, i.e. not competing with user generated data being transferred or shared by users or applications. Examples of lightweight measurements are:
- 1-way delay/RTT
- 1-way delay variation (jitter)
- Packet loss
The presentation will focus on the measurement of those three metrics and their impact on the performance of the network as this is perceived by the final users.
The talk will include a section about tools and monitoring infrastructures (like perfSONAR MDM) which will be presented and discussed.
Domenico Vicinanza and Alessandra Scicchitano's Biographies Alessandra Scicchitano |