NCP-AII 無料問題集「NVIDIA AI Infrastructure」
You're deploying a distributed training workload across multiple NVIDIAAIOO GPUs connected with NVLink and InfiniBand. What steps are necessary to validate the end-to-end network performance between the GPUs before running the actual training job? (Select all that apply)
正解:B、C、E
解答を投票する
解説: (JPNTest メンバーにのみ表示されます)
You're deploying a multi-GPU training job on a cluster using Slurm. You need to ensure that the GPUs allocated to the job are healthy and functioning correctly before the training starts. What's the MOST effective approach to pre-validate the GPU hardware?
正解:C
解答を投票する
解説: (JPNTest メンバーにのみ表示されます)
A server with 8 NVIDIAAIOO GPUs is experiencing an unexpected shutdown under heavy load. The IPMI logs show a 'Power Supply Deasserted' event immediately preceding the shutdown. After replacing the PSU, the issue persists. What is the MOST likely cause of the continued shutdowns?
正解:C
解答を投票する
解説: (JPNTest メンバーにのみ表示されます)
You've installed a server with multiple NVIDIAAIOO GPUs intended for use with Kubernetes and NVIDIA's GPU Operaton After installing the GPU Operator, you notice that the GPUs are not being properly detected and managed by Kubernetes. Which of the following are potential causes and troubleshooting steps you should take?
正解:B、C、D、E
解答を投票する
解説: (JPNTest メンバーにのみ表示されます)
You are tasked with installing a BlueField-2 DPU on a server. After physical installation, the DPU is not recognized by the host OS (Linux). You've verified the power and connection. What is the most likely first step you should take to troubleshoot the issue?
正解:A
解答を投票する
解説: (JPNTest メンバーにのみ表示されます)
Consider the following scenario: You have a BlueField-2 DPU installed in a server. You are trying to establish RDMA (Remote Direct Memory Access) communication between the DPU and another server. However, the RDMA connection fails. Which of the following is the most crucial factor to verify in this scenario?
正解:E
解答を投票する
解説: (JPNTest メンバーにのみ表示されます)
You're managing a cluster of servers with BlueField-2 DPUs. One server is experiencing intermittent network connectivity issues. You suspect a problem with the DPU's firmware. Which of the following is the MOST reliable method to determine the CURRENT firmware version of the BlueField-2 DPIJ?
正解:A
解答を投票する
解説: (JPNTest メンバーにのみ表示されます)
You are running a distributed training job on a multi-GPU server. After several hours, the job fails with a NCCL (NVIDIA Collective Communications Library) error. The error message indicates a failure in inter-GPU communication. 'nvidia-smi' shows all GPUs are healthy. What is the MOST probable cause of this issue?
正解:A、B
解答を投票する
解説: (JPNTest メンバーにのみ表示されます)
You are setting up a multi-node A1 cluster with NVIDIA GPUs and InfiniBand for inter-node communication. You need to ensure the InfiniBand network is functioning optimally for GPU-accelerated workloads. What steps would you take to validate the InfiniBand installation and performance?
正解:C
解答を投票する
解説: (JPNTest メンバーにのみ表示されます)
You are troubleshooting an issue where a Docker container utilizing NVIDIA GPUs intermittently fails with a 'CUDA ERROR OUT OF MEMORY error. The host system has sufficient memory and the individual GPU has enough memory as well. You suspect that the problem might be related to how memory is being allocated within the container environment. What steps can you take to investigate and potentially mitigate this issue?
正解:A、D
解答を投票する
解説: (JPNTest メンバーにのみ表示されます)
You suspect a power supply issue is causing intermittent GPU failures in a server with four NVIDIAAIOO GPUs. The server is rated for a peak power consumption of 3000W. You have a power meter available. Which of the following methods provides the most accurate assessment of the server's power consumption under full GPU load?
正解:A
解答を投票する
解説: (JPNTest メンバーにのみ表示されます)
You are observing that the memory bandwidth being achieved by your CUDA application on an NVIDIAAIOO GPU is significantly lower than the theoretical peak bandwidth. Which of the following could be potential causes for this, and what actions can you take to validate or mitigate them? (Select all that apply)
正解:C、D、E
解答を投票する
解説: (JPNTest メンバーにのみ表示されます)
You are installing four NVIDIAAIOO GPUs into a server designed for AI training. The server motherboard has multiple PCIe Gen4 x16 slots. However, the server's power supply unit (PSU) only has three 8-pin PCIe power connectors available. What is the BEST course of action to ensure all GPUs receive adequate power?
正解:B
解答を投票する
解説: (JPNTest メンバーにのみ表示されます)
Consider the following Python code snippet which attempts to extract Digital Optical Monitoring (DOM) data from a transceiver using a hypothetical library 'transceiver_utils'. The transceiver is connected to port 'eth0'. However, the code consistently throws a 'TransceiverError: Invalid port' exception. What is the MOST likely cause of this error?
正解:A
解答を投票する
解説: (JPNTest メンバーにのみ表示されます)