NCP-AII 無料問題集「NVIDIA AI Infrastructure」
You are monitoring a server with 8 GPUs used for deep learning training. You observe that one of the GPUs reports a significantly lower utilization rate compared to the others, even though the workload is designed to distribute evenly. 'nvidia-smi' reports a persistent "XID 13" error for that GPU. What is the most likely cause?
正解:B
解答を投票する
解説: (JPNTest メンバーにのみ表示されます)
You are tasked with implementing a monitoring solution for power consumption and thermal performance in an NVIDIA-powered Ai cluster. You want to collect data from the Baseboard Management Controllers (BMCs) of the servers using Redfish. Which of the following Python code snippets demonstrates the correct approach for authenticating with the BMC and retrieving power and temperature readings?
正解:D
解答を投票する
解説: (JPNTest メンバーにのみ表示されます)
You have an Intel Xeon Gold server with 2 NVIDIA Tesla VI 00 GPUs. After deploying your A1 application, you observe that one GPU is consistently running at a significantly higher temperature than the other What could be a plausible reason for this behavior?
正解:B、E
解答を投票する
解説: (JPNTest メンバーにのみ表示されます)
An Ai infrastructure relies on a liquid cooling system to dissipate heat from multiple NVIDIA GPUs. After a recent software update, users report intermittent performance degradation and system crashes. You suspect a cooling issue. Which TWO of the following checks are the MOST critical in diagnosing the root cause?
正解:D、E
解答を投票する
解説: (JPNTest メンバーにのみ表示されます)
You are troubleshooting a performance issue with a GPU-accelerated application running inside a Docker container. The 'nvidia-smi' output inside the container shows the GPU is being utilized, but the performance is significantly lower than expected. Which of the following could be the cause of this performance bottleneck?
正解:A、B、C、E
解答を投票する
解説: (JPNTest メンバーにのみ表示されます)
You are using a custom container runtime other than Docker (e.g., containerd) and need to integrate it with the NVIDIA Container Toolkit.
What command would you use to configure the NVIDIA Container Toolkit for this runtime? (Assume your runtime configuration file is located at '/etc/containerd/config.toml')
What command would you use to configure the NVIDIA Container Toolkit for this runtime? (Assume your runtime configuration file is located at '/etc/containerd/config.toml')
正解:A
解答を投票する
解説: (JPNTest メンバーにのみ表示されます)
You are running a distributed training job on a multi-GPU server. After several hours, the job fails with a NCCL (NVIDIA Collective Communications Library) error. The error message indicates a failure in inter-GPU communication. 'nvidia-smi' shows all GPUs are healthy. What is the MOST probable cause of this issue?
正解:A、B
解答を投票する
解説: (JPNTest メンバーにのみ表示されます)
You have configured MIG on your A100 GPU, creating several MIG instances. You now want to allocate a specific MIG instance to a Docker container. How would you specify the necessary device option when running the 'docker run' command to ensure the container uses only that MIG instance? Assuming the MIG instance UUID is GPU-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
正解:A
解答を投票する
解説: (JPNTest メンバーにのみ表示されます)
You are troubleshooting slow I/O performance in a deep learning training environment utilizing BeeGFS parallel file system. You suspect the metadata operations are bottlenecking the training process. How can you optimize metadata handling in BeeGFS to potentially improve performance?
正解:B
解答を投票する
解説: (JPNTest メンバーにのみ表示されます)
After physically installing a new NVIDIA GPU in a server, you boot the system. You notice that the GPU is not recognized by the operating system. You've verified the card is properly seated and powered. What are the MOST LIKELY causes and solutions? (Select TWO)
正解:B、D
解答を投票する
解説: (JPNTest メンバーにのみ表示されます)