NCP-AII 無料問題集「NVIDIA AI Infrastructure」

You're deploying a distributed training workload across multiple NVIDIAAIOO GPUs connected with NVLink and InfiniBand. What steps are necessary to validate the end-to-end network performance between the GPUs before running the actual training job? (Select all that apply)

正解:B、C、E 解答を投票する
解説: (JPNTest メンバーにのみ表示されます)
You're deploying BlueField OS to an Arm-based SmartNIC. After flashing the image, the system fails to boot and you observe a kernel panic related to device tree loading. Which of the following is the most likely cause?

解説: (JPNTest メンバーにのみ表示されます)
You're deploying a multi-GPU training job on a cluster using Slurm. You need to ensure that the GPUs allocated to the job are healthy and functioning correctly before the training starts. What's the MOST effective approach to pre-validate the GPU hardware?

解説: (JPNTest メンバーにのみ表示されます)
A server with 8 NVIDIAAIOO GPUs is experiencing an unexpected shutdown under heavy load. The IPMI logs show a 'Power Supply Deasserted' event immediately preceding the shutdown. After replacing the PSU, the issue persists. What is the MOST likely cause of the continued shutdowns?

解説: (JPNTest メンバーにのみ表示されます)
You've installed a server with multiple NVIDIAAIOO GPUs intended for use with Kubernetes and NVIDIA's GPU Operaton After installing the GPU Operator, you notice that the GPUs are not being properly detected and managed by Kubernetes. Which of the following are potential causes and troubleshooting steps you should take?

正解:B、C、D、E 解答を投票する
解説: (JPNTest メンバーにのみ表示されます)
You are tasked with installing a BlueField-2 DPU on a server. After physical installation, the DPU is not recognized by the host OS (Linux). You've verified the power and connection. What is the most likely first step you should take to troubleshoot the issue?

解説: (JPNTest メンバーにのみ表示されます)
Consider the following scenario: You have a BlueField-2 DPU installed in a server. You are trying to establish RDMA (Remote Direct Memory Access) communication between the DPU and another server. However, the RDMA connection fails. Which of the following is the most crucial factor to verify in this scenario?

解説: (JPNTest メンバーにのみ表示されます)
You're managing a cluster of servers with BlueField-2 DPUs. One server is experiencing intermittent network connectivity issues. You suspect a problem with the DPU's firmware. Which of the following is the MOST reliable method to determine the CURRENT firmware version of the BlueField-2 DPIJ?

解説: (JPNTest メンバーにのみ表示されます)
You are running a distributed training job on a multi-GPU server. After several hours, the job fails with a NCCL (NVIDIA Collective Communications Library) error. The error message indicates a failure in inter-GPU communication. 'nvidia-smi' shows all GPUs are healthy. What is the MOST probable cause of this issue?

正解:A、B 解答を投票する
解説: (JPNTest メンバーにのみ表示されます)
You are setting up a multi-node A1 cluster with NVIDIA GPUs and InfiniBand for inter-node communication. You need to ensure the InfiniBand network is functioning optimally for GPU-accelerated workloads. What steps would you take to validate the InfiniBand installation and performance?

解説: (JPNTest メンバーにのみ表示されます)
You are configuring a switch port connected to a host in an NCP-AII environment. The host is running RoCEv2. To optimize performance and prevent packet loss, which flow control mechanism should you enable on the switch port?

解説: (JPNTest メンバーにのみ表示されます)
Which of the following is the MOST critical consideration when planning the cooling strategy for a server rack containing multiple NVIDIA A100 GPUs?

解説: (JPNTest メンバーにのみ表示されます)
You are configuring an NVIDIAAIOO GPU in a server, and after installation and driver setup, lower than the GPU's specified TDP. What are the possible reasons for this? nvidia-smi reports a power limit much

解説: (JPNTest メンバーにのみ表示されます)
You are troubleshooting an issue where a Docker container utilizing NVIDIA GPUs intermittently fails with a 'CUDA ERROR OUT OF MEMORY error. The host system has sufficient memory and the individual GPU has enough memory as well. You suspect that the problem might be related to how memory is being allocated within the container environment. What steps can you take to investigate and potentially mitigate this issue?

正解:A、D 解答を投票する
解説: (JPNTest メンバーにのみ表示されます)
You suspect a power supply issue is causing intermittent GPU failures in a server with four NVIDIAAIOO GPUs. The server is rated for a peak power consumption of 3000W. You have a power meter available. Which of the following methods provides the most accurate assessment of the server's power consumption under full GPU load?

解説: (JPNTest メンバーにのみ表示されます)
You are observing that the memory bandwidth being achieved by your CUDA application on an NVIDIAAIOO GPU is significantly lower than the theoretical peak bandwidth. Which of the following could be potential causes for this, and what actions can you take to validate or mitigate them? (Select all that apply)

正解:C、D、E 解答を投票する
解説: (JPNTest メンバーにのみ表示されます)
Which of the following are key benefits of using NVIDIA NVLink Switch in a multi-GPU server setup for AI and deep learning workloads?

正解:A、D、E 解答を投票する
解説: (JPNTest メンバーにのみ表示されます)
You are installing four NVIDIAAIOO GPUs into a server designed for AI training. The server motherboard has multiple PCIe Gen4 x16 slots. However, the server's power supply unit (PSU) only has three 8-pin PCIe power connectors available. What is the BEST course of action to ensure all GPUs receive adequate power?

解説: (JPNTest メンバーにのみ表示されます)
Consider the following Python code snippet which attempts to extract Digital Optical Monitoring (DOM) data from a transceiver using a hypothetical library 'transceiver_utils'. The transceiver is connected to port 'eth0'. However, the code consistently throws a 'TransceiverError: Invalid port' exception. What is the MOST likely cause of this error?

解説: (JPNTest メンバーにのみ表示されます)

弊社を連絡する

我々は12時間以内ですべてのお問い合わせを答えます。

オンラインサポート時間:( UTC+9 ) 9:00-24:00
月曜日から土曜日まで

サポート:現在連絡