NCP-AII無料問題集「NVIDIA AI Infrastructure」

質問 1

You're deploying a distributed training workload across multiple NVIDIAAIOO GPUs connected with NVLink and InfiniBand. What steps are necessary to validate the end-to-end network performance between the GPUs before running the actual training job? (Select all that apply)

（A）Ping all nodes to confirm basic network connectivity

（B）Employ 'iperf3' or 'nc' to measure TCP/UDP bandwidth between nodes over the InfiniBand network.

（C）Run NCCL tests (e.g., to measure NVLink bandwidth and latency between GPUs on the same node.

（D）Manually inspect the physical cabling of NVLink bridges and InfiniBand connections.

（E）Use 'ibstat' to verify the status and link speed of the InfiniBand interfaces on each node.

正解：B、C、E 解答を投票する

解説: (JPNTest メンバーにのみ表示されます)

質問 2

You're deploying BlueField OS to an Arm-based SmartNIC. After flashing the image, the system fails to boot and you observe a kernel panic related to device tree loading. Which of the following is the most likely cause?

（A）Incorrect bootloader configuration (e.g., incorrect bootargs). The bootloader might not be pointing to the correct device tree blob (dtb) or root filesystem.

（B）The BlueField OS image is corrupted. A fresh download and re-flash should resolve the problem.

（C）The flashed image is not intended for your specific BlueField card revision. Ensure that image corresponds to hardware version.

（D）Insufficient memory allocated to the initrd image. This can lead to failures during initial system setup.

（E）The secure boot configuration is incorrectly set up. Disabling secure boot in the BIOS or bootloader might resolve the issue.

正解：A 解答を投票する

解説: (JPNTest メンバーにのみ表示されます)

質問 3

You're deploying a multi-GPU training job on a cluster using Slurm. You need to ensure that the GPUs allocated to the job are healthy and functioning correctly before the training starts. What's the MOST effective approach to pre-validate the GPU hardware?

（A）Monitor the GPU temperature using 'nvidia-smi' during the first few minutes of the training job.

（B）Check the output of 'nvidia-smi' to ensure all GPUs are listed and have the expected memory.

（C）Execute the NVIDIA Data Center GPU Manager (DCGM) diagnostic suite on the allocated GPUs.

（D）Allocate all available GPUs to the job and assume they are healthy.

（E）Run a simple CUDA vector addition program on each GPU and check for errors.

正解：C 解答を投票する

解説: (JPNTest メンバーにのみ表示されます)

質問 4

A server with 8 NVIDIAAIOO GPUs is experiencing an unexpected shutdown under heavy load. The IPMI logs show a 'Power Supply Deasserted' event immediately preceding the shutdown. After replacing the PSU, the issue persists. What is the MOST likely cause of the continued shutdowns?

（A）Insufficient system memory (RAM).

（B）Network congestion causing system instability.

（C）Overcurrent protection (OCP) tripping due to excessive inrush current during GPU startup.

（D）A faulty CMOS battery.

（E）Incompatible GPU driver version.

正解：C 解答を投票する

解説: (JPNTest メンバーにのみ表示されます)

質問 5

You've installed a server with multiple NVIDIAAIOO GPUs intended for use with Kubernetes and NVIDIA's GPU Operaton After installing the GPU Operator, you notice that the GPUs are not being properly detected and managed by Kubernetes. Which of the following are potential causes and troubleshooting steps you should take?

（A）The 'nvidia-docker2 runtime is not set as the default runtime in '/etc/docker/daemon.json' . Change the default runtime to 'nvidia' and restart the Docker daemon.

（B）The GPU Operator's configuration is incorrect, preventing it from properly discovering and managing the GPUs. Check the GPU Operator's logs and configuration files.

（C）The NVIDIA drivers are not properly installed on the host operating system before installing the GPU Operator. Verify the driver installation using 'nvidia-smr.

（D）The NVIDIA Container Toolkit is not installed on the Kubernetes nodes. Install the toolkit according to NVIDIA's documentation.

（E）The Kubernetes nodes are not labeled correctly to indicate the presence of NVIDIA GPUs. Use 'kubectl label node nvidia.com/gpu.present=true'.

正解：B、C、D、E 解答を投票する

解説: (JPNTest メンバーにのみ表示されます)

質問 6

You are tasked with installing a BlueField-2 DPU on a server. After physical installation, the DPU is not recognized by the host OS (Linux). You've verified the power and connection. What is the most likely first step you should take to troubleshoot the issue?

（A）Check the system BIOS settings to ensure that IOMMU (Input/Output Memory Management Unit) is enabled and properly configured.

（B）Immediately reflash the BlueField-2 DPU firmware with the latest version.

（C）Replace the BlueField-2 DPU, assuming it's faulty hardware.

（D）Check the UEFI settings to ensure that the PCle slot where the DPU is installed is enabled and configured correctly.

（E）Install the latest NVIDIA drivers on the host OS, specifically the BlueField-related drivers.

正解：A 解答を投票する

解説: (JPNTest メンバーにのみ表示されます)

質問 7

Consider the following scenario: You have a BlueField-2 DPU installed in a server. You are trying to establish RDMA (Remote Direct Memory Access) communication between the DPU and another server. However, the RDMA connection fails. Which of the following is the most crucial factor to verify in this scenario?

（A）That the power supply to the DPU is providing sufficient wattage.

（B）That the clocks on both servers are synchronized using NTP (Network Time Protocol).

（C）That the TCP window size is properly tuned for high-bandwidth communication.

（D）That the correct MTU (Maximum Transmission Unit) is configured on the network interfaces involved in the RDMA connection.

（E）That the appropriate RDMA kernel modules (e.g., , 'ib_uverbs) are loaded on both the DPUand the remote server.

正解：E 解答を投票する

解説: (JPNTest メンバーにのみ表示されます)

質問 8

You're managing a cluster of servers with BlueField-2 DPUs. One server is experiencing intermittent network connectivity issues. You suspect a problem with the DPU's firmware. Which of the following is the MOST reliable method to determine the CURRENT firmware version of the BlueField-2 DPIJ?

（A）Use the 'mst status' command to query the device status and firmware version.

（B）Run 'ethtool -i on a network interface associated with the DPIJ.

（C）Query the DPIJ's BMC (Baseboard Management Controller) via IPMI or Redfish.

（D）Examine the '/proc/driver/mlx4_core/versiorf file.

（E）Check the system logs for firmware-related messages during boot.

正解：A 解答を投票する

解説: (JPNTest メンバーにのみ表示されます)

質問 9

You are running a distributed training job on a multi-GPU server. After several hours, the job fails with a NCCL (NVIDIA Collective Communications Library) error. The error message indicates a failure in inter-GPU communication. 'nvidia-smi' shows all GPUs are healthy. What is the MOST probable cause of this issue?

（A）Driver incompatibility issue between NCCL and the installed NVIDIA driver version.

（B）Incorrect NCCL configuration, such as an invalid network interface or incorrect device affinity settings.

（C）A faulty network cable connecting the server to the rest of the cluster.

（D）A bug in the NCCL library itself; downgrade to a previous version of NCCL.

（E）Insufficient inter-GPU bandwidth; reduce the batch size to decrease communication overhead.

正解：A、B 解答を投票する

解説: (JPNTest メンバーにのみ表示されます)

質問 10

You are setting up a multi-node A1 cluster with NVIDIA GPUs and InfiniBand for inter-node communication. You need to ensure the InfiniBand network is functioning optimally for GPU-accelerated workloads. What steps would you take to validate the InfiniBand installation and performance?

（A）Verify the InfiniBand drivers are installed and then run a standard TCP benchmark between the nodes.

（B）Run 'ibstat' to check InfiniBand interface status, use 'ping' to test connectivity, and rely on NCCL's internal checks during training.

（C）Run 'ibstat' to check InfiniBand interface status, use 'ibping' and 'ibperf to test latency and bandwidth, and verify correct NCCL configuration (e.g., during a distributed training run.

（D）Configure a static IP address on the InfiniBand interfaces, and rely on the operating system's network diagnostics.

（E）Use 'nvidia-smi' to monitor InfiniBand traffic, and rely on CUDA-aware MPl for communication validation.

正解：C 解答を投票する

解説: (JPNTest メンバーにのみ表示されます)

質問 11

You are configuring a switch port connected to a host in an NCP-AII environment. The host is running RoCEv2. To optimize performance and prevent packet loss, which flow control mechanism should you enable on the switch port?

（A）None; flow control is not needed with RoCEv2.

（B）TCP flow control.

（C）Priority Flow Control (PFC) or 802.1 Qbb, specifically for the traffic class associated with RoCEv2.

（D）Spanning Tree Protocol (STP).

（E）Simple Network Management Protocol (SNMP).

正解：C 解答を投票する

解説: (JPNTest メンバーにのみ表示されます)

質問 12

Which of the following is the MOST critical consideration when planning the cooling strategy for a server rack containing multiple NVIDIA A100 GPUs?

（A）Applying thermal paste to the GPU memory chips.

（B）Increasing the fan speed of the server chassis fans to maximum.

（C）Using liquid cooling for the CPUs, but air cooling for the GPUs.

（D）Ensuring the server room temperature is kept below 25 degrees Celsius.

（E）Optimizing airflow to ensure hot air is efficiently exhausted from the rack and cool air is drawn in.

正解：E 解答を投票する

解説: (JPNTest メンバーにのみ表示されます)

質問 13

You are configuring an NVIDIAAIOO GPU in a server, and after installation and driver setup, lower than the GPU's specified TDP. What are the possible reasons for this? nvidia-smi reports a power limit much

（A）The system BIOS is limiting the power to the PCIe slot.

（B）The GPU is faulty.

（C）The driver is not correctly installed.

（D）The power supply is not providing enough power.

（E）The GPIJ is in a low-power mode due to inactivity.

正解：A 解答を投票する

解説: (JPNTest メンバーにのみ表示されます)

質問 14

You are troubleshooting an issue where a Docker container utilizing NVIDIA GPUs intermittently fails with a 'CUDA ERROR OUT OF MEMORY error. The host system has sufficient memory and the individual GPU has enough memory as well. You suspect that the problem might be related to how memory is being allocated within the container environment. What steps can you take to investigate and potentially mitigate this issue?

（A）Increase the shared memory size for the container using the '-shm-size' flag when running the container.

（B）Set the environment variable inside the container to limit the number of GPUs visible to the application.

（C）Adjust the environment variable inside the container to ensure consistent GPU ordering.

（D）Monitor GPU memory usage both inside and outside the container using 'nvidia-smi' to identify memory leaks or excessive allocation.

（E）Lower the compute capability using '-compute' parameter on docker run.

正解：A、D 解答を投票する

解説: (JPNTest メンバーにのみ表示されます)

質問 15

You suspect a power supply issue is causing intermittent GPU failures in a server with four NVIDIAAIOO GPUs. The server is rated for a peak power consumption of 3000W. You have a power meter available. Which of the following methods provides the most accurate assessment of the server's power consumption under full GPU load?

（A）Use the power meter to measure the server's power consumption while running a synthetic benchmark that fully utilizes all GPIJs simultaneously.

（B）Add the maximum power rating of each GPU to the CPU's TDP (Thermal Design Power).

（C）Use the power meter to measure the server's power consumption at idle and multiply by four.

（D）Check the server's BIOS for power consumption readings.

（E）Run 'nvidia-smi' and sum the reported power consumption for each GPIJ.

正解：A 解答を投票する

解説: (JPNTest メンバーにのみ表示されます)

質問 16

You are observing that the memory bandwidth being achieved by your CUDA application on an NVIDIAAIOO GPU is significantly lower than the theoretical peak bandwidth. Which of the following could be potential causes for this, and what actions can you take to validate or mitigate them? (Select all that apply)

（A）The application is using single precision floating-point operations. Switch to double precision to increase memory bandwidth utilization.

（B）The system memory is fully occupied. Deallocate some memory.

（C）The GPU is being limited by power capping. Increase the power limit using 'nvidia-smi -pl' (if permitted) to allow the GPU to operate at higher clock speeds.

（D）The application is using uncoalesced memory access patterns. Refactor the code to ensure contiguous memory access by threads within a warp.

（E）The application is using a small transfer size per kernel launch. Increase the amount of data processed per kernel launch to amortize the overhead of kernel launch and data transfer.

正解：C、D、E 解答を投票する

解説: (JPNTest メンバーにのみ表示されます)

質問 17

Which of the following are key benefits of using NVIDIA NVLink Switch in a multi-GPU server setup for AI and deep learning workloads?

（A）Reduced latency in inter-GPU data transfers.

（B）Enhanced security features compared to PCle based interconnections.

（C）Simplified GPU resource management.

（D）Increased GPU-to-GPIJ communication bandwidth.

（E）Support for larger GPU memory pools than a single server can physically accommodate.

正解：A、D、E 解答を投票する

解説: (JPNTest メンバーにのみ表示されます)

質問 18

You are installing four NVIDIAAIOO GPUs into a server designed for AI training. The server motherboard has multiple PCIe Gen4 x16 slots. However, the server's power supply unit (PSU) only has three 8-pin PCIe power connectors available. What is the BEST course of action to ensure all GPUs receive adequate power?

（A）Underclock the GPUs significantly to reduce their power consumption below the available PSU capacity.

（B）Replace the existing PSU with a higher wattage PSU that has at least four 8-pin PCIe power connectors.

（C）Use a PCIe power splitter cable on one of the 8-pin connectors to power two GPUs.

（D）Connect the GPUs using the motherboard's internal SATA power connectors.

（E）Install only three GPUs and leave the fourth unpowered.

正解：B 解答を投票する

解説: (JPNTest メンバーにのみ表示されます)

質問 19

Consider the following Python code snippet which attempts to extract Digital Optical Monitoring (DOM) data from a transceiver using a hypothetical library 'transceiver_utils'. The transceiver is connected to port 'eth0'. However, the code consistently throws a 'TransceiverError: Invalid port' exception. What is the MOST likely cause of this error?

（A）The port 'eth0' does not exist or is not correctly associated with the transceiver.

（B）The fiber cable connected to the transceiver is damaged.

（C）The transceiver does not support DOM functionality.

（D）The 'transceiver_utils' library is outdated and does not support DOM data extraction.

（E）The Python code requires root privileges to access transceiver data.

正解：A 解答を投票する

解説: (JPNTest メンバーにのみ表示されます)

NCP-AII 無料問題集「NVIDIA AI Infrastructure」

弊社を連絡する

関連リンク

トップ試験