NCA-GENM無料問題集「NVIDIA Generative AI Multimodal」

質問 1

You are developing a multimodal model that combines text and tabular data for predicting customer churn. The text data consists of customer reviews, and the tabular data includes demographics and transaction history. You've preprocessed both datasets. Which of the following approaches would be the MOST effective for integrating these modalities?

（A）All of the above.

（B）Convert the text data into numerical features using techniques like TF-IDF, then concatenate these features with the tabular data.

（C）Train separate models for text and tabular data, then average their predictions.

（D）Concatenate the raw text and tabular data into a single feature vector.

（E）Use a Transformer-based model to encode the text and a separate neural network for the tabular data, then fuse the embeddings.

正解：B、E 解答を投票する

解説: (JPNTest メンバーにのみ表示されます)

質問 2

You've trained a large multimodal model that takes text and images as input and generates creative stories. While the model produces high-quality stories in general, it occasionally generates outputs that are factually incorrect or nonsensical. Which of the following techniques would be MOST effective in improving the model's factual accuracy and coherence?

（A）Increasing the model size by adding more layers.

（B）Implementing a retrieval-augmented generation (RAG) approach.

（C）Reducing the temperature parameter during generation.

（D）Removing dropout layers.

（E）Training the model on a smaller dataset.

正解：B 解答を投票する

解説: (JPNTest メンバーにのみ表示されます)

質問 3

When building a multimodal model using transformers, you observe that the model struggles to attend to the correct image regions when generating text descriptions. Which of the following techniques could you employ to improve the attention mechanism in the model?

（A）Decrease the learning rate.

（B）Reduce the dimensionality of the image features.

（C）Use a larger batch size during training.

（D）Implement a visual attention mechanism that explicitly guides the model to focus on relevant image regions.

（E）Increase the number of transformer layers in the text encoder.

正解：D 解答を投票する

解説: (JPNTest メンバーにのみ表示されます)

質問 4

You are developing a text-to-image generation system using a diffusion model. During inference, you notice that the generated images often contain artifacts or inconsistencies. What is the most appropriate strategy to reduce these artifacts and improve the overall image quality?

（A）Train the model with a larger dataset of higher-resolution images.

（B）Reduce the batch size during inference.

（C）Use a simpler text encoder to reduce noise in the conditioning signal.

（D）Decrease the guidance scale (classifier-free guidance).

（E）Increase the number of diffusion steps during the reverse process (sampling).

正解：E 解答を投票する

解説: (JPNTest メンバーにのみ表示されます)

質問 5

Consider the following PyTorch code snippet intended for training a variational autoencoder (VAE):

What potential issue(s) exist(s) in this code, and how would you address them?

（A）The binary cross-entropy (BCE) loss doesn't account for pixel values outside the range [0, 1]; normalize the input images to this range.

（B）All of the above.

（C）The Kullback-Leibler divergence (KLD) term isn't scaled appropriately for the batch size; divide it by the batch size to get a mean KLD loss.

（D）The BCE loss is summed across all pixels; average it by dividing by the total number of pixels in the input.

（E）The KLD calculation is incorrect; it should be 0.5 torch.sum(mu.pow(2) + logvar - 1 - logvar.exp()).

正解：B 解答を投票する

解説: (JPNTest メンバーにのみ表示されます)

質問 6

Consider a scenario where you are developing a virtual assistant that can answer questions about images. You have a large dataset of images and corresponding question-answer pairs. Which architecture is BEST suited for this task?

（A）A combination of a pre-trained word embedding model (e.g., Word2Vec) for the question and a separate model for image classification.

（B）A Support Vector Machine (SVM) trained on image features and question keywords.

（C）A transformer-based model that processes both images and questions as sequences of tokens, allowing for attention-based interaction between modalities.

（D）A convolutional neural network (CNN) for image feature extraction followed by a recurrent neural network (RNN) for question encoding and a fully connected layer for answer prediction.

（E）A simple feedforward neural network that takes flattened image pixels and question embeddings as input.

正解：C 解答を投票する

解説: (JPNTest メンバーにのみ表示されます)

質問 7

You are tasked with deploying a generative A1 model using NVIDIA Triton Inference Server. Which configuration parameter within Triton is MOST crucial for optimizing throughput and minimizing latency when serving a large number of concurrent requests?

（A）Default Model Filename

（B）Instance Group Count

（C）Batching Preferences

（D）Input Data Type

（E）Max Queue Size

正解：B 解答を投票する

解説: (JPNTest メンバーにのみ表示されます)

質問 8

Consider the following code snippet used for evaluating a Generative Adversarial Network (GAN):

What does the code snippet calculate, and what do 'images1' and "images2 represent in the context of GAN evaluation?

（A）Calculates the Frechet Inception Distance (FID); 'images1' and 'images2 represent real and fake images respectively.

（B）Calculates the Kernel Inception Distance (KID); 'images1' and 'images2 represent real and fake images respectively.

（C）Calculates the Peak Signal-to-Noise Ratio (PSNR); 'images1 and 'images2 represent real and fake images respectively.

（D）Calculates the Structural Similarity Index (SSIM); 'images1 and 'images2 represent real and fake images respectively.

（E）Calculates the Inception Score; 'images1' and 'images2 represent real and fake images respectively.

正解：A 解答を投票する

解説: (JPNTest メンバーにのみ表示されます)

質問 9

Consider the following Python code snippet using PyTorch, intended to combine image and text embeddings:

Which of the following statements regarding the output shapes of these combined embeddings are TRUE? (Select TWO)

（A）combined_embedding_concat has shape (64, 512).

（B）combined_embedding_weighted has shape (32, 1024).

（C）combined_embedding_weighted has shape (32, 512).

（D）combined_embedding_add has shape (32, 1024).

（E）combined_embedding_concat has shape (32, 1024).

正解：C、E 解答を投票する

解説: (JPNTest メンバーにのみ表示されます)

質問 10

You're using NVIDIA Triton to serve a multimodal model: a CLIP text encoder and a StyleGAN image generator. You need to ensure high throughput and minimal latency. Which Triton backend configuration is most suitable for this scenario, assuming both models are optimized for NVIDIA GPUs?

（A）Using just the Python backend with the models on CPU.

（B）A Python backend where both models are loaded into memory and inference is performed sequentially.

（C）A single model repository containing both models as TorchScript, served by a single Triton instance using the PyTorch backend.

（D）A single model repository with two model instances (CLIP as ONNX, StyleGAN as TensorRT) served by a single Triton instance, leveraging concurrent execution.

（E）Two separate model repositories, one for CLIP (as ONNX) and one for StyleGAN (as TensorRT), served by two Triton instances on different GPUs.

正解：D 解答を投票する

解説: (JPNTest メンバーにのみ表示されます)

質問 11

You're designing a multimodal A1 system for autonomous driving that integrates data from cameras (images), LiDAR (point clouds), radar (time-series), and GPS (geospatial). The system needs to make real-time decisions in complex urban environments. Which hardware and software components are crucial for achieving low latency and high accuracy in data processing and fusion?

（A）Real-time operating system (RTOS) for deterministic execution and minimal jitter.

（B）High-bandwidth, low-latency communication interfaces (e.g., PCle Gen4/5) for data transfer between sensors and processing units.

（C）All of the above.

（D）NVIDIA GPUs with CUDA for accelerated processing of image and point cloud data.

（E）Sensor fusion algorithms optimized for GPU acceleration.

正解：C 解答を投票する

解説: (JPNTest メンバーにのみ表示されます)

質問 12

You are training a conditional generative model to generate images based on text descriptions. You notice that the generated images often lack fine-grained details and tend to be blurry, even though the overall structure matches the text description. Which of the following techniques would be MOST effective in improving the image quality and adding finer details?

（A）Implement a perceptual loss function that compares high-level features of generated and real images.

（B）Increase the batch size used for training.

（C）Use a simpler generator architecture.

（D）Train the model for fewer epochs.

（E）Decrease the learning rate of the discriminator.

正解：A 解答を投票する

解説: (JPNTest メンバーにのみ表示されます)

質問 13

You're building a virtual assistant using NVIDIAAvatar Cloud Engine (ACE). You want the avatar to respond to user queries with realistic facial expressions and lip synchronization. Which ACE components are essential for achieving this?

（A）Only a 3D avatar model.

（B）Riva ASR, Riva TTS, and Audi02Emotion.

（C）only Riva ASR and TTS.

（D）Riva ASR, Riva TTS, Audi02Emotion, a 3D avatar model, and an animation engine.

（E）Riva ASR, Riva TTS, Audi02Emotion, and a 3D avatar model.

正解：D 解答を投票する

解説: (JPNTest メンバーにのみ表示されます)

質問 14

You are building a system to generate captions for images. You want to evaluate how well the generated captions describe the content of the images. Which of the following metrics are most suitable for evaluating the quality of image captions?

（A）F 1-Score

（B）Pixel Accuracy.

（C）ROUGE (Recall-Oriented Understudy for Gisting Evaluation).

（D）BLEU (Bilingual Evaluation Understudy).

（E）Inception Score.

正解：C、D 解答を投票する

解説: (JPNTest メンバーにのみ表示されます)

質問 15

You are working with a dataset of handwritten digits and training a Variational Autoencoder (VAE) to generate new digits. After training, you observe that the generated digits are blurry and lack sharp details. Which of the following modifications could potentially improve the quality of the generated digits in your VAE?

（A）Decreasing the dimensionality of the latent space.

（B）Increasing the weight of the KL divergence term in the VAE loss function.

（C）Increasing the capacity of the encoder and decoder networks (e.g., adding more layers or neurons).

（D）Reducing the weight of the KL divergence term in the VAE loss function.

（E）Using a simpler decoder architecture.

正解：C、D 解答を投票する

解説: (JPNTest メンバーにのみ表示されます)

質問 16

You are tasked with building a multimodal generative AI model to create marketing content from product images and descriptions. The image encoder uses a pre-trained ResNet50 model, and the text encoder uses a pre-trained BERT model. After initial training, the generated content frequently misinterprets the image. Which of the following strategies is MOST effective in improving the model's ability to correctly interpret the image within the multimodal context?

（A）Increase the learning rate for the BERT model to prioritize text-based information.

（B）Fine-tune the ResNet50 model with a dataset of images specifically related to the product domain, using a contrastive loss function that encourages representations of images and corresponding text to be close in the embedding space.

（C）Freeze the weights of both the ResNet50 and BERT models to prevent overfitting.

（D）Decrease the batch size during training.

（E）Replace ResNet50 with a simpler image encoder like a shallow CNN to reduce computational complexity.

正解：B 解答を投票する

解説: (JPNTest メンバーにのみ表示されます)

質問 17

You are working on a sequence-to-sequence model for neural machine translation. You've implemented an attention mechanism, but the model is still struggling with long sentences, often losing context in the later parts of the translation. Which type of attention mechanism is most likely to alleviate this issue effectively?

（A）Multi-Head Attention

（B）Local (Hard) Attention

（C）Self-Attention

（D）Global (Soft) Attention

（E）Bahdanau Attention (Additive Attention)

正解：A 解答を投票する

解説: (JPNTest メンバーにのみ表示されます)

質問 18

You're building a multimodal model that integrates text, images, and audio. The text data has many missing values. Which of the following strategies would be MOST effective for handling missing text data while leveraging the other modalities?

（A）Remove all data points with missing text values to ensure data integrity.

（B）Use a simple imputation method like replacing missing text with a placeholder like 'unknown'.

（C）Train a separate model to predict the missing text based on the available image and audio data, then impute the predicted values.

（D）Ignore the missing text values during training, assuming the model can learn from the available modalities.

（E）Use a multimodal generative model (e.g., VAE, GAN) to impute the missing text based on the learned joint representation of all modalities.

正解：E 解答を投票する

解説: (JPNTest メンバーにのみ表示されます)

NCA-GENM 無料問題集「NVIDIA Generative AI Multimodal」

弊社を連絡する

関連リンク

トップ試験