[2024年04月29日] 有効なDSA-C02テスト解答とSnowflake DSA-C02試験PDF問題を試そう [Q13-Q29]

Share

[2024年04月29日] 有効なDSA-C02テスト解答とSnowflake DSA-C02試験PDF問題を試そう

実際に出るDSA-C02試験問題集には正確で更新された問題

質問 # 13
Performance metrics are a part of every machine learning pipeline, Which ones are not the performance metrics used in the Machine learning?

  • A. Root Mean Squared Error (RMSE)
  • B. AUM
  • C. AU-ROC
  • D. R2 (R-Squared)

正解:B

解説:
Explanation
Every machine learning task can be broken down to either Regression or Classification, just like the performance metrics.
Metrics are used to monitor and measure the performance of a model (during training and testing), and do not need to be differentiable.
Regression metrics
Regression models have continuous output. So, we need a metric based on calculating some sort of distance between predicted and ground truth.
In order to evaluate Regression models, we'll discuss these metrics in detail:
Mean Absolute Error (MAE),
Mean Squared Error (MSE),
Root Mean Squared Error (RMSE),
R2 (R-Squared).
Mean Squared Error (MSE)
Mean squared error is perhaps the most popular metric used for regression problems. It essentially finds the average of the squared difference between the target value and the value predicted by the regression model.
Few key points related to MSE:
It's differentiable, so it can be optimized better.
It penalizes even small errors by squaring them, which essentially leads to an overestimation of how bad the model is.
Error interpretation has to be done with squaring factor(scale) in mind. For example in our Boston Housing regression problem, we got MSE=21.89 which primarily corresponds to (Prices)2.
Due to the squaring factor, it's fundamentally more prone to outliers than other metrics.
Mean Absolute Error (MAE)
Mean Absolute Error is the average of the difference between the ground truth and the predicted values.
Few key points for MAE
It's more robust towards outliers than MAE, since it doesn't exaggerate errors.
It gives us a measure of how far the predictions were from the actual output. However, since MAE uses absolute value of the residual, it doesn't give us an idea of the direction of the error, i.e. whether we're under-predicting or over-predicting the data.
Error interpretation needs no second thoughts, as it perfectly aligns with the original degree of the variable.
MAE is non-differentiable as opposed to MSE, which is differentiable.
Root Mean Squared Error (RMSE)
Root Mean Squared Error corresponds to the square root of the average of the squared difference between the target value and the value predicted by the regression model.
Few key points related to RMSE:
It retains the differentiable property of MSE.
It handles the penalization of smaller errors done by MSE by square rooting it.
Error interpretation can be done smoothly, since the scale is now the same as the random variable.
Since scale factors are essentially normalized, it's less prone to struggle in the case of outliers.
R2 Coefficient of determination
R2 Coefficient of determination actually works as a post metric, meaning it's a metric that's calcu-lated using other metrics.
The point of even calculating this coefficient is to answer the question "How much (what %) of the total variation in Y(target) is explained by the variation in X(regression line)" Few intuitions related to R2 results:
If the sum of Squared Error of the regression line is small => R2 will be close to 1 (Ideal), meaning the regression was able to capture 100% of the variance in the target variable.
Conversely, if the sum of squared error of the regression line is high=> R2 will be close to 0, meaning the regression wasn't able to capture any variance in the target variable.
You might think that the range of R2 is (0,1) but it's actually (-,1)because the ratio of squared errors of the regression line and mean can surpass the value 1 if the squared error of regression line is too high (>squared error of the mean).
Classification metrics
Classification problems are one of the world's most widely researched areas. Use cases are present in almost all production and industrial environments. Speech recognition, face recognition, textclassification - the list is endless.
Classification models have discrete output, so we need a metric that compares discrete classes in some form.
Classification Metrics evaluate a model's performance and tell you how good or bad the classification is, but each of them evaluates it in a different way.
So in order to evaluate Classification models, we'll discuss these metrics in detail:
Accuracy
Confusion Matrix (not a metric but fundamental to others)
Precision and Recall
F1-score
AU-ROC
Accuracy
Classification accuracy is perhaps the simplest metric to use and implement and is defined as the number of correct predictions divided by the total number of predictions, multiplied by 100.
We can implement this by comparing ground truth and predicted values in a loop or simply utilizing the scikit-learn module to do the heavy lifting for us (not so heavy in this case).
Confusion Matrix
Confusion Matrix is a tabular visualization of the ground-truth labels versus model predictions. Each row of the confusion matrix represents the instances in a predicted class and each column represents the instances in an actual class. Confusion Matrix is not exactly a performance metric but sort of a basis on which other metrics evaluate the results.
Each cell in the confusion matrix represents an evaluation factor. Let's understand these factors one by one:
True Positive(TP) signifies how many positive class samples your model predicted correctly.
True Negative(TN) signifies how many negative class samples your model predicted correctly.
False Positive(FP) signifies how many negative class samples your model predicted incorrectly. This factor represents Type-I error in statistical nomenclature. This error positioning in the confusion matrix depends on the choice of the null hypothesis.
False Negative(FN) signifies how many positive class samples your model predicted incorrectly. This factor represents Type-II error in statistical nomenclature. This error positioning in the confu-sion matrix also depends on the choice of the null hypothesis.
Precision
Precision is the ratio of true positives and total positives predicted
Recall/Sensitivity/Hit-Rate
A Recall is essentially the ratio of true positives to all the positives in ground truth.
Precision-Recall tradeoff
To improve your model, you can either improve precision or recall - but not both! If you try to re-duce cases of non-cancerous patients being labeled as cancerous (FN/type-II), no direct effect will take place on cancerous patients being labeled as non-cancerous.
F1-score
The F1-score metric uses a combination of precision and recall. In fact, the F1 score is the harmonic mean of the two.
AUROC (Area under Receiver operating characteristics curve)
Better known as AUC-ROC score/curves. It makes use of true positive rates(TPR) and false posi-tive rates(FPR).


質問 # 14
Mark the Incorrect statements regarding MIN / MAX Functions?

  • A. The data type of the returned value is the same as the data type of the input values
  • B. For compatibility with other systems, the DISTINCT keyword can be specified as an argument for MIN or MAX, but it does not have any effect
  • C. NULL values are ignored unless all the records are NULL, in which case a NULL value is returned
  • D. NULL values are skipped unless all the records are NULL

正解:C

解説:
Explanation
NULL values are ignored unless all the records are NULL, in which case a NULL value is returned


質問 # 15
What Can Snowflake Data Scientist do in the Snowflake Marketplace as Provider?

  • A. Eliminate the costs of building and maintaining APIs and data pipelines to deliver data to customers.
  • B. Publish listings for datasets that can be customized for the consumer.
  • C. Publish listings for free-to-use datasets to generate interest and new opportunities among the Snowflake customer base.
  • D. Share live datasets securely and in real-time without creating copies of the data or im-posing data integration tasks on the consumer.

正解:A、B、C、D

解説:
Explanation
All are correct!
About the Snowflake Marketplace
You can use the Snowflake Marketplace to discover and access third-party data and services, as well as market your own data products across the Snowflake Data Cloud.
As a data provider, you can use listings on the Snowflake Marketplace to share curated data offer-ings with many consumers simultaneously, rather than maintain sharing relationships with each indi-vidual consumer.
With Paid Listings, you can also charge for your data products.
As a consumer, you might use the data provided on the Snowflake Marketplace to explore and ac-cess the following:
Historical data for research, forecasting, and machine learning.
Up-to-date streaming data, such as current weather and traffic conditions.
Specialized identity data for understanding subscribers and audience targets.
New insights from unexpected sources of data.
The Snowflake Marketplace is available globally to all non-VPS Snowflake accounts hosted on Amazon Web Services, Google Cloud Platform, and Microsoft Azure, with the exception of Mi-crosoft Azure Government.
Support for Microsoft Azure Government is planned.


質問 # 16
Select the Correct Statements regarding Normalization?

  • A. Scikit-Learn provides a transformer RecommendedScaler for Normalization.
  • B. Normalization technique uses mean and standard deviation for scaling of model.
  • C. Normalization got affected by outliers.
  • D. Normalization technique uses minimum and max values for scaling of model.

正解:C、D

解説:
Explanation
Normalization is a scaling technique in Machine Learning applied during data preparation to change the values of numeric columns in the dataset to use a common scale.It is not necessary for all datasets in a model. It is required only when features of machine learning models have different ranges.
Scikit-Learn provides a transformer called MinMaxScaler for Normalization.
This technique uses minimum and max values for scaling of model.Itis useful when feature distribution is unknown.It got affected by outliers.


質問 # 17
Which type of Machine learning Data Scientist generally used for solving classification and regression problems?

  • A. Supervised
  • B. Unsupervised
  • C. Reinforcement Learning
  • D. Instructor Learning
  • E. Regression Learning

正解:A

解説:
Explanation
Supervised Learning
Overview:
Supervised learning is a type of machine learning that uses labeled data to train machine learning models. In labeled data, the output is already known. The model just needs to map the inputs to the respective outputs.
Algorithms:
Some of the most popularly used supervised learning algorithms are:
Linear Regression
Logistic Regression
Support Vector Machine
K Nearest Neighbor
Decision Tree
Random Forest
Naive Bayes
Working:
Supervised learning algorithms take labelled inputs and map them to the known outputs, which means you already know the target variable.
Supervised Learning methods need external supervision to train machine learning models. Hence, the name supervised. They need guidance and additional information to return the desired result.
Applications:
Supervised learning algorithms are generally used for solving classification and regression problems.
Few of the top supervised learning applications are weather prediction, sales forecasting, stock price analysis.


質問 # 18
Which of the following metrics are used to evaluate classification models?

  • A. All of the above
  • B. F1 score
  • C. Confusion matrix
  • D. Area under the ROC curve

正解:A

解説:
Explanation
Evaluation metrics are tied to machine learning tasks. There are different metrics for the tasks of classification and regression. Some metrics, like precision-recall, are useful for multiple tasks. Classification and regression are examples of supervised learning, which constitutes a majority of machine learning applications. Using different metrics for performance evaluation, we should be able to im-prove our model's overall predictive power before we roll it out for production on unseen data. Without doing a proper evaluation of the Machine Learning model by using different evaluation metrics, and only depending on accuracy, can lead to a problemwhen the respective model is deployed on unseen data and may end in poor predictions.
Classification metrics are evaluation measures used to assess the performance of a classification model.
Common metrics include accuracy (proportion of correct predictions), precision (true positives over total predicted positives), recall (true positives over total actual positives), F1 score (har-monic mean of precision and recall), and area under the receiver operating characteristic curve (AUC-ROC).
Confusion Matrix
Confusion Matrix is a performance measurement for the machine learning classification problems where the output can be two or more classes. It is a table with combinations of predicted and actual values.
It is extremely useful for measuring the Recall, Precision, Accuracy, and AUC-ROC curves.
The four commonly used metrics for evaluating classifier performance are:
1. Accuracy: The proportion of correct predictions out of the total predictions.
2. Precision: The proportion of true positive predictions out of the total positive predictions (precision = true positives / (true positives + false positives)).
3. Recall (Sensitivity or True Positive Rate): The proportion of true positive predictions out of the total actual positive instances (recall = true positives / (true positives + false negatives)).
4. F1 Score: The harmonic mean of precision and recall, providing a balance between the two metrics (F1 score = 2 * ((precision * recall) / (precision + recall))).
These metrics help assess the classifier's effectiveness in correctly classifying instances of different classes.
Understanding how well a machine learning model will perform on unseen data is the main purpose behind working with these evaluation metrics. Metrics like accuracy, precision, recall are good ways to evaluate classification models for balanced datasets, but if the data is imbalanced then other methods like ROC/AUC perform better in evaluating the model performance.
ROC curve isn't just a single number but it's a whole curve that provides nuanced details about the behavior of the classifier. It is also hard to quickly compare many ROC curves to each other.


質問 # 19
Which ones are the correct rules while using a data science model created via External function in Snowflake?

  • A. An external function can appear in any clause of a SQL statement in which other types of UDF can appear.
  • B. External functions can accept Model parameters.
  • C. External functions return a value. The returned value can be a compound value, such as a VARIANT that contains JSON.
  • D. External functions can be overloaded.

正解:A、B、C、D

解説:
Explanation
From the perspective of a user running a SQL statement, an external function behaves like any other UDF .
External functions follow these rules:
External functions return a value.
External functions can accept parameters.
An external function can appear in any clause of a SQL statement in which other types of UDF can appear. For example:
1.select my_external_function_2(column_1, column_2)
2.from table_1;
1.select col1
2.from table_1
3.where my_external_function_3(col2) < 0;
1.create view view1 (col1) as
2.select my_external_function_5(col1)
3.from table9;
An external function can be part of a more complex expression:
1.select upper(zipcode_to_city_external_function(zipcode))
2.from address_table;
The returned value can be a compound value, such as a VARIANT that contains JSON.
External functions can be overloaded; two different functions can have the same name but different signatures (different numbers or data types of input parameters).


質問 # 20
Which of the following is a useful tool for gaining insights into the relationship between features and predictions?

  • A. numpy plots
  • B. FULL dependence plots (FDP)
  • C. Partial dependence plots(PDP)
  • D. sklearn plots

正解:C

解説:
Explanation
Partial dependence plots (PDP) is a useful tool for gaining insights into the relationship between features and predictions. It helps us understand how different values of a particular feature impact model's predictions.


質問 # 21
What is the formula for measuring skewness in a dataset?

  • A. (3(MEAN - MEDIAN))/ STANDARD DEVIATION
  • B. (MEAN - MODE)/ STANDARD DEVIATION
  • C. MODE - MEDIAN
  • D. MEAN - MEDIAN

正解:A

解説:
Explanation
Since the normal curve is symmetric about its mean, its skewness is zero. This is a theoretical expla-nation for mathematical proofs, you can refer to books or websites that speak on the same in detail.


質問 # 22
Which one is not Types of Feature Scaling?

  • A. Min-Max Scaling
  • B. Robust Scaling
  • C. Economy Scaling
  • D. Standard Scaling

正解:A

解説:
ExplanationFeature Scaling
Feature Scaling is the process of transforming the features so that they have a similar scale. This is important in machine learning because the scale of the features can affect the performance of the model.
Types of Feature Scaling:
Min-Max Scaling: Rescaling the features to a specific range, such as between 0 and 1, by subtracting the minimum value and dividing by the range.
Standard Scaling: Rescaling the features to have a mean of 0 and a standard deviation of 1 by subtracting the mean and dividing by the standard deviation.
Robust Scaling: Rescaling the features to be robust to outliers by dividing them by the interquartile range.
Benefits of Feature Scaling:
Improves Model Performance: By transforming the features to have a similar scale, the model can learn from all features equally and avoid being dominated by a few large features.
Increases Model Robustness: By transforming the features to be robust to outliers, the model can become more robust to anomalies.
Improves Computational Efficiency: Many machine learning algorithms, such as k-nearest neighbors, are sensitive to the scale of the features and perform better with scaled features.
Improves Model Interpretability: By transforming the features to have a similar scale, it can be easier to understand the model's predictions.


質問 # 23
Which of the following method is used for multiclass classification?

  • A. loocv
  • B. one vs another
  • C. one vs rest
  • D. all vs one

正解:C

解説:
Explanation
Binary vs. Multi-Class Classification
Classification problems are common in machine learning. In most cases, developers prefer using a supervised machine-learning approach to predict class tables for a given dataset. Unlike regression, classification involves designing the classifier model and training it to input and categorize the test dataset. For that, you can divide the dataset into either binary or multi-class modules.
As the name suggests, binary classification involves solving a problem with only two class labels. This makes it easy to filter the data, apply classification algorithms, and train the model to predict outcomes. On the other hand, multi-class classification is applicable when there are more than two class labels in the input train data.
The technique enables developers to categorize the test data into multiple binary class labels.
That said, while binary classification requires only one classifier model, the one used in the multi-class approach depends on the classification technique. Below are the two models of the multi-class classification algorithm.
One-Vs-Rest Classification Model for Multi-Class Classification
Also known as one-vs-all, the one-vs-rest model is a defined heuristic method that leverages a binary classification algorithm for multi-class classifications. The technique involves splitting a multi-class dataset into multiple sets of binary problems. Following this, a binary classifier is trained to handle each binary classification model with the most confident one making predictions.
For instance, with a multi-class classification problem with red, green, and blue datasets, binary classification can be categorized as follows:
Problem one: red vs. green/blue
Problem two: blue vs. green/red
Problem three: green vs. blue/red
The only challenge of using this model is that you should create a model for every class. The three classes require three models from the above datasets, which can be challenging for large sets of data with million rows, slow models, such as neural networks and datasets with a significant number of classes.
The one-vs-rest approach requires individual models to prognosticate the probability-like score. The class index with the largest score is then used to predict a class. As such, it is commonly used forclassification algorithms that can naturally predict scores or numerical class membership such as perceptron and logistic regression.


質問 # 24
Which of the following is a common evaluation metric for binary classification?

  • A. Area under the ROC curve (AUC)
  • B. Accuracy
  • C. F1 score
  • D. Mean squared error (MSE)

正解:A

解説:
Explanation
The area under the ROC curve (AUC) is a common evaluation metric for binary classification, which measures the performance of a classifier at different threshold values for the predicted probabilities. Other common metrics include accuracy, precision, recall, and F1 score, which are based on the confusion matrix of true positives, false positives, true negatives, and false negatives.


質問 # 25
Which of the learning methodology applies conditional probability of all the variables with respec-tive the dependent variable?

  • A. Supervised learning
  • B. Unsupervised learning
  • C. Reinforcement learning
  • D. Artificial learning

正解:C

解説:
Explanation
Supervised learning methodology applies conditional probability of all the variables with respective the dependent variable and generally conditional probability of variables is nothing but a basic method of estimating the statistics for few random experiments.
Conditional probability is thus the likelihood of an event or outcome occurring based on the occurrence of some other event or prior outcome. Two events are said tobe independent if one event occurring does not affect the probability that the other event will occur.


質問 # 26
Which tools helps data scientist to manage ML lifecycle & Model versioning?

  • A. CRUX
  • B. MLFlow
  • C. Pachyderm
  • D. Albert

正解:B、C

解説:
Explanation
Model versioning in a way involves tracking the changes made toan ML model that has been previously built.
Put differently, it is the process of making changes to the configurations of an ML Model. From another perspective, we can see model versioning as a feature that helps Machine Learning Engineers, Data Scientists, and related personnel create and keep multiple versions of the same model.
Think of it as a way of taking notes of the changes you make to the model through tweaking hyperparameters, retraining the model with more data, and so on.
In model versioning, a number of things need to be versioned, to help us keep track of important changes. I'll list and explain them below:
Implementation code: From the early days of model building to optimization stages, code or in this case source code of the model plays an important role. This code experiences significant changes during optimization stages which can easily be lost if not tracked properly. Because of this, code is one of the things that are taken into consideration during the model versioning process.
Data: In some cases, training data does improve significantly from its initial state during model op-timization phases. This can be as a result of engineering new features from existing ones to train our model on. Also there is metadata (data about your training data and model) to consider versioning. Metadata can change different times over without the training data actually changing. We need to be able to track these changes through versioning Model: The model is a product of the two previous entities and as stated in their explanations, an ML model changes at different points of the optimization phases through hyperparameter setting, model artifacts and learning coefficients. Versioning helps take record of the different versions of a Machine Learning model.
MLFlow & Pachyderm are the tools used to manage ML lifecycle & Model versioning.


質問 # 27
How do you handle missing or corrupted data in a dataset?

  • A. All of the above
  • B. Assign a unique category to missing values
  • C. Drop missing rows or columns
  • D. Replace missing values with mean/median/mode

正解:A


質問 # 28
Consider a data frame df with 10 rows and index [ 'r1', 'r2', 'r3', 'row4', 'row5', 'row6', 'r7', 'r8', 'r9', 'row10'].
What does the aggregate method shown in below code do?
g = df.groupby(df.index.str.len())
g.aggregate({'A':len, 'B':np.sum})

  • A. Computes length of column A and Sum of Column B values
  • B. Computes Sum of column A values
  • C. Computes length of column A and Sum of Column B values of each group
  • D. Computes length of column A

正解:C

解説:
Explanation
Computes length of column A and Sum of Column B values of each group


質問 # 29
......

DSA-C02試験問題集でPDF問題とテストエンジン:https://www.jpntest.com/shiken/DSA-C02-mondaishu

弊社を連絡する

我々は12時間以内ですべてのお問い合わせを答えます。

オンラインサポート時間:( UTC+9 ) 9:00-24:00
月曜日から土曜日まで

サポート:現在連絡