DSA-C03無料問題集「Snowflake SnowPro Advanced: Data Scientist Certification」

質問 1

You have a binary classification model deployed in Snowflake to predict customer churn. The model outputs a probability score between 0 and 1. You've calculated the following confusion matrix on a holdout set: I I Predicted Positive I Predicted Negative I --1 1 Actual Positive | 80 | 20 | I Actual Negative | 10 | 90 | What are the Precision, Recall, and Accuracy for this model, and what do these metrics tell you about the model's performance? SELECT statement given for true and false condition (True Positive, True Negative, False Positive, False Negative)

（A）Precision = 0.90, Recall = 0.80, Accuracy = 0.80. The model has good overall performance but needs to be adjusted to improve the false negative rate.

（B）Precision = 0.80, Recall = 0.90, Accuracy = 0.90. The model is performing poorly, with a high rate of both false positives and false negatives.

（C）Precision = 0.89, Recall = 0.80, Accuracy = 0.85. The model is slightly better at avoiding false positives than identifying true positives.

（D）Precision = 0.89, Recall = 0.80, Accuracy = 0.85. The model has good overall performance with balanced precision and recall.

（E）Precision = 0.80, Recall = 0.89, Accuracy = 0.85. The model is slightly better at identifying true positives than avoiding false positives.

正解：C 解答を投票する

解説: (JPNTest メンバーにのみ表示されます)

質問 2

You are building a machine learning model using Snowpark Python to predict house prices. The dataset contains a feature column named 'location' which contains free-form text descriptions of house locations. You want to leverage a pre-trained Large Language Model (LLM) hosted externally to extract structured location features like city, state, and zip code from the free-form text within Snowpark. You want to minimize the data transferred out of Snowflake. Which approach is most efficient and secure?

（A）Create a Snowpark User-Defined Function (UDF) that calls the external LLM API. Pass the 'location' column data to the UDF and retrieve the structured location features. Then apply the UDF directly on the Snowpark DataFrame.

（B）Use to load the 'location' column data into a Pandas DataFrame, call the external LLM API in your Python script to enrich the location data and then use to store the enriched data back into a Snowflake table.

（C）Use Snowpark's 'createOrReplaceStage' to create an external stage pointing to the LLM API endpoint. Load the 'location' data into this stage and call the LLM API directly from the Snowflake stage using SQL.

（D）Create a Snowflake External Function that calls the external LLM API. Pass the 'location' column data to the External Function and retrieve the structured location features. Then apply the External Function directly on the Snowpark DataFrame.

（E）Use the Snowflake Connector for Python to directly query the 'location' column and call the external LLM API from the connector. Then write the updated data into a new table.

正解：D 解答を投票する

解説: (JPNTest メンバーにのみ表示されます)

質問 3

You've created a Python stored procedure in Snowflake to train a model. The procedure successfully trains the model, saves it using 'joblib.dump' , and then attempts to upload the model file to an internal stage. However, the upload fails intermittently with a FileNotFoundErroN. The stage is correctly configured, and the stored procedure has the necessary privileges. Which of the following actions are MOST likely to resolve this issue? (Select TWO)

（A）Use the fully qualified path for the model file when calling 'joblib.dump'. E.g., 'joblib.dump(model, '/tmp/model.joblib')' instead of 'joblib.dump(model, 'model .joblib')'.

（B）Before uploading the model to the stage, verify that the file exists using 'os.path.exists()' within the stored procedure. If the file does not exist, log an error and raise an exception.

（C）Before uploading the model to the stage, explicitly create the directory within the stage using 'snowflake.connector.connect()' and executing a 'CREATE DIRECTORY IF NOT EXISTS command on the stage. Then retry upload.

（D）Ensure that the Python packages used within the stored procedure (e.g., scikit-learn, joblib) are explicitly listed in the 'imports' clause of the 'CREATE PROCEDURE statement.

（E）Implement error handling within the Python code to catch the 'FileNotFoundError' and retry the file upload after a short delay using 'time.sleep()'. The stored procedure should retry the upload a maximum of 3 times before failing.

正解：A、B 解答を投票する

解説: (JPNTest メンバーにのみ表示されます)

質問 4

You are tasked with forecasting the daily sales of a specific product for the next 30 days using Snowflake. You have historical sales data for the past 3 years, stored in a Snowflake table named 'SALES DATA', with columns 'SALE DATE (DATE type) and 'SALES AMOUNT' (NUMBER type). You want to use the Prophet library within a Snowflake User-Defined Function (UDF) for forecasting. The Prophet model requires the input data to have columns named 'ds' (for dates) and 'y' (for values). Which of the following code snippets demonstrates the CORRECT way to prepare and pass your data to the Prophet UDF in Snowflake, assuming you've already created the Python UDF 'prophet_forecast'?

（A）

（B）

（C）

（D）

（E）

正解：D 解答を投票する

解説: (JPNTest メンバーにのみ表示されます)

質問 5

A data scientist is analyzing website click-through rates (CTR) for two different ad campaigns. Campaign A ran for two weeks and had 10,000 impressions with 500 clicks. Campaign B also ran for two weeks with 12,000 impressions and 660 clicks. The data scientist wants to determine if there's a statistically significant difference in CTR between the two campaigns. Assume the population standard deviation is unknown and unequal for the two campaigns. Which statistical test is most appropriate to use, and what Snowflake SQL code would be used to approximate the p-value for this test (assume 'clicks_b' , and are already defined Snowflake variables)?

（A）An independent samples t-test, because we are comparing the means of two independent samples. Snowflake code: SELECT

（B）Az-test, because we know the population standard deviation. Snowflake code: 'SELECT normcdf(clicks_a/impressions_a - clicks_b/impressions_b, O, 1)'

（C）A one-sample t-test, because we are comparing the sample mean of campaign A to the sample mean of campaign Snowflake code: 'SELECT t_test_lsamp(clicks_a/impressions_a - clicks_b/impressions_b, 0)'

（D）An independent samples t-test (Welch's t-test), because we are comparing the means of two independent samples with unequal variances. Snowflake code (approximation using UDF - assuming UDF 'p_value_from_t_stat' exists that calculates p-value from t-statistic and degrees of freedom):

（E）A paired t-test, because we are comparing two related samples over time. Snowflake code: 'SELECT t_test_ind(clicks_a/impressions_a, 'VAR EQUAL-TRUE')

正解：A 解答を投票する

解説: (JPNTest メンバーにのみ表示されます)

質問 6

You are a data scientist working with a large dataset of customer transactions stored in Snowflake. You need to identify potential fraud using statistical summaries. Which of the following approaches would be MOST effective in identifying unusual spending patterns, considering the need for scalability and performance within Snowflake?

（A）Export the entire dataset to a Python environment, use Pandas to calculate the average transaction amount and standard deviation for each customer, and then identify outliers based on a fixed threshold.

（B）Calculate the average transaction amount and standard deviation for each customer using window functions in SQL. Flag transactions that fall outside of 3 standard deviations from the customer's mean.

（C）Sample a subset of the data, calculate descriptive statistics using Snowpark Python and the 'describe()' function, and extrapolate these statistics to the entire dataset.

（D）Use Snowflake's native anomaly detection functions (if available, and configured for streaming) to detect anomalies based on transaction amount and frequency, grouped by customer ID.

（E）Implement a custom UDF (User-Defined Function) in Java to calculate the interquartile range (IQR) for each customer's transaction amounts and flag transactions as outliers if they are below QI - 1.5 IQR or above Q3 + 1.5 IQR.

正解：B、D 解答を投票する

解説: (JPNTest メンバーにのみ表示されます)

質問 7

You are developing a model to predict customer churn using Snowflake ML. After training a Gradient Boosting model, you want to understand the relationship between 'number_of_products' and the churn probability. You generate a partial dependence plot (PDP) for 'number_of_products'. The PDP shows a steep increase in churn probability as 'number_of_products' increases from 1 to 3, followed by a plateau. Which of the following statements are the MOST accurate interpretations of this PDP? Assume the dataset is balanced and has undergone proper preprocessing.

（A）The PDP indicates a high degree of interaction between 'number_of_products' and other features in the model, making the interpretation unreliable.

（B）Customers who purchase more than 3 products are less likely to churn, suggesting higher engagement or satisfaction.

（C）There might be a confounding variable correlated with both 'number_of_products' and churn, leading to a spurious relationship in the PDP.

（D）The model is perfectly calibrated, and the PDP accurately represents the true causal effect of 'number_of_products' on churn.

（E）Increasing the number of products purchased by all customers will definitively reduce overall churn.

正解：B、C 解答を投票する

解説: (JPNTest メンバーにのみ表示されます)

質問 8

You are using Snowflake Cortex to build a customer support chatbot that leverages LLMs to answer customer questions. You have a knowledge base stored in a Snowflake table. The following options describe different methods for using this knowledge base in conjunction with the LLM to generate responses. Which of the following approaches will likely result in the MOST accurate, relevant, and cost-effective responses from the LLM?

（A）Use Snowflake Cortex's 'COMPLETE function without any external knowledge base. Rely solely on the LLM's pre-trained knowledge.

（B）Fine-tune the LLM on the entire knowledge base. Train a custom LLM model specifically on the knowledge base data.

（C）Use Retrieval-Augmented Generation (RAG). Generate vector embeddings for the knowledge base entries, perform a similarity search to find the most relevant entries for each customer question, and include those entries in the prompt.

（D）Directly prompt the LLM with the entire knowledge base content for each customer question. Concatenate all knowledge base entries into a single string and include it in the prompt.

（E）Partition your database by different subject matter and then query the specific partitions for your information.

正解：C 解答を投票する

解説: (JPNTest メンバーにのみ表示されます)

質問 9

You are a data scientist working for a retail company that stores its transaction data in Snowflake. You need to perform feature engineering on customer purchase history data to build a customer churn prediction model. Which of the following approaches best combines Snowflake's capabilities with a machine learning framework (like scikit-learn) for efficient feature engineering? Assume your data is stored in a table named 'CUSTOMER TRANSACTIONS' with columns like 'CUSTOMER ID, 'TRANSACTION DATE, 'AMOUNT, and 'PRODUCT CATEGORY.

（A）Develop a custom Spark application to read data from Snowflake, perform feature engineering in Spark, and write the resulting features back to a new table in Snowflake, and avoid use of Snowflake SQL UDFs to minimize complexity.

（B）Use Snowflake's SQL UDFs (User-Defined Functions) written in Python to perform feature engineering directly within Snowflake on smaller aggregated sets of data to optimize compute costs. Integrate these UDFs to query the entire 'CUSTOMER TRANSACTIONS table to build your features.

（C）Create a Snowflake external function that calls a cloud-based (AWS, Azure, GCP) machine learning service for feature engineering, passing the raw transaction data for each customer and processing the aggregated data into features in Snowflake SQL.

（D）Extract all the data from 'CUSTOMER_TRANSACTIONS' into a Pandas DataFrame, perform feature engineering using Pandas and scikit-learn, and then load the processed data back into Snowflake.

（E）Load a small subset of 'CUSTOMER_TRANSACTIONS' into an in-memory database like Redis, perform feature engineering using custom Python scripts interacting with Redis, and periodically sync the results back to Snowflake.

正解：B 解答を投票する

解説: (JPNTest メンバーにのみ表示されます)

質問 10

You are a data scientist working for a retail company. You've been tasked with identifying fraudulent transactions. You have a Snowflake table named 'TRANSACTIONS' with columns 'TRANSACTION ID', 'AMOUNT', 'TRANSACTION DATE', 'CUSTOMER ID', and 'LOCATION'. You suspect outliers in transaction amounts might indicate fraud. Which of the following SQL queries is the MOST efficient and appropriate to identify potential outliers using the Interquartile Range (IQR) method, and incorporate necessary data type considerations for robust percentile calculations? Consider also the computational cost associated with each approach on a large dataset.

（A）Option E

（B）Option D

（C）Option B

（D）Option C

（E）Option A

正解：C 解答を投票する

解説: (JPNTest メンバーにのみ表示されます)

質問 11

You are building a predictive model for customer churn using linear regression in Snowflake. You have identified several features, including 'CUSTOMER AGE', 'MONTHLY SPEND', and 'NUM CALLS'. After performing an initial linear regression, you suspect that the relationship between 'CUSTOMER AGE and churn is not linear and that older customers might churn at a different rate than younger customers. You want to introduce a polynomial feature of "CUSTOMER AGE (specifically, 'CUSTOMER AGE SQUARED') to your regression model within Snowflake SQL before further analysis with python and Snowpark. How can you BEST create this new feature in a robust and maintainable way directly within Snowflake?

（A）Option E

（B）Option D

（C）Option B

（D）Option C

（E）Option A

正解：D 解答を投票する

解説: (JPNTest メンバーにのみ表示されます)

質問 12

You are building a time-series forecasting model in Snowflake to predict the hourly energy consumption of a building. You have historical data with timestamps and corresponding energy consumption values. You've noticed significant daily seasonality and a weaker weekly seasonality. Which of the following techniques or approaches would be most appropriate for capturing both seasonality patterns within a supervised learning framework using Snowflake?

（A）Decomposing the time series using STL (Seasonal-Trend decomposition using Loess) and building separate models for the trend and seasonal components, then combining the predictions.

（B）Using Fourier terms (sine and cosine waves) with frequencies corresponding to daily and weekly cycles as features in a regression model.

（C）Using a simple moving average to smooth the data before applying a linear regression model.

（D）Creating lagged features (e.g., energy consumption from the previous hour, the same hour yesterday, and the same hour last week) and using these features as input to a regression model (e.g., Random Forest or Gradient Boosting).

（E）Applying exponential smoothing directly to the original time series without feature engineering.

正解：B、D 解答を投票する

解説: (JPNTest メンバーにのみ表示されます)

質問 13

You are using Snowpark Feature Store to manage features for your machine learning models. You've created several Feature Groups and now want to consume these features for training a model. To optimize retrieval, you want to use point-in-time correctness. Which of the following actions/configurations are essential to ensure point-in-time correctness when retrieving features using Snowpark Feature Store?

（A）When creating Feature Groups, specify a 'timestamp_key' that represents the event timestamp of the data in the source tables.

（B）Explicitly specify a in the call.

（C）Use the method on the Feature Store client, providing a dataframe containing the 'primary_keyS and the desired for each record.

（D）Create an associated Stream on the source tables used for Feature Groups

（E）Ensure that all source tables used by the Feature Groups have Change Data Capture (CDC) enabled.

正解：A、C 解答を投票する

解説: (JPNTest メンバーにのみ表示されます)

質問 14

You've deployed a fraud detection model in Snowflake. The model is implemented as a Python UDF that uses a pre-trained scikit-learn model stored as a stage file. Your goal is to enable near real-time fraud detection on incoming transactions. Due to regulatory requirements, you need to maintain a detailed audit trail of all predictions, including the input features, model version, prediction scores, and any errors encountered during the prediction process. Which of the following approaches are valid and efficient for storing these audit logs and predictions in Snowflake?

（A）Utilize Snowflake's Streams and Tasks to automatically capture changes to the transaction table and trigger the prediction UDF, storing the audit logs in a separate table with similar structure as described in option A.

（B）Use Snowflake's 'SYSTEM$QUERY LOG' table to extract information about the UDF execution and join it with the transaction data to reconstruct the audit trail.

（C）Store the audit logs as unstructured text files in an external stage (e.g., AWS S3) and periodically load them into a Snowflake table using COPY INTO command.

（D）Log the audit information to an external logging service (e.g., Splunk) using an external function called from within the UDF.

（E）Create a dedicated table with columns for transaction ID, input features (as a JSON VARIANT), model version, prediction score, error message (if any), and prediction timestamp. Use a Snowflake Sequence to generate unique log IDs.

正解：A、E 解答を投票する

解説: (JPNTest メンバーにのみ表示されます)

質問 15

You are tasked with identifying Personally Identifiable Information (PII) within a Snowflake table named 'customer data'. This table contains various columns, some of which may contain sensitive information like email addresses and phone numbers. You want to use Snowflake's data governance features to tag these columns appropriately. Which of the following approaches is the MOST effective and secure way to automatically identify and tag potential PII columns with the 'PII CLASSIFIED tag in your Snowflake environment, ensuring minimal manual intervention and optimal accuracy?

（A）Write a SQL script to query the 'INFORMATION SCHEMA.COLUMNS' view, identify columns with names containing keywords like 'email' or 'phone', and then apply the 'PII_CLASSIFIED tag to those columns.

（B）Export the 'customer_data' to a staging area in cloud storage, use a third-party data discovery tool to scan for PII, and then manually apply the "PII_CLASSIFIED' tag to the corresponding columns in Snowflake based on the tool's findings.

（C）Use Snowflake's built-in classification feature with a pre-defined sensitivity category to identify potential PII columns. Associate a masking policy that redacts the data, and apply a tag 'PII_CLASSIFIED' via automated tagging to the columns identified as containing PII.

（D）Create a custom Snowpark for Python UDF that uses regular expressions to analyze the data in each column and apply the 'PII_CLASSIFIED tag if a match is found. Schedule this UDF to run periodically using Snowflake Tasks.

（E）Manually inspect each column in the 'customer_data' table and apply the 'PII_CLASSIFIED' tag to columns that appear to contain PII based on their names and a small sample of data.

正解：C 解答を投票する

解説: (JPNTest メンバーにのみ表示されます)

質問 16

You're working on a fraud detection system for an e-commerce platform. You have a table 'TRANSACTIONS with a 'TRANSACTION AMOUNT column. You want to bin the transaction amounts into several risk categories ('Low', 'Medium', 'High', 'Very High') using explicit boundaries. You want the bins to be inclusive of the lower boundary and exclusive of the upper boundary (e.g., [0, 100), [100, 500), etc.). Which of the following SQL statements using the 'WIDTH BUCKET function correctly bins the transaction amounts into these categories, assuming these boundaries: 0, 100, 500, 1000, and infinity, and assigns appropriate labels?

（A）Option E

（B）Option D

（C）Option B

（D）Option C

（E）Option A

正解：A 解答を投票する

解説: (JPNTest メンバーにのみ表示されます)

DSA-C03 無料問題集「Snowflake SnowPro Advanced: Data Scientist Certification」

弊社を連絡する

関連リンク

トップ試験