Associate-Developer-Apache-Spark問題集を掴み取れ！[最新2022]Databricks試験合格させます [Q69-Q93]

Associate-Developer-Apache-Spark問題集を掴み取れ！[最新2022]Databricks試験合格させます

Associate-Developer-Apache-Spark試験問題集PDF正確率保証と更新された問題

質問 69
The code block displayed below contains an error. The code block should return a DataFrame in which column predErrorAdded contains the results of Python function add_2_if_geq_3 as applied to numeric and nullable column predError in DataFrame transactionsDf. Find the error.
Code block:
1.def add_2_if_geq_3(x):
2. if x is None:
3. return x
4. elif x >= 3:
5. return x+2
6. return x
7.
8.add_2_if_geq_3_udf = udf(add_2_if_geq_3)
9.
10.transactionsDf.withColumnRenamed("predErrorAdded", add_2_if_geq_3_udf(col("predError")))

A. UDFs are only available through the SQL API, but not in the Python API as shown in the code block.
B. The Python function is unable to handle null values, resulting in the code block crashing on execution.
C. The operator used to adding the column does not add column predErrorAdded to the DataFrame.
D. Instead of col("predError"), the actual DataFrame with the column needs to be passed, like so transactionsDf.predError.
E. The udf() method does not declare a return type.

正解: C

解説:
Explanation
Correct code block:
def add_2_if_geq_3(x):
if x is None:
return x
elif x >= 3:
return x+2
return x
add_2_if_geq_3_udf = udf(add_2_if_geq_3)
transactionsDf.withColumn("predErrorAdded", add_2_if_geq_3_udf(col("predError"))).show() Instead of withColumnRenamed, you should use the withColumn operator.
The udf() method does not declare a return type.
It is fine that the udf() method does not declare a return type, this is not a required argument. However, the default return type is StringType. This may not be the ideal return type for numeric, nullable data - but the code will run without specified return type nevertheless.
The Python function is unable to handle null values, resulting in the code block crashing on execution.
The Python function is able to handle null values, this is what the statement if x is None does.
UDFs are only available through the SQL API, but not in the Python API as shown in the code block.
No, they are available through the Python API. The code in the code block that concerns UDFs is correct.
Instead of col("predError"), the actual DataFrame with the column needs to be passed, like so transactionsDf.predError.
You may choose to use the transactionsDf.predError syntax, but the col("predError") syntax is fine.

質問 70
Which of the following is one of the big performance advantages that Spark has over Hadoop?

A. Spark achieves great performance by storing data in the HDFS format, whereas Hadoop can only use parquet files.
B. Spark achieves performance gains for developers by extending Hadoop's DataFrames with a user-friendly API.
C. Spark achieves higher resiliency for queries since, different from Hadoop, it can be deployed on Kubernetes.
D. Spark achieves great performance by storing data and performing computation in memory, whereas large jobs in Hadoop require a large amount of relatively slow disk I/O operations.
E. Spark achieves great performance by storing data in the DAG format, whereas Hadoop can only use parquet files.

正解: D

解説:
Explanation
Spark achieves great performance by storing data in the DAG format, whereas Hadoop can only use parquet files.
Wrong, there is no "DAG format". DAG stands for "directed acyclic graph". The DAG is a means of representing computational steps in Spark. However, it is true that Hadoop does not use a DAG.
The introduction of the DAG in Spark was a result of the limitation of Hadoop's map reduce framework in which data had to be written to and read from disk continuously.
Graph DAG in Apache Spark - DataFlair
Spark achieves great performance by storing data in the HDFS format, whereas Hadoop can only use parquet files.
No. Spark can certainly store data in HDFS (as well as other formats), but this is not a key performance advantage over Hadoop. Hadoop can use multiple file formats, not only parquet.
Spark achieves higher resiliency for queries since, different from Hadoop, it can be deployed on Kubernetes.
No, resiliency is not asked for in the question. The question is about performance improvements.
Both Hadoop and Spark can be deployed on Kubernetes.
Spark achieves performance gains for developers by extending Hadoop's DataFrames with a user-friendly API.
No. DataFrames are a concept in Spark, but not in Hadoop.

質問 71
Which of the following code blocks returns a new DataFrame with only columns predError and values of every second row of DataFrame transactionsDf?
Entire DataFrame transactionsDf:
1.+-------------+---------+-----+-------+---------+----+
2.|transactionId|predError|value|storeId|productId| f|
3.+-------------+---------+-----+-------+---------+----+
4.| 1| 3| 4| 25| 1|null|
5.| 2| 6| 7| 2| 2|null|
6.| 3| 3| null| 25| 3|null|
7.| 4| null| null| 3| 2|null|
8.| 5| null| null| null| 2|null|
9.| 6| 3| 2| 25| 2|null|
10.+-------------+---------+-----+-------+---------+----+

A. transactionsDf.filter(col("transactionId") % 2 == 0).select("predError", "value") (Correct)
B. transactionsDf.filter("transactionId" % 2 == 0).select("predError", "value")
C. transactionsDf.filter(col(transactionId).isin([3,4,6]))
D. transactionsDf.select(col("transactionId").isin([3,4,6]), "predError", "value")
E. 1.transactionsDf.createOrReplaceTempView("transactionsDf")
2.spark.sql("FROM transactionsDf SELECT predError, value WHERE transactionId % 2 = 2")
F. transactionsDf.filter(col("transactionId").isin([3,4,6])).select([predError, value])

正解: A

解説:
Explanation
Output of correct code block:
+---------+-----+
|predError|value|
+---------+-----+
| 6| 7|
| null| null|
| 3| 2|
+---------+-----+
This is not an easy question to solve. You need to know that % stands for the module operator in Python. % 2 will return true for every second row. The statement using spark.sql gets it almost right (the modulo operator exists in SQL as well), but % 2 = 2 will never yield true, since modulo 2 is either 0 or 1.
Other answers are wrong since they are missing quotes around the column names and/or use filter or select incorrectly.
If you have any doubts about SparkSQL and answer options 3 and 4 in this question, check out the notebook I created as a response to a related student question.
Static notebook | Dynamic notebook: See test 1

質問 72
Which of the following options describes the responsibility of the executors in Spark?

A. The executors accept tasks from the driver, execute those tasks, and return results to the driver.
B. The executors accept tasks from the driver, execute those tasks, and return results to the cluster manager.
C. The executors accept jobs from the driver, analyze those jobs, and return results to the driver.
D. The executors accept jobs from the driver, plan those jobs, and return results to the cluster manager.
E. The executors accept tasks from the cluster manager, execute those tasks, and return results to the driver.

正解: A

解説:
Explanation
More info: Running Spark: an overview of Spark's runtime architecture - Manning (https://bit.ly/2RPmJn9)

質問 73
The code block displayed below contains multiple errors. The code block should remove column transactionDate from DataFrame transactionsDf and add a column transactionTimestamp in which dates that are expressed as strings in column transactionDate of DataFrame transactionsDf are converted into unix timestamps. Find the errors.
Sample of DataFrame transactionsDf:
1.+-------------+---------+-----+-------+---------+----+----------------+
2.|transactionId|predError|value|storeId|productId| f| transactionDate|
3.+-------------+---------+-----+-------+---------+----+----------------+
4.| 1| 3| 4| 25| 1|null|2020-04-26 15:35|
5.| 2| 6| 7| 2| 2|null|2020-04-13 22:01|
6.| 3| 3| null| 25| 3|null|2020-04-02 10:53|
7.+-------------+---------+-----+-------+---------+----+----------------+ Code block:
1.transactionsDf = transactionsDf.drop("transactionDate")
2.transactionsDf["transactionTimestamp"] = unix_timestamp("transactionDate", "yyyy-MM-dd")

A. Column transactionDate should be dropped after transactionTimestamp has been written. The string indicating the date format should be adjusted. The withColumn operator should be used instead of the existing column assignment.
B. Column transactionDate should be dropped after transactionTimestamp has been written. The string indicating the date format should be adjusted. The withColumn operator should be used instead of the existing column assignment. Operator to_unixtime() should be used instead of unix_timestamp().
C. The string indicating the date format should be adjusted. The withColumnReplaced operator should be used instead of the drop and assign pattern in the code block to replace column transactionDate with the new column transactionTimestamp.
D. Column transactionDate should be wrapped in a col() operator.
E. Column transactionDate should be dropped after transactionTimestamp has been written. The withColumn operator should be used instead of the existing column assignment. Column transactionDate should be wrapped in a col() operator.

正解: A

解説:
Explanation
This question requires a lot of thinking to get right. For solving it, you may take advantage of the digital notepad that is provided to you during the test. You have probably seen that the code block includes multiple errors. In the test, you are usually confronted with a code block that only contains a single error. However, since you are practicing here, this challenging multi-error question will make it easier for you to deal with single-error questions in the real exam.
You can clearly see that column transactionDate should be dropped only after transactionTimestamp has been written. This is because to generate column transactionTimestamp, Spark needs to read the values from column transactionDate.
Values in column transactionDate in the original transactionsDf DataFrame look like 2020-04-26 15:35. So, to convert those correctly, you would have to pass yyyy-MM-dd HH:mm. In other words:
The string indicating the date format should be adjusted.
While you might be tempted to change unix_timestamp() to to_unixtime() (in line with the from_unixtime() operator), this function does not exist in Spark. unix_timestamp() is the correct operator to use here.
Also, there is no DataFrame.withColumnReplaced() operator. A similar operator that exists is DataFrame.withColumnRenamed().
Whether you use col() or not is irrelevant with unix_timestamp() - the command is fine with both.
Finally, you cannot assign a column like transactionsDf["columnName"] = ... in Spark. This is Pandas syntax (Pandas is a popular Python package for data analysis), but it is not supported in Spark.
So, you need to use Spark's DataFrame.withColumn() syntax instead.
More info: pyspark.sql.functions.unix_timestamp - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 3

質問 74
The code block displayed below contains an error. The code block should return DataFrame transactionsDf, but with the column storeId renamed to storeNumber. Find the error.
Code block:
transactionsDf.withColumn("storeNumber", "storeId")

A. Argument "storeId" should be the first and argument "storeNumber" should be the second argument to the withColumn method.
B. Instead of withColumn, the withColumnRenamed method should be used.
C. Instead of withColumn, the withColumnRenamed method should be used and argument "storeId" should be the first and argument "storeNumber" should be the second argument to that method.
D. The withColumn operator should be replaced with the copyDataFrame operator.
E. Arguments "storeNumber" and "storeId" each need to be wrapped in a col() operator.

正解: C

解説:
Explanation
Correct code block:
transactionsDf.withColumnRenamed("storeId", "storeNumber")
More info: pyspark.sql.DataFrame.withColumnRenamed - PySpark 3.1.1 documentation Static notebook | Dynamic notebook: See test 1

質問 75
Which of the following is the idea behind dynamic partition pruning in Spark?

A. Dynamic partition pruning reoptimizes physical plans based on data types and broadcast variables.
B. Dynamic partition pruning performs wide transformations on disk instead of in memory.
C. Dynamic partition pruning is intended to skip over the data you do not need in the results of a query.
D. Dynamic partition pruning concatenates columns of similar data types to optimize join performance.
E. Dynamic partition pruning reoptimizes query plans based on runtime statistics collected during query execution.

正解: C

解説:
Explanation
Dynamic partition pruning reoptimizes query plans based on runtime statistics collected during query execution.
No - this is what adaptive query execution does, but not dynamic partition pruning.
Dynamic partition pruning concatenates columns of similar data types to optimize join performance.
Wrong, this answer does not make sense, especially related to dynamic partition pruning.
Dynamic partition pruning reoptimizes physical plans based on data types and broadcast variables.
It is true that dynamic partition pruning works in joins using broadcast variables. This actually happens in both the logical optimization and the physical planning stage. However, data types do not play a role for the reoptimization.
Dynamic partition pruning performs wide transformations on disk instead of in memory.
This answer does not make sense. Dynamic partition pruning is meant to accelerate Spark - performing any transformation involving disk instead of memory resources would decelerate Spark and certainly achieve the opposite effect of what dynamic partition pruning is intended for.

質問 76
Which of the following DataFrame operators is never classified as a wide transformation?

A. DataFrame.aggregate()
B. DataFrame.join()
C. DataFrame.select()
D. DataFrame.sort()
E. DataFrame.repartition()

正解: C

解説:
Explanation
As a general rule: After having gone through the practice tests you probably have a good feeling for what classifies as a wide and what classifies as a narrow transformation. If you are unsure, feel free to play around in Spark and display the explanation of the Spark execution plan via DataFrame.[operation, for example sort()].explain(). If repartitioning is involved, it would count as a wide transformation.
DataFrame.select()
Correct! A wide transformation includes a shuffle, meaning that an input partition maps to one or more output partitions. This is expensive and causes traffic across the cluster. With the select() operation however, you pass commands to Spark that tell Spark to perform an operation on a specific slice of any partition. For this, Spark does not need to exchange data across partitions, each partition can be worked on independently. Thus, you do not cause a wide transformation.
DataFrame.repartition()
Incorrect. When you repartition a DataFrame, you redefine partition boundaries. Data will flow across your cluster and end up in different partitions after the repartitioning is completed. This is known as a shuffle and, in turn, is classified as a wide transformation.
DataFrame.aggregate()
No. When you aggregate, you may compare and summarize data across partitions. In the process, data are exchanged across the cluster, and newly formed output partitions depend on one or more input partitions. This is a typical characteristic of a shuffle, meaning that the aggregate operation may classify as a wide transformation.
DataFrame.join()
Wrong. Joining multiple DataFrames usually means that large amounts of data are exchanged across the cluster, as new partitions are formed. This is a shuffle and therefore DataFrame.join() counts as a wide transformation.
DataFrame.sort()
False. When sorting, Spark needs to compare many rows across all partitions to each other. This is an expensive operation, since data is exchanged across the cluster and new partitions are formed as data is reordered. This process classifies as a shuffle and, as a result, DataFrame.sort() counts as wide transformation.
More info: Understanding Apache Spark Shuffle | Philipp Brunenberg

質問 77
The code block shown below should return a one-column DataFrame where the column storeId is converted to string type. Choose the answer that correctly fills the blanks in the code block to accomplish this.
transactionsDf.__1__(__2__.__3__(__4__))

A. 1. select
2. col("storeId")
3. as
4. StringType
B. 1. select
2. col("storeId")
3. cast
4. StringType
C. 1. cast
2. "storeId"
3. as
4. StringType()
D. 1. select
2. storeId
3. cast
4. StringType()
E. 1. select
2. col("storeId")
3. cast
4. StringType()

正解: E

解説:
Explanation
Correct code block:
transactionsDf.select(col("storeId").cast(StringType()))
Solving this question involves understanding that, when using types from the pyspark.sql.types such as StringType, these types need to be instantiated when using them in Spark, or, in simple words, they need to be followed by parentheses like so: StringType(). You could also use .cast("string") instead, but that option is not given here.
More info: pyspark.sql.Column.cast - PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 2

質問 78
Which of the following code blocks returns a DataFrame with an added column to DataFrame transactionsDf that shows the unix epoch timestamps in column transactionDate as strings in the format month/day/year in column transactionDateFormatted?
Excerpt of DataFrame transactionsDf:

A. transactionsDf.apply(from_unixtime(format="MM/dd/yyyy")).asColumn("transactionDateFormatted")
B. transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate"))
C. transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate", format="MM/dd/yyyy"))
D. transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate", format="dd/MM/yyyy"))
E. transactionsDf.withColumnRenamed("transactionDate", "transactionDateFormatted", from_unixtime("transactionDateFormatted", format="MM/dd/yyyy"))

正解: C

解説:
Explanation
transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate", format="MM/dd/yyyy")) Correct. This code block adds a new column with the name transactionDateFormatted to DataFrame transactionsDf, using Spark's from_unixtime method to transform values in column transactionDate into strings, following the format requested in the question.
transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate", format="dd/MM/yyyy")) No. Although almost correct, this uses the wrong format for the timestamp to date conversion: day/month/year instead of month/day/year.
transactionsDf.withColumnRenamed("transactionDate", "transactionDateFormatted", from_unixtime("transactionDateFormatted", format="MM/dd/yyyy")) Incorrect. This answer uses wrong syntax. The command DataFrame.withColumnRenamed() is for renaming an existing column only has two string parameters, specifying the old and the new name of the column.
transactionsDf.apply(from_unixtime(format="MM/dd/yyyy")).asColumn("transactionDateFormatted") Wrong. Although this answer looks very tempting, it is actually incorrect Spark syntax. In Spark, there is no method DataFrame.apply(). Spark has an apply() method that can be used on grouped data - but this is irrelevant for this question, since we do not deal with grouped data here.
transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate")) No. Although this is valid Spark syntax, the strings in column transactionDateFormatted would look like this:
2020-04-26 15:35:32, the default format specified in Spark for from_unixtime and not what is asked for in the question.
More info: pyspark.sql.functions.from_unixtime - PySpark 3.1.1 documentation and pyspark.sql.DataFrame.withColumnRenamed - PySpark 3.1.1 documentation Static notebook | Dynamic notebook: See test 1

質問 79
The code block displayed below contains one or more errors. The code block should load parquet files at location filePath into a DataFrame, only loading those files that have been modified before
2029-03-20 05:44:46. Spark should enforce a schema according to the schema shown below. Find the error.
Schema:
1.root
2. |-- itemId: integer (nullable = true)
3. |-- attributes: array (nullable = true)
4. | |-- element: string (containsNull = true)
5. |-- supplier: string (nullable = true)
Code block:
1.schema = StructType([
2. StructType("itemId", IntegerType(), True),
3. StructType("attributes", ArrayType(StringType(), True), True),
4. StructType("supplier", StringType(), True)
5.])
6.
7.spark.read.options("modifiedBefore", "2029-03-20T05:44:46").schema(schema).load(filePath)

A. The attributes array is specified incorrectly, Spark cannot identify the file format, and the syntax of the call to Spark's DataFrameReader is incorrect.
B. The data type of the schema is incompatible with the schema() operator and the modification date threshold is specified incorrectly.
C. Columns in the schema definition use the wrong object type and the syntax of the call to Spark's DataFrameReader is incorrect.
D. Columns in the schema definition use the wrong object type, the modification date threshold is specified incorrectly, and Spark cannot identify the file format.
E. Columns in the schema are unable to handle empty values and the modification date threshold is specified incorrectly.

正解: D

解説:
Explanation
Correct code block:
schema = StructType([
StructField("itemId", IntegerType(), True),
StructField("attributes", ArrayType(StringType(), True), True),
StructField("supplier", StringType(), True)
])
spark.read.options(modifiedBefore="2029-03-20T05:44:46").schema(schema).parquet(filePath) This question is more difficult than what you would encounter in the exam. In the exam, for this question type, only one error needs to be identified and not "one or multiple" as in the question.
Columns in the schema definition use the wrong object type, the modification date threshold is specified incorrectly, and Spark cannot identify the file format.
Correct! Columns in the schema definition should use the StructField type. Building a schema from pyspark.sql.types, as here using classes like StructType and StructField, is one of multiple ways of expressing a schema in Spark. A StructType always contains a list of StructFields (see documentation linked below). So, nesting StructType and StructType as shown in the question is wrong.
The modification date threshold should be specified by a keyword argument like options(modifiedBefore="2029-03-20T05:44:46") and not two consecutive non-keyword arguments as in the original code block (see documentation linked below).
Spark cannot identify the file format correctly, because either it has to be specified by using the DataFrameReader.format(), as an argument to DataFrameReader.load(), or directly by calling, for example, DataFrameReader.parquet().
Columns in the schema are unable to handle empty values and the modification date threshold is specified incorrectly.
No. If StructField would be used for the columns instead of StructType (see above), the third argument specified whether the column is nullable. The original schema shows that columns should be nullable and this is specified correctly by the third argument being True in the schema in the code block.
It is correct, however, that the modification date threshold is specified incorrectly (see above).
The attributes array is specified incorrectly, Spark cannot identify the file format, and the syntax of the call to Spark's DataFrameReader is incorrect.
Wrong. The attributes array is specified correctly, following the syntax for ArrayType (see linked documentation below). That Spark cannot identify the file format is correct, see correct answer above. In addition, the DataFrameReader is called correctly through the SparkSession spark.
Columns in the schema definition use the wrong object type and the syntax of the call to Spark's DataFrameReader is incorrect.
Incorrect, the object types in the schema definition are correct and syntax of the call to Spark's DataFrameReader is correct.
The data type of the schema is incompatible with the schema() operator and the modification date threshold is specified incorrectly.
False. The data type of the schema is StructType and an accepted data type for the DataFrameReader.schema() method. It is correct however that the modification date threshold is specified incorrectly (see correct answer above).

質問 80
Which of the following describes the conversion of a computational query into an execution plan in Spark?

A. The executed physical plan depends on a cost optimization from a previous stage.
B. Spark uses the catalog to resolve the optimized logical plan.
C. The catalog assigns specific resources to the optimized memory plan.
D. The catalog assigns specific resources to the physical plan.
E. Depending on whether DataFrame API or SQL API are used, the physical plan may differ.

正解: A

解説:
Explanation
The executed physical plan depends on a cost optimization from a previous stage.
Correct! Spark considers multiple physical plans on which it performs a cost analysis and selects the final physical plan in accordance with the lowest-cost outcome of that analysis. That final physical plan is then executed by Spark.
Spark uses the catalog to resolve the optimized logical plan.
No. Spark uses the catalog to resolve the unresolved logical plan, but not the optimized logical plan. Once the unresolved logical plan is resolved, it is then optimized using the Catalyst Optimizer.
The optimized logical plan is the input for physical planning.
The catalog assigns specific resources to the physical plan.
No. The catalog stores metadata, such as a list of names of columns, data types, functions, and databases.
Spark consults the catalog for resolving the references in a logical plan at the beginning of the conversion of the query into an execution plan. The result is then an optimized logical plan.
Depending on whether DataFrame API or SQL API are used, the physical plan may differ.
Wrong - the physical plan is independent of which API was used. And this is one of the great strengths of Spark!
The catalog assigns specific resources to the optimized memory plan.
There is no specific "memory plan" on the journey of a Spark computation.
More info: Spark's Logical and Physical plans ... When, Why, How and Beyond. | by Laurent Leturgez | datalex | Medium

質問 81
Which of the following code blocks selects all rows from DataFrame transactionsDf in which column productId is zero or smaller or equal to 3?

A. transactionsDf.filter(productId==3 or productId<1)
B. transactionsDf.where("productId"=3).or("productId"<1))
C. transactionsDf.filter(col("productId")==3 | col("productId")<1)
D. transactionsDf.filter((col("productId")==3) or (col("productId")<1))
E. transactionsDf.filter((col("productId")==3) | (col("productId")<1))

正解: E

解説:
Explanation
This question targets your knowledge about how to chain filtering conditions. Each filtering condition should be in parentheses. The correct operator for "or" is the pipe character (|) and not the word or. Another operator of concern is the equality operator. For the purpose of comparison, equality is expressed as two equal signs (==).
Static notebook | Dynamic notebook: See test 2

質問 82
The code block displayed below contains an error. The code block should combine data from DataFrames itemsDf and transactionsDf, showing all rows of DataFrame itemsDf that have a matching value in column itemId with a value in column transactionsId of DataFrame transactionsDf. Find the error.
Code block:
itemsDf.join(itemsDf.itemId==transactionsDf.transactionId)

A. The join method is inappropriate.
B. The merge method should be used instead of join.
C. The join statement is incomplete.
D. The union method should be used instead of join.
E. The join expression is malformed.

正解: C

解説:
Explanation
Correct code block:
itemsDf.join(transactionsDf, itemsDf.itemId==transactionsDf.transactionId) The join statement is incomplete.
Correct! If you look at the documentation of DataFrame.join() (linked below), you see that the very first argument of join should be the DataFrame that should be joined with. This first argument is missing in the code block.
The join method is inappropriate.
No. By default, DataFrame.join() uses an inner join. This method is appropriate for the scenario described in the question.
The join expression is malformed.
Incorrect. The join expression itemsDf.itemId==transactionsDf.transactionId is correct syntax.
The merge method should be used instead of join.
False. There is no DataFrame.merge() method in PySpark.
The union method should be used instead of join.
Wrong. DataFrame.union() merges rows, but not columns as requested in the question.
More info: pyspark.sql.DataFrame.join - PySpark 3.1.2 documentation, pyspark.sql.DataFrame.union - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 3

質問 83
Which of the following code blocks returns a DataFrame that is an inner join of DataFrame itemsDf and DataFrame transactionsDf, on columns itemId and productId, respectively and in which every itemId just appears once?

A. itemsDf.join(transactionsDf, "itemsDf.itemId==transactionsDf.productId", how="inner").dropDuplicates(["itemId"])
B. itemsDf.join(transactionsDf, itemsDf.itemId==transactionsDf.productId, how="inner").distinct(["itemId"])
C. itemsDf.join(transactionsDf, "itemsDf.itemId==transactionsDf.productId").distinct("itemId")
D. itemsDf.join(transactionsDf, itemsDf.itemId==transactionsDf.productId).dropDuplicates(["itemId"])
E. itemsDf.join(transactionsDf, itemsDf.itemId==transactionsDf.productId).dropDuplicates("itemId")

正解: D

解説:
Explanation
Filtering out distinct rows based on columns is achieved with the dropDuplicates method, not the distinct method which does not take any arguments.
The second argument of the join() method only accepts strings if they are column names. The SQL-like statement "itemsDf.itemId==transactionsDf.productId" is therefore invalid.
In addition, it is not necessary to specify how="inner", since the default join type for the join command is already inner.
More info: pyspark.sql.DataFrame.join - PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 2

質問 84
Which of the following code blocks returns a copy of DataFrame transactionsDf that only includes columns transactionId, storeId, productId and f?
Sample of DataFrame transactionsDf:
1.+-------------+---------+-----+-------+---------+----+
2.|transactionId|predError|value|storeId|productId| f|
3.+-------------+---------+-----+-------+---------+----+
4.| 1| 3| 4| 25| 1|null|
5.| 2| 6| 7| 2| 2|null|
6.| 3| 3| null| 25| 3|null|
7.+-------------+---------+-----+-------+---------+----+

A. transactionsDf.drop(["predError", "value"])
B. transactionsDf.drop([col("predError"), col("value")])
C. transactionsDf.drop(col("value"), col("predError"))
D. transactionsDf.drop("predError", "value")
E. transactionsDf.drop(value, predError)

正解: D

解説:
Explanation
Output of correct code block:
+-------------+-------+---------+----+
|transactionId|storeId|productId| f|
+-------------+-------+---------+----+
| 1| 25| 1|null|
| 2| 2| 2|null|
| 3| 25| 3|null|
+-------------+-------+---------+----+
To solve this question, you should be fmailiar with the drop() API. The order of column names does not matter
- in this question the order differs in some answers just to confuse you. Also, drop() does not take a list. The *cols operator in the documentation means that all arguments passed to drop() are interpreted as column names.
More info: pyspark.sql.DataFrame.drop - PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 2

質問 85
The code block displayed below contains an error. The code block should save DataFrame transactionsDf at path path as a parquet file, appending to any existing parquet file. Find the error.
Code block:

A. transactionsDf.format("parquet").option("mode", "append").save(path)
B. The code block is missing a reference to the DataFrameWriter.
C. The mode option should be omitted so that the command uses the default mode.
D. save() is evaluated lazily and needs to be followed by an action.
E. Given that the DataFrame should be saved as parquet file, path is being passed to the wrong method.
F. The code block is missing a bucketBy command that takes care of partitions.

正解: B

解説:
Explanation
Correct code block:
transactionsDf.write.format("parquet").option("mode", "append").save(path)

質問 86
Which of the following code blocks returns about 150 randomly selected rows from the 1000-row DataFrame transactionsDf, assuming that any row can appear more than once in the returned DataFrame?

A. transactionsDf.resample(0.15, False, 3142)
B. transactionsDf.sample(0.85, 8429)
C. transactionsDf.sample(0.15, False, 3142)
D. transactionsDf.sample(True, 0.15, 8261)
E. transactionsDf.sample(0.15)

正解: D

解説:
Explanation
Answering this question correctly depends on whether you understand the arguments to the DataFrame.sample() method (link to the documentation below). The arguments are as follows:
DataFrame.sample(withReplacement=None, fraction=None, seed=None).
The first argument withReplacement specified whether a row can be drawn from the DataFrame multiple times. By default, this option is disabled in Spark. But we have to enable it here, since the question asks for a row being able to appear more than once. So, we need to pass True for this argument.
About replacement: "Replacement" is easiest explained with the example of removing random items from a box. When you remove those "with replacement" it means that after you have taken an item out of the box, you put it back inside. So, essentially, if you would randomly take 10 items out of a box with 100 items, there is a chance you take the same item twice or more times. "Without replacement" means that you would not put the item back into the box after removing it. So, every time you remove an item from the box, there is one less item in the box and you can never take the same item twice.
The second argument to the withReplacement method is fraction. This referes to the fraction of items that should be returned. In the question we are asked for 150 out of 1000 items - a fraction of 0.15.
The last argument is a random seed. A random seed makes a randomized processed repeatable. This means that if you would re-run the same sample() operation with the same random seed, you would get the same rows returned from the sample() command. There is no behavior around the random seed specified in the question. The varying random seeds are only there to confuse you!
More info: pyspark.sql.DataFrame.sample - PySpark 3.1.1 documentation
Static notebook | Dynamic notebook: See test 1

質問 87
The code block displayed below contains multiple errors. The code block should return a DataFrame that contains only columns transactionId, predError, value and storeId of DataFrame transactionsDf. Find the errors.
Code block:
transactionsDf.select([col(productId), col(f)])
Sample of transactionsDf:
1.+-------------+---------+-----+-------+---------+----+
2.|transactionId|predError|value|storeId|productId| f|
3.+-------------+---------+-----+-------+---------+----+
4.| 1| 3| 4| 25| 1|null|
5.| 2| 6| 7| 2| 2|null|
6.| 3| 3| null| 25| 3|null|
7.+-------------+---------+-----+-------+---------+----+

A. The select operator should be replaced by a drop operator.
B. The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.
C. The column names should be listed directly as arguments to the operator and not as a list.
D. The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.
E. The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

正解: D

解説:
Explanation
Correct code block: transactionsDf.drop("productId", "f")
This question requires a lot of thinking to get right. For solving it, you may take advantage of the digital notepad that is provided to you during the test. You have probably seen that the code block includes multiple errors. In the test, you are usually confronted with a code block that only contains a single error. However, since you are practicing here, this challenging multi-error question will make it easier for you to deal with single-error questions in the real exam.
The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.
Correct! Here, you need to figure out the many, many things that are wrong with the initial code block. While the question can be solved by using a select statement, a drop statement, given the answer options, is the correct one. Then, you can read in the documentation that drop does not take a list as an argument, but just the column names that should be dropped. Finally, the column names should be expressed as strings and not as Python variable names as in the original code block.
The column names should be listed directly as arguments to the operator and not as a list.
Incorrect. While this is a good first step and part of the correct solution (see above), this modification is insufficient to solve the question.
The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.
Wrong. If you use the same pattern as in the original code block (col(productId), col(f)), you are still making a mistake. col(productId) will trigger Python to search for the content of a variable named productId instead of telling Spark to use the column productId - for that, you need to express it as a string.
The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.
No. This still leaves you with Python trying to interpret the column names as Python variables (see above).
The select operator should be replaced by a drop operator.
Wrong, this is not enough to solve the question. If you do this, you will still face problems since you are passing a Python list to drop and the column names are still interpreted as Python variables (see above).
More info: pyspark.sql.DataFrame.drop - PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 3

質問 88
Which of the following is a viable way to improve Spark's performance when dealing with large amounts of data, given that there is only a single application running on the cluster?

A. Increase values for the properties spark.sql.parallelism and spark.sql.partitions
B. Decrease values for the properties spark.default.parallelism and spark.sql.partitions
C. Increase values for the properties spark.default.parallelism and spark.sql.shuffle.partitions
D. Increase values for the properties spark.dynamicAllocation.maxExecutors, spark.default.parallelism, and spark.sql.shuffle.partitions
E. Increase values for the properties spark.sql.parallelism and spark.sql.shuffle.partitions

正解: C

解説:
Explanation
Decrease values for the properties spark.default.parallelism and spark.sql.partitions No, these values need to be increased.
Increase values for the properties spark.sql.parallelism and spark.sql.partitions Wrong, there is no property spark.sql.parallelism.
Increase values for the properties spark.sql.parallelism and spark.sql.shuffle.partitions See above.
Increase values for the properties spark.dynamicAllocation.maxExecutors, spark.default.parallelism, and spark.sql.shuffle.partitions The property spark.dynamicAllocation.maxExecutors is only in effect if dynamic allocation is enabled, using the spark.dynamicAllocation.enabled property. It is disabled by default. Dynamic allocation can be useful when to run multiple applications on the same cluster in parallel. However, in this case there is only a single application running on the cluster, so enabling dynamic allocation would not yield a performance benefit.
More info: Practical Spark Tips For Data Scientists | Experfy.com and Basics of Apache Spark Configuration Settings | by Halil Ertan | Towards Data Science (https://bit.ly/3gA0A6w ,
https://bit.ly/2QxhNTr)

質問 89
Which of the following describes a narrow transformation?

A. A narrow transformation is a process in which 32-bit float variables are cast to smaller float variables, like 16-bit or 8-bit float variables.
B. A narrow transformation is a process in which data from multiple RDDs is used.
C. narrow transformation is an operation in which data is exchanged across partitions.
D. A narrow transformation is an operation in which data is exchanged across the cluster.
E. A narrow transformation is an operation in which no data is exchanged across the cluster.

正解: E

解説:
Explanation
A narrow transformation is an operation in which no data is exchanged across the cluster.
Correct! In narrow transformations, no data is exchanged across the cluster, since these transformations do not require any data from outside of the partition they are applied on. Typical narrow transformations include filter, drop, and coalesce.
A narrow transformation is an operation in which data is exchanged across partitions.
No, that would be one definition of a wide transformation, but not of a narrow transformation. Wide transformations typically cause a shuffle, in which data is exchanged across partitions, executors, and the cluster.
A narrow transformation is an operation in which data is exchanged across the cluster.
No, see explanation just above this one.
A narrow transformation is a process in which 32-bit float variables are cast to smaller float variables, like
16-bit or 8-bit float variables.
No, type conversion has nothing to do with narrow transformations in Spark.
A narrow transformation is a process in which data from multiple RDDs is used.
No. A resilient distributed dataset (RDD) can be described as a collection of partitions. In a narrow transformation, no data is exchanged between partitions. Thus, no data is exchanged between RDDs.
One could say though that a narrow transformation and, in fact, any transformation results in a new RDD being created. This is because a transformation results in a change to an existing RDD (RDDs are the foundation of other Spark data structures, like DataFrames). But, since RDDs are immutable, a new RDD needs to be created to reflect the change caused by the transformation.
More info: Spark Transformation and Action: A Deep Dive | by Misbah Uddin | CodeX | Medium

質問 90
Which of the following code blocks reads in the JSON file stored at filePath as a DataFrame?

A. spark.read.json(filePath)
B. spark.read.path(filePath)
C. spark.read().json(filePath)
D. spark.read.path(filePath, source="json")
E. spark.read().path(filePath)

正解: A

解説:
Explanation
spark.read.json(filePath)
Correct. spark.read accesses Spark's DataFrameReader. Then, Spark identifies the file type to be read as JSON type by passing filePath into the DataFrameReader.json() method.
spark.read.path(filePath)
Incorrect. Spark's DataFrameReader does not have a path method. A universal way to read in files is provided by the DataFrameReader.load() method (link below).
spark.read.path(filePath, source="json")
Wrong. A DataFrameReader.path() method does not exist (see above).
spark.read().json(filePath)
Incorrect. spark.read is a way to access Spark's DataFrameReader. However, the DataFrameReader is not callable, so calling it via spark.read() will fail.
spark.read().path(filePath)
No, Spark's DataFrameReader is not callable (see above).
More info: pyspark.sql.DataFrameReader.json - PySpark 3.1.2 documentation, pyspark.sql.DataFrameReader.load - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 3

質問 91
Which of the following is a problem with using accumulators?

A. Only unnamed accumulators can be inspected in the Spark UI.
B. Accumulators are difficult to use for debugging because they will only be updated once, independent if a task has to be re-run due to hardware failure.
C. Only numeric values can be used in accumulators.
D. Accumulator values can only be read by the driver, but not by executors.
E. Accumulators do not obey lazy evaluation.

正解: D

解説:
Explanation
Accumulator values can only be read by the driver, but not by executors.
Correct. So, for example, you cannot use an accumulator variable for coordinating workloads between executors. The typical, canonical, use case of an accumulator value is to report data, for example for debugging purposes, back to the driver. For example, if you wanted to count values that match a specific condition in a UDF for debugging purposes, an accumulator provides a good way to do that.
Only numeric values can be used in accumulators.
No. While pySpark's Accumulator only supports numeric values (think int and float), you can define accumulators for custom types via the AccumulatorParam interface (documentation linked below).
Accumulators do not obey lazy evaluation.
Incorrect - accumulators do obey lazy evaluation. This has implications in practice: When an accumulator is encapsulated in a transformation, that accumulator will not be modified until a subsequent action is run.
Accumulators are difficult to use for debugging because they will only be updated once, independent if a task has to be re-run due to hardware failure.
Wrong. A concern with accumulators is in fact that under certain conditions they can run for each task more than once. For example, if a hardware failure occurs during a task after an accumulator variable has been increased but before a task has finished and Spark launches the task on a different worker in response to the failure, already executed accumulator variable increases will be repeated.
Only unnamed accumulators can be inspected in the Spark UI.
No. Currently, in PySpark, no accumulators can be inspected in the Spark UI. In the Scala interface of Spark, only named accumulators can be inspected in the Spark UI.
More info: Aggregating Results with Spark Accumulators | Sparkour, RDD Programming Guide - Spark 3.1.2 Documentation, pyspark.Accumulator - PySpark 3.1.2 documentation, and pyspark.AccumulatorParam - PySpark 3.1.2 documentation

質問 92
Which of the following code blocks returns a copy of DataFrame itemsDf where the column supplier has been renamed to manufacturer?

A. itemsDf.withColumnRenamed(col("manufacturer"), col("supplier"))
B. itemsDf.withColumn("supplier").alias("manufacturer")
C. itemsDf.withColumnsRenamed("supplier", "manufacturer")
D. itemsDf.withColumnRenamed("supplier", "manufacturer")
E. itemsDf.withColumn(["supplier", "manufacturer"])

正解: D

解説:
Explanation
itemsDf.withColumnRenamed("supplier", "manufacturer")
Correct! This uses the relatively trivial DataFrame method withColumnRenamed for renaming column supplier to column manufacturer.
Note that the question asks for "a copy of DataFrame itemsDf". This may be confusing if you are not familiar with Spark yet. RDDs (Resilient Distributed Datasets) are the foundation of Spark DataFrames and are immutable. As such, DataFrames are immutable, too. Any command that changes anything in the DataFrame therefore necessarily returns a copy, or a new version, of it that has the changes applied.
itemsDf.withColumnsRenamed("supplier", "manufacturer")
Incorrect. Spark's DataFrame API does not have a withColumnsRenamed() method.
itemsDf.withColumnRenamed(col("manufacturer"), col("supplier"))
No. Watch out - although the col() method works for many methods of the DataFrame API, withColumnRenamed is not one of them. As outlined in the documentation linked below, withColumnRenamed expects strings.
itemsDf.withColumn(["supplier", "manufacturer"])
Wrong. While DataFrame.withColumn() exists in Spark, it has a different purpose than renaming columns.
withColumn is typically used to add columns to DataFrames, taking the name of the new column as a first, and a Column as a second argument. Learn more via the documentation that is linked below.
itemsDf.withColumn("supplier").alias("manufacturer")
No. While DataFrame.withColumn() exists, it requires 2 arguments. Furthermore, the alias() method on DataFrames would not help the cause of renaming a column much. DataFrame.alias() can be useful in addressing the input of join statements. However, this is far outside of the scope of this question. If you are curious nevertheless, check out the link below.
More info: pyspark.sql.DataFrame.withColumnRenamed - PySpark 3.1.1 documentation, pyspark.sql.DataFrame.withColumn - PySpark 3.1.1 documentation, and pyspark.sql.DataFrame.alias - PySpark 3.1.2 documentation (https://bit.ly/3aSB5tm , https://bit.ly/2Tv4rbE , https://bit.ly/2RbhBd2) Static notebook | Dynamic notebook: See test 1 (https://flrs.github.io/spark_practice_tests_code/#1/31.html ,
https://bit.ly/sparkpracticeexams_import_instructions)