2022年最新の実際に出ると確認された 無料Databricks Associate-Developer-Apache-Spark試験問題 [Q20-Q42]

Share

2022年最新の実際に出ると確認された 無料Databricks Associate-Developer-Apache-Spark試験問題

Associate-Developer-Apache-Sparkリアル試験問題解答は無料

質問 20
The code block displayed below contains an error. The code block is intended to join DataFrame itemsDf with the larger DataFrame transactionsDf on column itemId. Find the error.
Code block:
transactionsDf.join(itemsDf, "itemId", how="broadcast")

  • A. The join method should be replaced by the broadcast method.
  • B. The larger DataFrame transactionsDf is being broadcasted, rather than the smaller DataFrame itemsDf.
  • C. broadcast is not a valid join type.
  • D. The syntax is wrong, how= should be removed from the code block.
  • E. Spark will only perform the broadcast operation if this behavior has been enabled on the Spark cluster.

正解: C

解説:
Explanation
broadcast is not a valid join type.
Correct! The code block should read transactionsDf.join(broadcast(itemsDf), "itemId"). This would imply an inner join (this is the default in DataFrame.join()), but since the join type is not given in the question, this would be a valid choice.
The larger DataFrame transactionsDf is being broadcasted, rather than the smaller DataFrame itemsDf.
This option does not apply here, since the syntax around broadcasting is incorrect.
Spark will only perform the broadcast operation if this behavior has been enabled on the Spark cluster.
No, it is enabled by default, since the spark.sql.autoBroadcastJoinThreshold property is set to 10 MB by default. If that property would be set to -1, then broadcast joining would be disabled.
More info: Performance Tuning - Spark 3.1.1 Documentation (https://bit.ly/3gCz34r) The join method should be replaced by the broadcast method.
No, DataFrame has no broadcast() method.
The syntax is wrong, how= should be removed from the code block.
No, having the keyword argument how= is totally acceptable.

 

質問 21
The code block shown below should read all files with the file ending .png in directory path into Spark.
Choose the answer that correctly fills the blanks in the code block to accomplish this.
spark.__1__.__2__(__3__).option(__4__, "*.png").__5__(path)

  • A. 1. open
    2. format
    3. "image"
    4. "fileType"
    5. open
  • B. 1. read()
    2. format
    3. "binaryFile"
    4. "recursiveFileLookup"
    5. load
  • C. 1. read
    2. format
    3. binaryFile
    4. pathGlobFilter
    5. load
  • D. 1. open
    2. as
    3. "binaryFile"
    4. "pathGlobFilter"
    5. load
  • E. 1. read
    2. format
    3. "binaryFile"
    4. "pathGlobFilter"
    5. load

正解: E

解説:
Explanation
Correct code block:
spark.read.format("binaryFile").option("recursiveFileLookup", "*.png").load(path) Spark can deal with binary files, like images. Using the binaryFile format specification in the SparkSession's read API is the way to read in those files. Remember that, to access the read API, you need to start the command with spark.read. The pathGlobFilter option is a great way to filter files by name (and ending). Finally, the path can be specified using the load operator - the open operator shown in one of the answers does not exist.

 

質問 22
Which of the following code blocks applies the Python function to_limit on column predError in table transactionsDf, returning a DataFrame with columns transactionId and result?

  • A. 1.spark.udf.register(to_limit, "LIMIT_FCN")
    2.spark.sql("SELECT transactionId, LIMIT_FCN(predError) AS result FROM transactionsDf")
  • B. 1.spark.udf.register("LIMIT_FCN", to_limit)
    2.spark.sql("SELECT transactionId, LIMIT_FCN(predError) FROM transactionsDf AS result")
  • C. 1.spark.udf.register("LIMIT_FCN", to_limit)
    2.spark.sql("SELECT transactionId, LIMIT_FCN(predError) AS result FROM transactionsDf") (Correct)
  • D. 1.spark.udf.register("LIMIT_FCN", to_limit)
    2.spark.sql("SELECT transactionId, to_limit(predError) AS result FROM transactionsDf") spark.sql("SELECT transactionId, udf(to_limit(predError)) AS result FROM transactionsDf")

正解: C

解説:
Explanation
spark.udf.register("LIMIT_FCN", to_limit)
spark.sql("SELECT transactionId, LIMIT_FCN(predError) AS result FROM transactionsDf") Correct! First, you have to register to_limit as UDF to use it in a sql statement. Then, you can use it under the LIMIT_FCN name, correctly naming the resulting column result.
spark.udf.register(to_limit, "LIMIT_FCN")
spark.sql("SELECT transactionId, LIMIT_FCN(predError) AS result FROM transactionsDf") No. In this answer, the arguments to spark.udf.register are flipped.
spark.udf.register("LIMIT_FCN", to_limit)
spark.sql("SELECT transactionId, to_limit(predError) AS result FROM transactionsDf") Wrong, this answer does not use the registered LIMIT_FCN in the sql statement, but tries to access the to_limit method directly. This will fail, since Spark cannot access it.
spark.sql("SELECT transactionId, udf(to_limit(predError)) AS result FROM transactionsDf") Incorrect, there is no udf method in Spark's SQL.
spark.udf.register("LIMIT_FCN", to_limit)
spark.sql("SELECT transactionId, LIMIT_FCN(predError) FROM transactionsDf AS result") False. In this answer, the column that results from applying the UDF is not correctly renamed to result.
Static notebook | Dynamic notebook: See test 3

 

質問 23
Which of the following describes Spark actions?

  • A. Stage boundaries are commonly established by actions.
  • B. Actions are Spark's way of exchanging data between executors.
  • C. Actions are Spark's way of modifying RDDs.
  • D. Writing data to disk is the primary purpose of actions.
  • E. The driver receives data upon request by actions.

正解: E

解説:
Explanation
The driver receives data upon request by actions.
Correct! Actions trigger the distributed execution of tasks on executors which, upon task completion, transfer result data back to the driver.
Actions are Spark's way of exchanging data between executors.
No. In Spark, data is exchanged between executors via shuffles.
Writing data to disk is the primary purpose of actions.
No. The primary purpose of actions is to access data that is stored in Spark's RDDs and return the data, often in aggregated form, back to the driver.
Actions are Spark's way of modifying RDDs.
Incorrect. Firstly, RDDs are immutable - they cannot be modified. Secondly, Spark generates new RDDs via transformations and not actions.
Stage boundaries are commonly established by actions.
Wrong. A stage boundary is commonly established by a shuffle, for example caused by a wide transformation.

 

質問 24
The code block shown below should return a DataFrame with only columns from DataFrame transactionsDf for which there is a corresponding transactionId in DataFrame itemsDf. DataFrame itemsDf is very small and much smaller than DataFrame transactionsDf. The query should be executed in an optimized way. Choose the answer that correctly fills the blanks in the code block to accomplish this.
__1__.__2__(__3__, __4__, __5__)

  • A. 1. itemsDf
    2. broadcast
    3. transactionsDf
    4. "transactionId"
    5. "left_semi"
  • B. 1. itemsDf
    2. join
    3. broadcast(transactionsDf)
    4. "transactionId"
    5. "left_semi"
  • C. 1. transactionsDf
    2. join
    3. broadcast(itemsDf)
    4. "transactionId"
    5. "left_semi"
  • D. 1. transactionsDf
    2. join
    3. broadcast(itemsDf)
    4. transactionsDf.transactionId==itemsDf.transactionId
    5. "outer"
  • E. 1. transactionsDf
    2. join
    3. itemsDf
    4. transactionsDf.transactionId==itemsDf.transactionId
    5. "anti"

正解: C

解説:
Explanation
Correct code block:
transactionsDf.join(broadcast(itemsDf), "transactionId", "left_semi")
This question is extremely difficult and exceeds the difficulty of questions in the exam by far.
A first indication of what is asked from you here is the remark that "the query should be executed in an optimized way". You also have qualitative information about the size of itemsDf and transactionsDf. Given that itemsDf is "very small" and that the execution should be optimized, you should consider instructing Spark to perform a broadcast join, broadcasting the "very small" DataFrame itemsDf to all executors. You can explicitly suggest this to Spark via wrapping itemsDf into a broadcast() operator. One answer option does not include this operator, so you can disregard it. Another answer option wraps the broadcast() operator around transactionsDf - the bigger of the two DataFrames. This answer option does not make sense in the optimization context and can likewise be disregarded.
When thinking about the broadcast() operator, you may also remember that it is a method of pyspark.sql.functions. One answer option, however, resolves to itemsDf.broadcast([...]). The DataFrame class has no broadcast() method, so this answer option can be eliminated as well.
All two remaining answer options resolve to transactionsDf.join([...]) in the first 2 gaps, so you will have to figure out the details of the join now. You can pick between an outer and a left semi join. An outer join would include columns from both DataFrames, where a left semi join only includes columns from the "left" table, here transactionsDf, just as asked for by the question. So, the correct answer is the one that uses the left_semi join.

 

質問 25
In which order should the code blocks shown below be run in order to create a DataFrame that shows the mean of column predError of DataFrame transactionsDf per column storeId and productId, where productId should be either 2 or 3 and the returned DataFrame should be sorted in ascending order by column storeId, leaving out any nulls in that column?
DataFrame transactionsDf:
1.+-------------+---------+-----+-------+---------+----+
2.|transactionId|predError|value|storeId|productId| f|
3.+-------------+---------+-----+-------+---------+----+
4.| 1| 3| 4| 25| 1|null|
5.| 2| 6| 7| 2| 2|null|
6.| 3| 3| null| 25| 3|null|
7.| 4| null| null| 3| 2|null|
8.| 5| null| null| null| 2|null|
9.| 6| 3| 2| 25| 2|null|
10.+-------------+---------+-----+-------+---------+----+
1. .mean("predError")
2. .groupBy("storeId")
3. .orderBy("storeId")
4. transactionsDf.filter(transactionsDf.storeId.isNotNull())
5. .pivot("productId", [2, 3])

  • A. 4, 2, 1
  • B. 4, 1, 5, 2, 3
  • C. 4, 3, 2, 5, 1
  • D. 4, 2, 5, 1, 3
  • E. 4, 5, 2, 3, 1

正解: D

解説:
Explanation
Correct code block:
transactionsDf.filter(transactionsDf.storeId.isNotNull()).groupBy("storeId").pivot("productId", [2,
3]).mean("predError").orderBy("storeId")
Output of correct code block:
+-------+----+----+
|storeId| 2| 3|
+-------+----+----+
| 2| 6.0|null|
| 3|null|null|
| 25| 3.0| 3.0|
+-------+----+----+
This question is quite convoluted and requires you to think hard about the correct order of operations.
The pivot method also makes an appearance - a method that you may not know all that much about (yet).
At the first position in all answers is code block 4, so the question is essentially just about the ordering of the remaining 4 code blocks.
The question states that the returned DataFrame should be sorted by column storeId. So, it should make sense to have code block 3 which includes the orderBy operator at the very end of the code block. This leaves you with only two answer options.
Now, it is useful to know more about the context of pivot in PySpark. A common pattern is groupBy, pivot, and then another aggregating function, like mean. In the documentation linked below you can see that pivot is a method of pyspark.sql.GroupedData - meaning that before pivoting, you have to use groupBy. The only answer option matching this requirement is the one in which code block 2 (which includes groupBy) is stated before code block 5 (which includes pivot).
More info: pyspark.sql.GroupedData.pivot - PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 3

 

質問 26
Which of the following code blocks returns a DataFrame with approximately 1,000 rows from the 10,000-row DataFrame itemsDf, without any duplicates, returning the same rows even if the code block is run twice?

  • A. itemsDf.sample(fraction=1000, seed=98263)
  • B. itemsDf.sample(fraction=0.1)
  • C. itemsDf.sample(fraction=0.1, seed=87238)
  • D. itemsDf.sample(withReplacement=True, fraction=0.1, seed=23536)
  • E. itemsDf.sampleBy("row", fractions={0: 0.1}, seed=82371)

正解: C

解説:
Explanation
itemsDf.sample(fraction=0.1, seed=87238)
Correct. If itemsDf has 10,000 rows, this code block returns about 1,000, since DataFrame.sample() is never guaranteed to return an exact amount of rows. To ensure you are not returning duplicates, you should leave the withReplacement parameter at False, which is the default. Since the question specifies that the same rows should be returned even if the code block is run twice, you need to specify a seed. The number passed in the seed does not matter as long as it is an integer.
itemsDf.sample(withReplacement=True, fraction=0.1, seed=23536)
Incorrect. While this code block fulfills almost all requirements, it may return duplicates. This is because withReplacement is set to True.
Here is how to understand what replacement means: Imagine you have a bucket of 10,000 numbered balls and you need to take 1,000 balls at random from the bucket (similar to the problem in the question). Now, if you would take those balls with replacement, you would take a ball, note its number, and put it back into the bucket, meaning the next time you take a ball from the bucket there would be a chance you could take the exact same ball again. If you took the balls without replacement, you would leave the ball outside the bucket and not put it back in as you take the next 999 balls.
itemsDf.sample(fraction=1000, seed=98263)
Wrong. The fraction parameter needs to have a value between 0 and 1. In this case, it should be 0.1, since
1,000/10,000 = 0.1.
itemsDf.sampleBy("row", fractions={0: 0.1}, seed=82371)
No, DataFrame.sampleBy() is meant for stratified sampling. This means that based on the values in a column in a DataFrame, you can draw a certain fraction of rows containing those values from the DataFrame (more details linked below). In the scenario at hand, sampleBy is not the right operator to use because you do not have any information about any column that the sampling should depend on.
itemsDf.sample(fraction=0.1)
Incorrect. This code block checks all the boxes except that it does not ensure that when you run it a second time, the exact same rows will be returned. In order to achieve this, you would have to specify a seed.
More info:
- pyspark.sql.DataFrame.sample - PySpark 3.1.2 documentation
- pyspark.sql.DataFrame.sampleBy - PySpark 3.1.2 documentation
- Types of Samplings in PySpark 3. The explanations of the sampling... | by Pinar Ersoy | Towards Data Science

 

質問 27
The code block shown below should return a one-column DataFrame where the column storeId is converted to string type. Choose the answer that correctly fills the blanks in the code block to accomplish this.
transactionsDf.__1__(__2__.__3__(__4__))

  • A. 1. select
    2. col("storeId")
    3. as
    4. StringType
  • B. 1. select
    2. col("storeId")
    3. cast
    4. StringType
  • C. 1. cast
    2. "storeId"
    3. as
    4. StringType()
  • D. 1. select
    2. storeId
    3. cast
    4. StringType()
  • E. 1. select
    2. col("storeId")
    3. cast
    4. StringType()

正解: E

解説:
Explanation
Correct code block:
transactionsDf.select(col("storeId").cast(StringType()))
Solving this question involves understanding that, when using types from the pyspark.sql.types such as StringType, these types need to be instantiated when using them in Spark, or, in simple words, they need to be followed by parentheses like so: StringType(). You could also use .cast("string") instead, but that option is not given here.
More info: pyspark.sql.Column.cast - PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 2

 

質問 28
The code block displayed below contains an error. The code block should arrange the rows of DataFrame transactionsDf using information from two columns in an ordered fashion, arranging first by column value, showing smaller numbers at the top and greater numbers at the bottom, and then by column predError, for which all values should be arranged in the inverse way of the order of items in column value. Find the error.
Code block:
transactionsDf.orderBy('value', asc_nulls_first(col('predError')))

  • A. Column predError should be sorted by desc_nulls_first() instead.
  • B. Instead of orderBy, sort should be used.
  • C. Column predError should be sorted in a descending way, putting nulls last.
  • D. Two orderBy statements with calls to the individual columns should be chained, instead of having both columns in one orderBy statement.
  • E. Column value should be wrapped by the col() operator.

正解: C

解説:
Explanation
Correct code block:
transactionsDf.orderBy('value', desc_nulls_last('predError'))
Column predError should be sorted in a descending way, putting nulls last.
Correct! By default, Spark sorts ascending, putting nulls first. So, the inverse sort of the default sort is indeed desc_nulls_last.
Instead of orderBy, sort should be used.
No. DataFrame.sort() orders data per partition, it does not guarantee a global order. This is why orderBy is the more appropriate operator here.
Column value should be wrapped by the col() operator.
Incorrect. DataFrame.sort() accepts both string and Column objects.
Column predError should be sorted by desc_nulls_first() instead.
Wrong. Since Spark's default sort order matches asc_nulls_first(), nulls would have to come last when inverted.
Two orderBy statements with calls to the individual columns should be chained, instead of having both columns in one orderBy statement.
No, this would just sort the DataFrame by the very last column, but would not take information from both columns into account, as noted in the question.
More info: pyspark.sql.DataFrame.orderBy - PySpark 3.1.2 documentation, pyspark.sql.functions.desc_nulls_last - PySpark 3.1.2 documentation, sort() vs orderBy() in Spark | Towards Data Science Static notebook | Dynamic notebook: See test 3

 

質問 29
The code block displayed below contains an error. The code block should return a DataFrame in which column predErrorAdded contains the results of Python function add_2_if_geq_3 as applied to numeric and nullable column predError in DataFrame transactionsDf. Find the error.
Code block:
1.def add_2_if_geq_3(x):
2. if x is None:
3. return x
4. elif x >= 3:
5. return x+2
6. return x
7.
8.add_2_if_geq_3_udf = udf(add_2_if_geq_3)
9.
10.transactionsDf.withColumnRenamed("predErrorAdded", add_2_if_geq_3_udf(col("predError")))

  • A. UDFs are only available through the SQL API, but not in the Python API as shown in the code block.
  • B. The Python function is unable to handle null values, resulting in the code block crashing on execution.
  • C. The operator used to adding the column does not add column predErrorAdded to the DataFrame.
  • D. Instead of col("predError"), the actual DataFrame with the column needs to be passed, like so transactionsDf.predError.
  • E. The udf() method does not declare a return type.

正解: C

解説:
Explanation
Correct code block:
def add_2_if_geq_3(x):
if x is None:
return x
elif x >= 3:
return x+2
return x
add_2_if_geq_3_udf = udf(add_2_if_geq_3)
transactionsDf.withColumn("predErrorAdded", add_2_if_geq_3_udf(col("predError"))).show() Instead of withColumnRenamed, you should use the withColumn operator.
The udf() method does not declare a return type.
It is fine that the udf() method does not declare a return type, this is not a required argument. However, the default return type is StringType. This may not be the ideal return type for numeric, nullable data - but the code will run without specified return type nevertheless.
The Python function is unable to handle null values, resulting in the code block crashing on execution.
The Python function is able to handle null values, this is what the statement if x is None does.
UDFs are only available through the SQL API, but not in the Python API as shown in the code block.
No, they are available through the Python API. The code in the code block that concerns UDFs is correct.
Instead of col("predError"), the actual DataFrame with the column needs to be passed, like so transactionsDf.predError.
You may choose to use the transactionsDf.predError syntax, but the col("predError") syntax is fine.

 

質問 30
Which of the following code blocks returns a copy of DataFrame transactionsDf where the column storeId has been converted to string type?

  • A. transactionsDf.withColumn("storeId", convert("storeId").as("string"))
  • B. transactionsDf.withColumn("storeId", col("storeId").convert("string"))
  • C. transactionsDf.withColumn("storeId", col("storeId").cast("string"))
  • D. transactionsDf.withColumn("storeId", col("storeId", "string"))
  • E. transactionsDf.withColumn("storeId", convert("storeId", "string"))

正解: C

解説:
Explanation
This question asks for your knowledge about the cast syntax. cast is a method of the Column class. It is worth noting that one could also convert a column type using the Column.astype() method, which is just an alias for cast.
Find more info in the documentation linked below.
More info: pyspark.sql.Column.cast - PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 2

 

質問 31
Which of the following statements about garbage collection in Spark is incorrect?

  • A. Manually persisting RDDs in Spark prevents them from being garbage collected.
  • B. Garbage collection information can be accessed in the Spark UI's stage detail view.
  • C. Optimizing garbage collection performance in Spark may limit caching ability.
  • D. In Spark, using the G1 garbage collector is an alternative to using the default Parallel garbage collector.
  • E. Serialized caching is a strategy to increase the performance of garbage collection.

正解: A

解説:
Explanation
Manually persisting RDDs in Spark prevents them from being garbage collected.
This statement is incorrect, and thus the correct answer to the question. Spark's garbage collector will remove even persisted objects, albeit in an "LRU" fashion. LRU stands for least recently used.
So, during a garbage collection run, the objects that were used the longest time ago will be garbage collected first.
See the linked StackOverflow post below for more information.
Serialized caching is a strategy to increase the performance of garbage collection.
This statement is correct. The more Java objects Spark needs to collect during garbage collection, the longer it takes. Storing a collection of many Java objects, such as a DataFrame with a complex schema, through serialization as a single byte array thus increases performance. This means that garbage collection takes less time on a serialized DataFrame than an unserialized DataFrame.
Optimizing garbage collection performance in Spark may limit caching ability.
This statement is correct. A full garbage collection run slows down a Spark application. When taking about
"tuning" garbage collection, we mean reducing the amount or duration of these slowdowns.
A full garbage collection run is triggered when the Old generation of the Java heap space is almost full. (If you are unfamiliar with this concept, check out the link to the Garbage Collection Tuning docs below.) Thus, one measure to avoid triggering a garbage collection run is to prevent the Old generation share of the heap space to be almost full.
To achieve this, one may decrease its size. Objects with sizes greater than the Old generation space will then be discarded instead of cached (stored) in the space and helping it to be "almost full".
This will decrease the number of full garbage collection runs, increasing overall performance.
Inevitably, however, objects will need to be recomputed when they are needed. So, this mechanism only works when a Spark application needs to reuse cached data as little as possible.
Garbage collection information can be accessed in the Spark UI's stage detail view.
This statement is correct. The task table in the Spark UI's stage detail view has a "GC Time" column, indicating the garbage collection time needed per task.
In Spark, using the G1 garbage collector is an alternative to using the default Parallel garbage collector.
This statement is correct. The G1 garbage collector, also known as garbage first garbage collector, is an alternative to the default Parallel garbage collector.
While the default Parallel garbage collector divides the heap into a few static regions, the G1 garbage collector divides the heap into many small regions that are created dynamically. The G1 garbage collector has certain advantages over the Parallel garbage collector which improve performance particularly for Spark workloads that require high throughput and low latency.
The G1 garbage collector is not enabled by default, and you need to explicitly pass an argument to Spark to enable it. For more information about the two garbage collectors, check out the Databricks article linked below.

 

質問 32
Which of the following code blocks reads in the two-partition parquet file stored at filePath, making sure all columns are included exactly once even though each partition has a different schema?
Schema of first partition:
1.root
2. |-- transactionId: integer (nullable = true)
3. |-- predError: integer (nullable = true)
4. |-- value: integer (nullable = true)
5. |-- storeId: integer (nullable = true)
6. |-- productId: integer (nullable = true)
7. |-- f: integer (nullable = true)
Schema of second partition:
1.root
2. |-- transactionId: integer (nullable = true)
3. |-- predError: integer (nullable = true)
4. |-- value: integer (nullable = true)
5. |-- storeId: integer (nullable = true)
6. |-- rollId: integer (nullable = true)
7. |-- f: integer (nullable = true)
8. |-- tax_id: integer (nullable = false)

  • A. spark.read.option("mergeSchema", "true").parquet(filePath)
  • B. spark.read.parquet(filePath)
  • C. 1.nx = 0
    2.for file in dbutils.fs.ls(filePath):
    3. if not file.name.endswith(".parquet"):
    4. continue
    5. df_temp = spark.read.parquet(file.path)
    6. if nx == 0:
    7. df = df_temp
    8. else:
    9. df = df.union(df_temp)
    10. nx = nx+1
    11.df
  • D. spark.read.parquet(filePath, mergeSchema='y')
  • E. 1.nx = 0
    2.for file in dbutils.fs.ls(filePath):
    3. if not file.name.endswith(".parquet"):
    4. continue
    5. df_temp = spark.read.parquet(file.path)
    6. if nx == 0:
    7. df = df_temp
    8. else:
    9. df = df.join(df_temp, how="outer")
    10. nx = nx+1
    11.df

正解: A

解説:
Explanation
This is a very tricky question and involves both knowledge about merging as well as schemas when reading parquet files.
spark.read.option("mergeSchema", "true").parquet(filePath)
Correct. Spark's DataFrameReader's mergeSchema option will work well here, since columns that appear in both partitions have matching data types. Note that mergeSchema would fail if one or more columns with the same name that appear in both partitions would have different data types.
spark.read.parquet(filePath)
Incorrect. While this would read in data from both partitions, only the schema in the parquet file that is read in first would be considered, so some columns that appear only in the second partition (e.g. tax_id) would be lost.
nx = 0
for file in dbutils.fs.ls(filePath):
if not file.name.endswith(".parquet"):
continue
df_temp = spark.read.parquet(file.path)
if nx == 0:
df = df_temp
else:
df = df.union(df_temp)
nx = nx+1
df
Wrong. The key idea of this solution is the DataFrame.union() command. While this command merges all data, it requires that both partitions have the exact same number of columns with identical data types.
spark.read.parquet(filePath, mergeSchema="y")
False. While using the mergeSchema option is the correct way to solve this problem and it can even be called with DataFrameReader.parquet() as in the code block, it accepts the value True as a boolean or string variable. But 'y' is not a valid option.
nx = 0
for file in dbutils.fs.ls(filePath):
if not file.name.endswith(".parquet"):
continue
df_temp = spark.read.parquet(file.path)
if nx == 0:
df = df_temp
else:
df = df.join(df_temp, how="outer")
nx = nx+1
df
No. This provokes a full outer join. While the resulting DataFrame will have all columns of both partitions, columns that appear in both partitions will be duplicated - the question says all columns that are included in the partitions should appear exactly once.
More info: Merging different schemas in Apache Spark | by Thiago Cordon | Data Arena | Medium Static notebook | Dynamic notebook: See test 3

 

質問 33
Which of the following code blocks stores DataFrame itemsDf in executor memory and, if insufficient memory is available, serializes it and saves it to disk?

  • A. itemsDf.cache()
  • B. itemsDf.write.option('destination', 'memory').save()
  • C. itemsDf.store()
  • D. itemsDf.cache(StorageLevel.MEMORY_AND_DISK)
  • E. itemsDf.persist(StorageLevel.MEMORY_ONLY)

正解: A

解説:
Explanation
The key to solving this question is knowing (or reading in the documentation) that, by default, cache() stores values to memory and writes any partitions for which there is insufficient memory to disk. persist() can achieve the exact same behavior, however not with the StorageLevel.MEMORY_ONLY option listed here. It is also worth noting that cache() does not have any arguments.
If you have troubles finding the storage level information in the documentation, please also see this student Q&A thread that sheds some light here.
Static notebook | Dynamic notebook: See test 2

 

質問 34
Which of the following code blocks returns all unique values of column storeId in DataFrame transactionsDf?

  • A. transactionsDf.distinct("storeId")
  • B. transactionsDf["storeId"].distinct()
  • C. transactionsDf.select(col("storeId").distinct())
  • D. transactionsDf.filter("storeId").distinct()
  • E. transactionsDf.select("storeId").distinct()
    (Correct)

正解: E

解説:
Explanation
distinct() is a method of a DataFrame. Knowing this, or recognizing this from the documentation, is the key to solving this question.
More info: pyspark.sql.DataFrame.distinct - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2

 

質問 35
Which of the following code blocks returns a single-column DataFrame of all entries in Python list throughputRates which contains only float-type values ?

  • A. spark.DataFrame(throughputRates, FloatType)
  • B. spark.createDataFrame((throughputRates), FloatType)
  • C. spark.createDataFrame(throughputRates, FloatType())
  • D. spark.createDataFrame(throughputRates)
  • E. spark.createDataFrame(throughputRates, FloatType)

正解: C

解説:
Explanation
spark.createDataFrame(throughputRates, FloatType())
Correct! spark.createDataFrame is the correct operator to use here and the type FloatType() which is passed in for the command's schema argument is correctly instantiated using the parentheses.
Remember that it is essential in PySpark to instantiate types when passing them to SparkSession.createDataFrame. And, in Databricks, spark returns a SparkSession object.
spark.createDataFrame((throughputRates), FloatType)
No. While packing throughputRates in parentheses does not do anything to the execution of this command, not instantiating the FloatType with parentheses as in the previous answer will make this command fail.
spark.createDataFrame(throughputRates, FloatType)
Incorrect. Given that it does not matter whether you pass throughputRates in parentheses or not, see the explanation of the previous answer for further insights.
spark.DataFrame(throughputRates, FloatType)
Wrong. There is no SparkSession.DataFrame() method in Spark.
spark.createDataFrame(throughputRates)
False. Avoiding the schema argument will have PySpark try to infer the schema. However, as you can see in the documentation (linked below), the inference will only work if you pass in an "RDD of either Row, namedtuple, or dict" for data (the first argument to createDataFrame). But since you are passing a Python list, Spark's schema inference will fail.
More info: pyspark.sql.SparkSession.createDataFrame - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 3

 

質問 36
Which of the following code blocks saves DataFrame transactionsDf in location /FileStore/transactions.csv as a CSV file and throws an error if a file already exists in the location?

  • A. transactionsDf.write.save("/FileStore/transactions.csv")
  • B. transactionsDf.write.format("csv").mode("error").save("/FileStore/transactions.csv")
  • C. transactionsDf.write.format("csv").mode("error").path("/FileStore/transactions.csv")
  • D. transactionsDf.write.format("csv").mode("ignore").path("/FileStore/transactions.csv")
  • E. transactionsDf.write("csv").mode("error").save("/FileStore/transactions.csv")

正解: B

解説:
Explanation
Static notebook | Dynamic notebook: See test 1
(https://flrs.github.io/spark_practice_tests_code/#1/28.html ,
https://bit.ly/sparkpracticeexams_import_instructions)

 

質問 37
Which of the following describes a way for resizing a DataFrame from 16 to 8 partitions in the most efficient way?

  • A. Use operation DataFrame.coalesce(8) to fully shuffle the DataFrame and reduce the number of partitions.
  • B. Use operation DataFrame.repartition(8) to shuffle the DataFrame and reduce the number of partitions.
  • C. Use a wide transformation to reduce the number of partitions.
    Use operation DataFrame.coalesce(0.5) to halve the number of partitions in the DataFrame.
  • D. Use a narrow transformation to reduce the number of partitions.

正解: D

解説:
Explanation
Use a narrow transformation to reduce the number of partitions.
Correct! DataFrame.coalesce(n) is a narrow transformation, and in fact the most efficient way to resize the DataFrame of all options listed. One would run DataFrame.coalesce(8) to resize the DataFrame.
Use operation DataFrame.coalesce(8) to fully shuffle the DataFrame and reduce the number of partitions.
Wrong. The coalesce operation avoids a full shuffle, but will shuffle data if needed. This answer is incorrect because it says "fully shuffle" - this is something the coalesce operation will not do. As a general rule, it will reduce the number of partitions with the very least movement of data possible. More info:
distributed computing - Spark - repartition() vs coalesce() - Stack Overflow Use operation DataFrame.coalesce(0.5) to halve the number of partitions in the DataFrame.
Incorrect, since the num_partitions parameter needs to be an integer number defining the exact number of partitions desired after the operation. More info: pyspark.sql.DataFrame.coalesce - PySpark 3.1.2 documentation Use operation DataFrame.repartition(8) to shuffle the DataFrame and reduce the number of partitions.
No. The repartition operation will fully shuffle the DataFrame. This is not the most efficient way of reducing the number of partitions of all listed options.
Use a wide transformation to reduce the number of partitions.
No. While possible via the DataFrame.repartition(n) command, the resulting full shuffle is not the most efficient way of reducing the number of partitions.

 

質問 38
Which of the following code blocks performs a join in which the small DataFrame transactionsDf is sent to all executors where it is joined with DataFrame itemsDf on columns storeId and itemId, respectively?

  • A. itemsDf.join(transactionsDf, itemsDf.itemId == transactionsDf.storeId, "right_outer")
  • B. itemsDf.merge(transactionsDf, "itemsDf.itemId == transactionsDf.storeId", "broadcast")
  • C. itemsDf.join(transactionsDf, itemsDf.itemId == transactionsDf.storeId, "broadcast")
  • D. itemsDf.join(broadcast(transactionsDf), itemsDf.itemId == transactionsDf.storeId)
  • E. itemsDf.join(transactionsDf, broadcast(itemsDf.itemId == transactionsDf.storeId))

正解: D

解説:
Explanation
The issue with all answers that have "broadcast" as very last argument is that "broadcast" is not a valid join type. While the entry with "right_outer" is a valid statement, it is not a broadcast join. The item where broadcast() is wrapped around the equality condition is not valid code in Spark. broadcast() needs to be wrapped around the name of the small DataFrame that should be broadcast.
More info: Learning Spark, 2nd Edition, Chapter 7
Static notebook | Dynamic notebook: See test 1
tion and explanation?

 

質問 39
Which of the following describes Spark's Adaptive Query Execution?

  • A. Adaptive Query Execution reoptimizes queries at execution points.
  • B. Adaptive Query Execution features are dynamically switching join strategies and dynamically optimizing skew joins.
  • C. Adaptive Query Execution features include dynamically coalescing shuffle partitions, dynamically injecting scan filters, and dynamically optimizing skew joins.
  • D. Adaptive Query Execution is enabled in Spark by default.
  • E. Adaptive Query Execution applies to all kinds of queries.

正解: B

解説:
Explanation
Adaptive Query Execution features include dynamically coalescing shuffle partitions, dynamically injecting scan filters, and dynamically optimizing skew joins.
This is almost correct. All of these features, except for dynamically injecting scan filters, are part of Adaptive Query Execution. Dynamically injecting scan filters for join operations to limit the amount of data to be considered in a query is part of Dynamic Partition Pruning and not of Adaptive Query Execution.
Adaptive Query Execution reoptimizes queries at execution points.
No, Adaptive Query Execution reoptimizes queries at materialization points.
Adaptive Query Execution is enabled in Spark by default.
No, Adaptive Query Execution is disabled in Spark needs to be enabled through the spark.sql.adaptive.enabled property.
Adaptive Query Execution applies to all kinds of queries.
No, Adaptive Query Execution applies only to queries that are not streaming queries and that contain at least one exchange (typically expressed through a join, aggregate, or window operator) or one subquery.
More info: How to Speed up SQL Queries with Adaptive Query Execution, Learning Spark, 2nd Edition, Chapter 12 (https://bit.ly/3tOh8M1)

 

質問 40
Which of the following code blocks silently writes DataFrame itemsDf in avro format to location fileLocation if a file does not yet exist at that location?

  • A. itemsDf.write.avro(fileLocation)
  • B. itemsDf.save.format("avro").mode("ignore").write(fileLocation)
  • C. spark.DataFrameWriter(itemsDf).format("avro").write(fileLocation)
  • D. itemsDf.write.format("avro").mode("ignore").save(fileLocation)
  • E. itemsDf.write.format("avro").mode("errorifexists").save(fileLocation)

正解: A

解説:
Explanation
The trick in this question is knowing the "modes" of the DataFrameWriter. Mode ignore will ignore if a file already exists and not replace that file, but also not throw an error. Mode errorifexists will throw an error, and is the default mode of the DataFrameWriter. The question NO:
explicitly calls for the DataFrame to be "silently" written if it does not exist, so you need to specify mode("ignore") here to avoid having Spark communicate any error to you if the file already exists.
The `overwrite' mode would not be right here, since, although it would be silent, it would overwrite the already-existing file. This is not what the question asks for.
It is worth noting that the option starting with spark.DataFrameWriter(itemsDf) cannot work, since spark references the SparkSession object, but that object does not provide the DataFrameWriter.
As you can see in the documentation (below), DataFrameWriter is part of PySpark's SQL API, but not of its SparkSession API.
More info:
DataFrameWriter: pyspark.sql.DataFrameWriter.save - PySpark 3.1.1 documentation SparkSession API: Spark SQL - PySpark 3.1.1 documentation Static notebook | Dynamic notebook: See test 1

 

質問 41
Which of the elements that are labeled with a circle and a number contain an error or are misrepresented?

  • A. 0
  • B. 1, 4, 6, 9
  • C. 1, 10
  • D. 1, 8
  • E. 7, 9, 10

正解: D

解説:
Explanation
1: Correct - This should just read "API" or "DataFrame API". The DataFrame is not part of the SQL API. To make a DataFrame accessible via SQL, you first need to create a DataFrame view. That view can then be accessed via SQL.
4: Although "K_38_INU" looks odd, it is a completely valid name for a DataFrame column.
6: No, StringType is a correct type.
7: Although a StringType may not be the most efficient way to store a phone number, there is nothing fundamentally wrong with using this type here.
8: Correct - TreeType is not a type that Spark supports.
9: No, Spark DataFrames support ArrayType variables. In this case, the variable would represent a sequence of elements with type LongType, which is also a valid type for Spark DataFrames.
10: There is nothing wrong with this row.
More info: Data Types - Spark 3.1.1 Documentation (https://bit.ly/3aAPKJT)

 

質問 42
......

試験問題集でAssociate-Developer-Apache-Spark練習無料最新のDatabricks練習テスト:https://www.jpntest.com/shiken/Associate-Developer-Apache-Spark-mondaishu

Associate-Developer-Apache-Spark試験問題、リアルAssociate-Developer-Apache-Spark練習問題集:https://drive.google.com/open?id=1sZrJf21nzPGQuRyqYD7Vj2JRoVKcZnXP

弊社を連絡する

我々は12時間以内ですべてのお問い合わせを答えます。

オンラインサポート時間:( UTC+9 ) 9:00-24:00
月曜日から土曜日まで

サポート:現在連絡