最新の2023年07月 Databricks Associate-Developer-Apache-Spark問題集で更新された179問あります [Q40-Q57]

Share

最新の2023年07月 Databricks Associate-Developer-Apache-Spark問題集で更新された179問あります

PDF無料ダウンロードにはAssociate-Developer-Apache-Spark有効な練習テスト問題

質問 # 40
Which of the following code blocks returns a new DataFrame with only columns predError and values of every second row of DataFrame transactionsDf?
Entire DataFrame transactionsDf:
1.+-------------+---------+-----+-------+---------+----+
2.|transactionId|predError|value|storeId|productId| f|
3.+-------------+---------+-----+-------+---------+----+
4.| 1| 3| 4| 25| 1|null|
5.| 2| 6| 7| 2| 2|null|
6.| 3| 3| null| 25| 3|null|
7.| 4| null| null| 3| 2|null|
8.| 5| null| null| null| 2|null|
9.| 6| 3| 2| 25| 2|null|
10.+-------------+---------+-----+-------+---------+----+

  • A. transactionsDf.select(col("transactionId").isin([3,4,6]), "predError", "value")
  • B. transactionsDf.filter("transactionId" % 2 == 0).select("predError", "value")
  • C. transactionsDf.filter(col(transactionId).isin([3,4,6]))
  • D. transactionsDf.filter(col("transactionId") % 2 == 0).select("predError", "value") (Correct)
  • E. 1.transactionsDf.createOrReplaceTempView("transactionsDf")
    2.spark.sql("FROM transactionsDf SELECT predError, value WHERE transactionId % 2 = 2")
  • F. transactionsDf.filter(col("transactionId").isin([3,4,6])).select([predError, value])

正解:D

解説:
Explanation
Output of correct code block:
+---------+-----+
|predError|value|
+---------+-----+
| 6| 7|
| null| null|
| 3| 2|
+---------+-----+
This is not an easy question to solve. You need to know that % stands for the module operator in Python. % 2 will return true for every second row. The statement using spark.sql gets it almost right (the modulo operator exists in SQL as well), but % 2 = 2 will never yield true, since modulo 2 is either 0 or 1.
Other answers are wrong since they are missing quotes around the column names and/or use filter or select incorrectly.
If you have any doubts about SparkSQL and answer options 3 and 4 in this question, check out the notebook I created as a response to a related student question.
Static notebook | Dynamic notebook: See test 1


質問 # 41
Which of the following code blocks reads all CSV files in directory filePath into a single DataFrame, with column names defined in the CSV file headers?
Content of directory filePath:
1._SUCCESS
2._committed_2754546451699747124
3._started_2754546451699747124
4.part-00000-tid-2754546451699747124-10eb85bf-8d91-4dd0-b60b-2f3c02eeecaa-298-1-c000.csv.gz
5.part-00001-tid-2754546451699747124-10eb85bf-8d91-4dd0-b60b-2f3c02eeecaa-299-1-c000.csv.gz
6.part-00002-tid-2754546451699747124-10eb85bf-8d91-4dd0-b60b-2f3c02eeecaa-300-1-c000.csv.gz
7.part-00003-tid-2754546451699747124-10eb85bf-8d91-4dd0-b60b-2f3c02eeecaa-301-1-c000.csv.gz spark.option("header",True).csv(filePath)

  • A. spark.read.format("csv").option("header",True).option("compression","zip").load(filePath)
  • B. spark.read().option("header",True).load(filePath)
  • C. spark.read.format("csv").option("header",True).load(filePath)
  • D. spark.read.load(filePath)

正解:C

解説:
Explanation
The files in directory filePath are partitions of a DataFrame that have been exported using gzip compression.
Spark automatically recognizes this situation and imports the CSV files as separate partitions into a single DataFrame. It is, however, necessary to specify that Spark should load the file headers in the CSV with the header option, which is set to False by default.


質問 # 42
Which of the following code blocks returns a new DataFrame with the same columns as DataFrame transactionsDf, except for columns predError and value which should be removed?

  • A. transactionsDf.drop(predError, value)
  • B. transactionsDf.drop(["predError", "value"])
  • C. transactionsDf.drop(col("predError"), col("value"))
  • D. transactionsDf.drop("predError", "value")
  • E. transactionsDf.drop("predError & value")

正解:D

解説:
Explanation
More info: pyspark.sql.DataFrame.drop - PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 2


質問 # 43
Which of the following describes a shuffle?

  • A. A shuffle is a Spark operation that results from DataFrame.coalesce().
  • B. A shuffle is a process that compares data across executors.
  • C. A shuffle is a process that is executed during a broadcast hash join.
  • D. A shuffle is a process that compares data across partitions.
  • E. A shuffle is a process that allocates partitions to executors.

正解:D

解説:
Explanation
A shuffle is a Spark operation that results from DataFrame.coalesce().
No. DataFrame.coalesce() does not result in a shuffle.
A shuffle is a process that allocates partitions to executors.
This is incorrect.
A shuffle is a process that is executed during a broadcast hash join.
No, broadcast hash joins avoid shuffles and yield performance benefits if at least one of the two tables is small in size (<= 10 MB by default). Broadcast hash joins can avoid shuffles because instead of exchanging partitions between executors, they broadcast a small table to all executors that then perform the rest of the join operation locally.
A shuffle is a process that compares data across executors.
No, in a shuffle, data is compared across partitions, and not executors.
More info: Spark Repartition & Coalesce - Explained (https://bit.ly/32KF7zS)


質問 # 44
The code block displayed below contains an error. The code block should return the average of rows in column value grouped by unique storeId. Find the error.
Code block:
transactionsDf.agg("storeId").avg("value")

  • A. The avg("value") should be specified as a second argument to agg() instead of being appended to it.
  • B. Instead of avg("value"), avg(col("value")) should be used.
  • C. All column names should be wrapped in col() operators.
  • D. "storeId" and "value" should be swapped.
  • E. agg should be replaced by groupBy.

正解:E

解説:
Explanation
Static notebook | Dynamic notebook: See test 1
(https://flrs.github.io/spark_practice_tests_code/#1/30.html ,
https://bit.ly/sparkpracticeexams_import_instructions)


質問 # 45
The code block displayed below contains at least one error. The code block should return a DataFrame with only one column, result. That column should include all values in column value from DataFrame transactionsDf raised to the power of 5, and a null value for rows in which there is no value in column value. Find the error(s).
Code block:
1.from pyspark.sql.functions import udf
2.from pyspark.sql import types as T
3.
4.transactionsDf.createOrReplaceTempView('transactions')
5.
6.def pow_5(x):
7. return x**5
8.
9.spark.udf.register(pow_5, 'power_5_udf', T.LongType())
10.spark.sql('SELECT power_5_udf(value) FROM transactions')

  • A. The pow_5 method is unable to handle empty values in column value and the name of the column in the returned DataFrame is not result.
  • B. The pow_5 method is unable to handle empty values in column value, the name of the column in the returned DataFrame is not result, and Spark driver does not call the UDF function appropriately.
  • C. The pow_5 method is unable to handle empty values in column value, the UDF function is not registered properly with the Spark driver, and the name of the column in the returned DataFrame is not result.
  • D. The returned DataFrame includes multiple columns instead of just one column.
  • E. The pow_5 method is unable to handle empty values in column value, the name of the column in the returned DataFrame is not result, and the SparkSession cannot access the transactionsDf DataFrame.

正解:B

解説:
Explanation
Correct code block:
from pyspark.sql.functions import udf
from pyspark.sql import types as T
transactionsDf.createOrReplaceTempView('transactions')
def pow_5(x):
if x:
return x**5
return x
spark.udf.register('power_5_udf', pow_5, T.LongType())
spark.sql('SELECT power_5_udf(value) AS result FROM transactions')
Here it is important to understand how the pow_5 method handles empty values. In the wrong code block above, the pow_5 method is unable to handle empty values and will throw an error, since Python's ** operator cannot deal with any null value Spark passes into method pow_5.
The order of arguments for registering the UDF function with Spark via spark.udf.register matters. In the code snippet in the question, the arguments for the SQL method name and the actual Python function are switched. You can read more about the arguments of spark.udf.register and see some examples of its usage in the documentation (link below).
Finally, you should recognize that in the original code block, an expression to rename column created through the UDF function is missing. The renaming is done by SQL's AS result argument.
Omitting that argument, you end up with the column name power_5_udf(value) and not result.
More info: pyspark.sql.functions.udf - PySpark 3.1.1 documentation


質問 # 46
Which of the following code blocks creates a new one-column, two-row DataFrame dfDates with column date of type timestamp?

  • A. 1.dfDates = spark.createDataFrame([("23/01/2022 11:28:12",),("24/01/2022 10:58:34",)], ["date"])
    2.dfDates = dfDates.withColumnRenamed("date", to_timestamp("date", "yyyy-MM-dd HH:mm:ss"))
  • B. 1.dfDates = spark.createDataFrame([("23/01/2022 11:28:12",),("24/01/2022 10:58:34",)], ["date"])
    2.dfDates = dfDates.withColumn("date", to_timestamp("date", "dd/MM/yyyy HH:mm:ss"))
  • C. 1.dfDates = spark.createDataFrame([("23/01/2022 11:28:12",),("24/01/2022 10:58:34",)], ["date"])
  • D. 1.dfDates = spark.createDataFrame(["23/01/2022 11:28:12","24/01/2022 10:58:34"], ["date"])
    2.dfDates = dfDates.withColumnRenamed("date", to_datetime("date", "yyyy-MM-dd HH:mm:ss"))
  • E. 1.dfDates = spark.createDataFrame(["23/01/2022 11:28:12","24/01/2022 10:58:34"], ["date"])
    2.dfDates = dfDates.withColumn("date", to_timestamp("dd/MM/yyyy HH:mm:ss", "date"))

正解:B

解説:
Explanation
This question is tricky. Two things are important to know here:
First, the syntax for createDataFrame: Here you need a list of tuples, like so: [(1,), (2,)]. To define a tuple in Python, if you just have a single item in it, it is important to put a comma after the item so that Python interprets it as a tuple and not just a normal parenthesis.
Second, you should understand the to_timestamp syntax. You can find out more about it in the documentation linked below.
For good measure, let's examine in detail why the incorrect options are wrong:
dfDates = spark.createDataFrame([("23/01/2022 11:28:12",),("24/01/2022 10:58:34",)], ["date"]) This code snippet does everything the question asks for - except that the data type of the date column is a string and not a timestamp. When no schema is specified, Spark sets the string data type as default.
dfDates = spark.createDataFrame(["23/01/2022 11:28:12","24/01/2022 10:58:34"], ["date"]) dfDates = dfDates.withColumn("date", to_timestamp("dd/MM/yyyy HH:mm:ss", "date")) In the first row of this command, Spark throws the following error: TypeError: Can not infer schema for type:
<class 'str'>. This is because Spark expects to find row information, but instead finds strings. This is why you need to specify the data as tuples. Fortunately, the Spark documentation (linked below) shows a number of examples for creating DataFrames that should help you get on the right track here.
dfDates = spark.createDataFrame([("23/01/2022 11:28:12",),("24/01/2022 10:58:34",)], ["date"]) dfDates = dfDates.withColumnRenamed("date", to_timestamp("date", "yyyy-MM-dd HH:mm:ss")) The issue with this answer is that the operator withColumnRenamed is used. This operator simply renames a column, but it has no power to modify its actual content. This is why withColumn should be used instead. In addition, the date format yyyy-MM-dd HH:mm:ss does not reflect the format of the actual timestamp: "23/01/2022 11:28:12".
dfDates = spark.createDataFrame(["23/01/2022 11:28:12","24/01/2022 10:58:34"], ["date"]) dfDates = dfDates.withColumnRenamed("date", to_datetime("date", "yyyy-MM-dd HH:mm:ss")) Here, withColumnRenamed is used instead of withColumn (see above). In addition, the rows are not expressed correctly - they should be written as tuples, using parentheses. Finally, even the date format is off here (see above).
More info: pyspark.sql.functions.to_timestamp - PySpark 3.1.2 documentation and pyspark.sql.SparkSession.createDataFrame - PySpark 3.1.1 documentation Static notebook | Dynamic notebook: See test 2


質問 # 47
Which of the following code blocks reads in the JSON file stored at filePath as a DataFrame?

  • A. spark.read.path(filePath, source="json")
  • B. spark.read().path(filePath)
  • C. spark.read().json(filePath)
  • D. spark.read.json(filePath)
  • E. spark.read.path(filePath)

正解:D

解説:
Explanation
spark.read.json(filePath)
Correct. spark.read accesses Spark's DataFrameReader. Then, Spark identifies the file type to be read as JSON type by passing filePath into the DataFrameReader.json() method.
spark.read.path(filePath)
Incorrect. Spark's DataFrameReader does not have a path method. A universal way to read in files is provided by the DataFrameReader.load() method (link below).
spark.read.path(filePath, source="json")
Wrong. A DataFrameReader.path() method does not exist (see above).
spark.read().json(filePath)
Incorrect. spark.read is a way to access Spark's DataFrameReader. However, the DataFrameReader is not callable, so calling it via spark.read() will fail.
spark.read().path(filePath)
No, Spark's DataFrameReader is not callable (see above).
More info: pyspark.sql.DataFrameReader.json - PySpark 3.1.2 documentation, pyspark.sql.DataFrameReader.load - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 3


質問 # 48
Which of the following statements about Spark's DataFrames is incorrect?

  • A. Data in DataFrames is organized into named columns.
  • B. The data in DataFrames may be split into multiple chunks.
  • C. RDDs are at the core of DataFrames.
  • D. Spark's DataFrames are equal to Python's DataFrames.
  • E. Spark's DataFrames are immutable.

正解:D

解説:
Explanation
Spark's DataFrames are equal to Python's or R's DataFrames.
No, they are not equal. They are only similar. A major difference between Spark and Python is that Spark's DataFrames are distributed, whereby Python's are not.


質問 # 49
Which of the following code blocks generally causes a great amount of network traffic?

  • A. DataFrame.select()
  • B. DataFrame.count()
  • C. DataFrame.rdd.map()
  • D. DataFrame.collect()
  • E. DataFrame.coalesce()

正解:D

解説:
Explanation
DataFrame.collect() sends all data in a DataFrame from executors to the driver, so this generally causes a great amount of network traffic in comparison to the other options listed.
DataFrame.coalesce() just reduces the number of partitions and generally aims to reduce network traffic in comparison to a full shuffle.
DataFrame.select() is evaluated lazily and, unless followed by an action, does not cause significant network traffic.
DataFrame.rdd.map() is evaluated lazily, it does therefore not cause great amounts of network traffic.
DataFrame.count() is an action. While it does cause some network traffic, for the same DataFrame, collecting all data in the driver would generally be considered to cause a greater amount of network traffic.


質問 # 50
Which of the following code blocks reads in parquet file /FileStore/imports.parquet as a DataFrame?

  • A. spark.read.parquet("/FileStore/imports.parquet")
  • B. spark.read().parquet("/FileStore/imports.parquet")
  • C. spark.read().format('parquet').open("/FileStore/imports.parquet")
  • D. spark.mode("parquet").read("/FileStore/imports.parquet")
  • E. spark.read.path("/FileStore/imports.parquet", source="parquet")

正解:A

解説:
Explanation
Static notebook | Dynamic notebook: See test 1
(https://flrs.github.io/spark_practice_tests_code/#1/23.html ,
https://bit.ly/sparkpracticeexams_import_instructions)


質問 # 51
The code block shown below should return a DataFrame with only columns from DataFrame transactionsDf for which there is a corresponding transactionId in DataFrame itemsDf. DataFrame itemsDf is very small and much smaller than DataFrame transactionsDf. The query should be executed in an optimized way. Choose the answer that correctly fills the blanks in the code block to accomplish this.
__1__.__2__(__3__, __4__, __5__)

  • A. 1. itemsDf
    2. broadcast
    3. transactionsDf
    4. "transactionId"
    5. "left_semi"
  • B. 1. transactionsDf
    2. join
    3. itemsDf
    4. transactionsDf.transactionId==itemsDf.transactionId
    5. "anti"
  • C. 1. transactionsDf
    2. join
    3. broadcast(itemsDf)
    4. "transactionId"
    5. "left_semi"
  • D. 1. transactionsDf
    2. join
    3. broadcast(itemsDf)
    4. transactionsDf.transactionId==itemsDf.transactionId
    5. "outer"
  • E. 1. itemsDf
    2. join
    3. broadcast(transactionsDf)
    4. "transactionId"
    5. "left_semi"

正解:C

解説:
Explanation
Correct code block:
transactionsDf.join(broadcast(itemsDf), "transactionId", "left_semi")
This question is extremely difficult and exceeds the difficulty of questions in the exam by far.
A first indication of what is asked from you here is the remark that "the query should be executed in an optimized way". You also have qualitative information about the size of itemsDf and transactionsDf. Given that itemsDf is "very small" and that the execution should be optimized, you should consider instructing Spark to perform a broadcast join, broadcasting the "very small" DataFrame itemsDf to all executors. You can explicitly suggest this to Spark via wrapping itemsDf into a broadcast() operator. One answer option does not include this operator, so you can disregard it. Another answer option wraps the broadcast() operator around transactionsDf - the bigger of the two DataFrames. This answer option does not make sense in the optimization context and can likewise be disregarded.
When thinking about the broadcast() operator, you may also remember that it is a method of pyspark.sql.functions. One answer option, however, resolves to itemsDf.broadcast([...]). The DataFrame class has no broadcast() method, so this answer option can be eliminated as well.
All two remaining answer options resolve to transactionsDf.join([...]) in the first 2 gaps, so you will have to figure out the details of the join now. You can pick between an outer and a left semi join. An outer join would include columns from both DataFrames, where a left semi join only includes columns from the "left" table, here transactionsDf, just as asked for by the question. So, the correct answer is the one that uses the left_semi join.


質問 # 52
Which of the following code blocks returns a one-column DataFrame of all values in column supplier of DataFrame itemsDf that do not contain the letter X? In the DataFrame, every value should only be listed once.
Sample of DataFrame itemsDf:
1.+------+--------------------+--------------------+-------------------+
2.|itemId| itemName| attributes| supplier|
3.+------+--------------------+--------------------+-------------------+
4.| 1|Thick Coat for Wa...|[blue, winter, cozy]|Sports Company Inc.|
5.| 2|Elegant Outdoors ...|[red, summer, fre...| YetiX|
6.| 3| Outdoors Backpack|[green, summer, t...|Sports Company Inc.|
7.+------+--------------------+--------------------+-------------------+

  • A. itemsDf.select(~col('supplier').contains('X')).distinct()
  • B. itemsDf.filter(!col('supplier').contains('X')).select(col('supplier')).unique()
  • C. itemsDf.filter(not(col('supplier').contains('X'))).select('supplier').unique()
  • D. itemsDf.filter(~col('supplier').contains('X')).select('supplier').distinct()
  • E. itemsDf.filter(col(supplier).not_contains('X')).select(supplier).distinct()

正解:D

解説:
Explanation
Output of correct code block:
+-------------------+
| supplier|
+-------------------+
|Sports Company Inc.|
+-------------------+
Key to managing this question is understand which operator to use to do the opposite of an operation
- the ~ (not) operator. In addition, you should know that there is no unique() method.
Static notebook | Dynamic notebook: See test 1


質問 # 53
Which of the following code blocks stores a part of the data in DataFrame itemsDf on executors?

  • A. cache(itemsDf)
  • B. itemsDf.cache().count()
  • C. itemsDf.rdd.storeCopy()
  • D. itemsDf.cache(eager=True)
  • E. itemsDf.cache().filter()

正解:B

解説:
Explanation
Caching means storing a copy of a partition on an executor, so it can be accessed quicker by subsequent operations, instead of having to be recalculated. cache() is a lazily-evaluated method of the DataFrame. Since count() is an action (while filter() is not), it triggers the caching process.
More info: pyspark.sql.DataFrame.cache - PySpark 3.1.2 documentation, Learning Spark, 2nd Edition, Chapter 7 Static notebook | Dynamic notebook: See test 2


質問 # 54
Which of the following statements about RDDs is incorrect?

  • A. RDDs are great for precisely instructing Spark on how to do a query.
  • B. RDDs are immutable.
  • C. RDD stands for Resilient Distributed Dataset.
  • D. The high-level DataFrame API is built on top of the low-level RDD API.
  • E. An RDD consists of a single partition.

正解:E

解説:
Explanation
An RDD consists of a single partition.
Quite the opposite: Spark partitions RDDs and distributes the partitions across multiple nodes.


質問 # 55
The code block displayed below contains an error. The code block is intended to write DataFrame transactionsDf to disk as a parquet file in location /FileStore/transactions_split, using column storeId as key for partitioning. Find the error.
Code block:
transactionsDf.write.format("parquet").partitionOn("storeId").save("/FileStore/transactions_split")A.

  • A. The format("parquet") expression should be removed and instead, the information should be added to the write expression like so: write("parquet").
  • B. Partitioning data by storeId is possible with the partitionBy expression, so partitionOn should be replaced by partitionBy.
  • C. The format("parquet") expression is inappropriate to use here, "parquet" should be passed as first argument to the save() operator and "/FileStore/transactions_split" as the second argument.
  • D. partitionOn("storeId") should be called before the write operation.
  • E. Partitioning data by storeId is possible with the bucketBy expression, so partitionOn should be replaced by bucketBy.

正解:B

解説:
Explanation
Correct code block:
transactionsDf.write.format("parquet").partitionBy("storeId").save("/FileStore/transactions_split") More info: partition by - Reading files which are written using PartitionBy or BucketBy in Spark - Stack Overflow Static notebook | Dynamic notebook: See test 1


質問 # 56
The code block displayed below contains an error. The code block should arrange the rows of DataFrame transactionsDf using information from two columns in an ordered fashion, arranging first by column value, showing smaller numbers at the top and greater numbers at the bottom, and then by column predError, for which all values should be arranged in the inverse way of the order of items in column value. Find the error.
Code block:
transactionsDf.orderBy('value', asc_nulls_first(col('predError')))

  • A. Column predError should be sorted by desc_nulls_first() instead.
  • B. Two orderBy statements with calls to the individual columns should be chained, instead of having both columns in one orderBy statement.
  • C. Column value should be wrapped by the col() operator.
  • D. Instead of orderBy, sort should be used.
  • E. Column predError should be sorted in a descending way, putting nulls last.

正解:E

解説:
Explanation
Correct code block:
transactionsDf.orderBy('value', desc_nulls_last('predError'))
Column predError should be sorted in a descending way, putting nulls last.
Correct! By default, Spark sorts ascending, putting nulls first. So, the inverse sort of the default sort is indeed desc_nulls_last.
Instead of orderBy, sort should be used.
No. DataFrame.sort() orders data per partition, it does not guarantee a global order. This is why orderBy is the more appropriate operator here.
Column value should be wrapped by the col() operator.
Incorrect. DataFrame.sort() accepts both string and Column objects.
Column predError should be sorted by desc_nulls_first() instead.
Wrong. Since Spark's default sort order matches asc_nulls_first(), nulls would have to come last when inverted.
Two orderBy statements with calls to the individual columns should be chained, instead of having both columns in one orderBy statement.
No, this would just sort the DataFrame by the very last column, but would not take information from both columns into account, as noted in the question.
More info: pyspark.sql.DataFrame.orderBy - PySpark 3.1.2 documentation, pyspark.sql.functions.desc_nulls_last - PySpark 3.1.2 documentation, sort() vs orderBy() in Spark | Towards Data Science Static notebook | Dynamic notebook: See test 3


質問 # 57
......

Associate-Developer-Apache-Sparkテストエンジンお試しセット、Associate-Developer-Apache-Spark問題集PDF:https://www.jpntest.com/shiken/Associate-Developer-Apache-Spark-mondaishu

最新のDatabricks Associate-Developer-Apache-SparkPDFと問題集で(2023)無料試験問題解答はここ:https://drive.google.com/open?id=1jgX843GKHBI-mMS803IggJFuo2ooH7jS

弊社を連絡する

我々は12時間以内ですべてのお問い合わせを答えます。

オンラインサポート時間:( UTC+9 ) 9:00-24:00
月曜日から土曜日まで

サポート:現在連絡