Databricks Certified Professional Data Engineer

33 views

Embed
Email

From

Username or Email (please add comma after each username or email)

Name	Email

Back

Menu 3

Eaque ipsa quae ab illo inventore veritatis et quasi architecto beatae vitae dicta sunt explicabo.

Pass2certifyofficial

Uploaded on Jan 8, 2026

Category Education

The Databricks Certified Professional Data Engineer credential validates advanced expertise in designing, building, and optimizing scalable data pipelines using Databricks and Apache Spark. It demonstrates proficiency in data ingestion, transformation, ETL/ELT workflows, performance tuning, Delta Lake, data modeling, security, and production-grade data engineering best practices on cloud platforms.

Category Education

Comments

Databricks Certified Professional Data Engineer
Databricks
Databricks-Certified-Professional-Data-En
gineer
ExamName: Databricks Certified Data Engineer Professional Exam
Exam Version: 12.6
Questions & Answers Sample PDF
(Preview content before you buy)
Check the full version using the link below.
https://pass2certify.com/exam/databricks-certified-professional-data-engin
eer
Unlock Full Features:
Stay Updated: 90 days of free exam updates
Zero Risk: 30-day money-back policy
Instant Access: Download right after purchase
Always Here: 24/7 customer support team
https://pass2certify.com//exam/databricks-certified-professional-Pdagae t1a o-f e9
ngineer
Question 1. (Single Select)
An upstream system has been configured to pass the date for a given batch of data to the Databricks Jobs
API as a parameter. The notebook to be scheduled will use this parameter to load data with the following
code:
df = spark.read.format("parquet").load(f"/mnt/source/(date)")
Which code block should be used to create the date Python variable used in the above code block?
A: date = spark.conf.get("date")
B: input_dict = input()
C: import sys
D: date = dbutils.notebooks.getParam("date")
E: dbutils.widgets.text("date", "null")
Answer: E
Explanation:
The code block that should be used to create the date Python variable used in the above code block is:
dbutils.widgets.text(“date”, “null”) date = dbutils.widgets.get(“date”)
This code block uses the dbutils.widgets API to create and get a text widget named “date” that can accept
a string value as a parameter1. The default value of the widget is “null”, which means that if no parameter
is passed, the date variable will be “null”. However, if a parameter is passed through the Databricks Jobs
API, the date variable will be assigned the value of the parameter. For example, if the parameter is
“2021-11-01”, the date variable will be “2021-11-01”. This way, the notebook can use the date variable to
load data from the specified path.
The other options are not correct, because:
Option A is incorrect because spark.conf.get(“date”) is not a valid way to get a parameter passed through
the Databricks Jobs API. The spark.conf API is used to get or set Spark configuration properties, not
notebook parameters2.
Option B is incorrect because input() is not a valid way to get a parameter passed through the Databricks
Jobs API. The input() function is used to get user input from the standard input stream, not from the API
request3.
Option C is incorrect because sys.argv1 is not a valid way to get a parameter passed through the
Databricks Jobs API. The sys.argv list is used to get the command-line arguments passed to a Python
https://pass2certify.com//exam/databricks-certified-professional-Pdagae t2a o-f e9
ngineer
script, not to a notebook4.
Option D is incorrect because dbutils.notebooks.getParam(“date”) is not a valid way to get a parameter
passed through the Databricks Jobs API. The dbutils.notebooks API is used to get or set notebook
parameters when running a notebook as a job or as a subnotebook, not when passing parameters through
the API5.
Question 2. (Multi Select)
A data team is automating a daily multi-task ETL pipeline in Databricks. The pipeline includes a notebook
for ingesting raw data, a Python wheel task for data transformation, and a SQL query to update
aggregates. They want to trigger the pipeline programmatically and see previous runs in the GUI. They
need to ensure tasks are retried on failure and stakeholders are notified by email if any task fails.
Which two approaches will meet these requirements? (Choose 2 answers)
A: Use the REST API endpoint /jobs/runs/submit to trigger each task individually as separate job runs and
implement retries using custom logic in the orchestrator.
B: Create a multi-task job using the UI, Databricks Asset Bundles (DABs), or the Jobs REST API
(/jobs/create) with notebook, Python wheel, and SQL tasks. Configure task-level retries and email
notifications in the job definition.
C: Trigger the job programmatically using the Databricks Jobs REST API (/jobs/run-now), the CLI
(databricks jobs run-now), or one of the Databricks SDKs.
D: Create a single orchestrator notebook that calls each step with dbutils.notebook.run(), defining a job for
that notebook and configuring retries and notifications at the notebook level.
E: Use Databricks Asset Bundles (DABs) to deploy the workflow, then trigger individual tasks directly by
referencing each task’s notebook or script path in the workspace.
Answer: B, C
Explanation:
Comprehensive and Detailed Explanation From Exact Extract of Databricks Data Engineer Documents:
Databricks Jobs supports defining multi-task workflows that include notebooks, SQL statements, and
Python wheel tasks. These can be configured with retry policies, dependency chains, and failure
notifications. The correct practice, as stated in the documentation, is to use the Jobs REST API
(/jobs/create) or Databricks Asset Bundles to define multi-task jobs, and then trigger them programmatically
https://pass2certify.com//exam/databricks-certified-professional-Pdagae t3a o-f e9
ngineer
using /jobs/run-now, CLI, or SDK. This allows the team to maintain full job history, handle retries
automatically, and receive alerts via configured email notifications. Using /jobs/runs/submit creates one-off
ad hoc runs without maintaining dependency visibility. Therefore, options B and C together satisfy the
operational, automation, and governance requirements.
Question 3. (Single Select)
The data science team has created and logged a production model using MLflow. The following code
correctly imports and applies the production model to output the predictions as a new DataFrame named
preds with the schema "customer_id LONG, predictions DOUBLE, date DATE".
The data science team would like predictions saved to a Delta Lake table with the ability to compare all
predictions across time. Churn predictions will be made at most once per day. Which code block
accomplishes this task while minimizing potential compute costs? A)
preds.write.mode("append").saveAsTable("churn_preds") B)
preds.write.format("delta").save("/preds/churn_preds") C)
D)
https://pass2certify.com//exam/databricks-certified-professional-Pdagae t4a o-f e9
ngineer
E)
A: Option A
B: Option B
C: Option C
D: Option D
E: Option E
Answer: A
Question 4. (Single Select)
An upstream source writes Parquet data as hourly batches to directories named with the current date. A
nightly batch job runs the following code to ingest all data from the previous day as indicated by the date
variable:
https://pass2certify.com//exam/databricks-certified-professional-Pdagae t5a o-f e9
ngineer
Assume that the fields customer_id and order_id serve as a composite key to uniquely identify each order.
If the upstream system is known to occasionally produce duplicate entries for a single order hours apart,
which statement is correct?
A: Each write to the orders table will only contain unique records, and only those records without duplicates
in the target table will be written.
B: Each write to the orders table will only contain unique records, but newly written records may have
duplicates already present in the target table.
C: Each write to the orders table will only contain unique records; if existing records with the same key are
present in the target table, these records will be overwritten.
D: Each write to the orders table will only contain unique records; if existing records with the same key are
present in the target table, the operation will tail.
E: Each write to the orders table will run deduplication over the union of new and existing records, ensuring
no duplicate records are present.
Answer: B
Explanation:
This is the correct answer because the code uses the dropDuplicates method to remove any duplicate
records within each batch of data before writing to the orders table. However, this method does not check
for duplicates across different batches or in the target table, so it is possible that newly written records may
have duplicates already present in the target table. To avoid this, a better approach would be to use Delta
Lake and perform an upsert operation using mergeInto. Verified [Databricks Certified Data Engineer
Professional], under “Delta Lake” section; Databricks Documentation, under “DROP DUPLICATES” section.
Question 5. (Single Select)
https://pass2certify.com//exam/databricks-certified-professional-Pdagae t6a o-f e9
ngineer
A junior member of the data engineering team is exploring the language interoperability of Databricks
notebooks. The intended outcome of the below code is to register a view of all sales that occurred in
countries on the continent of Africa that appear in the geo_lookup table.
Before executing the code, running SHOW TABLES on the current database indicates the database
contains only two tables: geo_lookup and sales.
Which statement correctly describes the outcome of executing these command cells in order in an
interactive notebook?
A: Both commands will succeed. Executing show tables will show that countries at and sales at have been
registered as views.
B: Cmd 1 will succeed. Cmd 2 will search all accessible databases for a table or view named countries af: if
this entity exists, Cmd 2 will succeed.
C: Cmd 1 will succeed and Cmd 2 will fail, countries at will be a Python variable representing a PySpark
DataFrame.
D: Both commands will fail. No new variables, tables, or views will be created.
E: Cmd 1 will succeed and Cmd 2 will fail, countries at will be a Python variable containing a list of strings.
Answer: E
Explanation:
This is the correct answer because Cmd 1 is written in Python and uses a list comprehension to extract the
country names from the geo_lookup table and store them in a Python variable named countries af. This
variable will contain a list of strings, not a PySpark DataFrame or a SQL view. Cmd 2 is written in SQL and
https://pass2certify.com//exam/databricks-certified-professional-Pdagae t7a o-f e9
ngineer
tries to create a view named sales af by selecting from the sales table where city is in countries af.
However, this command will fail because countries af is not a valid SQL entity and cannot be used in a SQL
query. To fix this, a better approach would be to use spark.sql() to execute a SQL query in Python and pass
the countries af variable as a parameter. Verified [Databricks Certified Data Engineer Professional], under
“Language Interoperability” section; Databricks Documentation, under “Mix languages” section.
https://pass2certify.com//exam/databricks-certified-professional-Pdagae t8a o-f e9
ngineer
Need more info? Check the link below:
https://pass2certify.com/exam/databricks-certified-professional-da
ta-engineer
Thanks for Being a Valued Pass2Certify User!
Guaranteed Success Pass Every Exam with Pass2Certify.
Save $15 instantly with promo code
SAVEFAST
Sales: [email protected]
Support: [email protected]
https://pass2certify.com//exam/databricks-certified-professional-Pdagae t9a o-f e9
ngineer