Uploaded on Jul 7, 2020
With the introduction of the Professional Data Engineer dumps. It was considered a very difficult exam because of its extensive course outline. But now it is not so difficult if you prepare yourself from the Professional Data Engineer exam dumps pdf. It is the most reliable material available for preparation. You can simply ace your exam with the help of this study material.Realexamcollection has achieved a high status among students as an exam study material provider. It gives you an idea of the actual configuration of the exam.
2020 Professional Data Engineer Real Exam - Pass Professional Data Engineer Exam - Realexamcollection
GOOGoGogLleE
PROFESSIONAL-DATA-ENGINEER Dumps PDF
Professional-Data-Engineer
https://www.realexamcollection.com/gooVgelers/iponr:o Dfeemssoional-data-engineer-dumps.html
[ Total Questions: 10]
Implementing Cisco
E
Web: www.exams4sure.com
Google - Professional-Data-Engineer
Topic 3, MJTelco Case Study
Company Overview
MJTelco is a startup that plans to build networks in rapidly growing, underserved markets around the world.
The company has patents for innovative optical communications hardware. Based on these patents, they can
create many reliable, high-speed backbone links with inexpensive hardware.
Company Background
Founded by experienced telecom executives, MJTelco uses technologies originally developed to overcome
communications challenges in space. Fundamental to their operation, they need to create a distributed data
infrastructure that drives real-time analysis and incorporates machine learning to continuously optimize their
topologies. Because their hardware is inexpensive, they plan to overdeploy the network allowing them to
account for the impact of dynamic regional politics on location availability and cost.
Their management and operations teams are situated all around the globe creating many-to-many relationship
between data consumers and provides in their system. After careful consideration, they decided public cloud is
the perfect environment to support their needs.
Solution Concept
MJTelco is running a successful proof-of-concept (PoC) project in its labs. They have two primary needs:
Scale and harden their PoC to support significantly more data flows generated when they ramp to more
than 50,000 installations.
Refine their machine-learning cycles to verify and improve the dynamic models they use to control
topology definition.
MJTelco will also use three separate operating environments – development/test, staging, and production – to
meet the needs of running experiments, deploying new features, and serving production customers.
Business Requirements
Scale up their production environment with minimal cost, instantiating resources when and where
needed in an unpredictable, distributed telecom user community.
Ensure security of their proprietary data to protect their leading-edge machine learning and analysis.
Provide reliable and timely access to data for analysis from distributed research workers
Maintain isolated environments that support rapid iteration of their machine-learning models without
affecting their customers.
Technical Requirements
2 of 13
Google - Professional-Data-Engineer
Ensure secure and efficient transport and storage of telemetry data
Rapidly scale instances to support between 10,000 and 100,000 data providers with multiple flows each.
Allow analysis and presentation against data tables tracking up to 2 years of data storing approximately 100m
records/day
Support rapid iteration of monitoring infrastructure focused on awareness of data pipeline problems both in
telemetry flows and in production learning cycles.
CEO Statement
Our business model relies on our patents, analytics and dynamic machine learning. Our inexpensive hardware
is organized to be highly reliable, which gives us cost advantages. We need to quickly stabilize our large
distributed data pipelines to meet our reliability and capacity commitments.
CTO Statement
Our public cloud services must operate as advertised. We need resources that scale and keep our data secure.
We also need environments in which our data scientists can carefully study and quickly adapt our models.
Because we rely on automation to process our data, we also need our development and test environments to
work as we iterate.
CFO Statement
The project is too large for us to maintain the hardware and software required for the data and analysis. Also,
we cannot afford to staff an operations team to monitor so many data feeds, so we will rely on automation and
infrastructure. Google Cloud’s machine learning will allow our quantitative researchers to work on our
high-value problems instead of problems with our data pipelines.
Question #:1 - (Exam Topic 3)
MJTelco needs you to create a schema in Google Bigtable that will allow for the historical analysis of the last
2 years of records. Each record that comes in is sent every 15 minutes, and contains a unique identifier of the
device and a data record. The most common query is for all the data for a given device for a given day. Which
schema should you use?
A. Rowkey: date#device_idColumn data: data_point
B. Rowkey: dateColumn data: device_id, data_point
C. Rowkey: device_idColumn data: date, data_point
D. Rowkey: data_pointColumn data: device_id, date
E. Rowkey: date#data_pointColumn data: device_id
Answer: D
3 of 13
Google - Professional-Data-Engineer
Question #:2 - (Exam Topic 3)
Given the record streams MJTelco is interested in ingesting per day, they are concerned about the cost of
Google BigQuery increasing. MJTelco asks you to provide a design solution. They require a single large data
table called tracking_table. Additionally, they want to minimize the cost of daily queries while performing
fine-grained analysis of each day’s events. They also want to use streaming ingestion. What should you do?
A. Create a table called tracking_table and include a DATE column.
B. Create a partitioned table called tracking_table and include a TIMESTAMP column.
C. Create sharded tables for each day following the pattern tracking_table_YYYYMMDD.
D. Create a table called tracking_table with a TIMESTAMP column to represent the day.
Answer: B
4 of 13
Google - Professional-Data-Engineer
Topic 4, Main Questions Set B
Question #:3 - (Exam Topic 4)
You work for an economic consulting firm that helps companies identify economic trends as they happen. As
part of your analysis, you use Google BigQuery to correlate customer data with the average prices of the 100
most common goods sold, including bread, gasoline, milk, and others. The average prices of these goods are
updated every 30 minutes. You want to make sure this data stays up to date so you can combine it with other
data in BigQuery as cheaply as possible. What should you do?
A. Load the data every 30 minutes into a new partitioned table in BigQuery.
B. Store and update the data in a regional Google Cloud Storage bucket and create a federated data source
in BigQuery
C. Store the data in Google Cloud Datastore. Use Google Cloud Dataflow to query BigQuery and combine
the data programmatically with the data stored in Cloud Datastore
D. Store the data in a file in a regional Google Cloud Storage bucket. Use Cloud Dataflow to query
BigQuery and combine the data programmatically with the data stored in Google Cloud Storage.
Answer: A
Question #:4 - (Exam Topic 4)
You are designing the database schema for a machine learning-based food ordering service that will predict
what users want to eat. Here is some of the information you need to store:
The user profile: What the user likes and doesn’t like to eat
The user account information: Name, address, preferred meal times
The order information: When orders are made, from where, to whom
The database will be used to store all the transactional data of the product. You want to optimize the data
schema. Which Google Cloud Platform product should you use?
A. BigQuery
B. Cloud SQL
C. Cloud Bigtable
D. Cloud Datastore
Answer: A
5 of 13
Google - Professional-Data-Engineer
6 of 13
Google - Professional-Data-Engineer
Topic 6, Main Questions Set C
Question #:5 - (Exam Topic 6)
You need to create a data pipeline that copies time-series transaction data so that it can be queried from within
BigQuery by your data science team for analysis. Every hour, thousands of transactions are updated with a
new status. The size of the intitial dataset is 1.5 PB, and it will grow by 3 TB per day. The data is heavily
structured, and your data science team will build machine learning models based on this data. You want to
maximize performance and usability for your data science team. Which two strategies should you adopt?
Choose 2 answers.
A. Denormalize the data as must as possible.
B. Preserve the structure of the data as much as possible.
C. Use BigQuery UPDATE to further reduce the size of the dataset.
D. Develop a data pipeline where status updates are appended to BigQuery instead of updated.
E. Copy a daily snapshot of transaction data to Cloud Storage and store it as an Avro file. Use BigQuery’s
support for external data sources to query.
Answer: D E
Question #:6 - (Exam Topic 6)
A data scientist has created a BigQuery ML model and asks you to create an ML pipeline to serve predictions.
You have a REST API application with the requirement to serve predictions for an individual user ID with
latency under 100 milliseconds. You use the following query to generate predictions: SELECT
predicted_label, user_id FROM ML.PREDICT (MODEL ‘dataset.model’, table user_features). How should
you create the ML pipeline?
A. Add a WHERE clause to the query, and grant the BigQuery Data Viewer role to the application service
account.
B. Create an Authorized View with the provided query. Share the dataset that contains the view with the
application service account.
C. Create a Cloud Dataflow pipeline using BigQueryIO to read results from the query. Grant the Dataflow
Worker role to the application service account.
D. Create a Cloud Dataflow pipeline using BigQueryIO to read predictions for all users from the query.
Write the results to Cloud Bigtable using BigtableIO. Grant the Bigtable Reader role to the application
service account so that the application can read predictions for individual users from Cloud Bigtable.
Answer: D
7 of 13
Google - Professional-Data-Engineer
Topic 5, Practice Questions
Question #:7 - (Exam Topic 5)
Cloud Dataproc charges you only for what you really use with _____ billing.
A. month-by-month
B. minute-by-minute
C. week-by-week
D. hour-by-hour
Answer: B
Explanation
One of the advantages of Cloud Dataproc is its low cost. Dataproc charges for what you really use with
minute-by-minute billing and a low, ten-minute-minimum billing period.
Reference: https://cloud.google.com/dataproc/docs/concepts/overview
Question #:8 - (Exam Topic 5)
The Dataflow SDKs have been recently transitioned into which Apache service?
A. Apache Spark
B. Apache Hadoop
C. Apache Kafka
D. Apache Beam
Answer: D
Explanation
Dataflow SDKs are being transitioned to Apache Beam, as per the latest Google directive
Reference: https://cloud.google.com/dataflow/docs/
9 of 13
Google - Professional-Data-Engineer
Topic 1, Main Questions Set A
Question #:9 - (Exam Topic 1)
Your weather app queries a database every 15 minutes to get the current temperature. The frontend is powered
by Google App Engine and server millions of users. How should you design the frontend to respond to a
database failure?
A. Issue a command to restart the database servers.
B. Retry the query with exponential backoff, up to a cap of 15 minutes.
C. Retry the query every second until it comes back online to minimize staleness of data.
D. Reduce the query frequency to once every hour until the database comes back online.
Answer: B
Question #:10 - (Exam Topic 1)
Your company’s customer and order databases are often under heavy load. This makes performing analytics
against them difficult without harming operations. The databases are in a MySQL cluster, with nightly
backups taken using mysqldump. You want to perform analytics with minimal impact on operations. What
should you do?
A. Add a node to the MySQL cluster and build an OLAP cube there.
B. Use an ETL tool to load the data from MySQL into Google BigQuery.
C. Connect an on-premises Apache Hadoop cluster to MySQL and perform ETL.
D. Mount the backups to Google Cloud SQL, and then process the data using Google Cloud Dataproc.
Answer: C
10 of 13
Google - Professional-Data-Engineer
Topic 2, Flowlogistic Case Study
Company Overview
Flowlogistic is a leading logistics and supply chain provider. They help businesses throughout the world
manage their resources and transport them to their final destination. The company has grown rapidly,
expanding their offerings to include rail, truck, aircraft, and oceanic shipping.
Company Background
The company started as a regional trucking company, and then expanded into other logistics market. Because
they have not updated their infrastructure, managing and tracking orders and shipments has become a
bottleneck. To improve operations, Flowlogistic developed proprietary technology for tracking shipments in
real time at the parcel level. However, they are unable to deploy it because their technology stack, based on
Apache Kafka, cannot support the processing volume. In addition, Flowlogistic wants to further analyze their
orders and shipments to determine how best to deploy their resources.
Solution Concept
Flowlogistic wants to implement two concepts using the cloud:
Use their proprietary technology in a real-time inventory-tracking system that indicates the location of
their loads
Perform analytics on all their orders and shipment logs, which contain both structured and unstructured
data, to determine how best to deploy resources, which markets to expand info. They also want to use
predictive analytics to learn earlier when a shipment will be delayed.
Existing Technical Environment
Flowlogistic architecture resides in a single data center:
Databases
8 physical servers in 2 clusters
SQL Server – user data, inventory, static data
3 physical servers
Cassandra – metadata, tracking messages
10 Kafka servers – tracking message aggregation and batch insert
Application servers – customer front end, middleware for order/customs
11 of 13
Google - Professional-Data-Engineer
60 virtual machines across 20 physical servers
Tomcat – Java services
Nginx – static content
Batch servers
Storage appliances
iSCSI for virtual machine (VM) hosts
Fibre Channel storage area network (FC SAN) – SQL server storage
Network-attached storage (NAS) image storage, logs, backups
Apache Hadoop /Spark servers
Core Data Lake
Data analysis workloads
20 miscellaneous servers
Jenkins, monitoring, bastion hosts,
Business Requirements
Build a reliable and reproducible environment with scaled panty of production.
Aggregate data in a centralized Data Lake for analysis
Use historical data to perform predictive analytics on future shipments
Accurately track every shipment worldwide using proprietary technology
Improve business agility and speed of innovation through rapid provisioning of new resources
Analyze and optimize architecture for performance in the cloud
Migrate fully to the cloud if all other requirements are met
Technical Requirements
Handle both streaming and batch data
Migrate existing Hadoop workloads
Ensure architecture is scalable and elastic to meet the changing demands of the company.
Use managed services whenever possible
12 of 13
Google - Professional-Data-Engineer
Encrypt data flight and at rest
Connect a VPN between the production data center and cloud environment
SEO Statement
We have grown so quickly that our inability to upgrade our infrastructure is really hampering further growth
and efficiency. We are efficient at moving shipments around the world, but we are inefficient at moving data
around.
We need to organize our information so we can more easily understand where our customers are and what they
are shipping.
CTO Statement
IT has never been a priority for us, so as our data has grown, we have not invested enough in our technology. I
have a good staff to manage IT, but they are so busy managing our infrastructure that I cannot get them to do
the things that really matter, such as organizing our data, building the analytics, and figuring out how to
implement the CFO’ s tracking technology.
CFO Statement
Part of our competitive advantage is that we penalize ourselves for late shipments and deliveries. Knowing
where out shipments are at all times has a direct correlation to our bottom line and profitability. Additionally, I
don’t want to commit capital to building out a server environment.
https://www.realexamcollection.com/google/professional-data-engineer-dumps.html
13 of 13
Comments