Question.11 You are designing the database schema for a machine learning-based food ordering service that will predict what users want to eat. Here is some of the information you need to store: – The user profile: What the user likes and doesn’t like to eat – The user account information: Name, address, preferred meal times – The order information: When orders are made, from where, to whom The database will be used to store all the transactional data of the product. You want to optimize the data schema. Which Google Cloud Platform product should you use? (A) BigQuery (B) Cloud SQL (C) Cloud Bigtable (D) Cloud Datastore |
11. Click here to View Answer
Answer is (B) Cloud SQL
The database will be used to store all the transactional data of the product. what we need is the database only for store transactional data, not for analysis and ML. so the answer should be “the database that stores transactional data”, which means, Cloud SQL. if you want to analyze or do ML you just specify Cloud SQL as a federated data source.
A: it’s good for analysis but it costs too much to input/output data frequently.
C: BigTable is not good for transactional data.
D: okay datastore supports transactions, but it is weaker than RDB, and also, in this case, the data schema has already defined , you should use RDB.
Question.12 Your company produces 20,000 files every hour. Each data file is formatted as a comma separated values (CSV) file that is less than 4 KB. All files must be ingested on Google Cloud Platform before they can be processed. Your company site has a 200 ms latency to Google Cloud, and your Internet connection bandwidth is limited as 50 Mbps. You currently deploy a secure FTP (SFTP) server on a virtual machine in Google Compute Engine as the data ingestion point. A local SFTP client runs on a dedicated machine to transmit the CSV files as is. The goal is to make reports with data from the previous day available to the executives by 10:00 a.m. each day. This design is barely able to keep up with the current volume, even though the bandwidth utilization is rather low. You are told that due to seasonality, your company expects the number of files to double for the next three months. Which two actions should you take? (Choose two.) (A) Introduce data compression for each file to increase the rate file of file transfer. (B) Contact your internet service provider (ISP) to increase your maximum bandwidth to at least 100 Mbps. (C) Redesign the data ingestion process to use gsutil tool to send the CSV files to a storage bucket in parallel. (D) Assemble 1,000 files into a tape archive (TAR) file. Transmit the TAR files instead, and disassemble the CSV files in the cloud upon receiving them. (E) Create an S3-compatible storage endpoint in your network, and use Google Cloud Storage Transfer Service to transfer on-premises data to the designated storage bucket. |
12. Click here to View Answer
Answers are; A. Introduce data compression for each file to increase the rate file of file transfer.
C. Redesign the data ingestion process to use gsutil tool to send the CSV files to a storage bucket in parallel.
B – wrong (we need to provide solution without changing internet speed)
D – if we TAR 1000 files, its okay, but Volume is getting increased continously..How we define the number ?
E – Bandwidth already low, so storage Transfer service will not help here.
Follow these rules of thumb when deciding whether to use gsutil or Storage Transfer Service:
Transfer scenario Recommendation
Transferring from another cloud storage provider Use Storage Transfer Service.
Transferring less than 1 TB from on-premises Use gsutil.
Transferring more than 1 TB from on-premises Use Transfer service for on-premises data.
Transferring less than 1 TB from another Cloud Storage region Use gsutil.
Transferring more than 1 TB from another Cloud Storage region Use Storage Transfer Service.
Reference:
https://cloud.google.com/storage-transfer/docs/overview#gsutil
Question.13 You are choosing a NoSQL database to handle telemetry data submitted from millions of Internet-of-Things (IoT) devices. The volume of data is growing at 100 TB per year, and each data entry has about 100 attributes. The data processing pipeline does not require atomicity, consistency, isolation, and durability (ACID). However, high availability and low latency are required. You need to analyze the data by querying against individual fields. Which three databases meet your requirements? (Choose three.) (A) Redis (B) HBase (C) MySQL (D) MongoDB (E) Cassandra (F) HDFS with Hive |
13. Click here to View Answer
Answer is (A) HBase, (D) MongoDB, (E) Cassandra
A. Redis – Redis is an in-memory non-relational key-value store. Redis is a great choice for implementing a highly available in-memory cache to decrease data access latency, increase throughput, and ease the load off your relational or NoSQL database and application. Since the question does not ask cache, A is discarded.
B. HBase – Meets reqs
C. MySQL – they do not need ACID, so not needed.
D. MongoDB – Meets reqs
E. Cassandra – Apache Cassandra is an open source NoSQL distributed database trusted by thousands of companies for scalability and high availability without compromising performance. Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data.
F. HDFS with Hive – Hive allows users to read, write, and manage petabytes of data using SQL. Hive is built on top of Apache Hadoop, which is an open-source framework used to efficiently store and process large datasets. As a result, Hive is closely integrated with Hadoop, and is designed to work quickly on petabytes of data. HIVE IS NOT A DATABSE.
Question.14 You are using Google BigQuery as your data warehouse. Your users report that the following simple query is running very slowly, no matter when they run the query: SELECT country, state, city FROM [myproject:mydataset.mytable] GROUP BY country You check the query plan for the query and see the following output in the Read section of Stage:1:![]() What is the most likely cause of the delay for this query? (A) Users are running too many concurrent queries in the system (B) The [myproject:mydataset.mytable] table has too many partitions (C) Either the state or the city columns in the [myproject:mydataset.mytable] table have too many NULL values (D) Most rows in the [myproject:mydataset.mytable] table have the same value in the country column, causing data skew |
14. Click here to View Answer
Answer is (D) Most rows in the [myproject:mydataset.mytable] table have the same value in the country column, causing data skew
Purple is reading, Blue is writing. so majority is reading.
Partition skew, sometimes called data skew, is when data is partitioned into very unequally sized partitions. This creates an imbalance in the amount of data sent between slots. You can’t share partitions between slots, so if one partition is especially large, it can slow down, or even crash the slot that processes the oversized partition.
Reference:
https://cloud.google.com/bigquery/docs/best-practices-performance-patterns
Questio.15 Your globally distributed auction application allows users to bid on items. Occasionally, users place identical bids at nearly identical times, and different application servers process those bids. Each bid event contains the item, amount, user, and timestamp. You want to collate those bid events into a single location in real time to determine which user bid first. What should you do? (A) Create a file on a shared file and have the application servers write all bid events to that file. Process the file with Apache Hadoop to identify which user bid first. (B) Have each application server write the bid events to Cloud Pub/Sub as they occur. Push the events from Cloud Pub/Sub to a custom endpoint that writes the bid event information into Cloud SQL. (C) Set up a MySQL database for each application server to write bid events into. Periodically query each of those distributed MySQL databases and update a master MySQL database with bid event information. (D) Have each application server write the bid events to Google Cloud Pub/Sub as they occur. Use a pull subscription to pull the bid events using Google Cloud Dataflow. Give the bid for each item to the user in the bid event that is processed first. |
15. Click here to View Answer
Answer is (D) Have each application server write the bid events to Google Cloud Pub/Sub as they occur. Use a pull subscription to pull the bid events using Google Cloud Dataflow. Give the bid for each item to the user in the bid event that is processed first.
The need is to collate the messages in real-time. We need to de-dupe the messages based on timestamp of when the event occurred. This can be done by publishing ot Pub-Sub and consuming via Dataflow.
Reference:
https://stackoverflow.com/questions/62997414/push-vs-pull-for-gcp-dataflow