Question.16 A company uses an on-premises Microsoft SQL Server database to store financial transaction data. The company migrates the transaction data from the on-premises database to AWS at the end of each month. The company has noticed that the cost to migrate data from the on-premises database to an Amazon RDS for SQL Server database has increased recently. The company requires a cost-effective solution to migrate the data to AWS. The solution must cause minimal downtown for the applications that access the database. Which AWS service should the company use to meet these requirements? (A) AWS Lambda (B) AWS Database Migration Service (AWS DMS) (C ) AWS Direct Connect (D) AWS DataSync |
16. Click here to View Answer
Answer: B
Explanation:
The correct answer is B. AWS Database Migration Service (AWS DMS).
AWS DMS is specifically designed for database migration, making it the most suitable choice for migrating the on-premises SQL Server database to Amazon RDS for SQL Server. It supports heterogeneous migrations, meaning it can migrate between different database engines. DMS offers continuous data replication, which minimizes downtime as the database is continuously synchronized. It allows for a cutover at a convenient time, resulting in minimal disruption to applications. The continuous replication and the option for a cutover window address the requirement for minimal downtime.
AWS Lambda (A) is a serverless compute service used for running code in response to events. While it could potentially be used for data migration, it would require significantly more custom coding and would be less efficient and more complex to manage than DMS for this specific task.
AWS Direct Connect (C) provides a dedicated network connection from on-premises to AWS. While it can improve network performance and security, it doesn’t directly migrate the data. It would only help in faster and more secure data transfer, but the migration tool is still needed, and it doesn’t address the downtime requirement. It primarily reduces network costs, not the overall migration cost.
AWS DataSync (D) is primarily used for transferring large datasets between on-premises storage systems and AWS storage services like S3, EFS, and FSx. It’s optimized for file-based data transfer, not for migrating database schemas and data directly to a managed database service like RDS.
DMS optimizes costs for ongoing migrations. It charges based on the compute resources used during the migration process, which can be cost-effective compared to manual or custom solutions. Therefore, DMS is the most appropriate and cost-effective option that minimizes downtime.
Further Research:
AWS DMS Best Practices: https://docs.aws.amazon.com/dms/latest/userguide/CHAP_BestPractices.html
AWS DMS Documentation: https://aws.amazon.com/dms/
Question.17 A data engineer is building a data pipeline on AWS by using AWS Glue extract, transform, and load (ETL) jobs. The data engineer needs to process data from Amazon RDS and MongoDB, perform transformations, and load the transformed data into Amazon Redshift for analytics. The data updates must occur every hour. Which combination of tasks will meet these requirements with the LEAST operational overhead? (Choose two.) (A) Configure AWS Glue triggers to run the ETL jobs every hour. (B) Use AWS Glue DataBrew to clean and prepare the data for analytics. (C )Use AWS Lambda functions to schedule and run the ETL jobs every hour. (D) Use AWS Glue connections to establish connectivity between the data sources and Amazon Redshift. (E) Use the Redshift Data API to load transformed data into Amazon Redshift. |
17. Click here to View Answer
Answer: AD
Explanation:
The correct answer is AD. Here’s why:
A. Configure AWS Glue triggers to run the ETL jobs every hour: AWS Glue triggers are a native and straightforward way to schedule and execute Glue ETL jobs. They can be configured to run on a schedule (time-based), based on events (like the completion of another job), or on demand. Using Glue triggers directly addresses the requirement for hourly data updates with minimal operational overhead, as it’s a managed feature of AWS Glue itself, requiring no additional services or custom code for scheduling. Lambda (option C) would introduce unnecessary complexity for a simple scheduled execution.
D. Use AWS Glue connections to establish connectivity between the data sources and Amazon Redshift: AWS Glue connections provide a centralized and managed way to store and manage connection information to various data sources, including Amazon RDS, MongoDB, and Amazon Redshift. This simplifies the ETL job configuration by allowing you to reference connections instead of hardcoding connection details in each job. This reduces the need to manually configure connections within each ETL script, leading to easier maintenance and reduced operational overhead.
Why other options are not suitable:
- B. Use AWS Glue DataBrew to clean and prepare the data for analytics: While DataBrew can be used for data preparation, it’s primarily focused on interactive data exploration and visual data transformations, which aren’t as suitable for automated, scheduled ETL pipelines as Glue ETL jobs. It does not have the same programmatic flexibility and scaling capabilities as Glue ETL for this use case.
- C. Use AWS Lambda functions to schedule and run the ETL jobs every hour: Using Lambda to schedule Glue jobs introduces additional complexity and overhead. You would need to manage the Lambda function, its execution role, and ensure its reliability. Glue triggers provide a more direct and managed approach to scheduling Glue jobs.
- E. Use the Redshift Data API to load transformed data into Amazon Redshift: While the Redshift Data API can be used to load data, it’s often better suited for executing SQL queries and interacting with Redshift rather than high-volume data loading within an ETL pipeline. Glue ETL jobs, especially with options like
dynamicframes.toDF().write.format("redshift")
, offer better performance and integration for loading data from other data sources. The Glue connector is also optimized for data loading into Redshift.
Supporting Documentation:
- AWS Glue Triggers: https://docs.aws.amazon.com/glue/latest/dg/trigger-definition.html
- AWS Glue Connections: https://docs.aws.amazon.com/glue/latest/dg/connections-api.html
- AWS Glue DataBrew: https://aws.amazon.com/databrew/
- Redshift Data API: https://docs.aws.amazon.com/redshift-data-api/latest/APIReference/Welcome.html
Question.18 A company uses an Amazon Redshift cluster that runs on RA3 nodes. The company wants to scale read and write capacity to meet demand. A data engineer needs to identify a solution that will turn on concurrency scaling. Which solution will meet this requirement? (A) Turn on concurrency scaling in workload management (WLM) for Redshift Serverless workgroups. (B) Turn on concurrency scaling at the workload management (WLM) queue level in the Redshift cluster. (C) Turn on concurrency scaling in the settings during the creation of any new Redshift cluster. (D) Turn on concurrency scaling for the daily usage quota for the Redshift cluster. |
18. Click here to View Answer
Answer: B
Verified Answer
Explanation:
The correct answer is B: Turn on concurrency scaling at the workload management (WLM) queue level in the Redshift cluster.
Here’s a detailed justification:
Amazon Redshift concurrency scaling automatically adds compute capacity to your Redshift cluster to handle increases in concurrent read and write queries. This ensures consistent performance even during peak demand. Concurrency scaling is not a cluster-wide setting enabled during cluster creation (option C) nor is it directly configured through a daily usage quota (option D). RA3 nodes are specifically designed to utilize concurrency scaling effectively.
Workload Management (WLM) allows you to prioritize and manage queries based on their importance. Concurrency scaling is configured at the WLM queue level. By enabling concurrency scaling for specific WLM queues, you allow Redshift to automatically spin up additional compute resources when queries assigned to that queue experience contention due to high concurrency. This distributes the workload across more resources, improving query performance. Redshift Serverless, mentioned in option A, is a different deployment option than a provisioned Redshift cluster using RA3 nodes. While Redshift Serverless also offers concurrency scaling features, the context specifically refers to an existing Redshift cluster. Therefore, the focus should be on the settings within that cluster.
For more information, refer to the AWS documentation on Amazon Redshift concurrency scaling:
Configuring workload management (WLM) for concurrency scaling
Question.19 A data engineer must orchestrate a series of Amazon Athena queries that will run every day. Each query can run for more than 15 minutes. Which combination of steps will meet these requirements MOST cost-effectively? (Choose two.) (A) Use an AWS Lambda function and the Athena Boto3 client start_query_execution API call to invoke the Athena queries programmatically. (B) Create an AWS Step Functions workflow and add two states. Add the first state before the Lambda function. Configure the second state as a Wait state to periodically check whether the Athena query has finished using the Athena Boto3 get_query_execution API call. Configure the workflow to invoke the next query when the current query has finished running. (C) Use an AWS Glue Python shell job and the Athena Boto3 client start_query_execution API call to invoke the Athena queries programmatically. (D) Use an AWS Glue Python shell script to run a sleep timer that checks every 5 minutes to determine whether the current Athena query has finished running successfully. Configure the Python shell script to invoke the next query when the current query has finished running. (E) Use Amazon Managed Workflows for Apache Airflow (Amazon MWAA) to orchestrate the Athena queries in AWS Batch. |
19. Click here to View Answer
Answer: AB
Verified Answer
Explanation:
The most cost-effective solution for orchestrating long-running Athena queries daily involves a combination of AWS Lambda and AWS Step Functions. Lambda provides a serverless execution environment to invoke the Athena queries using the Boto3 start_query_execution
API call. This approach is cost-effective because you only pay for the compute time consumed by the function.
AWS Step Functions manages the workflow and handles the asynchronous nature of Athena queries. Adding a Wait state in Step Functions is crucial. This state allows the workflow to pause execution and periodically check the status of the Athena query using the get_query_execution
API call. By using a Wait state, the workflow efficiently polls for completion without consuming excessive compute resources, compared to a continuously running process. Once the query completes, the Step Functions workflow triggers the next query in the series.
Let’s examine why other options are less suitable. Using AWS Glue Python shell jobs (options C and D) can be more expensive. Glue jobs are generally intended for data transformation and ETL processes, and maintaining a constantly running Glue job to check query status is inefficient and costly. Option E, using Amazon MWAA and AWS Batch, would introduce significant overhead and cost for this specific use case. MWAA is typically employed for more complex, large-scale data pipelines, making it overkill for a series of Athena queries. Batch is suitable for compute-intensive tasks, not for orchestrating Athena queries.
Therefore, the combination of Lambda for invoking Athena queries and Step Functions with a Wait state for monitoring and orchestration offers the optimal balance of cost and efficiency.
Supporting Links:
- AWS Lambda: https://aws.amazon.com/lambda/
- AWS Step Functions: https://aws.amazon.com/step-functions/
- Amazon Athena: https://aws.amazon.com/athena/
- Athena Boto3 Documentation: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/athena.html
Question.20 A company is migrating on-premises workloads to AWS. The company wants to reduce overall operational overhead. The company also wants to explore serverless options. The company’s current workloads use Apache Pig, Apache Oozie, Apache Spark, Apache Hbase, and Apache Flink. The on-premises workloads process petabytes of data in seconds. The company must maintain similar or better performance after the migration to AWS. Which extract, transform, and load (ETL) service will meet these requirements? (A) AWS Glue (B) Amazon EMR (C) AWS Lambda (D) Amazon Redshift |
20. Click here to View Answer
Answer: B
Explanation:
The correct answer is B (Amazon EMR) because it best fits the requirements of migrating complex workloads utilizing Apache Pig, Oozie, Spark, HBase, and Flink to AWS while aiming for similar or better performance and reduced operational overhead.
Here’s why:
- Amazon EMR provides a managed Hadoop framework: EMR simplifies the setup, operation, and scaling of big data frameworks like Hadoop, Spark, HBase, and Flink. It directly supports the existing workloads utilizing these technologies. (https://aws.amazon.com/emr/)
- Performance: EMR can leverage EC2 instances optimized for compute and memory, allowing for processing petabytes of data in seconds, mirroring the on-premises performance.
- Reduced Operational Overhead: EMR handles the underlying infrastructure, operating system patching, and framework updates, freeing the company from these tasks.
- Cost Optimization: EMR supports spot instances to reduce costs for fault-tolerant workloads. It also offers various instance types tailored for specific workloads.
- Suitable for complex workloads: EMR is designed for running complex, distributed data processing applications.
Now, let’s analyze why the other options are less suitable:
- AWS Glue: Glue is primarily a serverless ETL service focused on data cataloging, transformation, and loading. While useful, it’s not a direct replacement for the diverse processing capabilities of Spark, Flink, and HBase. Although Glue supports Spark, it might not be as performant or flexible for the company’s specific use cases.
- AWS Lambda: Lambda is suitable for event-driven, serverless compute tasks, but it is not designed for large-scale data processing with frameworks like Spark or Flink. Its execution time limits and memory constraints make it unsuitable for petabyte-scale workloads.
- Amazon Redshift: Redshift is a data warehouse service, ideal for analytical queries and reporting. It is not a direct replacement for the processing frameworks the company currently uses and is more of a destination for processed data rather than an ETL platform in this context. While Redshift can perform some transformations, it’s not optimized for the complex operations performed by Spark or Flink.
Therefore, Amazon EMR is the most appropriate ETL service to meet the company’s requirements for migrating their on-premises workloads to AWS while maintaining performance and reducing operational overhead, due to its native support for the technologies they are already using at scale.