Question.36 A company plans to build a custom natural language processing (NLP) model to classify and prioritize user feedback. The company hosts the data and all machine learning (ML) infrastructure in the AWS Cloud. The ML team works from the company’s office, which has an IPsec VPN connection to one VPC in the AWS Cloud. The company has set both the enableDnsHostnames attribute and the enableDnsSupport attribute of the VPC to true. The company’s DNS resolvers point to the VPC DNS. The company does not allow the ML team to access Amazon SageMaker notebooks through connections that use the public internet. The connection must stay within a private network and within the AWS internal network. Which solution will meet these requirements with the LEAST development effort? (A) Create a VPC interface endpoint for the SageMaker notebook in the VPC. Access the notebook through a VPN connection and the VPC endpoint. (B) Create a bastion host by using Amazon EC2 in a public subnet within the VPC. Log in to the bastion host through a VPN connection. Access the SageMaker notebook from the bastion host. (C) Create a bastion host by using Amazon EC2 in a private subnet within the VPC with a NAT gateway. Log in to the bastion host through a VPN connection. Access the SageMaker notebook from the bastion host. (D) Create a NAT gateway in the VPC. Access the SageMaker notebook HTTPS endpoint through a VPN connection and the NAT gateway. |
36. Click here to View Answer
Correct Answer: A
In this scenario, the company requires that access to the Amazon SageMaker notebook remain within the AWS internal network, avoiding the public internet. By creating a VPC interface endpoint for SageMaker, the company can ensure that traffic to the SageMaker notebook remains internal to the VPC and is accessible over a private connection. The VPC interface endpoint allows private network access to AWS services, and it operates over AWS’s internal network, respecting the security and connectivity policies the company requires.
This solution requires minimal development effort compared to options involving bastion hosts or NAT gateways, as it directly provides private network access to the SageMaker notebook.
Question.37 A retail company is selling products through a global online marketplace. The company wants to use machine learning (ML) to analyze customer feedback and identify specific areas for improvement. A developer has built a tool that collects customer reviews from the online marketplace and stores them in an Amazon S3 bucket. This process yields a dataset of 40 reviews. A data scientist building the ML models must identify additional sources of data to increase the size of the dataset. Which data sources should the data scientist use to augment the dataset of reviews? (Choose three.) (A) Emails exchanged by customers and the company’s customer service agents (B) Social media posts containing the name of the company or its products (C) A publicly available collection of news articles (D) A publicly available collection of customer reviews (E) Product sales revenue figures for the company (F) Instruction manuals for the company’s products |
37. Click here to View Answer
Correct Answer: A,B,D
The data sources that the data scientist should use to augment the dataset of reviews are those that contain relevant and diverse customer feedback about the company or its products. Emails exchanged by customers and the company’s customer service agents can provide valuable insights into the issues and complaints that customers have, as well as the solutions and responses that the company offers. Social media posts containing the name of the company or its products can capture the opinions and sentiments of customers and potential customers, as well as their reactions to marketing campaigns and product launches. A publicly available collection of customer reviews can provide a large and varied sample of feedback from different online platforms and marketplaces, which can help to generalize the ML models and avoid bias.
Detect sentiment from customer reviews using Amazon Comprehend | AWS Machine Learning Blog How to Apply Machine Learning to Customer Feedback
Question.38 A company processes millions of orders every day. The company uses Amazon DynamoDB tables to store order information. When customers submit new orders, the new orders are immediately added to the DynamoDB tables. New orders arrive in the DynamoDB tables continuously. A data scientist must build a peak-time prediction solution. The data scientist must also create an Amazon OuickSight dashboard to display near real-lime order insights. The data scientist needs to build a solution that will give QuickSight access to the data as soon as new order information arrives. Which solution will meet these requirements with the LEAST delay between when a new order is processed and when QuickSight can access the new order information? (A) Use AWS Glue to export the data from Amazon DynamoDB to Amazon S3. Configure OuickSight to access the data in Amazon S3. (B) Use Amazon Kinesis Data Streams to export the data from Amazon DynamoDB to Amazon S3. Configure OuickSight to access the data in Amazon S3. (C) Use an API call from OuickSight to access the data that is in Amazon DynamoDB directly (D) Use Amazon Kinesis Data Firehose to export the data from Amazon DynamoDB to Amazon S3.Configure OuickSight to access the data in Amazon S3. |
38. Click here to View Answer
Correct Answer: B
The best solution for this scenario is to use Amazon Kinesis Data Streams to export the data from Amazon DynamoDB to Amazon S3, and then configure QuickSight to access the data in Amazon S3. This solution has the following advantages:
* It allows near real-time data ingestion from DynamoDB to S3 using Kinesis Data Streams, which can capture and process data continuously and at scale1.
* It enables QuickSight to access the data in S3 using the Athena connector, which supports federated queries to multiple data sources, including Kinesis Data Streams2.
* It avoids the need to create and manage a Lambda function or a Glue crawler, which are required for the other solutions.
The other solutions have the following drawbacks:
* Using AWS Glue to export the data from DynamoDB to S3 introduces additional latency and complexity, as Glue is a batch-oriented service that requires scheduling and configuration3.
* Using an API call from QuickSight to access the data in DynamoDB directly is not possible, as QuickSight does not support direct querying of DynamoDB4.
* Using Kinesis Data Firehose to export the data from DynamoDB to S3 is less efficient and flexible than using Kinesis Data Streams, as Firehose does not support custom data processing or transformation, and has a minimum buffer interval of 60 seconds5.
1: Amazon Kinesis Data Streams – Amazon Web Services
2: Visualize Amazon DynamoDB insights in Amazon QuickSight using the Amazon Athena DynamoDB connector and AWS Glue | AWS Big Data Blog
3: AWS Glue – Amazon Web Services
4: Visualising your Amazon DynamoDB data with Amazon QuickSight – DEV Community
5: Amazon Kinesis Data Firehose – Amazon Web Services
Question.39 A retail company uses a machine learning (ML) model for daily sales forecasting. The company’s brand manager reports that the model has provided inaccurate results for the past 3 weeks. At the end of each day, an AWS Glue job consolidates the input data that is used for the forecasting with the actual daily sales data and the predictions of the model. The AWS Glue job stores the data in Amazon S3. The company’s ML team is using an Amazon SageMaker Studio notebook to gain an understanding about the source of the model’s inaccuracies. What should the ML team do on the SageMaker Studio notebook to visualize the model’s degradation MOST accurately? (A) Create a histogram of the daily sales over the last 3 weeks. In addition, create a histogram of the daily sales from before that period. (B) Create a histogram of the model errors over the last 3 weeks. In addition, create a histogram of the model errors from before that period. (C) Create a line chart with the weekly mean absolute error (MAE) of the model. (D) Create a scatter plot of daily sales versus model error for the last 3 weeks. In addition, create a scatter plot of daily sales versus model error from before that period. |
39. Click here to View Answer
Correct Answer: B
The best way to visualize the model’s degradation is to create a histogram of the model errors over the last 3 weeks and compare it with a histogram of the model errors from before that period. A histogram is a graphical representation of the distribution of numerical data. It shows how often each value or range of values occurs in the data. A model error is the difference between the actual value and the predicted value. A high model error indicates a poor fit of the model to the data. By comparing the histograms of the model errors, the ML team can see if there is a significant change in the shape, spread, or center of the distribution. This can indicate if the model is underfitting, overfitting, or drifting from the data. A line chart or a scatter plot would not be as effective as a histogram for this purpose, because they do not show the distribution of the errors. A line chart would only show the trend of the errors over time, which may not capture the variability or outliers.
A scatter plot would only show the relationship between the errors and another variable, such as daily sales, which may not be relevant or informative for the model’s performance. References:
* Histogram – Wikipedia
* Model error – Wikipedia
* SageMaker Model Monitor – visualizing monitoring results
Question.40 An obtain relator collects the following data on customer orders: demographics, behaviors, location, shipment progress, and delivery time. A data scientist joins all the collected datasets. The result is a single dataset that includes 980 variables. The data scientist must develop a machine learning (ML) model to identify groups of customers who are likely to respond to a marketing campaign. Which combination of algorithms should the data scientist use to meet this requirement? (Select TWO.) (A) Latent Dirichlet Allocation (LDA) (B) K-means (C) Se mantic feg mentation (D) Principal component analysis (PCA) (E) Factorization machines (FM) |
40. Click here to View Answer
Correct Answer: B,D
The data scientist should use K-means and principal component analysis (PCA) to meet this requirement. K- means is a clustering algorithm that can group customers based on their similarity in the feature space. PCA is a dimensionality reduction technique that can transform the original 980 variables into a smaller set of uncorrelated variables that capture most of the variance in the data. This can help reduce the computational cost and noise in the data, and improve the performance of the clustering algorithm.
Clustering – Amazon SageMaker
Dimensionality Reduction – Amazon SageMaker