Question.61 IT leadership wants Jo transition a company’s existing machine learning data storage environment to AWS as a temporary ad hoc solution The company currently uses a custom software process that heavily leverages SOL as a query language and exclusively stores generated csv documents for machine learning The ideal state for the company would be a solution that allows it to continue to use the current workforce of SQL experts The solution must also support the storage of csv and JSON files, and be able to query over semi- structured data The following are high priorities for the company: * Solution simplicity * Fast development time * Low cost * High flexibility What technologies meet the company’s requirements? (A) Amazon S3 and Amazon Athena (B) Amazon Redshift and AWS Glue (C) Amazon DynamoDB and DynamoDB Accelerator (DAX) (D) Amazon RDS and Amazon ES |
61. Click here to View Answer
Correct Answer: A
Amazon S3 and Amazon Athena are technologies that meet the company’s requirements for a temporary ad hoc solution for machine learning data storage and query. Amazon S3 and Amazon Athena have the following features and benefits:
* Amazon S3 is a service that provides scalable, durable, and secure object storage for any type of data.
Amazon S3 can store csv and JSON files, as well as other formats, and can handle large volumes of data with high availability and performance. Amazon S3 also integrates with other AWS services, such as Amazon Athena, for further processing and analysis of the data.
* Amazon Athena is a service that allows querying data stored in Amazon S3 using standard SQL.
Amazon Athena can query over semi-structured data, such as JSON, as well as structured data, such as csv, without requiring any loading or transformation. Amazon Athena is serverless, meaning that there is no infrastructure to manage and users only pay for the queries they run. Amazon Athena also supports the use of AWS Glue Data Catalog, which is a centralized metadata repository that can store and manage the schema and partition information of the data in Amazon S3.
Using Amazon S3 and Amazon Athena, the company can achieve the following high priorities:
* Solution simplicity: Amazon S3 and Amazon Athena are easy to use and require minimal configuration and maintenance. The company can simply upload the csv and JSON files to Amazon S3 and use Amazon Athena to query them using SQL. The company does not need to worry about provisioning, scaling, or managing any servers or clusters.
* Fast development time: Amazon S3 and Amazon Athena can enable the company to quickly access and analyze the data without any data preparation or loading. The company can use the existing workforce of SQL experts to write and run queries on Amazon Athena and get results in seconds or minutes.
* Low cost: Amazon S3 and Amazon Athena are cost-effective and offer pay-as-you-go pricing models.
Amazon S3 charges based on the amount of storage used and the number of requests made. Amazon Athena charges based on the amount of data scanned by the queries. The company can also reduce the costs by using compression, encryption, and partitioning techniques to optimize the data storage and query performance.
* High flexibility: Amazon S3 and Amazon Athena are flexible and can support various data types, formats, and sources. The company can store and query any type of data in Amazon S3, such as csv, JSON, Parquet, ORC, etc. The company can also query data from multiple sources in Amazon S3, such as data lakes, data warehouses, log files, etc.
The other options are not as suitable as option A for the company’s requirements for the following reasons:
* Option B: Amazon Redshift and AWS Glue are technologies that can be used for data warehousing and data integration, but they are not ideal for a temporary ad hoc solution. Amazon Redshift is a service that provides a fully managed, petabyte-scale data warehouse that can run complex analytical queries using SQL. AWS Glue is a service that provides a fully managed extract, transform, and load (ETL) service that can prepare and load data for analytics. However, using Amazon Redshift and AWS Glue would require more effort and cost than using Amazon S3 and Amazon Athena. The company would need to load the data from Amazon S3 to Amazon Redshift using AWS Glue, which can take time and incur additional charges. The company would also need to manage the capacity and performance of the Amazon Redshift cluster, which can be complex and expensive.
* Option C: Amazon DynamoDB and DynamoDB Accelerator (DAX) are technologies that can be used for fast and scalable NoSQL database and caching, but they are not suitable for the company’s data storage and query needs. Amazon DynamoDB is a service that provides a fully managed, key-value and document database that can deliver single-digit millisecond performance at any scale. DynamoDB Accelerator (DAX) is a service that provides a fully managed, in-memory cache for DynamoDB that can improve the read performance by up to 10 times. However, using Amazon DynamoDB and DAX would not allow the company to continue to use SQL as a query language, as Amazon DynamoDB does not support SQL. The company would need to use the DynamoDB API or the AWS SDKs to access and query the data, which can require more coding and learning effort. The company would also need to transform the csv and JSON files into DynamoDB items, which can involve additional processing and complexity.
* Option D: Amazon RDS and Amazon ES are technologies that can be used for relational database and search and analytics, but they are not optimal for the company’s data storage and query scenario.
Amazon RDS is a service that provides a fully managed, relational database that supports various database engines, such as MySQL, PostgreSQL, Oracle, etc. Amazon ES is a service that provides a fully managed, Elasticsearch cluster, which is mainly used for search and analytics purposes. However, using Amazon RDS and Amazon ES would not be as simple and cost-effective as using Amazon S3 and Amazon Athena. The company would need to load the data from Amazon S3 to Amazon RDS, which can take time and incur additional charges. The company would also need to manage the capacity and performance of the Amazon RDS and Amazon ES clusters, which can be complex and expensive. Moreover, Amazon RDS and Amazon ES are not designed to handle semi-structured data, such as JSON, as well as Amazon S3 and Amazon Athena.
Amazon S3
Amazon Athena
Amazon Redshift
AWS Glue
Amazon DynamoDB
[DynamoDB Accelerator (DAX)]
[Amazon RDS]
[Amazon ES]
Question.62 A gaming company has launched an online game where people can start playing for free but they need to pay if they choose to use certain features The company needs to build an automated system to predict whether or not a new user will become a paid user within 1 year The company has gathered a labeled dataset from 1 million users The training dataset consists of 1.000 positive samples (from users who ended up paying within 1 year) and 999.000 negative samples (from users who did not use any paid features) Each data sample consists of 200 features including user age, device, location, and play patterns Using this dataset for training, the Data Science team trained a random forest model that converged with over 99% accuracy on the training set However, the prediction results on a test dataset were not satisfactory. Which of the following approaches should the Data Science team take to mitigate this issue? (Select TWO.) (A) Add more deep trees to the random forest to enable the model to learn more features. (B) indicate a copy of the samples in the test database in the training dataset (C) Generate more positive samples by duplicating the positive samples and adding a small amount of noise to the duplicated data. (D) Change the cost function so that false negatives have a higher impact on the cost value than false positives (E) Change the cost function so that false positives have a higher impact on the cost value than false negatives |
62. Click here to View Answer
Correct Answer: C,D
The Data Science team is facing a problem of imbalanced data, where the positive class (paid users) is much less frequent than the negative class (non-paid users). This can cause the random forest model to be biased towards the majority class and have poor performance on the minority class. To mitigate this issue, the Data Science team can try the following approaches:
* C. Generate more positive samples by duplicating the positive samples and adding a small amount of noise to the duplicated data. This is a technique called data augmentation, which can help increase the size and diversity of the training data for the minority class. This can help the random forest model learn more features and patterns from the positive class and reduce the imbalance ratio.
* D. Change the cost function so that false negatives have a higher impact on the cost value than false positives. This is a technique called cost-sensitive learning, which can assign different weights or costs to different classes or errors. By assigning a higher cost to false negatives (predicting non-paid when the user is actually paid), the random forest model can be more sensitive to the minority class and try to minimize the misclassification of the positive class.
Bagging and Random Forest for Imbalanced Classification
Surviving in a Random Forest with Imbalanced Datasets
machine learning – random forest for imbalanced data? – Cross Validated Biased Random Forest For Dealing With the Class Imbalance Problem
Question.63 A finance company has collected stock return data for 5.000 publicly traded companies. A financial analyst has a dataset that contains 2.000 attributes for each company. The financial analyst wants to use Amazon SageMaker to identify the top 15 attributes that are most valuable to predict future stock returns. Which solution will meet these requirements with the LEAST operational overhead? (A) Use the linear learner algorithm in SageMaker to train a linear regression model to predict the stock returns. Identify the most predictive features by ranking absolute coefficient values. (B) Use random forest regression in SageMaker to train a model to predict the stock returns. Identify the most predictive features based on Gini importance scores. (C) Use an Amazon SageMaker Data Wrangler quick model visualization to predict the stock returns. Identify the most predictive features based on the quick model’s feature importance scores. (D) Use Amazon SageMaker Autopilot to build a regression model to predict the stock returns. Identify the most predictive features based on an Amazon SageMaker Clarify report. |
63. Click here to View Answer
Correct Answer: D
Amazon SageMaker Autopilot is a fully managed solution that automatically explores different ML models and selects the most effective ones for a given prediction task. After model training, Amazon SageMaker Clarify can generate feature importance scores, identifying the top features in a straightforward, automated manner with minimal manual intervention.
By using SageMaker Autopilot, the data scientist can obtain the desired feature importance ranking for predictive attributes with minimal setup and low operational overhead, as opposed to manually configuring models in SageMaker.
Question.64 A data scientist is building a new model for an ecommerce company. The model will predict how many minutes it will take to deliver a package. During model training, the data scientist needs to evaluate model performance. Which metrics should the data scientist use to meet this requirement? (Select TWO.) (A) InferenceLatency (B) Mean squared error (MSE) (C) Root mean squared error (RMSE) (D) Precision (E) Accuracy |
64. Click here to View Answer
Correct Answer: B,C
For regression tasks that predict continuous numerical values, such as estimating delivery times, appropriate evaluation metrics quantify the difference between predicted and actual values. Two commonly used metrics are:
* Mean Squared Error (MSE): Calculates the average of the squares of the errors, providing a measure of the quality of an estimator.
* Root Mean Squared Error (RMSE): The square root of MSE, offering an error metric in the same units as the target variable, which aids in interpretability.
These metrics are standard for assessing the performance of regression models, especially when the goal is to predict continuous outcomes accurately.
Question.65 A chemical company has developed several machine learning (ML) solutions to identify chemical process abnormalities. The time series values of independent variables and the labels are available for the past 2 years and are sufficient to accurately model the problem. The regular operation label is marked as 0. The abnormal operation label is marked as 1 . Process abnormalities have a significant negative effect on the companys profits. The company must avoid these abnormalities. Which metrics will indicate an ML solution that will provide the GREATEST probability of detecting an abnormality? (A) Precision = 0.91 Recall = 0.6 (B) Precision = 0.61 Recall = 0.98 (C) Precision = 0.7 Recall = 0.9 (D) Precision = 0.98 Recall = 0.8 |
65. Click here to View Answer
Correct Answer: B
The metrics that will indicate an ML solution that will provide the greatest probability of detecting an abnormality are precision and recall. Precision is the ratio of true positives (TP) to the total number of predicted positives (TP + FP), where FP is false positives. Recall is the ratio of true positives (TP) to the total number of actual positives (TP + FN), where FN is false negatives. A high precision means that the ML solution has a low rate of false alarms, while a high recall means that the ML solution has a high rate of true detections. For the chemical company, the goal is to avoid process abnormalities, which are marked as 1 in the labels. Therefore, the company needs an ML solution that has a high recall for the positive class, meaning that it can detect most of the abnormalities and minimize the false negatives. Among the four options, option B has the highest recall for the positive class, which is 0.98. This means that the ML solution can detect 98% of the abnormalities and miss only 2%. Option B also has a reasonable precision for the positive class, which is 0.61. This means that the ML solution has a false alarm rate of 39%, which may be acceptable for the company, depending on the cost and benefit analysis. The other options have lower recall for the positive class, which means that they have higher false negative rates, which can be more detrimental for the company than false positive rates.
1: AWS Certified Machine Learning – Specialty Exam Guide
2: AWS Training – Machine Learning on AWS
3: AWS Whitepaper – An Overview of Machine Learning on AWS
4: Precision and recall