Question.31 An aircraft engine manufacturing company is measuring 200 performance metrics in a time-series. Engineers want to detect critical manufacturing defects in near-real time during testing. All of the data needs to be stored for offline analysis. What approach would be the MOST effective to perform near-real time defect detection? (A) Use AWS IoT Analytics for ingestion, storage, and further analysis. Use Jupyter notebooks from withinAWS IoT Analytics to carry out analysis for anomalies. (B) Use Amazon S3 for ingestion, storage, and further analysis. Use an Amazon EMR cluster to carry outApache Spark ML k-means clustering to determine anomalies. (C) Use Amazon S3 for ingestion, storage, and further analysis. Use the Amazon SageMaker Random CutForest (RCF) algorithm to determine anomalies. (D) Use Amazon Kinesis Data Firehose for ingestion and Amazon Kinesis Data Analytics Random Cut Forest(RCF) to perform anomaly detection. Use Kinesis Data Firehose to store data in Amazon S3 for furtheranalysis. |
31. Click here to View Answer
Correct Answer: D
* The company wants to perform near-real time defect detection on a time-series of 200 performance metrics, and store all the data for offline analysis. The best approach for this scenario is to use Amazon Kinesis Data Firehose for ingestion and Amazon Kinesis Data Analytics Random Cut Forest (RCF) to perform anomaly detection. Use Kinesis Data Firehose to store data in Amazon S3 for further analysis.
* Amazon Kinesis Data Firehose is a service that can capture, transform, and deliver streaming data to destinations such as Amazon S3, Amazon Redshift, Amazon OpenSearch Service, and Splunk. Kinesis Data Firehose can handle any amount and frequency of data, and automatically scale to match the throughput. Kinesis Data Firehose can also compress, encrypt, and batch the data before delivering it to the destination, reducing the storage cost and enhancing the security.
* Amazon Kinesis Data Analytics is a service that can analyze streaming data in real time using SQL or Apache Flink applications. Kinesis Data Analytics can use built-in functions and algorithms to perform various analytics tasks, such as aggregations, joins, filters, windows, and anomaly detection. One of the built-in algorithms that Kinesis Data Analytics supports is Random Cut Forest (RCF), which is a supervised learning algorithm for forecasting scalar time series using recurrent neural networks. RCF can detect anomalies in streaming data by assigning an anomaly score to each data point, based on how distant it is from the rest of the data. RCF can handle multiple related time series, such as the performance metrics of the aircraft engine, and learn a global model that captures the common patterns and trends across the time series.
* Therefore, the company can use the following architecture to build the near-real time defect detection solution:
* Use Amazon Kinesis Data Firehose for ingestion: The company can use Kinesis Data Firehose to capture the streaming data from the aircraft engine testing, and deliver it to two destinations:
Amazon S3 and Amazon Kinesis Data Analytics. The company can configure the Kinesis Data Firehose delivery stream to specify the source, the buffer size and interval, the compression and encryption options, the error handling and retry logic, and the destination details.
* Use Amazon Kinesis Data Analytics Random Cut Forest (RCF) to perform anomaly detection:
The company can use Kinesis Data Analytics to create a SQL application that can read the streaming data from the Kinesis Data Firehose delivery stream, and apply the RCF algorithm to detect anomalies. The company can use the RANDOM_CUT_FOREST or RANDOM_CUT_FOREST_WITH_EXPLANATION functions to compute the anomaly scores and attributions for each data point, and use the WHERE clause to filter out the normal data points. The company can also use the CURSOR function to specify the input stream, and the PUMP function to write the output stream to another destination, such as Amazon Kinesis Data Streams or AWS Lambda.
* Use Kinesis Data Firehose to store data in Amazon S3 for further analysis: The company can use Kinesis Data Firehose to store the raw and processed data in Amazon S3 for offline analysis. The company can use the S3 destination of the Kinesis Data Firehose delivery stream to store the raw data, and use another Kinesis Data Firehose delivery stream to store the output of the Kinesis Data Analytics application. The company can also use AWS Glue or Amazon Athena to catalog, query, and analyze the data in Amazon S3.
What Is Amazon Kinesis Data Firehose?
What Is Amazon Kinesis Data Analytics for SQL Applications?
DeepAR Forecasting Algorithm – Amazon SageMaker
Question.32 A Machine Learning Specialist is building a model to predict future employment rates based on a wide range of economic factors While exploring the data, the Specialist notices that the magnitude of the input features vary greatly The Specialist does not want variables with a larger magnitude to dominate the model What should the Specialist do to prepare the data for model training’? (A) Apply quantile binning to group the data into categorical bins to keep any relationships in the data by replacing the magnitude with distribution (B) Apply the Cartesian product transformation to create new combinations of fields that are independent of the magnitude (C) Apply normalization to ensure each field will have a mean of 0 and a variance of 1 to remove any significant magnitude (D) Apply the orthogonal sparse Diagram (OSB) transformation to apply a fixed-size sliding window to generate new features of a similar magnitude. |
32. Click here to View Answer
Correct Answer: C
Normalization is a data preprocessing technique that can be used to scale the input features to a common range, such as [-1, 1] or [0, 1]. Normalization can help reduce the effect of outliers, improve the convergence of gradient-based algorithms, and prevent variables with a larger magnitude from dominating the model. One common method of normalization is standardization, which transforms each feature to have a mean of 0 and a variance of 1. This can be done by subtracting the mean and dividing by the standard deviation of each feature. Standardization can be useful for models that assume the input features are normally distributed, such as linear regression, logistic regression, and support vector machines. References:
* Data normalization and standardization: A video that explains the concept and benefits of data normalization and standardization.
* Standardize or Normalize?: A blog post that compares different methods of scaling the input features.
Question.33 A company wants to create an artificial intelligence (Al) yoga instructor that can lead large classes of students. The company needs to create a feature that can accurately count the number of students who are in a class. The company also needs a feature that can differentiate students who are performing a yoga stretch correctly from students who are performing a stretch incorrectly. …etermine whether students are performing a stretch correctly, the solution needs to measure the location and angle of each student’s arms and legs A data scientist must use Amazon SageMaker to …ss video footage of a yoga class by extracting image frames and applying computer vision models. Which combination of models will meet these requirements with the LEAST effort? (Select TWO.) (A) Image Classification (B) Optical Character Recognition (OCR) (C) Object Detection (D) Pose estimation (E) Image Generative Adversarial Networks (GANs) |
33. Click here to View Answer
Correct Answer: C,D
To count the number of students who are in a class, the solution needs to detect and locate each student in the video frame. Object detection is a computer vision model that can identify and locate multiple objects in an image. To differentiate students who are performing a stretch correctly from students who are performing a stretch incorrectly, the solution needs to measure the location and angle of each student’s arms and legs. Pose estimation is a computer vision model that can estimate the pose of a person by detecting the position and orientation of key body parts. Image classification, OCR, and image GANs are not relevant for this use case. References:
* Object Detection: A computer vision technique that identifies and locates objects within an image or video.
* Pose Estimation: A computer vision technique that estimates the pose of a person by detecting the position and orientation of key body parts.
* Amazon SageMaker: A fully managed service that provides every developer and data scientist with the ability to build, train, and deploy machine learning (ML) models quickly.
Question.34 A company’s machine learning (ML) specialist is designing a scalable data storage solution for Amazon SageMaker. The company has an existing TensorFlow-based model that uses a train.py script. The model relies on static training data that is currently stored in TFRecord format. What should the ML specialist do to provide the training data to SageMaker with the LEAST development overhead? (A) Put the TFRecord data into an Amazon S3 bucket. Use AWS Glue or AWS Lambda to reformat the data to protobuf format and store the data in a second S3 bucket. Point the SageMaker training invocation to the second S3 bucket. (B) Rewrite the train.py script to add a section that converts TFRecord data to protobuf format. Point the SageMaker training invocation to the local path of the data. Ingest the protobuf data instead of the TFRecord data. (C) Use SageMaker script mode, and use train.py unchanged. Point the SageMaker training invocation to the local path of the data without reformatting the training data. (D) Use SageMaker script mode, and use train.py unchanged. Put the TFRecord data into an Amazon S3 bucket. Point the SageMaker training invocation to the S3 bucket without reformatting the training data. |
34. Click here to View Answer
Correct Answer: D
Amazon SageMaker script mode allows users to bring custom training scripts (such as train.py) without needing extensive modifications for specific data formats like TFRecord. By storing the TFRecord data in an Amazon S3 bucket and pointing the SageMaker training job to this bucket, the model can directly access the data, allowing the ML specialist to train the model without additional reformatting or data processing steps.
This approach minimizes development overhead and leverages SageMaker’s built-in support for custom training scripts and S3 integration, making it the most efficient choice.
Question.35 An agency collects census information within a country to determine healthcare and social program needs by province and city. The census form collects responses for approximately 500 questions from each citizen Which combination of algorithms would provide the appropriate insights? (Select TWO ) (A) The factorization machines (FM) algorithm (B) The Latent Dirichlet Allocation (LDA) algorithm (C) The principal component analysis (PCA) algorithm (D) The k-means algorithm (E) The Random Cut Forest (RCF) algorithm |
35. Click here to View Answer
Correct Answer: C,D
* The agency wants to analyze the census data for population segmentation, which is a type of unsupervised learning problem that aims to group similar data points together based on their attributes.
The agency can use a combination of algorithms that can perform dimensionality reduction and clustering on the data to achieve this goal.
* Dimensionality reduction is a technique that reduces the number of features or variables in a dataset while preserving the essential information and relationships. Dimensionality reduction can help improve the efficiency and performance of clustering algorithms, as well as facilitate data visualization and interpretation. One of the most common algorithms for dimensionality reduction is principal component analysis (PCA), which transforms the original features into a new set of orthogonal features called principal components that capture the maximum variance in the data. PCA can help reduce the noise and redundancy in the data and reveal the underlying structure and patterns.
* Clustering is a technique that partitions the data into groups or clusters based on their similarity or distance. Clustering can help discover the natural segments or categories in the data and understand their characteristics and differences. One of the most popular algorithms for clustering is k-means, which assigns each data point to one of k clusters based on the nearest mean or centroid. K-means can handle large and high-dimensional datasets and produce compact and spherical clusters.
* Therefore, the combination of algorithms that would provide the appropriate insights for population segmentation are PCA and k-means. The agency can use PCA to reduce the dimensionality of the census data from 500 features to a smaller number of principal components that capture most of the variation in the data. Then, the agency can use k-means to cluster the data based on the principal components and identify the segments of the population that share similar characteristics.
Amazon SageMaker Principal Component Analysis (PCA)
Amazon SageMaker K-Means Algorithm