New D-DS-FN-23 Dumps (V8.02) – Offering the Best Preparation Materials to Ensure Your Success in DELL EMC D-DS-FN-23 Exam

1. In a decision tree, what is an example of a pure node?

25 positives; 75 negatives

50 positives; 50 negatives

75 positives; 25 negatives

100 positives; 0 negatives

2. When would you prefer a Naive Bayes model to a logistic regression model for classification?

When you are using several categorical input variables with over 1000 possible values each.

When you need to estimate the probability of an outcome, not just which class it is in.

When all the input variables are numerical.

When some of the input variables might be correlated.

3. What is an appropriate assignment for a data scientist?

Monitor key performance indicators

Define an OLAP database schema

Conduct customer surveys

Develop predictive models

4. What is the output format from the Map function of MapReduce?

Key-value pairs

Binary representation of keys concatenated with structured data

Compressed index

Unique key record and separate records of all possible values

5. What does the R code z <- f[1:10, ] do?

Assigns the first 10 rows of f to the vector z

Assigns the 1st 10 columns of the 1st row of f to z

Assigns a sequence of values from 1 to 10 to z

Assigns the 1st 10 columns to z

6. What is a core deliverable at the end of the analytic project?

An implemented database design

A whitepaper describing the project and the implementation

A presentation for project sponsors

The training materials

7. Consider the following SQL statement:

SELECT employee_id, year, salary, avg(salary)

OVER

(PARTITION BY employee_id ORDER BY year ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) as result_1

FROM employee

ORDER BY employee_id, year

For each employee_id, what is returned as result_1?

Three year rolling average salary

Four year rolling average salary

Average salary across all employee_id values

Average employee_id

8. What is the mandatory Clause that must be included when using Window functions?

OVER

RANK

PARTITION BY

RANK BY

9. In a fitted ARIMA(1,2,3) model, how many differences are applied?

0

1

2

3

10. If R factors are categorical variables, which data classification level are they most closely related?

Nominal

Ordinal

Interval

Ratio

11. Consider this SQL statement:

SELECT product, prod_cost, avg(prod_cost) OVER (PARTITION BY product)

FROM product_detail

The OVER clause makes this what type of function?

Window function

Aggregate function

System function

User-defined function

12. In a Student's t-test, what is the meaning of the p-value?

it is the area under the appropriate tails of the Student's distribution

it is the "power" of the Student's t-test

it is the mean of the distribution for the null hypothesis

it is the mean of the distribution for the alternate hypothesis

13. Consider these itemsets:

(hat, scarf, coat)

(hat, scarf, coat, gloves)

(hat, scarf, gloves)

(hat, gloves)

(scarf, coat, gloves)

What is the confidence of the rule (gloves -> hat)?

75%

60%

66%

80%

14. During the data preparation phase, you notice a high correlation between average spend on video games, age of players, and number of science fiction shows watched.

Which technique could you use to address the three correlated variables?

Square the three variables to remove the correlation

Combine the three variables into one new variable

Drop the three variables to improve the model

Use scaling to make the three variables equivalent in size

15. You are attempting to find the Euclidean distance between two centroids:

Centroid A's coordinates: (X = 2, Y = 4)

Centroid B's coordinates (X = 8, Y = 10)

Which formula finds the correct Euclidean distance?

SQRT((2-8)2+(4-10)2) or 8.49

SQRT(((2-8) x 2) + ((4-10) x 2)) or 12.17

((2-8)2+(4-10)2) or 72

((2-8) x 2 + (4-10) x 2) or 148

16. In linear regression modeling, which action can be taken to improve the linearity of the relationship between the dependent and independent variables?

Apply a transformation to a variable

Use a different statistical package

Calculate the R-Squared value

Change the units of measurement on the independent variable

17. Which chart type is intended to display correlations between sets of numeric data?

Scatterplot

Histogram

Pie chart

Line Chart

18. What does the Receiver Operating Characteristic (ROC) curve show?

Relationship between p-value and true positive rate

Relationship between p-value and true negative rate

Relationship between true positive rate and false positive rate

Relationship between true positive rate and true negative rate

19. A fair six-sided die is rolled. Let A denote the event that an odd number is rolled. Let C denote the event that a 1, 2, or 3 is rolled.

What is the value of the conditional probability, P(C|A)?

2/3

1/2

1/3

1/4

20. Which word or phrase completes the statement? Business Intelligence is to ad-hoc reporting and dashboards as Data Science is to __________.

Optimization and Predictive Modeling

Alerts and Queries

Structured Data and Data Sources

Sales and profit reporting

21. Which method is used to solve for coefficients b0, b1, .., bn in your linear regression model: Y = b0 + b1x1+b2x2+….+bnxn

Ordinary Least squares

Apriori Algorithm

Ridge and Lasso

Integer programming

22. What is one modeling or descriptive statistical function in MADlib that is typically not provided in a standard relational database?

Linear regression

Expected value

Variance

Quantiles

23. Refer to the exhibit.

You are using K-means clustering to classify customer behavior for a large retailer. You need to determine the optimum number of customer groups. You plot the within-sum-of- squares (wss) data as shown in the exhibit.

How many customer groups should you specify?

2

3

4

8

24. Which activity is performed in the Operationalize phase of the Data Analytics Lifecycle?

Define the process to maintain the model

Try different analytical techniques

Try different variables

Transform existing variables

25. Which word or phrase completes the statement; “A theater actor is to ‘artistic and expressive’ as a data scientist is to.”?

Communicative and collaborative

Introverted and technical

Logical and steadfast

Independent and intelligent

26. When is the GROUP BY ROLLUP clause used in an OLAP query?

All subtotals and grand totals are to be included in the output

Subtotals are only to be included in the output

Grand totals are only to be included in the output

Specific subtotals and grand totals for a combination of variables are only to be included in the output

27. You have run the association rules algorithm on your data set, and the two rules {banana, apple} => {grape} and {apple, orange}=> {grape} have been found to be relevant.

What else must be true?

{grape, apple, orange} must be a frequent itemset.

{banana, apple, grape, orange} must be a frequent itemset.

{grape} => {banana, apple} must be a relevant rule.

{banana, apple} => {orange} must be a relevant rule.

28. Which type of numeric value does a logistic regression model estimate?

Probability

A p-value

Any integer

Any real number

29. You are having a discussion with a business colleague. The colleague mentions that they want to perform K-means clustering on text file data stored in HDFS.

Which tool should be recommended?

Mahout

HBase

Scribe

Sqoop

30. In which phase of the data analytics lifecycle do Data Scientists spend the most time in a project?

Discovery

Data Preparation

Model Building

Communicate Results

31. Your colleague, who is new to Hadoop, approaches you with a question. They want to know how best to access their data. This colleague has a strong background in data flow languages and programming.

Which query interface would you recommend?

Pig

Hive

Howl

HBase

32. What is a consideration when building decision trees?

Cannot handle variables that affect the outcome in a discontinuous way

Short decision trees are likely subject to overfit

Correlated variables can cause double-counting

Tree structure is sensitive to small changes in the training data

33. You need to run a hypothesis test across three normally distributed populations.

Which technique should you use?

Z-test

Welch's t-test

ANOVA

Wilcoxon rank sum test

34. The Marketing department of your company wishes to track opinion on a new product that was recently introduced. Marketing would like to know how many positive and negative reviews are appearing over a given period and potentially retrieve each review for more in- depth insight.

They have identified several popular product review blogs that historically have published thousands of user reviews of your company’s products. You have been asked to provide the desired analysis.

You examine the RSS feeds for each blog and determine which fields are relevant. You then craft a regular expression to match your new product’s name and extract the relevant text from each matching review.

What is the next step you should take?

Convert the extracted text into a suitable document representation and index into a review corpus

Use the extracted text and your regular expression to perform a sentiment analysis based on mentions of the new product

Read the extracted text for each review and manually tabulate the results

Group the reviews using Naïve Bayesian classification

35. Which process in text analysis can be used to reduce dimensionality?

Stemming

Parsing

Digitizing

Sorting

36. Which analytical method is considered unsupervised?

K-means clustering

Naïve Bayesian classifier

Decision tree

Linear regression

37. Refer to the exhibit.

Which type of data issue would you suspect based on the exhibit?

"Saturated" data, indicating potential issues with data definitions

Incomplete data, indicating potential issues with data transmission

Mis-scaled data, indicating potential issues with data entry

The exhibit does not raise any obvious concerns with the data.

38. You have created a Logistic Regression model to predict customer churn for your company. The company’s Marketing department wants to use your model to identify at-risk customers and offer incentives to keep them from leaving.

Using two different thresholds for the model provides the two confusion matrices shown in the graphic. Marketing understands the relative costs of missing at-risk customers versus offering incentives to customers who are not at risk. Therefore, you need their advice on how to set the appropriate threshold on the churn model.

You are meeting with the Marketing team. In the meeting, you plan to state: “Raising the threshold from 0.5 to 0.75 reduces the number of unnecessary incentives that can be offered, at the cost of missing more of the customers who churned.”

What is the most appropriate visual to reinforce this statement?

A)

B)

C)

D)

Option A

Option B

Option C

Option D

39. Your customer provided you with 2, 000 unlabeled records and asked you to separate them into three groups.

What is the correct analytical method to use?

K-means clustering

Linear regression

Naive Bayesian classification

Logistic regression

40. How is dimensionality defined in a "bag of words" document representation?

Average number of words per sentence in the document

Total number of words in the document

Number of unique terms in the document

Frequency of repeated words in the document

41. You received 100,000 home loan records and want to quickly determine if there is any correlation between mortgage age and mortgage amount before conducting advanced analysis.

Which tool should be used for the preliminary analysis?

Scatter plot

Stacked Bar chart

Box and Whisker plot

Histogram

42. What is the output of the K-means clustering algorithm?

Centroid positioning and entropy of each record in each cluster

Center of each discovered cluster and mapping of each record to a cluster

Two dimensional representation of the data and the clusters

Intercept and coefficients for each input variable in the dataset

43. You are provided with the following list.

Which window function is missing?

cume_dist()

dense_rank()

rank()

percent_rank()

first_value()

last_value()

lag()

lead()

ntile()

row_preceding()

row_number()

median()

cumulative_sum()

44. In text analysis, what makes the corpus representation dynamic?

Algorithms used to determine the classification or tagging

Search and retrieval process for finding the document that meets the search criteria

Inherent high dimensionality in the problem of text analysis

Requirement to update index and corpus metrics continuously

45. How are window functions different from regular aggregate functions?

Rows retain their separate identities and the window function can access more than the current row.

Rows are grouped into an output row and the window function can access more than the current row.

Rows retain their separate identities and the window function can only access the current row.

Rows are grouped into an output row and the window function can only access the current row.

46. You have created a Linear Regression model to predict total sales based on variables M, N, P and Q as shown in the graphic. You originally expected all variables to have positive coefficients.

Which action would you take?

Accept all variables and begin model validation steps against holdout data

Accept only positive variables and investigate potential correlation with the dependent variable

Accept only statistically significant variables and investigate correlated independent variables

Accept none of the variables and investigate correlations between all variables

47. You have been assigned to do a study of the daily revenue effect of a pricing model of online transactions. All the data currently available to you has been loaded into your analytics database; revenue data, pricing data, and online transaction data.

You find that all the data comes in different levels of granularity. The transaction data has timestamps (day, hour, minutes, seconds), pricing is stored at the daily level, and revenue data is only reported monthly.

What is your next step?

Report back to the business owner that the current data model does not support the business question.

Interpolate a daily model for revenue from the monthly revenue data.

Aggregate all data to the monthly level in order to create a monthly revenue model.

Disregard revenue as a driver in the pricing model, and create a daily model based on pricing and transactions only.

48. Which key role for a successful analytic project can provide business domain expertise with a deep understanding of the data and key performance indicators?

Business Intelligence Analyst

Project Manager

Project Sponsor

Business User

49. A Data Scientist is assigned to build a model from a reporting data warehouse. The warehouse contains data collected from many sources and transformed through a complex, multi-stage ETL process.

What is a concern the data scientist should have about the data?

It is too processed

It is not structured

It is not normalized

It is too centralized

50. You have just completed the Discovery phase of a project and finished interviewing the main stakeholders. You have identified the necessary data feeds and are now beginning to set up the analytic sandbox.

What is the next step?

Assess data quality

Perform ELT / ETL

Create data visualizations

Run descriptive statistics for several data sets

51. In which lifecycle stage are appropriate analytical techniques determined?

Model planning

Model building

Data preparation

Discovery

52. What is holdout data?

a subset of the provided data set selected at random and used to validate the model

a subset of the provided data set selected at random and used to initially construct the model

a subset of the provided data set that is removed by the data scientist because it contains data errors

a subset of the provided data set that is removed by the data scientist because it contains outliers

53. In a t-test with unknown variance, what values are used to calculate the t-statistic?

Sample mean, sample standard deviation, and sample size

Mean, sample standard deviation, and population size

Sample mean, standard deviation, and sample size

Mean, standard deviation, and population size

54. Which participant in a data analytics project is typically responsible for assessing the validity of the model?

Data scientist

Business user

Project sponsor

Project manager

55. In a user-defined aggregate function, what is SFUNC?

Window function

State transition function

Final calculation function

Segment-level calculation function

56. What is required in a presentation for project sponsors?

The "Big Picture" takeaways for executive level stakeholders

Data warehouse design changes

Line by line review of the developed code

Detailed statistical basis for the modeling approach used in the project

57. Consider the following itemsets:

(hat, scarf, coat)

(hat, scarf, coat, gloves)

(hat, scarf, gloves)

(hat, gloves)

(scarf, coat, gloves)

If the minimum support is 50%, what represents the complete list of frequent 2-itemsets?

(hat, scarf), (hat, gloves)

(hat, scarf), (scarf, coat), (coat, gloves)

(scarf, gloves), (scarf, coat) (hat, gloves)

(hat, scarf), (hat, gloves), (scarf, gloves), (scarf, coat)

58. Which activity is performed in the Operationalize phase of the data analytics lifecycle?

Try different variables

Try different analytical techniques

Assess the benefits

Transform existing variables

59. Which ROC curve represents a perfect model fit?

A)

B)

C)

D)

Exhibit A

Exhibit B

Exhibit C

Exhibit D

60. Which Hadoop service is responsible for requesting resources for, and monitoring the completion of, MapReduce processes?

Application Manager

NameNode

Application Master

DataNode

61. To ensure a successful analytic project, which key role can consult and advise the project team on the value of end results and how these will be used on a daily basis?

Business User

Project Manager

Data Scientist

Business Intelligence Analyst

62. Which word or phrase completes the statement? Emphasis color is to standard color as _______.

Main message is to context

Main message is to key findings

Frequent item set is to item

Pie chart is to proportions

63. A data scientist is preparing a presentation for a meeting with the project’s business sponsors. The distribution of per-sale revenue is an important finding from the analysis. The graphics illustrate four ways to plot the per-sale revenue distribution..”

Which graphic is most appropriate for the sponsor presentation?

Figure A

Figure B

Figure C

Figure D

64. You have been assigned to do a study of the daily revenue effect of a pricing model of online transactions. You have tested all the theoretical models in the previous model planning stage, and all tests have yielded statistically insignificant results.

What is your next step?

Report that the results are insignificant, and reevaluate the original business question.

Run all the models again against a larger sample, leveraging more historical data.

Move forward on the model with the highest significance scores relative to the others.

Modify samples used by the models and iterate until a significant result occurs.

65. A disk drive manufacturer has a defect rate of less than 1.0% with 98% confidence. A quality assurance team samples 1000 disk drives and finds 14 defective units.

Which action should the team recommend?

The manufacturing process should be inspected for problems.

A larger sample size should be taken to determine if the plant is functioning properly

A smaller sample size should be taken to determine if the plant is functioning properly

The manufacturing process is functioning properly and no further action is required.

66. Data visualization is used in the final presentation of an analytics project.

For what else is this technique commonly used?

Assessing data quality

Descriptive statistics

ETLT

Model selection

67. Refer to the exhibit.

What provides the decision tree for predicting whether or not someone is a good or bad credit risk.

What would be the assigned probability, p(good), of a single male with no known savings?

0.83

0

0.498

0.6

68. Which SQL OLAP extension provides all possible grouping combinations?

CUBE

ROLLUP

UNION ALL

CROSS JOIN

69. Assume you are performing an analysis to determine fraud detection on credit card usage. You will need to ensure higher-risk transactions. These may indicate that fraudulent credit card activity is retained in your data for analysis and not dropped as outliers during pre- processing.

What is the approach for loading data into the analytical sandbox for this analysis?

ELT

ETL

EDW

OLTP

70. What type of data is represented in the exhibit?

Structured

Unstructured

Quasi-structured

Semi-structured

71. When is a Wilcoxon Rank-Sum test used?

When an assumption about the distribution of the populations cannot be made

When the data can be easily sorted

When the populations represent the sums of other values

When the data cannot be easily sorted

72. Refer to the Exhibit.

In the Exhibit. For effective visualization, what is the chart's primary flaw?

The use of 3 dimensions.

The slanting of axis labels.

The location of the legend.

The order of the columns.

73. What requests resources from YARN during a MapReduce job?

Map and reduce tasks

ApplicationMaster

ApplicationsManager

DataNodes

74. Since R factors are categorical variables, they are most closely related to which data classification level?

nominal

ordinal

interval

ratio

75. What is a distinct property of Logistic Regression compared with Linear Regression?

Logistic Regression handles missing values well

Logistic Regression is robust with redundant or correlated variables

Logistic Regression returns probability estimates of an event

Logistic Regression works well with discrete variables that have many distinct values

76. You are building a logistic regression model to predict whether a tax filer will be audited within the next two years. Your training set population is 1000 filers. The audit rate in your training data is 4.2%.

What is the sum of the probabilities that the model assigns to all the filers in your training set that have been audited?

42.0

4.2

0.42

0.042

77. Consider the example of an analysis for fraud detection on credit card usage. You will need to ensure higher-risk transactions that may indicate fraudulent credit card activity are retained in your data for analysis, and not dropped as outliers during pre-processing.

What will be your approach for loading data into the analytical sandbox for this analysis?

ELT

ETL

EDW

OLTP

78. What is an appropriate data visualization to use in a presentation for an analyst audience?

Pie chart

Area chart

Stacked bar chart

ROC curve

79. How is HDFS defined?

Large “web table” capable of holding millions of rows and millions of columns

Row-column oriented datastore supporting redundancy and high availability

Reliable, redundant distributed file system

Reliable file system stored on a single extensible storage platform

80. Which word or phrase completes the statement? Structured data is to OLAP data as quasi- structured data is to

Clickstream data

XML data

Text documents

Image files

81. You have been assigned to run a logistic regression model for each of 100 countries, and all the data is currently stored in a PostgreSQL database.

Which tool/library would you use to produce these models with the least effort?

MADlib

Mahout

RStudio

HBase

82. A data scientist plans to classify the sentiment polarity of 10, 000 product reviews collected from the Internet.

What is the most appropriate model to use? Suppose labeled training data is available.

Naïve Bayesian classifier

Linear regression

Logistic regression

K-means clustering

83. What does R code nv <- v[v < 1000] do?

Selects the values in vector v that are less than 1000 and assigns them to the vector nv

Sets nv to TRUE or FALSE depending on whether all elements of vector v are less than 1000

Removes elements of vector v less than 1000 and assigns the elements >= 1000 to nv

Selects values of vector v less than 1000, modifies v, and makes a copy to nv

84. You have run a Linear Regression model on the data shown in the graphic.

Which value is a reasonable guess for R-squared?

-.8

.8

.25

1.25

85. You have created a scatterplot of two continuous variables for 2000 records. You want to add a line to the scatterplot to check linearity of the data.

Which function would best address this need?

abline()

glm()

hist()

lm()

86. Why do the Naïve Bayesian classifier implementations use the log of probability value rather than the pure probability value?

To ensure the conditional independence of attribute values

To avoid numerical underflow errors in high dimensional problems

To obtain a more accurate estimate of the probabilities without the need for a Laplace smoothing

To invalidate the variables that are continuous

87. Consider the following SQL query:

SELECT product_id FROM supplier_A

UNION

SELECT product_id FROM supplier_B;

What is the expected result?

All product_id values from both tables with duplicates or repeating rows

All product_id values from supplier_A table but not from supplier_B table

All product_id values from supplier_B table but not from supplier_A table

All product_id values from both tables with no duplicates or repeating rows

88. In data visualization, which type of chart is recommended to represent frequency data?

Line chart

Histogram

Q-Q chart

Scatterplot

89. Which word or phrase completes the statement; “Excessive emphasis color is to Bar chart as __________________.”?

Multicollinearity is to OLS

Multicollinearity is to serial correlation

Confidence is to leverage

Confidence interval is to regression

90. You submit a MapReduce job to a Hadoop cluster. Although the job was successfully submitted, you notice that it is not completing.

What should be done?

Ensure that a DataNode is running

Ensure that the TaskTracker is running

Ensure that the NameNode is running

Ensure that the JobTracker is running

91. Trend, seasonal, and cyclical are components of a time series.

What is another component?

Irregular

Linear

Quadratic

Exponential

92. Variable D is not significantly impacting the dependent variable.

After seeing your findings, the majority of your team agreed that variable B should be positively impacting the dependent variable.

What is a possible reason the coefficient for variable B was negative and not positive?

Variable B is interacting with another variable due to correlated inputs

Variable B needs a quadratic transformation due to its relationship to the dependent variable

The information gain from variable B is already provided by another variable

Variable B needs a logarithmic transformation due to its relationship to the dependent variable

93. Refer to the exhibit.

You have run a linear regression model against your data, and have plotted true outcome versus predicted outcome. The R-squared of your model is 0.75.

What is your assessment of the model?

The R-squared may be biased upwards by the extreme-valued outcomes. Remove them and refit to get a better idea of the model's quality over typical data.

The R-squared is good. The model should perform well.

The extreme-valued outliers may negatively affect the model's performance. Remove them to see if the R-squared improves over typical data.

The observations seem to come from two different populations, but this model fits them both equally well.

94. If distributed Item-based Collaborative Filtering is an algorithm supported by Mahout, what is the use case category of the algorithm?

Classification

Recommenders

Frequent Itemset

Clustering

95. Your risk analysis team has access to new customer financial data. You want to use this data to improve your prediction of credit default. Previously, the team was using only credit bureau scores, loan size, and customer income to assess risk of default.

What is the null hypothesis that should be used to evaluate the model?

New model predicts as well as the toss of a coin weighted by the average default rate

New model predicts better than the toss of a coin weighted by the average default rate

Model using the new financial data predicts the outcome just as well as the previous model

Model using the new financial data predicts the outcome better than the previous model

96. Which assumption makes the Naïve Bayesian classifier different from the general Bayesian model?

Number of features cannot be greater than the number of records

Features of a class are conditionally independent of one another

All variables need to be numeric

Fewer features can be used with the Naïve Bayes classifier

97. Refer to the exhibit.

You have plotted the distribution of savings account sizes for your bank.

How would you proceed, based on this distribution?

The data is extremely skewed. Replot the data on a logarithmic scale to get a better sense of it.

The data is extremely skewed, but looks bimodal; replot the data in the range 2, 500-10, 000 to be sure.

The accounts of size greater than 2500 are rare, and probably outliers. Eliminate them from your future analysis.

The data is extremely skewed. Split your analysis into two cohorts: accounts less than 2500, and accounts greater than 2500

New D-DS-FN-23 Dumps (V8.02) – Offering the Best Preparation Materials to Ensure Your Success in DELL EMC D-DS-FN-23 Exam

Dell Data Science Foundations 2023 Certification Exam D-DS-FN-23 Free Dumps

About The Author

dumps

Dell Data Science Foundations 2023 Certification Exam D-DS-FN-23 Free Dumps

Related Posts

About The Author

dumps