Praise
"A must-read resource for anyone who wants critical information embracing the opportunity to get started with big data." — Craig Vaughan About SAP Vice President “This real book says the sound will finally become apparent: in today's world, data is business, and one can no longer think corporately with thinking data. Learn to save stock and you'll learn the science behind thinking data.” — Ron Bekkerman Chief Data Policeman at Carmel Corporate "An excellent book for business managers who lead or interact with data experts, who want to better understand the principles and algorithms without the scientific view of one industry books." — Ronney Kohavi Becomes Designer in Microsoft's Market Services Division "Provost and Fawcett have bottled both the art and science of real-world data analysis in an unprecedented introduction to the field." — Geoff Webb Editor-in-Chief, Info Mining and Knowledge Discovery Journal — Claudia Perlich Chief Scientist at Dstillery Advertising Research Foundation Grand Innovation Award Winner (2013)
“A key piece in the rapid growth of Data Scholarship. A must-read for anyone interested in this Tall Data revolution." —Justin Gapper Business Section Analytics Managers at Teledyne Scientific and Imaging For all skill levels, but especially useful for the budding data scientist. it's the first book of its kind—it focuses in Data science concepts applied to practical business problems, global examples that describe familiar and accessible problems in the business world: customer acquisition, targeted index, even whiskey analytics!, what to skip and succeed in troubleshooting. Whether you're looking for a good comprehensive overview of archival science or what a budding data scientist needs in relation to the basics, this is a must read. —Chris Volinsky Director of Statistical Research at AT&T Labs and member of the $1 Billion Netflix Challenging Winning team "This record beats Data Analytics 101. It's the essential guide for us (all of us?) to data opportunities and the new mandate for decision making. data-driven decision making." — Tom Phillips CEO of Dstillery and former head of Google Explore and Analytics "Smart use of data has become a force that drives the company to new levels of competitiveness. To thrive in this ecosystem based in data, engineers, analysts, and leaders need to understand the choices, design choices, and tradeoffs before they understand it.With compelling examples, clear exposition, and a wealth of detail that covers not only the "how" but also the "why" , Data Science for Business is the perfect start for anyone who wants to get involved in developing and implementing data-driven systems. data." — Josh Attenberg, Head of Etsy Information Scholarship
“Data is the foundation of new waves of productivity growth, innovation and richer customer insights. Only recently widely seen as a source of competitive advantage, getting data right is quickly becoming a table stake to stay in the game. The authors' deep application experience makes this a must-read – a window into the competitor's own strategy.” — Alan Murray Serial Entrepreneur. Coriolis Ventures Partners “One of the best data research books, which helped me to think through various ideas about liquidity analysis in the forex industry. The examples are great and help you dive deep into yours! This will go to our required lifetime! — Nidhi Kathuria Vice President FX at Royal Bank of Scotland “A great, really accessible handbook that helps business owners better appreciate these concepts, tools and techniques that data scientists use… and for data scientists to better appreciate a business framework in which its solutions are developed". — Decaf McCarthy Director of Analytics and Data Arts at Atigeo — Ira Laefsky MIO Design (Computer Science)/MBA Informatics and Human-Computer Interaction Researcher, former member of the Senior Advisory Team at Arcthur D. Little, Inc. and Analog Equipment Corporation not only the "how," but a "why," Data Science for Business is the perfect primer for those who want to get involved in the design and implementation of data-driven systems. — Ted O'Brien Co-Founder/Director of Skills Acquisition for Starbridge Partners and Data Science Report Publishing
Data learning for business
Foster Provost and Tom Fawcett
Data Science for Business by Foster Provost and Tom Fawcett Urheberrechtsgesetz © 2013 Support Rector and Tom Fawcett. All rights reserved. Print Who in the United States. Published by O'Reilly Media, Inc., 1005 Gravenstein Highway North, Sebesto, CA 95472. O'Reilly Related Mayor may be purchased for educational, commercial or sales promotional use. Online versions are also available for most titles (http://my.safaribooksonline.com). For more information, please contact Corporate/Institutional Sales: 800-998-9938 or[email protected]
Editor: Mike Loukides and Meghan Blanchette Production Editor: Cristian Hearse Proofreader: Kiel Van Hoot Indexer: WordCo Indexing Services, Inc. July 2013:
Cape Designer: Mark Paglietti Interior Designer: David Futato Illustrator: Rebecca Demarest
First edition
Revision history for first version: 2013-07-25:
first approve
19/12/2013:
second edition
See http://oreilly.com/catalog/errata.csp?isbn=9781449361327 for version details. Who O'Reilly logo is a registered trademark of O'Reilly Media, Incidents. Many of these designations used by manufacturers and sellers to distinguish their selection are known as trademarks. Where these classifications appear in which record, and O'Reilly News, Inc., was aware of a trademark claim, the designations are printed in capitals or initial capitals. Data Science for Business is a trademark of Promoter Chancellor and Tom Fawcett. Although every precaution has been taken in the preparation of this book, the publisher and authors assume no responsibility for errors or omissions or for damages resulting from the application of the information contained herein.
ISBN: 978-1-449-36132-7 [LSI]
For our parents.
Index
Prologue. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii 1. Introduction: Data Analytical Thinking. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 The Ubiquity of Data Opportunities Example: Hurricane Franky Example: Predictions Chill Data Science, Engineering, and Data-Driven Decision Making Data Processing and Big Data From Big Data 1.0 to Big Data 2.0 Data or Data Literacy Skills as Adenine Data Straight Asset- Analytic Think One such book Data Mining and Data Scientific, Reopened Chemistry Is Not About Checking Tubes: Data Science Opposite and My of this Data Explorer Summary
1 3 4 4 7 8 9 12 14 14 15 16
2. Business Problems and Data Science Show. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Basic Concepts: A set of normal data mining tasks. The process of data mining. Supervised vs. Unsupervised Input Mining.
From Business Problems to Data Mining Tasks Supervised Verses Unsupervised Methods Data Mining and Its Results The Data Mining Process Understanding the Business Understanding Data Intelligence Assessing Readiness Modeling
19 24 25 26 27 28 29 31 31
v
Development Required Importance Data Science Team Manager Analysis Techniques Other than Statistical Technologies Database Query Intelligence Storage Deconstruction Training Engine Analysis and Data Mining Response Business Queries with These Fast Techniques
32 34 35 35 37 38 39 39 40 41
3. Introduction to predictive modeling: From correlation to supervised segmentation. 43 Fundamental Concepts: Identifying Informational Characteristics. Segmentation data with progressive feature selection. Example techniques: Finding dependencies. Selection of features/variables. Tree induction.
Models, electrical or prediction Supervised segmentation selection Training features Example: Feature selection with information gain Supervised segmentation in tree-structured models Visualization of segmentations Arbors instead of rule sets Probability estimation example: solving the turnover problem due to tree induction sum
44 48 49 56 62 67 71 71 73 78
4. Fit model to data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Fundamental concepts: Discovering information from the “optimal” set of model parameters. Selecting this forward target to extract data. It is related to the goal. Advanced loss. Exemplary techniques: Linear regression. logical regression; Support vector machines.
Classification via Scientific Functions Linear Discriminant Functions Optimization of an Objective Function in Mining Sample Linear Discriminant for Data Used Linear Discriminant Functions Grading and Ordering Syntax Support Vector Machines Briefly Regression Through Mathematical Functions Probability Class Literary Regression Estimation Details example: Nonlinear Logistic Regression Functions vs tree induction, support pipe machines and neural networks vi
|
Index
83 85 88 89 91 92 95 97 100 103 107
Summary
110
5. Overfitting is additionally avoided. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 Key Concepts: Generalization? Fit and overfit? Complexity check. Exemplary techniques: Cross-validation. Feature selection? Pruning trees? Arrangement.
Generalization Overfitting Overfitting Tested Hold Data and Mapping Graphs Overfitting to Wood Induction Overfitting to Mathematical Functions Example: Overfitting to Straight Line Functions * Example: Why is overfitting bad? From Validation Evaluation to Cross-Validation Churn Dataset Review Learning Curves Avoiding Overfitting Beyond This Complexity Control Avoiding Initial Overfitting A Basic Method for Avoiding Overfitting * Avoiding Overfitting for Parameter Optimization Summary
111 113 113 113 116 118 119 124 126 129 130 133 133 134 136 140
6. Similarity, Neighborhood and Clusters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 Basic Concepts: Calculating data-described ziele similarity. Using direct similarity prediction. Clustering as segmentation based on similarity. Exemplary engineering: Research in a similar company. Nearest neighbor methods. Grouping methods? Forward distance metrics that calculate similarity.
Similarity and Range Nearest Neighbor Inference Example: Whiskey Analysis Nearest Neighbors for a Predictor Corpus How Many Neighbors and How Much Power? Geometric Interpretation, Overfitting and Complexity Control Problems with Nearest Neighbor Schemes Some Basic Technical Details Related to Type Similarities Properties of Non-Uniform Neighbors * Other Distance Functions * Combining Actions: Calculating Neighbor Scores Clustering Model: Whiskey Indexerarchited Analysis
142 144 145 147 149 151 155 157 157 158 162 163 164 165 |
vii
Nearest Neighborhood Review: Clustering Around Centroids Show: Clustering Business News Novels Understanding Cluster Discovery * Using Supervised Knowledge to Create Clustering Descriptions Step Back: Solving a Business Problem vs. Data Exploration Summary
170 175 178 180 183 185
7. Analytical thinking EGO decision: What is a goods model?. . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 Conceptual basis: careful consideration of what is desired outside of data science results. Expected value as a basic ranking structure. The examination is appropriate proportional baselines. Exemplary techniques: Various evaluation metrics. Cost and benefit assessment. Calculation of expected profit. Establish basic comparison methods.
Evaluating Classifiers Real Simple Accuracy Your Problems Confusion Matrix Imbalance Issues My Problems with Uneven Costs and Benefits Generalize Beyond Classification A Basic Analysis Framework: Expected Value Using Expected Value for Box Classification Using Expected Value to Categorize context and the assessment of the investment base in the data summary
188 189 189 190 193 193 194 195 196 204 207
8. View example performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 Fundamental Concepts: Visualizing Performance Paradigms Under Various Types of Uncertainty. Further consideration of what life wants from data mining ergebnis. Exemplary technique: Profit curves. Cumulative response curves? lifting arches; ROC curves.
Classification instead of classification ROC gain curves plots the curves that fall under the ROC curve (AUC) Cumulative response pressure increase curves Example: Performance analysis for deflection modeling Summary
209 212 214 219 219 223 231
9. Facts and Probabilities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 Key concepts: Explicit combination of evidence using Bayes definition. Probabilistic argument via conditional independence assumptions. Exemplary methods: Naive Bayes classification; Check the altitude.
viii
|
Index
Example: Segmenting Online Consumers with Advertising Combining Evidence Stochastic Joint Probability and Independence Bayes' Rule Applying Bayes' Rule to Data Science Conditional Independence and Naive Fog Pros and Cons for ADENINE Naive Fog Visualization of "Lift" Evidence Example "Raising Facebook Evidence" Evidence facts in action: Absolute targeting in the ad summary
233 235 236 237 239 241 243 244 246 248 248
10. Represents both Mining Texts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251 Fundamental Concepts: The Importance of Creating Mining-Friendly Data Representations. Copy representation for data mining. Exemplary techniques: Lyrical bags. Calculation of TFIDF. N-grams? Production; Export named entities. Theme templates.
Why Text Matters Why Text Is Hard Bag of Words Representation Term Frequency Parsimony Measure: Inverse Document Frequency Combine Them: TFIDF Example: Jazz Run * The Connection from IDF to Entropy Beyond Bag of Words N-gram Sequences Extraction of pattern named entities Outline: News Mining to Predict Stock Price Movement The Work Dates Dates Preprocessing Summary of Results
252 252 253 254 254 256 258 258 263 265 265 266 266 268 268 270 272 273 277
11. Decision Analytical Thinking II: Towards Analytical Engineering. . . . . . . . . . . . . . . . . . . . 279 Key Concept: Solving Business Problems with Data Science Starting with Analytical Engineering: Developing an analytical option, based on available data, tools and techniques. Paradigm technique: Expected value as a framework for designing physical information solutions.
Index
|
ix
Driving the Best Prospects to a Charity Match This Expected Value Framework: Analyzing the Business Problem and Reconnecting the Solution An Enrollment Diversion to Selection Bias Our Chaos Example Revisited with Even More Complexity The Estimated Value Framework: Structuring a Business Problem Evaluating the Free motivation Expected Einen Evaluate Decomposition for a Data Science Solution Summary
280 280 282 283 283 285 286 289
12. Other IT Science and Technician duties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291 Fundamental Concepts: Fundamentals of concepts as a foundation for many common data science techniques. The importance of knowing the building blocks of data science. Paradigm techniques: Associative and coexistent. Behavioral profiling? Link prediction? Reduction of dates. Mining hidden information. Film recommendation. Decomposition of error bias. Sample sets; Causal data line.
Co-occurrences and correlations: Finding objects that collectively measure surprise: lift and leverage Example: beer and won tickets Correlations between shares of likes Profiling: discovering typical behaviors Linking prediction and reduction data from social network suggestions, simulation and suggestions from movies Distortions, Set Differences and Approaches Data-Based Causal Explanation and Viral Marketing Paradigm Summary
292 293 294 295 298 303 304 308 311 312
13. Data Physics and Business Strategy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315 Fundamentals: Our principles as the foundation of data-driven business success. Gaining additional sustainable competitive advantage through the science of details. The importance of careful curation of the science of evidence.
Thinking Data Analytics, Redux Achieving Advanced Competitiveness with Data Science Lasting Financial Advantage with Data Science Formidable Historical Resource Exclusive Intellectual Property Exclusive Impalpable Collateral Finance Senior Data Scientists Senior Data Scientists Senior Data Scientists Senior Data Science Management Attract and Cultivate Your Teams
X
|
Index
315 317 318 319 319 320 320 322 323
Review Data Science Case Studies Be ready to accept creative concepts after any source Be ready to evaluate data science project ideas Example of data proposal failures Well by Big Red Proposal Data Science Maturity of a Company
325 326 326 327 328 329
14. Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333 The basics of data science Applying our fundamental concepts to a new problem: mining data from mobile devices Changing the way we think about solutions to business problems where data can't: people in the loop, privacy, ethics and data Review Related with people Are there more data skills? Complete Example: After Crowd-sourcing to Cloud-Sourcing Final words
333
336 339 340 343 344 345 346
A. Request the revision guide. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349 BORON. Another sample sentence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353 Dictionary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357 Bibliography. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361 Indexes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
Index
|
XI
Prologue
Data Science for Business is intended for several types of readers: • Entrepreneurs who will work with data scientists, manage data science-based projects, or invest in data science ventures, • Developers who will implement document science solutions, in addition • Aspiring data scientists . This is not a book about algorithms, nor is it a substitute for a book about algorithms. We deliberately avoid this ourselves with an algorithm-centric approach. We believe that there is a relatively low set of fundamental concepts or principles underlying techniques for extracting useful knowledge from data. These concepts serve as the basis for many well-known data mining algorithms. In addition, these concepts form the basis of data-centric economic problem analysis, the creation and evaluation of data science solutions, and the evaluation of general data science strategies and advice. Thus, reports are organized around these public authorities and not around specific algorithms. Finding it necessary to describe the details of the process, we used a combination of text and diagrams, which we believe is more accessible than a list of detailed algorithmic steps. The book cannot assume a complex numerical scenario. However, by nature, the material is somewhat technical – the aim is to convey a meaningful understanding of the science of the evidence, not simply to provide a high-level review. In general, we try to minimize the math and make the report as "conceptual" as possible. In relation to industry publishing, the book is invaluable in helping to align this understanding across business, technical/development and information science groups. This comment is based on a small sample, so we are curious to see how general it actually is (see Chapter 5!). Ideally, we envisioned a book that any data scientist could give to their employees in their development or commercial teams, essentially saying: if you really
xiii
Even if we want to design/implement leading data scientist solutions to business difficulties, we all need to have a common understanding of this stuff. Our colleagues say that this book was very useful in an unexpected way: in preparing to interview candidates for data jobs. Demand after agreements to hire scientists is strong and growing. In response, more and more job seekers are posing as data scientists. Any aspiring data science job should understand what are the basic principles presented in the book. (Our arbeitskollegen tell us they're surprised how many don't. We're having semi-serious discussions about a follow-up "Cliff Notes for Interviewing Data Physics Papers" booklet.)
Our Conceptual Approach to Data Science In this book, we present a collection of the most important fundamental concepts of data science. More of these keys are "headings" for books, and others are introduced more naturally through discussions (and thus do not necessarily qualify as key concepts). Concepts extend this process from visualizing the problem to applying data science techniques and developing results to improve decision making. In addition, the theory underpins a wide variety of business analysis methods and techniques. Concepts fall into three general types: 1. Concepts about how data learning fits into the organization and into which competitive landscape, including ways to attract, shape, and grow data science teams. ways of thinking about the ways data science leads to competitive advantage; and tactical concepts in typeface designs with academic data. 2. Basic modes of analytical thinking. These help identify appropriate data and consider appropriate methods. These concepts include the data mining professional as well as this collection of different high-level data mining tasks. 3. General concepts with real knowledge extraction from data, which underpin the wide range of data science tasks, in addition to their algorithms. For example, a fundamental concept is defining the similarity of pairs of entities represented by data. This skill is the basis for many specific tasks. It can be used directly to find customers similar to a specific customer. It is the core of many prediction algorithms that estimate a target value, such as the customer's expected resource usage or the probability that the customer will respond to an offer. It is also the basis for cloud technologies, which group entities according to their shared resources without a focal goal. Likewise, it forms the basis of information retrieval, in which documents or web pages related to a search query are retrieved. Finally, it supports several common algorithms for recommendation. A traditional algorithm-oriented book might present each of these missions in a different chapter, under different names, with common aspects xvii
|
Prologue
buried in details of algorithms or mathematical propositions. In this book, instead, we focus on the aforementioned unifying concepts, presenting specific tasks and algorithms as natural manifestations of them. As another example, we evaluate the aforementioned dienstleistungen von ampere model, we see a concept of augmentation—the more widespread a pattern is than one would expect by chance—recurring widely in data science. It is used to evaluate different types of standards in different contexts. Algorithms for targeting publications are evaluated by calculating the increase achieved for the target population. Research is used to judge the weight of evidence for or against a conclusion. Boost helps determine whether a coexistence (an association) in the data is interesting, rather than just an innate consequence of popularity. We believe that this explanation of data science that includes these fundamental concepts not only helps the reader, but also facilitates communication between business stakeholders and product scientists. It provides a common adenine vocabulary and allows the two parties to understand each other. Shared concepts lead to deeper conversations that can reveal critical issues you might otherwise miss.
Which Instructor For This book has been used successfully as a reference for a wide variety of academic data courses. Historically, the market grew out of the development of Foster's interdisciplinary data science courses at NYU's Stern School, beginning in the fall of 2005.1 The OEM class is nominally for MBA students and MSIS students, but has attracted graduate faculty at the aforementioned universities. The most interesting aspect of a class was not that it was aimed at current MBAs from the two MSIS for which it was designed. Most interestingly, it has also been found to be very valuable for students who have a strong background in machine learning and different technical disciplines. One of the reasons seemed to be to ensure that a focus on fundamentals and issues other than algorithms was missing from their curricula. At NYU, we now use the publication to support a variety of data science-related programs: the inaugural MBA and MSIS programs, undergraduate business analytics, the new NYU/Stern MS in Business Analytics program, and how the introduction to data science for NYU's new Master of Data Science program. Additionally, (prior to publication) the book has been adopted more than any other university for programs in other countries (and continues to be), in business schools, in computing life programs, and for more general introductions to data science. You are on the lookout for book sites (see below) for information on how to obtain useful study materials containing lecture slides, sample homework questions, and problems.
1. From the execution, each author has the distinct impression that they have done most of the work for the book.
Prologue
|
xv
entries, project setting samples based on book structures, exam questions and more to follow. We maintain an updated list of known adoptees on the book website. Click Who Uses the Top.
Other Skills and Concepts There are many other essential skills and abilities that a practicing data scientist needs to know in addition to the fundamentals of data science. These additional skill concepts will be discussed in Chapter 1 and Chapter 2. Interested cardholders are encouraged to see the book's website for suggestions for materials for learning these additional skills and concepts (e.g., Python scripting, line editing Linux command lines, data files, common data models, databases and queries, big data architecture and systems such as MapReduce and Hadoop, data visualization and various relational topics).
Additional Sections Note In addition to occasional footnotes, the book includes boxed "sidebars." These essentially live in the extended soil. We keep them for material we think is interesting and valuable, but too long for commentary and too much to amp shunt for the main text.
Technical Details Front — A note on sections marked with an asterisk
Occasional numerical details are carried over to select "starred" sections. These separate headings will be prefixed with an asterisk and some will be preceded by a paragraph similar to one. These "starred" sections contain more complete mathematics and/or more technical detail than the anderorts, and this introductory paragraph explains their purpose. The book is written in such a way that these sections can be skipped without loss of continuity, but at many points we remind readers that they seem to be going.
Constructs include text such as (Smith and Jones, 2003) which indicates a reference to an entry in whose bibliography (in this case, Smith and Jones 2003 article or book). "Smith and Jones (2003)" is a similar reference. A single biography for which the entire book appears at the end.
XVI
|
Prologue
In this book we have tried to keep mathematics to a minimum, and whatever the art is we simplify as much as possible without introducing confusion. Readers unfamiliar with the technical fundamentals may need less commentary on our simplified options. 1. We avoid the symbols Sign (Σ) and Pi (Π), which are commonly used in textbooks to indicate sums and products, respectively. Instead, we just apply equations with ellipses, please: f (x) = w1 x1 + w2 x2 + ⋯ + wn xn
In which engineering, "star" activities have sometimes adopted Sigma and Pi annotations when this ellipse approach is too cumbersome. Beginners reading these sections are somewhat comfortable with mathematical notation and will not be confused. 2. Information books are often careful to distinguish between a set and its estimate, putting a "hat" on variables that are estimates. In this book, we're almost always talking about dating estimates, and everything makes the equations complicated and ugly. Everything should be considered an assessment of the data, unless our people tell us otherwise. 3. We simplify the notation and remove generic foreign words where we believe they are clear from the context. For example, when we discuss classifiers mathematically, we are technically dealing with critical terms for feature vectors. The standard expression of all would lead to equations like: ^ f R(x) = xAge × - 1 + 0.7 × xBalance + 60
Instead, we chose the most understandable: f(x) = Age × - 1 + 0.7 × Balance + 60
with the understanding that x remains a vector and age and balances are its components. We can try to be consistent with typography by binding a fixed-width variable to the typewriter, such as sepal_width, to indicate type keywords in your own. For example, in the text mining chapter, a word like "discussion" denotes a word in a document, while discussion can be the token that results in the file. Later typographical conventions are used in this book:
Prologue
|
xvii
Italics Define new terms, URLs, email contact, file names and file extensions. constant width
Program stock is used as well as within paragraphs to refer to program components such as your, our variable or function, input types, environment variables, directives, and keywords. fixed-width italics
Displays text to be replaced by user-supplied values with values specified by the environment. Throughout the book, we place special tips and warnings related to the material. They will be rendered differently depending on whether you are reading a paper, PDF, or e-book, as follows: A complex sentence or paragraph like this means a spike or hint.
This text and item means a general note.
Text rendered in this way means warning or caution. These are more important than hints and are used sparingly.
Using examples In addition to being an introduction to data science, this book is intended to be useful in everyday discussions and work in the province. Answering a question by quoting the entire book and quoting samples does not require permission. We ourselves understand, but we cannot demand performance. The usual official citation includes the title, author, publisher and ISBN. For example: “Data Science for Business by Foster College and Tom Fawcett (O'Reilly). Intellectual 2013 Foster Provost and Tombo Fawcett, 978-1-449-36132-7." If you believe that your use of examples falls under the external fair use or license set forth above, please feel free to contact us at[email protected]xviii
|
Prologue
Safari® Books Online Safari Read Online is an on-demand digital library that offers specialized content in book and video form from the world's leading authors in technology and business. Technicians, software developers, web designers, and professionals and creative professionals use Safari Books Online as their primary resource for research, problem solving, learning, and certification training. Safari Books Online offers a variety of product combinations and our programs for organizations, governments or individuals. Subscribers have access to thousands of books, video tutorials and pre-publication manuscripts in a fully searchable navigation from publishers such as O'Reilly Media, Prentice Saloon Professional, Addison-Wesley Professional, Microsoft Press, Sams, Que, Peachpit Press , Focal Press, Cisco Press, John Wiley & Sons, Syngress, Morgan Kaufmann, APPLE Redbooks, Packt, Acrobat Press, FT Press, Apress, Manning, New Riders, McGraw-Hill, Addiction & Bartlett, Course Technology and dozens more. For more information about Safari Books Online, visit us online.
Contact Us Address notes and questions about this book to the publisher: O'Reilly Media, Inc. 1005 Gravenstein Road North Sebastopol, CA 95472 800-998-9938 (in US or Canada) 707-829-0515 (local international) 707-829-0104 (fax) We have two web pages for this book where we list errata, examples and some additional information. You can access the publisher page at http://oreil.ly/data-science and the authors page at http://www.data-science-for-biz.com. For comments or technical questions about these volumes, please send an e-mail to bookques[email protected]For more information about O'Reilly Media books, courses, conferences and news, visit http://www.oreilly.com. Find us on Facebook: http://facebook.com/oreilly Follow us on Chirrup: http://twitter.com/oreillymedia Follow us on YouTube: http://www.youtube.com/oreillymedia
Prologue
|
xix
Acknowledgments I thank all the many colleagues and others who provided invaluable ideas, comments, critiques, suggestions, and encouragement based on discussions and various earlier drafts of the script. At the risk of being missed, we are especially grateful to: Panos Adamopoulos, Manuel Arriaga, Josh Attenberg, Solon Barocas, Ron Bekkerman, Josh Blumenstock, Ohad Brazilay, Aaron Brick, Jessica Clerk, Nitesh Chawla, Pecker Devito, Vasant Dhar, Jan Ehmke, Theos Evgeniou, Justin Gapper, Tomer Geva, Daniel Gillick, Shawndra Hill, Nidhi Kathuria, Ronny Kohavi, Marios Kokkodis, Tomcat Lee, Philipp Marek, Dan Wood, Sophie Mohin, Lauren Moores, Alan Murray, Pick Nishimuraman, Balaji, Jaison Disparage, Claudia Perlich, Gregory Piatetsky-Shapiro, Pussy Philsoft, Kev Reilly, Maytal Saar-Tsechansky, Evan Sadler, Galit Shmueli, Okay Cup, Pick Street, Kiril Tsemekhman, Craig Vicar, Chris Volinsky, Gebb Debbie Yuster and Rong Zheng. We would also like to extend a global thank you to the students in Foster's classes, Information Mining for Business Analytics, Real Life Data, Introduction to Data Science, and the Data Science Exploration Seminar. Issues pressing issues that arose when using previous drafts, this book provided essential feedback for its improvement. Thanks to all the colleagues who have taught us about data science and how to teach data science over the years. Special thanks to Maytal Saar-Tsechansky and Claudia Perlich. Maytal kindly shared with Foster her notes for our data mining class many years earlier. The classification tree example in Chapter 3 (thanks especially for the "bodies" visualization) is almost based on the idea beyond the demonstration. Her ideas and example had the genetics for visualization by comparing interval matrices with trees and embedded discretized functions in Chapter 4, the “Will Daniel Respond” example in Chapter 6 based on her example, press perhaps other things long forgotten . Claudia has taught adjunct Data Mining for Business Analytics/Introduction to Data Science go with Care tracks for the past few years and has taught a lot about data science in print (and beyond). Thanks to David Stillwell, Thore Graepel, and Michal Kosinski for providing the Facebook Like data for some of our examples. We thank Nick Street for providing the cell nuclei data and also allowing us to use the sperm figure in Chapter 4. We thank David Martens for his assistance in visualizing fluid positions. We thank Chris Volinsky for providing data for the Netflix Challenge work. Thanks to Sonny Tambe for early access to his results on big data technology and productivity. Thanks to Patrick Perry for showing us an example of a bank call center used in Chapter 12. Credit to Geoff Webb for using the Mag Opus correlation mining system. Above all, we thank our families for their love, patience and encouragement. A lot of open source software was used in training about this repository and its examples. Architects want to tell developers and contributors that:
xx
|
Prologue
• Python and Perl • Scipy, Numpy, Matplotlib and Scikit-Learn • Weka • The Device Learning Depot at the University of California, Irvine (Bache & Lichman, 2013) Finally, we encourage readers to check our website for updates to this material , new chapters, errata, additions, and companion slide sets. — Director Foster and Tom Fawcett
Prologue
|
xxi
CHAPTER 1
Introduction: Data Analytical Thinking
Dreams cannot be small dreams, because they have no power to move the hearts of men. — Johann Wolfgang von Goethean
Over the past fifteen years, significant investment has been made in employment infrastructure, which has improved the ability to collect data across the business. Almost every aspect of the store is now open to data collection and often even equipped for data collection: operations, manufacturing, supply chain management, customer behavior, marketing campaign performance, workflow processes or more. At the same time, information is now widely available on external events such as supermarket trends, industry news and competitor moves. This wide availability of data has led to greater preoccupation with methods for extracting useful knowledge of the type of information from data - the field of input science.
The Diffusion of Data Opportunities With vast amounts of data available today, companies in nearly every industry are focused on exploiting information for competitive advantage. In the past, companies could use teams of press statistics, modelers and analysts to manually explore datasets, but the variety of press data in volume has far outstripped the capabilities of manual analysis. At the same time, computers have become much more powerful, networks have become ubiquitous, and algorithms have been developed to fit data sets that allow broader and deeper analysis than was previously possible. The convergence of these phenomena has led to the increasingly widespread commercial claim of data science principles and data mining techniques. Probably the broadest applications of data mining techniques are in marketing for jobs such as targeted marketing, online advertising, and cross-selling recommendations. 1
Data mining is used in general customer relationship management to analyze customer behavior in order to manage attrition and maximize expected customer value. The financial industry uses data mining for credit scores and transactions, as well as in the factory through our fraud detection and management. Major retailers, from Walmart to Amazon, are applying data mining to all of their businesses, from marketing to supply chain management. Many companies have strategically diversified with data science, occasionally going so far as to evolve into data carbon companies. The main goals of this book are to help you see the particular trade from the perspective of data and to understand the ethics of extracting useful knowledge from data. There is a fundamental framework for data analysis, as well as basic principles that must be understood. Additionally, there are specific areas where intuition, generation, common sense and local knowledge should be introduced. A data perspective will give you an overall context and give you a framework for systematically analyzing such problems. As you improve your data analytical thinking, you'll develop intuition for how and where to apply your creativity and domain knowledge. In the first two chapters of this book, we want to discuss in detail various topics, both techniques related to data science and data mining. The terms "data science" and "data mining" are often used interchangeably, and the former has taken on a life of its own as various individuals and organizations try to capitalize on the hype surrounding it. At a higher level, data science is a locus of fundamental principles that guide the extraction of knowledge from data. Intelligence mining is the extraction of knowledge from data, through technologies that incorporate these principles. As a term, "data science" will generally be more broadly useful than the traditional application of "data mining," but data mining techniques provide some of the clearest illustrations of data science principles. It will be important to understand data science even if you never plan to apply it. Analytical data thinking enables the evaluation of proposals for data mining projects. For example, if a team, a consultant, or a potential investment target proposes to improve a particular store application by learning from data, you should be able to systematically evaluate the actual proposal, decide whether this is a good idea or not. defective. This doesn't mean you'll be able to tell if itp will actually succeed - for data mining projects, it's often necessary to try - but you should be able to spot obvious flaws, unrealistic requirements and also missing fragments.
Throughout the book, we will describe several fundamental principles of data science and illustrate, respectively, at least one data mining technique that incorporates the principle. For each principle there are usually many specific techniques that embody it, so in this book we have chosen to emphasize the basic principle rather than specific techniques. That said, we don't make much of the difference between the inputs
2
|
Chapter 1: Introduction: Thinking about data analysis
science and data mining, and where it will have a substantial effect on the understanding of current concepts. Let's look at the ability to double short the analytics data to extract predictive patterns.
Example: Hurricane Frances Recall an example from a 2004 New York Times story: Hurricane Frances is moving through the Caribbean, threatening to hit the Atlantic coast of Florida. Residents moved to higher ground, but far away in Bentonville, Ark., Wal-Mart store executives decided the situation offered a prime opportunity for one of the newest data-driven firearms...advanced prediction. A week before the storm made landfall, Linda MOLARITY. Dillman, Wal-Mart's chief information officer, set up her team to make predictions based on what had happened when Turmoil Charley started minutes earlier. Backed by the trillions of bytes of customer history stored in Wal-Mart's data warehouse, he felt the company could "start predicting what's going to happen, instead of waiting for it to happen," as he put it. (Hays, 2004)
Consider why data-driven forecasting might be useful in these scenarios. It might be helpful to predict that people in the path of the aforementioned hurricane would buy more background water. Maybe, but this point is apparently a little obvious, and why do we need data science to figure it out? It may be helpful to forecast the amount of sales increase due to the hurricane to ensure that local Wal-Marts have clean inventory. Perhaps data mining could reveal that a particular DVD was sold out in the hurricane's path - but perhaps it was sold out that week at Wal-Marts across the country, not just where the hurricane's landfall was imminent. The prediction might be somewhat usable, but probably more general than Ms. Dillman intended. It would be valuable to explore patterns directly in hurricanes that are not obvious. To do this, analysts look at Wal-Mart's vast data from similar past events (such as Hurricane Charley) to identify exceptional local demand for services. From these patterns, any business can anticipate unusual demand for merchandise and send express inventory to stores before the hurricane hits. In fact, it did. The New York Times (Hays, 2004) reported that: “… by experts they extracted the data, moreover, they also identified that the stores would really need certain products - and not just ordinary lenses. "In the past, we didn't know that Strawberry PopTarts increased their sales, seven times their normal turnover rate, before a hurricane," Mill said. Dillman spoke in a recent interview. "And the best-selling item before the hurricane was beer."1
1. Sure! What goes better with Fruit Pop-Tarts than a delicious cold beer?
Example: Hurricane Francesc
|
3
Example: Predicted Customer Deviation Method Have these data analyzes been done? Consider a second, more typical business scenario and how it might be addressed in terms of adenine data. This problem will serve as a running example, illuminate many of the issues raised in this book, and provide a common frame of reference. Let's say you found a great job with a review at MegaTelCo, one of the largest telecommunications companies in the United States. They face significant problems including keeping customers in their bluetooth business. In the mid-Atlantic region, 20% of ring customers leave when their contracts expire and it is becoming increasingly difficult to win new customers. As cell phone retail is saturated, the massive growth of this wireless market has slowed. Communications companies are now engaged in battles to attract each other's customers while keeping their own. Switching customers from one company to another is called churn, and it's costly for everyone: one company has to spend on incentives to attract a buyer, while another company loses revenue when the customer leaves. He may have called to help understand the report and develop a solution. Attracting a new customer is much more expensive than keeping an existing customer, so a good marketing budget is allocated to avoid churn. Marketing has a special retention offer already designed. Your goal is to develop a more thorough, step-by-step planning method whereby the physical data team must use MegaTelCo's big data capabilities to decide which customers will remain prominent in the special retention arrangement before the expiration of their contracts. Think carefully about what data they can use and how it can be used. Specifically, how should MegaTelCo select a set of customers to receive its offer to best reduce churn for a given incentive budget? The answer to this question is much more complicated than it first appears. We will return to this problem repeatedly throughout the book, adding complexity to our solution as we develop our understanding of fundamental data science concepts. In fact, customer retention has been one of the main uses of data mining technologies, especially in telecommunications, in both accounting and bookkeeping businesses. In general, these were some of the earliest and most widespread adopters of data mining technologies, reasons discussed later.
Data Science, Engineering and Data-Based Decision Making Data scholarship includes principles, processes and techniques from sympathetic phenomena through (automated) data analysis. In this book, we will look at the ultimate goal
4
|
Chapter 1: Introduction: Analytical reflection on the data
Figure 1-1. Data science in the context of various data-related processes styles the organization. for data science to improve decision-making, as this general is of direct employment to business. Figure 1-1 places academic data in the context of many other closely related and data-related processes in the organization. It distinguishes data science from other dimensions of data processing and is gaining more and more attention in business. Let's start from the top. Data-driven decision making (DDD) refers to a practice of making decisions based on data analysis rather than pure intuition. For example, a marketer may choose key ads based solely on their long experience in the field and their eye for what will work. Alternatively, it could base its choice on analyzing data about how consumers react to different ads. Women can also use a combination of these approaches. DDD is an all-or-nothing practice, and different companies engage in DDD to a greater or lesser extent. The benefits of data-driven decision making have been definitively revealed. Economist Erik Brynjolfsson and his friends at USING Press Penn's What School conducted a study on how DDD affects firm performance (Brynjolfsson, Hitt, and Kim, 2011). They developed a DDD measure that ranks the company according to the intensity with which they use data science, engineering and data-driven decision making.
|
5
δεδομένα για τη λήψη αποφάσεων σε μια εταιρεία. Δείχνουν ότι, στατιστικά, όσο περισσότερο βασίζεται σε δεδομένα μια εταιρεία, τόσο πιο παραγωγική είναι - ακόμη και ελέγχοντας ένα ευρύ μέτρο πιθανών υπερβολών. Και η διαφορετικότητα δεν μπορεί να είναι μικρή. Μία τυπική απόκλιση πάνω από την κλίμακα DDD σχετίζεται με αύξηση της παραγωγικότητας 4% έως 6%. Το DDD συσχετίζεται επίσης με υψηλότερη απόδοση περιουσιακών στοιχείων, απόδοση ιδίων κεφαλαίων, χρήση περιουσιακών στοιχείων και πραγματική αγορά, και οι δύο σχέσεις φαίνεται να είναι αιτιολογικές. Οι αποφάσεις της τάξης που θα μας ενδιαφέρουν να συμπεριλάβουμε αυτό το αποθεματικό εμπίπτουν κυρίως σε δύο τύπους: (1) κρίσεις σχετικά με το ποιες «ανακαλύψεις» πρέπει να γίνουν στα δεδομένα και (2) αποφάσεις που είναι επαναλαμβανόμενες, ειδικά σε μεγάλη κλίμακα, και επομένως μπορεί να επωφεληθεί από ακόμη και μικρές αυξήσεις στην ακρίβεια λήψης αποφάσεων με βάση την ανάλυση δεδομένων. Το παραπάνω παράδειγμα Walmart απεικονίζει ένα πρόβλημα Τύπου 1: η Linda Dillman θα ήθελε να ανακαλύψει ότι η γνώση γίνεται βοήθημα στη Walmart στην προετοιμασία για την επικείμενη άφιξη του τυφώνα Frances. Το 2012, ο ανταγωνιστής Target της Walmart ήταν στις διαθέσιμες ειδήσεις, μια σωστή περίπτωση λήψης αποφάσεων βάσει δεδομένων, επίσης ένα πρόβλημα τύπου 1 (Duhigg, 2012). Όπως οι περισσότεροι λιανοπωλητές, η Target ενδιαφέρεται για τις αγοραστικές συνήθειες των καταναλωτών, για το τι τους παρακινεί και τι όχι. Οι καταναλωτές τείνουν να έχουν κίνηση στις συνήθειές τους, αλλά είναι πολύ δύσκολο να τους κάνεις να αλλάξουν. Οι υπεύθυνοι λήψης αποφάσεων του Target λένε, ωστόσο, ότι ο ερχομός ενός νέου μωρού σε μια οικογένεια σηματοδοτεί ένα σημείο στο οποίο οι άνθρωποι αλλάζουν σημαντικά τις παραδόσεις τους. Σύμφωνα με τα λόγια ενός αναλυτή της Focus, «Μόλις τους κάνουμε να αγοράσουν τις πάνες μας, θα αρχίσουν να αγοράζουν και όλα τα άλλα». Οι περισσότεροι λιανοπωλητές γνωρίζουν το dieser και επομένως ανταγωνίζονται μεταξύ τους προσπαθώντας να πουλήσουν προϊόντα που σχετίζονται με τα μωρά σε νέους γονείς. Δεδομένου ότι τα περισσότερα αρχεία γεννήσεων είναι δημόσια, λάβετε πληροφορίες σχετικά με τις γεννήσεις στο Διαδίκτυο και στείλτε ειδικές προσφορές στους νέους γονείς. Ωστόσο, η Target ήθελε να ενισχύσει την ανταγωνιστικότητά της. Τους ενδιέφερε να μπορούν να προβλέψουν ότι οι άνθρωποι περιμένουν μωρό. Εάν μπορούσαν, θα κέρδιζαν πλεονέκτημα ανοίγοντας προσφορές ενώπιον των ανταγωνιστών τους. Χρησιμοποιώντας την τεχνολογία της επιστήμης δεδομένων, η Target ανέλυσε ιστορικά δεδομένα από πελάτες που αργότερα αποκάλυψαν ότι ήταν έγκυες και ήταν σε θέση να εξαγάγει λεπτομέρειες που θα μπορούσαν να προβλέψουν ποιες αγοραστές ήταν έγκυες. Για παράδειγμα, οι μέλλουσες μητέρες συχνά αλλάζουν τη διατροφή τους, την ντουλάπα τους, τα σχήματα βιταμινών τους και ούτω καθεξής. Αυτοί οι δείκτες μπορούν να εξαχθούν από μεγάλα δεδομένα, να συγκεντρωθούν σε μοντέλα πρόβλεψης και στη συνέχεια να αναπτυχθούν σε καμπάνιες μάρκετινγκ. Ο Wealth θα μιλήσει για προγνωστικές προσαρμογές σε πολύ επιλεγμένους όπως εμείς μέσα από το βιβλίο. Για το πλάσμα του χρόνου, αρκεί να καταλάβουμε ότι ένα προκαταρκτικό μοντέλο αφαιρεί το μεγαλύτερο μέρος της πολυπλοκότητας του κόσμου, εστιάζοντας σε ένα συγκεκριμένο σύνολο δεικτών που συσχετίζονται κατά κάποιο τρόπο με την ποσότητα των ενισχυτών του συνόλου (ποιος θα ανακάμψει ή ποιος πηγαίνει για ψώνια , που είναι έγκυος κ.λπ.). Είναι σημαντικό να σημειωθεί ότι, τόσο στα παραδείγματα Walmart όσο και στο Target, το
6
|
Chapter 1: Introduction: Data Analytics Considerations
the data analysis was not testing a simple hypothesis. Instead, the dates are mined in the same hope that something useful will be discovered.2 Our example represents a type 2 DDD problem. MegaTelCo has hundreds of millions of customers, each of whom is a candidate for defection. Tens of millions of customers have contracts that expire every month, so each one of them has a higher chance of moving away in the near future. If we can improve our ability to calculate, for a given customer, how profitable it would be to focus on that customer, we can potentially reap enormous benefits by applying this skill to the millions of customers in the population. The same logic mentioned earlier applies to many of the areas where we have seen the most intense application of data science and data mining: direct marketing, online advertising, loan rating, financial sales, help desk management, fraud detection, top search, recommendation product recommendation and so on. Graphs in Counter 1-1 schau data Science support data-driven decision making, but also override data-driven decision-making. This highlights the often overlooked fact that more and more business decisions are being made automatically by computer systems. Different industries have adopted machine decision making at different rates. Finance and telecommunications were early adopters, mainly due to the early development of data networks and the application of large-scale computing, which allowed the compression and modeling of large-scale data, and the application of the resulting copies to decision-making. In the 1990s, automated decision making dramatically changed the labor and consumer credit industries. It spans the 1990s, banking and telecommunications, and has run large-scale systems for managing data-driven fraud control decisions. As retail systems have become increasingly computerized, merchandising decisions have become automated. Famous examples include Harrah's casino rewards programs, as well as full recommendations for Amazon and Netflix. We're currently experiencing an advertising revolution, driven in large part by the massive increase in time shoppers spend online and the online ability to make (literally) split-second advertising decisions.
Data Product and "Big Data" It is important to take a field trip here to address another issue. There is one quality of data processing that is not data science, despite the impression one might get from the media. General data and processing are necessary to achieve data science, but it is more general. Since then, for example, many data processing skills, solutions and technologies are often mistakenly referred to as data science. To understand the science of details and data orientation 2. Target was so successful that this case raised ethical questions about the development of such techniques. Ethics and privacy concerns are interesting and very important, but we leave their discussion for another time and place.
Real "big data" data processing.
|
7
it is important to understand the differences. Data learning needs access to data, moreover, it often benefits from high data engineering that data processing technologies can facilitate, but these technologies are not data science techniques per se. They support data science, as shown in Figure 1-1, but are useful for much more. Data processing technologies are very important for data-driven business operations, ensuring that they do not involve knowledge extraction or data-based decision making, such as efficient transaction processing, state-of-the-art web system processing and online advertising campaign. -line frame. Advanced "big data" (such as Hadoop, HBase, in addition to MongoDB) has received significant press recently. Big data essentially means data sets that are too large for traditional data processing systems, so both require new processing technologies. In addition to traditional technologies, big data technologies are used for many tasks, including data engineering. Occasionally, big data technologies will actually be used for execution data mining techniques. However, much more often, well-known big data technologies are applied to process data to support data mining techniques and other data science activities, as illustrated in Figure 1-1. Earlier, we discussed Brynjolfsson's study demonstrating the benefits of data-driven decision making. An independent study, conducted by economist Prasanna Tambe at NYU's Stern School, examined the extent to which size data technology appears to be helping companies (Tambe, 2012). He finds that, after controlling for several potential confounders, the use of massive engineering details is associated with a significant additional increase in productivity. Specifically, one standard deviation greater use of big data technologies is associated with 1% to 3% higher productivity than the aforementioned average firm. a smaller standard deviation in big data usage is associated with 1% to 3% lower productivity. This leads to potentially very large productivity differences between firms at the extremes.
From Big Data 1.0 to Big Data 2.0 One way to think about the state of big intelligence technologies is to draw an analogy with the commercial adoption of navigation technologies. In Web 1.0, companies were concerned with putting basic Internet technology on their website so they could establish a Web presence, develop e-commerce proficiency, and increase the efficiency of their operations. We can think of starting as life in the Big Data 1.0 era. Companies are busier building operations to process bigger data, mostly to help current operations - for example, improve efficiency. Once companies incorporated the pervasive Web 1.0 technologies (and, in the process mentioned above, drove down the prices of the underlying technologies), they began looking for more. One began to ask what the Web would do for itself and how it could augment what it had always done—and one entered the era of Web 2.0, where new systems and companies began to take advantage of the interactive nature of a Web. The changes brought about by this shift in thinking are pervasive. the most obvious being corporate to social8
|
Chapter 1: Introduction: Thinking about data analysis
network elements and the rise of the 'voice' of the individual consumer (and citizen). We should expect the best Great Data 2.0 code to follow Big Data 1.0. Once companies are able to flexibly process massive data, they need to start asking themselves, "What can I do now that I couldn't do before, or do it better than I could before?" Diese is likely to be the golden age of data science. The principles and techniques we present in this book will be used far more widely and in depth than they are today. Computers It is important to note that in the Web 1.0 era, some early companies were able to implement Web 2.0 ideas much earlier than usual. Amazon is a prime example, incorporating the "voice" of the consumer early on in product ratings, top reviews (and in-depth rating product reviews). Likewise, we see some companies already implementing Big Data 2.0. Amazon is once again a company at the forefront, providing recommendations based on massive thank you data. There are other examples. Online advertisers must provide extremely large volumes of data (billions of ad impressions per day are not uncommon) and achieve extremely high performance (real-time bidding systems make decisions in tens of milliseconds). We should look to them and other related disciplines for suggestions for advances in big data and data science that will later be adopted by other disciplines.
Data Science Capability and Intelligence as a Strategic Advantage The previous sections suggest one of the fundamental tenets of data science: data, or the ability to extract useful knowledge as a free product, should be considered a key strategic asset. Many companies see data analysis as primarily related to getting value from some existing data, also often without careful consideration of whether the company has what analysis is right for you. Seeing them as our own allows us to think explicitly about how big they are to invest in them. Often, we just don't have good enough data to make better decisions and/or the good ability to better support decision making from data. Moreover, thinking of them as assets should lead us to allow them to be complementary. The best data science can deliver little value without the right data. the right data often cannot meaningfully improve decisions without the right data science talent. Like all assets, itp often needs to be invested. Building a top-notch data science team is no trivial task, but it can make a big difference in decision-making. We will discuss strategic considerations surrounding data science in Chapter 13. The next portfolio study will allow us to introduce the idea of how clearly even investing in assets often pays off. The typical story of Minus Signet Bank from the 1990s provides a good example. Earlier, in the 1980s, data science transformed the ultimate loan economy. Default probability modeling has changed the industry of personal dating evaluation beyond its data science capability as a diplomatic asset
|
9
πιθανότητα προεπιλογής για στρατηγικές μεγάλης κλίμακας και μέρος σούπερ μάρκετ, που έφερε ταυτόχρονα οικονομίες κλίμακας. Μπορεί να φαίνεται παράξενο τώρα, ωστόσο, την εποχή που αναφέρθηκε, οι πιστωτικές κάρτες είχαν ουσιαστικά ομοιόμορφη τιμή, για δύο λόγους: (1) οι εταιρείες δεν διέθεταν επαρκή συστήματα πληροφοριών για να αντιμετωπίσουν πολύ διαφορετικές τιμές και (2) η διοίκηση της τράπεζας πίστευε ότι οι καταναλωτές ήθελαν να μην υπερασπιστούν την έλλειψη τιμών. Γύρω στο 1990, δύο στρατηγικοί οραματιστές (Richard Fairbanks και Nigel Morris) συνειδητοποίησαν ότι η τεχνολογία της πληροφορίας ήταν αρκετά ισχυρή ώστε μπορούσαν να κάνουν πιο εξελιγμένα προγνωστικά μοντέλα - χρησιμοποιώντας το είδος των τεχνικών που έχουμε συζητήσει σε αυτό το βιβλίο - και να προσφέρουν διαφορετικούς όρους (σήμερα: τιμολόγηση , πιστωτικά όρια, μεταφορές υπολοίπου χαμηλής αρχικής προμήθειας, επιστροφή μετρητών, πόντοι επιβράβευσης και πολλά άλλα). Αυτοί οι δύο άντρες δεν μπορούσαν να αποσπάσουν μεγάλη πίστη μέχρι να πιαστούν ως σύμβουλοι και να τους αφήσουν να προσπαθήσουν. Τελικά, αφού τελείωσαν οι μεγάλες κρύπτες, μπόρεσαν να αποκτήσουν τα συμφέροντα μιας μικρής περιφερειακής τράπεζας της Βιρτζίνια: της Signet Bank. Η διοίκηση της Emblem Bank ήταν πεπεισμένη ότι η μοντελοποίηση της κερδοφορίας, όχι μόνο η πιθανότητα αθέτησης, ήταν η σωστή στρατηγική. Γνώριζα ότι ένα μικρό ποσοστό πελατών αντιπροσωπεύει στην πραγματικότητα περισσότερο από το 100% του κέρδους μιας τράπεζας από την ενέργεια της πιστωτικής κάρτας (επειδή το υπόλοιπο είναι νεκρό σημείο ή απώλεια χρημάτων). Όταν μπορούσαν να μοντελοποιήσουν την κερδοφορία, θα μπορούσαν να κάνουν καλύτερες προσφορές για τους αγαπημένους πελάτες, εκτός από το να «διαλέγουν την αφρόκρεμα» της πελατείας των μεγάλων τραπεζών. Αλλά η Signet Bank είχε μια σημαντική ανακάλυψη στην εφαρμογή αυτής της στρατηγικής. Η Handful δεν είχε αυτά τα σχετικά δεδομένα για να μοντελοποιήσει οικονομικά προκειμένου να προσφέρει διαφορετικούς όρους σε διαφορετικούς πελάτες. Οχι. Καθώς οι τράπεζες προσέφεραν πίστωση με ένα συγκεκριμένο σύνολο ορολογίας και ένα συγκεκριμένο πρότυπο μοντέλο, είχαν τις ημερομηνίες για να δουν την κερδοφορία (1) για το λεξικό που προσέφερε πραγματικά σε ποιο παρελθόν, πραγματική (2) για τον τύπο πελάτη που ήταν στην πραγματικότητα προσφερόταν πίστωση (δηλαδή όσες κρίθηκαν άξιες δανεισμού με το ισχύον μοντέλο). Τι θα μπορούσε να κάνει η Signet Bank; Το ανέδειξαν ως μια θεμελιώδη στρατηγική για την έναρξη της επιστήμης δεδομένων: την απόκτηση των απαραίτητων δεδομένων με κόστος. Από τη στιγμή που έχουμε δει τις πληροφορίες ως ένα γενικό πλεονέκτημα, πρέπει να σκεφτούμε αν είμαστε διατεθειμένοι να τις επενδύσουμε ή αν μας αρέσει. Στη βαλίτσα του Signet, η χωρητικότητα δεδομένων προκαλείται από την κερδοφορία των διαφορετικών πιστωτικών όρων κατά τη διεξαγωγή πειραμάτων. Οι διαφορετικές ορολογίες ήταν τυχαίες υπηρεσίες για διαφορετικούς πελάτες. Μπορεί να ακούγεται ανόητο εκτός των περιστάσεων της αναλυτικής σκέψης δεδομένων: πιθανότατα θα χάσετε χρήματα! Αυτό είναι αλήθεια. Στην περίπτωση αυτή, οι απώλειες είναι αυτές που κοστίζουν τα δεδομένα που αποκτήθηκαν. Ο στοχαστής της ανάλυσης δεδομένων πρέπει να εξετάσει το ενδεχόμενο να προσδιορίσει εάν αναμένει ότι οι πληροφορίες θα έχουν επαρκή αξία για να δικαιολογήσουν την επένδυση. Τι συνέβη λοιπόν με τη Signet Bank; Όπως θα περίμενε κανείς, όταν η Signet άρχισε να παραθέτει τυχαία ορολογία στους καταναλωτές για την απόκτηση δεδομένων, ο αριθμός των οικονομικών προβλημάτων αυξήθηκε. Η Signet πέρασε από το κορυφαίο ποσοστό διαγραφής του κλάδου (2,9% των απλήρωτων υπολοίπων) σε σχεδόν 6% διαγραφές. Οι απώλειες συνεχίστηκαν για μερικά χρόνια καθώς οι επιστήμονες της πληροφορίας εργάζονταν για να καθορίσουν προγνωστικά μοντέλα από τα δεδομένα, να τα αξιολογήσουν και να τα εφαρμόσουν για να βελτιώσουν το κέρδος. Επειδή η εταιρεία θεωρούσε τις ζημίες ως στοιχήματα δεδομένων, παρέμειναν παρά τις αντιρρήσεις των ενδιαφερομένων. Τελικά, η λειτουργία πιστωτικής κάρτας Signet 10
| Chapter 1: Introduction: Thinking about data analysis
γύρισε και έγινε τόσο κερδοφόρος που επρόκειτο να το διαχωρίσει από τις άλλες δραστηριότητες μιας τράπεζας, που πλέον συνίστατο στην επισκίαση των πιστωτικών επιτευγμάτων των χρήσεων. Η Fairbanks και η Morris έγιναν πρόεδρος και διευθύνων σύμβουλος, πρόεδρος Τύπου και COO και συνέχισαν να εφαρμόζουν τις αρχές της επιστήμης δεδομένων σε όλη την εταιρεία — όχι μόνο στο αρχείο πελατών, αλλά και στην αποθήκευση. Όταν ένας πελάτης ζητά μια βελτιωμένη προσφορά, τα μοντέλα βάσει δεδομένων υπολογίζουν την πιθανή κερδοφορία σε πολλαπλές πιθανές προσφορές (διαφορετικές προσφορές, συμπεριλαμβανομένης της διατήρησης του status quo) και ο υπολογιστής του αντιπροσώπου εξυπηρέτησης πελατών εμφανίζει τις καλύτερες προσφορές που πρέπει να πραγματοποιηθούν. Μπορεί να μην έχετε ακούσει για τη μικροσκοπική Signet Bank, αλλά αν διαβάζετε αυτό το βιβλίο, πιθανότατα έχετε ακούσει για το spin-off: η Capital A. Fairbanks και η διαφημιστική εταιρεία της Morris μεγάλωσαν και έγιναν ένας από τους μεγαλύτερους εκδότες πιστωτικών καρτών στον κόσμο του κλάδου. με ένα από τα χαμηλότερα ποσοστά πτώσης. Το 2000 αναφέρθηκε ότι η τράπεζα μετέφερε 45.000 από τα λεγόμενα «επιστημονικά τεστ», όπως τα αποκαλούσαν. στρατηγικό σύνολο. Μια εξαίρεση είναι μια μελέτη από τους Martens και President (2011) που αξιολογεί δεδομένα από επιταγές σε συναλλαγές για συγκεκριμένα προϊόντα για τη βελτίωση της τσάντας καταναλωτή μιας τράπεζας για να προσδιορίσει ποιες προσφορές προϊόντων να κάνουν. Η τράπεζα κατασκεύασε δωρεάν μοντέλα δεδομένων για να αποφασίσει ποιον θα αντικρούσει με διαφορετικές προσφορές προϊόντων. Η έρευνα εξέτασε πολλούς διαφορετικούς τύπους δεδομένων και τις επιπτώσεις τους στην προγνωστική απόδοση. Οι κοινωνικοδημογραφικές πληροφορίες παρέχουν μια ισχυρή ικανότητα μοντελοποίησης του τύπου των καταναλωτών που είναι πιθανό να αγοράσουν το ένα ή το άλλο προϊόν. Ωστόσο, τα κοινωνικοδημογραφικά δεδομένα απλώς εξαφανίζονται. μετά από ένα συγκεκριμένο όγκο δεδομένων, δεν παρέχονται πρόσθετα οφέλη. Αντίθετα, τα λεπτομερή δεδομένα για μεμονωμένες (ανώνυμες) συναλλαγές πελατών βελτιώνουν σημαντικά την απόδοση σε δίκαια κοινωνικοδημογραφικά δεδομένα. Η σχέση μπορεί να είναι σαφής και συναρπαστική και - σημαντικά, για το αντικείμενο εδώ - η προγνωστική απόδοση θα βελτιώνεται καθώς χρησιμοποιούνται περισσότερα δεδομένα, αυξάνοντας σε όλο το εύρος που διερευνήθηκε από τον Wood και τον Professor χωρίς σημάδια μείωσης. Αυτό έχει μια σημαντική συνέπεια: οι τράπεζες με μεγάλα περιουσιακά στοιχεία δεδομένων μπορούν να έχουν σημαντική στρατηγική χρήση έναντι των μικρότερων ανταγωνιστών τους. Καθώς μαθαίνουν αυτές οι τάσεις και οι τράπεζες είναι σε θέση να αποστέλλουν εξελιγμένα αναλυτικά στοιχεία, οι τράπεζες με μεγαλύτερα περιουσιακά στοιχεία δεδομένων θα μπορούν καλύτερα να προσδιορίζουν τους καλύτερους πελάτες για μεμονωμένα προϊόντα. Το καθαρό αποτέλεσμα θα είναι η αυξημένη υιοθέτηση των προϊόντων κάθε τράπεζας, το μειωμένο κόστος ανά απόκτηση πελάτη ή και τα δύο. Το όραμα για εσάς ως στρατηγικό πλεονέκτημα σίγουρα δεν θα περιοριστεί στον Nealth, ή ακόμα και στον χρηματοοικονομικό τομέα. Η Amazon μπόρεσε να συλλέξει δεδομένα από νωρίς για πελάτες στο διαδίκτυο, γεγονός που δημιούργησε σημαντικό κόστος αλλαγής: Οι καταναλωτές βρίσκουν αξία στις αξιολογήσεις και τις συστάσεις που παρέχει η Amazon. Η Amazon, επομένως, μπορεί πιο εύκολα να διατηρήσει πελάτες και μπορεί ακόμη και να χρεώσει ένα ασφάλιστρο (Brynjolfsson & Smith, 2000). Harrah's Casinos 3. Μπορείτε να μελετήσετε περισσότερα για την ιστορία του Capital One (Clemons & Thatcher, 1998; McNamee 2001).
Data and data science capability as a strategic advantage
|
11
famous for collecting and mining player data and from a small player in a kurhaus operation in the mid-1990s to the acquisition of Caesar's Entertainment in 2005 to become the tallest gambling company in the world. And Facebook's enormous appreciation was credited to its vast and unique inputs (Sengupta, 2012), including information about individuals and their tastes, as well as information about the structure of the social network. Information about network structure has proven important for forecasting and remarkably useful in modeling who will buy certain products (Hill, Provost, & Volinsky, 2006). It's clear that Get has a significant data component. Whether you have implemented the right data science strategies until you get the most out of them is an open question. In the book, we discuss in more detail many of the aforementioned fundamental concepts as fictions, exploring the principles of file mining and data analytics.
Data Analytical Thinking Analyzing case studies like this tipping problem improves our ability to approach problems "data analytically." Advancing this vision should be the main focus of the book. To include a business-facing fix, you must be able to judge whether and how the listing can improve performance. We will introduce a set of fundamental concepts and principles that facilitate careful thinking. We will develop frameworks to structure the analysis so that it can be done in a systemic way. As mentioned above, understanding data science is critical, even if you don't plan to do it yourself, because analytics are now so critical to enterprise business. Companies are increasingly driven by data analytics, so there is a great professional advantage in being able to interact properly with and within these companies. Understanding the fundamental terms and having frameworks for organizing data analytical thinking not only enables one to interact competently, but will help envision opportunities to improve data-driven decision-making or see threats from data-driven competitors . Companies in many traditional industries are leveraging new and existing data resources for competitive use. They are hiring academic data teams to bring technical support to Fortschritt to increase revenue and reduce costs. In addition, many new companies are growing with data mining as a key strategic component. Facebook and Twitter, along with many other 'Digital 100' companies (Business Insider, 2012), have high ratings mainly because of the data you are committed to capturing by creating. Marketers must run data-driven campaigns, entrepreneurs must
4. Of course, this is a new phenomenon. Amazing and Google are both established companies that have received tremendous value from their assets.
12
|
Chapter 1: Introduction: Thinking about data analysis
be able to invest wisely in companies with significant data assets, and business planners must be able to design plans that take advantage of the information. As a few examples, if a consultant presents a plan to take advantage of data to improve your work, you should be able to assess whether the proposal makes sense. For a competitor to announce a new data partnership, you need to recognize when doing so could put you at a strategic disadvantage. Or, say she has an opinion on a venture capital firm and her first task is to evaluate potential investments in an advertising agency. The founders argue that they will derive significant value from a special set of data they will collect, and based on that, they will receive a significantly higher valuation. Be reasonable; With an understanding of the fundamentals of data science, you should be able to ask some probing questions to determine whether your evaluative arguments are plausible. On a less grand but possibly more common scale, data analytics touches every business unit. Employees in these units need to constantly interact with the team of data scientists. If these employees don't have a fundamental foundation in the principles of data analytics, they really don't understand what's going on in the business. Lack of understanding is much more harmful in data science projects than in other technical projects because data science supports better decision making. As we describe in the next chapter, this requires close interaction between data scientists and business decision makers. Stalls where business people don't understand what data scientists are doing are at a substantial disadvantage, either because they waste time and effort or, even worse, because they end up making the wrong decisions.
The need for managers with data analysis skills
Consulting firm McKinsey plus Corporate estimates that “there will be a shortage of talent needed for organizations to take advantage of big intelligence. By 2018, United Expresses alone could face a shortage of 140,000 to 190,000 people with deep analytical skills, as well as 1.5 million managers and analysts, including the expertise to use big analytics to make effective decisions.” (Manyika, 2011). Why are there 10 times more managers or analysts than those with deep analytical skills? Surely data scientists are not as difficult to manage as 10 managers! The reason is that a company can gain leverage if a data research team pushes to make better decisions in various areas of the store. Also, as Mcinsey points out, managers in these areas need to understand the fundamentals of data science to effectively gain this leverage.
Data Analytical Thinking
|
13
This book This post focuses on the fundamentals of data science and data mining. It is a set of principles, concepts and techniques that structure thinking and analysis. They allow us to get surprisingly deep data science processes and methods, eliminating the need to focus on the large number of specific data mining algorithms. There are several good book-level data mining algorithms and techniques, from practical guidance to actual mathematical statistics. Instead, this book focuses on the fundamental concepts and how they help us think through the symptoms of where data mining can go wrong. This does not mean that we will ignore data mining techniques. Many algorithms are just the body of the basic ideas. But, with only a few exceptions, we won't focus on the specific technical details of how the technology actually works. we will try to provide enough resources to understand what the techniques are and how they are based on elementary principles.
Data Mining and Data Science, Revisited This book devotes much attention to extracting useful (non-trivial, hopefully applicable) patterns or models from large sets of details (Fayyad, Piatetsky, Shapiro, & Smile, 1996) and to fundamental data physics underlying data mining, data mining. In the churn predictor, we'd like to take the past churn data and pull out patterns, examples of behavioral patterns, which are useful—that can help us predict which customers are more likely to churn in the future. help us design sense services. The fundamental concepts of data science are drawn from many fields that feature data analysis. We'll introduce these keys throughout the post, but we'll briefly discuss a few right away to get the basic flavor. We will analyze all this and more in the following parts. Key Idea: Extracting useful knowledge from data to solve general symptoms can be mechanized by following a process with fairly well-defined steps. The Cross Industry Standard Process for Data Mining, abbreviated CRISP-DM (CrispDM Project, 2000), is a codification of this process. Keeping this in mind as a process provides a framework for structuring my thinking about data analysis issues. For example, in actual practice, purists repeatedly see analytical "solutions" that are not based on careful analysis of the problem or are not carefully classified. Methodological thinking about analytics has highlighted these often neglected aspects of supporting decision making with data. This structured thinking also contrasts critical points where human creativity is required versus points where high-powered analytical tools can be applied.
14
| Branch 1: Introduction: Data Analytical Thinking
Fundamental concept: From a large mass of data, information technology can be used to find informative descriptive characteristics of entities of interest. In Unser's flip example, adenine customers would retain interest for life, and each customer could be described by a large number of attributes such as usage, customer service history, and many other factors. Which thing gives us information about how likely the customer is to leave the companies when their contract ends? How much information? This process is sometimes more or less referred to as finding variables that "correlate" with churn (we'll discuss which picture exactly). A business analyst lets you develop a few as well as test your own, and there are tools to help facilitate this experimentation (see “Other analytical techniques and technologies” on page 35). Alternatively, the analyst can apply information technology to automatically discover informative features – essentially doing large-scale automated experimentation. Furthermore, we will see for ourselves that this concept can be used recursively to build models to predict churn based on various attributes. Key Concept: If you look hard enough at a configuration of information, you'll find something on your own—but it can generalize beyond the data you're looking at. This is known as the amplifier overfitting data set. Date mining techniques can be very powerful, and the need to detect and avoid overfitting is one of the most important concepts to understand when applying evidence mining to real-world problems. The concept of overfitting, in addition to preventing it, permeates data lifecycles, algorithms, and evaluation tasks. Key term: Formulating data mining solutions and evaluating the results involves careful thought about the context in which it will be used. If our goal is to derive potentially useful knowledge, how can we articulate what is useful? Crucially it depends on how it goes. For your example of managing detachment, how would we even use the patterns extracted from the historical data? Should buyer value be factored into exit probability? More generally, did the samples lead to better decisions than a reasonable alternative? How well, will it happen by accident? How well would one do with a 'standard' smartphone option? These are just four of the fundamental conceptual skills we will explore. Toward the end of the book, we'll discuss a dozen key concepts in detail and show how they help us structure data analytics and understand data mining techniques and algorithms, as well as data science applications, in general.
Chemistry Is Not About Test Bubble: Academic Data Versus the Data Scientist's Job Before the incident, we should briefly review the engineering side of data science. At the time of this scenario, discussions of data science often included not only analytical skills, but also techniques for understanding data, and popular tools used. Data definitions
Chemistry Is Not About Test Tubes: Data Knowledge vs. This Product Scientist Job
|
15
scientists (and job ads) define not only areas of expertise, but also specific programming languages and tools. Details It is common to see job postings mentioning information mining techniques (e.g. arbitrary forests, shared learning engines), specific application areas (recommendation systems, ad placement optimization), along with popular big data processing program tools (Hadoop, MongoDB ). There is often little distinction in the scientific press about the technology required to handle massive data sets. We must point out that data science, like computer science, is a new field. The specific concerns of information science are quite new and general, but they are just beginning to emerge. The state of data knowledge can be compared to that of chemistry in the mid-19th century, when theories and general principles were being formulated and the aforementioned field was widely experimented. Every good expert had to be a competent test technician. Likewise, it's hard to imagine a data scientist working without being proficient with certain types of software tools. Having answered that, this book focuses on science rather than technology. You won't find instructions here on how to do bulk data mining on Hadoop clusters, or even what Hadoop is or why you might want to learn from it.5 We focus here on the general principles of the science that have emerged. In 10 years, the prevailing technologies will likely have changed or advanced enough that a discussion here is out of date, while the general principles are equivalent to how they were 20 years ago and are likely to undergo little change in the coming decades.
Summary The book Get is about extracting useful information both insights from large amounts of data and establishing them to improve business decision making. As the vast collection of information has spread to almost every department and business in the industry, there are also opportunities to mine the data. Beneath the extensive set of techniques for data mining lies a much smaller set of fundamental concepts that make up data science. These concepts are popular or encompass much of the essence of data mining and business analytics. To succeed in today's data-driven business environment we must be able to think about how these fundamental theories apply to specific financial problems—think analytically about data. For example, in this chapter we discussed the basics of why data should be considered a big business asset, or when we think along those lines, we begin to question whether (and how much) we should invest in data. Thus, understanding these fundamental concepts is important not only for the researcher himself, but for anyone. It is one of today's "big data" technologies for processing massive cross-sectional data sets and enabling relational file systems. Hadoop is based on the MapReduce parallel processing framework introduced by Google.
16
|
Chapter 1: Introduction: Thinking about data analysis
someone who works with data scientists, implements input scientists, invests in data-intensive ventures, or leads the implementation of analytics in an organization. Analytical thinking about data is supported by conceptual frameworks discussed throughout the book. For example, automated pattern exports for data is a process with well-defined steps, which are the subject of a later chapter. Understanding the process and steps helps structure our data-analytic thinking, both to make it more systematic and less prone to factual errors and omissions. There is strong evidence that data-driven decision making and big data technologies significantly improve business performance. Data science supports data-driven decision-making - and sometimes drives decision-making automatically - and is based on technologies in economics or "big data" engineering, but its principles become distinct. For the basics of data science, we also discuss in this book differences and actual compatibility with other important technologies, such as statistical hypothesis testing and database searching (which have their own books and classes). The other discussion sets out some of the differences in more detail.
Summary
|
17
CHAPTER 2
Data science business problems and solutions
Fundamental concepts: A set of normal data mining tasks. The process of data mining. Supervised vs. unsupervised data surface.
An important tenet of data science is that data mining is an operation with fairly well-understood steps. Some involve an advanced application of intelligence, such as automated discovery and evaluation of file patterns, while others primarily require the creativity, business acumen, and good sense of an analyst. Understanding this whole process helps structure data surface projects so that they are closer to methodical research than heroic endeavors driven by luck and individual acumen. Because the data mining process analyzes the overall operation of data verdict patterns with a set of well-defined subtasks, it is also useful for structuring data science discussions. In this book, we will use the process as a general framework available in our conversation. This chapter introduces the file mining process, but first provides additional context by discussing common types of data mining tasks. The introduction of these allows us to insist in a more concrete way on the presentation of the overall process, as well as on the introduction of other concepts included in the following chapters. One closes the chapter by discussing a set of important business analysis topics that are not the focus of this book (but for which there are many other useful books), such as databases, data warehousing, and basic statistics.
From Business Problems to Data Bronze Jobs Every data-driven business decision challenge is unique and includes its own combination of goals, desires, constraints, and even personalities. As in engineering, however, there are common sets of tasks that underpin business problems. Working with business actors, data naturally solves a business problem 19
reading in secondary tasks. Then, solutions to the subtasks can be created to solve the overall problem. Some of these subtasks are unusual to the specific business problem, but others are common data carbon tasks. For example, our communication breakdown problem is unique to MegaTelCo: there are unique problems that are different from the diversion problems of any other telco. However, a secondary task that may be separate from the solution of any churn problem is estimating from historical data the probability that a customer will terminate their contract shortly after expiration. Once MegaTelCo's idiosyncratic date is compiled into a specific format (described in the next chapter), this probability estimate fits the mold of a very common data mining task. We know a lot about how to solve common tasks in the mountain of data, both scientifically and practically. In a later book, we also want to provide data physics frameworks to help decompose the economics and reconstruct solutions for these subtasks. Ampere kriticen's skill in data science is the ability to decompose a data analysis problem into parts by ensuring that each part corresponds to a known task with which the tools are available. Discovering familiar problems and their solutions saves time and technology reinvents the cycle. It also allows people to focus on more interesting parts of the process that require human involvement—parts that aren't automated, where human creativity and intelligence must come from.
Despite the large number of specific mining algorithms developed over the year, there are only a few fundamentally different task options that these algorithms address. It is worth clearly defining these tasks. Subsequent chapters will use the first two (classification and regression) to illustrate different fundamental concepts. In the following, the term "individual" will refer to an entity about which we have data, such as you or a consumer, which could be a plurality of entities such as a business. We will make this concept more precise in Chapter 3. In many business analysis projects, we want to look for "correlations" between a particular variable that describes an individual or other variables. For demonstration, in historical data, we can know which customers left us after their contracts expired. We might want to know what other variables correlate with a customer leaving in the near future. Finding these correlations is the most basic aspect of classification and regression tasks. 1. Classification experiment and class probability estimation to predict, for each individual in a population, to which (small) class group that individual belongs. Courses are generally mutually exclusive. An example of a ranking question would be: "Among all MegaTelCo customers, who is most likely to respond to a given offer?" In this example, the duplicate classes could be called to make them responsive and unresponsive.
20
|
Chapter 2: Business Problems Also Data Science Solutions
Για μια εργασία ταξινόμησης, η διαδικασία εξόρυξης δεδομένων αδενίνης παράγει ένα μοντέλο που, δεδομένου ενός νέου ατόμου, καθορίζει σε ποια κατηγορία ανήκει αυτό το άτομο. Η πυκνή εργασία που σχετίζεται με το Ampere είναι η βαθμολογία ή η εκτίμηση πιθανότητας κλάσης. Ένα μοντέλο βαθμολόγησης που εφαρμόζεται σε ένα άτομο παράγει, αντί για μια πρόβλεψη τάξης, μια βαθμολογία που αντιπροσωπεύει πιθανότητες (ή κάποια άλλη ποσοτικοποίηση της πιθανότητας) που είναι μοναδική για κάθε διδασκαλία. Στο σενάριο απόκρισης πελατών μας, ένα μοντέλο βαθμολόγησης θα μπορούσε να αξιολογήσει κάθε μεμονωμένο πελάτη και επίσης να παράγει μια βαθμολογία απόστασης για το πόσο πιθανό είναι κάθε πελάτης να ενεργήσει στην προσφορά. Η κατάταξη και η βαθμολογία συνδέονται στενά. όπως θα δούμε, ένα μοντέλο που μπορεί να κάνει το ένα μπορεί συχνά να τροποποιηθεί για να κάνει το άλλο. 2. Πείραμα παλινδρόμησης («εκτίμηση τιμής») για την εκτίμηση ή την πρόβλεψη, σε κάθε άτομο, της αριθμητικής τιμής κάποιας μεταβλητής για αυτό το άτομο. Ένα παράδειγμα μιας στάσης παλινδρόμησης θα ήταν, "Πόσο θα ωφεληθεί ένας δεδομένος πελάτης από την υπηρεσία;" Η ιδιότητα (μεταβλητή) που πρέπει να προβλεφθεί είναι η τεχνική χρήση και μπορεί να δημιουργηθεί μια μοντελοποίηση εξετάζοντας άλλα ισοδύναμα άτομα στον πληθυσμό ή την ιστορική τους χρήση. Μια διαδικασία εκφυλισμού παράγει μια μοντελοποίηση που, καθορίζεται ένα, αξιολογεί την είσοδο της συγκεκριμένης δυναμικής που είναι συγκεκριμένη για αυτό το άτομο. Η παλινδρόμηση σχετίζεται στενά με την ταξινόμηση, αλλά τα δύο είναι διαφορετικά. Ανεπίσημα, η ταξινόμηση προβλέπει αν κάτι θα συμβεί, ενώ η παλινδρόμηση προβλέπει πόσο θα συμβεί κάτι. Η διαφορά θα γίνει πιο ξεκάθαρη όσο προχωρά το βιβλίο. 3. Η αντιστοίχιση ομοιότητας επιχειρεί να εντοπίσει παρόμοια άτομα με βάση γνωστά δεδομένα που τα ενημερώνουν. Η αντιστοίχιση ομοιότητας μπορεί να χρησιμοποιηθεί αμέσως για την εύρεση παρόμοιων οντοτήτων. Για παράδειγμα, η IBM ενδιαφέρεται να βρει παρόμοιες εταιρείες για τους καλύτερους επιχειρηματικούς πελάτες της, προκειμένου να εστιάσει τη δύναμη πωλήσεών της στο καλύτερό της. Λαμβάνουν αντιστοίχιση ομοιότητας με βάση «υπηρεσιακά» δεδομένα που περιγράφουν τα χαρακτηριστικά των εταιρειών. Η αντιστοίχιση ομοιότητας είναι η βάση για μια από τις πιο δημοφιλείς διαδικασίες για την πώληση προτάσεων μακιγιάζ (εύρεση ατόμων που είναι παρόμοια με εσάς όσον αφορά τα προϊόντα που έχουν αγοράσει). Τα μέτρα ομοιότητας αποτελούν τη βάση ορισμένων λύσεων για άλλες εργασίες στο pit πληροφοριών, όπως η περαιτέρω ταξινόμηση, η παλινδρόμηση και η ομαδοποίηση. Μιλάμε για ομοιότητα ή τυπικό μήκος της από τον κλάδο 6. 4. Ομαδοποίηση επιχειρεί να ομαδοποιήσει τα άτομα σε έναν πληθυσμό με βάση την ομοιότητά τους, αλλά δεν καθορίζεται από έναν απεριόριστο συγκεκριμένο σκοπό. Ένα παράδειγμα ερώτησης ομαδοποίησης θα ήταν, "Σχηματίζουν οι πελάτες μας ομάδες ή τμήματα που δεν επηρεάζονται;" Η ομαδοποίηση είναι χρήσιμη για τη συμπερίληψη προκαταρκτικής έρευνας τομέα για να δούμε ποια φυσικά σύνολα υπάρχουν, επειδή αυτά τα συμπλέγματα, με τη σειρά τους, μπορούν να προτείνουν άλλες, αντίστροφα κοντινές, εργασίες εξόρυξης δεδομένων. Το πακέτο μπορεί επίσης να χρησιμοποιηθεί ως εισροή στις διαδικασίες λήψης αποφάσεων που επικεντρώνονται στην απόκτησή τους επειδή: Ποιες υπηρεσίες πρέπει να προσφέρουμε ή να αναπτύξουμε; Πώς πρέπει να ενσωματωθούν οι ομάδες εξυπηρέτησης πελατών μας (ή οι ομάδες πωλήσεων); Ο Wealth συζητά σε βάθος τη ομαδοποίηση στο Κεφάλαιο 6. 5. Η ομαδοποίηση συν-εμφάνισης (επίσης γνωστή ως άνθρακας συχνών συνόλων στοιχείων, ανακάλυψη κανόνων συσχέτισης και ανάλυση καλαθιού αγοράς) προσπαθεί να βρει συσχετισμό μεταξύ οντοτήτων με βάση τις συναλλαγές που τις αφορούν. Μια επίδειξη συν-εμφάνισης ερώτηση Από προκλήσεις που σχετίζονται με τις επιχειρήσεις έως εργασίες εξόρυξης δεδομένων
|
21
θα ήταν: Ποια είδη θα αγοράζονται συνήθως μαζί; Ενώ η σύσφιξη εξετάζει την ομοιότητα μεταξύ αντικειμένων με βάση τα χαρακτηριστικά αντικειμένων, η ομαδοποίηση συν-εμφάνισης εξετάζει διαφορετικά αντικείμενα με βάση την κοινή τους εμφάνιση στις συναλλαγές. Για παράδειγμα, η ανάλυση των αρχείων αγορών ενός σούπερ μάρκετ μπορεί να αποκαλύψει ότι ο κιμάς αγοράζεται μαζί και χρησιμοποιεί καυτερή σάλτσα πιο συχνά από ό,τι θα περίμενε κανείς. Ο καθορισμός του τρόπου δράσης σε αυτήν την ανιχνευσιμότητα απαιτεί λίγη δημιουργικότητα, αλλά μπορεί να προκύψει μια ειδική προώθηση, έκθεση προϊόντος ή σύνθετη προσφορά. Η ταυτόχρονη εμφάνιση προϊόντων στις αγορές είναι ένας κοινός τύπος ομαδοποίησης που είναι γνωστός ως ανάλυση καλαθιού αγορών. Ορισμένα συστήματα συστάσεων εκτελούν επίσης ένα είδος καθορισμένης συγγένειας βρίσκοντας, για παράδειγμα, ζεύγη βιβλίων που φτιάχνονται συχνά από τα ίδια άτομα ("άτομα που αγόρασαν το X αγόρασαν και το Y"). Το αποτέλεσμα της ομαδοποίησης συν-εμφάνισης είναι μια περιγραφή των στοιχείων που εμφανίζονται μαζί. Αυτές οι αναφορές συνήθως περιλαμβάνουν στατιστικά στοιχεία σχετικά με τη συχνότητα της ταυτόχρονης εμφάνισης και μια εκτίμηση για το πόσο εκπληκτικό μπορεί να είναι. 6. Οι προσπάθειες δημιουργίας προφίλ (γνωστές και ως περιγραφή συμπεριφοράς) αντιπροσωπεύουν μια συγκεκριμένη συμπεριφορά ενός ατόμου, μιας επιλογής ή ενός πληθυσμού. Ένα παράδειγμα ερώτησης δημιουργίας προφίλ θα ήταν, "Ποια είναι η τυπική χρήση κινητού τηλεφώνου για αυτό το τμήμα πελατών;" Η συνθήκη μπορεί να μην έχει απλή περιγραφή. το προφίλ χρήσης κινητού τηλεφώνου μπορεί να απαιτεί μια περίπλοκη περιγραφή των μέσων όρων της νύχτας και του Σαββατοκύριακου, των διεθνών πρακτικών, των χρεώσεων περιαγωγής, των λεπτών κειμένου και ούτω καθεξής. Η συμπεριφορά του μπουκαλιού μπορεί να περιγραφεί γενικά σε έναν ολόκληρο πληθυσμό ή σε μικρό ή ακόμη και επίπεδο ομάδας. Ο επαγγελματίας συνήθως αφιερώνεται στη δημιουργία της βιομηχανίας συμπεριφοράς για εφαρμογές ανίχνευσης ανωμαλιών, όπως η ανίχνευση απάτης και η επιτήρηση για εισβολές στο δίκτυο υπολογιστών (όπως κάποιος που παραβιάζει τον λογαριασμό σας στο iTunes). Για παράδειγμα, με ένα άτομο που είναι γνωστό ως ο τύπος αγοράς που κάνει ένα άτομο με μια πιστωτική κάρτα, μπορούμε να προσδιορίσουμε εάν μια νέα χρέωση στον λογαριασμό μου ταιριάζει σε αυτό το προφίλ ή όχι. Μπορούμε να χρησιμοποιήσουμε έναν βαθμό αναντιστοιχίας ως βαθμολογία υποψίας και να θέσουμε έκτακτη ανάγκη εάν είναι πολύ υψηλή. 7. Σύνδεση προσπαθειών πρόβλεψης για την πρόβλεψη συνδέσεων μεταξύ δεδομένων θέσης, υποδεικνύοντας γενικά ότι πρέπει να υπάρχει μια σύνδεση και, τελικά, εκτιμώντας επίσης την ισχύ της σύνδεσης. Η πρόβλεψη συνδέσμων είναι συνηθισμένη στα μέσα κοινωνικής δικτύωσης: "Εφόσον αυτή και η Karoe μοιράζονται 10 δικά μας, ίσως σας αρέσει να είστε φίλοι με την Karen;" Οι προβλέψεις συνδέσμων μπορούν επίσης να εκτιμήσουν την ισχύ ενός συνδέσμου. Για παράδειγμα, για να προτείνουμε ταινίες στους δικούς μας, μπορείτε να φανταστείτε ένα γράφημα μεταξύ του πελάτη και του πόρου που παρακολούθησε αντί για αξιολόγηση. Μέσα στο γράφημα, αναζητούμε συνδέσμους που δεν είναι διαδεδομένοι σε πελάτες και ταινίες, αλλά οι ίδιοι οραματιζόμαστε ότι πρέπει να υπάρχουν και η ανάγκη είναι έντονη. Αυτοί οι σύνδεσμοι αποτελούν τη βάση για τις συστάσεις. 8. Η μείωση δεδομένων προσπαθεί να πάρει μια μεγάλη ποικιλία δεδομένων και να την αλλάξει με μια μικρότερη εταιρεία δεδομένων που περιέχει πολλές από τις ζωτικές πληροφορίες στη μεγαλύτερη ανάρτηση. Το μικρότερο σύνολο δεδομένων μπορεί να είναι ευκολότερο στον χειρισμό ή στην επεξεργασία. Επίσης, αυτό το μικρό σύνολο δεδομένων μπορεί να αποκαλύψει καλύτερα τις πληροφορίες. Για λόγους επίδειξης, ένα τεράστιο σύνολο δεδομένων σχετικά με τις προτεραιότητες προβολής ταινιών από καταναλωτές μπορεί να μειωθεί σε ένα πολύ μικρότερο σύνολο δεδομένων.
|
Chapter 2: Business Issues Press Data Science Research
αποκαλύπτουν λανθάνουσες προτιμήσεις πρόθεσης χρήστη και δεδομένα προβολής (π.χ. προτιμήσεις φύλου θεατών). Η τυπική μείωσή του συνεπαγόταν απώλεια πληροφοριών. Αυτό που είναι σημαντικό είναι η αντιστάθμιση για βελτιωμένη ανακάλυψη. 9. Η εφευρετική μοντελοποίηση προσπαθεί να μας βοηθήσει να κατανοήσουμε ποια γεγονότα ή ενέργειες επιλέγει η πραγματική επιρροή. Για παράδειγμα, παρατηρήστε ότι χρησιμοποιήσαμε μοντέλα πρόβλεψης για τη στόχευση διαφημίσεων σε καταναλωτές και παρατηρήσαμε ότι οι στοχευμένοι καταναλωτές πραγματοποίησαν αγορές με υψηλότερο ποσοστό μετά τη στόχευση. Αυτό οφειλόταν στις επιπτώσεις των διαφημίσεων στους πελάτες για να αγοράσουν; Ή ποιανού τα μοντέλα πρόβλεψης έκαναν απλώς καλή δουλειά για να καθορίσουν ποιοι καταναλωτές θα είχαν αγοράσει ούτως ή άλλως; Οι τεχνικές για την αιτιώδη μοντελοποίηση περιλαμβάνουν εκείνες που περιλαμβάνουν σημαντική επένδυση αδενίνης σε δεδομένα, όπως τυχαιοποιημένα ελεγχόμενα πειράματα (π.χ., τα λεγόμενα «Α/Β τεστ»), καθώς και εξελιγμένες μεθόδους για την εξαγωγή αιτιωδών συμπερασμάτων από δεδομένα παρατήρησης. Και οι δύο πιλοτικές μέθοδοι για την αιτιακή γλυπτική μπορούν γενικά να θεωρηθούν ως «αντιπαραστατικές» αναλύσεις: προσπαθούν να καταλάβουν ποια θα ήταν η διαφορά μεταξύ των καταστάσεων – τι θα μπορούσε να συμβεί – όπου το συμβάν «θεραπείας» (π.χ. εμφάνιση μιας συγκεκριμένης διαφήμισης για συγκεκριμένη μεμονωμένη αδενίνη ) θα συνέβαινε και δεν θα συνέβαινε. Σε περιπτώσεις λήψης, ένας προσεκτικός επιστήμονας δεδομένων θα πρέπει πάντα να περιλαμβάνει με αιτιολογική κατάληξη τις ακριβείς υποθέσεις που πρέπει να γίνουν στην ταξινόμηση για να ισχύει το αιτιολογικό συμπέρασμα (πάντα υπήρχαν τέτοιες υποθέσεις - πάντα ρωτήστε). Κατά την εκτέλεση αιτιώδους μοντελοποίησης, μια εταιρεία πρέπει να σταθμίσει την αντιστάθμιση της αυξανόμενης επένδυσης για να μειώσει τις υποθέσεις που έγιναν, αντί να αποφασίσει ότι τα συμπεράσματα είναι αρκετά σωστά για τις υποθέσεις. Το Flat είναι ο πιο προσεκτικός τυχαίος και ελεγχόμενος πειραματισμός, μπορούν να γίνουν υποθέσεις που μπορούν να καταστήσουν άκυρα τα αιτιακά αποτελέσματα. Η ανακάλυψη του «φαινόμενου εικονικού φαρμάκου» στην ιατρική οραματίζεται μια περιβόητη κατάσταση κάπου όπου μια υπόθεση πέρασε σε προσεκτικά σχεδιασμένες τυχαιοποιημένες δοκιμές. Η λεπτομερής συζήτηση όλων αυτών των εργασιών θα γέμιζε πολλά βιβλία. Σε αυτό το βιβλίο, παρουσιάζουμε μια συλλογή από τις πιο θεμελιώδεις αρχές εκμάθησης δεδομένων—αρχές που είναι κοινές σε όλους αυτούς τους τύπους ενεργειών. Το άτομο που μαθαίνει τις αρχές κυρίως με ταξινόμηση, αναδρομή, αντιστοίχιση ομοιότητας και ομαδοποίηση θα συζητήσει και τα δύο άλλα καθώς παρέχουν σημαντικές απεικονίσεις στις βασικές αρχές (στο τέλος του βιβλίου). Σκεφτείτε ποιος από αυτούς τους τύπους εργασιών μπορεί να ταιριάζει στο πρόβλημα πρόβλεψης κύκλου εργασιών. Πολύ συχνά, οι επαγγελματίες διατυπώνουν προβλέψεις ανατροπής ως πρόβλημα εύρεσης τμημάτων πελατών που είναι περισσότερο ή λιγότερο πιθανό να φύγουν. Αυτό το πρόβλημα τμηματοποίησης μοιάζει με πρόβλημα ταξινόμησης, ή πιθανώς πρόβλημα ομαδοποίησης ή ακόμα και πρόβλημα παλινδρόμησης. Για να αποφασίσουμε την καλύτερη σύνθεση, πρέπει πρώτα να εισάγουμε μερικές σημαντικές διακρίσεις.
From business problems to data mining responsibilities
|
23
Supervised and Unsupervised Methods Consider two similar questions we might ask about a population of customers. The first is, "Do our customers naturally fall into different groups?" Here, no specific goal or objective has been set for the grouping. If no such objective exists, the data mining problem is referred to as unsupervised. Instead, this raised an easily different question: "Can we identify groups of customers who are particularly likely to cancel their services immediately after their contracts end?" There is a specific focus here: will a customer leave when their contract ends? For these cases, segmentation is done for a specific reason: to receive promotions based on rejection probability. This is called the supervised data mining problem.
A note on the terms: supervised and unsupervised learning
The terms supervised and unsupervised come from the aforementioned field of machine learning. Metaphorically, a teacher "supervises" the student by carefully providing the target intelligence plus a series of examples. The unsupervised learning task could include the same example analysis, but would not include the target details. The student will be given information about the learning purpose, but will be left to form their own conclusions about the audience the examples have.
The difference bet on these issues is subtle, but important. If a specific target bucket is provided, which problem can be formulated as supervised. Supervised tasks require different techniques, unsupervised tasks are faster and the results are generally much more useful. A supervised technique is given a specific target for the cluster - the target prediction. Clustering, an unobserved task, produces clusters based on similarities, but there is no way to guarantee that these similarities will be meaningful or determine whether they are useful for a particular random purpose. Techno, another condition must be met to oversee data mining: there must be data on the target. It is inappropriate to have target information in the guidance. must additionally exist in the data. For example, it might be useful to know if a given customer will stick around for less than six months, whereas historical data is missing or retention information is incomplete (if, say, the data is only kept for two months) target values cannot be provided. Obtaining details and objectives is often a significant investment in data science. The value for the target variable, given that an individual is often called upon to define the individual, emphasizing that they must also (not always) incur costs for actively labeling the data. Modeling location, recurrence, and etiology are typically solved with supervised methods. Similarity matching, link prediction, detail reduction can be both. Grouping, match grouping, and profiling are generally unsupervised. THE
24
| Phase 2: Business challenges and data science solutions
The data mining fundamentals we will present highlight all types of disen techniques. Two main subcategories of maintained data wells, classification and iteration, are distinguished by who enters the target. Regression involves a numerical objective, while classification involves a categorical (often binary) objective ampere. Consider these similar questions that we can address with supervised data mining: "Will this customer buy service S1 if incentive I is set?" This is a classification problem because it has a binary target (is the customer mine or not). "Which service like (S1, S2, or neither) will result in potential customer purchase if incentivized?" A sorting fix also belongs here, targeting three values. "How long has this customer been using the service?" This is a backtracking problem because it has a numerical ampere target. The target variance is the amount of usage (actual or forecast) per customer. There are subtleties to these methods that need to be brought to light. For enterprise applications, we often want a numerical prediction for a categorical target. In the rollover example, an easy yes/no prediction about whether a customer should continue to subscribe to the service may not be enough. we want to derive the probability that the customer moves. This is still considered classification modeling and not regression because the underlying objective is the criterion. Somewhere fork clarity is needed, this is called "class probability estimation". A vital part in the early stages of the input mining process is (i) deciding whether the attack line will be supervised or unsupervised and (ii) whenever it is supervised, producing an accurate definition of a target variable. This quirk should be a specific quantity that you want to be the focus of this data mining (and for which we can reserve the value of some data samples). Wealth will return to it in Chapter 3.
Data Mining and Its Results There is another important distinction regarding data mining: a difference between (1) mining data to find patterns and building models and (2) using the results of data mining. Students often confuse these two processes when studying product science, managers sometimes confuse them when discussing business analysis. The use of data mining results should influence and inform the data mining process itself, but the duplicate should be kept distinct. In the turnover example, consider the disposal scenario in which these results will be used. We want to use the model to predict which of our customers will leave. Specificity, suppose this data mining generated a likelihood estimation modeling of M-class adenine. Each data mining and its results are shown
|
25
Figure 2-1. Data carbon versus the use of data extraction schlussfolgerungen. The top half of the figure plots this excerpt over historical data for a model production. It is important to note that the historical data has the target value ("class") shown. The bottom half shows a result of the data extraction being used, where the template is applied to new data for which we don't see the classic option. Which model predicts both class entry and the probability that the aforementioned class variable continues at that value. existing user, described using a set of attributes, CHLIAD takes these attributes as input and produces an amp rating or wear probability estimate. This is the use of data extraction results. Data mining produces MOLARITY models from some other data, also historical. Figure 2-1 illustrates these two phases. Data mining produces the likelihood estimation model as shown including the top half of the projection. In those using zeitraum (bottom half), the model is used in new adenine, if it does not see as much it creates a probability counter for it.
The process of data mining Data mining is a business. It involves the application of a significant amount of scientific type technology, but getting it right is also about art. But, as with many mature crafts, there is a well-understood process that structures the problem, allowing for logical consistency, repeatability, and objectivity. ADENINE rating valuable for the data
26
|
Chapter 2: Data Science Business Problems and Solutions
The mining process is given by the Cross Diligence Standard Procedure for Data Mining (CRISP-DM; Shearer, 2000), illustrated in Figure 2-2.1
Figure 2-2. The CRISP data mining process. This process diagram makes it experimental that iteration is the rule, not the exception. Starting the process once without solving the problem that usually causes failure. Often the gesamtheit process is an exploration of the data and after the first cycle the data science team knows many discoveries. The screen replay container is much more updated. Let us now discuss the step in detail.
Understanding the Business First, it is vital to understand the problem to be solved. This may seem obvious, when business projects rarely come pre-packaged as clear and unambiguous Mountain 1 dates. See also the Wikipedia page on the CRISP-DM processing model.
Data processing
|
27
problems. Often, reframing the problem and designing a solution is an iterative process over revelation. The diagram shown in Figure 2-2 represents this as a bicycle within a circle rather than a simple linear process. The initial formulation may not be finalized or optimized as several iterations may be required to obtain an acceptable solution composition. The Business Understanding stage introduces a member of the art where the creativity of analysts plays a large role. Data arts has a few articles to say, as we'll review, but usually the key to a great achievement is a creative problem posed by an analyst about how to present the business element as one or more data science problems. A high-level knowledge of the fundamentals helps creative business analysts to persuade new compositions. We have a set of powerful tools for solving specific data mining problems: the aforementioned fundamental data mining tasks discussed on page 19 of the From Business Questions to Carbon Data Tasks switch. this equipment. This may mean structuring (engineering) the symptom or taking subproblems that include building models for clustering, reconstruction, probability estimation, and so on. In this first step, the project must think carefully about a problem to be solved and use scale. This is one of the most important essential principles of their science, to which we can devote two entire chapters (Chapter 7 and Chapter 11). What exactly do we love about running? How exactly would we get one? What similar usage landscape drivers are potential intelligence mining models? When we discuss this to continue the details, the person will start with a simpler click on the use case, although as we go, we voluntarily go back and realize that many times the case scenario should be adapted to better reflect the actual business need. We will introduce the concept cleanup to help verify our thinking, for example, formulating a business problem in terms of expected value allows us to systematically degrade it in data mining tasks.
Understanding the data If the goal is to solve the business problem, the data is the available raw material from which to build the download. The computer is important for understanding the strengths and limitations of the data because there is rarely an exact matching of the issue. Historical data is generally collected for use unrelated to the current business issue or for an unexpressed purpose. A company database, an amp transaction database, and a marketing response database contain different information, may support different cross-populations, and may have different degrees of reliability. It is also common for the cost of this information to vary. Some products will be available almost for free, while others will require effort to obtain. Some data allows acquisition. Other items will simply not exist and will require entire utility projects to organize their collection. A critical part of the data understanding phase is estimating costs and usage 28
|
Chapter 2: Business Problems Solutions Data Type Life
για κάθε πηγή δεδομένων και να αποφασίσει εάν απαιτείται περαιτέρω επένδυση. Ακόμη και μετά την απόκτηση όλων των συνόλων δεδομένων, η συλλογή τους μπορεί να απαιτήσει πρόσθετη προσπάθεια. Για παράδειγμα, οι διαφάνειες πελατών και τα αναγνωριστικά άρθρων είναι διαβόητες και θορυβώδεις μεταβλητές. Ο καθαρισμός αντίστοιχων αρχείων πελατών για να διασφαλιστεί μόνο μία εγγραφή ανά πελάτη είναι ένα ακόμη πιο περίπλοκο πρόβλημα ανάλυσης (Hernández & Stolfo, 1995; Elmagarmid, Ipeirotis, & Verykios, 2007). Καθώς προχωρά η κατανόηση των λεπτομερειών, οι διαδρομές λύσης ενδέχεται να τροποποιήσουν την κατεύθυνση ως απόκριση και η ομάδα μπορεί ακόμη και να διακλαδωθεί. Η ανίχνευση δόλιας παρέχει μια απεικόνιση αυτού. Η εξόρυξη προϊόντων έχει διερευνηθεί εκτενώς για τον εντοπισμό απάτης και διάφορες ενέργειες που σχετίζονται με τον εντοπισμό απάτης είναι η κλασική εποπτευόμενη εξόρυξη δεδομένων. Σκεφτείτε το καθήκον του εντοπισμού απάτης με πιστωτικές κάρτες. Οι χρεώσεις εμφανίζονται στον λογαριασμό κάθε πελάτη, επομένως συχνά ανιχνεύονται δόλιες λογαριασμοί - αν όχι αρχικά από την επιχείρηση, στη συνέχεια από τον πελάτη όταν ελεγχθεί ο λογαριασμός. Μπορούμε να υποθέσουμε ότι σχεδόν όλοι οι απατεώνες αναγνωρίζονται και επισημαίνονται αξιόπιστα, καθώς ο νόμιμος πελάτης και ο άνθρωπος που διαπράττει την απάτη είναι διαφορετικά άτομα και έχουν αντίθετους στόχους. Έτσι, οι συναλλαγές με πιστωτικές κάρτες έχουν αξιόπιστες ετικέτες (απάτη και νόμιμη) που μπορούν να χρησιμεύσουν ως στόχοι για μια τεχνική επιτήρησης. Τώρα εξετάστε το σχετικό πρόβλημα της ανίχνευσης της απάτης του Medicare. Αυτό είναι ένα τεράστιο πρόβλημα στις Ηνωμένες Πολιτείες, που κοστίζει δισεκατομμύρια δολάρια ετησίως. Ωστόσο, αυτό μπορεί να φαίνεται σαν ένα τυπικό πρόβλημα ανίχνευσης απάτης, καθώς λαμβάνοντας υπόψη τη σχέση επιχειρηματικής δυσκολίας με αυτά τα δεδομένα, συνειδητοποιούμε ότι το πρόβλημα είναι σημαντικά διαφορετικό. Οι δράστες απάτης – πάροχοι ιατρικών υπηρεσιών που κάνουν ψευδείς, μερικές φορές πραγματικούς, ισχυρισμούς για τους ασθενείς τους – έχουν επίσης το δικαίωμα να εξυπηρετούν παρόχους από πραγματικούς χρήστες του συστήματος τιμολόγησης. Αυτοί που διαπράττουν την απάτη αποτελούν υποσύνολο των χρηστών που νομιμοποιούν. δεν υπάρχει διχασμένο αδιάφορο μέρος που θα δηλώσει ότι παίρνει ακριβώς τις «σωστές» χρεώσεις. Συνεπώς, τα δεδομένα χρέωσης του Medicare δεν έχουν σταθερές μεταβλητές στόχου που υποδεικνύουν απάτη και δεν ισχύει μια εποπτευόμενη διεύθυνση εκμάθησης που λειτουργεί για απάτη με πιστωτικές κάρτες. Ένα τέτοιο πρόβλημα απαιτεί συχνά προσεγγίσεις χωρίς επίβλεψη, όπως προσαρμοσμένη, δικτύωση, ανίχνευση ανωμαλιών και ομαδοποίηση συν-συμβάντων. Το γεγονός ότι και τα δύο είναι θέματα ανίχνευσης απάτης θα πρέπει να υποδηλώνει μια επιφανειακή ομοιότητα που προκαλεί σύγχυση. Εισαγάγετε δεδομένα για να λάβετε τι να εξορύξετε κάτω από την επιφάνεια για να καταλάβετε τη δομή του προβλήματος εργασίας και τα δεδομένα που είναι διαθέσιμα και στη συνέχεια να τα συνδυάσετε με μία ή περισσότερες εργασίες εξαγωγής δεδομένων για τις οποίες ενδέχεται να έχουμε σημαντική έρευνα και μηχανήματα που πρέπει να εφαρμοστούν. Δεν είναι ασυνήθιστο για ένα επιχειρηματικό πρόβλημα να περιέχει πολλές εργασίες εξόρυξης δεδομένων, συχνά διαφορετικών τύπων, και ο συνδυασμός των λύσεών τους θα είναι υποχρεωτικός (βλ. Κεφάλαιο 11).
Data preparation The analytical technologies we can use are great, but they impose certain requirements on who uses the data. They often require the data to be in a different adenine format than the method by which the data is naturally provided. some conversion will also be required. The data surface process
|
29
Therefore, a data preparation phase usually occurs alongside equipped data understanding, in which the data is manipulated and transformed into formats that produce the best result. Typical examples of data preparation are converting data into a table selection, removing or missing key inferences, converting data into a different type. Some intelligence surface techniques are designed for extra categorical symbolic data, while others only deal with numerical values. Also, numerical values are often normalized or scaled so that they can be compared. Standard techniques and sheet rules are available for making such conversions. Book 3 describes the most common format for data mining in some detail. In general, however, this reserve will not focus on its preparation techniques, which could be the subject of a book in its own right (Pyle, 1999). We will define their basic forms in the following chapters and only deal with the details of data preparation when they shed light on some fundamental principle of data science or when they are needed to illustrate a specific adenine example. More commonly, data analysts can spend a lot of time at the beginning of the process to determine which variables will be used later in the aforementioned processing. This is one of the main points in any human creativity, common sense and business consciousness that is at stake. Often, the quality of the mining solution depends on how well the analysts structure the problems and work out the variables (and sometimes it can be surprisingly difficult for them to admit this).
A very general and important concern in data preparation is to watch out for “leakage” (Kaufman et al. 2012). A leak is a situation where a variable collected in historical data provides information about the objective variable—information that appears in the historical data but is not actually available when a decision needs to be made. For a given sample, when predicting whether at a given point in time a website visitor would terminate their session or continue their browsing to another page, the variable "total number of pages visited in the session" is predictive. However, the total number of web pages visited in the session would be known until the sessions ended (Kohavi et al., 2000) – at which point the value of the target variable would be known! As other illustrative examples, consider predicting whether a user will become a "big spender." knowing the categories of items purchased (or worse, the measure of price paid) are highly predictive, but not recognized in laufzeit decision making (Kohavi & Parekh, 2003). Leakages should be considered mild during data preparation because product preparation is usually done a posteriori – from historical data. We present a more detailed example of a hard-to-find true leak in Chapter 14.
30
| Option 2: IT business problems and solutions
Modeling Models are the subject of subsequent books, and we won't dwell on it here, except to say that the result of modeling is some sort of scaling or color-tracking regularities in the data. The modeling stage is the main part where data mining techniques are applied to the data. It is important to have some understanding of the fundamental concepts of data mining, including the types of techniques and automatic verification that exist, because this is the part of the art where more science and technology can be applied.
Evaluation The purpose of a site step is to rigorously review data mining results and gain confidence that they are valid and reliable before moving on. Because we scrutinize any data set we decide to find patterns, they may not survive close scrutiny. We would like to have the confidence that the models and patterns extracted from the data are true regularities and not just idiosyncrasies or sample anomalies. Results are likely to be deployed immediately after detail mining, but this is not recommended. it is usually much easier, cheaper, faster and also safer to test a model first in an adenine-based laboratory. Equally important, a plus rating level serves to ensure that the scale meets the company's original goals. Recall that the main purpose of data science for work is to support decision making, and that wealth started the process to focus on the store problem we would like to solve. Typically, a data mining solution exists only slightly beyond the highest resolution and should be evaluated as such. Continuing, uniformly, while a scale passes rigorous evaluation tests "in the laboratory," there may be external factors that render it unusable. For example, a common flaw with detection remedial measures (such as fraud detection, spam detection, or intrusion monitoring) is that they generate too many false positives. A model may be very accurate (>99%) by laboratory standards, but evaluation in the real business context may reveal that it still produces too many false alarms to be economically viable. (How much would it cost to provide staff to see these false alarms? What would be the overall customer dissatisfaction?) Evaluating data mining results includes both quantitative and qualitative assessments. Many of our needs are concerned with making business decisions that want to be implemented or supported by these outcome models. In many cases, these stakeholders must "sign off" on model development or, upon purchase, must be satisfied with the quality of a model selection. What it is varies from application to application, but often stakeholders try to see if the model works to do more good than harm, and more that this model is unlikely to cause catastrophic disasters.
The data mountain process
|
31
errors.2 To facilitate this qualitative assessment, the data scientist should consider the understanding of the model by stakeholders (not just the data scientists). And if the self-typing is not understandable (for example, maybe the example is a very complex mathematical formula), how can data scientists tune this model? Finally, an integrated evaluation framework is important because obtaining detailed information about the performance of a developed model can be difficult or impossible. Access is often limited to the desktop, making comprehensive evaluation "in production" difficult. Implanted systems often contain many “moving parts” and it is difficult to assess the contribution of the individual adenine moiety. Companies with mature data science teams create test environments that mirror production data as closely as possible in order to get the most realistic assessments before taking the risk of development. However, in some cases, we may want to extend the evaluation to the development environment, from equipping a live system to a survival capable of running random experiments. In the case of mastication, if we decide from laboratory testing that the ampere data mining model will give us a preference for regurgitation, we may want to proceed with an "in vivo" evaluation, in which a living schema randomly applies who models some customers while keeping other customers as a control group (recall our discussion of causal modeling in Option 1). Such experiments must be carefully designed and the technical aspect is beyond the scope of this book. The interested reader can start with lessons learned according to Ron Kohavi and his co-authors (Kohavi et al., 2007, 2009, 2012). We may also want to port established systems for evaluations to ensure that the world does not change to the detriment of the model's decision making. For example, behavior can change — in some cases, such as fraud or spam, in direct response to developing standards. Added, this output that starts the model is highly dependent on the input data. Input data can change the file in style and substance, often without notices from the data science team. Raeder et al. (2012) present a detailed discussion of system design to help address these and other issues related to assessment in implementation.
Deployment Input Deployment The results of data mining - and the incremental dates of the mining techniques themselves - are actually used to achieve some return on investment. Some clearer development cases involve introducing a predictive model into some information system or business process. In our example, a model for predicting churn probability is integrated into the business process to manage churn
2. For example, in the data well only project, a model created to diagnose problems in the area phone connection and send technicians to the potential problem page. Prior to deployment, a group of phone company shareholders requested that the product be packaged to make exceptions for hospitals.
32
|
Chapter 2: Business Problems and Scientific Dated Solutions
—for example, sending special offers to customers at particular risk live. (We'll discuss this in more detail as the post progresses.) A newer fraud detection model can be integrated into a workforce management information system to track accounts and generate "cases" to be considered by employee investigations. Increasingly, data mining techniques themselves are being developed. For example, for online ad targeting, systems that automatically build (and test) models in production when a new ad campaign is introduced are bypassed. Two main justifications for developing the data copper system itself, rather than the models produced by a data mining system, are (i) the world can change faster than academic data staff can adapt, in terms of fraud and intrusion detection, plus (ii) The ampere business has many modeling tasks so that the information science team manually selects each model individually. In this case, it might be better to deploy the data extraction phase in production. In this way, it is critical to organize a process to notify the data science team of any apparent anomalies and provide safe operation (Raeder et al., 2012). Development can also be much less 'technical'. In a famous case of adenine, data mining revealed a kit adaptation that could help quickly diagnose and correct a common error in industrial printing. Implementation was successful only by taping a sheet of paper containing the rules to the edge of the inkjet (Evans & Fisher, 2002). The growth bottle can also be large and subtle, such as a change in data acquisition processes or a strategic, corporate, or operational change resulting from knowledge gained from data mining.
Deploying a model to a factory axis typically requires recoding the model from the production environment, usually for speed or compatibility with an existing system. Which can result in significant costs and investments. In many cases, the data science team is responsible for creating a working prototype along with its evaluation. These are passed to a development team. In practice, there are dangers with "over the wall" transfers from dating science to development. It may be helpful to remember the saying, "Your model is not what intelligence scientists design, it's what engineers build." From a management perspective, it is advisable to involve the development team community early in the data science schema. I start as a consultant, providing critical insights to the data science team. Increasingly, these particular developers were “data science engineers” – software engineers who have specific experience in both production systems and data science. These developers gradually take over responsibility as the project matures. In principle, developers will take the lead and take ownership of a product. Typically, data scientists
Data mining processes
|
33
should continue to be involved in the design in the final placement, as consultants rather than developers, depending on their skills.
Regardless of whether development is thriving, the process usually leads to the Business Insight phase. The data mining process provides a lot of information about a business problem as well as the difficulties of your search. A second iteration may yield an improved solution. Simply experience is thinking about information about the business, data and performance goals that are often associated with new ideas to improve the performance of the business, even new business areas or new ventures. Note that a failed deployment is required to start the new shift. The evaluation platform may reveal that the results are not good enough to develop and that ours requires adjusting the problem definition or obtaining different data. This is represented by the assessment "shortcut" link back to Business Understanding in process diagrams. In practice, there should be shortcuts from each stage to each previous one, because the process always retains an exploratory aspect and a plan should be flexible enough to revisit previous steps based on discoveries made.3
Implications for Data Skills Team Management It's tempting - but often wrong - to view this data mining process as software development speed. In fact, data mining projects are often treated and managed like engineering projects, which is understandable when they start from desktop parts, including the data produced by a large software system and the analytical results fed into it. Managers are generally familiar with software technologies and are comfortable leading software projects. Milestones can be agreed upon and success is often dubious. Software managers might look at the CRISP data mining cycle (Figure 2-2) and reason that it closely resembles a software development cycle, so they should feel right at home managing an analysis project the right way. Acquisition can be wrong because data mining is an exploratory endeavor closer to research and development than engineering. The CRISP cycle is based on exploration. iterates sourcing and strategy, not packaging designs. The results are much less certain and the results of a particular rate can change the fundamental understanding of the problem. Building a data mining solution directly for deployment can be a costly early commitment. Instead, analytical casts must be prepared to invest in information to reduce uncertainty in a variety of ways. Small investments can be made through pilot studies of single-use prototypes. Data scientists must 3. Application professionals can recognize and have similarity with the mission “Fail faster to succeed sooner” (Muoio, 1997).
34
|
Chapter 2: Business Problems and Scientific Dating Solutions
review a literature review to see what else has been done and how it worked. On a larger scale, a squadron may invest substantially in the construction of experimental testbeds to enable extensive flexible experimentation. If you're a software administrator, testaments feel more like exploration than you're used to, and maybe more than you're comfortable with.
Software skills versus analytical skills
While data extraction involves software, it also requires skills that may not be common among developers. In software engineering, the ability to write efficient, high-quality code from requirements can be critical. Team members can be evaluated using software metrics, such as the amount of code written or the number of bug tickets closed. In analysis, it is most important that the unique can formulate good problems, generate original solutions quickly, make logical assumptions in the face of ill-structured problems, design experiments that represent good investments, real results to analyze. When building a data science team, these qualities, minus traditional software engineering knowledge, are skills to pursue.
Other Analytics Techniques Advanced Business Analytics involves the application of various technologies to data analysis. Many of these go beyond this book's focus on analytical data thinking and the principles of extracting useful patterns from data. However, it is important to be familiar with these related techniques, understand what their purpose is, what their role is, and when they can be used to consult with experts in them. For this, we present six sets of family analytical formulas. Where appropriate, we draw comparisons and contrasts with input mining. The main difference is that intelligence mining focuses on this machine-driven search for data insights, patterns, or regularities. .
Statistics The term "statistics" will have two different uses in business analysis. At first, data is used as a general term for calculating specific numerical values of interest from the data (for example, "We need to gather some statistics about customer usage to determine what's wrong here.") These values often include sums , median, percentages, extra, how. let's go
4. It is important to keep in mind that it is rare for a discovery to be fully automated. The important guideline is that data mining at least partially automates the search and discovery process, rather than providing specific support for didactic search and discovery.
Other analytical techniques and technologies
|
35
τα ονομάζουμε «συνοπτικά στατιστικά στοιχεία». Συχνά θέλουμε να εμβαθύνουμε και να υπολογίσουμε συνοπτικά στατιστικά στοιχεία υπό όρους για ένα ή περισσότερα υποσύνολα του πληθυσμού (π.χ. "Διαφέρει το ποσοστό απόκλισης μεταξύ ανδρών και γυναικών πελατών;" επίσης "Τι γίνεται με τους πελάτες υψηλού εισοδήματος βορειοανατολικού τύπου ( υποδηλώνει μια περιοχή ΗΠΑ);») Οι συνοπτικές στατιστικές είναι τα βασικά δομικά στοιχεία πολλών θεωριών και πρακτικών της επιστήμης δεδομένων. Οι συνοπτικές στατιστικές πρέπει να επιλέγονται με μεγάλη προσοχή στο επιχειρηματικό πρόβλημα που πρέπει να λυθεί (μία από τις θεμελιώδεις αρχές που θα παρουσιάσουμε αργότερα), καθώς και με προσοχή στη διανομή των δεδομένων που λίγα ανήκουν στην περίληψη. Για παράδειγμα, το διάμεσο σύνολο (μέσος όρος) στις Ηνωμένες Πολιτείες, σύμφωνα με την Count Business Economic Surveys του 2004, ήταν πάνω από 60.000 $. Αν χρησιμοποιούσαμε το é ως μέτρο του μέσου εισοδήματος για να λάβουμε αποφάσεις πολιτικής, θα κοροϊδεύαμε τους εαυτούς μας. Η κατανομή του εισοδήματος στις Ηνωμένες Πολιτείες είναι πολύ λοξή, με πολλούς ανθρώπους να κερδίζουν σχετικά λίγα και μερικούς ανθρώπους να κερδίζουν φανταστικά πολλά. Σε τέτοιες περιπτώσεις, η κοινή αριθμητική μας λέει σχετικά πολλά για το πόσα κερδίζουν οι άνθρωποι. Εναλλακτικά, θα πρέπει να χρησιμοποιήσουμε ένα διαφορετικό μέτρο του «μέσου» εισοδήματος, όπως το διάμεσο. Το εισόδημα του Mittler - το ποσό όπου μέρος του πληθυσμού κάνει περισσότερα και το μισό χτίζει λιγότερα - στις Ηνωμένες Πολιτείες στη δημόσια μελέτη του 2004 ήταν μόλις $44.389 - σημαντικά χαμηλότερο από τον μέσο όρο. Αυτό το παράδειγμα μπορεί να φαίνεται προφανές επειδή έχουμε συνηθίσει να ακούμε για ένα «μέσο εισόδημα», αλλά το ίδιο σκεπτικό ισχύει για κάθε υπολογισμό συνοπτικών στατιστικών: θυμάστε ποιο πρόβλημα θα θέλατε να λύσετε ή ποια ερώτηση σας θα θέλατε να απαντήσετε ? ? Έχετε εξετάσει την κατανομή των δεδομένων και εάν το παραπάνω επιλεγμένο στατιστικό είναι κατάλληλο; Μια άλλη χρήση του όρου "στατιστική" θα σημαίνει το πεδίο σπουδαστών με αυτό το όνομα, το οποίο μπορούμε να διαφοροποιήσουμε χρησιμοποιώντας μια καθαρή κλήση, Kennzahlen. Ο τομέας της Στατιστικής μας παρέχει μια τεράστια ποσότητα γνώσης που στηρίζει την ανάλυση και μπορεί να θεωρηθεί ως είσοδος στο ευρύτερο πεδίο της Επιστήμης Δεδομένων. Για παράδειγμα, το Statistics μας βοηθά να κατανοήσουμε τις διαφορετικές κατανομές δεδομένων και ποια στατιστικά στοιχεία είναι κατάλληλα για να συνοψίσουμε το καθένα. Τα στατιστικά στοιχεία σάς βοηθούν να κατανοήσετε πώς να χρησιμοποιήσετε τα δεδομένα για να ελέγξετε τις παραμέτρους και να εκτιμήσετε την αβεβαιότητα των συμπερασμάτων. Σε σχέση με την εξόρυξη δεδομένων, ο έλεγχος υποθέσεων μπορεί να βοηθήσει στον προσδιορισμό του εάν ένα μοτίβο παρατήρησης είναι πιθανό να είναι μια έγκυρη γενική κανονικότητα, παρά μια τυχαία διαθεσιμότητα σε κάποιο συγκεκριμένο σύνολο δεδομένων. Σχετικές με αυτό το βιβλίο, οι περισσότερες τεχνικές για την εξαγωγή προτύπων εμφάνισης κουμπιών από δεδομένα έχουν τις ρίζες τους στις στατιστικές. Για παράδειγμα, μια προκαταρκτική μελέτη θα μπορούσε να προτείνει ποιοι πελάτες στα προαναφερθέντα βορειοανατολικά έχουν ποσοστό εκτροπής 22,5%, ενώ ο μέσος ρυθμός εκτροπής είναι μόνο 15%. Αυτό θα μπορούσε απλώς να είναι μια τυχαία διακύμανση, καθώς η προαναφερθείσα αξία κύκλου εργασιών δεν είναι σταθερή. ποικίλλει ανάλογα με τις περιοχές και με την πάροδο του χρόνου, επομένως είναι αναμενόμενες διαφορές. Αλλά το βορειοανατολικό ποσοστό είναι μιάμιση περίοδος όσο στις ΗΠΑ b, που φαίνεται να είναι μοναδικό. Τι είναι τυχαίο που οφείλεται σε τυχαία παραλλαγή; Για την απάντηση σε αυτές τις ερωτήσεις χρησιμοποιείται ο έλεγχος στατιστικών υποθέσεων.
36
|
Chapter 2: Business-related problems and data science solutions
Closely related is the quantification of the gap in confidence intervals. The general choke charge is 15%, but there is some variation. Traditional statistical analyzes may reveal that 95% of the time the turnover rate is expected to fall between 13% and 17%. This contrasts with the (complementary) process of data mining, which can be thought of as hypothesis generation. Can we find patterns in the data that includes the first position? Hypothesis generation must then be carried out followed by careful examination of the hypothesis (usually on different data, see Chapter 5). In addition, data mining procedures can lead to numerical estimates, and we often include confidence intervals that transform the rates. We will return to this when we discuss the evaluation and data mining results. In this book we will not spend any more time discussing basic statistical concepts. There is a lot of basic literature on statistics and statistics needed for business, and any treatment we try to cram in would be too limited or superficial. That being said, a typical term that is commonly heard in the context of business analytics is "correlation". For example, "Are someone's reviews correlated with subsequent customer churn?" As with current statistics, "correlation" has both a general purpose meaning (variations in one quantity tell us something about variations in another) and a specific specific one (eg linear correlation based on a particular mathematical formula ). . The concept of regression will be the starting point for the rest of our discussion of the science of firm inputs, starting in the next chapter.
Database Query A prompt is a specific request for a subset of data or for statistics about data, formulated in a technical language and placed into a database system. Many accessories are available to answer single or recurring questions about data presented by a commenter. These tools are typically interfaces to database systems, based on Structured Query Language (SQL), or a graphical user interface (GUI) tool for finding formulated queries (for example, query-byexample, alternatively QBE). For example, if the analyst can define "profitable" in useful, computational terms, the next items in the database, then a query tool could answer, "Who are the most profitable customers in the Northeast?" The analyst activates and runs the search to retrieve a list of the most profitable customers, possibly sorted by profitability. This activity differs fundamentally from data mining in that there is no discovery of patterns or models. Database searching is appropriate when an analyst already has an idea of what an interesting subpopulation of the data might look like and wants to test the button in that population to confirm a hypothesis through it. For example, if an analyst finds that middle-aged hands living in the Northeast have a number of interesting turnover behaviors, they might compose an SQL query: SELECT * FROM YOUR WHERE AGE > 45 and SEX='M' and DOMICILE = ' HUH '
Other analytical techniques and technologies
|
37
If these are my goals with a quote, a query tool can be used to retrieve all information about these ("*") CUSTOMERS from the data. On the other hand, data mining may be the first to reach the sky with this query in the first place - as a normal print of patterns in data. A data mining driver should look at past customers who have and have not had a disability and determine that this segment (characterized as "Age is greater than 45 and Gender is male and RESIDENCE is Northeastern US") is predictive of the judge turnover. Once this is translated into an SQL query, a query tool can be used to find a matching record in the database. Query tools often have the ability to perform sophisticated logic, including processing enterprise content into subpopulations, sorting, joining multiple tables with related details, and more. Data scientists often become quite adept at writing queries to extract the data they need. Online Analytical Processing (OLAP) provides an easy-to-use GUI for searching large collections of data, with the goal of facilitating input search. One idea of "online" processing is that it is done in real time, so that analysts and decision makers can find answers to their questions quickly and quickly. Unlike the "ad hoc" query made possible by tools like SQL, for OLAP the dimensional analysis must be pre-programmed, including the OLAP system. Assuming we anticipated that we would want to explore the volume of transactions by region and hour, we could program these three dimensions into the system and analyze the populations, iteratively by simply clicking and dragging and manipulating pivot charts. OLAP systems are considered to facilitate manual or visual exploration of data by academics. OLAP does not perform any modeling or automatic pattern discovery. As an additional contrast, compared to OLAP, generic data mining tools can open up new dimensions of ease of analysis as part of exploration. Useful OLAP can be a useful complement to data mining tools for business data discovery.
Data warehousing Data warehousing collects and joins data from across the enterprise, often from multiple transaction processing systems, each with its own database. Analytical systems can access data warehouses. Data warehousing can be thought of as a technology that facilitates data mining. It's not always necessary, as most mining assets don't have access to a data warehouse, but companies that frequently invest in data warehouses can apply data mining more broadly and deeper into the organization. For example, if a data warehouse integrates sales records and statements from back then, as well as human resources, it can be used to find patterns that are characteristic of effective salespeople.
38
| Chapter 2: Data Science Business Problems and Solutions
Regression Analysis Some of the similar methodologies we discuss in this book are at the heart of a diverse set of analytical methods, often grouped under the category of regression research, which are widely applied in the field of statistics and also in selected areas based on econometric analysis. . This book focuses on issues other than those typically addressed in a regression analysis book or class. Here they are less interested in explaining a specific data set as there are export patterns that will generalize to other data and for the application mentioned above it improves some business processes. Typically this will involve estimating or predicting scores in cases that have none in the searched data set. So, as a specific example, in this book we're less interested in digging into the slope ratios (as important as they are) on a given historical set of dates, and more interested in predicting which customers who haven't yet left are going to be the best targets. to reduce future turnover. So we'll spend a lot of time talking about test patterns in new data to assess their generality, and techniques that reduce the tendency to find a particular test for a given data location but not generalize to the population from which the data come. The issue of explanatory versus predictive modeling can spark a deep debate,5 which is well beyond our focus. What is important is to realize that there is considerable crossover in who uses the techniques, but that the lessons learned from explanatory modeling do not fully apply to oracle modeling. Thus, a reader with some background in reverse analysis may find new and consistent seemingly contradictory lessons.6
Machine Learning and Data Mining The set of methodologies for extracting (predictive) models from data, known as self-learning methods, were developed in several areas at the same time, notably Machine Learning, Applied Statistics or Pattern Recognition. Machine Learning as a field of study emerged as a subfield of Artificial Intelligence, influenced by methods of improving an intelligent agent's knowledge or performance over time in response to that agent's experience in the world. This performance usually involves analyzing information as the environment also makes predictions about unknown quantities and over the years this aspect of data analysis in relation to machine learning has played a very important role in a field. As machine learning methods have been widely developed, the disciplines of Machine Learning, Applied Data and Pattern Recognition have developed close ties and the separation between the fields has become blurred.
5. The interested reader should read the discussion by Shmueli (2010). 6. Those who continue the exams by default will have the apparent contradictions resolved. Such in-depth study is not necessary to understand fundamental principles.
Other analytical techniques and technologies
|
39
The field of Data Mining (or KDD: Knowledge Discovery and Data Mining) began as an outgrowth of Machine Learning, and the two remain closely linked. Both fields are concerned with examining data to find useful or informative patterns. Techniques and calculations can bet for two. In fact, the regions are so closely linked that researchers often participate in both communities and move between them seamlessly. However, some differences are worth pointing out to give perspective. In general, because Machine Learning deals with many types of performance improvement, it includes subfields such as robotics and computer vision that are not part of KDD. Information is also concerned with agency and knowledge production—how an intelligent agent will use the knowledge it has learned to reason and act in its environment—which are not data mining concerns. Historically, KDD excluded Machine Learning as a targeted research field in questions raised when considering real-world applications, and a decade and a half later, the KDD audience is still more concerned with applications than Machine Learning. Therefore, research focused on commercial applications and business issues of data analytics tends to be directed towards a KDD community far superior to machine learning. KDD also tends to be more interested in using this whole process to analyze data: your brand, bag model, rating, etc.
Answering business questions with these technologies To demonstrate how these techniques apply to business analysis, consider a set of questions that might arise and the technologies that might be appropriate to answer them. These questions are all related, but each is subtly different. It is important to recognize these differences to understand which technologies should be used and which population may be necessary to consult. 1. Who can be the most profitable customer? With "profitable" can be clearly chosen based on the existing information, the aforementioned is a direct database query. A typical query tool can be used to retrieve a set of customer records from a database. The results can be ranked based on the accrual financial value rather than some other operational profitability indicator. 2. Is there really a difference between profitable customers and the average customer? This is an adenine question about a confound or hypothesis (in this case, "There is a large difference between the choice of the above company among the profitable customers and the average customer") and the statistical hypothesis test will be used to identify or refute the. Statistical analysis could also derive a limit of precision or confidence that the difference was real. Usually, the desired result remains: "The value of such profitable buyers is significantly different from that of the average customer, fitted probability < 5% that this is due to chance." 40
| Chapter 2: Data Science Business Problems and Solutions
3. But who are these customers really? Do I classify them as a bucket? Many times we would like to have more than just listing profitable customers. We would like to describe common characteristics in profitable buyers. Unique customer characteristics can be freely extracted from a database using techniques such as database searching, where they can also be used to generate summary statistics. A deeper analysis should include identifying the characteristics that differentiate revenue customers from nichtrentabel. This is the realm of data science, using data mining techniques to automatically discover patterns — which we'll discuss in depth in later chapters. 4. Will specific new customers be profitable? How much should I expect this client to generate? These questions can be addressed with data mining technologies that look at historical consumer data and produce predictive ways to profit. Such techniques would create models from historical data that could later be applied to new customers to generate predictions. Again, this is your study for the next few chapters. Note that this last pair of questions are subtly different data mining questions. First, a ranking question can be formulated as adenine prediction of whether the amp given to the new customer will be profitable (yes/no or its probability). The second can be formulated as a prediction of the (numerical) value that the customer voluntarily brings to the company. More on that as we go along.
Summary Data mining is a business. As with many crafts, there is a well-defined method that can help increase the likelihood of a successful outcome. Your process is a critical conceptual toolbox for the data scientist's mind about projects. We will refer to the data mining process repeatedly throughout the book, showing how each fundamental concept fits together. In turn, understanding the fundamentals of data science greatly improves the chances of success when a company invokes the data mining process. The various fields of study related to data science have developed a set of canonical types of tasks such as classification, regression, and clustering. Each type of task serves a different purpose and has specific characteristics and partners in relation to solution techniques. A data scientist typically approaches a new project by analyzing it so that one or more of these regular tasks are revealed, selecting a designed solution for each, following the solver. To do this skillfully can require considerable experience in both skills. A successful data mining project involves an intelligent compromise between where the data can go (ie, what yours can predict and how well) and the project's goals. For this reason, it is important to keep in mind how the results of information extraction can be used and use this to inform the data mining activity itself. Summary
|
41
Data mining varies and remains complementary to important supporting technologies such as statistical inference and database searching (which have their own related classes and subclasses). While the limits of zwischen wit mine and related techniques are not always clear, it is important to know the true capabilities of other techniques in order to know when to use them. For a business manager, the data mining process can be valuable as a framework for evaluating a data mining project or proposal. And the process provides a systematic organization, including a choice of one who can be asked about a proposed project or project, to understand whether the request was well designed or has elementary flaws. We'll get to that after we've discussed in more detail some of the aforementioned fundamentals themselves - which we turn to to get started.
42
|
Chapter 2: Data Science Business Problems and Solutions
CHAPTER 3
Introduction to predictive modeling: from correlations to supervised segmentation
Fundamental Concepts: Identify informational characteristics. Progress Slicer stores absolute selections. Paper Techniques: Finding Correlations. Selection of features/variables. Tree induction.
The previous kapittel discussed the additional way of casting with high liquid. This chapter explores one of the main topics in data mining: the predictive model. Following our example of outlier prediction mining out of the main section, where we'll start with predictive modeling as controlled segmentation - how you can segment the population into groups that differ from each other with respect to some available number. Specifically, how can we segment the population in relation to something we would like to predict or estimate. The purpose of this prediction could be something we would like to avoid, such as which customers are likely to leave the group after their contract expires, which accounts have been defrauded, which potential customers are likely to default on their account balances (deletions such as your account default or comedian credit balance) or which plot pages contain objectionable content. Instead, the target may be positive data, such as which consumers are most likely to react to an advertisement or special offer, or which web pages are most appropriate for a search query. In the process of discussing supervised segmentation, we introduced one of the fundamental ideas of data copper: the search or selection of important informative variables or "properties" of the companies described about the data. What right it means to be "informative" varies between applications, but in general, information is a quantity that reduces uncertainties about something. So if an old treasure gives me information about where the treasure itself is hidden, it doesn't mean that I know for sure where it is, it just means that my uncertainty about where the treasure is hidden is reduced. As the uneducated improve, so does my uncertainty.
43
Now recall a concept of "supervised" data mining with the previous option. The key to supervised data mining is that we have more of a target quantity that we would like to predict or that others understand better. Often this number is unknown or unknown at the time we would like to make a business decision, particularly if an adenine customer will leave near the end of their contract or which accounts have been defrauded. The existence of a target variable crystallizes our notion of finding informative features: are there one or more other variables that reduce dispositional uncertainty about the target value? This also provides a common analysis log for the general concept of correlation discussed above: we want to find known features associated with the target of interest - that reduce our uncertainty in it. Only the discovery of these correlated variables made it possible to provide important insights into the economic problem. Identifying information features is also useful to help us deal with ever-growing databases and data streams. Such large data sets pose electronic problems for analytical techniques, especially available to the analyst who does not have access to high-performance computers. A proven way to analyze very large data sets is to first select subsets of which data to analyze. Informative feature selection provides a “smart” way to select an informative subset of data. Also, selecting features earlier in data-driven models increases modeling accuracy, for reasons we will discuss in Chapter 5. Finding informative features in others is the basis for a widely used predictive modeling technique called bush induction, which we will present. at which end of this chapter as the use of this concept of foundation. Bush induction incorporates the idea of supervised segmentation in an elegant way, selecting features iteratively in an informative manner. By the end of this chapter we will have understood: the basics of forecasting. the fundamental concept of finding informative features, along with a specific illustrative technique for doing so; the concept of tree-structured forms. tap and understand the basics of the process of deriving tree-structured models from a data set — performing primary control.
Models, Induction, and Prediction In general, a model is a convenient representation of reality created to anticipate a specific purpose. It is a simplified set based on certain assumptions about something that is not important for a particular purpose, or sometimes based on limitations of intelligence or healing potential. For example, a performance can be a model of the physical world. It removes a huge amount of information that the cartographer considered irrelevant to his destination. It preserves and sometimes further simplifies relevant information. For example, a road option preserves and highlights the roads, their basic topology, their relationships to the intended lanes, and various related information. Various professions can be familiar types of style: an architectural design, an engineering prototype, the
44
|
Chapter 3: Introduction to Predictive Modeling: From Core to Supervised Primary
Figure 3-1. Datawell terminology for a supervised classification problem. The problem is controlled by having a target attribute and any "training" data where we know the value for the target attribute. It is a classification (rather than a regression) problem because the target has a choice (yes or no) rather than a number. Black-Scholes option pricing model and so on. Each removes details that are not relevant to their main purpose and preserves those that are. In data science, a predictive model is a formula for estimating the aforementioned invisible attraction option: the set. The formula given above may be mathematical or may be a valid adenine guideline such as the adenine rule. It is often a mixture of both. Given the division of supervised data mining into placement and regression, we will consider classification models (and veranschlagung class probability models) and recursive examples.
Terminology: prediction
In common usage, prediction means predicting a future occurrence. In data scholarship, prediction often means estimating an unknown value. This value can be something in the future (in common, true prediction), but it can also be something in the present or the past. Indeed, since data mining often deals with historical data, our data is often created and controlled using events from the past. Predictive models for credit scores estimate the probability that a potential company will go bankrupt (write off). Predictive models for how to judge whether a given email is spam. The predictive models needed to detect fraud judge when
Models, induction and prediction
|
45
if an account has been defrauded. The key is that the model is intended to be used to estimate the unknown value.
This contrasts with descriptive modeling, where the main purpose of the model is not to evaluate a value, but to obtain information about the underlying phenomenon or process. A descriptive model of vehicle behavior would tell the United States what regular gambling customers are like. A predictive prototype can only be judged on its predictive performance, although we say that sensitivity is not trivial. The difference between storing the guest model is not as stark as it might imply. Some of the same techniques can be used equally, and generally one model can serve both purposes (albeit sometimes poorly). Any great value of a predictive model can be gained from the insight gained from searching for e rather than from the predictions it makes. Before we discuss predictive modeling further, we need to introduce some terminology. Supervised learning exists in model building whenever that model describes an adenine relationship between a set of selected variables (properties or attributes) and a predetermined volatile variable called a target variable. The option estimates the value of this target variable as a function (possibly a probabilistic function) of the generic. Therefore, for churn prediction discovery, we would like to model the propensity to churn as using customer account characteristics such as age, income, time with team, number of customer service calls, overcharge, customer demographics, data usage, and more. Figure 3-1 illustrates some of the terminology we've introduced here in a simplified bad credit forecasting demo. An exemplary example or actions, event or data point - in all cases, a credited historical customer. Also, this is called line inch file or spreadsheet terminology. The Einem instance is described by a set of attributes (fields, columns, variables or attributes). An instance is also sometimes referred to as an element vector because it can be represented as an ordered album (vector) of fixed length attribute values. Unless otherwise specified, we want to assume that all trait (but not target) fairness is present in the courtship.
1. The body described is used long until the adenine causal understanding of the data origin process (why do people get upset?).
46
|
Chapter 3: Introduction to Predictive Modeling: From Correlation to Supervised Segmentation
Many tags required the same things The principles and techniques of detail science have historically been studied in many different fields, including machine learning, pattern recognition, statistics, gold, and others. As a result, there are often many different names for the same things. Typically, we mean a set of computer-generated data, the format of which is usually the same as a database table or spreadsheet tools. A data set contains a broad set of instances or cases. An instance is also referred to as a row in a database table or sometimes a folder contains items. Resources (table columns) have many different names. Statistics talk about independent or predictable variables such as beschaffenheit given as input. In function research, you may also hear about explanatory variables. The target variable whose total is predicted is usually called the dependent variable in statistics. This terminology can be a bit confusing. which independent variables may be independent of each other (or whatever) and the dependent variable does not always depend on all the independent variables. For this reason we avoid dependent/independent terminology in such a book. Some experts consider the target variable to be included to be the feature set, others do not. The important thing is a bit obvious: the target variable is not used to predict itself. However, it could be that past values for a target variable are quite useful for predicting future values - so these past values can have live ships as a feature.
Building models from details is known as an induction paradigm. Induction is a term from philosophy that refers to the generalization of specific formulas to general policies (or laws or truths). Our rules are general in a statistical sense (usually not true 100% of the time, often barely), and the process that builds the model from the data is called an induction or learner algorithm. Most inductive procedures have variations that induce models for both categories and regression. Ours will mainly discuss our classifications because they tend to receive less attention in other statistical treatments and because they are relevant to many business problems (and thus many active in data science with an emphasis on classification).
Terminology: inductions and deductions
Induction can be contrasted with inference. Abstraction starts with general rules and specific facts and creates a number of specific facts from them. Using our models can be a significant (potential) discount process. We'll get into that quickly.
The input data for the absorption algorithm, used to induce the model, belongs to the so-called user training data. As mentioned in Chapter 2, they are called label data because the value of the predicted variable (the label) is known.
Models, induction and prediction
|
47
Let's go back to the dilution problem example. Based on what we learned in Chapters 1 and 2, we might decide that in the modeling step we should create a “supervised segmentation” model, which divides the sample into segments that are (on average) more or less likely to drop out business after maturity, conclusion. Until we consider the directions that can be made, let us now turn to one of our fundamental concepts: how can we choose one or more features/attributes/variables that will best divide an instance with respect to our target variable of interest?
Supervised Segmentation Recalling that a prediction model focuses on estimating a value of a given target variable of interest. The natural way to think about extracting dating patterns in a continuous way is to try to segment the population into subgroups that have different values for the target variable (and within a subgroup the cases have similar values for the target variable). When segmentation is done using values of variables that want to be well known when the target is not, then other segments can be used to predict the value of the target variable. And segmentation can simultaneously provide a human-understandable research targeting set. One of those selected in an English voice might be: “Middle-aged professionals living in New York, on average, have a turnover rate of 5%. Specifically, the time “middle-aged technician residing in New York City” is the chosen term (which refers to some specific characteristics) and “a 5% deviation rate” describes the predicted set of the objective variable for the department. 2 We are repeatedly interested in application data mining when we have many features and are not sure exactly what the partitions should be. In the reversal prediction problem, should one say which segments are best for predicting which trend to reverse? If they exist in the data segments with a significantly different (mean) value for the target variable, we would like to be able to extract them in an automated way. This brings us to our fundamental term: how can we judge that a constant contains important information about the target variable? How much? We would like to obtain a full selection of the most informative tires with respect to the given task manually (i.e. predicting the value of the target variable). Even better, we can score the variables according to how well they predict the price of that target. Consider selecting only the most informative feature. Solving this problem will introduce our first data mining technique up front - simple but easily scalable to be very useful. In our example, which variant gives us more information
2. The predicted price can be estimated from the date in several ways, which we will come up with. At this point, we can roughly think of it as some kind of average of the aforementioned training data that fall into the segment.
48
|
Chapter 3: Introduction to Predictive Modeling: From Correlation to Supervised Segmentation
Figure 3-2. A set of people to sort. The label above each head represents the value of the target variable (low or low). Colors and shapes represent different prediction properties. the future turnover rate of the population? Be a professional amp? Age? Residence? Income? Customer service complaint item? Amount of overcharges? We now want to look closely at a useful way of selecting informative variables, and later we will show how this technique can be used repeatedly to construct a supervised segmentation. Although this is a very useful additional example, remember that supervised multivariate direct segmentation is only one application of this fundamental idea of the selected information set. This node should become one of your conceptual tools when thinking about product science problems more generally. For example, as we go along, we will delve into other modeling approaches, as they do not directly incorporate variable selection. If the world has very broad feature sets, it can be (extremely) useful to go back to that original idea or select a subset of informative features. Go like can significantly reduce the size of a large data set and, as we will see, significantly improve the accuracy of the resulting full.
Selecting Informative Features Given a large display set, how can we select features until we separate them informatively? Let's consider a binary classification problem (two categories), very good for what we would like to get out of it. To be specific, meter 3-2 shows a simple operational problem: twelve people are presented as stick figures. There are two types of heads: square and circular. and two types of bodies: rectangular and oval. and two for people go gray car while the rest are white. These are the characteristics we will use to describe the population. Above each person is a binary destination label, Yes or No, indicating (for example) whether the person becomes a canceled loan. Our people could describe the data for these people as:
Supervised Segmentation
|
49
• Characteristics: — head shape: square, leaflet — body shape: rectangular, elliptical — body color: gray, white • Adjustable target: — bypass: Yes, No So let's ask ourselves: which of the characteristics do we want to target these people best at organizations, in a way that discriminates between victims and non-victims? Technically, we would like the resulting classes to be as clean as possible. By perfect, we mean homogeneous with respect to the object variable. If all members of a group have the same target value, the group is clean. If there is at least one group member that has a different value for the target variable than the rest of the group, then the group is not clean. Unfortunately, in real data we rarely expect to find a variable that turns segments purple. However, if the individual can significantly reduce the impurity, then we can learn something, learn the data (and the corresponding population) and, if necessary in this chapter, we can use the feature in a predictive model - in the real our world, predicting which members from one segment will have higher defection rates than those from another segment. If we are able to do this, we may, for example, offer credit to those with the lowest projected retirement rates, or we may offer a different credit key based on different projected retirement rates. Technically, there are several complications: 1. Traits rarely perfectly separate a group. Uniform, if one subgroup is pure, another may not be. Example used, in Figure 3-2, consider if the second person is not there. So body-color=gray would create a clean thread (write-off=no). However, the other related thread, body-color=white, may not be perfect. 2. In the previous example, a body-color=grey condition simply separates a single data point from the pure subset. Is it better than a separate one that doesn't produce clean subsets but reduces pollution more broadly? 3. They do not represent all bin features. many attributes have ternary or discrete values. We have to consider that one attribute can be divided into two groups while another can be divided into three sets or seven. How do we see these? 4. Some attributes take numeric values (continuous or integer). Want to make sense of making a place for each evaluated number? (No.) How should we think about creative information delivery goals after a user count?
50
|
Chapter 3: Introduction to Predictive Modeling: From Relational Segmentation to Supported Segmentation
Fortunately, for classification problems, we can address all issues by creating a formula that evaluates well in common with a set of examples in a selection, against a selected target variable. Such a formula is based on a measure of purity. The most common separation criterion is called information gain and is based on a measure of ampere purity called entropy. Both concepts were invented by one of the forerunners of the middle class, Claude Shannon, in his seminal work on the provinces (Shannon, 1948). Entropy is a disorder scale that can be applied to an ensemble, creating individual quantities. Consider that we have a set whose properties are members of the set and each member has one and only one land. In supervised segmentation, the membership properties will match the who values of the target variable. Clutter corresponds to how mixed (unclean) the part is included in relation to these shares. So, for example, a section mixed by low plates and many non-low plates will have height entropy. More technically, entropy is defined as: Equation 3-1. Total entropy = - p1 log ( p1) - p2 log ( p2) - ⋯ Jeder pi is the probability (the relative percentage) of property i in the configuration, ranging from pi = 1 to choose the association to define property i , and pi = 0 when no member of the given has property i. The simplest… shows that there can be more than two properties (and for technical people, the logarithm is usually thought of as base 2). Since the empty magisch equation does not lend itself to intuitive understanding, Figure 3-3 shows an entropy plot at a locus containing 10 instances of two classes, + plus –. We can see, then, that degrees measure the general disorder of a set, ranging from zero to minimal disorder (the constant has staff all with the same unique property) to maximum disorder (the files are equally mixed). Since there are only two training sessions, p+ = 1–p–. Start with all negative cases where the lower left side, p+ = 0, the set has minimal confusion (is clean) and the entropy is zero. If we want to change the default company class labels from – to +, the entropy will increase. Entropy is maximized by 1 when an instance class can be balanced (five of each) and p+ = p– = 0.5. As more class labels change, the + class starts to dominate and the entropy decreases again. When all cases are positive, p+ = 1 and entropy is lower again. As a real-world example, consider a 10-person SIEMENS set with seven from the class you didn't download and three actually from the class you did download. So: p(not low) = 7/10 = 0.7 p(low) = 3/10 = 0.3 Screen position
|
51
Figure 3-3. Two-class ensemble entropy as a function of p(+). entropy(S) = ≈ ≈
- 0,7 × log2 (0,7) + 0,3 × log2 (0,3) - 0,7 × - 0,51 + 0,3 × - 1,74 0,88
Entropy is only part of the story. We would like to know how to measure how informative a feature is with respect to our target: how much information gain it gives and how much is the value of the target variable. An attribute partitions a set of instances into multiple subsets. Entropy only expresses how impure a single subset is. Fortunately, we use entropy to measure how disordered a given set is, we can define information gains (IG) to estimate how much a feature improves (decreases) entropy across the segmentation it creates. Strictly speaking, the information received measures one change per amount owed to whatever amount a new request is added. Here, in the context of supervised partitioning, we consider the information gain by partitioning a set into all values of a single feature. Suppose that the partitioned feature has k distinct values. Let's call initial set to display the parent set, the result of dividing the attribute values of the k child sets. Thus, information gain is a service of parent pressure on children.
52
|
Chapter 3: Introduction to Predictive Modeling: After Correlation with Supervised Segmentation
resulting from some division of the set of parents - how much information did this feature provide? This is based on much clearer guidelines with children and parents. Stated in the context of predictive modeling, if we were to know the value of this value, how much would our skill and the value of the target variable increase? Specifically, the information gain (IG) description is: Equation 3-2. IG information gain (parents, children) =
entropy(parent) pen(c1) × entropy(c1) + p(c2) × entropy(c2) + ⋯
Specifically, the entropy for any child (ci) is weighted by the proportion of cases that desire that child, p(ci). This directly addresses our concern above, making sure to split a single example, and noting that this set is purple, it might not be a good idea to lock the posterior set into two large, relatively pure subsets, even if neither is pure. As an example, consider the division in Figure 3-4. This is a problem of two classes (• and ★). Looking at the photo, the baby sets definitely look more "pure" than their parents. The augment set has 30 instances consisting of 16 real spots of 14 stars, like this: entropy(parent) = ≈ ≈
- p( • ) × log2 p( • ) + p ( ☆ ) × log2 piano ( ☆ ) - 0.53 × - 0.9 + 0.47 × - 1.1 0.99 (very impure)
The entropy of the left child is: entropy(Residual < 50K ) = ≈ ≈
- p( • ) × log2 p( • ) + p ( ☆ ) × log2 pressure ( ☆ ) - 0.92 × ( - 0.12) + 0.08 × ( - 3.7) 0.39
The entropy of this selected child can be: entropy(Residual ≥ 50K ) = ≈ ≈
- p( • ) × log2 p( • ) + piano ( ☆ ) × log2 p ( ☆ ) - 0.24 × ( - 2.1) + 0.76 × ( - 0.39) 0.79
Using Equation 3-2, the information gain of dither division is:
Personalized supervision
|
53
IG =
entropy ( parent ) - piano (Balance < 50 K) × chaos (Balance < 50 K) + p (Balance ≥ 50 K) × entropy (Balance ≥ 50 K)
≈
0,99 - 0,43 × 0,39 + 0,57 × 0,79
≈
0,37
Therefore, this division reduces the essential entropy. In terms of predictive sculpting, the attribute provides a mapping for information about the initial value of the target.
Figure 3-4. Splitting this "delete" sampler into two symbols based on splitting the Balance (account balance) attribute by 50K. As a second example, consider different individual candidates shown in Figure 3-5. This is the same parent set as in Figure 3-4, but instead we consider the split on Quality Home with three ratings: OWN, VERMIETUNG, and OTHER. Without showing who details the execution:
54
|
Chapter 3: Introduction to Direct Modeling: From Correlation to Supervised Segmentation
entropy(parent) ≈ 0.99 entropy(Residence=OWN) ≈ 0.54 entropy(Residence=RENT) ≈ 0.97 entropy(Residence=OTHER) ≈ 0.98 IG ≈ 0.13
Figure 3-5. A sort tree is added to the Residence attribute with three values. The Housing variable has a positive profit, but it is lower than that of Equilibrium. Automatically this owns, while the child Residence=OWN significantly reduced entropy, the other RENT and OTHER values produce children that are not as clean as the parent. So, based on these data, which Residence variable is less informative than Rest. Considering clear concerns from the above about creating moderated targeting due to ranking issues, please contact your preferred site. does not require absolute
Supervised Segmentation
|
55
purity. It can be applied to any number of child subsets. Takes into account the relative sizes of children, giving greater weight to larger subsets.3
numeric variables
We haven't discussed exactly what to do if the attribute is numeric. Numeric variables can be "distinguished" by choosing a breakpoint (or multiple breakpoints) and then treating the summation to a categorical attribute. For example, income can be split into two or more ranges. The information gain can be applied to evaluate the segmental alignment generated by this discretization for the numerical feature. We are still left with the question of how to prefer the partition points for the numeric feature. Conceptually, we can search all reasonable scatter points and choose the one listed above that provides the greatest information gain.
Finally, what about supervised segmentations for regression problems—problems with a numerical target variable? Observing the reduction of the impurity of the child subsets soothes the intuitive sense of composition, but the information gain does not exist in the right measure because the entropy-based information gain remains parked in the distribution of properties in the segmentation. Instead, we would like a range of purity of numerical values (target) in the subjects. There will be no scaling up through the ampere scheme here, but the fundamental idea is important: a natural measure of impurity for numerical values is variance. If the set has the same values for the target numeric variable, the selection is pure and the variance is zero. If which numerical target values in which set are very different, then the analysis has high variance. We can form a very similar perception with information security by looking at the reductions in differences between parents and children. The process proceeds in direct proportion to a conclusion by providing the above information. To generate the best numerical target amps provided by segmentation, we can select the one that produces the best weighted average variance reduction. In essence, we would again find variables that have the best correlation with the target, or alternatively, are most predictable from that target.
Example: Mapping selection with information acquisition rights We complete the implementation of the initially specific data mining technique. For a data set with instances described by features and a target variable, we can determine which feature is most informative for estimating the value of the target variable. (We will delve into this more deeply below.) We ourselves or can classify and define traits based on their information, particularly the information they gain. This can be used simply to better understand the data. It can be used to help predict the target. Or the itp canister can often be 3. Technically, there are traces of an order equipped with multi-valued features, as separating them can lead to a large information gain, but it is not predictive. These ("overfitting") problems will be the subject of Chapter 5.
56
|
Chapter 3: Introduction to Predictive Modeling: Beginning the Association for Supervised Segmentation
reduce the size of the data to be analyzed by selecting a subset of features in suits where we cannot or cannot process the entire data set. To demonstrate the external use of data mining, we present a very simple but realistic dataset taken from the machine learning data repository for a college starting in California for Irvine. Mushrooms. From the description: This dataset includes descriptions of hypothetical specimens corresponding to 23 species of gilled mushrooms in the families Agaricus and Lepiota (pp. 500–525). Each species is identified as definitely edible, definitely poisonous, or unidentified vulnerable and not recommended. This last class was combined with poisonous. And the Guide clearly provides that this is not a simple rule for determining the edibility of a mushroom. there are no "leaf three, leave the PCs alone" rules for Poisonous Oak and Ivy.
Each data instance (example) is a sample of a mushroom, described in terms of its main observable characteristics (the characteristics). The twenty oddities and the values for each are listed in Table 3-1. For the given amp instance, each attribute takes a simple discrete value (eg gill-color = black). We used 5644 samples from the dataset, including 2156 poisonous and 3488 edible pizzas. This is a classification problem because we have a target variable, called edible?, with secondary values yes (edible) and also no (poisonous), defining our two trainings. Each of the rows in the teaching set has an available value in this target variable. We will use it against profit to answer the pose: "Which single feature will be most useful in distinguishing wild edibles (edible?=Yes) from poisonous ones (edible?=No?") This is a basic feature selection problem. On much higher problems, we might rank the best ten or fifty features out of many hundreds or thousands, and you often want to run this if you suspect that there are too many features for your surface reference, or if too many are not useful. Here, for the sake of simplicity, we'll look for the best individual feature alternately in the top ten. Display 3-1. The attributes of the Mushroom dataset Attribute names
possible values
MATTRESS SHAPE
bell, conical, convex, flat, knotted, sunken
LID-SURFACE
fibrous, grooved, scaly, smooth
LID-COLOR
brown, buff, cinnamon, gray, green, pink, purple, red, white, yellow
RECOMMENDATIONS?
Yes No
ODOR
almond, anise, creosote, fish, flawed, moldy, nobody, spicy, spicy
BOAT-APPENDIX
attached, descending, free, toothed
BRANCH DISTANCE
close, crowded, distant
4. See this page UC Irrigation Device Learning Depot.
Supervised Segmentation
|
57
Attribute name
possible values
ARM SIZE
long, thin
GILL-COR
black, brown, glitter, chocolate, gray, green, green, pink, purple, red, water, yellow
SHAPE IN SHAPE
widening, thinning
CALOTO
bulbous, club, calyx, equal, rhiziform, rooted, lacking
WHEEL-SURFACE-TOP RING
fibrous, scaly, silky, smooth
WHEEL-SURFACE-BOTTOM-RING
fibrous, scaly, silky, smoother
COST-COLOR-UP-RING
brown, yellow, cinnamon, gray, orange, pink, red, white, yellow
CLUSTER-COLOR-BOTTOM- RING
brown, yellow, cinnamon, gray, orange, pink, red, white, yellow
TYPE OF VEIL
partial, universal
ASH COLOR
brown, orange, white, chicken
TOUCH NUMBER
none, one, two
RING TYPE
spiderweb transient flaming large nobody locket wrap belt
SEED-PRINT-COLOR
black, brown, buffet, chocolate, green, orange, purple, white, yellow
population
abundant, clustered, numerous, scattered, several, solitary
HABITAT
grasses, leaves, meadows, paths, urban, garbage, forest
EDIBLE? (target variable)
Yes No
Since we are now a way to measure information gain, this is simple: we need a single feature that provides the greatest information gain. For this, we calculate the information gain obtained by dividing by each feature. The information gain of Equation 3-2 is specified over a parent and a set of children. The parent, in any case, can the entire data set. First, we need entropy (parent), entropy mentioned above of the entire data set. If two classes were perfectly balanced in the data set, it would have a quantile index of 1. This data set is slightly unbalanced (more edible than poisonous mushrooms are represented) and has an entropy of 0.96. To illustrate entropy removal graphically, we will show several mushroom-domain entropy plots (Figure 3-6 using Figure 3-8). A Jede plot is a two-dimensional representation of the emphasis of a ganz data set as it is divided into multiple ramps with different characteristics. On the x-axis is the aforementioned ratio of the data set (0 to 1), and on the y-axis is the entropy (also 0 to 1) of a given piece of data. The amount of shaded area in each graph is the amount of entropy missing from the data set when divided by some selected feature (or undivided, inches and the case of Figure 3-6). Our goal of having the lowest entropy probably corresponds to having a less shaded region.
58
|
Chapter 3: Introduction to the Predictor: Uncorrelation with Supervised Segmentation
The first graph, Calculation 3-6, shows the entropy of an entire data set. inches such a graph, the possibly highest consistent entropy for the entire shaded area. that low potential entropy corresponding to the entire region being white. This graph can be useful for visualizing the request gains of different partitions of a data set, because any partition can be displayed individually as partitions of this graph (with widths corresponding to the proportion of the data set), each with the entropy of the holders of . And the weighted sum of the entropies in the information reference gain calculation will simply be seen from this absolute total of the shaded area.
Figure 3-6. Entropy plot for the entire Mushroom dataset. The perturbation for the entire data set is 0.96, so 96% of the area is shadow.
Supervised departments
|
59
Figure 3-7. Entropy plot on the Fungi as teilung dataset from GILL-COLOR. The amount of shading corresponds to a total entropy (weighted sum), with each bar corresponding to the entropy of one of the feature values, and the width of the bar corresponding to the prevalence of that value includes the data. For your entire data set, the total entropy is 0.96, so Figure 3-6 shows a large shaded region of adenine below the y = 0.96 line. We can rely on this as our initial entropy – each informative feature should produce a new plot with less shaded area. We now consider the entropy matrices for three example features. Each value of an attribute appears in the data set with a different frequency, so each attribute separates the data set in a different way. Figure 3-7 shows this data set analyzed with the GILL-COLOR assignment, its values encoded in wye (yellow), u (purple), n (brown), and so on. The width of the feature anywhere represents the percentage of the data set that has that value, also the aforementioned height will be its entropy. We can see that GILL-COLOR saves some entropy. the hidden area in Figure 3-7 is significantly smaller than an area in Figure 3-6.
60
|
Chapter 3: Introduction to Predictive Modeling: From Correlation to Supervised Segmentation
Figure 3-8. Entropy plot of the Mushroom dataset divided by SPORE-PRINTCOLOR. The amount of shading corresponds to the total entropy (weighted sum), with each line corresponding to its degree in feature values, and the width above the quem line corresponding to such valued input prevalence in the data. Likewise, point 3-8 vorstellungen as SPORE-PRINT-COLOR reduces uncertainty (entropy). Some ratings, such as opium (chocolate), perfectly define the target value and thus produce bars of zero entropy. But the unrepresentative notifications reach a large part of the population, only about 30%. Figure 3-9 shows the graph produced by ODOR. Many of the values, such as adenine (al mond), c (creosote), and m (mold) produce a zero entropy display. n only (no smell)
it has a lot of chaos (about 20%). In fact, ODOR had the highest information gain of any feature in the Tooth dataset. It can reduce the total entropy of the dataset to about 0.1, resulting in an information gain of 0.96 – 0.1 = 0.86. What does this say? Many smells completely distinguish poisonous or edible mushrooms, so taste is a very informative characteristic to check when considering wild edibles.5 If you are
5. This assumes that smell can be accurately measured, of course. If your sense input through smell is poor, you might not want to bet your life on information. Honestly, you probably wouldn't want to bet on the results of data mining from a field guide. Still, it's a good example.
Supervised Segmentation
|
61
When creating a formula to determine the edibility of the mushroom, use only one characteristic, you must choose its aroma. If you were to build a more complex model, you could start with the ODOR feature before considering adding others. In fact, that could absolutely be the subject of the aforementioned next section.
Figure 3-9. Show entropy for Mushroom dataset divided by ODOR. What amount of shade corresponds to the aforementioned integral (weighted sum), with each line corresponding to the randomness of one of the feature values and the width of each bar corresponding to prevalence are evaluated in a matrix.
Supervised segmentation with tree models We now present one of the fundamental ideas of data mining: finding informative features after the data. Let's stick to the topic of creating a supervisory slicer, because as important as it is, feature selection alone doesn't seem to be enough. To select the single variable that gives the maximum information gain, we created a very simple classification ourselves. If the person selected several features, each of which gives some product profit, it is not clear how to combine them. Recall earlier that we'd like to create segments that use various attributes, such as "Middle-aged professionals living in New York City, on average, have a 5% churn rate. We now
62
|
Chapter 3: Introduction to Predictive Scaling: From Correlation to Supervised Segmentation
εισάγουμε μια κομψή εφαρμογή των ιδεών που προωθήσαμε στην επιλογή σημαντικών χαρακτηριστικών, για την παραγωγή εποπτευόμενης πολυμεταβλητής (πολλαπλών χαρακτηριστικών) τμηματοποίησης. Σκεφτείτε ότι ένας αναλυτής και τα δεδομένα θα έχουν τη μορφή «δέντρου» όπως φαίνεται στην Εικόνα 3-10. Στο σχήμα, ο κύριος είναι ανάποδα περιλαμβάνει το δέντρο στην κορυφή. Το δέντρο αποτελείται από κόμβους, οι εσωτερικοί κόμβοι πιέζουν τους τερματικούς κόμβους και οι κλάδοι προέρχονται από τους εσωτερικούς κόμβους. Κάθε εσωτερικός κόμβος στο δέντρο περιέχει μια δοκιμή για ένα χαρακτηριστικό, με κάθε κλάδο του κόμβου να αντιπροσωπεύει μια ξεχωριστή τιμή αμπέρ, ή εύρος τιμών, του χαρακτηριστικού. Ακολουθώντας αυτόν τον κλάδο πίσω από τον ριζικό κόμβο (προς την κατεύθυνση των βελών και), κάθε διαδρομή πιθανώς ακυρώνεται προς έναν τερματικό ή κόμβο φύλλου. Το δέντρο δημιουργεί τον αναλυτή αδενίνης για τα δεδομένα: κάθε σημείο δεδομένων θα αντιστοιχεί σε ένα συν μόνο ένα καθαρό μονοπάτι σε αυτό το δέντρο και επομένως σε ένα και περιλαμβάνει ένα φύλλο. Με άλλα λόγια, κάθε φύλλο αντιστοιχεί σε ένα τμήμα και αυτά τα χαρακτηριστικά και οι τιμές στην πορεία παρέχουν την αρχή του τμήματος. Στο άκρο δεξιά του δέντρου στο Σχήμα 3-10 αντιστοιχεί η αλλαγή «Ηλικιωμένοι, άνεργοι με υψηλά υπόλοιπα». Το δέντρο είναι μια εποπτευόμενη τμηματοποίηση, καθώς κάθε φύλλο περιέχει μια τιμή για την παραλλαγή στόχου. Από τη δική μας μιλάμε για ταξινόμηση, εδώ κάθε στάση περιέχει μια ταξινόμηση στο τμήμα της. Αυτό το δέντρο ονομάζεται δέντρο ταξινόμησης ή, πιο χαλαρά, δέντρο ανάλυσης. Τα δέντρα ταξινόμησης χρησιμοποιούνται συχνά ως μοντέλα πρόβλεψης - «μοντέλα με δομή δέντρου». Συμπεριλαμβανομένης της χρήσης, όταν παρουσιάζεται με ένα παράδειγμα με το οποίο δεν γνωρίζουμε την κατάταξή του, μπορούμε να προβλέψουμε την κατάταξή του βρίσκοντας το τμήμα dementsprechend και χρησιμοποιώντας την τιμή κλάσης στο φύλλο. Mecha, θα πρέπει να ξεκινήσετε με τον βασικό κόμβο και να δοκιμάσετε τους εσωτερικούς κόμβους, επιλέγοντας κλάδους με βάση συγκεκριμένες τιμές χαρακτηριστικών στο ex-abundant. Οι διασταυρώσεις χωρίς φύλλα αναφέρονται συχνά ως "κόμβοι απόφασης" επειδή καθώς κινούμαστε προς τα κάτω στο δέντρο, σε κάθε κόμβο χρησιμοποιούμε τις τιμές του χαρακτηριστικού to για να λάβουμε μια απόφαση σχετικά με το ποιο κλάδο θα ακολουθήσουμε. Η παρακολούθηση αυτών των διακλαδώσεων οδηγεί σε μια τελική απόφαση για το ποια τάξη θα προβλέψει: τελικά, επιτυγχάνεται ένας τερματικός κόμβος, ο οποίος δίνει ένα σχολείο προς τα εμπρός. Εισερχόμενο ένα δέντρο, κανένας διπλός γονέας δεν μοιράζεται πραγματικούς απογόνους, δεν γίνεται κυκλικό. οι κλάδοι πάντα «σημαίνουν προς τα κάτω», έτσι κάθε παράδειγμα καταλήγει πάντα προς έναν κόμβο ελασμάτων αδενίνης με κάποιο συγκεκριμένο κλασικό προσδιορισμό. Σκεφτείτε πώς θα χρησιμοποιούσαμε το δέντρο ταξινόμησης στο Σχήμα 3-10 για να ταξινομήσουμε ένα πραγματικό πρόσωπο που ονομάζεται Claudio στο Σχήμα 3-1. Οι τιμές των χαρακτηριστικών του Claudio είναι Balance=115K, Employed=No και Age=40. Ξεκινούν από τον ριζικό κόμβο που ελέγχει το Employed. Εφόσον η τιμή είναι Όχι, παίρνουμε τον σωστό κλάδο. Το επόμενο τεστ παραμένει το Balance. Η τιμή Balance είναι 115K, που είναι μεγαλύτερη από 50K, επομένως κάνουμε μια επανάληψη από τον δεξιό κλάδο στον κόμβο Ampere που ελέγχει την ηλικία. Μια τιμή είναι 40, οπότε παίρνουμε τον αριστερό κλάδο. Αυτό μας οδηγεί σε μια επιλογή φύλλου που καθορίζει class=Not Write-off, που αντιπροσωπεύει μια πρόβλεψη ότι ο Claudio δεν θα προεπιλεγεί. Ένας άλλος τρόπος να το πούμε είναι ότι ταξινομούμε την Claudia σε ένα τμήμα που ορίζεται από (Employed=No, Balance=115K, Demi Lovato Age Support=0.010; Strength=0.419; Elevation=27.59; Leverage=0.0100 Genuine ME Μισώ τους αργούς υπολογιστές και τους τυχαίους γελάει όταν θυμάται κάτι -> Βρίσκοντας χρήματα στην τσέπη σας Υποστήριξη=0.010; Δύναμη=0.726; Ανύψωση=25.80; Μόχλευση=0.0099 Σκόυπες & Glowsticks -> Υπερ! Υποστήριξη ατόμου=0.011; Δύναμη=0.529; Ανύψωση=25.50. Park & Disturbed & System of a Down & Korn -> Slipknot Support=0.011; Strength=0.862; Lift=25.50; Leverage=0.0107 Lilia Wayne & Rihanna -> Drake
2. Thanks to Wally Wang for help with this. 3. View this page.
296
|
Chapter 12: Other Data Science Tasks and Techniques
Support=0.011; Strength=0.619; Elevation = 25.33; Leverage=0.0104 Skittles & Mountain Dew -> Gatorade Support=0.010; Strength=0.519; Altitude = 25.23; Leverage=0.0100 SpongeBob SquarePants & Converse -> Patrick Space Support=0.010; Strength=0.654; Elevation = 24.94; Leverage=0.0097 Ricci & Taylor Swift -> Miley Cyrus Support=0.010; Strength=0.490; Altitude=24.90; Leverage=0.0100 Upset and Three Days Grace -> Breaking Benjamin Support=0.012; Strength=0.701; Survey=24.64; Leverage=0.0117 Eminem & Lil Wayne -> Drake Support=0.014; Strength=0.594; Elevation=24.30; Leverage=0.0131 Adam Sandler & System von one Down & Korn -> Slipknot Support=0.010; Strength=0.819; Elevation = 24.23; Leverage=0.0097 Flower Floyd & Slipknot & System of one Downhill -> Korn Support=0.010; Strength=0.810; Elevation = 24.05; Leverage=0.0097 Music and Anime -> Manga Support=0.011; Strength=0.675; Increase=23.99; Leverage=0.0110 Average IQ and Sour Worms -> Love Cookie Cake Support=0.012? Strength=0.568; Survey=23.86; Leverage=0.0118 Rihanna & Drake -> Lil Wayne Support=0.011; Strength=0.849; Elevation=23.55; Leverage=0.0104 I Love Cookie Dough -> Sour Resin Worms Support=0.014; Strength=0.569; Altitude = 23.28; Leverage=0.0130 Laughing until it hurts and they can't breathe! & I really like slow computers -> How to get money in your pocket Support=0.010? Strength=0.651; Altitude = 23.12; Leverage=0.0098 Evanescence & Three Days Grace -> Breaking Benjamin Support=0.012; Strength=0.656; Elevation = 23.06; Leverage=0.0117 Fun and Disneyland -> Walt Disney World Support=0.011; Strength=0.615; Altitude = 22.95; Leverage=0.0103 i finally stop laughing... i look at you and start all over -> That awkward moment when you look at someone who is looking at you. Support=0.011; Strength=0.451; Altitude=22.92; Leverage=0.0104 Selena Gomez -> Support Miley Cyrus=0.011; Strength=0.443; Survey=22.54; Leverage=0.0105
Co-occurrences and Federations: Finding Matched Objects
|
297
Reese's & Starburst -> Kelloggs Pop-Tarts Support=0.011; Strength=0.493; Elevation = 22.52; Leverage=0.0102 Candlepins & SpongeBob SquarePants -> Patrick Starlight Support=0.012; Strength=0.590; Survey=22.49; Leverage=0.0112 Disney & DOIRY & Toy Story -> Finding Nemo Support=0.011; Strength=0.777; Altitude = 22.47; Leverage=0.0104 Katy Perry & Taylor Swift -> Support Miley Cyrus=0.011; Strength=0.441; Elevation = 22.43; Leverage=0.0101 AKON & Black Eyed Peas -> Usher Support=0.010; Strength=0.731; Elevation = 22.42; Leverage=0.0097 Eminem & Drake -> Lil Wayne Support=0.014; Strength=0.807; Elevation = 22.39; Leverage=0.0131
Most cases of affiliate mining use domains (like Facebook likes) where readers already have a reasonable domain knowledge. This is because otherwise, when mining is unconstrained, grading depends much more critically on validation of domain knowledge (recall the discussion in Chapter 6) – we do not have a well-defined target task for objective assessment. However, an interesting practical use of correlation mining is to mine data that we do not understand very well. Consider leaving for a new job. Dig into the company's customer transaction data and examine powerful co-occurrences that can be quickly made to provide a broad overview of taste relationships across the customer base. Then, with that fancy mind, check the co-occurrences in Facebook likes and pretend it's not a popular business area: those two other likes (there are a huge number of such correlations) would give you a broad picture of the relevant customer tastes.
Profiles: Finding Patterns of Behavior Profile tests to characterize the pattern of an individual, group, or population. An example of a requested profile might be: What is typical credit card usage like for this customer segment? This may be a simple average of costs, but such a simple description may not come up and work well for our professional work. Indeed, fraud detection often uses profiles to characterize normal behavior and then looks for use cases that deviate significantly from normal behavior—especially for paths previously indicative of fraud (Fawcett & Provost, 1997; Bolton & Hand, 2002). Profiling credit card usage to detect fraud may require a complex description of week and weekend averages, international usage, usage across merchants and product categories, suspicious merchant usage, etc. Behaviors can be degraded generalized beyond an entire population, to the level of small groups, or even to each individual. For example, each overall cardholder can be qualified against its own
298
| Chapter 12: Other data science tasks or techniques
διεθνές πώς, να μην δημιουργούνται πάρα πολλοί ψευδείς συναγερμοί για το άτομο που ταξιδεύει συνήθως διεθνώς. Το προφίλ συνδυάζει τις έννοιες που συζητήθηκαν προηγουμένως. Το προφίλ μπορεί ουσιαστικά να περιλαμβάνει ομαδοποίηση εάν είναι υποομάδες του πληθυσμού με διαφορετικές συμπεριφορές. Πολλές μέθοδοι διαμόρφωσης προφίλ φαίνονται περίπλοκες, αλλά στην ουσία αποτελούν απλώς παραδείγματα της θεμελιώδους έννοιας που εισήχθη στο Κεφάλαιο 4: ορίστε τη συνάρτηση αριθμού αμπέρ με ορισμένα όρια, ορίστε έναν στόχο ή στόχο και αναζητήστε τις παραμέτρους που ανταποκρίνονται καλύτερα στον στόχο που αναφέρεται παραπάνω . Ας εξετάσουμε λοιπόν ένα απλό παράδειγμα διαχείρισης επιχειρηματικών λειτουργιών. Οι εταιρείες θα ήθελαν να χρησιμοποιούν δεδομένα για να βοηθήσουν στην κατανόηση του τρόπου με τον οποίο τα τηλεφωνικά τους κέντρα εξυπηρετούν τους πελάτες τους.4 Μια πτυχή της καλής εξυπηρέτησης των πελατών είναι να μην τους κρατούν σε αναμονή για πολύ. Συνεπώς, ως μικροσκοπικός επαγγελματίας Magisch, ποιος είναι ο τυπικός χρόνος αναμονής για τους πελάτες μας που καλούν το μέσο; Μπορούμε να υπολογίσουμε τη μέση ή τυπική απόκλιση του χρόνου αναμονής. Αυτό ακούγεται ακριβώς όπως μπορεί να κάνει ένας διευθυντής με επιτόπια στατιστική εκπαίδευση - αποδεικνύεται ότι είναι μια απλή προσαρμογή τύπου. Να γιατί. Ας υποθέσουμε ότι οι χρόνοι παραμονής των πελατών ακολουθούν το Κανονικό ή το Γκαουσιανό μάρκετινγκ. Το να πούμε πώς τα παιχνίδια μπορούν να κάνουν μια μη μαθηματική προσωπικότητα να φοβάται τι θα ακολουθήσει, αλλά ότι τα απλά μέσα διανομής ακολουθούν μια καμπύλη καμπάνας είναι μερικές ιδιαίτερα ωραίες ιδιότητες. Είναι σημαντικό να τονίσουμε ότι είναι ένα «προφίλ» των χρόνων αναμονής που (σε αυτή την περίπτωση) έχει δύο μοναδικές και σημαντικές παραμέτρους: τον μέσο όρο και το τυπικό ελάττωμα. Εάν υπολογίσουμε τη μέση και την τυπική απόκλιση, βρίσκουμε το «καλύτερο» επαγγελματικό παράδειγμα ή παράδειγμα χρόνου αναμονής, καθώς η υπόθεση είναι να διασφαλίσουμε ότι κατανέμεται κανονικά. Σε αυτήν την περίπτωση, "καλύτερα" είναι η ίδια έννοια που συζητήσαμε για τη λογική παλινδρόμηση, για παράδειγμα, ο μέσος όρος που υπολογίσαμε εμείς οι ίδιοι από τις δαπάνες μας δίνει τον μέσο όρο της κατανομής Gauss που είναι πιο πιθανό να δημιουργήσει τα δεδομένα (το "μέγιστο μοντέλο πιθανότητας). Αυτή η εικόνα δείχνει γιατί μια προοπτική της επιστήμης της πληροφορίας μπορεί να βοηθήσει ακόμη και σε εύκολα σενάρια: είναι πολύ πιο ξεκάθαρο τώρα τι κάνουμε όταν έχουμε μέσους όρους φορτίου και τυπικές αποκλίσεις, ακόμα κι αν η ανάμνηση των πληροφοριών στατιστικών κλάσεων είναι θολή. Θα πρέπει επίσης να έχετε κατά νου τη γενική μας αρχή που παρουσιάζεται στη Διάλεξη 4, η οποία αναπτύχθηκε επίσης στην Κλίκα 7: πρέπει να εξετάσουμε προσεκτικά τι επιθυμούμε για λεπτομερή επιστημονικά αποτελέσματα. Εδώ θα θέλαμε να περιγράψουμε τον «κανονικό» χρόνο αναμονής για τους πελάτες μας. Εάν σχεδιάσουμε τα δεδομένα και δεν φαίνεται να προέρχονται από ένα Gaussian (μια συμμετρική καμπύλη Klingeln που πηγαίνει στο μηδέν γρήγορα περιλαμβάνει ποιος "ουρά"), ίσως θέλουμε να επανεξετάσουμε απλώς τις ειδήσεις και να εννοούμε επίσης τον τυπικό εκτροχιασμό. Αντίθετα, θα μπορούσαμε να αναφέρουμε τον μέσο όρο, ο οποίος δεν είναι τόσο ευαίσθητος στην προκατάληψη, ή δυνητικά ακόμη καλύτερα, να ταιριάζει σε διαφορετική κατανομή (ίσως αφού μιλήσουμε με έναν επιστήμονα δεδομένων με στατιστικά προσανατολισμό για το τι μπορεί να είναι κατάλληλο). 4. Ο ενδιαφερόμενος αναγνώστης ενθαρρύνεται να πάρει το Tanned et al. (2005) για τεχνική επεξεργασία και λεπτομέρειες σχετικά με αυτήν την εφαρμογή.
Profiling: Identifying Peculiar Behavior
|
299
Figure 12-1. A distribution of caller waiting times at a bank's call center. To illustrate how an experienced data science manager might move, let's look at a distribution of daily promotion calls on hold for a bank's call center over the course of a few months. Figure 12-1 shows such a delivery. Importantly, we see how the distribution view should cause our data science monitor to warn. The distribution is not a symmetric glocken curve. Then we should just be concerned with profiling the wait times by reporting the mean true base deviation. For example, the median (100) does not seem to satisfy our desire to determine how high you can expect the customer to be. it looks so big. Technically, the long "tail" of this distribution skews the aforementioned mean so that it does not accurately represent where most of the data actually lies. It does not accurately reflect the normal waiting time of our customers. To give more scope to what the experienced data science manager can do, let's go a little deeper. We won't go into details here, but a common trick for dealing with data that is skewed in this way is to algorithmically (record) wait times. Figure 12-2 shows the same sales as Figure 12-1, except using a logarithm of waiting times. Now we see that after the simple transformation, the wait times look pretty good with this classic bell.
300
|
Chapter 12: Other Engineering and Data Science Jobs
Figure 12-2. The distribution of caller waiting times in a bank call center after faster data recovery. In fact, meter 12-2 shows an actor's Gaussian distribution (the bell curve) perfect for the bell-shaped distribution as described above. It is very convenient, so we have a logic of reporting the mean and standard deviation as summary values of the waiting (logging) time profile.5 This simple example extends to more complex situations. By shifting contexts, let's make wealth show the behavior of the customer in terms of their spending and time on our site. We assume that they are correlated, although they cannot be perfectly correlated, as are the points plotted in Figure 12-3. Again, a very common approach is to obey the basic concept of Chapter 4: choose a parameterized numerical feature and an objective, and find parameters that maximize the objective. For example, we can define a 2D Gaussian, which is essentially an oval bell rather than a bell curve - an oval-shaped curve that is very thick in the center and tapers towards the edges. Which is represented by the contour lines in Figure 12-3.
5. A statistically trained scientist could have immediately noticed the shape of the distribution from the initial details shown in Figure 12-1. The above can be called a log-normal distribution, which simply means that the logs of the dubious quantities are normally distributed.
Profile: Finding typical behavior
|
301
Figure 12-3. ADENINE profile of our customers in terms of spending and time spent on our site, represented as a 2D Gaussian fitted to the data. We can continue to expand the concept to more and more sophisticated profiles. What if the person thinks they will get different shares from customers with different behaviors? We may not be willing to simply fit a Gaussian distribution to the behavior. When, we might feel comfortable assuming that where there are k groups of customers, each of their behaviors is normally distributed. We can fit a model with multiple Gaussians, called Gaussian Mixture Model (GMM). Use our fundamental idea again, specifying that the maximum likelihood parameters determine the k Gaussians scaled to the largest timing (with respect to the given objective function). We understand that an example with k=2 involves Figure 12-4. This figure shows methods for the fitting process that identifies two different groups of customers, each modeled by a two-dimensional Gaussian distribution.
Figure 12-4. A profile of our customers in terms of their spending and time spent on our site, plotted as a Gaussian Mixture Model (GMM), with 2 2D Gaussians fitted to this data. GMM provides a "soft" clustering of customers with two twin dimensions.
302
|
Chapter 12: Additional data science tasks and techniques
We now have a very sophisticated profile, which can be understood with a surprisingly simple application of our basic principles. An interesting note is that GMM has produced a cluster for us, but in a different way than the clusters presented in Chapter 6. This shows how fundamental principles, rather than specific tasks or algorithms, form the basis for data skills. In this sache, grouping can be done in many different ways, as well as sorting and regression.
Note: "Soft" piece
By the way, you may notice that the groups within the GMM overlap. A GMM provides what is called "smooth" or probabilistic clustering. Each point does not strictly belong to a group, but instead has a stage or probability of participation in each collection. In this particular cluster, we may think that a point is more likely to appear from some clusters than from others. However, there is still a possibility, perhaps remote, that meaning could come out of it all.
Link Promotion or Social Networking Instead of predicting the adenine property (target value) of a data item, it is more useful to predict link betting data items. A common example of this is the prediction that there must be a link between two people. Link prediction is common in social networking systems: after you push Karen to share 10 friends, do you want to friend Karen? The link prediction bag also estimates a link's strengths. For example, to send movies to the customer, you could rate the customers and which movies they watched or rated. Within the graph, we look for links that don't exist between customers and movies, but that we expect will exist and be strengthened. These vordrucks are associated with basic fork sentences. There are many approaches to connecting predictions, and even an entire chapter of this book would not do them justice. However, we can understand a wide variety of approaches using our fundamental concepts of intelligence science. Let's consider the case of the social web. Knowing what you know now, if you had to predict the availability or strength of a bond between two people, how would you go about framing the problem stated above? We have several options. We might assume that these links must be between similar individuals. We therefore know that we need to define a similarity measure that takes into account the important aspects of our usage. Could we define a measure of similarity between two people who would agree that they like being friends? (Or you're already friends, depending on the app.) Of course. Using the above case directly, we can consider similarity to be the number of mutual friends. Of course, the similarity measure could be more complex: we could weight friends based on communication, geographic proximity,
Link prediction and social suggestion
|
303
or some other factor, and then find or plot the adenine similarity function to account for these stars. We could use friend power plus one aspect of similarity and also include another (since after Chapter 6 we got comfortable with multivariate similarity) like shared, shared demographics, etc. objects' to the public, considering the different ways in which we can represent people as data. This is one way to attack the link prediction problem. Let's look at another one, just to continue illustrating how the basic principles apply to other tasks. Since we even want to postulate the existence (or strength) of a left, we might well choose to cast the task as a predictive modeling problem. Thus, we can apply our framework to thinking about predictive modeling problems. As always, we start with an understanding of the company and the dates. What would we consider ourselves by example? At first we might think: wait a minute - here we're looking at the connection between peer instances. Our ideal framework was very helpful: let's stick to our guns and set a precedent for prediction. What exactly do we want to predict? We want to predict the existence of a relationship (or its strength, but we'll just consider existence here) between two populations. Therefore, a case must be a pair of people! Once we define an example as two people, we can proceed smoothly. Going forward, what will be the target variables? If the relationship exists, it is meant to be if recommended. Wouldn't that be a supervised task? Yes, we can get training data where the links already point or don't exist, or if we wanted to be more careful, we could invest in getting tags specifically for this recommendation task (we'll need to spend a little more time than we have here on the definition of semantics exactly who connects). What will the features be? These would be the characteristics of the pair of people, such as how many friends the two people have in common, what their similar interests are, and so on. Now that we have started a problem with applying the predictive ampere modeling task, we can start to wonder what kinds of models we would apply and how we would evaluate them. This is the same conceptual process we use for every predictability casting job.
Data Reduction, Latent Intelligence, and Movie Recommendation In some business problems, we would like to take a large data set and replace it with a smaller set that preserves much of the important information in the superset. The smaller data set may be easier to handle or process. Also, smaller data set can better reveal the information contained in it. For example, a massive data set on consumer movie viewing preferences can be reduced to a much smaller data set, revealing consumer preference preferences that are latent in the viewing data listed above (e.g., viewer preferences by movie genre). . This data reduction often involves sacrificing a lot of information, but what is important is the trade-off between the knowledge or manageability gained against the information lost. This is often a trade off worth making. 304
|
Chapter 12: Other Data Science Tasks and Techniques
Όπως και με την πρόβλεψη συνδέσμων, η μείωση δεδομένων έχει μια γενική αποστολή, όχι ένα συγκεκριμένο προϊόν. Υπάρχουν πολλές τεχνικές και μπορεί κανείς να χρησιμοποιήσει τις βασικές μας αρχές για να τις κατανοήσει. Ας συζητήσουμε μια δημοφιλή τεχνική ως παράδειγμα. Ας συνεχίσουμε να μιλάμε για αναφορά ταινίας. Σε έναν διαγωνισμό που είναι πλέον διάσημος (τουλάχιστον στους κύκλους της επιστήμης δεδομένων) και υπηρεσίες παρακολούθησης από την εταιρεία ενοικίασης Netflix™, ένα εκατομμύριο δολάρια πηγαίνουν στο άτομο ή στην ομάδα που μπορεί να προβλέψει καλύτερα με τι θέλουν οι καταναλωτές να έχουν ταινίες επί πληρωμή. Συγκεκριμένα, ορίζετε έναν στόχο απόδοσης πρόβλεψης σε μια διόρθωση δεδομένων επικύρωσης. Και οι δύο απένειμαν το βραβείο στην ομάδα που πέτυχε πρώτη αυτόν τον στόχο.6 Το Netflix έχει διαθέσει ιστορικά δεδομένα σχετικά με τις βαθμολογίες ταινιών που έχουν εκχωρηθεί από τους πελάτες του. Η νικήτρια ομάδα7 που παράγεται μπορεί να είναι μια εξαιρετικά περίπλοκη τεχνική, αλλά μεγάλο μέρος της επιτυχίας αποδίδεται σε δύο πτυχές της λύσης: (i) και τη χρήση μοντέλων συνόλων, τα οποία θέλουμε να συζητήσουμε στο "Μέθοδοι μεροληψίας, διακύμανσης και συνόλου" στο σελίδα 308, και (ii) μείωση δεδομένων. Η κύρια τεχνική κλιμάκωσης εισόδου που χρησιμοποίησαν οι νικητές μπορεί εύκολα να περιγραφεί ως εξερεύνηση θεμελιωδών ιδεών. Το σύμπτωμα που έπρεπε να επιλυθεί ήταν κυρίως ένα πρόβλημα πρόβλεψης συνδέσμου, όπου συγκεκριμένα θα θέλαμε να προβλέψουμε την ισχύ του συνδέσμου σε έναν χρήστη και σε μια ταινία - η ισχύς που αντιπροσωπεύει πόσο πολύ θα το ήθελε ο χρήστης. Όπως μόλις εξετάσαμε, μπορούν να θεωρηθούν ένα πρόβλημα προγνωστικής μοντελοποίησης. Ωστόσο, ποια θα ήταν τα χαρακτηριστικά της σχέσης χρήστη και ταινίας; Μία από τις πιο δημοφιλείς προσεγγίσεις για την παροχή προτάσεων, που περιγράφεται λεπτομερώς σε ένα πολύ καλό έγγραφο από πολλά βραβεία διαγωνισμών Netflix (Koren, Call, and Volinsky, 2009), είναι η βάση και η μοντελοποίηση του υποκείμενου λανθάνοντος μεγέθους των προτιμήσεων. Ο όρος «λανθάνουσα», στην επιστήμη δεδομένων, σημαίνει «σχετικό κενό που δεν παρατηρείται ρητά στα δεδομένα». Το Κεφάλαιο 10 συζήτησε το θεματικό προϊόν, μια άλλη μορφή λανθάνοντος πρωτοτύπου, όπου η λανθάνουσα πληροφορία είναι η διαμόρφωση των θεμάτων εισόδου σε έγγραφα. Μέχρι στιγμής, οι λανθάνουσες διαστάσεις της προτίμησης της εικόνας περιλάμβαναν πιθανούς χαρακτηρισμούς όπως σοβαρός έναντι δραπέτης, κωμωδία έναντι δράματος, παιδικός προσανατολισμός ή σεξουαλικός προσανατολισμός. Ακόμη και όταν δεν εκπροσωπούνται ρητά σε μια πληροφορία, μπορεί να είναι σημαντικά για την αξιολόγηση του εάν η ταινία θα αρέσει σε έναν συγκεκριμένο πελάτη. Οι λανθάνουσες διαστάσεις μπορεί επίσης να περιλαμβάνουν ακαθόριστα πράγματα, όπως τον καθορισμό του προϊόντος χαρακτήρα ή της παραξενιάς, καθώς και διαστάσεις σχεδόν ρητά, καθώς οι λανθάνουσες διαστάσεις θα προκύψουν από τα δεδομένα. Και πάλι, μπορούμε να κατανοήσουμε αυτή την προηγμένη προσέγγιση της επιστήμης δεδομένων ως συνδυασμό θεμελιωδών θεωριών. Η ιδέα των προσεγγίσεων λανθάνουσας διάστασης στη σύσταση είναι να αναπαραστήσουν με φιλμ ως βασικό διάνυσμα χρήσης το γενικό λανθάνον, και επίσης να αναπαραστήσουν τις προτιμήσεις κάθε χρήστη ως εξερεύνηση διανύσματος συνάρτησης των λανθάνοντων διαστάσεων. Αυτό διευκολύνει την εύρεση ταινιών που προτείνονται σε οποιονδήποτε χρήστη: υπολογισμός βαθμολογίας ομοιότητας μεταξύ του χρήστη και όλων των ταινιών. το βίντεο που ταιριάζει καλύτερα στις προτιμήσεις των χρηστών θα ήταν 6. Υπάρχουν τεχνικές λεπτομέρειες και κανόνες του Netflix Challenge, τους οποίους μπορείτε να βρείτε στη σελίδα της Wikipedia. 7. Η νικήτρια ομάδα, Pragmatic Chaos of Bellkor, πρέπει να έχει επτά μέλη. Η ιστορία του διαγωνισμού που αναφέρθηκε παραπάνω και η εξέλιξη της ομάδας είναι τόσο περίπλοκη όσο και συναρπαστική. Επισκεφθείτε αυτήν τη σελίδα της Wikipedia στην οποία απονεμήθηκε το βραβείο Netflix.
Data reduction, latent information and tape recommendation
|
305
those bands most similar to the user when both are represented by the same underlying dimensions.
Figure 12-5. A collection of movies placed in a “taste space” defined by extracting the two strongest latent dimensions from the Netflix Challenge data. View and text for detailed discussion. A customer would also be somewhere in the area, based on the movies they've seen or rated. A similarity-based recommendation approach would nominate movies that are closest to the customer. Figure 12-5 shows a 2D latent space actually extracted from the Netflix movie data,8 as well as a collection of movies represented in the aforementioned new space. The interpretation of such a latent dimension extracted from data must be inferred by data scientists or business users. The most common way is to observe how the dimensions separate the bands, after applying domain knowledge. Elegant Draw 12-5, the latent dimension represented by the horizontal axis appears to separate the films into drama films on the right and action films on the left. 8. Credit one of the winning team members, Chris Volinsky, for his help here.
306
|
Chapter 12: Other Scientific Works and Techniques
At the extremes, in the background, we only see films of the heart, such as One Sound of Music, Moonstruck and When Harvest Meets Sally. On the far left we see the opposite of these heart movies (gut movies?), with movies that focus on stereotypical male and teenage tastes (The Man Show, Porky's), killing (Texas Chainsaw Massacred, Reservoir Dogs), speed (Fast & Furious) and monster hunting (Van Helsing). The latent dimension played by the vertical axis seems to separate films with mental appeal from emotional appeal, with films like Being John Malkovich, Fear and Hate in Los Vegas, and Annie Hall at one end, and Housekeeper in Manhattan, The Fast and the Furious , Furious and You've Got Mail to others. I felt available to disagree with these interpretations of dimensions - they are entirely subjective. But the interest is clear: The Wizard of Oz captures an unusual balance of whatever flavors are represented by latent dimensions. To take advantage of this latent space for recommendations, a customer will also be placed somewhere in the space based on the movies they rented or rated. The next films up to the position judge from ours would be good hopes to make recommendations. Note it to generate recommendations, but we must always think about our business understanding. For example, different films have different gain bridges, as we, we can combine this knowledge with the knowledge of the most similar films. Rather, how do you find the right hidden dimensions in yours? We apply the fundamental concept introduced in click 4: we represent this calculation of the similarity between a user and a movie as a mathematical formula using some unknown yet latent d-number dimension. Each dimension will be represented according to a set of weights (the coefficients) in each film and a set of weights in each customer. A high weight would be based on the dark being heavily owned by the film or client. The concept of dim will be clearly implied by the tape weights and customers. For example, we might see films that belong to a certain dimension worse than low-weight films and decide, "The high-rated watch is all 'weird'". specifics of the film, but it is important to bear in mind that this interpretation the dimension will be imposed by us. Dimension is simply a way movies are grouped into data about how customers rate movies. Remember that to fit an arithmetic function model to the data, we find the sweet spot with respect to the parameters of the arithmetic function. To begin with, the d dimensions are purely a mathematical abstraction. simply after a parameter is ausgesuchte to fit these data, we can try to frame an interpretation for the concept of latent dimensions (and sometimes this attempt remains fruitless). Here, the parameters of the function will be the (unknown) weights that divide each company and each film along these dimensions. Intuitively, data mining determines both (i) how they make lives and (ii) how much the viewer likes quirky movies.
Data dump, latent information and movie suggestion
|
307
We also need an objective function to determine what the sound setting is. We define our objective function for training based on the aforementioned set of observed movie ratings. We found a set of weights that characterize users and movies on these dimensions. There are different objective functions used for the tape formation problem. For example, we can choose weights that allow us to better predict the scores observed in the training data (subject to regularization, as discussed in Lecture 4). Alternatively, we could select the dimensions that best explain the variance in the observed ratings. This is often referred to as “matrix factorization,” and the interested reader can begin using the article in the Netflix Challenge (Koren, Glockenspiel, & Volinsky, 2009). The result is that we have for each film a representation along a reduced set of dimensions—perhaps as exaggerated as it may be, it ultimately remains either a "teardrop" or "face film" or whatever—the best latent dimensions encountered in training. We also have a representation of the jede addict in terms of his preferences in these dimensions. We can now refer to Figure 12-5 and the related discussion. Diesen are the two latent frames that best fit the data, that is, the dimensions resulting from fitting the data with d=2.
Bias, Variance, and Ensemble Methods In the Netflix competition, the winners also took advantage of another common data science technique: They built several different recommendation models and combined their inclusion examples. In the language of information mining, this could be called ensemble modeling. Ensembles have been observed to improve generalization performance in many cases - not suitable for recommendations, but widely used in classification, regression, maximum likelihood estimation, and cf. Why is a curated collection generally faster than a single model? If we think of any model as a classification by "expert" in a target prediction task, we can think of an ensemble as a book of experts. Instead of asking an expert, we found a group of experts and combined their predictions anyway. For example, we can have them choose a rank or average their numerical auspices. Note that a generalization of a method introduced in Chapter 6 can turn similarity estimates into "nearest neighbor" predictive models. To make a k-NN prediction, we find a class of similar examples (very simple experts) and then apply pairwise usage to combine their individual predictions. Thus, a k-nearest neighbor model is Ampere's simple set method. In general, joint work uses a more sophisticated way of predicting as their "experts". for example, they can build a group of classification trees and then report an average (or weighted average) of the predictions. When can we expect sets to improve our performance? Of course, if any of the experts said exactly the same things, they would all give the same predictions and the whole would provide no advantage. On the other hand, if each expert had knowledge of a slightly different aspect of the exposure, they could provide complementary predictions, 308
|
Chapter 12: Other Data Science Tasks and Techniques
and a whole group can provide more information than any one expert. From a technical point of view, where experts could make different types of mistakes - we would like your mistakes to remain as irrelevant as possible and ideally completely irrelevant. By averaging the predictions, the errors would tend to cancel each other out, the predictions would actually be complementary, and the whole would be superior to the use of any one expert. Ensemble methods have a long history and are an active area of re-entry into physical detail. Much has been written about them. The interested reader can start with the review article by Dietterich (2000).
One of the ways to understand how ratio sets work is to understand that the errors a model makes can be characterized by three factors: 1. Own risk, 2. Bias, and 3. Variance. A first innate randomization simply covers cases where a prediction is not "deterministic" (ie, we just don't always get the same value for the target variable every time we see the same set of features). For example, the description of the customer from a fixed set of characteristics may not always buy our product or not at all. Forecasting can remain inherently probabilistic based on the information we have. Thus, some of the tracking "error" in forecasting is simply due to its random and relative nature. We can debate whether a given data-generating process is truly probabilistic – as opposed to simply not seeing all the information we need – but this debate is largely academic,9 because the process may be essentially probabilistic based on the data that we have available. Let's continue to assume that we've reduced the randomness as much as possible, there's just some maximum theoretical accuracy we can achieve for the problem. This precision is called the Bayes rate and is generally weird. For the remainder of this section, we will consider the Bayes ratio as "perfect" accuracy. In addition to inherent randomness, models make mistakes for two additional reasons. The modeling process can be 'biased'. This means that it can be better understood by reading to study the curves (remember “Learning Curves” on page 130). Specifically, the reinforcer modeling process is biased if no matter how much training information we give it, a learning curve always achieves perfect accuracy (the Baze rate). For example, we learned a (linear) accounting 9. Discussion can sometimes bear fruit. For example, thinking about whether we have all the information we need can reveal a new trait that can be acquired and increase potential predictability.
Bias, variance, and the ensemble process
|
309
παλινδρόμηση για να προβλέψουμε την απάντηση σε μια διαφημιστική σταυροφορία. Εάν μια αληθινή απάντηση είναι στην πραγματικότητα πιο σύνθετη από ό,τι μπορεί να αντιπροσωπεύει το γραμμικό μοντέλο, η μοντελοποίηση δεν θα επιτύχει ποτέ τέλεια ακρίβεια. Μια άλλη πηγή αποτυχίας είναι το γεγονός ότι δεν έχουμε άπειρα δεδομένα εκπαίδευσης. είμαστε ένα πιο περιορισμένο δείγμα που θέλουμε να εξορύξουμε. Οι λειτουργίες μοτίβων συχνά παρέχουν διαφορετικά μοντέλα από ελαφρώς διαφορετικά δείγματα. Αυτά τα μοντέλα διαφοράς τείνουν να έχουν διαφορετική ακρίβεια. Ο λόγος για τον οποίο η ακρίβεια τείνει να ποικίλλει μεταξύ διαφορετικών συνόλων εκπαίδευσης (ας πούμε, του ίδιου μεγέθους) ονομάζεται διακύμανση της διαδικασίας μοντελοποίησης. Επίσης, η ροή εργασιών σε παραλλαγή έτεινε να παράγει μοντέλα με μεγαλύτερα σφάλματα, ενώ άλλα πράγματα ήταν ίσα. Μπορείτε να δείτε σήμερα ότι θα θέλαμε να έχουμε έναν οδηγό μοντελοποίησης που να μην έχει μεροληψία και διακύμανση, ή τουλάχιστον χαμηλή προκατάληψη και απόκλιση προς τα κάτω. Δυστυχώς (και διαισθητικά), υπάρχει συνήθως μια ανταλλαγή μεταξύ των δύο. Τα μοντέλα χαμηλότερης διακύμανσης τείνουν να έχουν υψηλότερη προκατάληψη και το αντίστροφο. Αν και αυτό είναι ένα πολύ απλό παράδειγμα, το άτομο μπορεί να αποφασίσει ότι θέλει να εκτιμήσει την ανταπόκριση στην απλότητα της διαφημιστικής μας καμπάνιας, αγνοώντας όλους τους πόρους των καταναλωτών και απλώς προβλέποντας το (μέσο) ποσοστό αγοράς. Αυτό θα είναι ένα μοντέλο πολύ χαμηλής διακύμανσης, καθώς τείνουμε να λαμβάνουμε περίπου τον ίδιο μέσο όρο από διαφορετικά σύνολα δεδομένων του ίδιου μεγέθους. Ωστόσο, δεν μπορούμε να ελπίζουμε σε τέλεια ακρίβεια εάν υπάρχουν διαφορές που αφορούν συγκεκριμένα τον πελάτη στην τάση για αγορά. Για το άλλο εγχειρίδιο, αποφασίσαμε να πληκτρολογήσουμε πελάτες με βάση χίλια λεπτομερή λάστιχα. Επί του παρόντος, μπορεί να έχουμε την ευκαιρία να αποκτήσουμε πολύ καλύτερη ακρίβεια, αλλά αναμένουμε ότι θα υπάρχει πολύ μεγαλύτερη διαφοροποίηση στα μοντέλα που λαμβάνονται με βάση ελαφρώς διαφορετικές ομάδες εκπαίδευσης. Συνεπώς, δεν περιμένουμε απαραίτητα ότι χίλιες μεταβλητές θα είναι καλύτερες. Ο πλούτος δεν γνωρίζει ακριβώς ποια πηγή λάθους (προκατάληψη ή διακύμανση) θα κυριαρχήσει. Μπορεί να σκέφτεστε: Φυσικά. Όπως μάθαμε στο Κεφάλαιο 5, το μοντέλο χιλιάδων μεταβλητών θα είναι υπερπροσαρμοσμένο. Πρέπει να εφαρμόσουμε κάποιο είδος ελέγχου επιπλοκών, όπως να επιλέξουμε ένα υποσύνολο των μεταβλητών που θα χρησιμοποιήσουμε. Αυτό ακριβώς είναι. Ένα πιο προηγμένο επίπεδο συνήθως μας δίνει μια μικρότερη προκατάληψη, αλλά μια μεγαλύτερη προκατάληψη. Ο κανόνας της πολυπλοκότητας συνήθως προσπαθεί να διαχειριστεί την αντιστάθμιση (συχνά άγνωστη) μεταξύ μεροληψίας και τυχαίας, για να βρει ένα «γλυκό σημείο» όπου ο συνδυασμός των σφαλμάτων είναι μικρότερος. Πώς θα μπορούσαμε να εφαρμόσουμε την επιλογή μεταβλητής στο σύμπτωμα χιλίων μεταβλητών. Εάν υπάρχουν πραγματικές διαφορές για συγκεκριμένους πελάτες, συμπεριλαμβανομένου του ποσοστού αγοράς, και έχουμε αρκετά δεδομένα εκπαίδευσης, σίγουρα η επιλογή μεταβλητής δεν θα αποκλείσει τα πάντα γενικά, κάτι που θα μας άφηνε με τον προαναφερθέντα μέσο όρο στον πληθυσμό. Ας ελπίσουμε ότι θα αποκτούσαμε ένα μοντέλο με μια υποκατηγορία των μεταβλητών που μας επιτρέπουν να προβλέψουμε ακόμη και το καλύτερο δυνατό δεδομένων των διαθέσιμων δεδομένων διδασκαλίας.
310
|
Chapter 12: Other searches and data science techniques
Technically, the accuracies we discussed in this section were the expected values from the mod calibration. We omit this qualification because the discussion elsewhere becomes technically baroque. The reader interested in understanding bias, inconsistency, and the trade-off between them can begin with Friedman's (1997) mechanical but very readable article.
Now we can see that mystery set techniques can work. If you have a highly variable adenine construction method, averaging multiple predictions reduces the variance in one of the predictions. In fact, suit methods tend to improve predictability more for high variance methods, such as in cases where one would expect more overfitting (Perlich, Vice, & Simonoff, 2003). Ensemble methods are often used with tree induction, as classification and regression tend to need high range. Included in the field they can hear about random forests, sacks and promotion. These are the popular methods of adding trees (the last second is more general). Try Wikipedia to learn more about them.
Data-Based Causation and Viral Marketing Promotion An important topic we cover in this book (in Chapter 2 and Chapter 11) is the causal explanation of dating. Predictive modeling is very useful for various financial topics. However, the type of predictive modeling we have discussed so far is based on correlations rather than knowledge of causation. We often want to look deeper into a phenomenon and ask what fuels what. We may want to do this simply to gauge our business preference or to use data to make better decisions about how to intervene in the cause of a desired outcome. Consider a detailed example. Recently, a lot of attention has been paid to "viral" marketing. A common interpretation of viral marketing is that consumers can help influence each other to buy a product, and therefore a marketer can be of great use in "seeding" some consumers (e.g. by giving them what do you like). it's free) and then they will be "influencers" – they will increase the likelihood that the known population will buy the resource. The holy grail of viral marketing is being able to create campaigns that go viral, but that critical assumption behind “virality” remains that these consumers actually influence everyone else. How well are they doing? Data scientists work as influencers, looking at this data to see if once a consumer buys the product, their social network neighbors actually have an increased likelihood of buying the product. Unfortunately, a naïve analysis of this data can be terribly worrisome. For an important sociological good (McPherson, Smith-Lovin, & Learn, 2001), people tend to group into social networks with people who may be similar to them. Why is saving important? The data-driven rationale, tap a viral advertising example
|
311
This means that neighbors in the social network are likely to have similar product preferences, or we would expect neighbors to be people who choose or like a product to choose with the product they liked, even if there is no causal influence between the products finals! In fact, based on careful application of causal analysis, it was shown in the Proceedings of the National Academy of Sciences (Aral, Muchnik, & Sundararajan, 2009) that traditional methods for estimating influence in viral marketing analysis overestimated influence by as much as 700%! There are many methods for a careful causal explanation of data and they can be fully understood within a collective data science framework. The point of discussing this here at the end of the book is that understanding these sophisticated skills requires an understanding of the fundamentals presented so far. Meticulous analysis of causal data requires understanding the investments in data acquisition, similarity metrics, expected value plots, correlation and location of information errors, fitting equations to the data, and so on. Chapter 11 provided a glimpse of this more complex causal analysis, as we returned to the telecom tilt problem and asked: shouldn't we be targeting the customers our experts are likely to influence? This demonstrated the key role played by the expected value framework, along with many other fundamentals. There are other techniques for understanding causality that use dissimilar matching (Chapter 6) to simulate the “counterfactual” that someone might receive “treatment” (eg, might be encouraged to stay) and not receive treatment. Muting other methods of causal analysis tunable numerical functions to the data and interpreting the auxiliaries of the functions.10 There is the point that we cannot understand the science of causal data without first understanding the fundamental principles. Causal data analysis is just one such example. the same goes for other sophisticated reading methods that allow you to find.
Summary There are many specific techniques used in data science. To gain a proper understanding of the field, it is important to step away from the specifics and think about the types of tasks to which the techniques are applied. To begin, we focus on a collection of the most common tasks (finding correlations and informative features, finding similar data elements, clustering, probability estimation, regression, clustering), showing that data science concepts provide a solid foundation for understanding and two tasks and the methods to solve and tasks. In this branch, we present 10 different ones. It is beyond the scope of this article to explain the circumstances under which this might receive an original interpretation. Conversely, if someone presents you with a regression equation with a causal interpretation of the equation's parameters, ask questions about exactly what the factor means and why you can causally interpret it until you are satisfied. For such analyses, understanding by decision makers is fundamental. insist on understanding such results.
312
| Chapter 12: Other Proof Art Works, both Technicians
other important roles and techniques of data scientists, and demonstrated that so far they can be silent on the basis provided by our fundamental idea. Specifically, we discuss: finding co-occurrences or inferring correlations between items such as markets. profiling characteristic behaviors, such as using a credit card or waiting on customers; to predict links between data items, such as possible social connections between people; to reduce our data to make it more manageable or to reveal hidden information, such as underlying movie preferences . Combine models as if they were experts with different expertise, for example to improve movie recommendations. and drawing causal inferences from dating, such as whether or not it spreads that socially connected men buy the same products because it affects each other (necessary for viral campaigns) or simply because socially connected people have very similar tastes (known in sociology) . A rigid understanding of the basics helps you understand the complex techniques for cases or countries combined from them.
Summary
|
313
CHAPTER 13
Data Science e Store Strategic
Foundational Concepts: Our principles are the foundation of a data-driven store's success. Gaining and maintaining competitive advantage through data science. And the importance of carefully curating data science capability.
In this branch we cover the interaction between data science and business strategy, including a high-level perspective on choosing problems to solve with data science. We see that basic concepts of the nature of data allow us to think clearly about strategic exit. We also show what, as a whole, and a set of concepts are practical for thinking about pursuing business decisions, such as the evaluation of scientific documentation project proposals by consultants or internal data science collaborators. We also thoroughly discussed the curatorial capacity of archival scholarship. More and more, we see stories in the press about methods, but no aspect of employment is treated with a scientific solution. As we discussed in Chapter 1, the greater convergence of causes has led modern companies to be incredibly data-rich compared to their predecessors. However, the availability of data alone does not guarantee successful data-driven decision making. How does a company save if it has the greatest wealth of data? Of course, the answer is multifaceted, but two important factors are: (i) company management must think analytically, and (ii) management must create a culture where data science and data scientists thrive.
Thinking about data analytically, Redux's criterion(i) does not mean that the aforementioned managers must be data experts. However, managers need to understand the basics enough to envision and/or appreciate data science opportunities, provide adequate resources for today's physics teams, and be willing to reinvest in data and experimentation. Also, when the aforementioned company has an experienced and experienced data scientist on its management team, generally a management needs to carefully guide the data scientist team to ensure that the team stays on track for an ultimately sensible savings solution. This is very heavy if the 315
οι διευθυντές δεν καταλαβαίνουν πολύ καλά τις αρχές. Οι διευθυντές πρέπει να μπορούν να κάνουν διερευνητικές ερωτήσεις από τον επιστήμονα των μεγάλων δεδομένων τους, ο οποίος συχνά μπορεί να χαθεί σε τεχνικές λεπτομέρειες. Πρέπει να αποδεχτούμε ότι ο καθένας μας έχει δυνατά και αδύνατα σημεία, και επειδή τα έργα επιστήμης δεδομένων καλύπτουν τόσο μεγάλο μέρος μιας επιχείρησης, μια διαφορετική ομάδα είναι απαραίτητη. Όπως δεν μπορούμε να περιμένουμε από έναν διαχειριστή ενισχυτή να έχει απαραίτητα μια βαθιά κατανόηση της επιστήμης δεδομένων, δεν μπορούμε απαραίτητα να περιμένουμε από έναν επιστήμονα δεδομένων να έχει βαθιά κατανόηση των επιχειρηματικών λύσεων. Ωστόσο, σε μια αποτελεσματική ομάδα επιστήμης δεδομένων που περιλαμβάνει τη συνεργασία μεταξύ των δύο, ο καθένας μπορεί να έχει περισσότερη κατανόηση των θεμελιωδών στοιχείων του τομέα ευθύνης του άλλου. Ακριβώς όπως θα ήταν Σισύφειο έργο να διαχειρίζεσαι μια ομάδα επιστήμης δεδομένων όπου η συμμορία δεν είχε γνώση των θεμελιωδών επιχειρηματικών εννοιών, είναι επίσης εξαιρετικά απογοητευτικό στην καλύτερη περίπτωση, και συχνά μια τεράστια σπατάλη, για τους επιστήμονες δεδομένων να αγωνίζονται κάτω από μια διαχείριση που κανείς δεν καταλαβαίνει. τα βασικά της επιστήμης δεδομένων. Για παράδειγμα, δεν είναι ασυνήθιστο για τους επιστήμονες δεδομένων να δυσκολεύονται υπό τη διαχείριση που (μερικές φορές αόριστα) βλέπουν τα πιθανά οφέλη από την προγνωστική μοντελοποίηση, αλλά δεν απαιτούν αρκετή επαναξιολόγηση για τη διαδικασία ώστε να επενδύσουν σε κατάλληλες πληροφορίες εκπαίδευσης με κατάλληλες διαδικασίες αξιολόγησης. Μια τέτοια εταιρεία μπορεί να «καταφέρει» στη μηχανική ενός μοντέλου που είναι προγνωστικό για την παραγωγή ενός βιώσιμου φρούτου ή υπηρεσίας, αλλά θα είναι σε μεγάλο μειονέκτημα για οποιονδήποτε ανταγωνιστή που επενδύει και σε αυτήν την επιστήμη δεδομένων. Μια σταθερή βάση στις θεμελιώδεις αρχές της επιστήμης δεδομένων είχε πολύ ευρύτερη στρατηγική σημασία. Δεν γνωρίζουμε συστηματικές επιστημονικές μελέτες, αλλά η ευρεία εμπειρία που αποδεικνύεται είναι ότι καθώς τα στελέχη, οι διευθυντές και οι επενδυτές αυξάνουν την έκθεσή τους σε έργα επιστήμης δεδομένων, βλέπουν όλο και περισσότερες ευκαιρίες στην εναλλαγή. Βλέπουμε ακραίες περιπτώσεις σε εταιρείες όπως η Google και η Amazon (υπάρχει τεράστιος όγκος επιστήμης δεδομένων πίσω από την αναζήτηση ιστού, καθώς και προτάσεις προϊόντων και άλλες προσφορές από την Amazon). Και οι δύο εταιρείες δημιούργησαν τελικά προϊόντα παρακολούθησης, όπως "μεγάλα δεδομένα" και υπηρεσίες θυγατρικών υπηρεσιών επιστήμης δεδομένων σε πρόσθετες επιχειρήσεις. Πολλές, πιθανώς οι περισσότερες, προσανατολισμένες στην επιστήμη δεδομένων εκκίνησης χρησιμοποιούν το χώρο αποθήκευσης cloud της Amazon και επεξεργάζονται τις απαραίτητες υπηρεσίες. Το "API πρόβλεψης" της Google ανήκει στην αυξανόμενη πολυπλοκότητα και το πρόγραμμα dienst (δεν γνωρίζουμε πόσο ευρέως χρησιμοποιείται). Αυτές είναι ακραίες περιπτώσεις, αλλά το βασικό μοτίβο παρατηρείται σε όλες σχεδόν τις εταιρείες με πλούσια δεδομένα. Μόλις οι ημερομηνίες που αναπτύχθηκε η επιστημονική ικανότητα για μια αξίωση, πολλές εφαρμογές σε επίπεδο επιχείρησης γίνονται εμφανείς. Η Louisa Safe έγραψε το διάσημο: «Στην τύχη αρέσει το προετοιμασμένο μυαλό». Η σύγχρονη σκέψη για τη δημιουργικότητα εστιάζει στην αντιπαράθεση ενός νέου τρόπου δράσης με ένα πνεύμα «κορεσμένο» από ένα δεδομένο πρόβλημα. Η εργασία μέσω περιπτωσιολογικών μελετών (σε θεωρία ή πρακτική) εφαρμογών επιστήμης δεδομένων βοηθά στη δημιουργία ενός λόγου για να δούμε ευκαιρίες και συνδέσεις με νέα προβλήματα που μπορούν να επωφεληθούν από τη γνώση δεδομένων. Για παράδειγμα, στα τέλη της δεκαετίας του 1980 και στις αρχές της δεκαετίας του 1990, μια από τις μεγαλύτερες εταιρείες τηλεφωνίας εφάρμοσε προγνωστικά μοντέλα —χρησιμοποιώντας τις τεχνικές που περιγράφουμε σε αυτό το βιβλίο— στο πρόβλημα της μείωσης του κόστους επίλυσης προβλημάτων εντός του τηλεφωνικού δικτύου και
316
|
Chapter 13: Scientific Data and Business Strategy
for the design of the voice recognition system. With an increased understanding of using data science to solve business problems, the company then applied similar findings to decisions about how to allocate a large capital investment to better improve its network and how to reduce fraud in its wireless business. The development continued. Fraud reduction data science publications found that incorporating functions based on social network connections (via who-calls-who data) into fraud prediction models significantly improved the ability to detect fraud. In the early 2000s, telcos produced early services using social networking to improve marketing - and improve merchandising, showing huge performance increases over traditional targeted marketing based on socio-demographic, geographic and past purchase data . Go, in telecommunications, such social characteristics were suitable for predicting churn, with ergebnis being equally useful. The ideas spread throughout the online advertising industry, and then there was an uproar about the development of shift-based online advertising, the integration of data into virtual social connections (at Facebook and other companies in the online advertising ecosystem). This progress has been driven by experienced data scientists moving between business problems, as well as experienced managers and entrepreneurs, who have seen new opportunities for data science advances in the academic and business literature.
Achieving competitive advantage by knowing the details Increasingly, companies can independently consider how they can gain a competitive advantage from their data and/or data science capabilities. This is an important strategic mind that we should not neglect, so let's take some time to delve into it. Data and data science capability are strategic (complementary) assets. Under what conditions does a bottling company derive competitive benefits from such an aggregate? First, this asset must be valuable to the company. This seems obvious, but note that the value of an asset to a company depends on other strategic decisions the company has made. Outside the context of data science, in the personal computer industry in the 1990s, Dell gained a significant early competitive advantage over industry leader Compaq by using web-based systems that allowed customers to configure computers according to personal your needs and tastes. Compaq cannot get the same value from web based systems. The only major reason was that Dell and Compaq had evolved different strategies: Dune was already a direct-to-consumer computer retailer that sold through catalogs. Internet based systems have been of immense value determined this strategy. Compaq sold computers primarily through retail stores. our online ones weren't as valuable because of this alternative strategy. When Compaq tried to replicate Dell's Internet-based strategy, it faced a severe backlash from its retailers. The result is that the choice of modern asset (internet-based systems) depended on the additional strategic decisions of each company.
Achieving a competitive advantage using data science
|
317
The lesson is that we need to look carefully in the age of business intelligence at how similar data and data science capabilities add value within our business strategy and whether they would do the same within our competitors' strategies. This can identify viable opportunities and potential threats. A direct data science analog of the Dell-Compaq example is Amazon versus the frontier. Even very early on, Amazon's data on customer base purchases enabled personalized recommendations to be delivered to customers as they shopped online. Even if Borders was able to mine its dates on who bought which records, the retail strategy didn't allow for the same continuous delivery of scientific data recommendations. Therefore, a condition for a cost advantage is that the asset is valuable within our strategy. We've already started talking about the second set of criteria: to gain a competitive advantage, competitors either don't have an advantage or shouldn't be able to get the same value from the computer. We need to think about both the data assets and the capability of the nature of the details. Do we have a great data item? If not, do we have an advantage whose use is more in line with our strategy than that of our competitors? Or could we better exploit this data element because of our better data science capability? The inverse of asking how to gain a competitive advantage with data and data science is asking whether we are at a competitive disadvantage. The answers to the previous questions may be yes for our competitors and not for us. Next, we assume that we are seeking to gain a competitive advantage, but the arguments apply symmetrically if we are trying to achieve parity with a data-savvy competitor.
Maintaining a competitive advantage over data science The next question is: even if we can gain a competitive advantage, can we maintain it? If our competitors' ability to easily copy our investment is emphasized, our advantage may be short-lived. This is a particularly critical issue if other people have large resources: by adopting our policy, they can outbid ours if they have more resources. A strategy for competing based on data science is to plan to always stay one step ahead of the competition: you can always invest in new data assets, you can always develop new techniques and resources. This strategy can make for an exciting and potentially fast-growing business, but often few companies are able to manage it. For example, you should make sure you have one of the best data science teams, as the effectiveness of data scientists has a wide range, with the best being much more talented and faster on average. If you have a big team, you might be ready to give it your all to advance to matches. We're excited to discuss the data science we're learning below. The aforementioned choice to always maintain an organized selection upfront about combat is to achieve a sustainable competitive advantage due to a competitor's inability to reproduce other elements of it.
318
|
Chapter 13: Life of Data and Business Strategy
Estimated cost of copying the aforementioned file item, otherwise data science capability. There are different avenues for such support.
Excellent heritage option The historical relationship may have put our company in an advantageous position and it may be very costly for competitors to achieve the same position. Amazon again provides a great example. For the "Dotcom Boom" of the 1990s, Buy was able to sell books cheaply, and investors continued to reward the company. This allowed Ama District to gather massive data assets (such as more massive data on consumer preferences online for shopping and viewing product reviews), thus enabling them to create products based on invaluable data (such as product recommendations and reviews). Those historical junctures are gone: investors today are unlikely to offer the same level of support to a competitor trying to copy Amazon's data element by selling books below the markup for years (not to mention that Amazon has gone too far from the books). This example also shows how data products themselves can increase sales to competitors by replicating the data element. Customers appreciate the data-driven recommendations and product reviews/ratings that Amazon offers. This creates switching costs: Competitors would have to provide Amazon customers with an additional fee to persuade them to shop elsewhere—either with lower prices or some other valuable product or service that Amazon does not offer. So when the acquisition dates are directly tied to the value provided by the data, the virtuous cycle of results creates a problem for competitors: competitors need customers included to guarantee the necessary data, even though they need your order to provide equivalents services to attract customers. Entrepreneurs, investors can direct this strategic thinking: what immediate historical conditions exist that guarantee they cannot continue indefinitely, and where they might allow me to access or create a data item more cheaply than I would is it possible in the future? Or what will allow me to build a data science team that would be more complicated (or impossible) to build later?
Proprietary Intellectual Property The Company may own proprietary intellectual property. Data science intellectual property can include new techniques for mining data or using the results. They may be patented or may simply be trade secrets. In the first case, a competitor, moreover, will not be able to (legally) copy the solution or will have an increase in costs to do so, either by licensing our machines or by developing a new company to avoid patent infringement. In the case of a secret exchange, the competitor may simply not know how we might respond. With data science solutions, the real engine is often hidden. including all who encounter the being visibly.
Maintaining a competitive advantage with data science
|
319
Unique Collateral Intangible Assets Magnitude's competitors may not be able to until they figure out how to implement our solution. With the winning determinations of data science, this true source of good performance (for efficient and generously equipped predictive modeling) may not be clear. The power of a predictive modeling solution can depend heavily on the mechanics of the problem, the plugins created, the combination of different models, etc. It is often not clear to a competitor how performance is achieved in practice. Although our algorithms are published in detail, several implementation programs can be critical to getting a solution found in the lab to work in production. Additionally, success can be based on intangible assets, such as having a corporate culture that is well-suited to providing data science answers. For example, a culture that embraces business experimentation and (rigorous) data claims support will naturally be a simple fit for data research solutions to be successful. Alternatively, if developers are encouraged to perform data learning, they are less likely to screw up a high-quality solution. Remember our maxim: Your model is not the data scientist's own layout, it's something their engineers implement.
Senior Data Scientists Maybe our data scientists are much better than our competitors. There is huge variation in the quality and abilities of data scientists. Soft among well-trained data scientists, itp will be well-received in the product learning community, as some individuals possess a combination of innate creativity, analytical wisdom, business sense, and tenacity that enables them to create remarkably better solutions than ihr colleagues. This extreme skills gap is illustrated by the annual results in the KDD Cup data mining competition. At any given time, the leading professional society for natural products, ACM SIGKDD, holds its annual conference (the ACM SIGKDD International Conference on Knowledge Dissemination and Data Mining). Every year, the conference organizes a data mining competition. Some data scientists love to compete and there are many competitions. The Netflix contest, discussed in Chapter 12, is one of the most famous, and such contests also target a crowdsourcing business (see Kaggle). The One KDD Cup lives on as the granddaddy of data mining trophies and has also been held every year since 1997. Why is this relevant? Some of the best data scientists in the world participate in these competitions. Depending on the year and the task, hundreds or thousands of competitors try to solve the problem. Since data science talent is evenly distributed, one would think that computing would rarely see the same people win competitions repeatedly. But that's a lot of what we see. There are people who have won teams repeatedly, sometimes several years in a row and for multiple roles each year (sometimes the competition is longer
320
|
Chapter 13: Data Science and Business Strategy
Μια εργασία).1 Είναι σημαντικό ότι υπάρχει σημαντική διαφοροποίηση στην ικανότητα ακόμη και των καλύτερων επιστημόνων δεδομένων, και αυτό φαίνεται από τα «αντικειμενικά» αποτελέσματα των αγώνων KDD Cup. Το αποτέλεσμα είναι ότι, λόγω της μεγάλης ποικιλίας στην ικανότητα, οι ιδανικοί επιστήμονες δεδομένων μπορούν να επιλέξουν και να επιλέξουν ευκαιρίες εργασίας που ικανοποιούν τις επιθυμίες τους σε σχέση με τον μισθό, την κουλτούρα, τον χρόνο προόδου κ.λπ. Αυτή η απόκλιση στην ποιότητα των επιστημόνων δεδομένων ενισχύεται από το απλό γεγονός ότι η κορυφαία επιστήμη δεδομένων έχει μεγάλη ζήτηση. Οποιοσδήποτε μπορεί να αυτοαποκαλείται επιστήμονας δεδομένων, οι πραγματικές μικρές επιχειρήσεις μπορούν πραγματικά να αξιολογήσουν τον επιστήμονα δεδομένων καθώς και να προσλάβουν ενδεχομένως. Αυτό ελέγχει τα άλλα αλιεύματα: απαιτούν τουλάχιστον έναν κορυφαίο επιστήμονα νοημοσύνης για να αξιολογήσει πραγματικά την ποιότητα της μελλοντικής εργασίας. Έτσι, εάν οι πόροι της εταιρείας μας έχουν καταφέρει να δημιουργήσουν μια ισχυρή ικανότητα επιστήμης δεδομένων, έχουμε ένα σημαντικό και διαρκές πλεονέκτημα έναντι των ανταγωνιστών που αντιμετωπίζουν προβλήματα με την πρόσληψη επιστημόνων δεδομένων. Επίσης, οι κορυφαίοι επιστήμονες δεδομένων απολαμβάνουν τη συνεργασία με άλλους κορυφαίους επιστήμονες δεδομένων, γεγονός που αυξάνει το πλεονέκτημά μας. Πρέπει επίσης να αποδεχθούμε την έκκληση ότι η επιστήμη των δεδομένων είναι εν μέρει μια τέχνη. Η αναλυτική εμπειρία χρειάζεται ζεύγος για να αποκτήσει κανείς, και όλα τα υπέροχα βιβλία και οι βιντεοδιαλέξεις από μόνα τους δεν θα μετατρέψουν κάποιον σε δάσκαλο. Η τέχνη μαθαίνεται από την εμπειρία. Η μαζική αποτελεσματική μαθησιακή διαδρομή μοιάζει με αυτή της κλασικής επιχείρησης: οι επίδοξοι επιστήμονες δεδομένων εργάζονται ως μαθητευόμενοι στους πλοιάρχους. Αυτό θα μπορούσε να είναι ένα μεταπτυχιακό πρόγραμμα αδενίνης με έναν καθηγητή αιχμής προσανατολισμένο στις εφαρμογές, σε μεταδιδακτορικό πρόγραμμα ή στη βιομηχανία που συνεργάστηκε με έναν από τους καλύτερους επιστήμονες βιομηχανικών τεκμηρίων. Σε κάποιο σημείο, ο μαθητευόμενος έχει τα κατάλληλα προσόντα για να γίνει «δημοσιογράφος» και στη συνέχεια θα εργαστεί πιο ανεξάρτητα σε μια ομάδα ή ακόμη και θα εκτελέσει τα δικά του έργα. Αμέτρητοι επιστήμονες δεδομένων υψηλής ποιότητας εργάζονται με χαρά σε αυτό το στέλεχος για την καριέρα τους. Μερικά μικρά υποσύνολα γίνονται κύριοι, λόγω του συνδυασμού του ταλέντου τους να βλέπουν τις δυνατότητες των νέων ευκαιριών της φυσικής δεδομένων (περισσότερα σε μια στιγμή) και της κυριαρχίας τους στη θεωρία και την τεχνική. Μερικοί από αυτούς στη συνέχεια αναλαμβάνουν μαθητευόμενους. Η κατανόηση της διαδρομής μάθησης μπορεί ακόμη και να βοηθήσει στην εστίαση των προσπαθειών πρόσληψης αναζητώντας ερευνητές δεδομένων που έχουν μάθει από καπετάνιους υψηλού επιπέδου. Μπορεί επίσης να χρησιμοποιηθεί τακτικά με λιγότερο εμφανή τρόπο: εάν μπορείτε να προσλάβετε έναν κύριο επιστήμονα δεδομένων, οι επίδοξοι επιστήμονες δεδομένων υψηλού επιπέδου μπορούν να γίνουν οι μαθητευόμενοι μας. Εκτός από την επιλογή, ένας κορυφαίος επιστήμονας δεδομένων πρέπει να έχει ένα ισχυρό και ικανό δίκτυο. Δεν εννοούμε ένα δίκτυο με την έννοια ότι κάποιος μπορεί να το βρει σε ένα επαγγελματικό σύστημα τεχνολογίας στο διαδίκτυο. Οι απαιτήσεις ενός αποτελεσματικού επιστήμονα γνωριμιών χρειάζονται βαθιές συνδέσεις με άλλες φυσικές νοημοσύνη σε αυτή την επιστημονική κοινότητα λεπτομερειών. Ο λόγος είναι απλώς ότι ο τομέας της επιστήμης δεδομένων είναι τεράστιος και υπάρχουν πάρα πολλά διαφορετικά θέματα για να τα κατακτήσει κάθε άτομο. Ένας κορυφαίος επιστήμονας δεδομένων είναι κύριος σε κάποιους τομείς τεχνικής εξειδίκευσης και αξιόπιστος σε πολλούς άλλους. (Προσοχή στο "jack of all trades, master of none".) Ωστόσο, δεν θέλουμε η μαεστρία του επιστήμονα δεδομένων σε κάποια περιοχή 1. καλύτεροι ανθρακωρύχοι γνωριμιών στον κόσμο. Πολλοί κορυφαίοι επιστήμονες δεδομένων δεν έχουν διαγωνιστεί ποτέ σε παρόμοιο διαγωνισμό. μερικοί ανταγωνίζονται για άλλη μια φορά και μετά επικεντρώνουν τις προσπάθειές τους σε άλλα πράγματα.
Maintaining the competitive advantage of data scientists
|
321
technical mastery becomes the proverbial hammer in which all problems are nails. A top data scientist will have the expertise required for the problem. This is tremendously promoted by strong and deep professional contacts. Data scientists look everywhere else to help guide them to the right research. The better a professional network, the better the resolution. Furthermore, the best intelligence scientists have the best connections.
Higher Data Science Management Perhaps even more critical to the success of data science in business is the achievement of management objectives from across the data science spectrum. Good data science managers are especially hard to find. They want to have a good understanding of the fundamentals of detail science, possibly being competent data scientists themselves. Good data science managers must also possess a set of other skills that are rare in an individual: • They must truly understand and appreciate the business that is inevitable. Additionally, they need to be able to envision your business so they can interact with their peers, including other functional areas, to develop ideas for new data science products and services. • Must be able to communicate well and be respected by both "techies" and "friends". This often translates data science jargon (which we try to downplay in the book) into business jargon and vice versa. • They must coordinate technically complex activities, such as in this product of multiple models and inverse methods with business constraints or cost constraints. They often need to take the technical architectures of the business, such as data systems or factory software systems, to ensure that the solutions the team produces can be used in practice. • Must be able to predict the results of data science projects. As we discussed, data science is more like R&D than any other commercial activity. Whether or not a given data science project turns out to be a positive outcome is highly unknown at the outset and possibly the same in the project itself. Elsewhere, we discussed the importance of producing proof-of-concept studies quickly, but neither positive nor negative results from such studies were very predictive of the initial success or failure of the larger project. The group has just introduced investment guidance in the next cycle of the data mining process (remember Chapter 2). If we look to R&D management for clues about managing the nature of the evidence, we find only one reliable predictor of the success of a research project, and it is highly predictive: the past success of the researcher. We have seen a similar situation with data science past. There are people who obviously have an intuitive sense of bliss that will pay off. We do not know from hard data why this is so, although experience suggests that it is. As with data science competitions where we see great repeat performance in the same people, we also see people repeatedly seeing the new 322
|
Chapter 13: Input Science and Business Strategy
scientific opportunities and manage them with great success - and this is especially impressive since many data science managers never see a project through to great success. • They must do all of the above within the culture of a particular company. Finally, our information science capability makes it difficult or expensive for a competitor to replicate because we can hire data scientists and physical data managers to improve. This may be because our reputation also marks the court with data scientists - a data scientist may be favored to work for a company that is known to be friendly to data experts in the data type of life. Or our company can show a subtle appeal. As we are going to look a little deeper into everything that goes into attracting top peer data.
Engaging with both data scientists and their teams Earlier in the chapter, we noted that the two most important factors in ensuring that our company is making the most of its assets are: (i) company management must think in detail and (ii) company management must create a broad culture where data science, as well as data scientists, want to succeed. As we mentioned above, there can be a huge difference between the effectiveness of a great data scientist and an average data scientist, and between a large data science team and a single great information scientist. But how can one start leading data scientists with confidence? How can we build great teams? This is a very complicated question that needs to be answered with practice. At the time of writing, the supply of top data scientists is quite short, emerging in a highly competitive market for them. The best companies to hire data scientists are IBM, Microsoft, and Google from around the world, which clearly demonstrate the value of a handful of data science jobs in terms of pay, benefits, and/or overall, as a specific factor not to choose light hearted: Data scientists like to be around other top data scientists. It could be argued that they need to survive with other top data scientists, not because they enjoy their work, but also because the field is broad and the collective mind of a team of data scientists can take on a very large role. wider. set of defined solution techniques. You do, just because a market is difficult and not negligible, all is lost. Countless data scientists wish they had more influence individually than at a corporate behemoth. Much more burdensome (and consequent experience) is the broader process involved in producing data science research. Einige has visions of becoming the chief scientist of a company, and the path to chief scientist can best be paved with projects at smaller, more diverse companies. Some have had visions of how to become entrepreneurs and understand that the life of a particular fast-start data scientist can provide them with invaluable experience. Or some simply enjoy the thrill of being part of a fast-growing venture: working for a company that grows 20% or 50% a year is very different from working for a company that grows 5% or 10% a year (or not growing not at all). to everything). Fascinating, data-driven academics and their teams
|
323
To get these cases, which companies have an advantage in hiring are those that create an environment for preparing data science, both dating scientists. If you don't have a significant mass of data scientists, get creative. Encourage the scientists you know to participate in local technical data science communities and global academic data science communities.
Note on publication
Science is a social endeavor, and the best data scientists often even want to stay engaged in the community by publishing their discoveries. Sometimes companies struggle with this idea, feeling like they're "revealing which store" or tipping off competitors by revealing what they're doing. What else pass, with them not, they may not be able to sign or keep the super favorite. Publishing also has some advantages for a firm, such as greater publicity, exposure, external validation of ideas, etc. There is no clear answer, but the above question should be kindly considered. Some companies aggressively patent their data science ideas, after which academic publication remains natural if the idea is truly modern and important.
A company's data science presence can be enhanced by engaging academic data scientists. There are several ways to do this. For those academics interested in practical applications of their work, it may be possible to fund their research projects. Both the domestic book and the industry sponsored by academic software and essentially expanded into data science that focuses on its specifics and its interaction. The best arrangement (in our experience) is a combination of data, money and a compelling business problem. if the project turns out to be a farewell for a Ph.D. thesis of a student in a top program, the return to the company can far exceed the cost. Fund a Ph.D. The student can cost a company in the field $50,000/year, which is a fraction of the outright rich cost of a top data scientist. One key is to get enough knowledge of data science to click on the right professor – one with the right experience for the symptom you have. Another tactic that can be very cost-effective is to hire one or more top data scientists as scientific consultants. If the relationship can be structured so that consultants actually interact for solutions to problems, companies that do not have the resources beyond influence to hire the best data peers can significantly increase the quality of the final solutions. These advisors may be information scientists at partner companies, product scientists at companies that share investors or house members, or alumni who have some consulting time. A completely different approach is to hire a third party for data science behaviors. There are a number of third-party data science providers, from large companies that specialize in business analytics (like IBM), to specific data science consultancies (like Elder Research), to boutique data science companies that accept a number of very small your customers 324
| Chapter 13: Data Scientist and Business Strategy
to help them develop their data learning skills (such as Data Scientists, LLC). KDnuggets. A prerequisite for hiring data science consulting firms is that your interests may not always be well aligned. This is obvious to experienced advisor users, but not to everyone. Smart managers use all these resources regularly. A skilled chief physicist or a simple manager can assemble a substantially more powerful and diverse team for a project than most companies can hire.
Consider data science case studies In addition to building a strong data skillset, how can a manager ensure that their firm is best positioned to take advantage of data science implementation options? Ensure there is an understanding and appreciation of the basic general science of the product. Enterprise-wide authorized employment often receives Roman applications. Once you gain knowledge about the fundamental business of data science, the best way to position yourself for success is to work through many examples of data science application in related businesses. Check out the case to know that it really roams the data mining process. Articulate your recognized case studies. In fact, data mining is useful. Instead, working on the connection between the business problem and potential data science solutions is even more important. The more different problems you work on, the better you will be at seeing and physically capitalizing on the business to implement the information and knowledge "stored" in the data entry - often the same problem-problem formulation can be applied by analogy to others, so only minor changes. It is important to keep in mind that the examples we present in this book were selected or drawn from available resources. In fact, the physical business and data team must be prepared for all kinds of confusion and constraints, they must be flexible in dealing with them. Sometimes there is a large amount of data and data science techniques available until they are put into practice. At other times, the location is more like the critical scene from the movie Apollo 13. In the watch, a booster malfunction and command module explosion leave the astronauts stranded a quarter of a mile from Earth, with CO2 levels rising very quickly. for them. to survive that journey home. In short, because of the limitations imposed by what astronauts have available, engineers have to figure out how to use a large cubic filter in place of a narrower cylindrical filter (to literally fit a smooth peg is a round hole) . In the main scene, the chief engineer throws all the "stuff" in the command unit onto a table and says to his team, "Okay, everyone… , using nothing but this." Realtime 2. Disclaimer: The book is adenine relationship with Data Scientists, LLC.
Review data science case studies
|
325
Data scientist problems often look more like an Apollo 13 situation than a textbook situation. For example, Perlich et al. (2013) describe the adenine test for this very case. To reach consumers with online display advertising, obtaining an adequate supply of this ideally needed training data was prohibitively expensive. However, the data were delivered at a much lower cost than many other distributions, both for different target variables. Its own effects solution combined patterns created from alternative data and "unloaded" those patterns for use in the desired task. Using these surrogate data allowed them to operate with significantly lower data investment on the ideal (and expensive) training distributions.
Be ready to accept artist ideas from any source Once actors understand the fundamentals of data science, generating ideas for new solutions can come from any direction—from executives looking at potential new lines of business to directors working on profit and responsibility for loss, managers who critically analyze a work process and bring to employees a comprehensive knowledge of exactly how a given business process works. Data Life should be encouraged to interact with employees across the business, and part of the performance appraisal should be based on how well they used ideas to improve the business with data science. Incidentally, this pays off in fortuitous ways: the data processing skills that data scientists possess can often be used in less complex ways, but they can help many companies without those skills. Often, a manager may have no idea what specific information can be obtained - data that can help that manager directly, without complicated data science.
Be Prepared to Evaluate Proposals for Data Science Projects Beliefs about improving overall decision making through data science can come from any direction. Managers, investors and employees must be able to articulate these ideas clearly, and decision makers must be ready to evaluate them. Essentially, we had to be able to formulate correct sentences and evaluate sentences. The data mining process, illustrated in Chapter 2, provides a framework to guide this. Each step of the process reveals questions to ask both in formulating project proposals and in evaluating projects: • Has the project problem now been defined? Does this data science solution solve the problem? • Is it clear how we can evaluate the adenine solution?
326
|
Chapter 13: Information Science and Work Strategy
• Could we see evidence of success before making a large investment in development? • Does the company have the necessary assets? For example, for supervised models, is there actual labeled business data? Is the institution prepared to invest in assets it does not yet own? The ADENINE Appendix provided an initial list of questions for scoring a data science request, organized by data mining function. Let's look at illustrative examples. (In the BORON Appendix you will find another example of a ranking proposal, focused on our function reversal problem.)
Date Mining Proposal Example Your company has an installed student base of 900,000 current users of its Whiz-bang® widget. You have now developed Whiz-bang® 2.0, which has significantly lower operating costs than the original. Ideally, you'd like to convert ("port") all core staff by revision 2.0. However, using 2.0 requires users to master the new interface, and there is a serious risk that when they try to do this, customers will become frustrated rather than disappointed, become less satisfied with the company, or, in the worst case, slip into popular and competitive Boppo® widget. Commercialize has developed a new plan to encourage migration, which will cost US$250 for select customers. There is no guarantee that a customer will choose to migrate, even if they accept this incentive. An outside firm, Big Red Consulting, is proposing an adenine plan to carefully target customers from Whiz-bang® 2.0, and because of your proven fluency with data science fundamentals, you are invited to help evaluate Big Red's proposal. Do Big Red's choices seem right? Focus Whiz-bang Customer Migration—prepared for Big Cherry Consulting, Ink. We will develop a prediction pattern using a modern data mining company. As discussed in our final meeting, we are assuming a budget of $5,000,000 for which we are going to migrate clients. Defining the plan for other budgets is simple. So we can reach 20,000 customers with this budget. Here's the method we'll choose: we'll use the data to build an example of whether or not a customer will migrate because of the challenge. The data set will consist of a set of characteristics per customer, such as the number and type of previous interactions with our service, net of widget usage, customer location, estimated scientific sophistication, company ownership and other specific views, such as the number of different elements of the company's products and services on how. What will be set is whether or not the client will be migrated to the new widget if they get the pull. Using these data, we will construct a linear regression to estimate the target variable. The model will be ranked based on its accuracy on these items. In particular, we even want to ensure that the accuracy is much higher if we aim arbitrarily. To build this model: for all buyers we will apply the regression model to estimate the target variable. If the guess is greater than 0.5, we predict that the customer will migrate. Otherwise, we voluntarily declare that the customer will not be able to migrate. Next, we will select from the classes - Be prepared to evaluate proposals for data science projects
|
327
Give 20,000 customers of these predictions to migrate and those 20,000 will be the recommended targets.
Gaps in the Big Red Proposition We can use our understanding of fundamental ethics and other key data science concepts to identify gaps in the proposition. Appendix A provides an initial guide to considering such proposals, with some key questions to ask. However, this book as a whole can really be considered a proposal evaluation user. Here are some of the most glaring flaws in the Big Data proposition: Business insight • The definition of the objective variable is imprecise. For example, how often does migration occur? (Chapter 3) • The formulation is that the data mining problem could be best aligned to which business problem. In the show, what if some buyers (or all) had high migration anyway (without the incentive)? So we would waste the incentive cost by targeting them. (Chapter 2, Chapter 11) Understanding the Data/Preparing the Data • No labeled training data! This is a new motivation. We have to build part of our budget by getting tags for some examples. This can be done by targeting a (randomly) selected subset of customers with the incentive. A more sophisticated approach can also be suggested (Chapter 2, Chapter 3, Chapter 11). • Whenever we are concerned about wasting this incentive on clients who are likely to be able to get by without it, we should also consider a “control group” for the site of the period for which we obtain training data. This should be easy, as everyone we don't intend to collect tags will be subject to "control". We can build a separate model for roaming or disincentivization, and both combine the models for a true expected structure. (Chapter 11) Modeling • Linear regression is not a good choice for modeling a categorical target variable. Instead, you should use a classification method such as innate induction, logistic regression, k-NN, etc. Better yet, why not try multiple methods and see if they test which performs best? (Chapter 2, Chapter 3, Chapter 4, Chapter 5, Chapter 6, Chapter 7, Class 8)
328
|
Chapter 13: Data Science and Business Strategy
Scoring • Scoring should not be in the training data. Some type of validation approach should be used (eg cross-validation and/or adenine step approach as discussed above). (Chapter 5) • Will there be validation of the model's knowledge in the field? Where do you record all the quirks of the data collection process? (Chapter 7, Branch 11, Chapter 14) Development • The idea of randomly selecting customers with regression scores greater than 0.5 is not considered correct. Above all, it is not clear that a regression score of 0.5 seriously corresponds to a migration probability of 0.5. Second, 0.5 will be random enough for each case. Third, since our model provides us with a rank (e.g., with migration probability, the inverse of the expected value if we use the more complex formulation), we should use the aforementioned rank to guide our goal: to select and more candidates to be properly classified as the budget will allow. (Chapter 2, Lecture 3, Chapter 7, Chapter 8, Chapter 11) Of course, this is just one example with a certain set of shortcomings. A different choice of concepts may need to be used for a different proposal that fails in other ways.
A Company's Data Science Maturity In order for a company to realistically plan for data science, it must realistically and rationally assess its own maturity within data skill competency terminologies. It is beyond the scope of this book to provide a guide to self-assessment, but a few words on the subject are important. Companies vary widely in their data science capabilities in measuring incoming batches. A very important dimension for strategic planning is the “maturity” of the enterprise, namely, how systematic and well-founded are the processes used to guide the enterprise's data science projects.3 At one end of the maturity spectrum, data science business The product of a company processes are completely ad hoc. In many companies, employees involved in data science and business analytics efforts lack formal training in these areas, and the managers involved have a lot of knowledge about the fundamental business of data science and data analytics.
3. The aforementioned reader interested in this impression of an enterprise's capability maturity is encouraged to read about the Capability Maturity Model for software engineering, which is the inspiration for this discussion.
Maturity of a company's data science
|
329
A note on "immature" companies
Being "immature" means that the adenine company is destined to fail. And they are where success is very variable and depends much more on superior luck in a mature business. The achievements of the project will depend on the courageous efforts of individuals who possess a natural acuity for analytical data thinking. An immature company may apply unsophisticated data science solutions at a larger scale, or may apply sophisticated medicine on a narrow screen. But rarely does an immature company implement mature data science solutions at scale.
A mid-level startup maturity employs well-trained input scientists as well as business managers and other stakeholders who understand the fundamentals of data learning. Both sides can think clearly about how to solve data-informed business problems, forcing both sides to participate in the design and implementation of solutions that directly address business problems. At the edge of maturity are companies that are constantly seeking to improve their data science processes (they don't match the solutions). Business executives are constantly challenging the data science team to instill processes that better align their solutions to business problems. At the same time, they realize what practical trade-offs might favor choosing a poor solution that can be implemented today over a theoretically much better solution that won't be ready until next year. Data scientists starting a business need to have confidence that when they propose investments to improve their data science processes, ihr suggestions will be received with open and informed minds. This does not mean that all requests will be approved, but those suggested above will be evaluated on their own merit in the context of the business.
Note: Data academy is not functional or engineering.
There are several dangers when making an analogy with an engineering capture performance maturity model - danger that the analogy may be interpreted too literally. Trying to apply the same type of company that works for navigation engineering or worse for print production jobs, I want to fail for data arts. Moreover, misguided attempts to do so will send a company's best data scientists out before management even knows what happened. The key is even understanding the actual processes of data science, how to do data arts well, and work to build consistency and support. Remember that data science loves R&D more than general or manufacturing. As a specific consideration, management must consistently allocate the necessary resources to properly evaluate data science projects early and often. Sometimes this involves investing in appointments that would otherwise be impossible. This often affects the allocation of engineering resources to help the physical data gang. The Data Science Squad
330
|
Chapter 13: Data Science or Business Strategy
It should, in turn, work to provide the manager with assessments that align as closely as possible with the actual business problem(s).
As a more concrete example, consider again your telecom turnaround problem and how changing maturing companies might solve it: • An immature company will have (hopefully) analytically experienced employees apply hc print findings based on her intuitions about how to manage turnover. These may or may not work well. In an immature company, it will be difficult for management to weigh these options against alternatives or to determine when implementing a near-optimal solution. • A mid-stage installation will have implementations of a well-defined framework for testing different optional solutions. It will verify the environment that mimics as closely as possible the actual business setup mentioned earlier - for example, running the latest production data through a testbed comparing how different methods "would have worked", taking into account carefully the costs and associated benefits. • ADENINE maturing company may have developed exactly the same job as mid-maturity company to identify customers most likely to leave, pushing even higher expected loss if they were in a rollover. They will also work to put the processes in place and collect the dates necessary to also judge the impact of the incentives and thus work to find those individuals for whom the incentives will produce the greatest expected value increase (with not giving the incentive). Such a company may also work to incorporate such a process into a tested and/or optimized framework by evaluating various offers or different parameters (such as the level of discount) in a predetermined offer. Honest self-assessment of your detail science degree can be difficult, but it is necessary to make the most of your current abilities and improve your potential.
Maturity of a company's data science
|
331
CHAPTER 14
conclusion
If you can't explain it simply, you haven't understood enough. -Albert Einstein
The practice of data physics can best be described as a combination of analytical engineering and exploration amplifiers. The business provides a solution that we would like to solve. Rarely is the business problem directly unique to our core data mining tasks. We broke the problem down into sub-tasks that were how we could solve it, usually starting with the current tools. For some of these tasks, we may not know how well we can solve them, so we need to extract the data and evaluate the direction, see If this is not successful, then perhaps we should try something completely different. In the process, we may discover insights that will help us solve a problem we set out to eliminate, or we may discover unexpected events that lead to other important successes. Neither analytical engineering nor exploration should be omitted when considering the appeal of data science methods for solving a work problem. Dismissing the engineering aspect often makes it much less likely that the results of data mining will actually solve the business problem. Failure to value the processed as a true exploration discovery prevents the company from properly managing, incentivizing and investing if the project is successful.
The Fundamentals of Data Literacy Both analytical engineering and exploration and discovery are made more systematic and therefore more likely to succeed by understanding and adopting the aforementioned fundamentals of data science. In this book, we present a collection of the most important fundamental concepts. Some of these concepts we made headlines for the chapters and others were introduced, of course, through the discussion.
(and not necessarily as core concepts). These concepts span this process, from imagining how the data academic community can realize the business decision, to applying data science techniques, developing results, and improving decision making. Concepts also support a wide range of business analysis. We can group our key concepts roughly into three types: 1. General concepts about how data science fits into the organization and competitive landscape, including ways to attract, structure and prepare data science teams, ways of thinking about how data of data science lead to competitive advantage, ways in which competitive advantage can be maintained, and tactical principles for doing well with data science projects. 2. General ways of thinking analytically about data, which help us collect appropriate data and consider appropriate methods. Concepts include the data mining process, collection is different from high-level data science work, such principles as the following. • The academic data team should keep in mind the feature to be solved and the usage scenario throughout the data mining process mentioned above • Data should be seen as an asset, so we should think carefully about what to do to make better use of the asset • The potential valued structure can relate us to business structure issues so that we can see the issues of asset mining as well as the connective tissue of costs, services and constraints imposed by the business environment • Generalization and overfitting: if we look closely enough at the data, will we find patterns; We want patterns that generalize to elements we haven't seen yet • Applying data science to a well-structured problem versus exploratory data mining requires different levels of effort at different stages of the data mining process 3. General idea for real knowledge extraction from data , that underpin the vast array of data science techniques. Which include the core as below. • Identifying informative features—those that are otherwise correlated provide information about an unknown amount of interest • Fitting a numerical function model to details, selecting a target, and finding a set of parameters based on that target • Complexity control required to achieve a good trade-off between generalization and overfitting • Computing similarity between objects described by data
334
|
Chapter 14: Finally
Μόλις σκεφτούμε τη φυσική των δεδομένων ως προς τις βασικές της θεμελιώδεις αρχές, βλέπουμε ότι η ίδια εννοιολόγηση βασίζεται σε πολλές διαφορετικές στρατηγικές τύπου της επιστήμης δεδομένων, εργασίες, αλγόριθμους και διαδικασίες. Όπως έχουμε δείξει σε όλο το βιβλίο, αυτές οι αρχές όχι μόνο μας επιτρέπουν να κατανοήσουμε τη γνώμη και την πρακτική της επιστήμης δεδομένων πολύ πιο βαθιά, αλλά επίσης σας επιτρέπουν να κατανοήσετε τα χαρακτηριστικά και τις τεχνικές της επιστήμης δεδομένων πολύ ευρέως, επειδή αυτές οι μέθοδοι και οι δεξιότητες είναι πολύ συχνά απλές συγκεκριμένες περιπτώσεις μιας ή περισσότερων από τις βασικές επιχειρήσεις. Σε υψηλό επίπεδο, έχουμε δει τις οδηγίες να οργανώνουν επιχειρηματικά προβλήματα βάσει της συνολικής αναμενόμενης τιμής, επιτρέποντάς μας να αναλύουμε προβλήματα σε εργασίες φυσικής δεδομένων που κατανοούμε καλύτερα πώς να λύνουμε και να εφαρμόζουμε σε πολλούς διαφορετικούς τύπους επιχειρηματικών προβλημάτων. Για την εξαγωγή γνώσης από δεδομένα, είδαμε ότι η βασική βάση μας για τον προσδιορισμό της ομοιότητας δύο αντικειμένων που περιγράφονται από δεδομένα χρησιμοποιείται άμεσα, για παράδειγμα, για την εύρεση παρόμοιων πελατών για τους καλύτερους πελάτες μας. Χρησιμοποιείται τόσο για ταξινόμηση όσο και για παλινδρόμηση μέσω μεθόδων πλησιέστερου γείτονα. Και είναι η βάση για την ομαδοποίηση, η χωρίς επίβλεψη ομαδοποίηση αντικειμένων λεπτομερειών. Είναι η βάση για την εύρεση των εγγράφων που συνδέονται περισσότερο με ένα ερώτημα αναζήτησης. Και είναι το παρασκήνιο, καθώς υπάρχουν περισσότερες από μία συνήθεις μέθοδοι από τότε που κάνετε συστάσεις, χρησιμοποιώντας το παράδειγμα κλιμάκωσης πελατών και ταινιών στον ίδιο «γευστικό χώρο» και στη συνέχεια εύρεση του ρολογιού πιο παρόμοιου με έναν συγκεκριμένο πελάτη. Όσον αφορά τη μέτρηση, βλέπουμε την εντύπωση της μεγέθυνσης – που καθορίζει πόσο πιο πιθανό είναι ένα μοτίβο από ό,τι θα αναμενόταν τυχαία – να φαίνεται μεγάλη στην εκμάθηση δεδομένων, όταν αξιολογούμε πολύ διαφορετικές ταξινομήσεις μοτίβων. Εφάπαξ αξιολογεί τις διαδικασίες για τη στόχευση της διαφήμισης υπολογίζοντας την αύξηση που επιτυγχάνεται για τον πληθυσμό-στόχο. Ανύψωση του ματιού για να κρίνει κανείς το δίκτυο αποδεικτικών στοιχείων υπέρ ή κατά ενός συμπεράσματος. Η ανύψωση υπολογίζεται για να βοηθήσει να κριθεί εάν μια επαναλαμβανόμενη συν-συμβάν είναι διασκεδαστική, σε αντίθεση με το ότι η απλότητα είναι φυσικό επακόλουθο της δημοτικότητας. Επιπλέον, η κατανόηση της θεμελιώδους έννοιας διευκολύνει την επικοινωνία μεταξύ των ενδιαφερομένων στις επιχειρήσεις και των επιστημόνων δεδομένων, όχι μόνο λόγω του ποιος μοιράζεται το λεξιλόγιο, αλλά επειδή και οι δύο πλευρές καταλαβαίνουν βασικά καλύτερα. Αντί να περιμένει κανείς να συζητηθούν πλήρως οι ζωτικές πτυχές, μπορεί κανείς να εμβαθύνει και να κάνει συχνές ερωτήσεις που θα αποκαλύψουν κρίσιμες πτυχές που διαφορετικά δεν θα είχαν αποκαλυφθεί. Για παράδειγμα, ας υποθέσουμε ότι η επιχείρησή σας εξετάζει το ενδεχόμενο να επενδύσει σε μια εταιρεία που βασίζεται στην επιστήμη των δεδομένων για να δημιουργήσει μια προσαρμόσιμη διαδικτυακή υπηρεσία ειδήσεων. Ρωτάς πώς ακριβώς εξατομικεύουν τις ειδήσεις. Το άτομο λέει ότι χρησιμοποιεί μηχανές back vector. Ας προσποιηθούμε ακόμη και ότι δεν μιλάμε για μηχανές σωλήνων στήριξης σε αυτό το βιβλίο. Θα πρέπει να σκέφτεστε με αρκετή σιγουριά για τις γνώσεις σας στην επιστήμη των δεδομένων μέχρι τώρα, επομένως δεν πρέπει να λέτε απλώς "Ω, εντάξει". Θα πρέπει να μπορείτε να ρωτάτε με σιγουριά, "Τι είναι αυτό ακριβώς;" Εάν ξέρουν πραγματικά για τι πράγμα μιλάνε, πρέπει να της δώσουν αρκετές εξηγήσεις με βάση τις βασικές αρχές μας (όπως κάναμε στο Κεφάλαιο 4). Είστε επίσης έτοιμοι να ρωτήσετε, "Ποιες ακριβώς είναι οι ημερομηνίες διδασκαλίας που σκοπεύετε να χρησιμοποιήσετε;" Όχι μόνο μπορεί αυτό να εντυπωσιάσει τους επιστήμονες δεδομένων στο προσωπικό, αλλά είναι στην πραγματικότητα μια σημαντική ερώτηση που πρέπει να δείτε The Fundamentals of Data Science
|
335
they are either doing something credible or using "data science" as a smokescreen to hide. You might wonder if you really believe that any Sibyllan model built from this data – no matter what kind of model it is – is likely to solve the business problem it's attacking. You need to be prepared to ask what you honestly think they will have credibility training labels for such a task. The compliant.
Applying our fundamental concepts to a new problem: mobile data mining What we have emphasized repeatedly, when we think of data science as a collection of concepts, businesses and general methods, we will have much more knowledge of equally comprehensive data loosely related and see application academically data to new business problems. Let's look at the new adenine example. Recently (as of this writing) there has been a noticeable shift in consumer internet activity from conventional radios to a wide variety of mobile devices. Companies, still working to understand how to reach consumers on their home computers, are struggling to understand how to reach consumers on their mobile devices: smart phones, wafers and even mobile radios as Wi-Fi access becomes widespread widespread. We won't talk about most of the complexities of this fix, but from our perspective, the data analytics thinker can realize that mobile devices provide a new kind of data with which little leverage has yet been gained. It includes specific mobile devices live that match your location data. For example, in a fluid advertising ecosystem, depending on my privacy settings, my fluid device may transmit my precise GPS location to those entities that would like to target me with advertisements, daily deals, and other offers. Figure 14-1 schaustellungen is a scatterplot of a small sample of the sites a potential advertiser might see, sampled from the roaming ad ecosystem. Likewise, assuming I'm not broadcasting GPS location over the radio, my device broadcasts the IP address of the network it uses temporarily, which usually broadcasts location information.
336
| Chapter 14: Conclusion
Figure 14-1. ADENINE scatterplot for an example of GPS navigation captured by a mobile medical device. On an interesting side note, this is just a scatterplot of latitude and fundamentals transmitted by mobile devices. there is no map! It gives a broad picture of population compression around the world. And it leaves us in awe of what's going on with Polar's mobile devices.
How can we use such a product? Let us take an unfamiliar fundamental conceptualization. If we want to go beyond exploratory data analysis (as we started with a visualization in figure 14-1), we need to think about the variation of some specific business difficulty. A particular ADENINE company may have a specific problem to solve and focus on one or two. An entrepreneur or investor can study all the different potential problems they see today's businesses or consumers have. Let's choose one relevant to the data mentioned above. Advertisers face the element that is new global, wealth sees a variety of different devices, so much the behavior of a particular consumer can be fragmented into many. In the behind-the-scenes world, once advertisers name a good prospect, perhaps through a specific consumer's browser cookie or device identifier, the person can begin to act on it. for example, by showing you targeted advertisements. In the mobile ecosystem, that consumer's activity
The fundamental conceptual idea for data science
|
337
κατακερματίζεται μεταξύ συσκευών. Ομοίως, εάν βρεθεί ένας καλός υποψήφιος πελάτης σε μια συσκευή, πώς μπορεί να ορίσει τις άλλες συσκευές του; Μια δυνατότητα είναι να χρησιμοποιήσετε τα δεδομένα τοποθεσίας για να φιλτράρετε τη θέση πιθανών άλλων συσκευών που μπορεί να ανήκουν σε αυτόν τον υποψήφιο. Το Σχήμα 14-1 υποδηλώνει ότι ένα μεγάλο μέρος αυτού του χώρου πιθανών εναλλακτικών επιλογών θα εξαλειφόταν εάν μπορούσαμε να καταγράψουμε τη συμπεριφορά επίσκεψης μιας κινητής συσκευής. Προφανώς, η συμπεριφορά τοποθεσίας μου στο έξυπνο τηλέφωνό μου θα είναι αρκετά παρόμοια με την προσωπική μου συμπεριφορά τοποθεσίας, ειδικά αν η EGO εξετάζει μια τοποθεσία WiFi που χρησιμοποιεί το ME. στοιχείο (Κεφάλαιο 6). Κατά την εκτέλεση της φάσης κατανόησης δεδομένων, πρέπει να προσδιορίσουμε ακριβώς πού θα αντιπροσωπεύουμε τις συσκευές και τις τοποθεσίες μου. Μόλις κάνουμε ένα βήμα πίσω από τις λεπτομέρειες των αλγορίθμων και των εφαρμογών και σκεφτούμε τα θεμελιώδη, μπορούμε να δούμε ότι οι ιδέες που συζητήθηκαν στο παράδειγμα διατύπωσης προβλημάτων για εξόρυξη κειμένου (Κεφάλαιο 10) θα ισχύουν πολύ καλά εδώ - αν και αυτό το παράδειγμα έχει Το μηδέν για να κάνετε είναι κείμενο. Κατά την εξόρυξη δεδομένων σε έγγραφα, συχνά αγνοούμε μεγάλο μέρος της δομής του κειμένου, όπως η ταξινόμησή του. Για πολλά προβλήματα, μπορούμε απλώς να αντιμετωπίσουμε κάθε έγγραφο ως δεδομένα λέξης από ένα δυνητικά μεγαλύτερο λεξιλόγιο. Η ίδια σκέψη ισχύει σωστά. Φυσικά, υπάρχει σημαντική δομή γύρω από το πού γίνονται οι επισκέψεις, όπως η σειρά με την οποία εξυπηρετούνται, αλλά για την εξόρυξη δεδομένων, μια απλούστερη πρώτη στρατηγική είναι συχνά η καλύτερη. Ας θεωρήσουμε απλώς κάθε συσκευή ως "θέση τσάντας", σε αναλογία με την αναπαράσταση της λέξης τσάντα που συζητήθηκε στο Κεφάλαιο 10. Υποθέτοντας ότι προσπαθούμε να βρούμε άλλες σταθερές της ίδιας αλυσίδας, μπορούμε επίσης να εφαρμόσουμε κερδοφόρα ιδέες TFIDF στο κείμενο σε τοποθεσίες . Οι τοποθεσίες WiFi που είναι πολύ δημοφιλείς (όπως τα Starbucks στη γωνία του Washington Square Park) είναι απίθανο να είναι τόσο ενημερωτικές όσο ένας ακριβής υπολογισμός ομοιότητας για την εύρεση της ίδιας παραγγελίας σε διαφορετικές συσκευές. Αυτή η τοποθεσία θα λάβει χαμηλή βαθμολογία IDF (σκεφτείτε ότι το "D" σημαίνει "Συσκευή" αντί για "Έγγραφο"). Με το άλλο άκρο αυτού του φάσματος διαθέσιμο σε πολλούς ανθρώπους, τα δίκτυα Wi-Fi του διαμερίσματός σας θα έχουν πολύ λίγες διαφορετικές συσκευές και, επομένως, θα ήταν αρκετά διακριτικά. Το TFIDF στη θέση θα μεγιστοποιούσε τη σημασία αυτών των τοποθεσιών σε έναν υπολογισμό ομοιότητας. Μεταξύ αυτών των δύο σε διάκριση μπορεί να υπάρχει ένα δίκτυο WLAN γραφείου, το οποίο έλαβε ένα ενδιάμεσο σήμα IDF. Επί του παρόντος, εάν το προφίλ της συσκευής μας είναι μια αναπαράσταση TFIDF που βασίζεται σε τοπικούς θύλακες μεγέθους, όπως στη χρήση της αλληλογραφίας σχετικά με τη διατύπωση TFIDF για την αναζήτηση σάρωσης για το δείγμα μουσικού της τζαζ στο Κεφάλαιο 10, αναζητούμε τις πιο παρόμοιες συσκευές με την προηγούμενη . θεωρήθηκαν καλές προοπτικές. Ας υποθέσουμε ότι η επιλογή μου ήταν η συσκευή που προσδιορίστηκε σε καλό προβάδισμα. Ο φορητός υπολογιστής μου παρακολουθείται τόσο στο δίκτυο Vi του διαμερίσματός μου όσο και στο δίκτυο Wi-Fi της εργασίας μου. Οι μόνες άλλες συσκευές που βρίσκονται εκεί είναι το τηλέφωνό μου, το tablet μου και πιθανώς οι κινητές συσκευές της συζύγου μου και ορισμένοι φίλοι και συνάδελφοι 1. Παρεμπιπτόντως, αυτή η δυνατότητα θα πρέπει να είναι ανώνυμη εάν επηρεάζομαι από παραβίαση του απορρήτου. Περισσότερα για αυτό αργότερα.
338
| Chapter 14: Conclusion
(but mention that this will get low TF scores in any position as compared to my devices). Therefore, it is likely that her phone and pills will be very similar - possibly very similar - to the one identified as a candidate. If the advertiser had identified me on the laptop as a good prospect for a given ad, that income would also identify my phone and tablet as a good industry for fair advertising. This program is not intended to be a permanent solution to the problem of finding the corresponding user on different mobile devices; 2 shows how having a conceptual toolbox can be useful for thinking about a new problem. Once these ideas are realized, data scientists will dig in to discover what really works and how to enrich and extend the brainstorming, applying many of the concepts we discussed (such as how to measure alternative sourcing options).
Changing the Ways We Think of Solutions to Business Problems The example also provides a concrete illustration of another important fundamental concept (we haven't exhausted them even after so many pages of a detailed book). I can complete the general input sub-cycle for economic/data understanding of the data mining process, our understanding of what the problem is changes to match what we can do with the data. Often the aforementioned change is subtle but very important clues to (try to) suggest when this is happening. Mystery? Because not all stakeholders are involved in shaping the data capacity problem. If we forget that we have changed the problem, especially if the change is subtle, we may encounter downward and upward resistance. Moreover, resistance is allowed purely due to misunderstandings! At worst, it can be perceived as stubbornness, which can cause resentment and threaten the success of the conquest. Let's revisit the case of mobile targeting. The astute reader might have said, Wait a minute. Ours started out by saying we would find the right users for different devices. One thing we did was find users who were very similar in terms of location information. I can agree that the aforementioned set of similar users listed above probably contains the same user - much better than any alternative I can think of - but that's not the same as finding the same user on different devices. This reviewer must be correct. As you edit the problem statement, the problem has changed slightly. We now make it possible to identify the same users: the subset of devices with very similar location profiles may have a very high probability of containing other instances of the same operator, but this is not guaranteed. This must be clear, including in our minds, and made clear to the parties concerned. As it turns out, for targeted ads or promotions, this change will likely be acceptable to all interested parties. Recalling the cost/benefit framework for evaluating data mining solutions (Chapter 7), it is clear that many offers aim at some false 2. However, it is the essence of a real solution that the problem implemented by one of the most advanced companies mobile advertising.
The Fundamental Concepts of Scientific Data
|
339
Positives will be relatively low cost compared to the benefit of finding more true positives. In addition, for many promotions, targets can actually be happy to "lose" if every ms is going to target other special looking people. And my wife also close friends and colleagues are pretty consistent about many of my tastes or interests!3
What Data Can't Do: People in the Loop, Reopened This book has focused on how, why, and when we can derive economic value from physical data by improving data-driven decision making. It is important to consider the limits of data science and data-driven decision making. There are things that computers are good at and things that humans are good at, but often these things are not equivalent. For example, humans are much better at recognizing – than whatever is left of a world – small sets of relevant aspects of the world, from anything to gathering information about a user in a particular order. Electronics is much better at looking at a huge collection of data, including a huge array of (potentially) related variables, and weeding out the relevance of the variables to predicting an ensemble. New Times Op-Ed columnist David Brooks wrote an excellent essay called "What File Can't Do" (Brooks, 2013). A must read if you are wondering which magic data science app will solve your problems.
Data science involves the crafty product of individual knowledge and computer-based techniques to accomplish what neither of them could do wrong on their own. (And beware any tool vendors who suggest otherwise!) The product mining process presented in Chapter 2 helps promote the combination of humans and computers. The structure imposed by this process is largely about how to people early, to ensure that the application of data science methods is focused on the right tasks. Examining the data mining process also reveals that task selection and specification is not the only place where human interaction is critical. As discussed in Chapter 2, one of the places where human creativity, knowledge, and common sense is valued is in the selection and proper input of data into mines—something that is often overlooked in discussions of data carbon, especially considering the importance of .
3. In an article in the Proceedings of the National Academy of Sciences, Crandall et al. (2010) show that geographic co-occurrences between individuals are strongly predictive of people being friends: “Knowing that two listeners were near pure adenine in a few different locations around the same time can create a high conditional probability that are directly connected to the underlying social network'. This means that even “errors” in the value of location similarity can be static, including the advantage of community network targeting – which has proven to be highly effective for marketing (Hill et al., 2006).
340
|
Chapter 14: Conclusion
Human interaction is critical in the evaluation phase of the process. The combination of correct data and scientific dating techniques is excellent for finding models that optimize some objective criteria. It's just that people can say what that best objective criterion looks like for optimizing a particular problem. This involves substantial personal human judgment, because often the actual criteria to be optimized cannot be measured, so people must choose the best possible proxy or proxies—and keep those decisions in mind as sources of risk when using a model. In addition, we need gentle and sometimes imaginative attention to whether the resulting models or patterns help solve the problem. Furthermore, we must bear in mind that the data to which we will apply academic information techniques are the products of some process that involved human decisions. We will fall prey to thinking that data represent objective truth.4 Data embody the beliefs, purposes, biases, also applied by those who designed the systems of detail collection. The meaning of details is colored by our own beliefs. Consider who follows a simple hypothesis. Many years ago, its authors worked as data scientists with one of the largest telephone companies. There was a terrible problem with fraud in the wireless business, and we applied data science methods to massive amounts of data about cell phone practices, social calling patterns, places we visited, and more. (Fawcett & Provost, 1996, 1997). An apparently well-performing component of a fraud detection model showed that "calling zero on the mobile site phone significantly increases the risk of fraud." This was verified through a careful validation assessment. Fortunately (in this case) we follow good data science practices and in the assessment transparency work to ensure validation of the model's domain knowledge. We had trouble understanding this particular model element. Many cells declared a high probability of fraud,5 but cell zero remained. Also, the other cells make sense because when you googled their geographic location, at least there was a good story - for example, the cell was in a high crime area. Observing the null position of the cell yielded nothing. It was not in the cell lists. We went to the leading data guru to guess an answer. In fact, there was no zero cell. But this data obviously has a lot of fraudulent null site calls!
4. Philosophical thinking must go WOLFRAM. V. CIPHER. Quine's (1951) classic essay, Two Dogmas of Empiricism, in which he presents a scathing critique, is the common impression that a dichotomy between the empirical and the analytic is presented. 5. Techno, patterns were more useful if there was a significant change in behavior for more calls from these cells. If you are interested, the papers describe it in detail.
What data can't do: People in the loop, revisited
|
341
Με λίγα λόγια, η κατανόησή μας για το προϊόν ήταν λάθος. Εν ολίγοις, όταν η απάτη επιλύθηκε στην τράπεζα ενός πελάτη, συχνά περνούσε σημαντικός χρόνος μεταξύ της εκτύπωσης, της αποστολής, της λήψης του τιμολογίου, του ανοίγματος, της ανάγνωσης και της λήψης μέτρων από τον πελάτη. Κατά τη διάρκεια αυτής της περιόδου, fortgesetzte δόλια δραστηριότητα. Τώρα που εντοπίστηκε η απάτη, αυτές οι κλήσεις δεν θα πρέπει να εμφανίζονται στο επόμενο τιμολόγιο του πελάτη, καθώς είχαν αποκλειστεί από αυτό το σύστημα χρέωσης. Δεν απορρίφθηκαν, αλλά (ευτυχώς για τις προσπάθειες εξόρυξης πληροφοριών) κρατήθηκαν σε μια διαφορετική αναζήτηση. Δυστυχώς, όποιος σχεδίασε αυτήν τη βάση δεδομένων αποφάσισε ότι δεν ήταν σημαντικό να κρατήσει έναν συγκεκριμένο κάδο. Το ένα ήταν η θέση του κελιού. Έτσι, όταν η προσπάθεια της επιστήμης δεδομένων ζήτησε δεδομένα για τον έλεγχο των δόλιων κλήσεων προκειμένου να δημιουργηθούν σετ δοκιμών τύπου εκπαίδευσης, αυτές οι κλήσεις συμπεριλήφθηκαν. Όταν δεν μπορούν να έχουν αμπέρ κυψελίτη, μια άλλη σχεδιαστική κρίση (συνειδητή ή μη) προκάλεσε τα πεδία να γεμίσουν με μηδενικά. Έτσι, οι περισσότερες δόλιες κλήσεις φάνηκε να είναι μετά την τοποθεσία μηδέν! Αυτή είναι μια «διαρροή» όπως συζητήθηκε στο Κεφάλαιο 2. Ίσως πιστεύετε ότι θα πρέπει να είναι εύκολο να εντοπιστεί. Ο υπολογιστής δεν ήταν, για διάφορους λόγους. Σκεφτείτε πόσες τηλεφωνικές κλήσεις πραγματοποιούνται από δεκάδες εκατομμύρια πελάτες κατά τη διάρκεια αρκετών μηνών και σε κάθε κλήση υπήρχε ένας τεράστιος αριθμός πιθανών περιγραφικών χαρακτηριστικών. Δεν υπήρχε δυνατότητα μη αυτόματης εξέτασης των πληροφοριών. Επίσης, οι κλήσεις ομαδοποιήθηκαν ανά πελάτη, επομένως δεν υπήρχαν πολλές κλήσεις από κινητό τηλέφωνο. διανθίστηκαν με τις άλλες κλήσεις κάθε πελάτη. Τέλος, και ίσως το πιο σημαντικό, ως μέρος της προετοιμασίας δεδομένων, τα δεδομένα συμπιέστηκαν για να βελτιωθεί ο βαθμός της μεταβλητής στόχου. Ορισμένες κλήσεις που πιστώθηκαν ως απατεώνες σε έναν λογαριασμό δεν ήταν στην πραγματικότητα δόλιες. Αρκετά από αυτά, με τη σειρά τους, μπορούν να αναγνωριστούν βλέποντας ότι ο πελάτης τα είπε σε προηγούμενο διάστημα, χωρίς απάτη. Το αποτέλεσμα που αποκαλεί κυτταρικό μηδέν έχει μεγάλη πιθανότητα απάτης, αλλά δεν ήταν τέλειος προφήτης απάτης (που θα ήταν μια σκοτεινή σημαία). Ο σκοπός αυτής της έρευνας μικρής υπόθεσης είναι να καταδείξει ότι «τι είναι τα δεδομένα» βρίσκεται στην απόδοση που βάζουμε. Αυτή η απόδοση άλλαζε συχνά λόγω της διαδικασίας εξόρυξης δεδομένων και πρέπει να αποδεχτείτε αυτήν την ευελιξία. Το παράδειγμά μας για τον εντοπισμό απάτης έδειξε μια αλλαγή στην ερμηνεία ενός στοιχείου πληροφοριών. Επίσης, συχνά αλλάζουμε την κατανόησή μας για τον τρόπο δειγματοληψίας των δεδομένων καθώς αποκαλύπτουμε προκαταλήψεις στη διαδικασία συλλογής δεδομένων. Για παράδειγμα, αν θέλουμε να μοντελοποιήσουμε τη συμπεριφορά των καταναλωτών για να σχεδιάσουμε ή να παραδώσουμε μια καμπάνια μάρκετινγκ, είναι σημαντικό να κατανοήσουμε ακριβώς ποια ήταν η βάση καταναλωτών από την οποία ελήφθησαν δείγματα των δεδομένων. Και πάλι, αυτό φαίνεται προφανές στη θεωρία, αλλά στην πράξη μπορεί να περιλαμβάνει μια εις βάθος ανάλυση των συστημάτων και των εταιρειών από τις οποίες προήλθαν οι πληροφορίες. Σε τελική ανάλυση, πρέπει να είμαστε πρόθυμοι για τους τύπους προβλημάτων οδοντοφυΐας όπου η επιστήμη των δεδομένων, ακόμη και με ανθρώπους που επιβιβάζονται, είναι πιθανό να αξιολογηθεί. Πρέπει να αναρωτηθούμε: υπάρχουν πραγματικά αρκετά στοιχεία σχετικά με την εν λόγω απόφαση; Οι στρατηγικές αποφάσεις πολύ υψηλού επιπέδου μπορούν να τοποθετηθούν σε ένα μοναδικό πλαίσιο. Η ανάλυση δεδομένων, καθώς και η επίδειξη εικασιών, επιτρέπουν την παροχή εσωτερικών πληροφοριών, αλλά συχνά για αποφάσεις υψηλού επιπέδου, οι υπεύθυνοι λήψης αποφάσεων πρέπει να βασίζονται στην εμπειρία, τη γνώση και τα συναισθήματά τους. Αυτά σίγουρα ισχύουν για αποφάσεις πολιτικής, όπως η απόκτηση ή όχι μιας συγκεκριμένης εταιρείας: Η ικανότητα ανάλυσης δεδομένων βοηθά την
342
|
Chapter 14: Conclusion
decision, but ultimately each situation is unique and will require the judgment of an experienced strategist. The idea of unique situations must be realized. At one extreme, we can think of Steve Jobs' famous statement: “It's very difficult to create products from focus groups. Often people do not know what they are doing, you see it for them... This does not mean that our customers do not listen, when it is difficult for them to say what they want when they have not seen anything far they like it. As we look to the rescue, we might expect that, with the increasing ability to do automated and careful experiments, we will move from asking men what they would like or what would help them to observing what they like or find useful. Until we get it right, we must follow our fundamental principle: to think of data as an asset, in which we may need to invest. Our Capital One case from Chapter 1 is a clear example of building multiple assets and investing in data and data lifetimes to determine which people would want and, for each product, which people would be suitable (ie, profitable) customers .
Privacy, ethics and data mining for individuals Mining data, especially products for individuals, raises important ethical issues that should not be ignored. Recently, there has been considerable discussion in the press and in government agencies about privacy and additional data (especially online data), and the issues are much broader. Most large consumer-facing companies are choosing instead to buy detailed data about all of us. This data is used directly to make decisions about many of the business applications we discussed in a book: Should we take out loans? If so, what should our credit limit be? Should we be the targets of an offer? What content would we like to see on the site? Which option should be recommended until use? What are we likely to extract from a competitor? Is there fraud on our account? This tension between privacy and sound business decisions is fascinating because there seems to be a direct relationship between the increased use of personal data and the increased effectiveness of related business decisions. For example, a study by researchers at Toronto College and MIT showed that after particularly strict privacy protection measures were introduced in Europe, online advertising became significantly less effective. Specifically, “the difference between those who were exposed to ads and those who were not fell by about 65%. No such change was used for countries outside Europe' (Goldfarb & Tucker, 2011).6 This is not a phenomenon limited to online advertising: the addition of sophisticated social media data (e.g. who is communicating with whom) to more traditional data on individuals significantly increases the effectiveness of fraud detection (Fawcett & Provost, 1997) and targeted marketing (Hill et al., 2006). In general, the finer details listed above can be gleaned from 6. See Mayer and Narayanan's website for a critique of this and other research claims about the value of behavior-based advertising.
Privacy, ethics and mining data for individuals
|
343
people, the better you can predict things with them that are important for making business decisions. This apparent direct link between reduced politics and increased corporate power creates strong feelings that alienate privacy and job prospects (sometimes within the same person). It is well beyond the scope of this book to address this issue, and the questions were extremely complex (eg, what kind of "anonymization" would suffice?) and varied. Probably the biggest obstacle to thoughtful consideration of privacy-friendly data science designs will be the difficulty of even defining what privacy is. Daniel Solove is a global authority on privacy. His essay "A Taxonomy of Privacy" (2006) begins: Privacy is a term in disarray. No one can articulate what this means. As one commentator noted, privacy suffers from "a tangle of meanings."
Solove's paper spans more than 80 pages, providing a taxonomy of data. Helen Nissenbaum is another privacy generalist who has recently focused specifically on the relationship between privacy and massive databases (and their mining). His book on the subject, Policy in Context, is over 300 pages long (and well worth reading). We mention this to emphasize that privacy concerns are not some easy-to-understand or easy-to-use issues that can be quickly dispatched or even written as a section or chapter in a data science book. Whether you're a data professor or a major business stakeholder in data science publishing, you have privacy issues to worry about, and you'll need to spend a lot of time thinking about them carefully.
Is there more to data science? Although this book is quite voluminous, they have done their best to select the key concepts that are most relevant for the online data scientist and stakeholders to understand data science and communicate well. Of course, the person covering it sees the fundamental assumptions of archival science, and any scientists' data may conflict if we include just the right ones. But everyone must agree that these are some of the most important concepts that underlie a huge amount of art. There are all kinds of advanced topics and close relationship topics based on the fundamentals presented here. We won't try to list them - if you're interested, check out the recent conference programs from leading data mining resources, such as the ACM SIGKDD International Press on Data Mining and Knowledge Discovery or the IEEE International Conference on Evidence Mining . Likewise, these conventions have leading industry pieces from now on, with an emphasis on applications of academic data to business and government issues. Let's just give a more concrete demonstration of the kind of subject one might find when exploring further. Support the pioneering principle of data science: Data (and data science capability) should be considered assets and candidates for investment. Throughout the book we increasingly discuss what the definition of data investment is. if
344
| Chapter 14: Conclusion
We apply the general sizing framework of explicit consideration of costs and benefits to data science projects, this brings us to recent thinking about data investment.
Final Example: From Crowd-Sourcing to Cloud-Sourcing The connectivity between companies and "consumers" created by the Internet has changed the economy of work. Internet-based systems such as Involuntary Turk and Amazon's oDesk (among others) facilitate a type of crowdsourcing that could be called a "job cloud" - tapping into a vast pool of independent contractors online. A job assurance filter in the cloud that is particularly relevant to data science is “micro-outsourcing”: the end product of a large number of very short, well-defined tasks. Micro-outsourcing is particularly relevant to data science because it changes the economics, as well as the practices, of investing in data.7 For example, recall the requirements for implementing a supervised entity (see Chapter 2). We must have high specifications for a target variable, in addition we must actually have values for the target variable ("labels") for a training data location. Sometimes the person can't pinpoint the target variable, but it turns out we don't have labeled data. In some cases, we use bottle micro-sourcing systems such as Mechanical Muslim, the dates on the label. For example, advertising wants to keep its ads away from objectionable rail links, such as those containing hate speech. However, with billions of pages to place their ads on, how do they know which ones were unacceptable? It would be too expensive to have employees looking after you. We can see this immediately from a potential candidate for text classification (Chapter 10): we can take the text from this page, represent it as feature vectors as we discussed, and construct a hate speech classifier. Unfortunately, we do not have a representative sample of hate speech pages to use as training data. However, if this issue is important enough8, then we should consider investing in mail receipt data and see if the richness can shape a way to identify pages that contain hate speech.
7. The interested reader can go to Google Scholar and do searches on "mechanical Turkish data mining" or more generally on "human computing" and find articles on the topic and follow the citation links ("Cited By”) to find even more. 8. In fact, the problem of displaying ads on objectionable pages has been reported as a billion dollar problem (Winterberry Group, 2010).
Final Example: De Crowd-Sourcing for Cloud-Sourcing
|
345
Working in the cloud changes the economics of data investment in your labeled training data capture example. We can hire very cheap labor over the internet for data investment in a number of ways. For example, wealth can keep workers on the Amazon Mechanical Turk signup pages as inadmissible and not, providing Colombia with destination tags, much more reasonable than hiring regular student workers. The verwirklichung rate, when completed by a well-trained intern, was 250 sites per hour at a rate of $15/hour. When published on Amazon Mechanical Turkish, the tagging rate increased to 2,500 sites per hour and the anzug cost remained the same. (Epirotis et al., 2010)
The thing is, you get what you pay for, and low cost sometimes means basic quality. Over the past ten years there has been a flurry of research on the issues of maintaining quality when leveraging cloud work. Note that page tagging is just one example of enhancing data science with script work. Even in this case study, there are other options such as using cloud work to search for a real hate phone with confidence instead of tagging the pages we give them (Attenberg & Provost, 2010) or cloud workers they can be invited into a game – such as the system to find cases where the current model makes mistakes – to “beat the machine” (Attenberg et al., 2011).
Final Words Your artists have been working to apply data science to real employment problems for over two decades. You'd like to think they'd all be second nature. It is surprising how useful it can still be, even for states, to have this set of clear basic concepts at their disposal. So often, as you find yourself at an apparent impasse in philosophy, getting the basics out makes the way clear. "Well, let's go back to our understanding of the business and the data... what is the precision of the problem we're trying to solve" can solve many problems, whether we then decide to process the implications of an expected value framework or think more gently about how they are collected , with more or type costs, the benefits are well defined, or for additional investment in data, or to consider whether the target variable can be appropriately defined to solve the problem, etc. Knowing the different types of academic information assignments helps to prevent this entrance scholar from processing all business problems as nails in the particular break that this child knows well. Careful thinking about what will matter to the business problem when considering assessment and “baselines” for comparison bring stakeholder interactions to life. (The comparison is to a scary effect of financial data that some show as mean squared error as meaningless relative to the fix in question.) This facilitator of data analytics applies not only to data science but to anyone involved. If you are interested in Adenine economics and not a data academic, don't let the so-called data scientists fool you with their jargon: the concepts that start this book, in addition to knowing about your own data and financial product, must allow her to understand 80% or more of data science at a cheap enough level to be productive for her business. alone
346
| Chapter 14: Conclusion
read this book if you don't understand what a data scientist is saying, beware. There are certainly many, more complex data science topics, but a good data scientist should be able to describe the basics of the problem and its solution at this level and in the terms of such a book. If you have a great data guru, take this as an unimaginable challenge: think deeply about why your work is important to help the business, and you can present it as such.
Final Words
|
347
APPENDIX A
Proposal Review Guide
Effective data chemistry thinking will enable you to systematically evaluate potential data mining projects. The material in this book should provide the necessary basis for evaluating planned data mining projects and discovering potential flaws in input proposals. This skill can be applied as a self-assessment for your own brand and as an aid in evaluating proposals from internal data science teams or external consultants. The following contains a set of questions to keep in mind when considering a data surface design. Questions are a shell of the data mining process, discussed in detail in Chapter 2, and used as conceptual frameworks throughout the book. Before reading this novel, you should be able to apply these ideas to a new business problem. The following list is not intended to be exhaustive (in general, the book is not intended to be exhaustive). However, the list contains auswahl of some of the most important questions to ask. Throughout the book, we focus on data science projects, where a focus for me is certain regulations, standards, or data models. The Proposal Review Guide reflects this. There may be incoming scientific works in an organization where these regularities are not explicitly defined. For example, many data visualization elements were not initially clearly defined for modeling. However, the process of data mining can help the structural data analysis thinker on such projects - they are just more like unsupervised data mining than supervised data mining.
Data and business understanding • What exactly does the store's problem need to solve? • Was the data science solution properly configured to solve this business problem? Note: we may need to make a reasonable estimate. • To which business entity does an instance/sample correspond?
349
• Is the adenine problem a supervised or unsupervised problem? — If supervised, — Is the amplifier focus variable defined? — If so, is it precisely defined? — Think about the values you can assume. • Is the key defined accurately? — Think about the values they can assume. • For supervised problems: Will modeling this target variable actually improve the stated business problem? Important sub-problem? If the latter is true, does the rest of the business problem get solved? • Framing this problem based on expected value Can you structure the subtasks to be solved? • If there is no constraint, is there a well-defined “exploratory data analysis” path? (So where does this rating go?)
Data Preparation • Has it become practical to take fork eigenschaft values and create feature arrays and put them into a single table? • When not, is an alternative data format clearly and precisely delineated? Are these considered in later stages of the project? (Many of the later methods/techniques assume that the data set is in feature vector form.) • When and modeling will be supervised, is the target variable well defined? Is it clearer how to get values for the target variable (for training and testing) and use them in the array? • How accurately was the philosophy and target variable obtained? Are there any costs? For this, was the cost taken into account in the proposal? • Are the data extracted from a population similar to warrant which model to apply? Whenever there are discrepancies, is some selection bias clearly observed? Is there a plan for how they will be compensated?
Modeling • Is the choice of model appropriate for the choice of the target variable? — Classification, class probability estimation, classification, regression, clustering, etc.
350
|
Appendix A: Proposal Review Guide
• Does the model/modeling technique meet the various task specifications? — Generalization performance, understandability, learning speed, implementation method, amount of data required, data types, missing values? — Is the choice of casting technique compatible with prior knowledge of the problem (for example, the lifetime of the linear model of adenine is proposed for a decidedly nonlinear problem)? • Should many examples be made and compared (in the evaluation)? • For the cloud, is there a defined similarity metric? Does this make sense for this business problem?
Assessment and Use • Are you in a domain knowledge validation project? — Will the niche or stakeholder area want to review the models after development? If so, will the model be in a format they can understand? • Is the assessment established and appropriate metrics for the business project? Remember the original wording. — Are the costs and benefits of the business considered? — With ranking, how is a ranking threshold chosen? — Are probability estimates obtained directly? — Is the taxonomy more closely related (eg to a set budget)? — For regression, how will you evaluate a quality of numerical omens? Why is this the right way in which subject? • Does the estimate refer to the date of stay? — Cross-validation is a technique. • What baselines will the results be compared to? — Riddle Does this make sense in the context of the actual problem being solved? — Is there a plan for the objective evaluation of key methods? • For clustering, how will clump be understood? • Do you want the planned development to actually (better) solve the stated business problem? • If the project expenditure is to be justified to stakeholders, what is planned above to measure the final (implemented) corporate impact?
Rating as much as
|
351
APPENDIX B
Another sample sentence
Appendix A presented a set of useful guidelines and questions for evaluating data science proposals. Branch 13 contained a sample proposal ("Example data mining proposal" on page 327) for a "customer migration" campaign and a broad critique of its weaknesses ("Flaws included the Big Color proposal" on page 328). We use the telecommunications subversion issue as an example of entirely executing adenine for reservation. Here we present a second sample offering and critique, one based on the dilution problem.
Scenario and Purpose You've landed a great job at Green Giant Consulting (GGC), managing an analytical team that makes a living developing your product scholarship skill set. GGC is proposing a data science project with TelCo, the country's second largest contactless service provider, to help address customer churn. Your team of analysts may produce the next proposal and you review it before submitting the proposed plan to TelCo. Do you find flaws in the design? Do you have any suggestions on how to improve it? De-escalation through specific incentives — ADENINE GGC proposal
We suggest that TelCo test its ability to track customer churn through a predictive churn analysis. The basic idea is that TelCo can use data about customer behavior to predict when customers will leave and then target those customers with specific incentives to stay with TelCo. We propose the following modeling problem, which can be run using data already in TelCo's possession.
Let's model the odds that a customer will leave (or not) within 90 days of the end of the contract, understanding that it is a separate issue to keep customers, someone who continues their service from month to month, for a long period of time, and then to is decreasing. expiry. portions. We believe that variance forecasting in this 90-day window is a workable starting point, and the lessons learned can be applied to other variance forecasting situations. THE
353
model will be drawn on a database of historical cases are customers who have left the company. The probability of rejection will be predicted based on data 45 days before the end of the contract to give TelCo enough time to influence customer behavior in an incentive offer. One would define the reversal probability by building an ensemble of trees (random forest) model, which can be known to have high accuracy for a wide variety of estimation problems. We estimate that Small Wishes can identify 70% of customers who will enter within the 90-day time window. The person will verify this by running the model mentioned above, go to the database to verify that indeed the pattern can achieve this level of accuracy. Through interactions with TelCo stakeholders, we understand that it is very important that V.P. The customer retention office approves any new customer retention processes and said it will base its decision on its own assessment of what this customer identification guide uses, in addition to the views of selected companies' retention experts on the process. customers. Therefore, the person will give the V.P. and expert access to the model so they can verify that it will work efficiently and correctly. We suggest that every week, the model is run to estimate the churn probabilities of customers whose contracts expire in 45 days (plus or minus a week). A customer will be ranked based on these probabilities, plus the top N will be selected to receive the current incentive, with N based on the initial incentive cost and weekly maintenance budget.
Gaps in the GGC Proposal We can use our understanding of the basic principles to leverage other key data studies concepts to identify gaps in the proposal. The ADENINE Annex sets out a procurement 'guide' for considering such a proposal, with some key questions to ask. However, this post because it can really be considered a proposal evaluation guide. Here are the most glaring flaws in Green Giant's proposal: 1. The offer currently available mentions modeling based on "departed customers." For training (and testing) we will also want to have customers who have not left the company, so this modeling finds selective information. (Chapter 2, Chapter 3, Chapter 4, Episode 7) 2. Why rank customers based on highest bounce rate? Why not rank your expected loss accordingly, using a standard expected total value calculation? (Chapter 7, Chapter 11) 3. Better yet, shouldn't we try to shape the customers who are most likely to be (positively) affected by the incentive? (Chapter 11, Chapter 12) 4. If we want to continue because in (3) we have the relation of not having the learning data we need. We should invest in training data acquisition. (Chapter 3, Chapter 11) The message that the current proposal can give is just a first step towards the business goal, but this should be explicitly stated: see if we can evaluate the aforementioned possibilities well. If we can, then it makes sense to move on. If not, we may need to reconsider your investment in a similar project. 354
|
Appendix VORON: Another Sentence Example
5. The quote says nothing about evaluating generalization performance (ie, performing a validation evaluation). It looks like they are going to test the training set (“…run the model in the database…”). (Chapter 5) 6. The sentence does not create (or state) which properties will be used! Is this just by default? Is this because the staff don't even care about e? What will the project be? (Chapter 2, Chapter 3) 7. How does the team estimate that the product will be able to identify 70% of customers who leave? There is no mention of whether pilot studies have ever been performed, no learning curves produced on sample data, or other support for this claim. It looks like adenine. (Chapter 2, Chapter 5, Chapter 7) 8. Furthermore, without discussing error assessment or the concept of false positives and false negatives, it is unclear what "identifying 70% of customers who will leave" actually means. If I say nothing about the false positive rate, ME I can identify 100% of them just by saying they all want out. Therefore, talking about the true positive rate only makes sense if we also talk about the false positive rate. (Chapter 7, Chapters 8) 9. How to choose a specific model? With the toolkits of modernity, we can easily compare multiple models with the same data. (Chapter 4, Chapter 7, Chapter 8) 10. The V.P. Customer Retention must sign off on the process and indicate that they will study the process to see if it makes sense (domain knowledge validation). However, sets of trees are selected in a black box. The proposal says nothing about how he will understand how the process decides. Given your desire, it would be better to sacrifice a lot of precision to construct a more understandable example of adenine. Once "on board", it may be possible to use less understood techniques to achieve greater accuracy. (Chapter 3, Chapter 7, Chapter 12)
Script and proposal
|
355
Glossary
Note: This glossary is an early expansion of a glossary compiled by Ron Kohavi and Foster Provost (1998), used with artistic permission from Springer Science and Business Media. in advance
A priori is an adenine term borrowed from philosophy meaning "before experience". In data science, an a priori belief is one that is brought to the problem as vorgeschichte knowledge, as opposed to a belief formed after examining the data. For example, you might say, "There is no ampere priori basis for believing that such a relationship is linear." After examining the data, you decided that two variables were directly related (and therefore linear regression should work well), but there was no reason to believe, from prior knowledge, that they should be so related. The opposite of a priori is a posteriori.
Accuracy (error rate) The percentage of correct (wrong) predictions made by the model on a data set (cf. coverage). Accuracy is usually evaluated using an independent data set (holdout) that has not been used for any length of time during the learning process. More complex precision estimation techniques such as cross-validation and bootsprint are usually used, especially with datasets containing a small number of occurrences.
Correlation mining techniques that find sets of consequences of the form "X and Y → A and B" (correlations) that satisfy specified criteria. Attribute (field, variant, attribute) A quantity that describes an instance. An attribute has a domain delimited by the recommendation type, which denotes the values that can be taken in an attribute. The following domain types are common: • Categorical (symbolic): A finite number of distinct values. The target type indicates that there is no order between values, such as names and colors. The ordinal type indicates that there is a market, as in the attribute that takes the values low, medium, instead of high. • Continuous (quantitative): Usually, subset of real numbers, where there is a metrological effect between the probability values. Numbers are usually treated as continuous problems in practice. We do not disagree with this caveat, and it is often attributed that a feature is the original specification of a feature and its 357
Class value (label). For example, element is an attribute. "The color is blue" is a feature of an example. Many transformations on the feature set leave the feature set unchanged (for example, regrouping feature values press transtrain multi-valued features into duplicate features). In this book, we follow the practice of many clinical writers and use appeal as a synonym for property. Category (tag) Individually a small, mutually exclusive set of tags that are used as potential equity for target variation in a classification problem. Labeled data has a grade label assigned to each instance. For example, in a dollar bill classification problem, which category could be legitimate and counterfeit. In a stock interest job, the strength of the categories will gain significantly, lose significantly, and maintain their valuation. Classifier A mapping from unlabeled instances to (distinct) classes. Classification has a form (eg a classification tree) plus an interpretation process (including how to deal with unknown values, etc.). Most classifiers can also provide probability estimates (or various possible scores), which can be constrained to produce a discrete class judgment, taking into account the cost/benefit of any utility function. Confusion Matrix A projection matrix of predicted and actual classifications. A confusion matrix is of size lambert × l, where l is the number of changing label values. Ampere, many classifier evaluation metrics are defined based on the confusion tree master, such as accuracy, true positive rate, false positive rate, true negative rate, false negative rate, precision, recall, gain, specificity, positive predictive value, negative predictive value pressure . Coverage The proportion in a data set for which a classifier makes a booster prediction. If a classifier does not classify all instances, it may still be important to know its performance on
358
|
Glossary
set of cases for which it is reliable enough to make a prediction. Cost (utility/loss/yield) A measure of the cost of doing the task (and/or the benefit) of making a prediction ŷ when the labeled actual is y. Using accuracy to evaluate a model assumes uniform error costs and uniform benefits from correct classifications. Cross-validation A method of estimating the precision (or error) of an inducer by partitioning the file into k mutually exclusive subsets (the "diplomas") of approximately equal size. The inductor is tested and tested k times. Each data previously in the data set was first removed in the folds and tested in the fold. The accuracy estimate is the average accuracy over k turns or what accuracy over the grouped control turns. Data Cleansing/Cleaning Which method of improving data quality by modifying your form or web, for example removing other correction information values that are incorrect. This step primarily precedes subsequent modeling, although performing the data mining process may indicate that read cleanup is desirable and suggest ways to improve data quality. Data Mining The term data well is a bit of a stretch. Sometimes it refers to the entire process of transferring data to copper, and often to the specific application of modeling technologies to data inches to create models or find other patterns/regularities. Dataset A shape and a set of adenine instances corresponding to the shape. Generally, no order of performances is assumed. Most data mining tasks use a single fixed format array or collection of feature vectors. Dimension With an attribute or multiple attributes that together describe a property. For example, one
The Magisch geographic dimension development model consists of three properties: country, state, city. A time dimension spell includes 5 attributes: year, month, day, hour, minute. Error rate See Accuracy (error rate). Example Notification Case (instance, case, record). Resource Please assign (field, variable, resource). feature vector (subscript, tuple) A special list describing an instance. Field
See Property.
i.i.d. sample Ampere set of independent and identically distributed instances. Induction Induction is the process of building a general model (such as a classification tree or an equation) from a set of data. Induction can be treated by deduction: deduction starts because a general rule or model and one or more facts, and generates other specific answers about i. Induction goes in the opposite direction: Induction takes a collection of facts and creates a general rule or style. In the context of this book, model induction is synonymous with learning or alternatively mining a model, rules or models can often be statistical and elegant in nature. Instance (eg instance, record) A distinct view of the world from which to learn a model or change who will use a model (eg by prediction). In most scientific work to date, instances are described by feature vectors. some work usages show complex representativeness (for example, containing relationships between cases or between parts of cases). KDD
It was originally an acronym for Knowledge Discovery from Databases. It is now widely used to cover its discovery
Furthermore, knowledge about data is often used in the same way as data mining. Journey Understanding The non-trivial process of identifying logical, novel, perhaps useful, and ultimately understandable patterns in data. This is the definition used in Advances in Knowledge Discovery and Data Mining, by Fayyad, Piatetsky-Shapiro, & Smyth (1996). Loss
See Cost (Utility/Loss/Return).
Machine Learning In data science, machine learning is commonly used to mean the application of regularization algorithms to data. A term often used synonymously with the modeling stage of the data mining process. Machine Learning is the scientific field of study that focuses on inductive algorithms and other algorithms that can be said to learn. Missing value One in which the scope of an attribute is not recognized or does not exist. There are several possible reasons why a value is missing, including: it was not measured. there was a malfunction of the instrument. the attribute is not valid or the value of the attribute cannot be known. Some algorithms have challenges with skip values. Model
A structure and corresponding interpretation that summarizes or partially summarizes a set of records, for description or prediction. Most inductive algorithm generation models that can then be used by classifiers, to regressors, as templates for mortal consumption and/or as input to subsequent stages is the data mining process.
Model development Using a learned model to solve a real-world problem. Development is often used, especially to compare "use" in a model in the Evaluation stage with the file mining process. Finally, the implementation
Glossary
|
359
OLAP (MOLAP, ROLAP) is usually false on data somewhere where the true answer is known.
True negative specificity rate (see Confusion Table).
OLAP (MOLAP, ROLAP) Online analytical processing. It is often synonymous with MOLAP (multidimensional OLAP). OLAP engines make it easy to explore data across multiple (predefined) dimensions. OLAP often uses intermediate data structures to store precomputed results in multidimensional data, enabling fast calculations. ROLAP (relational OLAP) refers to the implementation of relational databases using OLAP.
Supervised teaching skills used to learn the independent relationship zwischen eigenschaften in addition to a specified dependent property (the label). Most induction algorithms fall into the category of supervised learning.
Record
See Resource(record, tuple) class.
Schema A technique of the characteristics of a data set and its properties. Sensor True Positive Ratio (see Confusion Table).
360
|
Glossary
double
See feature vector (subscript, tuple).
Unsupervised learning techniques that cluster stabilities without a predefined target characteristic. Collection algorithms are becoming more unsupervised. Commercial
See Cost (Utility/Loss/Return).
Bibliography
Aamodt, A., & Plaza, CO. (1994). Case-based reasoning: Key issues, methodological variations, and systems approaches. Artificial Intelligence Communications, 7(1), 39–59. Available: http://www.iiia.csic.es/People/enric/AICom.html. Adams, N.M., & Manual, D.J. (1999). Ranking comparison when the cost of misallocations is uncertain. Pattern Recognition, 32, 1139–1147. Aha, D.W. (Ed.). (1997). Lethargic learning. Kluwer Academic Printers, Norwell, MA, USA. Oh GRAY. W., Kibler, D., & Alberto, M.K. (1991). Example-based learning algorithms. Machine Learning, 6, 37–66. Aggarwal, C., & Yu, P. (2008). Privacy Preserving Data Mining: Models and Algorithms. Springer, USA. Aral, S., Muchnik, L., & Sundararajan, A. (2009). Influence-based discriminative transmission following homophily-driven diffusion in dynamic networks. Proceedings of the Nation Academy away Academic, 106(51), 21544-21549. Archer, D., & Vassilvitskii, S. (2007). K-means++: the limit of careful seeding. Inside Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pages. 1027–1035. Attenberg, J., Ipeirotis, P., & Provost, F. (2011). Beating which machine: Employees have a hard time finding strangers. In Workshops at the Twenty-Fifth AAAI Conference on Artificial Intelligence. Attenberg, J., & Provost, F. (2010). Why label when you can survey?: Alternatives to active learning for applying human means to build classification models under extreme class imbalances. In the Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 423–432. ACM.
361
Bache, K. & Lichman, CHLIAD. (2013). UCI Machine Learning Repository. http://archive.ics.uci.edu/ml. Irvine, CAR: California Institute, School of Information and Computer Science. Screw, R., & Hand, D. (2002). Accidental acquisition fraud: A review. Statistical Science, 17(3), 235-255. Breiman, L., Friedman, J., Olshen, R., & Stone, C. (1984). Classification and regression trees. Wadsworth International Group, Bellmont, CA. Rivulets, D. (2013). What data can't do. New York Times, February 18. Brown L, Gans N, Mandelbaum A, Sakov A, Shen H, Zeltyn S, and Zhao FIFTY. (2005). Statistical analysis of a call center: a queuing science perspective. Journal of the American Statistical Association, 100(469), 36-50. Brynjolfsson, E., & Smith, M. (2000). Frictionless trade? A comparison between online and brick-and-mortar retailers. Management Science, 46, 563–585. Brynjolfsson, E., Hitt, L.M., & Kim, H. OPIUM. (2011). Power in Numbers: How does data-driven determination impact business performance? Rep. of Technology, available from SSRN: http://ssrn.com/abstract=1819486 or http://dx.doi.org/10.2139/ssrn.1819486. Business Insider (2012). The Digital 100: The world's most valuable private technology companies. http://www.businessinsider.com/2012-digital-100. Ciccarelli, FARTHING. D., Doerks, T., Von Mering, C., Creevey, C.J., Sneles, B., & Bork, P. (2006). Towards automatic reconstruction from a highly resolved tree of life. Science, 311 (5765), 1283–1287. Clearwater, S., & Stern, E. (1991). A rule learning program for classifying high-energy physical events. Comp Physics Comm, 67, 159–182. Clemons, E. & Thatcher, M. (1998). Capital One: Leverage and Insights Strategy. In Proceedings of the 31st Hawaii International Conference on Systems Sciences. Caen, L., Diether, K., & Malloy, C. (2012). Statutory stock awards. Harvard Business Language Working White, No. 13–010. Cover, T. & Hart, P. (1967). Standard nearest neighbor tax. Information theory, IEEE Transactions on, 13(1), 21–27. Crandall, D., Backstrom, L., Cosley, D., Suri, S., Huttenlocher, D., & Kleinberg, J. (2010). To infer social ties from geographical coincidences. Proceedings of the National Academy of Sciences, 107(52), 22436-22441. Deza, E., & Deza, M. (2006). Remote dictionary. Elsevier Arts.
362
|
Bibliography
Dietterich, T.G. (1998). Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation, 10, 1895–1923. Dietterich, T.G. (2000). Ensemble approaches to device learning. Multiple Classification Systems, 1-15. Duhigg, C. (2012). How companies discover their secrets. New York Times, February 19. Elmagarmid, A., Ipeirotis, P., & Verykios, V. (2007). Duplicate record detection: Search ADENINE. Knowledge or Data Engineering, IEEE Financial up, 19(1), 1–16. Evans, R., & Catch, D. (2002). Using decision tree setup to minimize process delays in the printing industry. In Klosgen, W., & Zytkow, J. (Eds.), Handbook of Data Mining and Knowledge Discovery, pp. 874–881. Oxford University Press. Ezawa K, Singh M, & Norton SULFUR. (1996). Learning goal-oriented Bayesian network for automation risk management. In Saitta, LITRES. (Eds.), Proceeding of the XIII International Conference on Machine How, plastic. 139–147. San Francisco, OK. Morgan Kaufman. Fawcett, T. (2006). Introduction to ROC data. Standard Letters of Gratitude, 27(8), 861–74. Fawcett, T., & Provost, F. (1996). The combination of data mining and machine learning used effective user generation. In Simoudis, Han, & Fayyad (Eds.), Early Proceedings of the Sec‐one International Convention on Knowledge Discovery and Data Mining, pp. 8–13. Menlo Place, CAR. AAAI print. Facet, T., & Provost, FARAD. (1997). Adaptive fraud detection. Data Copper and Knowledge Discovered, 1 (3), 291–316. Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From file mining to database knowledge discovery. Revista ART, 17, 37–54. Frank, A., & Assunção, A. (2010). UCI Machine Learning Repository. Friedman, GALLOP. (1997). About bias, variance, 0/1 loss, and the curse of dimension. Your Mining and Knowledge Discovery, 1(1), 55-77. Gandy, O.H. (2009). Arriving at the key with probability: Involves rational discrimination and cumulative disadvantage. Ashgate Publisher. Goldfarb, A. & Tucker, CENTURY. (2011). Wired advertising, behavioral targeting and obfuscation. ACM Communications 54(5), 25-27. Haimowitz, I. & Schwartz, H. (1997). Grouping and forecasting to optimize loan lines. In Fawcett, Haimowitz, Provost, & Stolfo (Eds.), AI Approaches to Fraud Detection and Risk Management, pp. 29–33. Type AAA. Available as Technical Submission WS-97-07.
Bibliography
|
363
Hall, M., Francis, E., Holmes, G., Pfahringer, B., Reutemann, P. & Witten, I. (2009). The WEKA Data Mining Program: An Update. SIGKDD Explorations, 11 (1). Seal, DIAMETER. GALLOP. (2008). Statistics: a very brief introduction. Publisher of University Shoes. Hastie T, Tibshirani R and Friedman GALLOP. (2009). The Elements of Statistical Learning: Data Mining, Inference, Prediction (Second Edition ed.). Caperer. Hays, C. FIFTY. (2004). What do they know about you? The New York Calendar. Hernández, M.A., & Stolfo, S.J. (1995). The merge/purge issue for large databases. SIGMOD Rec., 24, 127–138. Hill, S., Provost, F., & Volinsky, C. (2006). Network-based marketing: identifying potential adopters through consumer networks. Statistical Science, 21(2), 256–276. Holte, RADIUS. C. (1993). Really simple site rules attribute wellness to the most commonly used datasets. Machine Learning, 11, 63–91. Ipeirotis, P., Provost, F., & Wang, BOUND. (2010). Quality Management on Amazon Mechanical Turk. Inside Proceedings of 2010 ACM SIGKDD Workshop on Human Computation, slide. 64-67. ACM. Jackson, M. (1989). Michael Jackson's Malt Whiskey Companion: A Connoisseur's Guide to Scotland's Malt Whisky. Dorling Kindersley, London. Japkowicz, N., & Stephen, S. (2002). The main problem of imbalance: a systematic study of ADENINE. Intelli Data Analysis, 6(5), 429–450. Japkowicz, N., & Shah, M. (2011). Analysis of learning algorithms: A classification perspective. Cambridge University Press. Jensen, D.D. & Cohen, P. RADIUS. (2000). Multiple comparisons in induction algorithms. Machine Learning, 38(3), 309–338. Junqué de Fortuny, E., Martens, D., & Provost, FLUORINE. (2013). Predictive modeling including big data: Is bigger really better? Big Data, published online October 2013: http://online.liebertpub.com/doi/abs/10.1089/big.2013.0037 Kass, GUANINE. V. (1980). Can the exploratory technique explore large units of categorical data? Applied Statistics, 29(2), 119–127. Kaufman, S., Rosset, S., Perlich, C., & Stitelman, O. (2012). Leakage in data mining: configuration, detection and prevention. ACM Transactions on Knowledge Discovery from Data (TKDD), 6(4), 15. Kohavi, R., Brodley, C., Frasca, B., Mason, L., & Zheng, Z. (2000). The organizers of the KDD-cup 2000 report: Peeling the onion. ACM SIGKDD detectors. 2(2). Kohavi R, Deng A, Frasca B, Longbotham R, Schreitende T, & Xu YTTRIUM. (2012). Trusted Online Controlled Experiments: Five Mysterious Results Explained. In Pro-364
|
Bibliography
proceedings of the 18th ACM SIGKDD International Press on Knowledge Discovery and Date Mining, pp. 786–794. ACM. Kohavi, R., & Longbotham, R. (2007). See experiments: lessons learned. Computer, 40(9), 103–105. Kohavi, R., Longbotham, R., Sommerfield, D., & Henne, R. (2009). Controlled experiments on the web: research and a practical guide. Data Surface actual Information Discovery, 18(1), 140-181. Kohavi, R. & Parekh, R. (2003). Ten additional reviews for improving ecommerce sites. In Proceedings of the fifth WEBKDD workshop. Kohavi, R., & Provost, FARAD. (1998). Glossary of terms. Machine Scholarship, 30(2-3), 271-274. Kolodner, J. (1993). Case-based arguments. Morgan Kaufmann, Sun Mateo. Koren, Y., Bell, R., & Volinsky, C. (2009). Matrix Factorization Techniques for Recommender Systems. User, 42 (8), 30-37. Kosinski, M., Stillwell, D., & Graepel, T. (2013). Private characteristics and properties are predictable from digital records related to human behavior. Proceedings of the National Academy in Sciences, doi: 10.1073/pnas.1218772110. Lapointe, F.-J., & Legendre, P. (1994). ADENINE classification for pure milk cask whisky. Applied Stats, 43(1), 237–257. Light, D. (1995). Neural network for credit score. In Goonatilake, S., & Treleaven, P. (Eds.), Intelligent Systems for Finance as well as Business, pp. 61–69. Johns Wiley and Sons Ltd., Wild Sussex, England. Letunic, & Bork (2006). Interactive tree of life (iTOL): an online tool for displaying and annotating phylogenetic trees. Bioinformatics, 23 (1). Lin, J.-H., & Vitter, J.S. (1994). A theory of memory-based learning. Machine Learning Review, 17, 143-167. Lloyd, S. PENNY. (1982). Least-squares quantization in PCM. IEEE Transactions on Information Theory, 28 (2), 129–137. MacKay, DICK. (2003). Information Theory, Inference and Educational Arithmetic, Chapter 20. Summary Work Example: Clustering. Cambridge University Press. MacQueen, J.B. (1967). Some branching classification and analysis methods are multivariate observations. In Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297. University of California Press. Malin, B. & Sweeney, L. (2004). How (not) to protect the privacy of genomic data in a distributed network: using trace re-identification for anonymity protection system evaluation and design. Commerce of Biomedical Information, 37(3), 179-192. Report
|
365
Martens, D., & Provost, F. (2011). Pseudo-social web segmentation from consumer transport data. Working Paper CeDER-11-05, Recent York University – Stern School of Business. McCallum, A. & Nigam, K. (1988). Comparison of event models for Naive Bayes text classification. In the AAAI service for forward learning Read Categorization. McDowell, GRAMME. (2008). Tear who Coding Interview: 150 Schedule Answer and Custom. CareerCup LLC. McNamee, METER. (2001). Revolutionary credit card. Stanford Business 69 (3). McPherson, M., Smith-Lovin, L., & Cook, J.M. (2001). Birds are shame: homophily in social networks. Annual Review of Sociology, 27:415-444. Mittermayer, M., & Knolmayer, G. (2006). Text mining for market response to news: a survey. Essay No. 184, Information Systems Research, University of Bern. Muoio, A. (1997). They have a distinct feel... how about you? Speed Company, 10. Nissenbaum, H. (2010). Privacy in context. Stanford University Press. Papadopoulos, AN. N., & Manolopoulos, YTTRIOUM. (2005). Narrower direct search: A database perspective. Caperer. Pennisi, E. (2003). A tree of life. Available online only: http://www.sciencemag.org/site/feature/data/tol/. Perlich, C., Provost, F., & Simonoff, J. (2003). Induction tree for. Logistic Backtracking: A Learning Curve Analysis. Journal of Machine Learning Exploration, 4, 211-255. Perlich, C., Dalessandro, B., Stitelman, O., Raeder, T., & Provost, F. (2013). Machine learning in targeted display advertising: Put learning into action. Machine learning (no compression, published online: 2013 May 30. DOI 10.1007/s10994-013-5375-2). Poundstone, W. (2012). Are you smart enough to work at Google?: Trick questions, zen riddles, insanely difficult brain teasers, and various tricky interview techniques they need to know to land a job, all with a new austerity. Short, Brown and Company. Rector, F., & Facial, LIOTHYRONINE. (1997). Analyzing or visualizing classifier performance: Comparison under imprecise class and cost distributions. In Proceedings of the Third International Conference on Knowledge Discovery and Data Mining (KDD-97), pp. 43–48 Menlo Park, CA. Type AAA. Provost, F. & Hawcett, TONNE. (2001). Robust classification for imprecise environments. Machine Learning, 42(3), 203–231.
366
|
Bibliography
Provost, F., Faust, T., & Kohavi, R. (1998). The case against the accuracy estimate for comparing induction calculus. Includes Shavlik, J. (Ed.), Lawsuit of ICML-98, pp. 445–453 San Francisco, CA. Morgan Kaufman. Pyle, D. (1999). Data preparation required Data mining. Morgan Kaufman. Quine, W.V.O. (1951). Two Tenets of Empiricism, The Philosophical Review 60: 20-43. Reproduced in 1953, From a dynamic point of view. Harvard University Press. Quinlan, J. ROENTGEN. (1993). C4.5: Machine learning programs. Morgan Kaufman. Quinlan, J. (1986). Induction of decision trees. Machine Learning, 1 (1), 81–106. Raeder, T., Dalessandro, B., Stitelman, O., Perlich, C., & Chancellor, F. (2012). Design principles of massive and robust prediction systems. The 18th Annual ACM SIGKDD International Events will explore knowledge and data carbon. Rosset, S., & Zhu, HIE. (2007). Piecewise linear ordered solution path. An Annals outside Stats, 35(3), 1012–1030. Schumaker, R., & Chen, H. (2010). A discriminative stock price prediction engine on financial news keywords. IEEE Comm., 43(1), 51–56. Sengupta, S. (2012). Potential Facebook customers can rest easy from their treasure trove of data. Shakhnarovich, G., Darrell, T., & Indyk, P. (Eds., 2005). Nearest Neighbor Tools in Learning and Vision. Neural Get Processing Series. The MIT Press, Cambridge, Massachusetts, USA. Shannon, C.E. (1948). A mathematical theory a communication. Bell System Technology Daily, 27, 379–423. Shearer, C. (2000). The CRISP-DM model: The new blueprint for data mining. Journal in File Warehousing, 5(4), 13–22. Shmueli, G. (2010). For how or to predict?. Statistical Science, 25(3), 289–310. Silver, N. (2012). Signal and noise. Pete Press HC. Solove, DENSITY. (2006). Ampere data classification. University of Middle Law Test, 154(3), 477-564. Vase, ROENTGEN. M. (2005). Ratio bets on default prediction of real credit earnings: Integrating ROC and loan evaluation. Journal of Banking real Finance, 29, 1213–1236. Proposal, A. M., Jasny, B. R., Culotta, E., & Pennisi, E. (2003). Mapping the transformed history of life. Science, 300 (5626). Juices, HIE. (1988). Measuring the accuracy of diagnostic systems. Academics, 240, 1285-1293.
Bibliography
|
367
Sweet, TIED. A. (1996). Signal Detection Theory press ROC Analysis in Psychology and Diagnosis: Collectors Papers. Lawrence Erlbaum Associated, Mahwah, NJ. Swets, J.A., Dining, R.M., & Monahan, JOULE. (2000). Better decisions through science. Scientific Yank, 283, 82-87. Tambe, P. (2013). Big Data, Your Skills and Company Value. Jobs Paper, NYU Stern. Available: http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2294077. WEKA (2001). Weka Machine Learning Desktop. Presentation: http://www.cs.waika‐to.ac.nz/~ml/index.html. Wikipedia (2012). Determine the number of clusters in a data set. Wikipedia, the free encyclopedia. http://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set [Online; accessed 14 February 2013]. Wilcoxon, FLUORINE. (1945). Individual view with sorting methods. Biometrics Bulletin, 1(6), 80–83. Available: http://sci2s.ugr.es/keel/pdf/algorithm/articulo/wilcox‐on1945.pdf. Winterberry Select (2010). Beyond the gray areas: Transparency, brand safety and the past of online advertising. White Paper, Winterberry Group LLC. http://www.winterberrygroup.com/ourinsights/wp Wishart, D. (2006). Whiskey Confidential: Choosing a one stop shop for taste. Kiosk. Witten, I. & Frank, E. (2000). Data Mining: Practical Machine Learning Tools and Techniques for Java Implementations. Morgan Seller, San Francisco. Free software http://www.cs.waikato.ac.nz/~ml/weka/. Zadrozny, B. (2004). Learning and evaluating classifiers under test selection bias. In Proceedings of the Twenty First International Conference on Machine Learning, pp. 903-910. Zadrozny, B., & Elkan, C. (2001). Learning and decision making when costs and probabilities are all unknowns. In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 204–213. ACM.
368
| Bibliography
Index
Symbols
2-D Gaussian distributions, 301 "and" operator, 240
ONE
A Taxonomy the Privacy (Solove), 344 Aberfeldy single malt whisky, 179 Aberlour single malt whisky, 146 absolute errors, 96 precision (term), 189 exact results, 128 ACM SIGKDD, 320, 344 advertising functions, variables, 22. against, 316 historical advantages are, 319
adversarial analysis, 23 learning curve and, 132 analytical design, 279–289 turnover example, 283–289 expected value decomposition both, 286–289 incentives, evaluating influence of, 285–286 providing structure from business problem/solution with , 280– 282 selection bias, 282–283 directing our capabilities with, 280–283 analysis skills, expert vs. software, 35 analytical solutions, 14 analytical techniques, 35–41, 187–208 sending to your company, 40–41 key performance and, 204 –207 classification accuracy, 189–194 confusion matrix, 189–190 product storage, 38 database queries, 37–38 expected values, 194–204 generalization methods for , machine learning print 193–194, 39–40 recursive OLAP analysis, 3 , 39 statistics, 35–37 analysis technologies, 29 analysis tools, 113 Angry Birds, 247
We'd love to hear your suggestions to improve our ratings. Versenden email to[email protected]
369
Annie Room (film), 307 Apollo 13 (film), 325 Apple Computer, 175–178, 270 applications, 1, 187 area under ROC curves (AUC), 219, 225, 226 Armstrong, Louis, 261 overfitting rate, 113 correlation discovered, 292-298 between Join Prefers, 295-298 pint and lottery example, 294-295 exemplary eWatch/eBracelet, 292-293 Canteen Opus system by, 296 shopping basket analysis, 295-298 surprise, 294-298 awatch, 294 -293 features, 43, 49–67, 56–62, 334 ausstattung, 46 findings, 43 heterogeneous, 156, 157 variables vs. features, 46 Audubon Society Field Tour on North American Mushrooms, 57 average customer decisions, profitable decisions, . 40
si
Word approximation bag away, 254 bags, 254 base rates, 98, 115, 190 reference classifiers, 244 base methods, data science, 249 Basie, Counter, 261 Bayes percentage, 309 Bayes, Thomas, 238 Bayesian Methods, 2428 Bayes' Rule, 237–246 Beer and Lottery Affair, 294–295 Beethoven, Ludwig van, 247 Early Cross Validation, 127 Description of Behavior, 22 John Malkovich Beings (film), 307 Bellkors Pragmatic Chare (Netflix group improvement), 305 calculation, 203 benefits and underlying profit calculation, 214 data-driven decision making, 5 price, budget 199, 210
370
| Index
nearest neighbor methods, 157 digrams, 265 bias errors, ensemble methods and, 308–311 Meaningful dating data science plus, 7–8 evolution over, 8–9 in Amazon and Google, 316 big data services, 8 status, 8 use , 8 Big Red sentence example, 327–329 Bing, 252, 253 Black-Sholes type, 44 blog posts, 252 blog posts, 234 Borders (bookseller), 318 breast cancer mom example, 103–107 Brooks, David, 340 browser cookies, 234 Brubeck, Dave, 261 Bruichladdich single ale scotch, 179 Brynjolfsson, Erik, 5, 8 budget, 210 selection constraints, 213 building modeling workshops, 127 building models , 25, 28, 15, 28, 15, 28, 15 15, 28, 15, 127 news our insight, 175–178 business issues changing element from, to fit available data, 339–340 sending data vs., 183–185 engineering issues vs., 291 evaluation in a proposal, 326 context expected value, structure with, 283 – 285 exploratory data mining vs., 334 unique framework, 342 using expected values to provide structure, 280–282 business strategy, 315–331 accepting creative ideas, 326 case analysis, examining, 325 competitive advantages, 317–318, 318 – 323 data scientists, scoring, 320–322 suggested interpretation, 326–329 our history and, 319 intangible assets both, 320 internal assets in addition, 319
effective management of data scientists, 322–323 maturity away from data science, 329–331 thinking about data analytically for, 315–317
do
Caesars Communication, 11 call center example, 299–301 Capability Maturity Model, 330 Capital Of, 11, 288 Case-Based Reasoning, 151 Build Cases, 32 Rank vs. classification, 209–231 random modeling, 23 causal analysis, 286 causal explanation, 311 causal radius, 269 causality, correlation vs., 178 example of unbalanced cell intersection in class, 190 unequal cost aids, 193 Economic census 6 Census Bureau, 193 locations, 173 center-based clustering, 175 midpoints, 170–175, 175–178 features, 41 typical customers, 41 chaos, 4, 14, 191 added expected, 198 discovery variables, 15 analysis from performance to modeling, 2123– churn , 317 Ciccarelli, Francesca, 168 class confusion, 189 class labels, 102–103 their correlation, estimating probability, 235 prior classes, 201, 214, 219, 222 class probability, 2, 37-210, exhaustive classes, 242 reciprocal excluded, 242 probability given evidence, 241 separation , 123 classification, 2, 20, 141 Bayes rule for, 239 building models for, 28 suit methods and, 308 neighbors and, 147
regression e, 21 supervised data mining e, 25 confusion form classification accuracy, 189-190 interpretation, with expected values, 196-198 measurables of, 189 unbalanced classes, 190-193 unequal costs/benefits, classification function 393.6- modeling , 193 classification tasks, 21 classification trees, plus 63 rule sets, 71–71 full print procedures, 311 in KDD Cup flip problems, 224–231 inductive, 67 logistic regressions, extra, 129 predictive models and 663 visualizations, –69 accuracy classifier, 189 classifiers and ROC plots, 216–217 standard, 244 confounding model produced by, 210–211 negative, 216 cumulative response curves enabled, 220 –221 Discrete (Binary), 217 Inability to obtain accurate estimates, 210 Elevation of, 220 Linear, 85 Soft Bays, 242 Operating Conditions of, 219 Decoupled Production of Conditions for, 218 Flexible, 216 More Limits, 210 Chance, 213 music accounting for instances by, 2109 container. –211 weather, 205 pruning dendrograms, 167 cloud tasks, 346 case clusters, 119 cluster centers, 170 skewed clusters, 173 clustering, 21, 163–183, 251 algorithms, 1717 business news–1,
Index
|
371
center-based, 175 creating, 167 preparing information from, 175–176 hierarchical, 165–170 display, 165 interpretation consequences, 178–180 nearest neighbors and, 170–175 profiling and, 299 smoother, 303 180– learning and , 183 whiskey display, 164–166 clustering, 141, 179 co-currence clustering, 21–22, 292–298 beer and lottery examples, 294–295 eWatch/eBracelet show, 292–2939 surprising wire –294 Coelho, Paul, 247 cognition, 40 Coltrane, John, 261 combining functions, 147, 162–163 joint tasks, 19–23, 19 advertising, among real business scientists, 322, 335 corporate culture, as 30 comparisons, multiple, 139–139 complex functions, 118 , 123 complexity, 131 increased control, 133–138, 136 ensemble method and, 310 nearest neighbor reasoning in addition, 151–153 complications , 50 comprehensibility, for mod, 31 computational failure, 96 probability calculation, 102 conditional and Bayes rule, 238 unconstrained vs., 241 conditional likelihood, 236 setup line, 236 confidence, in interface, 293 confusion and points in ROC space, 217 evaluation models with, 189–190 expected evaluated entspr to, 212 produced by classifiers, 210–211 true positive and false negative rates for, budget 215 constraints, 213;
372
|
Index
workforce, 214 examples of consumer movie viewing preferences, 304 uses voice, 9 consumers, description of, 234–235 content fractions, online consumer segmentation based on, 234 circumstances, importance of, 253 control group, data fit evaluation includes, 328 conversion data , 30 cookie, user, 234 bodies, 253 correlations, 20, 37 causal vs., 178 general purpose concept, 37 special technical meaning, 37 cosine subtraction, 160, 160 cosine similarity, 160 cosine similarity function, 261 cost of sales matrix, 212x -benefit matrix , 199, 200, 203 cost-benefit calculation, 214 estimates, 199 planned, 210 data, 28 gegenfactual analyses, 23 Cray Home Corporation, 272 credit card transactions, 29, 298 credibility selection bias example, , 282 CRISPER cycle, 34 approaches and, 34 strategy and, 34 CRISP-DM, 14, 26 Cross Industrial Standard Process for Data Mining (CRISP), 14, 26–34, 26 business understanding, 27–28 intelligence capture, 29– 30 data understanding, 28– 29 development, 32–34 evaluation, 31–32 modeling, 31 software development cycles vs., 34–35 cross-validation, 126, 140 initialization, 127 datasets e, 126 nested, 135 overfitting6 e, 12 -129
cumulative response curves, curse of dimensions 219–222, customer churn example 156 analytical engineering case, 283–289 and information business maturity, customer deviance 331, forecasting, 4 with cross-validation, 129–129 with tree induction, 73–78 customer retention, 4 patrons, with 41
Hey
data as strategic asset, 11 conversion, 30 cost, 28 retention, 113 investment in, 288 labeled, 47 objective truth vs., 341 acquisition, 288 training, 45, 47 data analysis, 4, 20 data exploration, 183–185 terrain data, 167 data mining, 19–42 and Bayes rule, 240 use, 40–41, 48 as strategic component, 12 CRISP coding, 26–34 data science and, 2, 14–15 more domain knowledge, 156 initial step . , 14 structural projects, 19 supervised gegen. unsupervised methods, 24–25 systems, 33 tasks, adapting business problems to, 19–23, 19 techniques, 33
Data mining (field), 40 data mining algorithms, 20 examples of data mining proposals, 327–329 data preparation, 30, 251 data preprocessing, 272–273 data processing technologies, 7 data processing, data exchange against, 7–8 data reduction, 22–23, 304–308 data requirements, 29 information science, 1–17, 315–331, 333–347 in addition to adding value to applications, 187 by craftsmanship, 321 as strategic blessing, 9–12 core methods by , 249 predictive behavior based on past actions, 3 Big Data and, 7–8 case studies, examining, 325 classification modeling for internal problems, 193 clud labor real, 345–346 churn client, forecast, 4 data mining on people, 343 – 344 detail mining and, 2, 14–15 data processing vs., 7–8 data science engineers, 34 analytical thinking data is, 12–13 data-driven vs. business, 7 data-driven decision making, 4–7 engineering , 4 –7 engineering and, 15 uses evolving since, 8–9 problem of adaptation to available data, 339 –340 basic principles, 2 our, 39 human interaction and, 340–343 human knowledge and, Hurricane Frances example 340– 343, 3 learning path to, 321 limits, 340–343 mining mobile device intelligence example, 336– 339 opportunity from, 1–3 principles, 4, 19 privacy and ethics from, 343–344 processes, 4 software development vs., 330 structure, 39 capabilities, 4 technology from. teacher for, 15–16 understanding, 2, 7;
Index
|
373
degrees in data physics, by companies, 329–331 academic data scientists, 324 as scientific advisors, 324 attracting/feeding, 323–325 estimating, 320–322 managing, 322–323 Details Life, LLC, 325 data sources, , 28 –29 expected value decomposition and, 286–289 expected value structure and, 283–285 data storage, 38 data analysis thinking, 12–13 and unequal categories, 190 for business solutions, 315–317 data-driven nature of work vs. . , 7 estimation, 7 data-driven causal explanations, 311–312 data-driven decision making, 4–7 benefits, 5 findings, 6 iteration, 6 database polls , as analytical technique, 37–38 database graph, 47 data entropy . . 46 Dictionary of distances (Deza & Deza), 159
374
|
Index
differential descriptions, 183 Digital 100 companies, 12 Dillman, Linda, 6 dimensions, nearest neighbor reasoning, 156–157 direct marketing example, 280–283 discoveries, 6 discrete (binary) classifiers, 217 discrete classifiers, 215 discrete variables, discrete , linear , 86 our discriminative modeling, genetic vs., 248 disturbance, measurement, 51 ad serving, 233 indifference functions, nearest neighbor ratio, 158–161 distance, measurement, 143 Gaussian distribution, 96 Normal, 96 property distributions Who, 56 (television ), 247 documents (term), 253 range data mining procedures, 156 nearest neighbor bar and 156–157 domain knowledge validation, 298 ranking, for association discovery, 298 Dotcom Boom , 275, 319 double counting, 203 ties, statistics, 103
m
processing distance, 161, 161 Albert, Albert, 333 Elder Researching, 324 Ellington, Duke, 259, 261 email, 252 engineering, 15, 28 engineering Problems, Business vs. Problems, 291 band method, 308– 319– entropy, , 51, 58, 78 and inverse document frequency, 263 changes, 52 required equations, 51 graphs, 58
cosine distance equations, 160 entropy, 51 Euclidean distance, 144 general linear model, 86 information gain (IG), 53 Jaccard distance, 159 L2 norm, 158 log-likelihood linear function, 100 logistic function, 101 majority score function, 162 majority Score, 162 Manhattan Distance, 159 Rank Average by Similarity, 162 Regression Average by Similarity, 163 Average Similarity Score, 163 Error Free, 219 Error Rates, 189, 198 Absolute Errors, Negative996, v. false positive, 189 quadratic, 95 generalization performance estimation, 126 bewertung, supported frequency, 72 data mining ethics, 343–344 elastic, 144 Euclidean interval, 144 evaluation models, 187–208 true baseline performance, 72 prior performance clustering, 204–20 194 confusion matrix, 189–190 expected values, 194–204 abstraction methods for, 193–194 process, 329 training data evaluation, 113 in vivo evaluation, 32 principal, 31 evaluation framework, 32 social calculus probability, 236–23 , 236– 237 evidence probability, 238, 239 determining power, 235 probability, 240 strongly dependent, 243
augmented elements Visit Like example, 246–248 modeling, with Gullible Bayes, 244–246 eWatch/eBracelet example, 292–293 exam groups, 179 examples, 46 analytical engineering, 280–289 design correlations295– , 294–295 data biases, 341 Big Red proposal, 327–329 breast cancer, 103–107 business news, 175–178 call center metrics, 299–301 cell phone churn, 190, 193 center-based clustering, work, 175–171 in cloud, 345–346 clustering, 163–183 consumer movie viewing preferences, 304 co-occurrence/correlation, 292–293, 294–295 cross-validation, 126–129 customer deviation, 4, 73–78–71292 , 331 proposal mining site details , 327–329 data-driven explanations of causality, 311–312 credit card fraud detection, 298 targeted marketing, 280–283 proposition evaluation, 353–355 evidence lifting, 246–248 eWatch/eBrace, 292– 293 Facebook Likes, 246–248, 295–298 Naive Giant Consulting, 353–355 Storm Frances, 3 information gain, feature selection with, 56– 62 iris overfitting, 89, 119–123 jazz music, 258–262 spam classifier, 243 shopping cart analysis , 295–298 direct data discrimination, 89–110 mobile data mining, 336–339 mining news, 268–276 mushroom, 56–62 Naive Bayes, 248 nearest neighbor reasoning, 145–147 overfitting function, 119–123 overfitting , performance degradation in addition, 124–126 PEC, 233–235
Index
|
375
profiles, 298, 299–301 stock price movement, 268–276 supervised learning to produce cluster descriptors, 180–183 targeted display, 233–235, 248, 343 text representation tasks, 258–262, 268–276 tree training vs. . logistic regression, 103–107 bacteriological marketing, 311–312 wood analysis, 145–147 whiskey group, 164–166 Whiz-bang widgets, 327–329 wireless fraud, 341 exhaustive categories, 242 expected levels 24 and expected profits, costs and benefits1 , 214 relative calculation, 198 for classifiers, 193 uncertainty, 215 expected value calculation, 265 general form, 194 overall, 197 negative, 210 expected value structure, 334 prediction structure for problems/ com business solutions, 282 business questions of building difficulty, 283– 285 predicted values, 194–204 cost-benefit grid in addition, 199–204 decomposition, go to com product science solution, 286–289 error pricing and, 198 framing classifier evaluation, 119 –198 framing classifier use, 195–196 explanatory variables, 47 exploratory data mining counters. defined problems, 334 export patterns, 14
eat
Facebook, 11, 252, 317 online consumers flagged by, 234 “Like,” example, 246–248 Fairbanks, Richard, 9 false alarm sets, 216, 217 false negative reviews, 203 false negatives, 189, 190, 193, 200
376
|
Index
false positive rate, 203, 216–219 false positives, 189, 190, 193, 200 resource directions, 46 resources, 46, 47 Federer, Roger, 247 Fettercairn single malt scotch, 179 Fight Society8 financial markets, 247 21 first-level models, 108 adjustments, 102, 113–115, 126, 131, 140, 225–226 folds, 127, 129 hit detection, 29, 214, 317 free web realization, 233 frequencies, estimated frequency 2, 256 , 73 functions adding variables to , 123 classification, 86 composition, 147 complex, 118, 123 cores, 108 binding, 167 recording probabilities, 100 logistics, 101 loss, 95–96 goals, 110 fundamental ideas, 6
G
Gaussian distribution, 96, 299 Gaussian Mixture Model (GMM), 302 GE Money, 185 generalization, 116, 334 mean of, 126, 140 overfitting and, 111–112 variance of, 126, 140 generalization of generalization, performance of, 121 , 124 generative schemes, distinctive vs. modeling, 248 generative questions, 240 geometric interpretation, nearest neighbor reasoning, 151–153 Gillespie, Dizzie, 261 Gini coefficient, 219 Glen Albyn single all cask, 181 single, Granttch
Glen Mhor single malt whisky, 179 Glen Spey single malt whisky, 179 Glenfiddich single malt whisky, 179 Glenglassaugh single malt whisky, 169 Glengoyne single malt whisky, 181 Glenlossie single malt whisky, 179 single malt whisky, 179 single malt Glenugie, 179 goal, 88 Goethe , Johann Wolfram von, 1 Goodman, Benny, 261 Google, 252, 253, 323 Prediction API, 316 commercial search in, 233 Google Finance, 270 Google Scholar, 345 Graepel, Thore, user interface 246–246 graph , 37 entropy graphs, 58 customizations, 126, 140 Green Enormous Consulting example, 353–355 GUI, 37
H
Haimowitz, Ira, 185 Harrahs casino, 7, 11 hashing methods, 157 heterogeneous features, 156 Hewlett-Packard, 141, 175, 266 hierarchical clustering, 165–170 Hilton, Perez, 2792 hits, 359 art loss, percentage, 216, 220 validation data, 113 generating, 113 overfitting or, 113–115 validation assessments, is overfitting, 126 validation test, 126 homogeneous fields, 83 homographs, 253 How I Met Your Mother (TV show ), 247 Howls Castle, 247 human interaction and science data, 340–343 examples French Swirl, 3 hyperplanes, 69, 86 annahme, computational probability, 238 hypothesis generation, 37
hypothesis testing, 133
EU
IBM, 141, 179, 323, 324 IEEE International Conference on Data Mining, 344 teenage data companies, 330 impurities, 50 in live assessment, 32 sample accuracy, 114 Foundation (film), 247 erroneous generalizations, 124 learning and independence evidence research evidence, 246 by probability, 236–237 implied verses. conditional, 241 independent events, probability, 236–237 independent variables, 47 indices, 174 induction, subtraction vs., 47 inference of missing values, 30 influence, 23 information judgment, 48 measurement, 52 information gain (IG), 51, 78 . selection, 49 case scoring, 188 cases, 46 clustering, 119 comparison, with evidence collection, 246 for targeting online consumers, 234 additional intangible assets, 320 intellectual property, 319 smart test notches, 247–248 smart methods, 418 readability in Internet, , 252
Index
|
377
low inverse document (IDF), 256–257 quantitative printing, 263–277 in TFIDF, 258 graded term, matching with, 258 data entries, scoring, 204–207 purchase, 177, 287 IQ, increases in evidence for, 247–248 iris example for overfitting linear functions, linear product pit vector 119–123, 89– 110 iTunes, 22, 178
J
Jaccard distance (equation), 159 Jackson, Michael, jazz musicians example 145, 258–262 occupations, Steve, 176, 343 participation probability, judgment request 236–237, 48 judgments, email sorter example, spam example 231, 14
K
K-means algorithm, 170, 172 KDD Cup, 320 kernel functions, 108 kernels, polynomials, 108 Kerouac, Jacken, 256 Knowledge Discernment and Data Mining (KDD), 40 analysis techniques for, 39–40 input23–209 231 discharge of Knowledge , 335 Kosinski, Michael, 246–246
EU
Norma L2 (equation), 158 labeled data, 47 labels, 24 Ladyburn single malt scotch, 179 Laphroaig single all scotch, 179 Lapointe, François-Joseph, 145, 169, 179 Dirichlet inert distribution, 267
378
|
Index
latent information, 304–308 request to examine consumer movie viewing preferences, 304 weighted ratings, 307 latent general model, 268 Latent semantic indexing, 267 incremental learning, 243 gear, 39–40 parameter, 81 unsupervised, 318–24, 24 cameras learning, 126, 140 analytical use, 132 fit graphs, 131 logistic regression, 131 overfitting vs., 130–132 tree orientation, 131 least squares regression, 96, 97 Fabled, Pierre , 7161, 16945 leverage, 293–294 Lie to Me ( TV show), 247 elevation, 244, 293–294, 335 lift curves, 219–222, 228–229 probability, calculation of, 102 probability responses, 195 likes2, Facebook bound data, 126 linear edges, 122 linear classifiers, 83, 85 linear discriminant responsibilities, 85–88 objective functions, optimizing, 88 parametric patterns, 83 support vector machines, 92–94 linear separators, 86 functions for, 85–88 min –94 cases of sort/sort by, 91 real support vector machines, 92–94 linear estimation, true logistic regression, 99 models linear, 82 linear regression, standard, 96 linguistic structure, 252 link design, 22, 303 –304 link characteristics , 167 Linkwood single malt ruin, 181 local demand, 3
mobile device site visit behavior, 338 log-normal distribution, 301 log-odds, linear function 99 log-odds, 100 logistic function, 101 logistic regression, 88, 97–107, 119 breast cancer example, 103– 107 classification trees and , 129 in KDD Beaker overturning problem, 224–231 lesson curve for, 131 linear estimation and, 99 mathematics, 100–103 tree vs. induction, 103–107 understanding, 98 Lord of aforementioned nuisance role 5– losses, 24 Lost (TV series ), 247
M
analytical machine learning techniques with, 39–40 methods, 39 Mag Body, 296 classical majority, 205 using majority punctuation (equation), 162 majority classification (equation), 162 majority voting, 150 Manhattan drift (equation), 159 Mann - Whitney -Wilcoxon measure, 219 margin maximization limit, 93 margins, 92 market basket analysis, 295–298 Massachusetts Institute is Technology (MIT), 5, 343 mathematical functions, overfitting, 118–119 matrix factorization, 308 1 functional objective 36 maximization margin, 93 maximum likelihood model, 299 Maccarthy, Cormac, 256 McKinsey and Company, 13 median generalization, 126, 140 Mechanical Turk, 345 Medicare fraud, detection, 29 Michael Jackson's Craft Whiskey Companion, 345 our Microsoft, 255, 323
Mingus, Charles, 261 missing cores, 30 mobile device detection, discovery, 336 von mining data, 336–339 model accuracy, 114 model construction, test data and, 134 model evaluation and scoring, 190 model induction model, 47 example listening, 155 model performance, visualizing, 209–231 area under ROC curves, 219 cumulative respondent curves, 219–222 lift curves, 219–222 gain curves, classification 212–214 classification cases, 209–231 Model-Sho items, 44 product options , 44 descriptive, 46 predictive, 45 modelers, 118 modeling algorithms, 135, 328 modeling labs, 127 understanding models, 31 creation, 47 first level , 108 fitting to data, 82, 334 linear, 881 parameterization, topics, 72 production, 127 second layer, 108 construction, 81 table, 112 comprehension types, 67 deterioration, 124 modifiers (of words), 276 Monk, Thelonius, 261 Moonstruck (film), 307 Moralris, Nigel, 9 Multiple comparatives 139–139 Multisets, 254 Mushroom example , 56–62 Collectively exclusive groups, 242
N
n-gram strings, 265
Index
|
379
Naive Bayes, 241–242 advantages/disadvantages, 243–244 condition independence and, 241–246 on the KDD Cup overturning problem, 224–231 modeling data generated by, 244–246 reporting, 243 example of targeted advertising, 248 Naive-Nive Bayes, 245–246 named entity extraction, 266–266 NASDAQ, 270 Public Public Video (NPR), 247 centroid neighbor capture, 170–175 clustering and, 170–175 fitting scheme because, 308 benefits of nearest neighbor procedure , 157 related to KDD Cup churn, 224–231 nearest neighbor subtractor, 144–163 computing neighbor scores, 162–163 ranking, 147–148 link functions, 162–163 complexity control, and 151–153 computational efficiency 17. source, 149 dimensions for, 156–157 distance functions for, 158–161 domain knowledge and, 156–157 for predictive modeling, 147 geometric interpretation and, 151–153 heterogeneous features and, 157 neighbor influence, determination of, 1150–151 , 155–156 overfitting and, 151–153 performance against, 157 likelihood estimation, 148 regression, 149 whiskey analysis, 145–147 negative gain , 212 no, 188 neighbor taking, speedup, 157 extra 14 14 direct groupings, eve 149 nested cross-validation, 135 Netflix , 7, 142, 305 Netflix Create, 304–308, 320
380
|
Index
neural networks, 107, 108 parametric modeling and, 107–110 using, 109 New York Stock Exchange, 270 New York University (NYU), 8 Nissenbaum, Helen, 344 nonlinear support vector gears, 92, 107 Normal distribution, 96, 29 , 255 North Port single malt scotch, 181 probably not responding, 195 not spam (target category), 235 numbers, 255 numbered variables, 56 numerical predictions, 25
THE
Oakland Raiders, 266 objective functions, 110 benefits, 97 creation, 88 disadvantages, 97 maximization, 136 optimization, 88 goals, 88 probabilities, 98, 99 oDesk, 345 On the Road (Kerouac), 256 On-line Processing (OLAP) , 38 electronic processing, 38 One Draw, 247 Orange (French telecommunications company), 223 outliers, 167 wall transfers, 34 superposition, 15, 73, 111–139, 334 and tree induction, 116–131 reviews , 11 , 113, 119, 133–138 complexity control, 133–138 cross-validation example, 126–129 fitting method and, 310 fitting graphs and, 113–115 general methodology for how, 134–136 generalization and, 111–112 validation , 113–115 validation assessments of, 126 in mathematical functions, 118–119 learning curves vs., 130–132 linear functions, 119–123
nearest neighbor reasoning, 151–153 further structure optimization, 136–138 degraded performance and, 124–126 techniques to avoid, 126
Pi
parabola, 107, 123 parameter learning, 81 parameterized models, 81 parameterized numerical functions, 301 parametric modeling, 81 class probability estimation, 97–107 linear classification, 83 linear plus backtracking, 95–97 logistic regression7–1079, 107–110 nonlinear functions in, 107–110 support vector machines and, 107–110 Parker, Charlie, 259, 261 Pasteur, Louis, 316 patents, as intellectual property, 319 zusammenfassung patterns, 27 analysis patterns, 14 penalties in rollover modeling, 223– 231 performance 124–126 nearest neighbor reasoning, 157 phrase extraction, 266 pilot studies, 355 fall (stock prices), 269 polynomial kernels, 108 absolute, 188 posterior probability, 239 –24 , 204 prediction, 6, 45 Prediction API ( Google), 316 predictive learning methods, 181 predictive modeling, 43–44, 81 alternative methods, 81 key concepts, 78 explanations causal e, 311 rank pine e, 67–71 churn, wood induction prediction, 73–78 special, 48 initial e, 44–48
link prediction, 303–304 nearest neighbor reasoning, 147 parametric modeling e, 81 likelihood estimation e, 71–73 social recommendations, 303–304 supervised segmentation, 48–79 predictions, 47 planning, 30 principles, 4, 23 prior belief, probability based on, 240 prior permutation, 14 prior probability, class, 239 privacy and data mining, 343–344 Privacy involves context (Nissenbaum), 344 privacy, 343 probability combination (PEC), 233 – 249 Bayes' rule and two, 237–246 probability theory for, 235–237 targeted ad serving, 233–235 probabilistic topic models, 267 probabilities, 102–103 plus nearest neighbor reasoning, 148 ground rule of, 201 construction models for beurteilung de, 2236 links, 236–237 error, 198 elements, 239 independent event, 236–237 posterior, 239–240 priority, 239 unconditional, 238, 240 probability trees auswertung, 64 , 72 probability notation, 235–233–23 probability process personalized, 22, 298–303 examples of consumer movie viewing preferences, 304 when distribution is symmetric, 300 profit curves, 212– 214, 229–230 winners, negative, 212 profitability, 40 sustainable customers, average customers vs., 40 propositions, evaluation, 326–329 , 353–355 proxy labels, 288 psychometric data, 295 publications, 324
Index
|
381
purity, 49-56 Theorem Theorem, 143
Q
queries, 37 skills, 38 formulas, 37 implementations, 38 queries, 37 Quine, WATT. V.O., 341
R
Ra, Sun, 261 rating containers, rating vs., 209–231 alloy set, 48 reasoning, 141 recall metric, ROC (Receiver Operating Characteristics) graphs, 214–219 area under RACING curves (AUC), 219 turning Problem KDD Cup, 227–227 recommendations, 142 Reddit, 252 regional distribution centers, clustering/correlations both, 292 regression, 20, 21, 141 creating patterns for, 28 classification, 21 methods or, 308 minimum exposures, 911 logistic, ridge, 138 supervised data mining e, 25 supervised segmentation e, 56 regression modeling, 194 regression trees, 64, 311 regularization, 136, 140 subtraction of missing values, 30 iteration, 6 requirements, 29 respondents, potential contrast. unlikely, 195 recover, 141 neighbors recovering, 149 Reuters news agency, 175 ridge reconstruction, 138 mean square error, 194
382
|
Index
small
Scoch single malt Saint Magdalene, 181 Scapa single malt scotch, 179 Schwartz, Henry, 185 rating, 21 finding ads, showing against, 233 search engine, 252 second-tier models, 108 segmentation that creates the best, 56 supervision, 163 unsupervised, 184 selection features, 43 informative variables, 49 variables, 43 auswahl bias, 282–283 semantic, syntactic vs. similarity, 178 distinct classes, 123 continuous backward elimination, 135 direct sequential variety (SFS), 13 use of help, 13 , 254 Shannon, Claude, 51 Sheldon Cooper (fictional character), 247 signal coherence, in cost-benefit matrix, 203 Signet Deposit, 9, 288 Silver Lake, 255 Color, Nate, 205 similarity, 141–1463 application 334 clustering, 163–178 cosine, 160 data mining vs. business problems and, 183–185 length also, 142–144 heterogeneous beschaffenheit and, 157 link validation in addition, 303 measurement, 143 neighbor reasoning4 nearest matching163, 21 similarity class mediation (equation), 162 moderated similarity regression (equation) , 163 moderate similarity score (equation), 163 Simone, Nu, 261 skewed, 190 Skype Globalized, 255
smoothing, 73 social recommendations, 303–304 smooth clustering, 303 software development, 34 software engine, academic data vs., 330 software skills, analytical skills vs., 35 Solove, Daniel, 344 solution paths, change, 29 spam (target category ) , 235 spam detection systems, 235 specified class values, 26 overall specified targets, 26 speech implementation systems, 317 accelerated neighbor retrieval, 157 Spirited Away, 247 spreadsheets, Gullible Bayes implementation with . 35 stemming, 255, 254, 259, David, 259 stock market, 268 stock price movement example, 268–276 Stoker (film thriller), 256 stopwords, 255, 256 strategic considerations, 9 strategy, 34 may, in my Association, 293, 295 strong dependencies, 243 structure, 39 Structured Query Language (SQL), 37 structured thinking, 14 structuring, 28 subjective antecedents, 240 subtasks, 20 general statistics, 35, 36 Summit Product, Inc., 271 Sun Ra, 261
supervised data, 43–44, 78 supervised data mining classification, 25 conditions, 24 regression, 25 subclassing, 25 separate vs., 24–25 supervised learning generating cluster descriptors with, 180– 183 methods of, 181 condition, 24 supervision segment 43 –44, 48–67, 163 selection selection, 49–62 generation, 62 entropy, 49–56 induction, 64 execution, 44 purity of data sets, 49–56 regression problems and, 56 tree induction, 64–67 available tree copies , 62–64 support set equipment, 88, 119 linear and, 92–94, 92 nonlinear, 92, 107 objective function, 92 parametric modeling and, 107–110 technique , in copper combination, 295 wave (stock values), 269 surprise, 293–294 synonyms, 253 syntactic similarity, semantic vs., 178
T
table setup, 112, 114 tables, 47 Tambe, Prasanna, 8 Tamdhu single malt whiskey, 181 target, 6 target variables, 47, 149 estimated price, 56 review, 328 targeted advertising pattern, 233–235 protection from, 28 Naive Bayes, in International and, example of best attack capabilities 343, 280–283 tasks/techniques, 4, 291–313 associations, 292–298 bias, 308–311
Index
|
383
classification, 21 co-occurrence, date reduction 292–298, 304–308 causal explanations based on data, 311–312 ensemble method, latent information 308–311, link prediction 304–308, shopping basket analysis 303–304, 295 Overlay, 39 Underlying Principles, 23 Profiles, 298–303 Social Recommendations, 303–304 Variation, 308–311 Viral Marketing Example, 311–312 Tatum, Art, 261 Analytics, 29 Application, 385 the Big Data, 385 the Big Data, ., frequency 15 –16 terms (TF), 254–256 defined, 254 in TFIDF, 258 document frequency inverter, matching with, 258 values for, 260 terms in documents, 253 supervised learning, 24 unsupervised learning, 24 weights , 267 terry, clark, 261 test data, model building and, 134 test sets, 114 tests, wait, 126 text, 251 as unstructured data, 252–253 data, 251 fields, variable number of words, 252 meaning of, 252 jazz musicians examples, 258– 262 relative dirt, 252 text processing, 251 text representation task, 253–258 text representation task, 253–258 bag of words addressing, 254 data preparation, 270–272 data preprocessing 272–, 273 definition, 268–270
384
|
Index
inverse paper frequency, 256–257 jazz musicians example, 258–262 mining site, 338 prevalence measure, 254–256 spread measure, 256–257 message extraction example stories, 268–276 sequential n-gram approach, collection name, 266 –266 results, interpretation, 273–276 stock price movement example, 268–276 frequency term, 254–256 TFIDF price and, 258 subject models for, 266–268 TFIDF piles (TFIDF prices), 175 applied to locations, 338 work text implementation, 258 The Big Bang Teaching, 247 The Colbert Report, 247 The Daily Showing, 247 The Patron, 247 The New York Times, 3 , 340 The Onion, 247 The Road (McCarthy), 256 The Signal and the Noise (Silver ), 205 The Sound of Music (film), 307 To Stoker (comedy film), 256 The Wizard of Feinunze (film), 307 Pick Reuters Text Research Collection (TRC2), 175 thresholds and classifiers, 210–211 and performance curves, 212 time series (data), 270 Tobermory single malt scotch, 179 tokens, 253 tools, analytics, 113 topic levels, 266 models from topics to text representation, 266–268 sharing secrets, 319 training data, 45, 4 , 113 evaluation, 113 , 328 bounds, 310 using, 126, 131, 140 training sets, 114 transfers, overscreen, 34 tree induction, 44 ballet methods e, 311 learning curves for, 131 bounds, 133
logistic vs. reconstruction, 103–107 supervised functional, 64–67 overfitting and, 116–118, 133–134 problems with, 133 Tree of Life (Sugden et al, Pennisi), 167 tree-structured model classification, 63 generating , 64 decision , 63 for supervised segmentation, 62–64 objectives, 64 probability evaluation, 64, 72 pruning, 134 regression, 64 limiting, 118 trigrams, 265 Tron, 247 true negative rate, 203 true negatives, 200 3 true positives, 216–217, 221 true positives, 200 Irish single craft Tullibardine, 169 Tumblr, online consumer targeting by, 234 Twitter, 252 Dual Dogmas of Empiricism (Quine), 341
you
UCI Dataset Repository, 89–94 unconditional independence, contingent vs., 241 assumption and proof of unconditional probabilities, 238 prior probability based on, 240 unique conditions, strategic decisions, 342 University the California at Irvine, 57, 104 University of Montréal, 14 University of Toronto, 343 unstructured data, 252 unstructured data, text-like, 252–253 unsupervised training, 24 unsupervised dating methods, supervised vs., 24–25 unsupervised problems, 185 autonomous segmentation, 184 content generated by users 5
V
value (value), adding, to applications, 187
value estimation, 21 dependent variables, 47 reviews, 47 findings, 15, 43 independent, 47 informative, 49 differentiation, 56 ratings, 48 ratio bets, 46 options, 43 objectives, 47, 56, 149 variance, 56 errors methods and , creation 308–311, 126, 140 viral marketing example, 311–312 views, vs. charts, 209 Volinsky, Chris, 306
do
Wal-Mart, 1, 3, 6 Waller, Fats, 261 Wang, Wally, 247, 296 Washington Square Deposit, 338 weather conditions, 205 Web 2.0, 252 web pages, personal, 252 web properties, such as content pieces, 234 web services, Free, 233 Weeds (TV Series), 247 Weighted Score, 150, 307 Loaded Votes, 150 Something Data Can't Do (Brooks), 340 Whiskey Example Clustering and, 164–166 for Neighbors closer, 145 for creating tree descriptions , 180–183 Ingenious demonstration, 327–329 Wikileaks, 247 wireless fraud example, 341 Wisconsin-based breast cancer data, 104 word lengths, 252 modifiers, 276 strings, 265 workforce constraints 247, , 124
Index
|
385
Y
Yahoo! Finance, 270 Yahoo!, online end targeting by, 234
386
|
Index
G
zero-one loss, 95
About the Authors Foster Vice is a professor and faculty member at NONE at the NYU Stern School of Business, where he teaches the Work Analytics, Data Science, and MBA programs. His award-winning research is widely read and cited. Before joining NYU, he worked as a data scientist for five years for what is now Verizon. Over the past decade, the Academic Province has co-founded several successful data science-oriented businesses. Tom Fawcett holds a Ph.D. in machine learning and has worked in the R&D industry for over two decades (GTE Laboratories, NYNEX/Verizon Labs, HP Dental, etc.). His published work has become standard reading in data science, both in methodology (eg evaluation of data mining results) and applications (eg fraud detection and spam filtering).
Colophon cover font is Adobe ITC Garamond. The text font is Learn Subordinate Pro and the header font is Adobe Myriad Condensed.
FAQs
How do you pass data science? ›
- Review the job posting. ...
- Go to their website. ...
- Study their competitors. ...
- Check the company's values and culture. ...
- Find out the company's recent achievements. ...
- Research your interviewer. ...
- Any data science interview. ...
- Phone interview.
- Writing Online. I make a significant portion of my income from writing online. ...
- Affiliate Marketing. ...
- Performing Market Research. ...
- Creating Courses & Workshops. ...
- Other Freelance Tasks.
- Creating and automating financial forecasting models.
- Cleaning up and capturing accurate financial data points.
- Developing more accurate business planning models.
- Identifying realistic data-driven objectives along with ways to monitor achievement and to trouble-shoot shortfalls.
Adding data science to your business practices can make a marked difference in productivity, decision-making, and product development. It can help you minimize or eradicate the risk of fraud and error, increase efficiency, and provide better customer service.
Is data science hard for beginners? ›Because of the often technical requirements for Data Science jobs, it can be more challenging to learn than other fields in technology. Getting a firm handle on such a wide variety of languages and applications does present a rather steep learning curve.
Is it hard to break into data science? ›Data science is a major that can be incredibly difficult to get into. The field is growing rapidly, and there are a lot of people who want to get into it. If you're interested in data science, you need to start thinking about how you can position yourself for success in the highly competitive job market.
Can data scientists make 300k? ›The 75th percentile of the base salary for a data science manager at level 3 is $310,000, an annual increase of 13%.
Is data science highest salary? ›...
These are the top skills of a Data Scientist based on 8834 jobs posted by employers.
- Python.
- Machine Learning.
- Data Science.
- SQL.
- Deep Learning.
Employees as Data Scientist earn an average of ₹25.3lakhs, mostly ranging from ₹16.9lakhs to ₹99.0lakhs based on 1163 profiles.
What is an example of data science in business? ›The first data science real-life example is the manufacturing industry. Many manufacturers depend on data science to create forecasts of product demand. It helps them in optimizing supply chains and delivering orders without risk of over/under-ordering.
How do I prepare my business for data science? ›
- Step 1: Understand what data science isn't. ...
- Step 2: Audit your data fluency. ...
- Step 3: Give data science a try. ...
- Step 4: Hire the right people. ...
- Step 5: Get buy-in. ...
- Step 6: Create a data driven culture.
- Understanding business objectives and data-based information needs. ...
- Collecting the right data. ...
- Analyzing the data to gain relevant insights. ...
- Communicating the data effectively to inform decision making. ...
- Understanding how evidence-based decisions are made.
Data Scientists have more education and a higher degree of specialization, and so they typically command a higher salary than Business Analysts.
What is the difference between business science and data science? ›Business Analytics is the statistical study of business data to gain insights. Data science is the study of data using statistics, algorithms and technology. Uses mostly structured data. Uses both structured and unstructured data.
Why data science is better than MBA? ›Technical skills: The main difference between MBA in Analytics and Data Science and other domains is that the other domains are solely focused on managerial aspects. But data science clubs technical skills and managerial skills. A person will learn technical skills like SQL, Python and data visualisation.
Can I learn data science if I am bad at math? ›Being mathematically gifted isn't a strict prerequisite for being a data scientist. Sure, it helps, but being a data scientist is more than just being good at math and statistics. Being a data scientist means knowing how to solve problems and communicate them in an effective and concise manner.
How many days it will take to learn data science? ›As we outline in our data science FAQs, on average, to a person with no prior coding experience and/or mathematical background, it takes around 7 to 12 months of intensive studies to become an entry-level data scientist.
Can I learn data science in 6 months? ›Becoming a data scientist in six months is possible if you have a strong background in mathematics and coding.
What is the hardest part of data science? ›Although data pre-processing is often considered the worst part of a data scientist's job, it is crucial that models are built on clean, high-quality data. Otherwise, machine learning models learn the wrong patterns, ultimately leading to wrong predictions.
Does data science require a lot of math? ›Data science careers require mathematical study because machine learning algorithms, and performing analyses and discovering insights from data require math. While math will not be the only requirement for your educational and career path in data science, but it's often one of the most important.
What is the hardest thing in data science? ›
The hardest part of data science is not building an accurate model or obtaining good, clean data, but defining feasible problems and coming up with reasonable ways of measuring solutions.
Why are data scientist paid so much? ›What is the Demand for Data Scientists? To an economist, this is a simple case of supply and demand, but this is arguably one of the prime reasons why data science pays so well. Companies today are in search of qualified candidates who can help them better understand big data, but these qualified candidates are scarce.
Who pays data scientists best? ›Do certain industries pay data scientists more than others? Yes, certain industries tend to pay data scientists higher salaries than others. Data scientists who work in the finance and insurance industries, for example, tend to earn higher salaries compared to those who work in other industries.
Can an average person be a data scientist? ›Data science is fully based on mathematics and statistics. If you are from the same background it will be easy to learn data science, and it will be easy to be a data scientist.
Is data science a happy career? ›Data science is a fantastic career with a tonne of potential for future growth. Already, there is a lot of demand, competitive pay, and several benefits. Companies are actively looking for data scientists that can glean valuable information from massive amounts of data.
Is data scientist a stressful job? ›Data Science can be a stressful job because it has its challenges. But whether it is truly a stressful job or not is pretty subjective, depending on the circumstances, working environment, and the project. People with a passion for the job enjoy it while others may experience undeniable stress.
Does data science require coding? ›1. Does Data Science Require Coding? Yes, data science needs coding because it uses languages like Python and R to create machine-learning models and deal with large datasets.
How many hours do data scientists work? ›Working hours can vary, but usually full-time hours will be Monday to Friday and around 37 hours per week. Some jobs or projects might require you to work longer hours or weekends.
Does data science pay well in USA? ›How much does a Data Scientist make? The national average salary for a Data Scientist is $1,03,951 in United States. Filter by location to see Data Scientist salaries in your area. Salary estimates are based on 31,676 salaries submitted anonymously to Glassdoor by Data Scientist employees.
Do data scientists work from home? ›If you're looking for a data scientist job and want to work remotely, there are opportunities not just in technology-focused industries, but across sectors like healthcare, education, sales, and computer and information technology.
What are 2 examples of business data? ›
Sales data. Warehouse and inventory data. Website traffic statistics. Customer contact information.
Can data science be used for small business? ›The short answer is yes, data science is definitely relevant to small businesses. In fact, it can be an incredibly powerful tool for them. By using data science, small businesses can make better decisions about everything from product development to marketing to customer service.
Is data science business profitable? ›Data monetization is very profitable since any data a firm acquires is essential to that company and others. The data you sell will be covered by dozens of companies, and these companies will be in the telecommunications and information services sectors.
How do I start a data science side hustle? ›- Freelancing. Starting a freelancing job or becoming a free agent gives you the flexibility in working hours. ...
- Technical Writing. ...
- Blogging. ...
- Copywriting. ...
- Ghostwriting. ...
- Contract work. ...
- Global Contract. ...
- Local Contract.
- Pursue a Bachelor's Degree (in a Related Field) or Bootcamp.
- Develop a Strong Portfolio.
- Network.
- Find a Mentor.
- Tailor Your Resume and Prep Well For Interviews.
- Step 1: Earn a Bachelor's Degree. A great way to get started in Data Science is to get a bachelor's degree in a relevant field such as data science, statistics, or computer science. ...
- Step 2: Learn Relevant Programming Languages. ...
- Step 3: Learn Related Skills.
The best way to learn data science is to work on projects so you can gain data science skills that can be applied immediately and are useful from a real-world implementation perspective. The sooner you start working on diverse data science projects, the faster you will learn the related concepts.
Can I learn data analysis on my own? ›Yes, you can learn the fundamentals of data analysis on your own.
Is business analytics a lot of coding? ›Technical Skills for Business Analytics
Having both a conceptual and working understanding of tools and programming languages is important to translate data sources into tangible solutions. SQL is the coding language of databases and one of the most important tools in an analytics professional's toolkit.
It's moderately hard to become a business analyst. You should have soft and technical skills and the proper education to become a successful business analyst.
What is the difference between a data scientist and a business data analyst? ›
A Data Scientist specializes in high-level data manipulation, including writing complex algorithms and computer programming. Business Analysts are more focused on creating and interpreting reports on how the business is operating day to day, and providing recommendations based on their findings.
Should I become a business analyst or data scientist? ›Typically, data science can be taken up by early career professionals but business analytics is better suited for professionals with experience in business development, technology and project management.
Can a business analyst become data scientist? ›By investing in the right training and experience, business analysts can leverage their knowledge to excel in data analysis, unlocking new professional growth and development opportunities. Therefore, if you are a business analyst looking to move into data analysis, it is a viable career path worth pursuing.
Who earns more data analyst or business analyst? ›data analyst, which is better? The average data analyst's salary could go up to $72,250 per year. It also depends on the company, job role, and geographical location matters. A data business analyst's salary is typically higher, averaging $78,500/year.
Is MBA after data science worth it? ›As the industry is expanding, the need for management in it is also increasing and hence an MBA in Data Science is definitely worth considering.
Which MBA is best after data science? ›- MBA in Data Science and Machine Learning. ...
- MBA in Fintech and Data Analytics. ...
- MBA in Strategic Data-Driven Management. ...
- Master of Business Administration [Online] ...
- Professional MBA Digital Transformation & Data Science. ...
- The UCL MBA. ...
- MBA (Data & Cyber Management) ...
- MBA Data Analytics.
An MBA with a data science specialization helps prepare you for careers in brand management, strategic planning, risk management, public finance and management consulting. Make sure to check the job description before applying.
Can I do data science if I'm bad at math? ›Being mathematically gifted isn't a strict prerequisite for being a data scientist. Sure, it helps, but being a data scientist is more than just being good at math and statistics. Being a data scientist means knowing how to solve problems and communicate them in an effective and concise manner.
Is data science the hardest major? ›The short answer to the above question is a big NO! Data Science is hard to learn is primarily a misconception that beginners have during their initial days. As they discover the unique domain of data science more, they realise that data science is just another field of study that can be learned by working hard.
Is data science hard in college? ›Since the field is relatively new, a Data Science major can sometimes be hard to come by, both in undergraduate or graduate programs. The scarcity has resulted in many students having to pick alternative concentrations.
Is data science easy for average students? ›
Data science is fully based on mathematics and statistics. If you are from the same background it will be easy to learn data science, and it will be easy to be a data scientist. If you are from a non-IT background, first you have to learn mathematics and statistics.
Is data science hard for non it students? ›The short answer to this question is, yes. Data Scientists spend most of their time coding or programming to implement various steps involved in a Data Science project. Data Scientists must have a sound understanding of various programming languages such as Python, SQL, R, etc.
What level of math do you need for data science? ›Data Scientists use three main types of math—linear algebra, calculus, and statistics. Probability is another math data scientists use, but it is sometimes grouped together with statistics.
Is data science a lot of math? ›Mathematics is an integral part of data science. Any practicing data scientist or person interested in building a career in data science will need to have a strong background in specific mathematical fields.
Is data science a stressful job? ›Is being one stressful? Data Science can be a stressful job because it has its challenges. But whether it is truly a stressful job or not is pretty subjective, depending on the circumstances, working environment, and the project. People with a passion for the job enjoy it while others may experience undeniable stress.
Can I teach myself data science? ›It's definitely possible to become a data scientist without any formal education or experience. The most important thing is that you have the drive to learn and are motivated to solve problems. And if you can find a mentor or community who can help guide and support your learning then that's even better!
Who gets paid more data scientist or software engineer? ›The average yearly salary for data scientists is $120,103 . The average yearly salary for software engineers is $102,234 . Software engineers also receive an average of $4,000 in bonuses each year. Your salary may vary depending on your experience, skills, training, certifications and your employer.
What GPA do you need for data science? ›Admission Details
A bachelor degree, although not necessarily in computer science, is required with a minimum overall GPA of 3.0 4.0. Minimum GRE scores of 304 and 2.5 must be submitted.
People from various backgrounds especially with zero coding experiences have proven to become good data scientists in just one year by learning to code smartly.
Is data science need coding? ›1. Does Data Science Require Coding? Yes, data science needs coding because it uses languages like Python and R to create machine-learning models and deal with large datasets.
How long does it take for a beginner to learn data science? ›
As we outline in our data science FAQs, on average, to a person with no prior coding experience and/or mathematical background, it takes around 7 to 12 months of intensive studies to become an entry-level data scientist.
What is the average age for data science? ›Senior Data Scientist Age Breakdown
Interestingly enough, the average age of senior data scientists is 40+ years old, which represents 49% of the population.
The data science scope is high, you are an average student, and yes, you can become a data scientist. Just focus on improving your skills. You can do so by working through online data science courses & real-time projects offered by The IoT Academy.