Learning Theory from First Principles

Name: Learning Theory from First Principles
Brand: MIT Press Ltd
Price: 99.49 EUR

(0 Bewertungen)15

Buch (gebunden)

eBook epub77,99 €

99,49 €inkl. Mwst.

Zustellung: Sa, 09.08. - Fr, 15.08.

Versand in 3-4 Wochen

Versandkostenfrei

"The aim of this book is to provide the simplest formulations that can be derived "from first principles" with simple arguments"--

Inhaltsverzeichnis

Preface ix
I Preliminaries 1
1 Mathematical preliminaries 3
1. 1 Linear algebra and differentiable calculus. . . . . . . . . . . . . . . . . . 3
1. 1. 1 Minimization of quadratic forms. . . . . . . . . . . . . . . . . . . 3
1. 1. 2 Inverting a 2 × 2 matrix. . . . . . . . . . . . . . . . . . . . . . . . 4
1. 1. 3 Inverting matrices defined by blocks, matrix inversion lemma. . . 4
1. 1. 4 Eigenvalue and singular value decomposition. . . . . . . . . . . . 6
1. 1. 5 Differential calculus. . . . . . . . . . . . . . . . . . . . . . . . . . 7
1. 2 Concentration inequalities. . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1. 2. 1 Hoeffding’ s inequality. . . . . . . . . . . . . . . . . . . . . . . . . 9
1. 2. 2 McDiarmid’ s inequality. . . . . . . . . . . . . . . . . . . . . . . . 12
1. 2. 3 Bernstein’ s inequality (_x0007_). . . . . . . . . . . . . . . . . . . . . . . 14
1. 2. 4 Expectation of the maximum. . . . . . . . . . . . . . . . . . . . . 16
1. 2. 5 Estimation of expectations through quadrature (_x0007_). . . . . . . . . 17
1. 2. 6 Concentration inequalities for matrices (_x0007__x0007_). . . . . . . . . . . . . 18
2 Introduction to supervised learning 21
2. 1 From training data to predictions. . . . . . . . . . . . . . . . . . . . . . . 22
2. 2 Decision theory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2. 2. 1 Loss functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2. 2. 2 Risks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2. 2. 3 Bayes risk and Bayes predictor. . . . . . . . . . . . . . . . . . . . 27
2. 3 Learning from data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2. 3. 1 Local averaging. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2. 3. 2 Empirical risk minimization. . . . . . . . . . . . . . . . . . . . . . 31
2. 4 Statistical learning theory. . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2. 4. 1 Measures of performance. . . . . . . . . . . . . . . . . . . . . . . 35
2. 4. 2 Notions of consistency over classes of problems. . . . . . . . . . . 35
2. 5 No free lunch theorems (_x0007_). . . . . . . . . . . . . . . . . . . . . . . . . . 36
2. 6 Quest for adaptivity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2. 7 Beyond supervised learning. . . . . . . . . . . . . . . . . . . . . . . . . . 39
2. 8 Summary - book outline. . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3 Linear least-squares regression 43
3. 1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3. 2 Least-squares framework. . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3. 3 Ordinary least-squares (OLS) estimator. . . . . . . . . . . . . . . . . . . 45
3. 3. 1 Closed-form solution. . . . . . . . . . . . . . . . . . . . . . . . . . 45
3. 3. 2 Geometric interpretation. . . . . . . . . . . . . . . . . . . . . . . . 46
3. 3. 3 Numerical resolution. . . . . . . . . . . . . . . . . . . . . . . . . . 47
3. 4 Statistical analysis of OLS. . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3. 5 Fixed design setting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3. 5. 1 Statistical properties of the OLS estimator. . . . . . . . . . . . . 50
3. 5. 2 Experiments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3. 6 Ridge least-squares regression. . . . . . . . . . . . . . . . . . . . . . . . . 53
3. 7 Lower-bound (_x0007_). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3. 8 Random design analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3. 8. 1 Gaussian designs. . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3. 8. 2 General designs (_x0007__x0007_). . . . . . . . . . . . . . . . . . . . . . . . . 61
3. 9 Principal component analysis (_x0007_). . . . . . . . . . . . . . . . . . . . . . . 63
3. 10 Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
II Generalization bounds for learning algorithms 67
4 Empirical risk minimization 69
4. 1 Convexification of the risk. . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4. 1. 1 Convex surrogates. . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4. 1. 2 Geometric interpretation of the support vector machine (_x0007_). . . . 72
4. 1. 3 Conditional Φ -risk and classification calibration (_x0007_). . . . . . . . 74
4. 1. 4 Relationship between risk and Φ -risk (_x0007__x0007_). . . . . . . . . . . . . . 76
4. 2 Risk minimization decomposition. . . . . . . . . . . . . . . . . . . . . . . 80
4. 3 Approximation error. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4. 4 Estimation error. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4. 4. 1 Application of McDiarmid’ s inequality. . . . . . . . . . . . . . . . 82
4. 4. 2 Easy case I: quadratic functions. . . . . . . . . . . . . . . . . . . . 83
4. 4. 3 Easy case II: Finite number of models. . . . . . . . . . . . . . . . 84
4. 4. 4 Beyond finitely many models through covering numbers (_x0007_). . . 84
4. 5 Rademacher complexity. . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4. 5. 1 Symmetrization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4. 5. 2 Lipschitz-continuous losses. . . . . . . . . . . . . . . . . . . . . . 89
4. 5. 3 Ball-constrained linear predictions. . . . . . . . . . . . . . . . . . 91
4. 5. 4 Putting things together (linear predictions). . . . . . . . . . . . . 92
4. 5. 5 From constrained to regularized estimation (_x0007_). . . . . . . . . . . 93
4. 5. 6 Extensions and improvements. . . . . . . . . . . . . . . . . . . . . 96
4. 6 Model selection (_x0007_). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4. 6. 1 Structural risk minimization. . . . . . . . . . . . . . . . . . . . . . 98
4. 6. 2 Selection based on validation set. . . . . . . . . . . . . . . . . . . 99
4. 7 Relationship with asymptotic statistics (_x0007_). . . . . . . . . . . . . . . . . 99
4. 8 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5 Optimization for machine learning 103
5. 1 Optimization in machine learning. . . . . . . . . . . . . . . . . . . . . . . 103
5. 2 Gradient descent. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5. 2. 1 Simplest analysis: ordinary least-squares. . . . . . . . . . . . . . . 106
5. 2. 2 Convex functions and their properties. . . . . . . . . . . . . . . . 110
5. 2. 3 Analysis of GD for strongly convex and smooth functions. . . . . 112
5. 2. 4 Analysis of GD for convex and smooth functions (_x0007_). . . . . . . 117
5. 2. 5 Beyond gradient descent (_x0007_). . . . . . . . . . . . . . . . . . . . . 120
5. 2. 6 Non-convex objective functions (_x0007_). . . . . . . . . . . . . . . . . 122
5. 3 Gradient methods on non-smooth problems. . . . . . . . . . . . . . . . . 123
5. 4 Convergence rate of stochastic gradient descent (SGD). . . . . . . . . . . 127
5. 4. 1 Strongly convex problems (_x0007_). . . . . . . . . . . . . . . . . . . . 132
5. 4. 2 Adaptive methods (_x0007_). . . . . . . . . . . . . . . . . . . . . . . . . 134
5. 4. 3 Bias-variance trade-offs for least-squares (_x0007_). . . . . . . . . . . . . 135
5. 4. 4 Variance reduction (_x0007_). . . . . . . . . . . . . . . . . . . . . . . . 138
5. 5 Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
6 Local averaging methods 145
6. 1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
6. 2 Local averaging methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
6. 2. 1 Linear estimators. . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
6. 2. 2 Partition estimators. . . . . . . . . . . . . . . . . . . . . . . . . . 148
6. 2. 3 Nearest-neighbors. . . . . . . . . . . . . . . . . . . . . . . . . . . 150
6. 2. 4 Nadaraya-Watson estimator a. k. a. kernel regression (_x0007_). . . . . . 151
6. 3 Generic “ simplest” consistency analysis. . . . . . . . . . . . . . . . . . . 153
6. 3. 1 Fixed partition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
6. 3. 2 k-nearest neighbor. . . . . . . . . . . . . . . . . . . . . . . . . . . 158
6. 3. 3 Kernel regression (Nadaraya-Watson) (_x0007_). . . . . . . . . . . . . . 160
6. 4 Universal consistency (_x0007_). . . . . . . . . . . . . . . . . . . . . . . . . . . 163
6. 5 Adaptivity (_x0007__x0007_). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
6. 6 Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
7 Kernel methods 169
7. 1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
7. 2 Representer theorem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
7. 3 Kernels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
7. 3. 1 Linear and polynomial kernels. . . . . . . . . . . . . . . . . . . . 175
7. 3. 2 Translation-invariant kernels on [0, 1]. . . . . . . . . . . . . . . . . 176
7. 3. 3 Translation-invariant kernels on Rd. . . . . . . . . . . . . . . . . . 179
7. 3. 4 Beyond vectorial input spaces (_x0007_). . . . . . . . . . . . . . . . . . 182
7. 4 Algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
7. 4. 1 Representer theorem. . . . . . . . . . . . . . . . . . . . . . . . . . 184
7. 4. 2 Column sampling. . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
7. 4. 3 Random features. . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
7. 4. 4 Dual algorithms (_x0007_). . . . . . . . . . . . . . . . . . . . . . . . . . 186
7. 4. 5 Stochastic gradient descent (_x0007_). . . . . . . . . . . . . . . . . . . . 187
7. 4. 6 “ Kernelization” of linear algorithms. . . . . . . . . . . . . . . . . 188
7. 5 Generalization guarantees - Lipschitz-continuous losses. . . . . . . . . . . 189
7. 5. 1 Risk decomposition. . . . . . . . . . . . . . . . . . . . . . . . . . . 190
7. 5. 2 Approximation error for translation-invariant kernels on Rd. . . . 191
7. 6 Theoretical analysis of ridge regression (_x0007_). . . . . . . . . . . . . . . . . 194
7. 6. 1 Kernel ridge regression as a “ linear” estimator. . . . . . . . . . . 194
7. 6. 2 Bias and variance decomposition (_x0007_). . . . . . . . . . . . . . . . 195
7. 6. 3 Relating empirical and population covariance operators. . . . . . 198
7. 6. 4 Analysis for well-specified problems (_x0007_). . . . . . . . . . . . . . . 200
7. 6. 5 Analysis beyond well-specified problems (_x0007_). . . . . . . . . . . . . 201
7. 6. 6 Balancing bias and variance (_x0007_). . . . . . . . . . . . . . . . . . . 202
7. 7 Experiments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
7. 8 Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
8 Sparse methods 207
8. 1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
8. 1. 1 Dedicated proof technique for constrained least-squares. . . . . . 209
8. 1. 2 Probabilistic and combinatorial lemmas. . . . . . . . . . . . . . . 210
8. 2 Variable selection by the ℓ 0-penalty. . . . . . . . . . . . . . . . . . . . . . 212
8. 2. 1 Assuming k is known. . . . . . . . . . . . . . . . . . . . . . . . . . 212
8. 2. 2 Estimating k (_x0007_). . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
8. 3 Variable selection by ℓ 1-regularization. . . . . . . . . . . . . . . . . . . . 216
8. 3. 1 Intuition and algorithms. . . . . . . . . . . . . . . . . . . . . . . . 217
8. 3. 2 Slow rates - random design. . . . . . . . . . . . . . . . . . . . . . 221
8. 3. 3 Slow rates - fixed design. . . . . . . . . . . . . . . . . . . . . . . . 221
8. 3. 4 Fast rates (_x0007_). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
8. 3. 5 Zoo of conditions (_x0007__x0007_). . . . . . . . . . . . . . . . . . . . . . . . . 225
8. 3. 6 Random design (_x0007_). . . . . . . . . . . . . . . . . . . . . . . . . . . 227
8. 4 Experiments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
8. 5 Extensions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
8. 6 Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
9 Neural networks 233
9. 1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
9. 2 Single hidden layer neural network. . . . . . . . . . . . . . . . . . . . . . 235
9. 2. 1 Optimization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
9. 2. 2 Rectified linear units and homogeneity. . . . . . . . . . . . . . . . 237
9. 2. 3 Estimation error. . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
9. 3 Approximation properties. . . . . . . . . . . . . . . . . . . . . . . . . . . 241
9. 3. 1 Universal approximation property in one dimension. . . . . . . . 242
9. 3. 2 Infinitely many neurons and variation norm. . . . . . . . . . . . . 243
9. 3. 3 Variation norm in one dimension. . . . . . . . . . . . . . . . . . . 244
9. 3. 4 Variation norm in arbitrary dimension. . . . . . . . . . . . . . . . 248
9. 3. 5 Precise approximation properties. . . . . . . . . . . . . . . . . . . 249
9. 3. 6 From the variation norm to a finite number of neurons (_x0007_). . . . 250
9. 4 Generalization performance for neural networks. . . . . . . . . . . . . . . 253
9. 5 Relationship with kernel methods (_x0007_). . . . . . . . . . . . . . . . . . . . 255
9. 5. 1 From a Banach space F1 to a Hilbert space F2 (_x0007_). . . . . . . . . 255
9. 5. 2 Kernel function (_x0007__x0007_). . . . . . . . . . . . . . . . . . . . . . . . . . 257
9. 5. 3 Upper-bound on RKHS norm (_x0007__x0007_). . . . . . . . . . . . . . . . . . 258
9. 6 Experiments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
9. 7 Extensions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
9. 8 Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
III Special topics 263
10 Ensemble learning 265
10. 1 Averaging / bagging. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
10. 1. 1 Independent datasets. . . . . . . . . . . . . . . . . . . . . . . . . . 266
10. 1. 2 Bagging. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
10. 2 Random projections and averaging. . . . . . . . . . . . . . . . . . . . . . 269
10. 2. 1 Gaussian sketching. . . . . . . . . . . . . . . . . . . . . . . . . . . 271
10. 2. 2 Random projections. . . . . . . . . . . . . . . . . . . . . . . . . . 273
10. 3 Boosting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
10. 3. 1 Problem set-up. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
10. 3. 2 Incremental learning. . . . . . . . . . . . . . . . . . . . . . . . . . 281
10. 3. 3 Matching pursuit. . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
10. 3. 4 Adaboost. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
10. 3. 5 Greedy algorithm based on gradient boosting. . . . . . . . . . . . 284
10. 3. 6 Convergence of expected risk. . . . . . . . . . . . . . . . . . . . . 287
10. 3. 7 Experiments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
10. 4 Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
11 From online learning to bandits 291
11. 1 First-order online convex optimization. . . . . . . . . . . . . . . . . . . . 292
11. 1. 1 Convex case. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
11. 1. 2 Strongly-convex case (_x0007_). . . . . . . . . . . . . . . . . . . . . . . 295
11. 1. 3 Online mirror descent (_x0007_). . . . . . . . . . . . . . . . . . . . . . . 295
11. 1. 4 Lower bounds (_x0007__x0007_). . . . . . . . . . . . . . . . . . . . . . . . . . . 297
11. 2 Zero-th order convex optimization. . . . . . . . . . . . . . . . . . . . . . 299
11. 2. 1 Smooth stochastic gradient descent. . . . . . . . . . . . . . . . . . 301
11. 2. 2 Stochastic smoothing (_x0007_). . . . . . . . . . . . . . . . . . . . . . . 303
11. 2. 3 Extensions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
11. 3 Multi-armed bandits. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
11. 3. 1 Need for an exploration-exploitation trade-off. . . . . . . . . . . . 308
11. 3. 2 “ Explore-then-commit” . . . . . . . . . . . . . . . . . . . . . . . . 308
11. 3. 3 Optimism in the face of uncertainty (_x0007_). . . . . . . . . . . . . . . 310
11. 3. 4 Adversarial bandits (_x0007_). . . . . . . . . . . . . . . . . . . . . . . . 312
11. 4 Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
12 Over-parameterized models 315
12. 1 Implicit bias of gradient descent. . . . . . . . . . . . . . . . . . . . . . . . 316
12. 1. 1 Least-squares. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
12. 1. 2 Separable classification. . . . . . . . . . . . . . . . . . . . . . . . . 318
12. 1. 3 Beyond convex problems (_x0007_). . . . . . . . . . . . . . . . . . . . . 323
12. 2 Double descent. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
12. 2. 1 The double descent phenomenon. . . . . . . . . . . . . . . . . . . 325
12. 2. 2 Empirical evidence. . . . . . . . . . . . . . . . . . . . . . . . . . . 326
12. 2. 3 Linear regression with Gaussian projections (_x0007_). . . . . . . . . . . 327
12. 3 Global convergence of gradient descent. . . . . . . . . . . . . . . . . . . . 332
12. 3. 1 Mean field limits. . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
12. 3. 2 From linear networks to positive definite matrices. . . . . . . . . . 337
12. 3. 3 Global convergence for positive definite matrices. . . . . . . . . . 338
12. 3. 4 Special case of Oja flow. . . . . . . . . . . . . . . . . . . . . . . . 340
12. 4 Lazy regime and neural tangent kernels (_x0007_). . . . . . . . . . . . . . . . . 341
12. 5 Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
13 Structured prediction 345
13. 1 Multi-category classification. . . . . . . . . . . . . . . . . . . . . . . . . . 346
13. 1. 1 Extension of classical convex surrogates. . . . . . . . . . . . . . . 346
13. 1. 2 Generalization bound I: stochastic gradient descent. . . . . . . . . 348
13. 1. 3 Generalization bound II: Rademacher complexities (_x0007_). . . . . . . 350
13. 2 General set-up and examples. . . . . . . . . . . . . . . . . . . . . . . . . 352
13. 2. 1 Examples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352
13. 2. 2 Structure encoding loss functions. . . . . . . . . . . . . . . . . . . 354
13. 3 Surrogate methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356
13. 3. 1 Score functions and decoding step. . . . . . . . . . . . . . . . . . 356
13. 3. 2 Fisher consistency and calibration functions. . . . . . . . . . . . . 357
13. 3. 3 Main surrogate frameworks. . . . . . . . . . . . . . . . . . . . . . 357
13. 4 Smooth/quadratic surrogates. . . . . . . . . . . . . . . . . . . . . . . . . 358
13. 4. 1 Quadratic surrogate. . . . . . . . . . . . . . . . . . . . . . . . . . 358
13. 4. 2 Theoretical guarantees. . . . . . . . . . . . . . . . . . . . . . . . . 358
13. 4. 3 Linear estimators and decoding steps. . . . . . . . . . . . . . . . . 359
13. 4. 4 Smooth surrogates (_x0007_). . . . . . . . . . . . . . . . . . . . . . . . . 360
13. 5 Max-margin formulations. . . . . . . . . . . . . . . . . . . . . . . . . . . 362
13. 5. 1 Structured SVM. . . . . . . . . . . . . . . . . . . . . . . . . . . . 362
13. 5. 2 Max-min formulations (_x0007__x0007_). . . . . . . . . . . . . . . . . . . . . . 363
13. 6 Generalization bounds (_x0007_). . . . . . . . . . . . . . . . . . . . . . . . . . . 365
13. 7 Experiments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366
13. 7. 1 Robust regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . 366
13. 7. 2 Ranking. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366
13. 8 Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
14 Probabilistic methods 371
14. 1 From empirical risks to log-likelihoods. . . . . . . . . . . . . . . . . . . . 371
14. 1. 1 Conditional likelihoods. . . . . . . . . . . . . . . . . . . . . . . . . 373
14. 1. 2 Classical priors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
14. 1. 3 Sparse priors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374
14. 1. 4 On the relationship between MAP and MMSE (_x0007_). . . . . . . . . 375
14. 2 Discriminative vs. generative models. . . . . . . . . . . . . . . . . . . . . 378
14. 2. 1 Linear discriminant analysis and softmax regression. . . . . . . . 379
14. 2. 2 Naive Bayes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379
14. 2. 3 Maximum likelihood estimations. . . . . . . . . . . . . . . . . . . 380
14. 3 Bayesian inference. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381
14. 3. 1 Computational handling of posterior distributions. . . . . . . . . 382
14. 3. 2 Model selection through marginal likelihood. . . . . . . . . . . . . 383
14. 4 PAC-Bayesian analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384
14. 4. 1 Set-up. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384
14. 4. 2 Uniformly bounded loss functions. . . . . . . . . . . . . . . . . . . 385
14. 5 Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387
15 Lower bounds on performance 389
15. 1 Statistical lower bounds. . . . . . . . . . . . . . . . . . . . . . . . . . . . 390
15. 1. 1 Minimax lower bounds. . . . . . . . . . . . . . . . . . . . . . . . . 390
15. 1. 2 Reduction to a hypothesis test. . . . . . . . . . . . . . . . . . . . 391
15. 1. 3 Review of information theory. . . . . . . . . . . . . . . . . . . . . 393
15. 1. 4 Lower-bound on hypothesis testing based on information theory. 395
15. 1. 5 Examples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398
15. 1. 6 Minimax lower bounds through Bayesian analysis. . . . . . . . . . 399
15. 2 Optimization lower bounds. . . . . . . . . . . . . . . . . . . . . . . . . . 402
15. 2. 1 Convex optimization. . . . . . . . . . . . . . . . . . . . . . . . . . 402
15. 2. 2 Non-convex optimization (_x0007_). . . . . . . . . . . . . . . . . . . . . 404
15. 3 Lower bounds for stochastic gradient descent (_x0007_). . . . . . . . . . . . . . 407
15. 4 Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409

Mehr aus dieser Reihe

Deep Learning

Ian Goodfellow, Yoshua Bengio, Aaron Courville, Joshua Bengio

Buch (gebunden)

91,49 €*

Reinforcement Learning

Richard S. Sutton, Andrew G. Barto

Buch (gebunden)

100,49 €*

Elements of Causal Inference

Bernhard Scholkopf, Dominik Janzing, Jonas Peters

Buch (gebunden)

57,99 €*

Knowledge Graphs

Mayank Kejriwal, Craig Knoblock

Buch (gebunden)

69,99 €*

Introduction to Machine Learning

Ethem Alpaydin

Buch (gebunden)

104,49 €*

Introduction to Online Convex Optimization, second edition

Elad Hazan

Buch (gebunden)

74,99 €*

Probabilistic Machine Learning

Kevin P. Murphy

Buch (gebunden)

186,49 €*

Probabilistic Graphical Models

Daphne Koller, Nir Friedman

Buch (gebunden)

144,99 €*

Foundations of Machine Learning

Mehryar Mohri, Ameet Talwalkar, Afshin Rostamizadeh

Buch (gebunden)

104,49 €*

Veridical Data Science

Bin Yu, Rebecca L. Barter

Buch (gebunden)

99,49 €*

Fairness and Machine Learning

Solon Barocas, Moritz Hardt

Buch (gebunden)

81,49 €*

Produktdetails

Erscheinungsdatum

24. Dezember 2024

Sprache

englisch

Seitenanzahl

478

Reihe

Adaptive Computation and Machine Learning series

Autor/Autorin

Francis Bach

Verlag/Hersteller

MIT Press Ltd

Produktart

gebunden

Gewicht

1096 g

Größe (L/B/H)

231/190/34 mm

ISBN

9780262049443

Entdecken Sie mehr

Maschinelles Lernen

Mustererkennung

Maschinelles Sehen, Bildverstehen

Maschinelles Lernen

Mustererkennung

Maschinelles Sehen, Bildverstehen

Portrait

Francis Bach

Francis Bach is a researcher at Inria where he leads the machine learning team which is part of the Computer Science department at Ecole Normale Supé rieure. His research focuses on machine learning and optimization.

Bewertungen

0 Bewertungen

Es wurden noch keine Bewertungen abgegeben. Schreiben Sie die erste Bewertung zu "Learning Theory from First Principles" und helfen Sie damit anderen bei der Kaufentscheidung.

Service

eBook Favoriten

eBook Genres

Unsere Bestseller

Spiegel Bestseller Belletristik

Spiegel Bestseller Sachbuch

Internationale eBooks

eBooks auf Englisch

Weitere Sprachen

tolino eReader

Service

Abonnements

Abo jetzt neu

Hörbuch Downloads

Beliebte Genres

Hörbuch CDs

Beliebte Genres

Top Themen

Buch Kategorien

Weitere Sortimente

Kalenderformate

Kategorien

eBook Preishits4

Preishits nach Genre4

Weiteres4

Learning Theory from First Principles

Mehr aus dieser Reihe

Produktdetails

Entdecken Sie mehr

Portrait

Francis Bach

Bewertungen