Great Papers

Aggregators

Short Science - (summaries) http://www.shortscience.org/venue?key=conf/nips
VK - Deep Learning community (Papers, codes, articles) https://vk.com/deeplearning
Arix-Sanity - better arxiv search http://www.arxiv-sanity.com/
GitXiv (Arxiv and Git code) http://gitxiv.com/
Arxiv-Vanity - HTML rendeing of Arxiv pages https://www.arxiv-vanity.com/

Breiman, L. 2001. “Statistical Modeling: The Two Cultures (with Comments and a Rejoinder by the Author).” Statistical Science 16:199–231.

Regression: Panik, M. J. 2009. Regression Modeling: Methods, Theory, and Computation with SAS. Boca Raton, FL: CRC Press. (Disclosure: my favorite regression book.)
Decision tree: Breiman, L., Friedman, J., Olshen, R., and Stone, C. 1984. Classification and Regression Trees. Belmont, CA: Wadsworth.
Random forest: Breiman, L. 2001. “Random Forests.” Machine Learning 45:5–32.
Gradient boosting: Friedman, J. H. 2001. “Greedy Function Approximation: A Gradient Boosting Machine.” Annals of Statistics 29:1189–1232.
Neural network: Rumelhart, D. E., Hinton, G. E., and Williams, R. J. 1986. “Learning Representations by Back-Propagating Errors.” Nature 323:533–536.
Support vector machine: Cortes, C. and Vapnik, V. 1995. “Support-Vector Networks.” Machine Learning 20:273–297.
Naïve Bayes: Friedman, N., Geiger, D., and Goldszmidt, M. 1997. “Bayesian Network Classifiers.” Machine Learning 29:131–163.
Neighbors: Cover, T. and Hart, P. 1967. “Nearest Neighbor Pattern Classification.” IEEE Transactions on Information Theory 13:21–27.
Gaussian processes: Seeger, M. 2004. “Gaussian Processes for Machine Learning.” International Journal of Neural Systems 14:69–106.

A priori rules: Agrawal, R., Imieliński, T., and Swami, A. 1993. “Mining Association Rules between Sets of Items in Large Databases.” ACM SIGMOD Record 22:207–216.
k-means clustering: Hartigan, J. A. and Wong, M. A. 1979. “Algorithm AS 136: A k-Means Clustering Algorithm.” Journal of the Royal Statistical Society, Series C 28:100–108.
Mean shift clustering: Cheng, Y. 1995. “Mean Shift, Mode Seeking, and Clustering.” IEEE Transactions on Pattern Analysis and Machine Intelligence 17:790–799.
Spectral clustering: Von Luxburg, U. 2007. “A Tutorial on Spectral Clustering.” Statistics and Computing 17:395–416.
Kernel density estimation: Silverman, B. W. 1986. Density Estimation for Statistics and Data Analysis. Vol. 26. Boca Raton, FL: CRC Press.
Non-negative matrix factorization: Lee, D. D. and Seung, H. S. 1999. “Learning the Parts of Objects by Non-negative Matrix Factorization.” Nature 401:788–791.
Kernel PCA: Schölkopf, B., Smola, A., and Müller, K.-R. 1997. “Kernel Principal Component Analysis.” In Artificial Neural Networks—ICANN’97, 583–588. Berlin: Springer.
Sparse PCA: Zou, H., Hastie, T., and Tibshirani, R. 2006. “Sparse Principal Component Analysis.” Journal of Computational and Graphical Statistics 15:265–286.
Singular value decomposition: Golub, G. H. and Reinsch, C. 1970. “Singular Value Decomposition and Least Squares Solutions.” Numerische Mathematik 14:403–420.

Denoising autoencoders: Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.A. 2008. “Extracting and Composing Robust Features with Denoising Autoencoders.” Proceedings of the 25th International Conference on Machine Learning. New York: ACM.
Expectation maximization: Nigam, K., McCallum, A.K., Thrun, S. and Mitchell, T. 2000. “Text Classification from Labeled and Unlabeled Documents using EM.” Machine Learning 39:103-134.
Manifold regularization: Belkin, M., Niyogi, P., and Sindhwani, V. 2006. “Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples.” The Journal of Machine Learning Research 7:2399-2434.
Transductive support vector machines: Joachims, T. 1999. “Transductive Inference for Text Classification Using Support Vector Machines.” Proceedings of the 16th International Conference on Machine Learning. New York: ACM. (source (for ML) : http://qr.ae/LPuHs )