Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

For anyone interest in R without a background in statistics: I would highly recommend learning the two in parallel (if not statistics first).

R is first and foremost a language for statistical computing. You really aren't going get much out of it without working on some interesting data/stats problems. Plus for most hacker types I think being able to play with the statistics you're learning about with R can be a great learning aid.

However not only is it beneficial to learn stats with R, it is imho dangerous to learn R without some stats. There's already too much research being published with were 'p-value' means "the thing that t.test() output that I was told needs to be in the paper".

Because R lets you play so freely with stats I find it a great tool to gain greater intuition about certain mathematical principles, but there is a temptation to let the tool do the work and the thinking for you.



This is a very, very good point. Though, many of the functions that R provides just won't make any sense at all if you don't have an intuition for the statistics behind it. I have found myself reading the papers published about specific functions in order to understand the results.

Do you have any resources that you suggest for beginners in statistics looking to learn on their own?


Here's the resources I've used and found helpful (plus notes) for getting a least a basic grounding in statistics.

Khan Academy + "Head First Statistics"

-No matter what direction you want to go with your learning and before you start getting into street fights over Bayesian vs Frequentists philosophies, "stats 101" is the common base of 'statistics' for: Physicists, Engineers, Sociologists, Doctors, Economists, Journalists and more. Yes there are many "truths" in 'stats 101' that can be questioned, but this is the common language you'll want talking to any of these people.

"What is a p-value anyway?"

-A very quick, non-math heavy read on avoiding magical thinking in classical statistics.

"Data Analysis with Open Source Tools"

-I really love this book because it steps way out of "stats 101" land and gives a lot of really great real world advice and touches on many mathematical tools that can help you understand data. The chapter "What you really need to know about classical statistics" gives a refreshing overview of all the basic stats you've covered and explains many things that seem strange. Also the bibliography/recommended reading list from this book is fantastic.

The biggest thing I've noticed as I've learned more is that statistics is a gigantic field, and can mean extremely different things depending on what field you're in. Obviously learn the advanced statistics relevant to your interests, but when you get a chance listen to how other people think about stats, and keep learning.


You probably can't go wrong with the Introduction to Statistics class from Udacity

http://www.udacity.com/overview/Course/st101/CourseRev/1


I'm taking this right now. Can somebody provide some perspective on how thoroughly the course prepares you for using R? Does it give you a good enough grounding in statistics so that you won't shoot yourself in the foot with R?


I thought http://oli.cmu.edu/courses/free-open/statistics-course-detai... was a great introduction, after completing it.


Thanks for the link, I have been meaning to revise some statistics, four years of barely doing any makes you forget a lot.


After that you can take the AI course on Udacity.


Take a look at the free online course on Statistical Reasoning from CMU:

http://oli.cmu.edu/courses/free-open/statistical-reasoning-c...

It includes interactive exercises with an option to do them in R.


How about not-beginners looking to refresh / deepen their intuitions?

I've recently been working with the Python toolset in this space -- pandas, numpy, matplotlib -- and run smack dab into my rusty regression analysis. In particular I need to better understand the distribution assumptions underlying the error distributions and the variances around the coefficient and intercept values.

Any suggestions for some deeper study / refresher?


Depends on the data sets you want to work with. For straight-up linear regression, with a heavy emphasis on observational data appropriate for microeconometrics, "Introductory Econometrics: A Modern Approach" by Jeff Wooldridge is absolutely phenomenal (an old edition is fine). (This is usually assigned for advanced undergraduate econ majors or non-advanced masters students; I don't know what the equivalent would be for undergraduate stats majors).

For more "intuition about working with data, especially if you're a visual person," Howard Wainer's books are wonderful; one example is "Graphic Discovery: A Trout in the Milk and Other Visual Adventures." They're non-technical, short chapters, discussions of different data sets.

Bill Cleveland's "Visualizing Data" and "Elements of Graphing Data" cover the same material -- graphing data -- at a more technical level. I don't know Cleveland's books would help with the issues you asked about, but... they are amazing books and if you're interested in the subject at all I can't recommend them highly enough.

I don't have any free recommendations, unfortunately.


Seconding Wooldridge. It's the only economics text I'm keeping from university - I'm getting rid of all the rest. It really digs into the material and highlights the pitfalls and incorrect assumptions in regression and forecasting. I'm planning to consult it when I start working on analytics in current/future projects.


Honestly, the only way to understand regression is to study something like Mccullagh & Nelders book. Anything else and you are going to have a very hard time really being useful without misinterpreting the results. There are some real subtleties to interpretation of regression coefficients, and more importantly structuring your data in such a way that you will answer the questions you want.

It's not an easy book, but if you've gone through to at least third year level in statistics it's approachable and you will understand it to a deep level.


If you can follow the maths of things like Poisson distributions, the central-limit theorem, etc., then it's the modelling issues that are tricky.

Take a look at Dudley's Statistics for Applications from MIT's Open Courseware - http://ocw.mit.edu/courses/mathematics/18-443-statistics-for... - is supposed to be very thorough and easy to follow, but warning, it's based on an expensive textbook.


Gelman and Hill is a nice book organized specifically around regression.


Gelman and Hill is a wonderful and under-rated book. My guess is that its clumsy title (Data Analysis Using Regression and Multilevel/Hierarchical Models) hides the fact that it's an introductory textbook that takes the reader from knowing nothing to eventually constructing complex Bayesian models. Plus, it's a pretty good tutorial on R and BUGS.


While it is predominantly a statistics language there is also a huge wealth of data manipulation capabilities in functions like plyr, aggregate, *apply, ave, subset, etc.

Just in terms of organizing data sets, ignoring any statistical analysis, R is fantastic.


I've found Python + Pandas much better in this regard than R. Maybe it's just me, but for grouping, indexing, and manipulating tabular data, Python syntax just makes more sense.

That said, R is better for stats and matrix operations.


Are you using Pandas? If so, your comment would be ironic because pandas borrows heavily from R ;)


They might have borrowed from R. Wes McKinney admits to being influenced by R especialy data frames...but it makes data analysis all the more easier when i can do everything i want within the the Python environment. pandas is proving to be a bit of a longer learning curve i must admit, but then the python environment and native Matplotlib support made life oh so much simpler.

That's just me though.


What has pandas borrowed from R, other than a 2D data structure with heterogeneously-typed columns?

I guess the data frame merge invocations are similar.

(I know patsy/statsmodels are introducing R's formula syntax to python, but that's not pandas.)


The split-apply-combine framework dealing with group by tasks (http://www.jstatsoft.org/v40/i01/paper, not that there aren't other precedents) for one. But generally, Wes has used R to figure out what people want to do, and then ported an elegant interface to python.


I would agree with you if it wasn't for the data.table package in R. It is a game changer. Really.


Can you elaborate on data.table being 'a game changer'. I am inclined to agree, but I'm am just starting to get a handle on it. I am still hesitant and switching between sqldf, reshape2, base::merge and data.table more than I would like. Do you think it could become a dominant method for data preparation?


Python has PyTables which complements Pandas nicely and seems to offer the same sort of features as data.table (note, I've not actually used data.table)


I am using R to analyse and document (knitr and latex) epidemiologic data which does not involve parsing a lot of text to extract my analysis data set. Data preparation for this type of research involves more combining data from different source tables, restructuring repeated measures, etc. I only know how to do that using R. Can Python be incorporated into the knitr literate programming framework and is it worth learning another language?


Python will be better supported in knitr in future; for now it only has preliminary support: http://yihui.name/knitr/demo/engines/


Can you advice on good resources to learn statistics?


Udacity has a intro to statistics class that you might like.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: