Posted on Thursday, 26th January 2012
Please read sections 4.3 and 5.4-5.5, and post a comment.
As always, students without a strong math background may skim
the more technical material and try to focus on the concepts.
Posted in Class | Comments (18)
Leave a Reply
You must be logged in to post a comment.

January 29th, 2012 at 2:49 pm
I am still unsure from the details section on pg 111 as to why the function psi in the KL divergence is necessarily logarithmic.
January 30th, 2012 at 3:42 pm
In experiments, are there examples of real data that follow beta distribution?
January 30th, 2012 at 4:33 pm
For the definition of entropy, I was wondering why base 2…I’ve always used natural log. I’ve often seen Boltzmann’s constant as well.
Is there any use of Gibbs free energy or enthalpy in statistics?
January 30th, 2012 at 9:16 pm
I appreciated the section on mutual information as being a way to describe how uncertainty about one variable can be reduced by knowing about another variable in ways other than the linear relationship described by correlations. That said, I don’t know how I would talk about mutual information in an experimental setting. What statistics are used to refer to mutual information? Is there a standard for what level of mutual information is “enough” to be meaningful?
January 30th, 2012 at 11:34 pm
Is the spectral decomposition mentioned in section 4.3.1 (A = PDPT) the same as projecting the data from matrix A onto the eigenvector?
January 31st, 2012 at 12:43 am
I have never seen the Beta distribution before. I don’t quite understand what it looks like and when it might be used. Could you give a concrete example?
January 31st, 2012 at 1:50 am
Can you talk about spectral decomposition? What is one example of data in which you can analyze its variance matrix using spectral decomposition over other methods?
January 31st, 2012 at 4:03 am
I think PDF graphs describing all of the new distributions would have helped me to understand their use more intuitively.
January 31st, 2012 at 6:58 am
In the KL discrepancy, it seems strange to me that you’re evaluating the deviation between two pdfs by using some random variable X from some third, unknown, distribution. I would think that the value of D_KL(f,g) would depend your choice of X. What am I missing? … (reading on) … Ah! In your illustration of two normal distributions, it appears as though the pdf of X should be f. Is that correct?
January 31st, 2012 at 8:51 am
Could you please go over the section where you talk about using spectral decomposition to analyze a covariance matrix?
January 31st, 2012 at 9:01 am
I had thought the geometric distribution also had the memoryless property – is this the case? Also, I’m curious: does the geometric function apply to any neural phenomena?
January 31st, 2012 at 9:03 am
I don’t follow the logic on the top of page 120 for 4.3.4 proof. Don’t we have to specify which x (x1 or x2) is classified to f(x) to minimize the error, instead of generic x? f(x2) > g(x2), so does that mean that for x2 we should always classify x2 as f(x)? What about for x1? Do we sepcify x1 as g(x) since g(x1) > f(x1)?
January 31st, 2012 at 9:21 am
The KL discrepancy seems like a very useful tool for considering differences in distributions. I’m not sure exactly what are the requirements that the distributions must fit in order to apply this, could you explain them in more depth?
January 31st, 2012 at 9:21 am
Could you also go over how the inverse Gaussian distribution relates to the integrate-and-fire neuron model?
January 31st, 2012 at 9:23 am
In the description of the “memoryless” property – considering the probability that a channel will remain open for the interval h – I’m not sure why the length of time a channel will remain open starting at time t must be greater than t (X > t). It would make more sense to me, given the discussion, that X > h, so that P(X > t+h | X > t) = P(X > h). Is the X > t requirement a typo?
January 31st, 2012 at 9:32 am
Using the Kullback-Leiber discrepancy to quantify the dependence between two random vectors it seems that there is a requirement that the variances be the same, is this true? Or is that only the case for the two normal distributions used in the illustration?
January 31st, 2012 at 9:43 am
5.4.7 discusses the degrees of freedom needed for the t and F distributions to be approximately normal (12 is specified for t but for F it just says when F is large). I remember from learning the Central Limit Theorem that an experiment with over 30 observations (29 degrees of freedom) can be considered normal. In practice, does this mean that t-tests of experiments with between 12 and 30 participants would lead to similarly viable (because the t distribution is normal) results even though the experiment had not yet reached a normal distribution by the CLT? If so, does this fact have any practical significance for doing hypothesis tests in experiments with small n?
January 31st, 2012 at 9:46 am
On page 114 at the top, shouldn’t we say that information about Y is associated with X whenever abs(rho) > 0?
We say that Bayes’ classifiers have the lowest expected number of misclassifications. However, I see other classifiers such as LDA being used. Why might we prefer something other than an “optimal” estimator?