통계적 중앙값, 모드, 왜도, 첨도를 추정하기위한 "온라인"(반복자) 알고리즘?

code

통계적 중앙값, 모드, 왜도, 첨도를 추정하기위한 "온라인"(반복자) 알고리즘?

codestyles 2020. 9. 19. 11:12

통계적 중앙값, 모드, 왜도, 첨도를 추정하기위한 "온라인"(반복자) 알고리즘?

값 집합의 중앙값, 모드, 왜도 및 / 또는 첨도를 추정하는 알고리즘이 있지만 한 번에 모든 값을 메모리에 저장할 필요는 없습니까?

기본 통계를 계산하고 싶습니다.

평균 : 산술 평균
분산 : 평균에서 제곱 된 편차의 평균
표준 편차 : 분산의 제곱근
중앙값 : 숫자의 큰 절반과 작은 절반을 구분하는 값
모드 : 세트에서 가장 자주 발견되는 값
왜도 : tl; 박사
첨도 : tl; 박사

이것들 중 하나를 계산하는 기본 공식은 초등학교 산술이며, 나는 그것들을 알고 있습니다. 이를 구현하는 많은 통계 라이브러리도 있습니다.

내 문제는 내가 처리하고있는 집합에있는 많은 수 (십억)의 값입니다. Python에서 작업하면 수십억 개의 요소로 목록이나 해시를 만들 수 없습니다. 내가 이것을 C로 썼다고해도, 10 억 요소 배열은 그리 실용적이지 않습니다.

데이터가 정렬되지 않습니다. 다른 프로세스에 의해 즉석에서 무작위로 생성됩니다. 각 세트의 크기는 매우 가변적이며 크기는 미리 알 수 없습니다.

나는 이미 평균과 분산을 꽤 잘 처리하는 방법을 알아 냈고, 어떤 순서로든 세트의 각 값을 반복합니다. (사실 제 경우에는 생성 된 순서대로 가져옵니다.) 다음은 제가 사용하는 알고리즘입니다. http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#On-line_algorithm :

세 가지 변수 초기화 : count, sum 및 sum_of_squares
각 값에 대해 :
- 증가 카운트.
- 합계에 값을 더합니다.
- sum_of_squares에 값의 제곱을 더합니다.
합계를 개수로 나누고 변수 평균으로 저장합니다.
sum_of_squares를 개수로 나누고 mean_of_squares 변수로 저장합니다.
제곱 평균, square_of_mean으로 저장합니다.
mean_of_squares에서 square_of_mean을 빼서 분산으로 저장합니다.
출력 평균 및 분산.

이 "온라인"알고리즘에는 약점이 있지만 (예 : sum_of_squares가 정수 범위 또는 부동 소수점 정밀도보다 빠르게 커짐에 따른 정확도 문제) 기본적으로 각 세트에 모든 값을 저장할 필요없이 필요한 것을 제공합니다.

그러나 추가 통계 (중앙값, 모드, 왜도, 첨도)를 추정하는 데 유사한 기술이 있는지 여부는 알 수 없습니다. N 값을 처리하는 데 필요한 메모리가 실질적으로 O (N)보다 작은 한 편향된 추정기 또는 정확도를 어느 정도 저하시키는 방법으로 살 수 있습니다.

라이브러리에 이러한 작업 중 하나 이상을 "온라인"으로 계산하는 기능이있는 경우 기존 통계 라이브러리를 가리키는 것도 도움이됩니다.

왜도 및 첨도

왜도 및 첨도에 대한 온라인 알고리즘 (분산 선을 따라)에 대해서는 동일한 위키 페이지 에서 더 높은 순간 통계에 대한 병렬 알고리즘을 참조 하십시오 .

중앙값

정렬 된 데이터가 없으면 중앙값이 어렵습니다. 알고 있다면, 이론적으로는 선택 알고리즘 을 사용하여 부분적으로 만 정렬하면됩니다 . 그러나 그것은 수십억 개의 가치에 그다지 도움이되지 않습니다. 주파수 카운트를 사용하는 것이 좋습니다. 다음 섹션을 참조하십시오.

주파수 카운트가있는 중앙값 및 모드

정수인 경우 주파수를 계산 하여 더 이상 관련이 없다고 확신하는 일부 값을 넘어 가장 높은 값과 가장 낮은 값을 잘라낼 것입니다. 실수 (또는 너무 많은 정수)의 경우 아마도 버킷 / 간격을 만든 다음 정수와 동일한 접근 방식을 사용합니다. (근사치) 모드 및 중앙값 계산은 빈도 표를 기반으로보다 쉬워집니다.

일반적으로 분포 된 랜덤 변수

정규 분포를 따르는 경우 모집단 표본 평균 , 분산 , 왜도 및 첨도 를 작은 하위 집합에 대한 최대 가능성 추정기로 사용합니다. 그것들을 계산하는 (온라인) 알고리즘, 당신은 이미 지금. 예를 들어, 추정 오류가 충분히 작아 질 때까지 수십만 또는 백만 개의 데이터 포인트를 읽습니다. 세트에서 무작위로 선택해야합니다 (예 : 처음 100'000 개 값을 선택하여 편향을 도입하지 않음). 동일한 접근 방식을 추정 모드에 사용할 수도 있고 정상 사례에 대한 중앙값을 사용할 수도 있습니다 (두 표본 평균은 모두 추정치 임).

추가 의견

도움이된다면 위의 모든 알고리즘을 병렬로 실행할 수 있습니다 (많은 정렬 및 선택 알고리즘 (예 : QuickSort 및 QuickSelect) 포함).

I have always assumed (with the exception of the section on the normal distribution) that we talk about sample moments, median, and mode, not estimators for theoretical moments given a known distribution.

In general, sampling the data (i.e. only looking at a sub-set) should be pretty successful given the amount of data, as long as all observations are realizations of the same random variable (have the same distributions) and the moments, mode and median actually exist for this distribution. The last caveat is not innocuous. For example, the mean (and all higher moments) for the Cauchy Distribution do not exist. In this case, the sample mean of a "small" sub-set might be massively off from the sample mean of the whole sample.

I use these incremental/recursive mean and median estimators, which both use constant storage:

mean += eta * (sample - mean)
median += eta * sgn(sample - median)

where eta is a small learning rate parameter (e.g. 0.001), and sgn() is the signum function which returns one of {-1, 0, 1}. (Use a constant eta if the data is non-stationary and you want to track changes over time; otherwise, for stationary sources you can use something like eta=1/n for the mean estimator, where n is the number of samples seen so far... unfortunately, this does not appear to work for the median estimator.)

This type of incremental mean estimator seems to be used all over the place, e.g. in unsupervised neural network learning rules, but the median version seems much less common, despite its benefits (robustness to outliers). It seems that the median version could be used as a replacement for the mean estimator in many applications.

I would love to see an incremental mode estimator of a similar form...

UPDATE

I just modified the incremental median estimator to estimate arbitrary quantiles. In general, a quantile function (http://en.wikipedia.org/wiki/Quantile_function) tells you the value that divides the data into two fractions: p and 1-p. The following estimates this value incrementally:

quantile += eta * (sgn(sample - quantile) + 2.0 * p - 1.0)

The value p should be within [0,1]. This essentially shifts the sgn() function's symmetrical output {-1,0,1} to lean toward one side, partitioning the data samples into two unequally-sized bins (fractions p and 1-p of the data are less than/greater than the quantile estimate, respectively). Note that for p=0.5, this reduces to the median estimator.

I implemented the P-Square Algorithm for Dynamic Calculation of Quantiles and Histograms without Storing Observations in a neat Python module I wrote called LiveStats. It should solve your problem quite effectively. The library supports every statistic that you mention except for mode. I have not yet found a satisfactory solution for mode estimation.

Ryan, I'm afraid you are not doing the mean and variance right... This came up a few weeks ago here. And one of the strong points of the online version (which actually goes by the name of Welford's method) is the fact that it is specially accurate and stable, see the discussion here. One of the strong points is the fact that you do not need to store the total sum or total sum of squares...

I can't think of any on-line approach to the mode and median, which seem to require considering the whole list at once. But it may very well be that a similar approach than the one for the variance and mean will work also for the skewness and kurtosis...

The Wikipedia article quoted in the question contains the formulas for calcualting skewness and kurtosis on-line.

For mode - I believe - there is no way doing this on-line. Why? Assume that all values of your input are different besides the last one that duplicates a previous one. In this case you have to remember all values allready seen in the input to detect that the last value duplicates a value seen befor and makes it the most frequent one.

For median it is almost the same - up to the last input you don't know what value will become the median if all input values are different because it could be before or after the current median. If you know the length of the input, you can find the median without storing all values in memory, but you will still have to store many of them (I guess around the half) because a bad input sequence could shift the median heavily in the second half possibly making any value from the first half the median.

(Note that I am refering to exact calculation only.)

If you have billions of data points, then it's not likely that you need exact answers, as opposed to close answers. Generally, if you have billions of data points the underlying process which generates them will likely obey some kind of statistical stationarity / ergodicity / mixing property. Also it may matter whether you expect the distributions to be reasonably continuous or not.

In these circumstances, there exist algorithms for on-line, low memory, estimation of quantiles (the median is a special case of 0.5 quantile), as well as modes, if you don't need exact answers. This is an active field of statistics.

quantile estimation example: http://www.computer.org/portal/web/csdl/doi/10.1109/WSC.2006.323014

mode estimation example: Bickel DR. Robust estimators of the mode and skewness of continuous data. Computational Statistics and Data Analysis. 2002;39:153–163. doi: 10.1016/S0167-9473(01)00057-3.

These are active fields of computational statistics. You are getting into the fields where there isn't any single best exact algorithm, but a diversity of them (statistical estimators, in truth), which have different properties, assumptions and performance. It's experimental mathematics. There are probably hundreds to thousands of papers on the subject.

The final question is whether you really need skewness and kurtosis by themselves, or more likely some other parameters which may be more reliable at characterizing the probability distribution (assuming you have a probability distribution!). Are you expecting a Gaussian?

Do you have ways of cleaning/preprocessing the data to make it mostly Gaussianish? (for instance, financial transaction amounts are often somewhat Gaussian after taking logarithms). Do you expect finite standard deviations? Do you expect fat tails? Are the quantities you care about in the tails or in the bulk?

Everyone keeps saying that you can't do the mode in an online manner but that is simply not true. Here is an article describing an algorithm to do just this very problem invented in 1982 by Michael E. Fischer and Steven L. Salzberg of Yale University. From the article:

The majority-finding algorithm uses one of its registers for temporary storage of a single item from the stream; this item is the current candidate for majority element. The second register is a counter initialized to 0. For each element of the stream, we ask the algorithm to perform the following routine. If the counter reads 0, install the current stream element as the new majority candidate (displacing any other element that might already be in the register). Then, if the current element matches the majority candidate, increment the counter; otherwise, decrement the counter. At this point in the cycle, if the part of the stream seen so far has a majority element, that element is in the candidate register, and the counter holds a value greater than 0. What if there is no majority element? Without making a second pass through the data—which isn't possible in a stream environment—the algorithm cannot always give an unambiguous answer in this circumstance. It merely promises to correctly identify the majority element if there is one.

It can also be extended to find the top N with more memory but this should solve it for the mode.

Ultimately if you have no a priori parametric knowledge of the distribution I think you have to store all the values.

That said unless you are dealing with some sort of pathological situation, the remedian (Rousseuw and Bassett 1990) may well be good enough for your purposes.

Very simply it involves calculating the median of batches of medians.

median and mode can't be calculated online using only constant space available. However, because median and mode are anyway more "descriptive" than "quantitative", you can estimate them e.g. by sampling the data set.

If the data is normal distributed in the long run, then you could just use your mean to estimate the median.

You can also estimate median using the following technique: establish a median estimation M[i] for every, say, 1,000,000 entries in the data stream so that M[0] is the median of the first one million entries, M[1] the median of the second one million entries etc. Then use the median of M[0]...M[k] as the median estimator. This of course saves space, and you can control how much you want to use space by "tuning" the parameter 1,000,000. This can be also generalized recursively.

OK dude try these:

for c++:

double skew(double* v, unsigned long n){
    double sigma = pow(svar(v, n), 0.5);
    double mu = avg(v, n);

    double* t;
    t = new double[n];

    for(unsigned long i = 0; i < n; ++i){
        t[i] = pow((v[i] - mu)/sigma, 3);
    }

    double ret = avg(t, n);

    delete [] t;
    return ret;
}

double kurt(double* v, double n){
    double sigma = pow(svar(v, n), 0.5);
    double mu = avg(v, n);

    double* t;
    t = new double[n];

    for(unsigned long i = 0; i < n; ++i){
        t[i] = pow( ((v[i] - mu[i]) / sigma) , 4) - 3;
    }

    double ret = avg(t, n);

    delete [] t;
    return ret;
}

where you say you can already calculate sample variance (svar) and average (avg) you point those to your functions for doin that.

Also, have a look at Pearson's approximation thing. on such a large dataset it would be pretty similar. 3 (mean − median) / standard deviation you have median as max - min/2

for floats mode has no meaning. one would typically stick them in bins of a sginificant size (like 1/100 * (max - min)).

This problem was solved by Pebay et al:

https://prod-ng.sandia.gov/techlib-noauth/access-control.cgi/2008/086212.pdf

I would tend to use buckets, which could be adaptive. The bucket size should be the accuracy you need. Then as each data point comes in you add one to the relevant bucket's count. These should give you simple approximations to median and kurtosis, by counting each bucket as its value weighted by its count.

The one problem could be loss of resolution in floating point after billions of operations, i.e. adding one does not change the value any more! To get round this, if the maximum bucket size exceeds some limit you could take a large number off all the counts.

for j in range (1,M):
    y=np.zeros(M) # build the vector y
    y[0]=y0

    #generate the white noise
    eps=npr.randn(M-1)*np.sqrt(var)

    #increment the y vector
    for k in range(1,T):
        y[k]=corr*y[k-1]+eps[k-1]

    yy[j]=y

list.append(y)

참고URL : https://stackoverflow.com/questions/1058813/on-line-iterator-algorithms-for-estimating-statistical-median-mode-skewnes

'code' 카테고리의 다른 글

mocha before ()의 비동기 함수는 it () 사양 전에 항상 완료됩니까? (0)	2020.09.19
JBoss EAP, Wildfly, JBoss 웹 및 JBoss 서버의 차이점은 무엇입니까? (0)	2020.09.19
C ++에서 bool을 텍스트로 변환 (0)	2020.09.18
Java에서 문자열을 곱하여 시퀀스를 반복 할 수 있습니까? (0)	2020.09.18
자신이 정의한 레이아웃, onDraw () 메서드가 호출되지 않음 (0)	2020.09.18

현재글통계적 중앙값, 모드, 왜도, 첨도를 추정하기위한 "온라인"(반복자) 알고리즘?

codestyle

통계적 중앙값, 모드, 왜도, 첨도를 추정하기위한 "온라인"(반복자) 알고리즘?

통계적 중앙값, 모드, 왜도, 첨도를 추정하기위한 "온라인"(반복자) 알고리즘?

'code' 카테고리의 다른 글

'code'의 다른글

티스토리툴바

통계적 중앙값, 모드, 왜도, 첨도를 추정하기위한 "온라인"(반복자) 알고리즘?

통계적 중앙값, 모드, 왜도, 첨도를 추정하기위한 "온라인"(반복자) 알고리즘?

'code' 카테고리의 다른 글

'code'의 다른글

관련글

티스토리툴바