c c Code to form POLEC and Gaussian-based frequency polygons. The calling c program takes as input a file with the data in it, as well as the sample c size, and outputs to a file the mid-bin values and value of the density c estimate at those values, for each of the two frequency polygons. The c frequency polygon is formed by drawing straight lines connecting these c points. c c The Gaussian-based frequency polygon differs from that used in Simonoff c and Hurvich (1993) in two ways. First, the anchor position of the estimate c is taken to be immediately below the smallest sample value; a slightly c different rule was used in Simonoff and Hurvich, but the resultant c estimators are virtually identical. c c A more important difference is in the estimation of the underlying standard c deviation needed to choose the Gaussian-based bin width. In this code, c the scale estimate specifically derived for use in density estimation by c Janssen, Marron, Veraverbeke and Sarle (1995) is used, rather than the c Silverman-based ad hoc rule. The estimator based on the Janssen et al. c scale estimate has similar ISE and RISE properties to that used in Simonoff c and Hurvich for the densities examined there, but further Monte Carlo c simulation indicates that it is considerably more accurate for densities c with more pronounced multimodal structure. For this reason, I recommend c using the Janssen et al. scale estimate, as is done here. c c The code is written to accomodate samples up to size 1000. This can c be increased by increasing the dimension of x( ) in the calling c program. The given limit is 300 cells for a frequency polygon. In c the unlikely event that it is desired to increase this maximum (to c say the number KBIG), the following steps should be taken: c set the dimension of c to KBIG+2 in the calling program c set the dimension of t to KBIG+2 in the calling program c change the data/maxk/ setting to KBIG in the calling program c change the parameter(mmk=) setting to KBIG in the subroutines c vfplscv c splscvfp c gaussfp c c References: c (1) J.S. Simonoff and C.M. Hurvich (1993), "A study of the effectiveness c of simple density estimation methods," Computational c ------------- c Statistics, 8, 259--278. c ---------- c c (2) Janssen, P., Marron, J.S., Veraverbeke, N., and Sarle, W. (1995), c "Scale measures for bandwidth selection," Journal of c ---------- c Nonparametric Statistics, 5, 359-380. c ------------------------ c c Code written by Jeffrey S. Simonoff (jsimonoff@stern.nyu.edu). Last c revision of code: October 30, 1993. c dimension x(1000),c(302),t(302) character*80 infile,outfile data maxk/300/ xmaxk=float(maxk) write(*,*) 'Input file for data?' read(*,10) infile 10 format(a80) write(*,*) 'Output file for data?' read(*,10) outfile write(*,*) 'Sample size?' read(*,*) n xn=float(n) open(19,file=infile,status='old') open(20,file=outfile,status='unknown') write(20,20) infile 20 format('Input data file is ',a80) write(20,*) ' ' do 100 i=1,n read(19,*) x(i) 100 continue call sort(x,n) epsilon=(x(n)-x(1))*1.e-6 b1=x(1)-epsilon rng=x(n)+epsilon-b1 xmn=0. sd=0. do 110 i=1,n xi=float(i) dx=(x(i)-xmn)/xi xmn=xmn+dx sd=sd+dx*dx*xi*(xi-1.) 110 continue sd=sqrt(sd/(xn-1.)) ss=sscale(x,n,.2,sd) call vfplscv(x,n,c,t,it,maxk) write(20,*) 'Mid-bin values and density estimate values' write(20,*) write(20,*) 'POLEC variable width frequency polygon' do 120 i=1,it+2 120 write(20,*) c(i),t(i) hlnrm=2.15*ss*(xn**(-.2)) if ((rng/hlnrm).gt.xmaxk) hlnrm=rng/xmaxk call gaussfp(x,n,hlnrm,c,t,it,epsilon) write(20,*) ' ' write(20,*) ' ' write(20,*) 'Gaussian-based fixed bin frequency polygons' do 130 i=1,it+2 130 write(20,*) c(i),t(i) stop end subroutine vfplscv(x,n,c,t,it,maxk) c c Subroutine to form POLEC frequency polygon. Interval of interest is c split until the CV criterion is not reduced; then, each subinterval c is split until the CV criterion is not reduced. c parameter(mmk=300) dimension x(n),b(mmk+1),z(mmk),bb(mmk+1),h(mmk),c(*),t(*) b(1)=x(1)-(x(n)-x(1))/50. c(1)=b(1)-(x(n)-b(1))/100. b(2)=x(n) it=1 xmlscv=1.e30 kl=1 ku=2 call splscvfp(x,n,h,z,b,c,t,it,xmlscv,kl,ku,maxk) ifspl=it do 50 i=1,it+1 bb(i)=b(i) 50 continue do 100 i=1,ifspl kl=i ku=i+1 call splscvfp(x,n,h,z,bb,c,t,it,xmlscv,kl,ku,maxk) 100 continue call count(x,n,h,z,bb,it) call setfp(float(n),bb,h,z,c,t,it) return end subroutine splscvfp(x,n,h,z,b,c,t,it,xmlscv,kl,ku,maxk) c c Subroutine to do interval splitting and checking in construction of c POLEC frequency polygon c parameter(mmk=300) dimension x(n),h(*),z(*),b(*),bb(mmk+1),back(mmk+1),c(*),t(*) xn=float(n) kmax=it+1 ibit=it do 50 i=1,kmax back(i)=b(i) 50 continue do 100 i=1,kl bb(i)=b(i) 100 continue rng=b(ku)-b(kl) ii=2 xi=2. 110 iit=it+ii-1 if (iit.gt.maxk) goto 140 bw=rng/xi do 120 i=1,ii-1 bb(kl+i)=b(kl)+float(i)*bw 120 continue do 130 i=ku,kmax bb(i+ii-1)=b(i) 130 continue call count(x,n,h,z,bb,iit) call setfp(xn,bb,h,z,c,t,iit) xlscv=clscvfp(x,n,bb,z,c,t,iit) if (xlscv.lt.xmlscv) then ii=ii+1 xi=xi+1. xmlscv=xlscv ibit=iit do 135 i=1,iit+1 back(i)=bb(i) 135 continue goto 110 endif 140 do 150 i=1,ibit+1 b(i)=back(i) 150 continue it=ibit return end subroutine count(x,n,h,z,b,it) c c Subroutine to count the number of observations that fall in each c cell. c dimension x(n),h(it),b(it+1),z(it) do 100 i=1,it h(i)=0. z(i)=b(i+1)-b(i) 100 continue i=1 do 200 j=1,n 110 if (x(j).le.b(i+1)) then h(i)=h(i)+1. else i=i+1 goto 110 endif 200 continue return end subroutine setfp(xn,b,h,z,c,t,it) c c Subroutine to form frequency polygon based on given bin edges and c counts of observations. c dimension b(it+1),h(it),z(it),c(it+2),t(it+2) do 100 i=2,it+1 c(i)=(b(i-1)+b(i))/2. 100 continue c(it+2)=b(it+1)+b(1)-c(1) do 110 i=2,it+1 t(i)=h(i-1)/(xn*z(i-1)) 110 continue t(1)=0. t(it+2)=0. return end function clscvfp(x,n,b,z,c,t,it) c c CV function for a frequency polygon. c dimension z(it),x(n),b(it+1),c(it+2),t(it+2) xn=float(n) clscvfp=0. do 50 j=1,n do 30 i=2,it+1 if (x(j).gt.b(i)) goto 30 if (x(j).le.c(i)) then tl=t(i-1)*xn/(xn-1.) tu=(t(i)*xn*z(i-1)-1.)/((xn-1.)*z(i-1)) clscvfp=clscvfp-(tl*c(i)-tu*c(i-1)+(tu-tl)*x(j))/(c(i)-c(i-1)) else tl=(t(i)*xn*z(i-1)-1.)/((xn-1.)*z(i-1)) tu=t(i+1)*xn/(xn-1.) clscvfp=clscvfp-(tl*c(i+1)-tu*c(i)+(tu-tl)*x(j))/(c(i+1)-c(i)) endif goto 50 30 continue 50 continue clscvfp=2.*clscvfp/xn do 100 i=1,it+1 al=(c(i+1)*t(i)-c(i)*t(i+1))/(c(i+1)-c(i)) bet=(t(i+1)-t(i))/(c(i+1)-c(i)) if (bet.eq.0.) then clscvfp=clscvfp+al*al*(c(i+1)-c(i)) else clscvfp=clscvfp+((al+bet*c(i+1))**3-(al+bet*c(i))**3)/(3.*bet) endif 100 continue return end subroutine gaussfp(x,n,hlnrm,c,t,it,epsilon) c c Subroutine to form Gaussian-based frequency polygon. The anchor c is chosen to be just below the smallest sample value. In the very c unlikely event of extremely long-tailed data causing trouble, the c dimension of b can be increased past 300. c parameter(mmk=300) dimension x(n),b(mmk+1),c(*),t(*),h(mmk),z(mmk) b(1)=x(1)-epsilon c(1)=b(1)-hlnrm/2. it=1 100 b(it+1)=b(it)+hlnrm if (b(it+1).ge.x(n)) then go to 110 else it=it+1 go to 100 end if 110 call count(x,n,h,z,b,it) call setfp(float(n),b,h,z,c,t,it) return end subroutine sort(ra,n) C c Subroutine to sort entries 1 through n in ascending order from a c vector ra, using heapsort C dimension ra(n) l=n/2+1 ir=n 10 if (l.gt.1) then l=l-1 rra=ra(l) else rra=ra(ir) ra(ir)=ra(1) ir=ir-1 if (ir.eq.1) then ra(1)=rra return end if end if i=l j=l+l 20 if (j.le.ir) then if (j.lt.ir) then if (ra(j).lt.ra(j+1)) j=j+1 end if if (rra.lt.ra(j)) then ra(i)=ra(j) i=j j=j+j else j=ir+1 end if go to 20 end if ra(i)=rra go to 10 end function sscale(x,n,beta,s) c c Routine to calculate scale measure ss from "Scale measures for bandwidth c selection," by P. Janssen, J.S. Marron, N. Veraverbeke and W. Sarle (1995). c c x is data, sorted in ascending order; beta is one-half of the "data window" c width - it is typically taken to be .2; s is the sample standard deviation. c dimension x(n) double precision t1,ppnd16 xn=float(n) frac=2.*beta nf=ifix((xn-1.)*frac+.5) c c Calculation of the "first" scale estimate, d1 c rd1=1.e30 do 100 j=1,n-nf rd1j=x(j+nf)-x(j) if (rd1j.lt.rd1) then rd1=rd1j icent=j end if 100 continue t1=ppnd16(dble(.5+beta),ifault) xmu1=2.*t1-(4.**(-.875))*((xn/25.)**(-2./3.)) d1=rd1/xmu1 c c Calculation of third differences, to get second scale estimate, d3 c phi0=.3989422804 dd1=d1*frac/phi0 xkap=.001934+.2832*beta+.3803*beta*beta-1.542*(beta**3) icent=ifix(float(icent)+float(nf)/2.+.5) cfrac=float(icent-1)/(xn-1.) coeff=(2.*beta)/(beta+xkap) is=ifix(1.5+(xn-1.)*(cfrac+beta)) d3=x(is) is=ifix(1.5+(xn-1.)*(cfrac+xkap)) ie=ifix(1.5+(xn-1.)*(cfrac+beta)) xmn=submean(x,is,ie,n) d3=d3-coeff*xmn ie=ifix(1.5+(xn-1.)*(cfrac-xkap)) is=ifix(1.5+(xn-1.)*(cfrac-beta)) xmn=submean(x,is,ie,n) d3=d3+coeff*xmn is=ifix(1.5+(xn-1.)*(cfrac-beta)) d3=d3-x(is) if (d3.le.0.) then wt=1. else wt=(sqrt(dd1/(3.*d3))*2.*beta)/(3.*phi0) end if w1=min(wt,1.) w1=max(w1,2.*beta) d3=(w1**(.8))*d1 c c Final scale measure is minimum of d3 and s c sscale=min(d3,s) return end function submean(x,is,ie,n) c c Function that returns the mean of the values in the cells indexed c from is through ie in the vector x. c dimension x(n) submean=0. xis=float(ie-is+1) do 100 i=is,ie submean=submean+x(i) 100 continue submean=submean/xis return end double precision function ppnd16 (p, ifault) c c algorithm as241 appl. statist. (1988) vol. 37, no. 3 c c produces the normal deviate z corresponding to a given lower c tail area of p; z is accurate to about 1 part in 10**16. c c the hash sums below are the sums of the mantissas of the c coefficients. they are included for use in checking c transcription. c double precision zero, one, half, split1, split2, const1, * const2, a0, a1, a2, a3, a4, a5, a6, a7, b1, b2, b3, * c0, c1, c2, c3, c4, c5, c6, c7, d1, d2, d3, d4, d5, * d6, d7, e0, e1, e2, e3, e4, e5, e6, e7, f1, f2, f3, * f4, f5, f6, f7, p, q, r, b4, b5, b6, b7 parameter (zero = 0.d0, one = 1.d0, half = 0.5d0, * split1 = 0.425d0, split2 = 5.d0, * const1 = 0.180625d0, const2 = 1.6d0) c c coefficients for p close to 0.5 c parameter (a0 = 3.38713 28727 96366 6080d0, * a1 = 1.33141 66789 17843 7745d+2, * a2 = 1.97159 09503 06551 4427d+3, * a3 = 1.37316 93765 50946 1125d+4, * a4 = 4.59219 53931 54987 1457d+4, * a5 = 6.72657 70927 00870 0853d+4, * a6 = 3.34305 75583 58812 8105d+4, * a7 = 2.50908 09287 30122 6727d+3, * b1 = 4.23133 30701 60091 1252d+1, * b2 = 6.87187 00749 20579 0830d+2, * b3 = 5.39419 60214 24751 1077d+3, * b4 = 2.12137 94301 58659 5867d+4, * b5 = 3.93078 95800 09271 0610d+4, * b6 = 2.87290 85735 72194 2674d+4, * b7 = 5.22649 52788 52854 5610d+3) c hash sum ab 55.88319 28806 14901 4439 c c coefficients for p not close to 0, 0.5 or 1. c parameter (c0 = 1.42343 71107 49683 57734d0, * c1 = 4.63033 78461 56545 29590d0, * c2 = 5.76949 72214 60691 40550d0, * c3 = 3.64784 83247 63204 60504d0, * c4 = 1.27045 82524 52368 38258d0, * c5 = 2.41780 72517 74506 11770d-1, * c6 = 2.27238 44989 26918 45833d-2, * c7 = 7.74545 01427 83414 07640d-4, * d1 = 2.05319 16266 37758 82187d0, * d2 = 1.67638 48301 83803 84940d0, * d3 = 6.89767 33498 51000 04550d-1, * d4 = 1.48103 97642 74800 74590d-1, * d5 = 1.51986 66563 61645 71966d-2, * d6 = 5.47593 80849 95344 94600d-4, * d7 = 1.05075 00716 44416 84324d-9) c hash sum cd 49.33206 50330 16102 89036 c c coefficients for p near 0 or 1. c parameter (e0 = 6.65790 46435 01103 77720d0, * e1 = 5.46378 49111 64114 36990d0, * e2 = 1.78482 65399 17291 33580d0, * e3 = 2.96560 57182 85048 91230d-1, * e4 = 2.65321 89526 57612 30930d-2, * e5 = 1.24266 09473 88078 43860d-3, * e6 = 2.71155 55687 43487 57815d-5, * e7 = 2.01033 43992 92288 13265d-7, * f1 = 5.99832 20655 58879 37690d-1, * f2 = 1.36929 88092 27358 05310d-1, * f3 = 1.48753 61290 85061 48525d-2, * f4 = 7.86869 13114 56132 59100d-4, * f5 = 1.84631 83175 10054 68180d-5, * f6 = 1.42151 17583 16445 88870d-7, * f7 = 2.04426 31033 89939 78564d-15) c hash sum ef 47.52583 31754 92896 71629 c ifault = 0 q = p - half if (abs(q) .le. split1) then r = const1 - q * q ppnd16 = q * (((((((a7 * r + a6) * r + a5) * r + a4) * r + a3) * * r + a2) * r + a1) * r + a0) / * (((((((b7 * r + b6) * r + b5) * r + b4) * r + b3) * * r + b2) * r + b1) * r + one) return else if (q .lt. zero) then r = p else r = one - p end if if (r .le. zero) then ifault = 1 ppnd16 = zero return end if r = sqrt(-log(r)) if (r .le. split2) then r = r - const2 ppnd16 = (((((((c7 * r + c6) * r + c5) * r + c4) * r + c3) * * r + c2) * r + c1) * r + c0) / * (((((((d7 * r + d6) * r + d5) * r + d4) * r + d3) * * r + d2) * r + d1) * r + one) else r = r - split2 ppnd16 = (((((((e7 * r + e6) * r + e5) * r + e4) * r + e3) * * r + e2) * r + e1) * r + e0) / * (((((((f7 * r + f6) * r + f5) * r + f4) * r + f3) * * r + f2) * r + f1) * r + one) end if if (q .lt. zero) ppnd16 = - ppnd16 return end if end