## Info

and the Hessian matrix of f (x) have a special structure but is based on the premise that eventually first-order terms will dominate second-order terms, an assumption that is not justified when the residuals of the solution are very large because the second-order term ( 4.3.8) contains the product of second derivatives and residuals. In practice, in light curve analysis, it turns out that this assumption seems to be valid, i.e., the residuals at the solutions are small enough. The bounds (4.1.14) are not taken into account in the method of Differential Corrections. The procedure can be applied successfully if an initial solution x0 close enough to the solution x+ is known and if the correlations between the parameters are not too large. However, these requirements are not always met. The following disadvantages are sometimes mentioned in the context of Differential Corrections. Note that some of the following points (1 and 3) apply to any derivative-based least-squares method:

1. The initial solution x0 might not be well known and not located in the local convergence region. The result can be divergence or oscillation. It is not unusual to observe higher standard deviations afit after a few iterations.

2. Most of the partial derivatives dXidv(x) can be derived only numerically for light curve solutions. This means that to adjust m parameters, m + 1 or 2m + 1 light curves have to be calculated at each iteration, depending on whether asymmetric [see (4.5.25)] or symmetric [see (4.5.26)] finite differences are used to compute the partial derivatives. Accuracy depends critically on the choice of the increments Axi and is difficult to control (see Sect. 4.5.5). As is shown in Sect. 4.5.4, for some parameters, it is possible to calculate derivatives analytically.

3. Convergence problems, due to both nonlinearity and correlations, have long been observed in applications of Differential Corrections (Wilson & Biermann 1976). The linearized normal equations are sometimes ill-conditioned with condition numbers of the order of 106, i.e., parameters are strongly correlated. In light curve analysis, the problem is very dominant when the mass ratio is an adjustable parameter. Wilson & Biermann (1976) cope with these problems and solve alternatively and iteratively for subsets of parameters in successive iterations (see Sect. 4.3.2).

4. The need for precision, hence grid density, for the calculation of the theoretical light curve by numerical quadrature of the flux over the star's surface is obviously much greater than for a direct search method. Computing-time costs and/or restricted memory have prevented many light curve analyzers from using all observed data points. Instead, they form normal points and analyze them to estimate the parameters. Although we do not recommend this procedure enthusiastically, because there are subjective decisions to be made in the binning/averaging procedure such as the width of the bins in different parts of the light curve, in practice the results are not significantly different in most cases. See Sect. 4.1.1.5 for further discussion of the intrinsic merits of binning and not binning. Machines are now becoming so fast and memory so inexpensive that normal points are no longer needed.

4.3.2 Multiple Subset Method and Interactive Branching

To overcome some of the limitations of Differential Corrections,Wilson & Biermann (1976) introduced the method of multiple subsets, by which they could separate the most correlated parameters during the optimization process (which still proceeds by Differential Corrections). If P denotes the set of free or adjustable parameters, then this method solves for parameter subsets P c P in each iteration. P is separated into two or (rarely) more, not necessarily disjoint subsets Pi, but, of course, we expect that the union of all, say, NS subsets coincides with the original set P. The solution proceeds by alternating subset solutions. As the derivatives were already calculated for the full set P, by interactive branching several combinations of free parameters can be established without much additional computing time. Associated with these subsets are normal equations which have smaller dimension and a smaller condition number. The standard errors of the parameters are derived by a final run on the whole parameter set (ignoring the corrections). The method requires some interactive intervention of the human analyzer but nevertheless many authors proved its efficiency (Wilson 1988). The process of applying the subset with the smallest predicted afit has turned out to be an effective way to arrive at the deepest minimum in most cases. Intuitively, this method can be understood as a decomposition technique. In the example of two-dimensional minimization in the x- and y-directions, this decomposition would lead to a minimization in the x- and y-direction separately, i.e., fixing y and minimizing with respect to x, fixing x and minimizing with respect to y, and so on.

The multiple subset method addresses the difficulties caused by correlations between parameters. To overcome problems related to nonlinearity, a damping strategy is required (see Sect. 4.3.3, or Appendix A.1).

4.3.3 Damped Differential Corrections and Levenberg-Marquard Algorithm

A frequently used method to increase the convergence region of nonlinear least-squares problems is the Levenberg-Marquardt algorithm (Levenberg 1944, Marquardt 1963), sometimes also called the Marquardt method. Early users of the Levenberg-Marquardt algorithm in the context of EB light curves were Hill (1979) and Djurasevic (1992). Wilson & Terrell (1998) also mentioned the use of this method in the Wilson-Devinney program.

The damped least-squares algorithm implemented in WD9 5 (Kallrath et al. 1998), a version of the Wilson-Devinney program described in Chap. 7, is based on the normal equation and modification of the system matrix (4.3.13) in Sect. 4.3.1 and

is identical to the Levenberg-Marquardt strategy for damping. In its simplest form, the key idea of the Levenberg-Marquardt algorithm is to add a multiple of the identity matrix 1, Ak 1l, to the Hessian in (4.3.9) with a suitable nonnegative damping factor Ak. In place of the normal equation (4.3.9), we get a regular system of linear equations of the form

where Dk is a diagonal matrix usually chosen as Dk = Ak 1. In general, the crucial point of the method is to choose the damping properly6 as discussed by More (1978). A heuristic procedure to choose Ak is given in Kallrath et al. (1998) and looks like this:

1. for given xk compute f (x) and establish the normal equation, in particular, compute C;

2. for the initial damping constant A pick a modest value, for example, A = 10-4, and set A0 = A;

3. the diagonal elements cjj are replaced by (1 + Ak)cjj;

4. solve the modified normal equation (4.3.15) for Axk and set the potential iterated vector x' = xk + Axk;

6. acceptance test: if f (x') < f (xk) ^ xk+1 = x' and Ak+i = 0.1 Ak, goto step 2;

The question remains to define a condition for terminating the procedure. That topic is discussed in Sect. 4.4.2.

As the Levenberg-Marquardt algorithm is sometimes referred to as a "damped least-squares method" let us consider this point for a moment. A good property of the Levenberg-Marquardt method is its numerical stability caused by the fact that the matrix R'(xk)TR'(xk) + Ak 1 is positive definite even for rank-deficient Jacobians R'(xk) (see Appendix A.3.3 for nomenclature). However, often the "solution" is not the minimum of f (x) but only premature termination. Great care is needed because the damping effect of adding A1 to the Hessian hides to some extent the uniqueness problem associated with rank-deficient Jacobians.

### 4.3.4 Derivative-Free Methods

Derivative-free optimization is performed with direct search algorithms which were reviewed by Murray (1972) and Lootsma (1972). Among others, this group of

6 In light curve analysis, the method seems not to depend on the choice of A strongly if A is chosen between 10-4 and 10-8.

algorithms includes the Fibonacci line search, and the Simplex algorithm7 invented by Spendley et al. (1962). The Simplex algorithm becomes efficient if there are more than two or three parameters to be adjusted. In the context of light curve analysis, this algorithm was first used by Kallrath & Linnell (1987). Since then several authors have successfully applied this method. Experiences with the Simplex algorithm in the context of light curve analysis can be found in Kallrath (1993). More recently, Powell's Direction Set Method has been implemented by Prsa & Zwitter's (2005) in the light curve package PHOEBE.

### 4.3.4.1 The Simplex Algorithm

The Simplex algorithm does not require any derivatives to be computed but compares only the function values fi := f (xi) at the m + 1 vertices of a simplex in parameter space, S. In the first versions of the method by Spendley et al. (1962), the simplex was moved through S by an operation called reflection. The operation bears a close relationship to the method of steepest descent. A more efficient version by Nelder & Mead (1965) uses three additional operations: expansion, contraction, and shrinkage which move the simplex, adapt it to the hypersurface, and iterate it with decreasing volume to the solution x*. The basic idea is to eliminate the worst vertex by one of the four operations and to replace it with a better one. Figure 4.6 demonstrates these geometrical operations. f (x) = afit has been minimized w.r.t. inclination i and mass ratio q. In the initial simplex S(0), the vertex with the highest value afit has been marked with 0. Via reflection, S(0) is mapped onto S(1). S(2) is the result of contracting S(1) toward the center of the simplex. This is a simulated case, so the real solution x+ = (i, q)* is known and marked as a circle in the (i - q)-plane. The construction of S(0), introduced by Kallrath & Linnell (1987) and used in the light curve packages, deviates from the usual procedure as expressed in (4.3.21) because that appeared to be advantageous in EB light curve analysis.

Let us consider an arbitrary real-valued continuous function f (x) depending on a vector x e S c IRm, and m+1 vertices of a simplex S(0) constructed below and embedded in Rm. Let V;(k), 1 < k < m + 1, be the vertices of the simplex S(k) with coordinates x(k) after k iterations. For convenience we represent S(k) as an (m + 1) x (m + 1)-dimensional matrix Sj

The following definitions hold for each iteration, therefore the superscript k is neglected. In particular, we define the function value fh with highest value as

Sj', for 1 < j < m, fi := f (xi), for j = m + L

7 Not to be confused with Dantzig's Simplex algorithm used in Linear Programming. First published by Dantzig (1951), this Simplex algorithm was introduced by Dantzig in 1947.

Fig. 4.6 The geometry of the Simplex algorithm. This figure [Figs. 1 and 2 in Kallrath and Linnell (1987, p. 349)] shows the change of simplex shape while moving through a (i - q )-parameter plane. The upper part shows the first seven iterations, the lower part the iteration while moving toward the local minimum at i = 82?43 and q = 0.770

Fig. 4.6 The geometry of the Simplex algorithm. This figure [Figs. 1 and 2 in Kallrath and Linnell (1987, p. 349)] shows the change of simplex shape while moving through a (i - q )-parameter plane. The upper part shows the first seven iterations, the lower part the iteration while moving toward the local minimum at i = 82?43 and q = 0.770

the second highest value as fs := max {f, 1 < i < m + 1, i = h}, (4.3.18)

and the lowest function value as fl := min {fi, 1 < i < m + 1}, (4.3.19)

and eventually the center and the geometrical center of the simplex as

Note that for the computation of the center xcal, the vertex xh is omitted.

The initial simplex S(0) depends on the number of parameters m, an initial vector x®, and a scaling quantity s := (si, s2,..., sm) determining the size of S(0). Sj is the size of S(0) projected onto the Xj-axis. Spendley et al. (1962) and Yarbro & Deming (1974), based on statistical considerations, derive the auxiliary quantities

and eventually construct S(0) = S(0)(m, x(10), s) as

According to that definition, we have x(j0) > x®. The geometrical interpretation is that the already known simplex is constructed from one of its vertices. However, if information such as an initial solution x(0) about the minimum of f(x) is available, it seems better to construct the simplex S(0) = S(0)(m, x((0) - qs, s), whose geometrical center xg approximately coincidences with x0, i.e., xg « x((0). In order to cover a parameter space defined by simple bounds, the simplex S(0) = S(0)(m, x(10), s/p) is appropriate because its vertices lie on the edges of a hypercube.

The iteration process is summarized as a flow chart in Fig. 4.7 using the following definitions:

xa = xa(x) := (1 + a)xc - ax; 0 < a, x^ = x^ (x) := (1 - P)xc + p x; 0 <p< 1, xY = xY(x) := (1 + y)xc + yx; 0 < Y, xs = xs(x) := xl + 5(x - xl); 0 <5. (4.3.23)

Fig. 4.7 Flow chart of the Simplex algorithm. This figure [Fig. 9 in Kallrath and Linnell (1987, p. 356)] represents the rules embedded in the Simplex algorithm

The four operations a (reflection), p (contraction), y (expansion), and < (shrinkage) are applied according to the following scheme:

if reflection S(k+1) = a (S(kxh ^ xa(xh) fs > fa > fl, expansion S(k+1) = y (S(k)), xh ^ xY(xh) fY < fa < fl, contraction S(k+1) = £ (S(k)), xh ^ x£^Jtf)) fa < fh, (4.3.24)

shrinkage S(k+1) = 5 (S(k)), xh ^ xs(x1'), i = l f£ > fh.

By careful selection of the parameters a, p, y, and 5, the four operations can be controlled. The values a = 1.0, p = 0.35, y = 2.0, and 5 = 0.5 are recommended by Parkinson & Hutchinson (1972). For most light curve analyses, this set of parameters is used, but sometimes p = 0.5 is also substituted.

From the definitions of the operations, it follows that the volume of the simplex and the function value at that vertex with minimal function value are monotoni-cally decreasing functions of k unless the operation expansion occurs. The iterations are halted by one of three stopping criteria. The first is the number kmax of iterations. The second stops iterations when all mean errors, aj, of the column average, Sj,

become smaller than a given tolerance ej, i.e., aj < ej for all j. Third, iterations may be stopped if fl becomes smaller than a threshold value ft. If f (x) represents a least-squares function, e.g., the standard deviation of the fit, afit, then f may be set to the inner noise, adata, of the data.

As the Simplex algorithm is a direct search method, it cannot give standard errors for the parameters. It is also difficult to add additional parameters during the course of iterations.

4.3.4.2 Powell's Direction Method

Powell's Direction Set method (Powell 1964a,b) is an efficient method for finding the unconstrained minimum of a function of n variables without calculating derivatives. The method has a quadratic termination property, i.e., quadratic functions are minimized in a predetermined number of operations, performing n2 + O(n) line search steps along conjugate directions in parameter space. In Appendix A.1 we briefly mentioned variable metric and conjugate direction methods. The method was first applied to light curve analysis by Prsa & Zwitter's (2005b). A more recent write-up of Powell's method by Vassiliadis & Conejeros (2008) provides the computational details (line search, choice of direction), and a discussion about the termination criterion.

### 4.3.4.3 Simulated Annealing

Simulated annealing (SA) dates back to Metropolis et al. (1953) and is a function-evaluation-based improvement or stochastic search method often used to solve combinatorial optimization problems; see Pardalos & Mavridou (2008) for a recent description. It is capable to escape local minima and can, in principle, find the global optimium, however, without any guarantee. During its iterations, SA accepts not only better than previous solutions, but also worse quality solutions controlled probabilistically through a control parameter T. The basis requirements are an initial starting point, a neighborhood concept and a cooling scheme involving a few tuning parameters. For continuous optimization problems, the neighborhood concept leads to parameter variations of the form:

Xi ^ x'i = Xi + r ■ Vi, where xi is a parameter, r is a random number in the range [-1, +1], and vi is the step length for this parameter. The function to be minimized, f(x), is then computed for the new values of the parameter, to give the value f' (x'). In our program, this function is the sum of squares of the weighted residuals. If f' < f, the function is said to have moved downhill, and the values x are taken as the new optimum values. All downhill steps are accepted and the process repeats from this new point. An uphill step may be accepted, allowing the parameter values to escape from local minima. Uphill movements (for example, a value f' such that f' > f) are accepted with probability p = e(f-f')/ T, where f is the previously found minimum value of the function to be minimized, f' is the most recently computed value of the function, and T is the current value of the "temperature," the thermodynamic analogue, not the light curve parameter. If p is larger than a random number between 0 and 1, the new point is accepted (the "Metropolis criterion"). Both the values of vi and T are adjusted. The step length is changed such that a fixed percentage - this is one of the tuning parameters mentioned above - of the new evaluations is accepted. When more points are accepted, v is increased, and more points may be "out of bounds," that is, beyond the parameter limits that are established in a constraints file. The temperature is decreased after a specified number of iterations, after which the temperature is changed to T' = rTT, where 0 < rT < 1. In the course of iterations, the step length, rT, decreases and the algorithm terminates ideally close to the global optimum. The computational cost to reach this state can be high, as temperature annealing has to be sufficiently slow. A pseudocode of SA and detailed discussion of the parameters and temperature adjustment are presented by Alizamir et al. (2008).

SA has been applied to EB research by Milone & Kallrath (2008), in a slightly modified form named Adapative Simulated Annealing by Prsa (2005). SA is also part of NightFall, a freely available amateur code for modeling eclipsing binary stars described in Sect. 8.3.

### 4.3.5 Other Approaches

In EB light curve analysis, other optimization techniques have been used in addition to the methods described in the previous subsections. Napier (1981) used a sequential (creeping random) search technique for fitting the parameters in his tri-axial ellipsoid-ellipsoid geometry light curve model. Since this light curve model did not require a large amount of computer time, Napier was able to find a solution parameter set within minutes. A similar approach, namely a "controlled random search optimization procedure" based on the Price algorithm (Price 1976) has been applied to the Wilson-Devinney model by Barone et al. (1988). In this case, the efficiency was low. Barone et al. (1998) reported solutions which required about 30,000 iterations and 11 days of CPU time on a VAX 8600. Finally, we mention Metcalfe (1999) who used genetic algorithms to analyze EBs.

Various semi-empirical modifications of Differential Corrections were outlined by Khaliullina & Khaliullin (1984) and implemented in a computer program capable of automatic iteration (which, however, is designed to analyze the light curves of EBs with spherical components only). Hill (1979), in his program LIGHT2, which is based on Roche geometry, may have been the first to change his light curve solver from Differential Corrections to the Marquardt (1963) algorithm, which has much improved convergence properties and has become a standard of nonlinear least-squares routines.

Recently, the WD program has been enhanced by a damped least-squares solver based on the idea of Levenberg (1944). Further material on the LevenbergMarquardt & Marquardt algorithm can be found in More (1978) and provides the basis for light curve package WD9 5 by Kallrath et al. (1998) and later versions, that uses the Simplex algorithm for initial search and global investigation, and switches to a damped least-squares solver when approaching a local minimum. The Levenberg-Marquardt procedure efficiently combines a steepest descent method with a step-size controlled GauB-Newton algorithm. A similar step in this direction has already been performed by Plewa (1988) with the program MINW. MINW combines the Simplex algorithm, a variable metric method [Fletcher (1970), or Appendix A1], and the computation of the Hessian matrix for statistical purposes and error estimation.

4.4 A Priori and A Posteriori Steps in Light Curve Analysis

Ex necessitate rei (From the necessity of the case)

### 4.4.1 Estimating Initial Parameters

The heuristic rules summarized in this section may be useful for estimating some of the initial parameters and also for checking the consistency of light curve solutions as well as interpreting their physical meanings. It may happen that the least-squares adjustment, for some reason, produces a seriously inadequate solution. A formal criterion to detect an inadequate solution is to compare the standard deviation of the fit, afit, and the internal noise, adata, of the data. The solution is acceptable only if afit « adata. It certainly helps to plot both the light curve and the data together! A more physical criterion is to check the astrophysical plausibility. If astrophysically implausible results are derived, it may be due to a deficiency of the light curve model or due to a local instead of a global minimum.

Hints for estimating parameters are summarized below; as usual, light curves are assumed to be measured in relative flux units ("light") and not in magnitude.

1. Estimation of the temperature difference: For circular orbits, the ratio of primary to secondary eclipse depths is equal to the corresponding ratio of eclipsed mean star surface brightnesses at the places sampled. It has been argued that surface brightness is sufficient by itself as a modeling parameter and that the use of temperatures as light curve parameters is both unnecessary and, because of stellar atmosphere effects that might not follow the blackbody law [see, e.g., Popper (1993, p. 193)]. Our position on this matter is that the effective temperature is an important curve-independent physical parameter because, with a correct understanding of the stellar atmospheres involved, it constrains the relative fluxes in different passbands, and thus it reduces the number of free parameters; moreover, it provides an important handle on the astrophysics of the stellar components. If the atmospheric physics is not correctly understood, then the temperature will vary from curve to curve. Assuming that the atmosphere computation is accurate, the ratio of depths can give the temperature of one component if the temperature of the other one is known, even though stars are not black-bodies. This procedure holds for both partial and complete eclipses, but for only circular orbits. Even so, for small eccentricities they provide an initial guess for one star's temperature as a function of the other's. The temperature of one of the stars needs to be known from spectra, or from color indices if the interstellar reddening is known. We may guess that the total light of the system is dominated by the hotter star. This may be wrong, but we can proceed with the modeling until we confirm it to be so, or prove the opposite - in which case we continue with new T and T2. Convergence should be reachieved quite quickly.

2. Estimation of the luminosity ratio: For spherical stars, the ratio of the light loss to light remaining at the bottom of a total eclipse is the monochromatic luminosity ratio of the smaller to the larger star. The size ratio can be estimated also, as in the absence of limb darkening we have k2 = '-I = ^1-, (4.4.1)

where 1 - 10 and 1 - 10,° are the transit and occultation eclipse depths. Wilson (1966) makes a slight modification of (4.4.1) to include limb-darkening. For circular orbits as shown in Fig. 3.10, the two eclipses are of equal duration. The following idea gives only a lower limit of k (because of uncertainties in i).

Although we often find similar formulations in textbooks, this approach is only of limited use because there is no reliable way to know i. The same concerns hold for the following considerations: The duration of the entire eclipse (first contact to last contact) is a measure of the sum of the relative radii, r1 + r2. If k is known (usually it would not be known a priori), then ri and r2 can also be found. If the eclipse is total, it may be possible to identify contacts 2 and 3 (onset and completion of the totality phase). In this case, the timings provide the radii. If t1,..., t4 denote times of first through fourth contact as shown in Fig. 4.8, then k is approximately:

Note that (4.4.2) is strictly true only if the stars move along straight lines, not in circular orbits. Which is the larger and which is the smaller star? If the smaller star transits the larger at the time of the primary (deeper) eclipse, the smaller one is star8 2; if the smaller star is occulted during primary minimum, it is star 1. Occultation eclipses have a flatter light curve during totality (corresponding to t3 -12) than do transits, as a rule. But trial and error may be required to ascertain which of the two possibilities applies. For eccentric orbits, the eclipse which

## Telescopes Mastery

Through this ebook, you are going to learn what you will need to know all about the telescopes that can provide a fun and rewarding hobby for you and your family!

Get My Free Ebook