n5321 | 2025年6月14日 17:55

Tags: CAE


ZOLTAN J. CENDES Ansoft founder1985年的paper

INTRODUCTION

Development of the finite element method has been closely linked to advances in computer technology, allowing users to solve problems of increasing complexity. However, there remains a class of problems of a higher order of complexity with computational requirements exceeding the limits of current machines. Recent developments in hardware technology provide parallel processing capabilities that appear well suited to exploit the parallelism present in the finite element method.

Alternative hardware architectures for parallel processing include:

  • Tightly coupled networks where processors cooperate closely on the solution of the problem (e.g. Cm* [6]).

  • Loosely coupled networks where interprocessor communication is done on a message basis (i.e. an ethernet-like interprocessor communications network).

  • Attached processors that range from array processors to special-purpose finite element machines and VLSI systolic arrays [9, 15].

However, some issues must be considered to fully exploit the capabilities of multiprocessor architectures.

  • Since most multiprocessor machines are still in the experimental stage, the systems lack user friendliness and support for software development, thereby making programming costly.

  • The selection of a finite element system (both hardware and software) is complicated by the large number of possible hardware and software configurations and the high costs associated with the development of any such system.

  • Tuning the program to the hardware to optimize performance becomes harder with more complex architectures because additional issues must be considered (management of private and shared resources, data sharing between processors, interprocessor communications, etc.).

Because of the wide range of options and the high costs incurred in the development of both the hardware and the software, the need to evaluate a proposed system without an actual implementation arises. A methodology for the preliminary evaluation of proposed system architectures (both hardware and software organization) is presented here. The methodology does not require that hardware be configured or software written and makes use of a resource constrained, network scheduling framework to model and simulate a system’s performance. Thus, proposed system architectures may be compared to determine which alternatives appear promising and deserve further study.

PARALLELISM AND PARALLEL PROCESSING IN THE FINITE ELEMENT METHOD

Parallelism in the finite element method

Increasing capabilities of microprocessors and VLSI technology make it feasible to build a machine with multiple processors cooperating on the solution of a problem. Special-purpose processors that efficiently manipulate arrays are being developed. Such alternatives to the traditional von Neumann machine architecture can exploit the parallelism explicitly displayed in parts of the finite element method algorithms and may reduce computational costs.

The basic finite element analysis algorithm for a linear static problem consists of four major steps:

  1. Generate the stiffness matrices and internal load vectors for all elements.

  2. Assemble the element stiffness matrices and load vectors into a structure stiffness matrix and internal load vector.

  3. Solve the system of equations for the structure displacements corresponding to the applied loads.

  4. Recover element quantities (stresses and strains) for all elements from the displacements computed for the structure.

The generation of the stiffness matrices and load vectors for all the elements can be performed in parallel since the computation of one element quantity is independent from all other elements. Also, the recovery of element quantities for all elements can be done in parallel once the total system of equations has been solved. The process of structural stiffness matrix assembly can start as soon as the stiffness matrix from any one element is available. Parallelism also exists in the solution step.

  • The solution of the system of equations governing the behavior of the structure can begin as soon as all contributions to a node have been assembled. Also, the recovery of element quantities can start as soon as the displacements for any element have been computed. Systems using marching solvers use this property.

  • If the solution of the system of equations does not start until all elements have been assembled, parallelism in the solver can be exploited through vectorization.

Parallel processing architectures

A hardware configuration consists of a set of processors and their associated resources. A processor may be a stand-alone unit like a CPU (central processing unit), a functional unit of a CPU like a vector adder or an attached processor or a customized VLSI chip. Resources include primary memory, secondary storage devices, communications channels or data buses and input and output devices. In addition, resources can be private to one processor or shared by multiple processors. The performance of a program executing on a configuration depends on the size, organization and type of available processors and resources. Hardware configurations may be classified according to the number of processors in the system and the nature of the network linking the processors. Four general classes of hardware configurations include the following:

  • Uniprocessor systems consist of only one CPU and its associated resources. Examples are the IBM 3083 or Digital VAX 11/780. A large number of finite element analysis programs have been written for this class of machines [11]. However, sequential machines fail to exploit parallelism and thereby constrain the performance of programs by the speed of the single processor.

  • Dual processor machines consist of a uniprocessor, the host, linked to an attached processor by a very high bandwidth channel. The attached processor might be an array/vector processor [14, 15] or a VLSI systolic array [8, 9]. Two finite element analysis programs executing on dual processor systems have been reported, with speedup ratios of 5-10 [14] and 13-16 [15]. The speedup ratio is the execution time of the program running on the host alone divided by the execution time on the dual processor system. Since the attached processor acts as a slave to the host, parallel processing does not occur.

  • Loosely coupled networks consist of several autonomous processors that are not necessarily identical. The processors can access private or shared resources and communicate via a medium bandwidth (~10 Mbps) ethernet-like network. Examples are the Spice network [12] and the Apollo Domain Architecture. No finite element programs using the capabilities of loosely coupled networks are known to the authors. Such a program would be composed of several processes that may execute on different processors.

  • Tightly coupled networks consist of several processors where interprocessor communication is hardware supported and is routed via data buses. Examples of such machines are Cm* [6], the NASA Finite Element Machine [13] and ILLIAC IV [2]. Tightly coupled networks may be subdivided into four classes according to the number of instruction and data streams concurrently processed:

    1. An SIMD (Single Instruction stream, Multiple Data stream) machine consists of one control or master processor and several slave processors that simultaneously execute the same instruction stream on independent sets of data as in ILLIAC IV.

    2. MISD (Multiple Instruction stream, Single Data stream) machines are also known as pipelined or vector machines. A hardware pipeline consists of a set of functional units (i.e. fixed point arithmetic, floating point arithmetic, index, data fetch, etc.) linked together in series where multiple instructions simultaneously act on the data stream as it goes through the pipeline as in the CRAY-1 or the CDC STAR. Finite element programs developed for uniprocessor systems can be transported to MISD machines.

    3. MIMD (Multiple Instruction stream, Multiple Data stream) machines consist of at least two general-purpose processors linked together by data buses. Multiple instruction streams acting on multiple data streams execute concurrently in MIMD machines. Examples of such machines are Cm* and the NASA Finite Element Machine. No commercial finite element software is available for this class of machines. Using Cm* for the solution of electrical power networks [4] (which is similar to finite element analysis), performance improved as the number of processors was increased but leveled off with a system of nine processors. Also, using the NASA Finite Element Machine, the speedup ratios for the various steps of the solution algorithm (element stiffness generation, assembly, solution of equations, element recovery) varied between 2.82 and 4.00 on a four processor configuration.

    4. Systolic arrays [8, 9] are computational cells connected in a regular network. A cell is a processor with limited capabilities and resources. Data flows continuously through the network, and each processor performs simple computations on the data. In a finite element system, they can be used as special-purpose processors that perform matrix operations such as multiplication, symmetric triple product, LU-factorization and backsubstitution.

Architectures offering parallel processing can exploit the parallelism displayed by the finite element method. However, only loosely coupled networks and MIMD machines maximize the use of parallelism throughout the finite element analysis algorithm. Effective programming becomes harder with increasing architectural complexity. In particular, scheduling of solution steps on particular processors and allowing for necessary data communications will greatly influence system performance. Developing experimental systems is extremely costly and no formal guidelines exist to determine which alternatives are most cost effective.

MODELING AND SCHEDULING A FINITE ELEMENT SYSTEM ARCHITECTURE

Problem description

The objective of the methodology described is to provide a comparison of various software and hardware designs that exploit parallelism to improve performance. Designs are evaluated by simulating the scheduling or assignment of computational tasks to particular processors. Alternative designs are compared based on the estimated costs of sample runs developed from the simulation. As a first estimate, the cost of a run is taken as its total execution time.

To study the behavior of algorithms on various architectures, the representations of hardware and software must be independent. Furthermore, the software representation must support parallelism in the algorithm and be independent of the size of the problem being solved. Representations satisfying these conditions treat the hardware as a set of processors and their associated resources, while the software is modeled as a directed acyclic network where nodes stand for independent computational tasks in the program and branches represent data flow paths and precedence relationships between tasks.

The finite element program is modeled as a generic non-iterative program structure independent of the finite element problem size. Nodes, representing tasks, can be subscripted and as such stand for the computation of all the variables generated from the node by varying each of its subscripts over its range. Network links represent data precedence relations between computational steps. The size independent generic network model is then expanded to build an explicit network, including all the tasks for a given problem, where each task computes one quantity. This expansion is performed using a set of rules for expanding the subscripted nodes and generating the appropriate precedence relations and resource constraints [7].

The simulation of the program on a hardware configuration is performed by solving an integer programming problem whose objective is to find a feasible mapping, satisfying precedence and resource constraints, from the set of tasks onto the set of processors while minimizing the objective function (i.e. total execution time). The solution results are the task-processor assignments and the starting, finishing and execution times for all the tasks. Also, processor and resource utilization levels are generated by the scheduling procedures.

Because of the deterministic nature of the solution technique used, the programs that are modeled are restricted to be non-iterative. This forces nonlinear systems to have a predetermined number of iterations per loading step and a predefined number of load steps. Representing the hardware as a set of processors and resources imposes some limitations on the modeling of interprocessor communications because some of the information concerning the physical connections is lost. For example, the cost of communicating between two processors depends on how the two processors are connected.

Furthermore, it has been shown that the problem of finding an optimal assignment resulting in minimum total execution time of a set of non-preemptible tasks to a set of processors subject to resource and precedence constraints is NP-complete in almost all cases [5]. To reduce the complexity of the general scheduling problem, some simplifying assumptions are made, although the simplified problem is still NP-complete. These assumptions are:

  • The resource requirements of a given task executing on a certain processor are assumed to be independent of all other tasks and processors. An example that violates this assumption is the case of two tasks that must execute in a sequential manner with the second task using data computed by the first. The data must be communicated over the network connecting the two processors executing the two tasks. The resource requirement of the communication will vary from minimal if the two tasks execute on the same processor to substantial if the data must be transferred over many physical links of the networks. Task-processor uncoupling also assumes that resources utilized by a task during execution are released after the completion of the task. In certain cases, data generated by a task must be saved for later use. These data utilize a memory resource that is released only when the data are no longer needed.

  • The time domain is assumed to be an integer space. Although time is usually conceived of as a continuous space, computer time is discrete because it can be represented by cycles, where each cycle cannot be decomposed.

  • The tasks are assumed to be non-preemptible. A task is said to be preemptible if its execution can be interrupted by another task and then resumed at a later time. The non-preemptibility assumption eliminates the need to compute the cost of saving the status of a task when it is preempted and of setting up the task when it is resumed. To provide a degree of preemptibility, a task can be decomposed into a set of sequential tasks: however, this eliminates the constraint of having all subparts of a task executing on the same processor.

The scheduling and computer configuration model as formalized below is capable of representing a wide variety of analysis algorithms and all of the architectures described above.


CONCLUSIONS

This paper presents a methodology for the preliminary evaluation of the performance of finite element system architectures. This methodology provides a tool for the selection of a system for further study without initially incurring the costs of developing such a system. The evaluation is performed by simulating the execution of the program on the proposed hardware configuration. The physical system is modeled as a set of processors and resources, while the finite element program is represented by a directed network where the nodes are independent computational tasks and the links represent precedence constraints on the order of computation. The simulation is based on a heuristic scheduling algorithm representing a modification of the critical path method. The result of the scheduling is an estimate of the cost of solving a problem on a proposed hardware and software configuration. Based on a set of such simulations, alternatives can be ranked and those deserving further study or implementation can be selected.

The methodology provides one approach to the evaluation of finite element systems. It provides a flexible mechanism to represent a wide variety of algorithms and hardware configurations. Regrettably, the current methodology does not place sufficient importance on the issues of interprocessor communications and data sharing, which have been identified as potential bottlenecks in actual system performance. However, these reservations can be alleviated by extending the hardware and software models and the scheduling procedure.

Possible extensions to the methodology include:

  • Better models for the hardware and software. The hardware model can be improved to include information describing the communications network linking the processors. An extension to the software model is needed to allow the representation of iterative, conditional and asynchronous algorithms.

  • Extensions to the scheduling problem. The scheduling problem can be modified to allow the preemption of tasks, to handle resource requirements that depend on the processor used for another task, and to model global resource usage.

  • A more elaborate cost function. This could include such items as resource utilization and resource costs.

  • Extensions to the scheduling technique. The technique can be extended by developing more sophisticated selection rules and using slacks as priorities to generate an improved schedule.