jpg

Contents

Foreword

Preface

Acknowledgments

List of Figures

List of Tables

Acronyms

CHAPTER 1: INTRODUCTION

1.1 PROGRAMMING GUIDELINES

1.2 A BRIEF HISTORY OF F#

1.3 BENEFITS OF F#

1.4 INTRODUCING F#

1.5 IMPERATIVE PROGRAMMING

1.6 FUNCTIONAL PROGRAMMING

CHAPTER 2: PROGRAM STRUCTURE

2.1 NESTING

2.2 FACTORING

2.3 MODULES

2.4 OBJECTS

2.5 FUNCTIONAL DESIGN PATTERNS

2.6 F# DEVELOPMENT

CHAPTER 3: DATA STRUCTURES

3.1 ALGORITHMIC COMPLEXITY

3.2 ARRAYS

3.3 LISTS

3.4 SETS

3.5 HASH TABLES

3.6 MAPS

3.7 CHOOSING A DATA STRUCTURE

3.8 SEQUENCES

3.9 HETEROGENEOUS CONTAINERS

3.10 TREES

CHAPTER 4: NUMERICAL ANALYSIS

4.1 NUMBER REPRESENTATION

4.2 ALGEBRA

4.3 INTERPOLATION

4.4 QUADRATIC SOLUTIONS

4.5 MEAN AND VARIANCE

4.6 OTHER FORMS OF ARITHMETIC

CHAPTER 5: INPUT AND OUTPUT

5.1 PRINTING

5.2 GENERIC PRINTING

5.3 READING FROM AND WRITING TO FILES

5.4 SERIALIZATION

5.5 LEXING AND PARSING

CHAPTER 6: SIMPLE EXAMPLES

6.1 FUNCTIONAL

6.2 NUMERICAL

6.3 STRING RELATED

6.4 LIST RELATED

6.5 ARRAY RELATED

6.6 HIGHER-ORDER FUNCTIONS

CHAPTER 7: VISUALIZATION

7.1 WINDOWS FORMS

7.2 MANAGED DIRECTX

7.3 TESSELATING OBJECTS INTO TRIANGLES

CHAPTER 8: OPTIMIZATION

8.1 TIMING

8.2 PROFILING

8.3 ALGORITHMIC OPTIMIZATIONS

8.4 LOWER-LEVEL OPTIMIZATIONS

CHAPTER 9: LIBRARIES

9.1 LOADING .NET LIBRARIES

9.2 CHARTING AND GRAPHING

9.3 THREADS

9.4 RANDOM NUMBERS

9.5 REGULAR EXPRESSIONS

9.6 VECTORS AND MATRICES

9.7 DOWNLOADING FROM THE WEB

9.8 COMPRESSION

9.9 HANDLING XML

9.10 CALLING NATIVE LIBRARIES

9.11 FOURIER TRANSFORM

9.12 METAPROGRAMMING

CHAPTER 10: DATABASES

10.1 PROTEIN DATA BANK

10.2 WEB SERVICES

10.3 RELATIONAL DATABASES

CHAPTER 11: INTEROPERABILITY

11.1 EXCEL

11.2 MATLAB

11.3 MATHEMATICA

CHAPTER 12: COMPLETE EXAMPLES

12.1 FAST FOURIER TRANSFORM

12.2 SEMI-CIRCLE LAW

12.3 FINDING NTH-NEAREST NEIGHBORS

12.4 LOGISTIC MAP

12.5 REAL-TIME PARTICLE DYNAMICS

Appendix A: Troubleshooting

A.1 VALUE RESTRICTION

A.2 MUTABLE ARRAY CONTENTS

A.3 NEGATIVE LITERALS

A.4 ACCIDENTAL CAPTURE

A.5 LOCAL AND NON-LOCAL VARIABLE DEFINITIONS

A.6 MERGING LINES

A.7 APPLICATIONS THAT DO NOT DIE

A.8 BEWARE OF “IT”

Glossary

Bibliography

INDEX

Image

To my family

Foreword

Computational science is one of the wonders of the modern world. In almost all areas of science the use of computational techniques is rocketing, and software has moved from being a supporting tool to being a key site where research activities are performed. This has meant a huge increase in the importance of controlling and orchestrating computers as part of the daily routine of a scientific laboratory, from large teams making and running the computers performing global climate simulations to the individual scientist/programmer working alone. Across this spectrum, the productivity of teams and the happiness of scientists depends dramatically on their overall competency as programmers, as well as on their skills as researchers within their field. So, in the last 30 years we have seen the continued rise of that new profession: the scientific programmer. A good scientific programmer will carry both epithets with pride, knowing that programming is a key foundation for a successful publication record.

However, programming cultures differ widely, and, over time, gaping divides can emerge that can be to the detriment of all. In this book, Dr. Harrop has taken great steps forward to bridging three very different cultures: managed code programming, scientific programming and functional programming. At a technical level, each has its unique characteristics. Managed code programming, epitomized by .NET and Java, focuses on the productivity of the (primarily commercial) programmer. Scientific programmers focus on high performance computations, data manipulation, numerical computing and visualization. Functional programming focuses on crisp, declarative solutions to problems using compositional techniques. The challenge, then, is to bring these disparate worlds together in a productive way.

The language F#, which Dr. Harrop uses in this book, itself bridges two of these cultures by being a functional language for the .NET platform. F# is an incredibly powerful language: the .NET libraries give a rich and solid foundation of software functionality for many tasks, from routine programming to accessing web services and high performance graphics engines. F# brings an approach to programming that routinely makes even short programs powerful, simple, elegant and correct. However Dr. Harrop has gone a step further, showing how managed code functional programming can revolutionize the art of scientific programming itself by being a powerful workhorse tool that unifies and simplifies many of the tasks scientific programmers face.

But what of the future? The next 20 years will see great changes in scientific programming. It is customary to mention the ever-increasing challenges of parallel, concurrent, distributed and reactive programming. It is widely expected that future micro-processors will use ever-increasing transistor counts to host multiple processing cores, rather than more sophisticated microprocessor designs. If computations can be parallelized and distributed on commodity hardware then the computing resources that can be brought can be massively increased. It is well known that successful concurrent and distributed computing requires a combination of intelligent algorithm design, competent programming, and core components that abstract some details of concurrent execution, e.g. databases and task execution libraries. This needs a language that can interoperate with key technologies such as databases, and parallelism engines. Furthermore, the ability to declaratively and crisply describe solutions to concurrent programming problems is essential, and F# is admirably suited to this task.

The future, will, however, bring other challenges as well. Truly massive amounts of data are now being generated by scientific experiments. Web-based programming will become more and more routine for scientific teams: a good web application can revolutionize a scientific field. Shared databases will soon be used in almost every scientific field, and programmatic access to these will be essential. F# lends itself to these challenges: for example, it is relatively easy to perform sophisticated and high-performance analysis of these data sources by bringing them under the static type discipline of functional programming, as shown by some of the samples in this book.

You will learn much about both programming and science through this book. Dr. Harrop has chosen the style of F# programming most suited to the individual scientist: crisp, succinct and efficient, with a discursive presentation style reminiscent of Mathematica. It has been a pleasure to read, and we trust it will launch you on a long and productive career as a managed code, functional scientific programmer.

Preface

The face of scientific computing has changed. Computational scientists are no longer writing their programs in Fortran and competing for time on supercomputers. Scientists are now streamlining their research by choosing more expressive programming languages, parallel processing on desktop machines and exploiting the wealth of scientific information distributed across the internet.

The landscape of programming languages saw a punctuation in its evolution at the end of the 20th century, marked by the advent of a new breed of languages. These new languages incorporate a multitude of features that are all designed to serve a single purpose: to make life easier. Modern programming languages offer so much more expressive power than traditional languages that they even open up new avenues of scientific research that were simply intractable before.

The next few years will usher in a new era of computing, where parallelism becomes ubiquitous. Few approaches to programming will survive this transition, and functional programming is one of them.

Seamlessly interoperating with computers across the world is of pivotal importance not only because of the breadth of information now available on-line but also because this is the only practicable way to interrogate the enormous amount of data available. The amount of genomic and proteinomic data published every year continues to grow exponentially, as each generation of technology fuels the next.

Only one mainstream programming language combines awesome expressive power, interoperability and performance: F#. This book introduces all of the aspects of the F# programming language needed by a working scientist, emphasizing aspects not covered by existing literature. Consequently, this book is the ideal complement to a detailed overview of the language itself, such as the F# manual or the book Expert F#[25].

Chapters 1–5 cover the most important aspects of F# programming needed to start developing useful F# programs. Chapter 6 ossifies this knowledge with a variety of enlightening and yet simple examples. Chapters 7–11 cover advanced topics including real-time visualization, interoperability and parallel computing. Chapter 12 concludes the book with a suite of complete working programs relevant to scientific computing.

The source code from this book is available from the following website: http://www.ffconsultancy.com/products/fsharp_for_scientists/

J. D. HARROP

Cambridge, UK

June, 2008

Acknowledgments

I would like the thank Don Syme, the creator of F#, for pioneering research into programming languages and for thrusting the incredibly powerful ML family of languages into the limelight of mainstream software development.

Xavier Leroy, everyone at projet Cristal and the Debian package maintainers for making OCaml so practically useful.

Stephen Elliott and Sergei Taraskin and their group at the University of Cambridge for teaching me how to be a research scientist and letting me pursue crazy ideas when I should have been working.

Ioannis Baltopoulos and Enrique Nell for proofreading this book and giving essential feedback.

J. D. H.

List of Figures

2.1 The fold_range function can be used to accumulate the result of applying a function f to a contiguous sequence of integers, in this case the sequence [1, 9).
2.2 Developing an application written entirely in F# using Microsoft Visual Studio 2005.
2.3 Visual Studio provides graphical throwback of the type information inferred by the F# compiler: hovering the mouse over the definition of a variable r in the source code brings up a tooltip giving the inferred type of r.
2.4 A project’s properties page allows the compiler to be controlled.
2.5 The Add-in Manager is used to provide the F# interactive mode.
2.6 Creating a new C# class library project called ClassLibraryl inside a new solution called Interop.
2.7 Creating a new F# project called Project1 also inside the Interop solution.
2.8 Setting the startup project of the Interop solution to the F# project Project1 rather than the C# project ClassLibrary1 as a DLL cannot be used to start an application.
3.1 Complexities of the ipow_1 and ipow_2 functions in terms of the number T(n) of multiplies performed.
3.2 Complexities of the ipow_2 function in terms of the number of multiplies performed, showing: exact complexity T(n) (dots) and lower- and upper-bounds algorithmic complexities log2(n) – 1 ≤ T(n) ≤ 2(1 + log2 n) for n > 1 (lines).
3.3 Measured performance of the ipow_1 and ipow_2 functions which have asymptotic algorithmic complexities of Θ(n) and Θ(ln n), respectively.
3.4 Arrays are the simplest data structure, allowing fast, random access (reading or writing) to the ith element ∀i ∈ {0 … n – 1} where n is the number of elements in the array. Elements cannot be added or removed without copying the whole array.
3.5 The higher-order Array.init function creates an array ai = f(i) for i ∈ {0 … n – 1} using the given function f.
3.6 The higher-order Array.map function creates an array containing the result of applying the given function f to each element in the given array a.
3.7 The higher-order Array.fold_left function repeatedly applies the given function f to the current accumulator and the current array element to produce a new accumulator to be applied with the next array element.
3.8 Lists are the simplest, arbitrarily-extensible data structure. Decapitation splits a list li i ∈ {0 … n – 1} into the head element h and the tail list ti i ∈ {0 … n – 2}.
3.9 Measured performance (time t in seconds) for inserting key-value pairs into hash tables and functional maps containing n – 1 elements. Although the hash table implementation results in better average-case performance, the O(n) time-complexity incurred when the hash table is resized internally produces much slower worst-case performance by the hash table.
3.10 A perfectly-balanced binary tree of depth x = 3 containing 2x+1 – 1 = 15 nodes, including the root node and 2x = 8 leaf nodes.
3.11 The result of inserting an integer counter into each node of the tree depicted in figure 3.10 using the counted_ptree_of_tree function.
3.12 An unbalanced binary tree with the remaining depth stored in every node.
3.13 A maximally-unbalanced binary tree of depth x = 7 containing 2x + 1 = 15 nodes, including the root node and x + 1 = 8 leaf nodes.
3.14 An unbalanced binary tree used to partition the space r ∈ [0, 1) in order to approximate the gravitational effect of a cluster of particles in a system.
3.15 Measured performance of the tree-based approach relative to a simple array-based approach for the evaluation of long-range forces showing the resulting fractional error δ = |OE|/E vs time taken t = ttree/tarray relative to the array-based method.
4.1 Values i of the type int, called machine-precision integers, are an exact representation of a consecutive subset of the set of integers fm01_img01a.jpg where l and u are given by min_int and max_int, respectively.
4.2 Values of the type float, called double-precision floating-point numbers, are an approximate representation of real-valued numbers, showing: a) full-precision (normalized) numbers (black), and b) denormalized numbers (gray).
4.3 Accuracy of two equivalent expressions when evaluated using floating-point arithmetic: a) fm01_img01.jpg (solid line), and b) fm01_img02.jpg (dashed line).
5.1 Parsing character sequences often entails lexing into a token stream and then parsing to convert patterns of tokens into grammatical constructs represented hierarchically by a tree data structure.
6.1 The first seven rows of Pascal’s triangle.
7.1 A blank Windows form.
7.2 A form with a single control, a button.
7.3 A thousand generations of the rule 30 cellular automaton.
7.4 A DirectX viewer that clears the display to a single color (called “coral”).
7.5 Abutting triangles can be amortised into triangle fans and strips to reduce the number of vertices required to describe a geometry.
7.6 A triangle rendered programmatically and visualized using an orthographic projection.
7.7 A DirectX viewer that draws an icosahedron.
7.8 Progressively more refined uniform tesselations of a sphere, obtained by subdividing the triangular faces of an icosahedron and normalising the resulting vertex coordinate vectors to push them onto the surface of a sphere.
7.9 3D surface plot of y = sin(r + 3x)/r where fm01_img03.jpg.
8.1 Profiling results generated by the freely-available NProf profiler for a program solving the queens problem on an 11 × 11 board.
8.2 Measured performance (time t in seconds) of mem functions over set, list and array data structures containing n elements.
8.3 Relative time taken t = ts/ta for testing membership in a set (ts) and an array (ta) as a function of the number of elements n in the container, showing that arrays are up to 2 × faster for n < 35.
8.4 Measured performance (time t in seconds per element) of List.of_array, Array.copy and Set.of_array data structures containing n elements.
8.5 Measured performance (time t in seconds per element) of iter functions over list, array and set data structures containing n elements.
8.6 Measured performance (time t in seconds per element) of the fold_left functions over list and array data structures containing n elements.
8.7 Measured performance (time t in seconds per element) of the fold_right functions over list, array and set data structures containing n elements.
8.8 Controlling the optimization flags passed to the F# compiler by Visual Studio 2005.
8.9 Deforestation refers to methods used to reduce the size of temporary data, such as the use of composite functions to avoid the creation of temporary data structures illustrated here: a) mapping a function f over a list l twice, and b) mapping the composite function f c08_img09.jpg f over the list l once.
8.10 An array of tuples or records containing pairs of float values incurs a level of indirection.
8.11 A struct can be used to completely unbox the float values, placing them directly into the array. This is often the most efficient representation.
9.1 The XYGraph tool from ComponentXtra.
9.2 Fourier series of a discretely sampled sine wave, showing: a) the samples ur r ∈ [0, 16) and Fourier series fm01_img04.jpg, and b) the corresponding Fourier coefficients vs computed numerically using FFTW.
10.1 Using a Windows Forms TreeView control to visualize the contents of an XML document from the Protein Data Bank.
10.2 Visualizing the contents of a SQL Server database using the Windows Forms DataGrid control.
11.1 An Excel spreadsheet with cell values generated from a F# interactive session.
11.2 A plot of sin(x) created in MATLAB by an F# program.
12.1 The approximately semi-circular eigenvalue density P(λ) for a dense, random, square matrix Mij = ±1 with n = 1024, showing the computed eigenvalue distribution.
12.2 Conventionally, atoms are referenced only by their index fm01_img04.jpg within the supercell. Consequently, atoms fm01_img04.jpg at opposite ends of the supercell are considered to be bonded.
12.3 We use an unconventional representation that allows all atoms to be indexed by the supercell they are in as well as their index within the supercell. In this case, the pair of bonded atoms are referenced as ((0, 0), ii) and ((1, 0), ij), i.e. with i in the origin supercell (0, 0) and j in the supercell with offset (1, 0).
12.4 The 50th-nearest neighbour shell from a 105-atom model of amorphous silicon [1] rendered using the F# for Visualization library.
12.5 Chaotic behaviour of the logistic map.
12.6 Real-time interactive simulation and visualization of non-interacting particles bouncing on a 3D surface.

List of Tables

3.1 Complexity of the ipow_2 function measured as the number of multiply operations performed.
3.2 Functions implementing common operations over data structures.
3.3 Asymptotic algorithmic complexities of operations over different data structures containing n elements where i ∈ {0 … n – 1} is a parameter of some of the operations, e.g. insert at index i.
7.1 DirectX primitive drawing functions.

Acronyms

ADO Active Data Objects
ASP Active Server Pages
AST Abstract-syntax tree
BNF Backus-Naur form
CAML Categorical Abstract Machine Language
FFT Fast Fourier Transform
FFTW Fastest Fourier Transform in the West
GOE Gaussian Orthogonal Ensemble
HOF Higher-Order Function
IDE Integrated Development Environment
INRIA Institut National de Recherche en Informatique et en Automatique
IO Input and Output
LCF Logic of Computable Functions
ML Meta-Language
OCaml Objective CAML
OO Object-Oriented
OOP Object-Oriented Programming
OpenGL Open Graphics Library
RPC Remote Procedure Call
SOAP Simple Object Access Protocol
UDDI Universal Description, Discovery and Integration
VM Virtual Machine
VS Visual Studio
WSDL Web Service Definition Language
XML eXtensible Markup Language
XSLT eXtensible Stylesheet Language Transformations