reading

Contents

Foreword

Preface

Acknowledgments

List of Figures

List of Tables

Acronyms

CHAPTER 1: INTRODUCTION

1.1 PROGRAMMING GUIDELINES

1.2 A BRIEF HISTORY OF F#

1.3 BENEFITS OF F#

1.4 INTRODUCING F#

1.5 IMPERATIVE PROGRAMMING

1.6 FUNCTIONAL PROGRAMMING

CHAPTER 2: PROGRAM STRUCTURE

2.1 NESTING

2.2 FACTORING

2.3 MODULES

2.4 OBJECTS

2.5 FUNCTIONAL DESIGN PATTERNS

2.6 F# DEVELOPMENT

CHAPTER 3: DATA STRUCTURES

3.1 ALGORITHMIC COMPLEXITY

3.2 ARRAYS

3.3 LISTS

3.4 SETS

3.5 HASH TABLES

3.6 MAPS

3.7 CHOOSING A DATA STRUCTURE

3.8 SEQUENCES

3.9 HETEROGENEOUS CONTAINERS

3.10 TREES

CHAPTER 4: NUMERICAL ANALYSIS

4.1 NUMBER REPRESENTATION

4.2 ALGEBRA

4.3 INTERPOLATION

4.4 QUADRATIC SOLUTIONS

4.5 MEAN AND VARIANCE

4.6 OTHER FORMS OF ARITHMETIC

CHAPTER 5: INPUT AND OUTPUT

5.1 PRINTING

5.2 GENERIC PRINTING

5.3 READING FROM AND WRITING TO FILES

5.4 SERIALIZATION

5.5 LEXING AND PARSING

CHAPTER 6: SIMPLE EXAMPLES

6.1 FUNCTIONAL

6.2 NUMERICAL

6.3 STRING RELATED

6.4 LIST RELATED

6.5 ARRAY RELATED

6.6 HIGHER-ORDER FUNCTIONS

CHAPTER 7: VISUALIZATION

7.1 WINDOWS FORMS

7.2 MANAGED DIRECTX

7.3 TESSELATING OBJECTS INTO TRIANGLES

CHAPTER 8: OPTIMIZATION

8.1 TIMING

8.2 PROFILING

8.3 ALGORITHMIC OPTIMIZATIONS

8.4 LOWER-LEVEL OPTIMIZATIONS

CHAPTER 9: LIBRARIES

9.1 LOADING .NET LIBRARIES

9.2 CHARTING AND GRAPHING

9.3 THREADS

9.4 RANDOM NUMBERS

9.5 REGULAR EXPRESSIONS

9.6 VECTORS AND MATRICES

9.7 DOWNLOADING FROM THE WEB

9.8 COMPRESSION

9.9 HANDLING XML

9.10 CALLING NATIVE LIBRARIES

9.11 FOURIER TRANSFORM

9.12 METAPROGRAMMING

CHAPTER 10: DATABASES

10.1 PROTEIN DATA BANK

10.2 WEB SERVICES

10.3 RELATIONAL DATABASES

CHAPTER 11: INTEROPERABILITY

11.1 EXCEL

11.2 MATLAB

11.3 MATHEMATICA

CHAPTER 12: COMPLETE EXAMPLES

12.1 FAST FOURIER TRANSFORM

12.2 SEMI-CIRCLE LAW

12.3 FINDING N^TH-NEAREST NEIGHBORS

12.4 LOGISTIC MAP

12.5 REAL-TIME PARTICLE DYNAMICS

Appendix A: Troubleshooting

A.1 VALUE RESTRICTION

A.2 MUTABLE ARRAY CONTENTS

A.3 NEGATIVE LITERALS

A.4 ACCIDENTAL CAPTURE

A.5 LOCAL AND NON-LOCAL VARIABLE DEFINITIONS

A.6 MERGING LINES

A.7 APPLICATIONS THAT DO NOT DIE

A.8 BEWARE OF “IT”

Glossary

Bibliography

INDEX

Published by John Wiley & Sons, Inc., Hoboken, New Jersey.

Published simultaneously in Canada.

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., Ill River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permission.

Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.

Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic format. For information about Wiley products, visit our web site at www.wiley.com.

Library of Congress Cataloging-in-Publication Data:

Harrop, Jon D.

F# for scientists / Jon Harrop.

p. cm.

Includes index.

ISBN 978-0-470-24211-7 (cloth)

1. F# (Computer program language) 2. Functional programming (Computer science) 3. Science—Data processing. I. Title.

QA76.73.F163H37 2008

005.1'14—dc22

2008009567

Foreword

Computational science is one of the wonders of the modern world. In almost all areas of science the use of computational techniques is rocketing, and software has moved from being a supporting tool to being a key site where research activities are performed. This has meant a huge increase in the importance of controlling and orchestrating computers as part of the daily routine of a scientific laboratory, from large teams making and running the computers performing global climate simulations to the individual scientist/programmer working alone. Across this spectrum, the productivity of teams and the happiness of scientists depends dramatically on their overall competency as programmers, as well as on their skills as researchers within their field. So, in the last 30 years we have seen the continued rise of that new profession: the scientific programmer. A good scientific programmer will carry both epithets with pride, knowing that programming is a key foundation for a successful publication record.

However, programming cultures differ widely, and, over time, gaping divides can emerge that can be to the detriment of all. In this book, Dr. Harrop has taken great steps forward to bridging three very different cultures: managed code programming, scientific programming and functional programming. At a technical level, each has its unique characteristics. Managed code programming, epitomized by .NET and Java, focuses on the productivity of the (primarily commercial) programmer. Scientific programmers focus on high performance computations, data manipulation, numerical computing and visualization. Functional programming focuses on crisp, declarative solutions to problems using compositional techniques. The challenge, then, is to bring these disparate worlds together in a productive way.

The language F#, which Dr. Harrop uses in this book, itself bridges two of these cultures by being a functional language for the .NET platform. F# is an incredibly powerful language: the .NET libraries give a rich and solid foundation of software functionality for many tasks, from routine programming to accessing web services and high performance graphics engines. F# brings an approach to programming that routinely makes even short programs powerful, simple, elegant and correct. However Dr. Harrop has gone a step further, showing how managed code functional programming can revolutionize the art of scientific programming itself by being a powerful workhorse tool that unifies and simplifies many of the tasks scientific programmers face.

But what of the future? The next 20 years will see great changes in scientific programming. It is customary to mention the ever-increasing challenges of parallel, concurrent, distributed and reactive programming. It is widely expected that future micro-processors will use ever-increasing transistor counts to host multiple processing cores, rather than more sophisticated microprocessor designs. If computations can be parallelized and distributed on commodity hardware then the computing resources that can be brought can be massively increased. It is well known that successful concurrent and distributed computing requires a combination of intelligent algorithm design, competent programming, and core components that abstract some details of concurrent execution, e.g. databases and task execution libraries. This needs a language that can interoperate with key technologies such as databases, and parallelism engines. Furthermore, the ability to declaratively and crisply describe solutions to concurrent programming problems is essential, and F# is admirably suited to this task.

The future, will, however, bring other challenges as well. Truly massive amounts of data are now being generated by scientific experiments. Web-based programming will become more and more routine for scientific teams: a good web application can revolutionize a scientific field. Shared databases will soon be used in almost every scientific field, and programmatic access to these will be essential. F# lends itself to these challenges: for example, it is relatively easy to perform sophisticated and high-performance analysis of these data sources by bringing them under the static type discipline of functional programming, as shown by some of the samples in this book.

You will learn much about both programming and science through this book. Dr. Harrop has chosen the style of F# programming most suited to the individual scientist: crisp, succinct and efficient, with a discursive presentation style reminiscent of Mathematica. It has been a pleasure to read, and we trust it will launch you on a long and productive career as a managed code, functional scientific programmer.

Preface

The face of scientific computing has changed. Computational scientists are no longer writing their programs in Fortran and competing for time on supercomputers. Scientists are now streamlining their research by choosing more expressive programming languages, parallel processing on desktop machines and exploiting the wealth of scientific information distributed across the internet.

The landscape of programming languages saw a punctuation in its evolution at the end of the 20th century, marked by the advent of a new breed of languages. These new languages incorporate a multitude of features that are all designed to serve a single purpose: to make life easier. Modern programming languages offer so much more expressive power than traditional languages that they even open up new avenues of scientific research that were simply intractable before.

The next few years will usher in a new era of computing, where parallelism becomes ubiquitous. Few approaches to programming will survive this transition, and functional programming is one of them.

Seamlessly interoperating with computers across the world is of pivotal importance not only because of the breadth of information now available on-line but also because this is the only practicable way to interrogate the enormous amount of data available. The amount of genomic and proteinomic data published every year continues to grow exponentially, as each generation of technology fuels the next.

Only one mainstream programming language combines awesome expressive power, interoperability and performance: F#. This book introduces all of the aspects of the F# programming language needed by a working scientist, emphasizing aspects not covered by existing literature. Consequently, this book is the ideal complement to a detailed overview of the language itself, such as the F# manual or the book Expert F#[25].

Chapters 1–5 cover the most important aspects of F# programming needed to start developing useful F# programs. Chapter 6 ossifies this knowledge with a variety of enlightening and yet simple examples. Chapters 7–11 cover advanced topics including real-time visualization, interoperability and parallel computing. Chapter 12 concludes the book with a suite of complete working programs relevant to scientific computing.

The source code from this book is available from the following website: http://www.ffconsultancy.com/products/fsharp_for_scientists/

J. D. HARROP

Cambridge, UK

June, 2008

Acknowledgments

I would like the thank Don Syme, the creator of F#, for pioneering research into programming languages and for thrusting the incredibly powerful ML family of languages into the limelight of mainstream software development.

Xavier Leroy, everyone at projet Cristal and the Debian package maintainers for making OCaml so practically useful.

Stephen Elliott and Sergei Taraskin and their group at the University of Cambridge for teaching me how to be a research scientist and letting me pursue crazy ideas when I should have been working.

Ioannis Baltopoulos and Enrique Nell for proofreading this book and giving essential feedback.

J. D. H.

List of Figures

2.1 The fold_range function can be used to accumulate the result of applying a function f to a contiguous sequence of integers, in this case the sequence [1, 9).

2.2 Developing an application written entirely in F# using Microsoft Visual Studio 2005.

2.3 Visual Studio provides graphical throwback of the type information inferred by the F# compiler: hovering the mouse over the definition of a variable r in the source code brings up a tooltip giving the inferred type of r.

2.4 A project’s properties page allows the compiler to be controlled.

2.5 The Add-in Manager is used to provide the F# interactive mode.

2.6 Creating a new C# class library project called ClassLibraryl inside a new solution called Interop.

2.7 Creating a new F# project called Project1 also inside the Interop solution.

2.8 Setting the startup project of the Interop solution to the F# project Project1 rather than the C# project ClassLibrary1 as a DLL cannot be used to start an application.

3.1 Complexities of the ipow_1 and ipow_2 functions in terms of the number T(n) of multiplies performed.

3.2 Complexities of the ipow_2 function in terms of the number of multiplies performed, showing: exact complexity T(n) (dots) and lower- and upper-bounds algorithmic complexities log₂(n) – 1 ≤ T(n) ≤ 2(1 + log₂ n) for n > 1 (lines).

3.3 Measured performance of the ipow_1 and ipow_2 functions which have asymptotic algorithmic complexities of Θ(n) and Θ(ln n), respectively.

3.4 Arrays are the simplest data structure, allowing fast, random access (reading or writing) to the i^th element ∀i ∈ {0 … n – 1} where n is the number of elements in the array. Elements cannot be added or removed without copying the whole array.

3.5 The higher-order Array.init function creates an array a_i = f(i) for i ∈ {0 … n – 1} using the given function f.

3.6 The higher-order Array.map function creates an array containing the result of applying the given function f to each element in the given array a.

3.7 The higher-order Array.fold_left function repeatedly applies the given function f to the current accumulator and the current array element to produce a new accumulator to be applied with the next array element.

3.8 Lists are the simplest, arbitrarily-extensible data structure. Decapitation splits a list l_i i ∈ {0 … n – 1} into the head element h and the tail list t_i i ∈ {0 … n – 2}.

3.9 Measured performance (time t in seconds) for inserting key-value pairs into hash tables and functional maps containing n – 1 elements. Although the hash table implementation results in better average-case performance, the O(n) time-complexity incurred when the hash table is resized internally produces much slower worst-case performance by the hash table.

3.10 A perfectly-balanced binary tree of depth x = 3 containing 2^x+1 – 1 = 15 nodes, including the root node and 2^x = 8 leaf nodes.

3.11 The result of inserting an integer counter into each node of the tree depicted in figure 3.10 using the counted_ptree_of_tree function.

3.12 An unbalanced binary tree with the remaining depth stored in every node.

3.13 A maximally-unbalanced binary tree of depth x = 7 containing 2x + 1 = 15 nodes, including the root node and x + 1 = 8 leaf nodes.

3.14 An unbalanced binary tree used to partition the space r ∈ [0, 1) in order to approximate the gravitational effect of a cluster of particles in a system.

3.15 Measured performance of the tree-based approach relative to a simple array-based approach for the evaluation of long-range forces showing the resulting fractional error δ = |O – E|/E vs time taken t = t_tree/t_array relative to the array-based method.

4.1 Values i of the type int, called machine-precision integers, are an exact representation of a consecutive subset of the set of integers

where l and u are given by min_int and max_int, respectively.

4.2 Values of the type float, called double-precision floating-point numbers, are an approximate representation of real-valued numbers, showing: a) full-precision (normalized) numbers (black), and b) denormalized numbers (gray).

4.3 Accuracy of two equivalent expressions when evaluated using floating-point arithmetic: a)

(solid line), and b)

(dashed line).

5.1 Parsing character sequences often entails lexing into a token stream and then parsing to convert patterns of tokens into grammatical constructs represented hierarchically by a tree data structure.

6.1 The first seven rows of Pascal’s triangle.

7.1 A blank Windows form.

7.2 A form with a single control, a button.

7.3 A thousand generations of the rule 30 cellular automaton.

7.4 A DirectX viewer that clears the display to a single color (called “coral”).

7.5 Abutting triangles can be amortised into triangle fans and strips to reduce the number of vertices required to describe a geometry.

7.6 A triangle rendered programmatically and visualized using an orthographic projection.

7.7 A DirectX viewer that draws an icosahedron.

7.8 Progressively more refined uniform tesselations of a sphere, obtained by subdividing the triangular faces of an icosahedron and normalising the resulting vertex coordinate vectors to push them onto the surface of a sphere.

7.9 3D surface plot of y = sin(r + 3x)/r where

8.1 Profiling results generated by the freely-available NProf profiler for a program solving the queens problem on an 11 × 11 board.

8.2 Measured performance (time t in seconds) of mem functions over set, list and array data structures containing n elements.

8.3 Relative time taken t = t_s/t_a for testing membership in a set (t_s) and an array (t_a) as a function of the number of elements n in the container, showing that arrays are up to 2 × faster for n < 35.

8.4 Measured performance (time t in seconds per element) of List.of_array, Array.copy and Set.of_array data structures containing n elements.

8.5 Measured performance (time t in seconds per element) of iter functions over list, array and set data structures containing n elements.

8.6 Measured performance (time t in seconds per element) of the fold_left functions over list and array data structures containing n elements.

8.7 Measured performance (time t in seconds per element) of the fold_right functions over list, array and set data structures containing n elements.

8.8 Controlling the optimization flags passed to the F# compiler by Visual Studio 2005.

8.9 Deforestation refers to methods used to reduce the size of temporary data, such as the use of composite functions to avoid the creation of temporary data structures illustrated here: a) mapping a function f over a list l twice, and b) mapping the composite function f

f over the list l once.

8.10 An array of tuples or records containing pairs of float values incurs a level of indirection.

8.11 A struct can be used to completely unbox the float values, placing them directly into the array. This is often the most efficient representation.

9.1 The XYGraph tool from ComponentXtra.

9.2 Fourier series of a discretely sampled sine wave, showing: a) the samples u_r r ∈ [0, 16) and Fourier series

, and b) the corresponding Fourier coefficients v_s computed numerically using FFTW.

10.1 Using a Windows Forms TreeView control to visualize the contents of an XML document from the Protein Data Bank.

10.2 Visualizing the contents of a SQL Server database using the Windows Forms DataGrid control.

11.1 An Excel spreadsheet with cell values generated from a F# interactive session.

11.2 A plot of sin(x) created in MATLAB by an F# program.

12.1 The approximately semi-circular eigenvalue density P(λ) for a dense, random, square matrix M_ij = ±1 with n = 1024, showing the computed eigenvalue distribution.

12.2 Conventionally, atoms are referenced only by their index

within the supercell. Consequently, atoms

at opposite ends of the supercell are considered to be bonded.

12.3 We use an unconventional representation that allows all atoms to be indexed by the supercell they are in as well as their index within the supercell. In this case, the pair of bonded atoms are referenced as ((0, 0), i_i) and ((1, 0), i_j), i.e. with i in the origin supercell (0, 0) and j in the supercell with offset (1, 0).

12.4 The 50^th-nearest neighbour shell from a 10⁵-atom model of amorphous silicon [1] rendered using the F# for Visualization library.

12.5 Chaotic behaviour of the logistic map.

12.6 Real-time interactive simulation and visualization of non-interacting particles bouncing on a 3D surface.

ADO	Active Data Objects
ASP	Active Server Pages
AST	Abstract-syntax tree
BNF	Backus-Naur form
CAML	Categorical Abstract Machine Language
FFT	Fast Fourier Transform
FFTW	Fastest Fourier Transform in the West
GOE	Gaussian Orthogonal Ensemble
HOF	Higher-Order Function
IDE	Integrated Development Environment
INRIA	Institut National de Recherche en Informatique et en Automatique
IO	Input and Output
LCF	Logic of Computable Functions
ML	Meta-Language
OCaml	Objective CAML
OO	Object-Oriented
OOP	Object-Oriented Programming
OpenGL	Open Graphics Library
RPC	Remote Procedure Call
SOAP	Simple Object Access Protocol
UDDI	Universal Description, Discovery and Integration
VM	Virtual Machine
VS	Visual Studio
WSDL	Web Service Definition Language
XML	eXtensible Markup Language
XSLT	eXtensible Stylesheet Language Transformations