# idris2-ilex

**Repository Path**: a-alpha/idris2-ilex

## Basic Information

- **Project Name**: idris2-ilex
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: BSD-3-Clause
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2026-02-28
- **Last Updated**: 2026-02-28

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# ilex: Generating fast Lexers from an Idris2 DSL

This is a library for generating performant lexers that can
be derived from a simple regular expression DSL (domain specific language).
However, this library also offers a way to enhance the lexers to become proper
parsers. The following list describes the main characteristics
of this library:

* Performance: Lexers are compiled at runtime to discrete, array-backed
  state machines that operate directly on raw byte arrays and therefore
  perform exceedingly well in comparison with common combinator-based
  solutions that operate on lists of characters.
* Stack safety: The drivers running the lexers over byte arrays are
  tail recursive and therefore safe to use on backends with limited
  stack size such as the JavaScript backends.
* Convenience: A high-level DSL for describing the regular expressions
  used in the lexers is provided together with a few basic conversion
  functions to extract lexicographic tokens from the matched byte
  arrays.
* Full unicode (UTF-8) support: Regular expressions based on
  characters (unicode code points) are automatically converted to
  state machines that handle non-ascii code points correctly at
  the byte level.
* Streaming: It is straight forward to use the state machines in
  a setting where large amounts of data have to be streamed chunk-wise
  without loading them all at once into memory.
* Context Awareness: It is possible to combine several lexers into
  a single context-aware tokenizer. Common use cases for this are,
  for instance, the lexing of string literals with support for
  character escape sequences, string interpolation where
  string literals can contain regular code snippets,
  or nested multi-line comments.
* Parsing: As an experimental feature, we explore the possibilities
  this library offers for writing proper LR-parsers by keeping
  track of the parsing state in a (possibly indexed) stack-like
  data type. While this approach gives us all of the benefits
  listed above, it remains yet to be seen if writing parsers
  in such a way is convenient enough to compete with the well
  established approaches offered by parser generators, parser combinators,
  or hand-written recursive descent parsers.

This is a literate Idris file, so we start with some imports and
a utility function used to run the lexers and parsers described
in this section and print their results.

```idris
module README

import Derive.Prelude
import Examples.Basics
import Examples.Types
import FS.Posix
import FS.System
import IO.Async.Loop.Epoll
import IO.Async.Loop.Posix
import Syntax.T1
import Text.ILex.FS

%default total
%language ElabReflection
%hide Data.Linear.(.)
%hide Language.Reflection.TT.Str

pretty :
     {auto ie : Interpolation e}
  -> {auto ia : Interpolation a}
  -> Parser1 (BoundedErr e) r s (List a)
  -> String
  -> IO ()
pretty p s =
  case parseString p Virtual s of
    Left x   => putStrLn $ interpolate x
    Right vs => traverse_ (putStrLn . interpolate) vs
```

## A basic CSV Lexer

A [lexer](https://en.wikipedia.org/wiki/Lexical_analysis) cuts text
(or a raw byte vector) into syntactically meaningful tokens. For
instance, assume we have a CSV file with the following content:

```idris
csvStr1 : String
csvStr1 =
  """
  foo,12,true
  bar,13,true
  baz,14,false
  """
```

What kind of token can we recognize here? We certainly have the
commas separating the fields on every line. We also have line
breaks separating table rows. Between those, we have arbitrary
data. Let us start with these very basic observations and
define and test a lexer for this structure. Since we are
going to implement several iterations of a CSV lexer,
we use an integer suffix in our names:

```idris
data CSV0 : Type where
  Comma0 : CSV0
  NL0    : CSV0
  Cell0  : String -> CSV0

%runElab derive "CSV0" [Show,Eq]

Interpolation CSV0 where interpolate = show
```

The data constructors of `CSV0` describe the different
tokens we can recognize in a (very basic) CSV
file. Without further ado, here is our first CSV lexer:

```idris
csv0 : L1 q Void 1 CSV0
csv0 =
  lexer
    [ ctok ',' Comma0
    , nltok '\n' NL0
    , readTok (plus (dot && not ',')) Cell0
    ]
```

Before we describe the definition above in some detail, let's quickly
test it at the REPL:

```repl
README> :exec pretty csv0 csvStr1
Cell0 "foo": 1:1--1:4
Comma0: 1:4--1:5
Cell0 "12": 1:5--1:7
Comma0: 1:7--1:8
Cell0 "true": 1:8--1:12
NL0: 1:12--1:13
Cell0 "bar": 2:1--2:4
Comma0: 2:4--2:5
Cell0 "13": 2:5--2:7
Comma0: 2:7--2:8
Cell0 "true": 2:8--2:12
NL0: 2:12--2:13
Cell0 "baz": 3:1--3:4
Comma0: 3:4--3:5
Cell0 "14": 3:5--3:7
Comma0: 3:7--3:8
Cell0 "false": 3:8--3:13
```

As you can see, the input has been cut into a list of tokens, each
annotated with the region (its *bounds*) where in the string it appeared.
Please note that the functions in this library cannot be evaluated directly
at the REPL, so we have to actually compile a small program and run it using
`:exec`.

Feel free to experiment with this a bit. For instance, you might want to
force an error by passing it a string that contains a tab character (`"\t"`),
something our lexer cannot yet handle. You will note that you'll get a
detailed error message highlighting the exact position of the error
in the input string.

Now, let us have a closer look at the definition of `csv0`: We use
a `TokenMap` to describe a lexer: a list of regular expressions each paired
with a conversion function. To create these pairs, we often use
one of the utility functions from `Text.ILex.Stack`.
Functions `ctok`, for instance, matches a fixed number
of characters and emits a properly bounded token.

Cell tokens are a bit more involved: We expect one or more (`plus`)
printable characters (`dot`) that are also not commas (`&& not ','`)
and convert them to a `Cell0` token. For such variable length tokens,
we can use utilities `readTok` (if we want to the token as a `String`)
or `convTok` (if we want the token as a `ByteString`).

Please note, that the utilities described so far assume that there
are no line breaks in the tokens they recognize. We therefore use
a special utility (`nltok`) to recognize the line break tokens, which
makes sure that the current line number is properly increased internally.

A quick note about the types: `L1 q Void 1 CSV0` describes a
lexer that operates on state thread `q` (you can ignore this part
for now), emits custom errors of type `Void`
(no custom errors in this case, because `Void` is uninhabited),
and emits bounded tokens of type `CSV0`. After every token, we
get a new state to start with of type `Index 1`, which is a
natural number (encoded as a value of type `Bits32`)
strictly less than 1. Again, we will see more about this when
we talk about context sensitive lexers.

### Additional Cell Types

While our first example already demonstrates how we can tokenize
a very basic language, one could argue that something similar could be
achieved just by using functions `lines` and `split` from `Data.String`.
Therefore, we are going to enhance our lexer in several ways.

First, we'd like to support boolean literals (`"true"` and `"false"`).
Next, we'd like to support positive and negative integer literals.
And finally, we'd like to be able to not only read data with
Unix-style line breaks, but also ASCII sequences from other
systems.

First, our updated token type:

```idris
data CSV1 : Type where
  Comma1 : CSV1
  NL1    : CSV1
  Txt1   : String -> CSV1
  Bool1  : Bool -> CSV1
  Int1   : Integer -> CSV1

%runElab derive "CSV1" [Show,Eq]

Interpolation CSV1 where interpolate = show
```

And here's the corresponding lexer:

```idris
linebreak : RExp True
linebreak = '\n' <|> "\n\r" <|> "\r\n" <|> '\r' <|> '\RS'

csv1 : L1 q Void 1 CSV1
csv1 =
  lexer
    [ ctok ',' Comma1
    , nltok linebreak NL1
    , ctok "true" (Bool1 True)
    , ctok "false" (Bool1 False)
    , convTok (opt '-' >> decimal) (Int1 . integer)
    , readTok (plus (dot && not ',')) Txt1
    ]
```

The above is *much* more powerful than our original `csv0` lexer, and
trying to implement this manually might be quite a frustrating
experience. Go ahead and experiment with this a bit at the REPL, and
see, if it behaves as expected.

### Overlapping Expressions and Maximal Munch

Before we continue enhancing our lexer, we have to quickly explain
two important concepts. When you look at the definition of `csv1`,
you will note that there is some overlap between certain token types.
For instance, `"true"` could be interpreted both as a boolean
or as a text token. In addition `"\r\n"` could be read as
one or two line breaks. Go ahead and check at the REPL
how `pretty` deals with these cases.

As you can easily verify, `"foo"` is interpreted as a single text token
instead of three tokens (one for each character), even though an interpretation
as three tokens would be in agreement with the regular expressions
we defined. Likewise, '"\r\n"' is interpreted as a single
line break instead of two. This strategy is called
the [maximal munch](https://en.wikipedia.org/wiki/Maximal_munch)
or *longest match* principle: The lexer tries to generate the longest possible
token that matches an expression.

In addition, there can be some overlap between tokens of the same
length, so that they could be interpreted in more than one way.
In this case, the first matching expression takes precedence.
Therefore, `"true"` is interpreted as a boolean and `"123"` as
an integer, although both would also be valid text tokens.

### Quoted Text

Currently, our CSV lexer cannot process any text that contains
commas. Likewise, it is not possible to include any control
characters such as tabs or line breaks in our text tokens.
We'd now like to add support for these.

Again, there are several strategies we could use here. What we are
going to do is to add support for quoted strings. This will allow
us to have text with commas by wrapping the corresponding text
in double quotes. However, a quoted
string can no longer contain double quote characters, so we need
a way to escape those. Likewise, we'd like to include and escape
some other characters such as tabs and line breaks.

Here's a regular expression for quoted strings:

```idris
quoted : RExp True
quoted = '"' >> star qchar >> '"'
  where
    qchar : RExp True
    qchar =
          (dot && not '"')
      <|> #""""#
      <|> #""n"#
      <|> #""r"#
      <|> #""t"#
```

With this, we can enhance our lexer:

```idris
unquote : ByteString -> String

csv1_2 : L1 q Void 1 CSV1
csv1_2 =
  lexer
    [ ctok ',' Comma1
    , nltok linebreak NL1
    , ctok "true" (Bool1 True)
    , ctok "false" (Bool1 False)
    , convTok (opt '-' >> decimal) (Int1 . integer)
    , readTok (plus (dot && not ',' && not '"')) Txt1
    , convTok quoted (Txt1 . unquote)
    ]
```

In order to keep things simple, we are going use character lists
to implement `unquote`:

```idris
unquote = go [<] . unpack . toString
  where
    go : SnocList Char -> List Char -> String
    go sc []              = pack (sc <>> [])
    go sc ('"'::'"' ::xs) = go (sc :< '"') xs
    go sc ('"'::'t' ::xs) = go (sc :< '\t') xs
    go sc ('"'::'n' ::xs) = go (sc :< '\n') xs
    go sc ('"'::'r' ::xs) = go (sc :< '\r') xs
    go sc ('"'::xs)       = go sc xs
    go sc (x::xs)         = go (sc :< x) xs
```

We are going to look at how to do this in a more performant
and elegant manner in a moment.

I suggest you fire up a REPL session and experiment a bit with what kind
of tokens our lexer is now capable of recognizing.

As promised, here's an alternative version of `unquote`: We just tokenize
the byte string again. This avoids having to convert the byte string
to a list of characters and traversing that. In general, this should be
faster than the above, but I suggest to profile this properly if it
is used in performance critical code.

```idris
lexUQ : L1 q Void 1 String
lexUQ =
  lexer
    [ ctok #""""# "\""
    , ctok #""n"# "\n"
    , ctok #""r"# "\r"
    , ctok #""t"# "\t"
    , ctok #"""#  ""
    , readTok (plus (dot && not '"' && not '\\')) id
    ]

fastUnquote : ByteString -> String
fastUnquote bs =
  case parseBytes lexUQ Virtual bs of
    Left  _  => ""
    Right xs => fastConcat (map val xs)
```

## Under the Hood: Context-sensitive Lexers

In this section we are going to write a context-sensitive lexer.
The techniques we learn here are going to be used in the next section,
where we'll write a proper CSV-parser.

Before we continue, please note that ilex uses thread-local mutable
state to keep track of the parts we parsed so far. This makes use of
linear types and the [ref1 library](https://github.com/stefan-hoeck/idris2-ref1).
It is recommended that readers familiarize themselves with the
techniques described there in order to better understand what's
coming next.

### A Custom Mutable Parser Stack

We'd like to reimplement the behavior of `csv1_2` but without having to
read every quoted string token twice. In order to do this, we will already
have to implement a mini-parser, because we are going to use different
lexers, depending on whether we are currently
in a quoted string token or not.

```idris
0 QSTCK : Type -> Type
QSTCK = Stack Void (SnocList $ Bounded CSV1) 2
```

Parser stack `QSTCK q` is a wrapper around several mutable references
that can be manipulated in state-thread `q`.

### Parser State

Unlike simple lexers, which always operate in the same state, a proper
parser can be in one of several states, and each state expects and
recognizes different lexicographic tokens.

For our very simple context-sensitive lexer, we recognize two
states, so our state size is `2`:

```idris
QSz : Bits32
QSz = 2
```

Our lexer can be in one of two states: `0`, which is the default
state and is predefined as `Ini`, and `InStr`, which let's us know
that we are currently inside a quoted string token:

```idris
QST : Type
QST = Index QSz

InStr : QST
InStr = 1
```

As you can see, a parser state is just an index into an array of the
given size. In ilex, almost everything is array-backed for maximum
performance.

### Lexers

The next step is to define a lexer for each of the states our parser
can be in. For every token recognized by one of the lexers we not
only need to describe how the token affects our mutable parser stack,
but also return the state the parser will be in *after* the given token
was recognized. For the initial state, almost nothing changes
compared to `csv1_2`:

```idris
lexInit : DFA q 2 QSTCK
lexInit =
  dfa
    [ ctok ',' Comma1
    , nltok linebreak NL1
    , ctok "true" (Bool1 True)
    , ctok "false" (Bool1 False)
    , convTok (opt '-' >> decimal) (Int1 . integer)
    , readTok (plus (dot && not ',' && not '"')) Txt1
    , copen '"' (pure InStr)
    ]
```

*DFA* is an acronym for "discrete finite automaton", and utility `dfa`
converts a list of pairs (regular expressions plus state transformations)
to such an automaton.

As you can see, since `QSTCK` implements interfaces `LC` and `LexST`,
we can just reuse the utilities from the previous example. However,
we no longer recognize a whole quoted string but just the opening
quote. We use the `copen` utility, which will push the current
token bounds onto a stack of bounds of things that need to be closed.
Other than that, we just return the new parser state, which will
be `InStr`.

Once we are in a quoted string, we recognize escape sequences as
well as chunks of unescaped text. These will be appended to
the snoc list in the `strs` mutable reference. Once we encounter
an unescaped closing quote, we end quoted string recognition
and append the recognized, concatenated string to the list of tokens.

Here are the necessary utilities and token map:

```idris
closeStr : (x : QSTCK q) => F1 q QST

lexStr : DFA q QSz QSTCK
lexStr =
  dfa
    [ cexpr #""""# $ pushStr InStr "\""
    , cexpr #""n"# $ pushStr InStr "\n"
    , cexpr #""r"# $ pushStr InStr "\r"
    , cexpr #""t"# $ pushStr InStr "\t"
    , cexpr '"' closeStr
    , read (plus $ dot && not '"') (pushStr InStr)
    ]
```

Function `closeStr` is more involved: In this case, we need to manually
construct the token bounds, append the bounded token to the list of
recognized tokens, and change the parser state back to `Ini`:

```idris
closeStr = T1.do
  bs <- closeBounds
  s  <- getStr
  push1 x.stack_ (B (Txt1 s) bs)
  pure Ini
```

We now wrap up the different lexers in an array of lexers, which
describes the possible state transformations for every parser state.

```idris
quotedTrans : Lex1 q QSz QSTCK
quotedTrans = lex1 [E Ini lexInit, E InStr lexStr]
```

### Error Handling

To finalize our context-aware lexer, we need to generate proper
error messages. Again, the possible error messages are stored
in an array (one handler for each parser state). This is actually
quite nice, because it allows us to decouple the handling of errors
from the parser's state transformations. Whenever the current DFA
ends in an error state, the corresponding error handler is picked
from the array and passed the parser stack and byte sequence
parsed so far:

```idris
quotedErr : Arr32 QSz (QSTCK q -> F1 q (BoundedErr Void))
quotedErr = arr32 QSz (unexpected []) [E InStr $ unclosedIfEOI "\"" []]
```

We also need to describe what happens when we reach the end of input.
Here, we check if we are in a valid parser state (a CSV string can
end in any state except within a quoted string) and either produce
a result or fail with an error:

```idris
quotedEOI : QST -> QSTCK q -> F1 q (Either (BoundedErr Void) (List $ Bounded CSV1))
quotedEOI st x =
  case st == Ini of
    False => arrFail QSTCK quotedErr st x
    True  => getList x.stack_ >>= pure . Right
```

Finally, we wrap everything up in a record of type `P1` (a "linear parser").
We are going to have a look at the `lchunk` thing when we talk about
streaming large amounts of data:

```idris
csv1_3 : P1 q (BoundedErr Void) QSz QSTCK (List $ Bounded CSV1)
csv1_3 = P Ini (init [<]) quotedTrans snocChunk quotedErr quotedEOI
```

## Parsing CSV Files

We are now ready to learn how to properly parse comma separated values.
We use the same techniques as in the last section (a mutable parser stack
plus several parser states with corresponding state transformations).

However, instead of just listing the lexicographic tokens, we are going
to actually accumulate these tokens into a larger data structure. We therefore
start with a definition of said data structure:

```idris
data Cell : Type where
  Null  : Cell
  CStr  : String -> Cell
  CBool : Bool -> Cell
  CInt  : Integer -> Cell

%runElab derive "Cell" [Show,Eq]

record Line where
  constructor L
  nr    : Nat
  cells : List Cell

%runElab derive "Line" [Show,Eq]

Interpolation Line where interpolate = show

0 Table : Type
Table = List Line
```

As you can see, a CSV table is just a list of lines, with a line
consisting of a line number and a list of cells.

### Parser Stack and State

Below is the record of mutable references we are going to use to accumulate
lines: As usual, we include the fields for keeping track of
token bounds. In addition, we have mutable references for
accumulating quoted strings, the current line, and the whole
table. Typically, we call this the parser's *stack*, even though
it consists of more than a single (snoc)list:

```idris
public export
record CSTCK (q : Type) where
  constructor C
  line  : Ref q Nat
  col   : Ref q Nat
  psns  : Ref q (SnocList Position)
  strs  : Ref q (SnocList String)
  err   : Ref q (Maybe $ BoundedErr Void)
  cells : Ref q (SnocList Cell)
  lines : Ref q (SnocList Line)
  bytes : Ref q ByteString

export %inline
HasPosition CSTCK where
  line      = CSTCK.line
  col       = CSTCK.col
  positions = CSTCK.psns

export %inline
HasError CSTCK Void where
  error = err

export %inline
HasStringLits CSTCK where
  strings = strs

export %inline
HasStack CSTCK (SnocList Line) where
  stack = lines

export %inline
HasBytes CSTCK where
  bytes = CSTCK.bytes

export
cinit : F1 q (CSTCK q)
cinit = T1.do
  l  <- ref1 Z
  c  <- ref1 Z
  bs <- ref1 [<]
  ss <- ref1 [<]
  er <- ref1 Nothing
  cs <- ref1 [<]
  ls <- ref1 [<]
  by <- ref1 ""
  pure (C l c bs ss er cs ls by)
```

Next, we'll define the different parser states. We want to make
sure we add an empty `Null` cell in case we encounter two
consecutive commas or a comma followed by a line break. Also,
there must not be two consecutive values, and we want to know
whether we are currently parsing a quoted string or not. This
leads to four distinct parser states (the initial state `Ini`
marks the beginning of a line). However, we add one more state,
which will only be reached when we encounter an unescaped line break
inside a quoted string: In our case, this is interpreted as an error,
and we want to make sure that this error produces an error message
explaining that we have encountered an unclosed quote.

```idris
public export
CSz : Bits32
CSz = 5

public export
0 CST : Type
CST = Index CSz

export
Val, Str, Com, NL : CST
Val = 1; Str = 2; Com = 3; NL = 4
```

### State Transitions

We are now going to define the parser's state transitions:
The lexicographic tokens we recognize at each parser
state, how they affect the parser stack, and the parser
state *after* encountering a token.

First: After encountering a CSV cell, we push it onto the
`cells` stack and switch to the `Val` state, because
we just encountered a value:

```idris
onCell : (x : CSTCK q) => Cell ->F1 q CST
onCell v = push1 x.cells v >> pure Val
```

Second: After encountering a line break, we increase the line
number and assemble the CSV line from the current line number
and collected cells. If the list of cells is empty, we do not
append an empty line to the table.

There is one special case we have to consider: If the line
break occurred right after a comma, we make sure to push a
`Null` cell onto the stack of cells:

```idris
onNL : (x : CSTCK q) => (afterComma : Bool) -> F1 q CST
onNL afterComma = T1.do
  incline 1
  when1 afterComma (push1 x.cells Null)
  cs@(_::_) <- getList x.cells | [] => pure Ini
  ln <- read1 x.line
  push1 x.lines (L ln cs)
  pure Ini
```

With the above, we can already define the default lexer.
This comes with two additional state transitions that can
be given inline: After encountering a comma, we push an empty cell
onto the stack and switch to the `Com` state. Likewise,
after encountering a double quote, we *open* a new
thing that needs to be closed (`copen`) and switch to
the `Str` state:

```idris
csvDflt : (afterComma : Bool) -> DFA q CSz CSTCK
csvDflt afterComma =
  dfa
    [ cexpr ',' $ push1 (cells %search) Null >> pure Com
    , conv linebreak (\_ => onNL afterComma)
    , cexpr "true" $ onCell (CBool True)
    , cexpr "false" $ onCell (CBool False)
    , conv (opt '-' >> decimal) (onCell . CInt . integer)
    , read (plus $ dot && not ',' && not '"') (onCell . CStr)
    , copen '"' (pure Str)
    ]
```

The string lexer pushes all encountered text regions and escape
sequences onto the corresponding stack until a closing
double quote is encountered, which leads to the accumulated
string to be pushed onto the stack of cells (`onCell`).

```idris
csvStr : DFA q CSz CSTCK
csvStr =
  dfa
    [ cexpr #""""# $ pushStr Str "\""
    , cexpr #""n"# $ pushStr Str "\n"
    , cexpr #""r"# $ pushStr Str "\r"
    , cexpr #""t"# $ pushStr Str "\t"
    , cclose '"' $ getStr >>= onCell . CStr
    , read (plus $ dot && not '"') $ pushStr Str
    , conv linebreak (const $ pure NL)
    ]
```

Please note the special handling of a line break inside a quoted string:
We immediately switch to the `NL` state, which, as we will see in a moment,
comes without any valid state transformations and thus will automatically
lead to an error being raised.

Finally, we gather the state transformations for every parser state
in an array of transformations (since the `NL` is always invalid, we
don't have to provide an automaton of state transformations to it:
it will automatically be given the empty automaton that always fails):

```idris
csvSteps : Lex1 q CSz CSTCK
csvSteps =
  lex1
    [ E Ini (csvDflt False)
    , E Val $ dfa [cexpr ',' (pure Com), conv linebreak (const $ onNL False)]
    , E Str csvStr
    , E Com (csvDflt True)
    ]
```

### Finalizing the Parser

The only thing missing is error handling and the final assembling of
the parser. States `Str`, `Val`, and `NL` throw custom errors:

```idris
csvErr : Arr32 CSz (CSTCK q -> F1 q (BoundedErr Void))
csvErr =
  arr32 CSz (unexpected [])
    [ E Str $ unclosedIfEOI "\"" []
    , E Val $ unexpected [","]
    , E NL  $ unclosed "\""
    ]

csvEOI : CST -> CSTCK q -> F1 q (Either (BoundedErr Void) Table)
csvEOI st x =
  case st == Str || st == NL of
    True  => arrFail CSTCK csvErr st x
    False => T1.do
      _   <- onNL (st == Com)
      tbl <- getList x.lines
      pure (Right tbl)
```

And here's the final CSV parser:

```idris
csv : P1 q (BoundedErr Void) CSz CSTCK Table
csv = P Ini cinit csvSteps snocChunk csvErr csvEOI
```

A quick note about the `snocChunk` part: Since the parsers
and lexers in this library maintain their own internal stack, they can
be used directly to stream large amounts of data via the
*ilex-streams* library (see below). Every time a chunk of bytes as
been loaded into memory and consumed by the streaming run loop,
the values accumulated so far should be emitted and removed from
the parser stack to release the memory they consumed. This is done
via field `P1.chunk`. In our case, we use utility `snocChunk` to
simply extract and emit the contents of a mutable snoc list.

With all this said and done, here's a small CSV string that can be
used to test our parser:

```idris
csvStr3 : String
csvStr3 =
  #"""
  entry 1,"this is a ""test""",true,12
  entry 2,"below is an empty line",false,0

  entry 3,"this one has an empty cell",,-20
  entry 4,"finally, a linebreak: "r"n",true,4
  entry 5,"next one is almost empty",true,18
  ,,,
  """#
```

Below are a couple of erroneous CSV strings. Feel free to
run our parser against those and check out the error messages
we get:

```idris
twoValues : String
twoValues = "this is wrong,\"foo\"false"

invalidTab : String
invalidTab = "this is wrong,foo,false,\"Ups: \t\",bar"

unclosedQuote : String
unclosedQuote = "this is wrong,foo,false,\"Ups , bar"
```

The error message from the last example is important: Instead
of complaining about having reached the end of input (which, in
my opinion, would be rather unspecific), we mark the opening
quote that has not been properly closed.

We'd like to do the same in case we arrive at a line break
in the middle of a quoted string. Again, the resulting error
message will be more specific. This is only possible by
including the line break state `NL` as one of the states
reachable from the quoted string lexer.
Here's an example where this is
relevant. Feel free to remove the `linebreak` line from
the definition of `csvStr` and re-check the parser's behavior
at the REPL.

```idris
unclosedMultiline : String
unclosedMultiline =
  """
  foo,"this is a,12
  bar,test,13
  """
```

## Streaming Files

Functions such as `parse` and `parseString` are supposed to process a
(byte)string as a whole, so the whole data needs to fit into memory.
Sometimes, however, we would like to process huge amounts of data.
Package *ilex-streams*, which is also part of this project, provides
utilities for this occasion. Fortunately, data type `P1` already
holds all the ingredients to process large data sets in chunks and
emit the values accumulated so far once after each chunk of data.

We are going to demonstrate how this is done when reading a large
file containing CSV-encoded data. First, we define a type alias
and runner for the data stream
(see the [idris2-streams library](https://github.com/stefan-hoeck/idris2-streams)
for proper documentation about `Pull` and `Stream`). Since
we want to read stuff from file descriptors we use
the `Async` monad (from [idris2-async](https://github.com/stefan-hoeck/idris2-async))
with polling capabilities. We also want to deal with possible errors -
parsing errors as well as errors from the operating system. This
leads to the following type alias for our data stream plus a
runner that will pretty print all errors to `stderr`:

```idris
0 Prog : Type -> Type -> Type
Prog o r = AsyncPull Poll o [ParseError Void, Errno] r

covering
runProg : Prog Void () -> IO ()
runProg prog =
 let handled := handle [stderrLn . interpolate, stderrLn . interpolate] prog
  in epollApp $ mpull handled
```

If the above is foreign to you, please work through the tutorials
provided with the *async* and *streams* libraries.

And here's an example how to stream a single, possibly huge, CSV file
(this will print the total number of non-empty CSV-lines encountered):

```idris
streamCSV : String -> Prog Void ()
streamCSV pth =
     readBytes pth
  |> streamParse csv (FileSrc pth)
  |> C.count
  |> printLnTo Stdout
```

Functions `readBytes`, `C.count`, and `printLnTo` are just standard
streaming utilities, while function `streamParse` is provided by
the *ilex-streams* library. With the above, you can stream arbitrarily
large CSV files in constant memory.

However, there are some minor differences compared to tokenizing whole
strings of data:

For obvious reasons, we don't store the whole byte stream in memory,
so error messages will only contain the region (start and end line and
column) where the error occurred but not a highlighted excerpt of the
string with the error.

```idris
streamCSVFiles : Prog String () -> Prog Void ()
streamCSVFiles pths =
     flatMap pths (\p => readBytes p |> streamParse csv (FileSrc p))
  |> C.count
  |> printLnTo Stdout
```

Below is a `main` function so you can test our CSV parser against
your own CSV files (or some files downloaded from the web). I strongly
suggest you experiment some more with this.
For instance, you might add support
for single-line comments: Everything after a specific character
(for instance `'#'`) until the end of line is to be considered
a comment and should be dropped.

You could also add support for floating-point literals,
single characters (in single quotes like in Idris), date and time
values and so on.

```idris
covering
main : IO ()
main = runProg $ streamCSVFiles (P.tail args)
```

In order to compile this and check it against your own
CSV files:

```sh
pack -o csv exec examples/src/README.md
examples/build/exec/csv ~/downloads/*.csv
```

<!-- vi: filetype=idris2:syntax=markdown
-->