Please help Ukraine!
Sponsor
Pandoc   a universal document converter

Creating Custom Pandoc Readers in Lua

Introduction

If you need to parse a format not already handled by pandoc, you can create a custom reader using the Lua language. Pandoc has a built-in Lua interpreter, so you needn’t install any additional software to do this.

A custom reader is a Lua file that defines a function called Reader, which takes two arguments:

  • the raw input to be parsed, as a list of sources
  • optionally, a table of reader options, e.g. { columns = 62, standalone = true }.

The Reader function should return a Pandoc AST. This can be created using functions in the pandoc module, which is automatically in scope. (Indeed, all of the utility functions that are available for Lua filters are available in custom readers, too.)

Each source item corresponds to a file or stream passed to pandoc containing its text and name. E.g., if a single file input.txt is passed to pandoc, then the list of sources will contain just a single element s, where s.name == 'input.txt' and s.text contains the file contents as a string.

The sources list, as well as each of its elements, can be converted to a string via the Lua standard library function tostring.

A minimal example would be

function Reader(input)
  return pandoc.Pandoc({ pandoc.CodeBlock(tostring(input)) })
end

This just returns a document containing a big code block with all of the input. Or, to create a separate code block for each input file, one might write

function Reader(input)
  return pandoc.Pandoc(input:map(
    function (s) return pandoc.CodeBlock(s.text) end))
end

In a nontrivial reader, you’ll want to parse the input. You can do this using standard Lua library functions (for example, the patterns library), or with the powerful and fast lpeg parsing library, which is automatically in scope. You can also use external Lua libraries (for example, an XML parser).

A previous pandoc version passed a raw string instead of a list of sources to the Reader function. Reader functions that rely on this are obsolete, but still supported: Pandoc analyzes any script error, detecting when code assumed the old behavior. The code is rerun with raw string input in this case, thereby ensuring backwards compatibility.

Bytestring readers

In order to read binary formats, including docx, odt, and epub, pandoc supports the ByteStringReader function. A ByteStringReader function is similar to the Reader function that processes text input. Instead of a list of sources, the ByteStringReader function is passed a bytestring, i.e., a string that contains the binary input.

-- read input as epub
function ByteStringReader (input)
  return pandoc.read(input, 'epub')
end

Format extensions

Custom readers can be built such that their behavior is controllable through format extensions, such as smart, citations, or hard-line-breaks. Supported extensions are those that are present as a key in the global Extensions table. Fields of extensions that are enabled default have the value true or enable, while those that are supported but disabled have value false or disable.

Example: A writer with the following global table supports the extensions smart, citations, and foobar, with smart enabled and the other two disabled by default:

Extensions = {
  smart = 'enable',
  citations = 'disable',
  foobar = true
}

The users control extensions as usual, e.g., pandoc -f my-reader.lua+citations. The extensions are accessible through the reader options’ extensions field, e.g.:

function Reader (input, opts)
  print(
    'The citations extension is',
    opts.extensions:includes 'citations' and 'enabled' or 'disabled'
  )
  -- ...
end

Extensions that are neither enabled nor disabled in the Extensions field are treated as unsupported by the reader. Trying to modify such an extension via the command line will lead to an error.

Example: plain text reader

This is a simple example using lpeg to parse the input into space-separated strings and blankline-separated paragraphs.

-- A sample custom reader that just parses text into blankline-separated
-- paragraphs with space-separated words.

-- For better performance we put these functions in local variables:
local P, S, R, Cf, Cc, Ct, V, Cs, Cg, Cb, B, C, Cmt =
  lpeg.P, lpeg.S, lpeg.R, lpeg.Cf, lpeg.Cc, lpeg.Ct, lpeg.V,
  lpeg.Cs, lpeg.Cg, lpeg.Cb, lpeg.B, lpeg.C, lpeg.Cmt

local whitespacechar = S(" \t\r\n")
local wordchar = (1 - whitespacechar)
local spacechar = S(" \t")
local newline = P"\r"^-1 * P"\n"
local blanklines = newline * (spacechar^0 * newline)^1
local endline = newline - blanklines

-- Grammar
G = P{ "Pandoc",
  Pandoc = Ct(V"Block"^0) / pandoc.Pandoc;
  Block = blanklines^0 * V"Para" ;
  Para = Ct(V"Inline"^1) / pandoc.Para;
  Inline = V"Str" + V"Space" + V"SoftBreak" ;
  Str = wordchar^1 / pandoc.Str;
  Space = spacechar^1 / pandoc.Space;
  SoftBreak = endline / pandoc.SoftBreak;
}

function Reader(input)
  return lpeg.match(G, tostring(input))
end

Example of use:

% pandoc -f plain.lua -t native
*Hello there*, this is plain text with no formatting
except paragraph breaks.

- Like this one.
^D
[ Para
    [ Str "*Hello"
    , Space
    , Str "there*,"
    , Space
    , Str "this"
    , Space
    , Str "is"
    , Space
    , Str "plain"
    , Space
    , Str "text"
    , Space
    , Str "with"
    , Space
    , Str "no"
    , Space
    , Str "formatting"
    , SoftBreak
    , Str "except"
    , Space
    , Str "paragraph"
    , Space
    , Str "breaks."
    ]
, Para
    [ Str "-"
    , Space
    , Str "Like"
    , Space
    , Str "this"
    , Space
    , Str "one."
    ]
]

Example: a wiki Creole reader

This is a parser for Creole common wiki markup. It uses an lpeg grammar. Fun fact: this custom reader is faster than pandoc’s built-in creole reader! This shows that high-performance readers can be designed in this way.

-- A sample custom reader for Creole 1.0 (common wiki markup)
-- http://www.wikicreole.org/wiki/CheatSheet

-- For better performance we put these functions in local variables:
local P, S, R, Cf, Cc, Ct, V, Cs, Cg, Cb, B, C, Cmt =
  lpeg.P, lpeg.S, lpeg.R, lpeg.Cf, lpeg.Cc, lpeg.Ct, lpeg.V,
  lpeg.Cs, lpeg.Cg, lpeg.Cb, lpeg.B, lpeg.C, lpeg.Cmt

local whitespacechar = S(" \t\r\n")
local specialchar = S("/*~[]\\{}|")
local wordchar = (1 - (whitespacechar + specialchar))
local spacechar = S(" \t")
local newline = P"\r"^-1 * P"\n"
local blankline = spacechar^0 * newline
local endline = newline * #-blankline
local endequals = spacechar^0 * P"="^0 * spacechar^0 * newline
local cellsep = spacechar^0 * P"|"

local function trim(s)
   return (s:gsub("^%s*(.-)%s*$", "%1"))
end

local function ListItem(lev, ch)
  local start
  if ch == nil then
    start = S"*#"
  else
    start = P(ch)
  end
  local subitem = function(c)
    if lev < 6 then
      return ListItem(lev + 1, c)
    else
      return (1 - 1) -- fails
    end
  end
  local parser = spacechar^0
               * start^lev
               * #(- start)
               * spacechar^0
               * Ct((V"Inline" - (newline * spacechar^0 * S"*#"))^0)
               * newline
               * (Ct(subitem("*")^1) / pandoc.BulletList
                  +
                  Ct(subitem("#")^1) / pandoc.OrderedList
                  +
                  Cc(nil))
               / function (ils, sublist)
                   return { pandoc.Plain(ils), sublist }
                 end
  return parser
end

-- Grammar
G = P{ "Doc",
  Doc = Ct(V"Block"^0)
      / pandoc.Pandoc ;
  Block = blankline^0
        * ( V"Header"
          + V"HorizontalRule"
          + V"CodeBlock"
          + V"List"
          + V"Table"
          + V"Para") ;
  Para = Ct(V"Inline"^1)
       * newline
       / pandoc.Para ;
  HorizontalRule = spacechar^0
                 * P"----"
                 * spacechar^0
                 * newline
                 / pandoc.HorizontalRule;
  Header = (P("=")^1 / string.len)
         * spacechar^1
         * Ct((V"Inline" - endequals)^1)
         * endequals
         / pandoc.Header;
  CodeBlock = P"{{{"
            * blankline
            * C((1 - (newline * P"}}}"))^0)
            * newline
            * P"}}}"
            / pandoc.CodeBlock;
  Placeholder = P"<<<"
              * C(P(1) - P">>>")^0
              * P">>>"
              / function() return pandoc.Div({}) end;
  List = V"BulletList"
       + V"OrderedList" ;
  BulletList = Ct(ListItem(1,'*')^1)
             / pandoc.BulletList ;
  OrderedList = Ct(ListItem(1,'#')^1)
             / pandoc.OrderedList ;
  Table = (V"TableHeader" + Cc{})
        * Ct(V"TableRow"^1)
        / function(headrow, bodyrows)
            local numcolumns = #(bodyrows[1])
            local aligns = {}
            local widths = {}
            for i = 1,numcolumns do
              aligns[i] = pandoc.AlignDefault
              widths[i] = 0
            end
            return pandoc.utils.from_simple_table(
              pandoc.SimpleTable({}, aligns, widths, headrow, bodyrows))
          end ;
  TableHeader = Ct(V"HeaderCell"^1)
              * cellsep^-1
              * spacechar^0
              * newline ;
  TableRow   = Ct(V"BodyCell"^1)
             * cellsep^-1
             * spacechar^0
             * newline ;
  HeaderCell = cellsep
             * P"="
             * spacechar^0
             * Ct((V"Inline" - (newline + cellsep))^0)
             / function(ils) return { pandoc.Plain(ils) } end ;
  BodyCell   = cellsep
             * spacechar^0
             * Ct((V"Inline" - (newline + cellsep))^0)
             / function(ils) return { pandoc.Plain(ils) } end ;
  Inline = V"Emph"
         + V"Strong"
         + V"LineBreak"
         + V"Link"
         + V"URL"
         + V"Image"
         + V"Str"
         + V"Space"
         + V"SoftBreak"
         + V"Escaped"
         + V"Placeholder"
         + V"Code"
         + V"Special" ;
  Str = wordchar^1
      / pandoc.Str;
  Escaped = P"~"
          * C(P(1))
          / pandoc.Str ;
  Special = specialchar
          / pandoc.Str;
  Space = spacechar^1
        / pandoc.Space ;
  SoftBreak = endline
            * # -(V"HorizontalRule" + V"CodeBlock")
            / pandoc.SoftBreak ;
  LineBreak = P"\\\\"
            / pandoc.LineBreak ;
  Code = P"{{{"
       * C((1 - P"}}}")^0)
       * P"}}}"
       / trim / pandoc.Code ;
  Link = P"[["
       * C((1 - (P"]]" + P"|"))^0)
       * (P"|" * Ct((V"Inline" - P"]]")^1))^-1 * P"]]"
       / function(url, desc)
           local txt = desc or {pandoc.Str(url)}
           return pandoc.Link(txt, url)
         end ;
  Image = P"{{"
        * #-P"{"
        * C((1 - (S"}"))^0)
        * (P"|" * Ct((V"Inline" - P"}}")^1))^-1
        * P"}}"
        / function(url, desc)
            local txt = desc or ""
            return pandoc.Image(txt, url)
          end ;
  URL = P"http"
      * P"s"^-1
      * P":"
      * (1 - (whitespacechar + (S",.?!:;\"'" * #whitespacechar)))^1
      / function(url)
          return pandoc.Link(pandoc.Str(url), url)
        end ;
  Emph = P"//"
       * Ct((V"Inline" - P"//")^1)
       * P"//"
       / pandoc.Emph ;
  Strong = P"**"
         * Ct((V"Inline" -P"**")^1)
         * P"**"
         / pandoc.Strong ;
}

function Reader(input, reader_options)
  return lpeg.match(G, tostring(input))
end

Example of use:

% pandoc -f creole.lua -t markdown
== Wiki Creole

You can make things **bold** or //italic// or **//both//** or //**both**//.

Character formatting extends across line breaks: **bold,
this is still bold. This line deliberately does not end in star-star.

Not bold. Character formatting does not cross paragraph boundaries.

You can use [[internal links]] or [[http://www.wikicreole.org|external links]],
give the link a [[internal links|different]] name.
^D
## Wiki Creole

You can make things **bold** or *italic* or ***both*** or ***both***.

Character formatting extends across line breaks: \*\*bold, this is still
bold. This line deliberately does not end in star-star.

Not bold. Character formatting does not cross paragraph boundaries.

You can use [internal links](internal links) or [external
links](http://www.wikicreole.org), give the link a
[different](internal links) name.

Example: parsing JSON from an API

This custom reader consumes the JSON output of https://www.reddit.com/r/haskell.json and produces a document containing the current top articles on the Haskell subreddit.

It assumes that the pandoc.json library is available, which ships with pandoc versions after (not including) 3.1. It’s still possible to use this with older pandoc version by using a different JSON library. E.g., luajson can be installed using luarocks install luajson—but be sure you are installing it for Lua 5.4, which is the version packaged with pandoc.

-- consumes the output of https://www.reddit.com/r/haskell.json

local json = require 'pandoc.json'

local function read_inlines(raw)
  local doc = pandoc.read(raw, "commonmark")
  return pandoc.utils.blocks_to_inlines(doc.blocks)
end

local function read_blocks(raw)
  local doc = pandoc.read(raw, "commonmark")
  return doc.blocks
end

function Reader(input)

  local parsed = json.decode(tostring(input))
  local blocks = {}

  for _,entry in ipairs(parsed.data.children) do
    local d = entry.data
    table.insert(blocks, pandoc.Header(2,
                  pandoc.Link(read_inlines(d.title), d.url)))
    for _,block in ipairs(read_blocks(d.selftext)) do
      table.insert(blocks, block)
    end
  end

  return pandoc.Pandoc(blocks)

end

Similar code can be used to consume JSON output from other APIs.

Note that the content of the text fields is markdown, so we convert it using pandoc.read().

Example: syntax-highlighted code files

This is a reader that puts the content of each input file into a code block, sets the file’s extension as the block’s class to enable code highlighting, and places the filename as a header above each code block.

function to_code_block (source)
  local _, lang = pandoc.path.split_extension(source.name)
  return pandoc.Div{
    pandoc.Header(1, source.name == '' and '<stdin>' or source.name),
    pandoc.CodeBlock(source.text, {class=lang}),
  }
end

function Reader (input, opts)
  return pandoc.Pandoc(input:map(to_code_block))
end

Example: extracting the content from web pages

This reader uses the command-line program readable (install via npm install -g readability-cli) to clean out parts of HTML input that have to do with navigation, leaving only the content.

-- Custom reader that extracts the content from HTML documents,
-- ignoring navigation and layout elements. This preprocesses input
-- through the 'readable' program (which can be installed using
-- 'npm install -g readability-cli') and then calls the HTML reader.
-- In addition, Divs that seem to have only a layout function are removed
-- to avoid clutter.

function make_readable(source)
  local result
  if not pcall(function ()
      local name = source.name
      if not name:match("http") then
        name = "file:///" .. name
      end
      result = pandoc.pipe("readable",
                 {"--keep-classes","--base",name},
                 source.text)
    end) then
      io.stderr:write("Error running 'readable': do you have it installed?\n")
      io.stderr:write("npm install -g readability-cli\n")
      os.exit(1)
  end
  return result
end

local boring_classes =
        { row = true,
          page = true,
          container = true
        }

local boring_attributes = { "role" }

local function is_boring_class(cl)
  return boring_classes[cl] or cl:match("col%-") or cl:match("pull%-")
end

local function handle_div(el)
  for i,class in ipairs(el.classes) do
    if is_boring_class(class) then
      el.classes[i] = nil
    end
  end
  for i,k in ipairs(boring_attributes) do
    el.attributes[k] = nil
  end
  if el.identifier:match("readability%-") then
    el.identifier = ""
  end
  if #el.classes == 0 and #el.attributes == 0 and #el.identifier == 0 then
    return el.content
  else
    return el
  end
end

function Reader(sources)
  local readable = ''
  for _,source in ipairs(sources) do
    readable = readable .. make_readable(source)
  end
  local doc = pandoc.read(readable, "html", PANDOC_READER_OPTIONS)
  -- Now remove Divs used only for layout
  return doc:walk{ Div = handle_div }
end

Example of use:

pandoc -f readable.lua -t markdown https://pandoc.org

and compare the output to

pandoc -f html -t markdown https://pandoc.org