No tengo la menor idea

2009-10-26

Blogging Literately in Haskell

Filed under: haskell — robgreayer @ 22:05

I’ve made one blog post about Haskell, and found the processs to be too tedious to be tolerated. The wysiwyg tinyMCE editor that is avalaible for WordPress is useless for formatting code segments and editing directly in the raw HTML editor WordPress provides is agonizing. Figuring there must be a better way, I looked around for an offline tool that would just work for me. I turned up a few editors, free and non-free, some which looked just horrid, a few almost reasonable, but none that would make the mix of Haskell and narrative as easy as it seems it ought to be. What I really want is an easy way to take Literate Haskell (you know, those seemingly innocuous text messages that are suddenly interrupted with ominous passages like:

> {-# LANGUAGE DeriveDataTypeable #-}
> module BlogLiterately where

for no immediately apparent reason) file, perhaps with simple markdown encoded text commentary, and just magically upload it to my blog, with good HTML formatting of the markdown and Haskell sections.

Haskell, theoretically, has all sorts of libraries for doing all sorts of wonderful things. Perhaps it wouldn’t be to hard to mash something together from Haskell libraries that did precisely what I wanted.

Somehow I’d osmotically absorbed from the ether the knowlege that there was a Haskell-based tool around called Pandoc, by John MacFarlane, which deals with markdown, HTML and other formats. It is, like all good Haskelly things, available on hackage, both as a program and library. So it seemed like a good place to start.

> import Text.Pandoc

Looking over the documentation for Pandoc, I observed that although was it able to do wonderful things with loads of formats, it didn’t automatically do what I wanted with the actual Haskell source code. What I want is to make the source look like what Haskell code looks like as part of haddock documentation, especially the Haddock documentation available on Hackage. Haddock uses hscolour, a tool by Malcolm Wallace, to format source listings.

> import Language.Haskell.HsColour(hscolour,Output(..))
> import Language.Haskell.HsColour.Colourise(defaultColourPrefs)

Assuming Pandoc and hscolour do what I want, I’d still need some way of actually publishing the blog post. With a bit of googliness I learned that my blog software supports something called the MetaWeblog API, which is an XML-RPC-based protocol for interacting with blogs.

There’s a Haskell XML-RPC library, HaXR, by Bjorn Bringert, (on hackage) which seems like it should be appropriate.

> import Network.XmlRpc.Client
> import Network.XmlRpc.Internals

And it works that out I’ll need some miscellaneous other stuff. Since I’m writing a command line tool, I’ll need to process the command line arguments, and Neil Mitchell’s CmdArgs library ought to work for that:

> import System.Console.CmdArgs

I’m going to end up needing to parse and manipulate XHTML, so I’ll use Malcolm Wallace’s HaXml XML combinators:

> import Text.XML.HaXml
> import Text.XML.HaXml.Verbatim

I’ll need to do some text IO, which I’ll use the UTF8 encoding for, leading to Eric Mertens’ ubiquitous (until GHC 6.12!) utf8-string library:

> import qualified System.IO.UTF8 as U

And I’ll need a couple other bits and pieces:

> import Control.Monad(liftM,unless)
> import Text.XHtml.Transitional(showHtml)
> import Text.ParserCombinators.Parsec

The program I envision will read in a literate Haskell file, use Pandoc to parse it as markdown, then somehow find the code blocks in the (parsed) input, and use hscolour to transform those. Pandoc turns its input into a structure of type:

data Pandoc = Pandoc Meta [Block]

where a Block (the interesting bit, for my purposes) looks like:

-- | Block element.
data Block  
    = Plain [Inline]        -- ^ Plain text, not a paragraph
    | Para [Inline]         -- ^ Paragraph
    | CodeBlock Attr String -- ^ Code block (literal) with attributes 
    | RawHtml String        -- ^ Raw HTML block (literal)
    | BlockQuote [Block]    -- ^ Block quote (list of blocks)
    | OrderedList ListAttributes [[Block]] -- ^ Ordered list (attributes
                            -- and a list of items, each a list of blocks)
    | BulletList [[Block]]  -- ^ Bullet list (list of items, each
                            -- a list of blocks)
    | DefinitionList [([Inline],[Block])]  -- ^ Definition list 
                            -- (list of items, each a pair of an inline list,
                            -- the term, and a block list)
    | Header Int [Inline]   -- ^ Header - level (integer) and text (inlines) 
    | HorizontalRule        -- ^ Horizontal rule
    | Table [Inline] [Alignment] [Double] [[Block]] [[[Block]]]  -- ^ Table,
                            -- with caption, column alignments,
                            -- relative column widths, column headers
                            -- (each a list of blocks), and rows
                            -- (each a list of lists of blocks)
    | Null                  -- ^ Nothing
    deriving (Eq, Read, Show, Typeable, Data)

The literate Haskell that Pandoc finds in a file ends up in various CodeBlock elements of the Pandoc document. Other code can also wind up in CodeBlock elements — normal markdown formatted code. The literate Haskell code seems to be differentiated from other by the Attr component, which has the form:

type Attr = (String, [String], [(String, String)])

Experimentation reveals that CodeBlock elements that have an Attr of the form (_,["sourceCode","haskell"],_) are literate Haskell code blocks, and other CodeBlock elements are the markdown code blocks. I want to syntax-highlight both kinds of code blocks, but when both get rendered to the output HTML, I want to preserve the literate Haskell as literate (it needs the prepended > character). Actually, I want to do just a little bit more…

When writing a Literate Haskell File, I might want to include non-Haskell but nevertheless ‘code’ examples in the text. Samples of how one might express the same thing in ML, or examples of how to run a program, etc. So not all code blocks should be colourised by hscolour, just specific Haskelly ones. Markdown doesn’t provide a way to express what kind of code a code block might be, but since I’m munging all the code blocks anyway, I can adopt a simple convention. If a code example looks like:


    [Haskell]
    foo :: String -> String

it is a Haskell block. If it looks like something else, e.g.


    [C++]
    cout << "Hello World!";

or


    [other]
    foo bar baz

etc., it is something else.

ed: It turns out that although Markdown doesn't provide a way to do this, Pandoc's extensions to Markdown do allow it, in a way fairly similar to what I've proposed. If I revise this further, I'll use the Pandoc convention. Thanks to John MacFarlane for pointing this out in the comments.

I can strip off the code type indicator from the beginning of each block as I examine them for code to colourise. I use Parsec to recognize the opening tag:

> unTag :: String -> (String, String)
> unTag s = either (const ("",s)) id $ parse tag "" s
>    where tag = do
>              tg <- between (char '[') (char ']') $ many $ noneOf "[]"
>              skipMany $ oneOf " t"
>              (string "rn" <|> string "n")
>              txt <- many $ anyToken
>              eof
>              return (tg,txt)

To highlight the syntax using hscolour (which produces HTML), I'm going to need to transform the String from a CodeBlock element to a String suitable for the RawHtml element (because the hscolour library transforms Haskell text to HTML). Pandoc strips off the prepended > characters from the literate Haskell, so I need to put them back, and also tell hscolour whether the source it is colouring is literate or not. The hscolour function looks like:

hscolour :: Output      -- ^ Output format.
         -> ColourPrefs -- ^ Colour preferences (for formats that support them).
         -> Bool        -- ^ Whether to include anchors.
         -> Bool        -- ^ Whether output document is partial or complete.
         -> String	-- ^ Title for output.
         -> Bool        -- ^ Whether input document is literate haskell or not
         -> String      -- ^ Haskell source code.
         -> String      -- ^ Coloured Haskell source code.

Hscolour supports a few different HTML-based Output formats: HTML, CSS, and ICSS. HTML is HTML with font tags, which are inherently evil and so I'll won't bother exploring this option. CSS is formatting using HTML class tags, which allows you the flexibility of using a separate stylesheet to control how hscolour-annotated code is style. ICSS is 'inline-CSS' a.k.a. HTML 'style' tags, which is really what I need for my WordPress blog, as I refuse to the pay $15 a year WordPress.com charge for the privilege of storing a stylesheet on their site. Unfortunately, hscolour's inline styling options are quite limited (very few colours, little control of fonts), but I've come up with a slightly more convoluted way of turning CSS coloured source into inline-css HTML. So I will start off by using hscolour in CSS mode to transform my source:

> colourIt literate srcTxt = 
>     hscolour CSS defaultColourPrefs False True "" literate src'
>     where src' | literate = prepend srcTxt
>                | otherwise = srcTxt

Prepending the literate Haskell markers on the source is trivial:

> prepend s = unlines $ map p $ lines s where p s = '>':' ':s

Hscolour uses HTML span elements and CSS classes like 'hs-keyword' or hs-keyglyph to markup Haskell code. What I want to do is take each marked span element and replace the class attribute with an inline style element that has the markup I want for that kind of source. I can capture the style posibilities with a type:

> data StylePrefs = StylePrefs {
>     keyword  :: String,
>     keyglyph :: String,
>     layout   :: String,
>     comment  :: String,
>     conid    :: String,
>     varid    :: String,
>     conop    :: String,
>     varop    :: String,
>     str      :: String,
>     chr      :: String,
>     number   :: String,
>     cpp      :: String,
>     selection :: String,
>     variantselection :: String,
>     definition :: String 
>     } deriving (Read,Show)

Each field in the above type will contain the desired style for the class hscolour assigns to a span of code. A default style that produces something like what the source listings on Hackage look like is:

> defaultStylePrefs = StylePrefs {
>     keyword  = "color: blue; font-weight: bold;"
>   , keyglyph = "color: red;"
>   , layout   = "color: red;"
>   , comment  = "color: green;"
>   , conid    = ""
>   , varid    = ""
>   , conop    = ""
>   , varop    = ""
>   , str      = "color: teal;"
>   , chr      = "color: teal;"
>   , number   = ""
>   , cpp      = ""
>   , selection = ""
>   , variantselection = ""
>   , definition = ""
>   }

I can read these preferences in from a file using the Read instance for StylePrefs. I could handle errors better, but this should work:

> getStylePrefs "" = return defaultStylePrefs
> getStylePrefs fname = liftM read (U.readFile fname)

Hscolour produces a String of HTML. To transform it, we need to parse it, manipulate it and then re-render it as a String. I'll use HaXml to do all of this:

> xformXml :: StylePrefs -> String -> String
> xformXml prefs s =  verbatim $ filtDoc (xmlParse "input" s) where
>     -- filter the document (an Hscoloured fragment of Haskell source)
>     filtDoc (Document p s e m) =  c where
>         [c] = filts (CElem e)
>     -- the filter is a fold of individual filters for each CSS class
>     filts = foldXml $ foldl o keep [
>             filt "keyword" keyword,
>             filt "keyglyph" keyglyph,
>             filt "layout" layout,
>             filt "comment" comment,
>             filt "conid" conid,
>             filt "varid" varid,
>             filt "conop" conop,
>             filt "varop" varop,
>             filt "str" str,
>             filt "chr" chr,
>             filt "num" number,
>             filt "cpp" cpp,
>             filt "sel" selection,
>             filt "variantselection" variantselection,
>             filt "definition" definition
>         ]
>     -- an individual filter replaces the attributes of a tag with
>     -- a style attribute when it has a specific 'class' attribute.
>     filt lbl f =
>         replaceAttrs [("style",f prefs)] `when`
>             (attrval $ ("class",AttValue [Left ("hs-" ++ lbl)]))

To completely colourise a CodeBlock we now can create a function that transforms a CodeBlock into a RawHtml block, where the content contains marked up Haskell (possibly with literate markers):

> colouriseCodeBlock prefs (CodeBlock attr@(_,inf,_) s) =
>     if tag == "Haskell" || lit
>         then RawHtml $ xformXml prefs $ colourIt lit s'
>         else CodeBlock attr s'
>     where (tag,s') = unTag s
>           lit = "sourceCode" `elem` inf && "haskell" `elem` inf
> colouriseCodeBlock _ b = b

And colourising a Pandoc document is simply:

> colourisePandoc prefs (Pandoc m blocks) = 
>     Pandoc m $ map (colouriseCodeBlock prefs) blocks

Transforming a complete input document string to an HTML output string:

> xformDoc :: StylePrefs -> String -> String
> xformDoc prefs s = 
>     showHtml 
>     $ writeHtml writeOpts -- from Pandoc
>     $ colourisePandoc prefs
>     $ readMarkdown parseOpts -- from Pandoc
>     $ fixLineEndings s
>     where writeOpts = defaultWriterOptions {
>               writerLiterateHaskell = True,
>               writerReferenceLinks = True }
>           parseOpts = defaultParserState { 
>               stateLiterateHaskell = True }
>           -- readMarkdown is picky about line endings
>           fixLineEndings [] = []
>           fixLineEndings ('r':'n':cs) = 'n':fixLineEndings cs
>           fixLineEndings (c:cs) = c:fixLineEndings cs

Now that I can transform a document, I need to be able to post the document to my blog. The metaWeblog API defines a newPost and editPost procedures that look like:

metaWeblog.newPost (blogid, username, password, struct, publish) returns string
metaWeblog.editPost (postid, username, password, struct, publish) returns true

For my blog (a WordPress blog), the blogid is just default. The user name and password are simply strings, and publish is a flag indicating whether to load the post as a draft, or to make it public immediately. The postid is an identifier string which is assigned when you initially create a post. The interesting bit is the struct field, which is an XML-RPC structure defining the post along with some meta-data, like the title. All I need is to be able to provide the post text and a title, so I can create the right struct like so:

> mkPost title text = 
>     [("title",title),("description",text)]

The HaXR library exports a function for invoking XML-RPC procedures:

remote :: Remote a => 
    String -- ^ Server URL. May contain username and password on
           --   the format username:password@ before the hostname.
       -> String -- ^ Remote method name.
       -> a      -- ^ Any function 
     -- @(XmlRpcType t1, ..., XmlRpcType tn, XmlRpcType r) => 
                 -- t1 -> ... -> tn -> IO r@

The funtion requires an URL and a method name, and returns a function of type Remote a => a. Based on the instances defined for Remote, any function with zero or more parameters in the class XmlRpcType and a return type of XmlRpcType r => IO r will work, which means you can simply 'feed' remote additional arguments as required by the remote procedure, and as long as you make the call in an IO context, it will typecheck. So to call the metaWeblog.newPost procedure, I can do something like:

> postIt :: String -> String -> String -> String -> String -> String -> Bool -> IO String
> postIt url blogId user password title text publish =
>     remote url "metaWeblog.newPost" blogId user password (mkPost title text) publish

To update (replace) a post, the function would be:

> updateIt :: String -> String -> String -> String -> String -> String -> Bool -> IO Bool
> updateIt url postId user password title text publish =
>     remote url "metaWeblog.editPost" postId user password (mkPost title text) publish

I've got most of the pieces in place -- I just need to turn it into a command line program. I can capture the command line controls in a type:

> data BlogLiterately = BlogLiterately {
>        style :: String,    -- name of a style file
>        publish :: Bool,    -- an indication of whether the post should be
>                            -- published, or loaded as a draft
>        blogid :: String,   -- blog-specific identifier (e.g. for blogging
>                            -- software handling multiple blogs)
>        blog :: String,     -- blog xmlrpc URL
>        user :: String,     -- blog user name
>        password :: String, -- blog password
>        title :: String,    -- post title
>        file :: String,     -- file to post
>        postid :: String    -- id of a post to updated
>     } deriving (Show,Data,Typeable)

And using CmdArgs, this bit of impure evil defines how the command line arguments work:

> bl = mode $ BlogLiterately {
>     style = "" &= text "Style Specification" & typFile,
>     publish = def &= text "Publish post",
>     blogid = "default" &= text "Blog specific identifier",
>     blog = def &= argPos 0 & typ "URL" 
>         & text "URL of blog's xmlrpc address (e.g. http://example.com/blog/xmlrpc.php)",
>     user = def &= argPos 1 & typ "USER" & text "blog author's user name" ,
>     password = def &= argPos 2 & typ "PASSWORD" & text "blog author's password",
>     title = def &= argPos 3 & typ "TITLE",
>     file = def &=  argPos 4 & typ "FILE" & text "literate haskell file",
>     postid = "" &= text "Post to replace (if any)"}

The main blogging function uses the information captured in the BlogLiterately type to read the style preferences, read the input file and transform it, and post it to the blog:

> blogLiterately (BlogLiterately style pub blogid url user pw title file postid) = do
>     prefs <- getStylePrefs style
>     html <- liftM (xformDoc prefs) $ U.readFile file
>     if null postid 
>         then do
>             postid <- postIt url blogid user pw title html pub
>             putStrLn $ "post Id: " ++ postid
>         else do
>             result <- updateIt url postid user pw title html pub
>             unless result $ putStrLn "update failed!"

And the main program is simply:

> main = cmdArgs "Blog Literately v0.1, (C) Robert Greayer 2009" [bl] >>= blogLiterately

I can run it to get some help:

$ ./BlogLiterately --help
Blog Literately v0.1, (C) Robert Greayer 2009

blogliterately [FLAG] URL USER PASSWORD TITLE FILE

  -? --help[=FORMAT]  Show usage information (optional format)
  -V --version        Show version information
  -v --verbose        Higher verbosity
  -q --quiet          Lower verbosity
  -s --style=FILE     Style Specification
     --publish        Publish post
  -b --blogid=VALUE   Blog specific identifier (default=default)
     --postid=VALUE   Post to replace (if any)

Which tells me I can actually upload a post something like:

$ ./BlogLiterately http://greayer.wordpress.com/xmlrpc.php myuser mypass 
    "Blogging Literately in Haskell" BlogLiterately.lhs

This is a great start for what I want. Handling of exceptions is non-existent; I simply cross my fingers and hope that the default error message will be self explanatory. I ought to also allow the author and categories for the document to be specified. But it works as is, and it's a tool I'd actuallly use (and did, to post this).

About these ads

10 Comments »

  1. Cool post. Note also that instead of hscolour you can build pandoc with the flag “-f highlighting” it’ll build in haskell source code highlighting using the highlighting-kate library.

    Comment by Gregory Collins — 2009-10-26 @ 22:38

  2. Looks very useful to us — thanks for publicising this!

    Comment by Kevin Hammond — 2009-10-27 @ 04:04

    • You’re probably aware of this, but in case you’re not: recent versions of pandoc directly support literate Haskell. Be sure you’ve installed pandoc with highlighting support:

      cabal install –reinstall -fhighlighting pandoc

      Then you can do

      pandoc -s -f markdown+lhs -t html+lhs mypost.txt -o mypost.html

      and the input (in “bird-style” literate Haskell) will be converted into HTML, with highlighted Haskell source code.

      It wouldn’t be too hard to modify pandoc to allow the option of using hscolour instead of highlighting-kate for highlighting Haskell code. I may put that on the todo list…

      Comment by John MacFarlane — 2009-10-27 @ 10:10

      • Also: if you want, say, an OCaml example, you can do that in pandoc using delimited code blocks:

        ~~~~~~~~~~~~~~~~~~~~~ {.ml}
        fun fact (n) =
        if n=0 then 1
        else n*fact(n-1);
        ~~~~~~~~~~~~~~~~~~~~~~~~~~~

        This will be highlighted as OCaml.

        Comment by John MacFarlane — 2009-10-27 @ 10:30

      • I’ll have to try this out — it looks like -kate has definitions for everything from Ada to Yacc (somebody should contribute definitions for Z80 assembly language, for completeness). The stumbling block for me is the PCRE dependency which means it takes more than just a cabal install on Windows.

        Comment by robgreayer — 2009-10-27 @ 11:20

  3. Testing BlogLiterally, and HTML DSL changes…

    Robert Greayer has just released his BlogLiterally program on hackage, so I thought I’d try it out. I’ve made some small changes to the HTML DSL I’ve been working on since i last posted, which I’ll outline here: (Turns out that….

    Trackback by Data.Random — 2009-11-03 @ 03:01

  4. It is a joy even for private use to look at stable code as html pages. I kept crude tools to do this, with markdown comments, for many years in various languages. I stopped only in the last few years for Haskell, because my editor allows great syntax coloring, and GHC accepts a custom literate preprocessor.

    Yes, once I could adopt a “code is indented, comments are flush” idiom, and get rid of the danged bird tracks, my editor window looked close enough to a nicely formatted web page that I just stopped bothering with the creation of an explicit web page. (I also allow single periods to delimit comment blocks, and process -> Unicode, etc.)

    You’d be surprised how many Haskell programmers share my dislike of bird tracks. We understand how others could get used to them, but that’s a slippery slope, one can also get used to the parentheses in Lisp, but why live that way? (I also did all of my Lisp programming with a parenthesis preprocessor.)

    The way to put this all together would be to integrate GHC preprocessing with the html creation. Markdown does make for the best comments, and the time to write comments is while coding.

    Comment by Dave — 2009-11-03 @ 07:26

  5. Hi, your post is the best library use tutorial I have ever seen :) I translated your text word for word to Polish language here http://gracjanpolak.wordpress.com/2009/11/01/literackie-blogowanie-w-haskellu/. Hope you don’t mind! :)

    Comment by gracjanpolak — 2009-11-03 @ 16:44

    • Thanks, and no problem!

      Comment by robgreayer — 2009-11-03 @ 17:24

  6. > skipMany $ oneOf ” t”
    It seemed that program swallows backslashes..

    Comment by Dmitry Olyenyov — 2009-11-08 @ 21:56


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

The Shocking Blue Green Theme Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: