Archive for March, 2009

Dirty little stemmer in OCaml (part 1)

Tuesday, March 31st, 2009

1958 Volkswagen Pickup photographed at the Hen...Image via Wikipedia

This is not a good example of word stemmer or anything like that. I post it more as an example of some simple and practical code of this fine language – OCaml. The example looks interesting to me since it mixes “practical” stuff like regexes with all 3 basic hof-s map, fold and filter. Remember, I am an OCaml newbie.. so while this works I don’t guarantee anything else. All code here is side effects free.

I wrote this when I urgently needed something that would resolve the quite banal cases I was seeing on my oglasko.com classifieds ads search engine. For example “Ljubljana, Ljubljani, Ljubljano”, “kuhar, kuharica” and case specific synonyms like “vw, volkswagen”. Even if stemmer is mega-crude it resolved most of the cases quite well. It’s probably one of those 20-80 things or hopefully 10-90. After all people don’t have that wild imagination when searching for classifieds. Of course I intend to improve it all when it’s time comes again. It’s a stemmer for my (Slovene) language.

The function stemm that does it all is a straight “pipe”:

let stemm s =
  rebuild ( regex_replace to_regex_replace ( swap to_swap ( remove to_remove ( tokenize s ) ) ) )

Contrary to stack languages like Factor you have to read it from inside out. s is our string, we tokenize it, we remove the to_remove stuff, we swap the to_swap stuff and we regex_replace the to_regex_replace stuff then we rebuild it and return it.

Tokenize is just a fancy word for separating string to separate tokens – words. It could be a little more complex but ours just splits on space:

let tokenize s = Str.split (Str.regexp " ") s

to_remove is a list of words to remove:

let to_remove = [ "in"; "na"; "v"; "in"; "ali"; "pred"; "za"; "pod" ]

let remove bad tags =
  let not_in_bad tag =
    let equal_string rtag = equal ( String.compare tag rtag ) in
    not ( List.exists equal_string bad )
  in
  List.filter not_in_bad tags

I will try to explain, it’s basically very simple and clean even if it might not look that way to you.

remove is a function that takes list of “bad” words and our list of tokens. We will use the List.filter to filter out the bad words (obviously). Filter accepts a function (A) and a list (Lin) and returns a new list (Lout) with filtered values. Function (A) accepts current item in list and returns true/false which determines if the item will be a part of returned list (Lout).

So we just need to create that function A. We call it not_in_bad and it’s a closure in our case because it encloses the bad list. This function again uses only one obvious function from List module – List.exists.

Exists again takes a function (B) and a list. Function B determines how we want to qualify if something is the same as something else. We create the function B one line above and name it equal_string and it just calls String.compare and checks if result is 0. Code for equal function is just:

let equal x = x == 0

Let’s do the swap now. to_swap is a assoc list that we turn into hash-table for quicker key-ed access.

let to_swap = assoc_to_hash [
  "volkswagen", 	"vw";
  "lj", 		"ljubljana";
  "kopru", 		"koper"; ]

let swap hash tags =
  let swap_one_if tag =
    try Hashtbl.find hash tag
    with Not_found -> tag
  in
  List.map swap_one_if tags

swap swaps tags that are the same as keys in hash-table with their values. It uses just one function List.map. Map accepts a function that determines the mapping and a list. We deliver it swap_one_if closure that tries to find the current tag in hashtable and if it does returns it’s value, if not it returns the tag. Very simple and clean.

regex_replace uses List.map in the same way as the swap function.

let to_regex_replace = [
  "ico$", "";
  "ica$", "";
  "ah$", "";
  "em$", "";
]

let regex_replace res tags =
  let regex_one tag =
    List.fold_left (fun sumer (k, v) ->
      Str.global_replace ( Str.regexp k ) v sumer
    ) tag res
  in
  List.map regex_one tags

It’s mapping function regex_one uses a List.fold_left tu run all regexes (with key being the regular expression and value being the replacement) sequentially on the tag and return the result of it.

Now we only have the rebuild function missing in our “pipe” and we are done. This functions seems short boring string concat but there is something weird about it. Can you see it?

let rebuild = String.concat " "

Rebuild function has no arguments? and string concat probably doesn’t takes just 1 string value?

What happened here is not a bug and not a hack but something called Currying which lies deep at the theoretical basis of most(?) functional languages. OCaml functions, in a way don’t take multiple arguments but “curry them in”. This is explained with examples if you click here and scroll down a little. So what happens here is: rebuild doesn’t take a list and concats it with ” “, but it takes nothing and returns a *function* that takes a list and concats it with ” “.

The stemmer is a practical beast and works in two modes (because we need it to). It can start, accept string through unix pipe and return stemmed string or it can run continuously and behave as a very simple TCP server which accepts a string through TCP and returns stemmed back. I will post that and full source code next time as this post is getting too long.

If you aren’t familiar with the basic hof-s map, fold and filter you can find out more about them here or consult the wikipedia.

Reblog this post [with Zemanta]

REBOL notes: WITH convention

Sunday, March 15th, 2009

What’s going on

Saturday, March 14th, 2009