An LLM tokenizer implemented as a streamly application#78
An LLM tokenizer implemented as a streamly application#78twitu wants to merge 10 commits intocomposewell:masterfrom
Conversation
Map stream of bytes to index values
|
I have not understood the program, but just had a cursory look at it:
|
|
Yeah, I will change the example to use Streamly Array. There are two pipes one of them can be written as a scanl but the other one might be quite difficult. 🤔 I was hoping to make two examples for this application one that's easy to read and one that's performance oriented. There's even some parts of the logic that can benefit from parallelism. What do you think? |
Perhaps you can make the performance-oriented version easy to read :-) |
|
|
||
| -- Stores byte-sequence-to-index mapping and index-to-text mapping | ||
| data ByteMappings = ByteMappings | ||
| { byteToIndex :: !(M.Map Word8 Int), -- Maps bytes to unique indices |
There was a problem hiding this comment.
This Map Word8 Int can just be a mutable array and can benefit from O(1) peeking and poking.
| addPair acc chunk = | ||
| case Array.toList chunk of | ||
| [b1, b2] -> M.insertWith (+) (b1, b2) 1 acc | ||
| _ -> acc |
There was a problem hiding this comment.
You can index into the Array directly.
| updateMappings (ByteMappings b2i s2i i2t nidx) (i1, i2) = | ||
| let text1 = M.findWithDefault "?" i1 i2t | ||
| text2 = M.findWithDefault "?" i2 i2t | ||
| newToken = text1 ++ text2 |
There was a problem hiding this comment.
You should use Text or utf8 encoded Array Wod8 instead of String.
| (nidx + 1) | ||
|
|
||
| {-# INLINE replaceMostFrequentPair #-} | ||
| replaceMostFrequentPair :: (Monad m) => (Int, Int) -> Int -> Pipe m Int Int |
There was a problem hiding this comment.
Could you describe what this function does? Some examples would help.
| -- Stores byte-sequence-to-index mapping and index-to-text mapping | ||
| data ByteMappings = ByteMappings | ||
| { byteToIndex :: !(M.Map Word8 Int), -- Maps bytes to unique indices | ||
| seqToIndex :: !(M.Map (V.Vector Word8) Int), -- Maps sequences of bytes to unique indices |
There was a problem hiding this comment.
I'm curious to see how a cuckoo hash table might behave in this case.
We can try using it and checking performance.
There was a problem hiding this comment.
You can use an unboxed Array instead of a Vector Word8
Array Word8 basically.
| data ByteMappings = ByteMappings | ||
| { byteToIndex :: !(M.Map Word8 Int), -- Maps bytes to unique indices | ||
| seqToIndex :: !(M.Map (V.Vector Word8) Int), -- Maps sequences of bytes to unique indices | ||
| indexToText :: !(M.Map Int String), -- Maps indices to text representation |
There was a problem hiding this comment.
You can maybe use Text or Utf8 encoded Array Word8 for this?
| let text1 = M.findWithDefault "?" i1 i2t | ||
| text2 = M.findWithDefault "?" i2 i2t | ||
| newToken = text1 ++ text2 | ||
| bytes = V.fromList $ map charToWord8 newToken |
There was a problem hiding this comment.
This looks incorrect. Char is essentially 4 bytes. You are losing information here.
Char -> [Word8]
Unless you're strictly using ASCII. In that case, you needn't use Char.
| -- reset the state (starting with the current byte), and continue. | ||
| {-# INLINE greedyTokenizer #-} | ||
| greedyTokenizer :: (Monad m) => ByteMappings -> Pipe m Word8 String | ||
| greedyTokenizer mapping = Pipe consume produce (V.empty, "", 0) |
There was a problem hiding this comment.
You can just write this as a Stream.
A greedy tokenizer breaks text into words based on data driven rules it has learnt. The learning phase finds the most common pair of tokens in the data and merges them into a new token.
This is a pure text processing application which can be re-imagined as a streaming application, a study of all three fundamental constructs of streaming - Streams, Folds and Pipes and a demonstration of the streamly framework.
A review is welcome.