在Parsec中匹配字节串 [英] Matching bytestrings in Parsec

查看:117
本文介绍了在Parsec中匹配字节串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在尝试使用 Real World Haskell 中介绍的完整CSV分析器。为了我试图修改代码来使用 ByteString 而不是 String ,但是有一个 string combinator,它只适用于 String



是否有Parsec组合器类似于 string ,它与 ByteString 一起工作,而不必前后进行转换?



我看到有一个可以解析 ByteString attoparsec 的替代解析器,但是我宁愿坚持使用Parsec,因为我只是在学习如何使用它。

解决方案

重新从类似于

  import的开始。前导隐藏(getContents,putStrLn)
导入Data.ByteString
导入Text.Parsec.ByteString

这是我到目前为止的内容。有两个版本。两者都编译。可能两者都不是你想要的,但他们应该帮助讨论,并帮助你澄清你的问题。



我注意到的一些事情:




  • 如果导入Text.Parsec.ByteString ,则使用 uncons ,它依次使用Data.ByteString.Internal中的 w2c 将所有读取的字节转换为 Char 秒。这使Parsec的行和列号错误报告能够明智地工作,并且还使您可以毫无问题地使用 string 和朋友。


因此,CSV解析器的简易版本确实如此:

 导入Prelude隐藏(getContents,putStrLn)
导入Data.ByteString(ByteString)
$ b $导入限定Prelude(getContents,putStrLn)
将限定的Data.ByteString导入为ByteString(getContents)

import Text.Parsec
import Text.Parsec.ByteString

csvFile :: Parser [[String]]
csvFile = endBy line eol
line :: Parser [String]
line = sepBy cell(char',')
cell :: Parser String
cell = quotedCell< |> many(noneOf,\\\
\r)

quotedCell :: Parser String
quotedCell =
do _< - char''
content < - 许多quotedChar
_< - char''<?> 报价在单元格末尾
返回内容

quotedChar :: Parser Char
quotedChar =
noneOf\
< |> ; try(string\\>> return''')

eol :: Parser String
eol = try(string\\\
\r )
< |> try(string\r\\\

<|>字符串\\\

< |>字符串\r
<?> 行尾

parseCSV :: ByteString - > ParseError [[String]]
parseCSV = parse csvFile(unknown)
$ b $ main :: IO()
main =
do c< - ByteString .getContents
case parse csvFile(stdin)c
Left e - > Prelude.putStrLn错误解析输入:
print e
右r - > mapM_ print r

但是这太微不足道了,以至于我认为它不可能是你想要的。也许你想让所有东西都保持一个 ByteString [Word8] 或者类似的东西?因此我的第二次尝试如下。我仍然是 import ing Text.Parsec.ByteString ,这可能是一个错误,并且代码无可救药地充斥着转换。

,它编译并且具有完整的类型注释,因此应该有一个良好的起点。

  import前导隐藏(getContents,putStrLn)
导入Data.ByteString(ByteString)
导入控制。 Monad(liftM)

导入合格的Prelude(getContents,putStrLn)
将合格的Data.ByteString导入为ByteString(pack,getContents)
将合格的Data.ByteString.Char8导入为Char8 ()| b

import Data.Word(Word8)
import Data.ByteString.Internal(c2w)

import Text.Parsec((< |>) ,(<>),parse,try,endBy,sepBy,many)
import Text.Parsec.ByteString
import Text.Parsec.Prim(tokens,tokenPrim)
import Text .Parsec.Pos(updatePosChar,updatePosString)
import Text.Parsec.Error(ParseError)

csvFile :: Parser [[ByteString]]
csvFile = endBy line eol
line :: Parser [ByteString]
line = sepBy cell(char',')
cell ::解析器ByteString
cell = quotedCell< |> liftM ByteString.pack(many(noneOf,\\\
\r))

quotedCell :: Parser ByteString
quotedCell =
do _< - char' '
content < - many quotedChar
_< - char'''<?> 单元格末尾的引用
return(ByteString.pack内容)

quotedChar :: Parser Word8
quotedChar =
noneOf\
< |> try(string\\>> return(c2w'''))

eol :: Parser ByteString
eol = try字符串\\\
\r)
< |> try(string\r\\\

<|>字符串\\\

< |>字符串\r
<?> 行尾

parseCSV :: ByteString - > ParseError [[ByteString]]
parseCSV = parse csvFile(unknown)
$ b $ main :: IO()
main =
do c< - ByteString .getContents
case parse csvFile(stdin)c
Left e - > Prelude.putStrLn错误解析输入:
print e
右r - > mapM_ print r

- 替换Parsec库中的一些函数

noneOf :: String - >解析器Word8
noneOf cs =满足(\ b - > b`notElem` [c2w c | c< - cs])

char :: Char - >解析器Word8
char c = byte(c2w c)

byte :: Word8 - >解析器Word8
字节c =满足(== c)<> show [c]

满足::(Word8 - > Bool) - >解析器Word8
满足f = tokenPrim(\ c - > show [c])
(\ pos c _cs - > updatePosChar pos c)
(\ c - >如果f(c2w c)则Just(c2w c)else Nothing)

string :: String - > Parser ByteString
string s = liftM Char8.pack(tokens show updatePosString s)

可能是您的应该是这两个 ByteString.pack 指令,在 cell 和<$ c的定义中$ C> quotedCell 。您可以尝试替换Text.Parsec.ByteString模块,以便使用 Char 来代替使严格ByteStrings成为 Stream 的实例c>令牌类型,你使用 Word8 令牌类型使ByteStrings成为 Stream 的一个实例,但这不会有帮助你有效率,它只会让你头痛,试图重新实现所有的sourcePos函数,以跟踪你在输入错误消息的位置。



不,为了使它更有效率,可以改变 char quotedChar 字符串 Parser [Word8] 以及行的类型 csvFile Parser [[Word8]] Parser [[[Word8]]] > 。您甚至可以将 eol 的类型更改为 Parser()。必要的修改如下所示:

  cell :: Parser [Word8] 
cell = quotedCell< | >许多(noneOf,\\\
\r)

quotedCell :: Parser [Word8]
quotedCell =
do _< - char'''
内容< - 很多quotedChar
_< - char''<> 单元格末尾的引号
返回内容

string :: String - >解析器[Word8]
字符串s = [c2w c | c< - (tokens show updatePosString s)]

您不必担心所有调用 c2w 就效率而言,因为它们不需要任何费用。



如果这不能解决您的问题问题,请说明会发生什么。


I am currently trying to use the Full CSV Parser presented in Real World Haskell. In order to I tried to modify the code to use ByteString instead of String, but there is a string combinator which just works with String.

Is there a Parsec combinator similar to string that works with ByteString, without having to do conversions back and forth?

I've seen there is an alternative parser that handles ByteString: attoparsec, but I would prefer to stick with Parsec, since I'm just learning how to use it.

解决方案

I'm assuming you're starting with something like

import Prelude hiding (getContents, putStrLn)
import Data.ByteString
import Text.Parsec.ByteString

Here's what I've got so far. There are two versions. Both compile. Probably neither is exactly what you want, but they should aid discussion and help you to clarify your question.

Something I noticed along the way:

  • If you import Text.Parsec.ByteString then this uses uncons from Data.ByteString.Char8, which in turn uses w2c from Data.ByteString.Internal, to convert all read bytes to Chars. This enables Parsec's line and column number error reporting to work sensibly, and also enables you to use string and friends without problem.

Thus, the easy version of the CSV parser, which does exactly that:

import Prelude hiding (getContents, putStrLn)
import Data.ByteString (ByteString)

import qualified Prelude (getContents, putStrLn)
import qualified Data.ByteString as ByteString (getContents)

import Text.Parsec
import Text.Parsec.ByteString

csvFile :: Parser [[String]]
csvFile = endBy line eol
line :: Parser [String]
line = sepBy cell (char ',')
cell :: Parser String
cell = quotedCell <|> many (noneOf ",\n\r")

quotedCell :: Parser String
quotedCell = 
    do _ <- char '"'
       content <- many quotedChar
       _ <- char '"' <?> "quote at end of cell"
       return content

quotedChar :: Parser Char
quotedChar =
        noneOf "\""
    <|> try (string "\"\"" >> return '"')

eol :: Parser String
eol =   try (string "\n\r")
    <|> try (string "\r\n")
    <|> string "\n"
    <|> string "\r"
    <?> "end of line"

parseCSV :: ByteString -> Either ParseError [[String]]
parseCSV = parse csvFile "(unknown)"

main :: IO ()
main =
    do c <- ByteString.getContents
       case parse csvFile "(stdin)" c of
            Left e -> do Prelude.putStrLn "Error parsing input:"
                         print e
            Right r -> mapM_ print r

But this was so trivial to get working that I assume it cannot possibly be what you want. Perhaps you want everything to remain a ByteString or [Word8] or something similar all the way through? Hence my second attempt below. I am still importing Text.Parsec.ByteString, which may be a mistake, and the code is hopelessly riddled with conversions.

But, it compiles and has complete type annotations, and therefore should make a sound starting point.

import Prelude hiding (getContents, putStrLn)
import Data.ByteString (ByteString)
import Control.Monad (liftM)

import qualified Prelude (getContents, putStrLn)
import qualified Data.ByteString as ByteString (pack, getContents)
import qualified Data.ByteString.Char8 as Char8 (pack)

import Data.Word (Word8)
import Data.ByteString.Internal (c2w)

import Text.Parsec ((<|>), (<?>), parse, try, endBy, sepBy, many)
import Text.Parsec.ByteString
import Text.Parsec.Prim (tokens, tokenPrim)
import Text.Parsec.Pos (updatePosChar, updatePosString)
import Text.Parsec.Error (ParseError)

csvFile :: Parser [[ByteString]]
csvFile = endBy line eol
line :: Parser [ByteString]
line = sepBy cell (char ',')
cell :: Parser ByteString
cell = quotedCell <|> liftM ByteString.pack (many (noneOf ",\n\r"))

quotedCell :: Parser ByteString
quotedCell = 
    do _ <- char '"'
       content <- many quotedChar
       _ <- char '"' <?> "quote at end of cell"
       return (ByteString.pack content)

quotedChar :: Parser Word8
quotedChar =
        noneOf "\""
    <|> try (string "\"\"" >> return (c2w '"'))

eol :: Parser ByteString
eol =   try (string "\n\r")
    <|> try (string "\r\n")
    <|> string "\n"
    <|> string "\r"
    <?> "end of line"

parseCSV :: ByteString -> Either ParseError [[ByteString]]
parseCSV = parse csvFile "(unknown)"

main :: IO ()
main =
    do c <- ByteString.getContents
       case parse csvFile "(stdin)" c of
            Left e -> do Prelude.putStrLn "Error parsing input:"
                         print e
            Right r -> mapM_ print r

-- replacements for some of the functions in the Parsec library

noneOf :: String -> Parser Word8
noneOf cs   = satisfy (\b -> b `notElem` [c2w c | c <- cs])

char :: Char -> Parser Word8
char c      = byte (c2w c)

byte :: Word8 -> Parser Word8
byte c      = satisfy (==c)  <?> show [c]

satisfy :: (Word8 -> Bool) -> Parser Word8
satisfy f   = tokenPrim (\c -> show [c])
                        (\pos c _cs -> updatePosChar pos c)
                        (\c -> if f (c2w c) then Just (c2w c) else Nothing)

string :: String -> Parser ByteString
string s    = liftM Char8.pack (tokens show updatePosString s)

Probably your concern, efficiency-wise, should be those two ByteString.pack instructions, in the definitions of cell and quotedCell. You might try to replace the Text.Parsec.ByteString module so that instead of "making strict ByteStrings an instance of Stream with Char token type," you make ByteStrings an instance of Stream with Word8 token type, but this won't help you with efficiency, it will just give you a headache trying to reimplement all the sourcePos functions to keep track of your position in the input for error messages.

No, the way to make it more efficient would be to change the types of char, quotedChar and string to Parser [Word8] and the types of line and csvFile to Parser [[Word8]] and Parser [[[Word8]]] respectively. You could even change the type of eol to Parser (). The necessary changes would look something like this:

cell :: Parser [Word8]
cell = quotedCell <|> many (noneOf ",\n\r")

quotedCell :: Parser [Word8]
quotedCell = 
    do _ <- char '"'
       content <- many quotedChar
       _ <- char '"' <?> "quote at end of cell"
       return content

string :: String -> Parser [Word8]
string s    = [c2w c | c <- (tokens show updatePosString s)]

You don't need to worry about all the calls to c2w as far as efficiency is concerned, because they cost nothing.

If this doesn't answer your question, please say what would.

这篇关于在Parsec中匹配字节串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆