在Parsec中匹配字节串 [英] Matching bytestrings in Parsec
问题描述
我目前正在尝试使用 Real World Haskell 中介绍的完整CSV分析器。为了我试图修改代码来使用 ByteString
而不是 String
,但是有一个 string
combinator,它只适用于 String
。
是否有Parsec组合器类似于 string
,它与 ByteString
一起工作,而不必前后进行转换?
我看到有一个可以解析 ByteString
: attoparsec
的替代解析器,但是我宁愿坚持使用Parsec,因为我只是在学习如何使用它。
重新从类似于
import的开始。前导隐藏(getContents,putStrLn)
导入Data.ByteString
导入Text.Parsec.ByteString
这是我到目前为止的内容。有两个版本。两者都编译。可能两者都不是你想要的,但他们应该帮助讨论,并帮助你澄清你的问题。
我注意到的一些事情:
- 如果
导入Text.Parsec.ByteString
,则使用uncons $ c来自Data.ByteString.Char8的$ c>,它依次使用Data.ByteString.Internal中的
w2c
将所有读取的字节转换为Char
秒。这使Parsec的行和列号错误报告能够明智地工作,并且还使您可以毫无问题地使用
string
和朋友。
因此,CSV解析器的简易版本确实如此:
导入Prelude隐藏(getContents,putStrLn)
导入Data.ByteString(ByteString)
$ b $导入限定Prelude(getContents,putStrLn)
将限定的Data.ByteString导入为ByteString(getContents)
import Text.Parsec
import Text.Parsec.ByteString
csvFile :: Parser [[String]]
csvFile = endBy line eol
line :: Parser [String]
line = sepBy cell(char',')
cell :: Parser String
cell = quotedCell< |> many(noneOf,\\\
\r)
quotedCell :: Parser String
quotedCell =
do _< - char''
content < - 许多quotedChar
_< - char''<?> 报价在单元格末尾
返回内容
quotedChar :: Parser Char
quotedChar =
noneOf\
< |> ; try(string\\>> return''')
eol :: Parser String
eol = try(string\\\
\r )
< |> try(string\r\\\
)
<|>字符串\\\
< |>字符串\r
<?> 行尾
parseCSV :: ByteString - > ParseError [[String]]
parseCSV = parse csvFile(unknown)
$ b $ main :: IO()
main =
do c< - ByteString .getContents
case parse csvFile(stdin)c
Left e - > Prelude.putStrLn错误解析输入:
print e
右r - > mapM_ print r
但是这太微不足道了,以至于我认为它不可能是你想要的。也许你想让所有东西都保持一个 ByteString
或 [Word8]
或者类似的东西?因此我的第二次尝试如下。我仍然是 import
ing Text.Parsec.ByteString
,这可能是一个错误,并且代码无可救药地充斥着转换。
但 ,它编译并且具有完整的类型注释,因此应该有一个良好的起点。
import前导隐藏(getContents,putStrLn)
导入Data.ByteString(ByteString)
导入控制。 Monad(liftM)
导入合格的Prelude(getContents,putStrLn)
将合格的Data.ByteString导入为ByteString(pack,getContents)
将合格的Data.ByteString.Char8导入为Char8 ()| b
import Data.Word(Word8)
import Data.ByteString.Internal(c2w)
import Text.Parsec((< |>) ,(<>),parse,try,endBy,sepBy,many)
import Text.Parsec.ByteString
import Text.Parsec.Prim(tokens,tokenPrim)
import Text .Parsec.Pos(updatePosChar,updatePosString)
import Text.Parsec.Error(ParseError)
csvFile :: Parser [[ByteString]]
csvFile = endBy line eol
line :: Parser [ByteString]
line = sepBy cell(char',')
cell ::解析器ByteString
cell = quotedCell< |> liftM ByteString.pack(many(noneOf,\\\
\r))
quotedCell :: Parser ByteString
quotedCell =
do _< - char' '
content < - many quotedChar
_< - char'''<?> 单元格末尾的引用
return(ByteString.pack内容)
quotedChar :: Parser Word8
quotedChar =
noneOf\
< |> try(string\\>> return(c2w'''))
eol :: Parser ByteString
eol = try字符串\\\
\r)
< |> try(string\r\\\
)
<|>字符串\\\
< |>字符串\r
<?> 行尾
parseCSV :: ByteString - > ParseError [[ByteString]]
parseCSV = parse csvFile(unknown)
$ b $ main :: IO()
main =
do c< - ByteString .getContents
case parse csvFile(stdin)c
Left e - > Prelude.putStrLn错误解析输入:
print e
右r - > mapM_ print r
- 替换Parsec库中的一些函数
noneOf :: String - >解析器Word8
noneOf cs =满足(\ b - > b`notElem` [c2w c | c< - cs])
char :: Char - >解析器Word8
char c = byte(c2w c)
byte :: Word8 - >解析器Word8
字节c =满足(== c)<> show [c]
满足::(Word8 - > Bool) - >解析器Word8
满足f = tokenPrim(\ c - > show [c])
(\ pos c _cs - > updatePosChar pos c)
(\ c - >如果f(c2w c)则Just(c2w c)else Nothing)
string :: String - > Parser ByteString
string s = liftM Char8.pack(tokens show updatePosString s)
可能是您的应该是这两个 ByteString.pack
指令,在 cell
和<$ c的定义中$ C> quotedCell 。您可以尝试替换Text.Parsec.ByteString模块,以便使用 Char $ c $>来代替使严格ByteStrings成为
Stream
的实例c>令牌类型,你使用 Word8
令牌类型使ByteStrings成为 Stream
的一个实例,但这不会有帮助你有效率,它只会让你头痛,试图重新实现所有的sourcePos函数,以跟踪你在输入错误消息的位置。
不,为了使它更有效率,可以改变 char
, quotedChar
和字符串
到 Parser [Word8]
以及行的类型
和 csvFile
至 Parser [[Word8]]
和 Parser [[[Word8]]]
> 。您甚至可以将 eol
的类型更改为 Parser()
。必要的修改如下所示:
cell :: Parser [Word8]
cell = quotedCell< | >许多(noneOf,\\\
\r)
quotedCell :: Parser [Word8]
quotedCell =
do _< - char'''
内容< - 很多quotedChar
_< - char''<> 单元格末尾的引号
返回内容
string :: String - >解析器[Word8]
字符串s = [c2w c | c< - (tokens show updatePosString s)]
您不必担心所有调用 c2w
就效率而言,因为它们不需要任何费用。
如果这不能解决您的问题问题,请说明会发生什么。
I am currently trying to use the Full CSV Parser presented in Real World Haskell. In order to I tried to modify the code to use ByteString
instead of String
, but there is a string
combinator which just works with String
.
Is there a Parsec combinator similar to string
that works with ByteString
, without having to do conversions back and forth?
I've seen there is an alternative parser that handles ByteString
: attoparsec
, but I would prefer to stick with Parsec, since I'm just learning how to use it.
I'm assuming you're starting with something like
import Prelude hiding (getContents, putStrLn)
import Data.ByteString
import Text.Parsec.ByteString
Here's what I've got so far. There are two versions. Both compile. Probably neither is exactly what you want, but they should aid discussion and help you to clarify your question.
Something I noticed along the way:
- If you
import Text.Parsec.ByteString
then this usesuncons
from Data.ByteString.Char8, which in turn usesw2c
from Data.ByteString.Internal, to convert all read bytes toChar
s. This enables Parsec's line and column number error reporting to work sensibly, and also enables you to usestring
and friends without problem.
Thus, the easy version of the CSV parser, which does exactly that:
import Prelude hiding (getContents, putStrLn)
import Data.ByteString (ByteString)
import qualified Prelude (getContents, putStrLn)
import qualified Data.ByteString as ByteString (getContents)
import Text.Parsec
import Text.Parsec.ByteString
csvFile :: Parser [[String]]
csvFile = endBy line eol
line :: Parser [String]
line = sepBy cell (char ',')
cell :: Parser String
cell = quotedCell <|> many (noneOf ",\n\r")
quotedCell :: Parser String
quotedCell =
do _ <- char '"'
content <- many quotedChar
_ <- char '"' <?> "quote at end of cell"
return content
quotedChar :: Parser Char
quotedChar =
noneOf "\""
<|> try (string "\"\"" >> return '"')
eol :: Parser String
eol = try (string "\n\r")
<|> try (string "\r\n")
<|> string "\n"
<|> string "\r"
<?> "end of line"
parseCSV :: ByteString -> Either ParseError [[String]]
parseCSV = parse csvFile "(unknown)"
main :: IO ()
main =
do c <- ByteString.getContents
case parse csvFile "(stdin)" c of
Left e -> do Prelude.putStrLn "Error parsing input:"
print e
Right r -> mapM_ print r
But this was so trivial to get working that I assume it cannot possibly be what you want. Perhaps you want everything to remain a ByteString
or [Word8]
or something similar all the way through? Hence my second attempt below. I am still import
ing Text.Parsec.ByteString
, which may be a mistake, and the code is hopelessly riddled with conversions.
But, it compiles and has complete type annotations, and therefore should make a sound starting point.
import Prelude hiding (getContents, putStrLn)
import Data.ByteString (ByteString)
import Control.Monad (liftM)
import qualified Prelude (getContents, putStrLn)
import qualified Data.ByteString as ByteString (pack, getContents)
import qualified Data.ByteString.Char8 as Char8 (pack)
import Data.Word (Word8)
import Data.ByteString.Internal (c2w)
import Text.Parsec ((<|>), (<?>), parse, try, endBy, sepBy, many)
import Text.Parsec.ByteString
import Text.Parsec.Prim (tokens, tokenPrim)
import Text.Parsec.Pos (updatePosChar, updatePosString)
import Text.Parsec.Error (ParseError)
csvFile :: Parser [[ByteString]]
csvFile = endBy line eol
line :: Parser [ByteString]
line = sepBy cell (char ',')
cell :: Parser ByteString
cell = quotedCell <|> liftM ByteString.pack (many (noneOf ",\n\r"))
quotedCell :: Parser ByteString
quotedCell =
do _ <- char '"'
content <- many quotedChar
_ <- char '"' <?> "quote at end of cell"
return (ByteString.pack content)
quotedChar :: Parser Word8
quotedChar =
noneOf "\""
<|> try (string "\"\"" >> return (c2w '"'))
eol :: Parser ByteString
eol = try (string "\n\r")
<|> try (string "\r\n")
<|> string "\n"
<|> string "\r"
<?> "end of line"
parseCSV :: ByteString -> Either ParseError [[ByteString]]
parseCSV = parse csvFile "(unknown)"
main :: IO ()
main =
do c <- ByteString.getContents
case parse csvFile "(stdin)" c of
Left e -> do Prelude.putStrLn "Error parsing input:"
print e
Right r -> mapM_ print r
-- replacements for some of the functions in the Parsec library
noneOf :: String -> Parser Word8
noneOf cs = satisfy (\b -> b `notElem` [c2w c | c <- cs])
char :: Char -> Parser Word8
char c = byte (c2w c)
byte :: Word8 -> Parser Word8
byte c = satisfy (==c) <?> show [c]
satisfy :: (Word8 -> Bool) -> Parser Word8
satisfy f = tokenPrim (\c -> show [c])
(\pos c _cs -> updatePosChar pos c)
(\c -> if f (c2w c) then Just (c2w c) else Nothing)
string :: String -> Parser ByteString
string s = liftM Char8.pack (tokens show updatePosString s)
Probably your concern, efficiency-wise, should be those two ByteString.pack
instructions, in the definitions of cell
and quotedCell
. You might try to replace the Text.Parsec.ByteString module so that instead of "making strict ByteStrings an instance of Stream
with Char
token type," you make ByteStrings an instance of Stream
with Word8
token type, but this won't help you with efficiency, it will just give you a headache trying to reimplement all the sourcePos functions to keep track of your position in the input for error messages.
No, the way to make it more efficient would be to change the types of char
, quotedChar
and string
to Parser [Word8]
and the types of line
and csvFile
to Parser [[Word8]]
and Parser [[[Word8]]]
respectively. You could even change the type of eol
to Parser ()
. The necessary changes would look something like this:
cell :: Parser [Word8]
cell = quotedCell <|> many (noneOf ",\n\r")
quotedCell :: Parser [Word8]
quotedCell =
do _ <- char '"'
content <- many quotedChar
_ <- char '"' <?> "quote at end of cell"
return content
string :: String -> Parser [Word8]
string s = [c2w c | c <- (tokens show updatePosString s)]
You don't need to worry about all the calls to c2w
as far as efficiency is concerned, because they cost nothing.
If this doesn't answer your question, please say what would.
这篇关于在Parsec中匹配字节串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!