为什么runConduit不发送所有数据? [英] Why doesn't runConduit send all the data?

查看:58
本文介绍了为什么runConduit不发送所有数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我正在解析的xml:

here's some xml i'm parsing:

<?xml version="1.0" encoding="utf-8"?>
<data>
<row ows_Document='Weekly Report 10.21.2020'
     ows_Category='Weekly Report'/>
<row ows_Document='Daily Update 10.20.2020'
     ows_Category='Daily Update'/>
<row ows_Document='Weekly Report 10.14.2020'
     ows_Category='Weekly Report'/>
<row ows_Document='Weekly Report 10.07.2020'
     ows_Category='Weekly Report'/>
<row ows_Document='Spanish: Reporte Semanal 07.10.2020' 
     ows_Category='Weekly Report'/>
</data>

我一直在尝试找出如何使管道解析器拒绝记录的方法,除非 ows_Category Weekly Report ,而 ows_Document 没有.不能包含西班牙语.首先,我在解析后使用了一个虚拟值(在下面的 parseDoc'中)将其过滤掉,但是后来我意识到我应该能够使用 Maybe (否则相同的 parseDoc 下面)和 join 一起使用 tag'事件使用的那一层折叠我的 Maybe 层基于名称或属性匹配的解析器失败.它可以编译,但是行为异常,显然甚至都没有尝试将某些元素发送到解析器!怎么可能呢?

i've been trying to figure out how to get the conduit parser to reject records unless ows_Category is Weekly Report and ows_Document doesn't contain Spanish. at first, i used a dummy value (in parseDoc' below) to filter them out after parsing, but then i realized i should be able to use Maybe (in the otherwise identical parseDoc below), together with join to collapse out my Maybe layer with the one used by tag' event parser that fails based on name or attribute matches. it compiles, but behaves bizarrely, apparently not even trying to send certain elements to the parser! how could this be?

{-# LANGUAGE OverloadedStrings #-}

import           Conduit
import           Control.Monad
import qualified Data.ByteString.Lazy.Char8 as L8
import           Data.Foldable
import           Data.String
import qualified Data.Text                  as T
import           Data.XML.Types
import           Text.XML.Stream.Parse

newtype Doc = Doc
  { name :: String
  } deriving (Show)

main :: IO ()
main = do
  r <- L8.readFile "oha.xml"

  let doc = Doc . T.unpack
      check (x,y) a b = if y == "Weekly Report" && not (T.isInfixOf "Spanish" x) then a else b

      t :: (MonadThrow m, MonadIO m) => ((T.Text, T.Text) -> ConduitT Event o m c)
                                     -> ConduitT Event o m (Maybe c)
      t f = tag' "row" ((,) <$> requireAttr "ows_Document" <*> requireAttr "ows_Category") $ \x -> do
        liftIO $ print x
        f x

      parseDoc, parseDoc' :: (MonadThrow m, MonadIO m) => ConduitT Event o m (Maybe Doc)
      parseDoc  = (join <$>) . t $ \z@(x,_) -> return $       check z (Just $ doc x)  Nothing -- this version doesn't get sent all of the data! why!?!?
      parseDoc' =              t $ \z@(x,_) -> return $ doc $ check z             x $ T.pack bad -- dummy value

      parseDocs :: (MonadThrow m, MonadIO m) => ConduitT Event o m (Maybe Doc)
                                             -> ConduitT Event o m [Doc]
      parseDocs = f tagNoAttr "data" . many'
      f g n = force (n <> " required") . g (fromString n)

      go p = runConduit $ parseLBS def r .| parseDocs p
      bad = "no good"

  traverse_ print =<<                              go parseDoc
  putStrLn ""
  traverse_ print =<< filter ((/= bad) . name) <$> go parseDoc'

输出-请注意,甚至没有发送 parseDoc 的记录之一(应该成功的记录,从10.14开始),而 parseDoc'的行为却与预期的一样:

output -- notice how parseDoc isn't even sent one of the records (one that should succeed, from 10.14), while parseDoc' behaves as expected:

("Weekly Report 10.21.2020","Weekly Report")
("Daily Update 10.20.2020","Daily Update")
("Weekly Report 10.07.2020","Weekly Report")
("Spanish: Reporte Semanal 07.10.2020","Weekly Report")
Doc {name = "Weekly Report 10.21.2020"}
Doc {name = "Weekly Report 10.07.2020"}

("Weekly Report 10.21.2020","Weekly Report")
("Daily Update 10.20.2020","Daily Update")
("Weekly Report 10.14.2020","Weekly Report")
("Weekly Report 10.07.2020","Weekly Report")
("Spanish: Reporte Semanal 07.10.2020","Weekly Report")
Doc {name = "Weekly Report 10.21.2020"}
Doc {name = "Weekly Report 10.14.2020"}
Doc {name = "Weekly Report 10.07.2020"}

当我尝试通过删除与 ows_Category 有关的所有内容来进一步简化时,突然 parseDoc 工作正常,确立了这个想法的合理性吗?当我改为删除与 ows_Document 有关的所有内容时,问题仍然存在.

when i tried further simplifying by removing everything to do with ows_Category, suddenly parseDoc worked fine, establishing the soundness of the idea? when i instead removed everything to do with ows_Document, the problem remained.

我怀疑我应该使用

i suspect i'm supposed to be doing this with requireAttrRaw, but i haven't been able to make sense of it and can't find doc/examples.

这是否与 Applicative 有关-既然我考虑了一下,它应该不会因为检查值而失败,对吧?

does this have to do with Applicative -- now that i think about it, it shouldn't be able to fail based on examining values, right?

更新

我从作者那里获得了该库的早期版本的答案,其中包括有趣的 force失败消息"在类似的情况下,$ return Nothing 不会,但是这会放弃所有解析,而不仅仅是使当前解析失败.

i found this answer from the author for a previous version of the library, which includes the intriguing force "fail msg" $ return Nothing in a similar situation, but that abandons all parsing instead of just failing the current parse.

评论提示我需要抛出异常,并在,但是像 force ... return Nothing 一样,这会杀死所有解析,而不仅仅是当前解析器.我也不知道如何掌握 event .

this comment suggests i need to throw an exception, and in the source, they use something like lift $ throwM $ XmlException "failed check" $ Just event, but like force ... return Nothing, this kills all parsing, instead of just the current parser. also i don't know how to get my hands on the event.

这是合并的拉动请求,声称已解决了此问题,但没有解决仅讨论它是琐碎的",而不是讨论如何使用它.:)

here's a merged pull request claiming to have addressed this issue, but it doesn't discuss how to use it, only that it is "trivial" :)

答案

要明确答案:

  parseAttributes :: AttrParser (T.Text, T.Text)
  parseAttributes = do
    d <- requireAttr "ows_Document"
    c <- requireAttr "ows_Category"
    ignoreAttrs
    guard $ not (T.isInfixOf "Spanish" d) && c == "Weekly Report"
    return d

  parseDoc :: (MonadThrow m, MonadIO m) => ConduitT Event o m (Maybe Doc)
  parseDoc = tag' "row" parseAttributes $ return . doc

或者,因为在这种情况下,可以独立检查属性值:

or, since in this case the attribute values can be checked independently:

  parseAttributes = requireAttrRaw' "ows_Document" (not . T.isInfixOf "Spanish")
                 <* requireAttrRaw' "ows_Category" ("Weekly Report" ==)
                 <* ignoreAttrs
    where requireAttrRaw' n f = requireAttrRaw ("required attr value failed condition: " <> n) $ \(n',as) ->
            asum $ (\(ContentText a) -> guard (n' == fromString n && f a) *> pure a) <$> as

,但后者留下了有关 requireAttrRaw 的这些问题:

but the latter leaves open these questions regarding requireAttrRaw:

  • 如果我们负责验证 Name ,我们是否不需要知道名称空间?
  • 为什么 requireAttrRaw 向我们发送 [Content] 而不是给我们两个 Maybe Content ,每个给 ContentText ContentEntity ?
  • 我们应该如何处理 ContentEntity 用于传递解析"?
  • shouldn't we need to know the namespace if we're in charge of verifying Name?
  • why does requireAttrRaw send us [Content] instead of two Maybe Content, one each for ContentText and ContentEntity?
  • what are we supposed to do with ContentEntity "For pass-through parsing"?

推荐答案

tl; dr tag''row'parseAttributes parseContent check 函数属于 parseAttributes ,而不是 parseContent .

tl;dr In tag' "row" parseAttributes parseContent, the check function belongs to parseAttributes, not to parseContent.

xml-conduit (尤其是)围绕以下不变量设计:

xml-conduit is (notably) designed around the following invariants:

  1. 当解析器的类型为 ConduitT Event om(可能是)时, Maybe 层将编码是否使用了 Event s
  2. tag'parseName parseAttributes parseContent 在且仅当 parseName parseAttributes 成功的情况下才使用 Event s li>
  3. tag'parseName parseAttributes parseContent 仅在 parseName parseAttributes 成功的情况下运行 parseContent
  4. >
  1. when parsers are of type ConduitT Event o m (Maybe a), the Maybe layer encodes whether Events have been consumed
  2. tag' parseName parseAttributes parseContent consumes Events if and only if both parseName and parseAttributes succeed
  3. tag' parseName parseAttributes parseContent runs parseContent if and only if both parseName and parseAttributes succeed

parseDoc 中:

  • check 函数在 parseContent 部分中调用;在此阶段,根据不变式2,li已承诺使用 tag'消耗 Event s
  • 将2个也许层的堆栈 join 在一起:
    • check 函数的输出,该函数编码当前< row/> 元素是否相关
    • 标准" tag'签名中的也许层,它根据不变式1编码是否已使用 Event s
    • the check function is called in the parseContent part; at this stage, tag' is already committed to consume Events, as per invariant 2
    • a stack of 2 Maybe layers are joined together:
      • the output of the check function, which encodes whether the current <row/> element is relevant
      • the "standard" Maybe layer from tag' signature, which encodes whether Events have been consumed, as per invariant 1

      这实质上打破了不变式1:当 check 返回 Nothing 时,尽管消耗了,但 parseDoc 返回 Nothing 整个< row/> 元素的事件.这导致 xml-conduit 的所有组合器的行为不确定,尤其是 很多" (在下面进行分析.)

      This essentially breaks invariant 1: when check returns Nothing, parseDoc returns Nothing despite consuming Events of the whole <row/> element. This results in undefined behavior of all combinators of xml-conduit, notably many' (analyzed below.)

      很多' 组合器依靠不变式1来完成其工作.它定义为

    • 尝试消费者
    • 如果 consumer 返回 Nothing ,则使用在您的情况下,即使完整的< row, consumer 对于 Daily Update 10.20.2020 项,返回的 Nothing 却没有./> 元素已被使用.因此,运行 ignoreAnyTreeContent 作为跳过该特定< row/> 的一种方法,但实际上最终却跳过了下一个( Weekly Report2020年10月14日).

      In your case, consumer returns Nothing for the Daily Update 10.20.2020 item, even though the complete <row/> element has been consumed. Therefore, ignoreAnyTreeContent is run as a means to skip that particular <row/>, but actually ends up skipping the next one instead (Weekly Report 10.14.2020).

      check 逻辑移至 parseAttributes 部分,以使 Event 的使用量与是否通过 check 相关联

      Move the check logic to the parseAttributes part, so that Event consumption becomes coupled to whether check passes.

      这篇关于为什么runConduit不发送所有数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆