使用Azure Cognitive Search为静态HTML Blob存储内容编制索引无法正常工作 [英] Indexing static HTML blob storage content with Azure Cognitive Search is not working as expected

查看:71
本文介绍了使用Azure Cognitive Search为静态HTML Blob存储内容编制索引无法正常工作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在为Blob存储中的静态HTML内容编制索引.该文档指出,预处理分析器在索引来自该数据源的内容时会剥离周围的HTML标签.但是,我们的 content 值始终是整个原始HTML文档.我也无法提取元描述"标签的值.根据文档"中,HTML内容应自动生成一个 metadata_description 属性,但该值始终为null.

我尝试了许多不同的索引器配置,但是到目前为止,还无法判断我是否配置错误或Azure Search无法正确识别内容类型.

blob存储中的所有文件均具有 .html 文件扩展名,并且 Content Type 列显示 text/html .

这是索引器配置(某些位< redacted>):

  {"@ odata.context":"https://< instance> .search.windows.net/$ metadata#indexers/$ entity","@ odata.etag":"\"< tag> \","name":< name>",说明":null,"dataSourceName":<数据源名称>","skillsetName":null,"targetIndexName":<目标索引>",已禁用":null,日程": {"interval":"PT2H","startTime":"0001-01-01T00:00:00Z"},参数": {"batchSize":null,"maxFailedItems":-1,"maxFailedItemsPerBatch":null,"base64EncodeKeys":null,配置": {"parsingMode":文本","dataToExtract":"contentAndMetadata","excludedFileNameExtensions":".png .jpg .mpg .pdf","indexedFileNameExtensions":".html"}},"fieldMappings":[{"sourceFieldName":元数据存储路径","targetFieldName":"id","mappingFunction":{"name":"base64Encode",参数":null}},{"sourceFieldName":元数据描述","targetFieldName":描述","mappingFunction":null},{"sourceFieldName":元数据存储路径","targetFieldName":"URL","mappingFunction":{"name":"extractTokenAtPosition",参数": {"delimiter":< delimiter>",位置":1}}}],"outputFieldMappings":[],缓存":null} 

解决方案

这可能是由于索引器中的配置"parsingMode":文本"

此解析模式用于从文档中提取文字文本值.在这种情况下,它包括所有html标记.

将该配置更改为"parsingMode":默认"以从文档中剥离html标签.

I'm working on indexing static HTML content in blob storage. The documentation states that preprocessing analyzers will strip surrounding HTML tags when indexing content from that data source. However, our content value is always the entire raw HTML document. I'm also unable to pull out the value of our "meta description" tags. According to the documentation on Indexing Blob Storage, HTML content should automatically produce a metadata_description property, but the value is always null.

I've tried many different indexer configurations, but thus far have not been able to tell if I have something misconfigured or if Azure Search doesn't recognize the content type properly.

All of the files in blob storage have a .html file extension, and the Content Type column shows text/html.

This is the indexer configuration (some bits <redacted>):

{
  "@odata.context": "https://<instance>.search.windows.net/$metadata#indexers/$entity",
  "@odata.etag": "\"<tag>\"",
  "name": "<name>",
  "description": null,
  "dataSourceName": "<datasource name>",
  "skillsetName": null,
  "targetIndexName": "<target index>",
  "disabled": null,
  "schedule": {
    "interval": "PT2H",
    "startTime": "0001-01-01T00:00:00Z"
  },
  "parameters": {
    "batchSize": null,
    "maxFailedItems": -1,
    "maxFailedItemsPerBatch": null,
    "base64EncodeKeys": null,
    "configuration": {
      "parsingMode": "text",
      "dataToExtract": "contentAndMetadata",
      "excludedFileNameExtensions": ".png .jpg .mpg .pdf",
      "indexedFileNameExtensions": ".html"
    }
  },
  "fieldMappings": [
    {
      "sourceFieldName": "metadata_storage_path",
      "targetFieldName": "id",
      "mappingFunction": {
        "name": "base64Encode",
        "parameters": null
      }
    },
    {
      "sourceFieldName": "metadata_description",
      "targetFieldName": "description",
      "mappingFunction": null
    },
    {
      "sourceFieldName": "metadata_storage_path",
      "targetFieldName": "url",
      "mappingFunction": {
        "name": "extractTokenAtPosition",
        "parameters": {
          "delimiter": "<delimiter>",
          "position": 1
        }
      }
    }
  ],
  "outputFieldMappings": [],
  "cache": null
}

解决方案

This is likely due to the configuration in your indexer "parsingMode": "text"

This parsing mode is for extracting literal text values from the documents. In this case, that includes all of the html tags.

Change that configuration to "parsingMode": "default" to strip html tags from your documents.

这篇关于使用Azure Cognitive Search为静态HTML Blob存储内容编制索引无法正常工作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆