在< ...>之间的C条html [英] C strip html between <...>

查看:40
本文介绍了在< ...>之间的C条html的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我如何使用C在HTML文档中的< ...>标记之间以及包含< ...>标记的地方剥离HTML?我当前的程序使用curl获取网页的内容并将其放入文本文件,然后从文本文件读取并删除<>,但是我不确定如何删除这些标记之间的所有内容.

How can i strip the HTML from document between and including the <...> tags in a HTML document using C? My current program uses curl to get the contents of the webpage and puts it into a text file, it then reads from the text file and removes the <>, but i am unsure of how to remove everything between those tags.

#include <curl/curl.h>
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>

#define WEBPAGE_URL "http://homepages.paradise.net.nz/adrianfu/index.html"
#define DESTINATION_FILE "/home/user/data.txt"

size_t write_data( void *ptr, size_t size, size_t nmeb, void *stream)
{
 return fwrite(ptr,size,nmeb,stream);
}

int main()
{
 int in_tag = 0;
 char * buffer;
 char c;
 long lSize;
 size_t result;

 FILE * file = fopen(DESTINATION_FILE,"w+");
 if (file==NULL) {
    fputs ("File error",stderr); 
    exit (1);
    }

 CURL *handle = curl_easy_init();
 curl_easy_setopt(handle,CURLOPT_URL,WEBPAGE_URL); /*Using the http protocol*/
 curl_easy_setopt(handle,CURLOPT_WRITEFUNCTION, write_data);
 curl_easy_setopt(handle,CURLOPT_WRITEDATA, file);
 curl_easy_perform(handle);
 curl_easy_cleanup(handle);

  int i, nRead, fd;
    int source;
    char buf[1024];


    if((fd = open("data.txt", O_RDONLY)) == -1)
    {
        printf("Cannot open the file");
    }
    else
    {
        nRead = read(fd, buf, 1024);
        printf("Original String ");
        for(i=0; i<nRead; i++)
        {
                printf("%c", buf[i]);
        }

        printf("\nReplaced String ");

        for(i=0; i<nRead; i++)
        {
            if(buf[i]=='<' || buf[i]=='>'){
            buf[i]=' ';

            }
            printf("%c", buf[i]);
        }
    }
    close(source);

 return 0;
 }

推荐答案

仅放置用于删除<"之间内容的代码和'>'标签(假设您使用正确的html,意味着您没有在另一个标签的声明中嵌套一个标签,例如< html< body>> ).我只更改您的代码的一小部分.我还将从 buf 变量中删除标签,而不是用间隔替换不想要的字符,因为我认为这对您更有用(如果我错了,请纠正我).

Placing just the code that removes the contents between the '<' and '>' tags (assuming that you deal with proper html, meaning that you don't have one tag nested in the declaration of the other like <html < body> >). I am just changing a small portion of your code. I will also remove the tags from the buf variable, instead of replacing the undesired characters with intervals, because I think this will be more useful to you (correct me if I am wrong).

int idx = 0;
int opened = 0; // false
for(i=0; i<nRead; i++)
{
    if(buf[i]=='<') {
        opened = 1; // true
    } else if (buf[i] == '>') {
        opened = 0; // false
    } else if (!opened) {
        buf[idx++] = buf[i];
    }
}
buf[idx] = '\0';
printf("%s\n", buf);

这篇关于在&lt; ...&gt;之间的C条html的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆