从数据集中分离关键字和@提及 [英] Separate keywords and @ mentions from dataset

查看：194 发布时间：2017/4/2 13:45:23 python regex pandas dataset data-cleaning

本文介绍了从数据集中分离关键字和@提及的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一大堆数据，其中有几列，100多个csv文件中有大约10k行，现在我只关心一列有消息格式的列，我想提取两个参数。我广泛搜索，发现两个解决方案看起来很接近，但不足以解决这里的问题。 ONE & 两个

输入：Col名称文本，每个邮件都是csv中的一个单独的行。

 让我们弹跳ðŸ〜‰＃[message_1] 
 
爱的能量& amp; amp; Microponic Mayhem while＃[message_2] 
 
 RT @IVijayboi：＃[message_3] @ Bdutt @ sardesairajdeep @ rahulkanwal @ abhisarsharma @ ppbajpayi @ Abpnewd @ Ndtv @ Aajtak＃Jihadimedia @ Ibn7 happy #PresstitutesDay 
 
RT @ RakeshKhatri23：我的生命＃[message_4] 
 
没有你
 
 IS 
 
 LIKE 
 
没有
 
 $ b FRAGRANCEðŸ'žðŸ'ž
 
〜真爱〜
 
 
我& amp; amp;我的宝贝ðŸ¶â¤ï¸ðŸ'@ Home Sweet Home＃[message_5]

输入是一个CSV文件与数据中的其他几列，但我只对此列感兴趣。我想将 @name 和 #keyword 从输入中分离成如下所示的新列：

预期输出

 文字，提及，关键字
 [message]，NAN ，NAN 
 [message]，NAN，NAN 
 [message]，@IVijayboi，#Jihadimedia 
 @Bdutt #PresstitutesDay 
 @sardesairajdeep 
 @rahulkanwal 
 @abhisarsharma 
 @ppbajpayi 
 @Abpnewd 
 @Ndtv 
 @Aajtak 
 @ Ibn7

正如我们在输入中看到的第一条和第二条消息没有 @ 和＃所以列值 NAN ，但是对于第三个消息，它有10个 @ 和2 ＃关键字。

简单来说，如何将@提到的名称和＃个关键字从邮件分离到单独的列。

解决方案

我怀疑你想使用正则表达式。我不知道你的@提及和＃关键字被允许采取的确切格式，但我猜想，这样的形式 @（[a-zA-Z0-9] +）[^ a-zA-Z0-9] 将工作。

 ＃！/ usr / bin / env python3 
 import re 
 
 test_string =Text 
让我们弹跳！ðŸ〜‰
爱的能量& amp; amp; amp; Micromonic Mayhem while
 RT @IVijayboi：etc etc
 
 submitted_match = re.compile（'@（[a-zA-Z0-9] +）[^ a -zA-Z0-9]'）
在提及_match.finditer（test_string）中匹配：
 print（match.group（1））
 
 hashtag_match = re.compile '＃（[a-zA-Z0-9] +）[^ a-zA-Z0-9]'）
用于匹配hashtag_match.finditer（test_string）：
 print（match.group 1））

希望能给你足够的开始。

I have a huge set of data which has several columns and about 10k rows in more than 100 csv files, for now I am concerned about only one column with message format and from them I want to extract two parameters. I searched extensively around and I found two solutions that seem close but are not enough close to solve the question here. ONE & TWO

Input : Col name "Text" and every message is a separate row in a csv.

"Let's Bounce!ðŸ˜‰  #[message_1]

 Loving the energy &amp; Microphonic Mayhem whileâ€¦" #[message_2]

RT @IVijayboi: #[message_3]   @Bdutt@sardesairajdeep@rahulkanwal@abhisarsharma@ppbajpayi@Abpnewd@Ndtv@Aajtak#Jihadimedia@Ibn7 happy #PresstitutesDay

 "RT @RakeshKhatri23: MY LIFE #[message_4]

        WITHOUT YOU 

        IS

        LIKE 

        FLOWERS WITHOUT 

        FRAGRANCE ðŸ’žðŸ’ž

        ~True Love~"


  Me &amp; my baby ðŸ¶â¤ï¸ðŸ‘ @ Home Sweet Home  #[message_5]

The input is a CSV file with several other columns in the data but I am interested only in this column. I want to separate the @name and #keywordfrom the input into a new column like:

expected output

text, mentions, keywords 
[message], NAN, NAN
[message], NAN, NAN
[message], @IVijayboi, #Jihadimedia  
           @Bdutt      #PresstitutesDay
           @sardesairajdeep 
           @rahulkanwal 
           @abhisarsharma 
           @ppbajpayi 
           @Abpnewd 
           @Ndtv 
           @Aajtak  
           @Ibn7

As we see in the input first and second message has no @ and # so the column values NAN but for the third message it has 10 @ and 2 # keywords.

In simple words how do I separate the @ mentioned names and # keywords from the message to a separate column.

解决方案

I suspect you want to use a regular expression. I don't know the exact format that your @ mentions and # keywords are allowed to take, but I would guess that something of the form @([a-zA-Z0-9]+)[^a-zA-Z0-9] would work.

#!/usr/bin/env python3
import re

test_string = """Text
"Let's Bounce!ðŸ˜‰
Loving the energy &amp; Microphonic Mayhem whileâ€¦"
RT @IVijayboi: etc etc"""

mention_match = re.compile('@([a-zA-Z0-9]+)[^a-zA-Z0-9]')
for match in mention_match.finditer(test_string):
    print(match.group(1))

hashtag_match = re.compile('#([a-zA-Z0-9]+)[^a-zA-Z0-9]')
for match in hashtag_match.finditer(test_string):
    print(match.group(1))

Hopefully that gives you enough to get started with.

这篇关于从数据集中分离关键字和@提及的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

从数据集中分离关键字和@提及 [英] Separate keywords and @ mentions from dataset

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

从数据集中分离关键字和@提及 [英] Separate keywords and @ mentions from dataset

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭