从数据集中分离关键字和@提及 [英] Separate keywords and @ mentions from dataset
问题描述
输入:Col名称文本
,每个邮件都是csv中的一个单独的行。
让我们弹跳ðŸ〜‰#[message_1]
爱的能量& amp; amp; Microponic Mayhem while#[message_2]
RT @IVijayboi:#[message_3] @ Bdutt @ sardesairajdeep @ rahulkanwal @ abhisarsharma @ ppbajpayi @ Abpnewd @ Ndtv @ Aajtak#Jihadimedia @ Ibn7 happy #PresstitutesDay
RT @ RakeshKhatri23:我的生命#[message_4]
没有你
IS
LIKE
没有
$ b FRAGRANCEðŸ'žðŸ'ž
〜真爱〜
我& amp; amp;我的宝贝ðŸ¶â¤ï¸ðŸ'@ Home Sweet Home#[message_5]
输入是一个CSV文件与数据中的其他几列,但我只对此列感兴趣。我想将 @name
和 #keyword
从输入中分离成如下所示的新列:
预期输出
文字,提及,关键字
[message],NAN ,NAN
[message],NAN,NAN
[message],@IVijayboi,#Jihadimedia
@Bdutt #PresstitutesDay
@sardesairajdeep
@rahulkanwal
@abhisarsharma
@ppbajpayi
@Abpnewd
@Ndtv
@Aajtak
@ Ibn7
正如我们在输入中看到的第一条和第二条消息没有 @
和#
所以列值 NAN
,但是对于第三个消息,它有10个 @
和2 #
关键字。
简单来说,如何将@提到的名称和#个关键字从邮件分离到单独的列。
我怀疑你想使用正则表达式。我不知道你的@提及和#关键字被允许采取的确切格式,但我猜想,这样的形式 @([a-zA-Z0-9] +)[^ a-zA-Z0-9]
将工作。
#!/ usr / bin / env python3
import re
test_string =Text
让我们弹跳!ðŸ〜‰
爱的能量& amp; amp; amp; Micromonic Mayhem while
RT @IVijayboi:etc etc
submitted_match = re.compile('@([a-zA-Z0-9] +)[^ a -zA-Z0-9]')
在提及_match.finditer(test_string)中匹配:
print(match.group(1))
hashtag_match = re.compile '#([a-zA-Z0-9] +)[^ a-zA-Z0-9]')
用于匹配hashtag_match.finditer(test_string):
print(match.group 1))
希望能给你足够的开始。
I have a huge set of data which has several columns and about 10k rows in more than 100 csv files, for now I am concerned about only one column with message format and from them I want to extract two parameters. I searched extensively around and I found two solutions that seem close but are not enough close to solve the question here. ONE & TWO
Input : Col name "Text"
and every message is a separate row in a csv.
"Let's Bounce!😉 #[message_1]
Loving the energy & Microphonic Mayhem while…" #[message_2]
RT @IVijayboi: #[message_3] @Bdutt@sardesairajdeep@rahulkanwal@abhisarsharma@ppbajpayi@Abpnewd@Ndtv@Aajtak#Jihadimedia@Ibn7 happy #PresstitutesDay
"RT @RakeshKhatri23: MY LIFE #[message_4]
WITHOUT YOU
IS
LIKE
FLOWERS WITHOUT
FRAGRANCE 💞💞
~True Love~"
Me & my baby ðŸ¶â¤ï¸ðŸ‘ @ Home Sweet Home #[message_5]
The input is a CSV file with several other columns in the data but I am interested only in this column. I want to separate the @name
and #keyword
from the input into a new column like:
expected output
text, mentions, keywords
[message], NAN, NAN
[message], NAN, NAN
[message], @IVijayboi, #Jihadimedia
@Bdutt #PresstitutesDay
@sardesairajdeep
@rahulkanwal
@abhisarsharma
@ppbajpayi
@Abpnewd
@Ndtv
@Aajtak
@Ibn7
As we see in the input first and second message has no @
and #
so the column values NAN
but for the third message it has 10 @
and 2 #
keywords.
In simple words how do I separate the @ mentioned names and # keywords from the message to a separate column.
I suspect you want to use a regular expression. I don't know the exact format that your @ mentions and # keywords are allowed to take, but I would guess that something of the form @([a-zA-Z0-9]+)[^a-zA-Z0-9]
would work.
#!/usr/bin/env python3
import re
test_string = """Text
"Let's Bounce!😉
Loving the energy & Microphonic Mayhem while…"
RT @IVijayboi: etc etc"""
mention_match = re.compile('@([a-zA-Z0-9]+)[^a-zA-Z0-9]')
for match in mention_match.finditer(test_string):
print(match.group(1))
hashtag_match = re.compile('#([a-zA-Z0-9]+)[^a-zA-Z0-9]')
for match in hashtag_match.finditer(test_string):
print(match.group(1))
Hopefully that gives you enough to get started with.
这篇关于从数据集中分离关键字和@提及的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!