不是由PHP任何特定的分隔符分隔的字母和数字分割字符串 [英] Splitting string containing letters and numbers not separated by any particular delimiter in PHP

查看:161
本文介绍了不是由PHP任何特定的分隔符分隔的字母和数字分割字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

目前我正在开发一个Web应用程序来获取Twitter的数据流,并试图通过我自己创造一个自然语言处理。

Currently I am developing a web application to fetch Twitter stream and trying to create a natural language processing by my own.

由于我的数据是从微博(由140个字符的限制),有很多字缩短,或在这种情况下,省略空间

Since my data is from Twitter (limited by 140 characters) there are many words shortened, or on this case, omitted space.

例如:

"Hi, my name is Bob. I m 19yo and 170cm tall"

应标记化到:

Should be tokenized to:

- hi
- my
- name
- bob
- i
- 19
- yo
- 170
- cm
- tall

注意 19 19yo 没有空格之间。我主要是用它来提取数字与他们的单位。

Notice that 19 and yo in 19yo have no space between them. I use it mostly for extracting numbers with their units.

简单地说,我需要的是一种方法来'爆炸'每个令牌有一些它由数字或字母的块没有分隔符。

Simply, what I need is a way to 'explode' each tokens that has number in it by chunk of numbers or letters without delimiter.

123ABC ['123','ABC']

ABC123 ['ABC','123']

abc123xyz ['ABC','123','某某']

等等。

什么是实现它在PHP的最佳方法是什么?

What is the best way to achieve it in PHP?

我发现了一些接近它,但它是C#和spesifically为日/月分裂。 我如何分割的基础上字母和数字

I found something close to it, but it's C# and spesifically for day/month splitting. How do I split a string in C# based on letters and numbers

推荐答案

您可以使用 preg_split

$string = "Hi, my name is Bob. I m 19yo and 170cm tall";
$parts = preg_split("/(,?\s+)|((?<=[a-z])(?=\d))|((?<=\d)(?=[a-z]))/i", $string);
var_dump ($parts);

在对位字母的边界匹配,正规的前pression比赛必须是零宽度。字符本身必须不被包括在匹配。为此,该零宽度lookarounds 的是有用的。

When matching against the digit-letter boundary, the regular expression match must be zero-width. The characters themselves must not be included in the match. For this the zero-width lookarounds are useful.

HTTP://$c$cpad.org/i4Y6r6VS

这篇关于不是由PHP任何特定的分隔符分隔的字母和数字分割字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆