Jump to content

Selecting words/patterns from sentences


soltek

Recommended Posts

Hello,

 

 

I'm looking for a way to filter a couple of words from one sentence for example:

«There was a man who had 13 legs and looked nice»

«My pocket had 41 coins before I spent them»

 

I want the sentences to be separated from before the «had [number]», so it becomes:

«There was a man»

«My pocket»

 

And also to select everything starting on the «had» to the number, so it'd become:

«had 13»

«had 41»

 

Any idea, mates?

Thank you.

Link to comment
Share on other sites

You contradicted yourself. You said you wanted everything before the word "had", but your first expected result was "«There was a man»". What about the word "who"?

 

But based upon your explanation, what you want is fairly simple:

function splitString($string)
{
    if(preg_match("#(.*?)(had \d*)#", $string, $matches) > 0)
    {
        unset($matches[0]);
        return $matches;
    }
    //Return false if there was no match
    return false;
}

$parts = splitString('There was a man who had 13 legs and looked nice');
print_r($parts);

$parts = splitString('My pocket had 41 coins before I spent them');
print_r($parts);

 

Output

Array
(
    [1] => There was a man who 
    [2] => had 13
)
Array
(
    [1] => My pocket 
    [2] => had 41
)

Link to comment
Share on other sites

You're right, mjdamato.

And thanks!

 

That function seems pretty awesome, can you give me some links that explain this part - "#(.*?)(had \d*)#" ?

I'd like to understand them for the next I need something similar.

Link to comment
Share on other sites

That function seems pretty awesome, can you give me some links that explain this part - "#(.*?)(had \d*)#" ?

I'd like to understand them for the next I need something similar.

 

That is a "Regular Expression" I can explain THAT regular expression, but to really understand them will take quite a bit of time for you to research. In my opinion, Regular Expressions are one of the most powerful features in programming, but also one of the hardest to master.

 

In the above regular expression the hash marks (#) are just delimiters and do not define anything to match on. The "(.*?)" searches for ANY characters. The period is a wildcard for any character, the * means to match 0 to many occurrences of any character and the ? makes the match non-greedy. That means it will stop matching as soon as a match is found for the next parameter.

 

The next parameter is "(had \d*)" which will match the letters "had" followed by a space. Immediately following the space there must be o or more numbers (the wildcard d is used to match digits and the * matches 0 or more). On second though I should have used a + which will match one or more occurrences.

Link to comment
Share on other sites

I went to a coffee shop and took the example given + the information on the first link to understand, or try to, and... jeez, I had no idea. Looks really powerful, organized and precise. I read the basic regular expressions and that helped to understand this part (had \d*). And yeah, I got kinda confused 'cause the + would have made more sense :)

If I have #had \d{2}+# -> It would match only «had [two_digits_number]», correct?

 

 

But I didnt get the first part: (.*?)

I mean, using the link you guys shared with me, as a guide, I understood what it means in word, but why do I need to select everything?

And also, how did the «There was a man who» got matched?

 

 

Link to comment
Share on other sites

If I have #had \d{2}+# -> It would match only «had [two_digits_number]», correct?

 

Actually, I'm not sure how that would be interpreted. You don't need to the '+'. The '+' as a modifier means to match one or more occurrences. But, you are already using '{2}' which means to match exactly two occurrences. So, if you removed the '+', then - yes - it would match 'had [two_digits_number]'

 

But I didnt get the first part: (.*?)

 

Hmm, I thought I explained that. The period is a wild card for any character. The asterisk is a modifier that say match 0 or more occurrences. By default the '*' is greedy which means it will match as many instances as it can. Even digits are "any character4" so the * would make it match everything to the end of the string. But, you only want to match to where the word 'had' starts. So we add the '?' to make the '*' non-greedy, so it will match as few characters as possible. In this case it will match every character up to the word 'had'.

 

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.