文字列をスペースで分割します—引用符で囲まれた部分文字列を保持します

269

私はこのような文字列を持っています：

this is "a test"

引用符内のスペースを無視しながら、スペースで分割する何かをPythonで記述しようとしています。私が探している結果は次のとおりです：

['this','is','a test']

PS。私はあなたが「引用の中に引用があるとどうなるか、私のアプリケーションではどうなるかと尋ねるでしょう」

python regex

— アダム・ピアス
ソース

1

この質問をしていただきありがとうございます。それはまさにpyparビルドモジュールを修正するために必要なものです。

— Martlark、2010

393

必要なのsplitは、組み込みshlexモジュールからです。

>>> import shlex
>>> shlex.split('this is "a test"')
['this', 'is', 'a test']

これはまさにあなたが望むことをするはずです。

— ジェルブ
ソース

13

「posix = False」を使用して、引用を保持します。shlex.split('this is "a test"', posix=False)リターン['this', 'is', '"a test"']

— Boon

@MatthewG。Python 2.7.3の「修正」は、Unicode文字列をに渡すshlex.split()とUnicodeEncodeError例外がトリガーされることを意味します。

— Rockallite、

57

shlex特にモジュールを見てくださいshlex.split。

>>> import shlex
>>> shlex.split('This is "a test"')
['This', 'is', 'a test']

— アレン
ソース

40

ここでは、複雑かつ/または間違っている正規表現のアプローチが見られます。regex構文は「空白または引用符で囲まれたもの」を簡単に記述でき、ほとんどのregexエンジン（Pythonを含む）はregexで分割できるため、これは私を驚かせます。したがって、正規表現を使用する場合は、どういう意味か正確に言ってみませんか？：

test = 'this is "a test"'  # or "this is 'a test'"
# pieces = [p for p in re.split("( |[\\\"'].*[\\\"'])", test) if p.strip()]
# From comments, use this:
pieces = [p for p in re.split("( |\\\".*?\\\"|'.*?')", test) if p.strip()]

説明：

[\\\"'] = double-quote or single-quote
.* = anything
( |X) = space or X
.strip() = remove space and empty-string separators

しかし、おそらくshlexはより多くの機能を提供します。

1

私はほとんど同じことを考えていましたが、代わりにtに対して[t.strip（ '"'）in tをre.findall（r '[^ \ s"] + | "[^"] * "'と提案します。これは「 a test "'）]

— ダリウス・ベーコン

2

+1これはshlexよりもはるかに高速だったので、これを使用しています。

— hanleyp 2009年

3

なぜトリプルバックスラッシュなのか？単純なバックスラッシュは同じことをしませんか？

— Doppelganger、2011

1

実際、これについて私が気に入らない点の1つは、引用符の前後のすべてが適切に分割されないことです。このような文字列がある場合、「PARAMS val1 = "Thing" val2 = "Thing2"」です。文字列が3つに分割されることを期待していますが、5つに分割されます。正規表現を実行してからしばらく経っているため、現在、ソリューションを使用してそれを解決しようとは思いません。

— leetNightshade 2013

1

正規表現を使用する場合は、生の文字列を使用する必要があります。

— asmeurer 2013

29

ユースケースによっては、csvモジュールをチェックアウトすることもできます。

import csv
lines = ['this is "a string"', 'and more "stuff"']
for row in csv.reader(lines, delimiter=" "):
    print(row)

出力：

['this', 'is', 'a string']
['and', 'more', 'stuff']

— ライアン・ジンストロム
ソース

2

shlexが必要な一部の文字を削除するときに便利です

— scraplesh 2013年

1

CSVの行に用いる2つの二重引用符（サイドバイサイドのように、""一つの二重引用符を表す）"、そう単一引用符の中に2つの二重引用符を向けるだろう'this is "a string""'と'this is "a string"""'するであろうマップの両方['this', 'is', 'a string"']

— ボリス・

15

shlex.splitを使用して70,000,000行のSquidログを処理しますが、非常に遅いです。だから私は再に切り替えました。

shlexのパフォーマンスに問題がある場合は、これを試してください。

import re

def line_split(line):
    return re.findall(r'[^"\s]\S*|".+?"', line)

— ダニエル・ダイ
ソース

8

この質問は正規表現でタグ付けされているので、私は正規表現アプローチを試すことにしました。最初に、引用符部分のすべてのスペースを\ x00で置き換え、次にスペースで分割し、次に\ x00を各部分のスペースに戻します。

どちらのバージョンも同じことを行いますが、splitterは、splitter2よりも読みやすくなっています。

import re

s = 'this is "a test" some text "another test"'

def splitter(s):
    def replacer(m):
        return m.group(0).replace(" ", "\x00")
    parts = re.sub('".+?"', replacer, s).split()
    parts = [p.replace("\x00", " ") for p in parts]
    return parts

def splitter2(s):
    return [p.replace("\x00", " ") for p in re.sub('".+?"', lambda m: m.group(0).replace(" ", "\x00"), s).split()]

print splitter2(s)

— エリフィナー
ソース

代わりにre.Scannerを使用してください。より信頼性が高くなっています（実際、re.Scannerを使用してshlexのように実装しています）。

— Devin Jeanpierre

+1 Hm、これはかなり賢いアイデアであり、問題を複数のステップに分解するので、答えはそれほど複雑ではありません。Shlexは、それを微調整しようとしても、私が必要とするものを正確に実行しませんでした。そして、シングルパスの正規表現ソリューションは本当に奇妙で複雑になりました。

— leetNightshade 2013

6

パフォーマンス上の理由からre高速であるようです。これは、外側の引用符を保持する最小の貪欲演算子を使用した私の解決策です。

re.findall("(?:\".*?\"|\S)+", s)

結果：

['this', 'is', '"a test"']

aaa"bla blub"bbbこれらのトークンはスペースで区切られていないため、一緒に構築されます。文字列にエスケープ文字が含まれている場合は、次のように一致させることができます。

>>> a = "She said \"He said, \\\"My name is Mark.\\\"\""
>>> a
'She said "He said, \\"My name is Mark.\\""'
>>> for i in re.findall("(?:\".*?[^\\\\]\"|\S)+", a): print(i)
...
She
said
"He said, \"My name is Mark.\""

これは、パターンの一部""によって空の文字列にも一致することに注意してください\S。

— ホッホル
ソース

1

このソリューションのもう一つの重要な利点は、区切り文字（例えばに関してはその汎用性である,経由'(?:".*?"|[^,])+'）。同じことが引用（囲み）文字にも当てはまります。

— a_guest

4

受け入れられたshlexアプローチの主な問題は、引用された部分文字列の外側のエスケープ文字を無視せず、いくつかのまれなケースでわずかに予期しない結果をもたらすことです。

次の使用例があります。単一引用符または二重引用符で囲まれた部分文字列が保持されるように入力文字列を分割する分割関数が必要です。このような部分文字列内で引用符をエスケープする機能があります。引用符で囲まれていない文字列内の引用符は、他の文字と異なる扱いをしないでください。予想される出力を含むテストケースの例：

入力文字列| 期待される出力
===============================================
 'abc def' | ['abc'、 'def']
 "abc \\ s def" | ['abc'、 '\\ s'、 'def']
 '"abc def" ghi' | ['abc def'、 'ghi']
 "'abc def' ghi" | ['abc def'、 'ghi']
 '"abc \\" def "ghi' | ['abc" def'、 'ghi']
 "'abc \\' def 'ghi" | ["abc 'def"、' ghi ']
 "'abc \\ s def' ghi" | ['abc \\ s def'、 'ghi']
 '"abc \\ s def" ghi' | ['abc \\ s def'、 'ghi']
 '""テスト' | [''、 'テスト']
 "''テスト" | [''、 'テスト']
 "abc'def" | ["abc'def"]
 "abc'def '" | ["abc'def '"]
 "abc'def 'ghi" | ["abc'def '"、' ghi ']
 "abc'def'ghi" | ["abc'def'ghi"]
 'abc "def' | ['abc" def']
 'abc "def"' | ['abc "def"']
 'abc "def" ghi' | ['abc "def"'、 'ghi']
 'abc "def" ghi' | ['abc "def" ghi']
 "r'AA 'r'。* _ xyz $ '" | ["r'AA '"、 "r'。* _ xyz $ '"]

すべての入力文字列に対して期待される出力結果が得られるように文字列を分割する次の関数で終わりました。

import re

def quoted_split(s):
    def strip_quotes(s):
        if s and (s[0] == '"' or s[0] == "'") and s[0] == s[-1]:
            return s[1:-1]
        return s
    return [strip_quotes(p).replace('\\"', '"').replace("\\'", "'") \
            for p in re.findall(r'"(?:\\.|[^"])*"|\'(?:\\.|[^\'])*\'|[^\s]+', s)]

次のテストアプリケーションをチェック他のアプローチ（の結果shlexとcsv今のところ）とカスタム分割実装：

#!/bin/python2.7

import csv
import re
import shlex

from timeit import timeit

def test_case(fn, s, expected):
    try:
        if fn(s) == expected:
            print '[ OK ] %s -> %s' % (s, fn(s))
        else:
            print '[FAIL] %s -> %s' % (s, fn(s))
    except Exception as e:
        print '[FAIL] %s -> exception: %s' % (s, e)

def test_case_no_output(fn, s, expected):
    try:
        fn(s)
    except:
        pass

def test_split(fn, test_case_fn=test_case):
    test_case_fn(fn, 'abc def', ['abc', 'def'])
    test_case_fn(fn, "abc \\s def", ['abc', '\\s', 'def'])
    test_case_fn(fn, '"abc def" ghi', ['abc def', 'ghi'])
    test_case_fn(fn, "'abc def' ghi", ['abc def', 'ghi'])
    test_case_fn(fn, '"abc \\" def" ghi', ['abc " def', 'ghi'])
    test_case_fn(fn, "'abc \\' def' ghi", ["abc ' def", 'ghi'])
    test_case_fn(fn, "'abc \\s def' ghi", ['abc \\s def', 'ghi'])
    test_case_fn(fn, '"abc \\s def" ghi', ['abc \\s def', 'ghi'])
    test_case_fn(fn, '"" test', ['', 'test'])
    test_case_fn(fn, "'' test", ['', 'test'])
    test_case_fn(fn, "abc'def", ["abc'def"])
    test_case_fn(fn, "abc'def'", ["abc'def'"])
    test_case_fn(fn, "abc'def' ghi", ["abc'def'", 'ghi'])
    test_case_fn(fn, "abc'def'ghi", ["abc'def'ghi"])
    test_case_fn(fn, 'abc"def', ['abc"def'])
    test_case_fn(fn, 'abc"def"', ['abc"def"'])
    test_case_fn(fn, 'abc"def" ghi', ['abc"def"', 'ghi'])
    test_case_fn(fn, 'abc"def"ghi', ['abc"def"ghi'])
    test_case_fn(fn, "r'AA' r'.*_xyz$'", ["r'AA'", "r'.*_xyz$'"])

def csv_split(s):
    return list(csv.reader([s], delimiter=' '))[0]

def re_split(s):
    def strip_quotes(s):
        if s and (s[0] == '"' or s[0] == "'") and s[0] == s[-1]:
            return s[1:-1]
        return s
    return [strip_quotes(p).replace('\\"', '"').replace("\\'", "'") for p in re.findall(r'"(?:\\.|[^"])*"|\'(?:\\.|[^\'])*\'|[^\s]+', s)]

if __name__ == '__main__':
    print 'shlex\n'
    test_split(shlex.split)
    print

    print 'csv\n'
    test_split(csv_split)
    print

    print 're\n'
    test_split(re_split)
    print

    iterations = 100
    setup = 'from __main__ import test_split, test_case_no_output, csv_split, re_split\nimport shlex, re'
    def benchmark(method, code):
        print '%s: %.3fms per iteration' % (method, (1000 * timeit(code, setup=setup, number=iterations) / iterations))
    benchmark('shlex', 'test_split(shlex.split, test_case_no_output)')
    benchmark('csv', 'test_split(csv_split, test_case_no_output)')
    benchmark('re', 'test_split(re_split, test_case_no_output)')

出力：

shlex

[OK] abc def-> ['abc'、 'def']
[FAIL] abc \ s def-> ['abc'、 's'、 'def']
[OK] "abc def" ghi-> ['abc def'、 'ghi']
[OK] 'abc def' ghi-> ['abc def'、 'ghi']
[OK] "abc \" def "ghi-> ['abc" def'、 'ghi']
[FAIL] 'abc \' def 'ghi->例外：終了の引用なし
[OK] 'abc \ s def' ghi-> ['abc \\ s def'、 'ghi']
[OK] "abc \ s def" ghi-> ['abc \\ s def'、 'ghi']
[OK] "" test-> [''、 'test']
[OK] '' test-> [''、 'test']
[FAIL] abc'def->例外：終了の引用なし
[FAIL] abc'def '-> [' abcdef ']
[FAIL] abc'def 'ghi-> [' abcdef '、' ghi ']
[FAIL] abc'def'ghi-> ['abcdefghi']
[FAIL] abc "def->例外：終了の引用なし
[FAIL] abc "def"-> ['abcdef']
[FAIL] abc "def" ghi-> ['abcdef'、 'ghi']
[FAIL] abc "def" ghi-> ['abcdefghi']
[FAIL] r'AA 'r'。* _ xyz $ '-> [' rAA '、' r。* _ xyz $ ']

csv

[OK] abc def-> ['abc'、 'def']
[OK] abc \ s def-> ['abc'、 '\\ s'、 'def']
[OK] "abc def" ghi-> ['abc def'、 'ghi']
[FAIL] 'abc def' ghi-> ["'abc"、 "def'"、 'ghi']
[FAIL] "abc \" def "ghi-> ['abc \\'、 'def"'、 'ghi']
[FAIL] 'abc \' def 'ghi-> ["' abc"、 "\\ '"、 "def'"、 'ghi']
[FAIL] 'abc \ s def' ghi-> ["'abc"、' \\ s '、 "def'"、 'ghi']
[OK] "abc \ s def" ghi-> ['abc \\ s def'、 'ghi']
[OK] "" test-> [''、 'test']
[FAIL] ''テスト-> ["''"、 'テスト']
[OK] abc'def-> ["abc'def"]
[OK] abc'def '-> ["abc'def'"]
[OK] abc'def 'ghi-> ["abc'def'"、 'ghi']
[OK] abc'def'ghi-> ["abc'def'ghi"]
[OK] abc "def-> ['abc" def']
[OK] abc "def"-> ['abc "def"']
[OK] abc "def" ghi-> ['abc "def"'、 'ghi']
[OK] abc "def" ghi-> ['abc "def" ghi']
[OK] r'AA 'r'。* _ xyz $ '-> ["r'AA'"、 "r '。* _ xyz $'"]

再

[OK] abc def-> ['abc'、 'def']
[OK] abc \ s def-> ['abc'、 '\\ s'、 'def']
[OK] "abc def" ghi-> ['abc def'、 'ghi']
[OK] 'abc def' ghi-> ['abc def'、 'ghi']
[OK] "abc \" def "ghi-> ['abc" def'、 'ghi']
[OK] 'abc \' def 'ghi-> ["abc' def"、 'ghi']
[OK] 'abc \ s def' ghi-> ['abc \\ s def'、 'ghi']
[OK] "abc \ s def" ghi-> ['abc \\ s def'、 'ghi']
[OK] "" test-> [''、 'test']
[OK] '' test-> [''、 'test']
[OK] abc'def-> ["abc'def"]
[OK] abc'def '-> ["abc'def'"]
[OK] abc'def 'ghi-> ["abc'def'"、 'ghi']
[OK] abc'def'ghi-> ["abc'def'ghi"]
[OK] abc "def-> ['abc" def']
[OK] abc "def"-> ['abc "def"']
[OK] abc "def" ghi-> ['abc "def"'、 'ghi']
[OK] abc "def" ghi-> ['abc "def" ghi']
[OK] r'AA 'r'。* _ xyz $ '-> ["r'AA'"、 "r '。* _ xyz $'"]

shlex：反復あたり0.281ms
csv：反復あたり0.030ms
再：0.049ms /反復

したがって、パフォーマンスはよりもはるかに優れておりshlex、正規表現をプリコンパイルすることでさらに向上させることができますcsv。この場合、このアプローチの方が優れています。

— トンファンデンヒューベル
ソース

何を話しているのかわからない： `` `>>> shlex.split（ 'this is" a test "'）['this'、 'is'、 'a test'] >>> shlex.split（ 'これは\\ "a test \\" '）[' this '、' is '、' "a '、' test" '] >>> shlex.split（' this is "a \\" test \\ " "'）[' this '、' is '、' a" test "']` ``

— morsik

@morsik、あなたのポイントは何ですか？多分あなたのユースケースは私のものと一致しませんか？テストケースを見ると、shlexが私のユースケースで期待どおりに動作しないすべてのケースがわかります。

— Ton van den Heuvel

3

引用符を保持するには、次の関数を使用します：

def getArgs(s):
    args = []
    cur = ''
    inQuotes = 0
    for char in s.strip():
        if char == ' ' and not inQuotes:
            args.append(cur)
            cur = ''
        elif char == '"' and not inQuotes:
            inQuotes = 1
            cur += char
        elif char == '"' and inQuotes:
            inQuotes = 0
            cur += char
        else:
            cur += char
    args.append(cur)
    return args

— THE_MAD_KING
ソース

より大きな文字列と比較すると、関数が非常に遅い

— Faran2007

3

さまざまな答えのスピードテスト：

import re
import shlex
import csv

line = 'this is "a test"'

%timeit [p for p in re.split("( |\\\".*?\\\"|'.*?')", line) if p.strip()]
100000 loops, best of 3: 5.17 µs per loop

%timeit re.findall(r'[^"\s]\S*|".+?"', line)
100000 loops, best of 3: 2.88 µs per loop

%timeit list(csv.reader([line], delimiter=" "))
The slowest run took 9.62 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 2.4 µs per loop

%timeit shlex.split(line)
10000 loops, best of 3: 50.2 µs per loop

— har777
ソース

1

うーん、「返信」ボタンが見つからないようです...とにかく、この回答はケイトのアプローチに基づいていますが、エスケープされた引用符を含む部分文字列で文字列を正しく分割し、部分文字列の開始引用符と終了引用符も削除します。

  [i.strip('"').strip("'") for i in re.split(r'(\s+|(?<!\\)".*?(?<!\\)"|(?<!\\)\'.*?(?<!\\)\')', string) if i.strip()]

これ'This is " a \\\"test\\\"\\\'s substring"'は（Pythonがエスケープを削除しないようにするために、非常識なマークアップが残念ながら必要な）文字列に対して機能します。

返されたリストの文字列で結果として得られるエスケープが不要な場合は、関数を少し変更した次のバージョンを使用できます。

[i.strip('"').strip("'").decode('string_escape') for i in re.split(r'(\s+|(?<!\\)".*?(?<!\\)"|(?<!\\)\'.*?(?<!\\)\')', string) if i.strip()]

1

一部のPython 2バージョンでのUnicodeの問題を回避するには、次のことをお勧めします。

from shlex import split as _split
split = lambda a: [b.decode('utf-8') for b in _split(a.encode('utf-8'))]

— モシュラー
ソース

python 2.7.5については、これは次のようになります。split = lambda a: [b.decode('utf-8') for b in _split(a)]それ以外の場合はあなたが得る：UnicodeDecodeError: 'ascii' codec can't decode byte ... in position ...: ordinal not in range(128)

— ピーター・バロ

1

オプションとしてtssplitを試してください：

In [1]: from tssplit import tssplit
In [2]: tssplit('this is "a test"', quote='"', delimiter='')
Out[2]: ['this', 'is', 'a test']

— ミハイル・ザハロフ
ソース

0

私は提案します：

テスト文字列：

s = 'abc "ad" \'fg\' "kk\'rdt\'" zzz"34"zzz "" \'\''

「」と「」もキャプチャするには：

import re
re.findall(r'"[^"]*"|\'[^\']*\'|[^"\'\s]+',s)

結果：

['abc', '"ad"', "'fg'", '"kk\'rdt\'"', 'zzz', '"34"', 'zzz', '""', "''"]

空の ""と ''を無視するには：

import re
re.findall(r'"[^"]+"|\'[^\']+\'|[^"\'\s]+',s)

結果：

['abc', '"ad"', "'fg'", '"kk\'rdt\'"', 'zzz', '"34"', 'zzz']

— 騒々しい
ソース

と書くことre.findall("(?:\".*?\"|'.*?'|[^\s'\"]+)", s)もできます。

— hochl 2018年

-3

単純な文字列よりも部分文字列を気にしない場合

>>> 'a short sized string with spaces '.split()

パフォーマンス：

>>> s = " ('a short sized string with spaces '*100).split() "
>>> t = timeit.Timer(stmt=s)
>>> print "%.2f usec/pass" % (1000000 * t.timeit(number=100000)/100000)
171.39 usec/pass

または文字列モジュール

>>> from string import split as stringsplit; 
>>> stringsplit('a short sized string with spaces '*100)

パフォーマンス：文字列モジュールは文字列メソッドよりもパフォーマンスが良いようです

>>> s = "stringsplit('a short sized string with spaces '*100)"
>>> t = timeit.Timer(s, "from string import split as stringsplit")
>>> print "%.2f usec/pass" % (1000000 * t.timeit(number=100000)/100000)
154.88 usec/pass

または、REエンジンを使用できます

>>> from re import split as resplit
>>> regex = '\s+'
>>> medstring = 'a short sized string with spaces '*100
>>> resplit(regex, medstring)

パフォーマンス

>>> s = "resplit(regex, medstring)"
>>> t = timeit.Timer(s, "from re import split as resplit; regex='\s+'; medstring='a short sized string with spaces '*100")
>>> print "%.2f usec/pass" % (1000000 * t.timeit(number=100000)/100000)
540.21 usec/pass

非常に長い文字列の場合、文字列全体をメモリに読み込まず、行を分割するか、反復ループを使用する必要があります

— グレゴリー
ソース

11

あなたは質問のすべてのポイントを逃したようです。文字列に分割する必要がない引用符付きのセクションがあります。

— rjmunro 2008年

-3

これを試して：

  def adamsplit(s):
    result = []
    inquotes = False
    for substring in s.split('"'):
      if not inquotes:
        result.extend(substring.split())
      else:
        result.append(substring)
      inquotes = not inquotes
    return result

いくつかのテスト文字列：

'This is "a test"' -> ['This', 'is', 'a test']
'"This is \'a test\'"' -> ["This is 'a test'"]

— pjz
ソース

失敗すると思われる文字列のreprを指定してください。

— pjz

思う？ adamsplit("This is 'a test'")→['This', 'is', "'a", "test'"]

— Matthew Schinckel 16

OPは "引用符内"とだけ言っており、二重引用符のある例しかありません。

— pjz

文字列をスペースで分割します—引用符で囲まれた部分文字列を保持します— Pythonで