Python fuzzy string matching as correlation style table/matrix










4















I have a file with x number of string names and their associated IDs. Essentially two columns of data.



What I would like, is a correlation style table with the format x by x (having the data in question both as the x-axis and y axis), but instead of correlation, I would like the fuzzywuzzy library's function fuzz.ratio(x,y) as the output using the string names as input. Essentially running every entry against every entry.



This is sort of what I had in mind. Just to show my intent:



import pandas as pd
from fuzzywuzzy import fuzz

df = pd.read_csv('random_data_file.csv')

df = df[['ID','String']]
df['String_Dup'] = df['String'] #creating duplicate of data in question
df = df.set_index('ID')

df = df.groupby('ID')[['String','String_Dup']].apply(fuzz.ratio())


But clearly this approach is not working for me at the moment. Any help appreciated. It doesn't have to be pandas, it is just an environment I am relatively more familiar with.



I hope my issue is clearly worded, and really, any input is appreciated,










share|improve this question




























    4















    I have a file with x number of string names and their associated IDs. Essentially two columns of data.



    What I would like, is a correlation style table with the format x by x (having the data in question both as the x-axis and y axis), but instead of correlation, I would like the fuzzywuzzy library's function fuzz.ratio(x,y) as the output using the string names as input. Essentially running every entry against every entry.



    This is sort of what I had in mind. Just to show my intent:



    import pandas as pd
    from fuzzywuzzy import fuzz

    df = pd.read_csv('random_data_file.csv')

    df = df[['ID','String']]
    df['String_Dup'] = df['String'] #creating duplicate of data in question
    df = df.set_index('ID')

    df = df.groupby('ID')[['String','String_Dup']].apply(fuzz.ratio())


    But clearly this approach is not working for me at the moment. Any help appreciated. It doesn't have to be pandas, it is just an environment I am relatively more familiar with.



    I hope my issue is clearly worded, and really, any input is appreciated,










    share|improve this question


























      4












      4








      4








      I have a file with x number of string names and their associated IDs. Essentially two columns of data.



      What I would like, is a correlation style table with the format x by x (having the data in question both as the x-axis and y axis), but instead of correlation, I would like the fuzzywuzzy library's function fuzz.ratio(x,y) as the output using the string names as input. Essentially running every entry against every entry.



      This is sort of what I had in mind. Just to show my intent:



      import pandas as pd
      from fuzzywuzzy import fuzz

      df = pd.read_csv('random_data_file.csv')

      df = df[['ID','String']]
      df['String_Dup'] = df['String'] #creating duplicate of data in question
      df = df.set_index('ID')

      df = df.groupby('ID')[['String','String_Dup']].apply(fuzz.ratio())


      But clearly this approach is not working for me at the moment. Any help appreciated. It doesn't have to be pandas, it is just an environment I am relatively more familiar with.



      I hope my issue is clearly worded, and really, any input is appreciated,










      share|improve this question
















      I have a file with x number of string names and their associated IDs. Essentially two columns of data.



      What I would like, is a correlation style table with the format x by x (having the data in question both as the x-axis and y axis), but instead of correlation, I would like the fuzzywuzzy library's function fuzz.ratio(x,y) as the output using the string names as input. Essentially running every entry against every entry.



      This is sort of what I had in mind. Just to show my intent:



      import pandas as pd
      from fuzzywuzzy import fuzz

      df = pd.read_csv('random_data_file.csv')

      df = df[['ID','String']]
      df['String_Dup'] = df['String'] #creating duplicate of data in question
      df = df.set_index('ID')

      df = df.groupby('ID')[['String','String_Dup']].apply(fuzz.ratio())


      But clearly this approach is not working for me at the moment. Any help appreciated. It doesn't have to be pandas, it is just an environment I am relatively more familiar with.



      I hope my issue is clearly worded, and really, any input is appreciated,







      python pandas matrix fuzzy






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 12 '18 at 11:59









      Brian Tompsett - 汤莱恩

      4,1931337101




      4,1931337101










      asked Nov 12 '18 at 11:28









      WayOutofDepthWayOutofDepth

      323




      323






















          3 Answers
          3






          active

          oldest

          votes


















          1














          Use pandas' crosstab function, followed by a column-wise apply to compute the fuzz.
          This is considerably more elegant than my first answer.



          import pandas as pd
          from fuzzywuzzy import fuzz

          # Create sample data frame.
          df = pd.DataFrame([(1, 'abracadabra'), (2,'abc'), (3,'cadra'), (4, 'brabra')],
          columns=['id', 'strings'])
          # Create the cartesian product between the strings column with itself.
          ct = pd.crosstab(df['strings'], df['strings'])
          # Note: for pandas versions <0.22, the two series must have different names.
          # In case you observe a "Level XX not found" error, the following may help:
          # ct = pd.crosstab(df['strings'].rename(), df['strings'].rename())

          # Apply the fuzz (column-wise). Argument col has type pd.Series.
          ct = ct.apply(lambda col: [fuzz.ratio(col.name, x) for x in col.index])

          # This results in the following:
          # strings abc abracadabra brabra cadra
          # strings
          # abc 100 43 44 25
          # abracadabra 43 100 71 62
          # brabra 44 71 100 55
          # cadra 25 62 55 100


          For simplicity, I omitted the groupby operation as suggested in your question. In case need want to apply the fuzzy string matching on groups, simply create a separate function:



          def cross_fuzz(df):
          ct = pd.crosstab(df['strings'], df['strings'])
          ct = ct.apply(lambda col: [fuzz.ratio(col.name, x) for x in col.index])
          return ct

          df.groupby('id').apply(cross_fuzz)





          share|improve this answer

























          • Thanks! This works as long as i do not attempt to cross tabulate the same series. i.e. i had to create a copy of 'strings' and call it something else, otherwise it gave me a 'level strings not found error'. I am happy to accept this as the answer otherwise.

            – WayOutofDepth
            Nov 12 '18 at 15:43











          • The above code works for pandas 0.22.0. But you're right, former versions of pandas (e.g. 0.20.3) required the series to have different names. I edited my answer accordingly. Thanks for pointing this out.

            – normanius
            Nov 12 '18 at 16:11



















          0














          import csv
          from fuzzywuzzy import fuzz
          import numpy as np

          input_file = csv.DictReader(open('random_data_file.csv'))
          string =
          for row in input_file: #file is appended row by row into a python dictionary
          string.append(row["String"]) #keys for the dict. are the headers



          #now you have a list of the string values
          length = len(string)
          resultMat = np.zeros((length, length)) #zeros 2D matrix, with size X * X

          for i in range (length):
          for j in range (length):
          resultMat[i][j] = fuzz.ratio(string[i], string[j])

          print resultMat


          I did the implementation in a numby 2D matrix. I am not that good in pandas, but I think what you were doing is adding another column and comparing it to the string column, meaning: string[i] will be matched with string_dub[i], all results will be 100



          Hope it helps






          share|improve this answer






























            0














            In pandas, the cartesian cross product between two columns can be created using a dummy variable and pd.merge. The fuzz operation is applied using apply. A final pivot operation will extract the format you had in mind. For simplicity, I omitted the groupby operation, but of course, you could apply the procedure to all group-tables by moving the code below into a separate function.



            Here is what this could look like:



            import pandas as pd
            from fuzzywuzzy import fuzz

            # Create sample data frame.
            df = pd.DataFrame([(1, 'abracadabra'), (2,'abc'), (3,'cadra'), (4, 'brabra')],
            columns=['id', 'strings'])

            # Cross product, using a temporary column.
            df['_tmp'] = 0
            mrg = pd.merge(df, df, on='_tmp', suffixes=['_1','_2'])

            # Apply the function between the two strings.
            mrg['fuzz'] = mrg.apply(lambda s: fuzz.ratio(s['strings_1'], s['strings_2']), axis=1)

            # Reorganize data.
            ret = mrg.pivot(index='strings_1', columns='strings_2', values='fuzz')
            ret.index.name = None
            ret.columns.name = None

            # This results in the following:
            # abc abracadabra brabra cadra
            # abc 100 43 44 25
            # abracadabra 43 100 71 62
            # brabra 44 71 100 55
            # cadra 25 62 55 100





            share|improve this answer
























              Your Answer






              StackExchange.ifUsing("editor", function ()
              StackExchange.using("externalEditor", function ()
              StackExchange.using("snippets", function ()
              StackExchange.snippets.init();
              );
              );
              , "code-snippets");

              StackExchange.ready(function()
              var channelOptions =
              tags: "".split(" "),
              id: "1"
              ;
              initTagRenderer("".split(" "), "".split(" "), channelOptions);

              StackExchange.using("externalEditor", function()
              // Have to fire editor after snippets, if snippets enabled
              if (StackExchange.settings.snippets.snippetsEnabled)
              StackExchange.using("snippets", function()
              createEditor();
              );

              else
              createEditor();

              );

              function createEditor()
              StackExchange.prepareEditor(
              heartbeatType: 'answer',
              autoActivateHeartbeat: false,
              convertImagesToLinks: true,
              noModals: true,
              showLowRepImageUploadWarning: true,
              reputationToPostImages: 10,
              bindNavPrevention: true,
              postfix: "",
              imageUploader:
              brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
              contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
              allowUrls: true
              ,
              onDemand: true,
              discardSelector: ".discard-answer"
              ,immediatelyShowMarkdownHelp:true
              );



              );













              draft saved

              draft discarded


















              StackExchange.ready(
              function ()
              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53261214%2fpython-fuzzy-string-matching-as-correlation-style-table-matrix%23new-answer', 'question_page');

              );

              Post as a guest















              Required, but never shown

























              3 Answers
              3






              active

              oldest

              votes








              3 Answers
              3






              active

              oldest

              votes









              active

              oldest

              votes






              active

              oldest

              votes









              1














              Use pandas' crosstab function, followed by a column-wise apply to compute the fuzz.
              This is considerably more elegant than my first answer.



              import pandas as pd
              from fuzzywuzzy import fuzz

              # Create sample data frame.
              df = pd.DataFrame([(1, 'abracadabra'), (2,'abc'), (3,'cadra'), (4, 'brabra')],
              columns=['id', 'strings'])
              # Create the cartesian product between the strings column with itself.
              ct = pd.crosstab(df['strings'], df['strings'])
              # Note: for pandas versions <0.22, the two series must have different names.
              # In case you observe a "Level XX not found" error, the following may help:
              # ct = pd.crosstab(df['strings'].rename(), df['strings'].rename())

              # Apply the fuzz (column-wise). Argument col has type pd.Series.
              ct = ct.apply(lambda col: [fuzz.ratio(col.name, x) for x in col.index])

              # This results in the following:
              # strings abc abracadabra brabra cadra
              # strings
              # abc 100 43 44 25
              # abracadabra 43 100 71 62
              # brabra 44 71 100 55
              # cadra 25 62 55 100


              For simplicity, I omitted the groupby operation as suggested in your question. In case need want to apply the fuzzy string matching on groups, simply create a separate function:



              def cross_fuzz(df):
              ct = pd.crosstab(df['strings'], df['strings'])
              ct = ct.apply(lambda col: [fuzz.ratio(col.name, x) for x in col.index])
              return ct

              df.groupby('id').apply(cross_fuzz)





              share|improve this answer

























              • Thanks! This works as long as i do not attempt to cross tabulate the same series. i.e. i had to create a copy of 'strings' and call it something else, otherwise it gave me a 'level strings not found error'. I am happy to accept this as the answer otherwise.

                – WayOutofDepth
                Nov 12 '18 at 15:43











              • The above code works for pandas 0.22.0. But you're right, former versions of pandas (e.g. 0.20.3) required the series to have different names. I edited my answer accordingly. Thanks for pointing this out.

                – normanius
                Nov 12 '18 at 16:11
















              1














              Use pandas' crosstab function, followed by a column-wise apply to compute the fuzz.
              This is considerably more elegant than my first answer.



              import pandas as pd
              from fuzzywuzzy import fuzz

              # Create sample data frame.
              df = pd.DataFrame([(1, 'abracadabra'), (2,'abc'), (3,'cadra'), (4, 'brabra')],
              columns=['id', 'strings'])
              # Create the cartesian product between the strings column with itself.
              ct = pd.crosstab(df['strings'], df['strings'])
              # Note: for pandas versions <0.22, the two series must have different names.
              # In case you observe a "Level XX not found" error, the following may help:
              # ct = pd.crosstab(df['strings'].rename(), df['strings'].rename())

              # Apply the fuzz (column-wise). Argument col has type pd.Series.
              ct = ct.apply(lambda col: [fuzz.ratio(col.name, x) for x in col.index])

              # This results in the following:
              # strings abc abracadabra brabra cadra
              # strings
              # abc 100 43 44 25
              # abracadabra 43 100 71 62
              # brabra 44 71 100 55
              # cadra 25 62 55 100


              For simplicity, I omitted the groupby operation as suggested in your question. In case need want to apply the fuzzy string matching on groups, simply create a separate function:



              def cross_fuzz(df):
              ct = pd.crosstab(df['strings'], df['strings'])
              ct = ct.apply(lambda col: [fuzz.ratio(col.name, x) for x in col.index])
              return ct

              df.groupby('id').apply(cross_fuzz)





              share|improve this answer

























              • Thanks! This works as long as i do not attempt to cross tabulate the same series. i.e. i had to create a copy of 'strings' and call it something else, otherwise it gave me a 'level strings not found error'. I am happy to accept this as the answer otherwise.

                – WayOutofDepth
                Nov 12 '18 at 15:43











              • The above code works for pandas 0.22.0. But you're right, former versions of pandas (e.g. 0.20.3) required the series to have different names. I edited my answer accordingly. Thanks for pointing this out.

                – normanius
                Nov 12 '18 at 16:11














              1












              1








              1







              Use pandas' crosstab function, followed by a column-wise apply to compute the fuzz.
              This is considerably more elegant than my first answer.



              import pandas as pd
              from fuzzywuzzy import fuzz

              # Create sample data frame.
              df = pd.DataFrame([(1, 'abracadabra'), (2,'abc'), (3,'cadra'), (4, 'brabra')],
              columns=['id', 'strings'])
              # Create the cartesian product between the strings column with itself.
              ct = pd.crosstab(df['strings'], df['strings'])
              # Note: for pandas versions <0.22, the two series must have different names.
              # In case you observe a "Level XX not found" error, the following may help:
              # ct = pd.crosstab(df['strings'].rename(), df['strings'].rename())

              # Apply the fuzz (column-wise). Argument col has type pd.Series.
              ct = ct.apply(lambda col: [fuzz.ratio(col.name, x) for x in col.index])

              # This results in the following:
              # strings abc abracadabra brabra cadra
              # strings
              # abc 100 43 44 25
              # abracadabra 43 100 71 62
              # brabra 44 71 100 55
              # cadra 25 62 55 100


              For simplicity, I omitted the groupby operation as suggested in your question. In case need want to apply the fuzzy string matching on groups, simply create a separate function:



              def cross_fuzz(df):
              ct = pd.crosstab(df['strings'], df['strings'])
              ct = ct.apply(lambda col: [fuzz.ratio(col.name, x) for x in col.index])
              return ct

              df.groupby('id').apply(cross_fuzz)





              share|improve this answer















              Use pandas' crosstab function, followed by a column-wise apply to compute the fuzz.
              This is considerably more elegant than my first answer.



              import pandas as pd
              from fuzzywuzzy import fuzz

              # Create sample data frame.
              df = pd.DataFrame([(1, 'abracadabra'), (2,'abc'), (3,'cadra'), (4, 'brabra')],
              columns=['id', 'strings'])
              # Create the cartesian product between the strings column with itself.
              ct = pd.crosstab(df['strings'], df['strings'])
              # Note: for pandas versions <0.22, the two series must have different names.
              # In case you observe a "Level XX not found" error, the following may help:
              # ct = pd.crosstab(df['strings'].rename(), df['strings'].rename())

              # Apply the fuzz (column-wise). Argument col has type pd.Series.
              ct = ct.apply(lambda col: [fuzz.ratio(col.name, x) for x in col.index])

              # This results in the following:
              # strings abc abracadabra brabra cadra
              # strings
              # abc 100 43 44 25
              # abracadabra 43 100 71 62
              # brabra 44 71 100 55
              # cadra 25 62 55 100


              For simplicity, I omitted the groupby operation as suggested in your question. In case need want to apply the fuzzy string matching on groups, simply create a separate function:



              def cross_fuzz(df):
              ct = pd.crosstab(df['strings'], df['strings'])
              ct = ct.apply(lambda col: [fuzz.ratio(col.name, x) for x in col.index])
              return ct

              df.groupby('id').apply(cross_fuzz)






              share|improve this answer














              share|improve this answer



              share|improve this answer








              edited Nov 12 '18 at 21:55

























              answered Nov 12 '18 at 13:33









              normaniusnormanius

              1,4841128




              1,4841128












              • Thanks! This works as long as i do not attempt to cross tabulate the same series. i.e. i had to create a copy of 'strings' and call it something else, otherwise it gave me a 'level strings not found error'. I am happy to accept this as the answer otherwise.

                – WayOutofDepth
                Nov 12 '18 at 15:43











              • The above code works for pandas 0.22.0. But you're right, former versions of pandas (e.g. 0.20.3) required the series to have different names. I edited my answer accordingly. Thanks for pointing this out.

                – normanius
                Nov 12 '18 at 16:11


















              • Thanks! This works as long as i do not attempt to cross tabulate the same series. i.e. i had to create a copy of 'strings' and call it something else, otherwise it gave me a 'level strings not found error'. I am happy to accept this as the answer otherwise.

                – WayOutofDepth
                Nov 12 '18 at 15:43











              • The above code works for pandas 0.22.0. But you're right, former versions of pandas (e.g. 0.20.3) required the series to have different names. I edited my answer accordingly. Thanks for pointing this out.

                – normanius
                Nov 12 '18 at 16:11

















              Thanks! This works as long as i do not attempt to cross tabulate the same series. i.e. i had to create a copy of 'strings' and call it something else, otherwise it gave me a 'level strings not found error'. I am happy to accept this as the answer otherwise.

              – WayOutofDepth
              Nov 12 '18 at 15:43





              Thanks! This works as long as i do not attempt to cross tabulate the same series. i.e. i had to create a copy of 'strings' and call it something else, otherwise it gave me a 'level strings not found error'. I am happy to accept this as the answer otherwise.

              – WayOutofDepth
              Nov 12 '18 at 15:43













              The above code works for pandas 0.22.0. But you're right, former versions of pandas (e.g. 0.20.3) required the series to have different names. I edited my answer accordingly. Thanks for pointing this out.

              – normanius
              Nov 12 '18 at 16:11






              The above code works for pandas 0.22.0. But you're right, former versions of pandas (e.g. 0.20.3) required the series to have different names. I edited my answer accordingly. Thanks for pointing this out.

              – normanius
              Nov 12 '18 at 16:11














              0














              import csv
              from fuzzywuzzy import fuzz
              import numpy as np

              input_file = csv.DictReader(open('random_data_file.csv'))
              string =
              for row in input_file: #file is appended row by row into a python dictionary
              string.append(row["String"]) #keys for the dict. are the headers



              #now you have a list of the string values
              length = len(string)
              resultMat = np.zeros((length, length)) #zeros 2D matrix, with size X * X

              for i in range (length):
              for j in range (length):
              resultMat[i][j] = fuzz.ratio(string[i], string[j])

              print resultMat


              I did the implementation in a numby 2D matrix. I am not that good in pandas, but I think what you were doing is adding another column and comparing it to the string column, meaning: string[i] will be matched with string_dub[i], all results will be 100



              Hope it helps






              share|improve this answer



























                0














                import csv
                from fuzzywuzzy import fuzz
                import numpy as np

                input_file = csv.DictReader(open('random_data_file.csv'))
                string =
                for row in input_file: #file is appended row by row into a python dictionary
                string.append(row["String"]) #keys for the dict. are the headers



                #now you have a list of the string values
                length = len(string)
                resultMat = np.zeros((length, length)) #zeros 2D matrix, with size X * X

                for i in range (length):
                for j in range (length):
                resultMat[i][j] = fuzz.ratio(string[i], string[j])

                print resultMat


                I did the implementation in a numby 2D matrix. I am not that good in pandas, but I think what you were doing is adding another column and comparing it to the string column, meaning: string[i] will be matched with string_dub[i], all results will be 100



                Hope it helps






                share|improve this answer

























                  0












                  0








                  0







                  import csv
                  from fuzzywuzzy import fuzz
                  import numpy as np

                  input_file = csv.DictReader(open('random_data_file.csv'))
                  string =
                  for row in input_file: #file is appended row by row into a python dictionary
                  string.append(row["String"]) #keys for the dict. are the headers



                  #now you have a list of the string values
                  length = len(string)
                  resultMat = np.zeros((length, length)) #zeros 2D matrix, with size X * X

                  for i in range (length):
                  for j in range (length):
                  resultMat[i][j] = fuzz.ratio(string[i], string[j])

                  print resultMat


                  I did the implementation in a numby 2D matrix. I am not that good in pandas, but I think what you were doing is adding another column and comparing it to the string column, meaning: string[i] will be matched with string_dub[i], all results will be 100



                  Hope it helps






                  share|improve this answer













                  import csv
                  from fuzzywuzzy import fuzz
                  import numpy as np

                  input_file = csv.DictReader(open('random_data_file.csv'))
                  string =
                  for row in input_file: #file is appended row by row into a python dictionary
                  string.append(row["String"]) #keys for the dict. are the headers



                  #now you have a list of the string values
                  length = len(string)
                  resultMat = np.zeros((length, length)) #zeros 2D matrix, with size X * X

                  for i in range (length):
                  for j in range (length):
                  resultMat[i][j] = fuzz.ratio(string[i], string[j])

                  print resultMat


                  I did the implementation in a numby 2D matrix. I am not that good in pandas, but I think what you were doing is adding another column and comparing it to the string column, meaning: string[i] will be matched with string_dub[i], all results will be 100



                  Hope it helps







                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Nov 12 '18 at 13:05









                  Hossam26644Hossam26644

                  446




                  446





















                      0














                      In pandas, the cartesian cross product between two columns can be created using a dummy variable and pd.merge. The fuzz operation is applied using apply. A final pivot operation will extract the format you had in mind. For simplicity, I omitted the groupby operation, but of course, you could apply the procedure to all group-tables by moving the code below into a separate function.



                      Here is what this could look like:



                      import pandas as pd
                      from fuzzywuzzy import fuzz

                      # Create sample data frame.
                      df = pd.DataFrame([(1, 'abracadabra'), (2,'abc'), (3,'cadra'), (4, 'brabra')],
                      columns=['id', 'strings'])

                      # Cross product, using a temporary column.
                      df['_tmp'] = 0
                      mrg = pd.merge(df, df, on='_tmp', suffixes=['_1','_2'])

                      # Apply the function between the two strings.
                      mrg['fuzz'] = mrg.apply(lambda s: fuzz.ratio(s['strings_1'], s['strings_2']), axis=1)

                      # Reorganize data.
                      ret = mrg.pivot(index='strings_1', columns='strings_2', values='fuzz')
                      ret.index.name = None
                      ret.columns.name = None

                      # This results in the following:
                      # abc abracadabra brabra cadra
                      # abc 100 43 44 25
                      # abracadabra 43 100 71 62
                      # brabra 44 71 100 55
                      # cadra 25 62 55 100





                      share|improve this answer





























                        0














                        In pandas, the cartesian cross product between two columns can be created using a dummy variable and pd.merge. The fuzz operation is applied using apply. A final pivot operation will extract the format you had in mind. For simplicity, I omitted the groupby operation, but of course, you could apply the procedure to all group-tables by moving the code below into a separate function.



                        Here is what this could look like:



                        import pandas as pd
                        from fuzzywuzzy import fuzz

                        # Create sample data frame.
                        df = pd.DataFrame([(1, 'abracadabra'), (2,'abc'), (3,'cadra'), (4, 'brabra')],
                        columns=['id', 'strings'])

                        # Cross product, using a temporary column.
                        df['_tmp'] = 0
                        mrg = pd.merge(df, df, on='_tmp', suffixes=['_1','_2'])

                        # Apply the function between the two strings.
                        mrg['fuzz'] = mrg.apply(lambda s: fuzz.ratio(s['strings_1'], s['strings_2']), axis=1)

                        # Reorganize data.
                        ret = mrg.pivot(index='strings_1', columns='strings_2', values='fuzz')
                        ret.index.name = None
                        ret.columns.name = None

                        # This results in the following:
                        # abc abracadabra brabra cadra
                        # abc 100 43 44 25
                        # abracadabra 43 100 71 62
                        # brabra 44 71 100 55
                        # cadra 25 62 55 100





                        share|improve this answer



























                          0












                          0








                          0







                          In pandas, the cartesian cross product between two columns can be created using a dummy variable and pd.merge. The fuzz operation is applied using apply. A final pivot operation will extract the format you had in mind. For simplicity, I omitted the groupby operation, but of course, you could apply the procedure to all group-tables by moving the code below into a separate function.



                          Here is what this could look like:



                          import pandas as pd
                          from fuzzywuzzy import fuzz

                          # Create sample data frame.
                          df = pd.DataFrame([(1, 'abracadabra'), (2,'abc'), (3,'cadra'), (4, 'brabra')],
                          columns=['id', 'strings'])

                          # Cross product, using a temporary column.
                          df['_tmp'] = 0
                          mrg = pd.merge(df, df, on='_tmp', suffixes=['_1','_2'])

                          # Apply the function between the two strings.
                          mrg['fuzz'] = mrg.apply(lambda s: fuzz.ratio(s['strings_1'], s['strings_2']), axis=1)

                          # Reorganize data.
                          ret = mrg.pivot(index='strings_1', columns='strings_2', values='fuzz')
                          ret.index.name = None
                          ret.columns.name = None

                          # This results in the following:
                          # abc abracadabra brabra cadra
                          # abc 100 43 44 25
                          # abracadabra 43 100 71 62
                          # brabra 44 71 100 55
                          # cadra 25 62 55 100





                          share|improve this answer















                          In pandas, the cartesian cross product between two columns can be created using a dummy variable and pd.merge. The fuzz operation is applied using apply. A final pivot operation will extract the format you had in mind. For simplicity, I omitted the groupby operation, but of course, you could apply the procedure to all group-tables by moving the code below into a separate function.



                          Here is what this could look like:



                          import pandas as pd
                          from fuzzywuzzy import fuzz

                          # Create sample data frame.
                          df = pd.DataFrame([(1, 'abracadabra'), (2,'abc'), (3,'cadra'), (4, 'brabra')],
                          columns=['id', 'strings'])

                          # Cross product, using a temporary column.
                          df['_tmp'] = 0
                          mrg = pd.merge(df, df, on='_tmp', suffixes=['_1','_2'])

                          # Apply the function between the two strings.
                          mrg['fuzz'] = mrg.apply(lambda s: fuzz.ratio(s['strings_1'], s['strings_2']), axis=1)

                          # Reorganize data.
                          ret = mrg.pivot(index='strings_1', columns='strings_2', values='fuzz')
                          ret.index.name = None
                          ret.columns.name = None

                          # This results in the following:
                          # abc abracadabra brabra cadra
                          # abc 100 43 44 25
                          # abracadabra 43 100 71 62
                          # brabra 44 71 100 55
                          # cadra 25 62 55 100






                          share|improve this answer














                          share|improve this answer



                          share|improve this answer








                          edited Nov 12 '18 at 13:39

























                          answered Nov 12 '18 at 13:07









                          normaniusnormanius

                          1,4841128




                          1,4841128



























                              draft saved

                              draft discarded
















































                              Thanks for contributing an answer to Stack Overflow!


                              • Please be sure to answer the question. Provide details and share your research!

                              But avoid


                              • Asking for help, clarification, or responding to other answers.

                              • Making statements based on opinion; back them up with references or personal experience.

                              To learn more, see our tips on writing great answers.




                              draft saved


                              draft discarded














                              StackExchange.ready(
                              function ()
                              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53261214%2fpython-fuzzy-string-matching-as-correlation-style-table-matrix%23new-answer', 'question_page');

                              );

                              Post as a guest















                              Required, but never shown





















































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown

































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown







                              Popular posts from this blog

                              Use pre created SQLite database for Android project in kotlin

                              Darth Vader #20

                              Ondo