Join the social network of Tech Nerds, increase skill rank, get work, manage projects...
 
  • Fixing a semi-redundant text file, best practice opinions wanted

    • 0
    • 0
    • 0
    • 0
    • 0
    • 0
    • 0
    • 519
    Answer it

    First a little background: Link to attachments

     

    I have a lot of text generated by some voice-to-text application (I honestly do not know the name of the application since I do not have physical access, however I have access to the live output). I am mining this data realtime and the output text looks like the first attachment, some parts are very clean, and some, well, very redundant.

     

    I have now written a piece of software in Python that cleans up the text (Attachment two). The thing is, I can only do it on a lot of text at a time, eg. my backups which has hundreds of megabytes of pure text, when it comes in realtime, it gets hard to process only a few strings, since the semi-redundancy lasts 15-25 lines (as you can see in attachment 1).

     

    The software works on the bigger files, and I am now trying to rewrite the code so it works with the live output.

     

    But since I am a self taught programmer, I was wondering if anybody could share how their approach to doing the job would be.

     

    My approach is (also see attachment two, however I am bad at commenting, so I do not know if you would get much out of it):

     

    1. Open the file(plain text) and wait until 25 lines has been written to the file
    2. read 25 lines into a list, lets call it MasterList
    3. run clean-up functions (1-7) on MasterList (see below)
    4. Print lines 10-14 to cleaned up file (first time it prints lines 0-14)
    5. Push line 5-24 of MasterList to the beginning of MasterList, making them now have indices 0-19
    6. Read 5 new lines to Masterlist or wait until 5 new lines are ready
    7. Go back to #3

     

    --> Note regarding #3: The cleanup functions do the following: *Compare lines by the use of Fuzzy String Matching, fuzzywuzzy, and delete duplicate or semiduplicate lines

     

    *Check if the first word in a sentence is the same as the last word in the previous sentence, and in that case delete the last word in the previous sentence *Smaller stuff, to make the text look clean.

     

    My questions is: Would you go about it in a completely different way? Maybe machine learning? Another language may be better suited? Any libraries or even software that already would do this?

     

    If you do read my code, I am also eager to learn my mistakes, if you see some stupid thing I am doing, criticism (even harsh criticism if you feel like bashing me) is very welcome.

     

    Thank you ever so much for your time.

 0 Answer(s)

Sign In
                           OR                           
                           OR                           
Register

Sign up using

                           OR                           
Forgot Password
Fill out the form below and instructions to reset your password will be emailed to you:
Reset Password
Fill out the form below and reset your password: