Join the social network of Tech Nerds, increase skill rank, get work, manage projects...
  • Fixing a semi-redundant text file, best practice opinions wanted

    • 0
    • 0
    • 0
    • 0
    • 0
    • 0
    • 0
    • 305
    Answer it

    First a little background: Link to attachments


    I have a lot of text generated by some voice-to-text application (I honestly do not know the name of the application since I do not have physical access, however I have access to the live output). I am mining this data realtime and the output text looks like the first attachment, some parts are very clean, and some, well, very redundant.


    I have now written a piece of software in Python that cleans up the text (Attachment two). The thing is, I can only do it on a lot of text at a time, eg. my backups which has hundreds of megabytes of pure text, when it comes in realtime, it gets hard to process only a few strings, since the semi-redundancy lasts 15-25 lines (as you can see in attachment 1).


    The software works on the bigger files, and I am now trying to rewrite the code so it works with the live output.


    But since I am a self taught programmer, I was wondering if anybody could share how their approach to doing the job would be.


    My approach is (also see attachment two, however I am bad at commenting, so I do not know if you would get much out of it):


    1. Open the file(plain text) and wait until 25 lines has been written to the file
    2. read 25 lines into a list, lets call it MasterList
    3. run clean-up functions (1-7) on MasterList (see below)
    4. Print lines 10-14 to cleaned up file (first time it prints lines 0-14)
    5. Push line 5-24 of MasterList to the beginning of MasterList, making them now have indices 0-19
    6. Read 5 new lines to Masterlist or wait until 5 new lines are ready
    7. Go back to #3


    --> Note regarding #3: The cleanup functions do the following: *Compare lines by the use of Fuzzy String Matching, fuzzywuzzy, and delete duplicate or semiduplicate lines


    *Check if the first word in a sentence is the same as the last word in the previous sentence, and in that case delete the last word in the previous sentence *Smaller stuff, to make the text look clean.


    My questions is: Would you go about it in a completely different way? Maybe machine learning? Another language may be better suited? Any libraries or even software that already would do this?


    If you do read my code, I am also eager to learn my mistakes, if you see some stupid thing I am doing, criticism (even harsh criticism if you feel like bashing me) is very welcome.


    Thank you ever so much for your time.

 0 Answer(s)

Sign In

Sign up using

Forgot Password
Fill out the form below and instructions to reset your password will be emailed to you:
Reset Password
Fill out the form below and reset your password: